YaDT command line program dTcmd
Yet another Decision Tree builder

(c) Salvatore Ruggieri,2002-2005
http://www.di.unipi.it/~ruggieri/YaDT

dTcmd is a command line program that exploits (some of the) features of YaDT classes in order to build decision trees. dTcmd takes a metadata table and a training tables as input and constructs a decision tree. There are command line options to specify the minimum number of cases to split a node and the confidence limits in pruning tree. Also,optional test table and scoring table may be specified. Tables can be in comma separated text files or in Microsoft SQL Server tables. Built trees can be saved as (PMML complaint) XML documents, text files or binary files.

> dTcmd <input options> <tree options> <output options>

Command line options.

  • input data options
  • tree construction options
  • output options
  • Input data types.
  • metadata table
  • training data table
  • test data table
  • score data table
  • binary data table
  • binary tree
  • Input data providers.
  • text files
  • gzipped text files
  • Microsoft SQL Server tables
  • Output data types.
  • binary data table
  • binary tree
  • XML tree
  • confusion matrix and text tree
  • scored table
  • verbose log

  • Input data options

    Input data to dTcmd consists of:

  • a table describing metadata (option -fm <file> or -sm <db> <table>)
  • a table containing training cases (option -fd <file> or -sd <db> <table>)
  • a binary file previously saved by dTcmd containing the two tables above (option -bd <file>)
  • (optional) a table containing test cases (option -ft <file> or -st <db> <table>)
  • (optional) a table containing cases to score (option -fs <file> or -ss <db> <table>)
  • Tables are represented either:
  • as text files,in the case of options beginning with -f,
  • as gzipped text files,in the case of options beginning with -f and filenames ending with .gz,
  • or as tables of SQL Server databases,in the case of options beginning with -s.
  • Mixture of text files and SQL Server tables are possible (e.g.,metadata being in a (gzipped) text file while training data being in a table).


    Tree construction options

    The following parameters affect the tree construction algorithm:

  • set minimum cases to split a node (option -m <num> where num is > 1,default 2)
  • set pruning confidence level (option -c <num> where num is in the range (0,1],default 0.25)
  • set pruning strategy exactly as c4.5 (option -c4.5,default not set)
  • set no pruning strategy at all (option -np)
  • randomly split training data in a real training data and an additional test data (option -h <num> where num is in percentage (in the range [0,100]) if cases to belong to the real training set,default 100)
  • do not to build the tree from input data files,but load it from a binary file previously saved by dTcmd (option -bt <file>)

  • Output options

    The following options affect the outputs of dTcmd:

  • output metadata and training table in binary format to a file (option -db <file>): this has no effect if option -bt is used
  • output tree in XML format to a file (option -x <file>) or to standard output (option -xstd)
  • output scored cases to a file (option -s <file>)
  • output tree in binary format to a file (option -tb <file>)
  • output confusion matrix(es) and text format tree to a file (option -t <file>) or to standard output (option -tstd)
  • output (verbose) log to a file (option -l <file>) or to standard output (option -lstd)
  • Zero,one of more of these options can be specified at command line,i.e. they are not mutually exclusive.


    Text files

    Text files codes tables as comma separated columns. To change separator to another character c,the -sep <c> option is provided. For instance, -sep " " switch to space separated columns. Also,the special string "?" can be present in text files to represent unknown values.


    Gzipped text files

    Gzipped text files are files with suffix .gz obtained by compressing text files with gzip.


    Microsoft SQL Server tables

    Microsoft SQL Server tables are accessed via ADO. Notice that the YaDT classes may access any ADO data provider,but dTcmd presently only considers SQL Server with trusted connections. In particular,no user name and password are to be provided. Also,unknown values are coded by NULL values.


    Metadata table

    Metadata tables have three columns,which in order represents:

  • training column names,
  • training column data types,which can be:
  • null,i.e.,no value (requires column type ignore)
  • string,i.e.,any string delimited by column separator or end of line,
  • integer,i.e.,any integer value,
  • float,i.e.,any float value,
  • training column types,which can be
  • ignore,i.e.,do not use column in tree construnction,
  • discrete,i.e.,column is used as a discrete attribute,(not compatible with null data type),
  • continuous,i.e.,column is used as a continuous attribute (not compatible with null or string data type),
  • weights,i.e.,column is used to weight cases (not compatible with null or string data type,and at most one column can be of this type),
  • or class i.e.,column contains class valyes (not compatible with null data type,and exactly one column of this type must be present).
  • For instance,the file golf.names
    outlook,string,discrete
    temperature,integer,continuous
    humidity,integer,continuous
    windy,string,discrete
    toPlay,string,class
    
    describes training data consisting of the following columns:
  • outlook,which contains strings interpreted as discrete values
  • temperature,which contains integers interpreted as continuous values
  • humidity,which contains integers interpreted as continuous values
  • windy,which contains strings interpreted as discrete values
  • goodPlaying, which contains floats inpreted as weight values
  • toPlay,which contains strings interpreted as class values

  • Trainig data table

    Training data tables have a number of columns according to the metadata table. The order of columns must be consistent with the order of metadata table rows. Unknown values are not admitted when the column type is weights or class. Here it is the golf.data training data file:

    sunny,85,85,false,1,Don't Play
    sunny,80,90,true,1,Don't Play
    overcast,83,78,false,1.5,Play
    rain,70,96,false,0.8,Play
    rain,68,80,false,2,Play
    rain,65,70,true,1,Don't Play
    overcast,64,65,true,2.5,Play
    sunny,72,95,false,1,Don't Play
    sunny,69,70,false,1,Play
    rain,75,80,false,1.5,Play
    sunny,75,70,true,3,Play
    overcast,72,90,true,1.5,Play
    overcast,81,75,false,1,Play
    rain,71,80,true,1,Don't Play
    


    Binary data table

    dTcmd may save and load a binary file containing a binary representation of a metadata table and a training table (see options, -bd <file> and -db <file>). Binary files are not guarranteed to be readable from future/past version of YaDT!!


    Binary tree

    dTcmd may save and load a binary file containing a binary representation of a decision tree (see options, -bt <file> and -tb <file>). Binary files are not guarranteed to be readable from future/past version of YaDT!!


    XML tree

    dTcmd may save to a file or to standard output a PMML complaint XML representation of the built tree (see options, -x <file> and -xstd).


    Confusion matrix and text trees

    dTcmd may save to a file or to standard output a text representation of the built tree and of confusion matrix over training and test data (see options, -t <file> and -tstd).


    Verbose log

    dTcmd may save to a file or to standard output a verbose log of computation in progress (see options, -l <file> and -lstd).


    Test data table

    Test data table has exactly the same format of training data table.


    Score data table

    Score data table has the same format of training data table with the following exceptions:

  • the weights column has been removed,
  • the class column has been removed,
  • all other columns maintan the same relative order,
  • an additional column,which we call key column, may optionally be present as the last column of score table.
  • An example score file for the golf example is the following:
    overcast,80,75,false,1
    rain,90,75,true,2
    sunny,98,82,false,3
    sunny,80,75,true,4
    overcast,90,75,false,5
    rain,78,82,false,6
    


    Scored data table

    Scoring a score data with a tree yields a scored data tables as a text file containing in the same order of score data table:

  • if present,the key column column in the score data table,
  • a column with predicted class,
  • a column with prediction probability.
  • An example score file for the golf score data table is the following:
    1,Play,1
    2,Don't Play,1
    3,Don't Play,1
    4,Play,1
    5,Play,1
    6,Play,1