4 How can you use it?
4.1 Specification of algebraic datatypes
Before AIFAD can learn from data, you will have to tell it how this data
looks like. For this purpose you create a file with the extension '.ads'
(algebraic datatype specification), which contains a set of (possibly
recursive) type equations. If you happen to know modern functional or
logic programming languages, this concept will be common to you. You
can find an example of a specification in file 'test.ads' in the
'examples'-directory of the distribution.
Note that specifications that do not make sense (e.g. violate liveness
properties, etc.) are currently pruned down to the minimum number of type
equations that can be reasonably supported. It might be more helpful
to display these automated correction steps, which is not yet done.
4.1.1 Lexical conventions
Note: Some equivalent definitions copied from the OCaml-manual.
The following characters are considered as blanks: space, newline,
horizontal tabulation, carriage return, line feed and form feed. Blanks
are ignored, but they separate adjacent identifiers, literals and
keywords that would otherwise be confused as one single identifier,
literal or keyword.
Comments in specifications start with a hash '#' and are valid up to
the end of the current line.
Identifiers are sequences of letters, digits, _ (the underscore
character), and ' (the single quote), starting with a letter or an
underscore. Letters contain at least the 52 lowercase and uppercase
letters from the ASCII set. The current implementation (except on MacOS)
also recognizes as letters all accented characters from the ISO 8859-1
(“ISO Latin 1”) set. All characters in an identifier are meaningful. The
current implementation places no limits on the number of characters of
an identifier.
Type names are identifiers, but start with either a lowercase character or
an underscore _. Data constructors are identifiers, but always start with
uppercase characters. The only keywords are “domain” and ”codomain”.
Here is the EBNF-grammar of type definitions, of which you can have
almost arbitrarily many in your specification:
| type-definition |
::= |
type-name '=' rhs '.' |
| |
| rhs |
::= |
sum { '|' sum }* |
| |
| |
type |
| |
| sum |
::= |
data-constructor |
| |
| |
data-constructor type |
| |
| type |
::= |
type-name |
| |
| |
product |
| |
| product |
::= |
'(' type { '*' type }+ ')' |
Types can only be defined once. Data constructors may appear only once
at each right hand side, but may be reused in other definitions. You
will have to name one of your types “domain”, which defines the shape
of your input data, and another type “codomain” to indicate the form
of your output data.
4.2 Data files
Data files take as extension '.add'. You can find an example that matches
the example specification in file 'test.add' in the distribution. Such
files follow the same lexical conventions as data specifications.
There are otherwise three kinds of data files: ones that contain mappings
from input values to output values, ones that only contain input values
and others that only contain output values. Depending on the operation
you want to perform, one or several of the above formats are valid.
Here is the EBNF-grammar of samples as used in data files containing
mappings from input to output values:
| sample |
::= |
data '->' data '.' |
| |
| data |
::= |
sum-value |
| |
| |
product-value |
| |
| sum-value |
::= |
data-constructor |
| |
| |
data-constructor argument |
| |
| argument |
::= |
data-constructor |
| |
| |
'(' data-constructor argument ')' |
| |
| |
product-value |
| |
| product-value |
::= |
'(' data { '*' data }+ ')' |
Files that contain only input or output data do not contain “samples”
but “data” followed by a dot.
With the current OCaml-implementation there can be at most 4194303
samples, which should be sufficient for most needs.
4.3 C4.5-compatible data
AIFAD can also handle data in this format. Data specifications of this
kind are stored in files having the extension '.names' and related data
in ones having the suffix '.data'. You can find an example for this data
format in the files 'c45.names' and 'c45.data' in directory 'examples'
of the distribution.
The '.names'-file contains as first line a comma-separated list of class
values terminated by a dot. Each following line starts with the name
of an attribute followed by a colon and again a comma-separated list of
attribute values terminated by a dot. Instead of this list, the keyword
“continuous” may be used to indicate numerical data. C4.5 can also
handle continuous attributes, which are currently ignored by AIFAD,
but may be added some time in the future. AIFAD will, however, make
use of the discrete data available.
Data files in C4.5-format consist of comma-separated lists of data stored
linewise, which must match the order and contents of the attributes
specified in the corresponding '.names'-file. The class values are
contained in the last column of these comma-separated lists.
4.4 Learning from data
When starting AIFAD on the command-line, learning mode is indicated by
the flag '-learn'. In this case you will also have to provide a name
for the specification of the data. This can either be done using the
'-spec'-flag, which takes the filename of the specification, but which
does not set a name for the data file (default: stdin). Or you can use
the flag '-stem', which expects a filestem and sets both the name of
the specification appropriately, i.e. 'stem.ads' and 'stem.add' for
normal data and respectively 'stem.names' and 'stem.data' if you happen
to learn in C4.5-mode. The latter mode can be indicated with the flag
'-c45' on the command-line. Examples:
The following learns normal data, the specification being in file
'foo.ads', the data being read via stdin from file 'bar.add'.
aifad -learn -spec foo.ads < bar.add
This learns specification 'foo.ads' and data 'foo.add' given
a stem:
aifad -learn -stem foo
This learns specification 'foo.names' and 'foo.data' given a
stem (C4.5-format):
aifad -c45 -learn -stem foo
By default the system will learn a model and print it in a
human-readable format to stdout. The model complexity (number of bits
consumed from input data to output one bit of information) is printed
to stderr. If '-no-hmod' is specified, no human-readable model will
be printed.
If you also want to store the learnt model in a machine-readable format
for later use (application), then you will have to specify a name for
the model file using the '-model'-flag, e.g.:
aifad -c45 -learn -stem foo -model mymodel.c45model
Note that C4.5-models and ones for normal specifications are
not compatible. AIFAD makes sure that you do not accidently apply some
model to the wrong data.
Models are printed directly in the OCaml-language. They are represented
by a function “model”, which maps the tuple (or single attribute)
of input to the corresponding output. Generally, variables that are
unused are unnamed, i.e. written as '_'. The data constructors are
represented by so-called polymorphic variants so that there is no need
for an explicit data specification: the human-readable model file should
actually be perfectly valid OCaml.
Input data is matched in match-clauses so as to discriminate between
different cases. As soon as the result is clear, constructors of the
output specification will be used to construct a result. If the result is
only partially known, let-bindings will factor out the already predictable
part, and more deeply nested matches predict the rest.
When no other actions are mentioned on the command-line except for the
machine-readable model, the model will be printed in human-readable form
on stdout, e.g.:
aifad -model mymodel.c45model
4.5 Learning parameters
4.5.1 Variable selection
Currently there is only one hardly different alternative to using the
gain ratio criterion, namely the one used by Quinlan in C4.5. This
criterion can be turned on using the flag '-gain-c45'. It incorporates
an additional constraint that selects also on basis of the unscaled
gains. See [Qui90] for details.
4.5.2 Pre-pruning
There is currently a very crude but still quite effective and efficient
pre-pruning technique, which is turned on with '-with-min-gr'. It stops
growing decision-trees during learning in a branch when the number of
bits required to select a variable for splitting exceeds the number of
bits gained by this step. This actually sets a minimum gain ratio at
each selection.
4.5.3 Different entropy measures
You can choose '-indep-entropy' when you assume that tuples (including
ones in substructures) are independent of each other. This is faster
than the default, which assumes dependence.
Choose '-shallow-entropy' if you want to compute the entropy without
descending into structures. This is much faster than the default, which
does so, especially for deeply nested datastructures.
4.5.4 Most frequent values
When no more splitting is possible or wanted, the most frequent value has
to be computed from the available data. By setting '-indep-most-freq',
you can choose to assume independence of tuple elements in the data,
the default assumes dependence again.
4.5.5 Splitting null branches
When splitting data depending on some variable, it can happen that one
branch of the tree does not have any data for further learning (i.e.
it is a null branch). Normally, the most frequent value of the
previous set is then taken as leaf. By setting '-split-null', the learning
process may continue using the remaining variables. Sometimes this hardly
makes any difference in efficiency, sometimes it can lead to extremely
long learning times and excessive model size.
Though the learning time is generally polynomial in the size of the
dataset, the timing curve can be almost exponential in the beginning even
for flat datastructures. When using structured data, time complexity is
currently indeed exponential then, though this problem might be solved
in a future release using very complicated models that employ CPS
(continuation-passing style).
At the moment we do not have empirical evidence that this experimental
feature significantly improves learning, but there is still some
experimentation left to be done.
4.5.6 Random models
There is an experimental feature which allows you to generate additional
random models that are constructed such that at each split a variable is
chosen from a distribution built from the gain ratios. This can be turned
on with the flag '-n-rand-gain', which expects the number of additional
random models to generate as argument. Another parameter '-t-rand-gain'
can be used to set the total number of seconds (floating point) that
may be spent learning in this case. Whichever limit is reached first
(time limit or number of models) will determine the end of learning.
Models are necessarily equally accurate but may differ in complexity:
the latter is taken as decision criterion on which model to choose
(minimum complexity). This seems quite interesting from an empirical
point of view but needs further investigation.
4.5.7 Handling missing values in C4.5-data
Samples containing missing values in their codomain are ignored. For input
data there are three strategies for handling missing values (indicated by
'?' in the data):
- Ignore samples containing missing values. Turned on by
'-mv-ignore'. This is recommended if you are absolutely sure that missing
values cannot occur, because it improves efficiency of learning, model
size and efficiency of application.
- Predict most probable (frequent) value when a missing-value is
present in a sample. Turned on by '-mv-mprob'. When learning such models,
the algebraic encoding of input data is either 'None' or 'Some data',
where data is a tuple of all variables in the C4.5-specification.
- Add another constructor to encode missing values (“flat
representation”). Turned on by '-mv-flat'.
- Algebraically encode the data by lifting constructors such that
they are either 'None' if some value is missing in an attribute or
'Some value' if it is present. This is the default ('-mv-lift-all').
4.5.8 Factorization of models
Models are always constructed in such a way that matches are only
performed when some information (also from substructures) is really
needed. It can, however, happen during learning that it is only visible
after learning some subtree that some predictions can already be made
before the match. By default, models will also be simplified in this
respect by using let-bindings that factor out common information from
subbranches. You can prohibit this feature by passing '-no-factor'. There
is hardly any performance penalty for factorization so you will usually
be happy with the default.
4.6 Applying models
If you have a machine-readable model file, you can apply it to input
data. There is a certain degree of flexibility here: you can apply a
model either to samples, i.e. mappings from input to output data, or
to input data only — even mixed. In the case of mappings the output
part in the data will be ignore. The input is read from standard input,
the resulting predictions of the model application will be printed in
the corresponding format (“normal” or C4.5) to stdout or to a file
whose name is give to flag '-pred'. Example:
aifad -apply -model mymodel.c45model < foo.data -pred bar.data
There is no need here to specify the type of model (C4.5), because the
latter knows what it is.
4.7 Evaluating results
To evaluate the accuracy of some model application we compare its result
to reference data. The file containing the reference data is specified by
'-eval', the data to be evaluated is read from stdin. You will also have
to provide for a data specification by either using '-spec' or '-stem'. The
result of the comparison consists of various statistics:
- number of samples
- number of bits required to encode the whole data
- average number of bits per sample
- number of common bits shared by both data sources
- average number of common bits per sample shared by both data sources
- accuracy, i.e. percentage of common bits
4.8 Model complexity
After learning the model complexity is printed to stderr. It is a
dimensionless measure measuring how many bits of input are needed in
average to produce one bit of output. Note that this is completely
unrelated to the accuracy of the model!
4.9 Random data generation
For testing purposes it is often very helpful to generate random
data given some specification. You might want to try this out with
the specification 'large.ads' in the 'examples'-directory of the
distribution. As usual, you will have to use the '-c45'-flag if you
want C4.5-data.
The following generates ten random samples on stdout for a
normal specification 'foo.ads':
aifad -rand-gen 10 -spec foo.ads
If you do not want to generate data for the target variable(s),
use this instead:
aifad -rand-gen-no-target 10 -spec foo.ads
You can even simulate the presence of missing values by setting
their probability (default: 0.01):
aifad -rand-gen 10 -rand-mv-prob 0.05 -spec foo.ads
If you want to pass a certain seed to the random number
generator, use '-rand-init'. By using '-rand-self-init', it will be
initialized at startup in a system-dependent way.