Rather than just enumerating improvements to the data representation in our database, we will discuss general principles that can guide us to improved specifications of the usual attribute-value encoding as required by ordinary decision tree learning, but we will do so by providing concrete examples for demonstration purposes.
One immediately obvious shortcoming of the usual representation is the
impossibility of expressing order relations between attribute values. This
can be seen most easily with discretized numerical values. For example,
the variable CM10A in the CONFMAN database tells the number of months
that a conflict persisted before intervention by conflict management.
There are seven possible values with the following meaning:
value tag description 0 pre-intervention 1 1-2 2 3-6 3 7-12 4 13-24 5 25-36 6 37+
Table 1: CM10A: number of months before conflict intervention
This encoding drops significant information, namely that these values can be ordered (on a time scale). The learning algorithm is forced to discriminate between all these cases, which means that provided training data is split up into seven subsets here. The more values the variable can take, the smaller those subsets, which will usually impose a strong bias in subsequent learning steps. This is also suggested by empirical evidence in surveys concerning the “small disjuncts”-problem4.
This problem also shows up in a dual way when we consider target attributes. Instead of just predicting whether some mediation outcome is a success or failure, we might be interested in the exact quality. The attribute CM14 expresses it as one of five values:
value tag description 1 med offered only 2 unsuccessful 3 ceasefire 4 partial agreement 5 full settlement
Table 2: CM14: detailed outcome of mediation attempt
Here, again, our background knowledge tells us that there is an
implicit order hidden in these values. We would certainly expect
that, for instance, a ceasefire is a requirement for obtaining a full
settlement. This should also change the way we estimate classification
accuracy. Predicting "unsuccessful" is surely worse than predicting
"partial agreement" if indeed a "full settlement" was achievable.
This problem should not be confused with cost-sensitive learning as
described in e.g. [Dom99] or similar techniques that
take a detour over regression as in e.g. [KWPD01] to predict
ordinal classes. We will show that ordered structures can be represented
naturally without the help of numerical values of any kind.
To remedy this shortcoming, we now explain the way in which ordered
structures can be expressed using algebraic datatypes. These datatypes
are defined as (potentially recursive) type equations. On the left hand
side we have the type name and on the right hand side a disjoint union of
values, which are separated by bars. Such values consist of a constructor
(starting with a capital letter) such that the value is uniquely tagged,
and this constructor may take an argument. The argument is either a type
or a (Cartesian) product of types. This way we can specify any kind of
inductively definable datastructure5.
Returning to our previous example, instead of defining the variables
as follows:
cm10a = PreInt | M1_2 | M3_6 | M7_12 | M13_24 | M25_36 | M37p. cm14 = MedOff | UnSucc | CeaseFire | PartAgr | FullSett.
we could use the following definitions (type equations), which exhibit the underlying ordered structure:
cm10a = PreInt | LateInt late_int. late_int = M1_2 | M3p m3p. m3p = M3_6 | M7p m7p. m7p = M7_12 | M13p m13p. m13p = M13_24 | M25p m25p. m25p = M25_36 | M37p.
cm14 = MedOff | MedAcc med_acc. med_acc = UnSucc | Succ succ. succ = CeaseFire | Agree agree. agree = PartAgr | FullSett.
Data values would now be represented in a structured way. Instead of
encoding, for example, a ceasefire as “CeaseFire”, we would write
MedAcc (Succ CeaseFire)
which makes clear that a ceasefire is an accepted, successful mediation attempt.
Conversely, an intervention after two months would be encoded using the
expression “LateInt M1_2” (“late intervention after 1 or 2
months”) rather than as “M1_2” directly.
How does this help us? For instance, it might be completely irrelevant
for the learner to know that some mediation attempt was undertaken, say,
more than six months too late. Therefore, the learner could stop splitting
up the dataset according to the number of months exceeding this limit and
instead continue with other, possibly more relevant input variables, which
is likely to improve accuracy (can choose better variables in subsequent
splits) and efficiency (no need to learn more subtrees than required).
For what concerns structured output values, the resulting classifier is
allowed to start constructing values even before it is sure about their
exact form. For example, given some political scenario, the classifier
might be able to predict with very little data that the outcome of a
mediation attempt will be successful, but might need much more information
to classify the exact kind of success, i.e. a ceasefire only or even a
(partial or full) agreement.
This property is also very important from a pragmatic point of view.
It may often be extremely costly to collect all relevant data. If
mediators are only interested in whether the outcome will be successful
irrespective of the details, they need only collect this much information
as required for this goal. On the other hand, if input data about the
conflict is already available, we can guarantee that the classifier will
already predict every property that only requires this data. In short,
we can predict a maximum of information about the result with a minimum
of information about the input.
It may also be useful to depict the difference between ordinary
encodings and ones that allow subattributes in a visually more appealing
way6:
cm14 MedOff UnSucc CeaseFire PartAgr FullSett
Table 3: CM14 represented in the standard way.
cm14 MedOff MedAcc med_acc UnSucc Succ succ CeaseFire Agree agree PartAgr FullSett
Table 4: CM14 represented in a structured way.
Taking a general view on subattributes, we see that they are useful for factoring out common information from values which would otherwise be encoded as atomic tags. We now demonstrate this with variable CM187:
value tag description 1 one medtr 2 two medtrs-same interests 3 two medtrs-diff interests 4 group medtrs-same interests 5 group medtrs-diff interests
Table 5: CM18: number of mediators acting in the mediation event
Instead of forcing the learning algorithm again to discriminate between five values, we can give hints that some of these values have common information by using a more appropriate encoding:
cm18 = OneMedtr | TwoMedtrs interests | GroupMedtrs interests. interests = Same | Diff.
Then, again, the learning algorithm need not continue discriminating further if the interests of the parties do not make a difference but their number does. The algorithm is even free to look at the mediators’ interests in certain cases, say, if there are only two mediators.
Note that there are not always unique factorizations. For example, we could also have encoded the upper variable as follows:
cm18 = OneMedtr | Same medtrs | Diff medtrs. medtrs = Two | Group.
Doing this, we put more stress on the ability of multiple mediators to coordinate their interests rather than on their number. This way we can bias learning by putting discriminating features (value constructors) that we consider more important closer to the top of structured values.
But what do we do if we are unsure about which feature to prefer? Then we can just leave this learning task to the algorithm again and encode our data as follows:
cm18 = OneMedtr | TwoOrMore (size * interests). size = Two | Group. interests = Same | Diff.
Of course, the question about exact size or commonality of interests only
makes sense if there are at least two mediators, assuming that mediators
cannot be schizophrenic. This example also demonstrates the usefulness of
allowing more than one subattribute by employing the product type
operator “*”. A group of mediators having differing interests
can now be denoted by:
TwoOrMore (Group, Diff)
The structure improvements we have considered so far have only affected one variable. By rearrangement of values in a structured way, we could express semantic relations between them.
We can, however, improve existing specifications even more by considering several variables at once and modelling them in a different way. This will most often result in fewer but more structured variables. Due to lack of expressiveness with usual encodings, there are many cases where redundancies exist which can be eliminated.
For example, consider variables P23 and P24:
value tag description 1 both Christian 2 both Muslim 3 both other global religions 4 both traditional-indigenous 5 both mixed religious base 6 different religions
Table 6: P23: Religion type compared
value tag description 1 same religion 2 different religion
Table 7: P24: Religion identity of parties compared
As we can see, both variables allow one to decide whether two parties have the same or different religions, but one of them (P23) is much more detailed. This is obviously a workaround invented by the database designers, because they could not express both the abstract and detailed views adequately without redundancies.
The solution here is to merge equivalent values or constructors as follows:
pm23_24 = Same religion | Different religion = Christian | Muslim | OtherGlobal | Traditional | Mixed
The learning algorithm would initially see only one variable, which
essentially captures the meaning of the variable P23. Only in the case
of two parties having the same religion would the algorithm get access
to the details as encoded in P24.
This has several advantages:
It is much more effort in general to factor out redundancies among many variables, but this is certainly justified by the benefits.