Previous Up Next

2  Improving existing specifications

Rather than just enumerating improvements to the data representation in our database, we will discuss general principles that can guide us to improved specifications of the usual attribute-value encoding as required by ordinary decision tree learning, but we will do so by providing concrete examples for demonstration purposes.

2.1  Local improvements

2.1.1  Problem: handling ordered values

One immediately obvious shortcoming of the usual representation is the impossibility of expressing order relations between attribute values. This can be seen most easily with discretized numerical values. For example, the variable CM10A in the CONFMAN database tells the number of months that a conflict persisted before intervention by conflict management. There are seven possible values with the following meaning:


value tagdescription
0pre-intervention
11-2
23-6
37-12
413-24
525-36
637+
Table 1: CM10A: number of months before conflict intervention

This encoding drops significant information, namely that these values can be ordered (on a time scale). The learning algorithm is forced to discriminate between all these cases, which means that provided training data is split up into seven subsets here. The more values the variable can take, the smaller those subsets, which will usually impose a strong bias in subsequent learning steps. This is also suggested by empirical evidence in surveys concerning the “small disjuncts”-problem4.

This problem also shows up in a dual way when we consider target attributes. Instead of just predicting whether some mediation outcome is a success or failure, we might be interested in the exact quality. The attribute CM14 expresses it as one of five values:


value tagdescription
1med offered only
2unsuccessful
3ceasefire
4partial agreement
5full settlement
Table 2: CM14: detailed outcome of mediation attempt

Here, again, our background knowledge tells us that there is an implicit order hidden in these values. We would certainly expect that, for instance, a ceasefire is a requirement for obtaining a full settlement. This should also change the way we estimate classification accuracy. Predicting "unsuccessful" is surely worse than predicting "partial agreement" if indeed a "full settlement" was achievable.
This problem should not be confused with cost-sensitive learning as described in e.g. [Dom99] or similar techniques that take a detour over regression as in e.g. [KWPD01] to predict ordinal classes. We will show that ordered structures can be represented naturally without the help of numerical values of any kind.

2.1.2  Solution: expressing order with algebraic datatypes

To remedy this shortcoming, we now explain the way in which ordered structures can be expressed using algebraic datatypes. These datatypes are defined as (potentially recursive) type equations. On the left hand side we have the type name and on the right hand side a disjoint union of values, which are separated by bars. Such values consist of a constructor (starting with a capital letter) such that the value is uniquely tagged, and this constructor may take an argument. The argument is either a type or a (Cartesian) product of types. This way we can specify any kind of inductively definable datastructure5.
Returning to our previous example, instead of defining the variables as follows:

  cm10a = PreInt | M1_2 | M3_6 | M7_12 | M13_24 | M25_36 | M37p.
  cm14  = MedOff | UnSucc | CeaseFire | PartAgr | FullSett.

we could use the following definitions (type equations), which exhibit the underlying ordered structure:

  cm10a    = PreInt | LateInt late_int.
  late_int = M1_2   | M3p m3p.
  m3p      = M3_6   | M7p m7p.
  m7p      = M7_12  | M13p m13p.
  m13p     = M13_24 | M25p m25p.
  m25p     = M25_36 | M37p.
  cm14    = MedOff    | MedAcc med_acc.
  med_acc = UnSucc    | Succ succ.
  succ    = CeaseFire | Agree agree.
  agree   = PartAgr   | FullSett.

Data values would now be represented in a structured way. Instead of encoding, for example, a ceasefire as “CeaseFire”, we would write

  MedAcc (Succ CeaseFire)

which makes clear that a ceasefire is an accepted, successful mediation attempt.

Conversely, an intervention after two months would be encoded using the expression “LateInt M1_2” (“late intervention after 1 or 2 months”) rather than as “M1_2” directly.
How does this help us? For instance, it might be completely irrelevant for the learner to know that some mediation attempt was undertaken, say, more than six months too late. Therefore, the learner could stop splitting up the dataset according to the number of months exceeding this limit and instead continue with other, possibly more relevant input variables, which is likely to improve accuracy (can choose better variables in subsequent splits) and efficiency (no need to learn more subtrees than required).
For what concerns structured output values, the resulting classifier is allowed to start constructing values even before it is sure about their exact form. For example, given some political scenario, the classifier might be able to predict with very little data that the outcome of a mediation attempt will be successful, but might need much more information to classify the exact kind of success, i.e. a ceasefire only or even a (partial or full) agreement.
This property is also very important from a pragmatic point of view. It may often be extremely costly to collect all relevant data. If mediators are only interested in whether the outcome will be successful irrespective of the details, they need only collect this much information as required for this goal. On the other hand, if input data about the conflict is already available, we can guarantee that the classifier will already predict every property that only requires this data. In short, we can predict a maximum of information about the result with a minimum of information about the input.
It may also be useful to depict the difference between ordinary encodings and ones that allow subattributes in a visually more appealing way6:


cm14
MedOffUnSuccCeaseFirePartAgrFullSett
Table 3: CM14 represented in the standard way.


cm14
MedOffMedAcc med_acc
 UnSuccSucc succ
  CeaseFireAgree agree
   PartAgrFullSett
Table 4: CM14 represented in a structured way.

2.1.3  Factoring out local information

Taking a general view on subattributes, we see that they are useful for factoring out common information from values which would otherwise be encoded as atomic tags. We now demonstrate this with variable CM187:


value tagdescription
1one medtr
2two medtrs-same interests
3two medtrs-diff interests
4group medtrs-same interests
5group medtrs-diff interests
Table 5: CM18: number of mediators acting in the mediation event

Instead of forcing the learning algorithm again to discriminate between five values, we can give hints that some of these values have common information by using a more appropriate encoding:

  cm18 = OneMedtr | TwoMedtrs interests | GroupMedtrs interests.
  interests = Same | Diff.

Then, again, the learning algorithm need not continue discriminating further if the interests of the parties do not make a difference but their number does. The algorithm is even free to look at the mediators’ interests in certain cases, say, if there are only two mediators.

Note that there are not always unique factorizations. For example, we could also have encoded the upper variable as follows:

  cm18 = OneMedtr | Same medtrs | Diff medtrs.
  medtrs = Two | Group.

Doing this, we put more stress on the ability of multiple mediators to coordinate their interests rather than on their number. This way we can bias learning by putting discriminating features (value constructors) that we consider more important closer to the top of structured values.

But what do we do if we are unsure about which feature to prefer? Then we can just leave this learning task to the algorithm again and encode our data as follows:

  cm18 = OneMedtr | TwoOrMore (size * interests).
  size = Two | Group.
  interests = Same | Diff.

Of course, the question about exact size or commonality of interests only makes sense if there are at least two mediators, assuming that mediators cannot be schizophrenic. This example also demonstrates the usefulness of allowing more than one subattribute by employing the product type operator*”. A group of mediators having differing interests can now be denoted by:

  TwoOrMore (Group, Diff)

2.2  Global improvements

The structure improvements we have considered so far have only affected one variable. By rearrangement of values in a structured way, we could express semantic relations between them.

We can, however, improve existing specifications even more by considering several variables at once and modelling them in a different way. This will most often result in fewer but more structured variables. Due to lack of expressiveness with usual encodings, there are many cases where redundancies exist which can be eliminated.

For example, consider variables P23 and P24:


value tagdescription
1both Christian
2both Muslim
3both other global religions
4both traditional-indigenous
5both mixed religious base
6different religions
Table 6: P23: Religion type compared


value tagdescription
1same religion
2different religion
Table 7: P24: Religion identity of parties compared

As we can see, both variables allow one to decide whether two parties have the same or different religions, but one of them (P23) is much more detailed. This is obviously a workaround invented by the database designers, because they could not express both the abstract and detailed views adequately without redundancies.

The solution here is to merge equivalent values or constructors as follows:

  pm23_24 = Same religion | Different
  religion = Christian | Muslim | OtherGlobal | Traditional | Mixed

The learning algorithm would initially see only one variable, which essentially captures the meaning of the variable P23. Only in the case of two parties having the same religion would the algorithm get access to the details as encoded in P24.
This has several advantages:

It is much more effort in general to factor out redundancies among many variables, but this is certainly justified by the benefits.


Previous Up Next