Previous Up Next

3  Design hints

In this section we will give a short summary of design hints that one should apply when creating new data specifications.

3.1  Top-down design

The following steps may give a suitable start:

3.2  Treatment of missing values

An annoying but important point in machine learning concerns the treatment of missing values, i.e. value tags that indicate that some value could not be observed or is even meaningless in a certain context. An often used approach applied in traditional techniques is to explicitly encode (“hardwire”) strategies for handling missing values in the learning algorithm. Given more expressive representations it seems necessary to ask whether this is still required or even advisable: more declarative ways of specifying (handling of) missing values may be possible.
Lets consider the following definition:

  t = A | B

If this variable could have missing values, then users of machine learning systems often extend this definition as follows:

  t = A | B | Missing

This, however, may cause not necessarily expected results. Assume that some dataset that uses t as result variable contains ten As, nine Bs and eleven missing values. Then an algorithm taking the most frequent value would predict Missing. But since the opposite of a missing value is an available value, we would actually have to predict that there will indeed be an observation. This case happens more frequently, because it encompasses both the observations of As and Bs.

Therefore, it may be more suitable to choose the following definition:

  t = Observed value | Missing
  value = A | B

This advises the learning algorithm to treat observed values separately from missing ones. We can even specify structures that explain in more detail how to interpret missing values. For example:

  t = Relevant observable_value | Irrelevant
  observable_value = Observed value | Missing
  value = A | B

This allows more precise specification of why some value is actually missing: because it may be irrelevant8 or because it could not be observed for some reason, etc. In all these cases the result may indeed be a consequence of the input data, which is why prediction of missing values does indeed make sense. E.g., some value may have been unobservable, because of other (observable) conditions that make collecting data very difficult.


Previous Up Next