Data Mining Group - Target Fields

PMML 3.0 - Target fields and values

Introduction

The target values are derived from a variety of elements in the models. For example, the target categories in RegressionModel are specified in the RegressionTable elements, while the TreeModel defines them within Node elements and Naive Bayes models specify them in TargetValueCounts.

The PMML elements Target for targets provide a common syntax for all models.

Example


  <Targets>
    <Target field="response" optype="categorical" >
      <TargetValue value="YES" rawDataValue="Yes" priorProbability="0.02" />
      <TargetValue value="NO" rawDataValue="No" priorProbability="0.98" />
    </Target>

    <!-- alternative for continuous field -->
    <Target field="amount" optype="continuous" >
      <TargetValue defaultValue="432.21" />
    </Target>
  </Targets>

The example defines a target field named "response". It has two categories "YES" and "NO". These values are used in the mining expressions for regression tables, tree nodes, Bayes counts, etc.

Schema


  <xs:element name="Targets">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="Target" maxOccurs="unbounded"  />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="Target">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="TargetValue" maxOccurs="unbounded"  />
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required" />
      <xs:attribute name="optype" type="OPTYPE" />
      <xs:attribute name="castInteger" >
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="round"/>
            <xs:enumeration value="ceiling"/>
            <xs:enumeration value="floor"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

  <xs:element name="TargetValue">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="value" type="xs:string" />
      <xs:attribute name="rawDataValue" type="xs:string" />
      <xs:attribute name="priorProbability" type="PROB-NUMBER" />
      <xs:attribute name="defaultValue" type="NUMBER" />
    </xs:complexType>
  </xs:element>

The attribute field must refer to a name of a DataField or DerivedField.

When the Target element specifies optype then it overrides the optype attribute in a corresponding MiningField, if it exists. If the target does not specify 'optype' then the MiningField is used as default. And, in turn, if the MiningField does not specify an optype, it is taken from the corresponding DataField. In other words, a MiningField overrides a DataField, and a Target overrides a MiningField.

If a regression model should predict integers, use the attribute castInteger to control how decimal places should be handled:

round: round to nearest integer, e.g., 2.718 becomes 3, -2.89 becomes -3

ceiling: smallest integer greater than or equal, e.g., 2.718 becomes 3

floor: largest integer smaller than or equal, e.g., 2.718 becomes 2

The attribute value corresponds to the categories in 'RegressionTable', tree 'Node', neural network 'NeuralOutput' and Bayes 'TargetValueCounts'.

The attribute rawDataValue defines corresponding values as they were found in the original input data. These values are not normalized or formatted. A PMML consumer can use them as display values in the scoring results instead of returning a possibly transformed values that is used as an internal target value. A model might map different raw data values to the same internal target value. E.g., 'yes' and 'Yes' may be mapped to 'YES'. In such cases rawDataValue is just one representative value, e.g., 'yes' or 'Yes'. Note that the raw data values are not used for identifying a target category within the model.

The attribute priorProbability specifies a default probability for the corresponding target category. It is used if the prediction logic itself did not produce a result. This can happen, e.g., if an input value is missing and there is no other method for treating missing values. The exact rules for using the prior probability are defined in the particular models.

The attribute defaultValue is the counterpart of prior probabilities for continuous fields. Usually the value is the mean of the target values in the training data.

The attribute priorProbability is used only if the optype of the field is categorical or ordinal. The attribute defaultValue is used only if the optype of the field is continuous.

Note that the Schema allows multiple target fields. It depends on the kind of the model whether prediction of multiple fields is supported.

Further notes:

The definition of targets may depend on derived fields in the model. So it makes sense to specify them after the derived fields right before the definition of RegressionTable, Node, etc.
The target categories may be different from the values that appear in the original training data. The predicted values may be normalized or they may be produced by user-defined format functions.
The definition of target categories can be different from a list of valid values in the data dictionary. And the same field can have different target specifications in different models. e.g., prior probabilities may be different.

Target fields are usually declared with usageType="predicted" in MiningField. The definition of predicted fields in the MiningSchema is not required and it does not have an impact on the scoring results. But it is very useful because it gives a user a first hint about the detailed results that can be computed by the model.

The list of target values within a target field is similar to the list of valid values in a DataField. However, the DataField defines the values that are allowed as input to the model, while the target values describes properties of the predicted values in a mining result. The default probabilities and the default value do not necessarily describes statistical properties of a target field as found in the training data. For example, the defaultValue can be the mean of the actual values in the training data but it could also be the median or any other value that was chosen during training. The same goes for the default probabilities. They are usually the 'prior' probabilities of respective values in the training data. But they can also be any other adjusted probability.