Trees
PMML3.1 Menu

Home


PMML Notice and License

Changes


Conformance

General Structure

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Taxomony

Targets

Output

Functions

Built-in Functions

Model Composition

Model Verification


Association Rules

Cluster
Models


General
Regression


Naive
Bayes


Neural
Network


Regression

Ruleset

Sequences

Text Models

Trees

Vector Machine

PMML 3.1 - Trees

The TreeModel in PMML allows for defining either a classification or prediction structure. Each Node holds a logical predicate expression that defines the rule for choosing the Node or any of the branching Nodes.


  <xs:element name="TreeModel">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="MiningSchema"/>
        <xs:element ref="Output" minOccurs="0" />
        <xs:element ref="ModelStats" minOccurs="0"/>
        <xs:element ref="Targets" minOccurs="0" />
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:element ref="Node"/>
        <xs:element ref="ModelVerification" minOccurs="0"/>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" />
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" />
      <xs:attribute name="missingValueStrategy" type="MISSING-VALUE-STRATEGY" default="none"/>
      <xs:attribute name="missingValuePenalty" type="PROB-NUMBER" default="1.0"/>
      <xs:attribute name="noTrueChildStrategy" type="NO-TRUE-CHILD-STRATEGY" default="returnNullPrediction" />
      <xs:attribute name="splitCharacteristic" default="multiSplit">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="binarySplit"/>
            <xs:enumeration value="multiSplit"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

Definitions:

TreeModel: starts the definition for a tree model.

Node: this element is an encapsulation for either defining a split or a leaf in a tree model. Every Node contains a predicate that identifies a rule for choosing itself or any of its siblings. A predicate may be an expression composed of other nested predicates.

modelName: the value in modelName in a TreeModel element identifies the model with a unique name in the context of the PMML file. See general structure of PMML models.

missingValueStrategy: defines a strategy for dealing with missing values. See Missing Value Strategies and Penalties for details.

missingValuePenalty: defines a penalty applied to confidence calculation when missing value handling is performed. See Missing Value Strategies and Penalties for details.

noTrueChildStrategy: defines what to do in situations where scoring cannot reach a leaf node. See Handling the case where scoring cannot continue for details.

splitCharacteristic: indicates whether non-leaf Nodes in the tree model have exactly two children, or an unrestricted number of children. In the case of multiSplit, it means that each Node may have 0 or more child Nodes. In the case of binarySplit, it means that each Node must have either 0 or 2 child Nodes.

Each Node consists of:


  <xs:element name="Node">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:group ref="PREDICATE" />
        <xs:choice>
          <xs:sequence>
            <xs:element ref="ScoreDistribution" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element ref="Node" minOccurs="0" maxOccurs="unbounded"/>
          </xs:sequence>
          <xs:group ref="EmbeddedModel"/>
        </xs:choice>
      </xs:sequence>
      <xs:attribute name="id" type="xs:string"/>
      <xs:attribute name="score" type="xs:string" use="required"/>
      <xs:attribute name="recordCount" type="NUMBER"/>
      <xs:attribute name="defaultChild" type="xs:string"/>
    </xs:complexType>
  </xs:element>

Definitions:

id: The value of id serves as a unique identifier for any given Node within the tree model.

score: The value of score in a Node serves as the predicted value for a record that chooses the Node.

recordCount: The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements. These numbers do not necessarily determine the number of records which have been used to build/train the model. Nevertheless, they allow to determine the relative size of given values in a ScoreDistribution as well as the relative size of a Node when compared to the parent Node.

defaultChild: only applicable when missingValueStrategy is set to defaultChild in the TreeModel element. Gives the id of the child node to use when no predicates can be evaluated due to missing values. See Missing Value Strategies and Penalties for details.

EmbeddedModel: Only applies in the context of model composition. It will hold a reference towards the model that is embedded in this Node and has to be applied to receive the actual prediction. For further details, see model composition.

The content of the attribute id can be any string that is unique within a model. Any numbering scheme can be used, e.g., ids can be enumerated as 1, 2, 3, 4, etc. Or it can be a hierarchical schema as used for chapters, sections, subsections, etc in a book, e.g., 1.1.2.1, or 1.2.2.2.

Predicates

Each Node has one PREDICATE; that may be a SimplePredicate, a SetPredicate, a CompoundPredicate, a True, or a False.


  <xs:group name="PREDICATE">
    <xs:choice>
      <xs:element ref="SimplePredicate" />
      <xs:element ref="CompoundPredicate" />
      <xs:element ref="SimpleSetPredicate" />
      <xs:element ref="True" />
      <xs:element ref="False" />
    </xs:choice>
  </xs:group>

  <xs:element name="SimplePredicate">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required"/>
      <xs:attribute name="operator" use="required">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="equal"/>
            <xs:enumeration value="notEqual"/>
            <xs:enumeration value="lessThan"/>
            <xs:enumeration value="lessOrEqual"/>
            <xs:enumeration value="greaterThan"/>
            <xs:enumeration value="greaterOrEqual"/>
            <xs:enumeration value="isMissing"/>
            <xs:enumeration value="isNotMissing"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
      <xs:attribute name="value" type="xs:string"/>
    </xs:complexType>
  </xs:element>

If the operator is isMissing or isNotMissing, the attribute value must not appear. With all other operators, however, the attribute value is required.

The predicates in the subnodes are evaluated left-to-right. The application algorithm chooses the first Node where the predicate evaluates to TRUE. Typically the rightmost Node just contains the predicate <True/>. If no Node applies and no final <True/> node is present, the noTrueChildStrategy applies (see below).

Definitions:

SimplePredicate: this element defines a rule in the form of a simple boolean expression. The rule consist of field, operator (booleanOperator) for binary comparison, and value.

field: This attribute of SimplePredicate element is a name entry of a MiningField or a DerivedField from TransformationDictionary or LocalTransformations.

operator: This attribute of SimplePredicate is one of the six pre-defined comparison operators.

Operator Math Symbol
equal =
notEqual
lessThan <
lessOrEqual
greaterThan >
greaterOrEqual

value: This attribute of SimplePredicate element is the information to evaluate / compare against.

Mathematically the rule is expressed as field booleanOperator value, that is, field is the left operand and value is the right operand. The following samples represent the equivalent to age < 30


  <SimplePredicate field="age" operator="lessThan" value="30" >

  <SimplePredicate value="30" operator="lessThan" field="age" >

  <SimplePredicate operator="lessThan" value="30" field="age" >
                    

Compound predicates


  <xs:element name="CompoundPredicate">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:sequence minOccurs="2" maxOccurs="unbounded">
          <xs:group ref="PREDICATE" />
        </xs:sequence>
      </xs:sequence>
      <xs:attribute name="booleanOperator" use="required">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="or"/>
            <xs:enumeration value="and"/>
            <xs:enumeration value="xor"/>
            <xs:enumeration value="surrogate"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

Definitions:

CompoundPredicate: an encapsulating element for combining two or more elements as defined at the entity PREDICATE. The attribute associated with this element, booleanOperator, can take one of the following logical (boolean) operators: and, or, xor or surrogate.

booleanOperator: The operators and, or and xor are associative binary operators, having their usual semantics. The order of evaluation is irrelevant for all the predicates within one CompoundPredicate. An expression surrogate(a,b) is equivalent to if not unknown(a) then a else b.

The operator and indicates an evaluation to TRUE if all the predicates evaluate to TRUE.

The operator or indicates an evaluation to TRUE if one of the predicates evaluates to TRUE.

The operator xor indicates an evaluation to TRUE if an odd number of the predicates evaluates to TRUE and all others evaluate to FALSE.

The operator surrogate allows for specifying surrogate predicates. They are used for cases where a missing value appears in the evaluation of the parent predicate such that an alternative predicate is available.

Simple set predicates


  <xs:element name="SimpleSetPredicate">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="Array"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required"/>
      <xs:attribute name="booleanOperator" use="required">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="isIn"/>
            <xs:enumeration value="isNotIn"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

Definition:

    SimpleSetPredicate: checks whether a field value is element of a set. The set of values is specified by the array.

The set of values is specified by the array in the content. The attribute associated with this element, booleanOperator, can take one of following boolean operators: isIn, and isNotIn.

The operator isIn indicates an evaluation to TRUE if the field value is contained in the list of values in the array.

The operator isNotIn indicates an evaluation to TRUE if the field value is not contained in the list of values in the array.


  <xs:element name="True">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

Definition

    True: a predicate element that identifies the boolean constant TRUE.


  <xs:element name="False">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

Definition:

    False: a predicate element that identifies the boolean constant FALSE.

Sub-predicates (siblings of a CompoundPredicate) are to be grouped together and evaluated together. For example,
( (temperature > 60) and (temperature < 100) and (outlook=overcast) )
is represented by

  <CompoundPredicate booleanOperator="and" >
    <SimplePredicate field="temperature" operator="greaterThan" 
                        value="60" />
    <SimplePredicate field="temperature" operator="lessThan" 
                        value="100" />
    <SimplePredicate field="outlook" operator="equal" 
                        value="overcast"/>
  </CompoundPredicate>

In the case where siblings of a CompoundPredicate are CompoundPredicates, each of the CompoundPredicates are evaluated together. For example,
( ( (temperature < 90) and (temperature > 50) ) or (humidity ≥80) )
is represented by

  <CompoundPredicate booleanOperator="or" >
    <CompoundPredicate booleanOperator="and" >
      <SimplePredicate field="temperature" operator="lessThan"
                       value="90" />
      <SimplePredicate field="temperature" operator="greaterThan"
                       value="50" />
    </CompoundPredicate>
    <SimplePredicate field="humidity" operator="greaterOrEqual"
                      value="80" />
  </CompoundPredicate>

Predicates on missing values

The value of any field in a logical expression may be missing. A SimplePredicate

Field Operator Value
evaluates to UNKNOWN if the value of Field is missing. Note that the DataDictionary and MiningSchema may contain a definition on how to handle a missing value, e.g., by replacing it by a substitute. In that case the substituted value is used to evaluate the predicate.

The result of a CompoundPredicate with an operator and, or or xor is determined by the following table:

P Q P and Q P or Q P xor Q
TrueTrueTrueTrueFalse
TrueFalseFalseTrueTrue
TrueUnknownUnknownTrueUnknown
FalseTrueFalseTrueTrue
FalseFalseFalseFalseFalse
FalseUnknownFalseUnknownUnknown
UnknownTrueUnknownTrueUnknown
UnknownFalseFalseUnknownUnknown
UnknownUnknownUnknownUnknownUnknown

The operator surrogate provides a special means to handle logical expressions with missing values. It is applied to a sequence of predicates. The order of the predicates matters, the first predicate is the primary, the next predicates are the surrogates. Evaluation order is left-to-right. The cascaded predicates are applied when the primary predicate evaluates to UNKNOWN. Therefore, a surrogate predicate can provide a resolution to undetermined predicates.

Example


  <CompoundPredicate booleanOperator="surrogate" >
    <CompoundPredicate booleanOperator="and" >
      <SimplePredicate field="temperature" operator="lessThan"
                       value="90" />
      <SimplePredicate field="temperature" operator="greaterThan"
                       value="50" />
    </CompoundPredicate>
    <SimplePredicate field="humidity" operator="greaterOrEqual"
                       value="80" />
    <False/>
  </CompoundPredicate>

The primary predicate is (temperature < 90) and (temperature > 50). If this evaluates to TRUE or FALSE then the result of the surrogate predicate is taken. If the primary predicate evaluates to UNKNOWN because the value for the field temperature is missing, then the evaluation proceeds with the second predicate humidity ≥ 80. If the humidity value is missing, then the final result is FALSE.

ScoreDistribution

This element comprises a method to list predicted values in a classification trees structure.


  <xs:element name="ScoreDistribution">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="value" type="xs:string" use="required"/>
      <xs:attribute name="recordCount" type="NUMBER" use="required"/>
      <xs:attribute name="confidence" type="PROB-NUMBER"/>
    </xs:complexType>
  </xs:element>

Attribute Definitions

    ScoreDistribution: an element of Node to represent segments of the score that a Node predicts in a classification model. If the Node holds an enumeration, each entry of the enumeration is stored in one ScoreDistribution element.

    value: This attribute of ScoreDistribution is the label in a classification model.

    recordCount: This attribute of ScoreDistribution is the size (in number of records) associated with the value attribute.

    confidence: This optional attribute of ScoreDistribution assigns a confidence to a given prediction class for this tree node.

When a Node is selected as the final Node and if this Node has no score attribute, then the highest recordCount in the ScoreDistribution determines which value is selected as the predicted class. If a Node contains a sequence of ScoreDistribution elements such that there is more than one entry where recordCounti is an upper bound, then the first entry is selected.

Note: If a Node has an attribute score then this attribute value overrides the computation of a predicted value from the ScoreDistribution.

Missing Value Strategies and Penalties

The purpose of the missing value strategy is to define what happens when missing values are encountered in a case to be scored by the tree model - in situations where the main predicate defined at a decision tree node evaluates to UNKNOWN. See the section on Predicates on Missing Values for an explanation of how missing values can cause a predicate to evaluate to UNKNOWN.

missingValueStrategy:

This optional attribute of TreeModel indicates which strategy to apply when a Node's predicate evaluates to UNKNOWN during the scoring of a case:

  <xs:simpleType name="MISSING-VALUE-STRATEGY">
    <xs:restriction base="xs:string">
      <xs:enumeration value="lastPrediction" />
      <xs:enumeration value="nullPrediction" />
      <xs:enumeration value="defaultChild" />
      <xs:enumeration value="weightedConfidence" />
      <xs:enumeration value="none" />
    </xs:restriction>
  </xs:simpleType>

Definitions:

  • lastPrediction:If a Node's predicate evaluates to UNKNOWN while traversing the tree, evaluation is stopped and the current winner is returned as the final prediction.
  • nullPrediction: If a Node's predicate value evaluates to UNKNOWN while traversing the tree, abort the scoring process and give no prediction.
  • defaultChild: If a Node's predicate value evaluates to UNKNOWN while traversing the tree, evaluate the attribute defaultChild which gives the child to continue traversing with. Requires the presence of the attribute defaultChild in every non-leaf Node.
  • weightedConfidence: If a Node's predicate value evaluates to UNKNOWN while traversing the tree, the confidences for each class is calculated from scoring it and each of its sibling Nodes in turn (excluding any siblings whose predicates evaluate to FALSE). The confidences returned for each class from each sibling Node that was scored are weighted by the proportion of the number of records in that Node, then summed to produce a total confidence for each class. The winner is the class with the highest confidence. Note that weightedConfidence should be applied recursively to deal with situations where several predicates within the tree evaluate to UNKNOWN during the scoring of a case.
  • none: Comparisons with missing values other than checks for missing values always evaluate to FALSE. If no rule fires, then use the noTrueChildStrategy to decide on a result. This option requires that missing values be handled after all rules at the Node have been evaluated.
    Note: In contrast to lastPrediction, evaluation is carried on instead of stopping immediately upon first discovery of a Node who's predicate value cannot be determined due to missing values.
Note: The missingValueStrategy is not invoked if missing values are handled within predicates, either by compound predicates composed using the surrogate operator or by simple predicates containing the comparison operators isMissing or isNotMissing. When the predicate contains these operators it is possible for the predicate to evaluate to TRUE or FALSE when fields referenced within the predicate have missing values.

missingValuePenalty: This optional attribute of TreeModel allows computed confidences to be reduced by a specified factor each time certain kinds of missing value handling are invoked during the scoring of a case. For each Node where either surrogate rules or the defaultChild strategy had to be used to select a child, the final confidences are multiplied by this factor. Note that this is based on the number of Nodes, not on the overall number of missing values that were encountered (with opeator surrogate, multiple missing values can be encountered within a single Node). For example, if two Nodes with missing values were encountered to get to the final prediction, confidence is multiplied by the two missingValuePenalty values.

Handling the situation where scoring cannot continue

noTrueChildStrategy:

During the scoring of a case, if the scoring reaches an internal Node at which none of the subnodes' predicates evaluate to TRUE, and no missing value handling strategy (if defined) is invoked, this optional attribute of TreeModel determines what to do next:


  <xs:simpleType name="NO-TRUE-CHILD-STRATEGY">
    <xs:restriction base="xs:string">
      <xs:enumeration value="returnNullPrediction" />
      <xs:enumeration value="returnLastPrediction" />
    </xs:restriction>
  </xs:simpleType>

Definitions:

  • returnNullPrediction: No prediction is returned (this is the default behaviour)
  • returnLastPrediction: If the parent has a score attribute return the value of this attribute. Otherwise, no prediction is returned.

In the following example, if scoring reaches N1, but the case to be scored has a value for field prob1 which is less than or equal to 0.33, the noTrueChildStrategy defined for the tree determines what action to take. If set to returnNullPrediction, then no prediction is returned. If set to returnLastPrediction, then the score of N1 (0) is returned.


  <Node id="N1" score="0">
    <True />
    <Node id="T1" score="1">
      <SimplePredicate field="prob1" operator="greaterThan" value="0.33"/>
    </Node>
  </Node>

Examples

How to use surrogate

The CART algorithm features the concept of surrogate splits. For example, when classifying a record, the record is dropped to a node where the primary split is salary ≤ 35000. Further assume the record has a missing value for the field salary. CART deals with this situation by applying a sequence of surrogate rules in cascade-like fashion, until one of them can classify the given record. There may be 0 or more surrogate splits available. In our example, we could have age ≤28 and homeowner==0 as surrogates. If age is not missing, the record is classified according to the age value. If age is missing, we try homeowner. If homeowner is also missing, we have run out of surrogates and we apply a TRUE or FALSE, as specified by the PMML model.
For example,

salary ≤ 35000 SURROGATE TRUE
means:
Classify the record according to salary, if salary is not missing. If salary is missing, the predicate returns TRUE anyway.

Example TreeModel


  <?xml version="1.0" ?>
  <PMML version="3.1" xmlns="https://www.dmg.org/PMML-3_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header copyright="www.dmg.org" description="A very small binary tree model to show structure."/>
    <DataDictionary numberOfFields="5" >
      <DataField name="temperature" optype="continuous" dataType="double"/>
      <DataField name="humidity" optype="continuous" dataType="double"/>
      <DataField name="windy" optype="categorical" dataType="string">
        <Value value="true"/>
        <Value value="false"/>
      </DataField>
      <DataField name="outlook" optype="categorical" dataType="string" >
        <Value value="sunny"/>
        <Value value="overcast"/>
        <Value value="rain"/>
      </DataField>
      <DataField name="whatIdo" optype="categorical" dataType="string" >
        <Value value="will play"/>
        <Value value="may play"/>
        <Value value="no play"/>
      </DataField>
    </DataDictionary>
    <TreeModel modelName="golfing" functionName="classification">
      <MiningSchema>
        <MiningField name="temperature"/>
        <MiningField name="humidity"/>
        <MiningField name="windy"/>
        <MiningField name="outlook"/>
        <MiningField name="whatIdo" usageType="predicted"/>
      </MiningSchema>
      <Node score="will play">
        <True/>
        <Node score="will play">
          <SimplePredicate field="outlook" operator="equal" value="sunny"/>
          <Node score="will play">
            <CompoundPredicate booleanOperator="and" >
              <SimplePredicate field="temperature" operator="lessThan" value="90" />
              <SimplePredicate field="temperature" operator="greaterThan" value="50" />
            </CompoundPredicate>
            <Node score="will play" >
              <SimplePredicate field="humidity" operator="lessThan" value="80" />
            </Node>
            <Node score="no play" >
              <SimplePredicate field="humidity" operator="greaterOrEqual" value="80" />
            </Node>
          </Node>
          <Node score="no play" >
            <CompoundPredicate booleanOperator="or" >
              <SimplePredicate field="temperature" operator="greaterOrEqual" value="90"/>
              <SimplePredicate field="temperature" operator="lessOrEqual" value="50" />
            </CompoundPredicate>
          </Node>
        </Node>
        <Node score="may play" >
          <CompoundPredicate booleanOperator="or" >
            <SimplePredicate field="outlook" operator="equal" value="overcast" />
            <SimplePredicate field="outlook" operator="equal" value="rain" />
          </CompoundPredicate>
          <Node score="may play" >
            <CompoundPredicate booleanOperator="and" >
              <SimplePredicate field="temperature" operator="greaterThan" value="60" />
              <SimplePredicate field="temperature" operator="lessThan" value="100" />
              <SimplePredicate field="outlook" operator="equal" value="overcast" />
              <SimplePredicate field="humidity" operator="lessThan" value="70" />
              <SimplePredicate field="windy" operator="equal" value="false" />
            </CompoundPredicate>
          </Node>
          <Node score="no play" >
            <CompoundPredicate booleanOperator="and" >
              <SimplePredicate field="outlook" operator="equal" value="rain" />
              <SimplePredicate field="humidity" operator="lessThan" value="70" />
            </CompoundPredicate>
          </Node>
        </Node>
      </Node>
    </TreeModel>
  </PMML>

Scoring Procedure

We will use the above example to illustrate the steps that should be followed in the scoring process.

The input data is assumed to be:

temperature=75, humidity=55, windy="false", outlook="overcast"

1) Select the root Node. Its predicate is the constant TRUE.

2) Select the first child Node at the root Node (the order of the child Nodes is top-down in the document). Evaluate the predicate of this Node (outlook="sunny"). The observation is outlook="overcast", so this predicate evaluates to FALSE. Return to the parent Node, and select the second child. The predicate at the second child is

outlook="overcast" OR outlook="rain"
With input outlook="overcast" this predicate evaluates to TRUE. Stay at this Node.

3) Select the first child of the current Node. Evaluate the predicate

temperature > 60 AND temperature < 100 AND outlook="overcast" AND humidity <70 AND windy="false"
The predicate evaluates to TRUE, so select this Node.

4) Since the last selected Node does not have any child Nodes, it is a leaf Node that contains a score value for the predicted field whatIdo. Finally, the returned score value for whatIdo is may play.

Scoring Procedure with Missing Value Strategies

Example calculations of scoring with missing values are based on the following tree model:


  <?xml version="1.0" ?>
  <PMML version="3.1" xmlns="https://www.dmg.org/PMML-3_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header copyright="www.dmg.org" description="A very small
        tree model to demonstrate missing value handling and confidence calculation."/>
    <DataDictionary numberOfFields="4" >
      <DataField name="temperature" optype="continuous" dataType="double"/>
      <DataField name="humidity" optype="continuous" dataType="double"/>
      <DataField name="outlook" optype="categorical" dataType="string">
        <Value value="sunny"/>
        <Value value="overcast"/>
        <Value value="rain"/>
      </DataField>
      <DataField name="whatIdo" optype="categorical" dataType="string">
        <Value value="will play"/>
        <Value value="may play"/>
        <Value value="no play"/>
      </DataField>
    </DataDictionary>
    <TreeModel modelName="golfing" functionName="classification" missingValueStrategy="weightedConfidence" >
      <MiningSchema>
        <MiningField name="temperature"/>
        <MiningField name="humidity"/>
        <MiningField name="outlook"/>
        <MiningField name="whatIdo" usageType="predicted"/>
      </MiningSchema>
      <Node id="1" score="will play" recordCount="100" defaultChild="2">
        <True/>
        <ScoreDistribution value="will play" recordCount="60" confidence="0.6" />
        <ScoreDistribution value="may play" recordCount="30" confidence="0.3" />
        <ScoreDistribution value="no play" recordCount="10" confidence="0.1" />
        <Node id="2" score="will play" recordCount="50" defaultChild="3" >
          <SimplePredicate field="outlook" operator="equal" value="sunny"/>
          <ScoreDistribution value="will play" recordCount="40" confidence="0.8" />
          <ScoreDistribution value="may play" recordCount="2" confidence="0.04" />
          <ScoreDistribution value="no play" recordCount="8" confidence="0.16" />
          <Node id="3" score="will play" recordCount="40">
            <CompoundPredicate booleanOperator="surrogate" >
              <SimplePredicate field="temperature" operator="greaterOrEqual" value="50" />
              <SimplePredicate field="humidity" operator="lessThan" value="80" />
            </CompoundPredicate>
            <ScoreDistribution value="will play" recordCount="36" confidence="0.9" />
            <ScoreDistribution value="may play" recordCount="2" confidence="0.05" />
            <ScoreDistribution value="no play" recordCount="2" confidence="0.05" />
          </Node>
          <Node id="4" score="no play" recordCount="10" >
            <CompoundPredicate booleanOperator="surrogate" >
              <SimplePredicate field="temperature" operator="lessThan" value="50"/>
              <SimplePredicate field="humidity" operator="greaterOrEqual" value="80" />
            </CompoundPredicate>
            <ScoreDistribution value="will play" recordCount="4" confidence="0.4" />
            <ScoreDistribution value="may play" recordCount="0" confidence="0.0" />
            <ScoreDistribution value="no play" recordCount="6" confidence="0.6" />
          </Node>
        </Node>
        <Node id="5" score="may play" recordCount="50" >
          <CompoundPredicate booleanOperator="or" >
            <SimplePredicate field="outlook" operator="equal" value="overcast" />
            <SimplePredicate field="outlook" operator="equal" value="rain" />
          </CompoundPredicate>
          <ScoreDistribution value="will play" recordCount="20" confidence="0.4" />
          <ScoreDistribution value="may play" recordCount="28" confidence="0.56" />
          <ScoreDistribution value="no play" recordCount="2" confidence="0.04" />
        </Node>
      </Node>
    </TreeModel>
  </PMML>

Example 1 - Scoring with explicit confidences

The case to be scored has temperature=45, outlook="sunny", humidity=60. There are no missing values, and traversing the tree leads to node 4.

The prediction at node 4 is no play and the associated confidence (given by the confidence attribute where value="no play" in the score distribution) is 0.6.

Example 2 - Scoring with a missing value, and weightedConfidence missing value handling

The case to be scored has outlook="sunny" but temperature and humidity are unknown.

Scoring leads to node 2 but because temperature and humidity are not known, the predicates for node 2's first child (node 3) evaluates to UNKNOWN and missingValueHandlingStrategy weightedConfidence is invoked at this point.

This is resolved by deriving confidences for each class resulting from choosing each child node of node 2 where the predicate does not evaluate to FALSE (nodes 3 and 4).

Node 3 confidences:
conf(will play)=0.9
conf(may play)=0.05
conf(no play)=0.05
Node 4 confidences:
conf(will play)=0.4
conf(may play)=0.0
conf(no play)=0.6

Now these confidences are recombined, but weighted according to the relative numbers of records assigned to nodes 3 (40 records) and 4 (10 records).

Node 2 confidences:
conf(will play)=(40/50) * 0.9 + (10/50) * 0.4=0.72 + 0.08 = 0.8
conf(may play)=(40/50) * 0.05 + (10/50) * 0.0=0.04
conf(will play)=(40/50) * 0.05 + (10/50) * 0.6=0.04 + 0.12 = 0.16

The overall prediction returned for this case is the one with the highest confidence, will play.

Example 3 - Scoring with multiple missing values, and weightedConfidence missing value handling

The case to be scored has unknown values for outlook, humidity and temperature.

The predicates of node 2 evaluate to UNKNOWN, due to missing value for outlook, so missingValueHandlingStrategy weightedConfidence is invoked at this point. Confidences for each class are derived from each child of node 1 where the predicate does not evaluate to FALSE (nodes 2 and 5) and recombined.

Node 2 confidences for this case are computed using the steps in example 2.

Node 2 confidences:
conf(will play)=0.8
conf(may play)=0.04
conf(no play)=0.16
Node 5 confidences
conf(will play)=0.4
conf(may play)=0.56
conf(no play)=0.04

Now the confidences are recombined, but weighted according to the numbers of records assigned to nodes 2 (50 records) and 5 (50 records).

Node 1 confidences:
conf(will play)=(50/100) * 0.8 + (50/100) * 0.4=0.4 + 0.2 = 0.6
conf(may play)=(50/100) * 0.04 + (50/100) * 0.56=0.02 + 0.28 = 0.3
conf(will play)=(50/100) * 0.16 + (50/100) * 0.04=0.08 + 0.02 = 0.1

The overall prediction returned for this case is the one with the highest confidence, will play.

Example 4 - Scoring with defaultChild missing value handling

Suppose we alter the example TreeModel to set the missingValueStrategy attribute to defaultChild. We also add the attribute missingValuePenalty and set it to 0.8.

Now consider how to score a case with temperature=40, humidity=70, but outlook is unknown.

The predicate of node 2 evaluates to UNKNOWN, due to missing value for outlook, so missingValueHandlingStrategy defaultChild is invoked at this point. Scoring continues by selecting node 1's defaultChild (node 2). Scoring then continues normally from node 2, the prediction returned is no play but the confidence returned is 0.6 multiplied by the missingValuePenalty of 0.8 = 0.48.

Example 5 - Scoring with defaultChild missing value handling, multiple missing values

Suppose we alter the example TreeModel to set the missingValueStrategy attribute to defaultChild. We also add the attribute missingValuePenalty and set it to 0.8.

Now consider how to score a case with humidity=70, but outlook and temperature are unknown.

The predicate of node 2 evaluates to UNKNOWN, due to missing value for outlook, so missingValueHandlingStrategy defaultChild is invoked at this point. Scoring continues by selecting node 1's defaultChild (node 2). At node 2, the surrogate predicate based on humidity is used to select node 3. The prediction returned is will play but the confidence returned is 0.9 multiplied by the missingValuePenalty of 0.8 for each of the two nodes (node 1, node 2) where missing value handling was used, giving 0.9*0.8*0.8=0.576.

Example 6 - Scoring with lastPrediction missing value handling

Suppose we alter the example TreeModel to set the missingValueStrategy attribute to lastPrediction.

Now consider how to score a case with outlook="sunny", but temperature and humidity are unknown. Here scoring can go no further than node 2, because the predicates of node 3 and node 4 evaluate to UNKNOWN, so missingValueHandlingStrategy lastPrediction is invoked at this point. The prediction of will play is given and the confidence is 0.8.

Example 7 - Scoring with nullPrediction missing value handling

Suppose we alter the example TreeModel to set the missingValueStrategy attribute to nullPrediction.

Now consider how to score a case with outlook="sunny", but temperature and humidity are unknown. Here scoring can go no further than node 2, because the predicate of node 3 evaluates to UNKNOWN, so missingValueHandlingStrategy nullPrediction is invoked at this point. No prediction is can be returned.

Example 8 - Scoring with missingValueHandling none


   ...
   <TreeModel modelName="golfing" functionName="classification" missingValueStrategy="none" >
     ...
     <Node id="1" score="will play" recordCount="100" >
        <True/>
        <Node id="2" score="will play" recordCount="50" >
           <SimplePredicate field="age" operator="lessThan" value="30"/>
        </Node>
        <Node id="3" score="will not play" recordCount="20" >
           <SimplePredicate field="age" operator="greaterOrEqual" value="30"/>
        </Node>
        <Node id="4" score="will play" recordCount="30" >
           <True/>
        </Node>
     </Node>
     ...

Now consider how to score a case with age being unknown. While all valid values for age would be covered by nodes 2 and 3 and never reach node 4, missingValueHandling="none" will prevent either one from firing, since value missing is neither less than 30 nor greater than or equal to 30. A final node that will always fire takes care of missing values for age.

Conformance

Everything except for the following items are in core:

(a) element ScoreDistribution
(b) operators xor and surrogate of attribute booleanOperator at element CompoundPredicate
(c) all missing value strategies except none
e-mail info at dmg.org