Trees
PMML2.0 Menu

Home


PMML Notice and License

General Structure

Header

Data
Dictionary


Mining
Schema


Data Flow

Transformations

Statistics

Conformance

Taxomony

Trees

Regression

General
Regression


Cluster
Models


Association Rules

Neural
Network


Naive
Bayes


Sequences

PMML 2.0 -- Trees

The tree models in PMML allows for defining either a classification or prediction structure. Each Node holds a logical predicate expression that defines the rule for choosing the Node or any of the branching Nodes.


<!ELEMENT TreeModel (Extension*, MiningSchema, ModelStats?, 
                     Node, Extension*)>

<!ATTLIST TreeModel
   modelName                  CDATA                         #IMPLIED
   functionName               %MINING-FUNCTION;             #REQUIRED
   algorithmName              CDATA                         #IMPLIED
   splitCharacteristic        (binarySplit | multiSplit) "multiSplit"
>

Definitions:

TreeModel: starts the definition for a tree model.

Node: this element is an encapsulation for either defining a split or a leaf in a tree model. Every Node contains a predicate that identifies a rule for choosing itself or any of its siblings. A predicate may be an expression composed of other nested predicates.

modelName: the value in modelName in a TreeModel element identifies the model with an unique name in the context of the PMML file. See general structure of PMML models.

splitCharacteristic: indicates whether the tree model has exactly two children per node, or multiple childrens per node. In the case of multiple, it means that each node may have 2 or more child nodes.


Each Node consists of:


<!ENTITY % PREDICATE "( SimplePredicate | CompoundPredicate 
                        SimpleSetPredicate | True | False  ) " >

<!ELEMENT Node ( Extension*, (%PREDICATE;), ScoreDistribution*, Node* )>
<!ATTLIST Node
   score                      CDATA                          #REQUIRED
   recordCount                %NUMBER;                       #IMPLIED
>

Definitions:

score: The value of score in a Node serves as the predicted value for a record that choses the Node.

recordCount: The value of recordCount in a Node serves as a base size for recordCount values in ScoreDistribution elements. These numbers do not necessarily determine the number of records which have been used to build/train the model. Nevertheless, they allow to determine the relative size of given values in a score distribution as well as the relative size of a node when compared to the parent node.


Predicates

Each Node has one %PREDICATE; that may be a SimplePredicate, a set predicate, a CompoundPredicate, a True, or a False.


<!ELEMENT SimplePredicate EMPTY>
<!ATTLIST SimplePredicate
     field              %FIELD-NAME;                          #REQUIRED
     operator           (equal | notEqual |
                         lessThan | lessOrEqual |
                         greaterThan | greaterOrEqual)        #REQUIRED 
     value              CDATA                                 #REQUIRED
>

The predicates in the subnodes are evaluated left-to-right. The application algorithm chooses the first node where the predicate evaluates to true. Typically the rightmost node just contains the predicate <True/>.

Definitions:

SimplePredicate: this element consist of defining a rule in the form of a simple boolean expression. The rule consist of a field, a binary comparison operator(booleanOperator), and a value.

field: This attribute of SimplePredicate element is a name entry of one of the miningField elements at the MiningSchema.

operator: This attribute of SimplePredicate is one of the six pre-defined comparison operators.

Operator Math Symbol
equal =
notEqual !=
lessThan <
lessOrEqual <=
greaterThan >
greaterOrEqual>=

value - This attribute of SimplePredicate element is the information to evaluate / compare against.

Mathematically the rule is expressed as 'field booleanOperator value', that is, the field is the left operand and the value is the right operand. The following samples represent the equivalent to "age < 30"


<SimplePredicate field="age" operator="lessThan" 
                    value="30" >

<SimplePredicate value="30" operator="lessThan" 
                    field="age" >

<SimplePredicate operator="lessThan" value="30" 
                    field="age" >


Compound predicates


<!ELEMENT CompoundPredicate ( %PREDICATE; , (%PREDICATE;)+ >
<!ATTLIST CompoundPredicate
     booleanOperator     (or|and|xor|surrogate)     #REQUIRED >

Definitions:

CompoundPredicate: an encapsulating element for combining two or more elements as defined at the entity %PREDICATE;. The attribute associated with this element, booleanOperator, can take one of following logical(boolean) operators: and, or, xor, or surrogate.

booleanOperator: The operators and, or, and xor are associative binary operators, having their usual semantics. The order of evaluation is irrelevant for all the predicates within one compound predicate. An expression 'surrogate(a,b)' is equivalent to 'if not unknown(a) then a else b'.

The operator 'and' indicates an evaluation to TRUE if all the predicates evaluate to TRUE.

The operator 'or' indicates an evaluation to TRUE if one of the predicates evaluates to TRUE.

The operator 'xor' is only used with two arguments, it indicates an evaluation to TRUE if the predicates evaluate to different values (one TRUE and one FALSE).

The operator 'surrogate' allows for specifing surrogate predicates. They are used for cases where a missing value appears in the evaluation of the parent predicate such that an alternative predicate is available.

Simple set predicates


<!ELEMENT SimpleSetPredicate ( %ARRAY; ) >
<!ATTLIST SimpleSetPredicate
     field                 %FIELD-NAME;                    #REQUIRED
     booleanOperator       (isIn|isNotIn)                  #REQUIRED 
>

Definition:

    SimpleSetPredicate: checks whether a field value is element of a set. The set of values is specified by the array. The attribute associated with this element, booleanOperator, can take one of following boolean operators: isIn, and isNotIn.

The set of values is specified by the array in the content. The attribute associated with this element, booleanOperator, can take one of following boolean operators: isIn, and isNotIn.

The operator isIn indicates an evaluation to TRUE if the field value is contained in the list of values in the array.

The operator isNotIn indicates an evaluation to TRUE if the field value is not contained in the list of values in the array.


<!ELEMENT True EMPTY>

Definition

    True: a predicate element that identifies the boolean constant TRUE.


<!ELEMENT False EMPTY>

Definition:

    False: a predicate element that identified the boolean constant FALSE.


Sub-predicates (siblings of a compound predicate) are to be grouped together and evaluated together. For example,

( (temperature > 60) and (temperature < 100) and (outlook=overcast) )


<CompoundPredicate booleanOperator="and" >
     <SimplePredicate field="temperature" operator="greaterThan" 
                         value="60" />
     <SimplePredicate field="temperature" operator="lessThan" 
                         value="100" />
     <SimplePredicate field="outlook" operator="equal" 
                         value="overcast"/>
</CompoundPredicate>

In the case where siblings of a compound predicate are compound predicates, each of the compound predicate are evaluated together. For example, ( ( (temperature < 90) and (temperature > 50) ) or (humidity >=80) )


<CompoundPredicate booleanOperator="or" >
     <CompoundPredicate booleanOperator="and" >
          <SimplePredicate field="temperature" operator="lessThan"
                              value="90" />
          <SimplePredicate field="temperature" operator="greaterThan"
                              value="50" />
     </CompoundPredicate>
     <SimplePredicate field="humidity" operator="greaterOrEqual"
                         value="80" />
</CompoundPredicate>

Predicates on missing values

The value of any field in a logical expression may be missing. A SimplePredicate

Field Operator Value
evaluates to 'Unknown' if the value of Field is missing. Note that the DataDictionary and MiningSchema may contain a definition on how to handle a missing value, e.g. by replacing it by a substitute. In that case the substituted value is used to evaluate the predicate.

The result of a CompoundPredicate with an operator 'and' or 'or' is determined by the following table:

P Q P and Q P or Q
TrueTrueTrueTrue
TrueFalseFalseTrue
TrueUnknownUnknownTrue
FalseTrueFalseTrue
FalseFalseFalseFalse
FalseUnknownFalseUnknown
UnknownTrueUnknownTrue
UnknownFalseFalseUnknown
UnknownUnknownUnknownUnknown

The operator 'surrogate' provides a special means to handle logical expressions with missing values. It is applied to a sequence of predicates. The order of the predicates matters, the first predicate is the primary, the next predicates are the surrogates. Evaluation order is left-to-right. The cascaded predicates are applied when the primary predicate evaluates to 'Unknown'. Therefore, the surrogate predicates can provide a resolution to an undetermined predicate.

Example:

<CompoundPredicate booleanOperator="surrogate" >
     <CompoundPredicate booleanOperator="and" >
          <SimplePredicate field="temperature" operator="lessThan"
                              value="90" />
          <SimplePredicate field="temperature" operator="greaterThan"
                              value="50" />
     </CompoundPredicate>
     <SimplePredicate field="humidity" operator="greaterOrEqual"
                         value="80" />
     <False/>
</CompoundPredicate>

The primary predicate is ( (temperature < 90) and (temperature > 50) ). If this evaluates to True or False then this is the result of the surrogate predicate. If the primary predicate evaluates to Unknown because value for the field temperature is missing, then the evaluation proceeds with the second predicate (humidity >=80). If the humidity value is missing the final result is False.


ScoreDistribution

A method to list predicted values in a classification trees structure.


<!ELEMENT ScoreDistribution EMPTY>
<!ATTLIST ScoreDistribution 
     value                      CDATA                         #REQUIRED
     recordCount                %NUMBER;                      #REQUIRED
>

Attribute Definitions

    ScoreDistribution: an element of Node to represent segments of the score that a node predicts in a classification model. If the node holds an enumeration, each entry of the enumeration is store in one ScoreDistribution element.

    value: This attribute of ScoreDistribution is the label in a classification model.

    recordCount: This attribute of ScoreDistribution is the size (in number of records) associated with the value attribute.

When a Node is selected as the final node and if this node has no 'score' attribute, then the highest recordCount the ScoreDistribution determines which value is selected as predicted class. If a Node contains a sequence of ScoreDistribution elements such that there is more than one entry where (recordCount_i) is an upper bound, then the first entry is selected.

Note: if a Node has an attribute 'score' then this attribute value overrides the computation of a predicted value from the ScoreDistribution.


Examples:

How to use surrogate

In the CART algorithm there is this concept of surrogate split. Say one is classifying a record, he/she drops the record to a node where the primary split is "salary <= 35000". Further assume the record has missing value for the salary, which is quite natural to happen. CART deals with this situation by applying a sequence of surrogate rules, in cascade-like fashion, until one of them can classify the given record. There may be 0 or more surrogate splits available. In our example, we could have "age <=28" and "homeowner==0" as surrogates. If age is not missing, the record is classified according to the age value. If age is missing, we try homeowner. If also "homeowner" is missing, we have run out of surrogates and we apply a True or False, as specified by the XML document. For example, [(salary <= 35000) surrogate (True)], meaning "classify the record according to age, if age not missing. If age is missing, the predicate returns True anyway".


Example TreeModel

<?xml version="1.0" ?>
<PMML version="1.1" >
<Header copyright="www.dmg.org" description="A very small 
         binary tree model to show structure."/>
     <DataDictionary numberOfFields="5" >
          <DataField name="temperature" optype="continuous"/>
          <DataField name="humidity" optype="continuous"/>
          <DataField name="windy" optype="categorical" >
               <Value value="true"/>
               <Value value="false"/>
          </DataField>
          <DataField name="outlook" optype="categorical" >
               <Value value="sunny"/>
               <Value value="overcast"/>
               <Value value="rain"/>
          </DataField>
          <DataField name="whatIdo" optype="categorical" >
               <Value value="will play"/>
               <Value value="may play"/>
               <Value value="no play"/>
          </DataField>
     </DataDictionary>
     <TreeModel modelName="golfing" functionName="classification">
          <MiningSchema>
               <MiningField name="temperature"/>
               <MiningField name="humidity"/>
               <MiningField name="windy"/>
               <MiningField name="outlook"/>
               <MiningField name="whatIdo" usageType="predicted"/>
          </MiningSchema>
     <Node score="will play">
        <True/>
        <Node score="will play">
           <SimplePredicate field="outlook" operator="equal" 
               value="sunny"/>
           <Node score="will play">
              <CompoundPredicate booleanOperator="and" >
                  <SimplePredicate field="temperature" 
                                   operator="lessThan" value="90" />
                  <SimplePredicate field="temperature" 
                                   operator="greaterThan" value="50" />
              </CompoundPredicate>
              <Node score="will play" >
                 <SimplePredicate field="humidity" 
                                  operator="lessThan" value="80" />
              </Node>
              <Node score="no play" >
                 <SimplePredicate field="humidity" 
                                  operator="greaterOrEqual" value="80" />
              </Node>
           </Node>
           <Node score="no play" >
              <CompoundPredicate booleanOperator="or" >
                 <SimplePredicate field="temperature" 
                                  operator="greaterOrEqual" value="90"/>
                 <SimplePredicate field="temperature" 
                                  operator="lessOrEqual" value="50" />
              </CompoundPredicate>
           </Node>
        </Node>
        <Node score="may play" >
           <CompoundPredicate booleanOperator="or" >
              <SimplePredicate field="outlook" 
                               operator="equal" value="overcast" />
              <SimplePredicate field="outlook" 
                               operator="equal" value="rain" />
           </CompoundPredicate>
           <Node score="may play" >
              <CompoundPredicate booleanOperator="and" >
                 <SimplePredicate field="temperature" 
                                  operator="greaterThan" value="60" />
                 <SimplePredicate field="temperature" 
                                  operator="lessThan" value="100" />
                 <SimplePredicate field="outlook" 
                                  operator="equal" value="overcast" />
                 <SimplePredicate field="humidity" 
                                  operator="lessThan" value="70" />
                 <SimplePredicate field="windy" 
                                  operator="equal" value="false" />
              </CompoundPredicate>
           </Node>
           <Node score="no play" >
              <CompoundPredicate booleanOperator="and" >
                 <SimplePredicate field="outlook" 
                                  operator="equal" value="rain" />
                 <SimplePredicate field="humidity" 
                                  operator="lessThan" value="70" />
                 </CompoundPredicate>
              </Node>
           </Node>
        </Node>
   </TreeModel>
</PMML>

Scoring Procedure

We will use the above example to illustrate the steps that should be followed in the scoring process.

The input data is assumed to be:

     temperature=75, humidity=55, windy="false", outlook="overcast"

1) Select the root node. Its predicate is the constant True.

2) Select the first child node at the root node (the order of the child nodes is top-down in the document). Evaluate the predicate of this node (outlook="sunny"). The observation is outlook="overcast", so this predicate evaluates to False. Return to the parent node, and select the second child. The predicate at the second child is (outlook="overcast" or outlook="rain"). With input outlook="overcast" this predicate evaluates to True. Stay at this node.

3) Select the first child of the currect node. Evaluate the predicate
(temperature > 60 and temperature < 100 and outlook="overcast" and humidity <70 windy="false").
The predicate evaluates to True, so select this node.

4) Since the last selected node does not have any child nodes, it is a leaf node that contains a score value for the predicted field "whatIdo". Finally, the returned score value for "whatIdo" is "may play".

Conformance

The following items are the non-core features of TreeModel:

    (a) element ScoreDistribution

    (b) operators "xor" and "surrogate" of attribute booleanOperator at element CompoundPredicate

e-mail info at dmg.org