Data Mining Group - PMML Tree Model

PMML 1.1 -- DTD of Tree Model

The tree modeling framework allows for defining either a classification or regression structure. Each Node holds a rule, called PREDICATES, that determines the reason for choosing the Node itself. In addition to the PREDICATE the Node holds two or more sibling Nodes (multi-ary branching). The reason for chossing a node is to retrieve the value held at the score attribute.

A Tree Model consists of four major parts:

					
	<!ELEMENT TreeModel (Extension*, MiningSchema, ModelStats?, Node)>
	
	<!ATTLIST TreeModel
	     modelName              CDATA                           #IMPLIED
	>

Definitions:

Tree Model - starts the definition for a tree model.
Node - this element is an encapsulation for either defining a split or an apex on a tree model. Every Node contains, at a minimum, one Predicate that defines the rule for choosing the node itself. Compound Predicate many and nested predicates, can be used to define a complex split rule.
Model Name - the value in model Name identifies the model with an unique name in the context of the PMML file. This attribute is not required, thereof, consumers of PMML models are free to manage model's naming at their discretion.

Each Node consists of:


	<!ELEMENT Node ( Extension*, (%PREDICATES;), ScoreDistribution*, Node* )>
					
	<!ATTLIST Node
	     score                  CDATA                           #REQUIRED
             recordCount            %NUMBER;                        #IMPLIED
	>

Definitions:

Score - The value of score in a Node serves as the predicted value for a record that choses the Node.
RecordCount - The value of recordCount in a Node serves as the base size for recordCount values in the ScoreDistribution elements. These number do not necessarily determine the number of records which have been used to build/train the model. Nevertheless, they allow to determine the relative size of given values in a score distribution as well as the relative size of a nodewhen compared to the parent node.

The PREDICATES are:

Each Node has one %PREDICATES; that may be a Predicate, a CompoundPredicate, a True, or a False.


	<!ENTITY % PREDICATES "( Predicate | CompoundPredicate | True | False ) " >

	<!ELEMENT Predicate EMPTY>
	
	<!ATTLIST Predicate
	     field      %FIELD-NAME;                                                                #REQUIRED
	     operator   (equal | notEqual | lessThan | lessOrEqual | greaterThan | greaterOrEqual)  #REQUIRED
	     value      CDATA                                                                       #REQUIRED
	>

Definitions:

Predicate - this element consist of defining a rule in the form of a simple boolean expression. The rule consist of a field, a comparison operator(booleanOperator), and a value. Mathematically, the rule is express as field booleanOperator value, that is, the field is the left operand and the value is the right operand.
The following samples represent the equivalent to "age < 30"

	
	<Predicate field="age"         operator="lessThan" value="30">
	<Predicate value="30"          operator="lessThan" field="age">
	<Predicate operator="lessThan" value="30"          field="age">

Field - This attribute of Predicate element is a name entry of one of the miningField elements at the MiningSchema.
Operator - This attribute of Predicate is one of the six pre-defined comparizon operators.

	     Operator          Math Symbol

	     equal                =
	     notEqual             !=
	     lessThan             <
	     lessOrEqual          <=
	     greaterThan          >
	     greaterOrEqual       >=

Value - This attribute of Predicate element is the information to evaluate/compare against.


	<!ELEMENT CompoundPredicate ( %PREDICATES; , (%PREDICATES;)+ >
	
	<!ATTLIST CompoundPredicate 
	     booleanOperator           (or | and | xor | cascade)      #REQUIRED
	>

Definition:

CompoundPredicate - an encapsulating element for combining two or more elements as defined by the entity %PREDICATES;.
BooleanOperators - The operators and, or, and xor are evaluated as binary operators, where two operants are to be evaluated against one operator(operant1 operator operant2). The order of evaluation is irrelevant for all the predicates within onecompound predicate. For example,

(P1 xor P2) xor P3 = P1 xor (P2 xor P3)

(P1 and P2) and P3 = P1 and (P2 and P3)

(P1 or P2) or P3 = P1 or (P2 or P3)

Sub-predicates(siblings of a compound predicate) are to be group together and evaluated together. For example,

( (temperature > 60) and (temperature < 100) and (outlook=overcast) )

		
	<CompoundPredicate booleanOperator="and" >
						
		<Predicate field="temperature" operator="greaterThan"value="60" />
		<Predicate field="temperature" operator="lessThan"   value="100" />
		<Predicate field="outlook"     operator="equal"      value="overcast"/>
	</CompoundPredicate>

The case where siblings of a compound predicate are compound predicates, each of the compound predicate are evaluatedtogether. For example,

( ( (temperature < 90) and (temperature > 50) ) or (humidity >=80) )

					
	<CompoundPredicate booleanOperator="or">
	
		<CompoundPredicate booleanOperator="and">
			<Predicate field="temperature" operator="lessThan"   value="90" />
			<Predicate field="temperature" operator="greaterThan"value="50" />
		</CompoundPredicate>
			
		<Predicate field="humidity" operator="greaterOrEqual"value="80" />

	</CompoundPredicate>

The operator cascade is used to specify surrogate splits. The order of the predicates matters, the first predicate (or compound predicate) is the primary, the next predicates are the surrogates. Evaluation order is top-down. Cascade predicates are applied when the primary predicate can not be evaluated because of a missing value, and the DataDictionary and MiningSchema do not provide a definition on how to handle the missing value. Thereof, the surrogate predicates can provide a resolution to theundetermined predicate. For example,


	<CompoundPredicate booleanOperator="cascade">
		
		<CompoundPredicate booleanOperator="and">
			<Predicate field="temperature" operator="lessThan"   value="90" />
			<Predicate field="temperature" operator="greaterThan"value="50" />
		</CompoundPredicate>

		<Predicate field="humidity" operator="greaterOrEqual"value="80" />
		<False/>
	</CompoundPredicate>

Primary split rule:

( (temperature < 90) and (temperature > 50) ) surrogate split (a): (humidity >=80) surrogate split (b): False

If the value for the field temperature is missing, then try to evaluate on the value for field humidity, if the humidity value ismissing, then return false for the evaluation.

					
	<!ELEMENT True EMPTY>

Definition:

True - a predicate element that identifies the boolean constant TRUE.

	
	<!ELEMENT False EMPTY>

Definition:

False - a predicate element that identified the boolean constant FALSE.

ScoreDistribution

A method to list predicted values in a classification tree structure.


	<!ELEMENT ScoreDistribution EMPTY>
	<!ATTLIST ScoreDistribution 
	     value                  CDATA                           #REQUIRED
	     recordCount            %NUMBER;                        #REQUIRED
	>

Definitions:

Score Distribution - an element of Node to represent segments of the score that a node predicts for a classification model. If thenode holds an enumeration, each entry of the enumeration is stored in one ScoreDistribution element.

Value - This attribute of ScoreDistribution is one of the labels for the categorical predicted field.

RecordCount - This attribute of ScoreDistribution is the size(possibly the number of records) associated with the value attribute.

Extensions:

	
	<!ATTLIST TreeModel x-splitCharacteristic (binarySplit | multiSplit) #REQUIRED>

Definition:

x-slitCharacteristic - indicates whether the tree model has exactly two splits per node, or multiple splits per node. In the case ofmultiple, it means that each node may have independently 2 or more splits.

Example for a TreeModel:


	<?xml version="1.0" standalone="no" ?>
	<!DOCTYPE PMML PUBLIC "PMML1.1" "pmml-1-1.dtd">
	<PMML version="1.1">
	<Header copyright="www.DMG.com" description="A very small binary tree model to show structure." />

	<DataDictionary numberOfFields="5">  
	
		<DataField name="temperature" optype="continuous">
			<Interval closure="openOpen" leftMargin="-20" rightMargin="120" />
		</DataField>

		<DataField name="humidity" optype="continuous">
			<Interval closure="closedClosed" leftMargin="30" rightMargin="100" />
		</DataField>
		
		<DataField name="windy" optype="categorical">
			<Value value="true" />
			<Value value="false" />
		</DataField>

		<DataField name="outlook" optype="categorical" >
			<Value value="sunny" />
			<Value value="overcast "/>
			<Value value="rain" />
		</DataField>  
		
		<DataField name="whatIdo" optype="ordinal">	
			<Value value="no play" />
			<Value value="may play" />
			<Value value="will play" />
		</DataField>
	
	</DataDictionary>
	
	<TreeModel modelName="golfing">

	<MiningSchema>
		<MiningField name="temperature" />
		<MiningField name="humidity" /> 
		<MiningField name="windy" />
		<MiningField name="outlook" />
		<MiningField name="whatIdo" usageType="predicted" />
	</MiningSchema> 
	
	<Node score="will play">
		<True/> 
	
	<Node score="will play">
		<Predicate field="outlook" operator="equal" value="sunny"/> 

		<Node score="will play">
		
			<CompoundPredicate booleanOperator="and">
				<Predicate field="temperature" operator="lessThan" value="90" />
				<Predicate field="temperature" operator="greaterThan" value="50" />
			</CompoundPredicate> 
		
			<Node score="will play">
				<Predicate field="humidity" operator="lessThan" value="80" />
			</Node>

			<Node score="no play">
				<Predicate field="humidity" operator="greaterOrEqual" value="80" />
			</Node>
	
		</Node>
		
		<Node score="no play">
			<CompoundPredicate booleanOperator="or">
				<Predicate field="temperature" operator="greaterOrEqual" value="90"/>
				<Predicate field="temperature" operator="lessOrEqual" value="50" />
			</CompoundPredicate>  
		</Node> 
		
	</Node>

	<Node score="may play"> 
		<CompoundPredicate booleanOperator="or">
			<Predicate field="outlook" operator="equal" value="overcast" />
			<Predicate field="outlook" operator="equal" value="rain" /> 
		</CompoundPredicate>
			
		<Node score="may play">

		<CompoundPredicate booleanOperator="and">
				<Predicate field="temperature" operator="greaterThan" value="60" />
				<Predicate field="temperature" operator="lessThan" value="100" />
				<Predicate field="outlook" operator="equal" value="overcast" />
				<Predicate field="humidity" operator="lessThan" value="70" />
				<Predicate field="windy" operator="equal" value="false" />
			</CompoundPredicate>
		</Node>

		<Node score="no play">  
			<CompoundPredicate booleanOperator="and">
				<Predicate field="outlook" operator="equal" value="rain" />
				<Predicate field="humidity" operator="lessThan" value="70" />
			</CompoundPredicate> 
		</Node>
	
	</Node>
	
	</Node>
	</TreeModel>
	</PMML>

Simplified Scoring Procedure

We will use the above example to illustrate the steps that should be followed in the scoring process.

The following case (observation) must be scored:

temperature=75, humidity=55, windy=false, outlook=overcast

Parse the PMML file.
Select root node, and evaluate it's predicate. The evaluation must be "true" because its predicate is the constant "true".
Select the first child nodes at the root node (the child nodes order is top-down). Evaluate the predicate of this node (outlook=sunny), the observation is outlook=overcast, so this predicate evaluates to "false". Return to the parent node, and then select the second child. The predicate at the second child is (outlook=overcast or outlook=rain). The observation still outlook=overcast, so this predicate evaluates to "true". Stay at this node.
Select the first child of the currect node. Evaluate the predicate: (temperature > 60 & temperature < 100 & outlook=overcast & humidity <70 & windy=false).The observation (temperature=75, humididty=55, windy=false, outlook=evercast) evaluates to "true", so select this node.
Since the last selected node does not have any child nodes, it must be a leaf node that contains a score value for the dependent field "whatIdo". The returned score for "whatIdo" will be "may play".

Conformance

Non-Core features of a TreeModel does not required implementation by model producers and consumers, and still make the modelcompliant.

The following items are the non-core features of TreeModel:

(a) element ScoreDistribution

(b) operators "xor" and "cascade" of attribute booleanOperator at element CompoundPredicate