Data Mining Group - PMML 4.0 - Multiple Models: Model Composition, Ensembles, and Segmentation

The PMML standard provides several ways to represent multiple models within one PMML file. The simplest way is to put several models in one PMML element, but then it is not clear how the models should be used. The element MiningModel allows precise specification of the usage of multiple models within one PMML file. The two main approaches are Model Composition, and Segmentation.

Model Composition includes model sequencing and model selection using a decision tree, but is only applicable to Tree and Regression models. Segmentation approach allows representation of different models for different data segments and also model ensembles. Scoring a case using a model ensemble consists of scoring it using each model separately, then combining the results into a single scoring result using one of the pre-defined combination methods.

Segmentation is accomplished by using any PMML model inside of a Segment element, which also contains a PREDICATE and an optional weight. MiningModel then contains Segmentation element with a number of Segment elements as well as the attribute multipleModelMethod specifying how all the models applicable to a record should be combined. It is also possible to use a combination of model composition and segmentation approaches, using simple regression or decision trees for data preprocessing before segmentation.

Sample scenarios

Treatment of multiple models in PMML covers a variety of scenarios such as the following examples:

A logistic regression model may require non-trivial rules for replacing missing values like
if Age is missing if Occupation is "Student" then Age := 20 else if Occupation is "Retired" then Age := 70 else Age := 40
These preprocessing rules can be defined by a simple decision tree model that is put into a sequence with the regression model.
Tree Ensemble constitutes a case where simple algorithms for combining results of either classification or regression trees are well known. Multiple classification trees, for example, are commonly combined using a "majority-vote" method. Multiple regression trees are often combined using various averaging techniques. PMML allows for tagging models with segment identifiers and weights to be used for their combination in these ways.
One method for selecting among multiple models according to context is to supply a list of models tagged with segment definitions that dictate the circumstances under which that model is applicable.
Another common method for optimizing prediction models is the combination of segmentation and regression using a tree model to specify segments. Data are grouped into segments and for each segment there may be different regression equations. If the segmentation can be expressed by decision rules then this kind of segment based regression can be implemented by a decision tree where any leaf node in the tree can contain an embedded regression model.
Prediction results may have to be combined with a cost or profit matrix before a decision can be derived. A mailing campaign model may use tree classification to determine response probabilities per customer and channel. The cost matrix can be appended as a regression model that applies cost weighting factors to different channels, e.g., high cost for phone and low cost for email. The final decision is then based on the outcome of the regression model.
A voting scheme that merges results from multiple models can also be implemented by model composition in PMML. For example, there may be an ensemble of four classification models A, B, C, and D for the same target with values "yes" and "no". The final classification result may be defined as the average of the results from A, B, C, and D. The average can be computed by a regression model with equations
p_yes = 0.25*pA_yes + 0.25*pB_yes + 0.25*pC_yes + 0.25*pD_yes p_no = 0.25*pA_no + 0.25*pB_no + 0.25*pC_no + 0.25*pD_no
where pX_yes stands for the probability of class "yes" and pX_no stands for the probability of class "no" in the model X. Note that segmentation approach provides a simpler alternative to this representation. Each model is placed inside a Segment element with predicate <True/>, and multipleModelMethod attribute of Segmentation is defined as "average", then no Regression element is needed.

XML Schema

All variations on support for multiple models rely on the MiningModel model type:


  <xs:element name="MiningModel">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="MiningSchema"/>
        <xs:element ref="Output" minOccurs="0"/>
        <xs:element ref="ModelStats" minOccurs="0"/>
        <xs:element ref="ModelExplanation" minOccurs="0"/>
        <xs:element ref="Targets" minOccurs="0"/>
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:choice minOccurs="0" maxOccurs="unbounded">
          <xs:element ref="Regression"/>
          <xs:element ref="DecisionTree"/>
        </xs:choice>
        <xs:element ref="Segmentation" minOccurs="0"/>
        <xs:element ref="ModelVerification" minOccurs="0"/>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" use="optional"/>
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
      <xs:attribute name="algorithmName" type="xs:string" use="optional"/>
    </xs:complexType>
  </xs:element>

A Segmentation element contains several Segments and a model combination method. It can also contain some local transformations that can help in data preprocessing. Each Segment includes a PREDICATE element specifying the conditions under which that segment is to be used. For more details on PREDICATE see the section on predicates in TreeModel. It explains how predicates are described and evaluated and how missing values are handled.


  <xs:element name="Segmentation">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:element ref="Segment" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="multipleModelMethod" type="MULTIPLE-MODEL-METHOD" use="required"/>
    </xs:complexType>
  </xs:element>

  <xs:element name="Segment">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:group ref="PREDICATE"/>
        <xs:choice>
          <xs:element ref="ClusteringModel"/>
          <xs:element ref="GeneralRegressionModel"/>
          <xs:element ref="NaiveBayesModel"/>
          <xs:element ref="NeuralNetwork"/>
          <xs:element ref="RegressionModel"/>
          <xs:element ref="RuleSetModel"/>
          <xs:element ref="SupportVectorMachineModel"/>
          <xs:element ref="TreeModel"/>
          <xs:element ref="Extension"/>
        </xs:choice>
      </xs:sequence>
      <xs:attribute name="id" type="xs:string" use="optional"/>
      <xs:attribute name="weight" type="NUMBER" use="optional"/>
    </xs:complexType>
  </xs:element>

The Segment element is used to tag each model that can be combined as part of an ensemble or associated with a population segment. A multiple model combination method must be specified using multipleModelMethod attribute in Segmentation element.


  <xs:simpleType name="MULTIPLE-MODEL-METHOD">
    <xs:restriction base="xs:string">
      <xs:enumeration value="majorityVote"/>
      <xs:enumeration value="weightedMajorityVote"/>
      <xs:enumeration value="average"/>
      <xs:enumeration value="weightedAverage"/>
      <xs:enumeration value="median"/>
      <xs:enumeration value="max"/>
      <xs:enumeration value="sum"/>
      <xs:enumeration value="selectFirst"/>
      <xs:enumeration value="selectAll"/>
    </xs:restriction>
  </xs:simpleType>

Note that all models used inside Segment elements in one MiningModel must have the same miningFunction: regression, or classification, or clustering. The model combination methods listed above are applicable as follows:

selectFirst is applicable to any model type. Simply use the first model for which the predicate in the Segment evaluates to true.
selectAll is applicable to any model type. All models for which the predicate in the Segment evaluates to true are evaluated. The Output element should be used to specify inclusion of a segment id in the evaluation results so as to match results with the associated model segment. The PMML standard does not specify a mechanism for returning more than one value per record scored. Different implementations may choose to implement returning multiple values for a single record differently.
For clustering models only majorityVote, weightedMajorityVote, selectFirst, or selectAll can be used. In case of majorityVote the cluster ID that was selected by the largest number of models wins. For weightedMajorityVote the weights specified in Segment elements are used, and the cluster ID with highest total weight wins.
For regression models only average, weightedAverage, median, sum, selectFirst, or selectAll are applicable. The first four methods are applied to the predicted values of all models for which the predicate evaluates to true.
For classification models all the combination methods, except for sum, can be used. Note that average, weightedAverage, median, and max are applied to the predicted probabilities of target categories in each of the models used for the case, then the winning category is selected based on the highest combined probability, while majorityVote and weightedMajorityVote use the predicted categories from all applicable models and select the one based on the models' "votes".

The following three examples demonstrate the use of the Segment element to accomplish tree ensembles and segmentation. The first example demonstrates the implementation of an ensemble of classification trees whose results are combined by majority vote:

 <MiningModel functionName="classification" >
  <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="day" usageType="active"/>
      <MiningField name="continent" usageType="active"/>
      <MiningField name="sepal_length" usageType="supplementary"/>
      <MiningField name="sepal_width" usageType="supplementary"/>
      <MiningField name="Class" usageType="predicted"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
  <Segment id="1">
   <True/>
   <TreeModel modelName="Iris" functionName="classification" splitCharacteristic="binarySplit">
    <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="Class" usageType="predicted"/>
    </MiningSchema>
    <Node score="Iris-setosa" recordCount="150">
     <True/>
     <ScoreDistribution value="Iris-setosa" recordCount="50"/>
     <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
     <ScoreDistribution value="Iris-virginica" recordCount="50"/>
     <Node score="Iris-setosa" recordCount="50">
       <SimplePredicate field="petal_length" operator="lessThan" value="2.45"/>
       <ScoreDistribution value="Iris-setosa" recordCount="50"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="0"/>
       <ScoreDistribution value="Iris-virginica" recordCount="0"/>
     </Node>
     <Node score="Iris-versicolor" recordCount="100">
       <True/>
       <ScoreDistribution value="Iris-setosa" recordCount="0"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
       <ScoreDistribution value="Iris-virginica" recordCount="50"/>
       <Node score="Iris-versicolor" recordCount="54">
         <SimplePredicate field="petal_width" operator="lessThan" value="1.75"/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="49"/>
         <ScoreDistribution value="Iris-virginica" recordCount="5"/>
       </Node>
       <Node score="Iris-virginica" recordCount="46">
         <True/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="1"/>
         <ScoreDistribution value="Iris-virginica" recordCount="45"/>
       </Node>
     </Node>
    </Node>
   </TreeModel>
  </Segment>
  <Segment id="2">
   <True/>
   <TreeModel modelName="Iris" functionName="classification" splitCharacteristic="binarySplit">
    <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="continent" usageType="active"/>
      <MiningField name="Class" usageType="predicted"/>
    </MiningSchema>
    <Node score="Iris-setosa" recordCount="150">
     <True/>
     <ScoreDistribution value="Iris-setosa" recordCount="50"/>
     <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
     <ScoreDistribution value="Iris-virginica" recordCount="50"/>
     <Node score="Iris-setosa" recordCount="50">
       <SimplePredicate field="petal_length" operator="lessThan" value="2.15"/>
       <ScoreDistribution value="Iris-setosa" recordCount="50"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="0"/>
       <ScoreDistribution value="Iris-virginica" recordCount="0"/>
     </Node>
     <Node score="Iris-versicolor" recordCount="100">
       <True/>
       <ScoreDistribution value="Iris-setosa" recordCount="0"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
       <ScoreDistribution value="Iris-virginica" recordCount="50"/>
       <Node score="Iris-versicolor" recordCount="54">
         <SimplePredicate field="petal_width" operator="lessThan" value="1.93"/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="49"/>
         <ScoreDistribution value="Iris-virginica" recordCount="5"/>
         <Node score="Iris-versicolor" recordCount="48">
           <SimplePredicate field="continent" operator="equals" value="africa"/>
           <ScoreDistribution value="Iris-setosa" recordCount="0"/>
           <ScoreDistribution value="Iris-versicolor" recordCount="48"/>
           <ScoreDistribution value="Iris-virginica" recordCount="1"/>
         </Node>
         <Node score="Iris-virginical" recordCount="6">
           <SimplePredicate field="continent" operator="notequal" value="africa"/>
           <ScoreDistribution value="Iris-setosa" recordCount="0"/>
           <ScoreDistribution value="Iris-versicolor" recordCount="1"/>
           <ScoreDistribution value="Iris-virginica" recordCount="4"/>
         </Node>
       </Node>
       <Node score="Iris-virginica" recordCount="46">
         <True/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="1"/>
         <ScoreDistribution value="Iris-virginica" recordCount="45"/>
       </Node>
     </Node>
    </Node>
    </TreeModel>
   </Segment>
  <Segment id="3">
   <True/>
   <TreeModel modelName="Iris" functionName="classification" splitCharacteristic="binarySplit">
    <MiningSchema>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="continent" usageType="active"/>
      <MiningField name="Class" usageType="predicted"/>
    </MiningSchema>
    <Node score="Iris-setosa" recordCount="150">
     <True/>
     <ScoreDistribution value="Iris-setosa" recordCount="50"/>
     <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
     <ScoreDistribution value="Iris-virginica" recordCount="50"/>
     <Node score="Iris-setosa" recordCount="50">
       <SimplePredicate field="petal_width" operator="lessThan" value="2.85"/>
       <ScoreDistribution value="Iris-setosa" recordCount="50"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="0"/>
       <ScoreDistribution value="Iris-virginica" recordCount="0"/>
     </Node>
     <Node score="Iris-versicolor" recordCount="100">
       <True/>
       <ScoreDistribution value="Iris-setosa" recordCount="0"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
       <ScoreDistribution value="Iris-virginica" recordCount="50"/>
       <Node score="Iris-versicolor" recordCount="54">
         <SimplePredicate field="continent" operator="equals" value="asia"/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="49"/>
         <ScoreDistribution value="Iris-virginica" recordCount="5"/>
       </Node>
       <Node score="Iris-virginica" recordCount="46">
         <SimplePredicate field="continent" operation="notequals" value="asia"/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="1"/>
         <ScoreDistribution value="Iris-virginica" recordCount="45"/>
       </Node>
     </Node>
    </Node>
   </TreeModel>
  </Segment>
  </Segmentation>
 </MiningModel>

The second example shows an ensemble of regression trees whose results are combined by weighted averaging:

 <MiningModel functionName="regression" >
  <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="day" usageType="active"/>
      <MiningField name="continent" usageType="active"/>
      <MiningField name="sepal_length" usageType="predicted"/>
      <MiningField name="sepal_width" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="weightedAverage">
  <Segment id="1" weight="0.25">
   <True/>
   <TreeModel modelName="Iris" functionName="regression" splitCharacteristic="binarySplit">
    <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="sepal_length" usageType="predicted"/>
      <MiningField name="sepal_width" usageType="active"/>
    </MiningSchema>
    <Node score="5.843333" recordCount="150">
     <True/>
     <Node score="5.179452" recordCount="73">
       <SimplePredicate field="petal_length" operator="lessThan" value="4.25"/>
       <Node score="5.005660" recordCount="53">
         <SimplePredicate field="petal_length" operator="lessThan" value="3.40"/>
       </Node>
         <Node score="4.735000" recordCount="20">
           <SimplePredicate field="sepal_width" operator="lessThan" value="3.25"/>
         </Node>
         <Node score="5.169697" recordCount="33">
           <SimplePredicate field="sepal_width" operator="greaterThan" value="3.25"/>
         </Node>
       <Node score="5.640000" recordCount="20">
         <SimplePredicate field="petal_length" operator="greaterThan" value="3.40"/>
       </Node>
     </Node>
     <Node score="6.472727" recordCount="77">
       <SimplePredicate field="petal_length" operator="greaterThan" value="4.25"/>
       <Node score="6.326471" recordCount="68">
         <SimplePredicate field="petal_length" operator="lessThan" value="6.05"/>
         <Node score="6.165116" recordCount="43">
           <SimplePredicate field="petal_length" operator="lessThan" value="5.15"/>
           <Node score="6.054545" recordCount="33">
             <SimplePredicate field="sepal_width" operator="lessThan" value="3.05"/>
           </Node>
           <Node score="6.530000" recordCount="10">
             <SimplePredicate field="sepal_width" operator="greaterThan" value="3.05"/>
           </Node>
         </Node>
         <Node score="6.604000" recordCount="25">
           <SimplePredicate field="petal_length" operator="greaterThan" value="5.15"/>
         </Node>
        </Node>
       <Node score="7.577778" recordCount="9">
         <SimplePredicate field="petal_length" operator="greaterThan" value="6.05"/>
       </Node>
     </Node>
    </Node>
   </TreeModel>
  </Segment>
  <Segment id="2" weight="0.25">
   <True/>
   <TreeModel modelName="Iris" functionName="regression" splitCharacteristic="binarySplit">
    <MiningSchema>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="sepal_length" usageType="predicted"/>
      <MiningField name="sepal_width" usageType="active"/>
    </MiningSchema>
    <Node score="5.843333" recordCount="150">
     <True/>
     <Node score="5.073333" recordCount="60">
       <SimplePredicate field="petal_width" operator="lessThan" value="1.15"/>
       <Node score="4.953659" recordCount="41">
         <SimplePredicate field="petal_width" operator="lessThan" value="0.35"/>
       </Node>
         <Node score="4.688235" recordCount="17">
           <SimplePredicate field="sepal_width" operator="lessThan" value="3.25"/>
         </Node>
         <Node score="5.141667" recordCount="24">
           <SimplePredicate field="sepal_width" operator="greaterThan" value="3.25"/>
         </Node>
       <Node score="5.331579" recordCount="19">
         <SimplePredicate field="petal_width" operator="greaterThan" value="0.35"/>
       </Node>
     </Node>
     <Node score="6.356667" recordCount="90">
       <SimplePredicate field="petal_width" operator="greaterThan" value="1.15"/>
       <Node score="6.160656" recordCount="61">
         <SimplePredicate field="petal_width" operator="lessThan" value="1.95"/>
         <Node score="5.855556" recordCount="18">
           <SimplePredicate field="petal_width" operator="lessThan" value="1.35"/>
         </Node>
         <Node score="6.288372" recordCount="43">
           <SimplePredicate field="petal_width" operator="greaterThan" value="1.35"/>
          <Node score="6.000000" recordCount="13">
             <SimplePredicate field="sepal_width" operator="lessThan" value="2.75"/>
           </Node>
           <Node score="6.413333" recordCount="30">
             <SimplePredicate field="sepal_width" operator="greaterThan" value="2.75"/>
           </Node>
          </Node>
        </Node>
       <Node score="6.768966" recordCount="29">
         <SimplePredicate field="petal_width" operator="greaterThan" value="1.95"/>
       </Node>
     </Node>
    </Node>
   </TreeModel>
  </Segment>
  <Segment id="3" weight="0.5">
   <True/>
   <TreeModel modelName="Iris" functionName="regression" splitCharacteristic="binarySplit">
    <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="sepal_length" usageType="predicted"/>
    </MiningSchema>
    <Node score="5.843333" recordCount="150">
     <True/>
     <Node score="5.179452" recordCount="73">
       <SimplePredicate field="petal_length" operator="lessThan" value="4.25"/>
       <Node score="5.005660" recordCount="53">
         <SimplePredicate field="petal_length" operator="lessThan" value="3.40"/>
       </Node>
       <Node score="5.640000" recordCount="20">
         <SimplePredicate field="petal_length" operator="greaterThan" value="3.40"/>
       </Node>
     </Node>
     <Node score="6.472727" recordCount="77">
       <SimplePredicate field="petal_length" operator="greaterThan" value="4.25"/>
       <Node score="6.326471" recordCount="68">
         <SimplePredicate field="petal_length" operator="lessThan" value="6.05"/>
         <Node score="6.165116" recordCount="43">
           <SimplePredicate field="petal_length" operator="lessThan" value="5.15"/>
         </Node>
         <Node score="6.604000" recordCount="25">
           <SimplePredicate field="petal_length" operator="greaterThan" value="5.15"/>
         </Node>
        </Node>
       <Node score="7.577778" recordCount="9">
         <SimplePredicate field="petal_length" operator="greaterThan" value="6.05"/>
       </Node>
     </Node>
    </Node>
   </TreeModel>
  </Segment>
 </Segmentation>
 </MiningModel>

The third example shows the implementation of segmentation where the model to employ is the first for which the predicate element of a segment is satisfied.

 <MiningModel functionName="classification">
  <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="day" usageType="active"/>
      <MiningField name="continent" usageType="active"/>
      <MiningField name="sepal_length" usageType="supplementary"/>
      <MiningField name="sepal_width" usageType="supplementary"/>
      <MiningField name="Class" usageType="predicted"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="selectFirst">
  <Segment id="1">
    <CompoundPredicate booleanOperator="and">
       <SimplePredicate field="continent" operation="equal" value="asia" />
       <SimplePredicate field="day" operation="lessThan" value="60.0" />
       <SimplePredicate field="day" operation="greaterThan" value="0.0" />
    </CompoundPredicate>
    <TreeModel modelName="Iris" functionName="classification" splitCharacteristic="binarySplit">
    <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="Class" usageType="predicted"/>
    </MiningSchema>
    <Node score="Iris-setosa" recordCount="150">
     <True/>
     <ScoreDistribution value="Iris-setosa" recordCount="50"/>
     <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
     <ScoreDistribution value="Iris-virginica" recordCount="50"/>
     <Node score="Iris-setosa" recordCount="50">
       <SimplePredicate field="petal_length" operator="lessThan" value="2.45"/>
       <ScoreDistribution value="Iris-setosa" recordCount="50"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="0"/>
       <ScoreDistribution value="Iris-virginica" recordCount="0"/>
     </Node>
     <Node score="Iris-versicolor" recordCount="100">
       <SimplePredicate field="petal_length" operator="greaterThan" value="2.45"/>
       <ScoreDistribution value="Iris-setosa" recordCount="0"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
       <ScoreDistribution value="Iris-virginica" recordCount="50"/>
       <Node score="Iris-versicolor" recordCount="54">
         <SimplePredicate field="petal_width" operator="lessThan" value="1.75"/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="49"/>
         <ScoreDistribution value="Iris-virginica" recordCount="5"/>
       </Node>
       <Node score="Iris-virginica" recordCount="46">
         <SimplePredicate field="petal_width" operator="greaterThan" value="1.75"/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="1"/>
         <ScoreDistribution value="Iris-virginica" recordCount="45"/>
       </Node>
     </Node>
    </Node>
   </TreeModel>
  </Segment>
  <Segment id="2" >
    <CompoundPredicate booleanOperator="and">
      <SimplePredicate field="continent" operation="equal" value="africa" />
      <SimplePredicate field="day" operation="lessThan" value="60.0" />
      <SimplePredicate field="day" operation="greaterThan" value="0.0" />
    </CompoundPredicate>
   <TreeModel modelName="Iris" functionName="classification" splitCharacteristic="binarySplit">
   <MiningSchema>
      <MiningField name="petal_length" usageType="active"/>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="Class" usageType="predicted"/>
    </MiningSchema>
    <Node score="Iris-setosa" recordCount="150">
     <True/>
     <ScoreDistribution value="Iris-setosa" recordCount="50"/>
     <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
     <ScoreDistribution value="Iris-virginica" recordCount="50"/>
     <Node score="Iris-setosa" recordCount="50">
       <SimplePredicate field="petal_length" operator="lessThan" value="2.15"/>
       <ScoreDistribution value="Iris-setosa" recordCount="50"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="0"/>
       <ScoreDistribution value="Iris-virginica" recordCount="0"/>
     </Node>
     <Node score="Iris-versicolor" recordCount="100">
       <SimplePredicate field="petal_length" operator="greaterThan" value="2.15"/>
       <ScoreDistribution value="Iris-setosa" recordCount="0"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
       <ScoreDistribution value="Iris-virginica" recordCount="50"/>
       <Node score="Iris-versicolor" recordCount="54">
         <SimplePredicate field="petal_width" operator="lessThan" value="1.93"/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="49"/>
         <ScoreDistribution value="Iris-virginica" recordCount="5"/>
       </Node>
       <Node score="Iris-virginica" recordCount="46">
         <SimplePredicate field="petal_width" operator="greaterThan" value="1.93"/>
         <ScoreDistribution value="Iris-setosa" recordCount="0"/>
         <ScoreDistribution value="Iris-versicolor" recordCount="1"/>
         <ScoreDistribution value="Iris-virginica" recordCount="45"/>
       </Node>
     </Node>
    </Node>
   </TreeModel>
  </Segment>
  <Segment id="3">
   <SimplePredicate field="continent" operation="equal" value="africa" />
   <TreeModel modelName="Iris" functionName="classification" splitCharacteristic="binarySplit">
   <MiningSchema>
      <MiningField name="petal_width" usageType="active"/>
      <MiningField name="Class" usageType="predicted"/>
    </MiningSchema>
    <Node score="Iris-setosa" recordCount="150">
     <True/>
     <ScoreDistribution value="Iris-setosa" recordCount="50"/>
     <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
     <ScoreDistribution value="Iris-virginica" recordCount="50"/>
     <Node score="Iris-setosa" recordCount="50">
       <SimplePredicate field="petal_width" operator="lessThan" value="2.85"/>
       <ScoreDistribution value="Iris-setosa" recordCount="50"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="0"/>
       <ScoreDistribution value="Iris-virginica" recordCount="0"/>
     </Node>
     <Node score="Iris-versicolor" recordCount="100">
       <SimplePredicate field="petal_width" operator="greaterThan" value="2.85"/>
       <ScoreDistribution value="Iris-setosa" recordCount="0"/>
       <ScoreDistribution value="Iris-versicolor" recordCount="50"/>
       <ScoreDistribution value="Iris-virginica" recordCount="50"/>
     </Node>
    </Node>
   </TreeModel>
  </Segment>
 </Segmentation>
 </MiningModel>

Model Composition

Two general variants of Model Composition of decision trees and simple regression are supported:

Model sequencing: two or more models are combined into a sequence where the results of one model are used as input in another model.
Model selection: one of many models can be selected based on decision rules.

Model composition uses three syntactical concepts

The essential elements of a predictive model are captured in elements that can be included in other models.
Embedded models can define new fields, similar to derived fields.
The leaf nodes in a decision tree can contain another predictive model.

For example, using a sequence of models, a field could be defined by a regression equation. This field is then used as an ordinary input field in a decision tree. The basic idea is that we capture the essential elements of a model, in this example from a regression model, and use them to define new fields. That is similar to defining a derived field.

Mining models and their corresponding embedded elements

The first steps in making models reusable in other models is the definition of 'model expression' elements that can be embedded in another model. PMML defines the two elements Regression and DecisionTree.

Standalone model element	Embedded model element	Main content
`RegressionModel`	`Regression`	`RegressionTable`(s)
`TreeModel`	`DecisionTree`	`Node`(s)

EmbeddedModel does not contain a MiningSchema. There is only one MiningSchema at the top-level.


  <xs:group name="EmbeddedModel">
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:choice>
        <xs:element ref="Regression"/>
        <xs:element ref="DecisionTree"/>
      </xs:choice>
    </xs:sequence>
  </xs:group>

The element ResultField is very similar to OutputField and DerivedField. It allows an embedded model to define a new field that can be used by a subsequent model.


  <xs:element name="ResultField">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="name" type="FIELD-NAME" use="required" />
      <xs:attribute name="displayName" type="xs:string" />
      <xs:attribute name="optype" type="OPTYPE" />
      <xs:attribute name="dataType" type="DATATYPE"/>
      <xs:attribute name="feature" type="RESULT-FEATURE" />
      <xs:attribute name="value" type="xs:string" />
    </xs:complexType>
  </xs:element>

Model selection is enabled by allowing an EmbeddedModel within a tree Node.

The element Regression contains the essential elements of a RegressionModel:


  <xs:element name="Regression">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="Output" minOccurs="0"/>
        <xs:element ref="ModelStats" minOccurs="0"/>
        <xs:element ref="Targets" minOccurs="0"/>
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:element ref="ResultField" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="RegressionTable" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" />
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" />
      <xs:attribute name="normalizationMethod" type="REGRESSIONNORMALIZATIONMETHOD" default="none" />
    </xs:complexType>
  </xs:element>

ResultFields are elements that define named results, see above.

The element DecisionTree contains the essential elements of a TreeModel:


  <xs:element name="DecisionTree">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="Output" minOccurs="0"/>
        <xs:element ref="ModelStats" minOccurs="0"/>
        <xs:element ref="Targets" minOccurs="0"/>
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:element ref="ResultField" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="Node" />
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" />
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" />
      <xs:attribute name="missingValueStrategy" type="MISSING-VALUE-STRATEGY" default="none"/>
      <xs:attribute name="missingValuePenalty" type="PROB-NUMBER" default="1.0"/>
      <xs:attribute name="noTrueChildStrategy" type="NO-TRUE-CHILD-STRATEGY" default="returnNullPrediction" />
      <xs:attribute name="splitCharacteristic" default="multiSplit">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="binarySplit" />
            <xs:enumeration value="multiSplit" />
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

Regression and DecisionTree can exclusively be used to build a model using the MiningModel model type.

Model Sequencing for Input Transformations

The following example demonstrates how a regression equation can be used to define an input transformation in another model which happens to be a TreeModel.


  <?xml version="1.0" ?>
  <PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header copyright="DMG.org"/>
    <DataDictionary numberOfFields="5">
      <DataField name="age" optype="continuous" dataType="double"/>
      <DataField name="income" optype="continuous" dataType="double"/>
      <DataField name="gender" optype="categorical" dataType="string">
        <Value value="female"/>
        <Value value="male"/>
      </DataField>
      <DataField name="weight" optype="continuous" dataType="double"/>
    </DataDictionary>
    <MiningModel functionName="regression">
      <MiningSchema>
        <MiningField name="age"/>
        <MiningField name="income"/>
        <MiningField name="gender"/>
        <MiningField name="weight" usageType="predicted"/>
      </MiningSchema>
      <LocalTransformations>
        <DerivedField name="mc" optype="continuous" dataType="double">
          <MapValues outputColumn="mapped" mapMissingTo="-1">
            <FieldColumnPair field="gender" column="sourceval"/>
            <InlineTable>
              <row><sourceval>female</sourceval><mapped>1</mapped></row>
              <row><sourceval>male</sourceval><mapped>0</mapped></row>
            </InlineTable>
          </MapValues>
        </DerivedField>
      </LocalTransformations>
      <Regression>
        <ResultField name="term" feature="predictedValue"/>
        <RegressionTable intercept="2.34">
          <NumericPredictor name="income" coefficient="0.03"/>
          <PredictorTerm coefficient="1.23">
            <FieldRef field="age"/>
            <FieldRef field="mc"/>
          </PredictorTerm>
        </RegressionTable>
      </Regression>
      <DecisionTree functinName="regression">
        <Node score="0.0">
          <True/>
          <Node score="32.32">
            <SimplePredicate field="term" operator="lessThan" value="42"/>
          </Node>
          <Node score="78.91">
            <SimplePredicate field="term" operator="greaterOrEqual" value="42"/>
          </Node>
        </Node>
      </DecisionTree>
    </MiningModel>
  </PMML>

Remarks:

The submodels comprising a sequence are ordered in such a way that each is defined after any other submodels on which it depends.
The prediction from the last submodel defined is taken as the prediction for the composite model.

Model selection through tree models

Model selection through a tree model in PMML allows for combining multiple 'embedded models', aka model expressions, into the decision logic that selects one of the models depending on the current input values.

The following example shows how regression elements are used within the nodes of a decision tree:


  <?xml version="1.0" ?>
  <PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header copyright="DMG.org"/>
    <DataDictionary numberOfFields="5">
      <DataField name="age" optype="continuous" dataType="double"/>
      <DataField name="income" optype="continuous" dataType="double"/>
      <DataField name="gender" optype="categorical" dataType="string">
        <Value value="female"/>
        <Value value="male"/>
      </DataField>
      <DataField name="weight" optype="continuous" dataType="double"/>
    </DataDictionary>
    <MiningModel functionName="regression">
      <MiningSchema>
        <MiningField name="age"/>
        <MiningField name="income"/>
        <MiningField name="gender"/>
        <MiningField name="weight" usageType="predicted"/>
      </MiningSchema>
      <LocalTransformations>
        <DerivedField name="mc" optype="continuous" dataType="double">
          <MapValues outputColumn="mapped" mapMissingTo="-1">
            <FieldColumnPair field="gender" column="sourceval"/>
            <InlineTable>
              <row><sourceval>female</sourceval><mapped>1</mapped></row>
              <row><sourceval>male</sourceval><mapped>0</mapped></row>
            </InlineTable>
          </MapValues>
        </DerivedField>
      </LocalTransformations>
      <DecisionTree functionName="regression">
        <Node score="0.0">
          <True/>
          <Node score="0.0">
            <SimplePredicate field="age" operator="lessOrEqual" value="50"/>
            <Regression functionName="regression">
              <RegressionTable intercept="0.0">
                <NumericPredictor name="income" coefficient="0.03"/>
                <PredictorTerm coefficient="1.23">
                  <FieldRef field="age"/>
                  <FieldRef field="mc"/>
                </PredictorTerm>
              </RegressionTable>
            </Regression>
          </Node>
          <Node score="0.0">
            <SimplePredicate field="age" operator="greaterThan" value="50"/>
            <Regression functionName="regression">
              <RegressionTable intercept="2.22">
                <NumericPredictor name="income" coefficient="0.01"/>
                <PredictorTerm coefficient="-0.11">
                  <FieldRef field="age"/>
                  <FieldRef field="mc"/>
                </PredictorTerm>
              </RegressionTable>
            </Regression>
          </Node>
        </Node>
      </DecisionTree>
    </MiningModel>
  </PMML>

e-mail

info at dmg.org