Data Mining Group - PMML 4.0

Ruleset models can be thought of as flattened decision tree models. A ruleset consists of a number of rules. Each rule contains a predicate and a predicted class value, plus some information collected at training or testing time on the performance of the rule.

For example, the following text describes a rule:

PREDICATE: BP="HIGH" AND K > 0.045804001 AND Age <= 50 AND Na <= 0.77240998
PREDICTION: "drugB" 
CONFIDENCE: 0.9

Rulesets can be applied to new instances to derive predictions and associated confidences (scoring). Considering a case to be scored, if the rule's predicate evaluates to TRUE on the instance, the rule is said to fire. The ruleset can also have an optional default prediction and associated confidence that can be used to score a case if no rules fire.

If missing values in fields mentioned in a rule's predicate cause the predicate to evaluate to UNKNOWN, the rule does not fire.

One important question is then how to resolve conflicting predictions when multiple rules "fire". Useful strategies include:

first hit (just pick the first rule that fires).
weighted maximum (pick the rule with the highest weight).
weighted sum (pick the best prediction by combining the weights of all firing rules).

Each rule can have a confidence and a weight that are set at model build time, likely by considering each rule's performance on the training data. The method used to compute confidence and weight is employed by the application authoring the PMML and lies outside the scope of the PMML model description.


  <xs:element name="RuleSetModel">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="MiningSchema"/>
        <xs:element ref="Output" minOccurs="0" />
        <xs:element ref="ModelStats" minOccurs="0"/>
        <xs:element ref="ModelExplanation" minOccurs="0"/>
        <xs:element ref="Targets" minOccurs="0" />
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:element ref="RuleSet"/>
        <xs:element ref="ModelVerification" minOccurs="0"/>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" use="optional"/>
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" use="optional"/>
    </xs:complexType>
  </xs:element>

Definitions:

RuleSetModel: starts the definition for a ruleset model.

RuleSet: this element describes a list of rules that make up a ruleset model. The order of rules in the list is important when considering how to score the ruleset.

modelName: the value in modelName in a RuleSetModel element identifies the model with an unique name in the context of the PMML file. See general structure of PMML models.

A RuleSet consists of:


  <xs:element name="RuleSet">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="RuleSelectionMethod" minOccurs="1" maxOccurs="unbounded"/>
        <xs:element ref="ScoreDistribution" minOccurs="0" maxOccurs="unbounded"/>
        <xs:group ref="Rule" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="recordCount" type="NUMBER" use="optional"/>
      <xs:attribute name="nbCorrect" type="NUMBER" use="optional"/>
      <xs:attribute name="defaultScore" type="xs:string" use="optional"/>
      <xs:attribute name="defaultConfidence" type="NUMBER" use="optional"/>
    </xs:complexType>
  </xs:element>

Definitions

recordCount: The number of training/test cases to which the ruleset was applied to generate support and confidence measures for individual rules.

nbCorrect: indicates the number of training/test instances for which the default score is correct.

defaultScore: The value of score in a RuleSet serves as the default predicted value when scoring a case no rules in the ruleset fire.

defaultConfidence: provides a confidence to be returned with the default score (when scoring a case and no rules in the ruleset fire).

RuleSelectionMethod: specifies how to select rules from the ruleset to score a new case. If more than one method is included, the first method is used as the default method for scoring, but the other methods included may be selected by the application wishing to perform scoring as valid alternative methods.

ScoreDistribution: describe the distribution of the predicted value in the test/training data.

Rule: contains 0 or more rules which comprise the ruleset.

The RuleSelectionMethod describes how rules are selected to apply the model to a new case, and consists of:


  <xs:element name="RuleSelectionMethod">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="criterion" use="required">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="weightedSum"/>
            <xs:enumeration value="weightedMax"/>
            <xs:enumeration value="firstHit"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

Definitions:

criterion explains how to determine and rank predictions and their associated confidences from the ruleset in case multiple rules fire. There are many many possible ways of applying rulesets, but three useful approaches are covered.

firstHit: First firing rule is chosen as the predicted class, and the confidence is the weight of that rule. If further predictions and confidences are required, a search for the next firing rule that chooses a different predicted class is made, and so on.
weightedSum: Calculate the total weight for each class by summing the weights for each firing rule which predicts that class. The prediction with the highest total weight is then selected. The confidence is the total weight of the winning class divided by the number of firing rules. If further predictions and confidences are required, the process is repeated to find the class with the second highest total weight, and so on. Note that if two or more classes are assigned the same weight, the winning class is the one that appears first in the data dictionary values.
weightedMax: Select the firing rule with the highest weight. The confidence returned is the confidence of the selected rule. Note that if two firing rules have the same weight, the rule that occurs first in the ruleset is chosen.

Each Rule can be either a SimpleRule or a CompoundRule.


  <xs:group name="Rule">
    <xs:choice>
      <xs:element ref="SimpleRule" />
      <xs:element ref="CompoundRule" />
    </xs:choice>
  </xs:group>

Each SimpleRule consists of an identifier, a predicate, a score and information on rule performance.


  <xs:element name="SimpleRule">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
         <xs:group ref="PREDICATE"/>
         <xs:element ref="ScoreDistribution" minOccurs="0" maxOccurs="unbounded"/>
       </xs:sequence>
       <xs:attribute name="id" type="xs:string" use="optional"/>
       <xs:attribute name="score" type="xs:string" use="required"/>
       <xs:attribute name="recordCount" type="NUMBER" use="optional"/>
       <xs:attribute name="nbCorrect" type="NUMBER" use="optional"/>
       <xs:attribute name="confidence" type="NUMBER" use="optional"/>
     <xs:attribute name="weight" type="NUMBER" use="optional"/>
   </xs:complexType>
 </xs:element>

Definitions:

PREDICATE: the condition upon which the rule fires. For more details on PREDICATE see the section on predicates in TreeModel. This explains how predicates are described and evaluated and how missing values are handled.

ScoreDistribution: Describes the distribution of the predicted value for instances where the rule fires in the training/test data.

id: The value of id serves as a unique identifier for the rule. Must be unique within the ruleset.

score: The predicted value when the rule fires.

recordCount: The number of training/test instances on which the rule fired.

nbCorrect: Indicates the number of training/test instances on which the rule fired and the prediction was correct.

confidence: Indicates the confidence of the rule.

weight: Indicates the relative importance of the rule. May or may not be equal to the confidence.

Each CompoundRule consists of a predicate and one or more rules. CompoundRules offer a shorthand for a more compact representation of rulesets and suggest a more efficient execution mechanism.


  <xs:element name="CompoundRule">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:group ref="PREDICATE" />
        <xs:group ref="Rule" minOccurs="1" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

Definitions:

PREDICATE: For more details on PREDICATE see TreeModel.

Rule: One or more rules that are contained within the CompoundRule. Each of these rules may be a SimpleRule or a CompoundRule.

A ruleset containing both compound rules and simple rules have the same meaning as an equivalent ruleset containing only simple rules. It is possible to derive a ruleset containing simple rules by repeating the following transformation:

The original rule


  <CompoundRule>
    <PREDICATE1/>
    <SimpleRule id="1" ...>
      <PREDICATE2/>
      ... contents of simple rule 1 ...
    </SimpleRule>
    ... further rules ...
  </CompoundRule>

transforms to


  <SimpleRule id="1" ...>
    <CompoundPredicate booleanOperator="and">
      <PREDICATE1>
      <PREDICATE2>
    </CompoundPredicate>
    ... contents of simple rule 1 ...
  </SimpleRule>
  <CompoundRule>
    <PREDICATE1/>
    ... further rules ...
  </CompoundRule>

Or in other words, a simple rule is said to fire if its predicate evaluates to TRUE, and the predicates of all compound rules that contain the simple rule also evaluate to TRUE.

A Complete RuleSet Example

Consider a ruleset with three rules:

RULE1: 
        PREDICATE: BP="HIGH" AND K > 0.045804001 AND Age <= 50 AND Na <= 0.77240998
        PREDICTION: drugB
        Training/test measures:
                recordCount     79
                nbCorrect       76
                confidence      0.9
                weight          0.9
RULE2:
        PREDICATE: K > 0.057789002 AND BP="HIGH" AND Age <= 50
        PREDICTION: drugA
        Training/test measures:
                recordCount     278
                nbCorrect       168
                confidence      0.6
                weight          0.6
RULE3:
        PREDICATE: BP="HIGH" AND Na > 0.21
        PREDICTION: drugA
        Training/test measures:
                recordCount     100
                nbCorrect       50
                confidence      0.36
                weight          0.36

PMML for the example (using only simple rules)


  <?xml version="1.0" encoding="UTF-8"?>
  <PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header copyright="MyCopyright">
      <Application name="MyApplication" version="1.0"/>
    </Header>
    <DataDictionary numberOfFields="7">
      <DataField name="BP" displayName="BP" optype="categorical" dataType="string">
        <Value value="HIGH" property="valid"/>
        <Value value="LOW" property="valid"/>
        <Value value="NORMAL" property="valid"/>
      </DataField>
      <DataField name="K" displayName="K" optype="continuous" dataType="double">
        <Interval closure="closedClosed" leftMargin="0.020152" rightMargin="0.079925"/>
      </DataField>
      <DataField name="Age" displayName="Age" optype="continuous" dataType="integer"/>
      <DataField name="Na" displayName="Na" optype="continuous" dataType="double"/>
      <DataField name="Cholesterol" displayName="Cholesterol" optype="categorical" dataType="string">
        <Value value="HIGH" property="valid"/>
        <Value value="NORMAL" property="valid"/>
      </DataField>
      <DataField name="$C-Drug" displayName="$C-Drug" optype="categorical" dataType="string">
        <Value value="drugA" property="valid"/>
        <Value value="drugB" property="valid"/>
        <Value value="drugC" property="valid"/>
        <Value value="drugX" property="valid"/>
        <Value value="drugY" property="valid"/>
      </DataField>
      <DataField name="$CC-Drug" displayName="$CC-Drug" optype="continuous" dataType="double"/>   
    </DataDictionary>
    <RuleSetModel modelName="NestedDrug" functionName="classification" algorithmName="RuleSet">
      <MiningSchema>
        <MiningField name="BP" usageType="active"/>
        <MiningField name="K" usageType="active"/>
        <MiningField name="Age" usageType="active"/>
        <MiningField name="Na" usageType="active"/>
        <MiningField name="Cholesterol" usageType="active"/>
        <MiningField name="$C-Drug" usageType="predicted"/>
        <MiningField name="$CC-Drug" usageType="supplementary"/>
      </MiningSchema>
      <RuleSet defaultScore="drugY" recordCount="1000" nbCorrect="149" defaultConfidence="0.0">
        <RuleSelectionMethod criterion="weightedSum"/>
        <RuleSelectionMethod criterion="weightedMax"/>
        <RuleSelectionMethod criterion="firstHit"/>
        <SimpleRule id="RULE1" score="drugB" recordCount="79" nbCorrect="76" confidence="0.9" weight="0.9">
          <CompoundPredicate booleanOperator="and">
            <SimplePredicate field="BP" operator="equal" value="HIGH"/>
            <SimplePredicate field="K" operator="greaterThan" value="0.045804001"/>
            <SimplePredicate field="Age" operator="lessOrEqual" value="50"/>
            <SimplePredicate field="Na" operator="lessOrEqual" value="0.77240998"/>
          </CompoundPredicate>
          <ScoreDistribution value="drugA" recordCount="2"/>
          <ScoreDistribution value="drugB" recordCount="76"/>
          <ScoreDistribution value="drugC" recordCount="1"/>
          <ScoreDistribution value="drugX" recordCount="0"/>
          <ScoreDistribution value="drugY" recordCount="0"/>
        </SimpleRule>
        <SimpleRule id="RULE2" score="drugA" recordCount="278" nbCorrect="168" confidence="0.6" weight="0.6">
          <CompoundPredicate booleanOperator="and">
            <SimplePredicate field="K" operator="greaterThan" value="0.057789002"/>
            <SimplePredicate field="BP" operator="equal" value="HIGH"/>
            <SimplePredicate field="Age" operator="lessOrEqual" value="50"/>
          </CompoundPredicate>
          <ScoreDistribution value="drugA" recordCount="168"/>
          <ScoreDistribution value="drugB" recordCount="40"/>
          <ScoreDistribution value="drugC" recordCount="12"/>
          <ScoreDistribution value="drugX" recordCount="14"/>
          <ScoreDistribution value="drugY" recordCount="24"/>
        </SimpleRule>
        <SimpleRule id="RULE3" score="drugA" recordCount="100" nbCorrect="50" confidence="0.36" weight="0.36">
          <CompoundPredicate booleanOperator="and">
            <SimplePredicate field="BP" operator="equal" value="HIGH"/>
            <SimplePredicate field="Na" operator="greaterThan" value="0.21"/>
          </CompoundPredicate>
          <ScoreDistribution value="drugA" recordCount="50"/>
          <ScoreDistribution value="drugB" recordCount="10"/>
          <ScoreDistribution value="drugC" recordCount="12"/>
          <ScoreDistribution value="drugX" recordCount="7"/>
          <ScoreDistribution value="drugY" recordCount="11"/>
        </SimpleRule>
      </RuleSet>
    </RuleSetModel>
  </PMML>

Scoring Procedure for the Example

We will use the above example to illustrate the steps that should be followed in the scoring process.

Suppose we wish to score an instance where:

BP="HIGH", K=0.0621, Age = 36, Na = 0.5023

criterion="firstHit" scoring
If the criterion attribute in the RuleSelectionMethod is set to "firstHit" then RULE1 "fires" first and the prediction is "drugB". The confidence is the weight of RULE1, 0.9.

criterion="weightedSum" scoring
RULE1 RULE2 and RULE3 all fire. To choose the winner, for each prediction, sum the weights of the firing rules to produce a total weight for that prediction.
drugA: total weight = weight(RULE2) + weight(RULE3) = 0.6 + 0.36 = 0.96
drugB: total weight = weight(RULE1) = 0.9

The winning prediction with the highest total weight is drugA. The confidence for this prediction is the total weight for firing rules that predict drugA divided by the number of rules that fired:

confidence(drugA) = total_weight(drugA) / number_of_firing_rules = 0.96 / 3 = 0.32

criterion="weightedMax" scoring
RULE1 has the highest weight of the firing rules and the prediction is "drugB". The confidence is the confidence of RULE1, 0.9.

PMML for the example (using compound rules)

The following PMML shows how the example model can be described using compound rules.


  <?xml version="1.0" encoding="UTF-8"?>
  <PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header copyright="MyCopyright">
      <Application name="MyApplication" version="1.0"/>
    </Header>
    <DataDictionary numberOfFields="7">
      <DataField name="BP" displayName="BP" optype="categorical" dataType="string">
        <Value value="HIGH" property="valid"/>
        <Value value="LOW" property="valid"/>
        <Value value="NORMAL" property="valid"/>
      </DataField>
      <DataField name="K" displayName="K" optype="continuous" dataType="double">
        <Interval closure="closedClosed" leftMargin="0.020152" rightMargin="0.079925"/>
      </DataField>
      <DataField name="Age" displayName="Age" optype="continuous" dataType="integer">
        <Interval closure="closedClosed" leftMargin="15" rightMargin="74"/>
      </DataField>
      <DataField name="Na" displayName="Na" optype="continuous" dataType="double">
        <Interval closure="closedClosed" leftMargin="0.500517" rightMargin="0.899774"/>
      </DataField>
      <DataField name="Cholesterol" displayName="Cholesterol" optype="categorical" dataType="string">
        <Value value="HIGH" property="valid"/>
        <Value value="NORMAL" property="valid"/>
      </DataField>
      <DataField name="$C-Drug" displayName="$C-Drug" optype="categorical" dataType="string">
        <Value value="drugA" property="valid"/>
        <Value value="drugB" property="valid"/>
        <Value value="drugC" property="valid"/>
        <Value value="drugX" property="valid"/>
        <Value value="drugY" property="valid"/>
      </DataField>
      <DataField name="$CC-Drug" displayName="$CC-Drug" optype="continuous" dataType="double">
        <Interval closure="closedClosed" leftMargin="0" rightMargin="1"/>
      </DataField>
    </DataDictionary>
    <RuleSetModel modelName="Drug" functionName="classification" algorithmName="RuleSet">
      <MiningSchema>
        <MiningField name="BP" usageType="active"/>
        <MiningField name="K" usageType="active"/>
        <MiningField name="Age" usageType="active"/>
        <MiningField name="Na" usageType="active"/>
        <MiningField name="Cholesterol" usageType="active"/>
        <MiningField name="$C-Drug" usageType="predicted"/>
        <MiningField name="$CC-Drug" usageType="supplementary"/>
      </MiningSchema>
      <RuleSet defaultScore="drugY" recordCount="1000" nbCorrect="149" defaultConfidence="0.0">
        <RuleSelectionMethod criterion="weightedSum"/>
        <RuleSelectionMethod criterion="weightedMax"/>
        <RuleSelectionMethod criterion="firstHit"/>
          <CompoundRule>
            <SimplePredicate field="BP" operator="equal" value="HIGH"/>
            <CompoundRule>
              <SimplePredicate field="Age" operator="lessOrEqual" value="50"/>
              <SimpleRule id="RULE1" score="drugB" recordCount="79" nbCorrect="76" confidence="0.9" weight="0.9">                         
                <CompoundPredicate booleanOperator="and">
                  <SimplePredicate field="K" operator="greaterThan" value="0.045804001"/>
                  <SimplePredicate field="Na" operator="lessOrEqual" value="0.77240998"/>
                </CompoundPredicate>
                <ScoreDistribution value="drugA" recordCount="2"/>
                <ScoreDistribution value="drugB" recordCount="76"/>
                <ScoreDistribution value="drugC" recordCount="1"/>
                <ScoreDistribution value="drugX" recordCount="0"/>
                <ScoreDistribution value="drugY" recordCount="0"/>
              </SimpleRule>
              <SimpleRule id="RULE2" score="drugA" recordCount="278" nbCorrect="168" confidence="0.6" weight="0.6">    
                <SimplePredicate field="K" operator="greaterThan" value="0.057789002"/>
                <ScoreDistribution value="drugA" recordCount="168"/>
                <ScoreDistribution value="drugB" recordCount="40"/>
                <ScoreDistribution value="drugC" recordCount="12"/>
                <ScoreDistribution value="drugX" recordCount="14"/>
                <ScoreDistribution value="drugY" recordCount="24"/>
              </SimpleRule>
            </CompoundRule>
            <SimpleRule id="RULE3" score="drugA" recordCount="100" nbCorrect="50" confidence="0.36" weight="0.36">
              <SimplePredicate field="Na" operator="greaterThan" value="0.21"/>
              <ScoreDistribution value="drugA" recordCount="50"/>
              <ScoreDistribution value="drugB" recordCount="10"/>
              <ScoreDistribution value="drugC" recordCount="12"/>
              <ScoreDistribution value="drugX" recordCount="7"/>
              <ScoreDistribution value="drugY" recordCount="11"/>
            </SimpleRule>
          </CompoundRule>
      </RuleSet>
    </RuleSetModel>
  </PMML>

e-mail

info at dmg.org