PMML 4.0 - Ruleset
Ruleset models can be thought of as flattened decision tree
models. A ruleset consists of a number of rules. Each rule contains a
predicate and a predicted class value, plus some information
collected at training or testing time on the performance of the rule.
For example, the following text describes a rule:
PREDICATE: BP="HIGH" AND K > 0.045804001 AND Age <= 50 AND Na <= 0.77240998
PREDICTION: "drugB"
CONFIDENCE: 0.9
Rulesets can be applied to new instances to derive predictions and
associated confidences (scoring). Considering a case to be scored, if
the rule's predicate evaluates to TRUE on the instance, the rule is
said to fire. The ruleset can also have an optional
default prediction and associated confidence that can be used to
score a case if no rules fire.
If missing values in fields mentioned in a rule's predicate cause
the predicate to evaluate to UNKNOWN, the rule does not fire.
One
important question is then how to resolve conflicting predictions
when multiple rules "fire". Useful strategies include:
first hit (just pick the first
rule that fires).
weighted maximum (pick the rule with the
highest weight).
weighted sum (pick the best prediction by combining the
weights of all firing rules).
Each rule can have a confidence and a weight that are set at model
build time, likely by considering each rule's performance on the
training data. The method used to compute confidence and weight is
employed by the application authoring the PMML and lies outside the
scope of the PMML model description.
<xs:element name="RuleSetModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="MiningSchema"/>
<xs:element ref="Output" minOccurs="0" />
<xs:element ref="ModelStats" minOccurs="0"/>
<xs:element ref="ModelExplanation" minOccurs="0"/>
<xs:element ref="Targets" minOccurs="0" />
<xs:element ref="LocalTransformations" minOccurs="0" />
<xs:element ref="RuleSet"/>
<xs:element ref="ModelVerification" minOccurs="0"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="modelName" type="xs:string" use="optional"/>
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
<xs:attribute name="algorithmName" type="xs:string" use="optional"/>
</xs:complexType>
</xs:element>
|
Definitions:
RuleSetModel: starts
the definition for a ruleset model.
RuleSet: this element
describes a list of rules that make up a ruleset model. The order of
rules in the list is important when considering how to score the
ruleset.
modelName: the value
in modelName in a RuleSetModel element identifies the model with an
unique name in the context of the PMML file. See general
structure of PMML models.
A RuleSet consists of:
<xs:element name="RuleSet">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="RuleSelectionMethod" minOccurs="1" maxOccurs="unbounded"/>
<xs:element ref="ScoreDistribution" minOccurs="0" maxOccurs="unbounded"/>
<xs:group ref="Rule" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="recordCount" type="NUMBER" use="optional"/>
<xs:attribute name="nbCorrect" type="NUMBER" use="optional"/>
<xs:attribute name="defaultScore" type="xs:string" use="optional"/>
<xs:attribute name="defaultConfidence" type="NUMBER" use="optional"/>
</xs:complexType>
</xs:element>
|
Definitions
recordCount: The
number of training/test cases to which the ruleset was applied to
generate support and confidence measures for individual rules.
nbCorrect: indicates
the number of training/test instances for which the default score is
correct.
defaultScore: The
value of score in a RuleSet serves as the default predicted value
when scoring a case no rules in the ruleset fire.
defaultConfidence:
provides a confidence to be returned with the default score (when
scoring a case and no rules in the ruleset
fire).
RuleSelectionMethod: specifies how to select
rules from the ruleset to score a new case. If more than one method
is included, the first method is used as the default method for
scoring, but the other methods included may be selected by the
application wishing to perform scoring as valid alternative methods.
ScoreDistribution:
describe the distribution of the predicted value in the test/training
data.
Rule: contains 0 or
more rules which comprise the ruleset.
The RuleSelectionMethod describes how
rules are selected to apply the model to a new case, and consists of:
<xs:element name="RuleSelectionMethod">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="criterion" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="weightedSum"/>
<xs:enumeration value="weightedMax"/>
<xs:enumeration value="firstHit"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
|
Definitions:
criterion explains how to
determine and rank predictions and their associated confidences from
the ruleset in case multiple rules fire. There are many many possible
ways of applying rulesets, but three useful approaches are covered.
- firstHit: First firing rule is chosen
as the predicted class, and the confidence is the weight of that rule.
If further predictions and confidences are required, a search for the
next firing rule that chooses a different predicted class is made,
and so on.
- weightedSum: Calculate the total weight for each
class by summing the weights for each firing rule which predicts that
class. The prediction with the
highest total weight is then selected. The confidence is the total
weight of the winning class divided by the number of firing rules. If
further predictions and confidences are required, the process is
repeated to find the class with the second highest total weight, and
so on. Note that if two or more classes are assigned the same weight,
the winning class is the one that appears first in the data dictionary
values.
- weightedMax: Select the firing rule with the highest weight.
The confidence returned is the confidence of the selected rule. Note that
if two firing rules have the same weight, the rule that occurs first
in the ruleset is chosen.
Each Rule can be either a SimpleRule
or a CompoundRule.
<xs:group name="Rule">
<xs:choice>
<xs:element ref="SimpleRule" />
<xs:element ref="CompoundRule" />
</xs:choice>
</xs:group>
|
Each SimpleRule consists of an
identifier, a predicate, a score and information on rule performance.
<xs:element name="SimpleRule">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:group ref="PREDICATE"/>
<xs:element ref="ScoreDistribution" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="optional"/>
<xs:attribute name="score" type="xs:string" use="required"/>
<xs:attribute name="recordCount" type="NUMBER" use="optional"/>
<xs:attribute name="nbCorrect" type="NUMBER" use="optional"/>
<xs:attribute name="confidence" type="NUMBER" use="optional"/>
<xs:attribute name="weight" type="NUMBER" use="optional"/>
</xs:complexType>
</xs:element>
|
Definitions:
PREDICATE: the condition upon
which the rule fires. For more details on PREDICATE see the section
on predicates in TreeModel. This
explains how predicates are described and evaluated and how missing
values are handled.
ScoreDistribution: Describes the
distribution of the predicted value for instances where the rule fires
in the training/test data.
id: The value of id serves as a
unique identifier for the rule. Must be unique
within the ruleset.
score: The predicted value when
the rule fires.
recordCount: The number of
training/test instances on which the rule fired.
nbCorrect: Indicates the number of
training/test instances on which the rule fired and the prediction
was correct.
confidence: Indicates the
confidence of the rule.
weight: Indicates the relative
importance of the rule. May or may not be equal to the confidence.
Each CompoundRule consists of a
predicate and one or more rules. CompoundRules offer a shorthand for
a more compact representation of rulesets and suggest a more
efficient execution mechanism.
<xs:element name="CompoundRule">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:group ref="PREDICATE" />
<xs:group ref="Rule" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
|
Definitions:
PREDICATE: For more details on
PREDICATE see TreeModel.
Rule: One or more rules that are
contained within the CompoundRule. Each of these rules may be a
SimpleRule or a CompoundRule.
A ruleset
containing both compound rules and simple rules have the same meaning
as an equivalent ruleset containing only simple rules. It is possible
to derive a ruleset containing simple rules by repeating the
following transformation:
The original rule
<CompoundRule>
<PREDICATE1/>
<SimpleRule id="1" ...>
<PREDICATE2/>
... contents of simple rule 1 ...
</SimpleRule>
... further rules ...
</CompoundRule>
|
transforms to
<SimpleRule id="1" ...>
<CompoundPredicate booleanOperator="and">
<PREDICATE1>
<PREDICATE2>
</CompoundPredicate>
... contents of simple rule 1 ...
</SimpleRule>
<CompoundRule>
<PREDICATE1/>
... further rules ...
</CompoundRule>
|
Or in other words,
a simple rule is said to fire if its predicate evaluates to TRUE, and
the predicates of all compound rules that contain the simple rule
also evaluate to TRUE.
A Complete RuleSet Example
Consider a ruleset with three rules:
RULE1:
PREDICATE: BP="HIGH" AND K > 0.045804001 AND Age <= 50 AND Na <= 0.77240998
PREDICTION: drugB
Training/test measures:
recordCount 79
nbCorrect 76
confidence 0.9
weight 0.9
RULE2:
PREDICATE: K > 0.057789002 AND BP="HIGH" AND Age <= 50
PREDICTION: drugA
Training/test measures:
recordCount 278
nbCorrect 168
confidence 0.6
weight 0.6
RULE3:
PREDICATE: BP="HIGH" AND Na > 0.21
PREDICTION: drugA
Training/test measures:
recordCount 100
nbCorrect 50
confidence 0.36
weight 0.36
PMML for the example (using only simple rules)
<?xml version="1.0" encoding="UTF-8"?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="MyCopyright">
<Application name="MyApplication" version="1.0"/>
</Header>
<DataDictionary numberOfFields="7">
<DataField name="BP" displayName="BP" optype="categorical" dataType="string">
<Value value="HIGH" property="valid"/>
<Value value="LOW" property="valid"/>
<Value value="NORMAL" property="valid"/>
</DataField>
<DataField name="K" displayName="K" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="0.020152" rightMargin="0.079925"/>
</DataField>
<DataField name="Age" displayName="Age" optype="continuous" dataType="integer"/>
<DataField name="Na" displayName="Na" optype="continuous" dataType="double"/>
<DataField name="Cholesterol" displayName="Cholesterol" optype="categorical" dataType="string">
<Value value="HIGH" property="valid"/>
<Value value="NORMAL" property="valid"/>
</DataField>
<DataField name="$C-Drug" displayName="$C-Drug" optype="categorical" dataType="string">
<Value value="drugA" property="valid"/>
<Value value="drugB" property="valid"/>
<Value value="drugC" property="valid"/>
<Value value="drugX" property="valid"/>
<Value value="drugY" property="valid"/>
</DataField>
<DataField name="$CC-Drug" displayName="$CC-Drug" optype="continuous" dataType="double"/>
</DataDictionary>
<RuleSetModel modelName="NestedDrug" functionName="classification" algorithmName="RuleSet">
<MiningSchema>
<MiningField name="BP" usageType="active"/>
<MiningField name="K" usageType="active"/>
<MiningField name="Age" usageType="active"/>
<MiningField name="Na" usageType="active"/>
<MiningField name="Cholesterol" usageType="active"/>
<MiningField name="$C-Drug" usageType="predicted"/>
<MiningField name="$CC-Drug" usageType="supplementary"/>
</MiningSchema>
<RuleSet defaultScore="drugY" recordCount="1000" nbCorrect="149" defaultConfidence="0.0">
<RuleSelectionMethod criterion="weightedSum"/>
<RuleSelectionMethod criterion="weightedMax"/>
<RuleSelectionMethod criterion="firstHit"/>
<SimpleRule id="RULE1" score="drugB" recordCount="79" nbCorrect="76" confidence="0.9" weight="0.9">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="BP" operator="equal" value="HIGH"/>
<SimplePredicate field="K" operator="greaterThan" value="0.045804001"/>
<SimplePredicate field="Age" operator="lessOrEqual" value="50"/>
<SimplePredicate field="Na" operator="lessOrEqual" value="0.77240998"/>
</CompoundPredicate>
<ScoreDistribution value="drugA" recordCount="2"/>
<ScoreDistribution value="drugB" recordCount="76"/>
<ScoreDistribution value="drugC" recordCount="1"/>
<ScoreDistribution value="drugX" recordCount="0"/>
<ScoreDistribution value="drugY" recordCount="0"/>
</SimpleRule>
<SimpleRule id="RULE2" score="drugA" recordCount="278" nbCorrect="168" confidence="0.6" weight="0.6">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="K" operator="greaterThan" value="0.057789002"/>
<SimplePredicate field="BP" operator="equal" value="HIGH"/>
<SimplePredicate field="Age" operator="lessOrEqual" value="50"/>
</CompoundPredicate>
<ScoreDistribution value="drugA" recordCount="168"/>
<ScoreDistribution value="drugB" recordCount="40"/>
<ScoreDistribution value="drugC" recordCount="12"/>
<ScoreDistribution value="drugX" recordCount="14"/>
<ScoreDistribution value="drugY" recordCount="24"/>
</SimpleRule>
<SimpleRule id="RULE3" score="drugA" recordCount="100" nbCorrect="50" confidence="0.36" weight="0.36">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="BP" operator="equal" value="HIGH"/>
<SimplePredicate field="Na" operator="greaterThan" value="0.21"/>
</CompoundPredicate>
<ScoreDistribution value="drugA" recordCount="50"/>
<ScoreDistribution value="drugB" recordCount="10"/>
<ScoreDistribution value="drugC" recordCount="12"/>
<ScoreDistribution value="drugX" recordCount="7"/>
<ScoreDistribution value="drugY" recordCount="11"/>
</SimpleRule>
</RuleSet>
</RuleSetModel>
</PMML>
|
Scoring Procedure for the Example
We will use the above example to illustrate the steps that should
be followed in the scoring process.
Suppose we wish to score an instance where:
BP="HIGH",
K=0.0621, Age = 36, Na = 0.5023
criterion="firstHit"
scoring
If the criterion attribute in the RuleSelectionMethod
is set to "firstHit" then RULE1 "fires" first and
the prediction is "drugB". The confidence is the weight of RULE1, 0.9.
criterion="weightedSum"
scoring
RULE1 RULE2 and RULE3 all fire. To choose the winner,
for each prediction, sum the weights of the firing rules to produce a
total weight for that prediction.
drugA: total weight =
weight(RULE2) + weight(RULE3) = 0.6 + 0.36 = 0.96
drugB: total
weight = weight(RULE1) = 0.9
The winning prediction with the
highest total weight is drugA. The confidence for this prediction is
the total weight for firing rules that predict drugA divided by
the number of rules that fired:
confidence(drugA) =
total_weight(drugA) / number_of_firing_rules = 0.96 / 3 =
0.32
criterion="weightedMax" scoring
RULE1
has the highest weight of the firing rules and the prediction is
"drugB". The confidence is the confidence of RULE1, 0.9.
PMML for the example (using compound rules)
The following PMML shows how the example model can be described
using compound rules.
<?xml version="1.0" encoding="UTF-8"?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="MyCopyright">
<Application name="MyApplication" version="1.0"/>
</Header>
<DataDictionary numberOfFields="7">
<DataField name="BP" displayName="BP" optype="categorical" dataType="string">
<Value value="HIGH" property="valid"/>
<Value value="LOW" property="valid"/>
<Value value="NORMAL" property="valid"/>
</DataField>
<DataField name="K" displayName="K" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="0.020152" rightMargin="0.079925"/>
</DataField>
<DataField name="Age" displayName="Age" optype="continuous" dataType="integer">
<Interval closure="closedClosed" leftMargin="15" rightMargin="74"/>
</DataField>
<DataField name="Na" displayName="Na" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="0.500517" rightMargin="0.899774"/>
</DataField>
<DataField name="Cholesterol" displayName="Cholesterol" optype="categorical" dataType="string">
<Value value="HIGH" property="valid"/>
<Value value="NORMAL" property="valid"/>
</DataField>
<DataField name="$C-Drug" displayName="$C-Drug" optype="categorical" dataType="string">
<Value value="drugA" property="valid"/>
<Value value="drugB" property="valid"/>
<Value value="drugC" property="valid"/>
<Value value="drugX" property="valid"/>
<Value value="drugY" property="valid"/>
</DataField>
<DataField name="$CC-Drug" displayName="$CC-Drug" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="0" rightMargin="1"/>
</DataField>
</DataDictionary>
<RuleSetModel modelName="Drug" functionName="classification" algorithmName="RuleSet">
<MiningSchema>
<MiningField name="BP" usageType="active"/>
<MiningField name="K" usageType="active"/>
<MiningField name="Age" usageType="active"/>
<MiningField name="Na" usageType="active"/>
<MiningField name="Cholesterol" usageType="active"/>
<MiningField name="$C-Drug" usageType="predicted"/>
<MiningField name="$CC-Drug" usageType="supplementary"/>
</MiningSchema>
<RuleSet defaultScore="drugY" recordCount="1000" nbCorrect="149" defaultConfidence="0.0">
<RuleSelectionMethod criterion="weightedSum"/>
<RuleSelectionMethod criterion="weightedMax"/>
<RuleSelectionMethod criterion="firstHit"/>
<CompoundRule>
<SimplePredicate field="BP" operator="equal" value="HIGH"/>
<CompoundRule>
<SimplePredicate field="Age" operator="lessOrEqual" value="50"/>
<SimpleRule id="RULE1" score="drugB" recordCount="79" nbCorrect="76" confidence="0.9" weight="0.9">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="K" operator="greaterThan" value="0.045804001"/>
<SimplePredicate field="Na" operator="lessOrEqual" value="0.77240998"/>
</CompoundPredicate>
<ScoreDistribution value="drugA" recordCount="2"/>
<ScoreDistribution value="drugB" recordCount="76"/>
<ScoreDistribution value="drugC" recordCount="1"/>
<ScoreDistribution value="drugX" recordCount="0"/>
<ScoreDistribution value="drugY" recordCount="0"/>
</SimpleRule>
<SimpleRule id="RULE2" score="drugA" recordCount="278" nbCorrect="168" confidence="0.6" weight="0.6">
<SimplePredicate field="K" operator="greaterThan" value="0.057789002"/>
<ScoreDistribution value="drugA" recordCount="168"/>
<ScoreDistribution value="drugB" recordCount="40"/>
<ScoreDistribution value="drugC" recordCount="12"/>
<ScoreDistribution value="drugX" recordCount="14"/>
<ScoreDistribution value="drugY" recordCount="24"/>
</SimpleRule>
</CompoundRule>
<SimpleRule id="RULE3" score="drugA" recordCount="100" nbCorrect="50" confidence="0.36" weight="0.36">
<SimplePredicate field="Na" operator="greaterThan" value="0.21"/>
<ScoreDistribution value="drugA" recordCount="50"/>
<ScoreDistribution value="drugB" recordCount="10"/>
<ScoreDistribution value="drugC" recordCount="12"/>
<ScoreDistribution value="drugX" recordCount="7"/>
<ScoreDistribution value="drugY" recordCount="11"/>
</SimpleRule>
</CompoundRule>
</RuleSet>
</RuleSetModel>
</PMML>
|