PMML 4.0 - Association Rules
The Association Rule model represents rules where some set of items
is associated to another set of items. For example a rule can express
that a certain product or set of products is often bought in combination with a certain set
of other products.
The attribute definitions of the association rule model uses the
entity ELEMENT-ID in order to express a semantical constraint
that a value must be unique in a set of elements (contained in the same
XML document) of the same type.
An Association Rule model consists of four major parts:
- Model attributes
- Items
- ItemSets
- AssociationRules
<xs:element name="AssociationModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="MiningSchema"/>
<xs:element ref="Output" minOccurs="0"/>
<xs:element ref="ModelStats" minOccurs="0"/>
<xs:element ref="LocalTransformations" minOccurs="0"/>
<xs:element ref="Item" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Itemset" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="AssociationRule" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="modelName" type="xs:string"/>
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
<xs:attribute name="algorithmName" type="xs:string"/>
<xs:attribute name="numberOfTransactions" type="INT-NUMBER" use="required"/>
<xs:attribute name="maxNumberOfItemsPerTA" type="INT-NUMBER"/>
<xs:attribute name="avgNumberOfItemsPerTA" type="REAL-NUMBER"/>
<xs:attribute name="minimumSupport" type="PROB-NUMBER" use="required"/>
<xs:attribute name="minimumConfidence" type="PROB-NUMBER" use="required"/>
<xs:attribute name="lengthLimit" type="INT-NUMBER"/>
<xs:attribute name="numberOfItems" type="INT-NUMBER" use="required"/>
<xs:attribute name="numberOfItemsets" type="INT-NUMBER" use="required"/>
<xs:attribute name="numberOfRules" type="INT-NUMBER" use="required"/>
</xs:complexType>
</xs:element>
|
An AssociationModel can contain any number of Itemsets and AssociationRules.
These elements can be mixed but all Itemsets that are used in an
AssociationRule element must appear before the rule element.
Here is a description of the attributes:
numberOfTransactions:
The number of transactions contained in the input data.
maxNumberOfItemsPerTA
The number of items contained in the largest transaction.
avgNumberOfItemsPerTA:
The average number of items contained in a transaction.
minimumSupport: The minimum relative support
value (#supporting transactions / #total transactions) satisfied by all
rules.
minimumConfidence:
The minimum confidence value satisfied by all rules. Confidence is calculated
as (support (rule) / support(antecedent)).
lengthLimit: The maximum number of items contained in a rule
which was used to limit the number of rules.
numberOfItems: The number of different items contained in the
input data.
numberOfItemsets: The number of itemsets contained in the
model.
numberOfRules: The
number of rules contained in the model.
We consider items next:
<xs:element name="Item">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required"/>
<xs:attribute name="value" type="xs:string" use="required"/>
<xs:attribute name="mappedValue" type="xs:string"/>
<xs:attribute name="weight" type="REAL-NUMBER"/>
</xs:complexType>
</xs:element>
|
Here is a description of the attributes in an item:
id: An identification to uniquely identify an item.
value: The value of the item as in the input data.
mappedValue: Optional, a value to which the original item
value is mapped. For instance, this could be a product name if the
original value is an EAN code.
weight : The weight of
the item. For example, the price or value of an item.
Obviously the id of an
Item must be unique.
Furthermore the
Item values must be unique too.
That is, an AssocationModel must not have
different instances of Item where the values of the attribute named 'value'
are duplicates.
The entries in mappedValue may be the same, though.
We consider itemsets next:
<xs:element name="Itemset">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="ItemRef"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required"/>
<xs:attribute name="support" type="PROB-NUMBER"/>
<xs:attribute name="numberOfItems" type="xs:nonNegativeInteger"/>
</xs:complexType>
</xs:element>
|
Here is a description of the attributes in an Itemset:
id: An identification to uniquely identify an Itemset.
support: The relative support of the Itemset.
- support(set) = (number of transactions containing the set) / (total number
of transactions)
numberOfItems: The number of Items contained in this Itemset
ItemRef: Item references point to elements of type Item and are defined as:
<xs:element name="ItemRef">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="itemRef" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
|
Here is a description of the attributes in an ItemRef:
itemRef: Contains the identification of an item.
We consider association rules of the
form "<antecedent itemset> => <consequent itemset>"
next:
<xs:element name="AssociationRule">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="antecedent" type="xs:string" use="required"/>
<xs:attribute name="consequent" type="xs:string" use="required"/>
<xs:attribute name="support" type="PROB-NUMBER" use="required"/>
<xs:attribute name="confidence" type="PROB-NUMBER" use="required"/>
<xs:attribute name="lift" type="xs:float" use="optional"/>
<xs:attribute name="id" type="xs:string" use="optional"/>
</xs:complexType>
</xs:element>
|
Here is a description of the attributes in an AssociationRule:
antecedent: The id value of the itemset which is the antecedent
of the rule. We represent the itemset by the letter A.
consequent: The id value of the itemset which is the consequent
of the rule. We represent the itemset by the letter C.
support: The support of the rule, that is, the relative
frequency of transactions that contain A and C.
- support(A->C) = support(A+C)
confidence: The confidence of the rule.
- confidence(A->C) = support(A+C) / support(A)
lift: The lift value of the rule. If the XML attribute is
specified explicitly in the rule, the following equation must hold true.
- lift(A->C) = confidence(A->C) / support(C)
id: An identification to uniquely identify an association rule.
A very popular measure of interestingness of a rule is lift.
Lift values greater than 1.0 indicate that transactions containing A
tend to contain C more often than transactions that do not contain
A.
Another measure of interestingness is leverage.
An association with higher frequency and lower lift may be
more interesting than an alternative rule with lower frequency and higher lift.
The former can be more important in practice because it applies to more cases.
The value can be computed by
leverage(A->C) = support(A->C) - support(A)*support(C)
This is the difference between the observed frequency of
A+
C
and the frequencey that would be expected if
A and
C were independent.
A number of other measures can be defined. PMML does not provide specific
attributes in a rule for leverage and measures other than support, confidence,
and lift. Further measures can usually be derived
from the information in the PMML model.
Note that confidence and lift can be derived from support values.
The attributes confidence and lift have been included
in PMML because they are very common.
Here is an example of an association model:
<?xml version="1.0" ?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="www.dmg.org" description="example model for association rules"/>
<DataDictionary numberOfFields="2" >
<DataField name="transaction" optype="categorical" dataType="string"/>
<DataField name="item" optype="categorical" dataType="string"/>
</DataDictionary>
<AssociationModel
functionName="associationRules"
numberOfTransactions="4" numberOfItems="3"
minimumSupport="0.6" minimumConfidence="0.5"
numberOfItemsets="3" numberOfRules="2">
<MiningSchema>
<MiningField name="transaction" usageType="group"/>
<MiningField name="item" usageType="predicted"/>
</MiningSchema>
<Output>
<!-- There are nine outputs defined for this model -->
<!-- that return the top three highest confidence -->
<!-- "exclusiveRecommendation" results (selecting -->
<!-- rules where the items in the input itemset -->
<!-- appear in the antecedent but do not appear in -->
<!-- the consequent). For each of these three -->
<!-- rules, there are three available outputs: -->
<!-- rule: for example, "Cracker -> Water" -->
<!-- consequent: for example, "Water" -->
<!-- ruleId: for example, 1 -->
<OutputField name="Rule (Highest Confidence)" rankBasis="confidence" rank="1"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="rule"
dataType="string" optype="categorical"/>
<OutputField name="Recommendation (Highest Confidence)" rankBasis="confidence" rank="1"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="consequent"
dataType="string" optype="categorical"/>
<OutputField name="Rule Id (Highest Confidence)" rankBasis="confidence" rank="1"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="ruleId"
dataType="double" optype="continuous"/>
<OutputField name="Rule (2nd Highest Confidence)" rankBasis="confidence" rank="2"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="rule"
dataType="string" optype="categorical"/>
<OutputField name="Recommendation (2nd Highest Confidence)" rankBasis="confidence" rank="2"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="consequent"
dataType="string" optype="categorical"/>
<OutputField name="Rule Id (2nd Highest Confidence)" rankBasis="confidence" rank="2"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="ruleId"
dataType="double" optype="continuous"/>
<OutputField name="Rule (3rd Highest Confidence)" rankBasis="confidence" rank="3"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="rule"
dataType="string" optype="categorical"/>
<OutputField name="Recommendation (3rd Highest Confidence)" rankBasis="confidence" rank="3"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="consequent"
dataType="string" optype="categorical"/>
<OutputField name="Rule Id (3rd Highest Confidence)" rankBasis="confidence" rank="3"
algorithm="exclusiveRecommendation" feature="ruleValue" ruleFeature="ruleId"
dataType="double" optype="continuous"/>
</Output>
<!-- We have three items in our input data -->
<Item id="1" value="Cracker"/>
<Item id="2" value="Coke"/>
<Item id="3" value="Water"/>
<!-- and two frequent itemsets with a single item -->
<Itemset id="1" support="1.0" numberOfItems="1">
<ItemRef itemRef="1"/>
</Itemset>
<Itemset id="2" support="1.0" numberOfItems="1">
<ItemRef itemRef="3"/>
</Itemset>
<!-- and one frequent itemset with two items. -->
<Itemset id="3" support="1.0" numberOfItems="2">
<ItemRef itemRef="1"/>
<ItemRef itemRef="3"/>
</Itemset>
<!-- Two rules satisfy the requirements -->
<AssociationRule support="1.0" confidence="1.0" antecedent="1" consequent="2"/>
<AssociationRule support="1.0" confidence="1.0" antecedent="2" consequent="1"/>
</AssociationModel>
</PMML>
|
Scoring Procedure
The scoring procedure has as input an association model
and an itemset. It determines all rules defined within the input model which
are associated with the input itemset, based on the algorithm specified within a specific
OutputField:
recommendation: For a given input itemset, a rule is selected if its antecedent itemset is a subset
of the input itemset. Note: This could result in recommending item(s) that are included in the input itemset. To recommend items that
are not included in the input itemset, use exclusiveRecommendation (see next).
exclusiveRecommendation: For a given input itemset, a rule is selected if its antecedent itemset is
a subset of the input itemset, and its consequent itemset is not a subset of the input itemset. Note that, by definition, if the
requested output is defined in the output element as the consequent of the selected rule, the output will consist of the entire consequent
itemset. It is left to the consumer to decide whether or not items within the consequent should be excluded from the output if they are
included in the input itemset.
ruleAssociation: For a given input itemset, a rule is selected if its antecedent and consequent
itemsets are included in the input itemset.
Let us consider a sample model with the following rules:
rule 1: Cracker → Water
rule 2: Water → Cracker
rule 3: Cracker → Coke
rule 4: Cracker AND Water → Natchos
rule 5: Water → Pear AND Banana
Now, let's apply the model to the following input
itemsets, using each of the possible algorithms, to see which rules satisfy the
various input itemsets:
Input itemset |
recommendation |
exclusiveRecommendation |
ruleAssociation |
#1 - Cracker, Coke |
1, 3 |
1 |
3 |
#2 - Cracker, Water |
1, 2, 3, 4, 5 |
3, 4, 5 |
1, 2 |
#3 - Water, Coke |
2, 5 |
2, 5 |
null |
#4 - Cracker, Water, Coke |
1, 2, 3, 4, 5 |
4, 5 |
1, 2, 3 |
#5 - Cracker, Water, Banana, Apple |
1, 2, 3, 4, 5 |
3, 4, 5 |
1, 2 |
For instance, if we apply "exclusiveRecommendation" to input itemset #3,
"Water, Coke", rule 2 is found to match because its antecedent "Water" is a subset of the
input itemset, while its consequent "Cracker" is not. Rule 1 does not match because its consequent
"Water" is included in the input itemset; also, rule 4 does not match because its antecedent
"Cracker AND Water" is not a subset of the input itemset.
If we apply "exclusiveRecommnedation" to input itemset #5, "Cracker, Water, Banana,
Apple", rule 4 is found to match because its antecedent "Cracker AND Water" is a subset of the input itemset, while its
consequent "Nachos" is not. Rule 5 also matches because its consequent "Pear AND Banana" is not a subset of the input itemset.
Using input itemset #5 again, if we apply "ruleAssociation", rule 2 is found to match
since both the antecedent and consequent are subsets of the input itemset. Rule 5 is not found to match as its consequent
"Pear AND Banana" is not a subset of the input itemset.
See the chapter on Outputs for details on the various
types of outputs that can be returned by Association Rules models.