PMML 4.3 - Associations Rules
 PMML4.3 Menu Home Changes XML Schema Conformance Interoperability General Structure Field Scope Header Data Dictionary Mining Schema Transformations Statistics Taxomony Targets Output Functions Built-in Functions Model Verification Model Explanation Multiple Models Association Rules Baseline Models Bayesian Network Cluster Models Gaussian Process General Regression k-Nearest Neighbors Naive Bayes Neural Network Regression Ruleset Scorecard Sequences Text Models Time Series Trees Vector Machine

PMML 4.3 - Associations Rules

The Association Rule model represents rules where some set of items is associated to another set of items. For example a rule can express that a certain product or set of products is often bought in combination with a certain set of other products, also known as Market Basket Analysis. An Association Rule model typically has two variables: one for grouping records together into transactions (usageType="group") and another that uniquely identifies each record (usageType="active"). Alternatively, association rule models can be built on regular data, where each category of each categorical field is an item. Yet another possible format of data is a table with true/false values, where only the fields having true value in a record are considered valid items. Here are examples of data sources that can be used to build an association rules model. A transaction style data source:

RowNumber TransactionId Item
1 10001 Water
3 10002 Cracker
4 10002 Coke

This example data has a field for transaction ID and a field that contains all items. This example data has two transactions, one containing items "Water" and "Bread", the other containing "Cracker", "Coke", "Bread".

Association models can also be built on data in true/false form or on regular categorical data. These two types of data can also be present in the same data source. Here is an example:

RowNumber Water Cracker Coke Bread Area Day Day=Weekday
1 false false true true urban Monday true
2 true false true false rural Tuesday true
3 false true true true urban Sunday false
4 false true true true suburban Sunday false
5 false false true true suburban Friday true

In this example fields "Water", "Cracker", "Coke", "Bread", "Day=Weekday" are true/false fields, and "Area" and "Day" are usual categorical fields. In this setup every row of data corresponds to one transaction, with false values of the true/false fields not contributing to the transactions. Possible items in a transaction created from this data are: "Water", "Cracker", "Coke", "Bread", "Day=Weekday", "Area=urban", "Area=suburban", "Day=Monday", etc. The item values are based on field names when the field has only true/false values, and on field name followed by "=" and field value (category) when the field is a regular categorical one, such as "Area" and "Day" in the example.

The attribute definitions of the association rule model uses the entity ELEMENT-ID in order to express a semantical constraint that a value must be unique in a set of elements (contained in the same XML document) of the same type.

An Association Rule model consists of four major parts:

• Model attributes
• Items
• ItemSets
• AssociationRules
```<xs:element name="AssociationModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="MiningSchema"/>
<xs:element ref="Output" minOccurs="0"/>
<xs:element ref="ModelStats" minOccurs="0"/>
<xs:element ref="LocalTransformations" minOccurs="0"/>
<xs:element ref="Item" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Itemset" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="AssociationRule" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="ModelVerification" minOccurs="0"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="modelName" type="xs:string"/>
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
<xs:attribute name="algorithmName" type="xs:string"/>
<xs:attribute name="numberOfTransactions" type="INT-NUMBER" use="required"/>
<xs:attribute name="maxNumberOfItemsPerTA" type="INT-NUMBER"/>
<xs:attribute name="avgNumberOfItemsPerTA" type="REAL-NUMBER"/>
<xs:attribute name="minimumSupport" type="PROB-NUMBER" use="required"/>
<xs:attribute name="minimumConfidence" type="PROB-NUMBER" use="required"/>
<xs:attribute name="lengthLimit" type="INT-NUMBER"/>
<xs:attribute name="numberOfItems" type="INT-NUMBER" use="required"/>
<xs:attribute name="numberOfItemsets" type="INT-NUMBER" use="required"/>
<xs:attribute name="numberOfRules" type="INT-NUMBER" use="required"/>
<xs:attribute name="isScorable" type="xs:boolean" default="true"/>
</xs:complexType>
</xs:element>
```

An AssociationModel can contain any number of Itemsets and AssociationRules. Note, however, that all Itemsets must be listed before any of the rules.

Here is a description of the attributes:

numberOfTransactions: The number of transactions contained in the input data.

maxNumberOfItemsPerTA The number of items contained in the largest transaction.

avgNumberOfItemsPerTA: The average number of items contained in a transaction.

minimumSupport: The minimum relative support value (#supporting transactions / #total transactions) satisfied by all rules.

minimumConfidence: The minimum confidence value satisfied by all rules. Confidence is calculated as (support (rule) / support(antecedent)).

lengthLimit: The maximum number of items contained in a rule which was used to limit the number of rules.

numberOfItems: The number of different items contained in the input data. This number may be greater than or equal to the number of items contained in the model. The value will be greater if any items in the input data are excluded from the model, as a consequence of not being referenced by the model.

numberOfItemsets: The number of itemsets contained in the model.

numberOfRules: The number of rules contained in the model.

isScorable: This attribute indicates if the model is valid for scoring. If this attribute is true or if it is missing, then the model should be processed normally. However, if the attribute is false, then the model producer has indicated that this model is intended for information purposes only and should not be used to generate results. In order to be valid PMML, all required elements and attributes must be present, even for non-scoring models. For more details, see General Structure.

We consider items next:

```<xs:element name="Item">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required"/>
<xs:attribute name="value" type="xs:string" use="required"/>
<xs:attribute name="field" type="FIELD-NAME"/>
<xs:attribute name="category" type="xs:string"/>
<xs:attribute name="mappedValue" type="xs:string"/>
<xs:attribute name="weight" type="REAL-NUMBER"/>
</xs:complexType>
</xs:element>
```

Here is a description of the attributes in an item:

id: An identification to uniquely identify an item.

value: The value of the item as in the input data as described in the beginning of this document.

mappedValue: Optional, a value to which the original item value is mapped. For instance, this could be a product name if the original value is an EAN code.

weight : The weight of the item. For example, the price or value of an item.

PMML 4.3 adds optional attributes field and category that help to link the items to the data without ambiguity presented in the single string value.

Obviously the id of an Item must be unique. Furthermore the Item values must be unique, or if they are not unique then attributes field and category must distiguish them. That is, an AssocationModel must not have different instances of Item where the values of the value, field, and category attribute are all the same. The entries in mappedValue may be the same, though. Here are some examples of Items:
```    <Item id="1" value="Water"  field="Item"  category="Water" >
<Item id="2" value="Cracker"  field="Cracker"  category="true" >
<Item id="3" value="Day=Weekday"  field="Day=Weekday"  category="true" >
<Item id="4" value="Day=Monday"  field="Day"  category="Monday" >
```

Here item 1 is an example of items created based on the first data representation above. In this case the field name is not used in item value at all since all items come from the same field, so the item value is just the category value. This was sufficient when only the first representation of data was used. Items 2 and 3 are examples for the data in true/false form, where the names of the fields with true values are used as item values. Item 4 presents a case where the item value is built from the field name and the category, which happens when the data has regular form, like in the second example of data above. Note that when the name of a field contains "=", the item value alone may be ambiguous. For example, without knowing the set of categories for field "Day" in the second data example, we would not be able to decide if item "Day=Weekday" represents field "Day=Weekday" with a true value or field "Day" with value "Weekday".

All names of the fields used in items must appear as "active" in MiningSchema.

We consider itemsets next:

```<xs:element name="Itemset">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="ItemRef"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required"/>
<xs:attribute name="support" type="PROB-NUMBER"/>
<xs:attribute name="numberOfItems" type="xs:nonNegativeInteger"/>
</xs:complexType>
</xs:element>
```

Here is a description of the attributes in an Itemset:

id: An identification to uniquely identify an Itemset.

support: The relative support of the Itemset.
support(set) = (number of transactions containing the set) / (total number of transactions)

numberOfItems: The number of Items contained in this Itemset

ItemRef: Item references point to elements of type Item and are defined as:

```<xs:element name="ItemRef">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="itemRef" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
```

Here is a description of the attributes in an ItemRef:

itemRef: Contains the identification of an item.

We consider association rules of the form "<antecedent itemset> => <consequent itemset>" next:

```<xs:element name="AssociationRule">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="antecedent" type="xs:string" use="required"/>
<xs:attribute name="consequent" type="xs:string" use="required"/>
<xs:attribute name="support" type="PROB-NUMBER" use="required"/>
<xs:attribute name="confidence" type="PROB-NUMBER" use="required"/>
<xs:attribute name="lift" type="xs:float" use="optional"/>
<xs:attribute name="leverage" type="xs:float" use="optional"/>
<xs:attribute name="affinity" type="PROB-NUMBER" use="optional"/>
<xs:attribute name="id" type="xs:string" use="optional"/>
</xs:complexType>
</xs:element>
```

Here is a description of the attributes in an AssociationRule (note that the formulae listed below must hold true for the attributes included in the rule):

antecedent: The id value of the itemset which is the antecedent of the rule. We represent the itemset by the letter A.

consequent: The id value of the itemset which is the consequent of the rule. We represent the itemset by the letter C.

support: The support of the rule, that is, the relative frequency of transactions that contain A and C.
support(A->C) = support(A+C)

confidence: The confidence of the rule.
confidence(A->C) = support(A+C) / support(A)

lift: A very popular measure of interestingness of a rule is lift. Lift values greater than 1.0 indicate that transactions containing A tend to contain C more often than transactions that do not contain A.
lift(A->C) = confidence(A->C) / support(C)

leverage: Another measure of interestingness is leverage. An association with higher frequency and lower lift may be more interesting than an alternative rule with lower frequency and higher lift. The former can be more important in practice because it applies to more cases. The value is the difference between the observed frequency of A+C and the frequency that would be expected if A and C were independent.
leverage(A->C) = support(A->C) - support(A)*support(C)

affinity: Also known as Jaccard Similarity, affinity is a measure of the transactions that contain both the antecedent and consequent (intersect) compared to those that contain the antecedent or the consequent (union).
affinity(A->C) = support(A+C) / [ support(A) + support(C) - support(A+C)]

id: An identification to uniquely identify an association rule.

These statistics and their calculation are described visually in the chart below:

Figure 1: Association Rules Statistics Diagram

Here is an example of an association model:

```<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
<DataField name="transaction" optype="categorical" dataType="string"/>
<DataField name="item" optype="categorical" dataType="string"/>
<AssociationModel functionName="associationRules" numberOfTransactions="4" numberOfItems="3" minimumSupport="0.6" minimumConfidence="0.5" numberOfItemsets="3" numberOfRules="2">
<MiningSchema>
<MiningField name="transaction" usageType="group"/>
<MiningField name="item" usageType="active"/>
</MiningSchema>

<Output>
<!-- There are nine outputs defined for this model -->
<!-- that return the top three highest confidence  -->
<!-- "exclusiveRecommendation" results (selecting  -->
<!-- rules where the items in the input itemset    -->
<!-- appear in the antecedent but do not appear in -->
<!-- the consequent). For each of these three      -->
<!-- rules, there are three available outputs:     -->
<!--   rule:       for example, "Cracker -> Water" -->
<!--   consequent: for example, "Water"            -->
<!--   entityId:   for example, 1                  -->
<OutputField name="Rule (Highest Confidence)" rankBasis="confidence" rank="1" algorithm="exclusiveRecommendation" feature="rule" dataType="string" optype="categorical"/>
<OutputField name="Recommendation (Highest Confidence)" rankBasis="confidence" rank="1" algorithm="exclusiveRecommendation" feature="consequent" dataType="string" optype="categorical"/>
<OutputField name="Rule Id (Highest Confidence)" rankBasis="confidence" rank="1" algorithm="exclusiveRecommendation" feature="entityId" dataType="double" optype="continuous"/>
<OutputField name="Rule (2nd Highest Confidence)" rankBasis="confidence" rank="2" algorithm="exclusiveRecommendation" feature="rule" dataType="string" optype="categorical"/>
<OutputField name="Recommendation (2nd Highest Confidence)" rankBasis="confidence" rank="2" algorithm="exclusiveRecommendation" feature="consequent" dataType="string" optype="categorical"/>
<OutputField name="Rule Id (2nd Highest Confidence)" rankBasis="confidence" rank="2" algorithm="exclusiveRecommendation" feature="entityId" dataType="double" optype="continuous"/>
<OutputField name="Rule (3rd Highest Confidence)" rankBasis="confidence" rank="3" algorithm="exclusiveRecommendation" feature="rule" dataType="string" optype="categorical"/>
<OutputField name="Recommendation (3rd Highest Confidence)" rankBasis="confidence" rank="3" algorithm="exclusiveRecommendation" feature="consequent" dataType="string" optype="categorical"/>
<OutputField name="Rule Id (3rd Highest Confidence)" rankBasis="confidence" rank="3" algorithm="exclusiveRecommendation" feature="entityId" dataType="double" optype="continuous"/>
</Output>

<!-- We have three items in our input data -->
<Item id="1" value="Cracker"/>
<Item id="2" value="Coke"/>
<Item id="3" value="Water"/>

<!-- and two frequent itemsets with a single item -->

<Itemset id="1" support="1.0" numberOfItems="1">
<ItemRef itemRef="1"/>
</Itemset>

<Itemset id="2" support="1.0" numberOfItems="1">
<ItemRef itemRef="3"/>
</Itemset>

<!-- and one frequent itemset with two items. -->

<Itemset id="3" support="1.0" numberOfItems="2">
<ItemRef itemRef="1"/>
<ItemRef itemRef="3"/>
</Itemset>

<!-- Two rules satisfy the requirements -->

<AssociationRule support="1.0" confidence="1.0" antecedent="1" consequent="2"/>

<AssociationRule support="1.0" confidence="1.0" antecedent="2" consequent="1"/>

</AssociationModel>
</PMML>
```

Scoring Procedure

The scoring procedure has as input an association model and an itemset. The way the itemset is created from the input data depends on the types of input data. See the description and examples for Item element for details. The scoring determines all rules defined within the input model which are associated with the input itemset, based on the algorithm specified within a specific OutputField:

recommendation: For a given input itemset, a rule is selected if its antecedent itemset is a subset of the input itemset. Note: This could result in recommending item(s) that are included in the input itemset. To recommend items that are not included in the input itemset, use exclusiveRecommendation (see next).

exclusiveRecommendation: For a given input itemset, a rule is selected if its antecedent itemset is a subset of the input itemset, and its consequent itemset is not a subset of the input itemset. Note that, by definition, if the requested output is defined in the output element as the consequent of the selected rule, the output will consist of the entire consequent itemset. It is left to the consumer to decide whether or not items within the consequent should be excluded from the output if they are included in the input itemset.

ruleAssociation: For a given input itemset, a rule is selected if its antecedent and consequent itemsets are included in the input itemset.

Let us consider a sample model with the following rules:

rule 1: Cracker -> Water
rule 2: Water -> Cracker
rule 3: Cracker -> Coke
rule 4: Cracker AND Water -> Nachos
rule 5: Water -> Pear AND Banana

Now, let's apply the model to the following input itemsets, using each of the possible algorithms, to see which rules satisfy the various input itemsets:

 Input itemset recommendation exclusiveRecommendation ruleAssociation #1 - Cracker, Coke 1, 3 1 3 #2 - Cracker, Water 1, 2, 3, 4, 5 3, 4, 5 1, 2 #3 - Water, Coke 2, 5 2, 5 null #4 - Cracker, Water, Coke 1, 2, 3, 4, 5 4, 5 1, 2, 3 #5 - Cracker, Water, Banana, Apple 1, 2, 3, 4, 5 3, 4, 5 1, 2

For instance, if we apply "exclusiveRecommendation" to input itemset #3, "Water, Coke", rule 2 is found to match because its antecedent "Water" is a subset of the input itemset, while its consequent "Cracker" is not. Rule 1 does not match because its consequent "Water" is included in the input itemset; also, rule 4 does not match because its antecedent "Cracker AND Water" is not a subset of the input itemset.

If we apply "exclusiveRecommnedation" to input itemset #5, "Cracker, Water, Banana, Apple", rule 4 is found to match because its antecedent "Cracker AND Water" is a subset of the input itemset, while its consequent "Nachos" is not. Rule 5 also matches because its consequent "Pear AND Banana" is not a subset of the input itemset.

Using input itemset #5 again, if we apply "ruleAssociation", rule 2 is found to match since both the antecedent and consequent are subsets of the input itemset. Rule 5 is not found to match as its consequent "Pear AND Banana" is not a subset of the input itemset.

See the chapter on Outputs for details on the various types of outputs that can be returned by Association Rules models.

 e-mail info at dmg.org