Sequences
PMML2.0 Menu

Home


PMML Notice and License

General Structure

Header

Data
Dictionary


Mining
Schema


Data Flow

Transformations

Statistics

Conformance

Taxomony

Trees

Regression

General
Regression


Cluster
Models


Association Rules

Neural
Network


Naive
Bayes


Sequences

PMML 2.0 -- Sequence Rules

The basic data model consists of a sequence object, identified by the "Primary Key" that has a number of events attributed to it, defined by the "Secondary Key". Each event consists of a set of ordered items. An "Order Field" defines the order of the items within an event, with an optional qualifier in the form of an attribute name.

SequenceModel

A Sequence mining model consists of a number of major parts:

<!ELEMENT SequenceModel (Extension*, MiningSchema, Item*, Itemset*, 
                         SetPredicate*, Sequence+, 
                         SequenceRule*, Extension*)>
<!ATTLIST SequenceModel
  modelName                       CDATA          #IMPLIED
  functionName                    %MINING-FUNCTION; #REQUIRED
  algorithmName                   CDATA          #IMPLIED
  numberOfTransactions            %INT-NUMBER;   #REQUIRED
  maxNumberOfItemsPerTransaction  %INT-NUMBER;   #IMPLIED
  avgNumberOfItemsPerTransaction  %REAL-NUMBER;  #IMPLIED
  minimumSupport                  %REAL-NUMBER;  #REQUIRED
  minimumConfidence               %REAL-NUMBER;  #REQUIRED
  lengthLimit                     %INT-NUMBER;   #IMPLIED
  numberOfItems                   %INT-NUMBER;   #REQUIRED
  numberOfSets                    %INT-NUMBER;   #REQUIRED
  numberOfSequences               %INT-NUMBER;   #REQUIRED
  numberOfRules                   %INT-NUMBER;   #REQUIRED
  timeWindowWidth                 %INT-NUMBER;   #IMPLIED
  minimumTime                     %INT-NUMBER;   #IMPLIED
  maximumTime                     %INT-NUMBER;   #IMPLIED
>

Extension provides the capability to extend the content of a model.
MiningSchema lists the fields that are used in this model. This is a subset of the fields as defined in the data dictionary and the transformation dictionary. The transformations in the transformation dictionary will have been carried out on one of the DataField values in the data dictionary, providing new fields for use in the model.
Item is defined in the Association Model.
Itemset is defined in the Association Model.
SetPredicate is a set of predicates made up of simple boolean expressions.
Sequence is an ordered collection of SetPredicates or Itemsets. There will be at least one Sequence.
SequenceRule describes the relationship between two sequences.

Attribute description:

numberOfTransactions : the number of objects in the data (e.g. unique customers or visitors).
maxNumberOfItemsPerTransaction : the maximum number of events (e.g. visits) per object.
avgNumberOfItemsPerTransaction : the average number of events that make up the object.
minimumSupport : the minimum support for a sequence to be discovered.
minimumConfidence : the minimum confidence for a rule to be discovered.
lengthLimit : the maximum length of a sequence to be discovered.
numberOfItems : total number of unique items (e.g. pages on a site).
numberOfSets : total number of sets.
numberOfSequences : total number of sequences discovered.
numberOfRules : total number of rules discovered.
timeWindowWidth : this may be used to separate items associated with an object into discrete events, but only if no clear key already exists for the separate events. Two consecutive items must have a time gap of less than this value to be considered as being part of the same event.
minimumTime : minimum time between items as defined above.
maximumTime : maximum time between items as defined above.

SetPredicate

<!ENTITY % ELEMENT-ID    "CDATA">

<!ELEMENT SetPredicate ( %STRING-ARRAY; )>
<!ATTLIST SetPredicate
  id               %ELEMENT-ID;              #IMPLIED
  field            %FIELD-NAME;              #REQUIRED
  operator         CDATA                     #FIXED "supersetOf"
>

<!ELEMENT ItemSetReference EMPTY>
<!ATTLIST ItemSetReference 
  itemSetId        %ELEMENT-ID;              #REQUIRED
>
SetPredicate elements consist of a boolean expression. This is made up of a field, a comparison operator, and a value. The value(s) will be written in the form of an array.

Attribute description:

id : An element ID uniquely identifying a predicate set. (Referenced in Sequences by setId.)
field : The subject of the predicate statement. Usually this name refers to one of the DerivedField elements in the TransformationDictionary.
operator : The association between the subject of the predicate statement and the array of values.
Note that a SetPredicate compares two sets while a SimpleSetPredicte (as defined in the tree model) checks membership of a single value in a set.

Delimiter & Time

<!ENTITY % DELIMITER   "( sameTimeWindow | acrossTimeWindows )">
<!ENTITY % GAP         "( true | false | unknown )">

<!ELEMENT Delimiter EMPTY> <!ATTLIST Delimiter   delimiter   %DELIMITER; #REQUIRED   gap         %GAP;        #REQUIRED >
Delimiter is the separation between two Sets in a Sequence, or between two Sequences in a SequenceRule.

Attribute description:

delimiter : states whether or not this SetPredicate occurred within the same event or time period, as defined by a time window, (e.g. session) as the previous one.
gap : the possible existence of SetPredicates between this and the previous Set or Sequence. True represents an open sequence, which allows for gaps between sequences (as does unknown). In a closed sequence the gap is set to false, indicating that the two Sets or Sequences being described are consecutive Sets in the data.

<!ELEMENT Time EMPTY>
<!ATTLIST Time
  min       %NUMBER;  #REQUIRED
  max       %NUMBER;  #REQUIRED
  mean      %NUMBER;  #IMPLIED
>

Attribute description:

min : the minimum time between Sets in a Sequence (or between an antecedent and consequent Sequence in a Rule).
max : the maximum time between Sets in a Sequence (or between an antecedent and consequent Sequence in a Rule).
mean : the mean time between Sets in a Sequence (or between an antecedent and consequent Sequence in a Rule).


Sequence

<!ENTITY % FOLLOW_SET  "(Delimiter, SetReference )" >

<!ELEMENT Sequence ( SetReference, (%FOLLOW_SET;)* )> <!ATTLIST Sequence   id               %ELEMENT-ID;  #REQUIRED   numberOfSets      %INT-NUMBER; #IMPLIED   occurrence       %INT-NUMBER; #IMPLIED   support          %REAL-NUMBER; #IMPLIED >

Each Sequence consists of a SetReference and optional FOLLOW_SET(s). A FOLLOW_SET is another SetReference preceded by a delimiter.

Attribute description:

id : the unique ID of this sequence. (Referenced in SequenceRules by seqId).
numberOfSets : the number of SetPredicates and/or ItemSets in this sequence.
occurrence : the number of objects in the data for which this sequence holds true.
support : the ratio of the number of objects in the data for which this sequence holds true, to the total number of objects in the data.

<!ELEMENT SetReference EMPTY >
<!ATTLIST SetReference
  setId     %ELEMENT-ID; #REQUIRED
>

The SetReference refers (or points) to a previously defined set. That set will be either a SetPredicate or an Itemset (which will contain ItemRef elements).

Attribute description:

setId : a pointer to the id attribute of a SetPredicate or Itemset.



Sequence Rules

<!ELEMENT SequenceRule ( Extension*, AntecedentSequence, Delimiter, 
       Time*, ConsequentSequence )>
<!ATTLIST SequenceRule
  id                %ELEMENT-ID;  #REQUIRED
  numberOfSets      %INT-NUMBER;  #REQUIRED
  occurrence        %INT-NUMBER;  #REQUIRED
  support           %REAL-NUMBER; #REQUIRED
  confidence        %REAL-NUMBER; #REQUIRED
>

A Sequence Rule consists of an antecedent Sequence and a consequent Sequence, separated by a delimiter and, possibly, time.

Attribute description:

id : the unique ID of this sequence rule.
numberOfSets : the total number of sets in both the antecedent and consequent Sequences.
occurrence : the number of objects in the data for which the antecedent and consequent Sequences hold true.
support : the ratio of the number of objects in the data for which the antecedent and consequent Sequences hold true, to the total number of objects in the data.
confidence : probability of the consequent following the antecedent. Calculated as the number of occurrences of a sequence divided by the number of occurrences of the antecedent.



Antecedent & Consequent Sequences

<!ENTITY % SEQUENCE  "( SequenceReference, Time* )" >

<!ELEMENT SequenceReference EMPTY > <!ATTLIST SequenceReference   seqId    %ELEMENT-ID; #REQUIRED >
<!ELEMENT AntecedentSequence (%SEQUENCE;) > <!ELEMENT ConsequentSequence (%SEQUENCE;) >

Attribute description:

seqId : a pointer to the id attribute of a previously defined sequence.



Example

The example below represents the following scenario:

Visitors that come from  { index.html } will do the following (with a confidence of 0.25):
Visit  { offer.html, kdnuggets.com }  in the same visit, and without visiting another site;
Within 2 days they will return to  { products.html }  without visiting other sites between;
Visit  { basket.html } visiting at least one site beforehand;
Finally, go directly to  { checkout.html } .

<?xml version="1.0" ?>
<PMML version="2.0">
 <Header copyright="DMG.org" description="example model for sequences"/>
 <DataDictionary numberOfFields="4">
   <DataField name="visitor" optype="categorical"/>
   <DataField name="visit" optype="categorical"/>
   <DataField name="time" optype="categorical"/>
   <DataField name="page" optype="categorical"/>
 </DataDictionary>

 <TransformationDictionary>
   <DerivedField name="transaction"/>
     <Aggregate field="page" function="multiset" groupField="visit"/>
   </DerivedField>
 </TransformationDictionary>

 <SequenceModel functionName="sequences"
     numberOfTransactions="100" minimumSupport="0.20"
     minimumConfidence="0.25" numberOfItems="6"
     numberOfSets="5" numberOfSequences="3" numberOfRules="1">
 
  <MiningSchema>
    <MiningField name="visitor" usageType="supplementary"/>
    <MiningField name="visit" usageType="active"/>
    <MiningField name="time" usageType="active">
      <Extension name="unit" value="days"/>
    </MiningField>
    <MiningField name="page" usageType="active"/>
  </MiningSchema>

  <!-- ========== Predicates ========== -->
  <SetPredicate id="sp001" field="transaction" operator="supersetOf">
   <Array n="1" type="string"> index.html </Array>
  </SetPredicate>

  <SetPredicate id="sp002" field="transaction" operator="supersetOf">
   <Array n="2" type="string"> offer.html kdnuggets.com </Array>
  </SetPredicate>

  <SetPredicate id="sp003" field="transaction" operator="supersetOf">
   <Array n="1" type="string"> products.html </Array>
  </SetPredicate>

  <SetPredicate id="sp004" field="transaction" operator="supersetOf">
   <Array n="1" type="string"> basket.html </Array>
  </SetPredicate>

  <SetPredicate id="sp005" field="transaction" operator="supersetOf">
   <Array n="1" type="string"> checkout.html </Array>
  </SetPredicate>

  <!-- ========== Sequences ========== -->
  <Sequence id="seq001" numberOfSets="1" occurrence="80" support="0.80">
    <SetReference setId="sp001"/>
  </Sequence>

  <Sequence id="seq002" numberOfSets="4" occurrence="40" support="0.40">
    <SetReference setId="sp002"/>
    <Delimiter delimiter="acrossTimeWindows" gap="false"/>
    <SetReference setId="sp003"/>
    <Delimiter delimiter="sameTimeWindow" gap="true"/>
    <SetReference setId="sp004"/>
    <Delimiter delimiter="sameTimeWindow" gap="false"/>
    <SetReference setId="sp005"/>
  </Sequence>

  <Sequence id="seq003" numberOfSets="5" occurrence="20" support="0.20">
    <SetReference setId="sp001"/>
    <Delimiter delimiter="sameTimeWindow" gap="unknown"/>
    <SetReference setId="sp002"/>
    <Delimiter delimiter="acrossTimeWindows" gap="false"/>
    <SetReference setId="sp003"/>
    <Delimiter delimiter="sameTimeWindow" gap="true"/>
    <SetReference setId="sp004"/>
    <Delimiter delimiter="sameTimeWindow" gap="false"/>
    <SetReference setId="sp005"/>
  </Sequence>

  <!-- ========== SequenceRules ========== -->
  <SequenceRule id="rule001" numberOfSets="5" occurrence="20" 
        support="0.20" confidence="0.25">
   <Extension name="qWeight" value="0.5"/>
   <Extension name="attrWeight" value="0.5"/>
   <Extension name="seqWeight" value="0.5"/>

   <AntecedentSequence>
    <SequenceReference seqId="seq001"/>
   </AntecedentSequence>

   <Delimiter delimiter="sameTimeWindow" gap="unknown"/>
   <Time min="0" max="0"/>

   <ConsequentSequence>
    <SequenceReference seqId="seq002"/>
    <Time min="0" max="2"/> 
        <!-- time between "sp002" and "sp003" in sequence "seq002" -->
    <Time min="0" max="0"/>
        <!-- time between "sp003" and "sp004" in sequence "seq002" -->
    <Time min="0" max="0"/>
        <!-- time between "sp004" and "sp005" in sequence "seq002" -->
   </ConsequentSequence>

  </SequenceRule>

 </SequenceModel>
</PMML>

e-mail info at dmg.org