Clustering Models
PMML3.0 Menu

Home


PMML Notice and License

Changes


Conformance

General Structure

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Taxomony

Targets

Output

Functions

Built-in Functions

Model Composition

Model Verification


Association Rules

Cluster
Models


General
Regression


Naive
Bayes


Neural
Network


Regression

Ruleset

Sequences

Text Models

Trees

Vector Machine

PMML 3.0 - Clustering Models

PMML models for Clustering are defined in two different classes. These are center-based and distribution-based cluster models. Both models have the element ClusteringModel as the top level type and they share many other element types.

A cluster model basically consists of a set of clusters. For each cluster a center vector can be given. In center-based models a cluster is defined by a vector of center coordinates. Some distance measure is used to determine the nearest center, that is the nearest cluster for a given input record. For distribution-based models (e.g., in demographic clustering) the clusters are defined by their statistics. Some similarity measure is used to determine the best matching cluster for a given record. The center vectors then only approximate the clusters.

The model must contain information on the distance or similarity measure used for clustering. It may also contain information on overall data distribution, such as covariance matrix, or other statistics. Names of coordinates in CenterFields, ClusteringFields and statistics must be consistent with the names of the fields in the data dictionary and in the transformation dictionary.


  <xs:element name="ClusteringModel">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="MiningSchema"/>
        <xs:element ref="Output" minOccurs="0" />
        <xs:element ref="ModelStats" minOccurs="0"/>
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:element ref="ComparisonMeasure"/>
        <xs:element ref="ClusteringField" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="CenterFields" minOccurs="0"/>
        <xs:element ref="MissingValueWeights" minOccurs="0"/>
        <xs:element ref="Cluster" maxOccurs="unbounded"/>
        <xs:element ref="ModelVerification" minOccurs="0"/>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" use="optional"/>
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" use="optional"/>
      <xs:attribute name="modelClass" use="required">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="centerBased"/>
            <xs:enumeration value="distributionBased"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
      <xs:attribute name="numberOfClusters" type="INT-NUMBER" use="required"/>
    </xs:complexType>
  </xs:element>

  <xs:element name="CenterFields">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="DerivedField" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

The attribute modelClass specifies whether the clusters are defined by center-vectors or whether they are defined by the statistics. The latter is used by demographic clustering.

The fields which are used in the center vectors are normalized. In particular this allows to map categorical input fields to numeric values in center vectors. See the DerivedField and the section on normalization.

MiningField information (in MiningSchema) must be present for each active variable. For numeric variables it specifies the treatment of outliers. Note that there may be supplementary mining fields. The statistics for these fields are part of the model but they are not required to apply the model.

For each active MiningField, an element of type UnivariateStats (see ModelStats) holds information about the overall (background) population. This includes (required) DiscrStats or ContStats, which include possible field values and interval boundaries. Optionally, statistical information is included for the background data.

A cluster is defined by its center vector or by statistics. A center vector is implemented by a NUM-ARRAY;. Each Partition corresponds to a cluster and holds field statistics to describe it. The definition of a cluster may contain a center vector as well as statistics. The attribute modelClass in the ClusteringModel defines which one is used to actually define the cluster.


  <xs:element name="MissingValueWeights">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:group ref="NUM-ARRAY"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="Cluster">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="KohonenMap" minOccurs="0"/>
        <xs:group ref="NUM-ARRAY" minOccurs="0"/>
        <xs:element ref="Partition" minOccurs="0"/>
        <xs:element ref="Covariances" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="name" type="xs:string" use="optional"/>
      <xs:attribute name="size" type="xs:nonNegativeInteger" use="optional"/>
    </xs:complexType>
  </xs:element>


  <xs:element name="KohonenMap">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="coord1" type="xs:float" use="optional"/>
      <xs:attribute name="coord2" type="xs:float" use="optional"/>
      <xs:attribute name="coord3" type="xs:float" use="optional"/>
      </xs:complexType>
  </xs:element>

If a clustering models is center-based then each cluster has a NUM-ARRAY containing the center coordinates for the cluster. The correspondence between input fields and their coordinates is defined by CenterFields, see above. Note that categorical fields can have more than one coordinate, depending on the normalization method.

If some normalization is defined for input fields, then the center coordinates are defined using the normalized values. For numeric fields we could also use the values in the original domain. For categorical values, however, the center vector is not required to contain the indicator values 0.0 or 1.0. There can also be any values between 0.0 and 1.0. These indicate a distribution of categorical values, defining a kind of virtual center point.

MissingValueWeights is used to adjust distance or similarity measures for missing data. See the formulas for ComparisonMeasure below.

The element KohonenMap is appropriate for clustering models that were produced by a Kohonen map algorithm. The attributes coordi describe the position of the current cluster in a map with up to three dimensions. This element is not relevant to the scoring function.

If the cluster model contains statistics with mean values, then the center coordinates are not necessarily identical to the corresponding mean values. There may be differences depending on the kind of normalization of input values.

A covariance matrix stores coordinate-by-coordinate variances (diagonal cells) and covariances (non-diagonal cells).


  <xs:element name="Covariances">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="Matrix"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

The covariance matrix must be symmetric so only half of the non-diagonal covariance cells need to be stored. Missing covariance cells are reconstructed by symmetry. The sequence of rows/columns correspond to the sequence in MiningSchema. Note that Covariances does not contain information about internal fields that are defined in the TransformationDictionary.


  <xs:element name="ClusteringField">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="Comparisons" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required"/>
      <xs:attribute name="fieldWeight" type="REAL-NUMBER" use="optional"/>
      <xs:attribute name="similarityScale" type="REAL-NUMBER" use="optional"/>
      <xs:attribute name="compareFunction" type="COMPARE-FUNCTION" use="optional" />
    </xs:complexType>
  </xs:element>

field refers (by name) to a MiningField or to a DerivedField.

fieldWeight is the importance factor for the field. This field weight is used in the comparison functions in order to compute the comparison measure. The value must be a number greater than 0. The default value is 1.0.

similarityScale is the distance such that similarity becomes 0.5.

compareFunction is a function of taking two field values and a similarityScale to define similarity/distance. It can override the general specification of compareFunction in ComparisonMeasure.

For the computation of distances and similarities see below.


  <xs:element name="Comparisons">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="Matrix"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

Comparisons is a matrix which contains the similarity values or distance values, depending on the attribute modelClass in ClusteringModel. The order of the rows and columns corresponds to the order of discrete values or intervals in that field.

Distance or Similarity Measure

When two records are compared then either the distance or the similarity is of interest. In both cases the measures can be computed by a combination of an 'inner' function and an 'outer' function. The inner function compares two single field values and the outer function computes an aggregation over all fields.

Each field has a comparison function, this is either defined as a default in ClusteringModel or it can be defined per ClusteringField. Given two field values x and y, the inner function can be one of:

absDiff: absolute difference

c(x,y) = |x-y|

gaussSim: gaussian similarity

c(x,y) = exp(-ln(2)*z*z/(s*s)) where z=x-y, and s is the value of attribute similarityScale (required in this case) in the ClusteringField

delta:

c(x,y) = 0 if x=y, 1 else

equal:

c(x,y) = 1 if x=y, 0 else

table:

c(x,y) = lookup in similarity matrix


  <xs:simpleType name="COMPARE-FUNCTION">
    <xs:restriction base="xs:string">
      <xs:enumeration value="absDiff" />
      <xs:enumeration value="gaussSim" />
      <xs:enumeration value="delta" />
      <xs:enumeration value="equal" />
      <xs:enumeration value="table" />
    </xs:restriction>
  </xs:simpleType>

ComparisonMeasure

Per ClusteringModel there is one aggregation function: depending on the attribute kind in ComparisonMeasure the aggregated value is optimal if it is 0 (for distance measure) or greater values indicate optimal fit (for similarity measure).

A model can have MissingValueWeights in order to adjust distance measures for missing data.

Given:
  • A data vector Xi, i=1,...,n, where some of the values may be missing
  • A vector of cluster seeds Yi, i=1,...,n, all non missing
  • A vector of field weight values, Wi, i=1,...,n
    Wi is the value of fieldWeight in the ClusteringField corresponding to the i-th field. If the attribute is missing Wi is assumed to be 1.0.
  • A vector of adjustment values, Qi, i=1,...,n, all nonmissing Qi is the i-th value in the element MissingValueWeights. If the model does not have MissingValueWeights, Qi is assumed to be 1.0.
  • A function nonmissing() that returns 1 if the argument is nonmissing, or 0 otherwise
  • A summation SumNM(expr_i) that returns the sum of expr_i where the index i ranges from 1 to N where Xi is not missing.
The adjustment values are used to compute an adjustment factor

                     sum[Qi]
AdjustM   =  --------------------------
             sum[nonmissing(Xi)*Qi]

Note that AdjustM = 1.0 if there are no missing values.

The following aggregation functions are defined by PMML.

euclidean: kind=distance

D = (sumNM( Wi * c(Xi,Yi)2 ) *AdjustM )(1/2)

squaredEuclidean, aka squared: kind=distance

D = sumNM( Wi * c(Xi,Yi)2 ) *AdjustM

euclidean and squaredEuclidean are essentially equivalent because only the minimum or maximum of values is needed in order to assign a cluster.

chebychev, aka maximum: kind=distance

D = max (Wi*c(Xi,Yi)) *AdjustM over all i

cityBlock, aka sum:

D = sumNM (Wi*c(Xi,Yi)) *AdjustM

minkovski: p-parameter>0

D = (sumNM(c(Xi,Yi)p) *AdjustM)(1/p)

For binary or categorical data, let two individuals X and Y compare their values for each attribute, and

a11 = number of times where Xi=1 and Yi=1

a10 = number of times where Xi=1 and Yi=0

a01 = number of times where Xi=0 and Yi=1

a00 = number of times where Xi=0 and Yi=0

simpleMatching: kind=similarity, min=0, max=1

D = ( a11 + a00 ) / ( a11 + a10 + a01 + a00 )

jaccard: kind=similarity

D = ( a11 ) / ( a11 + a10 + a01 )

tanimoto: kind=similarity, min=0, max=1

D = ( a11 + a00 ) / ( a11 + 2*(a10+a01) + a00 )

binarySimilarity: kind=similarity, min=0, max=1, c.. and d.. are parameters > 0.

        c11*a11 + c10*a10 + c01*a01 + c00*a00
D = ---------------------------------------------------------
        d11*a11 + d10*a10 + d01*a01 + d00*a00


  <xs:element name="ComparisonMeasure">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:choice>
          <xs:element ref="euclidean"/>
          <xs:element ref="squaredEuclidean"/>
          <xs:element ref="chebychev"/>
          <xs:element ref="cityBlock"/>
          <xs:element ref="minkowski"/>
          <xs:element ref="simpleMatching"/>
          <xs:element ref="jaccard"/>
          <xs:element ref="tanimoto"/>
          <xs:element ref="binarySimilarity"/>
        </xs:choice>
      </xs:sequence>
      <xs:attribute name="kind" use="required">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="distance"/>
            <xs:enumeration value="similarity"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>

      <xs:attribute name="compareFunction" type="COMPARE-FUNCTION" use="optional" />
      <xs:attribute name="minimum" type="NUMBER" use="optional"/>
      <xs:attribute name="maximum" type="NUMBER" use="optional"/>
    </xs:complexType>
  </xs:element>

  <xs:element name="euclidean">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="squaredEuclidean">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="cityBlock">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="chebychev">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="minkowski">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="p-parameter" type="NUMBER" use="required"/>
    </xs:complexType>
  </xs:element>

  <xs:element name="simpleMatching">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="jaccard">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="tanimoto">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="binarySimilarity">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="c00-parameter" type="NUMBER" use="required"/>
      <xs:attribute name="c01-parameter" type="NUMBER" use="required"/>
      <xs:attribute name="c10-parameter" type="NUMBER" use="required"/>
      <xs:attribute name="c11-parameter" type="NUMBER" use="required"/>
      <xs:attribute name="d00-parameter" type="NUMBER" use="required"/>
      <xs:attribute name="d01-parameter" type="NUMBER" use="required"/>
      <xs:attribute name="d10-parameter" type="NUMBER" use="required"/>
      <xs:attribute name="d11-parameter" type="NUMBER" use="required"/>
    </xs:complexType>
  </xs:element>

The attributes minimum and maximum in ComparisonMeasure are not used in the scoring function. They are for information only, describing the minimum and maximum possible value in the comparison function.

Conformance

Center-based clustering:

  • comparison function 'absDiff' (that is, the value of attribute 'compareFunction' in 'ComparisonMeasure') is in core, other comparison functions are not in core.
  • aggregation function 'squaredEuclidean' is in core, other aggregation functions are not in core.

Distribution-based clustering:

  • comparison functions 'gaussSim', 'equal' and 'table' are in core, other comparison functions are not in core.
  • aggregation function 'cityBlock' is in core, other aggregation functions are not in core.

Example for a center-based clustering model


  <?xml version="1.0" ?>
  <PMML version="3.0">
    <Header copyright="dmg.org"/>
    <DataDictionary numberOfFields="3">
      <DataField name="marital status" optype="categorical">
          <Value value="s"/>
          <Value value="d"/>
          <Value value="m"/>
      </DataField>
      <DataField name="age" optype="continuous"/>
      <DataField name="salary" optype="continuous"/>
    </DataDictionary>
    <ClusteringModel modelName="Mini Clustering"
          functionName="clustering"
          modelClass="centerBased"
          numberOfClusters="2">
      <MiningSchema>
               <MiningField name="marital status"/>
               <MiningField name="age"/>
               <MiningField name="salary"/>
      </MiningSchema>

      <ComparisonMeasure kind="distance">
               <squaredEuclidean/>
      </ComparisonMeasure>

      <ClusteringField field="marital status"
                      compareFunction="absDiff"/>
      <ClusteringField field="age"
                      compareFunction="absDiff"/>
      <ClusteringField field="salary"
                      compareFunction="absDiff"/>
      <CenterFields>
        <DerivedField name="c1">
          <NormContinuous field="age">
               <LinearNorm orig="45" norm="0"/>
               <LinearNorm orig="82" norm="0.5"/>
               <LinearNorm orig="105" norm="1"/>
          </NormContinuous>
        </DerivedField>
        <DerivedField name="c2">
          <NormContinuous field="salary">
               <LinearNorm orig="39000" norm="0"/>
               <LinearNorm orig="39800" norm="0.5"/>
               <LinearNorm orig="41000" norm="1"/>
          </NormContinuous>
        </DerivedField>
        <DerivedField name="c3">
          <NormDiscrete field="marital status" value="m"/>
        </DerivedField>
        <DerivedField name="c4">
          <NormDiscrete field="marital status" value="d"/>
        </DerivedField>
        <DerivedField name="c5">
          <NormDiscrete field="marital status" value="s"/>
        </DerivedField>
      </CenterFields>
      <MissingValueWeights>
        <Array n="5" type="real">1 1 1 1 1</Array>
      </MissingValueWeights>
      <Cluster name="marital status is d or s">
        <Array n="5" type="real">
               0.524561 0.486321 0.128427 0.459188 0.412384</Array>
      </Cluster>
      <Cluster name="marital status is m">
        <Array n="5" type="real">
               0.69946 0.419037 0.591226 0.173521 0.235253</Array>
      </Cluster>
    </ClusteringModel>
  </PMML>

e-mail info at dmg.org