PMML 4.4.1 - Clustering Models

PMML models for Clustering are defined in two different classes. These are center-based and distribution-based cluster models. Both models have the element ClusteringModel as the top level type and they share many other element types.

A cluster model basically consists of a set of clusters. For each cluster a center vector can be given. In center-based models a cluster is defined by a vector of center coordinates. Some distance measure is used to determine the nearest center, that is the nearest cluster for a given input record. For distribution-based models (e.g., in demographic clustering) the clusters are defined by their statistics. Some similarity measure is used to determine the best matching cluster for a given record. The center vectors then only approximate the clusters.

The model must contain information on the distance or similarity measure used for clustering. It may also contain information on overall data distribution, such as covariance matrix, or other statistics. Names of coordinates in ClusteringFields, ClusteringFields and statistics must be consistent with the names of the fields in the data dictionary and in the transformation dictionary.

<xs:element name="ClusteringModel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="MiningSchema"/>
      <xs:element ref="Output" minOccurs="0"/>
      <xs:element ref="ModelStats" minOccurs="0"/>
      <xs:element ref="ModelExplanation" minOccurs="0"/>
      <xs:element ref="LocalTransformations" minOccurs="0"/>
      <xs:element ref="ComparisonMeasure"/>
      <xs:element ref="ClusteringField" minOccurs="1" maxOccurs="unbounded"/>
      <xs:element ref="MissingValueWeights" minOccurs="0"/>
      <xs:element ref="Cluster" maxOccurs="unbounded"/>
      <xs:element ref="ModelVerification" minOccurs="0"/>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="modelName" type="xs:string" use="optional"/>
    <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
    <xs:attribute name="algorithmName" type="xs:string" use="optional"/>
    <xs:attribute name="modelClass" use="required">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="centerBased"/>
          <xs:enumeration value="distributionBased"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="numberOfClusters" type="xs:nonNegativeInteger" use="required"/>
    <xs:attribute name="isScorable" type="xs:boolean" default="true"/>
  </xs:complexType>
</xs:element>

The attribute modelClass specifies whether the clusters are defined by center-vectors or whether they are defined by the statistics. The latter is used by distribution-based clustering.

The isScorable attribute indicates whether the model is valid for scoring. If this attribute is true or if it is missing, then the model should be processed normally. However, if the attribute is false, then the model producer has indicated that this model is intended for information purposes only and should not be used to generate results. In order to be valid PMML, all required elements and attributes must be present, even for non-scoring models. For more details, see General Structure.

The numberOfClusters attribute must be equal to the number of Cluster elements in the ClusteringModel.

The fields which are used in the center vectors are normalized. In particular this allows the mapping of categorical input fields to numeric values in center vectors. See the DerivedField and the section on normalization.

MiningField information (in MiningSchema) must be present for each active variable. For numeric variables it specifies the treatment of outliers. Note that there may be supplementary mining fields. The statistics for these fields are part of the model but they are not required to apply the model.

For each active MiningField, an element of type UnivariateStats (see ModelStats) holds information about the overall (background) population. This includes (required) DiscrStats or ContStats, which include possible field values and interval boundaries. Optionally, statistical information is included for the background data.

A cluster is defined by its center vector or by statistics. A center vector is implemented by a NUM-ARRAY. Each Partition corresponds to a cluster and holds field statistics to describe it. The definition of a cluster may contain a center vector as well as statistics. The attribute modelClass in the ClusteringModel defines which one is used to actually define the cluster.

<xs:element name="MissingValueWeights">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:group ref="NUM-ARRAY"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="Cluster">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="KohonenMap" minOccurs="0"/>
      <xs:group ref="NUM-ARRAY" minOccurs="0"/>
      <xs:element ref="Partition" minOccurs="0"/>
      <xs:element ref="Covariances" minOccurs="0"/>
    </xs:sequence>
    <xs:attribute name="id" type="xs:string" use="optional"/>
    <xs:attribute name="name" type="xs:string" use="optional"/>
    <xs:attribute name="size" type="xs:nonNegativeInteger" use="optional"/>
  </xs:complexType>
</xs:element>

<xs:element name="KohonenMap">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="coord1" type="xs:double" use="optional"/>
    <xs:attribute name="coord2" type="xs:double" use="optional"/>
    <xs:attribute name="coord3" type="xs:double" use="optional"/>
  </xs:complexType>
</xs:element>

If a clustering model is center-based then each cluster has a NUM-ARRAY containing the center coordinates for the cluster. The correspondence between input fields and their coordinates is defined via ClusteringFields, see below. Note that categorical fields can have more than one coordinate, depending on the normalization method.

If some normalization is defined for input fields, then the center coordinates are defined using the normalized values. For numeric fields we could also use the values in the original domain. For categorical values, however, the center vector is not required to contain the indicator values 0.0 or 1.0. There can also be any values between 0.0 and 1.0. These indicate a distribution of categorical values, defining a kind of virtual center point.

MissingValueWeights is used to adjust distance or similarity measures for missing data. See the formulas for ComparisonMeasure below.

The element Cluster defines a single cluster. Clusters are identified by an implicit 1-based index, indicating the position in which each cluster appears in the model. Optionally, the model may provide explicit identifiers through the id attribute. In this case, all clusters must provide an explicit identifier. Explicit identifiers must be unique across all the clusters. For clustering models, the explicit identifier of the winning cluster is returned as the predictedValue. If absent, the implicit identifier is returned instead. The implicit 1-based index identifier can always be obtained through the entityID feature. A cluster may also provide a name. The name of a cluster is is not required to be unique and is returned as the predictedDisplayValue. Finally, size is descriptive only (not used in predictions) and intended to capture the size of each cluster.

The element KohonenMap is appropriate for clustering models that were produced by a Kohonen map algorithm. The attributes coord1, coord2 and coord3 describe the position of the current cluster in a map with up to three dimensions. This element is not relevant to the scoring function.

If the cluster model contains statistics with mean values, then the center coordinates are not necessarily identical to the corresponding mean values. There may be differences depending on the kind of normalization of input values.

A covariance matrix stores coordinate-by-coordinate variances (diagonal cells) and covariances (non-diagonal cells).

<xs:element name="Covariances">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Matrix"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

The covariance matrix must be symmetric so only half of the non-diagonal covariance cells need to be stored. Missing covariance cells are reconstructed by symmetry. The sequence of rows/columns correspond to the sequence in MiningSchema. Note that Covariances does not contain information about internal fields that are defined in the TransformationDictionary.

<xs:element name="ClusteringField">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Comparisons" minOccurs="0"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="isCenterField" default="true">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="true"/>
          <xs:enumeration value="false"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="fieldWeight" type="REAL-NUMBER" default="1"/>
    <xs:attribute name="similarityScale" type="REAL-NUMBER" use="optional"/>
    <xs:attribute name="compareFunction" type="COMPARE-FUNCTION" use="optional"/>
  </xs:complexType>
</xs:element>

field refers (by name) to a MiningField or to a DerivedField.

isCenterField indicates whether the respective field is a center field, i.e. a component of the center, in a center-based model. Only center fields correspond to the entries in the center vectors in order.

fieldWeight is the importance factor for the field. This field weight is used in the comparison functions in order to compute the comparison measure. The value must be a number greater than 0. The default value is 1.0.

similarityScale is the distance such that similarity becomes 0.5.

compareFunction is a function of taking two field values and a similarityScale to define similarity/distance. It can override the general specification of compareFunction in ComparisonMeasure.

For the computation of distances and similarities see below.

<xs:element name="Comparisons">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Matrix"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Comparisons is a matrix which contains the similarity values or distance values, depending on the attribute modelClass in ClusteringModel. The order of the rows and columns corresponds to the order of discrete values or intervals in that field.

Distance or Similarity Measure

When two records are compared then either the distance or the similarity is of interest. In both cases the measures can be computed by a combination of an 'inner' function and an 'outer' function. The inner function compares two single field values and the outer function computes an aggregation over all fields.

Each field has a comparison function, this is either defined as a default in ClusteringModel or it can be defined per ClusteringField. Given two field values x and y, the inner function can be one of:

absDiff: absolute difference: c(x,y) = |x-y|
gaussSim: gaussian similarity: c(x,y) = exp(-ln(2)*z*z/(s*s)) where z=x-y, and s is the value of attribute similarityScale (required in this case) in the ClusteringField
delta:: c(x,y) = 0 if x=y, 1 else
equal:: c(x,y) = 1 if x=y, 0 else
table:: c(x,y) = lookup in similarity matrix

<xs:simpleType name="COMPARE-FUNCTION">
  <xs:restriction base="xs:string">
    <xs:enumeration value="absDiff"/>
    <xs:enumeration value="gaussSim"/>
    <xs:enumeration value="delta"/>
    <xs:enumeration value="equal"/>
    <xs:enumeration value="table"/>
  </xs:restriction>
</xs:simpleType>

ComparisonMeasure

Per ClusteringModel there is one aggregation function: depending on the attribute kind in ComparisonMeasure the aggregated value is optimal if it is 0 (for distance measure) or greater values indicate optimal fit (for similarity measure).

A model can have MissingValueWeights in order to adjust distance measures for missing data.

Given:

A data vector Xi, i=1,...,n, where some of the values may be missing
A vector of cluster seeds Yi, i=1,...,n, all non missing
A vector of field weight values, Wi, i=1,...,n
Wi is the value of fieldWeight in the ClusteringField corresponding to the i-th field. If the attribute is missing Wi is assumed to be 1.0.
A vector of adjustment values, Qi, i=1,...,n, all nonmissing Qi is the i-th value in the element MissingValueWeights. If the model does not have MissingValueWeights, Qi is assumed to be 1.0.
A function nonmissing() that returns 1 if the argument is nonmissing, or 0 otherwise
A summation SumNM(expr_i) that returns the sum of expr_i where the index i ranges from 1 to N where Xi is not missing.

The adjustment values are used to compute an adjustment factor


                     sum[Qi]
AdjustM   =  --------------------------
             sum[nonmissing(Xi)*Qi]

Note that AdjustM = 1.0 if there are no missing values.

The following aggregation functions are defined by PMML.

euclidean: kind=distance

D = (sumNM( Wi * c(Xi,Yi)² ) *AdjustM )^(1/2)

squaredEuclidean, aka squared: kind=distance

D = sumNM( Wi * c(Xi,Yi)² ) *AdjustM

euclidean and squaredEuclidean are essentially equivalent because only the minimum or maximum of values is needed in order to assign a cluster.

chebychev, aka maximum: kind=distance

D = max (Wi*c(Xi,Yi)) *AdjustM over all i

cityBlock, aka sum:

D = sumNM (Wi*c(Xi,Yi)) *AdjustM

minkowski: p-parameter>0

D = (sumNM(c(Xi,Yi)^p) *AdjustM)^(1/p)

For binary or categorical data, let two individuals X and Y compare their values for each attribute, and

a11 = number of times where Xi=1 and Yi=1

a10 = number of times where Xi=1 and Yi=0

a01 = number of times where Xi=0 and Yi=1

a00 = number of times where Xi=0 and Yi=0

simpleMatching: kind=similarity, min=0, max=1: D = ( a11 + a00 ) / ( a11 + a10 + a01 + a00 )
jaccard: kind=similarity: D = ( a11 ) / ( a11 + a10 + a01 )
tanimoto: kind=similarity, min=0, max=1: D = ( a11 + a00 ) / ( a11 + 2*(a10+a01) + a00 )
binarySimilarity: kind=similarity, min=0, max=1, c.. and d.. are parameters > 0.: c11*a11 + c10*a10 + c01*a01 + c00*a00
D = ---------------------------------------------------------
d11*a11 + d10*a10 + d01*a01 + d00*a00

<xs:element name="ComparisonMeasure">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:choice>
        <xs:element ref="euclidean"/>
        <xs:element ref="squaredEuclidean"/>
        <xs:element ref="chebychev"/>
        <xs:element ref="cityBlock"/>
        <xs:element ref="minkowski"/>
        <xs:element ref="simpleMatching"/>
        <xs:element ref="jaccard"/>
        <xs:element ref="tanimoto"/>
        <xs:element ref="binarySimilarity"/>
      </xs:choice>
    </xs:sequence>
    <xs:attribute name="kind" use="required">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="distance"/>
          <xs:enumeration value="similarity"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>

    <xs:attribute name="compareFunction" type="COMPARE-FUNCTION" default="absDiff"/>
    <xs:attribute name="minimum" type="NUMBER" use="optional"/>
    <xs:attribute name="maximum" type="NUMBER" use="optional"/>
  </xs:complexType>
</xs:element>

<xs:element name="euclidean">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="squaredEuclidean">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="cityBlock">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="chebychev">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="minkowski">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="p-parameter" type="NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

<xs:element name="simpleMatching">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="jaccard">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="tanimoto">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="binarySimilarity">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="c00-parameter" type="NUMBER" use="required"/>
    <xs:attribute name="c01-parameter" type="NUMBER" use="required"/>
    <xs:attribute name="c10-parameter" type="NUMBER" use="required"/>
    <xs:attribute name="c11-parameter" type="NUMBER" use="required"/>
    <xs:attribute name="d00-parameter" type="NUMBER" use="required"/>
    <xs:attribute name="d01-parameter" type="NUMBER" use="required"/>
    <xs:attribute name="d10-parameter" type="NUMBER" use="required"/>
    <xs:attribute name="d11-parameter" type="NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

The attributes minimum and maximum in ComparisonMeasure are not used in the scoring function. They are for information only, describing the minimum and maximum possible value in the comparison function.

Example for a center-based clustering model

<PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4">
  <Header copyright="dmg.org"/>
  <DataDictionary numberOfFields="3">
    <DataField name="marital status" optype="categorical" dataType="string">
      <Value value="s"/>
      <Value value="d"/>
      <Value value="m"/>
    </DataField>
    <DataField name="age" optype="continuous" dataType="double"/>
    <DataField name="salary" optype="continuous" dataType="double"/>
  </DataDictionary>
  <ClusteringModel modelName="Mini Clustering" functionName="clustering" modelClass="centerBased" numberOfClusters="2">
    <MiningSchema>
      <MiningField name="marital status"/>
      <MiningField name="age"/>
      <MiningField name="salary"/>
    </MiningSchema>
    <LocalTransformations>
      <DerivedField name="c1" optype="continuous" dataType="double">
        <NormContinuous field="age">
          <LinearNorm orig="45" norm="0"/>
          <LinearNorm orig="82" norm="0.5"/>
          <LinearNorm orig="105" norm="1"/>
        </NormContinuous>
      </DerivedField>
      <DerivedField name="c2" optype="continuous" dataType="double">
        <NormContinuous field="salary">
          <LinearNorm orig="39000" norm="0"/>
          <LinearNorm orig="39800" norm="0.5"/>
          <LinearNorm orig="41000" norm="1"/>
        </NormContinuous>
      </DerivedField>
      <DerivedField name="c3" optype="continuous" dataType="double">
        <NormDiscrete field="marital status" value="m"/>
      </DerivedField>
      <DerivedField name="c4" optype="continuous" dataType="double">
        <NormDiscrete field="marital status" value="d"/>
      </DerivedField>
      <DerivedField name="c5" optype="continuous" dataType="double">
        <NormDiscrete field="marital status" value="s"/>
      </DerivedField>
    </LocalTransformations>
    <ComparisonMeasure kind="distance">
      <squaredEuclidean/>
    </ComparisonMeasure>
    <ClusteringField field="c1" compareFunction="absDiff"/>
    <ClusteringField field="c2" compareFunction="absDiff"/>
    <ClusteringField field="c3" compareFunction="absDiff"/>
    <ClusteringField field="c4" compareFunction="absDiff"/>
    <ClusteringField field="c5" compareFunction="absDiff"/>
    <MissingValueWeights>
      <Array n="5" type="real">1 1 1 1 1</Array>
    </MissingValueWeights>
    <Cluster name="marital status is d or s">
      <Array n="5" type="real">0.524561 0.486321 0.128427 0.459188 0.412384</Array>
    </Cluster>
    <Cluster name="marital status is m">
      <Array n="5" type="real">0.69946 0.419037 0.591226 0.173521 0.235253</Array>
    </Cluster>
  </ClusteringModel>
</PMML>

e-mail

info at dmg.org