Data Mining Group - Cluster Models

PMML 2.0 -- Cluster Models

PMML models for cluster models are defined in two different classes. These are center-based and distribution-based cluster models. Both models have the DTD element ClusteringModel as the toplevel type and they share many other element types.

A cluster model basically consists of a set of clusters. For each cluster a center vector can be given. In center-based models a cluster is defined by a vector of center coordinates. Some distance measure is used to determine the nearest center, that is the nearest cluster for a given input record. For distribution-based models (e.g. in demographic clustering) the clusters are defined by their statistics. Some similarity measure is used to determine the best matching cluster for a given record. The center vectors then only approximate the clusters.

The model must contain information on the distance or similarity measure used for clustering. It may also contain information on overall data distribution, such as covariance matrix, or other statistics. Names of coordinates in CenterFields, ClusteringFields and statistics must be consistent with the names of the fields in the data dictionary and in the transformation dictionary.


<!ELEMENT ClusteringModel ( Extension*, MiningSchema, ModelStats?,
                            ComparisonMeasure, ClusteringField*,
                            CenterFields?, Cluster+, Extension* ) >
<!ATTLIST ClusteringModel
    modelName                    CDATA                     #IMPLIED
    functionName                 %MINING-FUNCTION;         #REQUIRED
    algorithmName                CDATA                     #IMPLIED
    modelClass                   ( centerBased |
                                   distributionBased )     #REQUIRED
    numberOfClusters             %INT-NUMBER;              #REQUIRED
>
<!ELEMENT CenterFields ( DerivedField+ ) >

The attribute modelClass specifies whether the clusters are defined by center-vectors or whether they are defined by the statistics. The latter is used by demographic clustering.

The fields which are used in the center vectors are normalized, in particular this allows to map categorical input fields to numeric values in center vectors. See the DerivedField and the DTD on normalization.

MiningField information (in MiningSchema) must be present for each active variable. For numeric variables it specifies the treatment of outliers. Note that there may be supplementary mining fields. The statistics for these fields are part of the model but they are not required to apply the model.

For each active MiningField, an element of type UnivariateStats (see ModelStats) holds information about the overall (background) population. This includes (required) DiscrStats or ContStats, which include possible field values and interval boundaries. Optionally, statistical information is included for the background data.

A cluster is defined by its center vector or by statistics. A center vector is implemented by a %NUM-ARRAY;. Each Partition corresponds to a cluster and holds field statistics to describe it. The definition of a cluster may contain a center vector as well as statistics. The attribute modelClass in the ClusteringModel defines which one is used to actually define the cluster.


<!ELEMENT Cluster (Extension*, (%NUM-ARRAY;)?, Partition?, Covariances?)>
<!ATTLIST Cluster
    name                         CDATA                     #IMPLIED
>

The %NUM-ARRAY; contains the center coordinates for the cluster. The correspondence between input fields and their coordinates is defined by CenterFields, see above. Note that categorical fields can have more than one coordinate, depending on the normalization method.

If some normalization is defined for input fields, then the center coordinates are defined using the normalized values. For numeric fields we could also use the values in the original domain. For categorical values, however, the center vector is not required to contain the indicator values 0.0 or 1.0. There can also be any values between 0.0 and 1.0. These indicate a distribution of categorical values, defining a kind of virtual center point.

If the cluster model contains statistics with mean values, then the center coordinates are not necessarily indentical to the corresponding mean values. There may be differences depending on the kind of normalization of input values.

A covariance matrix stores coordinate-by-coordinate variances (diagonal cells) and covariances (non-diagonal cells).


<!ELEMENT Covariances   ( Matrix ) >

The covariance matrix must be symmetric so only half of the non-diagonal covariance cells need to be stored. Missing covariance cells are reconstructed by symmetry. The sequence of rows/columns correspond to the sequence in MiningSchema. Note that Covariances does not contain information about internal fields that are defined in the TransformationDictionary.


<!ELEMENT ClusteringField ( Extension*, Comparisons? ) >
<!ATTLIST ClusteringField
    field                        %FIELD-NAME;                   #REQUIRED
    fieldWeight                  %REAL-NUMBER;                  #IMPLIED
    similarityScale              %REAL-NUMBER;                  #IMPLIED
    compareFunction              %COMPARE-FUNCTION;             #IMPLIED
>

field refers (by name) to a MiningField or to a DerivedField.

fieldWeight is the importance factor for the field. This field weight is used in the comparison functions in order to compute the comparison measure. The default value is 1.0.

similarityScale is the distance such that similarity becomes 0.5.

compareFunction is a function of taking two field values and a similarityScale to define similarity/distance. It can override the general specification of compareFunction in ComparisonMeasure.

For the computation of distances and similarities see below.


<!ELEMENT Comparisons ( Extension*, Matrix ) >

Comparisons is a matrix which contains the similarity values or distance values, depending on the attribute modelClass in ClusteringModel. The order of the rows and columns corresponds to the order of discrete values or intervals in that field.

Matrix

There are several kinds of matrices which are used within cluster models, eg., to describe covariances and similarities. In order to save space, a matrix can be stored as sparse matrix, diagonal matrix, etc.


<!ELEMENT Matrix   ( (%NUM-ARRAY;)+  |  MatCell+ )? >
<!ATTLIST Matrix
    kind                      ( diagonal | symmetric | any ) "any"
    numberOfRows                  %INT-NUMBER;                   #IMPLIED
    numberOfCols                  %INT-NUMBER;                   #IMPLIED
    diagDefault                   %REAL-NUMBER;                  #IMPLIED
    offDiagDefault                %REAL-NUMBER;                  #IMPLIED
>

A matrix may be represented as a sequence of arrays. If the matrix is diagonal, then the content is just one array of numbers representing the diagonal values. Otherwise, each array contains elements of one row in the matrix. If the kind of the matrix is any, then all values are given. If the matrix is symmetric then the first array contains the matrix element M(0,0), the second array contains M(1,0), M(1,1), and so on (that's the lower left triangle). Other elements are defined by symmetry.

A sparse matrix may also be represented in a compact form as an enumeration of MatCell. Each MatCell contains the numeric value of a single cell. In this case, diagonal has no significance for the matrix representation.


<!ELEMENT MatCell (#PCDATA)>
<!ATTLIST MatCell
    row                       %INT-NUMBER;                   #REQUIRED
    col                       %INT-NUMBER;                   #REQUIRED
>

Evaluating a matrix element M(i,j) proceeds as follows:

The element is explicitely given, either in a MatCell with row=i and col=j, or in the j-th element of the i-th array of Matrix.
The attribute kind of the matrix is symmetric, and the element M(j,i) is explicitely given.
A default value is given, either in the attribute diagDefault. if i=j, or in the attribute offDiagDefault.
No value can be calculated at this step. Calculation will be done only if a default behavior or additional information are given at a higher level.

Distance or Similarity Measure

When two records are compared then either the distance or the similarity is of interest. In both cases the measures can be computed by a combination of an 'inner' function and an 'outer' function. The inner function compares two single field values and the outer function computes an aggregation over all fields.

Each field has a comparison function, this is either defined as a default in ClusteringModel or it can be defined per ClusteringField. Given two field values x and y, the inner function can be one of:

absDiff: absolute difference: c(x,y) = |x-y|
gaussSim: gaussian similarity: c(x,y) = exp(-ln(2)*z*z/(s*s)) where z=x-y, and s is the value of attribute similarityScale (required in this case) in the ClusteringField
delta:: c(x,y) = 0 if x=y, 1 else
equal:: c(x,y) = 1 if x=y, 0 else
table:: c(x,y) = lookup in similarity matrix


<!ENTITY % COMPARE-FUNCTION "(absDiff | gaussSim | delta 
    | equal | table)" >

ComparisonMeasure

Per ClusteringModel there is one aggregation function: depending on the attribute kind in ComparisonMeasure the aggregated value is optimal if it is 0 (for distance measure) or greater values indicate optimal fit (for similarity measure).

The following aggregation functions are defined by PMML. Wi is fieldWeight in ClusteringField. Xi and Yi are field values

euclidean: kind=distance

D = (sum( Wi * c(Xi,Yi)^2 ))^(1/2)

squaredEuclidean, aka squared: kind=distance

D = sum( Wi * c(Xi,Yi)^2)

euclidean and squaredEuclidean are essentially equivalent because only the minimum or maximum of values is needed in order to assign a cluster.

chebychev, aka maximum: kind=distance

D = max (Wi*c(Xi,Yi)) over all i

cityBlock, aka sum:

D = sum (Wi*c(Xi,Yi))

minkovski: p-parameter>0

D = (sum |Xi-Yi|^p)^(1/p)

For binary or categorical data, let two individuals X and Y compare their values for each attribute, and

a11 = number of times where Xi=1 and Yi=1

a10 = number of times where Xi=1 and Yi=0

a01 = number of times where Xi=0 and Yi=1

a00 = number of times where Xi=0 and Yi=0

simpleMatching: kind=similarity, min=0, max=1: D = ( a11 + a00 ) / ( a11 + a10 + a01 + a00 )
jaccard: kind=similarity: D = ( a11 ) / ( a11 + a10 + a01 )
tanimoto: kind=similarity, min=0, max=1: D = ( a11 + a00 ) / ( a11 + 2*(a10+a01) + a00 )
binarySimilarity: kind=similarity, min=0, max=1, c.. and d.. are parameters > 0.: c11*a11 + c10*a10 + c01*a01 + c00*a00
D = ---------------------------------
d11*a11 + d10*a10 + d01*a01 + d00*a00


<!ELEMENT ComparisonMeasure (Extension*,
       ( euclidean      | squaredEuclidean     | chebychev |
         cityBlock  | minkowski  | simpleMatching |
         jaccard  | tanimoto  | binarySimilarity ) )
>



<!ATTLIST ComparisonMeasure
    kind                        (distance |similarity)    #REQUIRED
    compareFunction             %COMPARE-FUNCTION;        #IMPLIED
    minimum                     %NUMBER;                  #IMPLIED
    maximum                     %NUMBER;                  #IMPLIED
>

<!ELEMENT    euclidean                 EMPTY>
<!ELEMENT    squaredEuclidean          EMPTY>
<!ELEMENT    cityBlock                 EMPTY>
<!ELEMENT    chebychev                 EMPTY>
<!ELEMENT    minkowski                 EMPTY>
<!ATTLIST    minkowski p-parameter %NUMBER;             #REQUIRED>
<!ELEMENT    simpleMatching            EMPTY>
<!ELEMENT    jaccard                   EMPTY>
<!ELEMENT    tanimoto                  EMPTY>
<!ELEMENT    binarySimilarity          EMPTY>
<!ATTLIST    binarySimilarity
     c00-parameter               %NUMBER;                  #REQUIRED
     c01-parameter               %NUMBER;                  #REQUIRED
     c10-parameter               %NUMBER;                  #REQUIRED
     c11-parameter               %NUMBER;                  #REQUIRED
     d00-parameter               %NUMBER;                  #REQUIRED
     d01-parameter               %NUMBER;                  #REQUIRED
     d10-parameter               %NUMBER;                  #REQUIRED
     d11-parameter               %NUMBER;                  #REQUIRED
>

The attributes 'minimum' and 'maximum' in ComparisonMeasure are not used in the scoring function. They are for information only, describing the minimum and maximum possible value in the comparison function.

Conformance

Center-based clustering:

comparison function 'absDiff' (that is, the value of attribute 'compareFunction' in 'ComparisonMeasure') is in core, other comparison functions are not in core.
aggregation function 'squaredEuclidean' is in core, other aggregation functions are not in core.

Distribution-based clustering:

comparison functions 'gaussSim', 'equal' and 'table' are in core, other comparison functions are not in core.
aggregation function 'cityBlock' is in core, other aggregation functions are not in core.

Example for a center-based clustering model

<?xml version="1.0" ?>
<PMML version="2.0">
     <Header copyright="dmg.org"/>
     <DataDictionary numberOfFields="3">
          <DataField name="marital status" optype="categorical">
               <Value value="s"/>
               <Value value="d"/>
               <Value value="m"/>
          </DataField>
          <DataField name="age" optype="continuous"/>
          <DataField name="salary" optype="continuous"/>
     </DataDictionary>
     <ClusteringModel modelName="Mini Clustering" 
          functionName="clustering"
          modelClass="centerBased" 
          numberOfClusters="2">
          <MiningSchema>
               <MiningField name="marital status"/>
               <MiningField name="age"/>
               <MiningField name="salary"/>
          </MiningSchema>
     <ClusteringField field="marital status" 
                      compareFunction="squaredEuclidean"/>
     <ClusteringField field="age" 
                      compareFunction="squaredEuclidean"/>
     <ClusteringField field="salary" 
                      compareFunction="squaredEuclidean"/>
     <CenterFields>
       <DerivedField name="c1">
          <NormContinuous field="age">
               <LinearNorm orig="45" norm="0"/>
               <LinearNorm orig="82" norm="0.5"/>
               <LinearNorm orig="105" norm="1"/>
          </NormContinuous>
       </DerivedField>
       <DerivedField name="c2">
          <NormContinuous field="salary">
               <LinearNorm orig="39000" norm="0"/>
               <LinearNorm orig="39800" norm="0.5"/>
               <LinearNorm orig="41000" norm="1"/>
          </NormContinuous>
       </DerivedField>
       <DerivedField name="c3">
          <NormDiscrete field="marital status" value="m"/>
       </DerivedField>
       <DerivedField name="c4">
          <NormDiscrete field="marital status" value="d"/>
       </DerivedField>
       <DerivedField name="c5">
          <NormDiscrete field="marital status" value="s"/>
       </DerivedField>
     </CenterFields>
     <Cluster name="marital status is d or s">
          <Array n="5" type="real">
               0.524561 0.486321 0.128427 0.459188 0.412384</Array>
     </Cluster>
     <Cluster name="marital status is m">
          <Array n="5" type="real">
               0.69946 0.419037 0.591226 0.173521 0.235253</Array>
     </Cluster>
     </ClusteringModel>
</PMML>