Data Mining Group - PMML Clustering Model

PMML 1.1 -- DTD for Clustering Models

PMML models for Clustering are defined in two different classes. These are center-based and distribution-based cluster models. Both models have the DTD element ClusteringModel as the toplevel type and they share many other element types.

A cluster model basically consists of a set of clusters. For each cluster a center vector can be given. In center-based models a cluster is defined by a vector of center coordinates. Some distance mesure is used to determine the nearest center, that is the nearest cluster for a given input record. For distribution-based models (e.g. in demographic clustering) the clusters are defined by their statistics. Some similarity measure is used to determine the best matching cluster for a given record. The center vectors then only approximate the clusters.

The model must contain information on the distance or similarity measure used for clustering. It may also contain information on overall data distribution, such as covariance matrix, or other statistics. Names of coordinates in CenterFields, ClusteringFields, and statistics must be consistent with the names of the variables in the data dictionary.


	<!ELEMENT ClusteringModel( Extension*, MiningSchema, ModelStats?, ComparisonMeasure, 
	     ClusteringField*, CenterFields?, Cluster+ ) >
  
	<!ATTLIST ClusteringModel 
	     modelName          CDATA                                  #IMPLIED 
	     modelClass         ( centerBased | distributionBased )    #REQUIRED 
	     numberOfClusters   %INT-NUMBER;                           #REQUIRED
	>
	     
	<!ELEMENT CenterFields ( (%NORM-INPUT;)+ ) >

The attribute modelClass specifies whether the clusters are defined by center-vectors or whether they are defined by the statistics. The latter is used by demographic clustering.

The fields which are used in the center vectors are normalized, in particular this allows to map categorical input fields to numeric values in center vectors. For more information, see DTD on normalization.

MiningField information (in MiningSchema) must be present for each active variable. For numeric variables it specifies the treatment of outliers. Note that there may be supplementary mining fields. The statistics for these fields are part of the model but they are not required to apply the model.

For each active MiningField, UnivariateStats (in ModelStats) holds information about the overall (background) population. This includes (required) DiscrStats or ContStats, which include possible field values and interval boundaries. Optionally, statistical information is included for the background data.

Each Partition corresponds to a cluster and holds a center vector and/or field statistics to describe it.


	<!ELEMENT Cluster (Extension*, (%NUM-ARRAY;)?, Partition?, Covariances?)>
 
	<!ATTLIST Cluster 
	     name          CDATA     #IMPLIED
	>

The %NUM-ARRAY; contains the center coordinates for the cluster. The correspondence between input fields and their coordinates is defined by CenterFields, see ClusteringModel. Note that categorical fields can have more than one coordinate, depending on the normalization method (see %NORM-INPUT;).

If some normalization is defined for input fields, then the center coordinates are defined using the normalized values. For numeric fields we could also use the values in the original domain. For categorical values, however, the center vector is not required to contain the indicator values 0.0 or 1.0. There can also be any values between 0.0 and 1.0. These indicate a distribution of categorical values, defining a kind of virtual center point.

If the cluster model contains statistics with mean values, then the center coordinates are not necessarily indentical to the corresponding mean values. There may be differences depending on the kind of normalization of input values.


	<!ELEMENT Covariances ( Matrix ) >

A covariance matrix stores coordinate-by-coordinate variances (diagonal cells) and covariances (non-diagonal cells). The covariance matrix must be symmetric so only half of the non-diagonal covariance cells need to be stored. Missing covariance cells are reconstructed by symmetry. The sequence of rows/columns correspond to the sequence in MiningSchema.


	<!ELEMENT ClusteringField ( Extension*, Comparisons? ) >
 
	<!ATTLIST ClusteringField 
	     field               %FIELD-NAME;          #REQUIRED 
	     fieldWeight         %REAL-NUMBER;         #IMPLIED 
	     similarityScale     %REAL-NUMBER;         #IMPLIED 
	     compareFunction     %CMP-FCT;             #IMPLIED
	>

Attribute List:

field: references (the name of) a MiningField.

fieldWeight: is the importance factor for the field.

similarityScale: is the distance such that similarity becomes 0.5.

compareFunction: is a function of taking two field values and a similarityScale to define similarity/distance. It can override the general specification of compareFunction in ComparisonMeasure. For the computation of distances and similarities see below.


	<!ELEMENT Comparisons ( Extension*, Matrix ) >

Comparisons is a matrix which contains the similarity values or distance values, depending on the attributemodelClass in ClusteringModel. The order of the rows and columns corresponds to the order of discrete values or intervals in that field.

Matrix

There are several kinds of matrices which are used within cluster models, eg., to describe covariances and similarities. In order to save space, a matrix can be stored as sparse matrix, diagonal matrix, etc.


	<!ELEMENT Matrix  ( (%NUM-ARRAY;)+ |  MatCell+ )? > 
	
	<!ATTLIST Matrix
	     kind             ( diagonal | symmetric | any )      "any" 
	     numberOfRows     %INT-NUMBER;                        #IMPLIED 
	     numberOfCols     %INT-NUMBER;                        #IMPLIED 
	     diagDefault      %REAL-NUMBER;                       #IMPLIED 
	     offDiagDefault   %REAL-NUMBER;                       #IMPLIED
	>

A matrix may be represented as a sequence of arrays. If the matrix is diagonal, then the content is just one array of numbers representing the diagonal values. Otherwise, each array contains elements of one row in the matrix. If the kind of the matrix is any, then all values are given. If the matrix is symmetric then the first array contains the matrix element M(0,0), the second array contains M(1,0), M(1,1), and so on (that's the lower left triangle). Other elements are defined by symmetry.

A sparse matrix may also be represented in a compact form as an enumeration of MatCell. Each MatCell contains the numeric value of a single cell. In this case, diagonal has no significance for the matrix representation.


	<!ATTLIST MatCell 
	     row            %INT-NUMBER;          #REQUIRED 
	     col            %INT-NUMBER;          #REQUIRED
	>

Evaluating a matrix element

M(i,j) proceeds as follows:

The element is explicitely given, either in a MatCell with row=i andcol =j, or in the j-th element of the i-th array of Matrix.
The attribute kind of the matrix is symmetric, and the element M(j,i) is explicitely given.
A default value is given, either in the attribute diagDefault if i=j, or in the attribute offDiagDefault.
No value can be calculated at this step. Calculation will be done only if a default behavior or additional information are given at a higher level.

Distance or Similarity Measure

When two records are compared then either the distance or the similarity is of interest. In both cases the measures can be computed by a combination of an 'inner' function and an 'outer' function. The inner function compares two single field values and the outer function computes an aggregation over all fields.

Each field has a comparison function, this is either defined as a default in ClusteringModel or it can be defined per ClusteringField. Given two field values x and y, the inner function can be one of:

absDiff: absolute difference

c(x,y) = |x-y|

gaussSim: gaussian similarity

c(x,y) = exp(-ln(2)*z*z/(s*s))

where z=x-y, and s is the value of attribute similarityScale (required in this case) in the ClusteringField

delta:

c(x,y) = 0 if x=y, 1 else

equal:

c(x,y) = 1 if x=y, 0 else

table:

c(x,y) = lookup in similarity matrix


	<!ENTITY % CMP-FCT "(absDiff | gaussSim | delta | equal | table)" >

PerClusteringModel there is one aggregation function: depending on the attribute kind in ComparisonMeasure the aggregated value is optimal if it is 0 (for distance measure) or greater values indicate optimal fit (for similarity measure).

Wi is fieldWeight in ClusteringField. Xi and Yi are field values

euclidean: kind=distance

D = (sum( Wi * c(Xi,Yi)^2 ))^(1/2)

squaredEuclidean, aka squared: kind=distance

D = sum( Wi * c(Xi,Yi)^2)

euclidean and squaredEuclidean are essentially equivalent because only the minimum or maximum of values is needed in order to assign a cluster.

chebychev, aka maximum: kind=distance

D = max (Wi*c(Xi,Yi)) over all i

cityBlock, aka sum:

D = sum (Wi*c(Xi,Yi))

minkovski: p-parameter>0

D = (sum |Xi-Yi|^p)^(1/p)

For binary or categorical data, let two individuals X and Y compare their values for each attribute, and:

a11 = number of times where Xi=1 and Yi=1
a10 = number of times where Xi=1 and Yi=0
a01 = number of times where Xi=0 and Yi=1
a00 = number of times where Xi=0 and Yi=0

simpleMatching: kind=similarity min=0 max=1

D = ( a11 + a00 ) / ( a11 + a10 + a01 + a00 )

Jaccard: kind=similarity

D = ( a11 ) / ( a11 + a10 + a01 )

tanimoto: kind=similarity min=0 max=1

D = ( a11 + a00 ) / ( a11 + 2*(a10+a01) + a00 )

binary-similarity: kind=similarity min=0 max=1, c.. and d.. are parameters > 0.

c11*a11 + c10*a10 + c01*a01 + c00*a00

--------------------------------------------------------- = D

d11*a11 + d10*a10 + d01*a01 + d00*a00


	<!ELEMENT ComparisonMeasure (Extension*, ( euclidean | squaredEuclidean
	     | chebychev | cityBlock | minkowski | simpleMatching | jaccard  | tanimoto 
	     | binarySimilarity ))>
	
	<!ATTLIST ComparisonMeasure
	     kind                (distance | similarity)       #REQUIRED
	     compareFunction     %CMP-FCT;                     #IMPLIED 
	     minimum             %NUMBER;                      #IMPLIED 
	     maximum             %NUMBER;		       #IMPLIED
	> 

	<!ELEMENT euclidean EMPTY> 
	<!ELEMENT squaredEuclidean EMPTY> 
	<!ELEMENT cityBlock EMPTY> 
	<!ELEMENT chebychev EMPTY> 
	<!ELEMENT minkowski EMPTY>
	 
	<!ATTLIST minkowski p-parameter %NUMBER; #REQUIRED> 
	
	<!ELEMENT simpleMatching EMPTY> 
	<!ELEMENT jaccard EMPTY>
	<!ELEMENT tanimoto EMPTY> 
	<!ELEMENT binarySimilarity EMPTY> 
	
	<!ATTLIST binarySimilarity 
	     c00-parameter       %NUMBER;           #REQUIRED 
	     c01-parameter       %NUMBER;           #REQUIRED 
	     c10-parameter       %NUMBER;           #REQUIRED 
	     c11-parameter       %NUMBER;           #REQUIRED
	     d00-parameter       %NUMBER;           #REQUIRED
	     d01-parameter       %NUMBER;           #REQUIRED 
	     d10-parameter       %NUMBER;           #REQUIRED
	     d11-parameter       %NUMBER;           #REQUIRED
	>

Conformance

Center-based clustering:

comparison function 'absDiff' (that is, the value of attribute 'compareFunction' in 'ComparisonMeasure') is in core, other comparison functions are not in core.
aggregation function 'squaredEuclidean' is in core, other aggregation functions are not in core.

Distribution-based clustering:

comparison functions 'gaussSim', 'equal' and 'table' are in core, other comparison functions are not in core.
aggregation function 'cityBlock' is in core, other aggregation functions are not in core.

Example for a center-based clustering model


	<?xml version="1.0" ?>
	<PMML version="1.1"> 
	<Header copyright="dmg.org"/>
	
	<DataDictionary numberOfFields="3">
	<DataField name="marital status" optype="categorical">
	<Value value="s"/>
	<Value value="d"/>
	<Value value="m"/>
	</DataField>
	<DataField name="age" optype="continuous"/>
	<DataField name="salary" optype="continuous"/>
	</DataDictionary> 
	<ClusteringModel modelName="Mini Clustering" modelClass="centerBased" 
	     numberOfClusters="2"> 

	<MiningSchema>
	<MiningField name="marital status"/>
	<MiningField name="age"/>
	<MiningField name="salary"/>
	</MiningSchema> 

	<ClusteringField field="marital status" compareFunction="squaredEuclidean"/>
	<ClusteringField field="age" compareFunction="squaredEuclidean"/>
	<ClusteringField field="salary" compareFunction="squaredEuclidean"/>
	<CenterFields>
	<NormContinuous field="age">
	<LinearNorm orig="45" norm="0"/>
	<LinearNorm orig="82" norm="0.5"/>
	<LinearNorm orig="105" norm="1"/>
	</NormContinuous>
	<NormContinuous field="salary"> 
	<LinearNorm orig="39000" norm="0"/> 
	<LinearNorm orig="39800" norm="0.5"/>
	<LinearNorm orig="41000" norm="1"/>
	</NormContinuous> 
	<NormDiscrete field="marital status" value="m"/> 
	<NormDiscrete field="marital status" value="d"/> 
	<NormDiscrete field="marital status" value="s"/>
	</CenterFields>
	<Cluster name="marital status is d or s">
	<Array n="5" type="real">0.524561 0.486321 0.128427 0.459188 0.412384<Array>
	
	<Cluster> 
	<Cluster name="marital status is m">
	<Array n="5" type="real">0."90%"46 0.419037 0.591226 0.173521 0.235253<Array>
	
	<Cluster>
	</ClusteringModel>
	</PMML>