PMML Data Dictionary
PMML1.1 Menu

Home


PMML Notice and License

PMML Conformance

Header

Data Dictionary

Mining Schema

Statistics

Normalization

Tree Model

General Regression

General Structure

Association Rules

Neural Network

Center and Distribution - based Clustering

PMML 1.1 DTD

Download PMML v1.1 (zip)

PMML 1.1 -- Data Dictionary

The data dictionary contains definitions for fields as used in mining models. It specifies the types and value ranges. These definitions are assumed to be independent of specific data sets as used for training or building a specific model. A data dictionary can be shared by multiple models, statistics and other information related to the training set is stored within a model; see also the DTDs for statistics and mining fields.


 
					
	<!ENTITY % FIELD-NAME "CDATA" > 
	 
	<!ELEMENT DataDictionary (Extension*, DataField+) > 
	<!ATTLIST DataDictionary
	     numberOfFields    %INT-NUMBER;                         #IMPLIED
	>
	     
	<!ELEMENT DataField ( Extension*, (Interval* | Value*) ) >
	<!ATTLIST DataField
	     name         %FIELD-NAME;                              #REQUIRED
	     displayName  CDATA                                     #IMPLIED
	     optype       ( categorical | ordinal | continuous )    #REQUIRED
	     isCyclic     ( 0 | 1 )                                 "0"
	>
	     

The value 'numberOfFields' is the number of fields which are defined in the content of 'DataDictionary', this number can be added for consistency checks. The name of a data field must be unique in the data dictionary. The displayName is a string which may be used by applications to refer to that field. Within the XML document only the value of name is significant. If displayName is not given, then name is the default.

The fields are separated into different types depending on which operations are defined on the values; this is defined by the attribute optype. Categorical fields have the operator "=", ordinal fields have an additional "<", and continuous fields also have arithmetic operators. Cyclic fields have a distance measure which takes into account that the maximal value and minimal value are close together.


The content of aDataField defines the set of values which are considered to be valid.

Mining models distinguish three properties of values:

Using value:

Input value is missing, for example, if a database column contains a null value. It is possible to explicitly define values which are interpreted as missing values.

Invalid value:

The input value is not missing but it does not belong to a certain value range. The range of valid values can be defined for each field.

Valid value:

A value which is neither missing nor invalid.


The following element definitions for Value and Interval are used to define the types and value ranges for fields in the data dictionary. The range of valid values can either be defined by specifying the set itself or by specifying complement set.

Note that PMML does not define how an interpreter of a model actually represents invalid or missing values. This depends on the application environment.

A continuous field may have at most two intervals defining the range of valid values. If intervals are present, any data that is outside the intervals will be considered invalid. If no intervals are present, the entire real axis (except for discrete missing values) is made of valid values. Intervals are not allowed for non-continuous fields

There can be continuous fields where a sequence of valid values is specified instead of an interval range. This can be used, for example, in a rating system with a fixed set of values 1,2,3,4,5,6 representing grades. Other values than these numbers are considered as invalid input, but the operational type still allows to compute sums and mean values.

Note that binary fields are modelled as categorical fields where the value range is given and it contains exactly two valid values. There is no special optype value for binary fields.

Ordinal and continuous fields may be cyclic; these are used for clustering models where the distance of values depends on the 'cyclic' property. For cyclic ordinal fields the sequence of Value elements also defines the first and last value of a cycle. For continuous fields there must be exactly one interval of valid values, this interval also defines the first and last value of a cycle. The ordering for ordinal fields can be specified by a sequence of category elements. If the ordering is not specified in the element then the default ordering on the representation type (strings, date, numbers, etc) also defines the ordering of the field values in the mining model. If the application algorithm for some class of models does not require any special treatment of ordinal fields, then they can be interpreted as categorical fields.

	
	<!ELEMENT Value (Extension*) > 
	
	<!ATTLIST Value 
	     value             CDATA                           #REQUIRED
	     displayValue      CDATA                           #IMPLIED
	     property          (valid | invalid | missing )    "valid"
	>
	      

If a categorical or ordinal field contains at least one Value element where the value of property is 'valid' or unspecified, then the set of Value elements completely defines the set of valid values. Otherwise any value is valid by default.



The element Interval defines a range of numeric values.


					
	<!ELEMENT Interval EMPTY>
	
	<!ATTLIST Interval 
	     closure     (openClosed | openOpen | closedOpen | closedClosed )     #REQUIRED 
	     leftMargin  %NUMBER;                                                 #IMPLIED 
	     rightMargin %NUMBER;                                                 #IMPLIED
	>
	

The attributes leftMargin and rightMargin are optional but at least one value must be defined. If a margin is missing, then +/- infinity is assumed.

e-mail info at dmg.org