PMML 2.1 - Data Dictionary
The data dictionary contains definitions for fields as used in mining models. It specifies the types and value ranges. These definitions are assumed to be independent of specific data sets as used for training or scoring a specific model.
A data dictionary can be shared by multiple models, statistics and other information related to the training set is stored within a model; see also the schemas for statistics and mining fields.
<xs:element name="DataDictionary"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" /> <xs:element maxOccurs="unbounded" ref="DataField" /> <xs:element minOccurs="0" maxOccurs="unbounded" ref="Taxonomy" /> </xs:sequence> <xs:attribute name="numberOfFields" type="xs:nonNegativeInteger" /> </xs:complexType> </xs:element> <xs:element name="DataField"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" /> <xs:choice> <xs:element minOccurs="0" maxOccurs="unbounded" ref="Interval" /> <xs:element minOccurs="0" maxOccurs="unbounded" ref="Value" /> </xs:choice> </xs:sequence> <xs:attribute name="name" type="FIELD-NAME" use="required" /> <xs:attribute name="displayName" type="xs:string" /> <xs:attribute name="optype" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="categorical" /> <xs:enumeration value="ordinal" /> <xs:enumeration value="continuous" /> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="dataType"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="string" /> <xs:enumeration value="integer" /> <xs:enumeration value="float" /> <xs:enumeration value="double" /> <xs:enumeration value="boolean" /> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="taxonomy" type="xs:string" /> <xs:attribute name="isCyclic" default="0"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="0" /> <xs:enumeration value="1" /> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> </xs:element> |
The value numberOfFields is the number of fields which are defined in the content of DataDictionary, this number can be added for consistency checks. The name of a data field must be unique in the data dictionary. The displayName is a string which may be used by applications to refer to that field. Within the XML document only the value of name is significant. If displayName is not given, then name is the default value.
The fields are separated into different types depending on which operations are defined on the values; this is defined by the attribute optype. Categorical fields have the operator "=", ordinal fields have an additional "<", and continuous fields also have arithmetic operators. Cyclic fields have a distance measure which takes into account that the maximal value and minimal value are close together.
The optional attribute 'taxonomy' refers to a taxonomy of values. The value is a name of a taxonomy. It describes a hierarchy of values. The attribute is only applicable to categorical fields.
Valid Values
The content of a DataField defines the set of values which are considered to be valid.
Mining models distinguish three properties of values:
Missing value: Input value is missing, for example, if a database column contains a null value. It is possible to explicitly define values which are interpreted as missing values.
Invalid value: The input value is not missing but it does not belong to a certain value range. The range of valid values can be defined for each field.
Valid value: A value which is neither missing nor invalid.
These different types of values are used in the Value element defined in the next section.
Values and Intervals
The following element definitions for Value and Interval are used to define the types and value ranges for fields in the data dictionary. The range of valid values can either be defined by specifying the set itself or by specifying complement set.
Trailing blanks in categorical input values are not significant, and categorical values in PMML must not have trailing blanks. Leading blanks are significant.
Note that PMML does not define how an interpreter of a model actually represents invalid or missing values. This depends on the application environment.
A continuous field may have at most two intervals defining the range of valid values. If intervals are present, any data that is outside the intervals will be considered invalid. If no intervals are present, the entire real axis (except for discrete missing values) is made of valid values. Intervals are not allowed for non-continuous fields.
There can be continuous fields where a sequence of valid values is specified instead of an interval range. This can be used, for example, in a rating system with a fixed set of values 1,2,3,4,5,6 representing grades. Other values than these numbers are considered as invalid input. The operational type still allows to compute sums and mean values.
Note that binary fields are modelled as categorical fields where the value range is given and it contains exactly two valid values. There is no special optype value for binary fields.
Ordinal and continuous fields may be cyclic; these are used for clustering models where the similarity or distance of values depends on the 'cyclic' property. For cyclic ordinal fields the sequence of 'Value' elements defines the first and last value of a cycle. The value range for cyclic continuous fields may be defined by an interval or by a sequence of 'Value' elements. If the field has an interval there must be exactly one. This interval defines the default values for the first and last value of a cycle. Otherwise these are the minumum and maximum value given by the 'Value' elements.
The ordering for ordinal fields can be specified by a sequence of category elements. If the ordering is not specified in the element then the default ordering on the representation type (strings, date, numbers, etc) also defines the ordering of the field values in the mining model. If the application algorithm for some class of models does not require any special treatment of ordinal fields, then they can be interpreted as categorical fields.
<xs:element name="Value"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" /> </xs:sequence> <xs:attribute name="value" type="xs:string" use="required" /> <xs:attribute name="displayValue" type="xs:string" /> <xs:attribute name="property" default="valid"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="valid" /> <xs:enumeration value="invalid" /> <xs:enumeration value="missing" /> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> </xs:element> |
If a categorical or ordinal field contains at least one Value element where the value of property is 'valid' or unspecified, then the set of Value elements completely defines the set of valid values. Otherwise any value is valid by default.
The element Interval defines a range of numeric values.
<xs:element name="Interval"> <xs:complexType> <xs:attribute name="closure" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="openClosed" /> <xs:enumeration value="openOpen" /> <xs:enumeration value="closedOpen" /> <xs:enumeration value="closedClosed" /> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="leftMargin" type="NUMBER" /> <xs:attribute name="rightMargin" type="NUMBER" /> </xs:complexType> </xs:element> |
The attributes leftMargin and rightMargin are optional but at least one value must be defined. If a margin is missing, then +/- infinity is assumed.