Data Dictionary
PMML3.0 Menu

Home


PMML Notice and License

Changes


Conformance

General Structure

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Taxomony

Targets

Output

Functions

Built-in Functions

Model Composition

Model Verification


Association Rules

Cluster
Models


General
Regression


Naive
Bayes


Neural
Network


Regression

Ruleset

Sequences

Text Models

Trees

Vector Machine

PMML 3.0 - Data Dictionary

The data dictionary contains definitions for fields as used in mining models. It specifies the types and value ranges. These definitions are assumed to be independent of specific data sets as used for training or scoring a specific model.

A data dictionary can be shared by multiple models, statistics and other information related to the training set is stored within a model; see also the schemas for statistics and mining fields.


  <xs:element name="DataDictionary">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="DataField" maxOccurs="unbounded"  />
        <xs:element ref="Taxonomy" minOccurs="0" maxOccurs="unbounded"  />
      </xs:sequence>
      <xs:attribute name="numberOfFields" type="xs:nonNegativeInteger" />
    </xs:complexType>
  </xs:element>

  <xs:element name="DataField">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:choice>
          <xs:element ref="Interval" minOccurs="0" maxOccurs="unbounded" />
          <xs:element ref="Value"    minOccurs="0" maxOccurs="unbounded" />
        </xs:choice>
      </xs:sequence>
      <xs:attribute name="name" type="FIELD-NAME" use="required" />
      <xs:attribute name="displayName" type="xs:string" />
      <xs:attribute name="optype" type="OPTYPE" use="required" />
      <xs:attribute name="dataType" type="DATATYPE" />
      <xs:attribute name="taxonomy" type="xs:string" />
      <xs:attribute name="isCyclic" default="0">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="0" />
            <xs:enumeration value="1" />
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

  <xs:simpleType name="OPTYPE">      
     <xs:restriction base="xs:string">
      <xs:enumeration value="categorical"/>
      <xs:enumeration value="ordinal"/>
      <xs:enumeration value="continuous"/>
     </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="DATATYPE">      
     <xs:restriction base="xs:string">
        <xs:enumeration value="string" />
        <xs:enumeration value="integer" />
        <xs:enumeration value="float" />
        <xs:enumeration value="double" />
        <xs:enumeration value="boolean" />
        <xs:enumeration value="date"/>
        <xs:enumeration value="time" />
        <xs:enumeration value="dateTime" />
        <xs:enumeration value="dateDaysSince[0]" />
        <xs:enumeration value="dateDaysSince[1960]" />
        <xs:enumeration value="dateDaysSince[1970]" />
        <xs:enumeration value="dateDaysSince[1980]" />
        <xs:enumeration value="timeSeconds" />
        <xs:enumeration value="dateTimeSecondsSince[0]" />
        <xs:enumeration value="dateTimeSecondsSince[1960]" />
        <xs:enumeration value="dateTimeSecondsSince[1970]" />
        <xs:enumeration value="dateTimeSecondsSince[1980]" />
     </xs:restriction>
  </xs:simpleType>

The value numberOfFields is the number of fields which are defined in the content of DataDictionary, this number can be added for consistency checks. The name of a data field must be unique in the data dictionary. The displayName is a string which may be used by applications to refer to that field. Within the XML document only the value of name is significant. If displayName is not given, then name is the default value.

The fields are separated into different types depending on which operations are defined on the values; this is defined by the attribute optype. Categorical fields have the operator "=", ordinal fields have an additional "<", and continuous fields also have arithmetic operators. Cyclic fields have a distance measure which takes into account that the maximal value and minimal value are close together.

The dataType reuses names and semantics of atomic types in XML Schema. The representation of the values is defined in

PMML introduces three new types that are not part of the types in XML Schema. These are
  • timeSeconds,
  • dateDaysSince[aYear], and
  • dateTimeSecondsSince[aYear].
These additional types are supported in PMML because mining models often convert input values into numbers. After date and time values have been converted into numbers they can be used easily in comparisons and other mathematical computations such as differences. For example, the date 2003-04-01 can be converted to the value 15796 of type dateDaysSince[1960]. These type casts are analogous to, e.g., casting an integer to a double or vice versa.

PMML supports variants of dateDaysSince and dateTimeSecondsSince because there is no commonly used reference date. If a training algorithm uses, e.g., the reference year 1980, it is difficult to adjust the model to another reference year while exporting a PMML model. Changing the reference year would require adjustment of possibly many coefficients. It's easier to support a few reference years in a PMML consumer.

The type timeSeconds is a variant of the type time where the values are represented as the number of seconds since 00:00, that is, since midnight. The time 00:00 is represented by the number 0. No negative values are allowed. For technical simplicity we allow the values of type timeSeconds can be greater than the number of seconds in 24h. Otherwise we would have to check for overflow or execute some 'modulo' operation.

Example:

Obs      hms          ts
0      0:00:00         0
1      0:01:40       100
2      0:03:20       200
3      0:16:40      1000
4     24:00:00     86400
5     24:00:01     86401
6     27:46:40    100000

The type dateDaysSince[aYear] is a variant of the type date where the values are represented as the number of days since aYear-01-01. The date aYear-01-01 is represented by the number 0. aYear-01-02 is represented by 1, aYear-02-01 is represented by 31, etc. Dates before aYear-01-01 are represented as negative numbers.
For example, values of type dateDaysSince[1960] are the number of days since 1960-01-01. The date 1960-01-01 is represented by the number 0.

The type dateTimeSecondsSince[aYear] is a variant of the type date where the values are represented as the number of seconds since 00:00 on aYear-01-01. The datetime 00:00:00 on aYear-01-01 is represented by the number 0. The datetime 00:00:01 on aYear-01-01 is represented by 1, etc. Datetimes before aYear-01-01 are represented as negative numbers.
For example, values of type dateTimeSecondsSince[1960] are the number of seconds since 00:00 on 1960-01-01. The datetime 00:00:00 on 1960-01-01 is represented by the number 0. The datetime 00:01:00 on 1960-01-01 is represented by 60.

Note that there is no explicit specification of a time zone. The local time zone is always assumed.

Note that the Java class Date uses integers as an alternative representation of a date value. The integer value is the number of milliseconds since January 1, 1970, 00:00:00 GMT. See Java DateFormat.

The optional attribute taxonomy refers to a taxonomy of values. The value is a name of a taxonomy. It describes a hierarchy of values. The attribute is only applicable to categorical fields.

Valid Values

The content of a DataField defines the set of values which are considered to be valid.

Mining models distinguish three properties of values:

    Missing value: Input value is missing, for example, if a database column contains a null value. It is possible to explicitly define values which are interpreted as missing values.

    Invalid value: The input value is not missing but it does not belong to a certain value range. The range of valid values can be defined for each field.

    Valid value: A value which is neither missing nor invalid.

These different types of values are used in the Value element defined in the next section.

Values and Intervals

The following element definitions for Value and Interval are used to define the types and value ranges for fields in the data dictionary. The range of valid values can either be defined by specifying the set itself or by specifying complement set.

Trailing blanks in categorical input values are not significant, and categorical values in PMML must not have trailing blanks. Leading blanks are significant.

Note that PMML does not define how an interpreter of a model actually represents invalid or missing values. This depends on the application environment.

A continuous field may have at most two intervals defining the range of valid values. If intervals are present, any data that is outside the intervals will be considered invalid. If no intervals are present, the entire real axis (except for discrete missing values) is made of valid values. Intervals are not allowed for non-continuous fields.

There can be continuous fields where a sequence of valid values is specified instead of an interval range. This can be used, for example, in a rating system with a fixed set of values 1,2,3,4,5,6 representing grades. Other values than these numbers are considered as invalid input. The operational type still allows to compute sums and mean values.

Note that binary fields are modelled as categorical fields where the value range is given and it contains exactly two valid values. There is no special optype value for binary fields.

Ordinal and continuous fields may be cyclic; these are used for clustering models where the similarity or distance of values depends on the 'cyclic' property. For cyclic ordinal fields the sequence of 'Value' elements defines the first and last value of a cycle. The value range for cyclic continuous fields may be defined by an interval or by a sequence of 'Value' elements. If the field has an interval there must be exactly one. This interval defines the default values for the first and last value of a cycle. Otherwise these are the minimum and maximum value given by the 'Value' elements.

The ordering for ordinal fields can be specified by a sequence of category elements. If the ordering is not specified in the element then the default ordering on the representation type (strings, date, numbers, etc) also defines the ordering of the field values in the mining model. If the application algorithm for some class of models does not require any special treatment of ordinal fields, then they can be interpreted as categorical fields.


  <xs:element name="Value">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="value" type="xs:string" use="required" />
      <xs:attribute name="displayValue" type="xs:string" />
      <xs:attribute name="property" default="valid">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="valid" />
            <xs:enumeration value="invalid" />
            <xs:enumeration value="missing" />
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

The attribute displayValue in a DataField/Value is similar to mappedValue in association rules. The idea is that the raw data could have cryptic key values such as a product code 'sku043834', and the displayValue would then be the readable product name like 'coke, 12 fl oz can'. The display values can be used in a visualizer. Scoring engines typically use key values.

If a categorical or ordinal field contains at least one Value element where the value of property is 'valid' or unspecified, then the set of Value elements completely defines the set of valid values. Otherwise any value is valid by default.

The element Interval defines a range of numeric values.


  <xs:element name="Interval">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="closure" use="required">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="openClosed" />
            <xs:enumeration value="openOpen" />
            <xs:enumeration value="closedOpen" />
            <xs:enumeration value="closedClosed" />
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
      <xs:attribute name="leftMargin" type="NUMBER" />
      <xs:attribute name="rightMargin" type="NUMBER" />
    </xs:complexType>
  </xs:element>

The attributes leftMargin and rightMargin are optional but at least one value must be defined. If a margin is missing, then +/- infinity is assumed.

Conformance

Support for ordinal fields and for cyclic value ranges is not part of the core PMML requirements. Support for the specification of missing values or invalid values or value ranges, see content of 'DataField', is not in core PMML. All date/time related dataTypes are not in core PMML.
e-mail info at dmg.org