DMG logo PMML 4.2 - Data Dictionary
PMML4.2 Menu

Home

PMML Notice and License

Changes

XML Schema

Conformance

Interoperability

General Structure

Field Scope

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Taxomony

Targets

Output

Functions

Built-in Functions

Model Verification

Model Explanation

Multiple Models

Association Rules

Baseline Models

Cluster
Models


General
Regression


k-Nearest
Neighbors


Naive
Bayes


Neural
Network


Regression

Ruleset

Scorecard

Sequences

Text Models

Time Series

Trees

Vector Machine

PMML 4.2 - Data Dictionary

The DataDictionary contains definitions for fields as used in mining models. It specifies the types and value ranges. These definitions are assumed to be independent of specific data sets as used for training or scoring a specific model.

A DataDictionary can be shared by multiple models, statistics and other information related to the training set is stored within a model; see also the schemas for statistics and mining fields.

<xs:element name="DataDictionary">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DataField" maxOccurs="unbounded"/>
      <xs:element ref="Taxonomy" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="numberOfFields" type="xs:nonNegativeInteger"/>
  </xs:complexType>
</xs:element>

<xs:element name="DataField">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:sequence>
        <xs:element ref="Interval" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="Value" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:sequence>
    <xs:attribute name="name" type="FIELD-NAME" use="required"/>
    <xs:attribute name="displayName" type="xs:string"/>
    <xs:attribute name="optype" type="OPTYPE" use="required"/>
    <xs:attribute name="dataType" type="DATATYPE" use="required"/>
    <xs:attribute name="taxonomy" type="xs:string"/>
    <xs:attribute name="isCyclic" default="0">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="0"/>
          <xs:enumeration value="1"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
  </xs:complexType>
</xs:element>

<xs:simpleType name="OPTYPE">      
  <xs:restriction base="xs:string">
    <xs:enumeration value="categorical"/>
    <xs:enumeration value="ordinal"/>
    <xs:enumeration value="continuous"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="DATATYPE">      
  <xs:restriction base="xs:string">
    <xs:enumeration value="string"/>
    <xs:enumeration value="integer"/>
    <xs:enumeration value="float"/>
    <xs:enumeration value="double"/>
    <xs:enumeration value="boolean"/>
    <xs:enumeration value="date"/>
    <xs:enumeration value="time"/>
    <xs:enumeration value="dateTime"/>
    <xs:enumeration value="dateDaysSince[0]"/>
    <xs:enumeration value="dateDaysSince[1960]"/>
    <xs:enumeration value="dateDaysSince[1970]"/>
    <xs:enumeration value="dateDaysSince[1980]"/>
    <xs:enumeration value="timeSeconds"/>
    <xs:enumeration value="dateTimeSecondsSince[0]"/>
    <xs:enumeration value="dateTimeSecondsSince[1960]"/>
    <xs:enumeration value="dateTimeSecondsSince[1970]"/>
    <xs:enumeration value="dateTimeSecondsSince[1980]"/>
  </xs:restriction>
</xs:simpleType>

The value numberOfFields is the number of fields which are defined in the content of DataDictionary, this number can be added for consistency checks.

The name of a DataField must be unique from other names in the DataDictionary and, with few exceptions, unique from the names of other fields in the PMML document. For information on the naming and scope of DataFields, see Scope of Fields.

The displayName is a string which may be used by applications to refer to that field. Within the XML document only the value of name is significant. If displayName is not given, then it defaults to the value of name. For example, there may be a field with name="CSTAGE" and displayName="Customer age". An application may use the label Customer age, e.g., at the user interface in order to ask for input values. That is, displayName can be used when the application calls the PMML consumer. Once the consumer has received the parameters and matched to the MiningFields, the displayName is not relevant anymore. Only name is significant for internal processing.

The fields are separated into different types depending on which operations are defined on the values; this is defined by the attribute optype. Categorical field values can only be tested for equality, while ordinal field values in addition have an order defined. Values of continuous fields can be used with arithmetic operators. Cyclic fields have a distance measure which takes into account that the maximal value and minimal value are close together (see below).

The dataType reuses names and semantics of atomic types in XML Schema. The representation of the values is defined in

PMML introduces three new types that are not part of the types in XML Schema. These are

  • timeSeconds,
  • dateDaysSince[aYear], and
  • dateTimeSecondsSince[aYear].
where aYear is one of 0, 1960, 1970 or 1980. Note that aYear is not a variable and these are the only values allowed. If arbitrary years are necessary, you can use an appropriate builtin function.
These additional types are supported in PMML because mining models often convert input values into numbers. After date and time values have been converted into numbers they can be used easily in comparisons and other mathematical computations such as differences. For example, the date 2003-04-01 can be converted to the value 15796 of type dateDaysSince[1960]. These type casts are analogous to, e.g., casting an integer to a double or vice versa.

PMML supports variants of dateDaysSince and dateTimeSecondsSince because there is no commonly used reference date. If a training algorithm uses, e.g., the reference year 1980, it is difficult to adjust the model to another reference year while exporting a PMML model. Changing the reference year would require adjustment of possibly many coefficients. It's easier to support a few reference years in a PMML consumer.

The type timeSeconds is a variant of the type time where the values are represented as the number of seconds since 00:00, that is, since midnight. The time 00:00 is represented by the number 0. No negative values are allowed. For technical simplicity we allow the values of type timeSeconds can be greater than the number of seconds in 24h. Otherwise we would have to check for overflow or execute some modulo operation.

Example:

Obs      hms          ts
0      0:00:00         0
1      0:01:40       100
2      0:03:20       200
3      0:16:40      1000
4     24:00:00     86400
5     24:00:01     86401
6     27:46:40    100000

The type dateDaysSince[aYear] is a variant of the type date where the values are represented as the number of days since aYear-01-01. The date aYear-01-01 is represented by the number 0. aYear-01-02 is represented by 1, aYear-02-01 is represented by 31, etc. Dates before aYear-01-01 are represented as negative numbers.
For example, values of type dateDaysSince[1960] are the number of days since 1960-01-01. The date 1960-01-01 is represented by the number 0.

The type dateTimeSecondsSince[aYear] is a variant of the type date where the values are represented as the number of seconds since 00:00 on aYear-01-01. The datetime 00:00:00 on aYear-01-01 is represented by the number 0. The datetime 00:00:01 on aYear-01-01 is represented by 1, etc. Datetimes before aYear-01-01 are represented as negative numbers.
For example, values of type dateTimeSecondsSince[1960] are the number of seconds since 00:00 on 1960-01-01. The datetime 00:00:00 on 1960-01-01 is represented by the number 0. The datetime 00:01:00 on 1960-01-01 is represented by 60.

Note that there is no explicit specification of a time zone. The local time zone is always assumed.

Note that the Java class Date uses integers as an alternative representation of a date value. The integer value is the number of milliseconds since January 1, 1970, 00:00:00 GMT. See Java DateFormat.

The optional attribute taxonomy refers to a taxonomy of values. The value is a name of a taxonomy. It describes a hierarchy of values. The attribute is only applicable to categorical fields.

Valid Values

The content of a DataField defines the set of values which are considered to be valid.

Mining models distinguish three properties of values:

Missing value: Input value is missing, for example, if a database column contains a null value. It is possible to explicitly define values which are interpreted as missing values.

Invalid value: The input value is not missing but it does not belong to a certain value range. The range of valid values can be defined for each field.

Valid value: A value which is neither missing nor invalid.

These different types of values are used in the Value element defined in the next section.

Note that PMML does not define how an interpreter of a model actually represents invalid or missing values. This depends on the application environment.

Values and Intervals

The following element definitions for Value and Interval are used to define the types and value ranges for fields in the DataDictionary. The range of valid values can either be defined by specifying the set itself or by specifying complement set.

Trailing blanks in categorical input values are not significant, and categorical values in PMML must not have trailing blanks. Leading blanks are significant.

A continuous field may have an unlimited number of intervals defining the range of valid values. If intervals are present, any data that is outside the intervals will be considered invalid (except for discrete Values whose property attribute identify them as missing). For example, say a DataField is continuous with a valid range of [0,100] but the application encodes -999 to represent missing values, the proper way to represent this field would be:

<DataField name="SomeVariable" dataType="double" optype="continuous">
  <Interval closure="closedClosed" leftMargin="0" rightMargin="100"/>
  <Value property="missing" value="-999"/>
</DataField>

If no intervals are present, the entire real axis is made of valid values (except for discrete values whose property attribute identify them as missing or invalid ). Intervals are not allowed for non-continuous fields.

There can be continuous fields where a sequence of valid values is specified instead of an interval range. This can be used, for example, in a rating system with a fixed set of values 1,2,3,4,5,6 representing grades. Other values than these numbers are considered as invalid input. The operational type still allows to compute sums and mean values.

Note that binary fields are modeled as categorical fields where the value range is given and it contains exactly two valid values. There is no special optype value for binary fields.

Ordinal and continuous fields may be cyclic; these are used for clustering models where the similarity or distance of values depends on the cyclic property. For cyclic ordinal fields the sequence of Value elements defines the first and last value of a cycle in ascending order. The value range for cyclic continuous fields may be defined by an interval or by a sequence of Value elements. If the field has an interval there must be exactly one. This interval defines the default values for the first and last value of a cycle. Otherwise these are the minimum and maximum value given by the Value elements.
For example, an Interval of 1 to 7 or set of Value elements 1 through 7 may be used for the days of the week, where the seventh day of one week is followed by the first day of the next week.

For ordinal fields, if a sequence of Value elements is given, the sequence defines the ascending order on the respective field.

For example, the following defines loud < louder < insane:

<DataField name="Volume" optype="ordinal" dataType="string">
  <Value value="loud"/>
  <Value value="louder"/>
  <Value value="insane"/>
</DataField>

If no Value elements are given, the order is always assumed to be the ascending order of the representation type. For example, if the representation type is integer, an order must only be given explicitly via a respective sequence of Value elements in case an order different from the natural order of integers should be used.

The notable exception to this rule are ordinal DataFields of type string: Ordinal DataFields of type string must always specify the order explicitly via respective Value elements, see example above.

If the application algorithm for some class of models does not require any special treatment of ordinal fields, then they can be interpreted as categorical fields.

<xs:element name="Value">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="value" type="xs:string" use="required"/>
    <xs:attribute name="displayValue" type="xs:string"/>
    <xs:attribute name="property" default="valid">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="valid"/>
          <xs:enumeration value="invalid"/>
          <xs:enumeration value="missing"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
  </xs:complexType>
</xs:element>

The attribute displayValue in a DataField or Value is similar to mappedValue in AssociationRule. The idea is that the raw data could have cryptic key values such as a product code sku043834, and the displayValue would then be the readable product name like coke, 12 fl oz can. The displayValue can be used in a visualizer. Scoring engines typically use key values.

If a categorical or ordinal field contains at least one Value element where the value of property is valid, then the set of Value elements completely defines the set of valid values. Otherwise any value is valid by default.

The element Interval defines a range of numeric values.

<xs:element name="Interval">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="closure" use="required">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="openClosed"/>
          <xs:enumeration value="openOpen"/>
          <xs:enumeration value="closedOpen"/>
          <xs:enumeration value="closedClosed"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="leftMargin" type="NUMBER"/>
    <xs:attribute name="rightMargin" type="NUMBER"/>
  </xs:complexType>
</xs:element>

The attributes leftMargin and rightMargin are optional but at least one value must be defined. If a margin is missing, then ±∞ is assumed.

e-mail info at dmg.org