Data Mining Group - Mining Schema

PMML 3.1 - Mining Schema

Each model contains one mining schema which lists fields as used in that model. This is a subset of the fields as defined in the data dictionary. Field names in the MiningSchema must be unique, otherwise the PMML document is not valid. While the mining schema contains information that is specific to a certain model, the data dictionary contains data definitions which do not vary per model. The main purpose of the mining schema is to list the fields which a user has to provide in order to apply the model.


  <xs:element name="MiningSchema">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element maxOccurs="unbounded" ref="MiningField" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="MiningField">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="name" type="FIELD-NAME" use="required" />
      <xs:attribute name="usageType" type="FIELD-USAGE-TYPE" default="active" />
      <xs:attribute name="optype" type="OPTYPE" />
      <xs:attribute name="importance" type="PROB-NUMBER" />
      <xs:attribute name="outliers" type="OUTLIER-TREATMENT-METHOD" default="asIs" />
      <xs:attribute name="lowValue" type="NUMBER" />
      <xs:attribute name="highValue" type="NUMBER" />
      <xs:attribute name="missingValueReplacement" type="xs:string" />
      <xs:attribute name="missingValueTreatment" type="MISSING-VALUE-TREATMENT-METHOD" />
      <xs:attribute name="invalidValueTreatment" type="INVALID-VALUE-TREATMENT-METHOD" default="returnInvalid" />
    </xs:complexType>
  </xs:element>

  <xs:simpleType name="FIELD-USAGE-TYPE">
    <xs:restriction base="xs:string">
      <xs:enumeration value="active" />
      <xs:enumeration value="predicted" />
      <xs:enumeration value="supplementary" />
      <xs:enumeration value="group" />
      <xs:enumeration value="order" />
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="OUTLIER-TREATMENT-METHOD">
    <xs:restriction base="xs:string">
      <xs:enumeration value="asIs" />
      <xs:enumeration value="asMissingValues" />
      <xs:enumeration value="asExtremeValues" />
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="MISSING-VALUE-TREATMENT-METHOD">
    <xs:restriction base="xs:string">
      <xs:enumeration value="asIs" />
      <xs:enumeration value="asMean" />
      <xs:enumeration value="asMode" />
      <xs:enumeration value="asMedian" />
      <xs:enumeration value="asValue" />
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="INVALID-VALUE-TREATMENT-METHOD">
    <xs:restriction base="xs:string">
      <xs:enumeration value="returnInvalid" />
      <xs:enumeration value="asIs" />
      <xs:enumeration value="asMissing" />
    </xs:restriction>
  </xs:simpleType>

name: symbolic name of field, must refer to a field in the data dictionary.
The name of a field is used as an identifier within the PMML document. When a application uses a PMML model it binds the actual parameter values to the input fields in the MiningSchema. Parameters are passed by name. If the data dictionary defines a displayName for a certain field, then this displayName is used for matching the input parameters to the internal formulas. This allows using artificial identifiers within the models while still being able to use human readable names at the interface.

usageType

active: field used as input (independent field).

predicted: field whose value is predicted by the model.

supplementary: field holding additional descriptive information. Supplementary fields are not required to apply a model. They are provided as additional information for explanatory purpose, though. When some field has gone through preprocessing transformations before a model is built, then an additional supplementary field is typically used to describe the statistics for the original field values.

group: field similar to the SQL "group by". For example, this is used by the association and sequence models to group items into transactions by customerID or by transactionID.

order: This field defines the order of items or transactions and is currently used in sequence models only. Similarly to "group", it is motivated by the SQL syntax, namely by the "order by" statement.

optype: The attribute value overrides the corresponding value in the DataField. That is, a DataField can be used with different optypes in different models. For example, a 0/1 indicator could be used as a numeric input field in a regression model while the same field is used as a categorical field in a tree model.

importance: states the relative importance of the field. This indicator is typically used in prediction models in order to rank fields by their predictive contribution. A value of 1.0 suggests that the target field is directly correlated to the this field. A value of 0.0 suggests that the field is completely irrelevant. Most likely such a field would have a usageType="supplementary" rather than usageType="active". Note that the importance cannot be negative. Unlike a Pearson correlation coefficient, it does not indicate the 'direction' of a correlation with a negative number if a higher field value correlates to a lower target value. There is no commonly accepted correlation measure that is applicable to all combinations of numeric and categorical fields. But this attribute is still useful as it provides a mechanism for representing the results of feature selection. Note that other mining standards such as JDM include algorithms for computing the importance of input fields. The results can be represented by this attribute in PMML.

outliers

asIs: field values treated at face value.

asMissingValues: outlier values are treated as if they were missing.

asExtremeValues: outlier values are changed to a specific high or low value defined in MiningField.

highValue and lowValue: used in conjunction with, and are required, outlierTreatmentMethod="asExtremeValues" as values for records with outliers in this field if x < lowValue then x = lowValue

The DataDictionary describes the value element, which includes a property attribute for defining values as valid, invalid or missing. While valid values can be dealt with unchanged, the standard allows special treatment for missing and invalid values. The next two sections describe how missing and invalid values should be handled.

Missing Values

missingValueReplacement: If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input. For example the surrogate operator in the TreeModel does not apply if the MiningField specifies a replacement value.

missingValueTreatment: In a PMML consumer this field is 'for information only'. The consumer only looks at missingValueReplacement. If a value is present it replaces missing values. The missingValueTreatment attribute just indicates how the missingValueReplacement was derived, but places no behavioral requirement on the consumer.

MissingValueTreatment is a useful parameter in an API for training. The parameter can be copied into the PMML model. The scoring function, however, does not always know the actual mean, mode, median, etc. The corresponding value must be present in the attribute missingValueReplacement.

For example, if you want the scoring function to replace missing values by the mean value, and the mean value in the training data is 3.14, then write


    <MiningField 
      name="..."  
      missingValueReplacement="3.14"
      missingValueTreatment="asMean" />

The replacement value MUST be specified using the missingValueReplacement attribute.

Specifications for missing values occur at a couple of places in PMML.

The external representation of missing values in not directly defined by PMML. A PMML consumer system may implement them as null values in a database, or as blank strings in a file, etc.
The PMML data dictionary allows for an optional list of values which indicate a missing value. E.g., the data source may use the string "-" or "NA". If such a value occurs in the input data, a PMML consumer must treat it as a missing value.
The PMML mining schema within a model may define an optional replacement value. If an input value is missing, then a PMML consumer must replace it with the specified value.
For each type of a PMML model, there is a specific method how missing values are used in the computation of the score results.

Invalid Values

invalidValueTreatment: This field specifies how invalid input values are handled. returnInvalid is the default and specifies that, when an invalid input is encountered, the model should return a value indicating an invalid result has been returned. asIs means to use the input without modification. asMissing specifies that an invalid input value should be treated as a missing value and follow the behavior specified by the missingValueReplacement attribute if present (see above). If asMissing is specified but there is no respective missingValueReplacement present, a missing value is passed on for eventual handling by successive transformations via DerivedFields or in the actual mining model.

Conformance

Outlier treatment 'asIs', i.e. the default value of the attribute outliers in MiningField, is in core; other options are not in core.