Data Mining Group - PMML 4.0

Each model contains one MiningSchema which lists fields as used in that model. This is a subset of the fields as defined in the DataDictionary. While the MiningSchema contains information that is specific to a certain model, the DataDictionary contains data definitions which do not vary per model. The main purpose of the MiningSchema is to list the fields which a user has to provide in order to apply the model.

MiningFields identify which of the DataFields defined in the DataDictionary are used in the model. They also define the usage of each field (active, supplementary, predicted, ...) as well as policies for treating missing or invalid values. Within a MiningSchema, DataField references must be valid and unique. In other words, the name of a MiningField should match that of a DataField and no two MiningFields can have the same name.


  <xs:element name="MiningSchema">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element maxOccurs="unbounded" ref="MiningField" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:element name="MiningField">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="name" type="FIELD-NAME" use="required" />
      <xs:attribute name="usageType" type="FIELD-USAGE-TYPE" default="active" />
      <xs:attribute name="optype" type="OPTYPE" />
      <xs:attribute name="importance" type="PROB-NUMBER" />
      <xs:attribute name="outliers" type="OUTLIER-TREATMENT-METHOD" default="asIs" />
      <xs:attribute name="lowValue" type="NUMBER" />
      <xs:attribute name="highValue" type="NUMBER" />
      <xs:attribute name="missingValueReplacement" type="xs:string" />
      <xs:attribute name="missingValueTreatment" type="MISSING-VALUE-TREATMENT-METHOD" />
      <xs:attribute name="invalidValueTreatment" type="INVALID-VALUE-TREATMENT-METHOD" default="returnInvalid" />
    </xs:complexType>
  </xs:element>

  <xs:simpleType name="FIELD-USAGE-TYPE">
    <xs:restriction base="xs:string">
      <xs:enumeration value="active" />
      <xs:enumeration value="predicted" />
      <xs:enumeration value="supplementary" />
      <xs:enumeration value="group" />
      <xs:enumeration value="order" />
      <xs:enumeration value="frequencyWeight" />
      <xs:enumeration value="analysisWeight" />
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="OUTLIER-TREATMENT-METHOD">
    <xs:restriction base="xs:string">
      <xs:enumeration value="asIs" />
      <xs:enumeration value="asMissingValues" />
      <xs:enumeration value="asExtremeValues" />
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="MISSING-VALUE-TREATMENT-METHOD">
    <xs:restriction base="xs:string">
      <xs:enumeration value="asIs" />
      <xs:enumeration value="asMean" />
      <xs:enumeration value="asMode" />
      <xs:enumeration value="asMedian" />
      <xs:enumeration value="asValue" />
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="INVALID-VALUE-TREATMENT-METHOD">
    <xs:restriction base="xs:string">
      <xs:enumeration value="returnInvalid" />
      <xs:enumeration value="asIs" />
      <xs:enumeration value="asMissing" />
    </xs:restriction>
  </xs:simpleType>

name: symbolic name of field, must refer to a field in the DataDictionary.
The name of a field is used as an identifier within the PMML document. When an application uses a PMML model it binds the actual parameter values to the input fields in the MiningSchema. Parameters are passed by name. If the DataDictionary defines a displayName for a certain field, still the attribute name is used for matching the input parameters to the internal formulas. displayName allows using human readable names at the interface while using artificial identifiers within the semantics of model.

usageType

active: field used as input (independent field).

predicted: field whose value is predicted by the model.

supplementary: field holding additional descriptive information. Supplementary fields are not required to apply a model. They are provided as additional information for explanatory purpose, though. When some field has gone through preprocessing transformations before a model is built, then an additional supplementary field is typically used to describe the statistics for the original field values.

group: field similar to the SQL GROUP BY. For example, this is used by AssociationModel and SequenceModel to group items into transactions by customerID or by transactionID.

order: This field defines the order of items or transactions and is currently used in SequenceModel only. Similarly to group, it is motivated by the SQL syntax, namely by the ORDER BY statement.

frequencyWeight and analysisWeight: These fields are not needed for scoring, but provide very important information on how the model was built. Frequency weight usually has positive integer values and is sometimes called "replication weight". Its values can be interpreted as the number of times each record appears in the data. Analysis weight can have fractional positive values, it could be used for regression weight in regression models or for case weight in trees, etc. It can be interpreted as different importance of the cases in the model. Counts in ModelStats and Partitions can be computed using frequency weight, mean and standard deviation values can be computed using both weights.

The definition of predicted fields in the MiningSchema is not required and it does not have an impact on the scoring results. But it is very useful because it gives a user a first hint about the detailed results that can be computed by the model.

optype: The attribute value overrides the corresponding value in the DataField. That is, a DataField can be used with different optypes in different models. For example, a 0/1 indicator could be used as a numeric input field in a regression model while the same field is used as a categorical field in a tree model.

importance: states the relative importance of the field. This indicator is typically used in predictive models in order to rank fields by their predictive contribution. A value of 1.0 suggests that the target field is directly correlated to this field. A value of 0.0 suggests that the field is completely irrelevant. Most likely such a field would have usageType="supplementary" rather than usageType="active".
Note that the importance cannot be negative. Unlike a Pearson correlation coefficient, it does not indicate the 'direction' of a correlation with a negative number if a higher field value correlates to a lower target value. There is no commonly accepted correlation measure that is applicable to all combinations of numeric and categorical fields. But this attribute is still useful as it provides a mechanism for representing the results of feature selection.
Note that other mining standards such as JDM include algorithms for computing the importance of input fields. The results can be represented by this attribute in PMML.

outliers

asIs: field values treated at face value.

asMissingValues: outlier values are treated as if they were missing.

asExtremeValues: outlier values are changed to a specific high or low value defined in MiningField.

highValue and lowValue: used in conjunction with, and are required for, outlierTreatmentMethod="asExtremeValues" as values for records with outliers in this field.
Usage:

if x<lowValue then x = lowValue

if x>highValue then x = highValue

Note that outliers applies only to fields defined in the MiningSchema and hence can not be used for DerivedFields.

The DataDictionary describes the value element, which includes a property attribute for defining values as valid, invalid or missing. While valid values can be dealt with unchanged, the standard allows special treatment for missing and invalid values. The next two sections describe how missing and invalid values should be handled.

Missing Values

missingValueReplacement: If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input. For example the surrogate operator in TreeModel does not apply if the MiningField specifies a replacement value.

missingValueTreatment: In a PMML consumer this field is for information only. The consumer only looks at missingValueReplacement - if a value is present it replaces missing values. The missingValueTreatment attribute just indicates how the missingValueReplacement was derived, but places no behavioral requirement on the consumer.

MissingValueTreatment is a useful parameter in an API for training. The parameter can be copied into the PMML model. The scoring function, however, does not always know the actual mean, mode, median, etc. The corresponding value must be present in the attribute missingValueReplacement.

For example, if you want the scoring function to replace missing values by the mean value, and the mean value in the training data is 3.14, write


  ...
  <MiningField name="foo" missingValueReplacement="3.14" missingValueTreatment="asMean" />
  ...

The replacement value MUST be specified using the missingValueReplacement attribute.

Specifications for missing values occur at a couple of places in PMML.

The external representation of missing values in not directly defined by PMML. A PMML consumer system may implement them as null values in a database, or as blank strings in a file, etc.
The DataDictionary allows for an optional list of values which indicate a missing value. E.g., the data source may use the string "-" or "NA". If such a value occurs in the input data, a PMML consumer must treat it as a missing value.
The MiningSchema within a model may define an optional replacement value. If an input value is missing, then a PMML consumer must replace it with the specified value.
For each PMML model type, there is a specific method how missing values are used in the computation of the score results.

Invalid Values

invalidValueTreatment: This field specifies how invalid input values are handled. returnInvalid is the default and specifies that, when an invalid input is encountered, the model should return a value indicating an invalid result has been returned. asIs means to use the input without modification. asMissing specifies that an invalid input value should be treated as a missing value and follow the behavior specified by the missingValueReplacement attribute if present (see above). If asMissing is specified but there is no respective missingValueReplacement present, a missing value is passed on for eventual handling by successive transformations via DerivedFields or in the actual mining model.

e-mail

info at dmg.org