PMML 2.0 -- Mining Schema
Each model contains one mining schema which lists fields as used in that model. This is a subset of the fields as defined in the data dictionary. While the mining schema contains information that is specific to a certain model, the data dictionary contains data definitions which do not vary per model. The main purpose of the mining schema is to list the fields which a user has to provide in order to apply the model.
<!ENTITY % FIELD-USAGE-TYPE "(active | predicted | supplementary)" > <!ENTITY % OUTLIER-TREATMENT-METHOD "( asIs | asMissingValues | asExtremeValues ) " > <!ENTITY % MISSING-VALUE-TREATMENT-METHOD "(asIs | asMean | asMode | asMedian | asValue) " > <!ELEMENT MiningField (Extension*)> <!ATTLIST MiningField name %FIELD-NAME; #REQUIRED usageType %FIELD-USAGE-TYPE; "active" outliers %OUTLIER-TREATMENT-METHOD; "asIs" lowValue %NUMBER; #IMPLIED highValue %NUMBER; #IMPLIED missingValueReplacement CDATA #IMPLIED missingValueTreatment %MISSING-VALUE-TREATMENT-METHOD; #IMPLIED <!ELEMENT MiningSchema (MiningField+) > |
usageType
active: field used as input (independent field).
predicted: field whose value is predicted by the model.
supplementary: field holding additional descriptive information.
Supplementary fields are not required to apply a model. They are provided as additional information for explanatory purpose, though. When some field has gone through preprocessing transformations before a model is built, then an additional supplementary field is typically used to describe the statistics for the original field values.
outliers
asIs: field values treated at face value.
asMissingValues: outlier values are treated as if they were missing.
asExtremeValues: outlier values are changed to a specific high or low value defined in MiningField.
name: symbolic name of field, must refer to a field in the data dictionary.
highValue and lowValue: used in conjunction with %outlierTreatmentMethod; "asExtremeValues" as values for records with outliers in this field if x < lowValue then x = lowValue
missingValueReplacement: If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input. For example the surrogate operator in the TreeModel does not apply if the MiningField specifies a replacement value.
missingValueTreatment: In a PMML consumer this field is 'for information only'. The consumer only looks at missingValueReplacement. If a value is present it replaces missing values. The missingValueTreatment attribute just indicates how the missingValueReplacement was derived, but places no behavioral requirement on the consumer.
Specifications for missing values occur at a couple of places in PMML.
- The external representation of missing values in not directly defined by PMML. A PMML consumer system may implement them as null values in a database, or as blank strings in a file, etc.
- The PMML data dictionary allows for an optional list of values which indicate a missing value. E.g., the data source may use the string "-" or "NA". If such a value occurs in the input data, a PMML consumer must treat it as a missing value.
- The PMML mining schema within a model may define an optional replacement value. If an input value is missing, then a PMML consumer must replace it with the specified value.
- For each type of a PMML model, there is a specific method how missing values are used in the computation of the score results.
Conformance
outlier treatment 'asIs', i.e. the default value of the attribute outliers in MiningField, is in core; other options are not in core.