Mining Schema
PMML2.0 Menu

Home


PMML Notice and License

General Structure

Header

Data
Dictionary


Mining
Schema


Data Flow

Transformations

Statistics

Conformance

Taxomony

Trees

Regression

General
Regression


Cluster
Models


Association Rules

Neural
Network


Naive
Bayes


Sequences

PMML 2.0 -- Mining Schema

Each model contains one mining schema which lists fields as used in that model. This is a subset of the fields as defined in the data dictionary. While the mining schema contains information that is specific to a certain model, the data dictionary contains data definitions which do not vary per model. The main purpose of the mining schema is to list the fields which a user has to provide in order to apply the model.


 <!ENTITY  % FIELD-USAGE-TYPE "(active |
                                predicted |
                                supplementary)" >

 <!ENTITY  % OUTLIER-TREATMENT-METHOD "( asIs |
                                         asMissingValues |
                                         asExtremeValues ) " >

 <!ENTITY  % MISSING-VALUE-TREATMENT-METHOD "(asIs | asMean |
                                              asMode | asMedian |
                                              asValue) " >

<!ELEMENT MiningField (Extension*)>
<!ATTLIST MiningField
     name                     %FIELD-NAME;                     #REQUIRED
     usageType                %FIELD-USAGE-TYPE;               "active"
     outliers                 %OUTLIER-TREATMENT-METHOD;       "asIs"
     lowValue                 %NUMBER;                         #IMPLIED
     highValue                %NUMBER;                         #IMPLIED
     missingValueReplacement  CDATA                            #IMPLIED 
     missingValueTreatment    %MISSING-VALUE-TREATMENT-METHOD; #IMPLIED 

<!ELEMENT MiningSchema   (MiningField+) >

usageType

    active: field used as input (independent field).

    predicted: field whose value is predicted by the model.

    supplementary: field holding additional descriptive information.

    Supplementary fields are not required to apply a model. They are provided as additional information for explanatory purpose, though. When some field has gone through preprocessing transformations before a model is built, then an additional supplementary field is typically used to describe the statistics for the original field values.

outliers

    asIs: field values treated at face value.

    asMissingValues: outlier values are treated as if they were missing.

    asExtremeValues: outlier values are changed to a specific high or low value defined in MiningField.

name: symbolic name of field, must refer to a field in the data dictionary.

highValue and lowValue: used in conjunction with %outlierTreatmentMethod; "asExtremeValues" as values for records with outliers in this field if x < lowValue then x = lowValue

missingValueReplacement: If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input. For example the surrogate operator in the TreeModel does not apply if the MiningField specifies a replacement value.

missingValueTreatment: In a PMML consumer this field is 'for information only'. The consumer only looks at missingValueReplacement. If a value is present it replaces missing values. The missingValueTreatment attribute just indicates how the missingValueReplacement was derived, but places no behavioral requirement on the consumer.

Specifications for missing values occur at a couple of places in PMML.

  1. The external representation of missing values in not directly defined by PMML. A PMML consumer system may implement them as null values in a database, or as blank strings in a file, etc.
  2. The PMML data dictionary allows for an optional list of values which indicate a missing value. E.g., the data source may use the string "-" or "NA". If such a value occurs in the input data, a PMML consumer must treat it as a missing value.
  3. The PMML mining schema within a model may define an optional replacement value. If an input value is missing, then a PMML consumer must replace it with the specified value.
  4. For each type of a PMML model, there is a specific method how missing values are used in the computation of the score results.

Conformance

outlier treatment 'asIs', i.e. the default value of the attribute outliers in MiningField, is in core; other options are not in core.

e-mail info at dmg.org