|
||||||||||||||
|
||||||||||||||
| ||||||||||||||
PMML 4.0 - Mining SchemaEach model contains one MiningSchema which lists fields as used in that model. This is a subset of the fields as defined in the DataDictionary. While the MiningSchema contains information that is specific to a certain model, the DataDictionary contains data definitions which do not vary per model. The main purpose of the MiningSchema is to list the fields which a user has to provide in order to apply the model. MiningFields identify which of the DataFields defined in the DataDictionary are used in the model. They also define the usage of each field (active, supplementary, predicted, ...) as well as policies for treating missing or invalid values. Within a MiningSchema, DataField references must be valid and unique. In other words, the name of a MiningField should match that of a DataField and no two MiningFields can have the same name.
name: symbolic name of field, must refer to a field in the
DataDictionary. usageType active: field used as input (independent field). predicted: field whose value is predicted by the model. supplementary: field holding additional descriptive information. Supplementary fields are not required to apply a model. They are provided as additional information for explanatory purpose, though. When some field has gone through preprocessing transformations before a model is built, then an additional supplementary field is typically used to describe the statistics for the original field values. group: field similar to the SQL GROUP BY. For example, this is used by AssociationModel and SequenceModel to group items into transactions by customerID or by transactionID. order: This field defines the order of items or transactions and is currently used in SequenceModel only. Similarly to group, it is motivated by the SQL syntax, namely by the ORDER BY statement. frequencyWeight and analysisWeight: These fields are not needed for scoring, but provide very important information on how the model was built. Frequency weight usually has positive integer values and is sometimes called "replication weight". Its values can be interpreted as the number of times each record appears in the data. Analysis weight can have fractional positive values, it could be used for regression weight in regression models or for case weight in trees, etc. It can be interpreted as different importance of the cases in the model. Counts in ModelStats and Partitions can be computed using frequency weight, mean and standard deviation values can be computed using both weights. The definition of predicted fields in the MiningSchema is not required and it does not have an impact on the scoring results. But it is very useful because it gives a user a first hint about the detailed results that can be computed by the model. optype: The attribute value overrides the corresponding value in the DataField. That is, a DataField can be used with different optypes in different models. For example, a 0/1 indicator could be used as a numeric input field in a regression model while the same field is used as a categorical field in a tree model. importance:
states the relative importance of the field.
This indicator is typically used in predictive models
in order to rank fields by their predictive contribution.
A value of 1.0 suggests that the target field is directly correlated
to this field. A value of 0.0 suggests that
the field is completely irrelevant. Most likely such a
field would have usageType="supplementary" rather than
usageType="active". outliers
asIs: field values treated at face value. asMissingValues: outlier values are treated as if they were missing. asExtremeValues: outlier values are changed to a specific high or low value defined in MiningField. highValue and lowValue: used in conjunction with, and are
required for,
outlierTreatmentMethod="asExtremeValues" as values for records with
outliers in this field. Note that outliers applies only to fields defined in the MiningSchema and hence can not be used for DerivedFields. The DataDictionary describes the value element, which includes a property attribute for defining values as valid, invalid or missing. While valid values can be dealt with unchanged, the standard allows special treatment for missing and invalid values. The next two sections describe how missing and invalid values should be handled. Missing ValuesmissingValueReplacement: If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input. For example the surrogate operator in TreeModel does not apply if the MiningField specifies a replacement value. missingValueTreatment: In a PMML consumer this field is for information only. The consumer only looks at missingValueReplacement - if a value is present it replaces missing values. The missingValueTreatment attribute just indicates how the missingValueReplacement was derived, but places no behavioral requirement on the consumer. MissingValueTreatment is a useful parameter in an API for training. The parameter can be copied into the PMML model. The scoring function, however, does not always know the actual mean, mode, median, etc. The corresponding value must be present in the attribute missingValueReplacement.
For example, if you want the scoring function to replace missing values by the mean value,
and the mean value in the training data is 3.14, write
Specifications for missing values occur at a couple of places in PMML.
Invalid ValuesinvalidValueTreatment: This field specifies how invalid input values are handled. returnInvalid is the default and specifies that, when an invalid input is encountered, the model should return a value indicating an invalid result has been returned. asIs means to use the input without modification. asMissing specifies that an invalid input value should be treated as a missing value and follow the behavior specified by the missingValueReplacement attribute if present (see above). If asMissing is specified but there is no respective missingValueReplacement present, a missing value is passed on for eventual handling by successive transformations via DerivedFields or in the actual mining model. |
||||||||||||||
|