|
||||||||||||||||
|
||||||||||||||||
| ||||||||||||||||
PMML 4.2 - Model ExplanationThe following elements can be used:
<xs:element name="ModelExplanation"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:choice> <xs:element ref="PredictiveModelQuality" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="ClusteringModelQuality" minOccurs="0" maxOccurs="unbounded"/> </xs:choice> <xs:element ref="Correlations" minOccurs="0"/> </xs:sequence> </xs:complexType> </xs:element> Univariate StatisticsTo provide univariate statistics for MiningFields, use UnivariateStats in the ModelStats element as defined in the statistics section. See further information there.PartitionsPartition elements are defined in the statistics section. They can be used to provide statistics for a subset of records defined by the context in which the Partition element appears. For instance, a Partition in a Cluster element in a cluster model provides statistics for all records that belong to that cluster. A Partition element inside a Node element in a tree model gives statistics on the value distributions of the subset of records that belong to that Node. Likewise, a Partition element in a TargetValue gives details on the distributions of the records for which the model predicted the respective TargetValue.Predictive Model QualityThe PredictiveModelQuality element is a wrapper around various elements used to illustrate the quality of a predictive model. Since it is possible to recalculate model quality information for any given dataset that matches the fields specified in the MiningSchema, it also carries information on what dataset the quality information was collected.<xs:element name="PredictiveModelQuality"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="ConfusionMatrix" minOccurs="0"/> <xs:element ref="LiftData" minOccurs="0"/> <xs:element ref="ROC" minOccurs="0"/> </xs:sequence> <xs:attribute name="targetField" type="xs:string" use="required"/> <xs:attribute name="dataName" type="xs:string" use="optional"/> <xs:attribute name="dataUsage" default="training"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="training"/> <xs:enumeration value="test"/> <xs:enumeration value="validation"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="meanError" type="NUMBER" use="optional"/> <xs:attribute name="meanAbsoluteError" type="NUMBER" use="optional"/> <xs:attribute name="meanSquaredError" type="NUMBER" use="optional"/> <xs:attribute name="rootMeanSquaredError" type="NUMBER" use="optional"/> <xs:attribute name="r-squared" type="NUMBER" use="optional"/> <xs:attribute name="adj-r-squared" type="NUMBER" use="optional"/> <xs:attribute name="sumSquaredError" type="NUMBER" use="optional"/> <xs:attribute name="sumSquaredRegression" type="NUMBER" use="optional"/> <xs:attribute name="numOfRecords" type="NUMBER" use="optional"/> <xs:attribute name="numOfRecordsWeighted" type="NUMBER" use="optional"/> <xs:attribute name="numOfPredictors" type="NUMBER" use="optional"/> <xs:attribute name="degreesOfFreedom" type="NUMBER" use="optional"/> <xs:attribute name="fStatistic" type="NUMBER" use="optional"/> <xs:attribute name="AIC" type="NUMBER" use="optional"/> <xs:attribute name="BIC" type="NUMBER" use="optional"/> <xs:attribute name="AICc" type="NUMBER" use="optional"/> </xs:complexType> </xs:element> See the successive sections for descriptions on ConfusionMatrix and LiftData. Attribute description:
Example: <ModelQuality targetField="salary" dataName="MyData" dataUsage="training" meanError="0.01" meanAbsoluteError="123.4" meanSquaredError="234567.8"> ... </ModelQuality> In this example, the model quality information for the field salary was gathered during training on a dataset named MyData. Clustering Model QualityThe ClusteringModelQuality element is a wrapper around various elements used to illustrate the quality of a clustering model. <xs:element name="ClusteringModelQuality"> <xs:complexType> <xs:attribute name="dataName" type="xs:string" use="optional"/> <xs:attribute name="SSE" type="NUMBER" use="optional"/> <xs:attribute name="SSB" type="NUMBER" use="optional"/> </xs:complexType> </xs:element> Attribute description:
Gains/Lift Charts and Ranking Quality InformationGains and Lift charts are a popular method to display the quality of a predictive data mining model. Regression models have a single Gainschart for the whole model, classification models can have one for each class label. The data for a Gains chart is calculated in the following ways:
The necessary data for Gains and Lift charts is provided in a LiftData element. It is provided by record counts and
<xs:element name="LiftData"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="ModelLiftGraph"/> <xs:element ref="OptimumLiftGraph" minOccurs="0"/> <xs:element ref="RandomLiftGraph" minOccurs="0"/> </xs:sequence> <xs:attribute name="targetFieldValue" type="xs:string"/> <xs:attribute name="targetFieldDisplayValue" type="xs:string"/> <xs:attribute name="rankingQuality" type="NUMBER"/> </xs:complexType> </xs:element> Attribute targetFieldValue in element LiftData gives the
class label for which the gains/lift data is provided. It is required for
classification models only, and the values must be unique. A normalized
version, if applicable, can additionally be provided in attribute
targetFieldDisplayValue (see attributes value and
displayValue in the DataDictionary element). (area between model and random curve) / (area between optimum and random curve) rankingQuality is 1 in case the model is the optimum model. It is close to 0 in case the model is close to a random model. It can be less than 0 in case the model is worse than a random model. Three types of lift graphs can be included:
<xs:element name="ModelLiftGraph"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="LiftGraph"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="OptimumLiftGraph"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="LiftGraph"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="RandomLiftGraph"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="LiftGraph"/> </xs:sequence> </xs:complexType> </xs:element> At the minimum, ModelLiftGraph must always be provided. In case of classification models, OptimumLiftGraph is usually derived from the ModelLiftGraph by assuming that the total number of instances n in which the respective class label is predicted happens in the first n records. It might still be provided explicitly in case there is good reason why the trivial optimum model will not apply. However, regression models must always provide OptimumLiftGraph. RandomLiftGraph is usually derived for both, classification and regression models by intrapolating the start and ending points of the data. Again, there might be good reason why the random assumption is only applicable with restrictions. Hence it is possible to give RandomLiftGraph explicitly. <xs:element name="LiftGraph"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="XCoordinates"/> <xs:element ref="YCoordinates"/> <xs:element ref="BoundaryValues" minOccurs="0"/> <xs:element ref="BoundaryValueMeans" minOccurs="0"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="XCoordinates"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="NUM-ARRAY"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="YCoordinates"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="NUM-ARRAY"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="BoundaryValues"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="NUM-ARRAY"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="BoundaryValueMeans"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="NUM-ARRAY"/> </xs:sequence> </xs:complexType> </xs:element> For each segment of the lift chart, the cumulative number of records up to that point is provided in XCoordinates. The respective entry in YCoordinates gives
Likewise, element BoundaryMeanValues holds the mean value of the scores for all records that belong to the respective segment.
Example <LiftData targetFieldValue="1" targetFieldDisplayValue="Yes"> <ModelLiftGraph> <LiftGraph> <XCoordinates> <Array type="int" n="6">57 75 98 124 149 240</Array> </XCoordinates> <YCoordinates> <Array type="int" n="6">51 15 18 7 4 8</Array> </YCoordinates> <BoundaryValues> <Array type="real" n="6">0.8947 0.8333 0.7826 0.2692 0.16 0.0879</Array> </BoundaryValues> <BoundaryValueMeans> <Array type="real" n="6">0.9134 0.8691 0.8002 0.5389 0.2261 0.1492</Array> </BoundaryValueMeans> </LiftGraph> </ModelLiftGraph> </LiftData> <LiftData targetFieldValue="2" targetFieldDisplayValue="No"> <ModelLiftGraph> <LiftGraph> <XCoordinates> <Array type="int" n="6">91 116 142 165 183 240</Array> </XCoordinates> <YCoordinates> <Array type="int" n="6">83 21 19 5 3 6</Array> </YCoordinates> <BoundaryValues> <Array type="real" n="6">0.9120 0.84 0.7307 0.2173 0.16667 0.1052</Array> </BoundaryValues> <BoundaryValueMeans> <Array type="real" n="6">0.9569 0.8921 0.7478 0.4301 0.1836 0.1285</Array> </BoundaryValueMeans> </LiftGraph> </ModelLiftGraph> </LiftData> For instance, the value Yes was predicted 66 times in the first 75
records for which Yes was predicted by the model with the highest
confidence. All of these predictions had a confidence of at least 0.8333. In
the first segment, Yes was predicted in 51 instances, while the second
segment has 75-57=18 records of which 15 have value Yes. Here is the respective gains chart for class label Yes: The blue line corresponds to ModelLiftGraph, the green line to OptimumLiftGraph and the red line to RandomLiftGraph. Note that the latter two are not present in the example PMML, so the respective curves were derived from ModelLiftGraph. Example: <LiftData> <ModelLiftGraph> <LiftGraph> <XCoordinates> <Array n="7" type="int">5 12 18 23 31 41 52</Array> </XCoordinates> <YCoordinates> <Array n="7" type="real">80 70 48 40 53 64 66</Array> </YCoordinates> <BoundaryValues> <Array n="7" type="real">7.4261 7.1911 7.0731 6.9845 6.8072 6.6085 6.4999</Array> </BoundaryValues> <BoundaryValueMeans> <Array n="7" type="real">7.9327 7.6732 7.2982 6.9978 6.8734 6.7254 6.5373</Array> </BoundaryValueMeans> </LiftGraph> </ModelLiftGraph> <OptimumLiftGraph> <LiftGraph> <XCoordinates> <Array n="7" type="int">5 12 18 23 31 41 52</Array> </XCoordinates> <YCoordinates> <Array n="7" type="real">90 81 70 65 65 35 15</Array> </YCoordinates> <BoundaryValues> <Array n="7" type="real"> 8 7 6 5 4 3 2</Array> </BoundaryValues> <BoundaryValueMeans> <Array n="7" type="real"> 8.2872 7.8273 6.2362 5.7523 4.7895 3.4356 2.4563</Array> </BoundaryValueMeans> </LiftGraph> </OptimumLiftGraph> </LiftData> The lift data was built on a total of 52 records. The first 18 records with the highest predicted values accumulate to 198, and the lower limit for these is 7.0731. The third segment has 6 records accounts for a sum of 48 with a mean value of 7.2982. The optimum model would have predicted values that sum up to 70 in the same segment. Here is the respective gains chart: Again, the blue line corresponds to ModelLiftGraph, the green line to OptimumLiftGraph and the red line to RandomLiftGraph. Note that OptimumLiftGraph is present in the sample PMML, while RandomLiftGraph had to be derived from ModelLiftGraph. ROC GraphROC (Receiver Operating Characteristic) Graphs are used in binary classification models. The ROC curve is a graphical representation of the sensitivity vs. (1 - specificity) for a binary classifier as its discrimination threshold varies. It can also be represented by plotting the FPR (false positive rate) vs. the TPR (true positive rate).<xs:element name="ROC"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="ROCGraph"/> </xs:sequence> <xs:attribute name="positiveTargetFieldValue" type="xs:string" use="required"/> <xs:attribute name="positiveTargetFieldDisplayValue" type="xs:string"/> <xs:attribute name="negativeTargetFieldValue" type="xs:string"/> <xs:attribute name="negativeTargetFieldDisplayValue" type="xs:string"/> </xs:complexType> </xs:element> Attribute positiveTargetFieldValue in element ROC gives
the positive class label for which the ROC data is provided, whereas
attribute negativeTargetFieldValue in element ROC gives the
negative class label. A normalized version, if applicable, can additionally
be provided in attribute positiveTargetFieldDisplayValue and
negativeTargetFieldDisplayValue (see attributes value and
displayValue in the DataDictionary element). <xs:element name="ROCGraph"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="XCoordinates"/> <xs:element ref="YCoordinates"/> <xs:element ref="BoundaryValues" minOccurs="0"/> </xs:sequence> </xs:complexType> </xs:element> In ROCGraph, FPR is provided in the XCoordinates, whereas TPR is provided in the YCoordinates. FPR is defined as follows: number of false positive records / (number of true negative records + number of false positive records) TPR, on the other hand, is defined as follows: number of true positive records / (number of true positive records + number of false negative records) A boundary value represents a discrimination threshold or lower limit for the respective score. That is, all records that are part of the segment must have a value equal to or higher than the respective threshold in BoundaryValues. The point (0,0) in the graph is associated with the highest score and hence the highest limit; it represents a binary classifier that predicts all cases to be negative. The point (1,1) is associated with the lowest score and hence the lowest limit; it corresponds to a classifier that predicts every case to be positive. Example <ROC positiveTargetFieldValue="1" negativeTargetFieldValue="0"> <ROCGraph> <XCoordinates> <Array type="real" n="6">0.13 0.2 0.28 0.56</Array> </XCoordinates> <YCoordinates> <Array type="real" n="6">0.54 0.75 0.86 0.93</Array> </YCoordinates> <BoundaryValues> <Array type="real" n="6">0.8 0.6 0.4 0.2</Array> </BoundaryValues> </ROCGraph> </ROC> Here is the respective ROC graph: The ROC graph shows the tradeoff between the ability of a classifier to correctly identify positive records and the number of negative records that are incorrectly classified. Confusion MatrixThe ConfusionMatrix is used in classification models to give an overview of correct and incorrect classifications.<xs:element name="ConfusionMatrix"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="ClassLabels"/> <xs:element ref="Matrix"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="ClassLabels"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="STRING-ARRAY"/> </xs:sequence> </xs:complexType> </xs:element> ClassLabels specifies the class labels that the confusion matrix
refers to. New values in addition to the class labels of the initial
classification model are possible in a test result, if the test data contains
values that were not present in the training data. Example
is represented by the following PMML code: <ConfusionMatrix> <ClassLabels> <Array type="string" n="3">suburban urban rural</Array> </ClassLabels> <Matrix> <Array type="int" n="3"> 84 19 25</Array> <Array type="int" n="3"> 14 123 17</Array> <Array type="int" n="3"> 7 42 176</Array> </Matrix> </ConfusionMatrix> The confusion matrix tells that in 84 instances, the domicile was correctly predicted to be suburban. In 17 instances, the domicile was predicted to be urban instead of rural. Likewise, in 7 instances rural was predicted instead of suburban. Field CorrelationsThe Correlations element is used to give correlations between fields used in a mining model.<xs:element name="Correlations"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="CorrelationFields"/> <xs:element ref="CorrelationValues"/> <xs:element ref="CorrelationMethods" minOccurs="0"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="CorrelationFields"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="STRING-ARRAY"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="CorrelationValues"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="Matrix"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="CorrelationMethods"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="Matrix"/> </xs:sequence> </xs:complexType> </xs:element> The CorrelationFields element specifies the field names that the correlations refer to. These field names must match entries of MiningFields in the MiningSchema. The CorrelationValues matrix holds the correlations in numeric form. The rows and columns refer to the respective entries in CorrelationFields. Valid entries must be in the range of [-1;1]. Values outside of this range indicate that the correlation for the respective field combination is not available. The CorrelationMethods element is optional and has string entries. The entries correspond to the correlations given in CorrelationValues. For each correlation provided in CorrelationValues, a respective entry must exist in the second one. Valid values for the entries are:
Note that some of these methods are distribution tests, since especially for categorical fields, it is popular to use distribution tests to express correlations. Entries that would refer to missing entries in the CorrelationValues must, if specified, still have one of these valid values. If CorrelationMethods is not present at all, defaults are pearson for correlations between numeric fields, and contingencyTable for any other combination. Otherwise, a corresponding entry for each entry in CorrelationValues must be present. Note that correlations between numeric and non-numeric fields are usually done by defining buckets for the numeric field. These buckets are not covered here. Note that both matrices must be symmetric. Example: Here is an example correlation table: <Correlations> <CorrelationFields> <Array n="5" type="string">"Age" "Angina" "Blood_Pressure" "Cholesterol" "Diseased"</Array> </CorrelationFields> <CorrelationValues> <Matrix kind="symmetric" nbRows="5" nbCols="5"> <Array n="1" type="real"> 1</Array> <Array n="2" type="real"> 0.6207 1</Array> <Array n="3" type="real"> -0.2651 0.5793 1</Array> <Array n="4" type="real"> 0.2161 -99 0.1344 1</Array> <Array n="5" type="real"> 0.5649 0.5700 0.5257 -99 1</Array> </Matrix> </CorrelationValues> <CorrelationMethods> <Matrix kind="symmetric" nbRows="5" nbCols="5"> <Array n="1" type="string"> pearson</Array> <Array n="2" type="string"> cramer cramer</Array> <Array n="3" type="string"> spearman cramer spearman</Array> <Array n="4" type="string"> fisher contingencyTable spearman spearman</Array> <Array n="5" type="string"> contingencyTable chiSquare chiSquare chiSquare chiSquare</Array> </Matrix> </CorrelationMethods> </Correlations> For instance, according to this sample, the correlation between Age and Blood_Pressure is -0.2651 and was calculated using Spearman's rank correlation coefficient. The correlation between Diseased and Cholesterol is not available. |
||||||||||||||||
|