PMML 4.0 - Statistics
This schema for statistics provides a basic framework for
representing univariate statistics. It is used by schemas for specific
models as ModelStats.
<xs:element name="ModelStats">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="UnivariateStats" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
|
The statistics for a model is made of the collection of the
statistics for single fields.
Univariate Statistics
<xs:element name="UnivariateStats">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Counts" minOccurs="0"/>
<xs:element ref="NumericInfo" minOccurs="0"/>
<xs:element ref="DiscrStats" minOccurs="0"/>
<xs:element ref="ContStats" minOccurs="0"/>
<xs:element ref="Anova" minOccurs="0"/>
</xs:sequence>
<xs:attribute name="field" type="FIELD-NAME"/>
<xs:attribute name="weighted" default="0">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="0" />
<xs:enumeration value="1" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
|
An UnivariateStats element contains statistical information
about a single MiningField. UnivariateStats elements in ModelStats must
have a field name they refer to in attribute field. In other cases field may be missing.
The attribute weighted indicates whether the counts for the field are
computed using the weight variable. By default the weight is not used.
Discrete and continuous statistics are possible simultaneously
for numeric fields. This may be important if a numeric field has too
many discrete values. The statistics can include the most frequent
values and also a complete histogram distribution.
Statistics for ordinal fields are contained in DiscrStats.
<xs:element name="Counts">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="totalFreq" type="NUMBER" use="required"/>
<xs:attribute name="missingFreq" type="NUMBER"/>
<xs:attribute name="invalidFreq" type="NUMBER"/>
<xs:attribute name="cardinality" type="xs:nonNegativeInteger"/>
</xs:complexType>
</xs:element>
|
The element Counts carries counters for frequency of values
with respect to their state of being missing, invalid, or valid. The counts
can be non-integer if they are weighted.
totalFreq counts all records, same as for statistics of all MiningFields.
missingFreq counts the number of records where value is missing.
invalidFreq counts the number of records with values other than valid. The total
frequency includes the missing
values and invalid values.
The attribute cardinality is the number of unique, or distinct,
values that the variable has.
<xs:element name="NumericInfo">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Quantile" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="minimum" type="NUMBER"/>
<xs:attribute name="maximum" type="NUMBER"/>
<xs:attribute name="mean" type="NUMBER"/>
<xs:attribute name="standardDeviation" type="NUMBER"/>
<xs:attribute name="median" type="NUMBER"/>
<xs:attribute name="interQuartileRange" type="NUMBER"/>
</xs:complexType>
</xs:element>
|
The values for mean, minimum, maximum and standardDeviation
are defined as usual.
median is calculated as the 50% quantile;
interQuartileRange is calculated as (75% quantile - 25% quantile).
<xs:element name="Quantile">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="quantileLimit" type="PERCENTAGE-NUMBER" use="required"/>
<xs:attribute name="quantileValue" type="NUMBER" use="required"/>
</xs:complexType>
</xs:element>
|
quantileLimit is a percentage number between 0 and 100.
quantileValue is the corresponding value in the domain of field
values.
Discrete Statistics
<xs:element name="DiscrStats">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Array" minOccurs="0" maxOccurs="2"/>
</xs:sequence>
<xs:attribute name="modalValue" type="xs:string"/>
</xs:complexType>
</xs:element>
|
modalValue is the most frequent discrete value.
If only a single array is present, it must be of type
NUM-ARRAY
and contain a compact representation of all frequency numbers.
If there is an additional STRING-ARRAY then the
frequency numbers in the NUM-ARRAY correspond to the string values one by one.
Otherwise the frequency numbers correspond to the list of (valid)
values as given in the DataDictionary. If the statistics refer
to a DerivedField, then all values
are listed in the STRING-ARRAY because a DerivedField
has no means to specify a list of valid values. If DiscrStats
contains an array of string values and the
corresponding DataField defines a list of valid values,
then the set of
values listed in the array in DiscrStats must be a subset of the set of
valid values specified in the DataField.
Continuous Statistics
<xs:element name="ContStats">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Interval" minOccurs="0" maxOccurs="unbounded"/>
<xs:group ref="FrequenciesType" minOccurs="0"/>
</xs:sequence>
<xs:attribute name="totalValuesSum" type="NUMBER"/>
<xs:attribute name="totalSquaresSum" type="NUMBER"/>
</xs:complexType>
</xs:element>
<xs:group name="FrequenciesType">
<xs:sequence>
<xs:group ref="NUM-ARRAY" minOccurs="1" maxOccurs="3"/>
</xs:sequence>
</xs:group>
|
The three ARRAYs contain the frequencies, sum of values, and sum of
squared values for each interval.
Note: Interval is defined
in the schema for DataDictionary.
Example
Here is an example excerpt which gives univariate statistics for the continuous field
Age and the categorical field
Sex:
...
<DataDictionary>
<DataField name="Sex" optype="categorical" dataType="string">
<Value value="female"/>
<Value value="male"/>
</DataField>
<DataField name="Age" optype="continuous" dataType="double"/>
...
</DataDictionary>
...
<ModelStats>
<UnivariateStats field="Age">
<Counts totalFreq="240"/>
<NumericInfo mean="54.43" minimum="29" maximum="77"/>
<ContStats>
<Interval closure="openClosed" leftMargin="29" rightMargin="33.8"/>
<Interval closure="openClosed" leftMargin="33.8" rightMargin="38.6"/>
<Interval closure="openClosed" leftMargin="38.6" rightMargin="43.4"/>
<Interval closure="openClosed" leftMargin="43.4" rightMargin="48.2"/>
<Interval closure="openClosed" leftMargin="48.2" rightMargin="53"/>
<Interval closure="openClosed" leftMargin="53" rightMargin="57.8"/>
<Interval closure="openClosed" leftMargin="57.8" rightMargin="62.6"/>
<Interval closure="openClosed" leftMargin="62.6" rightMargin="67.4"/>
<Interval closure="openClosed" leftMargin="67.4" rightMargin="72.2"/>
<Interval closure="openClosed" leftMargin="72.2" rightMargin="77"/>
<Array type="int" n="10"> 1 8 28 30 30 43 51 33 13 3</Array>
</ContStats>
</UnivariateStats>
<UnivariateStats field="Sex">
<Counts totalFreq="240"/>
<DiscrStats>
<Array type="int" n="2"> 166 74</Array>
</DiscrStats>
</UnivariateStats>
...
</ModelStats>
...
|
The array in
DiscrStats for field
Sex refers to the respective
Value
elements in the
DataDictionary. Out of the total 240 instances, 166 are
female
while 74 are
male.
Field
Age has a mean of 54.43 years, with a minimum of 29 and a maximum of 77. The
ContStats gives more details about the actual distribution. For instance, in 43
records the value of
Age is larger than 53 years and less than or equal to 57.8.
ANOVA
PMML supports inclusion of ANOVA (Analysis of Variance)
information that, while descriptive in nature, can help understand
the relationship between certain independent variables and the
target (dependent) variable. Specifically, the analysis is
performed using one independent categorical variable, X, and a
continuous target (dependent) variable, Y (usually found in types of
regression and time series models).
<xs:element name="Anova">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="AnovaRow" minOccurs="3" maxOccurs="3"/>
</xs:sequence>
<xs:attribute name="target" type="FIELD-NAME" />
</xs:complexType>
</xs:element>
<xs:element name="AnovaRow">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="type" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="Model"/>
<xs:enumeration value="Error"/>
<xs:enumeration value="Total"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="sumOfSquares" type="NUMBER" use="required"/>
<xs:attribute name="degreesOfFreedom" type="NUMBER" use="required"/>
<xs:attribute name="meanOfSquares" type="NUMBER"/>
<xs:attribute name="fValue" type="NUMBER"/>
<xs:attribute name="pValue" type="PROB-NUMBER"/>
</xs:complexType>
</xs:element>
|
There can be an
Anova element for each target (dependent) variable
defined in the model. The
target attribute is used to specify
the target variable of this ANOVA. If there is only one target variable,
then the
target attribute is optional. But if the model
specifies more than one target, this field must indicate the name of
target variable analyzed with respect to this variable (identified in
the parent
UnivariarteStats element's
field attribute).
The F-value and p-value attributes are required when
the AnovaRow is of type Model. They are not used
for Error and Total AnovaRow types.
The meanOfSquares is required when the AnovaRow type
is Model or Error.
ANOVA is used to estimate the effect of X, an independent variable,
on Y, the target (dependent) variable. It does this by testing
the hypothesis that X has no effect on Y, called the null hypothesis.
The null hypothesis is rejected if the probability is very low (less than
the specified α) and not rejected otherwise. The probability is obtained using F-distribution.
Let's assume X contains k categories and there are
a total of N records (or samples, rows, etc.). By
analyzing the Y values amongst and
within each category, we can test the null hypothesis and determine if X has
an impact on Y. Sum-of-squares (SS) calculations are made which total the
squared difference between all the Y values of a group and the
average (or mean) of the Y values for that group. The sum-of-squares
for Y can be partitioned into two groups: the sum-of-squares within each
category and between the k categories. Thus,
SSTotal = SSModel + SSError
Where SSTotal is the SS for all records, SSModel
is the SS between each category's mean and the total population's mean,
and SSError is the SS between each record and the mean of that record's category.
Similarly, the degrees of freedom (DF) of the data can be partitioned into two groups as well:
DFTotal = DFModel + DFError
Where DFTotal is N-1, DFModel is k-1 and DFError
is N - k
These calculations can then be used to compute the Mean of Squares (MS, from SS/DF),
F-Value (from MSModel/MSError) and its resulting p-Value.
These results are summarized in an ANOVA table, represented as:
Total |
Sum of Squares |
Degrees of Freedom |
Mean of SS |
F-Value |
p-Value |
Model |
SSModel |
DFModel |
MSModel |
MSModel/MSError |
p-Value |
Error |
SSError |
DFError |
MSError |
|
|
Total |
SSModel + SSError |
DFModel + DFError |
|
|
|
Conventionally in the context of ANOVA between a single independent
variable and the target variable (also called one-factor ANOVA),
the model and error variabilities are also referred to as
between sample variability and within-sample variability,
respectively.
ANOVA Example
This examples uses ANOVA to study two categorical independent variables, X1 and X2, and a
continuous target (dependent) variable, Y. We want to know the effect of these two variables
on Y.
X1 has three categories: AA, BB, CC. X2 has seven categories: a, b, c, d, e, f, g. The number of
records analyzed, N, is 101. We'll use α = 0.20 for our
hypothesis test.
For X1, we have the following ANOVA table. The p-value is greater than our α value of 0.20 and
therefore indicates that the probability X1 has no effect on Y is significant. So, we accept the
null hypothesis, which says that X1 has no effect on Y:
Total |
Sum of Squares |
Degrees of Freedom |
Mean of SS |
F-Value |
p-Value |
Model |
21389708 |
2 |
10694854 |
0.77 |
0.66 |
Error |
809210617 |
98 |
8257251 |
|
|
Total |
830600325 |
100 |
|
|
|
For X2, we have another ANOVA table. The p-value is less than our α value of 0.20 and therefore
indicates that the probability X2 has no effect on Y is insignificant. So, we reject the null
hypothesis, and conclude that X2 does have an effect on Y:
Total |
Sum of Squares |
Degrees of Freedom |
Mean of SS |
F-Value |
p-Value |
Model |
26811371 |
6 |
4468561 |
1.70 |
0.13 |
Error |
247218517 |
94 |
2629984 |
|
|
Total |
274029888 |
100 |
|
|
|
The PMML for both ANOVA tables in this example is shown below:
...
<ModelStats>
<UnivariateStats field="X1">
...
<Anova>
<AnovaRow type="Model" sumOfSquares="21389708" degreesOfFreedom="2" meanOfSquares="7129903" fValue="0.85" pValue="0.47" />
<AnovaRow type="Error" sumOfSquares="809210617" degreesOfFreedom="98" meanOfSquares="8342377"/>
<AnovaRow type="Total" sumOfSquares="830600325" degreesOfFreedom="100" />
</Anova>
</UnivariateStats>
<UnivariateStats field="X2">
...
<Anova>
<AnovaRow type="Model" sumOfSquares="26811371" degreesOfFreedom="6" meanOfSquares="4468561" fValue="1.699" pValue="0.13" />
<AnovaRow type="Error" sumOfSquares="247218517" degreesOfFreedom="94" meanOfSquares="2629984"/>
<AnovaRow type="Total" sumOfSquares="274029888" degreesOfFreedom="100" />
</Anova>
</UnivariateStats>
...
</ModelStats>
...
|
Partitions
A Partition contains statistics for a subset of records, for example it can
describe the population in a cluster.
The content of a Partition mirrors the definition of the general univariate
statistics. That is, each Partition describes the distribution per field.
For each field there can be information about frequencies,
numeric moments, etc.
<xs:element name="Partition">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="PartitionFieldStats" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required"/>
<xs:attribute name="size" type="NUMBER"/>
</xs:complexType>
</xs:element>
|
The attribute name identifies the Partition.
The attribute size is the number of records.
All aggregates in PartitionFieldStats must have
size = totalFrequency in Counts if specified.
<xs:element name="PartitionFieldStats">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Counts" minOccurs="0"/>
<xs:element ref="NumericInfo" minOccurs="0"/>
<xs:group ref="FrequenciesType" minOccurs="0"/>
</xs:sequence>
<xs:attribute name="field" type="FIELD-NAME" use="required" />
<xs:attribute name="weighted" default="0">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="0"/>
<xs:enumeration value="1"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
|
field references to (the name of) a MiningField for background statistics.
The sequence of NUM-ARRAYs is the same as for ContStats.
For categorical fields there is only one array containing the frequencies;
for numeric fields, the second and third array contain the sums of values
and the sums of squared values, respectively.
The number of values in each array must match the number of categories or intervals
in UnivariateStats of the field.
Example
Here is an example from a tree model:
...
<DataDictionary>
<DataField name="Sex" optype="categorical" dataType="string">
<Value value="female"/>
<Value value="male"/>
</DataField>
<DataField name="Age" optype="continuous" dataType="double"/>
...
</DataDictionary>
...
<ModelStats>
<UnivariateStats field="Age" weighted="false">
<Counts totalFreq="240"/>
<NumericInfo mean="54.43" minimum="29" maximum="77"/>
<ContStats>
<Interval closure="openClosed" leftMargin="29" rightMargin="33.8"/>
<Interval closure="openClosed" leftMargin="33.8" rightMargin="38.6"/>
<Interval closure="openClosed" leftMargin="38.6" rightMargin="43.4"/>
<Interval closure="openClosed" leftMargin="43.4" rightMargin="48.2"/>
<Interval closure="openClosed" leftMargin="48.2" rightMargin="53"/>
<Interval closure="openClosed" leftMargin="53" rightMargin="57.8"/>
<Interval closure="openClosed" leftMargin="57.8" rightMargin="62.6"/>
<Interval closure="openClosed" leftMargin="62.6" rightMargin="67.4"/>
<Interval closure="openClosed" leftMargin="67.4" rightMargin="72.2"/>
<Interval closure="openClosed" leftMargin="72.2" rightMargin="77"/>
<Array type="int" n="10"> 1 8 28 30 30 43 51 33 13 3</Array>
</ContStats>
</UnivariateStats>
...
...
<Node score="1" recordCount="134">
<Partition name="1.1" size="134">
<PartitionFieldStats field="Age" weighted="false">
<Array type="int" n="10"> 1 5 17 24 17 19 23 17 8 3</Array>
</PartitionFieldStats>
<PartitionFieldStats field="Sex">
<Array type="int" n="2"> 70 64</Array>
</PartitionFieldStats>
...
...
<Targets>
<Target field="response" optype="categorical">
<TargetValue value="Yes">
<Partition name="Yes_classified" size="103">
<PartitionFieldStats field="Age" weighted="false">
<NumericInfo mean="56.6796116504854" minimum="35" maximum="77"/>
<Array type="int" n="10"> 0 3 7 8 10 17 34 18 5 1</Array>
</PartitionFieldStats>
<PartitionFieldStats field="Sex">
<Array type="int" n="2"> 88 15</Array>
</PartitionFieldStats>
</Partition>
</TargetValue>
...
</Target>
</Targets>
...
...
|
There is a total of 134 records in the
Node. This is already obvious from
recordCount
of the
Node, but is reflected in the
Partition element as well. However, there might
be cases in which only the statistics for a sample might be given, in which the counts would be
different. The
Array inside
PartitionFieldStats gives the distribution according
to the intervals defined in
UnivariatStats. For example, in 24 instances the
Age of
the records in the
Node was larger than 43.4 years and less than or equal to 48.2. Likewise,
in 70 cases the value of
Sex was
female, and
male in 64.
The
Partition element in
Target tells that 103 records were classified
Yes
by the model. Again, more details on the distributions for fields
Age and
Sex are
given in
PartitionFieldStats: 88 instances had value
female for field
Sex,
while 15 had the value
male. There were no cases in which
Age was less than or equal
to 33.8 years, and only 1 in which it was larger than 72.2 and less than or equal to 77 years.