Data Mining Group - Statistics

PMML 3.2 - Statistics

This schema for statistics provides a basic framework for representing univariate statistics. It is used by schemas for specific models as ModelStats.


  <xs:element name="ModelStats">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="UnivariateStats" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

The statistics for a model is made of the collection of the statistics for single fields.

Univariate Statistics


  <xs:element name="UnivariateStats">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="Counts" minOccurs="0"/>
        <xs:element ref="NumericInfo" minOccurs="0"/>
        <xs:element ref="DiscrStats" minOccurs="0"/>
        <xs:element ref="ContStats" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME"/>
    </xs:complexType>
  </xs:element>

An UnivariateStats element contains statistical information about a single MiningField. UnivariateStats elements in ModelStats must have a field name they refer to in attribute field. In other cases field may be missing. Discrete and continuous statistics are possible simultaneously for numeric fields. This may be important if a numeric field has too many discrete values. The statistics can include the most frequent values and also a complete histogram distribution.

Statistics for ordinal fields are contained in DiscrStats.


  <xs:element name="Counts">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="totalFreq" type="xs:nonNegativeInteger" use="required"/>
      <xs:attribute name="missingFreq" type="xs:nonNegativeInteger"/>
      <xs:attribute name="invalidFreq" type="xs:nonNegativeInteger"/>
    </xs:complexType>
  </xs:element>

The element Counts carries counters for frequency of values with respect to their state of being missing, invalid, or valid.

totalFreq counts all records, same as for statistics of all MiningFields. missingFreq counts the number of records where value is missing. invalidFreq counts the number of records with values other than valid. The total frequency includes the missing values and invalid values.


  <xs:element name="NumericInfo">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="Quantile" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="minimum" type="NUMBER"/>
      <xs:attribute name="maximum" type="NUMBER"/>
      <xs:attribute name="mean" type="NUMBER"/>
      <xs:attribute name="standardDeviation" type="NUMBER"/>
      <xs:attribute name="median" type="NUMBER"/>
      <xs:attribute name="interQuartileRange" type="NUMBER"/>
    </xs:complexType>
  </xs:element>

The values for mean, minimum, maximum and standardDeviation are defined as usual. median is calculated as the 50% quantile; interQuartileRange is calculated as (75% quantile - 25% quantile).


  <xs:element name="Quantile">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="quantileLimit" type="PERCENTAGE-NUMBER" use="required"/>
      <xs:attribute name="quantileValue" type="NUMBER" use="required"/>
    </xs:complexType>
  </xs:element>

quantileLimit is a percentage number between 0 and 100. quantileValue is the corresponding value in the domain of field values.

Discrete Statistics


  <xs:element name="DiscrStats">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="Array" minOccurs="0" maxOccurs="2"/>
      </xs:sequence>
      <xs:attribute name="modalValue" type="xs:string"/>
    </xs:complexType>
  </xs:element>

modalValue is the most frequent discrete value. If only a single array is present, it must be of type INT-ARRAY and contain a compact representation of all frequency numbers. If there is an additional STRING-ARRAY then the frequency numbers in the INT-ARRAY correspond to the string values one by one. Otherwise the frequency numbers correspond to the list of (valid) values as given in the DataDictionary. If the statistics refer to a DerivedField, then all values are listed in the STRING-ARRAY because a DerivedField has no means to specify a list of valid values. If DiscrStats contains an array of string values and the corresponding DataField defines a list of valid values, then both sets must be the same.

Continuous Statistics


  <xs:element name="ContStats">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="Interval" minOccurs="0" maxOccurs="unbounded"/>
        <xs:group ref="FrequenciesType" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="totalValuesSum" type="NUMBER"/>
      <xs:attribute name="totalSquaresSum" type="NUMBER"/>
    </xs:complexType>
  </xs:element>

  <xs:group name="FrequenciesType">
    <xs:sequence>
      <xs:group ref="INT-ARRAY"/>
      <xs:group ref="NUM-ARRAY" minOccurs="0" maxOccurs="2"/>
    </xs:sequence>
  </xs:group>

The three ARRAYs contain the frequencies, sum of values, and sum of squared values for each interval.

Note: Interval is defined in the schema for DataDictionary.

Example

Here is an example excerpt which gives univariate statistics for the continuous field Age and the categorical field Sex:


  ...
  <DataDictionary>
    <DataField name="Sex" optype="categorical" dataType="string">
      <Value value="female"/>
      <Value value="male"/>
    </DataField>
    <DataField name="Age" optype="continuous" dataType="double"/>
    ...
  </DataDictionary>
  ...
    <ModelStats>
      <UnivariateStats field="Age">
        <Counts totalFreq="240"/>
        <NumericInfo mean="54.43" minimum="29" maximum="77"/>
        <ContStats>
          <Interval closure="openClosed" leftMargin="29" rightMargin="33.8"/>
          <Interval closure="openClosed" leftMargin="33.8" rightMargin="38.6"/>
          <Interval closure="openClosed" leftMargin="38.6" rightMargin="43.4"/>
          <Interval closure="openClosed" leftMargin="43.4" rightMargin="48.2"/>
          <Interval closure="openClosed" leftMargin="48.2" rightMargin="53"/>
          <Interval closure="openClosed" leftMargin="53" rightMargin="57.8"/>
          <Interval closure="openClosed" leftMargin="57.8" rightMargin="62.6"/>
          <Interval closure="openClosed" leftMargin="62.6" rightMargin="67.4"/>
          <Interval closure="openClosed" leftMargin="67.4" rightMargin="72.2"/>
          <Interval closure="openClosed" leftMargin="72.2" rightMargin="77"/>
          <Array type="int" n="10"> 1 8 28 30 30 43 51 33 13 3</Array>
        </ContStats>
      </UnivariateStats>
      <UnivariateStats field="Sex">
        <Counts totalFreq="240"/>
        <DiscrStats>
          <Array type="int" n="2"> 166 74</Array>
        </DiscrStats>
      </UnivariateStats>
      ...
    </ModelStats>
    ...

The array in DiscrStats for field Sex refers to the respective Value elements in the DataDictionary. Out of the total 240 instances, 166 are female while 74 are male.
Field Age has a mean of 54.43 years, with a minimum of 29 and a maximum of 77. The ContStats gives more details about the actual distribution. For instance, in 43 records the value of Age is larger than 53 years and less than or equal to 57.8.

Partitions

A Partition contains statistics for a subset of records, for example it can describe the population in a cluster. The content of a Partition mirrors the definition of the general univariate statistics. That is, each Partition describes the distribution per field. For each field there can be information about frequencies, numeric moments, etc.


  <xs:element name="Partition">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="PartitionFieldStats" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="name" type="xs:string" use="required"/>
      <xs:attribute name="size" type="xs:nonNegativeInteger"/>
    </xs:complexType>
  </xs:element>

The attribute name identifies the Partition. The attribute size is the number of records. All aggregates in PartitionFieldStats must have size = totalFrequency in Counts if specified.


  <xs:element name="PartitionFieldStats">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="Counts" minOccurs="0"/>
        <xs:element ref="NumericInfo" minOccurs="0"/>
        <xs:group ref="FrequenciesType" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required" />
    </xs:complexType>
  </xs:element>

field references to (the name of) a MiningField for background statistics. The sequence of NUM-ARRAYs is the same as for ContStats. For categorical fields there is only one array containing the frequencies; for numeric fields, the second and third array contain the sums of values and the sums of squared values, respectively. The number of values in each array must match the number of categories or intervals in UnivariateStats of the field.

Example
Here is an example from a tree model:


  ...
  <DataDictionary>
    <DataField name="Sex" optype="categorical" dataType="string">
      <Value value="female"/>
      <Value value="male"/>
    </DataField>
    <DataField name="Age" optype="continuous" dataType="double"/>
    ...
  </DataDictionary>
  ...
    <ModelStats>
      <UnivariateStats field="Age">
        <Counts totalFreq="240"/>
        <NumericInfo mean="54.43" minimum="29" maximum="77"/>
        <ContStats>
          <Interval closure="openClosed" leftMargin="29" rightMargin="33.8"/>
          <Interval closure="openClosed" leftMargin="33.8" rightMargin="38.6"/>
          <Interval closure="openClosed" leftMargin="38.6" rightMargin="43.4"/>
          <Interval closure="openClosed" leftMargin="43.4" rightMargin="48.2"/>
          <Interval closure="openClosed" leftMargin="48.2" rightMargin="53"/>
          <Interval closure="openClosed" leftMargin="53" rightMargin="57.8"/>
          <Interval closure="openClosed" leftMargin="57.8" rightMargin="62.6"/>
          <Interval closure="openClosed" leftMargin="62.6" rightMargin="67.4"/>
          <Interval closure="openClosed" leftMargin="67.4" rightMargin="72.2"/>
          <Interval closure="openClosed" leftMargin="72.2" rightMargin="77"/>
          <Array type="int" n="10"> 1 8 28 30 30 43 51 33 13 3</Array>
        </ContStats>
      </UnivariateStats>
      ...
    ...
    <Node score="1" recordCount="134">
      <Partition name="1.1" size="134">
        <PartitionFieldStats field="Age">
          <Array type="int" n="10"> 1 5 17 24 17 19 23 17 8 3</Array>
        </PartitionFieldStats>
        <PartitionFieldStats field="Sex">
          <Array type="int" n="2"> 70 64</Array>
        </PartitionFieldStats>
      ...
    ...
    <Targets>
      <Target field="response" optype="categorical">
        <TargetValue value="Yes">
          <Partition name="Yes_classified" size="103">
            <PartitionFieldStats field="Age">
              <NumericInfo mean="56.6796116504854" minimum="35" maximum="77"/>
              <Array type="int" n="10"> 0 3 7 8 10 17 34 18 5 1</Array>
            </PartitionFieldStats>
            <PartitionFieldStats field="Sex">
              <Array type="int" n="2"> 88 15</Array>
            </PartitionFieldStats>
          </Partition>
        </TargetValue>
        ...
      </Target>
    </Targets>
    ...
  ...

There is a total of 134 records in the Node. This is already obvious from recordCount of the Node, but is reflected in the Partition element as well. However, there might be cases in which only the statistics for a sample might be given, in which the counts would be different. The Array inside PartitionFieldStats gives the distribution according to the intervals defined in UnivariatStats. For example, in 24 instances the Age of the records in the Node was larger than 43.4 years and less than or equal to 48.2. Likewise, in 70 cases the value of Sex was female, and male in 64.
The Partition element in Target tells that 103 records were classified Yes by the model. Again, more details on the distributions for fields Age and Sex are given in PartitionFieldStats: 88 instances had value female for field Sex, while 15 had the value male. There were no cases in which Age was less than or equal to 33.8 years, and only 1 in which it was larger than 72.2 and less than or equal to 77 years.