Data Mining Group - PMML Statistics

PMML 1.1 -- DTD for Statistics

This DTD subset for statistics provides a basic framework for representing univariate statistics. It is used by DTDs for data mining models as ModelStats. The general guideline for PMML models is: if there is any need for statistics then these representations should look like the elements defined below. There is no need to use exactly the same elements, but for ease of presentation and implementation it is recommended to use the same basic structure.


	<!ELEMENT ModelStats   ( UnivariateStats+ ) >

The statistics for a model is made of the collection of the statistics for single fields.

Univariate Statistics

				
	<!ENTITY  % AGGREGATE "(Counts?, NumericInfo?)" >
	
	<!ELEMENT UnivariateStats ( (%AGGREGATE;)?, DiscrStats?, ContStats? ) >
	
	<!ATTLIST UnivariateStats
	     field                  %FIELD-NAME;                    #IMPLIED
        >

An UnivariateStats element contains statistical information about a single mining field. Discrete AND continuous statistics are possible simultaneously for numeric fields. This may be important if a numeric field has too many discrete values. The statistics can include the most frequent values, and also a complete histogram distribution.

Statistics for ordinal fields are contained in DiscrStats. It may be necessary to extend quantiles to ordinal fields.

					
	<!ELEMENT Counts EMPTY>
	
	<!ATTLIST Counts
	     totalFreq              %NUMBER;                        #REQUIRED
	     missingFreq            %NUMBER;                        #IMPLIED
	     invalidFreq            %NUMBER;                        #IMPLIED
	>

The element Counts carries counters for frequency of values with respect to their state of being missing, invalid, or valid.

totalFreq counts all records, same as for statistics of all mining fields.
missingFreq counts the number of records where value is missing.
invalidFreq counts the number of records with values other than valid. The total frequency includes the missing values and invalid values.


	<!ELEMENT NumericInfo (Quantile*) >
	
	<!ATTLIST NumericInfo
	     minimum                %NUMBER;                        #IMPLIED
	     maximum                %NUMBER;                        #IMPLIED
             mean                   %NUMBER;                        #IMPLIED
	     standardDeviation      %NUMBER;                        #IMPLIED
	     median                 %NUMBER;                        #IMPLIED
	     interQuartileRange     %NUMBER;                        #IMPLIED
        >

The values for mean, minimum, maximum are defined as usual, standardDeviation as usual, median is calculated as 50% quantile;interQuartileRange is calculated as (75% quantile - 25% quantile).

					
	<!ELEMENT Quantile EMPTY>
	
	<!ATTLIST Quantile
	     quantileLimit          %PERCENTAGE-NUMBER;             #REQUIRED
	     quantileValue          %NUMBER;                        #REQUIRED
	>

quantileLimit is a percentage number between 0 and 100.
quantileValue is the corresponding value in the domain of field values.


	<!ELEMENT DiscrStats ( Extension*, (%STRING-ARRAY;)?, (%INT-ARRAY;)? ) >
	
	<!ATTLIST DiscrStats
	     modalValue             CDATA                           #IMPLIED
	>

modalValue is the most frequent discrete value. The INT-ARRAY contains a compact representation of all frequency numbers. If there is an array of string values then the frequency numbers in the INT-ARRAY correspond to the string values one by one. Otherwise the frequency numbers correspond to the list of (valid) values as given in the DataDictionary.

				
	<!ELEMENT ContStats (Extension*, Interval*, (%INT-ARRAY;)?, (%NUM-ARRAY;)?, (%NUM-ARRAY;)? )>

	<!ATTLIST ContStats
	     totalValuesSum         %NUMBER;                        #IMPLIED
	     totalSquaresSum        %NUMBER;                        #IMPLIED
	>

The three ARRAY's contain the frequencies, sum of values, and sum of squared values for each interval.

Note: Interval is defined in the DTD for DataDictionary.

A partition contains statistics for a subset of records, for example it can describe the population in a cluster.


	<!ELEMENT Partition (PartitionFieldStats+) >
					
	<!ATTLIST Partition
	     name                   CDATA                           #REQUIRED
	     size                   %NUMBER;                        #IMPLIED
	>

name identifies the partition.
size is the number of records, all AGGREGATEs in PartitionFieldStats must have total-frequency=size.


	<!ELEMENT PartitionFieldStats ( (%AGGREGATE;), (%NUM-ARRAY;)* ) >

	<!ATTLIST PartitionFieldStats
	     field                  %FIELD-NAME;                    #REQUIRED
	>

field references to (the name of) a MiningField for background statistics. The sequence of NUM-ARRAYs is the same as for ContStats. For categorical fields there is only one array containing the frequencies; for numeric fields, the second and third array contain the sums of values and the sums of squared values, respectively. The number of values in each array must match the number of categories or intervals in UnivariateStats of the field.