Data Mining Group - RFC - Baseline Model

RFC - Baseline Model

v.4.4

v.4.3

v.4.2.1

v.4.1

v.4.0.1

v.3.2

v.3.1

v.3.0

v.2.1

v.2.0

v.1.1

Examples

RFC

Management Process

PMML 3.1 - Baseline Model (proposal version 5-2-7)

Motivating Examples. There are several different types of baseline, change detection, hypothesis testing, and related models that are supported by the PMML Baseline Model. We begin with an several informal examples.

Change detection model with CUSUM statistic. For the first example, assume we have two Gaussian distributions, each characterized by a mean and a standard deviation, one representing normal and one representing abnormal behavior. Given a stream of events, a score is computed with each new event and this score is used to decide whether the stream of events is likely to be from the baseline distribution or the second distribution.

Notice that this may be viewed from the viewpoint of hypothesis testing by considering the baseline distribution as the null hypothesis and the second distribution as the alternate hypothesis. Given a stream of events the goal is to determine as quickly as possible when events are occuring from the alternate distribution.

A common score used for this purpose is the CUSUM defined as follows: Let f0(x) and f1(x) be the density functions for the two Gaussian distributions and let g(x) be the log odds ratio:

g(x) = log f1(x)/f0(x)

Given a stream of events with features x[0], x[1], x[2], ..., define the CUSUM score by:

assume Z[-1]=0

Z[n] = max{0, Z[n-1] + g(x[n])}

This would be represented in PMML with the following fragment:


  <BaselineModel modelName="geo-cusum" functionName="baseline" >

    <TestDistributions field="congestion-score" 
      testStatistic="CUSUM" testType="threshold" threshold="21.0" resetValue="0.0" >
      <Baseline>
        <GaussianDistribution mean="550.2" variance="48.2" />
      </Baseline>
      <Alternate>
        <GaussianDistribution mean="460.4" variance="39.2" />
      </Alternate>
    </TestDistributions>

  </BaselineModel>

Scoring with this type of change detection model can be done as follows: given a sequence of events, the mining and derived fields are evaluated to compute the field for the TestDistributions. Then the test indicated (CUSUM in the example above) is performed with parameter values supplied. In the example above, if the TestField exceeds the threshold of 21.0, a TRUE is generated, otherwise a FALSE is generated. A reset value can be used to provide an alternative to 0 in the formula above defining Z[n].

Change detection model with generalized likelihood ratio (GLR) statistic. Often there is less knowledge in change detection applications about the alternate distribution. For example, its mean and standard deviation may be unknown or very difficult to estimate. In this case, the generalized likelihood ratio (GLR) test can be used.

This would be represented in PMML with the following fragment:


  <BaselineModel modelName="geo-glr" functionName="baseline" >

    <TestDistributions field="congestion-score" 
      testStatistic="GLR" testType="threshold" threshold="12.8" resetValue="0.0" >
      <Baseline>
        <GaussianDistribution mean="550.2" variance="48.2" />
      </Baseline>
    </TestDistributions>

  </BaselineModel>

Threshold break models using standard scores. For another very basic example, if the baseline model is a single distribution, then the value of mining field can be converted to standard values (which are also called z-values) and a simple threshold used to determine whether the field is closed enough to the mean of the distribution.


  <BaselineModel modelName="three-sigma-threshold" functionName="baseline" >

    <TestDistributions field="defects" 
      testStatistic="zValue" testType="threshold" threshold="3.0" >
      <Baseline>
        <GaussianDistribution mean="18.2" variance="4.2" />
      </Baseline>
    </TestDistributions>

  </BaselineModel>

Scoring with this type of model is very simple. Given an event with a mining or derived field called "defects," the value of defects is converted to a standard score or z-value using the formula

standard-defects = (defects - mean) / sqrt (variance)
and if standard-defects is greater than or equal to 3.0 the model returns TRUE, otherwise the model returns FALSE.

Contingency table models using chi-squared statistics and p-values. For categorical variables, contingency tables summarize how one field and its categorical (rows) related to another field and its categorical values (columns). The baseline or null hypothesis is that there is no association between the field defining the rows and the field defining the columns. The alternate hypothesis is that there is such association. A standard approach is to compute the chi-squared statistic for the table and then to use a p-value to determine whether the chi-squared is statistically significant.


 <BaselineModel modelName="merchant-name-populated-model" functionName="baseline" >

  <TestDistributions testStatistic="chiSquared" testType="twoSidedPValue" threshold="95.0" >

    <CountTable>
      <FieldValueCounts field="merchant_name_populated">
	<FieldValueCount field="payment_accepted" count="41"/>
	<FieldValueCount field="payment_declined" count="81"/>
      </FieldValueCounts>
      <FieldValueCounts field="merchant_name_missing">
	<FieldValueCount field="payment_accepted" count="12"/>
	<FieldValueCount field="payment_declined" count="30"/>
      </FieldValueCounts>
    </CountTable>

  </TestDistributions>

 </BaselineModel>

Scoring with this type of model is done as follows. Given such a table (with any number of rows or columns), one first computes row and column totals, as well as the total for the whole table. Given any cell in the table, the expected number for that cell is computed as follows

expected number for cell = (row total * column total) / (total for table),
where the row total and column total are the row and column totals for that cell. Using the expected numbers, the chi-squared statistic is computed as follows:
chi-squared = sum (expected number - observed number)² / (expected number),
where the sum is over all cells in the table.

If the table has r rows and c columns, the degree of freedom of the chi-squared statistic is defined as (r-1)(c-1). The p-value is the probability that the chi-squared statistic with this degree of freedom exceeds the chi-squared value computed from the table. It is common to use thresholds of 95%, 97.5%, or 99%.

An alternative test could have simply specified:


  <TestDistributions testStatistic="chiSquared" testType="threshold" threshold="2.1" >

In this case, the chi-squared statistic for the contingency table would have been computed as described above, and a TRUE returned if the computed chi-squared is equal to or above the threshold; otherwise, a FALSE is returned.

Target and Output elements. The PMML Target element is used by the Baseline Model to return the result computed by the model, while the PMML Output element can be used to return auxilliary information, such as derived fields that are needed for subsequent computations.

Model Specification

The top level BaselineModel element follows the usual GeneralStructure conventions and contains a single TestDistributions element.

The content of the TestDistributions element depends on whether a continuous or discrete case is specified. It will generally contain one or more distribution specifications along with the test type and parameters used to produce the result. The role of each distribution specification is indicated py placing it within the Baseline element, indicating that it is the null model, or with the Alternate element, indicating the alternate model. The default role is Baseline.


  <xs:element name="BaselineModel">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
	<xs:element ref="MiningSchema"/>
        <xs:element ref="Output" minOccurs="0" />
        <xs:element ref="ModelStats" minOccurs="0"/>
        <xs:element ref="Targets" minOccurs="0" />
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:element ref="TestDistributions"/>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" use="optional" />
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" use="optional" />
    </xs:complexType>
  </xs:element>

  <xs:element name="TestDistributions">
    <xs:complexType>
      <xs:sequence>
	<xs:choice>
	  <xs:group ref="CONTINUOUS-DISTRIBUTION-TYPES" minOccurs="1" maxOccurs="2"/>
	  <xs:group ref="DISCRETE-DISTRIBUTION-TYPES" minOccurs="1" maxOccurs="unbounded"/>
	</xs:choice>
	<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="optional" />
      <xs:attribute name="testStatistic" type="BASELINE-TEST-STATISTIC" use="optional" />
      <xs:attribute name="testType" type="BASELINE-TEST-TYPE" default="threshold" use="optional" />
      <xs:attribute name="threshold" type="REAL-NUMBER" use="required" />
      <xs:attribute name="resetValue" type="REAL-NUMBER" default="0.0" use="optional" />
    </xs:complexType>
  </xs:element>


  <xs:simpleType name="BASELINE-TEST-STATISTIC">
    <xs:restriction base="xs:string">
      <xs:enumeration value="count" />
      <xs:enumeration value="zValue" />
      <xs:enumeration value="chiSquared" />
      <xs:enumeration value="fisher" />
      <xs:enumeration value="fisherExact" />
      <xs:enumeration value="yatesContinuityCorrection" />
      <xs:enumeration value="CUSUM" />
      <xs:enumeration value="GLR" />
      <xs:enumeration value="logOddsRatio" />
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="BASELINE-TEST-TYPE">
    <xs:restriction base="xs:string">
      <xs:enumeration value="threshold" />
      <xs:enumeration value="singleSidedPValue" />
      <xs:enumeration value="twoSidedPValue" />
    </xs:restriction>
  </xs:simpleType>

  <xs:simpleType name="ROLE-TYPE">
    <xs:restriction base="xs:string">
      <xs:enumeration value="baseline" />
      <xs:enumeration value="alternate" />
    </xs:restriction>
  </xs:simpleType>

Continuous Case

For the continuous case, there are one or two statistical distributions specified. If there is one distribution, it is assumed to be the baseline or null distribution and may be in the Baseline element. If there are two, one must be specified within an Alternate element. The field under test is specified on the TestDistributions element. A test statistic and test type should be specified. See above for an example.


  <xs:group name="CONTINUOUS-DISTRIBUTION-TYPES">
    <xs:sequence>
      <xs:choice>
	<xs:element ref="AnyDistribution"/>
	<xs:element ref="GaussianDistribution"/>
	<xs:element ref="PoissonDistribution"/>
	<xs:element ref="UniformDistribution"/>
      </xs:choice>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:group>

  <xs:element name="AnyDistribution">
    <xs:complexType>
      <xs:sequence>
	<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="mean" type="REAL-NUMBER" use="required" />
      <xs:attribute name="variance" type="REAL-NUMBER" use="required" />
    </xs:complexType>
  </xs:element>

  <xs:element name="GaussianDistribution">
    <xs:complexType>
      <xs:sequence>
	<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="mean" type="REAL-NUMBER" use="required" />
      <xs:attribute name="variance" type="REAL-NUMBER" use="required" />
    </xs:complexType>
  </xs:element>

  <xs:element name="PoissonDistribution">
    <xs:complexType>
      <xs:sequence>
	<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="mean" type="REAL-NUMBER" use="required" />
    </xs:complexType>
  </xs:element>

  <xs:element name="UniformDistribution">
    <xs:complexType>
      <xs:sequence>
	<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="lower" type="REAL-NUMBER" use="required" />
      <xs:attribute name="upper" type="REAL-NUMBER" use="required" />
    </xs:complexType>
  </xs:element>

Discrete Case

For the discrete case, rather than using statistical distributions, the CountTable and NormalizedCountTable are used to specify tables of counts. Each row and column in such a table is associated with a field in the MiningSchema or a derived field, and the cell values contain the corresponding counts or probabilities.

If there is a single table, it is interpreted as the baseline table. If there are two tables, one can be specified as the baeseline and one as the alternate.


  <xs:group name="DISCRETE-DISTRIBUTION-TYPES">
    <xs:choice>
      <xs:element ref="CountTable"/>
      <xs:element ref="NormalizedCountTable"/>
      <xs:element ref="HistogramTable"/>
    </xs:choice>
  </xs:group>

  <xs:element name="CountTable" type="COUNT-TABLE-TYPE" />

  <xs:element name="NormalizedCountTable" type="COUNT-TABLE-TYPE" />

  <xs:complexType name="COUNT-TABLE-TYPE">
    <xs:sequence>
      <xs:element ref="FieldValueCounts" minOccurs="1" maxOccurs="unbounded"/>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>

  <xs:element name="FieldValueCounts">
    <xs:complexType>
      <xs:sequence>
	<xs:element ref="FieldValueCount" minOccurs="1" maxOccurs="unbounded"/>
	<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required" />
      <xs:attribute name="value" use="optional" />
    </xs:complexType>
  </xs:element>

  <xs:element name="FieldValueCount">
    <xs:complexType>
      <xs:sequence>
	<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required" />
      <xs:attribute name="value" use="optional" />
      <xs:attribute name="count" type="NUMBER" use="required" />
    </xs:complexType>
  </xs:element>

Some discrete case examples follow:


<!-- ================================================================== -->
<!-- Discrete case, single table with threshold test -->
<!-- in this case, if the count of any cell exceeds the threshold, TRUE is returned. -->

  <TestDistributions testStatistic="count" testType="threshold" threshold="21.0" >

    <CountTable>
      <FieldValueCounts field="merchant_name_populated">
	<FieldValueCount field="payment_accepted" count="41"/>
	<FieldValueCount field="payment_declined" count="81"/>
      </FieldValueCounts>
      <FieldValueCounts field="merchant_name_missing">
	<FieldValueCount field="payment_accepted" count="12"/>
	<FieldValueCount field="payment_declined" count="30"/>
      </FieldValueCounts>
    </CountTable>

  </TestDistributions>

e-mail

info at dmg.org