PMML 4.3 - Baseline Model

Motivating Examples. There are several different types of baseline, change detection, hypothesis testing, and related models that are supported by the PMML Baseline Model. We begin with several informal examples.

Example 1: Change detection model with CUSUM statistic. For the first example, assume we have two Gaussian distributions, each characterized by a mean and a standard deviation, one representing normal behavior and one representing abnormal behavior. Given a stream of events, a score is computed with each new event and this score is used to decide whether the stream of events is likely to be from the baseline distribution or the second distribution.

Notice that this may be viewed from the viewpoint of hypothesis testing by considering the baseline distribution as the null hypothesis and the second distribution as the alternate hypothesis. Given a stream of events the goal is to determine as quickly as possible when events are occurring from the alternate distribution.

A common score used for this purpose is the CUSUM defined as follows: Let f0(x) and f1(x) be the density functions for the two Gaussian distributions, let r be the reset value and let g(x) be the log odds ratio:

g(x) = log f1(x)/f0(x)

r = 0

Given a stream of events with features x[0], x[1], x[2], ..., define the CUSUM score by:

assume Z[-1]=0

Z[n] = max{r, Z[n-1] + g(x[n])}

This would be represented in PMML with the following fragment:

<BaselineModel modelName="geo-cusum" functionName="regression">
  <MiningSchema>
    <MiningField name="congestion-score" optype="continuous"/>
    <MiningField name="cusum-score" optype="continuous" usageType="target"/>
  </MiningSchema>

  <TestDistributions field="congestion-score" testStatistic="CUSUM" resetValue="0.0">
    <Baseline>
      <GaussianDistribution mean="550.2" variance="48.2"/>
    </Baseline>
    <Alternate>
      <GaussianDistribution mean="460.4" variance="39.2"/>
    </Alternate>
  </TestDistributions>

</BaselineModel>

Scoring with this type of change detection model can be done as follows: given a sequence of events, the mining and/or derived fields are evaluated to compute the field for the TestDistributions. Then the test indicated (CUSUM in the example above) is performed with parameter values supplied.

Example 2: Baseline models using standard scores. For another very basic example, if the baseline model is a single distribution, then the value of a mining field can be converted to standard values (which are also called z-values) and a simple threshold is used to determine whether the field is close enough to the mean of the distribution.

<BaselineModel modelName="standard-score" functionName="regression">
  <MiningSchema>
    <MiningField name="defects" optype="continuous"/>
    <MiningField name="score" optype="continuous" usageType="target"/>
  </MiningSchema>

  <Output>
    <OutputField name="alert" optype="categorical" dataType="string" feature="decision">
      <Apply function="if">
        <Apply function="greaterThan">
          <FieldRef field="score"/>
          <Constant dataType="double">1</Constant>
        </Apply>
        <!-- Then case -->
        <Constant dataType="string">True</Constant>
        <!-- Else case -->
        <Constant dataType="string">False</Constant>
      </Apply>
    </OutputField>
  </Output>

  <TestDistributions field="defects" testStatistic="zValue">
    <Baseline>
      <GaussianDistribution mean="18.2" variance="17.64"/>
    </Baseline>
  </TestDistributions>

</BaselineModel>

Scoring with this type of model is very simple. Given an event with a mining or derived field called "defects," the value of defects is converted to a standard score or z-value using the formula

score = (defects - mean) / sqrt (variance)

and if defects is 24 then score = (24 - 18.2) / sqrt(17.64) = (5.8) / (4.2) = ~1.38.

Example 3: Scalar product detects changes between distributions by measuring distance between vectors.

<BaselineModel modelName="website-model" functionName="regression">
  <MiningSchema>
    <MiningField name="bin" optype="categorical"/>
    <MiningField name="score" optype="continuous" usageType="target"/>
  </MiningSchema>

  <TestDistributions field="bin" testStatistic="scalarProduct" weightField="cnt" normalizationScheme="Independent">
    <Baseline>
      <CountTable sample="262">
        <FieldValueCount field="bin" count="100" value="bin1"/>
        <FieldValueCount field="bin" count="150" value="bin2"/>
        <FieldValueCount field="bin" count="10" value="bin3"/>
        <FieldValueCount field="bin" count="2" value="bin4"/>
      </CountTable>
    </Baseline>
  </TestDistributions>

</BaselineModel>

Scoring with this type of model is done as follows. Given an observed distribution associating count Ci with vector coordinate (or bin) Vi and an expected distribution with counts ci for the same respective bins, the scalar product is simply the sum over i of Ci x ci/N. N is an optional normalization factor implied by the normalizationScheme attribute of the TestDistributions element. If the attribute is not specified, no normalization factor is applied. If the normalizationScheme attribute is set to "Independent", the normalization is defined by requiring the scalar product of each vector with itself to be one. This is equivalent to setting

N = sqrt(Ci x Ci) x sqrt(cj x cj),
(with implied index summation). When the vector contents are non-negative, an independently normalized scalar product will vary from 0 (orthogonal vectors) to 1 (identical vectors). In the example above, an observed vector consisting of respective counts of 10, 20, 5, and 5 would yield a scalar product of 0.959, using independent normalization.

Example 4: Contingency tables and comparisons to distributions using chi-squared statistics and p-values. Chi-squared tests can be used within baseline models to test whether a sampled distribution is consistent with a known distribution or whether two distributions are independent of one another when the distributions either discrete or binned. Contingency tables summarize how one field and its categorical values (rows) related to another field and its categorical values (columns). The baseline or null hypothesis is that there is no association between the field defining the rows and the field defining the columns. The alternate hypothesis is that there is such association. If the fields are independent of each other, the distributions represented by the columns and rows will be statistically similar to the expected distributions obtained by summing over the values of the contingent field. A standard approach is to compute the chi-squared statistic for the table. The chi-squared statistic could then be used to find the associated p-value and determine whether the differences are statistically significant, i.e. whether the fields are not independent.

A Chi-squared statistic may also be computed to compare a discrete distribution for an event sample with an expected distribution. The corresponding p-value would indicate the probability of observing a value of chi-square at least as large as that observed. A small p-value would indicate that the observed distribution differs from the expected distribution.

Scoring Baseline models with chi-squared tests is done as follows. For a contingency table (with any number of rows or columns), one first computes row and column totals, as well as the total for the whole table. Given any cell in the table, the expected number for that cell is computed as follows

expected number for cell = (row total * column total) / (total for table),
where the row total and column total are the row and column totals for that cell. Using the expected numbers, the chi-squared statistic is computed as follows:
chi-squared = sum (expected number - observed number)² / (expected number),

where the sum is over all cells in the table.

If the table has r rows and c columns, the degree of freedom of the chi-squared statistic is defined as (r-1)(c-1). The p-value is the probability that the chi-squared statistic with this degree of freedom exceeds the chi-squared value computed from the table. It is common to use thresholds of 95%, 97.5%, or 99%.

For comparing a sampled distribution with an expected distribution, the value of chi-squared is calculated directly from the observed and expected counts for each bin or discrete value. The expected distribution should be normalized to the total count within the observed distribution and the degrees of freedom is the number of values or number of bins minus one.

These tests are available for Baseline models by specifying the testStatistic attribute to be either "chiSquareIndependence" or "chiSquareDistribution". For chiSquareIndependence, a contingency table is described to the model by providing references to the fields describing the rows in the table. For chiSquareDistribution, a reference to a field describing the sampled distribution and a discrete distribution describing the expectations is required. A field reference to a sampled distribution can be provided using the Aggregate element.

The following example shows how a chi-squared test for comparing a sample to a distribution would be specified within a Baseline model. In this example, an aggregation is used to calculate the observed distribution.

<BaselineModel modelName="chisquared" functionName="regression">
  <MiningSchema>
    <MiningField name="obs" optype="continuous"/>
    <MiningField name="bin" optype="categorical"/>
    <MiningField name="score" optype="continuous" usageType="target"/>
  </MiningSchema>

  <LocalTransformations>
    <DerivedField name="obsDist" optype="continuous" dataType="integer">
      <Aggregate field="obs" function="sum" groupField="bin"/>
    </DerivedField>
  </LocalTransformations>

  <TestDistributions field="obsDist" testStatistic="chiSquareDistribution">
    <Baseline>
      <CountTable sample="262">
        <FieldValueCount field="bin" count="100" value="bin1"/>
        <FieldValueCount field="bin" count="150" value="bin2"/>
        <FieldValueCount field="bin" count="10" value="bin3"/>
        <FieldValueCount field="bin" count="2" value="bin4"/>
      </CountTable>
    </Baseline>
  </TestDistributions>

</BaselineModel>

An example of specifying scoring through a contingency table is provided below. Here, the fields from which the contingency table should be formed and the chi-squared calculated are specified within the BaselineModel element.

<BaselineModel modelName="chisquared" functionName="regression">
  <MiningSchema>
    <MiningField name="Count" optype="continuous"/>
    <MiningField name="Animal" optype="categorical"/>
    <MiningField name="TimeOfDay" optype="categorical"/>
    <MiningField name="score" optype="continuous" usageType="target"/>
  </MiningSchema>

  <TestDistributions field="Count" testStatistic="chiSquareIndependence">
    <Baseline>
      <FieldRef field="Animal"/>
      <FieldRef field="TimeOfDay"/>
    </Baseline>
  </TestDistributions>

</BaselineModel>

Output elements. The PMML Output element can be used to return auxiliary information, such as derived fields that are needed for subsequent computations. For more information, please refer to the chapters on outputs.

Model Specification

The top level BaselineModel element follows the usual GeneralStructure conventions and contains a single TestDistributions element.

The content of the TestDistributions element depends on whether a continuous or discrete case is specified. It will generally contain one or more distribution specifications along with the test type and parameters used to produce the result. The role of each distribution specification is indicated by placing it within the Baseline element, indicating that it is the null model, or with the Alternate element, indicating the alternate model.

<xs:element name="BaselineModel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="MiningSchema"/>
      <xs:element ref="Output" minOccurs="0"/>
      <xs:element ref="ModelStats" minOccurs="0"/>
      <xs:element ref="ModelExplanation" minOccurs="0"/>
      <xs:element ref="Targets" minOccurs="0"/>
      <xs:element ref="LocalTransformations" minOccurs="0"/>
      <xs:element ref="TestDistributions"/>
      <xs:element ref="ModelVerification" minOccurs="0"/>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="modelName" type="xs:string" use="optional"/>
    <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
    <xs:attribute name="algorithmName" type="xs:string" use="optional"/>
    <xs:attribute name="isScorable" type="xs:boolean" use="optional" default="true"/>
  </xs:complexType>
</xs:element>

<xs:element name="TestDistributions">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Baseline"/>
      <xs:element ref="Alternate" minOccurs="0"/>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="testStatistic" type="BASELINE-TEST-STATISTIC" use="required"/>
    <xs:attribute name="resetValue" type="REAL-NUMBER" default="0.0" use="optional"/>
    <xs:attribute name="windowSize" type="INT-NUMBER" default="0" use="optional"/>
    <xs:attribute name="weightField" type="FIELD-NAME" use="optional"/>
    <xs:attribute name="normalizationScheme" type="xs:string" use="optional"/>
  </xs:complexType>
</xs:element>

<xs:simpleType name="BASELINE-TEST-STATISTIC">
  <xs:restriction base="xs:string">
    <xs:enumeration value="zValue"/>
    <xs:enumeration value="chiSquareIndependence"/>
    <xs:enumeration value="chiSquareDistribution"/>
    <xs:enumeration value="CUSUM"/>
    <xs:enumeration value="scalarProduct"/>
  </xs:restriction>
</xs:simpleType>

<xs:element name="Baseline">
  <xs:complexType>
    <xs:choice>
      <xs:group ref="CONTINUOUS-DISTRIBUTION-TYPES" minOccurs="1"/>
      <xs:group ref="DISCRETE-DISTRIBUTION-TYPES" minOccurs="1"/>
    </xs:choice>
  </xs:complexType>
</xs:element>

<xs:element name="Alternate">
  <xs:complexType>
    <xs:choice>
      <xs:group ref="CONTINUOUS-DISTRIBUTION-TYPES" minOccurs="1"/>
    </xs:choice>
  </xs:complexType>
</xs:element>

The field attribute specifies which field is used in consuming a baseline model.

The testStatistic attribute specifies what type of baseline test is to be calculated. If the value is "CUSUM" then an Alternate element is required, otherwise the element is forbidden.

The resetValue attribute is only used if the testStatistic attribute is "CUSUM". This specifies the reset value used in the CUSUM formula.

The windowSize attribute is used to specify how much history the model uses. The default is to consider all data that the model has seen before the current record. This attribute has no affect when using a test statistic like "zValue" which does not use past values to compute a score.

The weightField attribute is only used if the testStatistic attribute is "scalarProduct". This specifies a mining field or derived field whose value is used to increment the observed count in the relevant bucket. If no weight field is provided then all records are given equal weight.

The normalizationScheme attribute is only used if the testStatistic attribute is "scalarProduct". This specifies how to normalize the scored data with the baseline data.

Continuous Case

For the continuous case, there are one or two statistical distributions specified. If there is one distribution, it is assumed to be the baseline or null distribution and must be in the Baseline element. If there are two, one must be specified within an Alternate element. The field under test is specified in the TestDistributions element. See above for an example.

<xs:group name="CONTINUOUS-DISTRIBUTION-TYPES">
  <xs:sequence>
    <xs:choice>
      <xs:element ref="AnyDistribution"/>
      <xs:element ref="GaussianDistribution"/>
      <xs:element ref="PoissonDistribution"/>
      <xs:element ref="UniformDistribution"/>
    </xs:choice>
    <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
  </xs:sequence>
</xs:group>

<xs:element name="AnyDistribution">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="mean" type="REAL-NUMBER" use="required"/>
    <xs:attribute name="variance" type="REAL-NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

<xs:element name="GaussianDistribution">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="mean" type="REAL-NUMBER" use="required"/>
    <xs:attribute name="variance" type="REAL-NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

<xs:element name="PoissonDistribution">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="mean" type="REAL-NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

<xs:element name="UniformDistribution">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="lower" type="REAL-NUMBER" use="required"/>
    <xs:attribute name="upper" type="REAL-NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

Discrete Case

For the discrete case, rather than using statistical distributions, the CountTable and NormalizedCountTable are used to specify tables of counts. Each row and column in such a table is associated with a field in the MiningSchema or a derived field, and the cell values contain the corresponding counts or probabilities.

<xs:group name="DISCRETE-DISTRIBUTION-TYPES">
  <xs:choice>
    <xs:element ref="CountTable"/>
    <xs:element ref="NormalizedCountTable"/>
    <xs:element ref="FieldRef" minOccurs="2" maxOccurs="unbounded"/>
  </xs:choice>
</xs:group>

<xs:element name="CountTable" type="COUNT-TABLE-TYPE"/>

<xs:element name="NormalizedCountTable" type="COUNT-TABLE-TYPE"/>

<xs:complexType name="COUNT-TABLE-TYPE">
  <xs:sequence>
    <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    <xs:choice>
      <xs:element ref="FieldValue" minOccurs="1" maxOccurs="unbounded"/>
      <xs:element ref="FieldValueCount" minOccurs="1" maxOccurs="unbounded"/>
    </xs:choice>
  </xs:sequence>
  <xs:attribute name="sample" type="NUMBER" use="optional"/>
</xs:complexType>

<xs:element name="FieldValue">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:choice>
        <xs:element ref="FieldValue" minOccurs="1" maxOccurs="unbounded"/>
        <xs:element ref="FieldValueCount" minOccurs="1" maxOccurs="unbounded"/>
      </xs:choice>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="value" use="required"/>
  </xs:complexType>
</xs:element>

<xs:element name="FieldValueCount">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="value" use="required"/>
    <xs:attribute name="count" type="NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

More Examples

Example 5: Scoring a CUSUM model: Consider a sequence of values for a variable x. CUSUM allows one to make repeated decisions as to whether x is described by distribution D0 or D1. As an example, let D0 be a Gaussian distribution with mean 0.0 and variance 1.0. Let D1 be a Gaussian distribution with mean 1.0 and variance 1.0 which is represented by the following PMML snippet.

<BaselineModel modelName="example-cusum" functionName="regression">
  <MiningSchema>
    <MiningField name="x" optype="continuous"/>
    <MiningField name="score" optype="continuous" usageType="target"/>
  </MiningSchema>

  <TestDistributions field="x" testStatistic="CUSUM" resetValue="0.0">
    <Baseline>
      <GaussianDistribution mean="0" variance="1"/>
    </Baseline>
    <Alternate>
      <GaussianDistribution mean="1" variance="1"/>
    </Alternate>
  </TestDistributions>

</BaselineModel>

The CUSUM score is defined as: Max(reset, previousScore + Log_e(D1(x)/D0(x)))

When the current score falls below the reset value (often set to 0), the score takes on that value. Higher scores indicate a greater likelihood of the 'true' distribution being D1. Each sequential observation of x that is above the reset value is used to accumulate evidence of the true distribution being D1.

In the current example, suppose we observe the sequence of values x=-1, 0, 1/2, 1, 1, 1/2, -1.

The values of Log_e(D1(x)/D0(x)) would be: -3/2, -1/2, 0, 1/2, 1/2, 0, -3/2.

The scores would be: 0, 0, 0, 1/2, 1, 1, 0.

e-mail

info at dmg.org