PMML 3.1 - Baseline Model (proposal version 5-2-7)
Motivating Examples. There are several different types of
baseline, change detection, hypothesis testing, and related models
that are supported by the PMML Baseline Model. We begin with an
several informal examples.
Change detection model with CUSUM statistic.
For the first example, assume we have two Gaussian distributions, each
characterized by a mean and a standard deviation, one representing
normal and one representing abnormal behavior. Given a stream of
events, a score is computed with each new event and this score is used
to decide whether the stream of events is likely to be from the
baseline distribution or the second distribution.
Notice that this may be viewed from the viewpoint of hypothesis
testing by considering the baseline distribution as the null
hypothesis and the second distribution as the alternate hypothesis.
Given a stream of events the goal is to determine as quickly as
possible when events are occuring from the alternate distribution.
A common score used for this purpose is the CUSUM
defined as follows:
Let f0(x) and f1(x) be the density functions for the two
Gaussian distributions and let g(x) be the log odds ratio:
g(x) = log f1(x)/f0(x)
Given a stream of events with features x[0], x[1], x[2], ...,
define the CUSUM score by:
assume Z[-1]=0
Z[n] = max{0, Z[n-1] + g(x[n])}
This would be represented in PMML with the following fragment:
<BaselineModel modelName="geo-cusum" functionName="baseline" >
<TestDistributions field="congestion-score"
testStatistic="CUSUM" testType="threshold" threshold="21.0" resetValue="0.0" >
<Baseline>
<GaussianDistribution mean="550.2" variance="48.2" />
</Baseline>
<Alternate>
<GaussianDistribution mean="460.4" variance="39.2" />
</Alternate>
</TestDistributions>
</BaselineModel>
|
Scoring with this type of change detection model can be done
as follows: given a sequence of events, the mining
and derived fields are evaluated to compute the
field for the TestDistributions.
Then the test indicated (CUSUM in the example
above) is performed with parameter values supplied.
In the example above, if the TestField exceeds the threshold of 21.0,
a TRUE is generated, otherwise a FALSE is generated.
A reset value can be used to provide
an alternative to 0 in the formula above defining Z[n].
Change detection model with generalized likelihood ratio (GLR) statistic.
Often there is less knowledge in change detection applications about the
alternate distribution. For example, its mean and standard deviation may be unknown
or very difficult to estimate. In this case, the generalized likelihood ratio (GLR)
test can be used.
This would be represented in PMML with the following fragment:
<BaselineModel modelName="geo-glr" functionName="baseline" >
<TestDistributions field="congestion-score"
testStatistic="GLR" testType="threshold" threshold="12.8" resetValue="0.0" >
<Baseline>
<GaussianDistribution mean="550.2" variance="48.2" />
</Baseline>
</TestDistributions>
</BaselineModel>
|
Threshold break models using standard scores. For another
very basic example, if the baseline model is a single distribution,
then the value of mining field can be converted to standard values
(which are also called z-values) and a simple threshold used to
determine whether the field is closed enough to the mean of the
distribution.
<BaselineModel modelName="three-sigma-threshold" functionName="baseline" >
<TestDistributions field="defects"
testStatistic="zValue" testType="threshold" threshold="3.0" >
<Baseline>
<GaussianDistribution mean="18.2" variance="4.2" />
</Baseline>
</TestDistributions>
</BaselineModel>
|
Scoring with this type of model is very simple. Given an event with a mining
or derived field called "defects," the value of defects is converted to a standard
score or z-value using the formula
standard-defects = (defects - mean) / sqrt (variance)
and if standard-defects is greater than or equal to 3.0 the model returns
TRUE, otherwise the model returns FALSE.
Contingency table models using chi-squared statistics and
p-values. For categorical variables, contingency tables
summarize how one field and its categorical (rows) related to another
field and its categorical values (columns). The baseline or null
hypothesis is that there is no association between the field defining
the rows and the field defining the columns. The alternate hypothesis
is that there is such association. A standard approach is to compute
the chi-squared statistic for the table and then to use a p-value to
determine whether the chi-squared is statistically significant.
<BaselineModel modelName="merchant-name-populated-model" functionName="baseline" >
<TestDistributions testStatistic="chiSquared" testType="twoSidedPValue" threshold="95.0" >
<CountTable>
<FieldValueCounts field="merchant_name_populated">
<FieldValueCount field="payment_accepted" count="41"/>
<FieldValueCount field="payment_declined" count="81"/>
</FieldValueCounts>
<FieldValueCounts field="merchant_name_missing">
<FieldValueCount field="payment_accepted" count="12"/>
<FieldValueCount field="payment_declined" count="30"/>
</FieldValueCounts>
</CountTable>
</TestDistributions>
</BaselineModel>
|
Scoring with this type of model is done as follows.
Given such a table (with any number of rows or columns), one first computes row
and column totals, as well as the total for the whole table.
Given any cell in the table, the expected number for that
cell is computed as follows
expected number for cell = (row total * column total) / (total for table),
where the row total and column total are the row and column totals for that cell.
Using the expected numbers, the chi-squared statistic is computed as follows:
chi-squared = sum (expected number - observed number)2 / (expected number),
where the sum is over all cells in the table.
If the table has r rows and c columns, the degree of freedom of the
chi-squared statistic is defined as (r-1)(c-1). The p-value is the probability
that the chi-squared statistic with this degree of freedom exceeds the
chi-squared value computed from the table. It is common to use thresholds
of 95%, 97.5%, or 99%.
An alternative test could have simply specified:
<TestDistributions testStatistic="chiSquared" testType="threshold" threshold="2.1" >
|
In this case, the chi-squared statistic for the contingency table would have been
computed as described above, and a TRUE returned if the computed chi-squared is equal
to or above the threshold; otherwise, a FALSE is returned.
Target and Output elements.
The PMML Target element is used by the Baseline Model to return the result computed
by the model, while the PMML Output element can be used to return auxilliary information,
such as derived fields that are needed for subsequent computations.
The top level BaselineModel element follows the
usual GeneralStructure conventions
and contains a single TestDistributions element.
The content of the TestDistributions element depends on whether a
continuous or
discrete case is specified.
It will generally contain one or more distribution specifications
along with the test type and parameters used to produce the result.
The role of each distribution specification is indicated py placing
it within the Baseline element, indicating that it is the null model,
or with the Alternate element, indicating the alternate model.
The default role is Baseline.
<xs:element name="BaselineModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="MiningSchema"/>
<xs:element ref="Output" minOccurs="0" />
<xs:element ref="ModelStats" minOccurs="0"/>
<xs:element ref="Targets" minOccurs="0" />
<xs:element ref="LocalTransformations" minOccurs="0" />
<xs:element ref="TestDistributions"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="modelName" type="xs:string" use="optional" />
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
<xs:attribute name="algorithmName" type="xs:string" use="optional" />
</xs:complexType>
</xs:element>
<xs:element name="TestDistributions">
<xs:complexType>
<xs:sequence>
<xs:choice>
<xs:group ref="CONTINUOUS-DISTRIBUTION-TYPES" minOccurs="1" maxOccurs="2"/>
<xs:group ref="DISCRETE-DISTRIBUTION-TYPES" minOccurs="1" maxOccurs="unbounded"/>
</xs:choice>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="field" type="FIELD-NAME" use="optional" />
<xs:attribute name="testStatistic" type="BASELINE-TEST-STATISTIC" use="optional" />
<xs:attribute name="testType" type="BASELINE-TEST-TYPE" default="threshold" use="optional" />
<xs:attribute name="threshold" type="REAL-NUMBER" use="required" />
<xs:attribute name="resetValue" type="REAL-NUMBER" default="0.0" use="optional" />
</xs:complexType>
</xs:element>
<xs:simpleType name="BASELINE-TEST-STATISTIC">
<xs:restriction base="xs:string">
<xs:enumeration value="count" />
<xs:enumeration value="zValue" />
<xs:enumeration value="chiSquared" />
<xs:enumeration value="fisher" />
<xs:enumeration value="fisherExact" />
<xs:enumeration value="yatesContinuityCorrection" />
<xs:enumeration value="CUSUM" />
<xs:enumeration value="GLR" />
<xs:enumeration value="logOddsRatio" />
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="BASELINE-TEST-TYPE">
<xs:restriction base="xs:string">
<xs:enumeration value="threshold" />
<xs:enumeration value="singleSidedPValue" />
<xs:enumeration value="twoSidedPValue" />
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="ROLE-TYPE">
<xs:restriction base="xs:string">
<xs:enumeration value="baseline" />
<xs:enumeration value="alternate" />
</xs:restriction>
</xs:simpleType>
|
For the continuous case, there are one or two statistical distributions
specified. If there is one distribution, it is assumed to be the baseline
or null distribution and may be in the Baseline element.
If there are two, one must be specified within an Alternate element.
The field under test is specified on the TestDistributions element.
A test statistic and test type should be specified.
See above for an example.
<xs:group name="CONTINUOUS-DISTRIBUTION-TYPES">
<xs:sequence>
<xs:choice>
<xs:element ref="AnyDistribution"/>
<xs:element ref="GaussianDistribution"/>
<xs:element ref="PoissonDistribution"/>
<xs:element ref="UniformDistribution"/>
</xs:choice>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:group>
<xs:element name="AnyDistribution">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="mean" type="REAL-NUMBER" use="required" />
<xs:attribute name="variance" type="REAL-NUMBER" use="required" />
</xs:complexType>
</xs:element>
<xs:element name="GaussianDistribution">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="mean" type="REAL-NUMBER" use="required" />
<xs:attribute name="variance" type="REAL-NUMBER" use="required" />
</xs:complexType>
</xs:element>
<xs:element name="PoissonDistribution">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="mean" type="REAL-NUMBER" use="required" />
</xs:complexType>
</xs:element>
<xs:element name="UniformDistribution">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="lower" type="REAL-NUMBER" use="required" />
<xs:attribute name="upper" type="REAL-NUMBER" use="required" />
</xs:complexType>
</xs:element>
|
For the discrete case, rather than using statistical distributions,
the CountTable and NormalizedCountTable
are used to specify tables of counts.
Each row and column in such a table is associated with a field in the
MiningSchema or a derived field, and the cell values
contain the corresponding counts or probabilities.
If there is a single table, it is interpreted as the baseline
table. If there are two tables, one can be specified as the
baeseline and one as the alternate.
<xs:group name="DISCRETE-DISTRIBUTION-TYPES">
<xs:choice>
<xs:element ref="CountTable"/>
<xs:element ref="NormalizedCountTable"/>
<xs:element ref="HistogramTable"/>
</xs:choice>
</xs:group>
<xs:element name="CountTable" type="COUNT-TABLE-TYPE" />
<xs:element name="NormalizedCountTable" type="COUNT-TABLE-TYPE" />
<xs:complexType name="COUNT-TABLE-TYPE">
<xs:sequence>
<xs:element ref="FieldValueCounts" minOccurs="1" maxOccurs="unbounded"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:element name="FieldValueCounts">
<xs:complexType>
<xs:sequence>
<xs:element ref="FieldValueCount" minOccurs="1" maxOccurs="unbounded"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="field" type="FIELD-NAME" use="required" />
<xs:attribute name="value" use="optional" />
</xs:complexType>
</xs:element>
<xs:element name="FieldValueCount">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="field" type="FIELD-NAME" use="required" />
<xs:attribute name="value" use="optional" />
<xs:attribute name="count" type="NUMBER" use="required" />
</xs:complexType>
</xs:element>
|
Some discrete case examples follow:
<!-- ================================================================== -->
<!-- Discrete case, single table with threshold test -->
<!-- in this case, if the count of any cell exceeds the threshold, TRUE is returned. -->
<TestDistributions testStatistic="count" testType="threshold" threshold="21.0" >
<CountTable>
<FieldValueCounts field="merchant_name_populated">
<FieldValueCount field="payment_accepted" count="41"/>
<FieldValueCount field="payment_declined" count="81"/>
</FieldValueCounts>
<FieldValueCounts field="merchant_name_missing">
<FieldValueCount field="payment_accepted" count="12"/>
<FieldValueCount field="payment_declined" count="30"/>
</FieldValueCounts>
</CountTable>
</TestDistributions>
|
|