## PMML 4.4 - Baseline Model

The **BaselineModel** in PMML allows for defining a change detection
model.

**Motivating Examples.** There are several different types of baseline,
change detection, hypothesis testing, and related models that are supported
by the PMML Baseline Model. We begin with several informal examples.

*Example 1: Change detection model with CUSUM statistic.* For the
first example, assume we have two Gaussian distributions, each characterized
by a mean and a standard deviation, one representing normal behavior and one
representing abnormal behavior. Given a stream of events, a score is computed
with each new event and this score is used to decide whether the stream of
events is likely to be from the baseline distribution or the second
distribution.

Notice that this may be viewed from the viewpoint of hypothesis testing by
considering the baseline distribution as the null hypothesis and the second
distribution as the alternate hypothesis. Given a stream of events the goal
is to determine as quickly as possible when events are occurring from the
alternate distribution.

A common score used for this purpose is the CUSUM defined as follows: Let
f0(x) and f1(x) be the density functions for the two Gaussian distributions,
let r be the reset value and let g(x) be the log odds ratio:

g(x) = log f1(x)/f0(x)

r = 0

Given a stream of events with features x[0], x[1], x[2], ..., define
the CUSUM score by:

assume Z[-1]=0

Z[n] = max{r, Z[n-1] + g(x[n])}

This would be represented in PMML with the following fragment:

<BaselineModel modelName="geo-cusum" functionName="regression">
<MiningSchema>
<MiningField name="congestion-score" optype="continuous"/>
<MiningField name="cusum-score" optype="continuous" usageType="target"/>
</MiningSchema>
<TestDistributions field="congestion-score" testStatistic="CUSUM" resetValue="0.0">
<Baseline>
<GaussianDistribution mean="550.2" variance="48.2"/>
</Baseline>
<Alternate>
<GaussianDistribution mean="460.4" variance="39.2"/>
</Alternate>
</TestDistributions>
</BaselineModel>

Scoring with this type of change detection model can be done as follows:
given a sequence of events, the mining and/or derived fields are evaluated to
compute the `field` for the **TestDistributions**. Then the test
indicated (CUSUM in the example above) is performed with parameter values
supplied.

*Example 2: Baseline models using standard scores.* For another
very basic example, if the baseline model is a single distribution, then the
value of a mining field can be converted to standard values (which are also
called z-values) and a simple threshold is used to determine whether the
field is close enough to the mean of the distribution.

<BaselineModel modelName="standard-score" functionName="regression">
<MiningSchema>
<MiningField name="defects" optype="continuous"/>
<MiningField name="score" optype="continuous" usageType="target"/>
</MiningSchema>
<Output>
<OutputField name="alert" optype="categorical" dataType="string" feature="decision">
<Apply function="if">
<Apply function="greaterThan">
<FieldRef field="score"/>
<Constant dataType="double">1</Constant>
</Apply>
<!-- Then case -->
<Constant dataType="string">True</Constant>
<!-- Else case -->
<Constant dataType="string">False</Constant>
</Apply>
</OutputField>
</Output>
<TestDistributions field="defects" testStatistic="zValue">
<Baseline>
<GaussianDistribution mean="18.2" variance="17.64"/>
</Baseline>
</TestDistributions>
</BaselineModel>

Scoring with this type of model is very simple. Given an event with a
mining or derived field called "defects," the value of defects is converted
to a standard score or z-value using the formula

score = (defects - mean) / sqrt (variance)

and if defects is 24 then score = (24 - 18.2) / sqrt(17.64) = (5.8) /
(4.2) = ~1.38.

*Example 3: Scalar product detects changes between distributions by
measuring distance between vectors.*

<BaselineModel modelName="website-model" functionName="regression">
<MiningSchema>
<MiningField name="bin" optype="categorical"/>
<MiningField name="score" optype="continuous" usageType="target"/>
</MiningSchema>
<TestDistributions field="bin" testStatistic="scalarProduct" weightField="cnt" normalizationScheme="Independent">
<Baseline>
<CountTable sample="262">
<FieldValueCount field="bin" count="100" value="bin1"/>
<FieldValueCount field="bin" count="150" value="bin2"/>
<FieldValueCount field="bin" count="10" value="bin3"/>
<FieldValueCount field="bin" count="2" value="bin4"/>
</CountTable>
</Baseline>
</TestDistributions>
</BaselineModel>

Scoring with this type of model is done as follows. Given an observed
distribution associating count Ci with vector coordinate (or bin) Vi and an
expected distribution with counts ci for the same respective bins, the scalar
product is simply the sum over i of Ci x ci/N. N is an optional normalization
factor implied by the `normalizationScheme` attribute of the
**TestDistributions** element. If the attribute is not specified, no
normalization factor is applied. If the `normalizationScheme`
attribute is set to "Independent", the normalization is defined by requiring
the scalar product of each vector with itself to be one. This is equivalent
to setting

N = sqrt(Ci x Ci) x sqrt(cj x cj),

(with implied index summation). When the vector contents are non-negative, an
independently normalized scalar product will vary from 0 (orthogonal vectors)
to 1 (identical vectors). In the example above, an observed vector consisting
of respective counts of 10, 20, 5, and 5 would yield a scalar product of
0.959, using independent normalization.
*Example 4: Contingency tables and comparisons to distributions using
chi-squared statistics and p-values.* Chi-squared tests can be used
within baseline models to test whether a sampled distribution is consistent
with a known distribution or whether two distributions are independent of one
another when the distributions either discrete or binned. Contingency tables
summarize how one field and its categorical values (rows) related to another
field and its categorical values (columns). The baseline or null hypothesis
is that there is no association between the field defining the rows and the
field defining the columns. The alternate hypothesis is that there is such
association. If the fields are independent of each other, the distributions
represented by the columns and rows will be statistically similar to the
expected distributions obtained by summing over the values of the contingent
field. A standard approach is to compute the chi-squared statistic for the
table. The chi-squared statistic could then be used to find the associated
p-value and determine whether the differences are statistically significant,
i.e. whether the fields are not independent.

A Chi-squared statistic may also be computed to compare a discrete
distribution for an event sample with an expected distribution. The
corresponding p-value would indicate the probability of observing a value of
chi-square at least as large as that observed. A small p-value would indicate
that the observed distribution differs from the expected distribution.

Scoring Baseline models with chi-squared tests is done as follows. For a
contingency table (with any number of rows or columns), one first computes
row and column totals, as well as the total for the whole table. Given any
cell in the table, the expected number for that cell is computed as
follows

expected number for cell = (row total * column total) / (total for table),

where the row total and column total are the row and column totals for that
cell. Using the expected numbers, the chi-squared statistic is computed as
follows:

chi-squared = sum (expected number - observed number)^{2} /
(expected number),

where the sum is over all cells in the table.

If the table has r rows and c columns, the degree of freedom of the
chi-squared statistic is defined as (r-1)(c-1). The p-value is the
probability that the chi-squared statistic with this degree of freedom
exceeds the chi-squared value computed from the table. It is common to use
thresholds of 95%, 97.5%, or 99%.

For comparing a sampled distribution with an expected distribution, the
value of chi-squared is calculated directly from the observed and expected
counts for each bin or discrete value. The expected distribution should be
normalized to the total count within the observed distribution and the
degrees of freedom is the number of values or number of bins minus one.

These tests are available for Baseline models by specifying the
`testStatistic` attribute to be either "chiSquareIndependence" or
"chiSquareDistribution". For chiSquareIndependence, a contingency table is
described to the model by providing references to the fields describing the
rows in the table. For chiSquareDistribution, a reference to a field
describing the sampled distribution and a discrete distribution describing
the expectations is required. A field reference to a sampled distribution can
be provided using the **Aggregate** element.

The following example shows how a
chi-squared test for comparing a sample to a distribution would be specified
within a Baseline model. In this example, an aggregation is used to calculate
the observed distribution.

<BaselineModel modelName="chisquared" functionName="regression">
<MiningSchema>
<MiningField name="obs" optype="continuous"/>
<MiningField name="bin" optype="categorical"/>
<MiningField name="score" optype="continuous" usageType="target"/>
</MiningSchema>
<LocalTransformations>
<DerivedField name="obsDist" optype="continuous" dataType="integer">
<Aggregate field="obs" function="sum" groupField="bin"/>
</DerivedField>
</LocalTransformations>
<TestDistributions field="obsDist" testStatistic="chiSquareDistribution">
<Baseline>
<CountTable sample="262">
<FieldValueCount field="bin" count="100" value="bin1"/>
<FieldValueCount field="bin" count="150" value="bin2"/>
<FieldValueCount field="bin" count="10" value="bin3"/>
<FieldValueCount field="bin" count="2" value="bin4"/>
</CountTable>
</Baseline>
</TestDistributions>
</BaselineModel>

An example of specifying scoring through a contingency table is provided
below. Here, the fields from which the contingency table should be formed and
the chi-squared calculated are specified within the **BaselineModel**
element.

<BaselineModel modelName="chisquared" functionName="regression">
<MiningSchema>
<MiningField name="Count" optype="continuous"/>
<MiningField name="Animal" optype="categorical"/>
<MiningField name="TimeOfDay" optype="categorical"/>
<MiningField name="score" optype="continuous" usageType="target"/>
</MiningSchema>
<TestDistributions field="Count" testStatistic="chiSquareIndependence">
<Baseline>
<FieldRef field="Animal"/>
<FieldRef field="TimeOfDay"/>
</Baseline>
</TestDistributions>
</BaselineModel>

*Output elements.* The PMML **Output** element can be used to
return auxiliary information, such as derived fields that are needed for
subsequent computations. For more information, please refer to the chapters
on outputs.

The top level **BaselineModel** element follows the usual GeneralStructure conventions and contains a
single **TestDistributions** element.

The content of the **TestDistributions** element depends on whether a
continuous or discrete case
is specified. It will generally contain one or more distribution
specifications along with the test type and parameters used to produce the
result. The role of each distribution specification is indicated by placing
it within the **Baseline** element, indicating that it is the null model,
or with the **Alternate** element, indicating the alternate model.

<xs:element name="BaselineModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="MiningSchema"/>
<xs:element ref="Output" minOccurs="0"/>
<xs:element ref="ModelStats" minOccurs="0"/>
<xs:element ref="ModelExplanation" minOccurs="0"/>
<xs:element ref="Targets" minOccurs="0"/>
<xs:element ref="LocalTransformations" minOccurs="0"/>
<xs:element ref="TestDistributions"/>
<xs:element ref="ModelVerification" minOccurs="0"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="modelName" type="xs:string" use="optional"/>
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
<xs:attribute name="algorithmName" type="xs:string" use="optional"/>
<xs:attribute name="isScorable" type="xs:boolean" use="optional" default="true"/>
</xs:complexType>
</xs:element>
<xs:element name="TestDistributions">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Baseline"/>
<xs:element ref="Alternate" minOccurs="0"/>
</xs:sequence>
<xs:attribute name="field" type="FIELD-NAME" use="required"/>
<xs:attribute name="testStatistic" type="BASELINE-TEST-STATISTIC" use="required"/>
<xs:attribute name="resetValue" type="REAL-NUMBER" default="0.0" use="optional"/>
<xs:attribute name="windowSize" type="INT-NUMBER" default="0" use="optional"/>
<xs:attribute name="weightField" type="FIELD-NAME" use="optional"/>
<xs:attribute name="normalizationScheme" type="xs:string" use="optional"/>
</xs:complexType>
</xs:element>
<xs:simpleType name="BASELINE-TEST-STATISTIC">
<xs:restriction base="xs:string">
<xs:enumeration value="zValue"/>
<xs:enumeration value="chiSquareIndependence"/>
<xs:enumeration value="chiSquareDistribution"/>
<xs:enumeration value="CUSUM"/>
<xs:enumeration value="scalarProduct"/>
</xs:restriction>
</xs:simpleType>
<xs:element name="Baseline">
<xs:complexType>
<xs:choice>
<xs:group ref="CONTINUOUS-DISTRIBUTION-TYPES" minOccurs="1"/>
<xs:group ref="DISCRETE-DISTRIBUTION-TYPES" minOccurs="1"/>
</xs:choice>
</xs:complexType>
</xs:element>
<xs:element name="Alternate">
<xs:complexType>
<xs:choice>
<xs:group ref="CONTINUOUS-DISTRIBUTION-TYPES" minOccurs="1"/>
</xs:choice>
</xs:complexType>
</xs:element>

The `field` attribute specifies which field is used in consuming a
baseline model.

The `testStatistic` attribute specifies what type of baseline test
is to be calculated. If the value is "CUSUM" then an **Alternate** element
is required, otherwise the element is forbidden.

The `resetValue` attribute is only used if the
`testStatistic` attribute is "CUSUM". This specifies the reset value
used in the CUSUM formula.

The `windowSize` attribute is used to specify how much history the
model uses. The default is to consider all data that the model has seen
before the current record. This attribute has no affect when using a test
statistic like "zValue" which does not use past values to compute a
score.

The `weightField` attribute is only used if the
`testStatistic` attribute is "scalarProduct". This specifies a mining
field or derived field whose value is used to increment the observed count in
the relevant bucket. If no weight field is provided then all records are
given equal weight.

The `normalizationScheme` attribute is only used if the
`testStatistic` attribute is "scalarProduct". This specifies how to
normalize the scored data with the baseline data.

For the continuous case, there are one or two statistical distributions
specified. If there is one distribution, it is assumed to be the baseline or
null distribution and must be in the **Baseline** element. If there are
two, one must be specified within an **Alternate** element. The field
under test is specified in the **TestDistributions** element. See above for an example.

For the discrete case, rather than using statistical distributions, the
**CountTable** and **NormalizedCountTable** are used to specify tables
of counts. Each row and column in such a table is associated with a field in
the MiningSchema or a derived field, and the
cell values contain the corresponding counts or probabilities.

<xs:group name="DISCRETE-DISTRIBUTION-TYPES">
<xs:choice>
<xs:element ref="CountTable"/>
<xs:element ref="NormalizedCountTable"/>
<xs:element ref="FieldRef" minOccurs="2" maxOccurs="unbounded"/>
</xs:choice>
</xs:group>
<xs:element name="CountTable" type="COUNT-TABLE-TYPE"/>
<xs:element name="NormalizedCountTable" type="COUNT-TABLE-TYPE"/>
<xs:complexType name="COUNT-TABLE-TYPE">
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:choice>
<xs:element ref="FieldValue" minOccurs="1" maxOccurs="unbounded"/>
<xs:element ref="FieldValueCount" minOccurs="1" maxOccurs="unbounded"/>
</xs:choice>
</xs:sequence>
<xs:attribute name="sample" type="NUMBER" use="optional"/>
</xs:complexType>
<xs:element name="FieldValue">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:choice>
<xs:element ref="FieldValue" minOccurs="1" maxOccurs="unbounded"/>
<xs:element ref="FieldValueCount" minOccurs="1" maxOccurs="unbounded"/>
</xs:choice>
</xs:sequence>
<xs:attribute name="field" type="FIELD-NAME" use="required"/>
<xs:attribute name="value" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
<xs:element name="FieldValueCount">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="field" type="FIELD-NAME" use="required"/>
<xs:attribute name="value" type="xs:string" use="required"/>
<xs:attribute name="count" type="NUMBER" use="required"/>
</xs:complexType>
</xs:element>

*Example 5: Scoring a CUSUM model:* Consider a sequence of values
for a variable x. CUSUM allows one to make repeated decisions as to whether x
is described by distribution D0 or D1. As an example, let D0 be a Gaussian
distribution with mean 0.0 and variance 1.0. Let D1 be a Gaussian
distribution with mean 1.0 and variance 1.0 which is represented by the
following PMML snippet.

<BaselineModel modelName="example-cusum" functionName="regression">
<MiningSchema>
<MiningField name="x" optype="continuous"/>
<MiningField name="score" optype="continuous" usageType="target"/>
</MiningSchema>
<TestDistributions field="x" testStatistic="CUSUM" resetValue="0.0">
<Baseline>
<GaussianDistribution mean="0" variance="1"/>
</Baseline>
<Alternate>
<GaussianDistribution mean="1" variance="1"/>
</Alternate>
</TestDistributions>
</BaselineModel>

The CUSUM score is defined as: Max(reset, previousScore +
Log_e(D1(x)/D0(x)))

When the current score falls below the reset value (often set to 0), the
score takes on that value. Higher scores indicate a greater likelihood of the
'true' distribution being D1. Each sequential observation of x that is above
the reset value is used to accumulate evidence of the true distribution being
D1.

In the current example, suppose we observe the sequence of values x=-1, 0,
1/2, 1, 1, 1/2, -1.

The values of Log_e(D1(x)/D0(x)) would be: -3/2, -1/2, 0, 1/2, 1/2, 0,
-3/2.

The scores would be: 0, 0, 0, 1/2, 1, 1, 0.