PMML 4.0 - Regression
The regression functions are used to determine the relationship between the
dependent variable (target field) and one or more independent variables.
The dependent variable is the one whose values you want to predict,
whereas the independent variables are the variables that you base your prediction on.
While the term regression usually refers to the prediction of numeric values,
the PMML element RegressionModel can also be used for classification.
This is due to the fact that multiple regression equations can be combined
in order to predict categorical values.
If the attribute functionName is regression then the model
is used for the prediction of a numeric value in a continuous domain.
These models should contain exactly one regression table.
If the attribute functionName is classification then the model
is used to predict a category.
These models should contain exactly one regression table for each
targetCategory. The normalizationMethod describes
how the prediction is converted into a confidence value (aka probability).
For simple regression with functionName='regression', the formula is:
Dependent variable = intercept + Sumi
(coefficienti
* independent variablei
) + error
Classification models can have multiple regression equations.
With n classes/categories there are n equations of the form
yj = interceptj + Sumi
(coefficientji
* independent variablei
)
A confidence/probability value for category j can be computed by the softmax function
pj = exp(yj) / (Sum[i in 1..n](exp(yi)) )
Another method, called simplemax, uses a simple quotient
pj = yj / (Sum[i in 1..n](yi) )
These confidence values are similar to statistical probabilities but they only
mimic probabilities by post-processing the values
yi.
Binary logistic regression is a special case with
y = intercept + Sumi
(coefficienti
* independent variablei
)
p = 1/(1+exp(-y))
It can be implemented as a model
<RegressionModel functionName="regression" normalizationMethod="softmax" ...
<RegressionTable targetCategory="YES"
...
|
with only one
RegressionTable. In that case the numeric predicted value
p
represents the probability that the target category
is
YES.
Note that this model is still a plain numeric prediction model.
It is not a classification model even if the regression table
specifies a target category.
A drawback of this representation is that the predicted value
does not explicitly refer to all target categories
that were used while training the model.
A better implementation would be
<RegressionModel functionName="classification" normalizationMethod="softmax" ...
<RegressionTable targetCategory="YES" ...
<RegressionTable targetCategory="NO" intercept="0.0"
...
|
In this case the model is a true classification model.
Note that
p=(1/(1+exp(-y))
is equivalent to
p=exp(y)/(exp(y)+exp(0.0)).
The XML Schema for RegressionModel
<xs:element name="RegressionModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="MiningSchema" />
<xs:element ref="Output" minOccurs="0" />
<xs:element ref="ModelStats" minOccurs="0" />
<xs:element ref="ModelExplanation" minOccurs="0"/>
<xs:element ref="Targets" minOccurs="0" />
<xs:element ref="LocalTransformations" minOccurs="0" />
<xs:element ref="RegressionTable" maxOccurs="unbounded" />
<!-- optionally, specification of the decision logic -->
<xs:element ref="ModelVerification" minOccurs="0" />
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="modelName" type="xs:string" />
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
<xs:attribute name="algorithmName" type="xs:string" />
<xs:attribute name="modelType" use="optional">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="linearRegression" />
<xs:enumeration value="stepwisePolynomialRegression" />
<xs:enumeration value="logisticRegression" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="targetFieldName" type="FIELD-NAME" use="optional" />
<xs:attribute name="normalizationMethod" type="REGRESSIONNORMALIZATIONMETHOD" default="none"/>
</xs:complexType>
</xs:element>
<xs:simpleType name="REGRESSIONNORMALIZATIONMETHOD">
<xs:restriction base="xs:string">
<xs:enumeration value="none" />
<xs:enumeration value="simplemax" />
<xs:enumeration value="softmax" />
<xs:enumeration value="logit" />
<xs:enumeration value="probit" />
<xs:enumeration value="cloglog" />
<xs:enumeration value="exp" />
<xs:enumeration value="loglog" />
<xs:enumeration value="cauchit" />
</xs:restriction>
</xs:simpleType>
|
Definitions
-
RegressionModel: The root element of an
XML regression model. Each instance of a regression model must start with this
element.
-
modelName: This is a unique identifier specifying the name of the regression model.
-
functionName: Can be regression or classification.
-
algorithmName: Can be any string describing the algorithm that was used while creating the model.
-
modelType: Specifies the type of a regression model. The attribute
modelType is for information only. It has been changed to optional
and the usage is deprecated. Use functionName and
normalizationMethod in order
to define the computation. Use algorithmName in order to give further
optional information.
-
targetFieldName: The name of the target field (also called
dependent variable). The attribute targetFieldName is for information only. It has been changed to
optional and the usage is deprecated. Use usageType="predicted" in MiningField instead.
<xs:element name="RegressionTable">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="NumericPredictor" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="CategoricalPredictor" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="PredictorTerm" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="intercept" type="REAL-NUMBER" use="required" />
<xs:attribute name="targetCategory" type="xs:string" />
</xs:complexType>
</xs:element>
|
-
RegressionTable: A table that lists the values of all predictors
or independent variables. If the model is used to predict a numerical field,
then there is only
one RegressionTable and the attribute targetCategory may be missing.
If the model is used to predict a categorical field, then there are two or more
RegressionTables and each one must have the attribute targetCategory
defined with a unique value.
<xs:element name="NumericPredictor">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="name" type="FIELD-NAME" use="required" />
<xs:attribute name="exponent" type="INT-NUMBER" default="1" />
<xs:attribute name="coefficient" type="REAL-NUMBER" use="required" />
</xs:complexType>
</xs:element>
<xs:element name="CategoricalPredictor">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="name" type="FIELD-NAME" use="required" />
<xs:attribute name="value" type="xs:string" use="required" />
<xs:attribute name="coefficient" type="REAL-NUMBER" use="required" />
</xs:complexType>
</xs:element>
<xs:element name="PredictorTerm">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="FieldRef" minOccurs="1" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="coefficient" type="REAL-NUMBER" use="required" />
</xs:complexType>
</xs:element>
|
-
NumericPredictor: Defines a numeric independent variable. The list
of valid attributes comprises the name of the variable, the exponent to be
used, and the coefficient by which the values of this variable must be
multiplied. Note that the exponent defaults to 1, hence it is not always
necessary to specify.
-
CategoricalPredictor: Defines a categorical independent
variable. The list of attributes
comprises the name of the variable, the value
attribute, and the coefficient by which the values of this variable must be multiplied.
To do a regression analysis with categorical values,
some means must be applied to enable calculations.
If the specified value of an independent value occurs,
the term
variable_name(value)
is replaced with 1. Thus the coefficient is
multiplied by 1. If the value does not occur, the term
variable_name(value)
is replaced with 0 so that the product
coefficient × variable_name(value)
yields 0. Consequently, the product
is ignored in the ongoing analysis.
If the input value is missing then variable_name(v) yields 0 for any v.
-
PredictorTerm:
Contains one or more fields that are combined by multiplication.
That is, this element supports interaction terms.
The type of all fields referenced within PredictorTerm must continuous.
The content of PredictorTerm might be extended to a sequence of any expression
This feature is not yet needed.
Valid combinations
functionName | normalizationMethod | number of RegressionTable elements |
result |
regression |
none |
1 |
predictedValue = y1 |
regression |
softmax,logit |
1 |
predictedValue = 1/(1+exp(-y1))
|
regression |
exp |
1 |
predictedValue = exp(y1))
|
regression |
other |
1 |
ERROR |
regression |
any |
>1 |
ERROR |
classification |
any |
any |
apply norm.method to y1 .. yn |
Note that an (abnormal) classification model may have just one RegressionTable.
In that case the predicted class is constant and the confidence is 1.0.
How to compute pj := probability of
target=Valuej
Let yj be the result of evaluating the
formula in the ith RegressionTable.
If one or more of the yj cannot be evaluated because the
value in one of the referenced fields is missing, then the
following formulas do not apply.
In that case the predictions are defined by the priorProbability values in the
Target element.
- softmax, categorical
- see above,
pj =
exp(yj)
/ (Sum[i in 1..n](exp(yi) ) )
- logit, categorical
- see above,
pj =
1 / ( 1 + exp( -yj
) )
- probit, categorical
-
pj =
integral(from -∞ to yj)(1/sqrt(2*π))exp(-0.5*u*u)du
eg., F(10) = 1
- cloglog, categorical
-
pj =
1 - exp( -exp( yj
) )
- loglog, categorical
-
pj = exp( -exp( -yj ) )
- cauchit, categorical
pj = 0.5 + (1/π) arctan( yj
)
- softmax, ordinal
F(y) = exp(yj) /
(Sum[i in 1..n](exp(yi)) )
p1 = F(y1)
pj = F(yj) -
F(yj-1), for j ≥ 2
- logit, ordinal
- inverse of logit function:
F(y)= 1/(1+exp(-y)), eg., F(15) = 1
p1 = F(y1)
pj = F(yj) -
F(yj-1), for j ≥ 2
- probit, ordinal
- inverse of probit function:
F(y)=
integral(from -∞ to y)(1/sqrt(2*π))exp(-0.5*u*u)du
eg., F(10) = 1
p1 = F(y1)
pj = F(yj) -
F(yj-1), for j ≥ 2
- cloglog, ordinal
- inverse of cloglog function:
F(y)= 1 - exp( -exp(y) )
eg., F(4) = 1.
p1 = F(y1)
pj = F(yj) -
F(yj-1), for j ≥ 2
- loglog, ordinal
- inverse of cloglog function:
F(y) = exp( -exp(-y) )
p1 = F(y1)
pj = F(yj) -
F(yj-1), for j ≥ 2
- cauchit, ordinal
- inverse of cauchit function:
F(y) = 0.5 + (1/π) arctan(y)
p1 = F(y1)
pj = F(yj) -
F(yj-1), for j ≥ 2
Comments on exp:
The exp normalizationMethod is frequently used in
statistical models for predicting non-negative target variables,
such as Poisson regression which is used in sales forecasting,
queuing models, insurance risk models, etc., to predict counts per unit
time (e.g., daily sales volumes, hourly service requests,
quarterly insurance claim filings, etc.).
Comments on probit: The area under the standard normal curve
corresponds to probability, specifically the probability
of finding an observation less than a given Z value.
The total area under the curve, from -∞ to +∞ = 1.0
Z = 0, area = above Z = 0.5, ie half the curve lies below the mean
Z = 1.0, area = above Z =0.1587, ie about 85% data lies
below (mean + 1 sd)
t | F(t) |
1 | 0.84134474606854000000 |
2 | 0.97724986805182000000 |
3 | 0.99865010196837000000 |
For ordinal targets: Suppose yi is the result of the
ith RegressionTable.
Apply the inverse link function to obtain the cumulative probability.
For the last (which is the so-called trivial) RegressionTable,
the intercept should be a "large" number so that after
applying the inverse link function, you obtain a cumulative probability of 1.
The individual probability of each category is calculated by subtracting the
cumulative probability of the previous category from the cumulative
probability of the current category.
Examples
The following regression formula is used to predict the
number of insurance claims:
number_of_claims |
= |
132.37 + 7.1*age + 0.01*salary +
41.1*car_location('carpark') + 325.03*car_location('street')
|
|
If the value
carpark
was specified for
car_location
in a
particular record, you would get the following formula:
number_of_claims = 132.37 + 7.1 age + 0.01 salary +
41.1 * 1 + 325.03 * 0
Linear Regression Sample
This is a linear regression equation predicting a number of insurance
claims on prior knowledge of the values of the independent variables
age, salary and car_location.
car_location is the only categorical variable. Its
value attribute can take on two possible values,
carpark and street.
number_of_claims = 132.37 + 7.1 age + 0.01
salary + 41.1 car_location( carpark ) + 325.03 car_location( street )
The corresponding PMML model is:
<?xml version="1.0" ?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="DMG.org"/>
<DataDictionary numberOfFields="4">
<DataField name="age" optype="continuous" dataType="double"/>
<DataField name="salary" optype="continuous" dataType="double"/>
<DataField name="car_location" optype="categorical" dataType="string">
<Value value="carpark"/>
<Value value="street"/>
</DataField>
<DataField name="number_of_claims" optype="continuous" dataType="integer"/>
</DataDictionary>
<RegressionModel
modelName="Sample for linear regression"
functionName="regression"
algorithmName="linearRegression"
targetFieldName="number_of_claims">
<MiningSchema>
<MiningField name="age"/>
<MiningField name="salary"/>
<MiningField name="car_location"/>
<MiningField name="number_of_claims" usageType="predicted"/>
</MiningSchema>
<RegressionTable intercept="132.37">
<NumericPredictor name="age"
exponent="1" coefficient="7.1"/>
<NumericPredictor name="salary"
exponent="1" coefficient="0.01"/>
<CategoricalPredictor name="car_location"
value="carpark" coefficient="41.1"/>
<CategoricalPredictor name="car_location"
value="street" coefficient="325.03"/>
</RegressionTable>
</RegressionModel>
</PMML>
|
Polynomial Regression Sample
This is a polynomial regression equation predicting a number of insurance
claims on prior knowledge of the values of the independent variables
salary and car_location. car_location is a
categorical variable. Its value attribute can take on two possible
values, carpark and street.
number_of_claims |
= |
3216.38 - 0.08 salary + 9.54E-7 salary**2 -
2.67E-12 salary**3 + 93.78 car_location('carpark') +
288.75 car_location('street')
|
|
<?xml version="1.0" ?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="DMG.org"/>
<DataDictionary numberOfFields="3">
<DataField name="salary" optype="continuous" dataType="double"/>
<DataField name="car_location" optype="categorical" dataType="string">
<Value value="carpark"/>
<Value value="street"/>
</DataField>
<DataField name="number_of_claims" optype="continuous" dataType="integer"/>
</DataDictionary>
<RegressionModel
functionName="regression"
modelName="Sample for stepwise polynomial regression"
algorithmName="stepwisePolynomialRegression"
targetFieldName="number_of_claims">
<MiningSchema>
<MiningField name="salary"/>
<MiningField name="car_location"/>
<MiningField name="number_of_claims" usageType="predicted"/>
</MiningSchema>
<RegressionTable intercept="3216.38">
<NumericPredictor name="salary"
exponent="1" coefficient="-0.08"/>
<NumericPredictor name="salary"
exponent="2" coefficient="9.54E-7"/>
<NumericPredictor name="salary"
exponent="3" coefficient="-2.67E-12"/>
<CategoricalPredictor name="car_location"
value="carpark" coefficient="93.78"/>
<CategoricalPredictor name="car_location"
value="street" coefficient="288.75"/>
</RegressionTable>
</RegressionModel>
</PMML>
|
Logistic Regression for binary classification
Many regression modeling algorithms create (k-1) equations for
classification problems with k different categories. This is particularly
useful for binary classification. The resulting model can easily
be defined in PMML as in the following example.
<?xml version="1.0" ?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="DMG.org"/>
<DataDictionary numberOfFields="3">
<DataField name="x1" optype="continuous" dataType="double"/>
<DataField name="x2" optype="continuous" dataType="double"/>
<DataField name="y" optype="continuous" dataType="double"/>
</DataDictionary>
<RegressionModel
functionName="regression"
modelName="Sample for stepwise polynomial regression"
algorithmName="stepwisePolynomialRegression"
normalizationMethod="softmax"
targetFieldName="y">
<MiningSchema>
<MiningField name="x1"/>
<MiningField name="x2"/>
<MiningField name="y" usageType="predicted"/>
</MiningSchema>
<RegressionTable targetCategory="no" intercept="125.56601826">
<NumericPredictor name="x1" coefficient="-28.6617384"/>
<NumericPredictor name="x2" coefficient="-20.42027426"/>
</RegressionTable>
<RegressionTable targetCategory="yes" intercept="0"/>
</RegressionModel>
</PMML>
|
Note that the last element for RegressionTable is trivial.
It does not have any predictor entries.
A RegressionTable defines a formula:
intercept + (sum of predictor terms).
If there are no predictor terms, then the (sum of ..) is 0 and the formula becomes just:
intercept.
That's exactly what <RegressionTable targetCategory="yes"
intercept="0"/> defines.
Sample for classification with more than two categories:
y_clerical |
= |
46.418 -0.132*age +7.867E-02*work
-20.525*sex('0') +0*sex('1')
-19.054*minority('0') +0*minority('1')
|
|
y_professional |
= |
51.169 -0.302*age +.155*work
-21.389*sex('0') +0*sex('1')
-18.443*minority('0') + 0*minority('1')
|
|
y_trainee |
= |
25.478 -.154*age +.266*work
-2.639*sex('0') +0*sex('1')
-19.821*minority('0') +0*minority('1')
|
|
Note that the terms such as 0*minority('1')
are
superfluous but it's valid
to use the same field with different indicator values such as '0' and '1'.
Though, a RegressionTable must not have multiple NumericPredictors
with the same name and it must not have multiple CategoricalPredictors
with the same pair of name and value.
The corresponding PMML model is:
<?xml version="1.0" ?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="DMG.org"/>
<DataDictionary numberOfFields="5">
<DataField name="age" optype="continuous" dataType="double"/>
<DataField name="work" optype="continuous" dataType="double"/>
<DataField name="sex" optype="categorical" dataType="string">
<Value value="0"/>
<Value value="1"/>
</DataField>
<DataField name="minority" optype="categorical" dataType="integer">
<Value value="0"/>
<Value value="1"/>
</DataField>
<DataField name="jobcat" optype="categorical" dataType="string">
<Value value="clerical"/>
<Value value="professional"/>
<Value value="trainee"/>
<Value value="skilled"/>
</DataField>
</DataDictionary>
<RegressionModel
modelName="Sample for logistic regression"
functionName="classification"
algorithmName="logisticRegression"
normalizationMethod="softmax"
targetFieldName="jobcat">
<MiningSchema>
<MiningField name="age"/>
<MiningField name="work"/>
<MiningField name="sex"/>
<MiningField name="minority"/>
<MiningField name="jobcat" usageType="predicted"/>
</MiningSchema>
<RegressionTable intercept="46.418" targetCategory="clerical">
<NumericPredictor name="age" exponent="1" coefficient="-0.132"/>
<NumericPredictor name="work" exponent="1" coefficient="7.867E-02"/>
<CategoricalPredictor name="sex" value="0" coefficient="-20.525"/>
<CategoricalPredictor name="sex" value="1" coefficient="0.5"/>
<CategoricalPredictor name="minority" value="0" coefficient="-19.054"/>
<CategoricalPredictor name="minority" value="1" coefficient="0"/>
</RegressionTable>
<RegressionTable intercept="51.169" targetCategory="professional">
<NumericPredictor name="age" exponent="1" coefficient="-0.302"/>
<NumericPredictor name="work" exponent="1" coefficient="0.155"/>
<CategoricalPredictor name="sex" value="0" coefficient="-21.389"/>
<CategoricalPredictor name="sex" value="1" coefficient="0.1"/>
<CategoricalPredictor name="minority" value="0" coefficient="-18.443"/>
<CategoricalPredictor name="minority" value="1" coefficient="0"/>
</RegressionTable>
<RegressionTable intercept="25.478" targetCategory="trainee">
<NumericPredictor name="age" exponent="1" coefficient="-0.154"/>
<NumericPredictor name="work" exponent="1" coefficient="0.266"/>
<CategoricalPredictor name="sex" value="0" coefficient="-2.639"/>
<CategoricalPredictor name="sex" value="1" coefficient="0.8"/>
<CategoricalPredictor name="minority" value="0" coefficient="-19.821"/>
<CategoricalPredictor name="minority" value="1" coefficient="0.2"/>
</RegressionTable>
<RegressionTable intercept="0.0" targetCategory="skilled"/>
</RegressionModel>
</PMML>
|
Using interaction terms
The following example uses predictor terms that are implicitly combined by multiplication,
aka interaction terms.
y |
= |
2.1 -0.1* age *work
-20.525*sex('0')
|
|
The corresponding PMML model is:
<?xml version="1.0" ?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="DMG.org"/>
<DataDictionary numberOfFields="4">
<DataField name="age" optype="continuous" dataType="double"/>
<DataField name="work" optype="continuous" dataType="double"/>
<DataField name="sex" optype="categorical" dataType="string">
<Value value="male"/>
<Value value="female"/>
</DataField>
<DataField name="y" optype="continuous" dataType="double"/>
</DataDictionary>
<RegressionModel
modelName="Sample for interaction terms"
functionName="regression"
targetFieldName="y">
<MiningSchema>
<MiningField name="age" optype="continuous" />
<MiningField name="work" optype="continuous" />
<MiningField name="sex" optype="categorical" />
<MiningField name="y" optype="continuous" usageType="predicted" />
</MiningSchema>
<RegressionTable intercept="2.1" >
<CategoricalPredictor name="sex" value="0" coefficient="-20.525"/>
<PredictorTerm coefficient="-0.1">
<FieldRef field="age"/>
<FieldRef field="work"/>
</PredictorTerm>
</RegressionTable>
</RegressionModel>
</PMML>
|
Note that the model can convert the categorical field sex into a
continuous field by defining an appropriate DerivedField.
Furthermore, fields can appear more than once within a PredictorTerm.
For example,
(3.14 * salary2 * age * income * (sex=='1'))
can be written in PMML as
<PredictorTerm coefficient="3.14"/>
<FieldRef field="salary"/>
<FieldRef field="age"/>
<FieldRef field="income"/>
<FieldRef field="salary"/>
<FieldRef field="_g_0"/> <!-- derived field for sex=='1' -->
</PredictorTerm>
|