DMG logo PMML 4.3 - Regression
PMML4.3 Menu

Home

Changes

XML Schema

Conformance

Interoperability

General Structure

Field Scope

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Taxomony

Targets

Output

Functions

Built-in Functions

Model Verification

Model Explanation

Multiple Models

Association Rules

Baseline Models

Bayesian Network

Cluster
Models


Gaussian
Process


General
Regression


k-Nearest
Neighbors


Naive
Bayes


Neural
Network


Regression

Ruleset

Scorecard

Sequences

Text Models

Time Series

Trees

Vector Machine

PMML 4.3 - Regression

The regression functions are used to determine the relationship between the dependent variable (target field) and one or more independent variables. The dependent variable is the one whose values you want to predict, whereas the independent variables are the variables that you base your prediction on. While the term regression usually refers to the prediction of numeric values, the PMML element RegressionModel can also be used for classification. This is due to the fact that multiple regression equations can be combined in order to predict categorical values.

If the attribute functionName is regression then the model is used for the prediction of a numeric value in a continuous domain. These models should contain exactly one regression table.

If the attribute functionName is classification then the model is used to predict a category. These models should contain exactly one regression table for each targetCategory. The normalizationMethod describes how the prediction is converted into a confidence value (aka probability).

For simple regression with functionName='regression', the formula is:

Dependent variable = intercept + Sumi (coefficienti * independent variablei ) + error

Classification models can have multiple regression equations. With n classes/categories there are n equations of the form

yj = interceptj + Sumi (coefficientji * independent variablei )
One method to compute the confidence/probability value for category j is to use the softmax function
pj = exp(yj) / (Sum[i in 1..n](exp(yi)) )
Another method, called simplemax, uses a simple quotient
pj = yj / (Sum[i in 1..n](yi) )
These confidence values are similar to statistical probabilities but they only mimic probabilities by post-processing the values yi.

Note that Binary logistic regression is a special case of softmax with

y = intercept + Sumi (coefficienti * independent variablei )
p = 1/(1+exp(-y))

It should be implemented as a classification model

<RegressionModel functionName="classification"  normalizationMethod="softmax" ...
  <RegressionTable targetCategory="YES" ...
  <RegressionTable targetCategory="NO" intercept="0.0"
  ...

Here p will be the probability associated with YES, the first targetCategory. Note that the probability of YES, as given by the softmax normalization method, is

p=exp(y)/(exp(y)+exp(0.0)).
which is equivalent to
p=(1/(1+exp(-y))
The probability of NO is then computed as 1-p.

The XML Schema for RegressionModel

<xs:element name="RegressionModel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="MiningSchema"/>
      <xs:element ref="Output" minOccurs="0"/>
      <xs:element ref="ModelStats" minOccurs="0"/>
      <xs:element ref="ModelExplanation" minOccurs="0"/>
      <xs:element ref="Targets" minOccurs="0"/>
      <xs:element ref="LocalTransformations" minOccurs="0"/>
      <xs:element ref="RegressionTable" maxOccurs="unbounded"/>
      <xs:element ref="ModelVerification" minOccurs="0"/>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="modelName" type="xs:string"/>
    <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
    <xs:attribute name="algorithmName" type="xs:string"/>
    <xs:attribute name="modelType" use="optional">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="linearRegression"/>
          <xs:enumeration value="stepwisePolynomialRegression"/>
          <xs:enumeration value="logisticRegression"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="targetFieldName" type="FIELD-NAME" use="optional"/>
    <xs:attribute name="normalizationMethod" type="REGRESSIONNORMALIZATIONMETHOD" default="none"/>
    <xs:attribute name="isScorable" type="xs:boolean" default="true"/>
  </xs:complexType>
</xs:element>

<xs:simpleType name="REGRESSIONNORMALIZATIONMETHOD">
  <xs:restriction base="xs:string">
    <xs:enumeration value="none"/>
    <xs:enumeration value="simplemax"/>
    <xs:enumeration value="softmax"/>
    <xs:enumeration value="logit"/>
    <xs:enumeration value="probit"/>
    <xs:enumeration value="cloglog"/>
    <xs:enumeration value="exp"/>
    <xs:enumeration value="loglog"/>
    <xs:enumeration value="cauchit"/>
  </xs:restriction>
</xs:simpleType>

Definitions

Valid combinations

functionName normalizationMethod number of
RegressionTable elements
result
regression none 1 predictedValue = y1
regression softmax,logit 1 predictedValue = 1/(1+exp(-y1))
regression exp 1 predictedValue = exp(y1)
regression probit 1 predictedValue = CDF(y1)
regression cloglog 1 predictedValue = 1 - exp( -exp( y1))
regression loglog 1 predictedValue = exp( -exp( -y1))
regression cauchit 1 predictedValue = 0.5 + (1/π) arctan(y1)
regression other 1 ERROR
regression any >1 ERROR
classification with an ordinal target logit, probit, cauchit, cloglog, loglog, none >1 apply norm.method to y1 .. yn to find cumulative probabilities, see text for details
classification with a binary target logit, probit, cauchit, cloglog, loglog 2 apply norm.method to y1, p2=1-p1
classification with a binary target none 2 if y1<0 p1=0, if y1>1 p1=1, else p1=y1. p2=1-p1
classification with a categorical target with 2 or more categories softmax,simplemax 2 or more apply norm.method to y1 .. yn
classification with a categorical target with >2 categories none >2 pi=yi for i=1..n-1, pn=1-Sum(pi)
classification other any ERROR

How to compute pj := probability of target=Valuej

Let there be N target categories with N RegressionTable elements. Let yj be the result of evaluating the formula in the jth RegressionTable. If one or more of the yj cannot be evaluated because the value in one of the referenced fields is missing, then the following formulas do not apply. In that case the predictions are defined by the priorProbability values in the Target element if present, undefined otherwise.

softmax, categorical
pj = exp(yj) / ( Sum[i = 1 to N](exp(yi) ) )
simplemax, categorical
pj = yj / ( Sum[i = 1 to N ]( yi ) )
none, categorical
pj = yj for j = 1 to N - 1,
pN = 1 - Sum[ i = 1 to N - 1 ]( pi )

Note that if N=2 and the probabilities are outside of [0,1] they should be changed to the closest 0 or 1.
logit, categorical with two categories
p1 = 1 / ( 1 + exp( -y1 ) ),
p2 = 1 - p1
probit, categorical with two categories
p1 = integral(from -∞ to y1)(1/sqrt(2*π))exp(-0.5*u*u)du,
p2 = 1 - p1
cloglog, categorical with two categories
p1 = 1 - exp( -exp( y1 ) ),
p2 = 1 - p1
loglog, categorical with two categories
p1 = exp( -exp( -y1 ) ),
p2 = 1 - p1
cauchit, categorical with two categories
p1 = 0.5 + (1/π) arctan( y1 ),
p2 = 1 - p1
logit, ordinal
inverse of logit function: F(y)= 1/(1+exp(-y)), e.g. F(15) = 1
p1 = F(y1)
pj = F(yj) - F(yj-1), for 2 ≤ j < N
pN = 1 - F(yN-1)
probit, ordinal
inverse of probit function: F(y)= integral(from -∞ to y)(1/sqrt(2*π))exp(-0.5*u*u)du
e.g. F(10) = 1
p1 = F(y1)
pj = F(yj) - F(yj-1), for 2 ≤ j < N
pN = 1 - F(yN-1)
cloglog, ordinal
inverse of cloglog function: F(y)= 1 - exp( -exp(y) )
e.g. F(4) = 1.
p1 = F(y1)
pj = F(yj) - F(yj-1), for 2 ≤ j < N
pN = 1 - F(yN-1)
loglog, ordinal
inverse of cloglog function: F(y) = exp( -exp(-y) )
p1 = F(y1)
pj = F(yj) - F(yj-1), for 2 ≤ j < N
pN = 1 - F(yN-1)
cauchit, ordinal
inverse of cauchit function: F(y) = 0.5 + (1/π) arctan(y)
p1 = F(y1)
pj = F(yj) - F(yj-1), for 2 ≤ j < N
pN = 1 - F(yN-1)
none, ordinal
p1 = y1
pj = yj - yj-1, for 2 ≤ j < N
pN = 1 - yN-1

Comments on exp: The exp normalizationMethod is frequently used in statistical models for predicting non-negative target variables, such as Poisson regression which is used in sales forecasting, queuing models, insurance risk models, etc., to predict counts per unit time (e.g., daily sales volumes, hourly service requests, quarterly insurance claim filings, etc.).

Comments on probit: The area under the standard normal curve corresponds to probability, specifically the probability of finding an observation less than a given Z value. The total area under the curve, from -∞ to +∞ = 1.0

Z = 0, area = above Z = 0.5, ie half the curve lies below the mean
Z = 1.0, area = above Z =0.1587, ie about 85% data lies below (mean + 1 sd)

t F(t)
1 0.84134474606854000000
2 0.97724986805182000000
3 0.99865010196837000000

For ordinal targets: Suppose yi is the result of the ith RegressionTable. Apply the inverse link function to obtain the cumulative probability. For the last (which is the so-called trivial) RegressionTable, the intercept should be a "large" number so that after applying the inverse link function, you obtain a cumulative probability of 1; this is assumed above in the equations given on how to compute the probabilities of ordinal targets. The individual probability of each category is calculated by subtracting the cumulative probability of the previous category from the cumulative probability of the current category.

Examples

The following regression formula is used to predict the number of insurance claims:

number_of_claims =
132.37 + 7.1*age + 0.01*salary + 41.1*car_location('carpark') + 325.03*car_location('street')

If the value carpark was specified for car_location in a particular record, you would get the following formula:

number_of_claims = 132.37 + 7.1 age + 0.01 salary + 41.1 * 1 + 325.03 * 0

Linear Regression Sample

This is a linear regression equation predicting a number of insurance claims on prior knowledge of the values of the independent variables age, salary and car_location. car_location is the only categorical variable. Its value attribute can take on two possible values, carpark and street.

number_of_claims = 132.37 + 7.1 age + 0.01 salary + 41.1 car_location( carpark ) + 325.03 car_location( street )

The corresponding PMML model is:

<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
  <Header copyright="DMG.org"/>
  <DataDictionary numberOfFields="4">
    <DataField name="age" optype="continuous" dataType="double"/>
    <DataField name="salary" optype="continuous" dataType="double"/>
    <DataField name="car_location" optype="categorical" dataType="string">
      <Value value="carpark"/>
      <Value value="street"/>
    </DataField>
    <DataField name="number_of_claims" optype="continuous" dataType="integer"/>
  </DataDictionary>
  <RegressionModel modelName="Sample for linear regression" functionName="regression" algorithmName="linearRegression" targetFieldName="number_of_claims">
    <MiningSchema>
      <MiningField name="age"/>
      <MiningField name="salary"/>
      <MiningField name="car_location"/>
      <MiningField name="number_of_claims" usageType="target"/>
    </MiningSchema>
    <RegressionTable intercept="132.37">
      <NumericPredictor name="age" exponent="1" coefficient="7.1"/>
      <NumericPredictor name="salary" exponent="1" coefficient="0.01"/>
      <CategoricalPredictor name="car_location" value="carpark" coefficient="41.1"/>
      <CategoricalPredictor name="car_location" value="street" coefficient="325.03"/>
    </RegressionTable>
  </RegressionModel>
</PMML>

Polynomial Regression Sample

This is a polynomial regression equation predicting a number of insurance claims on prior knowledge of the values of the independent variables salary and car_location. car_location is a categorical variable. Its value attribute can take on two possible values, carpark and street.

number_of_claims =
3216.38 - 0.08 salary + 9.54E-7 salary**2 - 2.67E-12 salary**3 + 93.78 car_location('carpark') + 288.75 car_location('street')
<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
  <Header copyright="DMG.org"/>
  <DataDictionary numberOfFields="3">
    <DataField name="salary" optype="continuous" dataType="double"/>
    <DataField name="car_location" optype="categorical" dataType="string">
      <Value value="carpark"/>
      <Value value="street"/>
    </DataField>
    <DataField name="number_of_claims" optype="continuous" dataType="integer"/>
  </DataDictionary>
  <RegressionModel functionName="regression" modelName="Sample for stepwise polynomial regression" algorithmName="stepwisePolynomialRegression" targetFieldName="number_of_claims">
    <MiningSchema>
      <MiningField name="salary"/>
      <MiningField name="car_location"/>
      <MiningField name="number_of_claims" usageType="target"/>
    </MiningSchema>
    <RegressionTable intercept="3216.38">
      <NumericPredictor name="salary" exponent="1" coefficient="-0.08"/>
      <NumericPredictor name="salary" exponent="2" coefficient="9.54E-7"/>
      <NumericPredictor name="salary" exponent="3" coefficient="-2.67E-12"/>
      <CategoricalPredictor name="car_location" value="carpark" coefficient="93.78"/>
      <CategoricalPredictor name="car_location" value="street" coefficient="288.75"/>
    </RegressionTable>
  </RegressionModel>
</PMML>

Logistic Regression for binary classification

Many regression modeling algorithms create (k-1) equations for classification problems with k different categories. This is particularly useful for binary classification. The resulting model can easily be defined in PMML as in the following example.

<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
  <Header copyright="DMG.org"/>
  <DataDictionary numberOfFields="3">
    <DataField name="x1" optype="continuous" dataType="double"/>
    <DataField name="x2" optype="continuous" dataType="double"/>
    <DataField name="y" optype="categorical" dataType="string">
      <Value value="yes"/>
      <Value value="no"/>
    </DataField>
  </DataDictionary>
  <RegressionModel functionName="regression" modelName="Sample for stepwise polynomial regression" algorithmName="stepwisePolynomialRegression" normalizationMethod="softmax" targetFieldName="y">
    <MiningSchema>
      <MiningField name="x1"/>
      <MiningField name="x2"/>
      <MiningField name="y" usageType="target"/>
    </MiningSchema>
    <RegressionTable targetCategory="no" intercept="125.56601826">
      <NumericPredictor name="x1" coefficient="-28.6617384"/>
      <NumericPredictor name="x2" coefficient="-20.42027426"/>
    </RegressionTable>
    <RegressionTable targetCategory="yes" intercept="0"/>
  </RegressionModel>
</PMML>

Note that the last element for RegressionTable is trivial. It does not have any predictor entries. A RegressionTable defines a formula: intercept + (sum of predictor terms). If there are no predictor terms, then the (sum of ..) is 0 and the formula becomes just: intercept. That's exactly what <RegressionTable targetCategory="yes" intercept="0"/> defines.

Sample for classification with more than two categories:

y_clerical =
46.418 -0.132*age +7.867E-02*work -20.525*sex('0') +0*sex('1') -19.054*minority('0') +0*minority('1')
y_professional =
51.169 -0.302*age +.155*work -21.389*sex('0') +0*sex('1') -18.443*minority('0') + 0*minority('1')
y_trainee =
25.478 -.154*age +.266*work -2.639*sex('0') +0*sex('1') -19.821*minority('0') +0*minority('1')

Note that the terms such as 0*minority('1') are superfluous but it's valid to use the same field with different indicator values such as '0' and '1'. Though, a RegressionTable must not have multiple NumericPredictors with the same name and it must not have multiple CategoricalPredictors with the same pair of name and value.

The corresponding PMML model is:

<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
  <Header copyright="DMG.org"/>
  <DataDictionary numberOfFields="5">
    <DataField name="age" optype="continuous" dataType="double"/>
    <DataField name="work" optype="continuous" dataType="double"/>
    <DataField name="sex" optype="categorical" dataType="string">
      <Value value="0"/>
      <Value value="1"/>
    </DataField>
    <DataField name="minority" optype="categorical" dataType="integer">
      <Value value="0"/>
      <Value value="1"/>
    </DataField>
    <DataField name="jobcat" optype="categorical" dataType="string">
      <Value value="clerical"/>
      <Value value="professional"/>
      <Value value="trainee"/>
      <Value value="skilled"/>
    </DataField>
  </DataDictionary>
  <RegressionModel modelName="Sample for logistic regression" functionName="classification" algorithmName="logisticRegression" normalizationMethod="softmax" targetFieldName="jobcat">
    <MiningSchema>
      <MiningField name="age"/>
      <MiningField name="work"/>
      <MiningField name="sex"/>
      <MiningField name="minority"/>
      <MiningField name="jobcat" usageType="target"/>
    </MiningSchema>
    <RegressionTable intercept="46.418" targetCategory="clerical">
      <NumericPredictor name="age" exponent="1" coefficient="-0.132"/>
      <NumericPredictor name="work" exponent="1" coefficient="7.867E-02"/>
      <CategoricalPredictor name="sex" value="0" coefficient="-20.525"/>
      <CategoricalPredictor name="sex" value="1" coefficient="0.5"/>
      <CategoricalPredictor name="minority" value="0" coefficient="-19.054"/>
      <CategoricalPredictor name="minority" value="1" coefficient="0"/>
    </RegressionTable>
    <RegressionTable intercept="51.169" targetCategory="professional">
      <NumericPredictor name="age" exponent="1" coefficient="-0.302"/>
      <NumericPredictor name="work" exponent="1" coefficient="0.155"/>
      <CategoricalPredictor name="sex" value="0" coefficient="-21.389"/>
      <CategoricalPredictor name="sex" value="1" coefficient="0.1"/>
      <CategoricalPredictor name="minority" value="0" coefficient="-18.443"/>
      <CategoricalPredictor name="minority" value="1" coefficient="0"/>
    </RegressionTable>
    <RegressionTable intercept="25.478" targetCategory="trainee">
      <NumericPredictor name="age" exponent="1" coefficient="-0.154"/>
      <NumericPredictor name="work" exponent="1" coefficient="0.266"/>
      <CategoricalPredictor name="sex" value="0" coefficient="-2.639"/>
      <CategoricalPredictor name="sex" value="1" coefficient="0.8"/>
      <CategoricalPredictor name="minority" value="0" coefficient="-19.821"/>
      <CategoricalPredictor name="minority" value="1" coefficient="0.2"/>
    </RegressionTable>
    <RegressionTable intercept="0.0" targetCategory="skilled"/>
  </RegressionModel>
</PMML>

Using interaction terms

The following example uses predictor terms that are implicitly combined by multiplication, aka interaction terms.
y =
2.1 -0.1* age *work -20.525*sex('0')

The corresponding PMML model is:

<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
  <Header copyright="DMG.org"/>
  <DataDictionary numberOfFields="4">
    <DataField name="age" optype="continuous" dataType="double"/>
    <DataField name="work" optype="continuous" dataType="double"/>
    <DataField name="sex" optype="categorical" dataType="string">
      <Value value="male"/>
      <Value value="female"/>
    </DataField>
    <DataField name="y" optype="continuous" dataType="double"/>
  </DataDictionary>
  <RegressionModel modelName="Sample for interaction terms" functionName="regression" targetFieldName="y">
    <MiningSchema> 
      <MiningField name="age" optype="continuous"/> 
      <MiningField name="work" optype="continuous"/> 
      <MiningField name="sex" optype="categorical"/> 
      <MiningField name="y" optype="continuous" usageType="target"/> 
    </MiningSchema>
    <RegressionTable intercept="2.1">
      <CategoricalPredictor name="sex" value="female" coefficient="-20.525"/>
      <PredictorTerm coefficient="-0.1">
        <FieldRef field="age"/>
        <FieldRef field="work"/>
      </PredictorTerm>
    </RegressionTable>
  </RegressionModel>
</PMML>

Note that the model can convert the categorical field sex into a continuous field by defining an appropriate DerivedField. Furthermore, fields can appear more than once within a PredictorTerm.

For example,

(3.14 * salary2 * age * income * (sex=='female'))

can be written in PMML as

<PredictorTerm coefficient="3.14">
  <FieldRef field="salary"/>
  <FieldRef field="age"/>
  <FieldRef field="income"/>
  <FieldRef field="salary"/>
  <FieldRef field="_g_0"/>  <!-- derived field for sex=='female' -->
</PredictorTerm>
e-mail info at dmg.org