Data Mining Group - Regression

PMML 2.1 -- Regression

The regression functions are used to determine the relationship between the dependent variable (target field) and one or more independent variables. The dependent variable is the one whose values you want to predict, whereas the independent variables are the variables that you base your prediction on.

A RegressionModel defines three types of regression models: linear, polynomial, and logistic regression. The modelType attribute indicates the type of regression used.

Linear and stepwise-polynomial regression are designed for numeric dependent variables having a continuous spectrum of values. These models should contain exactly one regression table. The attributes normalizationMethod and targetCategory are not used in that case.

Logistic regression is designed for categorical dependent variables. These models should contain exactly one regression table for each targetCategory. The normalizationMethod describes whether/how the prediction is converted into a probability.

For linear and stepwise regression, the regression formula is:

Dependent variable = intercept + Sum_i (coefficient_i * independent variable_i ) + error

For logistic regression the formula is:

y = intercept + Sum_i (coefficient_i * independent variable_i )
p = 1/(1+exp(-y))

p is the predicted value. In many cases p can be interpreted as the confidence or the probability of an individual belonging to the category of interest, as defined by targetCategory. There can be multiple regression equations. With n classes/categories there are n equations of the form

y_j = intercept_j + Sum_i (coefficient_j_i * independent variable_i )

A confidence value for category j can be computed by the softmax function

p_j = exp(y_j) / (Sum[i in 1..n](exp(y_i)) )

Another method, called simplemax, uses a simple quotient

p_j = y_j / (Sum[i in 1..n](y_i) )

These confidence values are similar to statistical probabilities but they only mimic probabilities by post-processing the values y_i.


<xs:element name="RegressionModel">
    <xs:complexType>
      <xs:sequence>
	<xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" />
	<xs:element ref="MiningSchema" />
	<xs:element minOccurs="0" ref="ModelStats" />
	<xs:element maxOccurs="unbounded" ref="RegressionTable" />
	<xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" />
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" />
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" />
      <xs:attribute name="modelType" use="required">
	<xs:simpleType>
	  <xs:restriction base="xs:string">
	    <xs:enumeration value="linearRegression" />
	    <xs:enumeration value="stepwisePolynomialRegression" />
	    <xs:enumeration value="logisticRegression" />
	  </xs:restriction>
	</xs:simpleType>
      </xs:attribute>
      <xs:attribute name="targetFieldName" type="FIELD-NAME" use="required" />
      <xs:attribute name="normalizationMethod" default="none">
	<xs:simpleType>
	  <xs:restriction base="xs:string">
	    <xs:enumeration value="none" />
	    <xs:enumeration value="simplemax" />
	    <xs:enumeration value="softmax" />
	  </xs:restriction>
	</xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>
  <xs:element name="RegressionTable">
    <xs:complexType>
      <xs:sequence>
	<xs:element minOccurs="0" maxOccurs="unbounded" ref="NumericPredictor" />
	<xs:element minOccurs="0" maxOccurs="unbounded" ref="CategoricalPredictor" />
      </xs:sequence>
      <xs:attribute name="intercept" type="REAL-NUMBER" use="required" />
      <xs:attribute name="targetCategory" type="xs:string" />
    </xs:complexType>
  </xs:element>
  <xs:element name="NumericPredictor">
    <xs:complexType>
      <xs:attribute name="name" type="FIELD-NAME" use="required" />
      <xs:attribute name="exponent" type="INT-NUMBER" use="required" />
      <xs:attribute name="coefficient" type="REAL-NUMBER" use="required" />
      <xs:attribute name="mean" type="REAL-NUMBER" />
    </xs:complexType>
  </xs:element>
  <xs:element name="CategoricalPredictor">
    <xs:complexType>
      <xs:attribute name="name" type="FIELD-NAME" use="required" />
      <xs:attribute name="value" type="xs:string" use="required" />
      <xs:attribute name="coefficient" type="REAL-NUMBER" use="required" />
    </xs:complexType>
  </xs:element>

RegressionModel: The root element of an XML regression model. Each instance of a regression model must start with this element.

modelName: This is a unique identifier specifying the name of the regression model.

functionName: Can be regression or classification.

algorithmName: Can be any string describing the algorithm that was used while creating the model.

modelType: Specifies the type of a regression model. This information is used to select the appropriate mathematical formulas during the scoring phase. The supported regression algorithms are listed.

targetFieldName: The name of the target field (also called response variable).

RegressionTable: A table that lists the values of all predictors or independent variables. If the model is used to predict a numerical field, then there is only one RegressionTable and the attribute targetCategory may be missing. If the model is used to predict a categorical field, then there are two or more RegressionTables and each one must have the attribute targetCategory defined with a unique value.

NumericPredictor: Defines a numeric independent variable. The list of valid attributes comprises the name of the variable, the exponent to be used, and the coefficient by which the valuesof this variable must be multiplied. If the independent variable contains missing values, the mean attribute is used to replace the missing values with the mean value.

CategoricalPredictor : Defines a categorical independent variable. The list of attributes comprises the name of the variable, the value attribute, and the coefficient by which the values of this variable must be multiplied. To do a regression analysis with categorical values, some means must be applied to enable calculations. If the specified value of an independent value occurs, the term variable_name(value) is replaced with 1. Thus the coefficient is multiplied by 1. If the value does not occur, the term variable_name(value) is replaced with 0 so that the product coefficient � variable_name(value) yields 0. Consequently, the product is ignored in the ongoing analysis. If the input value is missing then variable_name(v) yields 0 for any 'v'.

Example:

The following regression formula is used to predict the number of insurance claims:

number_of_claims = 132.37 + 7.1*age + 0.01*salary + 41.1*car_location('carpark') + 325.03*car_location('street')

If the value carpark was specified for car location in a particular record, you would get the following formula:

number of claims = 132.37 + 7.1 age + 0.01 salary + 41.1 * 1 + 325.03 * 0

Linear Regression Sample

This is a linear regression equation predicting a number of insurance claims on prior knowledge of the values of the independent variables age, salary and car location. Car location is the only categorical variable. Its value attribute can take on two possible values, carpark and street.

number of claims = 132.37 + 7.1 age + 0.01 salary + 41.1 car location( carpark ) + 325.03 car location( street )

The corresponding XML model is:

      
<RegressionModel
   functionName="regression"
   modelName="Sample for linear regression"
   modelType="linearRegression"
   targetFieldName="number of claims">

   <RegressionTable intercept="132.37">
       <NumericPredictor name="age" 
                         exponent="1" coefficient="7.1"/>
       <NumericPredictor name="salary" 
                         exponent="1" coefficient="0.01"/>
       <CategoricalPredictor name="car location"
                         value="carpark" coefficient="41.1"/>
       <CategoricalPredictor name="car location"
                         value="street" coefficient="325.03"/>
     </RegressionTable>

</RegressionModel>

Stepwise Polynomial Regression Sample

This is a stepwise polynomial regression equation predicting a number of insurance claims on prior knowledge of the values of the independent variables salary and car location. Car location is a categorical variable. Its value attribute can take on two possible values, carpark and street.

number of claims = 3216.38 - 0.08 salary + 9.54E-7 salary**2 - 2.67E-12 salary**3 + 93.78 car location( carpark ) + 288.75 car location( street )


<RegressionModel
   functionName="regression"
   modelName="Sample for stepwise polynomial regression"
   modelType="stepwisePolynomialRegression"
   targetFieldName="number of claims">

   <RegressionTable intercept="3216.38">
       <NumericPredictor name="salary" 
                         exponent="1" coefficient="-0.08"/>
       <NumericPredictor name="salary" 
                         exponent="2" coefficient="9.54E-7"/>
       <NumericPredictor name="salary" 
                         exponent="3" coefficient="-2.67E-12"/>
       <CategoricalPredictor name="car location"
                         value="carpark" coefficient="93.78"/>
       <CategoricalPredictor name="car location"
                         value="street" coefficient="288.75"/>
     </RegressionTable>

</RegressionModel>

Logistic Regression Sample:

y_clerical = 46.418 -0.132*age +7.867E-02*work -20.525*sex('0') +0*sex('1') -19.054*minority('0') +0*minority('1')

y_professional = 51.169 -0.302*age +.155*work -21.389*sex('0') +0*sex('1') -18.443*minority('0') + 0*minority('1')

y_trainee = 25.478 -.154*age +.266*work -2.639*sex('0') +0*sex('1') -19.821*minority('0') +0*minority('1')

The model below defines no normalization, so p_j = y_j.

Note that the terms such as 0*minority('1') are superfluous but it's valid to use the same field with different indicator values such as '0' and '1'. Though, a RegresstionTable must not have multiple numeric predictors with the same name and it must not have multiple categorical predictors with the same pair of name and value.

The corresponding XML model is:


<RegressionModel
   functionName="classification"
   modelName="Sample for logistic regression"
   modelType="logisticRegression"
   normalizationMethod="none"
   targetFieldName="jobcat">

   <RegressionTable intercept="46.418"
         targetCategory="clerical">
      <NumericPredictor name="age" exponent="1"
                           coefficient="-0.132"/>
      <NumericPredictor name="work" exponent="1"
                           coefficient="7.867E-02"/>

      <CategoricalPredictor name="sex" value="0"
                           coefficient="-20.525"/>
      <CategoricalPredictor name="sex" value="1" 
                           coefficient="0"/>
      <CategoricalPredictor name="minority" value="0"
                           coefficient="-19.054"/>
      <CategoricalPredictor name="minority" value="1" 
                           coefficient="0"/>
    </RegressionTable>

    <RegressionTable intercept="51.169"
           targetCategory="professional">
       <NumericPredictor name="age" exponent="1"
                            coefficient="-0.302"/>
       <NumericPredictor name="work" exponent="1"
                            coefficient=".155"/>

       <CategoricalPredictor name="sex" value="0"
                            coefficient="-21.389"/>
       <CategoricalPredictor name="sex" value="1" 
                            coefficient="0"/>
       <CategoricalPredictor name="minority" value="0"
                            coefficient="-18.443"/>
       <CategoricalPredictor name="minority" value="1" 
                            coefficient="0"/>
    </RegressionTable>

    <RegressionTable intercept="25.478"
             targetCategory="trainee">
       <NumericPredictor name="age" exponent="1"
                            coefficient="-.154"/>
       <NumericPredictor name="work" exponent="1"
                            coefficient=".266"/>

       <CategoricalPredictor name="sex" value="0"
                            coefficient="-2.639"/>
       <CategoricalPredictor name="sex" value="1" 
                            coefficient="0"/>
       <CategoricalPredictor name="minority" value="0"
                            coefficient="-19.821"/>
       <CategoricalPredictor name="minority" value="1" 
                            coefficient="0"/>
    </RegressionTable>

</RegressionModel>