General Regression
PMML2.1 Menu

Home


PMML Notice and License

General Structure

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Conformance

Taxomony

Trees

Regression

General
Regression


Cluster
Models


Association Rules

Neural
Network


Naive
Bayes


Sequences

PMML 2.1 - General Regression

Model XSD and Tag Description

 <xs:element name="GeneralRegressionModel">
    <xs:complexType>
      <xs:sequence>
	<xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" />
	<xs:element minOccurs="0" ref="MiningSchema" />
	<xs:element minOccurs="0" ref="ModelStats" />
	<xs:element ref="ParameterList" />
	<xs:element minOccurs="0" ref="FactorList" />
	<xs:element minOccurs="0" ref="CovariateList" />
	<xs:element minOccurs="0" ref="PPMatrix" />
	<xs:element minOccurs="0" ref="PCovMatrix" />
	<xs:element ref="ParamMatrix" />
	<xs:element minOccurs="0" maxOccurs="unbounded" ref="Extension" />
      </xs:sequence>
      <xs:attribute name="targetVariableName" type="FIELD-NAME" use="required" />
      <xs:attribute name="modelType" use="required">
	<xs:simpleType>
	  <xs:restriction base="xs:string">
	    <xs:enumeration value="regression" />
	    <xs:enumeration value="generalLinear" />
	    <xs:enumeration value="logLinear" />
	    <xs:enumeration value="multinomialLogistic" />
	  </xs:restriction>
	</xs:simpleType>
      </xs:attribute>
      <xs:attribute name="modelName" type="xs:string" />
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" />
    </xs:complexType>
  </xs:element>
  <xs:element name="ParameterList">
    <xs:complexType>
      <xs:sequence>
	<xs:element maxOccurs="unbounded" ref="Parameter" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="Parameter">
    <xs:complexType>
      <xs:attribute name="name" type="xs:string" use="required" />
      <xs:attribute name="label" type="xs:string" />
    </xs:complexType>
  </xs:element>
  <xs:element name="FactorList">
    <xs:complexType>
      <xs:sequence>
	<xs:element minOccurs="0" maxOccurs="unbounded" ref="Predictor" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="CovariateList">
    <xs:complexType>
      <xs:sequence>
	<xs:element minOccurs="0" maxOccurs="unbounded" ref="Predictor" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="Predictor">
    <xs:complexType>
      <xs:attribute name="name" type="FIELD-NAME" use="required" />
    </xs:complexType>
  </xs:element>
  <xs:element name="PPMatrix">
    <xs:complexType>
      <xs:sequence>
	<xs:element maxOccurs="unbounded" ref="PPCell" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="PPCell">
    <xs:complexType>
      <xs:attribute name="value" type="xs:string" use="required" />
      <xs:attribute name="predictorName" type="FIELD-NAME" use="required" />
      <xs:attribute name="parameterName" type="xs:string" use="required" />
      <xs:attribute name="targetCategory" type="xs:string" />
    </xs:complexType>
  </xs:element>
  <xs:element name="PCovMatrix">
    <xs:complexType>
      <xs:sequence>
	<xs:element maxOccurs="unbounded" ref="PCovCell" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="PCovCell">
    <xs:complexType>
      <xs:attribute name="pRow" type="xs:string" use="required" />
      <xs:attribute name="pCol" type="xs:string" use="required" />
      <xs:attribute name="tRow" type="xs:string" />
      <xs:attribute name="tCol" type="xs:string" />
      <xs:attribute name="value" type="REAL-NUMBER" use="required" />
      <xs:attribute name="targetCategory" type="xs:string" />
    </xs:complexType>
  </xs:element>
  <xs:element name="ParamMatrix">
    <xs:complexType>
      <xs:sequence>
	<xs:element maxOccurs="unbounded" ref="PCell" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="PCell">
    <xs:complexType>
      <xs:attribute name="targetCategory" type="xs:string" use="required" />
      <xs:attribute name="parameterName" type="xs:string" use="required" />
      <xs:attribute name="beta" type="REAL-NUMBER" use="required" />
      <xs:attribute name="df" type="INT-NUMBER" />
    </xs:complexType>
  </xs:element>

GeneralRegressionModel: marks the beginning of a general regression model. As the name says it, this is intended to support a multitude of regression models.

ParameterList: lists all Parameters. Each Parameter contains a required name, and optional label. Parameter names should be unique within the model and as brief as possible(since Parameter names appear frequently in the document). The label, if present, is meant to give a hint on Parameter's correlation with the Predictors.

FactorList: list of factor names. Not present if this particular regression flavor does not support factors (ex. linear regression). If present, the list may or may not be empty. Each name in the list must match a name from the dictionary. The factors are assumed to be categorical variables.

CovariateList: list of covariate names. Will not be present when there is no covariate. Each name in the list must match a name from the dictionary. The covariates will be assumed as continuous variables.

targetVariableName: name of the target variable (also called response variable). must match a name from the dictionary.

modelType: specifies the type of regression model in use. this information will be used to select the appropriate mathematical formulas during scoring. The supported regression algorithms are listed.

modelName: This is a unique identifier specifying the name of the regression model.

PPMatrix: Predictor-to-Parameter correlation matrix. It is a rectangular matrix having a column for each Predictor (factor or covariate) and a row for each Parameter. The matrix is represented as a sequence of cells, each cell containing a number representing the correlation between the Predictor and the Parameter. The cell values are computed as follows:

~ For each Predictor variable v and each Parameter p, the corresponding cell value is missing (empty) if there is no correlation between v and p .

~ If there is a correlation between a covariate Predictor and the Parameter, the cell value is set to the exponent that the covariate is raised to in the dependency expression. Example: assuming variable jobcat is a factor and work is a covariate, the Parameter [jobcat=professional] * work * work is correlated to the covariate work, and the number that should be entered in the cell is 2 because work is present at second power in the expression.

~ If there is a correlation between the factor variable and the Parameter, the cell value is set to the Predictor value that determines the correlation. Example: assuming the categories of the factor variable jobcat are: professional, clerical, skilled, unskilled, the cell in the matrix that corresponds to (jobcat, jobcat=skilled) has a value of skilled.

The empty cells are not required to be present in the exported model file. All cells determined to be missing from the xml file at model parsing will be assumed to be empty. Since empty cells make up a large chunk of the matrix, this will reduce the size of the exported model.

Note the IMPLIED targetCategory attribute. This is permitted in order to allow usage of different pp-matrices for different response values. If any PPCell contains this attribute, the expectation is that for that particular response level, a full PPMatrix can be reconstructed from the PMML document. It is that matrix which will be used during scoring in order to get the probability (and other stats) for the response level. By default, all target categories share the PPMatrix.

targetCategory attribute can thus be used to override the default for some or all target categories.

PPCell: cell in the PPMatrix. knows its row name, column name, and information as described above.

PCovMatrix: matrix of Parameter estimate covariances. made up of PCovCells,each of them being located via row information for Parameter name (pRow), row information for target variable value (tRow), column information for Parameter name (pCol) and column information for target variable value (tCol). Note that the matrix is symmetric with respect to the main diagonal (interchanging tRow and tCol will not change the value, same for the pair pRow and pCol). Therefore it is sufficient that only half of the matrix is exported.

ParamMatrix: Parameter matrix. A table containing the Parameter values along with associated statistics (degrees of freedom). One dimension has the target variable's categories, the other has the Parameter names. The table is represented by specifying each cell. There is no requirement for Parameter names other than that each name should uniquely identify one Parameter.

PCell: cell in the ParamMatrix. The targetCategory and parameterName attributes determine the cell's location in the Parameter matrix. The information contained is : beta (actual Parameter value, required), and df (degrees of freedom, optional).

General Regression Sample

Here is the information about the variables:

Name      Type    Number of    Categories (numeric coding in parentheses)
  categories    

JOBCAT    Target      7       Clerical(1), Office trainee(2), Security officer(3),
	                      College trainee(4), Exempt employee(5),
	                      MBA trainee(6), and Technical(7)
SEX       Factor      2       Males(0), and Females(1)
MINORITY  Factor      2       White(0), and Nonwhite(1)
AGE       Covariate
WORK      Covariate

The Parameter estimates are displayed as follows:

The PPMatrix is:

Parameter                         SEX    MINORITY   AGE    WORK
Intercept
[SEX = 0]                          0
[SEX = 1]                          1
[MINORITY = 0]([SEX = 0])                   0        0
[MINORITY = 1]([SEX = 0])                   0        1
[MINORITY = 0]([SEX = 1])                   1        0
[MINORITY = 1]([SEX = 1])                   1        1
AGE                                                  1
WORK                                                         1

This Predictor-to-Parameter combinations mapping is the same for each target variable category. The corresponding XML model is :


<GeneralRegressionModel 
         targetVariableName="jobcat" 
         modelType="multinomial-logistic"
         functionName="regression"
>

<ParameterList>
    <Parameter name="p0" label="Intercept"/>
    <Parameter name="p1" label="[SEX=0]"/>
    <Parameter name="p2" label="[SEX=1]"/>
    <Parameter name="p3" label="[MINORITY=0]([SEX=0])"/>
    <Parameter name="p4" label="[MINORITY=1]([SEX=0])"/>
    <Parameter name="p5" label="[MINORITY=0]([SEX=1])"/>
    <Parameter name="p6" label="[MINORITY=1]([SEX=1])"/>
    <Parameter name="p7" label="age"/>
    <Parameter name="p8" label="work"/>
</ParameterList>

<FactorLis>
    <Predictor name="sex" />
    <Predictor name="minority" />
</FactorList>

<CovariateList>
    <Predictor name="age" />
    <Predictor name="work" />
</CovariateList>

<PPMatrix>
    <PPCell value="1" predictorName="sex" parameterName="p1"/>
    <PPCell value="2" predictorName="sex" parameterName="p2"/>
    <PPCell value="1" predictorName="sex" parameterName="p3"/>
    <PPCell value="1" predictorName="sex" parameterName="p4"/>
    <PPCell value="2" predictorName="sex" parameterName="p5"/>
    <PPCell value="2" predictorName="sex" parameterName="p6"/>
    <PPCell value="1" predictorName="minority" parameterName="p3"/>
    <PPCell value="2" predictorName="minority" parameterName="p4"/>
    <PPCell value="1" predictorName="minority" parameterName="p5"/>
    <PPCell value="2" predictorName="minority" parameterName="p6"/>
    <PPCell value="1" predictorName="age" parameterName="p7"/>
    <PPCell value="1" predictorName="work" parameterName="p8"/>
</PPMatrix>


<ParamMatrix>
    <PCell targetCategory="1" parameterName="p0" beta="26.836" df="1"/>
    <PCell targetCategory="1" parameterName="p1" beta="-.719" df="1"/>
    <PCell targetCategory="1" parameterName="p3" beta="-19.214" df="1"/>
    <PCell targetCategory="1" parameterName="p5" beta="-.114" df="1"/>
    <PCell targetCategory="1" parameterName="p7" beta="-.133" df="1"/>
    <PCell targetCategory="1" parameterName="p8" beta="7.885E-02" 
       df="1"/>
    <PCell targetCategory="2" parameterName="p0" beta="31.077" df="1"/>
    <PCell targetCategory="2" parameterName="p1" beta="-.869" df="1"/>
    <PCell targetCategory="2" parameterName="p3" beta="-18.99" df="1"/>
    <PCell targetCategory="2" parameterName="p5" beta="1.01" df="1"/>
    <PCell targetCategory="2" parameterName="p7" beta="-.3" df="1"/>
    <PCell targetCategory="2" parameterName="p8" beta=".152" df="1"/>
    <PCell targetCategory="3" parameterName="p0" beta="6.836" df="1"/>
    <PCell targetCategory="3" parameterName="p1" beta="16.305" df="1"/>
    <PCell targetCategory="3" parameterName="p3" beta="-20.041" df="1"/>
    <PCell targetCategory="3" parameterName="p5" beta="-.73" df="1"/>
    <PCell targetCategory="3" parameterName="p7" beta="-.156" df="1"/>
    <PCell targetCategory="3" parameterName="p8" beta=".267" df="1"/>
    <PCell targetCategory="4" parameterName="p0" beta="8.816" df="1"/>
    <PCell targetCategory="4" parameterName="p1" beta="15.264" df="1"/>
    <PCell targetCategory="4" parameterName="p3" beta="-16.799" df="1"/>
    <PCell targetCategory="4" parameterName="p5" beta="16.48" df="1"/>
    <PCell targetCategory="4" parameterName="p7" beta="-.133" df="1"/ >
    <PCell targetCategory="4" parameterName="p8" beta="-.16" df="1"/>
    <PCell targetCategory="5" parameterName="p0" beta="5.862" df="1"/>
    <PCell targetCategory="5" parameterName="p1" beta="16.437" df="1"/>
    <PCell targetCategory="5" parameterName="p3" beta="-17.309" df="1"/>
    <PCell targetCategory="5" parameterName="p5" beta="15.888" df="1"/>
    <PCell targetCategory="5" parameterName="p7" beta="-.105" df="1"/>
    <PCell targetCategory="5" parameterName="p8" beta="6.914E-02" 
        df="1"/>
    <PCell targetCategory="6" parameterName="p0" beta="6.495" df="1"/>
    <PCell targetCategory="6" parameterName="p1" beta="17.297" df="1"/>
    <PCell targetCategory="6" parameterName="p3" beta="-19.098" df="1"/>
    <PCell targetCategory="6" parameterName="p5" beta="16.841" df="1"/>
    <PCell targetCategory="6" parameterName="p7" beta="-.141" df="1"/>
    <PCell targetCategory="6" parameterName="p8" beta="-5.058E-02" 
       df="1"/>
</ParamMatrix>

</GeneralRegressionModel>

Scoring Algorithm

We will use the above example to illustrate the steps that should be followed in the scoring process. Say the following case (observation) must be scored:

         obs = (sex=1 minority=0 age=25 work=4)
  1. Do model file parsing. Reconstruct the dictionary, PPMatrix, and the Parameter matrix.
  2. To score a case, construct the vector x (of length equal to the number of Parameters in the model) as follows.
  3. ~ If row i of the PP correlation matrix is empty, set x i = 1.

    ~ If row of the PP correlation matrix is nonempty and corresponds to a factor value or set of factor values, set x ito 1 if the case being scored matches this row, 0 if it does not.

    ~ If row i of the PP correlation matrix is nonempty and corresponds to a covariate c, the row should contain exactly one nonzero entry, in the column corresponding to the independent variable c. Let r be the value of this entry; it is a correlation exponent, so set xi= xr (using the value of c which appears in this case).

  4. Now for each response catefory (value of the target variable) j, let let &beta;jbe the vector of Parameter estimates for that response category. (If k is the last response category, remember that by convention &beta; k= 0.) Set r j= <x,&beta;j > and s j= exp ri. The probability that our case falls into category j is then p j= sj/ (s1 + ... + s k).
  5. If you just want to assign each case to the category into which it has the highest probability of falling, it is not necessary to compute anything after rj; thecategory whose rj value is highest is the one you want. If you want to compute the actual probabilities (for instance, in order to know whether you are assigning a case to a 51% good or a 99% good category), we use a little dodge to avoid overflow. Namely, pj is the reciprocal of exp (r1-rj ) +... + exp (rk-rj). If ri-rj> 700 for any i, then the exponential will overflow; but in this case Pj is so small that we can set it to zero. Underflow in the denominator can be ignored since the term exp (rj-rj) ensures the denominator is at least 1.
e-mail info at dmg.org