 PMML 4.4 - Gaussian Process Models
 PMML4.4 Menu Home Changes XML Schema Conformance Interoperability General Structure Field Scope Header Data Dictionary Mining Schema Transformations Statistics Taxomony Targets Output Functions Built-in Functions Model Verification Model Explanation Multiple Models Anomaly Detection Models Association Rules Baseline Models Bayesian Network Cluster Models Gaussian Process General Regression k-Nearest Neighbors Naive Bayes Neural Network Regression Ruleset Scorecard Sequences Text Models Time Series Trees Vector Machine

PMML 4.4 - Gaussian Process Models

Gaussian process (GP) is a stochastic process with a collection of random variables. Any finite number of those random variables has a joint Gaussian distribution. The probabilistic representation of a target function can be used for both regression and classification.

In GP, the target response y is assumed to be the noisy evaluation of the target function, i.e., y = f(x)+ϵ, where ϵ~N(0,σϵ 2). The prior on the function values is represented as Gaussian Process, i.e., p(f) = GP(m(∙),k(∙,∙)), where m(∙) is a mean function and k(∙,∙) is a kernel function (in this document, we assume the zero mean function m(∙) = 0). The likelihood function of the measured response is represented as a Gaussian distribution, i.e., p(y|f) = N(fϵ2). Given the historical data D = {(xi, yi )| i = 1,…,n}, the posterior distribution on the target value ynew = f(xnew )+ϵ corresponding to the new test data input xnew can be represented with an 1-D probability distribution as:

ynew~N(μ(xnew|D),σ2(xnew|D))

The mean μ(xnew|D) and the variance σ2(xnew|D) represent the predicted mean and the variance of ynew corresponding to the test point xnew, each of which can be expressed as:

μ(xnew|D) = kT(Kϵ2I)-1y 1:n
σ2(xnew|D) = k(xnew, xnew) - kT (Kϵ2I)-1k
where kT = (k(x1,xnew ),...,k(xn,xnew)) is the vector containing kernel evaluations between the training inputs xi,…,xn and the test feature vectors xnew; K is the covariance matrix (kernel matrix) whose (i,j)th entry is Kij = k(xi,xj ); and y1:n are the training target responses.

For the Gaussian Process Regression (GPR), note that to perform scoring procedure for a new data set Dnew={(xinew,yinew)|i=1,…,m}, the following are required: the training data set, the type of kernel function used for training, optimal values of the hyper-parameters of the kernel function, and noise variance. To represent a GP regression model, the following information should be stored in the PMML file:

• Training data D={(xi,yi )|i=1,…,n}. Training data can be represented as input data matrix X, whose ith row represents ith input xi, and the target response vector y={yi,…,yn}.
• The type of a kernel function used to describe the underlying structure of a target function.
• The hyper-parameters for the specified kernel function.
• The noise variance representing the error magnitude in the response.

<xs:element name="GaussianProcessModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="MiningSchema"/>
<xs:element ref="Output" minOccurs="0"/>
<xs:element ref="ModelStats" minOccurs="0"/>
<xs:element ref="ModelExplanation" minOccurs="0"/>
<xs:element ref="Targets" minOccurs="0"/>
<xs:element ref="LocalTransformations" minOccurs="0"/>
<xs:sequence>
<xs:choice>
<xs:element ref="ARDSquaredExponentialKernel"/>
<xs:element ref="AbsoluteExponentialKernel"/>
<xs:element ref="GeneralizedExponentialKernel"/>
</xs:choice>
</xs:sequence>
<xs:element ref="TrainingInstances"/>
<xs:element ref="ModelVerification" minOccurs="0"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="modelName" type="xs:string" use="optional"/>
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
<xs:attribute name="algorithmName" type="xs:string"/>
<xs:attribute name="optimizer" type="xs:string" use="optional"/>
<xs:attribute name="isScorable" type="xs:boolean" default="true"/>
</xs:complexType>
</xs:element>

Similar to other model elements, the following attributes are specified for Gaussian Process model:

• The optional attribute modelName specifies a name for the Gaussian Process model.
• algorithmName: can be any string describing the algorithm that was used while creating the Gaussian Process model.
• functionName could be either classification or regression depending on the Gaussian process target type.
• isScorable: This attribute indicates if the model is valid for scoring. If this attribute is true or if it is missing, then the model should be processed normally. However, if the attribute is false, then the model producer has indicated that this model is intended for information purposes only and should not be used to generate results. In order to be valid PMML, all required elements and attributes must be present, even for non-scoring models. For more details, see General Structure.

The attribute optimizer specifies the optimization algorithm used in training the Gaussian process model.

Since Gaussian process requires numeric attributes, which could be normalized, transformations are often applied. The transformation can be performed in the LocalTransformations element.

Kernel Types

A kernel function defines the function space that GP regression can represent, thus impacting the accuracy of the prediction model. Wide variety of kernel functions can be used. Here, we describe the four kernel functions that are commonly used for GP regression:

RadialBasisKernel: squared exponential basis function (p is the number of input variables)
k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |xi-zi| /λ)2))

ARDSquaredExponentialKernel: Automatic Relevance Determination (ARD) squared exponential basis function. This covariance function is the squared exponential kernel function with a separate lengthscale hyper-parameter, λi, for each predictor, i.e. input dimension xi
k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |xi-zi| /λi)2))

AbsoluteExponentialKernel: absolute exponential basis function
k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |xi-zi| /λi))

GeneralizedExponentialKernel: generalized exponential basis function
k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |xi-zi| /λi)degree))

When GP regression is used to fit input and noisy output, the noise variance should be specified. The noise term is assumed to follow Gaussian distribution and be independent and identically distributed. Note that constructing the covariance matrix Kϵ2I is equivalent to using the kernel function augmented with the noise variance, i.e., k(x,z)= k(x,z)+σϵ2 (if x = z). Thus, the noise variance term σϵ2 is included to the hyper-parameters for each kernel function.

The aforementioned four kernel functions are expressed as the following PMML schema:

<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="description" type="xs:string" use="optional"/>
<xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/>
<xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/>
<xs:attribute name="lambda" type="REAL-NUMBER" use="optional" default="1"/>
</xs:complexType>
</xs:element>

<xs:element name="ARDSquaredExponentialKernel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="description" type="xs:string" use="optional"/>
<xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/>
<xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/>
</xs:complexType>
</xs:element>

<xs:element name="AbsoluteExponentialKernel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="description" type="xs:string" use="optional"/>
<xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/>
<xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/>
</xs:complexType>
</xs:element>

<xs:element name="GeneralizedExponentialKernel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="description" type="xs:string" use="optional"/>
<xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/>
<xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/>
<xs:attribute name="degree" type="REAL-NUMBER" use="optional" default="1"/>
</xs:complexType>
</xs:element>

Element Lambda is defined as following:

<xs:element name="Lambda">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:group ref="REAL-ARRAY"/>
</xs:sequence>
</xs:complexType>
</xs:element>

Additional information about the kernel can be entered in the free type attribute description.

Training Instances

The element TrainingInstances refers to the element defined in KNN model documentation. It encapsulates the definition of the fields included in the training instances as well as their values.

Definitions

• isTransformed: Used as a flag to determine whether or not the training instances have already been transformed. If isTransformed is "false", it indicates that the training data has not been transformed yet. If "true", it indicates that it has already been transformed.
• recordCount: Defines the number of training instances or records. This number needs to match the number of instances defined in the element InlineTable or in the external data if TableLocator is used.
• fieldCount: Defines the number of fields (features + targets). This number needs to match the number of InstanceField elements defined under InstanceFields.

Each InstanceFields and InstanceField refer to elements defined in KNN.

The InstanceFields element serves as an envelope for all the fields included in the training instances. It encapsulates InstanceField elements.

Definitions

• field: Contains the name of a DataField or a DerivedField (in case isTransformed is set to "true"). Can also contain the name of the case ID variable.
• column: Defines the name of the tag or column used by element InlineTable. This attribute is required if element InlineTable is used to represent training data.

The element TrainingInstances offers a choice of PMML elements when it comes to representing the training data (feature vectors and class labels). These are:

• InlineTable: Allows for the training instances to be part of the PMML document itself. When used in k-NN models, a row in an InlineTable should contain a sequence of elements representing the input fields. Each tag or column, as given by attribute column of element InstanceField, corresponds to a field name. The content of an element defines the field value.
• TableLocator: Allows for the training data to be stored in an external table. Such a table can then be referenced by the TableLocator element which implements a kind of URL for tables.

For more information on elements InlineTable and TableLocator, refer to Taxonomy.

Example Model

In this example, the training procedure of a GP regression model, the representation of the trained GP regression model with PMML file, and the scoring procedure using the represented GP regression model are introduced. To illustrate these procedures, two training data points with input vectors {x1=(1,3), x2=(2,6)} and target response {y1=1, y2=2} are used. The ARD squared exponential kernel function k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |xi-zi| /λi)2)) is used, and its optimum hyper-parameters θ* = (γ*, λ*ϵ*) are computed using the training data points.

Using two training points with input vectors {x1 = (1,3), x2 = (2,6)} and target response {y1 = 1, y2 = 2}, the optimum hyper-parameters θ* = (γ*, λ*ϵ*) for the ARD squared exponential kernel function are determined as γ* = 2.4890, λ* = (1.5164,59.3113) and σϵ*= 0.1051. The noise variance is ϵ*)2 = (0.1051)2 = 0.0110.

This information is now represented in the following PMML file:

<PMML xmlns="http://www.dmg.org/PMML-4_4" version="4.4">
<DataField dataType="double" name="x1" optype="continuous"/>
<DataField dataType="double" name="x2" optype="continuous"/>
<DataField dataType="double" name="y1" optype="continuous"/>
<GaussianProcessModel modelName="Gaussian Process Model" functionName="regression">
<MiningSchema>
<MiningField name="x1" usageType="active"/>
<MiningField name="x2" usageType="active"/>
<MiningField name="y1" usageType="predicted"/>
</MiningSchema>
<Output>
<OutputField dataType="double" feature="predictedValue" name="MeanValue" optype="continuous"/>
<OutputField dataType="double" feature="standardDeviation" name="StandardDeviation" optype="continuous"/>
</Output>
<ARDSquaredExponentialKernel gamma="2.4890" noiseVariance="0.0110">
<Lambda>
<Array n="2" type="real">1.5164 59.3113</Array>
</Lambda>
</ARDSquaredExponentialKernel>
<TrainingInstances recordCount="2" fieldCount="3" isTransformed="false">
<InstanceFields>
<InstanceField field="x1" column="x1"/>
<InstanceField field="x2" column="x2"/>
<InstanceField field="y1" column="y1"/>
</InstanceFields>
<InlineTable>
<row>
<x1>1</x1>
<x2>3</x2>
<y1>1</y1>
</row>
<row>
<x1>2</x1>
<x2>6</x2>
<y1>2</y1>
</row>
</InlineTable>
</TrainingInstances>
</GaussianProcessModel>
</PMML>

The optimized hyper-parameters are stored under the element of ARDSquaredExponentialKernel. Note that the noiseVariance in PMML file denotes ϵ*)2=(0.1051)2=0.0110.

Scoring procedure, example

We assume that the training data, a kernel function type and its optimum hyper-parameters are retrieved from the given PMML file. Considering the same example as above, the scoring procedure at the new test inputs xnew= (1, 4) is illustrated. The target response ynew corresponding to the new input xnew follows the following 1-D posterior distribution:

ynew~N(μ(xnew|D),σ2(xnew|D))

where:

μ(xnew|D) = kT(Kϵ2I)-1y 1:n
σ2(xnew|D) = k(xnew, xnew) - kT (Kϵ2I)-1k

With the optimized hyper-parameters θ* = (γ* = 2.4890,λ* = (1.5164,59.3113),σϵ* = 0.1051) and the training inputs, x1 and x2, the covariance matrix augmented with the noise variance, Kϵ*I, is constructed as:

K11 = k(x1,x1) = 2.4890*exp(0) = 2.4890
K12 = k(x1,x2) = 2.4890*exp(-1/2 ((1-2)2/(1.5164) 2 +(3-6)2/(59.3113)2)) = 2.0000
K22 = k(x2,x2) = 2.4890*exp(0) = 2.4890
K21 = k(x2,x1) = 2.4890*exp(-1/2 ((2-1)2 /(1.5164)2 +(6-3)2/(59.3113)2)) = 2.0000

which gives

Kϵ*I = [(2.4890&2.0000@2.0000&2.4890)]+(0.1051)2 [(1&0@0&1)] = [(2.5000&2.000@2.0000&2.5000)]

When xnew=(1,4), the vector kT containing the kernel evaluations between the training inputs and test input is computed as

kT = (k(x1,xnew ),k(x2,xnew)) = (2.4886,2.0014)

Then, the predicted mean μ(xnew|D) and variance σ2(xnew|D) of ynew at xnew=(1,4) can be computed as

μ(xnew|D) = (2.4886,2.0014) [(2.5000&2.000@2.0000&2.5000)](-1) ((1@2)) = 1.0095
σ2(xnew|D) = 2.4890-(2.4886,2.0014)T[(2.5000&2.000@2.0000&2.5000)](-1) ((2.4886@2.0014)) = 0.0116

Based on the estimated mean value and the variance, the 95% confidence bound for ynew can be expressed as: [μ(xnew|D) - 1.96σ(xnew|D),μ(xnew|D) + 1.96σ(xnew|D)] = [0.7984,1.2206].

 e-mail info at dmg.org