PMML 4.3 - Gaussian Process Models

Gaussian process (GP) is a stochastic process with a collection of random variables. Any finite number of those random variables has a joint Gaussian distribution. The probabilistic representation of a target function can be used for both regression and classification.

In GP, the target response y is assumed to be the noisy evaluation of the target function, i.e., y = f(x)+ϵ, where ϵ~N(0,σ_ϵ ²). The prior on the function values is represented as Gaussian Process, i.e., p(f) = GP(m(∙),k(∙,∙)), where m(∙) is a mean function and k(∙,∙) is a kernel function (in this document, we assume the zero mean function m(∙) = 0). The likelihood function of the measured response is represented as a Gaussian distribution, i.e., p(y|f) = N(f,σ_ϵ²). Given the historical data D = {(x_i, y_i )| i = 1,…,n}, the posterior distribution on the target value y_new = f(x_new )+ϵ corresponding to the new test data input x_new can be represented with an 1-D probability distribution as:

y_new~N(μ(x_new|D),σ²(x_new|D))

The mean μ(x_new|D) and the variance σ²(x_new|D) represent the predicted mean and the variance of y_new corresponding to the test point x_new, each of which can be expressed as:

μ(x_new|D) = k^T(K+σ_ϵ²I)^-1y_1:n
σ²(x_new|D) = k(x_new, x_new) - k^T (K+σ_ϵ²I)^-1k+σ_ϵ²

where k^T = (k(x₁,x_new ),...,k(x_n,x_new)) is the vector containing kernel evaluations between the training inputs x_i,…,x_n and the test feature vectors x_new; K is the covariance matrix (kernel matrix) whose (i,j)^th entry is K_ij = k(x_i,x_j ); and y_1:n are the training target responses.

For the Gaussian Process Regression (GPR), note that to perform scoring procedure for a new data set D^new={(x_i^new,y_i^new)|i=1,…,m}, the following are required: the training data set, the type of kernel function used for training, optimal values of the hyper-parameters of the kernel function, and noise variance. To represent a GP regression model, the following information should be stored in the PMML file:

Training data D={(x_i,y_i )|i=1,…,n}. Training data can be represented as input data matrix X, whose i^th row represents i^th input x_i, and the target response vector y={y_i,…,y_n}.
The type of a kernel function used to describe the underlying structure of a target function.
The hyper-parameters for the specified kernel function.
The noise variance representing the error magnitude in the response.

<xs:element name="GaussianProcessModel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="MiningSchema"/>
      <xs:element ref="Output" minOccurs="0"/>
      <xs:element ref="ModelStats" minOccurs="0"/>
      <xs:element ref="ModelExplanation" minOccurs="0"/>
      <xs:element ref="Targets" minOccurs="0"/>
      <xs:element ref="LocalTransformations" minOccurs="0"/>
      <xs:sequence>
        <xs:choice>
          <xs:element ref="RadialBasisKernel"/>
          <xs:element ref="ARDSquaredExponentialKernel"/>
          <xs:element ref="AbsoluteExponentialKernel"/>
          <xs:element ref="GeneralizedExponentialKernel"/>
        </xs:choice>
      </xs:sequence>
      <xs:element ref="TrainingInstances"/>
      <xs:element ref="ModelVerification" minOccurs="0"/>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="modelName" type="xs:string" use="optional"/>
    <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
    <xs:attribute name="algorithmName" type="xs:string"/>
    <xs:attribute name="optimizer" type="xs:string" use="optional"/>
    <xs:attribute name="isScorable" type="xs:boolean" default="true"/>
  </xs:complexType>
</xs:element>

Similar to other model elements, the following attributes are specified for Gaussian Process model:

The optional attribute modelName specifies a name for the Gaussian Process model.
functionName could be either classification or regression depending on the Gaussian process target type.
isScorable: This attribute indicates if the model is valid for scoring. If this attribute is true or if it is missing, then the model should be processed normally. However, if the attribute is false, then the model producer has indicated that this model is intended for information purposes only and should not be used to generate results. In order to be valid PMML, all required elements and attributes must be present, even for non-scoring models. For more details, see General Structure.

The attribute optimizer specifies the optimization algorithm used in training the Gaussian process model.

Since Gaussian process requires numeric attributes, which could be normalized, transformations are often applied. The transformation can be performed in the LocalTransformations element.

Kernel Types

A kernel function defines the function space that GP regression can represent, thus impacting the accuracy of the prediction model. Wide variety of kernel functions can be used. Here, we describe the four kernel functions that are commonly used for GP regression:

RadialBasisKernel: squared exponential basis function (p is the number of input variables)

k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |x_i-z_i| /λ)²))

ARDSquaredExponentialKernel: Automatic Relevance Determination (ARD) squared exponential basis function. This covariance function is the squared exponential kernel function with a separate lengthscale hyper-parameter, λ_i, for each predictor, i.e. input dimension x_i

k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |x_i-z_i| /λ_i)²))

AbsoluteExponentialKernel: absolute exponential basis function

k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |x_i-z_i| /λ_i))

GeneralizedExponentialKernel: generalized exponential basis function

k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |x_i-z_i| /λ_i)^degree))

When GP regression is used to fit input and noisy output, the noise variance should be specified. The noise term is assumed to follow Gaussian distribution and be independent and identically distributed. Note that constructing the covariance matrix K+σ_ϵ²I is equivalent to using the kernel function augmented with the noise variance, i.e., k(x,z)= k(x,z)+σ_ϵ² (if x = z). Thus, the noise variance term σ_ϵ² is included to the hyper-parameters for each kernel function.

The aforementioned four kernel functions are expressed as the following PMML schema:

<xs:element name="RadialBasisKernel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="description" type="xs:string" use="optional"/>
    <xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/>
    <xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/>
    <xs:attribute name="lambda" type="REAL-NUMBER" use="optional" default="1"/>
  </xs:complexType>
</xs:element>

<xs:element name="ARDSquaredExponentialKernel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="description" type="xs:string" use="optional"/>
    <xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/>
    <xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/>
  </xs:complexType>
</xs:element>

<xs:element name="AbsoluteExponentialKernel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="description" type="xs:string" use="optional"/>
    <xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/>
    <xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/>
  </xs:complexType>
</xs:element>

<xs:element name="GeneralizedExponentialKernel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="description" type="xs:string" use="optional"/>
    <xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/>
    <xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/>
    <xs:attribute name="degree" type="REAL-NUMBER" use="optional" default="1"/>
  </xs:complexType>
</xs:element>

Element Lambda is defined as following:

<xs:element name="Lambda">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:group ref="REAL-ARRAY"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Additional information about the kernel can be entered in the free type attribute description.

Training Instances

The element TrainingInstances refers to the element defined in KNN model documentation. It encapsulates the definition of the fields included in the training instances as well as their values.

Definitions

isTransformed: Used as a flag to determine whether or not the training instances have already been transformed. If isTransformed is "false", it indicates that the training data has not been transformed yet. If "true", it indicates that it has already been transformed.
recordCount: Defines the number of training instances or records. This number needs to match the number of instances defined in the element InlineTable or in the external data if TableLocator is used.
fieldCount: Defines the number of fields (features + targets). This number needs to match the number of InstanceField elements defined under InstanceFields.

Each InstanceFields and InstanceField refer to elements defined in KNN.

The InstanceFields element serves as an envelope for all the fields included in the training instances. It encapsulates InstanceField elements.

Definitions

field: Contains the name of a DataField or a DerivedField (in case isTransformed is set to "true"). Can also contain the name of the case ID variable.
column: Defines the name of the tag or column used by element InlineTable. This attribute is required if element InlineTable is used to represent training data.

The element TrainingInstances offers a choice of PMML elements when it comes to representing the training data (feature vectors and class labels). These are:

InlineTable: Allows for the training instances to be part of the PMML document itself. When used in k-NN models, a row in an InlineTable should contain a sequence of elements representing the input fields. Each tag or column, as given by attribute column of element InstanceField, corresponds to a field name. The content of an element defines the field value.
TableLocator: Allows for the training data to be stored in an external table. Such a table can then be referenced by the TableLocator element which implements a kind of URL for tables.

For more information on elements InlineTable and TableLocator, refer to Taxonomy.

Example Model

In this example, the training procedure of a GP regression model, the representation of the trained GP regression model with PMML file, and the scoring procedure using the represented GP regression model are introduced. To illustrate these procedures, two training data points with input vectors {x₁=(1,3), x₂=(2,6)} and target response {y₁=1, y₂=2} are used. The ARD squared exponential kernel function k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |x_i-z_i| /λ_i)²)) is used, and its optimum hyper-parameters θ^* = (γ^*, λ^*,σ_ϵ^*) are computed using the training data points.

Using two training points with input vectors {x₁ = (1,3), x₂ = (2,6)} and target response {y₁ = 1, y₂ = 2}, the optimum hyper-parameters θ^* = (γ^*, λ^*,σ_ϵ^*) for the ARD squared exponential kernel function are determined as γ^* = 2.4890, λ^* = (1.5164,59.3113) and σ_ϵ^*= 0.1051. The noise variance is (σ_ϵ^*)² = (0.1051)² = 0.0110.

This information is now represented in the following PMML file:

<PMML xmlns="https://www.dmg.org/PMML-4_3" version="4.3">
<Header copyright="DMG.org"/>
  <DataDictionary numberOfFields="3">
    <DataField dataType="double" name="x1" optype="continuous"/>
    <DataField dataType="double" name="x2" optype="continuous"/>
    <DataField dataType="double" name="y1" optype="continuous"/>
  </DataDictionary>
  <GaussianProcessModel modelName="Gaussian Process Model" functionName="regression">
    <MiningSchema>
      <MiningField name="x1" usageType="active"/>
      <MiningField name="x2" usageType="active"/>
      <MiningField name="y1" usageType="predicted"/>
    </MiningSchema>
    <Output>
      <OutputField dataType="double" feature="predictedValue" name="MeanValue" optype="continuous"/>
      <OutputField dataType="double" feature="predictedValue" name="StandardDeviation" optype="continuous"/>
    </Output>
    <ARDSquaredExponentialKernel gamma="2.4890" noiseVariance="0.0110">
      <Lambda>
        <Array n="2" type="real">1.5164 59.3113</Array>
      </Lambda>
    </ARDSquaredExponentialKernel>
    <TrainingInstances recordCount="2" fieldCount="3" isTransformed="false">
      <InstanceFields>
        <InstanceField field="x1" column="x1"/>
        <InstanceField field="x2" column="x2"/>
        <InstanceField field="y1" column="y1"/>
      </InstanceFields>
      <InlineTable>
        <row>
          <x1>1</x1>
          <x2>3</x2>
          <y1>1</y1>
        </row>
        <row>
          <x1>2</x1>
          <x2>6</x2>
          <y1>2</y1>
        </row>
      </InlineTable>
    </TrainingInstances>
  </GaussianProcessModel>
</PMML>

The optimized hyper-parameters are stored under the element of ARDSquaredExponentialKernel. Note that the noiseVariance in PMML file denotes (σ_ϵ^*)²=(0.1051)²=0.0110.

Scoring procedure, example

We assume that the training data, a kernel function type and its optimum hyper-parameters are retrieved from the given PMML file. Considering the same example as above, the scoring procedure at the new test inputs x_new= (1, 4) is illustrated. The target response y_new corresponding to the new input x_new follows the following 1-D posterior distribution:

y_new~N(μ(x_new|D),σ²(x_new|D))

where:

μ(x_new|D) = k^T(K+σ_ϵ²I)^-1y_1:n
σ²(x_new|D) = k(x_new, x_new) - k^T (K+σ_ϵ²I)^-1k+σ_ϵ²

With the optimized hyper-parameters θ^* = (γ^* = 2.4890,λ^* = (1.5164,59.3113),σ_ϵ^* = 0.1051) and the training inputs, x₁ and x₂, the covariance matrix augmented with the noise variance, K+σ_ϵ^*I, is constructed as:

K₁₁ = k(x₁,x₁) = 2.4890*exp(0) = 2.4890
K₁₂ = k(x₁,x₂) = 2.4890*exp(-1/2 ((1-2)²/(1.5164) ² +(3-6)²/(59.3113)²)) = 2.0000
K₂₂ = k(x₂,x₂) = 2.4890*exp(0) = 2.4890
K₂₁ = k(x₂,x₁) = 2.4890*exp(-1/2 ((2-1)² /(1.5164)² +(6-3)²/(59.3113)²)) = 2.0000

which gives

K+σ_ϵ^*I = [(2.4890&2.0000@2.0000&2.4890)]+(0.1051)² [(1&0@0&1)] = [(2.5000&2.000@2.0000&2.5000)]

When x_new=(1,4), the vector k^T containing the kernel evaluations between the training inputs and test input is computed as

k^T = (k(x₁,x_new ),k(x₂,x_new)) = (2.4886,2.0014)

Then, the predicted mean μ(x_new|D) and variance σ²(x_new|D) of y_new at x_new=(1,4) can be computed as

μ(x_new|D) = (2.4886,2.0014) [(2.5000&2.000@2.0000&2.5000)]^(-1) ((1@2)) = 1.0095
σ²(x_new|D) = 2.4890-(2.4886,2.0014)^T[(2.5000&2.000@2.0000&2.5000)]^(-1) ((2.4886@2.0014))+(0.1051)² = 0.0226

Based on the estimated mean value and the variance, the 95% confidence bound for y_new can be expressed as: [μ(x_new|D) - 1.96σ(x_new|D),μ(x_new|D) + 1.96σ(x_new|D)] = [0.7148,1.3042].

e-mail

info at dmg.org