|
||||||||||||||
|
||||||||||||||
| ||||||||||||||
PMML 4.3 - Gaussian Process ModelsGaussian process (GP) is a stochastic process with a collection of random variables. Any finite number of those random variables has a joint Gaussian distribution. The probabilistic representation of a target function can be used for both regression and classification. In GP, the target response y is assumed to be the noisy evaluation of the target function, i.e., y = f(x)+ϵ, where ϵ~N(0,σϵ 2). The prior on the function values is represented as Gaussian Process, i.e., p(f) = GP(m(∙),k(∙,∙)), where m(∙) is a mean function and k(∙,∙) is a kernel function (in this document, we assume the zero mean function m(∙) = 0). The likelihood function of the measured response is represented as a Gaussian distribution, i.e., p(y|f) = N(f,σϵ2). Given the historical data D = {(xi, yi )| i = 1,…,n}, the posterior distribution on the target value ynew = f(xnew )+ϵ corresponding to the new test data input xnew can be represented with an 1-D probability distribution as: ynew~N(μ(xnew|D),σ2(xnew|D)) The mean μ(xnew|D) and the variance σ2(xnew|D) represent the predicted mean and the variance of ynew corresponding to the test point xnew, each of which can be expressed as: μ(xnew|D) = kT(K+σϵ2I)-1y 1:nwhere kT = (k(x1,xnew ),...,k(xn,xnew)) is the vector containing kernel evaluations between the training inputs xi,…,xn and the test feature vectors xnew; K is the covariance matrix (kernel matrix) whose (i,j)th entry is Kij = k(xi,xj ); and y1:n are the training target responses. For the Gaussian Process Regression (GPR), note that to perform scoring procedure for a new data set Dnew={(xinew,yinew)|i=1,…,m}, the following are required: the training data set, the type of kernel function used for training, optimal values of the hyper-parameters of the kernel function, and noise variance. To represent a GP regression model, the following information should be stored in the PMML file:
<xs:element name="GaussianProcessModel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="MiningSchema"/> <xs:element ref="Output" minOccurs="0"/> <xs:element ref="ModelStats" minOccurs="0"/> <xs:element ref="ModelExplanation" minOccurs="0"/> <xs:element ref="Targets" minOccurs="0"/> <xs:element ref="LocalTransformations" minOccurs="0"/> <xs:sequence> <xs:choice> <xs:element ref="RadialBasisKernel"/> <xs:element ref="ARDSquaredExponentialKernel"/> <xs:element ref="AbsoluteExponentialKernel"/> <xs:element ref="GeneralizedExponentialKernel"/> </xs:choice> </xs:sequence> <xs:element ref="TrainingInstances"/> <xs:element ref="ModelVerification" minOccurs="0"/> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="modelName" type="xs:string" use="optional"/> <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/> <xs:attribute name="algorithmName" type="xs:string"/> <xs:attribute name="optimizer" type="xs:string" use="optional"/> <xs:attribute name="isScorable" type="xs:boolean" default="true"/> </xs:complexType> </xs:element> Similar to other model elements, the following attributes are specified for Gaussian Process model:
The attribute optimizer specifies the optimization algorithm used in training the Gaussian process model. Since Gaussian process requires numeric attributes, which could be normalized, transformations are often applied. The transformation can be performed in the LocalTransformations element. Kernel TypesA kernel function defines the function space that GP regression can represent, thus impacting the accuracy of the prediction model. Wide variety of kernel functions can be used. Here, we describe the four kernel functions that are commonly used for GP regression:
When GP regression is used to fit input and noisy output, the noise variance should be specified. The noise term is assumed to follow Gaussian distribution and be independent and identically distributed. Note that constructing the covariance matrix K+σϵ2I is equivalent to using the kernel function augmented with the noise variance, i.e., k(x,z)= k(x,z)+σϵ2 (if x = z). Thus, the noise variance term σϵ2 is included to the hyper-parameters for each kernel function. The aforementioned four kernel functions are expressed as the following PMML schema: <xs:element name="RadialBasisKernel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="description" type="xs:string" use="optional"/> <xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/> <xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/> <xs:attribute name="lambda" type="REAL-NUMBER" use="optional" default="1"/> </xs:complexType> </xs:element> <xs:element name="ARDSquaredExponentialKernel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="description" type="xs:string" use="optional"/> <xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/> <xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/> </xs:complexType> </xs:element> <xs:element name="AbsoluteExponentialKernel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="description" type="xs:string" use="optional"/> <xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/> <xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/> </xs:complexType> </xs:element> <xs:element name="GeneralizedExponentialKernel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="Lambda" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="description" type="xs:string" use="optional"/> <xs:attribute name="gamma" type="REAL-NUMBER" use="optional" default="1"/> <xs:attribute name="noiseVariance" type="REAL-NUMBER" use="optional" default="1"/> <xs:attribute name="degree" type="REAL-NUMBER" use="optional" default="1"/> </xs:complexType> </xs:element> Element Lambda is defined as following: <xs:element name="Lambda"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="REAL-ARRAY"/> </xs:sequence> </xs:complexType> </xs:element> Additional information about the kernel can be entered in the free type attribute description. Training InstancesThe element TrainingInstances refers to the element defined in KNN model documentation. It encapsulates the definition of the fields included in the training instances as well as their values. Definitions
Each InstanceFields and InstanceField refer to elements defined in KNN. The InstanceFields element serves as an envelope for all the fields included in the training instances. It encapsulates InstanceField elements. Definitions
The element TrainingInstances offers a choice of PMML elements when it comes to representing the training data (feature vectors and class labels). These are:
For more information on elements InlineTable and TableLocator, refer to Taxonomy. Example ModelIn this example, the training procedure of a GP regression model, the representation of the trained GP regression model with PMML file, and the scoring procedure using the represented GP regression model are introduced. To illustrate these procedures, two training data points with input vectors {x1=(1,3), x2=(2,6)} and target response {y1=1, y2=2} are used. The ARD squared exponential kernel function k(x,z)=γ exp(-1/2 (Sum[(i=1)to p]( |xi-zi| /λi)2)) is used, and its optimum hyper-parameters θ* = (γ*, λ*,σϵ*) are computed using the training data points. Using two training points with input vectors {x1 = (1,3), x2 = (2,6)} and target response {y1 = 1, y2 = 2}, the optimum hyper-parameters θ* = (γ*, λ*,σϵ*) for the ARD squared exponential kernel function are determined as γ* = 2.4890, λ* = (1.5164,59.3113) and σϵ*= 0.1051. The noise variance is (σϵ*)2 = (0.1051)2 = 0.0110. This information is now represented in the following PMML file: <PMML xmlns="https://www.dmg.org/PMML-4_3" version="4.3"> <Header copyright="DMG.org"/> <DataDictionary numberOfFields="3"> <DataField dataType="double" name="x1" optype="continuous"/> <DataField dataType="double" name="x2" optype="continuous"/> <DataField dataType="double" name="y1" optype="continuous"/> </DataDictionary> <GaussianProcessModel modelName="Gaussian Process Model" functionName="regression"> <MiningSchema> <MiningField name="x1" usageType="active"/> <MiningField name="x2" usageType="active"/> <MiningField name="y1" usageType="predicted"/> </MiningSchema> <Output> <OutputField dataType="double" feature="predictedValue" name="MeanValue" optype="continuous"/> <OutputField dataType="double" feature="predictedValue" name="StandardDeviation" optype="continuous"/> </Output> <ARDSquaredExponentialKernel gamma="2.4890" noiseVariance="0.0110"> <Lambda> <Array n="2" type="real">1.5164 59.3113</Array> </Lambda> </ARDSquaredExponentialKernel> <TrainingInstances recordCount="2" fieldCount="3" isTransformed="false"> <InstanceFields> <InstanceField field="x1" column="x1"/> <InstanceField field="x2" column="x2"/> <InstanceField field="y1" column="y1"/> </InstanceFields> <InlineTable> <row> <x1>1</x1> <x2>3</x2> <y1>1</y1> </row> <row> <x1>2</x1> <x2>6</x2> <y1>2</y1> </row> </InlineTable> </TrainingInstances> </GaussianProcessModel> </PMML> The optimized hyper-parameters are stored under the element of ARDSquaredExponentialKernel. Note that the noiseVariance in PMML file denotes (σϵ*)2=(0.1051)2=0.0110. Scoring procedure, exampleWe assume that the training data, a kernel function type and its optimum hyper-parameters are retrieved from the given PMML file. Considering the same example as above, the scoring procedure at the new test inputs xnew= (1, 4) is illustrated. The target response ynew corresponding to the new input xnew follows the following 1-D posterior distribution: ynew~N(μ(xnew|D),σ2(xnew|D)) where: μ(xnew|D) = kT(K+σϵ2I)-1y 1:n With the optimized hyper-parameters θ* = (γ* = 2.4890,λ* = (1.5164,59.3113),σϵ* = 0.1051) and the training inputs, x1 and x2, the covariance matrix augmented with the noise variance, K+σϵ*I, is constructed as: K11 = k(x1,x1) = 2.4890*exp(0) = 2.4890 which gives K+σϵ*I = [(2.4890&2.0000@2.0000&2.4890)]+(0.1051)2 [(1&0@0&1)] = [(2.5000&2.000@2.0000&2.5000)]When xnew=(1,4), the vector kT containing the kernel evaluations between the training inputs and test input is computed as kT = (k(x1,xnew ),k(x2,xnew)) = (2.4886,2.0014) Then, the predicted mean μ(xnew|D) and variance σ2(xnew|D) of ynew at xnew=(1,4) can be computed as μ(xnew|D) = (2.4886,2.0014) [(2.5000&2.000@2.0000&2.5000)](-1) ((1@2)) = 1.0095 Based on the estimated mean value and the variance, the 95% confidence bound for ynew can be expressed as: [μ(xnew|D) - 1.96σ(xnew|D),μ(xnew|D) + 1.96σ(xnew|D)] = [0.7148,1.3042]. |
||||||||||||||
|