PMML 4.2 - Nearest Neighbors

k-Nearest Neighbors (k-NN) is an instance-based learning algorithm. In a k-NN model, a hypothesis or generalization is built from the training data directly at the time a query is made to the system. The prediction is based on the K training instances closest to the case being scored. Therefore, all training cases have to be stored, which may be problematic when the amount of data is large. This model has the ability to store the data directly in PMML using InlineTable or elsewhere using the TableLocator element defined in the Taxonomy document.

A k-NN model can have one or more target variables or no targets. When one or more targets are present, the predicted value is computed based on the target values of the nearest neighbors. When no targets are present, the model specifies a case ID variable for the training data. In this way, one can easily obtain the IDs of the K closest training cases (nearest neighbors).

A k-NN model consists of four major parts:

Model attributes
Training instances
Comparison measure
Input fields

XSD

<xs:element name="NearestNeighborModel">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="MiningSchema"/>
      <xs:element ref="Output" minOccurs="0"/>
      <xs:element ref="ModelStats" minOccurs="0"/>
      <xs:element ref="ModelExplanation" minOccurs="0"/>
      <xs:element ref="Targets" minOccurs="0"/>
      <xs:element ref="LocalTransformations" minOccurs="0"/>
      <xs:element ref="TrainingInstances"/>
      <xs:element ref="ComparisonMeasure"/>
      <xs:element ref="KNNInputs"/>
      <xs:element ref="ModelVerification" minOccurs="0"/>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="modelName" type="xs:string"/>
    <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
    <xs:attribute name="algorithmName" type="xs:string"/>
    <xs:attribute name="numberOfNeighbors" type="INT-NUMBER" use="required"/>
    <xs:attribute name="continuousScoringMethod" type="CONT-SCORING-METHOD" default="average"/>
    <xs:attribute name="categoricalScoringMethod" type="CAT-SCORING-METHOD" default="majorityVote"/>
    <xs:attribute name="instanceIdVariable" type="xs:string"/>
    <xs:attribute name="threshold" type="REAL-NUMBER" default="0.001"/>
    <xs:attribute name="isScorable" type="xs:boolean" default="true"/>
  </xs:complexType> 
</xs:element>

Model Attributes

NearestNeighborModel: The root element of an XML k-NN model. Each instance of a k-NN model must start with this element.
modelName: This is a unique identifier specifying the name of the k-NN model.
functionName: Could be either "classification" or "regression" depending on the target variable(s). If the model contains categorical and continuous targets, this attribute should be "mixed". In case no targets are present then it should be "clustering".
algorithmName: Can be any string describing the algorithm that was used while creating the k-NN model.
numberOfNeighbors: Specifies K, the number of desired neighbors.
continuousScoringMethod and categoricalScoringMethod: Specify the scoring (or combining) method based on the continuous or categorical target values of K neighbors. Details of the combining algorithms are described in the document on multiple models in PMML.
instanceIdVariable Contains the instance ID variable name and so refers to the name of a field in InstanceFields. Required if the model has no targets, optional otherwise.
threshold: Defines a very small positive number to be used for "weighted" scoring methods to avoid numerical problems when distance or similarity measure is zero.
isScorable: This attribute indicates if the model is valid for scoring. If this attribute is true or if it is missing, then the model should be processed normally. However, if the attribute is false, then the model producer has indicated that this model is intended for information purposes only and should not be used to generate results. For more details, see General Structure. In order to be valid PMML, the following XSD requirements need to be met, even for non-scoring models:

Required Attributes: functionName, numberOfNeighbors

Required Elements: MiningSchema, TrainingInstances, KNNInputs

Main Elements

TrainingInstances: This element serves as an envelope for defining all of the training instances. It contains the definition of the fields included in the training instances as well as a table for representing the training data itself.
ComparisonMeasure: Defines the distance or similarity measure used to find the k-nearest neighbors.
KNNInputs: This element serves as an envelope for defining all of the k-NN inputs.

Training Instances

The element TrainingInstances consists of:

<xs:element name="TrainingInstances">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="InstanceFields"/>
      <xs:choice>
        <xs:element ref="TableLocator"/>
        <xs:element ref="InlineTable"/>
      </xs:choice>
    </xs:sequence>
    <xs:attribute name="isTransformed" type="xs:boolean" default="false"/>
    <xs:attribute name="recordCount" type="INT-NUMBER" use="optional"/>
    <xs:attribute name="fieldCount" type="INT-NUMBER" use="optional"/>
  </xs:complexType>
</xs:element>

Element TrainingInstances encapsulates the definition of the fields included in the training instances as well as their values.

Definitions

isTransformed: Used as a flag to determine whether or not the training instances have already been transformed. If isTransformed is "false", it indicates that the training data has not been transformed yet. If "true", it indicates that it has already been transformed.
recordCount: Defines the number of training instances or records. This number needs to match the number of instances defined in the element InlineTable or in the external data if TableLocator is used.
fieldCount: Defines the number of fields (features + targets). This number needs to match the number of InstanceField elements defined under InstanceFields.

Each InstanceFields element consists of:

<xs:element name="InstanceFields">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="InstanceField" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="InstanceField">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="xs:string" use="required"/>
    <xs:attribute name="column" type="xs:string" use="optional"/>
  </xs:complexType>
</xs:element>

The InstanceFields element serves as an envelope for all the fields included in the training instances. It encapsulates InstanceField elements.

Definitions

field: Contains the name of a DataField or a DerivedField (in case isTransformed is set to "true"). Can also contain the name of the case ID variable.
column: Defines the name of the tag or column used by element InlineTable. This attribute is required if element InlineTable is used to represent training data.

The element TrainingInstances offers a choice of PMML elements when it comes to representing the training data (feature vectors and class labels). These are:

InlineTable: Allows for the training instances to be part of the PMML document itself. When used in k-NN models, a row in an InlineTable should contain a sequence of elements representing the input fields. Each tag or column, as given by attribute column of element InstanceField, corresponds to a field name. The content of an element defines the field value.
TableLocator: Allows for the training data to be stored in an external table. Such a table can then be referenced by the TableLocator element which implements a kind of URL for tables.

For more information on elements InlineTable and TableLocator, refer to Taxonomy.

Comparison Measure and Compare Function

When two records are compared, the distance is of interest. The distance measure can be computed by a combination of an 'inner' function and an 'outer' function. The inner function compares two single field values and the outer function computes an aggregation over all fields.

The ComparisonMeasure element is used to define the distance or similarity measure used to find the k-nearest neighbors.

Each field has a compareFunction attribute, this is either defined as default in element ComparisonMeasure or it can be defined per KNNInput.

For more information on element ComparisonMeasure and attribute compareFunction, refer to Clustering Model.

Input Fields

The KNNInputs element works as an envelope. It encapsulates several KNNInput elements which define the fields used to query the k-NN model, one KNNInput element per field.

<xs:element name="KNNInputs">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="KNNInput" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="KNNInput">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>     
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="fieldWeight" type="REAL-NUMBER" default="1"/>        
    <xs:attribute name="compareFunction" type="COMPARE-FUNCTION"/>
  </xs:complexType>
</xs:element>

Definitions

field: Contains the name of a DataField or a DerivedField. If a DerivedField is used and isTransformed is false, the training instances will also need to be transformed together with the k-NN input.
fieldWeight: Defines the importance factor for the field. It is used in the comparison functions to compute the comparison measure. The value must be a number greater than 0. The default value is 1.0.
compareFunction: See above.

Scoring Procedure

For each input vector in the query data set, the k-nearest neighbors are determined against the instances defined in element TrainingInstances. The continuousScoringMethod and categoricalScoringMethod attributes of element NearestNeighborModel are defined based on the target class of the k-neighbors. As such, scoring for each query is performed by one of the following methods:

average: continuous targets, predicted target is the average of targets for the k-nearest neighbors.
median: continuous targets, predicted target is the median of targets for the k-nearest neighbors.
weightedAverage: continuous targets, predicted target is the weighted average of targets for the k-nearest neighbors. The weights are proportional to the inverse of the distance from each k-neighbor to the query point.
majorityVote: categorical targets, predicted target corresponds to the category with the highest frequency of occurrence among the k-nearest neighbors. In case of a tie, the category with the largest number of cases in the training data is the winner. If multiple categories are tied on the largest number of cases in the training data, then the category with the smallest data value (in lexical order) among the tied categories is the winner.
weightedMajorityVote: categorical targets, predicted target corresponds to the category with the highest weighted frequency of occurrence among the k-nearest neighbors. The weights are proportional to the inverse of the distance from each k-neighbor to the query point.

Note that PMML allows for more than one target and different scoring methods can be applied to categorical and continuous targets.

<xs:simpleType name="CONT-SCORING-METHOD">
  <xs:restriction base="xs:string">
    <xs:enumeration value="median"/>      
    <xs:enumeration value="average"/>
    <xs:enumeration value="weightedAverage"/>
  </xs:restriction>
</xs:simpleType>

<xs:simpleType name="CAT-SCORING-METHOD">
  <xs:restriction base="xs:string">
    <xs:enumeration value="majorityVote"/>
    <xs:enumeration value="weightedMajorityVote"/>
  </xs:restriction>
</xs:simpleType>

NOTE: A distance of zero (D = 0) requires special attention when using weightedMajorityVote or weightedAverage since it results in a divide by zero exception. Therefore, the k-NN model incorporates a threshold parameter that specifies a default value to use to avoid problems with distance of zero. This should be used as follows:

case weight = 1 / (D + threshold)

Scoring Example 1: Continuous predictors and two targets: continuous and categorical

Here is an example for NearestNeighborModel for the Iris dataset using an InlineTable. The field "species" is a MiningField with usageType="target" as well as the field "species_class". They represent the dependent variables from the training data set.

<PMML xmlns="https://www.dmg.org/PMML-4_2" version="4.2">
  <Header copyright="Copyright (c) 2011, DMG.org"/>
  <DataDictionary numberOfFields="5">
    <DataField name="petal length" optype="continuous" dataType="double"/>
    <DataField name="petal width" optype="continuous" dataType="double"/>
    <DataField name="sepal length" optype="continuous" dataType="double"/>
    <DataField name="sepal width" optype="continuous" dataType="double"/>
    <DataField name="species" optype="continuous" dataType="double"/>
    <DataField name="species_class" optype="categorical" dataType="string"/>     
  </DataDictionary>
  <NearestNeighborModel modelName="KNN IrisGardens" continuousScoringMethod="average" categoricalScoringMethod="majorityVote" numberOfNeighbors="3" functionName="mixed">
    <MiningSchema>
      <MiningField name="petal length"/>
      <MiningField name="petal width"/>
      <MiningField name="sepal length"/>
      <MiningField name="sepal width"/>
      <MiningField name="species" usageType="target"/>
      <MiningField name="species_class" usageType="target"/>         
    </MiningSchema>
    <Output>
      <OutputField targetField="species" dataType="double" feature="predictedValue" name="output_1" optype="continuous"/>  
      <OutputField targetField="species_class" dataType="string" feature="predictedValue" name="output_2" optype="categorical"/>
    </Output>     
    <TrainingInstances recordCount="149" fieldCount="5" isTransformed="false">
      <InstanceFields>
        <InstanceField field="petal length" column="petal_length"/>
        <InstanceField field="petal width" column="petal_width"/>
        <InstanceField field="sepal length" column="sepal_length"/>
        <InstanceField field="sepal width" column="sepal_width"/>
        <InstanceField field="species" column="target_species"/>
        <InstanceField field="species_class" column="target_class"/>            
      </InstanceFields>
      <InlineTable>
        <row>
          <sepal_length>4.9</sepal_length>
          <sepal_width>3.0</sepal_width>
          <petal_length>1.4</petal_length>
          <petal_width>0.2</petal_width>
          <target_species>10</target_species>
          <target_class>Iris-setosa</target_class>            
        </row>
        <row>
          <sepal_length>4.7</sepal_length>
          <sepal_width>3.2</sepal_width>
          <petal_length>1.3</petal_length>
          <petal_width>0.2</petal_width>
          <target_species>10</target_species>
          <target_class>Iris-setosa</target_class>                
        </row>
        <!-- ... -->
        <row>
          <sepal_length>7.0</sepal_length>
          <sepal_width>3.2</sepal_width>
          <petal_length>4.7</petal_length>
          <petal_width>1.24</petal_width>
          <target_species>20</target_species>
          <target_class>Iris-versicolor</target_class>                
        </row>
        <!-- ... -->
        <row>
          <sepal_length>6.3</sepal_length>
          <sepal_width>3.3</sepal_width>
          <petal_length>6.0</petal_length>
          <petal_width>2.5</petal_width>
          <target_species>30</target_species>
          <target_class>Iris-virginica</target_class>                
        </row>
      </InlineTable>
    </TrainingInstances>
    <ComparisonMeasure kind="distance">
      <squaredEuclidean/>
    </ComparisonMeasure>      
    <KNNInputs>
      <KNNInput field="petal length" compareFunction="absDiff"/>
      <KNNInput field="petal width" compareFunction="absDiff"/>
      <KNNInput field="sepal length" compareFunction="absDiff"/>
      <KNNInput field="sepal width" compareFunction="absDiff"/>
    </KNNInputs>
  </NearestNeighborModel>
</PMML>

Scoring Example 1: procedure and results

We will use the above example to illustrate the steps that should be followed in the scoring process. Say the following case (obs) must be scored:

obs = (sepal length = 5.1, sepal width = 3.5, petal length = 1.4, petal width = 0.2)

The distance D between this observation and all 149 training instances are calculated.

Minimum distances for closest k = 3 neighbors

D1 = 0.0100 = (5.1 - 5.1)^2 + (3.5 - 3.5)^2 + (1.4 - 1.4)^2 + (0.3 - 0.2)^2 , continuous target = 10, categorical target = "Iris-setosa"
D2 = 0.0200 = (5.0 - 5.1)^2 + (3.6 - 3.5)^2 + (1.4 - 1.4)^2 + (0.2 - 0.2)^2 , continuous target = 10, categorical target = "Iris-setosa"
D3 = 0.0200 = (5.1 - 5.1)^2 + (3.4 - 3.5)^2 + (1.5 - 1.4)^2 + (0.2 - 0.2)^2 , continuous target = 10, categorical target = "Iris-setosa"

Given that the continuousScoringMethod for the continuous target is "average", the predicted value for the target "species" is (10 + 10 + 10) / 3 = 10 and given that the categoricalScoringMethod for the categorical target is "majorityVote", the predicted value for the target "species_class" is "Iris-setosa".

Note that if one of the k-neighbors were to have a continuous target = 20 (instead of = 10), the average would be (10 + 10 + 20) / 3 = 13.33, a valid value given that the target field "species" is continuous.

obs = (sepal length = 5.9, sepal width = 3.0, petal length = 5.1, petal width = 1.8)

The distance D between this observation and all 149 training instances are calculated.

Minimum distances for closest k = 3 neighbors

D1 = 0.0800 = (6.1 - 5.9)^2 + (3.0 - 3.0)^2 + (4.9 - 5.1)^2 + (1.8 - 1.8)^2 , continuous target = 20, categorical target = "Iris-versicolor"
D2 = 0.1000 = (6.0 - 5.9)^2 + (3.0 - 3.0)^2 + (4.8 - 5.1)^2 + (1.8 - 1.8)^2 , continuous target = 20, categorical target = "Iris-versicolor"
D3 = 0.1100 = (5.8 - 5.9)^2 + (2.7 - 3.0)^2 + (5.1 - 5.1)^2 + (1.9 - 1.8)^2 , continuous target = 20, categorical target = "Iris-versicolor"

Given that the continuousScoringMethod is "average", the predicted value for the target "species" is (20 + 20 + 20) / 3 = 20 and given that the categoricalScoringMethod is "majorityVote", the predicted value for the target "species_class" is "Iris-versicolor".

Note that the element Output specifies the two predicted values in the example. The first, "output_1" refers to target field "species", while the second, "output_2" refer to target field "species_class".

Scoring Example 2: Categorical predictor and categorical target with internal case ID variable

Here is an example for NearestNeighborModel for the Census dataset using an InlineTable. The field "income" is a MiningField with usageType="target", thus it represents the dependent variable from the training data set.

<PMML xmlns="https://www.dmg.org/PMML-4_2" version="4.2">
  <Header copyright="Copyright (c) 2011, DMG.org"/>
  <DataDictionary numberOfFields="4">
    <DataField name="marital status" optype="categorical" dataType="string">
      <Value value="s"/>
      <Value value="d"/>
      <Value value="m"/>
    </DataField>
    <DataField name="age" optype="continuous" dataType="double"/>
    <DataField name="dependents" optype="continuous" dataType="double"/>
    <DataField name="income" optype="categorical" dataType="string">
      <Value value="Low"/>
      <Value value="Middle"/>
      <Value value="High"/>
    </DataField>         
  </DataDictionary>
  <NearestNeighborModel modelName="KNN Census2000" categoricalScoringMethod="majorityVote" numberOfNeighbors="3" functionName="classification" instanceIdVariable="ID" threshold="0.001">
    <MiningSchema>
      <MiningField name="marital status"/>
      <MiningField name="age"/>
      <MiningField name="dependents"/>
      <MiningField name="income" usageType="target"/>
    </MiningSchema>
    <Output>
      <OutputField dataType="string" feature="predictedValue" name="output_1" optype="categorical"/>  
      <OutputField dataType="string" feature="entityId" name="neighbor1" rank="1" optype="categorical"/>  
      <OutputField dataType="string" feature="entityId" name="neighbor2" rank="2" optype="categorical"/>  
      <OutputField dataType="string" feature="entityId" name="neighbor3" rank="3" optype="categorical"/>  
    </Output>         
    <LocalTransformations>
      <DerivedField name="norm_age" optype="continuous" dataType="double">
        <NormContinuous field="age">
          <LinearNorm orig="0" norm="0"/>
          <LinearNorm orig="45" norm="0.5"/>
          <LinearNorm orig="105" norm="1"/>
        </NormContinuous>
      </DerivedField>
      <DerivedField name="married" optype="continuous" dataType="double">
        <NormDiscrete field="marital status" value="m"/>
      </DerivedField>
      <DerivedField name="divorced" optype="continuous" dataType="double">
        <NormDiscrete field="marital status" value="d"/>
      </DerivedField>
      <DerivedField name="single" optype="continuous" dataType="double">
        <NormDiscrete field="marital status" value="s"/>
      </DerivedField>
    </LocalTransformations>
    <TrainingInstances recordCount="200" fieldCount="5" isTransformed="false">
      <InstanceFields>
        <InstanceField field="ID" column="ID"/>
        <InstanceField field="marital status" column="ms"/>
        <InstanceField field="age" column="age"/>
        <InstanceField field="dependents" column="deps"/>
        <InstanceField field="income" column="inc"/>
      </InstanceFields>
      <InlineTable>
        <row> <ID>1</ID> <ms>m</ms> <age>33.0</age> <deps>4</deps> <inc>Low</inc> </row>
        <row> <ID>2</ID> <ms>s</ms> <age>25.0</age> <deps>3</deps> <inc>Low</inc> </row>
        <!-- ... -->
        <row> <ID>11</ID> <ms>m</ms> <age>38.0</age> <deps>2</deps> <inc>Middle</inc> </row>
        <!-- ... -->
        <row> <ID>200</ID> <ms>m</ms> <age>45.0</age> <deps>1</deps> <inc>High</inc> </row>
      </InlineTable>
    </TrainingInstances>
    <ComparisonMeasure kind="distance">
      <squaredEuclidean/>
    </ComparisonMeasure>      
    <KNNInputs>
      <KNNInput field="norm_age" compareFunction="absDiff"/>
      <KNNInput field="married" compareFunction="absDiff"/>
      <KNNInput field="divorced" compareFunction="absDiff"/>
      <KNNInput field="single" compareFunction="absDiff"/>
      <KNNInput field="dependents" compareFunction="absDiff"/>
    </KNNInputs>
  </NearestNeighborModel>
</PMML>

Note that since the TrainingInstances have not been transformed (isTransformed="false"), they will have to be transformed together with the KNNInputs as per transformations defined in elements TransformationDictionary or LocalTransformations before scoring takes place. As outlined in the Output element, scoring results will contain the predicted category for income and also ID values for the 3 nearest neighbors.

e-mail

info at dmg.org