|
||||||||||||||
|
||||||||||||||
| ||||||||||||||
PMML 4.2 - Nearest Neighborsk-Nearest Neighbors (k-NN) is an instance-based learning algorithm. In a k-NN model, a hypothesis or generalization is built from the training data directly at the time a query is made to the system. The prediction is based on the K training instances closest to the case being scored. Therefore, all training cases have to be stored, which may be problematic when the amount of data is large. This model has the ability to store the data directly in PMML using InlineTable or elsewhere using the TableLocator element defined in the Taxonomy document. A k-NN model can have one or more target variables or no targets. When one or more targets are present, the predicted value is computed based on the target values of the nearest neighbors. When no targets are present, the model specifies a case ID variable for the training data. In this way, one can easily obtain the IDs of the K closest training cases (nearest neighbors). A k-NN model consists of four major parts:
XSD<xs:element name="NearestNeighborModel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="MiningSchema"/> <xs:element ref="Output" minOccurs="0"/> <xs:element ref="ModelStats" minOccurs="0"/> <xs:element ref="ModelExplanation" minOccurs="0"/> <xs:element ref="Targets" minOccurs="0"/> <xs:element ref="LocalTransformations" minOccurs="0"/> <xs:element ref="TrainingInstances"/> <xs:element ref="ComparisonMeasure"/> <xs:element ref="KNNInputs"/> <xs:element ref="ModelVerification" minOccurs="0"/> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="modelName" type="xs:string"/> <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/> <xs:attribute name="algorithmName" type="xs:string"/> <xs:attribute name="numberOfNeighbors" type="INT-NUMBER" use="required"/> <xs:attribute name="continuousScoringMethod" type="CONT-SCORING-METHOD" default="average"/> <xs:attribute name="categoricalScoringMethod" type="CAT-SCORING-METHOD" default="majorityVote"/> <xs:attribute name="instanceIdVariable" type="xs:string"/> <xs:attribute name="threshold" type="REAL-NUMBER" default="0.001"/> <xs:attribute name="isScorable" type="xs:boolean" default="true"/> </xs:complexType> </xs:element> Model Attributes
Main Elements
Training InstancesThe element TrainingInstances consists of:<xs:element name="TrainingInstances"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="InstanceFields"/> <xs:choice> <xs:element ref="TableLocator"/> <xs:element ref="InlineTable"/> </xs:choice> </xs:sequence> <xs:attribute name="isTransformed" type="xs:boolean" default="false"/> <xs:attribute name="recordCount" type="INT-NUMBER" use="optional"/> <xs:attribute name="fieldCount" type="INT-NUMBER" use="optional"/> </xs:complexType> </xs:element> Element TrainingInstances encapsulates the definition of the fields included in the training instances as well as their values. Definitions
<xs:element name="InstanceFields"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="InstanceField" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="InstanceField"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="field" type="xs:string" use="required"/> <xs:attribute name="column" type="xs:string" use="optional"/> </xs:complexType> </xs:element> The InstanceFields element serves as an envelope for all the fields included in the training instances. It encapsulates InstanceField elements. Definitions
For more information on elements InlineTable and TableLocator, refer to Taxonomy. Comparison Measure and Compare FunctionWhen two records are compared, the distance is of interest. The distance measure can be computed by a combination of an 'inner' function and an 'outer' function. The inner function compares two single field values and the outer function computes an aggregation over all fields. The ComparisonMeasure element is used to define the distance or similarity measure used to find the k-nearest neighbors. Each field has a compareFunction attribute, this is either defined as default in element ComparisonMeasure or it can be defined per KNNInput. For more information on element ComparisonMeasure and attribute compareFunction, refer to Clustering Model. Input FieldsThe KNNInputs element works as an envelope. It encapsulates several KNNInput elements which define the fields used to query the k-NN model, one KNNInput element per field. <xs:element name="KNNInputs"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="KNNInput" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="KNNInput"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="field" type="FIELD-NAME" use="required"/> <xs:attribute name="fieldWeight" type="REAL-NUMBER" default="1"/> <xs:attribute name="compareFunction" type="COMPARE-FUNCTION"/> </xs:complexType> </xs:element> Definitions
Scoring ProcedureFor each input vector in the query data set, the k-nearest neighbors are determined against the instances defined in element TrainingInstances. The continuousScoringMethod and categoricalScoringMethod attributes of element NearestNeighborModel are defined based on the target class of the k-neighbors. As such, scoring for each query is performed by one of the following methods:
Note that PMML allows for more than one target and different scoring methods can be applied to categorical and continuous targets. <xs:simpleType name="CONT-SCORING-METHOD"> <xs:restriction base="xs:string"> <xs:enumeration value="median"/> <xs:enumeration value="average"/> <xs:enumeration value="weightedAverage"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="CAT-SCORING-METHOD"> <xs:restriction base="xs:string"> <xs:enumeration value="majorityVote"/> <xs:enumeration value="weightedMajorityVote"/> </xs:restriction> </xs:simpleType> NOTE: A distance of zero (D = 0) requires special attention when using weightedMajorityVote or weightedAverage since it results in a divide by zero exception. Therefore, the k-NN model incorporates a threshold parameter that specifies a default value to use to avoid problems with distance of zero. This should be used as follows: case weight = 1 / (D + threshold) Scoring Example 1: Continuous predictors and two targets: continuous and categoricalHere is an example for NearestNeighborModel for the Iris dataset using an InlineTable. The field "species" is a MiningField with usageType="target" as well as the field "species_class". They represent the dependent variables from the training data set. <PMML xmlns="https://www.dmg.org/PMML-4_2" version="4.2"> <Header copyright="Copyright (c) 2011, DMG.org"/> <DataDictionary numberOfFields="5"> <DataField name="petal length" optype="continuous" dataType="double"/> <DataField name="petal width" optype="continuous" dataType="double"/> <DataField name="sepal length" optype="continuous" dataType="double"/> <DataField name="sepal width" optype="continuous" dataType="double"/> <DataField name="species" optype="continuous" dataType="double"/> <DataField name="species_class" optype="categorical" dataType="string"/> </DataDictionary> <NearestNeighborModel modelName="KNN IrisGardens" continuousScoringMethod="average" categoricalScoringMethod="majorityVote" numberOfNeighbors="3" functionName="mixed"> <MiningSchema> <MiningField name="petal length"/> <MiningField name="petal width"/> <MiningField name="sepal length"/> <MiningField name="sepal width"/> <MiningField name="species" usageType="target"/> <MiningField name="species_class" usageType="target"/> </MiningSchema> <Output> <OutputField targetField="species" dataType="double" feature="predictedValue" name="output_1" optype="continuous"/> <OutputField targetField="species_class" dataType="string" feature="predictedValue" name="output_2" optype="categorical"/> </Output> <TrainingInstances recordCount="149" fieldCount="5" isTransformed="false"> <InstanceFields> <InstanceField field="petal length" column="petal_length"/> <InstanceField field="petal width" column="petal_width"/> <InstanceField field="sepal length" column="sepal_length"/> <InstanceField field="sepal width" column="sepal_width"/> <InstanceField field="species" column="target_species"/> <InstanceField field="species_class" column="target_class"/> </InstanceFields> <InlineTable> <row> <sepal_length>4.9</sepal_length> <sepal_width>3.0</sepal_width> <petal_length>1.4</petal_length> <petal_width>0.2</petal_width> <target_species>10</target_species> <target_class>Iris-setosa</target_class> </row> <row> <sepal_length>4.7</sepal_length> <sepal_width>3.2</sepal_width> <petal_length>1.3</petal_length> <petal_width>0.2</petal_width> <target_species>10</target_species> <target_class>Iris-setosa</target_class> </row> <!-- ... --> <row> <sepal_length>7.0</sepal_length> <sepal_width>3.2</sepal_width> <petal_length>4.7</petal_length> <petal_width>1.24</petal_width> <target_species>20</target_species> <target_class>Iris-versicolor</target_class> </row> <!-- ... --> <row> <sepal_length>6.3</sepal_length> <sepal_width>3.3</sepal_width> <petal_length>6.0</petal_length> <petal_width>2.5</petal_width> <target_species>30</target_species> <target_class>Iris-virginica</target_class> </row> </InlineTable> </TrainingInstances> <ComparisonMeasure kind="distance"> <squaredEuclidean/> </ComparisonMeasure> <KNNInputs> <KNNInput field="petal length" compareFunction="absDiff"/> <KNNInput field="petal width" compareFunction="absDiff"/> <KNNInput field="sepal length" compareFunction="absDiff"/> <KNNInput field="sepal width" compareFunction="absDiff"/> </KNNInputs> </NearestNeighborModel> </PMML> Scoring Example 1: procedure and resultsWe will use the above example to illustrate the steps that should be followed in the scoring process. Say the following case (obs) must be scored: obs = (sepal length = 5.1, sepal width = 3.5, petal length = 1.4, petal width = 0.2)The distance D between this observation and all 149 training instances are calculated. Minimum distances for closest k = 3 neighbors
Given that the continuousScoringMethod for the continuous target is "average", the predicted value for the target "species" is (10 + 10 + 10) / 3 = 10 and given that the categoricalScoringMethod for the categorical target is "majorityVote", the predicted value for the target "species_class" is "Iris-setosa". Note that if one of the k-neighbors were to have a continuous target = 20 (instead of = 10), the average would be (10 + 10 + 20) / 3 = 13.33, a valid value given that the target field "species" is continuous. obs = (sepal length = 5.9, sepal width = 3.0, petal length = 5.1, petal width = 1.8)The distance D between this observation and all 149 training instances are calculated. Minimum distances for closest k = 3 neighbors
Given that the continuousScoringMethod is "average", the predicted value for the target "species" is (20 + 20 + 20) / 3 = 20 and given that the categoricalScoringMethod is "majorityVote", the predicted value for the target "species_class" is "Iris-versicolor". Note that the element Output specifies the two predicted values in the example. The first, "output_1" refers to target field "species", while the second, "output_2" refer to target field "species_class". Scoring Example 2: Categorical predictor and categorical target with internal case ID variableHere is an example for NearestNeighborModel for the Census dataset using an InlineTable. The field "income" is a MiningField with usageType="target", thus it represents the dependent variable from the training data set. <PMML xmlns="https://www.dmg.org/PMML-4_2" version="4.2"> <Header copyright="Copyright (c) 2011, DMG.org"/> <DataDictionary numberOfFields="4"> <DataField name="marital status" optype="categorical" dataType="string"> <Value value="s"/> <Value value="d"/> <Value value="m"/> </DataField> <DataField name="age" optype="continuous" dataType="double"/> <DataField name="dependents" optype="continuous" dataType="double"/> <DataField name="income" optype="categorical" dataType="string"> <Value value="Low"/> <Value value="Middle"/> <Value value="High"/> </DataField> </DataDictionary> <NearestNeighborModel modelName="KNN Census2000" categoricalScoringMethod="majorityVote" numberOfNeighbors="3" functionName="classification" instanceIdVariable="ID" threshold="0.001"> <MiningSchema> <MiningField name="marital status"/> <MiningField name="age"/> <MiningField name="dependents"/> <MiningField name="income" usageType="target"/> </MiningSchema> <Output> <OutputField dataType="string" feature="predictedValue" name="output_1" optype="categorical"/> <OutputField dataType="string" feature="entityId" name="neighbor1" rank="1" optype="categorical"/> <OutputField dataType="string" feature="entityId" name="neighbor2" rank="2" optype="categorical"/> <OutputField dataType="string" feature="entityId" name="neighbor3" rank="3" optype="categorical"/> </Output> <LocalTransformations> <DerivedField name="norm_age" optype="continuous" dataType="double"> <NormContinuous field="age"> <LinearNorm orig="0" norm="0"/> <LinearNorm orig="45" norm="0.5"/> <LinearNorm orig="105" norm="1"/> </NormContinuous> </DerivedField> <DerivedField name="married" optype="continuous" dataType="double"> <NormDiscrete field="marital status" value="m"/> </DerivedField> <DerivedField name="divorced" optype="continuous" dataType="double"> <NormDiscrete field="marital status" value="d"/> </DerivedField> <DerivedField name="single" optype="continuous" dataType="double"> <NormDiscrete field="marital status" value="s"/> </DerivedField> </LocalTransformations> <TrainingInstances recordCount="200" fieldCount="5" isTransformed="false"> <InstanceFields> <InstanceField field="ID" column="ID"/> <InstanceField field="marital status" column="ms"/> <InstanceField field="age" column="age"/> <InstanceField field="dependents" column="deps"/> <InstanceField field="income" column="inc"/> </InstanceFields> <InlineTable> <row> <ID>1</ID> <ms>m</ms> <age>33.0</age> <deps>4</deps> <inc>Low</inc> </row> <row> <ID>2</ID> <ms>s</ms> <age>25.0</age> <deps>3</deps> <inc>Low</inc> </row> <!-- ... --> <row> <ID>11</ID> <ms>m</ms> <age>38.0</age> <deps>2</deps> <inc>Middle</inc> </row> <!-- ... --> <row> <ID>200</ID> <ms>m</ms> <age>45.0</age> <deps>1</deps> <inc>High</inc> </row> </InlineTable> </TrainingInstances> <ComparisonMeasure kind="distance"> <squaredEuclidean/> </ComparisonMeasure> <KNNInputs> <KNNInput field="norm_age" compareFunction="absDiff"/> <KNNInput field="married" compareFunction="absDiff"/> <KNNInput field="divorced" compareFunction="absDiff"/> <KNNInput field="single" compareFunction="absDiff"/> <KNNInput field="dependents" compareFunction="absDiff"/> </KNNInputs> </NearestNeighborModel> </PMML> Note that since the TrainingInstances have not been transformed (isTransformed="false"), they will have to be transformed together with the KNNInputs as per transformations defined in elements TransformationDictionary or LocalTransformations before scoring takes place. As outlined in the Output element, scoring results will contain the predicted category for income and also ID values for the 3 nearest neighbors. |
||||||||||||||
|