PMML 4.4 - Anomaly Detection Models

Anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a data set. Traditional approaches comprise of distance and density-based approaches. Examples of common ways to define distance or density are distance to the k-nearest neighbors or count of points within a given fixed radius. These methods however are unable to handle data sets with regions of different densities and do not scale well for large data. Other algorithms have been proposed which are better able to handle such cases; the PMML standard at this time supports three such algorithms:

Isolation Forest
One Class SVM
Clustering mean distance based anomaly detection model

Other models can also be used if their scoring follows PMML standard rules.

Isolation Forest is an approach that detects anomalies by isolating instances, without relying on any distance or density measure. The logic is that isolating anomalous observations in a random forest is easier as only a few conditions are needed to separate those cases from the normal observations. On the other hand, isolating normal observations require more conditions. Therefore, an anomaly score can be calculated as the number of conditions required to separate a given observation. The basic idea is that anomalies are more likely to be isolated closer to the root of an isolation tree whereas normal points are more likely to be isolated at the deeper end of an isolation tree. Anomalies are more susceptible to isolation and hence have short path lengths. Path Length of a point is measured by the number of edges it traverses in a tree from the root node until the traversal is terminated at an external node. Anomaly is measured by path length. Shorter paths indicate anomaly.

This path length must be normalized and values closer to 1 are more anomalous. The required parameters for this model are thus the number of trees, the tree height limit and the sample data size to calculate the path length normalization constant.

One Class SVMs (OCSVM) are used for novelty and anomaly detection. The support vector model is trained on data that has only one class, which is the “normal” class. It infers the properties of normal cases and from these properties can predict which examples are unlike the normal examples. This is useful for anomaly detection because the scarcity of training examples is what defines anomalies: that is, typically there are very few examples of the network intrusion, fraud, or other anomalous behavior. The SVM model is trained to maximize the distance between the origin and the support vector hyperplane. Therefore, the decision value for anomalous data points would be negative.

Clustering mean distance based anomaly detection models compare the distance (or similarity measure) of a case to its cluster center with an average distance (similarity) for that cluster. If the distance is significantly higher (or similarity significantly lower) than average, the case is considered anomalous, with its anomaly index defined as the ratio of the distances/similarities (with special treatment for the average distance of 0).

The PMML schema defines a new model type, AnomalyDetectionModel which can have PMML models as sub-elements. AnomalyDetectionModel itself is not allowed. Thus, with minor additions, a MiningModel element representing a random forest can be used to represent isolation forests; a SupportVectorMachineModel element can be used to represent One-Class SVM models; and a ClusteringModel can be used to represent a clustering mean distance based anomaly detection model. An OutputField element with feature="decision" can be used to output a boolean indicating if the case is an anomaly, with the predicted value of the Anomaly Detection model being some measure of the anomaly.

XML Schema

All variations of Anomaly Detection Models use pre-existing model definitions.

<xs:element name="AnomalyDetectionModel">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="MiningSchema"/>
        <xs:element ref="Output" minOccurs="0"/>
        <xs:element ref="LocalTransformations" minOccurs="0"/>
        <xs:element ref="ModelVerification" minOccurs="0"/>
        <xs:group ref="MODEL-ELEMENT"/>
        <xs:element ref="MeanClusterDistances" minOccurs="0"/>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string"/>
      <xs:attribute name="algorithmName" type="xs:string"/>
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
      <xs:attribute name="algorithmType" type="ALGORITHM-TYPE" use="required"/>
      <xs:attribute name="sampleDataSize" type="xs:string"/>
      <xs:attribute name="isScorable" type="xs:boolean" default="true"/>   
   </xs:complexType>
  </xs:element>

  
<xs:simpleType name="ALGORITHM-TYPE">
    <xs:restriction base="xs:string">
      <xs:enumeration value="iforest"/>
      <xs:enumeration value="ocsvm"/>
      <xs:enumeration value="clusterMeanDist"/>
	  <xs:enumeration value="other"/>
    </xs:restriction>
  </xs:simpleType>
 
  
<xs:element name="MeanClusterDistances">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:group ref="NUM-ARRAY"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

The anomaly model currently reuses one of the already defined model types from PMML, hence it contains a single MODEL-ELEMENT as the main child element. Any model other than AnomalyDetectionModel can be used, but at present specific scoring is described for three models. These are indicated in the algorithmType attribute:

iforest indicates an isolation forest model which uses a MiningModel element;
ocsvm indicates a one-class SVM model which corresponds to a SupportVectorMachineModel element;
clusterMeanDist indicates a clustering mean distance based anomaly detection model;
other stands for any other model.

sampleDataSize is a required parameter for isolation forest models. It is the dataset size used to train the forest and is needed to normalize the tree search depth.

The MeanClusterDistances element contains an array of non-negative real values, it is required when the algorithm type is clusterMeanDist. The length of the array must equal the number of clusters in the model, and the values in it are the mean distances/similarities to the center for each cluster.

An example of an isolation forest model

The tree depths are saved as score attributes in the trees; this is a necessary feature as this algorithm sometimes modifies the tree depth from an integer to a normalized value. It is therefore much simpler if the equivalent depth is simply read off as the score as opposed to inferring it. The 'raw' path length is the averaged predicted value of each tree which is then normalized to give the anomaly score, the prediction of the model. One can then use the result of the Apply function of the OutputField with feature="decision" to determine if this anomaly score should be categorized as an anomaly or not.

<PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4">
    <Header copyright="2019 dmg.org"/>
    <DataDictionary numberOfFields="5">
        <DataField name="class" optype="categorical" dataType="string">
           <Value value="setosa"/>
           <Value value="versicolor"/>
           <Value value="virginica"/>
        </DataField>
        <DataField name="sepal_length" optype="continuous" dataType="double"/>
        <DataField name="sepal_width" optype="continuous" dataType="double"/>
        <DataField name="petal_length" optype="continuous" dataType="double"/>
        <DataField name="petal_width" optype="continuous" dataType="double"/>
    </DataDictionary>
    <AnomalyDetectionModel functionName="regression" algorithmType="iforest" modelName="IsolationForests" sampleDataSize="5">
        <MiningSchema>
            <MiningField name="sepal_length" usageType="active"/>
            <MiningField name="petal_length" usageType="active"/>
            <MiningField name="petal_width" usageType="active"/>
        </MiningSchema>
        <Output>
            <OutputField name="anomalyScore" optype="continuous" dataType="float" feature="predictedValue"/>
            <OutputField name="anomaly" optype="categorical" dataType="boolean" feature="decision">
               <Apply function="greaterThan">
                  <FieldRef field="anomalyScore"/>
                  <Constant dataType="double">0.422</Constant>
               </Apply>
         </OutputField>
        </Output>
        <MiningModel functionName="regression" modelName="iforest_iris_pmml">
            <MiningSchema>
                <MiningField name="sepal_length" usageType="active"/>
                <MiningField name="petal_length" usageType="active"/>
                <MiningField name="petal_width" usageType="active"/>
            </MiningSchema>
            <Output>
                <OutputField name="avg_path_length" optype="continuous" dataType="double" feature="predictedValue"/>
            </Output>
            <Segmentation multipleModelMethod="average">
                <Segment id="Seg_1">
                    <True/>
                    <TreeModel functionName="regression" missingValueStrategy="nullPrediction" noTrueChildStrategy="returnLastPrediction" splitCharacteristic="multiSplit" modelName="SegmentModel_1">
                        <MiningSchema>
                            <MiningField name="petal_length"/>
                            <MiningField name="sepal_length"/>
                        </MiningSchema>
                        <Node id="Seg1_Nod_1" score="2.0">
                            <True/>
                            <Node id="Seg1_Nod_1.1" score="3.0">
                                <SimplePredicate field="petal_length" operator="lessOrEqual" value="1.7228131956992732"/>
                                <Node id="Seg1_Nod_1.1.1" score="4.0">
                                    <SimplePredicate field="sepal_length" operator="lessOrEqual" value="4.772875397423331"/>
                                </Node>
                                <Node id="Seg1_Nod_1.1.2" score="4">
                                    <SimplePredicate field="sepal_length" operator="greaterThan" value="4.772875397423331"/>
                                </Node>
                            </Node>
                            <Node id="Seg1_Nod_1.2" score="3.0">
                                <SimplePredicate field="petal_length" operator="greaterThan" value="1.7228131956992732"/>
                                <Node id="Seg1_Nod_1.2.1" score="4.0">
                                    <SimplePredicate field="petal_width" operator="lessOrEqual" value="1.7714"/>
                                </Node>
                                <Node id="Seg1_Nod_1.2.2" score="4.1544313298030655">
                                    <SimplePredicate field="petal_width" operator="greaterThan" value="1.7714"/>
                                </Node>
                            </Node>
                        </Node>
                    </TreeModel>
                </Segment>
                <Segment id="Seg_2">
                    <True/>
                    <TreeModel functionName="regression" missingValueStrategy="nullPrediction" noTrueChildStrategy="returnLastPrediction" splitCharacteristic="multiSplit" modelName="SegmentModel_2">
                        <MiningSchema>
                            <MiningField name="petal_width"/>
                        </MiningSchema>
                        <Node id="Seg2_Nod_1" score="2.0">
                            <True/>
                            <Node id="Seg2_Nod_1.1" score="3.0">
                                <SimplePredicate field="petal_width" operator="lessOrEqual" value="0.8001738992731421"/>
                                <Node id="Seg2_Nod_1.1.1" score="4.0">
                                    <SimplePredicate field="petal_width" operator="lessOrEqual" value="0.5381419370928949"/>
                                    <Node id="Seg2_Nod_1.1.1.1" score="5.0">
                                        <SimplePredicate field="petal_width" operator="lessOrEqual" value="0.25857816"/>
                                    </Node>
                                    <Node id="Seg2_Nod_1.1.1.2" score="5.0">
                                        <SimplePredicate field="petal_width" operator="greaterThan" value="0.25857816"/>
                                    </Node>
                                </Node>
                                <Node id="Seg2_Nod_1.1.2" score="4.0">
                                    <SimplePredicate field="petal_width" operator="greaterThan" value="0.5381419370928949"/>
                                    <Node id="Seg2_Nod_1.1.2.1" score="5.0">
                                        <SimplePredicate field="petal_width" operator="lessOrEqual" value="0.25857816"/>
                                    </Node>
                                    <Node id="Seg2_Nod_1.1.2.2" score="5.0">
                                        <SimplePredicate field="petal_width" operator="greaterThan" value="0.25857816"/>
                                    </Node>
                                </Node>
                            </Node>
                            <Node id="Seg2_Nod_1.2" score="3.0">
                                <SimplePredicate field="petal_width" operator="greaterThan" value="0.8001738992731421"/>
                            </Node>
                        </Node>
                    </TreeModel>
                </Segment>
            </Segmentation>
        </MiningModel>
    </AnomalyDetectionModel>
</PMML>

The predicted value is calculated as usual for a mining model. The anomaly score is then the normalized value of that output. This normalization is defined as 2^-(predictedValue/c(n)) where n is the sampleDataSize and c(n) = 2*H(n-1) - (2*(n-1)/n). H(x) may be reasonably approximated as ln(x) + 0.57721566 (Eulers constant). Finally, the input is classified as an anomaly if this predicted value is greater than the threshold value defined by the Apply function of the OutputField anomaly.

As an example, assume the input data is:

petal_length = 1.5, petal_width = 5.8, sepal_length = 4.6

Traversing the 2 trees as usual gives us the predictions as 4.0 and 3.0. The prediction avg_path_length of the mining model is therefore the average, 3.5.
Using the equations described earlier, the prediction anomalyScore of the AnomalyDetectionModel is: 2^(-(3.5/c(5))) = 2^(-3.5/(2(ln(4)+0.57721566))) = 2^(-1.50406956) = 0.35255749
Since 0.35255749 is less than threshold 0.422, the prediction anomaly is FALSE. The input data is not an anomaly.

An example of an OCSVM model:

<PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4">
    <Header copyright="2019 dmg.org"/>
    <DataDictionary numberOfFields="5">
        <DataField name="class" optype="categorical" dataType="string">
           <Value value="setosa"/>
           <Value value="versicolor"/>
           <Value value="virginica"/>
        </DataField>
        <DataField name="sepal_length" optype="continuous" dataType="double"/>
        <DataField name="sepal_width" optype="continuous" dataType="double"/>
        <DataField name="petal_length" optype="continuous" dataType="double"/>
        <DataField name="petal_width" optype="continuous" dataType="double"/>
    </DataDictionary>
    <AnomalyDetectionModel functionName="regression" algorithmType="ocsvm" modelName="OneClassSVM">
        <MiningSchema>
            <MiningField name="sepal_length" usageType="active"/>
            <MiningField name="sepal_width" usageType="active"/>
            <MiningField name="petal_length" usageType="active"/>
            <MiningField name="petal_width" usageType="active"/>
        </MiningSchema>
        <Output>
            <OutputField name="anomalyScore" optype="continuous" dataType="float" feature="predictedValue"/>
            <OutputField name="anomaly" optype="categorical" dataType="boolean" feature="decision">
                <Apply function="lessThan">
                    <FieldRef field="anomalyScore"/>
                    <Constant dataType="double">0</Constant>
                </Apply>
            </OutputField>
        </Output>
        <SupportVectorMachineModel functionName="regression" modelName="ocsvm_iris_pmml">
            <MiningSchema>
                <MiningField name="sepal_length"/>
                <MiningField name="sepal_width"/>
                <MiningField name="petal_length"/>
                <MiningField name="petal_width"/>
            </MiningSchema>
            <Output>
                <OutputField dataType="double" feature="predictedValue" name="svm_out" optype="continuous"/>
            </Output>
            <LinearKernelType/>
            <VectorDictionary>
                <VectorFields>
                    <FieldRef field="sepal_length"/>
                    <FieldRef field="sepal_width"/>
                    <FieldRef field="petal_length"/>
                    <FieldRef field="petal_width"/>
                </VectorFields>
                <VectorInstance id="3">
                    <Array type="real">5.5 4.2 1.4 0.2</Array>
                </VectorInstance>
                <VectorInstance id="8">
                    <Array type="real">4.4 3.0 1.3 0.2</Array>
                </VectorInstance>
            </VectorDictionary>
            <SupportVectorMachine>
                <SupportVectors>
                    <SupportVector vectorId="3"/>
                    <SupportVector vectorId="8"/>
                </SupportVectors>
                <Coefficients absoluteValue="-8.33">
                    <Coefficient value="0.5"/>
                    <Coefficient value="0.499"/>
                </Coefficients>
            </SupportVectorMachine>
        </SupportVectorMachineModel>
    </AnomalyDetectionModel>
</PMML>

It is a typical SVM model. The only new feature is the anomaly output field. In this example the threshold is set as 0 which is reasonable as OCSVM models train to set distances to anomalies as negative.

As an example, assume the input values are 0, 0.5, 1.0 and 2.0 for the mining fields in sequential order. Given the kernel function K as the linear kernel function, the predictedValue is:

Sum_(i=1)ⁿα_i*K(x,x_i) + b

= 0.5*K((0,0.5,1.0,2.0) , (5.5,4.2,1.4,0.2)) + 0.499*K((0,0.5,1.0,2.0) , (4.4,3.0,1.3,0.2)) -8.33

= 0.5*(0*5.5 + 0.5*4.2 + 1.0*1.4 + 2*0.2) + 0.499*(1.5 + 1.3 + 0.4) - 8.33

= 0.5*3.9 + 0.499*3.2 - 8.33

= -4.783

As this value is less than the threshold value of 0.0, the output anomaly is TRUE, the input is an anomalous value.

An example of a Clustering-based model:

<PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4">
    <Header copyright="2019 dmg.org"/>
    <DataDictionary numberOfFields="5">
        <DataField name="class" optype="categorical" dataType="string">
           <Value value="setosa"/>
           <Value value="versicolor"/>
           <Value value="virginica"/>
        </DataField>
        <DataField name="sepal_length" optype="continuous" dataType="double"/>
        <DataField name="sepal_width" optype="continuous" dataType="double"/>
        <DataField name="petal_length" optype="continuous" dataType="double"/>
        <DataField name="petal_width" optype="continuous" dataType="double"/>
    </DataDictionary>
    <AnomalyDetectionModel functionName="regression" algorithmType="clusterMeanDist" modelName="AnomalyOnKmeans">
        <MiningSchema>
            <MiningField name="sepal_length" usageType="active"/>
            <MiningField name="sepal_width" usageType="active"/>
            <MiningField name="petal_length" usageType="active"/>
            <MiningField name="petal_width" usageType="active"/>
        </MiningSchema>
        <Output>
            <OutputField name="anomalyScore" optype="continuous" dataType="float" feature="predictedValue"/>
            <OutputField name="anomaly" optype="categorical" dataType="boolean" feature="decision">
                <Apply function="greaterThan">
                    <FieldRef field="anomalyScore"/>
                    <Constant dataType="double">2.0</Constant>
                </Apply>
            </OutputField>
        </Output>  
        <ClusteringModel algorithmName="KMeans" functionName="clustering" modelClass="centerBased" modelName="K-Means" numberOfClusters="3">
            <MiningSchema>
                <MiningField highValue="7.9" importance="0.466315129248955" lowValue="4.3" missingValueReplacement="6.1" missingValueTreatment="asMedian" name="sepal_length" outliers="asExtremeValues" usageType="active"/>       
                <MiningField highValue="4.4" importance="0.217549102595577" lowValue="2.0" missingValueReplacement="3.2" missingValueTreatment="asMedian" name="sepal_width" outliers="asExtremeValues" usageType="active"/>       
                <MiningField highValue="6.9" importance="1" lowValue="1.0" missingValueReplacement="3.95" missingValueTreatment="asMedian" name="petal_length" outliers="asExtremeValues" usageType="active"/>
                <MiningField highValue="2.5" importance="0.852549535371169" lowValue="0.1" missingValueReplacement="1.3" missingValueTreatment="asMedian" name="petal_width" outliers="asExtremeValues" usageType="active"/>
            </MiningSchema>
            <LocalTransformations>
                <DerivedField dataType="double" name="cluster0" optype="continuous">
                    <NormContinuous field="sepal_length">
                        <LinearNorm norm="0" orig="4.3"/>
                        <LinearNorm norm="1" orig="7.9"/>
                    </NormContinuous>
                </DerivedField>
                <DerivedField dataType="double" name="cluster1" optype="continuous">
                    <NormContinuous field="sepal_width">
                        <LinearNorm norm="0" orig="2"/>
                        <LinearNorm norm="1" orig="4.4"/>
                    </NormContinuous>
                </DerivedField>
                <DerivedField dataType="double" name="cluster2" optype="continuous">
                    <NormContinuous field="petal_length">
                        <LinearNorm norm="0" orig="1"/>
                        <LinearNorm norm="1" orig="6.9"/>
                    </NormContinuous>
                </DerivedField>
                <DerivedField dataType="double" name="cluster3" optype="continuous">
                    <NormContinuous field="petal_width">
                        <LinearNorm norm="0" orig="0.1"/>
                        <LinearNorm norm="1" orig="2.5"/>
                    </NormContinuous>
                </DerivedField>
            </LocalTransformations>
            <ComparisonMeasure kind="distance">
                <euclidean/>
            </ComparisonMeasure>
            <ClusteringField compareFunction="absDiff" field="cluster0" isCenterField="true"/>
            <ClusteringField compareFunction="absDiff" field="cluster1" isCenterField="true"/>
            <ClusteringField compareFunction="absDiff" field="cluster2" isCenterField="true"/>
            <ClusteringField compareFunction="absDiff" field="cluster3" isCenterField="true"/>
            <Cluster name="1" size="50">
                <Array n="4" type="real">0.196111 0.590833 0.0786441 0.06</Array>
                <Covariances>
                    <Matrix kind="diagonal">
                         <Array n="4" type="real">0.009587112622827 0.025204790249433 0.000864869935334 0.001995464852608</Array>
                    </Matrix>
                </Covariances>
            </Cluster>
            <Cluster name="2" size="39">
                <Array n="4" type="real">0.707265 0.450855 0.797045 0.824786</Array>
                <Covariances>
                    <Matrix kind="diagonal">
                        <Array n="4" type="real">0.019486929574651 0.013603051431999 0.007748638163371 0.013722540860699</Array>
                    </Matrix>
                </Covariances>
            </Cluster>
            <Cluster name="3" size="61">
                <Array n="4" type="real">0.441257 0.307377 0.575715 0.54918</Array>
                <Covariances>
                    <Matrix kind="diagonal">
                        <Array n="4" type="real">0.015537509276125 0.014940042501518 0.007976321106145 0.012876631754705</Array>
                    </Matrix>
                </Covariances>
            </Cluster>
        </ClusteringModel>   
        <MeanClusterDistances>
            <Array n="3" type="real"> 0.165 0.211 0.210</Array>
        </MeanClusterDistances>  
   </AnomalyDetectionModel>
</PMML>

LocalTrasformations shows that the inputs were normalized. The mean distances from cases to their assigned cluster centers are found as 0.165, 0.211, 0.210, and those values are placed into the MeanClusterDistances element. For a case with input values 5.1, 3.5, 1.4, 0.2 the distance to the closest (first) center is 0.048, the anomaly score is computed as 0.048/0.165=0.291. As this is relatively low, the case is not an anomaly. For case 61 with inputs 5.0, 2.0, 3.5, 1.0 the assigned cluster is the third one with distance 0.457, so the anomaly score is 0.457/0.210=2.176. As this is above the threshold of 2.0, this case is an anomaly.

e-mail

info at dmg.org