|
||||||||||||||
|
||||||||||||||
| ||||||||||||||
PMML 4.4 - Anomaly Detection ModelsAnomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a data set. Traditional approaches comprise of distance and density-based approaches. Examples of common ways to define distance or density are distance to the k-nearest neighbors or count of points within a given fixed radius. These methods however are unable to handle data sets with regions of different densities and do not scale well for large data. Other algorithms have been proposed which are better able to handle such cases; the PMML standard at this time supports three such algorithms:
Other models can also be used if their scoring follows PMML standard rules. Isolation Forest is an approach that detects anomalies by isolating instances, without relying on any distance or density measure. The logic is that isolating anomalous observations in a random forest is easier as only a few conditions are needed to separate those cases from the normal observations. On the other hand, isolating normal observations require more conditions. Therefore, an anomaly score can be calculated as the number of conditions required to separate a given observation. The basic idea is that anomalies are more likely to be isolated closer to the root of an isolation tree whereas normal points are more likely to be isolated at the deeper end of an isolation tree. Anomalies are more susceptible to isolation and hence have short path lengths. Path Length of a point is measured by the number of edges it traverses in a tree from the root node until the traversal is terminated at an external node. Anomaly is measured by path length. Shorter paths indicate anomaly. This path length must be normalized and values closer to 1 are more anomalous. The required parameters for this model are thus the number of trees, the tree height limit and the sample data size to calculate the path length normalization constant. One Class SVMs (OCSVM) are used for novelty and anomaly detection. The support vector model is trained on data that has only one class, which is the “normal” class. It infers the properties of normal cases and from these properties can predict which examples are unlike the normal examples. This is useful for anomaly detection because the scarcity of training examples is what defines anomalies: that is, typically there are very few examples of the network intrusion, fraud, or other anomalous behavior. The SVM model is trained to maximize the distance between the origin and the support vector hyperplane. Therefore, the decision value for anomalous data points would be negative. Clustering mean distance based anomaly detection models compare the distance (or similarity measure) of a case to its cluster center with an average distance (similarity) for that cluster. If the distance is significantly higher (or similarity significantly lower) than average, the case is considered anomalous, with its anomaly index defined as the ratio of the distances/similarities (with special treatment for the average distance of 0). The PMML schema defines a new model type, AnomalyDetectionModel which can have PMML models as sub-elements. AnomalyDetectionModel itself is not allowed. Thus, with minor additions, a MiningModel element representing a random forest can be used to represent isolation forests; a SupportVectorMachineModel element can be used to represent One-Class SVM models; and a ClusteringModel can be used to represent a clustering mean distance based anomaly detection model. An OutputField element with feature="decision" can be used to output a boolean indicating if the case is an anomaly, with the predicted value of the Anomaly Detection model being some measure of the anomaly. XML SchemaAll variations of Anomaly Detection Models use pre-existing model definitions. <xs:element name="AnomalyDetectionModel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="MiningSchema"/> <xs:element ref="Output" minOccurs="0"/> <xs:element ref="LocalTransformations" minOccurs="0"/> <xs:element ref="ModelVerification" minOccurs="0"/> <xs:group ref="MODEL-ELEMENT"/> <xs:element ref="MeanClusterDistances" minOccurs="0"/> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="modelName" type="xs:string"/> <xs:attribute name="algorithmName" type="xs:string"/> <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/> <xs:attribute name="algorithmType" type="ALGORITHM-TYPE" use="required"/> <xs:attribute name="sampleDataSize" type="xs:string"/> <xs:attribute name="isScorable" type="xs:boolean" default="true"/> </xs:complexType> </xs:element> <xs:simpleType name="ALGORITHM-TYPE"> <xs:restriction base="xs:string"> <xs:enumeration value="iforest"/> <xs:enumeration value="ocsvm"/> <xs:enumeration value="clusterMeanDist"/> <xs:enumeration value="other"/> </xs:restriction> </xs:simpleType> <xs:element name="MeanClusterDistances"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:group ref="NUM-ARRAY"/> </xs:sequence> </xs:complexType> </xs:element> The anomaly model currently reuses one of the already defined model types from PMML, hence it contains a single MODEL-ELEMENT as the main child element. Any model other than AnomalyDetectionModel can be used, but at present specific scoring is described for three models. These are indicated in the algorithmType attribute:
sampleDataSize is a required parameter for isolation forest models. It is the dataset size used to train the forest and is needed to normalize the tree search depth. The MeanClusterDistances element contains an array of non-negative real values, it is required when the algorithm type is clusterMeanDist. The length of the array must equal the number of clusters in the model, and the values in it are the mean distances/similarities to the center for each cluster. An example of an isolation forest modelThe tree depths are saved as score attributes in the trees; this is a necessary feature as this algorithm sometimes modifies the tree depth from an integer to a normalized value. It is therefore much simpler if the equivalent depth is simply read off as the score as opposed to inferring it. The 'raw' path length is the averaged predicted value of each tree which is then normalized to give the anomaly score, the prediction of the model. One can then use the result of the Apply function of the OutputField with feature="decision" to determine if this anomaly score should be categorized as an anomaly or not. <PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4"> <Header copyright="2019 dmg.org"/> <DataDictionary numberOfFields="5"> <DataField name="class" optype="categorical" dataType="string"> <Value value="setosa"/> <Value value="versicolor"/> <Value value="virginica"/> </DataField> <DataField name="sepal_length" optype="continuous" dataType="double"/> <DataField name="sepal_width" optype="continuous" dataType="double"/> <DataField name="petal_length" optype="continuous" dataType="double"/> <DataField name="petal_width" optype="continuous" dataType="double"/> </DataDictionary> <AnomalyDetectionModel functionName="regression" algorithmType="iforest" modelName="IsolationForests" sampleDataSize="5"> <MiningSchema> <MiningField name="sepal_length" usageType="active"/> <MiningField name="petal_length" usageType="active"/> <MiningField name="petal_width" usageType="active"/> </MiningSchema> <Output> <OutputField name="anomalyScore" optype="continuous" dataType="float" feature="predictedValue"/> <OutputField name="anomaly" optype="categorical" dataType="boolean" feature="decision"> <Apply function="greaterThan"> <FieldRef field="anomalyScore"/> <Constant dataType="double">0.422</Constant> </Apply> </OutputField> </Output> <MiningModel functionName="regression" modelName="iforest_iris_pmml"> <MiningSchema> <MiningField name="sepal_length" usageType="active"/> <MiningField name="petal_length" usageType="active"/> <MiningField name="petal_width" usageType="active"/> </MiningSchema> <Output> <OutputField name="avg_path_length" optype="continuous" dataType="double" feature="predictedValue"/> </Output> <Segmentation multipleModelMethod="average"> <Segment id="Seg_1"> <True/> <TreeModel functionName="regression" missingValueStrategy="nullPrediction" noTrueChildStrategy="returnLastPrediction" splitCharacteristic="multiSplit" modelName="SegmentModel_1"> <MiningSchema> <MiningField name="petal_length"/> <MiningField name="sepal_length"/> </MiningSchema> <Node id="Seg1_Nod_1" score="2.0"> <True/> <Node id="Seg1_Nod_1.1" score="3.0"> <SimplePredicate field="petal_length" operator="lessOrEqual" value="1.7228131956992732"/> <Node id="Seg1_Nod_1.1.1" score="4.0"> <SimplePredicate field="sepal_length" operator="lessOrEqual" value="4.772875397423331"/> </Node> <Node id="Seg1_Nod_1.1.2" score="4"> <SimplePredicate field="sepal_length" operator="greaterThan" value="4.772875397423331"/> </Node> </Node> <Node id="Seg1_Nod_1.2" score="3.0"> <SimplePredicate field="petal_length" operator="greaterThan" value="1.7228131956992732"/> <Node id="Seg1_Nod_1.2.1" score="4.0"> <SimplePredicate field="petal_width" operator="lessOrEqual" value="1.7714"/> </Node> <Node id="Seg1_Nod_1.2.2" score="4.1544313298030655"> <SimplePredicate field="petal_width" operator="greaterThan" value="1.7714"/> </Node> </Node> </Node> </TreeModel> </Segment> <Segment id="Seg_2"> <True/> <TreeModel functionName="regression" missingValueStrategy="nullPrediction" noTrueChildStrategy="returnLastPrediction" splitCharacteristic="multiSplit" modelName="SegmentModel_2"> <MiningSchema> <MiningField name="petal_width"/> </MiningSchema> <Node id="Seg2_Nod_1" score="2.0"> <True/> <Node id="Seg2_Nod_1.1" score="3.0"> <SimplePredicate field="petal_width" operator="lessOrEqual" value="0.8001738992731421"/> <Node id="Seg2_Nod_1.1.1" score="4.0"> <SimplePredicate field="petal_width" operator="lessOrEqual" value="0.5381419370928949"/> <Node id="Seg2_Nod_1.1.1.1" score="5.0"> <SimplePredicate field="petal_width" operator="lessOrEqual" value="0.25857816"/> </Node> <Node id="Seg2_Nod_1.1.1.2" score="5.0"> <SimplePredicate field="petal_width" operator="greaterThan" value="0.25857816"/> </Node> </Node> <Node id="Seg2_Nod_1.1.2" score="4.0"> <SimplePredicate field="petal_width" operator="greaterThan" value="0.5381419370928949"/> <Node id="Seg2_Nod_1.1.2.1" score="5.0"> <SimplePredicate field="petal_width" operator="lessOrEqual" value="0.25857816"/> </Node> <Node id="Seg2_Nod_1.1.2.2" score="5.0"> <SimplePredicate field="petal_width" operator="greaterThan" value="0.25857816"/> </Node> </Node> </Node> <Node id="Seg2_Nod_1.2" score="3.0"> <SimplePredicate field="petal_width" operator="greaterThan" value="0.8001738992731421"/> </Node> </Node> </TreeModel> </Segment> </Segmentation> </MiningModel> </AnomalyDetectionModel> </PMML> The predicted value is calculated as usual for a mining model. The anomaly score is then the normalized value of that output. This normalization is defined as 2^-(predictedValue/c(n)) where n is the sampleDataSize and c(n) = 2*H(n-1) - (2*(n-1)/n). H(x) may be reasonably approximated as ln(x) + 0.57721566 (Eulers constant). Finally, the input is classified as an anomaly if this predicted value is greater than the threshold value defined by the Apply function of the OutputField anomaly. As an example, assume the input data is:
An example of an OCSVM model:<PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4"> <Header copyright="2019 dmg.org"/> <DataDictionary numberOfFields="5"> <DataField name="class" optype="categorical" dataType="string"> <Value value="setosa"/> <Value value="versicolor"/> <Value value="virginica"/> </DataField> <DataField name="sepal_length" optype="continuous" dataType="double"/> <DataField name="sepal_width" optype="continuous" dataType="double"/> <DataField name="petal_length" optype="continuous" dataType="double"/> <DataField name="petal_width" optype="continuous" dataType="double"/> </DataDictionary> <AnomalyDetectionModel functionName="regression" algorithmType="ocsvm" modelName="OneClassSVM"> <MiningSchema> <MiningField name="sepal_length" usageType="active"/> <MiningField name="sepal_width" usageType="active"/> <MiningField name="petal_length" usageType="active"/> <MiningField name="petal_width" usageType="active"/> </MiningSchema> <Output> <OutputField name="anomalyScore" optype="continuous" dataType="float" feature="predictedValue"/> <OutputField name="anomaly" optype="categorical" dataType="boolean" feature="decision"> <Apply function="lessThan"> <FieldRef field="anomalyScore"/> <Constant dataType="double">0</Constant> </Apply> </OutputField> </Output> <SupportVectorMachineModel functionName="regression" modelName="ocsvm_iris_pmml"> <MiningSchema> <MiningField name="sepal_length"/> <MiningField name="sepal_width"/> <MiningField name="petal_length"/> <MiningField name="petal_width"/> </MiningSchema> <Output> <OutputField dataType="double" feature="predictedValue" name="svm_out" optype="continuous"/> </Output> <LinearKernelType/> <VectorDictionary> <VectorFields> <FieldRef field="sepal_length"/> <FieldRef field="sepal_width"/> <FieldRef field="petal_length"/> <FieldRef field="petal_width"/> </VectorFields> <VectorInstance id="3"> <Array type="real">5.5 4.2 1.4 0.2</Array> </VectorInstance> <VectorInstance id="8"> <Array type="real">4.4 3.0 1.3 0.2</Array> </VectorInstance> </VectorDictionary> <SupportVectorMachine> <SupportVectors> <SupportVector vectorId="3"/> <SupportVector vectorId="8"/> </SupportVectors> <Coefficients absoluteValue="-8.33"> <Coefficient value="0.5"/> <Coefficient value="0.499"/> </Coefficients> </SupportVectorMachine> </SupportVectorMachineModel> </AnomalyDetectionModel> </PMML> It is a typical SVM model. The only new feature is the anomaly output field. In this example the threshold is set as 0 which is reasonable as OCSVM models train to set distances to anomalies as negative. As an example, assume the input values are 0, 0.5, 1.0 and 2.0 for the mining fields in sequential order. Given the kernel function K as the linear kernel function, the predictedValue is:
As this value is less than the threshold value of 0.0, the output anomaly is TRUE, the input is an anomalous value. An example of a Clustering-based model:<PMML xmlns="https://www.dmg.org/PMML-4_4" version="4.4"> <Header copyright="2019 dmg.org"/> <DataDictionary numberOfFields="5"> <DataField name="class" optype="categorical" dataType="string"> <Value value="setosa"/> <Value value="versicolor"/> <Value value="virginica"/> </DataField> <DataField name="sepal_length" optype="continuous" dataType="double"/> <DataField name="sepal_width" optype="continuous" dataType="double"/> <DataField name="petal_length" optype="continuous" dataType="double"/> <DataField name="petal_width" optype="continuous" dataType="double"/> </DataDictionary> <AnomalyDetectionModel functionName="regression" algorithmType="clusterMeanDist" modelName="AnomalyOnKmeans"> <MiningSchema> <MiningField name="sepal_length" usageType="active"/> <MiningField name="sepal_width" usageType="active"/> <MiningField name="petal_length" usageType="active"/> <MiningField name="petal_width" usageType="active"/> </MiningSchema> <Output> <OutputField name="anomalyScore" optype="continuous" dataType="float" feature="predictedValue"/> <OutputField name="anomaly" optype="categorical" dataType="boolean" feature="decision"> <Apply function="greaterThan"> <FieldRef field="anomalyScore"/> <Constant dataType="double">2.0</Constant> </Apply> </OutputField> </Output> <ClusteringModel algorithmName="KMeans" functionName="clustering" modelClass="centerBased" modelName="K-Means" numberOfClusters="3"> <MiningSchema> <MiningField highValue="7.9" importance="0.466315129248955" lowValue="4.3" missingValueReplacement="6.1" missingValueTreatment="asMedian" name="sepal_length" outliers="asExtremeValues" usageType="active"/> <MiningField highValue="4.4" importance="0.217549102595577" lowValue="2.0" missingValueReplacement="3.2" missingValueTreatment="asMedian" name="sepal_width" outliers="asExtremeValues" usageType="active"/> <MiningField highValue="6.9" importance="1" lowValue="1.0" missingValueReplacement="3.95" missingValueTreatment="asMedian" name="petal_length" outliers="asExtremeValues" usageType="active"/> <MiningField highValue="2.5" importance="0.852549535371169" lowValue="0.1" missingValueReplacement="1.3" missingValueTreatment="asMedian" name="petal_width" outliers="asExtremeValues" usageType="active"/> </MiningSchema> <LocalTransformations> <DerivedField dataType="double" name="cluster0" optype="continuous"> <NormContinuous field="sepal_length"> <LinearNorm norm="0" orig="4.3"/> <LinearNorm norm="1" orig="7.9"/> </NormContinuous> </DerivedField> <DerivedField dataType="double" name="cluster1" optype="continuous"> <NormContinuous field="sepal_width"> <LinearNorm norm="0" orig="2"/> <LinearNorm norm="1" orig="4.4"/> </NormContinuous> </DerivedField> <DerivedField dataType="double" name="cluster2" optype="continuous"> <NormContinuous field="petal_length"> <LinearNorm norm="0" orig="1"/> <LinearNorm norm="1" orig="6.9"/> </NormContinuous> </DerivedField> <DerivedField dataType="double" name="cluster3" optype="continuous"> <NormContinuous field="petal_width"> <LinearNorm norm="0" orig="0.1"/> <LinearNorm norm="1" orig="2.5"/> </NormContinuous> </DerivedField> </LocalTransformations> <ComparisonMeasure kind="distance"> <euclidean/> </ComparisonMeasure> <ClusteringField compareFunction="absDiff" field="cluster0" isCenterField="true"/> <ClusteringField compareFunction="absDiff" field="cluster1" isCenterField="true"/> <ClusteringField compareFunction="absDiff" field="cluster2" isCenterField="true"/> <ClusteringField compareFunction="absDiff" field="cluster3" isCenterField="true"/> <Cluster name="1" size="50"> <Array n="4" type="real">0.196111 0.590833 0.0786441 0.06</Array> <Covariances> <Matrix kind="diagonal"> <Array n="4" type="real">0.009587112622827 0.025204790249433 0.000864869935334 0.001995464852608</Array> </Matrix> </Covariances> </Cluster> <Cluster name="2" size="39"> <Array n="4" type="real">0.707265 0.450855 0.797045 0.824786</Array> <Covariances> <Matrix kind="diagonal"> <Array n="4" type="real">0.019486929574651 0.013603051431999 0.007748638163371 0.013722540860699</Array> </Matrix> </Covariances> </Cluster> <Cluster name="3" size="61"> <Array n="4" type="real">0.441257 0.307377 0.575715 0.54918</Array> <Covariances> <Matrix kind="diagonal"> <Array n="4" type="real">0.015537509276125 0.014940042501518 0.007976321106145 0.012876631754705</Array> </Matrix> </Covariances> </Cluster> </ClusteringModel> <MeanClusterDistances> <Array n="3" type="real"> 0.165 0.211 0.210</Array> </MeanClusterDistances> </AnomalyDetectionModel> </PMML> |
||||||||||||||
|