PMML 4.0 - Naïve Bayes
Naïve Bayes uses Bayes' Theorem, combined with a ("naive") presumption of
conditional independence, to predict the value of a target (output),
from evidence given by one or more predictor (input) fields.
Given a categorical target field T with possible values
T1,...Tm, and predictor fields
I1,...In, with values (in the current record) of
I1*,...In*, the probability that the target T has value
Ti, given the values of the predictors, is derived as follows:
P(Ti |
I1*,...In*)
= P(Ti) P(I1*,...In* | Ti) /
P(I1*,...In*) by Bayes'
theorem
~ P(Ti) ProductjP(Ij* | Ti) /
P(I1*,...In*) by the conditional independence
assumption
= P(Ti) ProductjP(Ij* | Ti) /
Sumk ( P(Tk) Productj P(Ij* | Tk))
= Li / Sumk
Lk, defining likelihood
Lk = P(Tk) Productj P(Ij* | Tk)
Li = P(Ti) Productj P(Ij* | Ti)
= (count[Ti] / Sumk count[Tk])
Productj (count[Ij*Ti] / Sumk
count[Tk]) / (count[Ti] / Sumk
count[Tk] )
~ count[Ti] Productj(count[Ij* Ti] /
count[Ti]) removing factors of
Sumkcount[Tk] common to all L
A count of zero requires special attention. Without adjustment, a count of zero
would exercise an absolute veto over a likelihood in which that count appears
as a factor. Therefore, the Bayes model incorporates a threshold parameter that
specifies a default (usually very small) probability to use in lieu of
P(Ij* | Tk) when count[Ij*Ti] is
zero.
A second adaptation to missing values in the training data, involves the
denominator count[Ti] in the conditional-probability terms. Accuracy
improves if the denominator for P(Ij* | Ti) is replaced
by the sum Sumk count[IjkTi], that is, the
sum of the counts of co-occurrences of target value Ti with any
(non-missing) value of item Ij.
Naïve Bayes models require that each field (whether target or predictor) be
discretized so that for each field, only a small, finite number of values are
considered by the model.
In sum, a Naïve Bayes model requires the following parameters and counts:
- An attribute threshold specifies the probability to use in lieu of
P(Ij* | Tk) when count[Ij*Ti]
is zero.
- An element TargetValueCounts lists, for each value Ti of the
target field, the number of occurrences of that target value in the
training data, i.e. count[Ti].
- For each predictor field Ii, for each discrete value
Iij of that field, an element PairCounts lists, for each value
Tk of the target field, the number of occurrences of that
predictor value jointly with that target value, i.e.
count[IijTk].
A NaiveBayesModel essentially defines a set of matrices. For each input field
there is a matrix which contains the frequency counts of an input value with
respect to a target value.
| Target value |
| t1 | t2 | t3 | ... |
| count[t1] | count[t2] | count[t3] |
... |
| | | | |
Input1 | i11 | count[i11,t1] | count[i11,t2] |
count[i11,t3] | ... |
i12 | count[i12,t1] | count[i12,t2] | count[i12,t3] |
... |
... | ... | ... | ... | ... |
| | | | |
Input2 | i21 | count[i21,t1] | count[i21,t2] |
count[i21,t3] | ... |
i22 | count[i22,t1] | count[i22,t2] | count[i22,t3] |
... |
i23 | count[i23,t1] | count[i23,t2] | count[i23,t3] |
... |
... | ... | ... | ... | ... |
| | | | |
Input3 | ... | ... | ... | ... | ... |
XSD
<xs:element name="NaiveBayesModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="MiningSchema"/>
<xs:element ref="Output" minOccurs="0" />
<xs:element ref="ModelStats" minOccurs="0"/>
<xs:element ref="ModelExplanation" minOccurs="0"/>
<xs:element ref="Targets" minOccurs="0" />
<xs:element ref="LocalTransformations" minOccurs="0" />
<xs:element ref="BayesInputs" />
<xs:element ref="BayesOutput" />
<xs:element ref="ModelVerification" minOccurs="0"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="modelName" type="xs:string" />
<xs:attribute name="threshold" type="REAL-NUMBER" use="required"/>
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
<xs:attribute name="algorithmName" type="xs:string" />
</xs:complexType>
</xs:element>
|
Bayes Inputs
The BayesInputs element contains several BayesInput elements.
<xs:element name="BayesInputs">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element maxOccurs="unbounded" ref="BayesInput" />
</xs:sequence>
</xs:complexType>
</xs:element>
|
Bayes Input
Each BayesInput also contains the counts pairing the discrete values of that
field with those of the target field. Each BayesInput for a continuous field
also defines how the continuous values are encoded as discrete bins.
(Discretization is achieved using DerivedField; only the Discretize mapping
for DerivedField may be invoked here)
<xs:element name="BayesInput">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element minOccurs="0" ref="DerivedField" />
<xs:element maxOccurs="unbounded" ref="PairCounts" />
</xs:sequence>
<xs:attribute name="fieldName" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
|
Bayes Output
BayesOutput contains the counts associated with the values of the target
field.
<xs:element name="BayesOutput">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="TargetValueCounts" />
</xs:sequence>
<xs:attribute name="fieldName" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
|
Pair Counts
PairCounts lists, for a field Ii's discrete value Iij,
the TargetValueCounts that pair the value Iij with each value of the target
field.
<xs:element name="PairCounts">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="TargetValueCounts" />
</xs:sequence>
<xs:attribute name="value" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
|
Target Value Counts
TargetValueCounts lists the counts associated with each value of the target
field. However, a TargetValueCount whose count is zero may be omitted.
Within BayesOutput, TargetValueCounts lists the total count of occurrences of
each target value.
Within PairCounts, TargetValueCounts lists, for each target value, the count
of the joint occurrences of that target value with a particular discrete input
value.
<xs:element name="TargetValueCounts">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element maxOccurs="unbounded" ref="TargetValueCount" />
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="TargetValueCount">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="value" type="xs:string" use="required" />
<xs:attribute name="count" type="REAL-NUMBER" use="required" />
</xs:complexType>
</xs:element>
|
Scoring procedure
Given an input vector like (i12,i23,i31) the probability for class t1 is
computed as
- P(t1 | i12,i23,i31) = L1 / (L1 + L2 + L3)
with
- L1 = count[t1] * count[i12,t1]/count[t1] *
count[i23,t1]/count[t1] * count[i31,t1]/count[t1]
- L2 = count[t2] * count[i12,t2]/count[t2] *
count[i23,t2]/count[t2] * count[i31,t2]/count[t2]
- L3 = count[t3] * count[i12,t3]/count[t3] *
count[i23,t3]/count[t3] * count[i31,t3]/count[t3]
When scoring, missing values are simply ignored. That is, the
conditional-probability factor associated with a missing predictor field is
omitted. For example, given an input vector with missing values (-,i23,-) the
probability for class t1 is computed as
- P(t1 | -,i23,-) = L1 / (L1 + L2 + L3)
with
- L1 = count[t1] * count[i23,t1]/count[t1]
- L2 = count[t2] * count[i23,t2]/count[t2]
- L3 = count[t3] * count[i23,t3]/count[t3]
Scoring procedure, example
<?xml version="1.0" ?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="Copyright (c) 2009, DMG.org"/>
<DataDictionary numberOfFields="5">
<DataField name="gender" optype="categorical" dataType="string">
<Value value="female"/>
<Value value= "male"/>
</DataField>
<DataField name="no of claims" optype="categorical" dataType="string">
<Value value= "0"/>
<Value value= "1"/>
<Value value= "2"/>
<Value value=">2"/>
</DataField>
<DataField name="domicile" optype="categorical" dataType="string">
<Value value="suburban"/>
<Value value= "urban"/>
<Value value= "rural"/>
</DataField>
<DataField name="age of car" optype="continuous" dataType="double"/>
<DataField name="amount of claims" optype="categorical" dataType="integer">
<Value value= "100"/>
<Value value= "500"/>
<Value value= "1000"/>
<Value value= "5000"/>
<Value value="10000"/>
</DataField>
</DataDictionary>
<NaiveBayesModel modelName="NaiveBayes Insurance"
functionName="classification" threshold="0.001">
<MiningSchema>
<MiningField name="gender"/>
<MiningField name="no of claims"/>
<MiningField name="domicile"/>
<MiningField name="age of car"/>
<MiningField name="amount of claims" usageType="predicted"/>
</MiningSchema>
<BayesInputs>
<BayesInput fieldName="gender">
<PairCounts value="male">
<TargetValueCounts>
<TargetValueCount value= "100" count="4273"/>
<TargetValueCount value= "500" count="1321"/>
<TargetValueCount value= "1000" count= "780"/>
<TargetValueCount value= "5000" count= "405"/>
<TargetValueCount value="10000" count= "42"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="female">
<TargetValueCounts>
<TargetValueCount value= "100" count="4325"/>
<TargetValueCount value= "500" count="1212"/>
<TargetValueCount value= "1000" count= "742"/>
<TargetValueCount value= "5000" count= "292"/>
<TargetValueCount value="10000" count= "48"/>
</TargetValueCounts>
</PairCounts>
</BayesInput>
<BayesInput fieldName="no of claims">
<PairCounts value="0">
<TargetValueCounts>
<TargetValueCount value= "100" count="4698"/>
<TargetValueCount value= "500" count= "623"/>
<TargetValueCount value= "1000" count="1259"/>
<TargetValueCount value= "5000" count= "550"/>
<TargetValueCount value="10000" count= "40"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="1">
<TargetValueCounts>
<TargetValueCount value= "100" count="3526"/>
<TargetValueCount value= "500" count="1798"/>
<TargetValueCount value= "1000" count= "227"/>
<TargetValueCount value= "5000" count= "152"/>
<TargetValueCount value="10000" count= "40"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="2">
<TargetValueCounts>
<TargetValueCount value= "100" count="225"/>
<TargetValueCount value= "500" count= "10"/>
<TargetValueCount value= "1000" count= "9"/>
<TargetValueCount value= "5000" count= "0"/>
<TargetValueCount value="10000" count= "10"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value=">2">
<TargetValueCounts>
<TargetValueCount value= "100" count="112"/>
<TargetValueCount value= "500" count= "5"/>
<TargetValueCount value= "1000" count= "1"/>
<TargetValueCount value= "5000" count= "1"/>
<TargetValueCount value="10000" count= "8"/>
</TargetValueCounts>
</PairCounts>
</BayesInput>
<BayesInput fieldName="domicile">
<PairCounts value="suburban">
<TargetValueCounts>
<TargetValueCount value= "100" count="2536"/>
<TargetValueCount value= "500" count= "165"/>
<TargetValueCount value= "1000" count= "516"/>
<TargetValueCount value= "5000" count= "290"/>
<TargetValueCount value="10000" count= "42"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="urban">
<TargetValueCounts>
<TargetValueCount value= "100" count="1679"/>
<TargetValueCount value= "500" count= "792"/>
<TargetValueCount value= "1000" count= "511"/>
<TargetValueCount value= "5000" count= "259"/>
<TargetValueCount value="10000" count= "30"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="rural">
<TargetValueCounts>
<TargetValueCount value= "100" count="2512"/>
<TargetValueCount value= "500" count="1013"/>
<TargetValueCount value= "1000" count= "442"/>
<TargetValueCount value= "5000" count= "137"/>
<TargetValueCount value="10000" count= "21"/>
</TargetValueCounts>
</PairCounts>
</BayesInput>
<BayesInput fieldName="age of car">
<DerivedField optype="categorical" dataType="string">
<Discretize field="age of car">
<DiscretizeBin binValue="0">
<Interval closure="closedOpen" leftMargin="0" rightMargin="1"/>
</DiscretizeBin>
<DiscretizeBin binValue="1">
<Interval closure="closedOpen" leftMargin="1" rightMargin="5"/>
</DiscretizeBin>
<DiscretizeBin binValue="2">
<Interval closure="closedOpen" leftMargin="5"/>
</DiscretizeBin>
</Discretize>
</DerivedField>
<PairCounts value="0">
<TargetValueCounts>
<TargetValueCount value= "100" count="927"/>
<TargetValueCount value= "500" count="183"/>
<TargetValueCount value= "1000" count="221"/>
<TargetValueCount value= "5000" count= "50"/>
<TargetValueCount value="10000" count= "10"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="1">
<TargetValueCounts>
<TargetValueCount value= "100" count="830"/>
<TargetValueCount value= "500" count="182"/>
<TargetValueCount value= "1000" count= "51"/>
<TargetValueCount value= "5000" count= "26"/>
<TargetValueCount value="10000" count= "6"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="2">
<TargetValueCounts>
<TargetValueCount value= "100" count="6251"/>
<TargetValueCount value= "500" count="1901"/>
<TargetValueCount value= "1000" count= "919"/>
<TargetValueCount value= "5000" count= "623"/>
<TargetValueCount value="10000" count= "71"/>
</TargetValueCounts>
</PairCounts>
</BayesInput>
</BayesInputs>
<BayesOutput fieldName="amount of claims">
<TargetValueCounts>
<TargetValueCount value= "100" count="8723"/>
<TargetValueCount value= "500" count="2557"/>
<TargetValueCount value= "1000" count="1530"/>
<TargetValueCount value= "5000" count= "709"/>
<TargetValueCount value="10000" count= "100"/>
</TargetValueCounts>
</BayesOutput>
</NaiveBayesModel>
</PMML>
|
Given an input vector (gender="male", no of claims = "2", domicile=
(missing), age of car = "1") the probability for class "1000" is computed as
- P("1000" | "male", "2", -,"1" ) = L2 / (L0 + L1 + L2
+ L3 + L4)
with
- L0 = 8723 * 4273/8598 * 225/8561 * 830/8008
- L1 = 2557 * 1321/2533 * 10/2436 * 182/2266
- L2 = 1530 * 780/1522 * 9/1496 * 51/1191
- L3 = 709 * 405/697 * .001 * 26/699
- L4 = 100 * 42/90 * 10/98 * 6/87
|