Data Mining Group - Naive Bayes

PMML 2.0 -- Naive Bayes Models

Naive Bayes uses Bayes' Theorem, combined with a ("naive") presumption of conditional independence, to predict the value of a target (output) independence, from evidence given by one or more predictor (input) fields.

Given a categorical target field T with possible values T₁,...T_m, and predictor fields I₁,...I_n, with values (in the current record) of I_1*,...I_n*, the probability that the target T has value T_i, given the values of the predictors, is derived as follows:

P(T_i| I_1*,...I_n*)

= P(T_i) P(I_1*,...I_n*|T_i) / P(I_1*,...I_n*), by Bayes' theorem

~ P(T_i) Product_jP(I_j*|T_i) / P(I_1*,...I_n*), by the conditional independence assumption

= P(T_i) Product_jP(I_j*| T_i) / Sum_k( P(T_k) Product_j P(I_j*| T_k))

= L_i / Sum_k L_k, defining likelihood L_k= P(T_k) Product_j P(I_j*| T_k)

L_i = P(T_i) Product_j P(I_j*| T_i)

= (count[T_i] / Sum_kcount[T_k]) Product_j(count[I_j*T_i] / Sum_kcount[T_k]) / (count[T_i] / Sum_kcount[T_k] )

~ count[T_i] Product_j(count[I_j*T_i] / count[T_i]), removing factors of Sum_kcount[T_k] common to all L

A count of zero requires special attention. Without adjustment, a count of zero would exercise an absolute veto over a likelihood in which that count appears as a factor. Therefore, the Bayes model incorporates a threshold parameter that specifies a default (usually very small) probability to use in lieu of P(I_j*| T_k) when count[I_j*T_i] is zero.

A second adaptation to missing values in the training data, involves the denominator count[T_i] in the conditional-probability terms. Accuracy improves if the denominator for P(I_j*| T_i) is replaced by the sum Sum_kcount[I_jkT_i], that is, the sum of the counts of co-occurrences of target value T_i with any (non-missing) value of item I_j.

Naive Bayes models require that each field (whether target or predictor) be discretized so that for each field, only a small, finite number of values are considered by the model.

In sum, a Naive Bayes model requires the following parameter and counts:

An attribute threshold specifies the probability to use in lieu of P(I_j*| T_k) when count[I_j*T_i] is zero.
An element TargetValueCounts lists, for each value T_i of the target field, the number of occurrences of that target value in the training data, i.e. count[T_i].
For each predictor field I_i, for each discrete value I_ijof that field, an element PairCounts lists, for each value T_k of the target field, the number of occurrences of that predictor value jointly with that target value, i.e. count[I_ijT_k].

A NaiveBayesModel essentially defines a set of matrices. For each input field there is a matrix which contains the frequency counts of an input value with respect to a target value.

Target value

t1 t2 t3 ...

count[t1] count[t2] count[t3] ...

Input1 i11 count[i11,t1] count[i11,t2] count[i11,t3] ...

i12 count[i12,t1] count[i12,t2] count[i12,t3] ...

... ... ... ... ...

Input2 i21 count[i21,t1] count[i21,t2] count[i21,t3] ...

i22 count[i22,t1] count[i22,t2] count[i22,t3] ...

i23 count[i23,t1] count[i23,t2] count[i23,t3] ...

... ... ... ... ...

Input3 ... ... ... ... ...

Scoring procedure

Given an input vector like (i12,i23,i31) the probability for class t1 is computed as
P(t1|i12,i23,i31) = L1 / (L1 + L2 + L3)
with
   L1 = count[t1] * count[i12,t1]/count[t1] * count[i23,t1]/count[t1] * count[i31,t1]/count[t1]
   L2 = count[t2] * count[i12,t2]/count[t2] * count[i23,t2]/count[t2] * count[i31,t2]/count[t2]
   L3 = count[t3] * count[i12,t3]/count[t3] * count[i23,t3]/count[t2] * count[i31,t3]/count[t3]

When scoring, missing values are simply ignored. That is, the conditional-probability factor associated with a missing predictor field is omitted. For example, given an input vector with missing values (-,i23,-) the probability for class t1 is computed as
P(t1|-,i23,-) = L1 / (L1 + L2 + L3)
with
   L1 = count[t1] * count[i23,t1]/count[t1]
   L2 = count[t2] * count[i23,t2]/count[t2]
   L3 = count[t3] * count[i23,t3]/count[t2]

DTD


<!ELEMENT NaiveBayesModel (Extension*, MiningSchema, ModelStats?,
                           BayesInputs, BayesOutput, Extension* )>

<!ATTLIST NaiveBayesModel
     modelName                CDATA                          #IMPLIED
     threshold                %REAL-NUMBER;                  #REQUIRED
     functionName             %MINING-FUNCTION;              #REQUIRED
     algorithmName            CDATA                          #IMPLIED
>

<!ELEMENT BayesInputs ( BayesInput+ )>

Bayes Input

Each BayesInput also contains the counts pairing the discrete values of that field with those of the target field. Each BayesInput for a continous field also defines how the continuous values are encoded as discrete bins. (Discretization is achieved using DerivedField; only the Discretize mapping for DerivedField may be invoked here.)


<!ELEMENT BayesInput ( Extension*, DerivedField?, PairCounts+ ) >

<!ATTLIST BayesInput
    fieldName                 CDATA                          #REQUIRED
>

Bayes Output

BayesOutput contains the counts associated with the values of the target field.


<!ELEMENT BayesOutput ( Extension*, TargetValueCounts )>
<!ATTLIST BayesOutput
    fieldName                 CDATA                          #REQUIRED
>

Pair Counts

PairCounts lists, for a field I_i's discrete value I_ij, the TargetValueCounts that pair the value I_ij with each value of the target field.


<!ELEMENT PairCounts ( TargetValueCounts )>

<!ATTLIST PairCounts
     value                    CDATA                         #REQUIRED
>

Target Value Counts

TargetValueCounts lists the counts associated with each value of the target field. However, a TargetValueCount whose count is zero may be omitted.

Within BayesOutput, TargetValueCounts lists the total count of occurrences of each target value.

Within PairCounts, TargetValueCounts lists, for each target value, the count of the joint occurrences of that target value with a particular discrete input value.


<!ELEMENT TargetValueCounts ( TargetValueCount+ )>

<!ELEMENT TargetValueCount EMPTY >

<!ATTLIST TargetValueCount
    value                     CDATA                          #REQUIRED
    count                     %REAL-NUMBER;                  #REQUIRED
>

Example model


<?xml version="1.0" ?>
<PMML version="2.0">
<Header copyright="Copyright (c) 2001, DMG.org"/>
<DataDictionary numberOfFields="5">

<DataField name="gender" optype="categorical">
    <Value value="female"/>
    <Value value=  "male"/>
</DataField>

<DataField name="no of claims" optype="categorical">
    <Value value= "0"/>
    <Value value= "1"/>
    <Value value= "2"/>
    <Value value=">2"/>
</DataField>

<DataField name="domicile" optype="categorical">
    <Value value="suburban"/>
    <Value value=   "urban"/>
    <Value value=   "rural"/>
</DataField>

<DataField name="age of car" optype="continuous"/>

<DataField name="amount of claims" optype="categorical">
    <Value value=  "100"/>
    <Value value=  "500"/>
    <Value value= "1000"/>
    <Value value= "5000"/>
    <Value value="10000"/>
</DataField>

</DataDictionary>

<NaiveBayesModel 
        modelName="NaiveBayes Insurance" 
        functionName="classification"
        threshold="0.001">
         
<MiningSchema>
    <MiningField name="gender"/>
    <MiningField name="no of claims"/>
    <MiningField name="domicile"/>
    <MiningField name="age of car"/>
    <MiningField name="amount of claims" usageType="predicted"/>
</MiningSchema>

<BayesInputs>
    <BayesInput fieldName="gender">

<PairCounts value="male">

    <TargetValueCounts>
        <TargetValueCount value=  "100" count="4273"/>
        <TargetValueCount value=  "500" count="1321"/>
        <TargetValueCount value= "1000" count= "780"/>
        <TargetValueCount value= "5000" count= "405"/>
        <TargetValueCount value="10000" count=  "42"/>
    </TargetValueCounts>

</PairCounts>

<PairCounts value="female">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="4325"/>
        <TargetValueCount value=  "500" count="1212"/>
        <TargetValueCount value= "1000" count= "742"/>
        <TargetValueCount value= "5000" count= "292"/>
        <TargetValueCount value="10000" count=  "48"/>
    </TargetValueCounts>
</PairCounts>

</BayesInput>

<BayesInput fieldName="no of claims">

<PairCounts value="0">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="4698"/>
        <TargetValueCount value=  "500" count= "623"/>
        <TargetValueCount value= "1000" count="1259"/>
        <TargetValueCount value= "5000" count= "550"/>
        <TargetValueCount value="10000" count=  "40"/>
    </TargetValueCounts>
</PairCounts>

<PairCounts value="1">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="3526"/>
        <TargetValueCount value=  "500" count="1798"/>
        <TargetValueCount value= "1000" count= "227"/>
        <TargetValueCount value= "5000" count= "152"/>
        <TargetValueCount value="10000" count=  "40"/>
    </TargetValueCounts>
</PairCounts>

<PairCounts value="2">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="225"/>
        <TargetValueCount value=  "500" count= "10"/>
        <TargetValueCount value= "1000" count=  "9"/>
        <TargetValueCount value= "5000" count=  "0"/>
        <TargetValueCount value="10000" count= "10"/>
    </TargetValueCounts>
</PairCounts>

<PairCounts value=">2">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="112"/>
        <TargetValueCount value=  "500" count=  "5"/>
        <TargetValueCount value= "1000" count=  "1"/>
        <TargetValueCount value= "5000" count=  "1"/>
        <TargetValueCount value="10000" count=  "8"/>
    </TargetValueCounts>
</PairCounts>

</BayesInput>

<BayesInput fieldName="domicile">

<PairCounts value="suburban">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="2536"/>
        <TargetValueCount value=  "500" count= "165"/>
        <TargetValueCount value= "1000" count= "516"/>
        <TargetValueCount value= "5000" count= "290"/>
        <TargetValueCount value="10000" count=  "42"/>
    </TargetValueCounts>
</PairCounts>

<PairCounts value="urban">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="1679"/>
        <TargetValueCount value=  "500" count= "792"/>
        <TargetValueCount value= "1000" count= "511"/>
        <TargetValueCount value= "5000" count= "259"/>
        <TargetValueCount value="10000" count=  "30"/>
     </TargetValueCounts>
</PairCounts>

<PairCounts value="rural">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="2512"/>
        <TargetValueCount value=  "500" count="1013"/>
        <TargetValueCount value= "1000" count= "442"/>
        <TargetValueCount value= "5000" count= "137"/>
        <TargetValueCount value="10000" count=  "21"/>
    </TargetValueCounts>
</PairCounts>

</BayesInput>

<BayesInput fieldName="age of car">
    <DerivedField>
        <Discretize field="age of car">
            <DiscretizeBin binValue="0">
                <Interval closure="closedOpen" leftMargin="0" 
                   rightMargin="1"/>
            </DiscretizeBin>
            <DiscretizeBin binValue="1">
                <Interval closure="closedOpen" leftMargin="1" 
                   rightMargin="5"/>
            </DiscretizeBin>
            <DiscretizeBin binValue="2">
                <Interval closure="closedOpen" leftMargin="5"/>
            </DiscretizeBin>
        </Discretize>
     </DerivedField>

<PairCounts value="0">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="927"/>
        <TargetValueCount value=  "500" count="183"/>
        <TargetValueCount value= "1000" count="221"/>
        <TargetValueCount value= "5000" count= "50"/>
        <TargetValueCount value="10000" count= "10"/>
     </TargetValueCounts>
</PairCounts>

<PairCounts value="1">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="830"/>
        <TargetValueCount value=  "500" count="182"/>
        <TargetValueCount value= "1000" count= "51"/>
        <TargetValueCount value= "5000" count= "26"/>
        <TargetValueCount value="10000" count=  "6"/>
    </TargetValueCounts>
</PairCounts>

<PairCounts value="2">
    <TargetValueCounts>
        <TargetValueCount value=  "100" count="6251"/>
        <TargetValueCount value=  "500" count="1901"/>
        <TargetValueCount value= "1000" count= "919"/>
        <TargetValueCount value= "5000" count= "623"/>
        <TargetValueCount value="10000" count=  "71"/>
    </TargetValueCounts>
</PairCounts>

</BayesInput>
</BayesInputs>

<BayesOutput fieldName="amount of claims">

<TargetValueCounts>
    <TargetValueCount value=  "100" count="8723"/>
    <TargetValueCount value=  "500" count="2557"/>
    <TargetValueCount value= "1000" count="1530"/>
    <TargetValueCount value= "5000" count= "709"/>
    <TargetValueCount value="10000" count= "100"/>
</TargetValueCounts>

</BayesOutput>
</NaiveBayesModel>
</PMML>

Scoring procedure, example

Given an input vector (gender="male", no of claims = "2", domicile= (missing), age of car = "1") the probability for class "1000" is computed as
P("1000"| "male", "2", -,"1" ) = L2 / (L0 + L1 + L2 + L3 + L4)
with
   L0 = 8723 * 4273/8723 * 225/8723 * 830/8723
   L1 = 2557 * 1321/2557 * 10/2557 * 182/2557
   L2 = 1530 * 780/1530 * 9/1530 * 51/1530
   L3 = 709 * 405/709 * .001 * 26/709
   L4 = 100 * 42/100 * 10/100 * 6/100