PMML 4.4.1 - Naïve Bayes

Naïve Bayes uses Bayes' Theorem, combined with a ("naive") presumption of conditional independence, to predict the value of a target (output), from evidence given by one or more predictor (input) fields.

Given a categorical target field T with possible values T₁,...T_m, and predictor fields I₁,...I_n, with values (in the current record) of I_1*,...I_n*, the probability that the target T has value T_i, given the values of the predictors, is derived as follows:

P(T_i | I_1*,...I_n*)

= P(T_i) P(I_1*,...I_n* | T_i) / P(I_1*,...I_n*) by Bayes' theorem

~ P(T_i) Product_jP(I_j* | T_i) / P(I_1*,...I_n*) by the conditional independence assumption

= P(T_i) Product_jP(I_j* | T_i) / Sum_k ( P(T_k) Product_j P(I_j* | T_k))

= L_i / Sum_k L_k, defining likelihood L_k = P(T_k) Product_j P(I_j* | T_k)

Naïve Bayes models require the target field to be discretized so that a finite number of values are considered by the model. On the other hand, predictor fields may be either discrete or continuous. For a discrete field, the probability can be calculated given the number of occurrences of that field. For example, the probability of the target T having value T_i is:

P(T_i) = count[T_i]/Sum_kcount[T_k]

For a continuous field, on the other hand, we assume it has a probability distribution representable by a Gaussian distribution with mean μ and variance σ² or a Poisson distribution with mean μ.

A gaussian distribution would imply that the probability of the continuous variable having value I_j* given a target value T_i is:

P(I_j* | T_i) = exp(-(I_j* - μ_ij)² / 2σ_ij²) / sqrt(2πσ_ij²)

while a poisson distribution with mean μ implies:

P(I_j* | T_i) = μ_ij^I_j* exp(-μ_ij) / I_j*!

In this way, for example, a gaussian distribution will give

L_i = P(T_i) Product_j P(I_j* | T_i)

= (count[T_i] / Sum_k count[T_k]) Product_{j[categorical]} (count[I_j*T_i] / Sum_k count[T_k]) Product(_{j[continuous]}exp(-(I_j* - μ_ij)² / 2σ_ij²) / sqrt(2πσ_ij²)

~ count[T_i] Product_{j[categorical]}(count[I_j* T_i]) Product(_{j[continuous]}exp(-(I_j* - μ_ij)² / 2σ_ij²) / (sqrt(2πσ_ij²) removing factors of Sum_kcount[T_k] common to all L

Where μ_ij is the mean of the joint distribution of variable i with target j. Similarly, σ_ij² is the variance of the joint distribution of variable i with target j.

A count of zero requires special attention. Without adjustment, a count of zero would exercise an absolute veto over a likelihood in which that count appears as a factor. Therefore, the Bayes model incorporates a threshold parameter that specifies a default (usually very small) probability to use in lieu of P(I_j* | T_k) when count[I_j*T_i] is zero. Similarly, since the probabilily of a continuous distribution can reach the value of 0 as the lower limit, the same threshold parameter is used as the probability of the continuous variable when the calculated probability of the distribution falls below that value.

A second adaptation to missing values in the training data, involves the denominator count[T_i] in the conditional-probability terms. Accuracy improves if the denominator for P(I_j* | T_i) is replaced by the sum Sum_k count[I_j*T_i], that is, the sum of the counts of co-occurrences of target value T_i with any (non-missing) value of item I_j.

In sum, a Naïve Bayes model requires the following attributes and elements:

A threshold attribute specifies the probability to use in lieu of P(I_j* | T_k) when count[I_j*T_i] is zero for categorial fields or when the calculated probability of the distribution falls below the threshold for continuous fields.
An element TargetValueCounts lists, for each value T_i of the target field, the number of occurrences of that target value in the training data, i.e. count[T_i].
For each discrete predictor field I_i, for each value I_ij of that field, an element PairCounts lists, for each value T_k of the target field, the number of occurrences of that predictor value jointly with that target value, i.e. count[I_ijT_k].
For each continuous predictor field I_i, an element TargetValueStats lists for each value T_k of the target field, the statistical measures of the joint distribution of that predictor field, i.e. mean[T_kj] and variance[T_kj].

A NaiveBayesModel essentially defines a set of matrices. For each input field there is a matrix which contains the frequency counts of a discrete input value or in case of a Gaussian distribution, the mean and the standard deviation of the joint distribution of the continuous input with a target value.

Target value

t1 t2 t3 ...

count[t1] count[t2] count[t3] ...

Discrete Input1 i11 count[i11,t1] count[i11,t2] count[i11,t3] ...

i12 count[i12,t1] count[i12,t2] count[i12,t3] ...

... ... ... ... ...

Discrete Input2 i21 count[i21,t1] count[i21,t2] count[i21,t3] ...

i22 count[i22,t1] count[i22,t2] count[i22,t3] ...

i23 count[i23,t1] count[i23,t2] count[i23,t3] ...

... ... ... ... ...

Continuous Input3 i3 mean[i3,t1],variance[i3,t1] mean[i3,t2],variance[i3,t2] mean[i3,t3],variance[i3,t3] ...

XSD

<xs:element name="NaiveBayesModel">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="MiningSchema"/>
        <xs:element ref="Output" minOccurs="0"/>
        <xs:element ref="ModelStats" minOccurs="0"/>
        <xs:element ref="ModelExplanation" minOccurs="0"/>
        <xs:element ref="Targets" minOccurs="0"/>
        <xs:element ref="LocalTransformations" minOccurs="0"/>
        <xs:element ref="BayesInputs"/>
        <xs:element ref="BayesOutput"/>
        <xs:element ref="ModelVerification" minOccurs="0"/>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string"/>
      <xs:attribute name="threshold" type="REAL-NUMBER" use="required"/>
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
      <xs:attribute name="algorithmName" type="xs:string"/>
      <xs:attribute name="isScorable" type="xs:boolean" default="true"/>
    </xs:complexType>
  </xs:element>

The isScorable attribute indicates whether the model is valid for scoring. If this attribute is true or if it is missing, then the model should be processed normally. However, if the attribute is false, then the model producer has indicated that this model is intended for information purposes only and should not be used to generate results. In order to be valid PMML, all required elements and attributes must be present, even for non-scoring models. For more details, see General Structure.

Bayes Inputs

The BayesInputs element contains several BayesInput elements.

<xs:element name="BayesInputs">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="BayesInput" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

Bayes Input

For a discrete field, each BayesInput contains the counts pairing the discrete values of that field with those of the target field.

For a continuous field, the BayesInput element lists the distributions obtained for that field with each value of the target field. BayesInput may also be used to define how continuous values are encoded as discrete bins. (Discretization is achieved using DerivedField; only the Discretize mapping for DerivedField may be invoked here).

<xs:element name="BayesInput">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:choice>
          <xs:element ref="TargetValueStats" minOccurs="1" maxOccurs="1"/>
          <xs:sequence>
            <xs:element ref="DerivedField" minOccurs="0" maxOccurs="1"/>
            <xs:element ref="PairCounts" minOccurs="1" maxOccurs="unbounded"/>
          </xs:sequence> 
        </xs:choice>        
      </xs:sequence>
      <xs:attribute name="fieldName" type="FIELD-NAME" use="required"/>
    </xs:complexType>
  </xs:element>

Note that a BayesInput element encompasses either one TargetValueStats element or one or more PairCounts elements. Element DerivedField can only be used in conjunction with PairCounts.

Bayes Output

BayesOutput contains the counts associated with the values of the target field.

<xs:element name="BayesOutput">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="TargetValueCounts"/>
      </xs:sequence>
      <xs:attribute name="fieldName" type="FIELD-NAME" use="required"/>
    </xs:complexType>
  </xs:element>

Target Value Stats

TargetValueStats serves as the envelope for element TargetValueStat. It is used for a continuous input field I_i to define statistical measures associated with each value of the target field. As defined in CONTINUOUS-DISTRIBUTION-TYPES, different distribution types can be used to represent such measures. For Bayes models, these are restricted to Gaussian and Poisson distributions. For more details on these continuous distribution types, please refer to BaselineModel.

<xs:element name="TargetValueStats">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="TargetValueStat" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  
<xs:element name="TargetValueStat">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:group ref="CONTINUOUS-DISTRIBUTION-TYPES" minOccurs="1"/>
      </xs:sequence>
      <xs:attribute name="value" type="xs:string" use="required"/>      
    </xs:complexType>
  </xs:element>

Pair Counts

PairCounts lists, for a field I_i's discrete value I_ij, the TargetValueCounts that pair the value I_ij with each value of the target field.

<xs:element name="PairCounts">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="TargetValueCounts"/>
      </xs:sequence>
      <xs:attribute name="value" type="xs:string" use="required"/>
    </xs:complexType>
  </xs:element>

Target Value Counts

TargetValueCounts lists the counts associated with each value of the target field. However, a TargetValueCount whose count is zero may be omitted.

Within BayesOutput, TargetValueCounts lists the total count of occurrences of each target value.

Within PairCounts, TargetValueCounts lists, for each target value, the count of the joint occurrences of that target value with a particular discrete input value.

<xs:element name="TargetValueCounts">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="TargetValueCount" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  
<xs:element name="TargetValueCount">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="value" type="xs:string" use="required"/>
      <xs:attribute name="count" type="REAL-NUMBER" use="required"/>
    </xs:complexType>
  </xs:element>

Scoring procedure

Given an input vector like (i₁₂,i₂₃,i₃), where i₁₂,i₂₃ are discrete variables and i₃ is a continuous variable, the probability for class t₁ is computed as

P(t₁ | i₁₂,i₂₃,i₃) = L₁ / (L₁ + L₂ + L₃)

with

L₁ = count[t₁] * count[i₁₂,t₁]/count[t₁] * count[i₂₃,t₁]/count[t₁] * exp(-(i₃ - mean[i₃,t₁])² / 2*variance[i₃,t₁]) / sqrt(2π*variance[i₃,t₁])

L₂ = count[t₂] * count[i₁₂,t₂]/count[t₂] * count[i₂₃,t₂]/count[t₂] * exp(-(i₃ - mean[i₃,t₂])² / 2*variance[i₃,t₂]) / sqrt(2π*variance[i₃,t₂])

L₃ = count[t₃] * count[i₁₂,t₃]/count[t₃] * count[i₂₃,t₃]/count[t₃] * exp(-(i₃ - mean[i₃,t₃])² / 2*variance[i₃,t₃]) / sqrt(2π*variance[i₃,t₃])

Note that we used a Gaussian distribution for scoring continuous variables. The scoring procedure would need to adjusted, if the Poisson distribution were to be used.

When scoring, missing values are simply ignored. That is, the conditional-probability factor associated with a missing predictor field is omitted. For example, given an input vector with missing values (-,i23,-) the probability for class t₁ is computed as

P(t₁ | -,i₂₃,-) = L₁ / (L₁ + L₂ + L₃)

with

L₁ = count[t₁] * count[i₂₃,t₁]/count[t₁]

L₂ = count[t₂] * count[i₂₃,t₂]/count[t₂]

L₃ = count[t₃] * count[i₂₃,t₃]/count[t₃]

Scoring procedure, example

<PMML xmlns="https://www.dmg.org/PMML-4_4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.4">
    <Header copyright="Copyright (c) 2013, DMG.org"/>
    <DataDictionary numberOfFields="6">
      <DataField name="age of individual" optype="continuous" dataType="double"/>
      <DataField name="gender" optype="categorical" dataType="string">
        <Value value="female"/>
        <Value value="male"/>
      </DataField>
      <DataField name="no of claims" optype="categorical" dataType="string">
        <Value value="0"/>
        <Value value="1"/>
        <Value value="2"/>
        <Value value="&gt;2"/>
      </DataField>
      <DataField name="domicile" optype="categorical" dataType="string">
        <Value value="suburban"/>
        <Value value="urban"/>
        <Value value="rural"/>
      </DataField>
      <DataField name="age of car" optype="continuous" dataType="double"/>
      <DataField name="amount of claims" optype="categorical" dataType="integer">
        <Value value="100"/>
        <Value value="500"/>
        <Value value="1000"/>
        <Value value="5000"/>
        <Value value="10000"/>
      </DataField>
    </DataDictionary>
    <NaiveBayesModel modelName="NaiveBayes Insurance" functionName="classification" threshold="0.001">
      <MiningSchema>
        <MiningField name="age of individual"/>
        <MiningField name="gender"/>
        <MiningField name="no of claims"/>
        <MiningField name="domicile"/>
        <MiningField name="age of car"/>
        <MiningField name="amount of claims" usageType="target"/>
      </MiningSchema>
      <BayesInputs>
        <BayesInput fieldName="age of individual">
          <TargetValueStats>
            <TargetValueStat value="  100">
              <GaussianDistribution mean="32.006" variance="0.352"/>
            </TargetValueStat>	
            <TargetValueStat value="  500">
              <GaussianDistribution mean="24.936" variance="0.516"/>
            </TargetValueStat>   
            <TargetValueStat value=" 1000">
              <GaussianDistribution mean="24.588" variance="0.635"/>
            </TargetValueStat>   
            <TargetValueStat value=" 5000">
              <GaussianDistribution mean="24.428" variance="0.379"/>
            </TargetValueStat>  
            <TargetValueStat value="10000">
              <GaussianDistribution mean="24.770" variance="0.314"/>
            </TargetValueStat>   
          </TargetValueStats>         
        </BayesInput>
        <BayesInput fieldName="gender">
          <PairCounts value="male">
            <TargetValueCounts>
              <TargetValueCount value="100" count="4273"/>
              <TargetValueCount value="500" count="1321"/>
              <TargetValueCount value="1000" count="780"/>
              <TargetValueCount value="5000" count="405"/>
              <TargetValueCount value="10000" count="42"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="female">
            <TargetValueCounts>
              <TargetValueCount value="100" count="4325"/>
              <TargetValueCount value="500" count="1212"/>
              <TargetValueCount value="1000" count="742"/>
              <TargetValueCount value="5000" count="292"/>
              <TargetValueCount value="10000" count="48"/>
            </TargetValueCounts>
          </PairCounts>
        </BayesInput>
        <BayesInput fieldName="no of claims">
          <PairCounts value="0">
            <TargetValueCounts>
              <TargetValueCount value="100" count="4698"/>
              <TargetValueCount value="500" count="623"/>
              <TargetValueCount value="1000" count="1259"/>
              <TargetValueCount value="5000" count="550"/>
              <TargetValueCount value="10000" count="40"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="1">
            <TargetValueCounts>
              <TargetValueCount value="100" count="3526"/>
              <TargetValueCount value="500" count="1798"/>
              <TargetValueCount value="1000" count="227"/>
              <TargetValueCount value="5000" count="152"/>
              <TargetValueCount value="10000" count="40"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="2">
            <TargetValueCounts>
              <TargetValueCount value="100" count="225"/>
              <TargetValueCount value="500" count="10"/>
              <TargetValueCount value="1000" count="9"/>
              <TargetValueCount value="5000" count="0"/>
              <TargetValueCount value="10000" count="10"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="&gt;2">
            <TargetValueCounts>
              <TargetValueCount value="100" count="112"/>
              <TargetValueCount value="500" count="5"/>
              <TargetValueCount value="1000" count="1"/>
              <TargetValueCount value="5000" count="1"/>
              <TargetValueCount value="10000" count="8"/>
            </TargetValueCounts>
          </PairCounts>
        </BayesInput>
        <BayesInput fieldName="domicile">
          <PairCounts value="suburban">
            <TargetValueCounts>
              <TargetValueCount value="100" count="2536"/>
              <TargetValueCount value="500" count="165"/>
              <TargetValueCount value="1000" count="516"/>
              <TargetValueCount value="5000" count="290"/>
              <TargetValueCount value="10000" count="42"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="urban">
            <TargetValueCounts>
              <TargetValueCount value="100" count="1679"/>
              <TargetValueCount value="500" count="792"/>
              <TargetValueCount value="1000" count="511"/>
              <TargetValueCount value="5000" count="259"/>
              <TargetValueCount value="10000" count="30"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="rural">
            <TargetValueCounts>
              <TargetValueCount value="100" count="2512"/>
              <TargetValueCount value="500" count="1013"/>
              <TargetValueCount value="1000" count="442"/>
              <TargetValueCount value="5000" count="137"/>
              <TargetValueCount value="10000" count="21"/>
            </TargetValueCounts>
          </PairCounts>
        </BayesInput>
        <BayesInput fieldName="age of car">
          <DerivedField optype="categorical" dataType="string">
            <Discretize field="age of car">
              <DiscretizeBin binValue="0">
                <Interval closure="closedOpen" leftMargin="0" rightMargin="1"/>
              </DiscretizeBin>
              <DiscretizeBin binValue="1">
                <Interval closure="closedOpen" leftMargin="1" rightMargin="5"/>
              </DiscretizeBin>
              <DiscretizeBin binValue="2">
                <Interval closure="closedOpen" leftMargin="5"/>
              </DiscretizeBin>
            </Discretize>
          </DerivedField>
          <PairCounts value="0">
            <TargetValueCounts>
              <TargetValueCount value="100" count="927"/>
              <TargetValueCount value="500" count="183"/>
              <TargetValueCount value="1000" count="221"/>
              <TargetValueCount value="5000" count="50"/>
              <TargetValueCount value="10000" count="10"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="1">
            <TargetValueCounts>
              <TargetValueCount value="100" count="830"/>
              <TargetValueCount value="500" count="182"/>
              <TargetValueCount value="1000" count="51"/>
              <TargetValueCount value="5000" count="26"/>
              <TargetValueCount value="10000" count="6"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="2">
            <TargetValueCounts>
              <TargetValueCount value="100" count="6251"/>
              <TargetValueCount value="500" count="1901"/>
              <TargetValueCount value="1000" count="919"/>
              <TargetValueCount value="5000" count="623"/>
              <TargetValueCount value="10000" count="71"/>
            </TargetValueCounts>
          </PairCounts>
        </BayesInput>
      </BayesInputs>
      <BayesOutput fieldName="amount of claims">
        <TargetValueCounts>
          <TargetValueCount value="100" count="8723"/>
          <TargetValueCount value="500" count="2557"/>
          <TargetValueCount value="1000" count="1530"/>
          <TargetValueCount value="5000" count="709"/>
          <TargetValueCount value="10000" count="100"/>
        </TargetValueCounts>
      </BayesOutput>
    </NaiveBayesModel>
  </PMML>

Given an input vector (age of individual = "24", gender = "male", no of claims = "2", domicile = (missing), age of car = "1") the probability for class "1000" is computed as

P("1000" | "24", "male", "2", -,"1" ) = L₂ / (L₀ + L₁ + L₂ + L₃ + L₄)

with

L₀ = 8723 * 0.001 * 4273/8598 * 225/8561 * 830/8008

L₁ = 2557 * [exp(-(24 - 24.936)² / (2 * 0.516) ) / sqrt(2π * 0.516)] * 1321/2533 * 10/2436 * 182/2266

L₂ = 1530 * [exp(-(24 - 24.588)² / (2 * 0.635) ) / sqrt(2π * 0.635)] * 780/1522 * 9/1496 * 51/1191

L₃ = 709 * [exp(-(24 - 24.428)² / (2 * 0.379) ) / sqrt(2π * 0.379)] * 405/697 * .001 * 26/699

L₄ = 100 * [exp(-(24 - 24.770)² / (2 * 0.314) ) / sqrt(2π * 0.314)] * 42/90 * 10/98 * 6/87

e-mail

info at dmg.org