Data Mining Group - Text Models

PMML 3.2 - Text Models

Components

A Text model consists of six major parts:

model attributes;
dictionary of terms or TextDictionary;
corpus of text documents
document-term matrix;
text model normalization;
text model similiarity.


  <xs:element name="TextModel">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="MiningSchema" />
        <xs:element ref="Output" minOccurs="0" />
        <xs:element ref="ModelStats" minOccurs="0" />
        <xs:element ref="Targets" minOccurs="0" />
        <xs:element ref="LocalTransformations" minOccurs="0" />
        <xs:element ref="TextDictionary" />
        <xs:element ref="TextCorpus" />
        <xs:element ref="DocumentTermMatrix" />
        <xs:element ref="TextModelNormalization" minOccurs="0" />
        <xs:element ref="TextModelSimiliarity" minOccurs="0" />
        <xs:element ref="ModelVerification" minOccurs="0"/>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="modelName" type="xs:string" />
      <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" />
      <xs:attribute name="algorithmName" type="xs:string" />
      <xs:attribute name="numberOfTerms" type="xs:integer" use="required" />
      <xs:attribute name="numberOfDocuments" type="xs:integer" use="required" />
    </xs:complexType>
  </xs:element>

The attribute modelName specifies the name of the text model.
The attribute functionName could be either classification, regression, or clustering, depending on the text model type.
numberOfTerms and numberOfDocuments give the respective numbers, which must be met throughout the respective model sections consistently.

Model Attributes

Since text models sometimes rely on attributes which may be normalized, transformations sometimes have to be applied, which can be performed in the LocalTransformations element.
MiningField information (in MiningSchema) must be present for each active variable. For each active MiningField, an element of type UnivariateStats (see ModelStats) holds information about the overall (background) population. This includes (required) DiscrStats or ContStats, which include possible field values and interval boundaries. Optionally, statistical information is included for the background data.

Text Dictionary

We consider next the TextDictionary, which is the second major component of the text model. The Text Dictionary contains a list of terms which are referenced in the successive sections via index mapping. Optionally, a taxonomy can be defined on the terms.


  <xs:element name="TextDictionary">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="Taxonomy" minOccurs="0" />
        <xs:group ref="STRING-ARRAY"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

The length of the array must match numberOfTerms in the TextModel element. Otherwise the PMML document is invalid.

Text Corpus of Documents

The third major component of a text model is the corpus of text documents. TextCorpus consists of a number of TextDocuments. Again, index-mapping is used for later references.


  <xs:element name="TextCorpus">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="TextDocument" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

The number of TextDocuments must match numberOfDocuments in the TextModel element. Otherwise the PMML document is invalid.

Each TextDocument in the TextCorpus consists of a unique ID and several attributes describing the document.


  <xs:element name="TextDocument">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="id" type="xs:string" use="required" />
      <xs:attribute name="name" type="xs:string" use="optional" />
      <xs:attribute name="length" type="INT-NUMBER" use="optional" />
      <xs:attribute name="file" type="xs:string" use="optional" />
    </xs:complexType>
  </xs:element>

id is used for later referenced towards the document.
name is a human readable label.
length is the length of the document in bytes.
file refers to the name of the document in the file system.

Document Term Matrix

The fourth major component of a text model is the DocumentTermMatrix.


  <xs:element name="DocumentTermMatrix">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="Matrix"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

The matrix provides an overview of the frequencies of the terms as they appear in the individual documents. The documents are given by the rows and must match the respective number as specified in numberOfDocuments. The columns refer to the terms as specified in TextDictionary and must also match the respective number. Index-matching is used for all references; this means that the n-th column refers to the n-th entry in the array in the TextDictionary. Furthermore, the n-th row refers to the n-th document as specified in the TextCorpus section.

Normalization

We turn to the fifth major component of the model: how to normalize the document term matrix. We use the normalization notation introduced in T. G. Kolda and D. P. O'Leary, A Semi-Discrete Matrix Decomposition of Latent Semantic Indexing in Information Retrieval, ACM Transactions on Information Systems, Volume 16, 1998, pages 322-346 and described in Michael W. Berry and Murray Browne, Understanding search Engines, SIAM, Piladelphia, 1999, pages 36-39. This notation is based upon specifying local term weights, global term weights, document normalization, and a metric for document similarity. The metric for document similarity is described in the next section.


  <xs:element name="TextModelNormalization" >
    <xs:complexType >
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="localTermWeights" default="termFrequency" >
        <xs:simpleType >
          <xs:restriction base="xs:string" >
            <xs:enumeration value="termFrequency" />
            <xs:enumeration value="binary" />
            <xs:enumeration value="logarithmic" />
            <xs:enumeration value="augmentedNormalizedTermFrequency" />
          </xs:restriction >
        </xs:simpleType >
      </xs:attribute >
      <xs:attribute name="globalTermWeights" default="inverseDocumentFrequency" >
        <xs:simpleType >
          <xs:restriction base="xs:string" >
            <xs:enumeration value="inverseDocumentFrequency" />
            <xs:enumeration value="none" />
            <xs:enumeration value="GFIDF" />
            <xs:enumeration value="normal" />
            <xs:enumeration value="probabilisticInverse" />
          </xs:restriction >
        </xs:simpleType >
      </xs:attribute >
      <xs:attribute name="documentNormalization" default="none" >
        <xs:simpleType >
          <xs:restriction base="xs:string" >
            <xs:enumeration value="none" />
            <xs:enumeration value="cosine" />
          </xs:restriction >
        </xs:simpleType >
      </xs:attribute >
    </xs:complexType >
  </xs:element >

This element defines the normalization of the document term matrix by defining the normalization in terms of three components, the local term weights, the global terms weights, and the document normalization.

Similarity

We now turn to the final major component of the text model - how to measure similiarity between two documents:


  <xs:element name="TextModelSimiliarity">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="similarityType" >
        <xs:simpleType >
          <xs:restriction base="xs:string" >
            <xs:enumeration value="euclidean" />
            <xs:enumeration value="cosine" />
          </xs:restriction >
        </xs:simpleType >
      </xs:attribute >
    </xs:complexType>
  </xs:element>

similarityType specifies how similarities are calculated.

Example

Here is a simple example:


  <?xml version="1.0" encoding="UTF-8"?>
  <PMML version="3.2" xmlns="https://www.dmg.org/PMML-3_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Header copyright="DMG.org" />
    <DataDictionary numberOfFields="1000">
      <DataField name="t1" optype="continuous" dataType="double" />
      <DataField name="t2" optype="continuous" dataType="double" />
      <DataField name="t3" optype="continuous" dataType="double" />
      <DataField name="t4" optype="continuous" dataType="double" />
      <DataField name="t5" optype="continuous" dataType="double" />
      <DataField name="sim" optype="continuous" dataType="double" />
    </DataDictionary>
    <TextModel modelName="tm1" functionName="regression" numberOfTerms="5"
            numberOfDocuments="3" >
      <MiningSchema>
        <MiningField name="t1" />
        <MiningField name="t2" />
        <MiningField name="t3" />
        <MiningField name="t4" />
        <MiningField name="t5" />
        <MiningField name="sim" usageType="predicted" />
      </MiningSchema>
      <TextDictionary>
        <Array type="string">four score seven perish earth</Array>
      </TextDictionary>
      <TextCorpus>
        <TextDocument
            id="d1"
            name = "Gettysburg Address"
            length = "272"
            file = "gettysburg.txt" />
        <TextDocument
            id="d2"
            name = "Letter to Henry Pierce"
            length = "622"
            file = "pierce.txt" />
        <TextDocument
            id="d3"
            name = "Letter to Grace Bedell"
            length = "88"
            file = "bedell.txt" />
      </TextCorpus>
      <DocumentTermMatrix>
        <Matrix nbRows="3" nbCols="5" diagDefault="0" offDiagDefault="0">
          <MatCell row="1" col="3">2</MatCell>
          <MatCell row="2" col="1">4</MatCell>
          <MatCell row="2" col="5">1</MatCell>
          <MatCell row="3" col="2">3</MatCell>
        </Matrix>
      </DocumentTermMatrix>
      <TextModelNormalization localTermWeights="binary" globalTermWeights="none" documentNormalization="none">
         </TextModelNormalization>
      <TextModelSimiliarity
         similarityType= "cosine">
         </TextModelSimiliarity>
    </TextModel>
  </PMML>

Scoring

Scoring proceeds as follows:

Given a collection of words, from a document or query string, construct the respective binary array according to the TextDictionary. This gives a query vector.
Calculate the similarity for each document; use the similarity function specified in similarityType to calculate the similarity between the query vector and each row in the DocumentTermMatrix.
The document with the highest similarity is the winner.