PMML 3.2 - Text Models
Components
A Text model consists of six major parts:
- model attributes;
- dictionary of terms or TextDictionary;
- corpus of text documents
- document-term matrix;
- text model normalization;
- text model similiarity.
<xs:element name="TextModel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="MiningSchema" /> <xs:element ref="Output" minOccurs="0" /> <xs:element ref="ModelStats" minOccurs="0" /> <xs:element ref="Targets" minOccurs="0" /> <xs:element ref="LocalTransformations" minOccurs="0" /> <xs:element ref="TextDictionary" /> <xs:element ref="TextCorpus" /> <xs:element ref="DocumentTermMatrix" /> <xs:element ref="TextModelNormalization" minOccurs="0" /> <xs:element ref="TextModelSimiliarity" minOccurs="0" /> <xs:element ref="ModelVerification" minOccurs="0"/> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="modelName" type="xs:string" /> <xs:attribute name="functionName" type="MINING-FUNCTION" use="required" /> <xs:attribute name="algorithmName" type="xs:string" /> <xs:attribute name="numberOfTerms" type="xs:integer" use="required" /> <xs:attribute name="numberOfDocuments" type="xs:integer" use="required" /> </xs:complexType> </xs:element> |
The attribute functionName could be either classification, regression, or clustering, depending on the text model type.
numberOfTerms and numberOfDocuments give the respective numbers, which must be met throughout the respective model sections consistently.
Model Attributes
Since text models sometimes rely on attributes which may be normalized,
transformations sometimes have to be applied, which can be performed in the
LocalTransformations element.
MiningField information (in MiningSchema) must be present for each active
variable. For each active MiningField, an element of type UnivariateStats
(see ModelStats) holds information about the overall (background)
population.
This includes (required) DiscrStats or ContStats, which include
possible
field values and interval boundaries. Optionally, statistical
information
is included for the background data.
Text Dictionary
We consider next the TextDictionary, which is the second major
component of the text model. The Text Dictionary contains a list of terms
which are referenced in the successive sections via index
mapping. Optionally, a taxonomy can be defined on the terms.
<xs:element name="TextDictionary">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="Taxonomy" minOccurs="0" />
<xs:group ref="STRING-ARRAY"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Text Corpus of Documents
The third major component of a text model is the corpus of text
documents. TextCorpus consists of a number of TextDocuments.
Again, index-mapping is used for later references.
<xs:element name="TextCorpus">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="TextDocument" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
</xs:element>
Each TextDocument in the TextCorpus consists of a unique ID and
several attributes describing the document.
<xs:element name="TextDocument">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required" />
<xs:attribute name="name" type="xs:string" use="optional" />
<xs:attribute name="length" type="INT-NUMBER" use="optional" />
<xs:attribute name="file" type="xs:string" use="optional" />
</xs:complexType>
</xs:element>
name is a human readable label.
length is the length of the document in bytes.
file refers to the name of the document in the file system.
Document Term Matrix
The fourth major component of a text model is the DocumentTermMatrix.
<xs:element name="DocumentTermMatrix">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
<xs:element ref="Matrix"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Normalization
We turn to the fifth major component of the model: how to normalize
the document term matrix. We use the normalization notation
introduced in T. G. Kolda and D. P. O'Leary, A Semi-Discrete Matrix Decomposition
of Latent Semantic Indexing in Information Retrieval, ACM Transactions
on Information Systems, Volume 16, 1998, pages 322-346 and described in
Michael W. Berry and Murray Browne, Understanding search Engines, SIAM,
Piladelphia, 1999, pages 36-39. This notation is based upon specifying
local term weights, global term weights, document normalization, and
a metric for document similarity. The metric for document similarity is
described in the next section.
<xs:element name="TextModelNormalization" >
<xs:complexType >
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="localTermWeights" default="termFrequency" >
<xs:simpleType >
<xs:restriction base="xs:string" >
<xs:enumeration value="termFrequency" />
<xs:enumeration value="binary" />
<xs:enumeration value="logarithmic" />
<xs:enumeration value="augmentedNormalizedTermFrequency" />
</xs:restriction >
</xs:simpleType >
</xs:attribute >
<xs:attribute name="globalTermWeights" default="inverseDocumentFrequency" >
<xs:simpleType >
<xs:restriction base="xs:string" >
<xs:enumeration value="inverseDocumentFrequency" />
<xs:enumeration value="none" />
<xs:enumeration value="GFIDF" />
<xs:enumeration value="normal" />
<xs:enumeration value="probabilisticInverse" />
</xs:restriction >
</xs:simpleType >
</xs:attribute >
<xs:attribute name="documentNormalization" default="none" >
<xs:simpleType >
<xs:restriction base="xs:string" >
<xs:enumeration value="none" />
<xs:enumeration value="cosine" />
</xs:restriction >
</xs:simpleType >
</xs:attribute >
</xs:complexType >
</xs:element >
Similarity
We now turn to the final major component of the text model - how to
measure
similiarity between two documents:
<xs:element name="TextModelSimiliarity">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="similarityType" >
<xs:simpleType >
<xs:restriction base="xs:string" >
<xs:enumeration value="euclidean" />
<xs:enumeration value="cosine" />
</xs:restriction >
</xs:simpleType >
</xs:attribute >
</xs:complexType>
</xs:element>
Example
Here is a simple example:
<?xml version="1.0" encoding="UTF-8"?>
<PMML version="3.2" xmlns="https://www.dmg.org/PMML-3_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="DMG.org" />
<DataDictionary numberOfFields="1000">
<DataField name="t1" optype="continuous" dataType="double" />
<DataField name="t2" optype="continuous" dataType="double" />
<DataField name="t3" optype="continuous" dataType="double" />
<DataField name="t4" optype="continuous" dataType="double" />
<DataField name="t5" optype="continuous" dataType="double" />
<DataField name="sim" optype="continuous" dataType="double" />
</DataDictionary>
<TextModel modelName="tm1" functionName="regression" numberOfTerms="5"
numberOfDocuments="3" >
<MiningSchema>
<MiningField name="t1" />
<MiningField name="t2" />
<MiningField name="t3" />
<MiningField name="t4" />
<MiningField name="t5" />
<MiningField name="sim" usageType="predicted" />
</MiningSchema>
<TextDictionary>
<Array type="string">four score seven perish earth</Array>
</TextDictionary>
<TextCorpus>
<TextDocument
id="d1"
name = "Gettysburg Address"
length = "272"
file = "gettysburg.txt" />
<TextDocument
id="d2"
name = "Letter to Henry Pierce"
length = "622"
file = "pierce.txt" />
<TextDocument
id="d3"
name = "Letter to Grace Bedell"
length = "88"
file = "bedell.txt" />
</TextCorpus>
<DocumentTermMatrix>
<Matrix nbRows="3" nbCols="5" diagDefault="0" offDiagDefault="0">
<MatCell row="1" col="3">2</MatCell>
<MatCell row="2" col="1">4</MatCell>
<MatCell row="2" col="5">1</MatCell>
<MatCell row="3" col="2">3</MatCell>
</Matrix>
</DocumentTermMatrix>
<TextModelNormalization localTermWeights="binary" globalTermWeights="none" documentNormalization="none">
</TextModelNormalization>
<TextModelSimiliarity
similarityType= "cosine">
</TextModelSimiliarity>
</TextModel>
</PMML>
Scoring
Scoring proceeds as follows:
- Given a collection of words, from a document or query string, construct the respective binary array according to the TextDictionary. This gives a query vector.
- Calculate the similarity for each document; use the similarity function specified in similarityType to calculate the similarity between the query vector and each row in the DocumentTermMatrix.
- The document with the highest similarity is the winner.