PMML 4.0 - Text Models
Components
A Text model consists of six major parts:
- model attributes;
- dictionary of terms or TextDictionary;
- corpus of text documents
- document-term matrix;
- text model normalization;
- text model similiarity.
<xs:element name="TextModel">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="MiningSchema"/>
<xs:element ref="Output" minOccurs="0"/>
<xs:element ref="ModelStats" minOccurs="0"/>
<xs:element ref="ModelExplanation" minOccurs="0"/>
<xs:element ref="Targets" minOccurs="0"/>
<xs:element ref="LocalTransformations" minOccurs="0"/>
<xs:element ref="TextDictionary"/>
<xs:element ref="TextCorpus"/>
<xs:element ref="DocumentTermMatrix"/>
<xs:element ref="TextModelNormalization" minOccurs="0"/>
<xs:element ref="TextModelSimiliarity" minOccurs="0"/>
<xs:element ref="ModelVerification" minOccurs="0"/>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="modelName" type="xs:string"/>
<xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
<xs:attribute name="algorithmName" type="xs:string"/>
<xs:attribute name="numberOfTerms" type="xs:integer" use="required"/>
<xs:attribute name="numberOfDocuments" type="xs:integer" use="required"/>
</xs:complexType>
</xs:element>
|
The attribute
modelName specifies the name of the text model.
The attribute
functionName could be either classification, regression,
or clustering, depending on the text model type.
numberOfTerms and
numberOfDocuments give the respective
numbers, which must be met throughout the respective model sections
consistently.
Model Attributes
Since text models sometimes rely on attributes which may be normalized,
transformations sometimes have to be applied, which can be performed in the
LocalTransformations element.
MiningField information (in MiningSchema) must be present for each active
variable. For each active MiningField, an element of type UnivariateStats
(see ModelStats) holds information about the overall (background)
population.
This includes (required) DiscrStats or ContStats, which include
possible
field values and interval boundaries. Optionally, statistical
information
is included for the background data.
Text Dictionary
We consider next the TextDictionary, which is the second major
component of the text model. The Text Dictionary contains a list of terms
which are referenced in the successive sections via index
mapping. Optionally, a taxonomy can be defined on the terms.
<xs:element name="TextDictionary">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Taxonomy" minOccurs="0"/>
<xs:group ref="STRING-ARRAY"/>
</xs:sequence>
</xs:complexType>
</xs:element>
|
The length of the array must match numberOfTerms in the
TextModel element. Otherwise the PMML document is invalid.
Text Corpus of Documents
The third major component of a text model is the corpus of text
documents. TextCorpus consists of a number of TextDocuments.
Again, index-mapping is used for later references.
<xs:element name="TextCorpus">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="TextDocument" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
|
The number of TextDocuments must match numberOfDocuments in the
TextModel element. Otherwise the PMML document is invalid.
Each TextDocument in the TextCorpus consists of a unique ID and
several attributes describing the document.
<xs:element name="TextDocument">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required"/>
<xs:attribute name="name" type="xs:string" use="optional"/>
<xs:attribute name="length" type="INT-NUMBER" use="optional"/>
<xs:attribute name="file" type="xs:string" use="optional"/>
</xs:complexType>
</xs:element>
|
id is used for a later reference towards the document.
name is a human readable label.
length is the length of the document in bytes.
file refers to the name of the document in the file system.
Document Term Matrix
The fourth major component of a text model is the DocumentTermMatrix.
<xs:element name="DocumentTermMatrix">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="Matrix"/>
</xs:sequence>
</xs:complexType>
</xs:element>
|
The matrix provides an overview of the frequencies of the terms as
they appear in the individual documents. The documents are given by the
rows and must match the respective number as specified in numberOfDocuments.
The columns refer to the terms as specified in TextDictionary and must also match
the respective number. Index-matching is used for all references; this means
that the n-th column refers to the n-th entry in the array in the TextDictionary.
Furthermore, the n-th row refers to the n-th document as specified in the
TextCorpus section.
Normalization
We turn to the fifth major component of the model: how to normalize
the document term matrix. We use the normalization notation
introduced in T. G. Kolda and D. P. O'Leary, A Semi-Discrete Matrix Decomposition
of Latent Semantic Indexing in Information Retrieval, ACM Transactions
on Information Systems, Volume 16, 1998, pages 322-346 and described in
Michael W. Berry and Murray Browne, Understanding search Engines, SIAM,
Piladelphia, 1999, pages 36-39. This notation is based upon specifying
local term weights, global term weights, document normalization, and
a metric for document similarity. The metric for document similarity is
described in the next section.
<xs:element name="TextModelNormalization">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="localTermWeights" default="termFrequency">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="termFrequency"/>
<xs:enumeration value="binary"/>
<xs:enumeration value="logarithmic"/>
<xs:enumeration value="augmentedNormalizedTermFrequency"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="globalTermWeights" default="inverseDocumentFrequency">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="inverseDocumentFrequency"/>
<xs:enumeration value="none"/>
<xs:enumeration value="GFIDF"/>
<xs:enumeration value="normal"/>
<xs:enumeration value="probabilisticInverse"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="documentNormalization" default="none">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="none"/>
<xs:enumeration value="cosine"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
|
This element defines the normalization of the document term matrix by
defining the normalization in terms of three components, the local term
weights, the global terms weights, and the document normalization.
Similarity
We now turn to the final major component of the text model - how to
measure
similiarity between two documents:
<xs:element name="TextModelSimiliarity">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="similarityType">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="euclidean"/>
<xs:enumeration value="cosine"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
|
similarityType specifies how similarities are calculated.
Example
Here is a simple example:
<?xml version="1.0" encoding="UTF-8"?>
<PMML version="4.0" xmlns="https://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header copyright="DMG.org"/>
<DataDictionary numberOfFields="6">
<DataField name="t1" optype="continuous" dataType="double"/>
<DataField name="t2" optype="continuous" dataType="double"/>
<DataField name="t3" optype="continuous" dataType="double"/>
<DataField name="t4" optype="continuous" dataType="double"/>
<DataField name="t5" optype="continuous" dataType="double"/>
<DataField name="sim" optype="continuous" dataType="double"/>
</DataDictionary>
<TextModel modelName="tm1" functionName="regression" numberOfTerms="5"
numberOfDocuments="3">
<MiningSchema>
<MiningField name="t1"/>
<MiningField name="t2"/>
<MiningField name="t3"/>
<MiningField name="t4"/>
<MiningField name="t5"/>
<MiningField name="sim" usageType="predicted"/>
</MiningSchema>
<TextDictionary>
<Array type="string">four score seven perish earth</Array>
</TextDictionary>
<TextCorpus>
<TextDocument
id="d1"
name = "Gettysburg Address"
length = "272"
file = "gettysburg.txt"/>
<TextDocument
id="d2"
name = "Letter to Henry Pierce"
length = "622"
file = "pierce.txt"/>
<TextDocument
id="d3"
name = "Letter to Grace Bedell"
length = "88"
file = "bedell.txt"/>
</TextCorpus>
<DocumentTermMatrix>
<Matrix nbRows="3" nbCols="5" diagDefault="0" offDiagDefault="0">
<MatCell row="1" col="3">2</MatCell>
<MatCell row="2" col="1">4</MatCell>
<MatCell row="2" col="5">1</MatCell>
<MatCell row="3" col="2">3</MatCell>
</Matrix>
</DocumentTermMatrix>
<TextModelNormalization localTermWeights="binary" globalTermWeights="none" documentNormalization="none">
</TextModelNormalization>
<TextModelSimiliarity
similarityType= "cosine">
</TextModelSimiliarity>
</TextModel>
</PMML>
|
Scoring
Scoring proceeds as follows:
- Given a collection of words, from a document or query string, construct
the respective binary array according to the TextDictionary.
This gives a query vector.
- Calculate the similarity for each document; use the
similarity function specified in similarityType to calculate the
similarity between the query vector and each row in the DocumentTermMatrix.
- The document with the highest similarity is the winner.