DMG logo PMML 4.3 - Text Models (Deprecated in PMML 4.2)
PMML4.3 Menu



XML Schema



General Structure

Field Scope










Built-in Functions

Model Verification

Model Explanation

Multiple Models

Association Rules

Baseline Models

Bayesian Network











Text Models

Time Series


Vector Machine

PMML 4.3 - Text Models (Deprecated in PMML 4.2)

NOTE: In PMML 4.2, the element TextModel has been deprecated since element TextIndex (see Transformations) allows for existing model elements to be used to replicate the functionality provided by element TextModel. For more on deprecation, see Conformance.


A Text model consists of six major parts:

  • model attributes;
  • dictionary of terms or TextDictionary;
  • corpus of text documents
  • document-term matrix;
  • text model normalization;
  • text model similarity.
<xs:element name="TextModel">
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="MiningSchema"/>
      <xs:element ref="Output" minOccurs="0"/>
      <xs:element ref="ModelStats" minOccurs="0"/>
      <xs:element ref="ModelExplanation" minOccurs="0"/>
      <xs:element ref="Targets" minOccurs="0"/>
      <xs:element ref="LocalTransformations" minOccurs="0"/>
      <xs:element ref="TextDictionary"/>
      <xs:element ref="TextCorpus"/>
      <xs:element ref="DocumentTermMatrix"/>
      <xs:element ref="TextModelNormalization" minOccurs="0"/>
      <xs:element ref="TextModelSimiliarity" minOccurs="0"/>
      <xs:element ref="ModelVerification" minOccurs="0"/>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    <xs:attribute name="modelName" type="xs:string"/>
    <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/>
    <xs:attribute name="algorithmName" type="xs:string"/>
    <xs:attribute name="numberOfTerms" type="xs:integer" use="required"/>
    <xs:attribute name="numberOfDocuments" type="xs:integer" use="required"/>
    <xs:attribute name="isScorable" type="xs:boolean" default="true"/>

The attribute modelName specifies the name of the text model.

The attribute functionName could be either classification, regression, or clustering, depending on the text model type.

numberOfTerms and numberOfDocuments give the respective numbers, which must be met throughout the respective model sections consistently.

  • isScorable: This attribute indicates if the model is valid for scoring. If this attribute is true or if it is missing, then the model should be processed normally. However, if the attribute is false, then the model producer has indicated that this model is intended for information purposes only and should not be used to generate results. In order to be valid PMML, all required elements and attributes must be present, even for non-scoring models. For more details, see General Structure.

Model Attributes

Since text models sometimes rely on attributes which may be normalized, transformations sometimes have to be applied, which can be performed in the LocalTransformations element.

MiningField information (in MiningSchema) must be present for each active variable. For each active MiningField, an element of type UnivariateStats (see ModelStats) holds information about the overall (background) population. This includes (required) DiscrStats or ContStats, which include possible field values and interval boundaries. Optionally, statistical information is included for the background data.

Text Dictionary

We consider next the TextDictionary, which is the second major component of the text model. The Text Dictionary contains a list of terms which are referenced in the successive sections via index mapping. Optionally, a taxonomy can be defined on the terms.

<xs:element name="TextDictionary">
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Taxonomy" minOccurs="0"/>
      <xs:group ref="STRING-ARRAY"/>

The length of the array must match numberOfTerms in the TextModel element. Otherwise the PMML document is invalid.

Text Corpus of Documents

The third major component of a text model is the corpus of text documents. TextCorpus consists of a number of TextDocuments. Again, index-mapping is used for later references.

<xs:element name="TextCorpus">
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="TextDocument" minOccurs="0" maxOccurs="unbounded"/>

The number of TextDocuments must match numberOfDocuments in the TextModel element. Otherwise the PMML document is invalid.

Each TextDocument in the TextCorpus consists of a unique ID and several attributes describing the document.

<xs:element name="TextDocument">
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    <xs:attribute name="id" type="xs:string" use="required"/>
    <xs:attribute name="name" type="xs:string" use="optional"/>
    <xs:attribute name="length" type="INT-NUMBER" use="optional"/>
    <xs:attribute name="file" type="xs:string" use="optional"/>

id is used for a later reference towards the document.

name is a human readable label.

length is the length of the document in bytes.

file refers to the name of the document in the file system.

Document Term Matrix

The fourth major component of a text model is the DocumentTermMatrix.

<xs:element name="DocumentTermMatrix">
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Matrix"/>

The matrix provides an overview of the frequencies of the terms as they appear in the individual documents. The documents are given by the rows and must match the respective number as specified in numberOfDocuments. The columns refer to the terms as specified in TextDictionary and must also match the respective number. Index-matching is used for all references; this means that the n-th column refers to the n-th entry in the array in the TextDictionary. Furthermore, the n-th row refers to the n-th document as specified in the TextCorpus section.


We turn to the fifth major component of the model: how to normalize the document term matrix. We use the normalization notation introduced in T. G. Kolda and D. P. O'Leary, A Semi-Discrete Matrix Decomposition of Latent Semantic Indexing in Information Retrieval, ACM Transactions on Information Systems, Volume 16, 1998, pages 322-346 and described in Michael W. Berry and Murray Browne, Understanding search Engines, SIAM, Philadelphia, 1999, pages 36-39. This notation is based upon specifying local term weights, global term weights, document normalization, and a metric for document similarity. The metric for document similarity is described in the next section.

<xs:element name="TextModelNormalization">
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    <xs:attribute name="localTermWeights" default="termFrequency">
        <xs:restriction base="xs:string">
          <xs:enumeration value="termFrequency"/>
          <xs:enumeration value="binary"/>
          <xs:enumeration value="logarithmic"/>
          <xs:enumeration value="augmentedNormalizedTermFrequency"/>
    <xs:attribute name="globalTermWeights" default="inverseDocumentFrequency">
        <xs:restriction base="xs:string">
          <xs:enumeration value="inverseDocumentFrequency"/>
          <xs:enumeration value="none"/>
          <xs:enumeration value="GFIDF"/>
          <xs:enumeration value="normal"/>
          <xs:enumeration value="probabilisticInverse"/>
    <xs:attribute name="documentNormalization" default="none">
        <xs:restriction base="xs:string">
          <xs:enumeration value="none"/>
          <xs:enumeration value="cosine"/>

This element defines the normalization of the document term matrix by defining the normalization in terms of three components, the local term weights, the global terms weights, and the document normalization.


We now turn to the final major component of the text model - how to measure similarity between two documents:

<xs:element name="TextModelSimiliarity">
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    <xs:attribute name="similarityType">
        <xs:restriction base="xs:string">
          <xs:enumeration value="euclidean"/>
          <xs:enumeration value="cosine"/>

similarityType specifies how similarities are calculated.


Here is a simple example:

<PMML xmlns="" version="4.3">
  <Header copyright=""/>
  <DataDictionary numberOfFields="6">
    <DataField name="t1" optype="continuous" dataType="double"/>
    <DataField name="t2" optype="continuous" dataType="double"/>
    <DataField name="t3" optype="continuous" dataType="double"/>
    <DataField name="t4" optype="continuous" dataType="double"/>
    <DataField name="t5" optype="continuous" dataType="double"/>
    <DataField name="sim" optype="continuous" dataType="double"/>
  <TextModel modelName="tm1" functionName="regression" numberOfTerms="5" numberOfDocuments="3">
      <MiningField name="t1"/>
      <MiningField name="t2"/>
      <MiningField name="t3"/>
      <MiningField name="t4"/>
      <MiningField name="t5"/>
      <MiningField name="sim" usageType="target"/>
      <Array type="string">four score seven perish earth</Array>
      <TextDocument id="d1" name="Gettysburg Address" length="272" file="gettysburg.txt"/>
      <TextDocument id="d2" name="Letter to Henry Pierce" length="622" file="pierce.txt"/>
      <TextDocument id="d3" name="Letter to Grace Bedell" length="88" file="bedell.txt"/>
      <Matrix nbRows="3" nbCols="5" diagDefault="0" offDiagDefault="0">
        <MatCell row="1" col="3">2</MatCell>
        <MatCell row="2" col="1">4</MatCell>
        <MatCell row="2" col="5">1</MatCell>
        <MatCell row="3" col="2">3</MatCell>
    <TextModelNormalization localTermWeights="binary" globalTermWeights="none" documentNormalization="none">
    <TextModelSimiliarity similarityType="cosine">


Scoring proceeds as follows:

  1. Given a collection of words, from a document or query string, construct the respective binary array according to the TextDictionary. This gives a query vector.
  2. Calculate the similarity for each document; use the similarity function specified in similarityType to calculate the similarity between the query vector and each row in the DocumentTermMatrix.
  3. The document with the highest similarity is the winner.
e-mail info at