General Structure
PMML2.1 Menu

Home


PMML Notice and License

General Structure

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Conformance

Taxomony

Trees

Regression

General
Regression


Cluster
Models


Association Rules

Neural
Network


Naive
Bayes


Sequences

PMML 2.1 - General Structure of a PMML Document

PMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general stucture of a PMML document is:


<?xml version="1.0" >
<PMML version="2.1"
  xmlns="http://www.dmg.org/PMML-2_1" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >

  <Header copyright="Example.com"/>
  <DataDictionary> ... </n:DataDictionary>

  ... a model ...

</PMML>


The namespaces in the PMML Schema itself are defined as:
 

<xs:schema 
  xmlns:xs='http://www.w3.org/2001/XMLSchema'
  targetNamespace='http://www.dmg.org/PMML-2_1'
  xmlns='http://www.dmg.org/PMML-2_1'
  elementFormDefault='unqualified'>

Although a PMML document must be valid with respect to the PMML XSD, a document must not require a validating parser, which would load external entities. In addition to being a valid XML document, a valid PMML document must obey a number of further rules which are described at various places in the PMML specification. See also the conformance rules for valid PMML documents, producers, and consumers.

The root element of a PMML document must have type PMML.


 <xs:element name='PMML'>
  <xs:complexType>
   <xs:sequence>
    <xs:element ref='Header'/>
    <xs:element ref='MiningBuildTask' minOccurs='0' maxOccurs='1'/>
    <xs:element ref='DataDictionary'/>
    <xs:element ref='TransformationDictionary' minOccurs='0' maxOccurs='1'/>
    <xs:sequence minOccurs='0' maxOccurs='unbounded'>
     <xs:choice>
      <xs:element ref='TreeModel'/>
      <xs:element ref='NeuralNetwork'/>
      <xs:element ref='ClusteringModel'/>
      <xs:element ref='RegressionModel'/>
      <xs:element ref='GeneralRegressionModel'/>
      <xs:element ref='NaiveBayesModel'/>
      <xs:element ref='AssociationModel'/>
      <xs:element ref='SequenceModel'/>
     </xs:choice>
    </xs:sequence>
    <xs:element ref='Extension' minOccurs='0' maxOccurs='unbounded'/>
   </xs:sequence>
   <xs:attribute name='version' type='xs:string' use='required'/>
  </xs:complexType>
 </xs:element>

A PMML document can contain more than one model. If the application system provides a means of selecting models by name and if the PMML consumer specifies a model name, then that model is used; otherwise the first model is used.

A PMML 2.1 compliant system is not required to provide model selection by name.

The list of mining models in a PMML document may even be empty. The document can be used to carry the initial metadata before an actual model is computed. A PMML document containing no model is not meant to be useful for a PMML consumer.

For PMML version 2.1 the attribute version must have the value 2.1

The element MiningBuildTask can contain any XML value describing the configuration of the training run that produced the model instance. This information is not directly needed in a PMML consumer, but in many cases it is helpful for maintenance and for visualization of the model. The particular content structure of MiningBuildTask is not defined by PMML 2.1. Though, this element would be the natural container for task specifications as defined by other mining standards, e.g., in SQL or Java.


 <xs:element name='MiningBuildTask'>
  <xs:complexType>
   <xs:sequence>
    <xs:element ref='Extension' minOccurs='0' maxOccurs='unbounded'/>
   </xs:sequence>
  </xs:complexType>
 </xs:element>

The fields in the DataDictionary and in the TransformationDictionary taken together are identified by unique names. Other elements in the models can refer to these fields by name. Multiple models on one PMML document can share the same fields in the TransformationDictionary. Nevertheless, a model can also define its 'own' derived fields. Various models use DerivedField elements directly in the definition of the model. For example, DerivedField's appear inline in the input layer of neural networks.

Certain types of PMML models such as neural networks or logistic regression can be used for different purposes. That is, some instances implement prediction of numeric values, while others can be used for classification. Therefore, PMML defines five different mining functions. Each model has an attribute 'functionName' which specifies the mining function.


  <xs:simpleType name="MINING-FUNCTION">
     <xs:restriction base='xs:string'>
      <xs:enumeration value='associationRules'/>
      <xs:enumeration value='sequences'/>
      <xs:enumeration value='classification'/>
      <xs:enumeration value='regression'/>
      <xs:enumeration value='clustering'/>
     </xs:restriction>
  </xs:simpleType>

For all PMML models the structure of the top-level model element is similar to the template of ExampleModel as below


 <xs:element name='ExampleModel'>
  <xs:complexType>
   <xs:sequence>
    <xs:element ref='Extension' minOccurs='0' maxOccurs='unbounded'/>
    <xs:element ref='MiningSchema'/>
    <xs:element ref='ModelStats' minOccurs='0' maxOccurs='1'/>
    ...
    <xs:element ref='Extension' minOccurs='0' maxOccurs='unbounded'/>
   </xs:sequence>
   <xs:attribute name='modelName' type='xs:string' use='optional'/>
   <xs:attribute name='functionName' type='MINING-FUNCTION' use='required'/>
   <xs:attribute name='algorithmName' type='xs:string' use='optional'/>
  </xs:complexType>
 </xs:element>

 <xs:element name='MiningSchema'>
  <xs:complexType>
   <xs:sequence>
    <xs:element ref='Extension' minOccurs='0' maxOccurs='unbounded'/>
    <xs:element ref='MiningField' maxOccurs='unbounded'/>
   </xs:sequence>
  </xs:complexType>
 </xs:element>

 <xs:element name='ModelStats'>
  <xs:complexType>
   <xs:sequence>
    <xs:element ref='Extension' minOccurs='0' maxOccurs='unbounded'/>
    <xs:element ref='UnivariateStats' maxOccurs='unbounded'/>
   </xs:sequence>
  </xs:complexType>
 </xs:element>

The non-empty list of mining fields define a mining schema. The univariate statistics contain global statistics on (a subset of the) mining fields. Other model specific elements follow after ModelStats in the content of ExampleModel. For a list of models that have been defined in PMML 2.1 see the element PMML above.

modelName: the value in modelName identifies the model with a unique name in the context of the PMML file. This attribute is not required. Users reading models of a PMML file are free to manage model's naming at their discretion.

functionName and algorithmName describe the kind of mining model, e.g., whether it is intended to be used for clustering or for classification. The algorithm name can be any description for the specific algorithm that produced the model. This attribute is for information only.

The naming conventions for PMML are: Element Names in mixed case, first uppercase. The attributeNames are in mixed case, first lowercase. The enumConstants in mixed case, first lowercase. The simpleTypes are all uppercase. The character '-' is used less often in order to avoid confusion with mathematical notation.

Extension Mechanism

The PMML schema contains a mechanism for extending the content of a model. Extension elements are included in the content defintition of many element types. These extension elements have a content model of ANY allowing considerable freedom in the nature of the extensions. They also support name-value pairs. For example, one use of the extension mechanism could be to associate display information for a particular tool.


 <xs:element name='Extension'>
   <xs:complexType>
    <xs:complexContent mixed="true">
    <xs:restriction base="xs:anyType">
      <xs:sequence>
       <xs:any processContents="lax" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence> 
      <xs:attribute name='extender' type='xs:string' use='optional'/>
      <xs:attribute name='name' type='xs:string' use='optional'/>
      <xs:attribute name='value' type='xs:string' use='optional'/>
     </xs:restriction>
    </xs:complexContent>
   </xs:complexType>
 </xs:element>


The extension data should be part of the content within the elements of type Extension. Nevertheless, the attribute name can be used to specify the kind of the extension. E.g., an Extension for storing special set of statistical indicators could be written as <Extension extender="foo.com" name="StatsMomentum">...</Extension>. If you know HTML tags you will notice a similarity to the attributes name and content of the META tag.

With XML 1.0 one can add attribute declarations to given elements, without changing an external schema. An XML parser may give a warning but a document which uses the additional attributes can be valid. That is, if the standard PMML schema contains an element TreeNode then a document may declare additional attributes on TreeNode. PMML adopts this rule but attribute names must have prefix 'x-' in order to make an extension obvious. The same convention is used for vendor specific element types which can be contained in an Extension element; the tag name must start with 'X-'.

This convention also helps to avoid conflicts with possible future extensions to standard PMML. If a document uses local namespaces, then the name of the namespace should not start with 'PMML' or 'DMG' or any variant of these names with lowercase characters. They are reserved for future use in PMML.

Basic data types and entities:

The definition


  <xs:simpleType name='NUMBER'>
   <xs:restriction base='xs:double'>
   </xs:restriction>
  </xs:simpleType>

is commonly used for distinguishing numeric values from other data. Numbers may have a leading sign, fractions, and an exponent. In addition to NUMBER there are a couple of more specific definitions:


  <xs:simpleType name='INT-NUMBER'>
   <xs:restriction base='xs:integer'>
   </xs:restriction>
  </xs:simpleType>

An INT-NUMBER must be an integer, no fractions or exponent.


  <xs:simpleType name='REAL-NUMBER'>
   <xs:restriction base='xs:double'>
   </xs:restriction>
  </xs:simpleType>

A REAL-NUMBER can be any number covers C/C++ types 'float','long', 'double'. Scientific notation, eg 1.23e4, is allowed.

PMML uses the character '.' as decimal point in the representation of REAL-NUMBER values.


  <xs:simpleType name='PROB-NUMBER'>
   <xs:restriction base='xs:decimal'>
   </xs:restriction>
  </xs:simpleType>

A PROB-NUMBER is a REAL-NUMBER between 0.0 and 1.0 usually describing a probability.


  <xs:simpleType name='PERCENTAGE-NUMBER'>
   <xs:restriction base='xs:decimal'>
   </xs:restriction>
  </xs:simpleType>

A PERCENTAGE-NUMBER is a REAL-NUMBER between 0.0 and 100.0.

Note that these entities do not enforce the XML parser to check the data types. However they still define requirements for a valid PMML document.

Many elements contain references to input fields. PMML does not use IDREF to represent field names because field names are not necessarily valid XML identifiers. However, given the definition


  <xs:simpleType name='FIELD-NAME'>
   <xs:restriction base='xs:string'>
   </xs:restriction>
  </xs:simpleType>

then references to input fields will be obvious from the schema syntax. Note that a model can refer to two kinds of input fields. One is the set of MiningField's in the MiningSchema. The other is the set of DerivedField's in the TransformationDictionary.

Compact arrays of values

Instances of mining models often contain sets with a large number of values. The type Array is defined as container structure which implements arrays of numbers and strings in a fairly compact way.


  <xs:complexType name='ArrayType' mixed='true'>
   <xs:attribute name='n' type='INT-NUMBER' use='optional'/>
   <xs:attribute name='type' use='optional'>
    <xs:simpleType>
     <xs:restriction base='xs:string'>
      <xs:enumeration value='int'/>
      <xs:enumeration value='real'/>
      <xs:enumeration value='string'/>
     </xs:restriction>
    </xs:simpleType>
   </xs:attribute>
  </xs:complexType>


 <xs:element name='Array'  type='ArrayType'>
 </xs:element>

The content of 'Array' is a blank separated sequence of values, multible blanks are as good as one blank. The attribute 'n' determines the number of elements in the sequence. If n is given it must match the number of values in the content, otherwise the PMML document is invalid. The attribute 'type' indicates the data types of values in the array. This attribute is optional because in many cases the data type is implied from the context where the array is used. String values may be enclosed within double quotes ", which are not considered to be part of the value. If a string value contains the " character, then it must be escaped by a backslash character \, that's the same escaping mechanism as used in C/C++.

Here is an example:


        <Array n="3" type="int"> 1 22 3
        </Array>
        <Array n="3" type="string">
        ab  "a b"   "with \"quotes\" "
        </Array>

The second array contains the three strings 'ab', 'a b', and 'with "quotes" '.

If there is a value for the length attribute n then the number of entries in the content must match this value; otherwise the PMML document is not valid. Similar to the entities for different types of numbers we define entities for arrays which should have a specific content type. Again, these entities just map to a single XML markup.


  <xs:group name='NUM-ARRAY'>
    <xs:choice>
      <xs:element ref='Array'/>
    </xs:choice>
  </xs:group>

A NUM-ARRAY is an array of numbers.

The following entities define arrays which contain integers, reals, or strings.


  <xs:group name='INT-ARRAY'>
    <xs:choice>
      <xs:element ref='Array'/>
    </xs:choice>
  </xs:group>

  <xs:group name='REAL-ARRAY'>
    <xs:choice>
      <xs:element ref='Array'/>
    </xs:choice>
  </xs:group>

  <xs:group name='STRING-ARRAY'>
    <xs:choice>
      <xs:element ref='Array'/>
    </xs:choice>
  </xs:group>

e-mail info at dmg.org