PMML 4.0 - General Structure of a PMML Document
PMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is:
The namespaces in the PMML Schema itself are defined as:
Note that because of the namespace declaration in its current form, PMML cannot be mixed with content of a different namespace.
Although a PMML document must be valid with respect to the PMML XSD, a document must not require a validating parser, which would load external entities. In addition to being a valid XML document, a valid PMML document must obey a number of further rules which are described at various places in the PMML specification. See also the conformance rules for valid PMML documents, producers, and consumers.
The root element of a PMML document must have type PMML.
A PMML document can contain more than one model. If the application system provides a means of selecting models by name and if the PMML consumer specifies a model name, then that model is used; otherwise the first model is used.
A PMML compliant system is not required to provide model selection by name.
The list of mining models in a PMML document may even be empty. The document can be used to carry the initial metadata before an actual model is computed. A PMML document containing no model is not meant to be useful for a PMML consumer.
For PMML 4.0 the attribute version must have the value 4.0
The element MiningBuildTask can contain any XML value describing the configuration of the training run that produced the model instance. This information is not directly needed by a PMML consumer, but in many cases it is helpful for maintenance and visualization of the model. The particular content structure of MiningBuildTask is not defined by PMML. Though, this element would be the natural container for task specifications as defined by other mining standards, e.g., in SQL or Java.
The fields in the DataDictionary and in the TransformationDictionary taken together are identified by unique names. Other elements in the models can refer to these fields by name. Multiple models on one PMML document can share the same fields in the TransformationDictionary. Nevertheless, a model can also define its 'own' derived fields in the element LocalTransformations. Furthermore, various models use DerivedField elements directly in the definition of the model. For example, DerivedFields appear inline in the input layer of neural networks.
Certain types of PMML models such as neural networks or logistic regression can be used for different purposes. That is, some instances implement prediction of numeric values, while others can be used for classification. Therefore, PMML defines five different mining functions. Each model has an attribute functionName which specifies the mining function.
For all PMML models the structure of the top-level model element is similar to the template of ExampleModel as below
A non-empty list of mining fields defines a mining schema.
The output element gives a
list of result values and internal results such as confidences or
probabilities that can be computed by the model.
The univariate statistics contain global statistics on (a subset of the)
mining fields. The targets section holds more
information on the target values and accompanying information like prior
probabilities, optypes and the like.
LocalTransformations holds derived fields that are local to the model.
Other model specific elements follow after that, in
the content of ExampleModel. Finally, the ModelVerification part gives sample
data and results of the model so consumers can instantly validate.
modelName: the value in modelName identifies the model with a unique name in the context of the PMML file. This attribute is not required. Consumers of PMML models are free to manage the names of the models at their discretion.
functionName and algorithmName describe the kind of mining model, e.g., whether it is intended to be used for clustering or for classification. The algorithm name is free-type and can be any description for the specific algorithm that produced the model. This attribute is for information only.
Although rare, it is possible for classification models to identify more than one "winning" outcomes. In these instances, PMML doesn't define a tie-breaking procedure but recommends that the category appearing first in the predictor's DataField be selected.
The naming conventions for PMML are:
The PMML schema contains a mechanism for extending the content of a
model. Extension elements should be present as the
first child in all elements and groups defined in PMML. This way it is possible
to place information in the Extension elements which affects how the
remaining entries are treated. The main element in
each model should have Extension elements as the first and the last
child for maximum flexibility.
If a document uses local namespaces, then the name of the namespace should not start with PMML or DMG or any variant of these names with lowercase characters. They are reserved for future use in PMML.
Up to PMML 2.1, extension attributes could be added to all elements in PMML if the prefix x- was used. This mechanism is deprecated, extension elements should be used instead. PMML documents with extension attributes using the old convention are still considered to be valid PMML. However, note that PMML documents containing old-style x- extension attributes will not validate in XML schema, but one can use XSL transformation to remove all x- extension attributes and receive an XML document that will validate.Examples
An extension attribute format can be added to a DataField
An extension element DataFieldSource can be added to a
DataField in the PCDATA section like this:
Basic data types and entities
is commonly used for distinguishing numeric values from other data. Numbers may have a leading sign, fractions, and an exponent. The type float in XML Schema supports numbers represented as INF, -INF, and NaN. These tokens are not allowed for NUMBER. In addition to NUMBER there are a couple of more specific types, they are like subtypes of NUMBER:
An INT-NUMBER must be an integer, no fractions or exponent.
A REAL-NUMBER can be any number covered by the C/C++ types float, long or double. Scientific notation, eg., 1.23e4, is allowed. Literals INF, -INF, and NaN are not supported.
PMML uses the character '.' as decimal point in the representation of REAL-NUMBER values.
A PROB-NUMBER is a REAL-NUMBER between 0.0 and 1.0, usually describing a probability.
A PERCENTAGE-NUMBER is a REAL-NUMBER between 0.0 and 100.0.
Note that these entities do not enforce the XML parser to check the data types. However they still define requirements for a valid PMML document.
Many elements contain references to input fields.
PMML does not use IDREF to represent field names
because field names are not necessarily valid XML identifiers.
However, given the definition
Instances of mining models often contain sets with a large number of values. The type Array is defined as a container structure which implements arrays of numbers and strings in a fairly compact way.
The content of Array is a blank separated sequence of values, multiple blanks are as good as one blank. The attribute n determines the number of elements in the sequence. If n is given it must match the number of values in the content, otherwise the PMML document is invalid. The attribute type is required since parsing an array is simpler if the type of the values in the content is specified up-front. This is particularly true for SAX based parsing. In many cases the type of the values is known from the context where the Array appears. But there are also cases where Arrays can be mixed e.g., in the statistics elements. String values may be enclosed within double quotes ", which are not considered to be part of the value. If a string value contains the double quote character " , then it must be escaped by a backslash character \ (that is the same escaping mechanism as used in C/C++).
Similar to the entities for different types of
numbers we define entities for arrays which should have a specific
content type. Again, these entities just map to a single XML
A special case of
arrays are sparse arrays which only store
elements with non-zero values.
The array 0 3 0 0 42 0 0 can be written like this:
The array 0 0 0 0 0 0 can be written like this:
In order to save space, a matrix can be stored as a diagonal or even sparse matrix.
The matrix is internally represented as a sequence
of Arrays or MatCells. If arrays are used, then each array
contains elements of one row in the matrix.
The actual representation is triggered by the kind of the Matrix:
Evaluating a matrix element M(i,j) proceeds as follows: