|
||||||||||||||
|
||||||||||||||
| ||||||||||||||
PMML 4.3 - General StructurePMML uses XML to represent mining models. The structure of the models is described by an XML Schema. One or more mining models can be contained in a PMML document. A PMML document is an XML document with a root element of type PMML. The general structure of a PMML document is: <?xml version="1.0"?> <PMML version="4.3" xmlns="https://www.dmg.org/PMML-4_3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Header copyright="Example.com"/> <DataDictionary> ... </DataDictionary> ... a model ... </PMML> The namespaces in the PMML Schema itself are defined as: <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="https://www.dmg.org/PMML-4_3" xmlns="https://www.dmg.org/PMML-4_3" elementFormDefault="unqualified"> Note that because of the namespace declaration in its current form, PMML cannot be mixed with content of a different namespace. Although a PMML document must be valid with respect to the PMML XSD, a document must not require a validating parser, which would load external entities. In addition to being a valid XML document, a valid PMML document must obey a number of further rules which are described at various places in the PMML specification. See also the conformance rules for valid PMML documents, producers, and consumers. The root element of a PMML document must have type PMML. <xs:element name="PMML"> <xs:complexType> <xs:sequence> <xs:element ref="Header"/> <xs:element ref="MiningBuildTask" minOccurs="0"/> <xs:element ref="DataDictionary"/> <xs:element ref="TransformationDictionary" minOccurs="0"/> <xs:sequence minOccurs="0" maxOccurs="unbounded"> <xs:group ref="MODEL-ELEMENT"/> </xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="version" type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:group name="MODEL-ELEMENT"> <xs:choice> <xs:element ref="AssociationModel"/> <xs:element ref="BayesianNetworkModel"/> <xs:element ref="BaselineModel"/> <xs:element ref="ClusteringModel"/> <xs:element ref="GaussianProcessModel"/> <xs:element ref="GeneralRegressionModel"/> <xs:element ref="MiningModel"/> <xs:element ref="NaiveBayesModel"/> <xs:element ref="NearestNeighborModel"/> <xs:element ref="NeuralNetwork"/> <xs:element ref="RegressionModel"/> <xs:element ref="RuleSetModel"/> <xs:element ref="SequenceModel"/> <xs:element ref="Scorecard"/> <xs:element ref="SupportVectorMachineModel"/> <xs:element ref="TextModel"/> <xs:element ref="TimeSeriesModel"/> <xs:element ref="TreeModel"/> </xs:choice> </xs:group> A PMML document can contain more than one model. If the application system provides a means of selecting models by name and if the PMML consumer specifies a model name, then that model is used; otherwise the first model is used. A PMML compliant system is not required to provide model selection by name. The list of mining models in a PMML document may even be empty. The document can be used to carry the initial metadata before an actual model is computed. A PMML document containing no model is not meant to be useful for a PMML consumer. For PMML 4.3 the attribute version must have the value 4.3 The element MiningBuildTask can contain any XML value describing the configuration of the training run that produced the model instance. This information is not directly needed by a PMML consumer, but in many cases it is helpful for maintenance and visualization of the model. The particular content structure of MiningBuildTask is not defined by PMML. Though, this element would be the natural container for task specifications as defined by other mining standards, e.g., in SQL or Java. <xs:element name="MiningBuildTask"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> In general, field names in PMML should be unique. Avoiding name duplication is a good practice since it makes life easier for consumers and, with few exceptions, certain field names cannot be duplicated under any circumstances (e.g., DerivedFields in the TransformationDictionary). For more information on field names, see Scope of Fields. Certain types of PMML models such as neural networks or logistic regression can be used for different purposes. That is, some instances implement prediction of numeric values, while others can be used for classification. Therefore, PMML defines five different mining functions. Each model has an attribute functionName which specifies the mining function. <xs:simpleType name="MINING-FUNCTION"> <xs:restriction base="xs:string"> <xs:enumeration value="associationRules"/> <xs:enumeration value="sequences"/> <xs:enumeration value="classification"/> <xs:enumeration value="regression"/> <xs:enumeration value="clustering"/> <xs:enumeration value="timeSeries"/> <xs:enumeration value="mixed"/> </xs:restriction> </xs:simpleType> For all PMML models the structure of the top-level model element is similar to the template of ExampleModel as below <xs:element name="ExampleModel"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="MiningSchema"/> <xs:element ref="Output" minOccurs="0"/> <xs:element ref="ModelStats" minOccurs="0"/> <xs:element ref="Targets" minOccurs="0"/> <xs:element ref="LocalTransformations" minOccurs="0" /> ... <xs:element ref="ModelVerification" minOccurs="0"/> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="modelName" type="xs:string" use="optional"/> <xs:attribute name="functionName" type="MINING-FUNCTION" use="required"/> <xs:attribute name="algorithmName" type="xs:string" use="optional"/> </xs:complexType> </xs:element> A non-empty list of mining fields defines a mining schema. The output
element gives a list of result values and internal results such as
confidences or probabilities that can be computed by the model. The
univariate statistics contain global statistics on (a subset of the) mining
fields. The targets section holds more information on the target values and
accompanying information like prior probabilities, optypes and the like.
LocalTransformations holds derived fields that are local to the model.
Other model specific elements follow after that, in the content of
ExampleModel. Finally, the ModelVerification part gives sample data and
results of the model so consumers can instantly validate. modelName: the value in modelName identifies the model with a unique name in the context of the PMML file. This attribute is not required. Consumers of PMML models are free to manage the names of the models at their discretion. functionName and algorithmName describe the kind of mining model, e.g., whether it is intended to be used for clustering or for classification. The algorithm name is free-type and can be any description for the specific algorithm that produced the model. This attribute is for information only. TiesAlthough rare, it is possible for classification models to identify more than one "winning" outcomes. In these instances, PMML doesn't define a tie-breaking procedure but recommends that the category appearing first in the predictor's DataField be selected. Naming ConventionsThe naming conventions for PMML are:
Extension MechanismThe PMML schema contains a mechanism for extending the content of a model. Extension elements should be present as the first child in all elements and groups defined in PMML. This way it is possible to place information in the Extension elements which affects how the remaining entries are treated. The main element in each model should have Extension elements as the first and the last child for maximum flexibility. <xs:element name="Extension"> <xs:complexType> <xs:complexContent mixed="true"> <xs:restriction base="xs:anyType"> <xs:sequence> <xs:any processContents="skip" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="extender" type="xs:string" use="optional"/> <xs:attribute name="name" type="xs:string" use="optional"/> <xs:attribute name="value" type="xs:string" use="optional"/> </xs:restriction> </xs:complexContent> </xs:complexType> </xs:element> These extension elements have a content model of ANY, where
vendor specific extension elements can be included. However, element types
must start with X-. This convention helps to avoid conflicts with
possible future extensions to standard PMML. If a document uses local namespaces, then the name of the namespace should not start with PMML or DMG or any variant of these names with lowercase characters. They are reserved for future use in PMML. Up to PMML 2.1, extension attributes could be added to all elements in PMML if the prefix x- was used. This mechanism is deprecated, extension elements should be used instead. PMML documents with extension attributes using the old convention are still considered to be valid PMML. However, note that PMML documents containing old-style x- extension attributes will not validate in XML schema, but one can use XSL transformation to remove all x- extension attributes and receive an XML document that will validate. Examples An extension attribute format can be added to a DataField like this: <DataField name="foo" dataType="double" optype="continuous"> <Extension name="format" value="%9.2f"/> </DataField> An extension element DataFieldSource can be added to a DataField in the PCDATA section like this: <DataField name="foo" dataType="double" optype="continuous"> <Extension> <DataFieldSource sourceKnown="yes"> <Source>derivedFromInput</Source> </DataFieldSource> </Extension> </DataField> Basic data types and entitiesThe definition <xs:simpleType name="NUMBER"> <xs:restriction base="xs:double"> </xs:restriction> </xs:simpleType> is commonly used for distinguishing numeric values from other data. Numbers may have a leading sign, fractions, and an exponent. The type float in XML Schema supports numbers represented as INF, -INF, and NaN. These tokens are not allowed for NUMBER. In addition to NUMBER there are a couple of more specific types, they are like subtypes of NUMBER: <xs:simpleType name="INT-NUMBER"> <xs:restriction base="xs:integer"> </xs:restriction> </xs:simpleType> An INT-NUMBER must be an integer, no fractions or exponent. <xs:simpleType name="REAL-NUMBER"> <xs:restriction base="xs:double"> </xs:restriction> </xs:simpleType> A REAL-NUMBER can be any number covered by the C/C++ types float, long or double. Scientific notation, eg., 1.23e4, is allowed. Literals INF, -INF, and NaN are not supported. PMML uses the character '.' as decimal point in the representation of REAL-NUMBER values. <xs:simpleType name="PROB-NUMBER"> <xs:restriction base="xs:double"> </xs:restriction> </xs:simpleType> A PROB-NUMBER is a REAL-NUMBER between 0.0 and 1.0, usually describing a probability. <xs:simpleType name="PERCENTAGE-NUMBER"> <xs:restriction base="xs:double"> </xs:restriction> </xs:simpleType> A PERCENTAGE-NUMBER is a REAL-NUMBER between 0.0 and 100.0. Note that these entities do not enforce the XML parser to check the data types. However they still define requirements for a valid PMML document. Many elements contain references to input fields. PMML does not use IDREF to represent field names because field names are not necessarily valid XML identifiers. However, given the definition <xs:simpleType name="FIELD-NAME"> <xs:restriction base="xs:string"> </xs:restriction> </xs:simpleType> then references to input fields will be obvious from the schema syntax. Note that a model can refer to two kinds of input fields. One is the set of MiningFields in the MiningSchema. The others are the DerivedFields as defined in TransformationDictionary or LocalTransformations. Further note that field names, like all other elements of PMML and in XML in general, are case sensitive. Plain Arrays of ValuesInstances of mining models often contain sets with a large number of values. The type Array is defined as a container structure which implements arrays of numbers and strings in a fairly compact way. <xs:complexType name="ArrayType" mixed="true"> <xs:attribute name="n" type="INT-NUMBER" use="optional"/> <xs:attribute name="type" use="required"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="int"/> <xs:enumeration value="real"/> <xs:enumeration value="string"/> </xs:restriction> </xs:simpleType> </xs:attribute> </xs:complexType> <xs:element name="Array" type="ArrayType"/> The content of Array is a blank separated sequence of values, multiple blanks are as good as one blank. The attribute n determines the number of elements in the sequence. If n is given it must match the number of values in the content, otherwise the PMML document is invalid. The attribute type is required since parsing an array is simpler if the type of the values in the content is specified up-front. This is particularly true for SAX based parsing. In many cases the type of the values is known from the context where the Array appears. But there are also cases where Arrays can be mixed e.g., in the statistics elements. String values may be enclosed within double quotes ", which are not considered to be part of the value. If a string value contains the double quote character " , then it must be escaped by a backslash character \ (that is the same escaping mechanism as used in C/C++). Example: <Array n="3" type="int">1 22 3</Array> <Array n="3" type="string">ab "a b" "with \"quotes\" "</Array> The second array contains the three strings 'ab', 'a b', and 'with "quotes" '. Similar to the entities for different types of numbers we define entities for arrays which should have a specific content type. Again, these entities just map to a single XML markup. <xs:group name="NUM-ARRAY"> <xs:choice> <xs:element ref="Array"/> </xs:choice> </xs:group> <xs:group name="INT-ARRAY"> <xs:choice> <xs:element ref="Array"/> </xs:choice> </xs:group> <xs:group name="REAL-ARRAY"> <xs:choice> <xs:element ref="Array"/> </xs:choice> </xs:group> <xs:group name="STRING-ARRAY"> <xs:choice> <xs:element ref="Array"/> </xs:choice> </xs:group> A NUM-ARRAY is an array of numbers. The other entities define arrays which contain integers, reals or strings. Sparse Arrays of ValuesA special case of arrays are sparse arrays which only store elements with non-zero values. <xs:element name="INT-SparseArray"> <xs:complexType> <xs:sequence> <xs:element ref="Indices" minOccurs="0"/> <xs:element ref="INT-Entries" minOccurs="0"/> </xs:sequence> <xs:attribute name="n" type="INT-NUMBER" use="optional"/> <xs:attribute name="defaultValue" type="INT-NUMBER" use="optional" default="0"/> </xs:complexType> </xs:element> <xs:element name="REAL-SparseArray"> <xs:complexType> <xs:sequence> <xs:element ref="Indices" minOccurs="0"/> <xs:element ref="REAL-Entries" minOccurs="0"/> </xs:sequence> <xs:attribute name="n" type="INT-NUMBER" use="optional"/> <xs:attribute name="defaultValue" type="REAL-NUMBER" use="optional" default="0"/> </xs:complexType> </xs:element> <xs:element name="Indices"> <xs:simpleType> <xs:list itemType="xs:int"/> </xs:simpleType> </xs:element> <xs:element name="INT-Entries"> <xs:simpleType> <xs:list itemType="xs:int"/> </xs:simpleType> </xs:element> <xs:element name="REAL-Entries"> <xs:simpleType> <xs:list itemType="xs:double"/> </xs:simpleType> </xs:element> The attribute n specifies the length of the sparse array,
which is especially useful in case the last entries are not explicitly
specified. defaultValue can be used to specify an arbitrary default
value for all positions which are not specified by the two arrays. Examples: The array 0 3 0 0 42 0 0 can be written like this: <INT-SparseArray n="7"> <Indices>2 5</Indices> <INT-Entries>3 42</INT-Entries> </INT-SparseArray> The array 0 0 0 0 0 0 can be written like this: <INT-SparseArray n="7"/> MatrixIn order to save space, a matrix can be stored as a diagonal or even sparse matrix. <xs:element name="Matrix"> <xs:complexType> <xs:choice minOccurs="0"> <xs:group ref="NUM-ARRAY" maxOccurs="unbounded"/> <xs:element ref="MatCell" maxOccurs="unbounded"/> </xs:choice> <xs:attribute name="kind" use="optional" default="any"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="diagonal"/> <xs:enumeration value="symmetric"/> <xs:enumeration value="any"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="nbRows" type="INT-NUMBER" use="optional"/> <xs:attribute name="nbCols" type="INT-NUMBER" use="optional"/> <xs:attribute name="diagDefault" type="REAL-NUMBER" use="optional"/> <xs:attribute name="offDiagDefault" type="REAL-NUMBER" use="optional"/> </xs:complexType> </xs:element> <xs:element name="MatCell"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="row" type="INT-NUMBER" use="required"/> <xs:attribute name="col" type="INT-NUMBER" use="required"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> The matrix is internally represented as a sequence of Arrays or
MatCells. If arrays are used, then each array contains elements of
one row in the matrix. The actual representation is triggered by the kind of the Matrix:
Evaluating a matrix element M(i,j) proceeds as follows:
The matrix
<Matrix nbRows="5" nbCols="5"> <Array type="real">0 0 0 42 0</Array> <Array type="real">0 1 0 0 0</Array> <Array type="real">5 0 0 0 0</Array> <Array type="real">0 0 0 0 7</Array> <Array type="real">0 0 9 0 0</Array> </Matrix> <Matrix diagDefault="0" offDiagDefault="0"> <MatCell row="1" col="4">42</MatCell> <MatCell row="2" col="2">1</MatCell> <MatCell row="3" col="1">5</MatCell> <MatCell row="4" col="5">7</MatCell> <MatCell row="5" col="3">9</MatCell> </Matrix> Non-Scoring ModelsFinding a good data mining model is often a process of trial and error. It is not unusual for a data mining algorithm to fail in its attempt generate model that is worthy of deployment. This is especially true during the early exploratory phase of the process, when a wide variety of variables are iteratively tested in search of finding that handful features in the data that can be exploited to meet a specific goal. Or, more fundamentally, most data mining algorithms have requirements must be met in order to operate properly. If, say, there is an insufficient amount of data or if there is a problem within the data, the algorithm may not produce a model at all. Alternatively, many data mining tools include features that will automatically eliminate variables that do not meet a certain criteria or enforce a minimum model quality requirement before allowing a model to be deployed. PMML includes many features that help users understand the quality of their models, including Statistics and Model Explanation. These descriptive elements are useful for valid models and they can be even more valuable when trying to understand a failed modeling attempt. Ironic as this may seem, there is value in PMML's ability to represent both good and bad models, especially in systems where PMML is the only interface between the module generating the model and the module consuming it. But this also requires that the consumer can tell the difference between PMML that contains a valid model and PMML that should not be used for scoring. For example, consider the case where all the independent variables for a regression model failed to meet the minimum importance criteria. The usageType of the MiningField for each variable could be set to supplementary instead of active and the producer could include UnivariateStats about each MiningField, statistics that would provide valuable descriptive information about why that variable was eliminated. Alternatively, imagine if the model did not meet some minimum criteria, the producer could include explanatory details in Model Explanation. In these cases, the producer generating valid PMML would generate a regression model with no independent variables and no intercept, a "y = 0" model. A consumer would have no way of knowing that it should not generate valid scores from such a model. And, if the consumer deployed this model for scoring, its users would have no way of knowing that the 0 scores should not be used. While PMML does contain a MiningBuildTask element that can be used can used to describe the results of training, consumers are not required to process this element. In fact, prior to PMML 4.1, there was no way to produce syntactically valid PMML that did not contain a model, and there was no way to tell the consumer not to score that model. Therefore, in PMML 4.1, an optional attribute isScorable was added to each PMML model element. If this attribute is true (which is the default if this attribute is absent), then the model should be processed normally. However, if the attribute is set to false, then the model producer has indicated that this model is intended for information purposes only and should not be used to generate results. Models with this attribute set to false are called "non-scoring" models. Producers who only generate models that are valid for scoring are unaffected by this change. But producers that wish to generate PMML that contain a non-scoring model should set this attribute to false as a clear indication that model is not intended for scoring. Model consumers can choose not to deploy non-scoring models or deploy them only for visualization and not scoring. Alternatively, consumers that deploy for scoring a non-scoring model need to ensure that scoring always generates an invalid result. This should be the same result a model would generate if the model received an unhandled invalid input (an invalid value that is not handled by invalid value treatment, see MiningField for more information about invalidValueTreatment). By definition in PMML, any operation on an invalid input results in an invalid output. Similarly, any non-scoring model must only generate invalid results. The PMML XSD contains required elements and attributes that must be present for the PMML to be valid, even for non-scoring models. Setting isScorable to false does not eliminate to need to meet XSD requirements in order for PMML to be considered valid. For example, each model element must contain a MiningSchema and there can be additional requirements for each model type (e.g., Regression models must have at least one RegressionTable, Trees must have one Node, etc.). For more details about the XSD requirements for non-scoring models, see the description of the isScorable attribute for each model type. |
||||||||||||||
|