PMML 4.1 - Scope of Fields

In programming languages, scope is a context used to define the visibility and accessibility of variables in different parts of the program. This is especially important when the same variable name is used in different places within the program, so that name conflicts are resolved in ways that, generally, ensure module independence and avoid side-effects.

PMML variables, called fields, typically have a pre-established scope. And, historically, name conflicts are usually not a problem since the standard discourages the re-use of names. The first mention of this concept appeared in PMML version 2.0 when Transformations were first introduced,

"PMML 2.0: DerivedField's in the TransformationDictionary together with DataField's in the DataDictionary must have unique names."

Since MiningFields shared the name of their DataFields, this statement eliminated the possibility of name conflicts. With the introduction of LocalTransformations in PMML 3.0, this statement was expanded to still prevent the possibility of name conflicts,

"PMML 3.0: DerivedFields in the TransformationDictionary or LocalTransformations together with DataFields in the DataDictionary must have unique names."

PMML 3.0 also introduced an approach using ensembles of regression and tree sub-models called "Model Composition", with a new model type MiningModel that employed ResultFields for delivering results between sub-models. Scope was difficult to implement because the regression and tree sub-models aren't proper model elements as they lack key features like a MiningSchema. [Note: This document does not cover the Model Composition approach since its use has been deprecated in PMML 4.1]

PMML 4.0 introduced a new type of MiningModel that used "Segmentation" for creating sub-models that could be organized as ensembles (a collection of sub-models used in parallel). These sub-models, called Segments, employed the same model elements used in the single model PMML documents and therefore allowed greater flexibility with respect to re-using field names:

"PMML 4.0: The fields in the DataDictionary and in the TransformationDictionary taken together are identified by unique names. Other elements in the models can refer to these fields by name. Multiple models on one PMML document can share the same fields in the TransformationDictionary. Nevertheless, a model can also define its 'own' derived fields in the element LocalTransformations."

Figure 1 shows the flow of data through a PMML 4.1 document, including multiple model elements (top-level and MiningModel sub-models).

Figure 1: PMML Scope Diagram

Similar to how software languages have rules ensuring clarity of variable names across a program, PMML has rules that govern scope as well. But unlike declarative software languages like C and Java, which allow any variable to take a variety of scopes (e.g., global, friend and local), fields in PMML are fixed into collections with pre-defined scope:

The DataDictionary and TransformationDictionary are definitions that are visible to all model elements, so their scope is global in nature. Generally, the standard has always required that these fields have unique names.
Each model element (both top-level and sub-model) contain a required MiningSchema element and an optional LocalTransformations element. These elements process and manipulate data as it flows through the model element. While appearing local in nature, scope comes into play when there are model elements within other model elements, as is the case with MiningModels.
Model elements contain Outputs which externalize the variety of model results that can be produced from the model element. Outputs are global in nature since they are referenced from outside of the model element's scope.
Model elements also contain Targets and VerificationFields which refer to other fields and are not new fields.

The following five rules are key to understanding the scope of fields in a PMML document:

Forward referencing of fields is not allowed.
- All fields in PMML must be defined before they are referenced.
DerivedFields in the TransformationDictionary are definitions only. Similar to a dictionary that contains the definition of words that appear elsewhere in documents, the TransformationDictionary contains the definition of transformations which are only instantiated when referenced elsewhere in the PMML document. This approach allows a DerivedField to be specified once but used in multiple model elements within the PMML document, which can reduce the size of PMML documents that contain multiple model elements.
- A DerivedField from the TransformationDictionary is instantiated in a model element when it is referenced in the model element, either directly or transitively through other DerivedFields.
  - Example of referencing directly and transitively: The TransformationDictionary contains DerivedField "B" which references a DerivedField "A" from earlier in the TransformationDictionary. If "B" is directly referenced in a model element and therefore becomes instantiated in that model element, then "A" is also instantiated in the model element because, while "A" may not be directly referenced, it is transitively referenced via "B".
- When they are instantiated inside a model element, DerivedFields from the TransformationDictionary are instantiated as if they appear before any other DerivedFields defined within the LocalTransformations element.
- When they are instantiated inside a model element, DerivedFields from the TransformationDictionary can only reference:
  1. MiningFields from the MiningSchema of that model element.
  2. Other DerivedFields which appear earlier in the TransformationDictionary.
- DerivedFields from the TransformationDictionary which are not referenced (directly or transitively) in a model are not instantiated in that model element.
- Whether they are instantiated in a particular PMML document or not, all DerivedFields from the TransformationDictionary must be valid PMML. For example, DerivedFields in the TransformationDictionary that have circular references or apply undefined functions are not valid PMML, even if those DerivedFields are never instantiated in a model element.
The MiningSchema is the "GateKeeper".
- Within a model element, all data must flow in through its MiningSchema.
- In this way, the MiningSchema is the boundary between model elements, a distinction that defines the scope of fields in the PMML document.
- Since MiningFields have features for manipulating data values that are missing, invalid or considered an outlier, values handled at the top-level (i.e., values that don't pass through the MiningSchema "asIs") will not make it to sub-models.
The MiningSchema of a particular model can refer to any fields available in the enclosing scope of its parent model.
- For top-level Models, their MiningSchema can refer only to DataFields in the DataDictionary
- For sub-models, their MiningSchema can refer to any fields defined within the parent model's scope, including
  - The MiningFields of the parent model
  - DerivedFields defined in the LocalTransformations of the parent model
  - For a sequence of sub-models (MiningModels that use Segmentation's modelChain feature, introduced in PMML 4.1), OutputFields defined in Segments that appear above/earlier in the parent MiningModel. Note: The submodels comprising a sequence are to be ordered in such a way that each is defined after any other submodels on which it depends.
In general, field names in PMML should be unique. Avoiding name duplication is a good practice since it makes life easier for consumers and, as outlined below, certain field names cannot be duplicated under any circumstances (e.g., DerivedFields in the TransformationDictionary). However, because of the nature of MiningModels, two reasonable exceptions can be made to allow duplicating field names:
1. DerivedField names in LocalTransformations can be duplicated across different sibling sub-models, provided the name is unique from the name of any field in its scope.
2. OutputField elements can be duplicated across different sibling sub-models, provided they are identical to each other (same name, same output feature, same data type, etc.).

In summary, these are the rules govern the naming of fields in PMML:

The names of DataFields in the DataDictionary must be unique from the names of any other DataFields and the names of DerivedFields in the TransformationDictionary.
The names of MiningFields cannot be re-used within the same model element.
The names of DerivedFields in the TransformationDictionary must be unique across the enitre PMML document.
The names of DerivedFields in the LocalTransformations must be unique from any other names in their scope.
The names of OutputFields must be unique from the names of any other fields in the PMML document unless the OutputField element is identically duplicated across sibling sub-models.

e-mail

info at dmg.org