PMML 4.1 - Scope of FieldsIn programming languages, scope is a context used to define the visibility
and accessibility of variables in different parts of the program. This is
especially important when the same variable name is used in different places
within the program, so that name conflicts are resolved in ways that,
generally, ensure module independence and avoid side-effects.
PMML variables, called fields, typically have a pre-established scope.
And, historically, name conflicts are usually not a problem since the
standard discourages the re-use of names. The first mention of this concept
appeared in PMML version 2.0 when Transformations were first introduced,
"PMML 2.0: DerivedField's in the TransformationDictionary
together with DataField's in the DataDictionary must have unique
names."
Since MiningFields shared the name of their DataFields,
this statement eliminated the possibility of name conflicts. With the
introduction of LocalTransformations in PMML 3.0, this statement was
expanded to still prevent the possibility of name conflicts,
"PMML 3.0: DerivedFields in the TransformationDictionary or
LocalTransformations together with DataFields in the DataDictionary must
have unique names."
PMML 3.0 also introduced an approach using ensembles of regression and
tree sub-models called "Model Composition", with a new model type
MiningModel that employed ResultFields for delivering
results between sub-models. Scope was difficult to implement because the
regression and tree sub-models aren't proper model elements as they lack key
features like a MiningSchema. [Note: This document does not cover
the Model Composition approach since its use has been deprecated in PMML
4.1]
PMML 4.0 introduced a new type of MiningModel that used
"Segmentation" for creating sub-models that could be organized as ensembles
(a collection of sub-models used in parallel). These sub-models, called
Segments, employed the same model elements used in the single model PMML
documents and therefore allowed greater flexibility with respect to re-using
field names:
"PMML 4.0: The fields in the DataDictionary and in the
TransformationDictionary taken together are identified by unique names.
Other elements in the models can refer to these fields by name. Multiple
models on one PMML document can share the same fields in the
TransformationDictionary. Nevertheless, a model can also define its 'own'
derived fields in the element LocalTransformations."
Figure 1 shows the flow of data through a PMML 4.1 document, including
multiple model elements (top-level and MiningModel sub-models).

Figure 1: PMML Scope Diagram
Similar to how software languages have rules ensuring clarity of variable
names across a program, PMML has rules that govern scope as well. But unlike
declarative software languages like C and Java, which allow any variable to
take a variety of scopes (e.g., global, friend and local), fields in PMML are
fixed into collections with pre-defined scope:
- The DataDictionary and TransformationDictionary are
definitions that are visible to all model elements, so their scope is
global in nature. Generally, the standard has always required that these
fields have unique names.
- Each model element (both top-level and sub-model) contain a required
MiningSchema element and an optional LocalTransformations
element. These elements process and manipulate data as it flows through the
model element. While appearing local in nature, scope comes into play when
there are model elements within other model elements, as is the case with
MiningModels.
- Model elements contain Outputs which externalize the variety
of model results that can be produced from the model element. Outputs are
global in nature since they are referenced from outside of the model
element's scope.
- Model elements also contain Targets and
VerificationFields which refer to other fields and are not new
fields.
The following five rules are key to understanding the scope of fields in
a PMML document:
- Forward referencing of fields is not allowed.
- All fields in PMML must be defined before they are referenced.
-
DerivedFields in the TransformationDictionary are
definitions only. Similar to a dictionary that contains the definition of
words that appear elsewhere in documents, the
TransformationDictionary contains the definition of
transformations which are only instantiated when referenced elsewhere in
the PMML document. This approach allows a DerivedField to be
specified once but used in multiple model elements within the PMML
document, which can reduce the size of PMML documents that contain
multiple model elements.
- A DerivedField from the TransformationDictionary
is instantiated in a model element when it is referenced in the model
element, either directly or transitively through other
DerivedFields.
- Example of referencing directly and transitively:
The TransformationDictionary contains
DerivedField "B" which references a DerivedField
"A" from earlier in the TransformationDictionary. If "B"
is directly referenced in a model element and therefore
becomes instantiated in that model element, then "A" is also
instantiated in the model element because, while "A" may not be
directly referenced, it is transitively referenced via
"B".
- When they are instantiated inside a model element,
DerivedFields from the TransformationDictionary are
instantiated as if they appear before any other DerivedFields
defined within the LocalTransformations element.
- When they are instantiated inside a model element,
DerivedFields from the TransformationDictionary can
only reference:
- MiningFields from the MiningSchema of that
model element.
- Other DerivedFields which appear earlier in the
TransformationDictionary.
- DerivedFields from the TransformationDictionary
which are not referenced (directly or transitively) in a model are not
instantiated in that model element.
- Whether they are instantiated in a particular PMML document or not,
all DerivedFields from the TransformationDictionary
must be valid PMML. For example, DerivedFields in the
TransformationDictionary that have circular references or
apply undefined functions are not valid PMML, even if those
DerivedFields are never instantiated in a model element.
- The MiningSchema is the "GateKeeper".
- Within a model element, all data must flow in through its
MiningSchema.
- In this way, the MiningSchema is the boundary between
model elements, a distinction that defines the scope of fields in the
PMML document.
- Since MiningFields have features for manipulating data
values that are missing, invalid or considered an outlier, values
handled at the top-level (i.e., values that don't pass through the
MiningSchema "asIs") will not make it to sub-models.
- The MiningSchema of a particular model can refer to any fields
available in the enclosing scope of its parent model.
- For top-level Models, their MiningSchema can refer only to
DataFields in the DataDictionary
- For sub-models, their MiningSchema can refer to any fields
defined within the parent model's scope, including
- The MiningFields of the parent model
- DerivedFields defined in the
LocalTransformations of the parent model
- For a sequence of sub-models (MiningModels that use
Segmentation's modelChain feature, introduced in PMML
4.1), OutputFields defined in Segments that
appear above/earlier in the parent MiningModel. Note: The
submodels comprising a sequence are to be ordered in such a way
that each is defined after any other submodels on which it
depends.
- In general, field names in PMML should be unique. Avoiding name
duplication is a good practice since it makes life easier for consumers
and, as outlined below, certain field names cannot be duplicated under any
circumstances (e.g., DerivedFields in the
TransformationDictionary). However, because of the nature of
MiningModels, two reasonable exceptions can be made to allow
duplicating field names:
- DerivedField names in LocalTransformations can be
duplicated across different sibling sub-models, provided the name is
unique from the name of any field in its scope.
- OutputField elements can be duplicated across different
sibling sub-models, provided they are identical to each other (same
name, same output feature, same data type, etc.).
In summary, these are the rules govern the naming of fields in PMML:
- The names of DataFields in the DataDictionary must be
unique from the names of any other DataFields and the names of
DerivedFields in the TransformationDictionary.
- The names of MiningFields cannot be re-used within the same
model element.
- The names of DerivedFields in the
TransformationDictionary must be unique across the enitre PMML
document.
- The names of DerivedFields in the
LocalTransformations must be unique from any other names in their
scope.
- The names of OutputFields must be unique from the names of any
other fields in the PMML document unless the OutputField element
is identically duplicated across sibling sub-models.
|