 PMML 4.4 - Definition and Application of Functions
 PMML4.4 Menu Home Changes XML Schema Conformance Interoperability General Structure Field Scope Header Data Dictionary Mining Schema Transformations Statistics Taxomony Targets Output Functions Built-in Functions Model Verification Model Explanation Multiple Models Anomaly Detection Models Association Rules Baseline Models Bayesian Network Cluster Models Gaussian Process General Regression k-Nearest Neighbors Naive Bayes Neural Network Regression Ruleset Scorecard Sequences Text Models Time Series Trees Vector Machine

PMML 4.4 - Definition and Application of Functions

PMML provides a number of predefined functions that support fine-grained transformations such as changing characters to upper case or converting date and time values to strings. The predefined functions are built into PMML because they cannot be defined by expressions in PMML itself or because a definition would be too complex.

Without support for such functions an application would have to perform the transformations before using a PMML model. The transformations that were applied when the model was created must be equivalent to the transformations when the model is applied to new data. By integrating some of the transformations directly into the PMML model, the definition and execution of the data flow becomes less error-prone.

PMML also supports the definitions of new functions that have other PMML expressions in the function body. The function represents a parameterized expression. The semantics of applying a 'user-defined' function in PMML is:

1. substitute the formal function parameters by the actual argument values, and then
2. replace the function application by the new expression.
That is, the function definitions are just a means for writing certain expressions in a more compact way.

A function can be applied to one or more other expressions such as constants, fields or results of transformations, see the group EXPRESSION in Transformations.html. When a function is applied, the actual arguments are identified by position. A function application itself is a PMML transformation expression. That is, there can be nested invocations of functions. Unless otherwise specified, a function will return a missing result if any of the inputs are missing.

In order to allow a single function specification to be applicable for multiple dataTypes, functions are assumed to inherit the dataType of the current input parameters unless otherwise specified. For example, the built-in function "+" can be applied to the integer, float, or double dataTypes. When the input parameters have multiple dataTypes, the least restrictive dataType will be inherited by default. An explicit dataType for the function needs to be defined if the PMML producer expects the default dataType inheritance to be over-written. The inheritance precedence for mixed type input parameters is as follows:

1. string
2. double
3. float
4. integer

For example, if an integer and a double parameter are used with the "+" function, by default the output dataType will be a double. The various date, time and dateTime as well as boolean dataTypes can not be mixed with other types in a defined function without explicitly defining the expected output dataType.

For ParameterFields, both dataType and optype are optional. When the specified dataType and expression dataType match, the expected behavior is straightforward. However, if the expression does not match the specified dataType or the dataType is not specified, further clarification on the expected behavior is needed. Similar to the handling of the function output, when the datatype is not specified, ParameterFields are assumed to inherit the dataType of the expression used for the definition. For example, if a ParameterField is specified as a FieldRef expression referring to the prodgroup field defined in the DataDictionary, the ParameterField for the function inherits the dataType of prodgroup.

In cases where the specified datatype does not match the expression, implicit casts are allowed in the direction of less restriction, according to the precedence list above. If the dataType for the ParameterField is more restrictive than the expression dataType, the PMML document is not valid since it is not clear in advance that the cast can be properly made. For example, an integer expression can be implicitly cast as a double by a ParameterField dataType, but a double expression can not be implicitly cast as an integer.

The ParameterField element also allows an optional displayName attribute which describes the parameter field in potentially greater detail. It has no effect on scoring, but can be used in reports generated by PMML consumers.

Schema

The XML Schema for the definition and application of functions is
<xs:element name="DefineFunction">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="ParameterField" minOccurs="1" maxOccurs="unbounded"/>
<xs:group ref="EXPRESSION"/>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required"/>
<xs:attribute name="optype" type="OPTYPE" use="required"/>
<xs:attribute name="dataType" type="DATATYPE"/>
</xs:complexType>
</xs:element>

<xs:element name="ParameterField">
<xs:complexType>
<xs:attribute name="name" type="FIELD-NAME" use="required"/>
<xs:attribute name="optype" type="OPTYPE"/>
<xs:attribute name="dataType" type="DATATYPE"/>
<xs:attribute name="displayName" type="xs:string"/>
</xs:complexType>
</xs:element>

<xs:element name="Apply">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:group ref="EXPRESSION" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="function" type="xs:string" use="required"/>
<xs:attribute name="mapMissingTo" type="xs:string"/>
<xs:attribute name="defaultValue" type="xs:string"/>
<xs:attribute name="invalidValueTreatment" type="INVALID-VALUE-TREATMENT-METHOD" default="returnInvalid"/>
</xs:complexType>
</xs:element>

The DefineFunction is used to define new (user-defined) functions as variations or compositions of existing functions or transformations. The function's name must be unique and must not conflict with other function names, either defined by PMML or other user-defined functions. The EXPRESSION in the content of DefineFunction is the function body that actually defines the meaning of the new function. The function body must not refer to fields other than the parameter fields.

The element Apply defines the application of a function. The function itself is identified by name with the function attribute. The actual parameters of the function application are given in the content of the element. Each actual argument value is given by an EXPRESSION and are mapped by position to the formal parameters in the corresponding function definition.

The optional attribute mapMissingTo defines the output value for the cases when any of the function inputs are missing. This means that if it is specified and any of the input values of the function are missing then the function is not applied at all and the mapMissingTo value is returned instead. This is useful when the applied function cannot handle missing values. On the contrary, the optional value defaultValue defines an output value when the function returns a missing value. In other words, when a defaultValue value is provided, the function is applied first and if it produces a missing value the defaultValue is returned instead.

The application of a function may sometimes yield invalid results (e.g. a division by zero). As in the case of a MiningField, the attribute invalidValueTreatment can be used to specify how such invalid values should be treated. This attribute’s default value returnInvalid causes the model to return a value indicating an invalid result. On the other hand, the invalidValueTreatment value asMissing replaces the invalid value with a missing value. Note that, in this case, if a defaultValue is also specified then the default value is returned instead of a missing value. Finally, in the context of an Apply, the value asIs is equivalent to returnInvalid as an invalid value cannot be propagated further.

Output table for Apply

('*' stands for any combination, empty cell stands for no value)

function input(s) mapMissingTo defaultValue invalidValueTreatment function output apply output
at least one missing map_missing_val * * not computed map_missing_val
no input value is missing OR mapMissingTo is empty * * out_val out_val
default_val * default_val
returnInvalid invalid returnInvalid
asMissing invalid
defaut_val asMissing invalid defaut_val

Example applying a built-in function

Data cleansing is one of the common tasks done in preparing data for mining. Some of these operations can be supported directly in a PMML model. The following example demonstrates how to convert string values to upper case by applying the built-in function uppercase.

Assuming that the original input data contains names of product groups and the names are provided in the field "prodgroup", we define a new DerivedField named PGNorm where all values use upper case characters.

<DerivedField name="PGNorm" dataType="string" optype="categorical">
<Apply function="uppercase">
<FieldRef field="prodgroup"/>
</Apply>
</DerivedField>

That is, when the value of the field prodgroup is, e.g., "Non-Food", the value of the field PGNorm becomes "NON-FOOD".

A DerivedField can contain a transformation expression such as MapValues or Discretize. The element Apply is just another transformation expression.

Example of user-defined function

A DerivedField can be defined by a possibly complex transformation. If a certain transformation has to be applied to multiple fields it makes sense to encapsulate the definition of the transformation expression in a function and then apply the function multiple times. This reduces the complexity and the size of PMML models.

New user-defined functions can be specified in a model using the element DefineFunction in the TransformationDictionary.

Example:

<TransformationDictionary>

<!-- define a new function called "AMPM" -->
<DefineFunction name="AMPM" dataType="string" optype="categorical">
<!-- result type is "string" -->

<!-- declaration of formal parameters -->
<ParameterField name="TimeVal" optype="continuous" dataType="integer" displayName="Time value"/>
<!-- there can be more than one parameter field -->

<!-- The function body can be any expression-->
<!-- Parameter names are used like field names in the expression -->

<Discretize field="TimeVal">  <!-- uses name of parameter field -->
<DiscretizeBin binValue="AM">
<Interval closure="closedClosed" leftMargin="0" rightMargin="43199"/>
</DiscretizeBin>
<DiscretizeBin binValue="PM">
<Interval closure="closedOpen" leftMargin="43200" rightMargin="86400"/>
</DiscretizeBin>
</Discretize>
</DefineFunction>

<!-- use function "AMPM" in a DerivedField -->
<DerivedField name="Shift" dataType="string" optype="categorical">
<Apply function="AMPM">
<FieldRef field="StartTime"/>
</Apply>
</DerivedField>

<!-- extract the hour from a time value -->
<DerivedField name="StartHour" dataType="string" optype="categorical">
<Apply function="format-datetime">
<Constant>%H</Constant>
<FieldRef field="StartTime"/>
</Apply>
</DerivedField>

</TransformationDictionary>

We assume that the field StartTime is defined with dataType="timeSeconds". An actual time value, "09:39:02" would be represented as a number, 34742, that is, the number of seconds since midnight at the given point in time. The transformation in the function AMPM maps this value to the string "AM". This value becomes the actual value of the field Shift. The input field Shift is also used in the definition of the DerivedField StartHour. This categorical field has the actual value "09", produced by the date formatting function.

Note that we use a notation <Constant>HH</Constant> for constants. This notation is shorter and easier to handle than the combination <Constant><Value value="HH"></Constant>.

In general, the application of a function looks like:

<Apply function="MyFunc" xmlns="http://www.dmg.org/PMML-4_2">
<i>parameter expression 1</i>
<i>parameter expression 2</i>
<i>   ...  </i>
<i>parameter expression n</i>
</Apply>
The expressions are mapped by position to the arguments in the function definition.

Another example:

<DefineFunction name="STATEGROUP" dataType="string" optype="categorical">
<ParameterField name="#1" optype="categorical" dataType="string"/>
<MapValues outputColumn="Region">
<FieldColumnPair field="#1" column="State"/>
<InlineTable>
<row><State>CA</State><Region>West</Region></row>
<row><State>OR</State><Region>West</Region></row>
<row><State>NC</State><Region>East</Region></row>
</InlineTable>
</MapValues>
</DefineFunction>
<DerivedField name="Group" dataType="string" optype="categorical">
<Apply function="STATEGROUP">
<FieldRef field="State"/>
</Apply>
</DerivedField>

The new function with name STATEGROUP accepts one argument. The definition uses the transformation MapValues. For example, if the function is applied with a value "CA", the result is the string West. The example also defines a new DerivedField with name Group as the result of applying the function STATEGROUP to the field State.

 e-mail info at dmg.org