PMML 4.3 - Definition and Application of Functions
 PMML4.3 Menu Home Changes XML Schema Conformance Interoperability General Structure Field Scope Header Data Dictionary Mining Schema Transformations Statistics Taxomony Targets Output Functions Built-in Functions Model Verification Model Explanation Multiple Models Association Rules Baseline Models Bayesian Network Cluster Models Gaussian Process General Regression k-Nearest Neighbors Naive Bayes Neural Network Regression Ruleset Scorecard Sequences Text Models Time Series Trees Vector Machine

## PMML 4.3 - Definition and Application of Functions

PMML provides a number of predefined functions that support fine-grained transformations such as changing characters to upper case or converting date and time values to strings. The predefined functions are built into PMML because they cannot be defined by expressions in PMML itself or because a definition would be too complex.

Without support for such functions an application would have to perform the transformations before using a PMML model. The transformations that were applied when the model was created must be equivalent to the transformations when the model is applied to new data. By integrating some of the transformations directly into the PMML model, the definition and execution of the data flow becomes less error-prone.

PMML also supports the definitions of new functions that have other PMML expressions in the function body. The function represents a parameterized expression. The semantics of applying a 'user-defined' function in PMML is:

1. substitute the formal function parameters by the actual argument values, and then
2. replace the function application by the new expression.
That is, the function definitions are just a means for writing certain expressions in a more compact way.

A function can be applied to one or more other expressions such as constants, fields or results of transformations, see the group `EXPRESSION` in Transformations.html. When a function is applied, the actual arguments are identified by position. A function application itself is a PMML transformation expression. That is, there can be nested invocations of functions. Unless otherwise specified, a function will return a missing result if any of the inputs are missing.

In order to allow a single function specification to be applicable for multiple `dataType`s, functions are assumed to inherit the `dataType` of the current input parameters unless otherwise specified. For example, the built-in function `"+"` can be applied to the `integer`, `float`, or `double` `dataType`s. When the input parameters have multiple `dataType`s, the least restrictive `dataType` will be inherited by default. An explicit `dataType` for the function needs to be defined if the PMML producer expects the default `dataType` inheritance to be over-written. The inheritance precedence for mixed type input parameters is as follows:

1. string
2. double
3. float
4. integer

For example, if an `integer` and a `double` parameter are used with the `"+"` function, by default the output `dataType` will be a `double`. The various `date`, `time` and `dateTime` as well as `boolean` `dataTypes` can not be mixed with other types in a defined function without explicitly defining the expected output `dataType`.

For ParameterFields, both dataType and optype are optional. When the specified `dataType` and expression `dataType` match, the expected behavior is straightforward. However, if the expression does not match the specified `dataType` or the `dataType` is not specified, further clarification on the expected behavior is needed. Similar to the handling of the function output, when the `datatype` is not specified, `ParameterField`s are assumed to inherit the `dataType` of the expression used for the definition. For example, if a `ParameterField` is specified as a `FieldRef` expression referring to the prodgroup field defined in the `DataDictionary`, the `ParameterField` for the function inherits the `dataType` of prodgroup.

In cases where the specified `datatype` does not match the expression, implicit casts are allowed in the direction of less restriction, according to the precedence list above. If the `dataType` for the `ParameterField` is more restrictive than the expression `dataType`, the PMML document is not valid since it is not clear in advance that the cast can be properly made. For example, an `integer` expression can be implicitly cast as a `double` by a `ParameterField` `dataType`, but a `double` expression can not be implicitly cast as an `integer`.

### Schema

The XML Schema for the definition and application of functions is
```<xs:element name="DefineFunction">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="ParameterField" minOccurs="1" maxOccurs="unbounded"/>
<xs:group ref="EXPRESSION"/>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required"/>
<xs:attribute name="optype" type="OPTYPE" use="required"/>
<xs:attribute name="dataType" type="DATATYPE"/>
</xs:complexType>
</xs:element>

<xs:element name="ParameterField">
<xs:complexType>
<xs:attribute name="name" type="xs:string" use="required"/>
<xs:attribute name="optype" type="OPTYPE"/>
<xs:attribute name="dataType" type="DATATYPE"/>
</xs:complexType>
</xs:element>

<xs:element name="Apply">
<xs:complexType>
<xs:sequence>
<xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
<xs:group ref="EXPRESSION" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="function" type="xs:string" use="required"/>
<xs:attribute name="mapMissingTo" type="xs:string"/>
<xs:attribute name="defaultValue" type="xs:string"/>
<xs:attribute name="invalidValueTreatment" type="INVALID-VALUE-TREATMENT-METHOD" default="returnInvalid"/>
</xs:complexType>
</xs:element>
```

The DefineFunction is used to define new (user-defined) functions as variations or compositions of existing functions or transformations. The function's name must be unique and must not conflict with other function names, either defined by PMML or other user-defined functions. The `EXPRESSION` in the content of DefineFunction is the function body that actually defines the meaning of the new function. The function body must not refer to fields other than the parameter fields.

The element Apply defines the application of a function. The function itself is identified by name with the function attribute. The actual parameters of the function application are given in the content of the element. Each actual argument value is given by an `EXPRESSION` and are mapped by position to the formal parameters in the corresponding function definition.

The optional attribute mapMissingTo defines the output value for the cases when any of the function inputs are missing. This means that if it is specified and any of the input values of the function are missing then the function is not applied at all and the mapMissingTo value is returned instead. This is useful when the applied function cannot handle missing values. On the contrary, the optional value defaultValue defines an output value when the function returns a missing value. In other words, when a defaultValue value is provided, the function is applied first and if it produces a missing value the defaultValue is returned instead.

The application of a function may sometimes yield invalid results (e.g. a division by zero). As in the case of a `MiningField`, the attribute invalidValueTreatment can be used to specify how such invalid values should be treated. This attribute’s default value `returnInvalid` causes the model to return a value indicating an invalid result. On the other hand, the `invalidValueTreatment` value `asMissing` replaces the invalid value with a missing value. Note that, in this case, if a defaultValue is also specified then the default value is returned instead of a missing value. Finally, in the context of an Apply, the value `asIs` is equivalent to `returnInvalid` as an invalid value cannot be propagated further.

#### Output table for Apply

('*' stands for any combination, empty cell stands for no value)

function input(s) mapMissingTo defaultValue invalidValueTreatment function output apply output
at least one missing map_missing_val * * not computed map_missing_val
no input value is missing OR mapMissingTo is empty * * out_val out_val
default_val * default_val
returnInvalid invalid returnInvalid
asMissing invalid
defaut_val asMissing invalid defaut_val

### Example applying a built-in function

Data cleansing is one of the common tasks done in preparing data for mining. Some of these operations can be supported directly in a PMML model. The following example demonstrates how to convert string values to upper case by applying the built-in function uppercase.

Assuming that the original input data contains names of product groups and the names are provided in the field "prodgroup", we define a new `DerivedField` named PGNorm where all values use upper case characters.

```<DerivedField name="PGNorm" dataType="string" optype="categorical">
<Apply function="uppercase">
<FieldRef field="prodgroup"/>
</Apply>
</DerivedField>
```

That is, when the value of the field prodgroup is, e.g., "Non-Food", the value of the field PGNorm becomes "NON-FOOD".

A `DerivedField` can contain a transformation expression such as `MapValues` or `Discretize`. The element `Apply` is just another transformation expression.

### Example of user-defined function

A `DerivedField` can be defined by a possibly complex transformation. If a certain transformation has to be applied to multiple fields it makes sense to encapsulate the definition of the transformation expression in a function and then apply the function multiple times. This reduces the complexity and the size of PMML models.

New user-defined functions can be specified in a model using the element `DefineFunction` in the `TransformationDictionary`.

#### Example:

```<TransformationDictionary>

<!-- define a new function called "AMPM" -->
<DefineFunction name="AMPM" dataType="string" optype="categorical">
<!-- result type is "string" -->

<!-- declaration of formal parameters -->
<ParameterField name="TimeVal" optype="continuous" dataType="integer"/>
<!-- there can be more than one parameter field -->

<!-- The function body can be any expression-->
<!-- Parameter names are used like field names in the expression -->

<Discretize field="TimeVal">  <!-- uses name of parameter field -->
<DiscretizeBin binValue="AM">
<Interval closure="closedClosed" leftMargin="0" rightMargin="43199"/>
</DiscretizeBin>
<DiscretizeBin binValue="PM">
<Interval closure="closedOpen" leftMargin="43200" rightMargin="86400"/>
</DiscretizeBin>
</Discretize>
</DefineFunction>

<!-- use function "AMPM" in a DerivedField -->
<DerivedField name="Shift" dataType="string" optype="categorical">
<Apply function="AMPM">
<FieldRef field="StartTime"/>
</Apply>
</DerivedField>

<!-- extract the hour from a time value -->
<DerivedField name="StartHour" dataType="string" optype="categorical">
<Apply function="format-datetime">
<Constant>%H</Constant>
<FieldRef field="StartTime"/>
</Apply>
</DerivedField>

</TransformationDictionary>
```

We assume that the field StartTime is defined with `dataType="timeSeconds"`. An actual time value, "09:39:02" would be represented as a number, 34742, that is, the number of seconds since midnight at the given point in time. The transformation in the function AMPM maps this value to the string "AM". This value becomes the actual value of the field Shift. The input field Shift is also used in the definition of the `DerivedField` StartHour. This categorical field has the actual value "09", produced by the date formatting function.

Note that we use a notation `<Constant>HH</Constant>` for constants. This notation is shorter and easier to handle than the combination ```<Constant><Value value="HH"></Constant>```.

In general, the application of a function looks like:

```<Apply function="MyFunc" xmlns="http://www.dmg.org/PMML-4_2">
<i>parameter expression 1</i>
<i>parameter expression 2</i>
<i>   ...  </i>
<i>parameter expression n</i>
</Apply>
```
The expressions are mapped by position to the arguments in the function definition.

#### Another example:

```<DefineFunction name="STATEGROUP" dataType="string" optype="categorical">
<ParameterField name="#1" optype="categorical" dataType="string"/>
<MapValues outputColumn="Region">
<FieldColumnPair field="#1" column="State"/>
<InlineTable>
<row><State>CA</State><Region>West</Region></row>
<row><State>OR</State><Region>West</Region></row>
<row><State>NC</State><Region>East</Region></row>
</InlineTable>
</MapValues>
</DefineFunction>
```
```<DerivedField name="Group" dataType="string" optype="categorical">
<Apply function="STATEGROUP">
<FieldRef field="State"/>
</Apply>
</DerivedField>
```

The new function with name STATEGROUP accepts one argument. The definition uses the transformation `MapValues`. For example, if the function is applied with a value "CA", the result is the string `West`. The example also defines a new `DerivedField` with name Group as the result of applying the function STATEGROUP to the field State.

 e-mail info at dmg.org