DMG logo PMML 4.1 - Definition and Application of Functions
PMML4.1 Menu

Home

PMML Notice and License

Changes

XML Schema

Conformance

Interoperability

General Structure

Field Scope

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Taxomony

Targets

Output

Functions

Built-in Functions

Model Verification

Model Explanation

Multiple Models

Association Rules

Baseline Models

Cluster
Models


General
Regression


k-Nearest
Neighbors


Naive
Bayes


Neural
Network


Regression

Ruleset

Scorecard

Sequences

Text Models

Time Series

Trees

Vector Machine

PMML 4.1 - Definition and Application of Functions

PMML provides a number of predefined functions that support fine-grained transformations such as changing characters to upper case or converting date and time values to strings. The predefined functions are built into PMML because they cannot be defined by expressions in PMML itself or because a definition would be too complex.

Without support for such functions an application would have to perform the transformations before using a PMML model. The transformations that were applied when the model was created must be equivalent to the transformations when the model is applied to new data. By integrating some of the transformations directly into the PMML model, the definition and execution of the data flow becomes less error-prone.

PMML also supports the definitions of new functions that have other PMML expressions in the function body. The function represents a parametrized expression. The semantics of applying a 'user-defined' function in PMML is:

  1. substitute the formal function parameters by the actual argument values, and then
  2. replace the function application by the new expression.
That is, the function definitions are just a means for writing certain expressions in a more compact way.

A function can be applied to one or more other expressions such as constants, fields or results of transformations, see the group EXPRESSION in Transformations.html. When a function is applied, the actual arguments are identified by position. A function application itself is a PMML transformation expression. That is, there can be nested invocations of functions.

In order to allow a single function specification to be applicable for multiple dataTypes, functions are assumed to inherit the dataType of the current input parameters unless otherwise specified. For example, the built-in function "+" can be applied to the integer, float, or double dataTypes. When the input parameters have multiple dataTypes, the least restrictive dataType will be inherited by default. An explicit dataType for the function needs to be defined if the PMML producer expects the default dataType inheritance to be over-written. The inheritance precedence for mixed type input parameters is as follows:

  1. string
  2. double
  3. float
  4. integer

For example, if an integer and a double parameter are used with the "+" function, by default the output dataType will be a double. The various date, time and dateTime as well as boolean dataTypes can not be mixed with other types in a defined function without explicitly defining the expected output dataType.

For ParameterFields, both dataType and optype are optional. When the specified dataType and expression dataType match, the expected behavior is straightforward. However, if the expression does not match the specified dataType or the dataType is not specified, further clarification on the expected behavior is needed. Similar to the handling of the function output, when the datatype is not specified, ParameterFields are assumed to inherit the dataType of the expression used for the definition. For example, if a ParameterField is specified as a FieldRef expression referring to the prodgroup field defined in the DataDictionary, the ParameterField for the function inherits the dataType of prodgroup.

In cases where the specified datatype does not match the expression, implicit casts are allowed in the direction of less restriction, according to the precedence list above. If the dataType for the ParameterField is more restrictive than the expression dataType, the PMML document is not valid since it is not clear in advance that the cast can be properly made. For example, an integer expression can be implicitly cast as a double by a ParameterField dataType, but a double expression can not be implicitly cast as an integer.

Schema

The XML Schema for the definition and application of functions is
<xs:element name="DefineFunction">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="ParameterField" minOccurs="1" maxOccurs="unbounded"/>
      <xs:group ref="EXPRESSION"/>
    </xs:sequence>
    <xs:attribute name="name" type="xs:string" use="required"/>
    <xs:attribute name="optype" type="OPTYPE" use="required"/>
    <xs:attribute name="dataType" type="DATATYPE"/>
  </xs:complexType>
</xs:element>

<xs:element name="ParameterField">
  <xs:complexType>
    <xs:attribute name="name" type="xs:string" use="required"/>
    <xs:attribute name="optype" type="OPTYPE"/>
    <xs:attribute name="dataType" type="DATATYPE"/>
  </xs:complexType>
</xs:element>

<xs:element name="Apply">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:group ref="EXPRESSION" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="function" type="xs:string" use="required"/>
    <xs:attribute name="mapMissingTo" type="xs:string"/>
    <xs:attribute name="invalidValueTreatment" type="INVALID-VALUE-TREATMENT-METHOD" default="returnInvalid"/>
  </xs:complexType>
</xs:element>

The DefineFunction is used to define new (user-defined) functions as variations or compositions of existing functions or transformations. The function's name must be unique and must not conflict with other function names, either defined by PMML or other user-defined functions. The EXPRESSION in the content of DefineFunction is the function body that actually defines the meaning of the new function. The function body must not refer to fields other than the parameter fields.

The element Apply defines the application of a function. The function itself is identified by name with the function attribute. The actual parameters of the function application are given in the content of the element. Each actual argument value is given by an EXPRESSION and are mapped by position to the formal parameters in the corresponding function definition.

The optional attribute mapMissingTo defines the result value for the cases when the computed value for the function is a missing value. If it is not specified and the function produces a missing value then the result is a missing value. Note that there are functions (e.g. isMissing) that can produce a non-missing value for a missing input. For this reason, the value of the mapMissingTo attribute is used when the output of the function is missing, and not necessarily when any of the inputs is missing. This is contrary to other expressions (e.g. MapValues) where the value of this attribute is used in the presence of missing inputs.

The application of a function may sometimes yield invalid results (e.g. a division by zero). The attribute invalidValueTreatment can be used to specify how such invalid values should be treated, as in the case of a MiningField. The default value returnInvalid causes the model to return a value indicating an invalid result. The value asMissing replaces the invalid value with a missing value. Finally, in the context of an Apply, the value asIs is equivalent to returnInvalid as an invalid value cannot be propagated further.

Example applying a built-in function

Data cleansing is one of the common tasks done in preparing data for mining. Some of these operations can be supported directly in a PMML model. The following example demonstrates how to convert string values to upper case by applying the built-in function uppercase.

Assuming that the original input data contains names of product groups and the names are provided in the field "prodgroup", we define a new DerivedField named PGNorm where all values use upper case characters.

<DerivedField name="PGNorm" dataType="string" optype="categorical">
  <Apply function="uppercase">
    <FieldRef field="prodgroup"/>
  </Apply>
</DerivedField>

That is, when the value of the field prodgroup is, e.g., "Non-Food", the value of the field PGNorm becomes "NON-FOOD".

A DerivedField can contain a transformation expression such as MapValues or Discretize. The element Apply is just another transformation expression.

Example of user-defined function

A DerivedField can be defined by a possibly complex transformation. If a certain transformation has to be applied to multiple fields it makes sense to encapsulate the definition of the transformation expression in a function and then apply the function multiple times. This reduces the complexity and the size of PMML models.

New user-defined functions can be specified in a model using the element DefineFunction in the TransformationDictionary.

Example:

<TransformationDictionary>

  <!-- define a new function called "AMPM" -->
  <DefineFunction name="AMPM" dataType="string" optype="categorical">
    <!-- result type is "string" -->

    <!-- declaration of formal parameters -->
    <ParameterField name="TimeVal" optype="continuous" dataType="integer"/>
    <!-- there can be more than one parameter field -->

    <!-- The function body can be any expression-->
    <!-- Parameter names are used like field names in the expression -->

    <Discretize field="TimeVal">  <!-- uses name of parameter field -->
      <DiscretizeBin binValue="AM">
        <Interval closure="closedClosed" leftMargin="0" rightMargin="43199"/>
      </DiscretizeBin>
      <DiscretizeBin binValue="PM">
        <Interval closure="closedOpen" leftMargin="43200" rightMargin="86400"/>
      </DiscretizeBin>
    </Discretize>
  </DefineFunction>

  <!-- use function "AMPM" in a DerivedField -->
  <DerivedField name="Shift" dataType="string" optype="categorical">
    <Apply function="AMPM">
      <FieldRef field="StartTime"/>
    </Apply>
  </DerivedField>

  <!-- extract the hour from a time value -->
  <DerivedField name="StartHour" dataType="string" optype="categorical">
    <Apply function="format-datetime">
      <Constant>%H</Constant>
      <FieldRef field="StartTime"/>
    </Apply>
  </DerivedField>

</TransformationDictionary>

We assume that the field StartTime is defined with dataType="timeSeconds". An actual time value, "09:39:02" would be represented as a number, 34742, that is, the number of seconds since midnight at the given point in time. The transformation in the function AMPM maps this value to the string "AM". This value becomes the actual value of the field Shift. The input field Shift is also used in the definition of the DerivedField StartHour. This categorical field has the actual value "09", produced by the date formatting function.

Note that we use a notation <Constant>HH</Constant> for constants. This notation is shorter and easier to handle than the combination <Constant><Value value="HH"></Constant>.

In general, the application of a function looks like:

<Apply function="MyFunc" xmlns="http://www.dmg.org/PMML-4_1">
  <i>parameter expression 1</i>
  <i>parameter expression 2</i>
  <i>   ...  </i>
  <i>parameter expression n</i>
</Apply>
The expressions are mapped by position to the arguments in the function definition.

Another example:

<DefineFunction name="STATEGROUP" dataType="string" optype="categorical">
  <ParameterField name="#1" optype="categorical" dataType="string"/>
  <MapValues outputColumn="Region">
    <FieldColumnPair field="#1" column="State"/>
    <InlineTable>
      <row><State>CA</State><Region>West</Region></row>
      <row><State>OR</State><Region>West</Region></row>
      <row><State>NC</State><Region>East</Region></row>
    </InlineTable>
  </MapValues>
</DefineFunction>
<DerivedField name="Group" dataType="string" optype="categorical">
  <Apply function="STATEGROUP">
    <FieldRef field="State"/>
  </Apply>
</DerivedField>

The new function with name STATEGROUP accepts one argument. The definition uses the transformation MapValues. For example, if the function is applied with a value "CA", the result is the string West. The example also defines a new DerivedField with name Group as the result of applying the function STATEGROUP to the field State.

e-mail info at dmg.org