Data Mining Group - Functions

PMML 3.2 - Definition and Application of Functions

PMML provides a number of predefined functions that support fine-grained transformations such as changing characters to upper case or converting date and time values to strings. The predefined functions are built into PMML because they cannot be defined by expressions in PMML itself or because a definition would be too complex.

Without support for such functions an application would have to perform the transformations before using a PMML model. The transformations that were applied when the model was created must be equivalent to the transformations when the model is applied to new data. By integrating some of the transformations directly into the PMML model, the definition and execution of the data flow becomes less error-prone.

PMML also supports the definitions of new functions that have other PMML expressions in the function body. The function represents a parameterized expression. The semantics of applying a 'user-defined' function in PMML is

substitute the formal function parameters by the actual argument values, and then
replace the function application by the new expression.

That is, the function definitions are just a means for writing certain expressions in a more compact way.

A function can be applied to one or more other expressions such as constants, fields or results of transformations, see the group EXPRESSION in Transformations.html. When a function is applied, the actual arguments are identified by position. A function application itself is a PMML transformation expression. That is, there can be nested invocations of functions.

In order to allow a single function specification to be applicable for multiple dataTypes, functions are assumed to inherit the dataType of the current input parameters unless otherwise specified. For example, the built-in function "+" can be applied to the integer, float, or double dataTypes. When the input parameters have multiple dataTypes, the least restrictive dataType will be inherited by default. An explicit dataType for the function needs to be defined if the PMML producer expects the default dataType inheritance to be over-written. The inheritance precedence for mixed type input parameters is as follows:

string

double

float

integer

For example, if an integer and a double parameter are used with the "+" function, by default the output dataType will be a double. The various date, time and dateTime as well as boolean dataTypes can not be mixed with other types in a defined function without explicitly defining the expected output dataType.

For ParameterFields, both dataType and optype are optional. When the specified dataType and expression dataType match, the expected behavior is straight forward. However, if the expression does not match the specified dataType or the dataType is not specified, further clarification on the expected behavior is needed. Similar to the handling of the function output, when the datatype is not specified, ParameterFields are assumed to inherit the dataType of the expression used for the definition. For example, if a ParameterField is specified as a FieldRef expression referring to the "prodgroup" field defined in the DataDictionary, the ParameterField for the function inherits the dataType of "prodgroup".

In cases where the specified datatype does not match the expression, implicit casts are allowed in the direction of less restriction, according to the precedence list above. If the dataType for the ParameterField is more restrictive than the expression dataType, the PMML document is not valid since it is not clear in advance that the cast can be properly made. For example, an integer expression can be implicitly cast as a double by a ParameterField dataType, but a double expression can not be implicitly cast as an integer.

Schema

The XML Schema for the definition and application of functions is


  <xs:element name="DefineFunction">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="ParameterField" minOccurs="1" maxOccurs="unbounded" />
        <xs:group ref="EXPRESSION" />
      </xs:sequence>
      <xs:attribute name="name" type="xs:string" use="required"/>
      <xs:attribute name="optype" type="OPTYPE" use="required"/>
      <xs:attribute name="dataType" type="DATATYPE" />
    </xs:complexType>
  </xs:element>

  <xs:element name="ParameterField">
    <xs:complexType>
      <xs:attribute name="name" type="xs:string" use="required" />
      <xs:attribute name="optype" type="OPTYPE" />
      <xs:attribute name="dataType" type="DATATYPE" />
    </xs:complexType>
  </xs:element>

  <xs:element name="Apply">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:group ref="EXPRESSION" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="function" type="xs:string" use="required"/>
    </xs:complexType>
  </xs:element>

The element Apply defines the application of a function. The function itself is identified by name in attribute function. The actual parameters of the function application are given in the content of the element. Each actual argument value is given by an EXPRESSION. The actual arguments are mapped by position to the formal parameters in the corresponding function definition.

The EXPRESSION in the content of DefineFunction is the function body that actually defines the meaning of the new function.

The function's name must be unique and must not conflict with other function names either defined by PMML or other user-defined functions.

The function body must not refer to fields other than the parameter fields.

Example applying a built-in function

Data cleansing is one of the common tasks done in preparing data for mining. Some of these operations can be supported directly in a PMML model. The following example demonstrates how to convert string values to upper case by applying the built-in function upper-case. Assuming that the original input data contains names of product groups and the names are provided in the field "prodgroup", we define a new DerivedField named PGNorm where all values use upper case characters.


  <DerivedField name="PGNorm" >
    <Apply function="upper-case" >
      <FieldRef field="prodgroup" >
    </Apply>
  <DerivedField>

That is, when the value of the field prodgroup is, e.g., Non-Food the value of the field PGNorm becomes NON-FOOD.

A DerivedField can contain a transformation expression such as MapValues or Discretize. The element Apply is just another transformation expression.

Example for user-defined function

A DerivedField can be defined by a possibly complex transformation. If a certain transformation has to be applied to multiple fields it makes sense to encapsulate the definition of the transformation expression in a function and then apply the function multiple times. This reduces the complexity and the size of PMML models.

New user-defined functions can be specified in a model using the element DefineFunction in the TransformationDictionary.

Example:


  <TransformationDictionary>
  ...

  <!-- define a new function called "AMPM" -->
  <DefineFunction name="AMPM" dataType="string">
    <!-- result type is "string" -->

    <!-- declaration of formal parameters -->
    <ParameterField name="TimeVal" optype="continuous" dataType="integer" />
    <!-- there can be more than one parameter field -->

    <!-- The function body can be any expression-->
    <!-- Parameter names are used like field names in the expression -->

    <Discretize field="TimeVal">  <!-- uses name of parameter field -->
      <DiscretizeBin binValue="AM">
        <Interval closure="closedClosed" leftMargin="0" rightMargin="43199" />
      </DiscretizeBin>
      <DiscretizeBin binValue="PM">
         <Interval closure="closedOpen" leftMargin="43200" rightMargin="86400"/>
      </DiscretizeBin>
    </Discretize>
  </DefineFunction>

  <!-- use function "AMPM" in a DerivedField -->
  <DerivedField name="Shift" optype="categorical"/>
    <Apply function="AMPM">
        <FieldRef field="StartTime"/>
    </Apply>
  </DerivedField>

  <!-- extract the hour from a time value -->
  <DerivedField name="StartHour" optype="categorical" >
    <Apply function="format-datetime" >
      <Constant>%H</Constant>
      <FieldRef field="StartTime"/>
    </Apply>
  </DerivedField>
  ...
  </TransformationDictionary>

We assume that the field StartTime is defined with dataType="timeSeconds". An actual time value 09:39:02 would be represented as a number 34742, that is, the number of seconds since midnight at the given point in time. The transformation in the function AMPM maps this value to the string AM. This value becomes the actual value of the field Shift. The input field Shift is also used in the definition of the DerivedField StartHour. This categorical field has the actual value 09 produced by the date formatting function.

Note that we use a notation <Constant>HH</Constant> for constants. This notation is shorter and easier to handle than the combination <Constant><Value value="HH"></Constant>.

In general, the application of a function looks like


  <Apply function="MyFunc" >
    parameter expression 1
    parameter expression 2
       ...  
    parameter expression n
  </Apply>

The expressions are mapped by position to the arguments in the function definition.

Another example:


  <DefineFunction name="STATEGROUP" dataType="string" >
    <ParameterField name="#1" optype="categorical" dataType="string" />
    <MapValues outputColumn="Region" >
      <FieldColumnPair field="#1" column="State" />
      <InlineTable>
         <row><State>CA</State><Region>West</Region></row>
         <row><State>OR</State><Region>West</Region></row>
         <row><State>NC</State><Region>East</Region></row>
      </InlineTable>
    </MapValues>
  </DefineFunction>

  <DerivedField name="Group" optype="categorical" >
     <Apply function="STATEGROUP" >
         <FieldRef field="State"/>
     </Apply>
  </DerivedField>

The new function with name STATEGROUP accepts one argument. The definition uses the transformation MapValues. For example, if the function is applied with a value CA the result is the string West.
The example also defines a new DerivedField with name Group as the result of applying the function STATEGROUP to the field State.