Using Functions
PMML3.0 Menu

Home


PMML Notice and License

Changes


Conformance

General Structure

Header

Data
Dictionary


Mining
Schema


Transformations

Statistics

Taxomony

Targets

Output

Functions

Built-in Functions

Model Composition

Model Verification


Association Rules

Cluster
Models


General
Regression


Naive
Bayes


Neural
Network


Regression

Ruleset

Sequences

Text Models

Trees

Vector Machine

PMML 3.0 - Definition and application of functions

PMML provides a number of predefined functions that support fine-grained transformations such as changing characters to upper case or converting date and time values to strings. The predefined functions are built into PMML because they cannot be defined by expressions in PMML itself or because a definition would be too complex.

Without support for such functions an application would have to perform the transformations before using a PMML model. The transformations that were applied when the model was created must be equivalent to the transformations when the model is applied to new data. By integrating some of the transformations directly into the PMML model, the definition and execution of the data flow becomes less error-prone.

PMML also supports the definitions of new functions that have other PMML expressions in the function body. The function represents a parameterized expression. The semantics of applying a 'user-defined' function in PMML is

  1. substitute the formal function parameters by the actual argument values, and then
  2. replace the function application by the new expression.
That is, the function definitions are just a means for writing certain expressions in a more compact way.

A function can be applied to one or more other expressions such as constants, fields or results of transformations, see the group EXPRESSION in Transformations.html. When a function is applied, the actual arguments are identified by position. A function application itself is a PMML transformation expression. That is, there can be nested invocations of functions.

Schema

The XML Schema for the definition and application of functions is

  <xs:element name="DefineFunction">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:element ref="ParameterField" minOccurs="1" maxOccurs="unbounded" />
        <xs:group ref="EXPRESSION" />
      </xs:sequence>
      <xs:attribute name="name" type="xs:string" use="required"/>
      <xs:attribute name="optype" type="OPTYPE" />
      <xs:attribute name="dataType" type="DATATYPE" />
    </xs:complexType>
  </xs:element>

  <xs:element name="ParameterField">
    <xs:complexType>
      <xs:attribute name="name" type="xs:string" use="required" />
      <xs:attribute name="optype" type="OPTYPE" use="required" />
    </xs:complexType>
  </xs:element>

  <xs:element name="Apply">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" />
        <xs:group ref="EXPRESSION" minOccurs="0" maxOccurs="unbounded" />
      </xs:sequence>
      <xs:attribute name="function" type="xs:string" />
    </xs:complexType>
  </xs:element>

The element Apply defines the application of a function. The function itself is identified by a name. The actual parameters of the function application are given in the content of the element. Each actual argument value is given by an EXPRESSION. The actual arguments are mapped by position to the formal parameters in the corresponding function definition.

The EXPRESSION in the content of DefineFunction is the function body that actually defines the meaning of the new function.

The function body must not refer to fields other than the parameter fields.

Example applying a built-in function

Data cleansing is one of the common task done in preparing data for mining. Some of these operations can be supported directly in a PMML model. The following example demonstrates how to convert string values to upper case by applying the built-in function upper-case. Assuming that the original input data contains names of product groups and the names are provided in the field "prodgroup", we define a new derived field named "PGNorm" where all values use upper case characters.

  <DerivedField name="PGNorm" >
    <Apply function="upper-case" >
      <FieldRef field="prodgroup" >
    </Apply>
  <DerivedField>

That is, when the value of the field "prodgroup" is, e.g., "Non-Food" the value of the field "PGNorm" becomes "NON-FOOD".

A DerivedField can contain an transformation expression such as <MapValues> or <Discretize>. The element <Apply> is just another transformation expression.

Example for user-defined function

A derived field can be defined by a possibly complex transformation. If a certain transformation has to be applied to multiple fields it makes sense to encapsulate the definition of the transformation expression in a function and then apply the function multiple times. This reduces the complexity and the size of PMML models.

New 'user-defined' functions can be specified in a model using the element DefineFunction in the transformation dictionary.

Examples:


  <TransformationDictionary>
  ...

  <!-- define a new function called "AMPM" -->
  <DefineFunction name="AMPM" dataType="string">
    <!-- result type is "string" -->

    <!-- declaration of formal parameters -->
    <ParameterField name="TimeVal" opype="continuous" />
    <!-- there can be more than one parameter field -->

    <!-- The function body can be any expression-->
    <!-- Parameter names are used like field names in the expression -->

    <Discretize field="TimeVal">  <!-- uses name of parameter field -->
      <DiscretizeBin binValue="AM">
        <Interval closure="closedClosed" leftMargin="0" rightMargin="43199" />
      </DiscretizeBin>
      <DiscretizeBin binValue="PM">
         <Interval closure="closedOpen" leftMargin="43200" rightMargin="86400"/>
      </DiscretizeBin>
    </Discretize>
  </DefineFunction>

  <!-- use function "AMPM" in a DerivedField -->
  <DerivedField name="Shift" optype="categorical"/>
    <Apply function="AMPM">
        <FieldRef field="StartTime"/>
    </Apply>
  </DerivedField>

  <!-- extract the hour from a time value -->
  <DerivedField name="StartHour" optype="categorical" >
    <Apply function="format-datetime" >
      <Constant>%H</Constant>
      <FieldRef field="StartTime"/>
    </Apply>
  </DerivedField>
  ...
  </TransformationDictionary>

Example: We assume that the field "StartTime" is defined with dataype="timeSeconds". An actual time value '09:39:02' would be represented as a number 34742, that is, the number of seconds since midnight at the given point in time. The transformation in the function AMPM maps this value to the string "AM". This value comes the actual value of the field "Shift". The input field "Shift" is also used in the definition of the derived field "StartHour". This categorical field has the actual value "09" produced by the date formatting function.

Note that we use a notation <Constant>HH</Constant> for constants. This notation is shorter and easier to handle than the combination <Constant><Value value="HH"></Constant>.

In general, the application of a function looks like


  <Apply function="MyFunc" >
    parameter expression 1
    parameter expression 2
       ...  
    parameter expression n
  </Apply>

The expressions are mapped by position to the arguments in the function definition.

Another example:
The new function with name "STATEGROUP" accepts one argument. The definition uses the transformation MapValues. For example, if the function is applied with a value "CA" the result is the string "West". The example also defines a new derived field with name "Group" as the result of applying the function "STATEGROUP" to the field "State".


  <DefineFunction name="STATEGROUP" dataType="string" >
    <ParameterField name="#1" optype="categorical" />
    <MapValues outputColumn="Region" optype="categorical">
      <FieldColumnPair field="#1" column="State"/>
      <InlineTable>
         <row><State>CA</State><Region>West</Region></row>
         <row><State>OR</State><Region>West</Region></row>
         <row><State>NC</State><Region>East</Region></row>
      </InlineTable>
    </MapValues>
  </DefineFunction>

  <DerivedField name="Group" optype="categorical" >
     <Apply function="STATEGROUP" >
         <FieldRef field="State"/>
     </Apply>
  </DerivedField>

e-mail info at dmg.org