PMML 3.2 - Definition and Application of Functions
PMML provides a number of predefined functions that support fine-grained transformations such as changing characters to upper case or converting date and time values to strings. The predefined functions are built into PMML because they cannot be defined by expressions in PMML itself or because a definition would be too complex.
Without support for such functions an application would have to perform the transformations before using a PMML model. The transformations that were applied when the model was created must be equivalent to the transformations when the model is applied to new data. By integrating some of the transformations directly into the PMML model, the definition and execution of the data flow becomes less error-prone.
PMML also supports the definitions of new functions that have other PMML expressions in the function body. The function represents a parameterized expression. The semantics of applying a 'user-defined' function in PMML is
- substitute the formal function parameters by the actual argument values, and then
- replace the function application by the new expression.
A function can be applied to one or more other expressions such as constants, fields or results of transformations, see the group EXPRESSION in Transformations.html. When a function is applied, the actual arguments are identified by position. A function application itself is a PMML transformation expression. That is, there can be nested invocations of functions.
In order to allow a single function specification to be applicable for multiple dataTypes, functions are assumed to inherit the dataType of the current input parameters unless otherwise specified. For example, the built-in function "+" can be applied to the integer, float, or double dataTypes. When the input parameters have multiple dataTypes, the least restrictive dataType will be inherited by default. An explicit dataType for the function needs to be defined if the PMML producer expects the default dataType inheritance to be over-written. The inheritance precedence for mixed type input parameters is as follows:
For example, if an integer and a double parameter are used with the "+" function, by default the output dataType will be a double. The various date, time and dateTime as well as boolean dataTypes can not be mixed with other types in a defined function without explicitly defining the expected output dataType.
For ParameterFields, both dataType and optype are optional. When the specified dataType and expression dataType match, the expected behavior is straight forward. However, if the expression does not match the specified dataType or the dataType is not specified, further clarification on the expected behavior is needed. Similar to the handling of the function output, when the datatype is not specified, ParameterFields are assumed to inherit the dataType of the expression used for the definition. For example, if a ParameterField is specified as a FieldRef expression referring to the "prodgroup" field defined in the DataDictionary, the ParameterField for the function inherits the dataType of "prodgroup".
In cases where the specified datatype does not match the expression, implicit casts are allowed in the direction of less restriction, according to the precedence list above. If the dataType for the ParameterField is more restrictive than the expression dataType, the PMML document is not valid since it is not clear in advance that the cast can be properly made. For example, an integer expression can be implicitly cast as a double by a ParameterField dataType, but a double expression can not be implicitly cast as an integer.
SchemaThe XML Schema for the definition and application of functions is
<xs:element name="DefineFunction"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" /> <xs:element ref="ParameterField" minOccurs="1" maxOccurs="unbounded" /> <xs:group ref="EXPRESSION" /> </xs:sequence> <xs:attribute name="name" type="xs:string" use="required"/> <xs:attribute name="optype" type="OPTYPE" use="required"/> <xs:attribute name="dataType" type="DATATYPE" /> </xs:complexType> </xs:element> <xs:element name="ParameterField"> <xs:complexType> <xs:attribute name="name" type="xs:string" use="required" /> <xs:attribute name="optype" type="OPTYPE" /> <xs:attribute name="dataType" type="DATATYPE" /> </xs:complexType> </xs:element> <xs:element name="Apply"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" /> <xs:group ref="EXPRESSION" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> <xs:attribute name="function" type="xs:string" use="required"/> </xs:complexType> </xs:element>
The EXPRESSION in the content of DefineFunction is the function body that actually defines the meaning of the new function.
The function's name must be unique and must not conflict with other function names either defined by PMML or other user-defined functions.
The function body must not refer to fields other than the parameter fields.
Example applying a built-in functionData cleansing is one of the common tasks done in preparing data for mining. Some of these operations can be supported directly in a PMML model. The following example demonstrates how to convert string values to upper case by applying the built-in function upper-case. Assuming that the original input data contains names of product groups and the names are provided in the field "prodgroup", we define a new DerivedField named PGNorm where all values use upper case characters.
<DerivedField name="PGNorm" > <Apply function="upper-case" > <FieldRef field="prodgroup" > </Apply> <DerivedField>
A DerivedField can contain a transformation expression such as MapValues or Discretize. The element Apply is just another transformation expression.
Example for user-defined functionA DerivedField can be defined by a possibly complex transformation. If a certain transformation has to be applied to multiple fields it makes sense to encapsulate the definition of the transformation expression in a function and then apply the function multiple times. This reduces the complexity and the size of PMML models.
New user-defined functions can be specified in a model using the element DefineFunction in the TransformationDictionary.
<!-- define a new function called "AMPM" -->
<DefineFunction name="AMPM" dataType="string">
<!-- result type is "string" -->
<!-- declaration of formal parameters -->
<ParameterField name="TimeVal" optype="continuous" dataType="integer" />
<!-- there can be more than one parameter field -->
<!-- The function body can be any expression-->
<!-- Parameter names are used like field names in the expression -->
<Discretize field="TimeVal"> <!-- uses name of parameter field -->
<Interval closure="closedClosed" leftMargin="0" rightMargin="43199" />
<Interval closure="closedOpen" leftMargin="43200" rightMargin="86400"/>
<!-- use function "AMPM" in a DerivedField -->
<DerivedField name="Shift" optype="categorical"/>
<!-- extract the hour from a time value -->
<DerivedField name="StartHour" optype="categorical" >
<Apply function="format-datetime" >
Note that we use a notation <Constant>HH</Constant> for constants. This notation is shorter and easier to handle than the combination <Constant><Value value="HH"></Constant>.
In general, the application of a function looks like
<Apply function="MyFunc" >
parameter expression 1
parameter expression 2
parameter expression n
<DefineFunction name="STATEGROUP" dataType="string" >
<ParameterField name="#1" optype="categorical" dataType="string" />
<MapValues outputColumn="Region" >
<FieldColumnPair field="#1" column="State" />
<DerivedField name="Group" optype="categorical" >
<Apply function="STATEGROUP" >
The new function with name STATEGROUP accepts one argument. The definition uses
the transformation MapValues. For example, if the function is applied
with a value CA the result is the string West.
The example also defines a new DerivedField with name Group as the result of applying the function STATEGROUP to the field State.