PMML 3.0 - Definition and application of functions
PMML provides a number of predefined functions that support fine-grained transformations such as changing characters to upper case or converting date and time values to strings. The predefined functions are built into PMML because they cannot be defined by expressions in PMML itself or because a definition would be too complex.
Without support for such functions an application would have to perform the transformations before using a PMML model. The transformations that were applied when the model was created must be equivalent to the transformations when the model is applied to new data. By integrating some of the transformations directly into the PMML model, the definition and execution of the data flow becomes less error-prone.
PMML also supports the definitions of new functions that have other PMML expressions in the function body. The function represents a parameterized expression. The semantics of applying a 'user-defined' function in PMML is
- substitute the formal function parameters by the actual argument values, and then
- replace the function application by the new expression.
A function can be applied to one or more other expressions such as constants, fields or results of transformations, see the group EXPRESSION in Transformations.html. When a function is applied, the actual arguments are identified by position. A function application itself is a PMML transformation expression. That is, there can be nested invocations of functions.
Schema
The XML Schema for the definition and application of functions is<xs:element name="DefineFunction"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" /> <xs:element ref="ParameterField" minOccurs="1" maxOccurs="unbounded" /> <xs:group ref="EXPRESSION" /> </xs:sequence> <xs:attribute name="name" type="xs:string" use="required"/> <xs:attribute name="optype" type="OPTYPE" /> <xs:attribute name="dataType" type="DATATYPE" /> </xs:complexType> </xs:element> <xs:element name="ParameterField"> <xs:complexType> <xs:attribute name="name" type="xs:string" use="required" /> <xs:attribute name="optype" type="OPTYPE" use="required" /> </xs:complexType> </xs:element> <xs:element name="Apply"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded" /> <xs:group ref="EXPRESSION" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> <xs:attribute name="function" type="xs:string" /> </xs:complexType> </xs:element> |
The EXPRESSION in the content of DefineFunction is the function body that actually defines the meaning of the new function.
The function body must not refer to fields other than the parameter fields.
Example applying a built-in function
Data cleansing is one of the common task done in preparing data for mining. Some of these operations can be supported directly in a PMML model. The following example demonstrates how to convert string values to upper case by applying the built-in function upper-case. Assuming that the original input data contains names of product groups and the names are provided in the field "prodgroup", we define a new derived field named "PGNorm" where all values use upper case characters.<DerivedField name="PGNorm" > <Apply function="upper-case" > <FieldRef field="prodgroup" > </Apply> <DerivedField> |
A DerivedField can contain an transformation expression such as <MapValues> or <Discretize>. The element <Apply> is just another transformation expression.
Example for user-defined function
A derived field can be defined by a possibly complex transformation. If a certain transformation has to be applied to multiple fields it makes sense to encapsulate the definition of the transformation expression in a function and then apply the function multiple times. This reduces the complexity and the size of PMML models.
New 'user-defined' functions can be specified in a model using the
element DefineFunction
in the transformation dictionary.
Examples:
<TransformationDictionary>
...
<!-- define a new function called "AMPM" -->
<DefineFunction name="AMPM" dataType="string">
<!-- result type is "string" -->
<!-- declaration of formal parameters -->
<ParameterField name="TimeVal" opype="continuous" />
<!-- there can be more than one parameter field -->
<!-- The function body can be any expression-->
<!-- Parameter names are used like field names in the expression -->
<Discretize field="TimeVal"> <!-- uses name of parameter field -->
<DiscretizeBin binValue="AM">
<Interval closure="closedClosed" leftMargin="0" rightMargin="43199" />
</DiscretizeBin>
<DiscretizeBin binValue="PM">
<Interval closure="closedOpen" leftMargin="43200" rightMargin="86400"/>
</DiscretizeBin>
</Discretize>
</DefineFunction>
<!-- use function "AMPM" in a DerivedField -->
<DerivedField name="Shift" optype="categorical"/>
<Apply function="AMPM">
<FieldRef field="StartTime"/>
</Apply>
</DerivedField>
<!-- extract the hour from a time value -->
<DerivedField name="StartHour" optype="categorical" >
<Apply function="format-datetime" >
<Constant>%H</Constant>
<FieldRef field="StartTime"/>
</Apply>
</DerivedField>
...
</TransformationDictionary>
Note that we use a notation <Constant>HH</Constant> for constants. This notation is shorter and easier to handle than the combination <Constant><Value value="HH"></Constant>.
In general, the application of a function looks like
<Apply function="MyFunc" >
parameter expression 1
parameter expression 2
...
parameter expression n
</Apply>
Another example:
The new function with name "STATEGROUP" accepts one argument. The definition uses
the transformation MapValues. For example, if the function is applied
with a value "CA" the result is the string "West".
The example also defines a new derived field with name "Group" as the result
of applying the function "STATEGROUP" to the field "State".
<DefineFunction name="STATEGROUP" dataType="string" >
<ParameterField name="#1" optype="categorical" />
<MapValues outputColumn="Region" optype="categorical">
<FieldColumnPair field="#1" column="State"/>
<InlineTable>
<row><State>CA</State><Region>West</Region></row>
<row><State>OR</State><Region>West</Region></row>
<row><State>NC</State><Region>East</Region></row>
</InlineTable>
</MapValues>
</DefineFunction>
<DerivedField name="Group" optype="categorical" >
<Apply function="STATEGROUP" >
<FieldRef field="State"/>
</Apply>
</DerivedField>