PMML 4.4.1 - Transformation Dictionary and Derived Fields

At various places the mining models use simple functions in order to map user data to values that are easier to use in the specific model. For example, neural networks internally work with numbers, usually in the range from 0 to 1. Numeric input data are mapped to the range [0..1], and categorical fields are mapped to series of 0/1 indicators.

PMML defines various kinds of simple data transformations:

Normalization: map values to numbers, the input can be continuous or discrete.
Discretization: map continuous values to discrete values.
Value mapping: map discrete values to discrete values.
Text Indexing: derive a frequency-based value for a given term.
Functions: derive a value by applying a function to one or more parameters
Aggregation: summarize or collect groups of values, e.g., compute average.
Lag: use a previous value of the given input field.

The corresponding XML elements appear as content of a surrounding markup DerivedField, which provides a common element for the various mappings. They can also appear at several places in the definition of specific models such as neural network or Naïve Bayes models. Transformed fields have a name such that statistics and the model can refer to these fields.

The transformations in PMML do not cover the full set of preprocessing functions which may be needed to collect and prepare the data for mining. There are too many variations of preprocessing expressions. Instead, the PMML transformations represent expressions that are created automatically by a mining system. A typical example is the normalization of input values in neural networks. Similarly, a discretization might be constructed by a mining system that computes quantile ranges in order to transform skewed data.

<xs:group name="EXPRESSION">
  <xs:choice>
    <xs:element ref="Constant"/>
    <xs:element ref="FieldRef"/>
    <xs:element ref="NormContinuous"/>
    <xs:element ref="NormDiscrete"/>
    <xs:element ref="Discretize"/>
    <xs:element ref="MapValues"/>
    <xs:element ref="TextIndex"/>
    <xs:element ref="Apply"/>
    <xs:element ref="Aggregate"/>
    <xs:element ref="Lag"/>
  </xs:choice>
</xs:group>

<xs:element name="TransformationDictionary">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DefineFunction" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DerivedField" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="LocalTransformations">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DerivedField" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="DerivedField">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:group ref="EXPRESSION"/>
      <xs:element ref="Value" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="name" type="FIELD-NAME"/>
    <xs:attribute name="displayName" type="xs:string"/>
    <xs:attribute name="optype" type="OPTYPE" use="required"/>
    <xs:attribute name="dataType" type="DATATYPE" use="required"/>
  </xs:complexType>
</xs:element>

If a DerivedField is contained in TransformationDictionary or LocalTransformations, then the name attribute is required. For DerivedFields which are contained inline in models, name is optional.

The TransformationDictionary allows for transformations to be defined once and used by any model element in the PMML document. For information on the use and naming of DerivedFields in the TransformationDictionary, see Scope of Fields.

The transformation expression in the content of DerivedField defines how the values of the new field are computed.

The attribute optype is needed in order to eliminate cases where the resulting type is not known. If there is a value mapping in a DerivedField, it is not known how to interpret the output. A value mapping might look like this:

"cat" -> "0.1"
"dog" -> "0.2"
"elephant" -> "0.3"
etc.

But it is not known whether 0.1 has to be interpreted as a number or as a string (=categorical value). Hence optype is required in order to make parsing and interpretation of models simpler.

A DerivedField may have a list of Value elements that define the ordering of the values for an ordinal field. The attribute property must not be used for Value elements within a DerivedField. That is, the list cannot specify values that are interpreted as missing or invalid.

Constant

Constant values can be used in expressions which have multiple arguments. The actual value of a constant is given by the content of the element. For example, <Constant>1.05</Constant> represents the number 1.05. The dataType of Constant can be optionally specified. If the ParameterField definition includes a dataType, the Constant will inherit the dataType specified in the ParameterField. If the dataType is not specified in the ParameterField definition or Constant element, it will be inferred by the content of the element. A Constant that consists solely of numbers will be treated as an integer, if a decimal point is present among the numbers the Constant will be treated as a double. Note that the default type of the number with a decimal point used to be float, was changed to double in PMML 4.4. A scientific notation expression like 3.14e+5 or like 3.14E-5 should also be treated as a double. The presence of any non-numeric characters (other than e, E, -, + in scientific notation expressions ) will result in the Constant being treated as a string. If the content is empty, then the constant will be interpreted as an empty string, unless dataType is something besides "string", in which case, it will be taken as a missing value of the specified type. Conflicting dataType specifications will be resolved as specified in the Functions document.

If the missing attribute is true, the constant will be taken as a missing value, regardless of the content.

In addition to the above the following three values are recognized and taken as equivalent to their IEEE values (see https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/#float):

NaN: notANumber. This is the value returned from such meaningless expressions as the logarithm of a negative number. It should not be used as a generic missing value, as missing values are properly those which are unknown, indeterminate, or non-applicable. All arithmetic expressions involving NaN will evaluate to NaN.
INF: positiveInfinity. This is an infinite number greater than any finite numeric value.
-INF: negativeInfinity. This is an infinite number less than any finite numeric value.

These three constants are case insensitive and only meaningful as floating point values (datatype equal to "float" or "double"). However, if datatype is set to its default value of "string", then the literal strings will be signified. No other data type will be accepted for these constants.

A Brief Summary of Operations Involving Infinite Numbers

It should be noted that infinity is not single number, but a class of numbers. Thus, infinity times two is infinity, but the first infinity is not equal to the second. This is reflected in rules detailed below.

Except as detailed below, -INF is to be taken as the opposite of INF. Both are infinite numbers.
The sum of an infinite number and a finite number is infinite with the sign of the infinite added.
The sum of two infinite numbers of the same sign is infinite.
The sum of two infinite numbers of opposite signs is indeterminate and will return a missing value.
The difference between an infinite number and a finite number is infinite.
The difference between two infinite numbers of the same sign is indeterminate and will return a missing value.
The difference between two infinite numbers of opposite signs is infinite.
The product of an infinite number and a non-zero finite number is infinite with the sign being determined by the signs of the factors.
The product of an infinite number and zero is indeterminate and therefore will return a missing value.
An infinite number divided by a finite number is infinite.
An infinite number divided by another of the same sign is indeterminate and therefore will return a missing value.
An infinite number divided by another of the opposite sign is infinite with the same sign as the numerator.
A number greater than 1 raised to INF returns INF.
1 or -1 raised to INF is indeterminate, and therefore returns a missing value.
A number greater than -1 and less than 1; raised to INF returns 0.
A number less than -1 raised to INF is indeterminate and therefore returns a missing value.
A number raised to -INF is the reciprocal of the same number raised to INF.
INF raised to a positive power returns INF.
An infinite number raised to the zeroeth power is indeterminate and therefore returns a missing value.
An infinite number raised to a negative power returns 0.
-INF raised to an odd positive integer returns -INF.
-INF raised to an even positive integer returns INF.
Except as above, -INF raised to any power is undefined, and therefore returns NaN.

<xs:element name="Constant">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:string">
        <xs:attribute name="dataType" type="DATATYPE"/>
		<xs:attribute name="missing" type="xs:boolean" default="false"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

FieldRef

Field references are simply pass-throughs to fields previously defined in the DataDictionary, a DerivedField, or a result field. For example, they are used in clustering models in order to define center coordinates for fields that don't need further normalization.

<xs:element name="FieldRef">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="mapMissingTo" type="xs:string"/>
  </xs:complexType>
</xs:element>

A missing input will produce a missing result. The optional attribute mapMissingTo may be used to map a missing result to the value specified by the attribute. If the attribute is not present, the result remains missing.

Normalization

The elements for normalization provide a basic framework for mapping input values to specific value ranges, usually the numeric range [0 .. 1]. Normalization is used, e.g., in neural networks and clustering models.

<xs:element name="NormContinuous">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="LinearNorm" minOccurs="2" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="mapMissingTo" type="NUMBER"/>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="outliers" type="OUTLIER-TREATMENT-METHOD" default="asIs"/>
  </xs:complexType>
</xs:element>

<xs:element name="LinearNorm">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="orig" type="NUMBER" use="required"/>
    <xs:attribute name="norm" type="NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

NormContinuous: defines how to normalize an input field by piecewise linear interpolation. The mapMissingTo attribute defines the value the output is to take if the input is missing. If the mapMissingTo attribute is not specified, then missing input values produce a missing result.

The sequence of LinearNorm elements defines a sequence of points for a stepwise linear interpolation function. The sequence must contain at least two elements. Within NormContinous the elements LinearNorm must be strictly sorted by ascending value of orig. Given two points (a1, b1) and (a2, b2) such that there is no other point (a3, b3) with a1<a3<a2, then the normalized value is

b1+ ( x-a1)/(a2-a1)*(b2-b1) for a1 ≤ x ≤ a2.

piecewise interpolated normalization extrapolated on undefined range

Missing input values are mapped to missing output.
If the input value is not within the range [a1..an] then it is treated>a according to outliers (compare with outliers in MiningSchema):

asIs: Extrapolates the normalization from the nearest interval.

asMissingValues: Maps to a missing value.

asExtremeValues: Maps to the next value from the nearest interval so that the function is continuous.

The graph above shows the default behavior with an extrapolation for values less than a1 or greater than a3.

The graph below depicts a mapping where outliers are mapped using asExtremeValues:

piecewise interpolated normalization with min/max on undefined range

NormContinuous can be used to implement simple normalization functions such as the z-score transformation" (X - m ) / s, where m is the mean value and s is the standard deviation.

<NormContinuous field="X">
  <LinearNorm orig="0" norm="-m/s"/>
  <LinearNorm orig="m" norm="0"/>
</NormContinuous>

In this example we assume that outliers are treated "asIs".

Normalize discrete values

Many mining models encode string values into numeric values in order to perform mathematical computations. For example, regression and neural network models often split categorical and ordinal fields into multiple dummy fields. This kind of normalization is supported in PMML by the element NormDiscrete.

<xs:element name="NormDiscrete">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="value" type="xs:string" use="required"/>
    <xs:attribute name="mapMissingTo" type="NUMBER"/>
  </xs:complexType>
</xs:element>

An element (f, v) defines that the unit has value 1.0 if the value of input field f is v, otherwise it is 0.

The set of NormDiscrete instances which refer to a certain input field define a fan-out function which maps a single input field to a set of normalized fields.

If the input value is missing and the attribute mapMissingTo is not specified then the result is a missing value as well. If the input value is missing and the attribute mapMissingTo is specified then the result is the value of the attribute mapMissingTo.

Discretization

Discretization of numerical input fields is a mapping from continuous to discrete values using intervals.

<xs:element name="Discretize">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DiscretizeBin" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="mapMissingTo" type="xs:string"/>
    <xs:attribute name="defaultValue" type="xs:string"/>
    <xs:attribute name="dataType" type="DATATYPE"/>
  </xs:complexType>
</xs:element>

<xs:element name="DiscretizeBin">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Interval"/>
    </xs:sequence>
    <xs:attribute name="binValue" type="xs:string" use="required"/>
  </xs:complexType>
</xs:element>

The attribute field defines the name of the input field. The elements DiscretizeBin define a set of mappings from an interval_i to a binValue_i. The value of the DerivedField is binValue_i if the input value is contained in interval_i for some i.

Two intervals may be mapped to the same categorical value but the mapping for each numerical input value must be unique, i.e., the intervals must be disjoint. The intervals should cover the complete range of input values.

Decision table for Discretize

('*' stands for any combination)

input value	matching interval	defaultValue	mapMissingTo	=>	result
val	Interval_i	*	*	=>	binValue_i
val	none	someVal	*	=>	someVal
val	none	not specified	*	=>	missing
missing	*	*	someVal	=>	someVal
missing	*	*	not specified	=>	missing

Example:

A definition such as:

<Discretize field="Profit">
  <DiscretizeBin binValue="negative">
    <Interval closure="openOpen" rightMargin="0"/>
    <!-- left margin is -infinity by default -->
  </DiscretizeBin>
  <DiscretizeBin binValue="positive">
    <Interval closure="closedOpen" leftMargin="0"/>
    <!-- right margin is +infinity by default -->
  </DiscretizeBin>
</Discretize>

takes the field Profit as input and maps values less than 0 to negative and other values to positive. A missing value for Profit is mapped to a missing value.

In SQL this definition corresponds to a CASE expression

CASE When "Profit" < 0 Then 'negative' When "Profit" >= 0 Then 'positive' End

Discretize can also be used to create a missing value indicator for a continuous variable. In this case, the DiscretizeBin element is superfluous and can be dropped.

Example:

<DerivedField name="Age_mis" displayName="Age missing or not" optype="categorical" dataType="string">
  <Discretize field="Age" mapMissingTo="Missing" defaultValue="Not missing"/>
</DerivedField>

Map Values

Any discrete value can be mapped to any possibly different discrete value by listing the pairs of values. This list is implemented by a table, so it can be given inline by a sequence of XML markups or by a reference to an external table. The same technique is used for a Taxonomy because the tables can become quite large.

<xs:element name="MapValues">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element minOccurs="0" maxOccurs="unbounded" ref="FieldColumnPair"/>
      <xs:choice minOccurs="0">
        <xs:element ref="TableLocator"/>
        <xs:element ref="InlineTable"/>
      </xs:choice>
    </xs:sequence>
    <xs:attribute name="mapMissingTo" type="xs:string"/>
    <xs:attribute name="defaultValue" type="xs:string"/>
    <xs:attribute name="outputColumn" type="xs:string" use="required"/>
    <xs:attribute name="dataType" type="DATATYPE"/>
  </xs:complexType>
</xs:element>

<xs:element name="FieldColumnPair">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="column" type="xs:string" use="required"/>
  </xs:complexType>
</xs:element>

The types InlineTable and TableLocator are defined in the Taxonomy schema.

The mapMissingTo attribute defines the value the output column is to take if any of the input columns are missing.

Different string values can be mapped to one value but it is an error if the table entries used for matching are not unique. The value mapping may be partial. I.e., if an input value does not match a value in the mapping table, then the result can be a missing value. See the decision table below for the possible combinations.

Decision table for MapValues

('*' stands for any combination)

input value	matching value	defaultValue	mapMissingTo	=>	result
val	in row i	*	*	=>	outputColumn in row i
val	none	someVal	*	=>	someVal
val	none	not specified	*	=>	missing
missing	*	*	someVal	=>	someVal
missing	*	*	not specified	=>	missing

Example:

A definition such as

<MapValues outputColumn="longForm">
  <FieldColumnPair field="gender" column="shortForm"/>
  <InlineTable>
    <row><shortForm>m</shortForm><longForm>male</longForm>
    </row>
    <row><shortForm>f</shortForm><longForm>female</longForm>
    </row>
  </InlineTable>
</MapValues>

maps abbreviation from the field gender to their corresponding full words. That is, m is mapped to male and f is mapped to female.

In SQL this definition corresponds to a CASE expression

CASE "gender" When 'm' Then 'male' When 'f' Then 'female' End

Example:

Here is an example for the multi-dimensional case, mapping from state and band to salary:

state band 1 band 2

MN 10,000 20,000

IL 12,000 23,000

NY 20,000 30,000

Respective PMML would look like this:

<DerivedField dataType="double" optype="continuous">
  <MapValues outputColumn="out" dataType="integer">
    <FieldColumnPair field="BAND" column="band"/>
    <FieldColumnPair field="STATE" column="state"/>
    <InlineTable>
      <row>
        <band>1</band>
        <state>MN</state>
        <out>10000</out>
      </row>
      <row>
        <band>1</band>
        <state>IL</state>
        <out>12000</out>
      </row>
      <row>
        <band>1</band>
        <state>NY</state>
        <out>20000</out>
      </row>
      <row>
        <band>2</band>
        <state>MN</state>
        <out>20000</out>
      </row>
      <row>
        <band>2</band>
        <state>IL</state>
        <out>23000</out>
      </row>
      <row>
        <band>2</band>
        <state>NY</state>
        <out>30000</out>
      </row>
    </InlineTable>
  </MapValues>
</DerivedField>

Example:

The MapValues element can be used to create missing value indicators for categorical variables. In this case, only one FieldColumnPair needs to be specified and the column attribute can be omitted.

<DerivedField name="LSTROPEN_MIS" optype="categorical" dataType="string">
  <MapValues mapMissingTo="Missing" defaultValue="Not missing" outputColumn="none">
    <FieldColumnPair field="LSTROPEN" column="none"/>
  </MapValues>
</DerivedField>

Extracting term frequencies from text

To leverage textual input in a PMML model, we can use a TextIndex expression to extract frequency information from the text input field, for a given term. The TextIndex element fully configures how the text input should be indexed, including case sensitivity, normalization and other settings. It has a single EXPRESSION element nested within containing the term value to look for, which will usually be a simple Constant.

<xs:element name="TextIndex">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="TextIndexNormalization" minOccurs="0" maxOccurs="unbounded"/>
      <xs:group ref="EXPRESSION"/>
    </xs:sequence>
    <xs:attribute name="textField" type="FIELD-NAME" use="required"/>
    <xs:attribute name="localTermWeights" default="termFrequency">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="termFrequency"/>
          <xs:enumeration value="binary"/>
          <xs:enumeration value="logarithmic"/>
          <xs:enumeration value="augmentedNormalizedTermFrequency"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="isCaseSensitive" type="xs:boolean" default="false"/>
    <xs:attribute name="maxLevenshteinDistance" type="xs:integer" default="0"/>
    <xs:attribute name="countHits" default="allHits">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="allHits"/>
          <xs:enumeration value="bestHits"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="wordSeparatorCharacterRE" type="xs:string" default="\s+"/>
    <xs:attribute name="tokenize" type="xs:boolean" default="true"/>
  </xs:complexType>
</xs:element>

The TextIndex element fully configures how the text in textField should be processed and translated into a frequency metric for a particular term of interest. The actual frequency metric to be returned is defined through the localTermWeights attribute. The options are described in more detail by T. G. Kolda and D. P. O'Leary, A Semi-Discrete Matrix Decomposition of Latent Semantic Indexing in Information Retrieval, ACM Transactions on Information Systems, Volume 16, 1998, pages 322-346.

termFrequency: use the number of times the term occurs in the document (x = freq_i).
binary: use 1 if the term occurs in the document or 0 if it doesn't (x = χ(freq_i)).
logarithmic: take the logarithm (base 10) of 1 + the number of times the term occurs in the document. (x = log(1 + freq_i))
augmentedNormalizedTermFrequency: this formula adds to the binary frequency a "normalized" component expressing the frequency of a term relative to the highest frequency of terms observed in that document (x = 0.5 * (χ(freq_i) + (freq_i / max_k(freq_k))) )

The isCaseSensitive attribute defines whether or not the case used in the text input should exactly match the case usage in the terms for which the frequency should be returned. Through maxLevenshteinDistance, small spelling mistakes can be accommodated, accepting a certain number of character additions, ommissions or replacements. See also this article. To capture compound words more easily, the wordSeparatorCharacterRE attribute can be used to pass a regular expression containing possible word separator characters. For example, if "[\s\-]" is passed, the strings "user-friendly" and "user friendly" would both match the term "user friendly". More complex normalization operations can be addressed through the TextIndexNormalization element. Note that the default value for attribute wordSeparatorCharacterRE means one or more spaces. The wordSeparatorCharacterRE attribute applies to the given text unless attribute tokenize is set to "false".

When attribute tokenize is set to "true", which is its default value, wordSeparatorCharacterRE should be applied to the text input field as well as the given term. This process will result in one or more tokens for which the leading and trailing punctuations should then be removed. The punctuation-free tokenized term can then be described by one word or a sequence of words, depending on the number or resulting tokens.

To calculate a term's frequency, count number of "hits" in the text for the particular term according to the value of countHits. A "hit" is defined as an occurrence of the term in the input text, meeting the case requirements defined through isCaseSensitive within the maxLevenshteinDistance. For example, if the given term is defined as "brown fox" and the input text is "The quick browny foxy jumps over the lazy dog. The brown fox runs away and to be with another brown foxy.". If attribute maxLevenshteinDistance is set to 1, the text "browny foxy" will not be considered a "hit" since its Levenshtein distance is 2 (the sum of the Levenshtein distances for words "browny" and "foxy" after the input text has been tokenized). The text "brown fox" that is retrieved next is a "hit" since the Levenshtein distance is 0. The text "brown foxy." is also a "hit" since the Levenshtein distance is 1. Note that the punctuation has been removed before computing the Levenshtein distance.

The number of "hits" in the text can be counted in two different ways. These are:

allHits: count all hits
bestHits: count all hits with the lowest Levenshtein distance

For example, if the input text is defined as "I have a doog. My dog is white. The doog is friendly" and the attribute maxLevenshteinDistance is set to 1, the number of hits for term "dog" will be 3 if attribute countHits is set to "allHits" and 1 if it is set to "bestHits".

An example

<MiningSchema>
  <MiningField name="myTextField"/>
  ...
</MiningSchema>
<LocalTransformations>
  <DerivedField name="sunFrequency">
    <TextIndex textField="myTextField" localTermWeights="termFrequency"
       isCaseSensitive="false" maxLevenshteinDistance="1" >
      <Constant>sun</Constant>
    </TextIndex>
  </DerivedField>
  ...

In the above example, the DerivedField sunFrequency will contain the number of hits for the term "sun" in the input field myTextField, regardless of case and with at most one spelling mistake. For example, if the value of myTextField is "The Sun was setting while the captain's son reached the bounty island, minutes after their ship had sunk to the bottom of the ocean", sunFrequency will be 3 as "Sun", "son" and "sunk" all match the term "sun" with a Levenshtein distance of 0 or 1. If the maximum Levenshtein distance were to be 0, only "Sun" would have matched.

Predictive models using text as an input are likely to be looking for more than a single term. Therefore, it is often convenient to define the TextIndex element just once inside a DefineFunction element and then invoke it with Apply elements as shown in the following example.

...
<TransformationDictionary>
  <DefineFunction name="myIndexFunction">
    <ParameterField name="text"/>
    <ParameterField name="term"/>
    <TextIndex textField="text" localTermWeights="termFrequency"
       isCaseSensitive="false" maxLevenshteinDistance="1" >
      <FieldRef field="term"/>
    </TextIndex>
  </DefineFunction>
</TransformationDictionary>
...
<MiningSchema>
  <MiningField name="myTextField"/>
  ...
</MiningSchema>
<LocalTransformations>
  <DerivedField name="sunFrequency">
    <Apply function="myIndexFunction">
      <FieldRef field="myTextField"/>
      <Constant>sun</Constant>
    </Apply>
  </DerivedField>
  <DerivedField name="rainFrequency">
    <Apply function="myIndexFunction">
      <FieldRef field="myTextField"/>
      <Constant>rain</Constant>
    </Apply>
  </DerivedField>
  <DerivedField name="windFrequency">
    <Apply function="myIndexFunction">
      <FieldRef field="myTextField"/>
      <Constant>wind</Constant>
    </Apply>
  </DerivedField>
  ...

Normalizing text input

While the Levenshtein distance is useful to cover small spelling mistakes, it's not necessarily suitable to capture different forms of the same term, such as cases of nouns, conjugations of verbs or even synonyms. One or more TextIndexNormalization elements can be nested within a TextIndex to normalize input text into a more term-friendly form.

<xs:element name="TextIndexNormalization">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:choice minOccurs="0">
        <xs:element ref="TableLocator"/>
        <xs:element ref="InlineTable"/>
      </xs:choice>
    </xs:sequence>
    <xs:attribute name="inField" type="xs:string" default="string"/>
    <xs:attribute name="outField" type="xs:string" default="stem"/>
    <xs:attribute name="regexField" type="xs:string" default="regex"/>
    <xs:attribute name="recursive" type="xs:boolean" default="false"/>
    <xs:attribute name="isCaseSensitive" type="xs:boolean"/>
    <xs:attribute name="maxLevenshteinDistance" type="xs:integer"/>
    <xs:attribute name="wordSeparatorCharacterRE" type="xs:string"/>
    <xs:attribute name="tokenize" type="xs:boolean"/>
  </xs:complexType>
</xs:element>

A TextIndexNormalization element offers more advanced ways of normalizing text input into a more controlled vocabulary that corresponds to the terms being used in invocations of this indexing function. The normalization operation is defined through a translation table, specified through a TableLocator or InlineTable element.

If an input in the inField column is encountered in the text, it is replaced by the value in the outField column. If there is a regexField column and its value for that row is true, the string in the inField column should be treated as a PCRE regular expression. For regular expression rows, attributes maxLevenshteinDistance and isCaseSentive are ignored.

By default, the translation table is applied once, applying each row once from top to bottom. If the recursive flag is set to true, the normalization table is reapplied until none of its rows causes a change to the input text. If multiple TextIndexNormalization elements are defined, they are applied in the order in which they appear. That is, if more than one TextIndexNormalization element is defined, the output of applying a TextIndexNormalization element will serve as the input to the next TextIndexNormalization element. This enables, for example, to have a first normalization step to take care of morphological translations to get each word into some base form (stemming), a second normalization step to combine synonyms by applying some sort of taxonomy and perhaps a third step to look for particular sequences of normalized tokens.

By default, the TextIndexNormalization element inherits the values for isCaseSensitive, maxLevenshteinDistance, wordSeparatorCharacterRE and tokenize from the TextIndex element, but they can be overridden per TextIndexNormalization element.

For each TextIndexNormalization element, wordSeparatorCharacterRE does not apply to the inField and outField columns for regular expression rows. However, it applies to both inField and outField for non-regular expression rows.

An example with normalization

...
<TransformationDictionary>
  <DefineFunction name="myIndexFunction" optype="continuous">
    <ParameterField name="reviewText"/>
    <ParameterField name="term"/>
    <TextIndex textField="reviewText" localTermWeights="binary" isCaseSensitive="false">
      <TextIndexNormalization inField="string" outField="stem" regexField="regex">
        <InlineTable>
          <row>
            <string>interfaces?</string>
            <stem>interface</stem>
            <regex>true</regex>
          </row>
          <row>
            <string>is|are|seem(ed|s?)|were</string>
            <stem>be</stem>
            <regex>true</regex>
          </row>
          <row>
            <string>user friendl(y|iness)</string>
            <stem>user_friendly</stem>
            <regex>true</regex>
          </row>
        </InlineTable>
      </TextIndexNormalization>

      <TextIndexNormalization inField="re" outField="feature" regexField="regex">
        <InlineTable>
          <row>
            <re>interface be (user_friendly|well designed|excellent)</re>
            <feature>ui_good</feature>
            <regex>true</regex>
          </row>
        </InlineTable>
      </TextIndexNormalization>
      <FieldRef field="term"/>

    </TextIndex>
  </DefineFunction>
</TransformationDictionary>
...
<MiningSchema>
  <MiningField name="Review"/>
  ...
</MiningSchema>
<LocalTransformations>
  <DerivedField name="isGoodUI">
    <Apply function="myIndexFunction">
      <FieldRef field="Review"/>
      <Constant>ui_good</Constant>
    </Apply>
  </DerivedField>
  ...

For example, when processing the text fragment "Testing the app for a few days convinced me the interfaces are excellent!", applying the first normalization block yields "Testing the app for a few days convinced me the interface be excellent!". Applying the second block then yields "Testing the app for a few days convinced me the ui_good!", which will produce a frequency value of 1 for the "isGoodUI" field.

Aggregations

Association rules and sequences refer to sets of items. These sets can be defined by an aggregation over sets of input records. The records are grouped together by one of the fields and the values in this grouping field partition the sets of records for an aggregation. This corresponds to the conventional aggregation in SQL with a GROUP BY clause. Input records with missing value in the groupField are simply ignored. This behavior is similar to the aggregate functions in the presence of NULL values in SQL.

<xs:element name="Aggregate">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="function" use="required">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="count"/>
          <xs:enumeration value="sum"/>
          <xs:enumeration value="average"/>
          <xs:enumeration value="min"/>
          <xs:enumeration value="max"/>
          <xs:enumeration value="multiset"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="groupField" type="FIELD-NAME"/>
    <xs:attribute name="sqlWhere" type="xs:string"/>
  </xs:complexType>
</xs:element>

A definition such as:

<Aggregate field="item" function="multiset" groupField="transaction"/>

builds sets of item values; for each transaction, i.e. for each value in the field transaction there is one set of items.

Lag

A "lag" is here defined as the value of the given input field a fixed number of records prior to the current one, assuming there are that many; and optionally assuming that they are part of the same block of records (aka "case history"), as defined by the values of one or more input fields (the "block specifiers"). If the desired value is not present, for a given record, the lag will be set to missing.

Warning!

Lags are only meaningful if the data are already sorted in the desired order (normally chronologically within case histories). It should not be assumed that a consumer will do the requisite sorting.

<xs:element name="Lag">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
        <xs:element ref="BlockIndicator" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required"/>
      <xs:attribute name="n" type="xs:positiveInteger" default="1"/>
      <xs:attribute name="aggregate" default="none">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:enumeration value="none"/>
            <xs:enumeration value="avg"/>          
            <xs:enumeration value="max"/>
            <xs:enumeration value="median"/>
            <xs:enumeration value="min"/>
            <xs:enumeration value="product"/>
            <xs:enumeration value="sum"/>
            <xs:enumeration value="stddev"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>

  
<xs:element name="BlockIndicator">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    </xs:complexType>
  </xs:element>

field in a Lag element is the name of the input field to be "lagged". In a BlockIndicator element, it is the name of a field used to define the block of records to be searched. A block thus defined will consist of one or more consecutive records in which the indicator fields have the same values.
n indicates that the value to be used is to be taken from nth record prior to the current one, if there are that many. If one or more BlockIndicator elements are defined, the record must be part of the same block as the current one. If either condition does not hold, then the value of the DerivedField will be set to missing. n must be a positive integer. If not specified, it will be set to 1 (indicating the record just before the current one).
aggregate is an optional attribute that defines an aggregate operation on the n lagged values. The allowed values are one of "avg", "max", "median", "min", "product", "sum" and "stddev". The definition of all operations, except "stddev" is described in Built-In Functions . "stddev" stands for standard deviation. The default is set to "none", which means a regular lag expression is performed. The data type must be numeric, if one of the aggregate operations other than "none" is selected.

Simple Lag Example

  <Lag field="Receipts"/>

In this example, the value returned will be that of Receipts in the record immediately prior to the current one, except that on the first record, it will be missing.

Slightly More Complex Lag Example

  <Lag field="Receipts" n="2"/>

Here, the value returned will be that of Receipts on the record just before the previous one, except that on the first two records, it will be missing.

Example Lag Within a Block

  <Lag field="AmtPaid" n="1">
    <BlockIndicator field="CustomerID"/>
  </Lag>

In this example, the value returned will be that of AmtPaid on the next previous record, if it has the same value of CustomerID as the current one, otherwise it will be missing.

Example with Aggregate Operation

  <Lag field="AmtPaid" n="3" aggregate="sum"/>

In this case, the returned value will be the sum of the three previous AmtPaid values. The table below contains a list of records to give more detail. For instance, the result of the lag expression of Record_5 will be '9'. Note that the first record will not contain a value for lag. Record_2 will only contain the sum of the first record and Record_3 will only contain the sum of the first two records.

Record Number	AmtPaid	AmtPaid_Lag
Record_1	1
Record_2	2	1
Record_3	3	3
Record_4	4	6
Record_5	5	9

e-mail

info at dmg.org