PMML 4.2 - Transformation Dictionary and Derived Fields

At various places the mining models use simple functions in order to map user data to values that are easier to use in the specific model. For example, neural networks internally work with numbers, usually in the range from 0 to 1. Numeric input data are mapped to the range [0..1], and categorical fields are mapped to series of 0/1 indicators.

PMML defines various kinds of simple data transformations:

Normalization: map values to numbers, the input can be continuous or discrete.
Discretization: map continuous values to discrete values.
Value mapping: map discrete values to discrete values.
Text Indexing: derive a frequency-based value for a given term.
Functions: derive a value by applying a function to one or more parameters
Aggregation: summarize or collect groups of values, e.g., compute average.

The corresponding XML elements appear as content of a surrounding markup DerivedField, which provides a common element for the various mappings. They can also appear at several places in the definition of specific models such as neural network or Naïve Bayes models. Transformed fields have a name such that statistics and the model can refer to these fields.

The transformations in PMML do not cover the full set of preprocessing functions which may be needed to collect and prepare the data for mining. There are too many variations of preprocessing expressions. Instead, the PMML transformations represent expressions that are created automatically by a mining system. A typical example is the normalization of input values in neural networks. Similarly, a discretization might be constructed by a mining system that computes quantile ranges in order to transform skewed data.

<xs:group name="EXPRESSION">
  <xs:choice>
    <xs:element ref="Constant"/>
    <xs:element ref="FieldRef"/>
    <xs:element ref="NormContinuous"/>
    <xs:element ref="NormDiscrete"/>
    <xs:element ref="Discretize"/>
    <xs:element ref="MapValues"/>
    <xs:element ref="TextIndex"/>        
    <xs:element ref="Apply"/>
    <xs:element ref="Aggregate"/>
  </xs:choice>
</xs:group>

<xs:element name="TransformationDictionary">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DefineFunction" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DerivedField" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="LocalTransformations">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DerivedField" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="DerivedField">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:group ref="EXPRESSION"/>
      <xs:element ref="Value" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="name" type="FIELD-NAME"/>
    <xs:attribute name="displayName" type="xs:string"/>
    <xs:attribute name="optype" type="OPTYPE" use="required"/>
    <xs:attribute name="dataType" type="DATATYPE" use="required"/>
  </xs:complexType>
</xs:element>

If a DerivedField is contained in TransformationDictionary or LocalTransformations, then the name attribute is required. For DerivedFields which are contained inline in models, name is optional.

The TransformationDictionary allows for transformations to be defined once and used by any model element in the PMML document. For information on the use and naming of DerivedFields in the TransformationDictionary, see Scope of Fields.

The transformation expression in the content of DerivedField defines how the values of the new field are computed.

The attribute optype is needed in order to eliminate cases where the resulting type is not known. If there is a value mapping in a DerivedField, it is not known how to interpret the output. A value mapping might look like this:

"cat" -> "0.1"
"dog" -> "0.2"
"elephant" -> "0.3"
etc.

But it is not known whether 0.1 has to be interpreted as a number or as a string (=categorical value). Hence optype is required in order to make parsing and interpretation of models simpler.

A DerivedField may have a list of Value elements that define the ordering of the values for an ordinal field. The attribute property must not be used for Value elements within a DerivedField. That is, the list cannot specify values that are interpreted as missing or invalid.

Constant

Constant values can be used in expressions which have multiple arguments. The actual value of a constant is given by the content of the element. For example, <Constant>1.05</Constant> represents the number 1.05. The dataType of Constant can be optionally specified. If the ParameterField definition includes a dataType, the Constant will inherit the dataType specified in the ParameterField. If the dataType is not specified in the ParameterField definition or Constant element, it will be inferred by the content of the element. A Constant that consists solely of numbers will be treated as an integer, if a decimal point is present among the numbers the Constant will be treated as a float. The presence of any non-numeric characters will result in the Constant being treated as a string. Conflicting dataType specifications will be resolved as specified in the Functions document.

<xs:element name="Constant">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:string">
        <xs:attribute name="dataType" type="DATATYPE"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

FieldRef

Field references are simply pass-throughs to fields previously defined in the DataDictionary, a DerivedField, or a result field. For example, they are used in clustering models in order to define center coordinates for fields that don't need further normalization.

<xs:element name="FieldRef">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="mapMissingTo" type="xs:string"/>
  </xs:complexType>
</xs:element>

A missing input will produce a missing result. The optional attribute mapMissingTo may be used to map a missing result to the value specified by the attribute. If the attribute is not present, the result remains missing.

Normalization

The elements for normalization provide a basic framework for mapping input values to specific value ranges, usually the numeric range [0 .. 1]. Normalization is used, e.g., in neural networks and clustering models.

<xs:element name="NormContinuous">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="LinearNorm" minOccurs="2" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="mapMissingTo" type="NUMBER"/>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="outliers" type="OUTLIER-TREATMENT-METHOD" default="asIs"/>
  </xs:complexType>
</xs:element>

<xs:element name="LinearNorm">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="orig" type="NUMBER" use="required"/>
    <xs:attribute name="norm" type="NUMBER" use="required"/>
  </xs:complexType>
</xs:element>

NormContinuous: defines how to normalize an input field by piecewise linear interpolation. The mapMissingTo attribute defines the value the output is to take if the input is missing. If the mapMissingTo attribute is not specified, then missing input values produce a missing result.

The sequence of LinearNorm elements defines a sequence of points for a stepwise linear interpolation function. The sequence must contain at least two elements. Within NormContinous the elements LinearNorm must be strictly sorted by ascending value of orig. Given two points (a1, b1) and (a2, b2) such that there is no other point (a3, b3) with a1<a3<a2, then the normalized value is

b1+ ( x-a1)/(a2-a1)*(b2-b1) for a1 ≤ x ≤ a2.

piecewise interpolated normalization extrapolated on undefined range

Missing input values are mapped to missing output.
If the input value is not within the range [a1..an] then it is treated>a according to outliers (compare with outliers in MiningSchema):

asIs: Extrapolates the normalization from the nearest interval.

asMissingValues: Maps to a missing value.

asExtremeValues: Maps to the next value from the nearest interval so that the function is continuous.

The graph above shows the default behavior with an extrapolation for values less than a1 or greater than a3.

The graph below depicts a mapping where outliers are mapped using asExtremeValues:

piecewise interpolated normalization with min/max on undefined range

NormContinuous can be used to implement simple normalization functions such as the z-score transformation" (X - m ) / s, where m is the mean value and s is the standard deviation.

<NormContinuous field="X">
  <LinearNorm orig="0" norm="-m/s"/>
  <LinearNorm orig="m" norm="0"/>
</NormContinuous>

In this example we assume that outliers are treated "asIs".

Normalize discrete values

Many mining models encode string values into numeric values in order to perform mathematical computations. For example, regression and neural network models often split categorical and ordinal fields into multiple dummy fields. This kind of normalization is supported in PMML by the element NormDiscrete.

<xs:element name="NormDiscrete">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="method" fixed="indicator">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="indicator"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="value" type="xs:string" use="required"/>
    <xs:attribute name="mapMissingTo" type="NUMBER"/>
  </xs:complexType>
</xs:element>

With the indicator method, an element (f, v) defines that the unit has value 1.0 if the value of input field f is v, otherwise it is 0.

The set of NormDiscrete instances which refer to a certain input field define a fan-out function which maps a single input field to a set of normalized fields.

If the input value is missing and the attribute mapMissingTo is not specified then the result is a missing value as well. If the input value is missing and the attribute mapMissingTo is specified then the result is the value of the attribute mapMissingTo.

Discretization

Discretization of numerical input fields is a mapping from continuous to discrete values using intervals.

<xs:element name="Discretize">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="DiscretizeBin" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="mapMissingTo" type="xs:string"/>
    <xs:attribute name="defaultValue" type="xs:string"/>
    <xs:attribute name="dataType" type="DATATYPE"/>
  </xs:complexType>
</xs:element>

<xs:element name="DiscretizeBin">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="Interval"/>
    </xs:sequence>
    <xs:attribute name="binValue" type="xs:string" use="required"/>
  </xs:complexType>
</xs:element>

The attribute field defines the name of the input field. The elements DiscretizeBin define a set of mappings from an interval_i to a binValue_i. The value of the DerivedField is binValue_i if the input value is contained in interval_i for some i.

Two intervals may be mapped to the same categorical value but the mapping for each numerical input value must be unique, i.e., the intervals must be disjoint. The intervals should cover the complete range of input values.

Decision table for Discretize

('*' stands for any combination)

input value	matching interval	defaultValue	mapMissingTo	=>	result
val	Interval_i	*	*	=>	binValue_i
val	none	someVal	*	=>	someVal
val	none	not specified	*	=>	missing
missing	*	*	someVal	=>	someVal
missing	*	*	not specified	=>	missing

Example:

A definition such as:

<Discretize field="Profit">
  <DiscretizeBin binValue="negative">
    <Interval closure="openOpen" rightMargin="0"/>
    <!-- left margin is -infinity by default -->
  </DiscretizeBin>
  <DiscretizeBin binValue="positive">
    <Interval closure="closedOpen" leftMargin="0"/>
    <!-- right margin is +infinity by default -->
  </DiscretizeBin>
</Discretize>

takes the field Profit as input and maps values less than 0 to negative and other values to positive. A missing value for Profit is mapped to a missing value.

In SQL this definition corresponds to a CASE expression

CASE When "Profit" < 0 Then 'negative' When "Profit" >= 0 Then 'positive' End

Discretize can also be used to create a missing value indicator for a continuous variable. In this case, the DiscretizeBin element is superfluous and can be dropped.

Example:

<DerivedField name="Age_mis" displayName="Age missing or not" optype="categorical" dataType="string">
  <Discretize field="Age" mapMissingTo="Missing" defaultValue="Not missing"/>
</DerivedField>

Map Values

Any discrete value can be mapped to any possibly different discrete value by listing the pairs of values. This list is implemented by a table, so it can be given inline by a sequence of XML markups or by a reference to an external table. The same technique is used for a Taxonomy because the tables can become quite large.

<xs:element name="MapValues">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element minOccurs="0" maxOccurs="unbounded" ref="FieldColumnPair"/>
      <xs:choice minOccurs="0">
        <xs:element ref="TableLocator"/>
        <xs:element ref="InlineTable"/>
      </xs:choice>
    </xs:sequence>
    <xs:attribute name="mapMissingTo" type="xs:string"/>
    <xs:attribute name="defaultValue" type="xs:string"/>
    <xs:attribute name="outputColumn" type="xs:string" use="required"/>
    <xs:attribute name="dataType" type="DATATYPE"/>
  </xs:complexType>
</xs:element>

<xs:element name="FieldColumnPair">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="column" type="xs:string" use="required"/>
  </xs:complexType>
</xs:element>

The types InlineTable and TableLocator are defined in the Taxonomy schema.

The mapMissingTo attribute defines the value the output column is to take if any of the input columns are missing.

Different string values can be mapped to one value but it is an error if the table entries used for matching are not unique. The value mapping may be partial. I.e., if an input value does not match a value in the mapping table, then the result can be a missing value. See the decision table below for the possible combinations.

Decision table for MapValues

('*' stands for any combination)

input value	matching value	defaultValue	mapMissingTo	=>	result
val	in row i	*	*	=>	outputColumn in row i
val	none	someVal	*	=>	someVal
val	none	not specified	*	=>	missing
missing	*	*	someVal	=>	someVal
missing	*	*	not specified	=>	missing

Example:

A definition such as

<MapValues outputColumn="longForm">
  <FieldColumnPair field="gender" column="shortForm"/>
  <InlineTable>
    <row><shortForm>m</shortForm><longForm>male</longForm>
    </row>
    <row><shortForm>f</shortForm><longForm>female</longForm>
    </row>
  </InlineTable>
</MapValues>

maps abbreviation from the field gender to their corresponding full words. That is, m is mapped to male and f is mapped to female.

In SQL this definition corresponds to a CASE expression

CASE "gender" When 'm' Then 'male' When 'f' Then 'female' End

Example:

Here is an example for the multi-dimensional case, mapping from state and band to salary:

state band 1 band 2

MN 10,000 20,000

IL 12,000 23,000

NY 20,000 30,000

Respective PMML would look like this:

<DerivedField dataType="double" optype="continuous">
  <MapValues outputColumn="out" dataType="integer">
    <FieldColumnPair field="BAND" column="band"/> 
    <FieldColumnPair field="STATE" column="state"/> 
    <InlineTable>
      <row>
        <band>1</band> 
        <state>MN</state> 
        <out>10000</out> 
      </row>
      <row>
        <band>1</band> 
        <state>IL</state> 
        <out>12000</out> 
      </row>
      <row>
        <band>1</band> 
        <state>NY</state> 
        <out>20000</out> 
      </row>
      <row>
        <band>2</band> 
        <state>MN</state> 
        <out>20000</out> 
      </row>
      <row>
        <band>2</band> 
        <state>IL</state> 
        <out>23000</out> 
      </row>
      <row>
        <band>2</band> 
        <state>NY</state> 
        <out>30000</out> 
      </row>
    </InlineTable>
  </MapValues>
</DerivedField>

Example:

The MapValues element can be used to create missing value indicators for categorical variables. In this case, only one FieldColumnPair needs to be specified and the column attribute can be omitted.

<DerivedField name="LSTROPEN_MIS" optype="categorical" dataType="string">
  <MapValues mapMissingTo="Missing" defaultValue="Not missing" outputColumn="none">
    <FieldColumnPair field="LSTROPEN" column="none"/>
  </MapValues>
</DerivedField>

Extracting term frequencies from text

To leverage textual input in a PMML model, we can use a TextIndex expression to extract frequency information from the text input field, for a given term. The TextIndex element fully configures how the text input should be indexed, including case sensitivity, normalization and other settings. It has a single EXPRESSION element nested within containing the term value to look for, which will usually be a simple Constant.

<xs:element name="TextIndex">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:element ref="TextIndexNormalization" minOccurs="0" maxOccurs="unbounded"/>
      <xs:group ref="EXPRESSION"/>
    </xs:sequence>
    <xs:attribute name="textField" type="FIELD-NAME" use="required"/>
    <xs:attribute name="localTermWeights" default="termFrequency">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="termFrequency"/>
          <xs:enumeration value="binary"/>
          <xs:enumeration value="logarithmic"/>
          <xs:enumeration value="augmentedNormalizedTermFrequency"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="isCaseSensitive" type="xs:boolean" default="false"/>
    <xs:attribute name="maxLevenshteinDistance" type="xs:integer" default="0"/>
    <xs:attribute name="countHits" default="allHits">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="allHits"/>
          <xs:enumeration value="bestHits"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="wordSeparatorCharacterRE" type="xs:string" default="\s"/>
    <xs:attribute name="tokenize" type="xs:boolean" default="true"/>    
  </xs:complexType>
</xs:element>

The TextIndex element fully configures how the text in textField should be processed and translated into a frequency metric for a particular term of interest. The actual frequency metric to be returned is defined through the localTermWeights attribute. The options are described in more detail by T. G. Kolda and D. P. O'Leary, A Semi-Discrete Matrix Decomposition of Latent Semantic Indexing in Information Retrieval, ACM Transactions on Information Systems, Volume 16, 1998, pages 322-346.

termFrequency: use the number of times the term occurs in the document (x = freq_i).
binary: use 1 if the term occurs in the document or 0 if it doesn't (x = χ(freq_i)).
logarithmic: take the logarithm (base 10) of 1 + the number of times the term occurs in the document. (x = log(1 + freq_i))
augmentedNormalizedTermFrequency: this formula adds to the binary frequency a "normalized" component expressing the frequency of a term relative to the highest frequency of terms observed in that document (x = 0.5 * (χ(freq_i) + (freq_i / max_k(freq_k))) )

The isCaseSensitive attribute defines whether or not the case used in the text input should exactly match the case usage in the terms for which the frequency should be returned. Through maxLevenshteinDistance, small spelling mistakes can be accommodated, accepting a certain number of character additions, ommissions or replacements. See also this article. To capture compound words more easily, the wordSeparatorCharacterRE attribute can be used to pass a regular expression containing possible word separator characters. For example, if "[\s\-]" is passed, the strings "user-friendly" and "user friendly" would both match the term "user friendly". More complex normalization operations can be addressed through the TextIndexNormalization element. Note that the default value for attribute wordSeparatorCharacterRE is considered to be the space character. The wordSeparatorCharacterRE attribute applies to the given text unless attribute tokenize is set to "false".

When attribute tokenize is set to "true", which is its default value, wordSeparatorCharacterRE should be applied to the text input field as well as the given term. This process will result in one or more tokens for which the leading and trailing punctuations should then be removed. The punctuation-free tokenized term can then be described by one word or a sequence of words, depending on the number or resulting tokens.

To calculate a term's frequency, count number of "hits" in the text for the particular term according to the value of countHits. A "hit" is defined as an occurrence of the term in the input text, meeting the case requirements defined through isCaseSensitive within the maxLevenshteinDistance. For example, if the given term is defined as "brown fox" and the input text is "The quick browny foxy jumps over the lazy dog. The brown fox runs away and to be with another brown foxy.". If attribute maxLevenshteinDistance is set to 1, the text "browny foxy" will not be considered a "hit" since its Levenshtein distance is 2 (the sum of the Levenshtein distances for words "browny" and "foxy" after the input text has been tokenized). The text "brown fox" that is retrieved next is a "hit" since the Levenshtein distance is 0. The text "brown foxy." is also a "hit" since the Levenshtein distance is 1. Note that the punctuation has been removed before computing the Levenshtein distance.

The number of "hits" in the text can be counted in two different ways. These are:

allHits: count all hits
bestHits: count all hits with the lowest Levenshtein distance

For example, if the input text is defined as "I have a doog. My dog is white. The doog is friendly" and the attribute maxLevenshteinDistance is set to 1, the number of hits for term "dog" will be 3 if attribute countHits is set to "allHits" and 1 if it is set to "bestHits".

An example

<MiningSchema>
  <MiningField name="myTextField"/>
  ...
</MiningSchema>
<LocalTransformations>
  <DerivedField name="sunFrequency">
    <TextIndex textField="myTextField" localTermWeights="termFrequency" 
       isCaseSensitive="false" maxLevenshteinDistance="1" >
      <Constant>sun</Constant>
    </TextIndex>
  </DerivedField>
  ...

In the above example, the DerivedField sunFrequency will contain the number of hits for the term "sun" in the input field myTextField, regardless of case and with at most one spelling mistake. For example, if the value of myTextField is "The Sun was setting while the captain's son reached the bounty island, minutes after their ship had sunk to the bottom of the ocean", sunFrequency will be 3 as "Sun", "son" and "sunk" all match the term "sun" with a Levenshtein distance of 0 or 1. If the maximum Levenshtein distance were to be 0, only "Sun" would have matched.

Predictive models using text as an input are likely to be looking for more than a single term. Therefore, it is often convenient to define the TextIndex element just once inside a DefineFunction element and then invoke it with Apply elements as shown in the following example.

...
<TransformationDictionary>
  <DefineFunction name="myIndexFunction">
    <ParameterField name="text"/>   
    <ParameterField name="term"/>
    <TextIndex textField="text" localTermWeights="termFrequency" 
       isCaseSensitive="false" maxLevenshteinDistance="1" >
      <FieldRef field="term"/>
    </TextIndex>
  </DefineFunction>
</TransformationDictionary>
...
<MiningSchema>
  <MiningField name="myTextField"/>
  ...
</MiningSchema>
<LocalTransformations>
  <DerivedField name="sunFrequency">
    <Apply function="myIndexFunction">
      <FieldRef field="myTextField"/> 
      <Constant>sun</Constant>            
    </Apply>
  </DerivedField>
  <DerivedField name="rainFrequency">
    <Apply function="myIndexFunction">
      <FieldRef field="myTextField"/>      
      <Constant>rain</Constant>
    </Apply>
  </DerivedField>
  <DerivedField name="windFrequency">
    <Apply function="myIndexFunction">
      <FieldRef field="myTextField"/>  
      <Constant>wind</Constant>         
    </Apply>
  </DerivedField>
  ...

Normalizing text input

While the Levenshtein distance is useful to cover small spelling mistakes, it's not necessarily suitable to capture different forms of the same term, such as cases of nouns, conjugations of verbs or even synonyms. One or more TextIndexNormalization elements can be nested within a TextIndex to normalize input text into a more term-friendly form.

<xs:element name="TextIndexNormalization">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
      <xs:choice minOccurs="0">
        <xs:element ref="TableLocator"/>
        <xs:element ref="InlineTable"/>
      </xs:choice>
    </xs:sequence>
    <xs:attribute name="inField" type="xs:string" default="string"/>
    <xs:attribute name="outField" type="xs:string" default="stem"/>
    <xs:attribute name="regexField" type="xs:string" default="regex"/>
    <xs:attribute name="recursive" type="xs:boolean" default="false"/>
    <xs:attribute name="isCaseSensitive" type="xs:boolean"/>
    <xs:attribute name="maxLevenshteinDistance" type="xs:integer"/>
    <xs:attribute name="wordSeparatorCharacterRE" type="xs:string"/>
    <xs:attribute name="tokenize" type="xs:boolean"/>    
  </xs:complexType>
</xs:element>

A TextIndexNormalization element offers more advanced ways of normalizing text input into a more controlled vocabulary that corresponds to the terms being used in invocations of this indexing function. The normalization operation is defined through a translation table, specified through a TableLocator or InlineTable element.

If an input in the inField column is encountered in the text, it is replaced by the value in the outField column. If there is a regexField column and its value for that row is true, the string in the inField column should be treated as a PCRE regular expression. For regular expression rows, attributes maxLevenshteinDistance and isCaseSentive are ignored.

By default, the translation table is applied once, applying each row once from top to bottom. If the recursive flag is set to true, the normalization table is reapplied until none of its rows causes a change to the input text. If multiple TextIndexNormalization elements are defined, they are applied in the order in which they appear. That is, if more than one TextIndexNormalization element is defined, the output of applying a TextIndexNormalization element will serve as the input to the next TextIndexNormalization element. This enables, for example, to have a first normalization step to take care of morphological translations to get each word into some base form (stemming), a second normalization step to combine synonyms by applying some sort of taxonomy and perhaps a third step to look for particular sequences of normalized tokens.

By default, the TextIndexNormalization element inherits the values for isCaseSensitive, maxLevenshteinDistance and wordSeparatorCharacterRE from the TextIndex element, but they can be overridden per TextIndexNormalization element.

For each TextIndexNormalization element, wordSeparatorCharacterRE does not apply to the inField and outField columns for regular expression rows. However, it applies to both inField and outField for non-regular expression rows.

An example with normalization

...
<TransformationDictionary>
  <DefineFunction name="myIndexFunction">
    <ParameterField name="reviewText"/>
    <ParameterField name="term"/>
    <TextIndex textField="reviewText" localTermWeights="binary" isCaseSensitive="false">
    
      <TextIndexNormalization inField="string" outField="stem" regexField="regex">
        <InlineTable>
          <row>
            <string>interfaces?</string>
            <stem>interface</stem>
            <regex>true</regex>
          </row>
          <row>
            <string>is|are|seem(ed|s?)|were</string>
            <stem>be</stem>
            <regex>true</regex>
          </row>
          <row>
            <string>user friendl(y|iness)</string>
            <stem>user_friendly</stem>
            <regex>true</regex>
          </row>
        </InlineTable>
      </TextIndexNormalization>

      <TextIndexNormalization inField="re" outField="feature" regexField="regex">
        <InlineTable>
          <row>
            <re>interface be (user_friendly|well designed|excellent)</re>
            <feature>ui_good</feature>
            <regex>true</regex>
          </row>
        </InlineTable>
      </TextIndexNormalization>
      
      <FieldRef field="term"/>

    </TextIndex>
  </DefineFunction>
</TransformationDictionary>
...
<MiningSchema>
  <MiningField name="Review"/>
  ...
</MiningSchema>
<LocalTransformations>
  <DerivedField name="isGoodUI">
    <Apply function="myIndexFunction">
      <FieldRef field="Review"/>
      <Constant>ui_good</Constant>      
    </Apply>
  </DerivedField>
  ...

For example, when processing the text fragment "Testing the app for a few days convinced me the interfaces are excellent!", applying the first normalization block yields "Testing the app for a few days convinced me the interface be excellent!". Applying the second block then yields "Testing the app for a few days convinced me the ui_good!", which will produce a frequency value of 1 for the "isGoodUI" field.

Aggregations

Association rules and sequences refer to sets of items. These sets can be defined by an aggregation over sets of input records. The records are grouped together by one of the fields and the values in this grouping field partition the sets of records for an aggregation. This corresponds to the conventional aggregation in SQL with a GROUP BY clause. Input records with missing value in the groupField are simply ignored. This behavior is similar to the aggregate functions in the presence of NULL values in SQL.

<xs:element name="Aggregate">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/>
    </xs:sequence>
    <xs:attribute name="field" type="FIELD-NAME" use="required"/>
    <xs:attribute name="function" use="required">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="count"/>
          <xs:enumeration value="sum"/>
          <xs:enumeration value="average"/>
          <xs:enumeration value="min"/>
          <xs:enumeration value="max"/>
          <xs:enumeration value="multiset"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
    <xs:attribute name="groupField" type="FIELD-NAME"/>
    <xs:attribute name="sqlWhere" type="xs:string"/>
  </xs:complexType>
</xs:element>

A definition such as:

<Aggregate field="item" function="multiset" groupField="transaction"/>

builds sets of item values; for each transaction, i.e. for each value in the field transaction there is one set of items.

e-mail

info at dmg.org