|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PMML 4.1 - Output fieldsOutput element describes a set of result values that can be returned from a model. In particular, OutputField elements specify names, types and rules for calculating specific result features. This information can be used while writing an output table. The Output section in the model specifies names for columns in an output table and describes how to compute the corresponding values.Example <Output> <OutputField name="P_responseYes" optype="continuous" dataType="double" targetField="response" feature="probability" value="YES"/> <OutputField name="P_responseNo" optype="continuous" dataType="double" targetField="response" feature="probability" value="NO"/> <OutputField name="I_response" optype="categorical" dataType="string" targetField="response" feature="predictedValue"/> <OutputField name="U_response" optype="categorical" dataType="string" targetField="response" feature="predictedDisplayValue"/> </Output> If a model contains this Output element a PMML consumer could map an input table to an output table with columns named P_responseYes, P_responseNo, etc. The values for P_responseYes are determined as the probability that the target field, with name response has the value YES. Schema: <xs:element name="Output"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="OutputField" minOccurs="1" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="OutputField"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:sequence minOccurs="0" maxOccurs="1"> <xs:element ref="Decisions" minOccurs="0" maxOccurs="1"/> <xs:group ref="EXPRESSION" minOccurs="1" maxOccurs="1"/> </xs:sequence> </xs:sequence> <xs:attribute name="name" type="FIELD-NAME" use="required"/> <xs:attribute name="displayName" type="xs:string"/> <xs:attribute name="optype" type="OPTYPE"/> <xs:attribute name="dataType" type="DATATYPE"/> <xs:attribute name="targetField" type="FIELD-NAME"/> <xs:attribute name="feature" type="RESULT-FEATURE"/> <xs:attribute name="value" type="xs:string"/> <xs:attribute name="ruleFeature" type="RULE-FEATURE" default="consequent"/> <xs:attribute name="algorithm" default="exclusiveRecommendation"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="recommendation"/> <xs:enumeration value="exclusiveRecommendation"/> <xs:enumeration value="ruleAssociation"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="rank" type="INT-NUMBER" default="1"/> <xs:attribute name="rankBasis" default="confidence"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="confidence"/> <xs:enumeration value="support"/> <xs:enumeration value="lift"/> <xs:enumeration value="leverage"/> <xs:enumeration value="affinity"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="rankOrder" default="descending"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:enumeration value="descending"/> <xs:enumeration value="ascending"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="isMultiValued" default="0"/> <xs:attribute name="segmentId" type="xs:string"/> </xs:complexType> </xs:element> <xs:simpleType name="RESULT-FEATURE"> <xs:restriction base="xs:string"> <xs:enumeration value="predictedValue"/> <xs:enumeration value="predictedDisplayValue"/> <xs:enumeration value="transformedValue"/> <xs:enumeration value="decision"/> <xs:enumeration value="probability"/> <xs:enumeration value="affinity"/> <xs:enumeration value="residual"/> <xs:enumeration value="standardError"/> <xs:enumeration value="clusterId"/> <xs:enumeration value="clusterAffinity"/> <xs:enumeration value="entityId"/> <xs:enumeration value="entityAffinity"/> <xs:enumeration value="warning"/> <xs:enumeration value="ruleValue"/> <xs:enumeration value="reasonCode"/> </xs:restriction> </xs:simpleType> <xs:element name="Decisions"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="Decision" minOccurs="1" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="businessProblem" type="xs:string"/> <xs:attribute name="description" type="xs:string"/> </xs:complexType> </xs:element> <xs:element name="Decision"> <xs:complexType> <xs:sequence> <xs:element ref="Extension" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="value" type="xs:string" use="required"/> <xs:attribute name="displayValue" type="xs:string"/> <xs:attribute name="description" type="xs:string"/> </xs:complexType> </xs:element> <xs:simpleType name="RULE-FEATURE"> <xs:restriction base="xs:string"> <xs:enumeration value="antecedent"/> <xs:enumeration value="consequent"/> <xs:enumeration value="rule"/> <xs:enumeration value="ruleId"/> <xs:enumeration value="confidence"/> <xs:enumeration value="support"/> <xs:enumeration value="lift"/> <xs:enumeration value="leverage"/> <xs:enumeration value="affinity"/> </xs:restriction> </xs:simpleType> The attribute name specifies the name of a the OutputField. The name itself does not define how the output values are computed. For information on the naming of OutputFields, see Scope of Fields. The dataType of an OutputField element specifies the default column type. opType can be used to indicate admissible operations on the values. A clusterId field, for example, can have integer as its dataType, but categorical as its opType. For details, see the description of DataDictionary. The attribute targetField must refer either to a
MiningField of type predicted or to a Target in
the Targets section. targetField is a required attribute in
case the model predicts multiple fields. If the attribute feature is not specified then the output value is
a copy of the target field value. In this case, the target field is a
required input for scoring in addition to the active fields in the
MiningSchema. The attribute value contains the raw, not the displayValue of a value in the target field, either specified in targetField or implied as the only target field in the model. This attribute is used only in conjunction with result features referring to an individual target value, for example probability. If the attribute value is not specified and the value of the attribute feature is probability then the probability of the predicted categorical value should be returned as an output. Otherwise, value indicates the category for which a probability is returned. An output field may contain an association rule or any of its properties. ruleFeature specifies which feature of an association rule to return. The attribute algorithm applies only to Association Rules models, and specifies which scoring algorithm to use when computing the output value. The attribute rank is used to specify which item from a set of outputs should be selected. It is used by Scorecards and Association Rules models. Scorecards can return multiple reason codes depending on the complexity of the scorecard. If the scorecard calls for multiple reason codes to be given, rank="1" returns the top reason code. To return other reason codes, rank must be set to the appropriate index. In a similar manner, Association Rules models can return multiple result values as a result of the scoring process. In this case, if the result consists of three items, the default behavior (rank="1") is for the first item to be returned as the result (assuming isMultiValued has a value of "0"; see below for more information about isMultiValued). To return other items, rank must be set to the appropriate index. The attribute rankBasis is used to specify which criterion is used to sort the multiple result values. For instance, the results could be sorted by the confidence, support or lift of the rules. The sorting order is determined by the rankOrder attribute. The default behavior (rankOrder="descending") indicates that the rules with the highest rank will appear first on the sorted list. In this case, setting rank="1" would return the first rule from the sorted list, which would be the highest ranked rule. Note that attributes isMultiValued, rankBasis and rankOrder apply only to Association Rules models. The attribute isMultiValued indicates that the output can represent multiple output values. If the value of the attribute is "1", then the rank value indicates the number of output values that should be returned - a positive value indicates the number of output values to be returned, based on the rankBasis and rankOrder, while a zero value indicates that all output values are to be returned. If the value of isMultiValued is "0" (default), then rank indicates a particular output, as defined above. The Output element allows for post-processing of output fields. This can take a generic form through the use of EXPRESSION (see Transformations for more information). Expressions are only allowed for some result features. For details, see section Result Features below. The attribute field in an EXPRESSION must always refer to the name of an OutputField in Output. References to Output Fields must not be cyclic. The attribute segmentId is applicable to MiningModels which utilize Segmentation. This attribute provides an alternative approach to deliver results from Segments which avoids having to specify Outputs within each Segment. If the segmentId attribute matches the id attribute of a Segment within the scope of this model element (the model element containing this OutputField's parent Outputs element), and if the predicate of that segment is true, then this OutputField returns the specified result feature from the sub-model contained within the specified segment. If there is no Segment matching segmentId or if the predicate of the matching Segment evaluated to false, then by convention the result delivered by this OutputField is missing. Note that any Segment element's id attribute that is referenced by a segmentId attribute in an OutputField must be unique within the scope of this Segmentation element and any child Segmentation elements that are contained in nested MiningModels. Result FeaturesThe meaning of the feature identifier is:predictedValue: Select the raw predicted value. More than one OutputField element can have the predictedValue feature only if the model predicts more than one field. Details can be found in the description of the individual models. DecisionsDecisions is used in conjunction with an EXPRESSION for output fields with result feature decision. Its attribute businessProblem names the problem or question for which a decision is proposed by application of the data mining model. The attribute description can describe the decision problem in more detail. The element contains an element Decision for every possible value of the decision. The value is a decision value as returned by the EXPRESSION. The displayValue is a string, which may be used by applications to refer to that decision. The attribute description can describe the decision in more detail. Rule FeaturesThe meaning of the ruleFeature attribute, when feature is set to ruleValue, is:
ExamplesBelow, a sequence of three examples shows how to use expressions together with OutputFields for post-processing of a predicted value, from simple rescaling to a business decision with the use of a threshold value.Example 1Suppose a regression model has the following OutputField elements: <Output> <OutputField name="RawResult" optype="continuous" dataType="double" feature="predictedValue"/> <OutputField name="FinalResult" optype="continuous" dataType="double" feature="transformedValue"> <NormContinuous field="RawResult"> <LinearNorm orig="-100" norm="-304"/> <LinearNorm orig="100" norm="324"/> </NormContinuous> </OutputField> </Output> In essence, this describes the rescaling function f(x) = 10 + 3.14*x in the range between -100 and 100. With a predicted value of 8, the final derived result would be 35.12. Example 2<Output> <OutputField name="RawResult" optype="continuous" dataType="double" feature="predictedValue"/> <OutputField name="FinalResult" optype="continuous" dataType="double" feature="transformedValue"> <Apply function="round"> <NormContinuous field="RawResult"> <LinearNorm orig="-100" norm="-21.4"/> <LinearNorm orig="-10" norm="-21.4"/> <LinearNorm orig="10.5" norm="42.97"/> <LinearNorm orig="100" norm="42.97"/> </NormContinuous> </Apply> </OutputField> </Output> In addition to the rescale function from the previous example, we now have upper and lower limits as well as rounding. Suppose the model returns a value of 8. The limits will not show effect, and after rescaling and rounding the final result will be 35. If the predicted value was 12.97, the upper limit would take effect and the maximum value 10.5 would be taken instead. After rescaling and rounding, the final derived result is 43. Min and max can always be handled using a piecewise linear transformation (NormContinuous). For a lower limit of min and an upper limit of max and a linear behavior governed by f*x + c in between, the following example defines the transformation: <NormContinuous field="X"> <LinearNorm orig="VMIN" norm="f*min+c"/> <LinearNorm orig="min" norm="f*min+c"/> <LinearNorm orig="max" norm="f*max+c"/> <LinearNorm orig="VMAX" norm="f*max+c"/> </NormContinuous> Note that the following inequalities must hold: VMIN < min < max < VMAX. Example 3Building on the previous example, the current one shows how the final result, obtained after the rescale function takes place, can be compared with a given threshold of value 30 to determine a business decision where values greater than 30 yield a positive response. <Output> <OutputField name="RawResult" optype="continuous" dataType="double" feature="predictedValue"/> <OutputField name="FinalResult" optype="continuous" dataType="double" feature="transformedValue"> <Apply function="round"> <NormContinuous field="RawResult"> <LinearNorm orig="-100" norm="-21.4"/> <LinearNorm orig="-10" norm="-21.4"/> <LinearNorm orig="10.5" norm="42.97"/> <LinearNorm orig="100" norm="42.97"/> </NormContinuous> </Apply> </OutputField> <OutputField name="BusinessDecision" optype="categorical" dataType="string" feature="decision"> <Decisions businessProblem="Should the outstanding amount be collected?" description="The decision depends on the likelihood to get the money and the cost to try."> <Decision value="waive" description="Waive any existing conditions on case and approve."/> <Decision value="refer" description="Keep conditions and refer case for further scrutiny."/> </Decisions> <Apply function="if"> <Apply function="greaterThan"> <FieldRef field="FinalResult"/> <Constant>30</Constant> </Apply> <!--THEN--> <Constant>waive</Constant> <!--ELSE--> <Constant>refer</Constant> </Apply> </OutputField> </Output> Outputs Per Model TypeThis table shows which outputs are allowed for each type of model defined in PMML. Please note that, as new scoring procedures are added in future releases of PMML, this table can change: Allowable Outputs based on Model Type (ok = Valid Output, X = Not Applicable) For Model Composition, the Output values should reflect those of the last model in the calculation. Note that the feature identifier residual is useful only if the model is used on test data that contains target values.It is straightforward to compute the residual on numeric data as the prediction error. The residual is based on differences of probability values in the case of categorical data. For example, assume a classification model to predict the labels Y and N. For some row in the test data the actual value may be Y and the predicted value is Y with a probability of 0.8. The term [actual value = target value] maps to 1.0 and the residual is the difference between 1.0 and the probability, i.e. 1.0-0.8 = 0.2. For some other row the actual value may be N. Assuming the predicted value and probability are the same as before we have [actual value = target value] = 0.0 and residual = 0.0 - 0.8 = -0.8. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|