Introduction
============
This dataset is aimed atz finding flaws in PMML export implementations.
In terms of data mininig, the data makes no sense at all since the values are
randomly distributed and in no way ment to be correlated. If you receive a
meaningful model, you most probably did something wrong.


Column Description
==================

column         SQL type (train/apply)         description (list of possible values)
-------------------------------------------------------------------------------
SGL_VAL        CHAR(1)/CHAR(2)                single value (A)
BI_VAL         CHAR(1)/CHAR(2)                two values (A, B)
TRI_VAL        CHAR(1)/CHAR(2)                three values (A, B, C)
TRAIL_BLANKS   CHAR(2)/CHAR(3)                has trailing blanks (which are missing from
                                              the application dataset) ("A", "A ", "A  ")
LEAD_BLANKS    CHAR(3)/CHAR(3)                varying amount of leading blanks ("A", " A", "  A")
EMPTY_STR      CHAR(1)/CHAR(2)                regular and empty strings (A, B, C, "")
CAP_STR        CHAR(1)/CHAR(2)                alternates in capitalization (a, A)
INT100         SMALLINT/Decimal(7,1)          integers between -100 and 100
INT_0_1        SMALLINT/Decimal(7,1)          0 or 1 only
CONST1         SMALLINT/Decimal(7,1)          constantly 1
NORM_0_1       Decimal(18,17)/Decimal(23,17)  normal distributed between 0 and 1
NORM100        Decimal(20,17)/Decimal(23,17)  normal distributed between -100 and 100

In addition, exploitTrain.csv has the following columns intended as targets for
prediction algorithms

TARGET_BI      CHAR(1)                        two values (a, b)
TARGET_TRI     CHAR(1)                        three values (a, b, c)
TARGET_INT100  SMALLINT                       integers between -100 and 100
TARGET_0_1     SMALLINT                       0 or 1
TARGET_CONT    DECIMAL(18,15)                 continuous


Note that the types differ between the training and th apply dataset. This is
due to various tests which make use of irregular values or strings that were not
seen previous to the apply dataset.


Usage
=====

Train with explout_train.csv. In general, always use all fields, try to force
fields being included in the model in case the algorithm discards them.
Predictive algorithms should build a model for each of the respective TARGET_*
columns.
Apply using exploitApply.csv. Note: The numeric fields in exploitApply.csv
should be specified as floating point if possible!
exploitApply.csv tries to perform every evil exploit one can think of. Some
might not have a solution in PMML yet.

Here is an overview of what is tested in which line. It is strongly recommended
to debug in sequence of appearance!

lines        tests
-------------------------------------------------------------------------------
  1 - 475    regular data
476 - 511    missing values
512 - 525    invalid values (cat)
526 - 531    floats instead of integers (num)
532 - 556    outliers
557 - 563    lowercase where not used before
564 - 571    use trailing blanks where not used before
572 - 579    use leading blanks where not used before


Contact
=======
If you have any kind of feedback, questions or comments, you can contact me at
raspl@de.ibm.com (Stefan Raspl).