Introduction ============ This dataset is aimed atz finding flaws in PMML export implementations. In terms of data mininig, the data makes no sense at all since the values are randomly distributed and in no way ment to be correlated. If you receive a meaningful model, you most probably did something wrong. Column Description ================== column SQL type (train/apply) description (list of possible values) ------------------------------------------------------------------------------- SGL_VAL CHAR(1)/CHAR(2) single value (A) BI_VAL CHAR(1)/CHAR(2) two values (A, B) TRI_VAL CHAR(1)/CHAR(2) three values (A, B, C) TRAIL_BLANKS CHAR(2)/CHAR(3) has trailing blanks (which are missing from the application dataset) ("A", "A ", "A ") LEAD_BLANKS CHAR(3)/CHAR(3) varying amount of leading blanks ("A", " A", " A") EMPTY_STR CHAR(1)/CHAR(2) regular and empty strings (A, B, C, "") CAP_STR CHAR(1)/CHAR(2) alternates in capitalization (a, A) INT100 SMALLINT/Decimal(7,1) integers between -100 and 100 INT_0_1 SMALLINT/Decimal(7,1) 0 or 1 only CONST1 SMALLINT/Decimal(7,1) constantly 1 NORM_0_1 Decimal(18,17)/Decimal(23,17) normal distributed between 0 and 1 NORM100 Decimal(20,17)/Decimal(23,17) normal distributed between -100 and 100 In addition, exploitTrain.csv has the following columns intended as targets for prediction algorithms TARGET_BI CHAR(1) two values (a, b) TARGET_TRI CHAR(1) three values (a, b, c) TARGET_INT100 SMALLINT integers between -100 and 100 TARGET_0_1 SMALLINT 0 or 1 TARGET_CONT DECIMAL(18,15) continuous Note that the types differ between the training and th apply dataset. This is due to various tests which make use of irregular values or strings that were not seen previous to the apply dataset. Usage ===== Train with explout_train.csv. In general, always use all fields, try to force fields being included in the model in case the algorithm discards them. Predictive algorithms should build a model for each of the respective TARGET_* columns. Apply using exploitApply.csv. Note: The numeric fields in exploitApply.csv should be specified as floating point if possible! exploitApply.csv tries to perform every evil exploit one can think of. Some might not have a solution in PMML yet. Here is an overview of what is tested in which line. It is strongly recommended to debug in sequence of appearance! lines tests ------------------------------------------------------------------------------- 1 - 475 regular data 476 - 511 missing values 512 - 525 invalid values (cat) 526 - 531 floats instead of integers (num) 532 - 556 outliers 557 - 563 lowercase where not used before 564 - 571 use trailing blanks where not used before 572 - 579 use leading blanks where not used before Contact ======= If you have any kind of feedback, questions or comments, you can contact me at raspl@de.ibm.com (Stefan Raspl).