A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Z. Fu, Fannie Mae Bruce Golden, University of Maryland Bruce Golden, University of Maryland S. Lele, University of Maryland S. Lele, University of Maryland S. Raghavan, University of Maryland S. Raghavan, University of Maryland Edward Wasil, American University Edward Wasil, American University Presented at INFORMS National Meeting Pittsburgh, November 2006
26
Embed
A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Genetic Algorithm-Based Approach for Building Accurate Decision Trees
by
Z. Fu, Fannie MaeZ. Fu, Fannie Mae
Bruce Golden, University of MarylandBruce Golden, University of Maryland
S. Lele, University of MarylandS. Lele, University of Maryland
S. Raghavan, University of MarylandS. Raghavan, University of Maryland
Edward Wasil, American UniversityEdward Wasil, American University
Presented at INFORMS National Meeting Pittsburgh, November 2006
2
A Definition of Data Mining
Exploration and analysis of large quantities of data
By automatic or semi-automatic means
To discover meaningful patterns and rules
These patterns allow a company to
Better understand its customersImprove its marketing, sales, and customer support
Note: Average time from ten replications. The left-most column gives the size of the scoring set.
21
In general, GAIT outperforms Aggregate-Initial which outperforms Whole-Training which outperforms Logistic Regression which outperforms Best-Initial
The improvement of GAIT over non-GAIT procedures is statistically significant in all three experiments
Regardless of where you start, GAIT produces highly accurate decision trees
We experimented with a second data set with approx. 50,000 observations and 14 demographic variables and the results were the same
Computational Results
22
We increase the size of the training and scoring sets while the size of the test set remains the same
Six combinations are used from the marketing data set
Scalability
Percent (%)
Size
Training Set Scoring Set
99
72
51
25
3
1
310,000
240,000
160,000
80,000
10,000
3,000
124,000
96,000
64,000
64,000
4,000
1,500
23
51 721 3 25 99
GAIT
Whole-Training
Logistic Regression
Best-Initial
75.00
76.00
77.00
78.00
79.00
80.00
81.00
82.00
Percentage (%) of Total Data Size
Cla
ssif
icat
ion
Acc
ura
cy (
%)
Classification Accuracy vs. Training/Scoring Set Size
24
1 51 72
GAIT
993 25
Best-Initial
Whole-Training
Logistic Regression
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
Percentage (%) of Total Data Size
Com
puting T
ime (
min
ute
s)
Computing Time for Training and Scoring
25
GAIT generates more accurate decision trees than Logistic Regression, Whole-Training, Aggregate-Initial, and Best-Initial
GAIT scales up reasonably well
GAIT (using only 3% of the data) outperforms Logistic Regression, Best-Initial, and Whole Training (using 99% of the data) and takes less computing time
Computational Results
26
GAIT generates high-quality decision trees
GAIT can be used effectively on very large data sets
The key to the success of GAIT seems to be the combined use of sampling, genetic algorithms, and C4.5 (a very fast decision-tree package from the machine-learning community)