Data Mining and Soft Computing Session 2. Data Preparation F i H Data Preparation Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept. of Computer Science and A.I. University of Granada Spain University of Granada, Spain Email: [email protected]http://sci2s ugr es http://sci2s.ugr .es http://decsai.ugr.es/~herrera
67
Embed
Data Mining and Soft Computing · Introduction to Data Mining and Knowledge Discovery 2. Dt P tiData Preparation 3. Introduction to Prediction, Classification, Clustering and Association
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Mining and Soft Computing
Session 2.
Data Preparation
F i H
Data Preparation
Francisco HerreraResearch Group on Soft Computing andInformation Intelligent Systems (SCI2S) Dept. of Computer Science and A.I.
University of Granada SpainUniversity of Granada, Spain
Summary1. Introduction to Data Mining and Knowledge Discovery2 D t P ti2. Data Preparation 3. Introduction to Prediction, Classification, Clustering and Association4. Introduction to Soft Computing. Focusing our attention in Fuzzy Logic
and Evolutionary Computationand Evolutionary Computation5. Soft Computing Techniques in Data Mining: Fuzzy Data Mining and
Knowledge Extraction based on Evolutionary Learning6 Genetic Fuzzy Systems: State of the Art and New Trends6. Genetic Fuzzy Systems: State of the Art and New Trends7. Some Advanced Topics: Imbalanced data sets, subgroup discovery,
data complexity. 8. Final talk: How must I Do my Experimental Study? Design of8. Final talk: How must I Do my Experimental Study? Design of
Experiments in Data Mining/Computational Intelligence. Using Non-parametric Tests. Some Cases of Study.
Data Preparation
OutlineOutlineIntroductionIntroduction
Preprocessing
Data ReductionDiscretization Feature Selection Instance SelectionDiscretization, Feature Selection, Instance Selection
Ex.: Instance Selection and Decision TreesEx.: Instance Selection and Decision Trees
Concluding Remarks
2
Data Preparation
OutlineOutlineIntroductionIntroduction
Preprocessing
Data ReductionDiscretization Feature Selection Instance SelectionDiscretization, Feature Selection, Instance Selection
Ex.: Instance Selection and Decision TreesEx.: Instance Selection and Decision Trees
Concluding Remarks
3
Data Preparation in KDDIntroduction
Knowledge
Target Processed
Patterns
data data
D t Mi i
InterpretationEvaluation
data
Preprocessing
Data Mining Evaluation
Selection
p g& cleaning
4
Data Preparation in KDDStep 1: Goal Identification
Introductionp
DefinedGoals
Step 3: Data Preprocessing
CleansedData
Step 2: Create Target Data
DataWarehouse
Data
TargetData
Step 4: Data Transformation
TransactionalDatabase
TransformedData
Step 6: Interpretation & Evaluation
FlatFile
Step 6: Interpretation & Evaluation Step 5: Data Mining
DataModel
Step 7: Taking Action
5
Data Preparation in KDDIntroduction
D. Pyle, 1999, pp. 90:
“Th f d t l f d t ti“The fundamental purpose of data preparation is to manipulate and transforrm raw data so th t th i f ti t t f ld d i ththat the information content enfolded in the data set can be exposed, or made more easily
ibl ”accesible.”
Dorian PyleData Preparation for Data Mining Morgan Kaufmann Publishers, 1999
6
INTRODUCTIONData preparation uses an
INTRODUCTION
important part of the time in a KDD process.
Data Preparation
OutlineOutlineIntroductionIntroduction
Preprocessing
Data ReductionDiscretization Feature Selection Instance SelectionDiscretization, Feature Selection, Instance Selection
Ex.: Instance Selection and Decision TreesEx.: Instance Selection and Decision Trees
Concluding Remarks
8
PreprocessingPreprocessing
• Data in the real world is dirty
Why Data Preprocessing?
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
9
PreprocessingPreprocessing
• Data cleaning– Fill in missing values smooth noisy data identify or remove outliers
Major Tasks in Data Preprocessing
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration– Integration of multiple databases, data cubes, or files
• Data transformation– Normalization and aggregation
• Data reduction– Obtains reduced representation in volume but produces the same or
similar analytical results
– Data discretization– Data discretization• Part of data reduction but with particular importance, especially for
numerical data10
PreprocessingPreprocessing
Data Cleaning
DataData integration
Data transformationtransformation
Datareduction
11
reduction
PreprocessingPreprocessing
Data Cleaning
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent dataCorrect inconsistent data
Reference:W Ki B Ch i E K H S K Ki A T f Di DW. Kim, B. Choi, E-K. Hong, S-K. Kim. A Taxonomy of Dirty Data.Data Mining and Knowledge Discovery 7, 81-99, 2003.
R.K. Pearson. Mining Imperfect Data. Dealing with
12
R.K. Pearson. Mining Imperfect Data. Dealing with Contamination and Incomplete Records. SIAM, 2005.
PreprocessingPreprocessing
Data Cleaning: Missing values
• Reasons for missing values– Information is not collected (e g people decline to give their age and weight)(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
• Handling missing valuesEliminate Data Objects– Eliminate Data Objects
– Estimate Missing Values– Ignore the Missing Value During AnalysisIgnore the Missing Value During Analysis– Replace with all possible values (weighted by their probabilities)
PreprocessingPreprocessing
Data Cleaning: Noise data. Outliers
14
PreprocessingPreprocessing
Data Cleaning: Noise data. Smooth by fitting the data into regression functions
y
Y1
Y1’
xX1
15
PreprocessingPreprocessing
Data Cleaning: Inconsistent data
Age=“42” gBirth day=“03/07/1997”
16
PreprocessingPreprocessing
Data IntegrationServer
Data WarehouseData Warehouse Data base
Extraction
Othersdata
Extraction, aggregation ..
Reference:Reference:E.Schallehn, K. Sattler, G. Saake. Efficient Similarity-basedOperations for Data Integration. Data and Knowledge Engineering 48:3, 351-387, 2004.
17
PreprocessingPreprocessing
• Smoothing: remove noise from data
Data Transformationg
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range– min‐max normalization
– z‐score normalizationz score normalization
– normalization by decimal scaling
• Attribute/feature construction– New attributes constructed from the given ones
Reference:T. Y. Lin. Attribute Transformation for Data Mining I: Theoretical
18
T. Y. Lin. Attribute Transformation for Data Mining I: TheoreticalExplorations. International Journal of Intelligent Systems 17, 213-222, 2002.
PreprocessingPreprocessing
• Min-max normalizzation
i
Data Transformation
AAA
AA
A minnewminnewmaxnewminmax
minvv _)__('
• Z-score normalization
A
A
devstandmeanvv '
• normalization by decimal scalingAdevstand _
j
vv10
' Where j is the smallest integer such that Max(| v’|)<1
19
PreprocessingPreprocessing
Data ReductionWarehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data setmay take a very long time to run on the complete data set
Data reduction obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same)much smaller in volume but yet produces the same (or almost the same) analytical results
Target dataReduced
dataTarget data data
20
PreprocessingPreprocessing
Data cube aggregation
Data Reduction strategies
– Data cube aggregation
– Dimensionality reduction
– Numerosity reduction
– Discretization and concept hierarchy generationp y g
21
Data Preparation
OutlineOutlineIntroductionIntroduction
Preprocessing
Data ReductionDiscretization Feature Selection Instance SelectionDiscretization, Feature Selection, Instance Selection
Ex.: Instance Selection and Decision TreesEx.: Instance Selection and Decision Trees
Concluding Remarks
22
Data ReductionData Reduction
Data Reduction
Feature selection
Discretization
Instance selection
Squashing
23
Data ReductionData Reduction
Data Reduction
Feature selection
Discretization
Instance selection
Squashing
R f
24
Reference:A. Owen, Data Squashing by Empirical Likelihood. Data Mining and Knowledge Discovery 7, 101-113, 2003.
Data ReductionData Reduction
Data Reduction
Feature selection
Discretization
Instance selection
Squashing
Reference:
25
H. Liu, F. Hussain, C.L. Tan, M. Dash. Discretization: An EnablingTechnique. Data mining and Knowledge Discovery 6, 393-423, 2002.
DiscretizationDiscretization
h f b• Three types of attributes:– Nominal — values from an unordered setO di l l f d d t– Ordinal — values from an ordered set
– Continuous — real numbers• Discretization:• Discretization: divide the range of a continuous attribute into intervals– Some classification algorithms only accept categoricalSome classification algorithms only accept categorical attributes, e.g. most versions of Naïve Bayes, CHAID
– Reduce data size by discretization– Prepare for further analysis
May 30, 2009 Data Mining: Concepts and Techniques 26
DiscretizationDiscretization
Di id h f i ( i ) ib i• Divide the range of a continuous (numeric) attribute into intervals
• Store only the interval labels• Store only the interval labels
• Important for association rules and classification
• In practice, “almost‐equal” height binning isIn practice, almost equal height binning is used which avoids clumping and gives more intuitive breakpointsintuitive breakpoints
• Additional considerations:– don’t split frequent values across bins
– create separate bins for special values (e g 0)create separate bins for special values (e.g. 0)
E l Width i i l t d f l• Equal Width is simplest, good for many classes– can fail miserably for unequal distributionsE l H i ht i b tt lt• Equal Height gives better results
How else can we discretize?
• Class‐dependent can be better for classification– Note: decision trees build discretization on the fly– Naïve Bayes requires initial discretization
• Many other methods exist …
32
Discretization Using Class LabelsDiscretization Using Class Labels
E t b d h• Entropy based approach
3 categories for both x 5 categories for both x gand y
gand y
Discretization Without Using Class LabelsDiscretization Without Using Class Labels
Data Equal interval widthwidth
Equal frequency
K-means
Data ReductionData Reduction
D tData Reduction
Feature Di i iFeature selection
hi
Discretization
Instance selection
Squashing
Reference:H. Liu, H. Motoda. Feature Selection for Knowledge Discovery and Data
35
H. Liu, H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic, 1998.H. Liu, H. Motoda (Eds.) Feature Extraction, Construction, and Selection: A Data Mining Perspective, Kluwer Ac., 1998.
Feature SelectionFeature Selection
• Another way to reduce dimensionality of data• Another way to reduce dimensionality of data
• Redundant features – duplicate much or all of the information contained in one or more other attributesExample: purchase price of a product and the amount– Example: purchase price of a product and the amount of sales tax paid
I l t f t• Irrelevant features– contain no information that is useful for the data mining task at handmining task at hand
– Example: students' ID is often irrelevant to the task of predicting students' GPA
Reference: J.R. Cano, F. Herrera, M. Lozano. Using Evolutionary Algorithms as Instance Selection for DataReduction in KDD: An Experimental Study. IEEE Trans. on Evolutionary Computation 7:6 (2003) 561-575.
Instance SelectionInstance SelectionMODEL. Training set selection
Data set (D)
MODEL. Training set selection
Using DMalgorithm
( )
Training set (TR) Test set (TS)g g ( ) ( )
Prot. SelectionSelection algorithm
Selected sets (TSS)
50
DM algorithm
Model
Instance SelectionLarge data bases. Stratification strategy.
Data Sets
T1
T2
T3
Tt
SS1 SS2 SS3 SSt
IS IS IS IS
TR1 TR2
TR3
TRt
TS TS2 TS TSt1 3
t
Referencia: J.R. Cano, F. Herrera, M. Lozano. Stratification for Scaling Up EvolutionaryPrototype Selection. Pattern Recognition Letters 26:7 (2005) 953-963.
Instance SelectionLarge data bases. Stratification strategy.
J R Cano F Herrera M Lozano Stratification for Scaling Up EvolutionaryJ.R. Cano, F. Herrera, M. Lozano, Stratification for Scaling Up Evolutionary Prototype Selection. Pattern Recognition Letters, 26, (2005), 953-963.
Data Preparation
OutlineOutlineIntroductionIntroduction
Preprocessing
Data ReductionDiscretization Feature Selection Instance SelectionDiscretization, Feature Selection, Instance Selection
Ex.: Instance Selection and Decision TreesEx.: Instance Selection and Decision Trees
Concluding Remarks
54
Ex.: Instance Selection and Decision Trees
Feature: color
green redyelowJ.R. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification Rules with Trade-off Precision-Interpretability. Data and Knowledge Engineering 60 (2007) 90-108.
Instance Selection d D i i Tand Decision Trees
Example: Decision Tree Attribute set:{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Cl 2 Cl 2Class 1 Clas 2 Class 1 Class 2
Reduced attribute set: {A1 A4 A6}
56
> Reduced attribute set: {A1, A4, A6}
Instance Selection d D i i Tand Decision Trees
Comprehensibility: smallDecision trees
Pruning: We can cut/eliminate nodes
Instance selection strategies allow us to build decision tree for large data base reducing the tree size.
Instance Selection can increase the decision tree inte p et bilit
Reduced number of rules (reduced number of variables)
Instance Selection d D i i Tand Decision Trees
Ad lt D t S tAdult Data Set
InstanceNumber
N
Varia-bles
Rule numbers Variables per ruleRules confidenceN(Cond,Clas)/N
Adult2 classes
30132 14C4.5
IS-CHC/C4.5 C4.5 IS-CHC/
C4.5C4.5 IS-CHC/
C4.52 classes 359 5 14 3 0.003 0.167
Instance selection allows us to get more interpretable rule sets (low number of rules and variables) variables).
Data Preparation
OutlineOutlineIntroductionIntroduction
Preprocessing
Data ReductionDiscretization Feature Selection Instance SelectionDiscretization, Feature Selection, Instance Selection
Ex.: Instance Selection and Decision TreesEx.: Instance Selection and Decision Trees
Concluding Remarks
61
Concluding RemarksConcluding Remarks
Data preprocessing is a necessity when we work with real applications.
Data P ttModels
I t
Target data Knowledge
Data Preparation Patterns Interpre-
tability
• Cleaning
• Integration
• Association rules
• Classification • Visualizationteg at o
• Transformation
• Reduction
• Prediction
• Clustering• Validation
62Bibliography: http://sci2s.ugr.es/keel (List of references by Specific Areas)
Concluding RemarksConcluding Remarks
Advantage: Data preparation allows us to apply the data mining algorithms in a quicker/simpler way getting highmining algorithms in a quicker/simpler way, getting high quality models: high precision and/or high interpretability.
All is not advantage: The data preparation is not a structured area with a specific methodology for managing p gy g ga new problem. Every problem can need a different preprocessing process, using different tools.
63
Concluding RemarksConcluding Remarks
Summary“Good data preparation is key to
producing valid and reliable models”
Data preparation is a big issue for both warehousing and mining
Data preparation includesData cleaning and data integrationData cleaning and data integrationData reduction and data transformation
A lot a methods have been developed but still an A lot a methods have been developed but still an active area of research
The cooperation between DM algorithms and data 64
The cooperation between DM algorithms and data preparation methods is an interesting/active area.
BibliographyBibliography
Dorian Pyle Data Preparation for Data MiningMorgan Kaufmann, Mar 15, 1999
Mamdouh RefaatData Preparation for Data Mining Using SAS
Morgan Kaufmann Sep 29 2006)Morgan Kaufmann, Sep. 29, 2006)
Tamraparni Dasu, Theodore Johnson l i i d Cl iExploratory Data Mining and Data Cleaning
Wiley, 2003
Bibliography of the Research Group SCI2S
S. García, J.R. Cano, F. Herrera, A Memetic Algorithm for Evolutionary PrototypeSelection: A Scaling Up Approach.Pattern Recognition 41:8 (2008) 2693-2709, doi:10.1016/j.patcog.2008.02.006 g ( ) , /j p g
J.R. Cano, S. García, F. Herrera, Subgroup Discovery in Large Size Data Sets Preprocessed Using Stratified Instance Selection for Increasingthe Presence of Minority Classes. yPattern Recognition Letters 29 (2008) 2156-2164, doi:10.1016/j.patrec.2008.08.001.
S. García, J.R. Cano, F. Herrera, A Memetic Algorithm for Evolutionary PrototypeSelection: A Scaling Up Approach. Pattern Recognition 41:8 (2008) 2693-2709, g p pp g ( ) ,doi:10.1016/j.patcog.2008.02.006.
J.R. Cano, F. Herrera, M. Lozano, Evolutionary Stratified Training Set Selection for Extracting Classification Rules with Trade-off Precision-Interpretability. Data and g p yKnowledge Engineering 60 (2007) 90-108, doi:10.1016/j.datak.2006.01.008.
J.R. Cano, F. Herrera, M. Lozano, Stratification for Scaling Up EvolutionaryPrototype Selection. Pattern Recognition Letters, 26, (2005), 953-963, doi: yp g , , ( ), ,10.1016/j.patrec.2004.09.043 .
J.R. Cano, F. Herrera, M. Lozano, Using Evolutionary Algorithms as Instance Selection for Data Reduction in KDD: An Experimental Study. IEEE Trans. on p yEvolutionary Computation 7:6 (2003) 561-575, doi: 10.1109/TEVC.2003.819265.
Available at: http://sci2s.ugr.es/publications/byAll.php