Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining
Apr 01, 2015
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Two
Principles of data mining
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Overview
• The process of data mining• Approaches of data mining• Categories of data mining problems• Information patterns to be discovered• Overview of data mining solutions• Importance of evaluation• Undertaking a data mining task in Weka • Review of basic concepts in statistics and
probability
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
PreparingInput Data
MiningPatterns
Post-processingPatterns
InputData
OutputPatterns
A data mining stage
Flow of control from one stage to the next stage
Flow of control from one stage to the previous stage
Repetition of the tasks at one stage
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
• Preparation
Formatted Data set
Formatted Data set
Target Data set
Pre-Processed Data set
Original Data sets
Collected Data set
• Integrating data• Getting necessary
data details
• Selecting relevant features• Selecting relevant records
• Data cleaning• Deal with unknown data• Data transformation
• Formatting data into acceptable form by the mining tool
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
• Mining– Determining data mining
tasks – Assigning roles for data
for certain tasks– Selecting data mining
solution(s) to each task– Setting necessary
parameters for the solution
– Collecting result patterns
Formatted Data set
Formatted Data set
Solution3
(w1, w2, …, wm) Solution2
(t1, t2, …, tr)Solution1
(p1, p2, …, pn)
Patterns
Mining solutionsParam
eter settings
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
• Post-processing– Pattern evaluation – Pattern selection– Pattern interpretation
PatternsEvaluation
criteria
reject
ValidPatternsValid
PatternsSelection
criteria
SelectedPatterns
acceptPattern
Interpretation
Knowledge learnt
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process• Roles of participants in data mining
– Participants include:• Data miners / data analysts: main participant of a DM project• Domain expert: main collaborators of DM project• Decision makers: clients of a DM project
– Risk of human bias in the discovery process– Important roles of domain expert
• Pattern interpretation (for usefulness)• Pattern evaluation (for significance)• Mining options (for suitable tasks, limited)• Advisory on data pre-processing (for suitable operations, limited)
– Balancing the strength of human and machine
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Approaches
• Hypothesis testing approach– Top-down lead by a hypothesis statement– Procedure:
1. Forming a hypothesis statement2. Collecting and selecting data of relevance3. Conducting data analysis and collecting patterns 4. Interpreting the patterns to accept/reject the hypothesis
• Discovery approach – Bottom-up without a hypothesis in mind– Procedure:
1. Collecting and preparing data of interest
2. Conducting data analysis and discovering possible patterns
3. Evaluating the importance and interestingness
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Approaches
• Discovery approach (cont’d)– Directed discovery (supervised learning):
• Certain aspects of the outcome, i.e. the goal, of the discovery have been specified. The discovery is to find those patterns satisfying the goal.e.g. patterns relating to the outcome of a class variable
– Undirected discovery (unsupervised learning): • There is no specification of the goal of the discovery.
The discovery is to find those patterns of some kind of significance.e.g. associative links among some attribute values
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Classification– Construct a classification model to determine the class
of a given record
Example Data Set
Model Construction
MethodClassification
Model
ClassificationModel
(a) Model Development Phase
class?
Input features classCi
Input features
(b) Model Use Phase
Unseen Data Record with undetermined class
Data Record with the determined class
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Various forms of classification models
Instance space Neural network Decision tree
List of ordered classification rulesFunction (linear regression)
Many more …
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Cluster detection– Measure similarity among data objects and group them
into clusters accordingly
Cluster Memberships of Data Points
Input data points
ClusteringMethod
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Forms of clustering resultsClusters of various shapes
Eclipse shaped clusters
Hierarchical clustering results
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Association rule mining– Discover significant relationships between data
objects
AssociationMining Method X Y
– Between values, e.g. Apple Coke
– Between categories of values, e.g. Food Magazine
– Between values of attributes, e.g. Married:yes OwnHouse:yes
– Over time period, e.g. year 1: Database year 2: Data Mining
• Various associations
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• An exampleStudentID Gender Country Major Subject Age TotalUnits Degree Class
1 M UK Computing 22 360 1st Class2 F UK Computing 21 360 2nd Lower3 M FRANCE Psychology 24 345 2nd Lower4 M SPAIN Accounting 23 360 1st Class5 F UK Psychology 22 300 Pass6 F USA History 30 345 2nd Upper7 M UK Computing 35 360 1st Class8 F FRANCE Psychology 25 360 3rd Class9 F GERMANY History 23 360 2nd Upper10 M UK Accounting 22 360 1st Class11 M SPAIN History 20 345 2nd Upper12 F UK Law 45 300 Pass
StudentID Gender Country Major Subject Age TotalUnits Degree Class1 M UK Computing 22 360 1st Class2 F UK Computing 21 360 2nd Lower3 M FRANCE Psychology 24 345 2nd Lower4 M SPAIN Accounting 23 360 1st Class5 F UK Psychology 22 300 Pass6 F USA History 30 345 2nd Upper7 M UK Computing 35 360 1st Class8 F FRANCE Psychology 25 360 3rd Class9 F GERMANY History 23 360 2nd Upper10 M UK Accounting 22 360 1st Class11 M SPAIN History 20 345 2nd Upper12 F UK Law 45 300 Pass
Classification model? Clusters? Association rules?
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Solutions: An Overview
• Classification solutions– Decision tree e.g. ID3– k nearest neighbour (kNN) e.g. PEBLS– Rules e.g. Sequential Cover– Bayesian theorem e.g. Naïve Bayes– Artificial neural network
• Clustering Solutions– Partition-based methods e.g. K-means– Hierarchical methods e.g. agglomeration– Density-based methods e.g. DBScan– Model-based methods e.g. Expectation-
Maximisation– Graph-based methods e.g. Chameleon
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Solutions: An Overview
• Association rule solutions– Greedy methods e.g. Apriori– Graph-based methods e.g. FP-Growth– Methods for various associations
• Boolean associations• Generalised associations (multi-level associations)• Quantitative associations (multidimensional associations)• Sequential associations (sequential patterns)
Since one type of data mining problems can be transformed to another type of data mining problems, some solutions for one type can also be applied to another type.
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Importance of evaluating result patterns– Classification model must be accurate enough to be
creditable – Clusters must genuinely exist– Association rules must have enough strengths to be
believed– Data descriptions must be general enough to cover a
large part of the data set
How do we evaluate the discovered patterns ?
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Possible measures of interestingness– Objective measures based on data and pattern
• Conciseness of pattern, e.g. minimum description length • Coverage, e.g. coverage for classification rules• Reliability, e.g. accuracy of a classification model• Peculiarity, e.g. measures of difference from the norm• Diversity, e.g. tendency of clusters
– Subjective measures based on domain knowledge• Novelty• Surprisingness• Usefulness • Applicability
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Commonly used measures– Accuracy rate or error rate for classification models
• True positive• False positive• False negative (see section 6.5.1)
– Quality of clusters• Quality of a cluster• Overall quality of all clusters (see section 4.5.1)
– Strengths of associations• Support• Confidence• Lift (see section 8.1.2 and 8.6)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Associate Tab page
Data Mining in Weka Explorer
• The roadmap
Preprocess Tab page
(1)
Cluster Tab page
(2)
Classify Tab page
Tree Visualiser window
(3)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer• Preprocess
Open data set from different sources
Generate random data set
Save data set into a file
Display & edit data
Attribute display, selection & removal from the opened data set
Selected attribute summary
Selected attribute visualisation
Visualise all attributes
Filters for pre-processing
Feedback messages
Data summary
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer• Classify (as an example)
Method selection & parameter setting
Test option setting
Task list. Menu of options available with right click.
Result display window
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer• Classify (as an example)
Method List
Selecting a specific method
Selecting &Changing parameters
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer• Visualisation
An Example Decision Tree
Scatter plot of data object of different classes
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Where probability and statistics used?– Patterns found from data are probabilistic in nature– Used in various measures of evaluation, e.g. confidence
measure of association rules
– Used in data exploration stage for better understanding, e.g. maximum, minimum, mean, variance, skewness
– Used during the mining process to assist the discovery of patterns, e.g. information gain for decision tree induction
– Used as a part of patterns, e.g. naïve Bayes, Gaussian mixture model
– Used in comparison of patterns, e.g. classification model with significantly better accuracy
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability and conditional probability– Probability of event P(E) and its meanings when:
P(E) = 0, P(E) = 1 and 0 < P(E) < 1
– Probabilities of multiple events: P(E and F), P(E or F) = P(E) + P(F) – P(E and F)
– Mutually exclusive events: P(E and F) = 0 and P(E and F) = P(E) + P(F)
– Conditional probability of event E given event F: P(E|F) = P(E and F)/P(F)
– Independent events: P(E and F) = P(E)P(F), and P(E|F) = P(E)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability & conditional probability (example)StudentID Gender Country Major Subject Age TotalUnits Degree Class
1 M UK Computing 22 360 1st Class2 F UK Computing 21 360 2nd Lower3 M FRANCE Psychology 24 345 2nd Lower4 M SPAIN Accounting 23 360 1st Class5 F UK Psychology 22 300 Pass6 F USA History 30 345 2nd Upper7 M UK Computing 35 360 1st Class8 F FRANCE Psychology 25 360 3rd Class9 F GERMANY History 23 360 2nd Upper10 M UK Accounting 22 360 1st Class11 M SPAIN History 20 345 2nd Upper12 F UK Law 45 300 Pass
StudentID Gender Country Major Subject Age TotalUnits Degree Class1 M UK Computing 22 360 1st Class2 F UK Computing 21 360 2nd Lower3 M FRANCE Psychology 24 345 2nd Lower4 M SPAIN Accounting 23 360 1st Class5 F UK Psychology 22 300 Pass6 F USA History 30 345 2nd Upper7 M UK Computing 35 360 1st Class8 F FRANCE Psychology 25 360 3rd Class9 F GERMANY History 23 360 2nd Upper10 M UK Accounting 22 360 1st Class11 M SPAIN History 20 345 2nd Upper12 F UK Law 45 300 Pass
2
1
12
6)( MGenderP
0 )( FGender and MGenderP
1 )( FGender or MGenderP
2
1)|( UKCountryFGenderP
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability distribution of random variables– Discrete random variable– Continuous random variable
P(X = x) P(a X < b)
68%
95%
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Basic Statistics
– Sample mean, median and mode
– Variance and standard deviation
– Skewness
n
xx
i
1
)( 22
n
xxs
ix
x
x
s
Medianx )(3
26age
23agemedian 22agemode
53.636sage 2 324.7sage
22913247
23263.
.
)(
ageskewness
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review• Confidence interval estimate
– Sample mean is only an estimate of the true mean for the data population.
– Central limit theorem: sample means follows a normal distribution that:
a. The mean is the true population mean X b. The standard deviation is
– Based on the central limit theorem and using the sample standard deviation to replace the true one, the following expression is used to estimate the interval for the true mean at confidence level of 1-
n/
1)(n
stx
n
stxP XX
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Confidence interval estimate (example)
95012
3247201226
12
3247201226 .)
..
..( P
The interval is estimated as [21.347, 30.653] at confidence level of 95%
For this data set, n = 12, age = 26 and sage = 7.324. At confidence level of 95%, i.e. 1 - = 0.95 and /2 = 0.025, n – 1 = 11, and therefore, t = 2.201. The interval estimate is:
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Hypothesis testing– As an introduction to statistical
inference and statistic significance.
– Procedure:a. Forming null and alternative
hypotheses
b. Deciding the level of significance p
c. Determining a test statistic and calculating its value
d. Comparing the calculated value against known value and deciding if the null hypothesis should be rejected
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
• Hypothesis testing (example)– Assuming age = 25
– Hypotheses:
Null:
Alternative:
– Calculating the statistic t as:
Probability & Statistics: A Brief Review
0.473ns
aget
age
123247
2526
/./
Less than t = 2.201 for p/2 = 0.025 and n – 1 = 11.
– Conclusion: null hypothesis is not rejected, i.e. the difference between the sample mean and the population mean is insignificant.
ageage
ageage
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Summary• The data mining process involves preparation of data, mining of
patterns and post-processing of the patterns.
• Top-down and bottom-up approaches are both useful. The discovery approach can be directed or undirected.
• Three main streams of data mining tasks and various forms of patterns and models are introduced.
• Specific solutions are required for specific types of problems
• The importance of evaluation of patterns must be appreciated.
• Normal procedure of conducting data mining in Weka is explained
• Some important basic concepts in probability and statistics are reviewed.
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
References
Read Chapter 2 of Data Mining Techniques and Applications
Useful further references
Han, J. and Kamber, M. (2006), Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publishers, Chapter 1
Berry, M. J. A. and Linoff, G. (2004), Data Mining Techniques: For Marketing, Sales and Customer Relationship Management, 2nd ed. Wiley Computer Publishing, Chapters 1 – 2