Top Banner
Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)
61

Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Databases and Data Mining

Lecture 3:Descriptive Data Mining

Peter van der Putten(putten_at_liacs.nl)

Page 2: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Course Outline

• Objective– Understand the basics of data mining– Gain understanding of the potential for applying it in the

bioinformatics domain– Hands on experience

• Schedule

• Evaluation– Practical assignment (2nd) plus take home exercise

• Website– http://www.liacs.nl/~putten/edu/dbdm05/

Date Time Room4-Nov-05 13.45 - 15.30 174   Lecture18-Nov-05 13.45 - 15.30 413   Lecture

15.45 - 17.30 306/308   Practical Assignments25-Nov-05 13.45 - 15.30 413   Lecture2-Dec-05 13.45 - 15.30 413   Lecture

15.45 - 17.30 306/308   Practical Assignments

Page 3: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Agenda Today:Descriptive Data Mining

• Before Starting to Mine….• Descriptive Data Mining

– Dimension Reduction & Projection– Clustering

• Hierarchical clustering• K-means• Self organizing maps

– Association rules• Frequent item sets• Association Rules• APRIORI• Bio-informatics case: FSG for frequent subgraph discovery

Page 4: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Before starting to mine….

• Pima Indians Diabetes Data– X = body mass

index– Y = age

Page 5: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Before starting to mine….

Page 6: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Before starting to mine….

Page 7: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Before starting to mine….

• Attribute Selection– This example: InfoGain by Attribute– Keep the most important ones

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

Plasma glucose concentration a 2 hours in an oralglucose tolerance test

Body mass index (weight in kg/(height in m)^2)

Age (years)

2-Hour serum insulin (mu U/ml)

Triceps skin fold thickness (mm)

Number of times pregnant

Diabetes pedigree function

Diastolic blood pressure (mm Hg)

Page 8: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Before starting to mine….

• Types of Attribute Selection– Uni-variate versus multivariate (sub set selection)

• The fact that attribute x is a strong uni-variate predictor does not necessarily mean it will add predictive power to a set of predictors already used by a model

– Filter versus wrapper• Wrapper methods involve the subsequent learner (classifier

or other)

Page 9: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Dimension Reduction

• Projecting high dimensional data into a lower dimension– Principal Component Analysis– Independent Component Analysis– Fisher Mapping, Sammon’s Mapping etc.– Multi Dimensional Scaling

• See Pattern Recognition Course (Duin)

Page 10: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Data Mining Tasks: Clustering

f.e. age

f.e.

wei

ght

Clustering is the discovery of groups in a set of instances

Groups are different, instances in a group are similar

In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user

Page 11: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Data Mining Tasks: Clustering

f.e. age

f.e.

wei

ght

Clustering is the discovery of groups in a set of instances

Groups are different, instances in a group are similar

In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user

In >3 dimensions this is not possible

Page 12: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Clustering Techniques

• Hierarchical algorithms– Agglomerative– Divisive

• Partition based clustering– K-Means– Self Organizing Maps / Kohonen Networks

• Probabilistic Model based– Expectation Maximization / Mixture Models

Page 13: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Hierarchical clustering

• Agglomerative / Bottom up– Start with single-instance clusters– At each step, join the two closest clusters– Method to compute distance between cluster x and y: single

linkage (distance between closest point in cluster x and y), average linkage (average distance between all points), complete linkage (distance between furthest points), centroid

– Distance measure: Euclidean, Correlation etc.

• Divisive / Top Down– Start with all data in one cluster– Split into two clusters based on category utility– Proceed recursively on each subset

• Both methods produce a dendrogram

Page 14: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Levels of Clustering

Divisive

Agglomerative

Dunham, 2003

Page 15: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Hierarchical Clustering Example

• Clustering Microarray Gene Expression Data– Gene expression measured using microarrays studied under variety of

conditions– On budding yeast Saccharomyces cerevisiae – Groups together efficiently genes of known similar function,

• Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

Page 16: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Hierarchical Clustering Example

• Method– Genes are the instances, samples the attributes!– Agglomerative– Distance measure = correlation

• Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

Page 17: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Simple Clustering: K-means

• Pick a number (k) of cluster centers (at random)• Cluster centers are sometimes called codes, and the

k codes a codebook

• Assign every item to its nearest cluster center• F.i. Euclidean distance

• Move each cluster center to the mean of its assigned items

• Repeat until convergence• change in cluster assignments less than a threshold

KDnuggets

Page 18: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

K-means example, step 1

k1

k2

k3

X

Y

Initiallydistributecodesrandomlyin patternspace

KDnuggets

Page 19: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

K-means example, step 2

k1

k2

k3

X

Y

Assigneach pointto the closestcode

KDnuggets

Page 20: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

K-means example, step 3

X

Y

Moveeach codeto the meanof all its assigned points

k1

k2

k2

k1

k3

k3

KDnuggets

Page 21: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

K-means example, step 2

X

Y

Repeat the process – reassign the data points to the codes

Q: Which points are reassigned?

k1

k2

k3

KDnuggets

Page 22: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

K-means example

X

Yk1

k3k2

KDnuggets

Repeat the process – reassign the data points to the codes

Q: Which points are reassigned?

Page 23: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

K-means example

X

Y

re-compute cluster means

k1

k3k2

KDnuggets

Page 24: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

K-means example

X

Y

move cluster centers to cluster means

k2

k1

k3

KDnuggets

Page 25: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

K-means clustering summary

Advantages• Simple, understandable• items automatically

assigned to clusters

Disadvantages• Must pick number of

clusters before hand• All items forced into a

cluster• Sensitive to outliers

Extensions• Adaptive k-means• K-mediods (based on median instead of mean)

– 1,2,3,4,100 average 22, median 3

Page 26: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Biological Example

• Clustering of yeast cell images– Two clusters are found– Left cluster primarily cells with thick capsule, right

cluster thin capsule• caused by media, proxy for sick vs healthy

Page 27: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Self Organizing Maps(Kohonen Maps)

• Claim to fame– Simplified models of cortical maps in the brain– Things that are near in the outside world link

to areas near in the cortex– For a variety of modalities: touch, motor, ….

up to echolocation– Nice visualization

• From a data mining perspective:– SOMs are simple extensions of k-means

clustering– Codes are connected in a lattice– In each iteration codes neighboring winning

code in the lattice are also allowed to move

Page 28: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

SOM

10x10 SOM

Gaussian Distribution

Page 29: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

SOM

Page 30: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

SOM

Page 31: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

SOM

Page 32: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

SOM example

Page 33: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Famous example:Phonetic Typewriter

• SOM lattice below left is trained on spoken letters, after convergence codes are labeled

• Creates a ‘phonotopic’ map• Spoken word creates a sequence of labels

Page 34: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Famous example:Phonetic Typewriter

• Criticism– Topology preserving property is not used so why use SOMs and

not adaptive k-means for instance?• K-means could also create a sequence

• This is true for most SOM applications!

– Is using clustering for classification optimal?

Page 35: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Bioinformatics ExampleClustering GPCRs

• Clustering G Protein Coupled Receptors (GPCRs) [Samsanova et al, 2003, 2004]

• Important drug target, function often unknown

Page 36: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Bioinformatics ExampleClustering GPCRs

Page 37: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Association Rules Outline

• What are frequent item sets & association rules?

• Quality measures– support, confidence, lift

• How to find item sets efficiently?– APRIORI

• How to generate association rules from an item set?

• Biological examples

KDnuggets

Page 38: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Market Basket ExampleGene Expression Example

TID Produce

1 MILK, BREAD, EGGS

2 BREAD, SUGAR

3 BREAD, CEREAL

4 MILK, BREAD, SUGAR

5 MILK, CEREAL

6 BREAD, CEREAL

7 MILK, CEREAL

8 MILK, BREAD, CEREAL, EGGS

9 MILK, BREAD, CEREAL

ID Expressed Genes in Sample

1 GENE1, GENE2, GENE 5

2 GENE1, GENE3, GENE 5

3 GENE2

4 GENE8, GENE9

5 GENE8, GENE9, GENE10

6 GENE2, GENE8

7 GENE9, GENE10

8 GENE2

9 GENE11

• Frequent item set

• {MILK, BREAD} = 4

• Association rule

• {MILK, BREAD} {EGGS}

• Frequency / importance = 2 (‘Support’)

• Quality = 50% (‘Confidence’)

• What genes are expressed (‘active’) together?

• Interaction / regulation

• Similar function

Page 39: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Association Rule Definitions

• Set of items: I={I1,I2,…,Im}

• Transactions: D={t1,t2, …, tn}, tj I

• Itemset: {Ii1,Ii2, …, Iik} I

• Support of an itemset: Percentage of transactions which contain that itemset.

• Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.

Dunham, 2003

Page 40: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Frequent Item Set Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

Dunham, 2003

Page 41: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Association Rule Definitions

• Association Rule (AR): implication X Y where X,Y I and X,Y disjunct;

• Support of AR (s) X Y: Percentage of transactions that contain X Y

• Confidence of AR () X Y: Ratio of number of transactions that contain X Y to the number that contain X

Dunham, 2003

Page 42: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Association Rules Ex (cont’d)

Dunham, 2003

Page 43: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Association Rule Problem

• Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence.

• NOTE: Support of X Y is same as support of X Y.

Dunham, 2003

Page 44: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Association Rules Example

• Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ?

A, B => E : conf=2/4 = 50%

A, E => B : conf=2/2 = 100%

B, E => A : conf=2/2 = 100%

E => A, B : conf=2/2 = 100%

Don’t qualify

A =>B, E : conf=2/6 =33%< 50%

B => A, E : conf=2/7 = 28% < 50%

__ => A,B,E : conf: 2/9 = 22% < 50%

TID List of items

1 A, B, E

2 B, D

3 B, C

4 A, B, D

5 A, C

6 B, C

7 A, C

8 A, B, C, E

9 A, B, C

KDnuggets

Page 45: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Solution Association Rule Problem

• First, find all frequent itemsets with sup >=minsup– Exhaustive search won’t work

• Assume we have a set of m items 2m subsets!

– Exploit the subset property (APRIORI algorithm)

• For every frequent item set, derive rules with confidence >= minconf

KDnuggets

Page 46: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Finding itemsets: next level

• Apriori algorithm (Agrawal & Srikant) • Idea: use one-item sets to generate two-item

sets, two-item sets to generate three-item sets, ..– Subset Property: If (A B) is a frequent item set, then

(A) and (B) have to be frequent item sets as well!– In general: if X is frequent k-item set, then all (k-1)-

item subsets of X are also frequent Compute k-item set by merging (k-1)-item sets

KDnuggets

Page 47: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

An example

• Given: five three-item sets

(A B C), (A B D), (A C D), (A C E), (B C D)

• Candidate four-item sets: (A B C D) Q: OK? A: yes, because all 3-item subsets are frequent

(A C D E) Q: OK?

A: No, because (C D E) is not frequent

KDnuggets

Page 48: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

From Frequent Itemsets to Association Rules

• Q: Given frequent set {A,B,E}, what are possible association rules? – A => B, E– A, B => E– A, E => B– B => A, E– B, E => A– E => A, B – __ => A,B,E (empty rule), or true => A,B,E

KDnuggets

Page 49: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Example: ‘Generating Rules from an Itemset

• Frequent itemset from golf data:

• Seven potential rules:

Humidity = Normal, Windy = False, Play = Yes (4)

If Humidity = Normal and Windy = False then Play = Yes

If Humidity = Normal and Play = Yes then Windy = False

If Windy = False and Play = Yes then Humidity = Normal

If Humidity = Normal then Windy = False and Play = Yes

If Windy = False then Humidity = Normal and Play = Yes

If Play = Yes then Humidity = Normal and Windy = False

If True then Humidity = Normal and Windy = False and Play = Yes

4/4

4/6

4/6

4/7

4/8

4/9

4/12

KDnuggets

Page 50: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Example:Generating Rules

• Rules with support > 1 and confidence = 100%:

• In total: 3 rules with support four, 5 with support three, and 50 with support two

Association rule Sup. Conf.

1 Humidity=Normal Windy=False Play=Yes 4 100%

2 Temperature=Cool Humidity=Normal 4 100%

3 Outlook=Overcast Play=Yes 4 100%

4 Temperature=Cold Play=Yes Humidity=Normal 3 100%

... ... ... ... ...

58 Outlook=Sunny Temperature=Hot Humidity=High 2 100%

KDnuggets

Page 51: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Weka associations: output

KDnuggets

Page 52: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Extensions and Challenges

• Extra quality measure: Lift– The lift of an association rule I => J is defined as:

• lift = P(J|I) / P(J) • Note, P(I) = (support of I) / (no. of transactions)• ratio of confidence to expected confidence

– Interpretation:• if lift > 1, then I and J are positively correlated

lift < 1, then I are J are negatively correlated.

lift = 1, then I and J are independent

• Other measures for interestingness– A B, B C, but not A C

• Efficient algorithms

• Known Problem– What to do with all these rules? How to exploit / make useful /

actionable?

KDnuggets

Page 53: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Biomedical ApplicationHead and Neck Cancer Example

1. ace27=0 fiveyr=alive 381 tumorbefore=0 372 conf:(0.98) 2. gender=M ace27=0 467 tumorbefore=0 455 conf:(0.97) 3. ace27=0 588 tumorbefore=0 572 conf:(0.97) 4. tnm=T0N0M0 ace27=0 405 tumorbefore=0 391 conf:(0.97) 5. loc=LOC7 tumorbefore=0 409 tnm=T0N0M0 391 conf:(0.96) 6. loc=LOC7 442 tnm=T0N0M0 422 conf:(0.95) 7. loc=LOC7 gender=M tumorbefore=0 374 tnm=T0N0M0 357

conf:(0.95) 8. loc=LOC7 gender=M 406 tnm=T0N0M0 387 conf:(0.95) 9. gender=M fiveyr=alive 633 tumorbefore=0 595 conf:(0.94)10. fiveyr=alive 778 tumorbefore=0 726 conf:(0.93)

Page 54: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Bioinformatics Application

• The idea of association rules have been customized for bioinformatics applications

• In biology it is often interesting to find frequent structures rather than items– For instance protein or other chemical structures

• Solution: Mining Frequent Patterns– FSG (Kuramochi and Karypis, ICDM 2001)– gSpan (Yan and Han, ICDM 2002)– CloseGraph (Yan and Han, KDD 2002)

Page 55: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

FSG: Mining Frequent Patterns

Page 56: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

FSG: Mining Frequent Patterns

Page 57: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

FSG Algorithmfor finding frequent subgraphs

Page 58: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Frequent Subgraph ExamplesAIDS Data

• Compounds are active, inactive or moderately active (CA, CI, CM)

Page 59: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Predictive Subgraphs

• The three most discriminating sub-structures forthe PTC, AIDS, and Anthrax datasets

Page 60: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

FSG References

• Frequent Sub-structure Based Approaches for Classifying Chemical CompoundsMukund Deshpande, Michihiro Kuramochi, and George KarypisICDM 2003

• An Efficient Algorithm for Discovering Frequent SubgraphsMichihiro Kuramochi and George KarypisIEEE TKDE

• Automated Approaches for Classifying StructuresMukund Deshpande, Michihiro Kuramochi, and George KarypisBIOKDD 2002

• Discovering Frequent Geometric SubgraphsMichihiro Kuramochi and George KarypisICDM 2002

• Frequent Subgraph Discovery Michihiro Kuramochi  and George Karypis1st IEEE Conference on Data Mining 2001

Page 61: Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Recap

• Before Starting to Mine….• Descriptive Data Mining

– Dimension Reduction & Projection– Clustering

• Hierarchical clustering• K-means• Self organizing maps

– Association rules• Frequent item sets• Association Rules• APRIORI• Bio-informatics case: FSG for frequent subgraph discovery

• Next week– Bioinformatics Data Mining Cases / Lab Session / Take Home

Exercise