Top Banner
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me slides are at: http://www.macs.hw.ac.uk/~dwcorne/T Advanced Database Systems F24DS2 / F29AT2 About Data Mining
22

These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to.

Apr 01, 2015

Download

Documents

Hudson Willey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Advanced Database SystemsF24DS2 / F29AT2

About Data Mining

Page 2: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Data Mining has Many meanings

There are lots of things you can do with a database:

1. Access data via straightforward queries, to answer straightforward questions about instances. E.g.:

• ``What is Ellen McArthur’s home phone number?” • “What is the ISBN number of Eats, shoots and leaves, by Lynne Truss?”, • “What grade did Larry Page get for the Internet

module??”• “Give me a list of all pages on the www that contain the

phrase “fried egg sandwich”.

Page 3: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Data Mining has Many meanings (cont…)

2. Generate simple reports about data via straightforward queries, to answer questions about sets of instances. E.g.: • ``How many of our customers are called “Trevor?”” • “Which of our books has been borrowed more times in

the last month than Eats, shoots and leaves, by Lynne Truss?”, • “Which student has the highest average marks?”• “ What percentage of house owners also own a car?”.

Page 4: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Data Mining has Many meanings (cont…)

3. Generate complex/and/or comprehensive statistical reports about the database as a whole, to summarise and understand the data – this is what tends to be done in the Analysis stage of Data Cleaning.. E.g.: • For each field, generate a histogram of the values • Run one or more clustering algorithms to find the

clusters in the data.

Page 5: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Data Mining has Many meanings (cont…)

4. Build Predictive models that may then be useful for business or research. For example: • Based on stock market data, we can construct a model

that attempts to predict tomorrow’s Dow Index closing price, given the previous few days’ prices.

• Based on blood test data from past patients, we can construct a model that attempts to predict whether or not a patient is developing hepatitis.

• Based on historic data on vibrations, we can build a model that tries to predict beforehand if an aircraft wing is likely to fail.

Page 6: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Data Mining has Many meanings (cont…)

5. Discover INTERESTING and USEFUL rules that are

`hidden’ in the data. For example:• An analysis of supermarket basket data will show a

surprising amount of baskets that contain both beer and nappies..

• Analysis of crime records data may find that the violent crimes rate in newcastle seems to reduce significantly whenever the violent crimes rate in Sunderland increases significantly. .

Page 7: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

All Together

1. Accessing

2. Reporting

3. Clustering/Histograms

4. Predictive models

5. Discovery of interesting/surprising things

Page 8: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Data Mining

1. Accessing2. Reporting3. Clustering/Histograms4. Predictive models5. Discovery of interesting/surprising things

When you hear the term `data mining’, it can mean any of 2, 3, 4 and 5. In business/industry, `2’ and `3’ are called data mining. In academia we usually take data mining to mean mainly `4’ and `5’

Page 9: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Notes on 3/4

1. Accessing

2. Reporting

3. Clustering/Histograms

4. Predictive models

5. Discovery of interesting/surprising things

These are the things that you would look at more closely in a machine learning course. The predictive models are things like neural networks, decision trees and rulesets

Page 10: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

What DM means for us1. Accessing2. Reporting3. Clustering/Histograms4. Predictive models5. Discovery of interesting/surprising things

So, 3 and 4 are dealt with in another course. 5 could be an entire MSc course on its own, but that is what DM means for us.

. In particular, we take a small bite of it that is relevant to practical discovery of interesting things in very large DBs. We look at a fast algorithm that can discover interesting rules in transaction databases, and that is a component in several advanced commercial systems..

Page 11: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

First, some important motivational/explanatory notes

Why do we need something like `type 5’ data mining at all? Couldn’t the `beer and nappies’ thing have been found by types 2 or 3 DM?

The next slide shows a tiny `supermarket basket’ database. E.g. Record 11 is a customer who bought eggs and glue only; record 12 Records a transaction where the basket contained only apples.

Page 12: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

1 1 1 1 1 12 1 1 13 1 1 1

4 1 1 1

5 1 16 1 17 1 1 1

8 1 1

9 1 110 1 111 1 112 1

13 1 1

14 1 1

15 1 116 117 1 118 1 1 1 1 119 1 1 1 1 120 1

ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream

Page 13: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Numbers

Our example DB has 20 records of supermarket transactions, from a supermarket that only sells 9 things

One month in a large supermarket with five stores spread around a reasonably sized city might easily yield a DB of 20,000,000 baskets, each containing a set of products from a pool of around 1,000

Page 14: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Rules

A `rule’ is something like this:If a basket contains apples and cheese, then it

also contains beerAny such rule has two associated measures:1. confidence – when the `if’ part is true,

how often is the `then’ bit true?2. coverage or support – how much of the

database contains the `if’ part?

Page 15: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Example:

What is the confidence and coverage of:

If the basket contains beer and cheese, then it also contains honey

2/20 of the records contain both beer and cheese, so coverage is 10%

Of these 2, 1 contains honey, so confidence is 50%

Is that interesting ? Is that useful ? What makes a rule interestingor useful?

Page 16: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Interesting/Useful rules

Statistically, anything that is interesting is something that happens significantly more than you would expect by chance.

E.g. basic statistical analysis of basket data may show that 10% of baskets contain bread, and 4% of baskets contain washing-up powder. I.e:– There is a probability 0.1 that a basket contains bread.

– There is a probability 0.04 that a basket contains washing-up powder.

Page 17: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Bread and washing up powder

What is the probability of a basket containing both bread and washing-up powder? The laws of probability say:

If these two things are independent, chance is 0.1 * 0.04 = 0.004

That is, we would expect 0.4% of baskets to contain both bread and washing up powder

Page 18: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Interesting means surprising

We therefore have a prior expectation that just 4 in 1,000 baskets should contain both bread and washing up powder.

If we investigate, and discover that really it is 20 in 1,000 baskets, then we will be very surprised. It tells us that:– Something is going on in shoppers’ minds: bread and

washing-up powder are connected in some way.– There may be ways to exploit this discovery … put the

powder and bread at opposite ends of the supermarket?

Page 19: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Finding surprising rules

Suppose we ask `what is the most surprising rule in this database? This would be, presumably, a rule whose accuracy is more different from its expected accuracy than any others. But it also has to have a suitable level of coverage, or else it may be just a statistical blip, and/or unexploitable.

Looking only at rules of the form: if basket contains X and Y, then it also contains Z… our realistic numbers tell us that there may be around 500,000,000

distinct possible rules. For each of these we need to work out its accuracy and coverage, by trawling through a database of around 20,000,000 basket records. … c 1016 operations …

Yes, it’s easy to use `type 2’ DM, say, to work out the confidence and coverage of a given rule. But type 5 DM is all about searching through, somehow, 500,000,000 (or usually immensely more) rules to sniff out what may be the interesting ones.

Page 20: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Here are some interesting onesin our mini basket DB:

• If a basket contains glue, then it also contains either beer or eggs

confidence: 100% ; coverage 25%

• If a basket contains apples and dates, then it also contains honey

confidence 100% ; coverage 20%

Page 21: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

What this lecture was about

• The many different meanings of data mining

• Warming up for the next lecture, via gentle discussion on transaction databases, rules, confidence, coverage, and what it takes for a rule to be interesting.

Page 22: These slides are at: dwcorne/Teaching/ David Corne, room EM G.39, x 3410, dwcorne@macs.hw.ac.uk / any questions, feel free to.

David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me

The

se s

lide

s ar

e at

: htt

p://

ww

w.m

acs.

hw.a

c.uk

/~dw

corn

e/T

each

ing/

Next

A classic fast algorithm for finding useful rules in large databases,