Top Banner
9 October 2003 1 CSCI6405 Fall 2003 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: [email protected] Teaching Assistant: Christopher Jordan, Email: [email protected] Office Hours: TR, 1:30 - 3:00 PM
31

CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

Feb 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 1

CSCI6405 Fall 2003Dta Mining and Data Warehousing

Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: [email protected] Assistant: Christopher Jordan,Email: [email protected] Hours: TR, 1:30 - 3:00 PM

Page 2: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 2

Lectures OutlinePat I: Overview on DM and DW

1. Introduction (ch1) Ass1 Due: Sep 23 Tue2. Data preprocessing (ch3)

Part II: DW and OLAP 3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 – Oct 14

Part III: Data Mining Methods/Algorithms 4. Data mining primitives (ch4)5. Classification data mining (ch7) Ass3: Oct 7 – Oct 216. Association data mining (ch6) Ass4: Oct 21 – Nov 57. Characterization data mining (ch5)8. Clustering data mining (ch8)

Part IV: Mining Complex Types of Data 9. Mining the Web (Ch9)

10. Mining spatial data (Ch9)Project Presentations

Project Due: Dec 8

Page 3: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 3

3. DATA PREPROCESSING (Ch3)

Data Preprocessing (DPP) ConceptMajor Tasks of DPPA DPP Case StudySummary

Page 4: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 4

Why Is Data Preprocessing Important?

No quality data, no quality mining results!Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Page 5: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 5

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility

Page 6: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 6

Page 7: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 7

Why Data Preprocessing ?

Raw data have errors and inconsistencies (Data cleaning)

Data need to be integrated from different sources and a

unique format is needed (Data integration and

transformation)

Irrelevant data should be removed (Data reduction)

Domain knowledge should be added into the prepared

data (Discretization and concept hierarchy generation)

Page 8: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 8

Major Tasks of DPP

Page 9: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 9

Major Tasks of DPP (cont)

Data cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integrationIntegration of multiple databases, data cubes, or files

Data transformationNormalization and aggregation

Data reductionObtains reduced representation in volume but produces the same or similar analytical results

Data discretizationPart of data reduction but with particular importance, especially for numerical data

Page 10: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 10

Why data cleaning?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., occupation=“”noisy: containing errors or outliers

e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names

e.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”

Page 11: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 11

Why is data dirty?Incomplete data comes from

n/a data value when collected

different consideration between the time when the data was collected and when it is analyzed.human/hardware/software problems

Noisy data comes from the process of datacollectionentrytransmission

Inconsistent data comes fromDifferent data sources

Functional dependency violation

Page 12: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 12

E.g. Data normalization for clustering mining

E.g., For clustering mining of a customer database: DB (Age, Income, Credit)

The distance between to data points:

d = ((C1_a1 - C2_a1)^2 + (C2_a2 - C2_a2)^2 + (C3_a1 – C3_a2)^2)^(1/2)

Age Income Credit Customer1: 32 40,000 10,000Customer2: 24 30,000 2,000

8 10,000 8,000Normalized: 1 1/1000 1/1000

8 10 8(rescaled) (rescaled)

If we scale all the attributes to the same order of magnitude we obtain reliable distance measure between the different records.

Page 13: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 13

Data mining task: - Mining clusters of clients for a magazine publisher database. - …

Data preparation for clustering: cleaned, integrated, normalized, numerical valued data, etc

Business Background: The publisher sells five types of magazine - on cars, houses, sports, music, and comics. The aim of the data mining is to find new, interesting clusters of clients in order to set up a marketing exercise. The business is interested in questions such as "What is the typical profile of a reader of a car magazine?’’, "Is there any correlation between an interest in cars and an interest in comics?" ...

A DPP Case Study

Page 14: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 14

The database should contain the records of subscription data of the magazines.

• It should be a selection of operational data from the publishers invoicing system and contains information about people who have subscribed to a magazine

• The records consist of: client number, name, address, date of subscription,and type of magazine

• In order to facilitate the DM process, a copy of this operational data is drawn and stored in a separate database (Refer Table 1)

1. Data Selection

Page 15: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 15

Client number Name Address Date

purchaseMagazine

purchased

23003 23003 23003 23009 23013 23019

Johnson Johnson Johnson Clinton King

Jonson

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 3 High Road

1 Downing Street

04-15-94 06-21-93 05-30-92 01-01-01 02-30-95 01-01-01

carmusiccomiccomicsportshouse

1. Original data

Page 16: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 16

Duplication of records:

In an operational client database some clients may be represented by several records, some of the possible causes may include:

- the result of negligence, such as people making typing errors

2. Data Cleaning: remove duplications

- clients moving from on place to another without notifying change of the address

- the cases in which people deliberately spell their names incorrectly or give incorrect information about themselves for avoiding a negative decision ... (Refer to Table 2)

Page 17: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 17

Client number Name Address Date

purchaseMagazine

purchased23003 23003 23003 23009 23013 23003

Johnson Johnson Johnson Clinton King

Johnson

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 3 High Road

1 Downing Street

04-15-94 06-21-93 05-30-92 01-01-0102-30-95 01-01-01

car music comic comic sports house

2. De-duplication

Client number Name Address Date

purchaseMagazine

purchased

23003 23003 23003 23009 23013 23019

Johnson Johnson Johnson Clinton King

Jonson

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 3 High Road

1 Downing Street

04-15-94 06-21-93 05-30-92 01-01-01 02-30-95 01-01-01

carmusiccomiccomicsportshouse

1. Original data

Page 18: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 18

E.g., The records Mr. Johnson and Mr. Jonson in the database. They have different client numbers but the same address, which is a strong indication that they are the same person.

This type of pollution will give a company the impression that it has more clients than in fact is the case.

Of course, we can never be sure of this, but a de-duplication algorithm using pattern analysis techniques could identify the situation and present it to a user to make a decision.

De-duplication:The duplicated records may be identified by a pattern recognition algorithm and then corrected.

De-duplication

Page 19: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 19

Domain inconsistency: Pollution was caused by wrong domain values which are not consistent with the definitions.

E.g. In the example table, date 01-01-01 means 1 January 1901 (the company did not even exist at that time).

In some databases, analysis shows an unexpected high number of people born on 11 November:

When people were forced to fill in a birth date on a screen and they either do not know or do not want to divulge it, they were inclined to type in `11-11-11'.

This kind of untrue random values can be disastrous in a data mining context.

If information is unknown (NULL) it should be represented as such in the database.

2. Data Cleaning: correct domain inconsistency

Page 20: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 20

Client number Name Address Date

purchaseMagazine

purchased23003 23003 23003 23009 23013 23003

Johnson Johnson Johnson Clinton King

Johnson

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 3 High Road

1 Downing Street

04-15-94 06-21-93 05-30-92

NULL 02-30-95 12-20-94

car music comic comic sports house

3. Domain consistency

Client number Name Address Date

purchaseMagazine

purchased23003 23003 23003 23009 23013 23003

Johnson Johnson Johnson Clinton King

Johnson

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 3 High Road

1 Downing Street

04-15-94 06-21-93 05-30-92 01-01-0102-30-95 01-01-01

car music comic comic sports house

Page 21: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 21

3. Data Integration (Enrichment)

Suppose that we have purchased extra information about our clients consisting of data of birth, income, amount of credit, and whether or not an individual owns a car or a house. (Refer to Table 4)

* You therefore have to make a deliberate decision either to overlook it or to delete it. A general rule states that any deletion of data must be a conscious decision, after a thorough analysis of the possible consequences.

Page 22: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 22

Client name

Date of birth Income Credit Car

ownerHouse owner

JohnsonClinton

04-13-7610-20-71

$18,500$36,000

$17,800$26,600

noyes

nono

4. Additional data available for enrichment

Client number Name Address Date

purchaseMagazine

purchased23003 23003 23003 23009 23013 23003

Johnson Johnson Johnson Clinton King

Johnson

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 3 High Road

1 Downing Street

04-15-94 06-21-93 05-30-92

NULL 02-30-95 12-20-94

car music comic comic sports house

3. Domain consistency

Page 23: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 23

Credit numb

erName Date of

birth Income CreditCar

owner

House

owner

AddressDate

purchase made

Magazine

purchased

23003 23003 23003 23009 23013 23003

Johnson Johnson Johnson Clinton

King Johnson

04-13-76 04-13-76 04-13-76 10-20-11

NULL 04-13-76

$18,500 $18,500 $18,500 $36,000 NULL

$18,500

$17,800 $17,800 $17,800 $26.600NULL$17,800

no no no yes

NULL no

no no no no

NULL no

1 Downing Street1 Downing Street1 Downing Street

2 Boulevard NULL

1 Downing Street

04-15-9406-21-93 05-30-92

NULL 02-30-9512-20-94

car music comic comicsports house

5. Enriched table

Page 24: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 24

4. Data Deduction

Remove the columns and rows which are not valuable to the DM process.In Table 6, the column NAME and the row with multiple NULL values are removed from the database.

In a real DM project, maybe most of the tables that are collected from the operational data and a lot of desirable data is missing, and most is possible to retrieve.

Page 25: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 25

Credit number

Date of birth Income Credit

Car owne

r

House owner Address

Date purchase

made

Magazine purchased

23003 23003 23003 23009 23003

04-13-76 04-13-76 04-13-76 10-20-1104-13-76

$18,500 $18,500 $18,500 $36,000 $18,500

$17,800 $17,800 $17,800 $26.600 $17,800

no no no yesno

no no no nono

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 1 Downing Street

04-15-94 06-21-93 05-30-92

NULL 12-20-94

car music comic comichouse

6. Table with column and row removed

Credit numb

erName Date of

birth Income CreditCar

owner

House

owner

AddressDate

purchase made

Magazine

purchased

23003 23003 23003 23009 23013 23003

Johnson Johnson Johnson Clinton

King Johnson

04-13-76 04-13-76 04-13-76 10-20-11

NULL 04-13-76

$18,500 $18,500 $18,500 $36,000 NULL

$18,500

$17,800 $17,800 $17,800 $26.600NULL$17,800

no no no yes

NULL no

no no no no

NULL no

1 Downing Street1 Downing Street1 Downing Street

2 Boulevard NULL

1 Downing Street

04-15-9406-21-93 05-30-92

NULL 02-30-9512-20-94

car music comic comicsports house

5. Enriched table

Page 26: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 26

4. Data Deduction (cont)

In some cases, especially fraud detection, lack of information can be a valuable indication of interesting patterns. Up to this point, the process phase has consisted of mainly simple SQL operations.

Page 27: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 27

5. Data transformationFor most of databases, the information provided is much too detailed to be used as input of data mining algorithms, such as

Apply the following coding steps:1. Address to region2. Birth date to age3. Divide income be 10004. Divide credit by 10005. Convert cars yes-no to 1-06. Convert purchase date to month numbers starting from 1990

Credit number

Date of birth Income Credit

Car owne

r

House owner Address

Date purchase

made

Magazine purchased

23003 23003 23003 23009 23003

04-13-76 04-13-76 04-13-76 10-20-1104-13-76

$18,500 $18,500 $18,500 $36,000 $18,500

$17,800 $17,800 $17,800 $26.600 $17,800

no no no yesno

no no no nono

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 1 Downing Street

04-15-94 06-21-93 05-30-92

NULL 12-20-94

car music comic comichouse

Page 28: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 28

7. An intermediate coding stage

Credit number Age Income Credit

Car owne

r

House owner Region Month of

purchaseMagazine

purchased

23003 23003 23003 23009 23003

20 20 20 25 20

18.5 18.5 18.5 36.0 18.5

17.8 17.817.826.617.8

0 0 0 1 0

0 0 0 0 0

1 1 1 1 1

52 42 29

NULL 48

car music comic comichouse

Credit number

Date of birth Income Credit

Car owne

r

House owner Address

Date purchase

made

Magazine purchased

23003 23003 23003 23009 23003

04-13-76 04-13-76 04-13-76 10-20-1104-13-76

$18,500 $18,500 $18,500 $36,000 $18,500

$17,800 $17,800 $17,800 $26.600 $17,800

no no no yesno

no no no nono

1 Downing Street 1 Downing Street 1 Downing Street

2 Boulevard 1 Downing Street

04-15-94 06-21-93 05-30-92

NULL 12-20-94

car music comic comichouse

6. Table with column and row removed

Page 29: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 29

Credit numbe

rAge Income Credit Car

ownerHouse owner Region Car

magazine House Sports Music Comic

2300323009

2025

18.536.0

17.826.6

01

00

11

10

10

00

10

11

8. The final table

Page 30: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 30

Page 31: CSCI6405 Fall 2003 Dta Mining and Data Warehousingxwang/courses/cs6405/Note3.1.pdf · Data mining primitives (ch4) 5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21 6. Association

9 October 2003 31

Summary

Data preparation is a big issue and most time cost process for both mining and warehousing

Data preparation includes

Data cleaning, integration, transformation, reduction, discretization, etc.

Many DPP tools have been developed but it is still an active research area because of the effort needed for