Top Banner
Special Topics in Data Mining
30

Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Dec 23, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Special Topics in Data Mining

Page 2: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Special Topics in Data Mining

Direct Objectives•To learn data mining techniques•To see their use in real-world/research applications•To get an understanding of the methodological principles behind data mining•To be able to read about data mining in the popular press with a critical eye•To implement & use data mining models using DM software

Page 3: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Special Topics in Data Mining

Grade Structure

Review Paper & Presentation : 30%Final Project Implementation & Present. : 40%Final Project Paper : 30%

Page 4: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Special Topics in Data MiningData Mining in Specific field for Review Paper

• Data Mining in Security• Data Mining in Telecommunications and Control• Text and Web Mining• Data Mining in Biomedicine and Science• Data Mining for Insurance• Data Mining in Banking and Commercial• Data Mining in Sales Marketing and Finance• Data Mining in Business

Page 5: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

Not well defined…. Since Data Mining is Confluence of Multiple DisciplinesNo one can agree on what data mining is! In fact the experts have very different descriptions:Different fields have different views of what data mining is (also different terminology!)

Page 6: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

Since Data Mining is Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

Page 7: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

“finding interesting structure (patterns, statistical models, relationships) in data bases”. - Fayyad, Chaduriand

“the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” - Fayyad

Page 8: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

“a knowledge discovery process of extracting previously unknown, actionable information from very large data bases” – Zorne

“a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.”--- Edelstein

Page 9: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

Data mining is the process of extracting hidden patterns from data.

Data mining is the process of discovering new patterns from large data sets involving methods from statistics and artificial intelligence but also database management.

“data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner” Hand, Mannila, Smyth

Page 10: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

Knowledge Discovery in Databases (KDD)Data Mining, also popularly known as Knowledge Discovery in Databases (KDD)...The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps: (From Zaiane)

Data cleaning: ... Data integration: ... Data selection: ... Data transformation: ... Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. Pattern evaluation: ... Knowledge representation: ...

Page 11: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

Knowledge Discovery in Databases (KDD)…..

Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful.

…..

Page 12: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

SoftwareCan use any software you like – must know how to input, manipulate, graph, and analyze data.

SAS, Weka, SPSS, Systat, Enterprise Miner, JMP, Minitab, Matlab, SQL Server

Page 13: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data Mining?

SoftwareCan use any software you like – must know how to input, manipulate, graph, and analyze data.

SAS, Weka, SPSS, Systat, Enterprise Miner, JMP, Minitab, Matlab, SQL Server

Page 14: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Data Data Data

• It’s all about the data - where does it come from?– www– Gene– Business processes/transactions– Telecommunications and networking– Medical imagery– Government, demographics (data.gov!)– Sensor networks– sports

Page 15: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

What is Data?

• Collection of objects and their attributes

• An attribute is a property or characteristic of an object– Examples: eye color of a person,

temperature, etc.– Attribute is also known as

variable, field, characteristic, or feature

• A collection of attributes describe an object– Object is also known as record,

point, case, sample, entity, or instance

• Attribute values are numbers or symbols assigned to an attribute

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes

Objects

Page 16: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Record Data

• Data that consists of a collection of records, each of which consists of a fixed set of attributes

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 17: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Document Data

• Each document becomes a `term' vector, – each term is a component (attribute) of the vector,– the value of each component is the number of times the

corresponding term occurs in the document.

Document 1

season

timeout

lost

win

game

score

ball

play

coach

team

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Page 18: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Transaction Data

• A special type of record data, where – each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased

by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

TID Items

1 Bread, Coke, Milk

2 Milk, Bread

3 Water, Coke, Diaper, Milk

4 Water, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Page 19: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Transaction Dataweblogs, phone calls…

128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

Page 20: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Graph Data

• Examples: Generic graph and HTML Links

5

2

1

2

5

<a href="papers/papers.html#bbbb">Data Mining </a><li><a href="papers/papers.html#aaaa">Graph Partitioning </a><li><a href="papers/papers.html#aaaa">Parallel Solution of Sparse Linear System of Equations </a><li><a href="papers/papers.html#ffff">N-Body Computation and Dense Linear System Solvers

Page 21: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Ordered Data

• Genomic sequence dataGGTTCCGCCTTCAGCCCCGCGCCCGCAGGGCCCGCCCCGCGCCGTCGAGAAGGGCCCGCCTGGCGGGCGGGGGGAGGCGGGGCCGCCCGAGCCCAACCGAGTCCGACCAGGTGCCCCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAGGCCAAGTAGAACACGCGAAGCGCTGGGCTGCCTGCTGCGACCAGGG

Page 22: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Time Series Data

Page 23: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Spatio-Temporal Data

Average Monthly Temperature of land and ocean

Page 24: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, …,

Relational Data

128.195.36.195, Doe, John, 12 Main St, 973-462-3421, Madison, NJ, 07932114.12.12.25,Trank, Jill, 11 Elm St, 998-555-5675, Chester, NJ, 07911…

07911, Chester, NJ, 07954, 34000, , 40.65, -74.1207932, Madison, NJ, 56000, 40.642, -74.132…

• Most large data sets are stored in relational data sets• Special data query language: SQL• Oracle, MSFT, IBM• Good open source versions: MySQL, PostGres

Page 25: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Data Quality

• What kinds of data quality problems?• How can we detect problems with the data? • What can we do about these problems?

• Examples of data quality problems: – Noise and outliers – missing values – duplicate data

Page 26: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Noise

• Noise refers to modification of original values– Examples: distortion of a person’s voice when talking on a

poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Page 27: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Outliers

• Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

Page 28: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Missing Values

• Reasons for missing values– Information is not collected

(e.g., people decline to give their age and weight)– Attributes may not be applicable to all cases

(e.g., annual income is not applicable to children)

• Handling missing values– Eliminate Data Objects– Estimate Missing Values– Ignore the Missing Value During Analysis– Replace with all possible values (weighted by their

probabilities)

Page 29: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Duplicate Data

• Data set may include data objects that are duplicates, or almost duplicates of one another– Major issue when merging data from heterogeous

sources

• Examples:– Same person with multiple email addresses

• Data cleaning– Process of dealing with duplicate data issues

Page 30: Special Topics in Data Mining. Direct Objectives To learn data mining techniques To see their use in real-world/research applications To get an understanding.

Examples of Data Mining Successes• Market Basket (WalMart)• Recommender Systems (Amazon.com)• Fraud Detection in Telecommunications (AT&T)• Target Marketing / CRM• Financial Markets• DNA Microarray analysis• Web Traffic / Blog analysis