Top Banner
Introduction to Data Mining Introduction to Data Mining Dr. Sushil Kulkarni Jai Hind College ([email protected])
62

Introduction to Data Mining

Nov 17, 2014

Download

Education

sushil.kulkarni

This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Data Mining

Introduction to Data Mining 1

Introduction to

Data Mining

Dr. Sushil Kulkarni Jai Hind College

([email protected])

Page 2: Introduction to Data Mining

Introduction to Data Mining 2

— Introduction to database — A Problem and A Solution— What Is Data Mining? — Goal of Data Mining— What is (not) Data Mining?— Convergence of 3 key Technologies— Data mining Functions— Kinds of Data Mining Problems

Road Map

Page 3: Introduction to Data Mining

Introduction to Data Mining 3

What is Database?

A database is any organized collection of data.

Page 4: Introduction to Data Mining

Introduction to Data Mining 4

Examples Co-workers

Page 5: Introduction to Data Mining

Introduction to Data Mining 5

Examples Patient Information

Page 6: Introduction to Data Mining

Introduction to Data Mining 6

Examples Airline reservation system

Page 7: Introduction to Data Mining

Introduction to Data Mining 7

Data vs. information• What is data?

– Data is unprocessed information.

• What is information?

– Information is data that have been

organized and communicated in a

coherent and meaningful manner.

– Data is converted into

information, and information is

converted into knowledge.

– Knowledge; information evaluated

and organized so that it can be

used purposefully.

Page 8: Introduction to Data Mining

Introduction to Data Mining 8

Why do we need a database?

• Keep records of our:– Clients– Staff– Volunteers

• To keep a record of activities and interventions

• Keep sales records

• Develop reports

• Perform research

Page 9: Introduction to Data Mining

Introduction to Data Mining 9

Purpose of Database system

DataData InformationInformation KnowledgeKnowledge ActionAction

Is to transformIs to transform

Page 10: Introduction to Data Mining

Introduction to Data Mining 10

Database• Database: Shared collection of logically related data (and a

description of this data), designed to meet the information needs

of an organization.

• Database management System: A software system that enables

users to define, create, and maintain the database and that

provides controlled access to this database.

Page 11: Introduction to Data Mining

Introduction to Data Mining 11

Who and How to do it ?

• Database Management System (DBMS) does this job.

• Using Software tools: Access, FileMaker, Lotus Notes, Oracle or SQL Server, …….

• It includes tools to add, modify or delete data from the database, ask questions (or queries) about the data stored in the database and produce reports summarizing selected contents.

Page 12: Introduction to Data Mining

Introduction to Data Mining 12

hmm.. Let’s jump to Data Mining • With this background we will now see what is data

Mining

Page 13: Introduction to Data Mining

Introduction to Data Mining 13

A Problem …

• You are a marketing manager of a brokerage company — Problem: Churn is too high > Turnover is 40% (after six month introductory period ends) — Customers receive incentives (average cost: ₹160) when account is opened — Giving new incentives to everyone who might leave is very expensive (as well as wasteful) — Bringing back a customer after they leave is both difficult and costly

Page 14: Introduction to Data Mining

Introduction to Data Mining 14

A Solution …

— One month before the end of the introductory period is

over, predict which customers will leave

— If you want to keep a customer that is predicted to churn, offer them something based on their predicted value

> The ones that are not predicted to churn need no

attention

— If you don’t want to keep the customer, do nothing

— How can you predict future behavior? > Tarot Cards > Magic 8 Ball

Page 15: Introduction to Data Mining

Introduction to Data Mining 15

KDD Process

• Knowledge discovery in databases (KDD) is a multi step process of finding useful information and patterns in data

• Data Mining is the use of algorithms to extract information and patterns derived by the KDD process.

• Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD.

Page 16: Introduction to Data Mining

Introduction to Data Mining 16

Steps of KDD Process

• Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD.

• Knowledge discovery in databases (KDD) is a multi step process of finding useful information and patterns in data

• Data Mining is the use of algorithms to extract information and patterns derived by the KDD process.

Page 17: Introduction to Data Mining

Introduction to Data Mining 17

Steps of KDD Process

1. Selection-Data Extraction -Obtaining Data from heterogeneous data sources -Databases, Data warehouses, World wide web or other information repositories.

2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data

to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected.

3. Transformation- Data Integration- Combines data from multiple Combines data from multiple

sources into a coherent store -Data can be encoded sources into a coherent store -Data can be encoded in common formats, normalized, reduced.in common formats, normalized, reduced.

Page 18: Introduction to Data Mining

Introduction to Data Mining 18

Steps of KDD Process

4. D4. Data mining – Apply algorithms to transformed data an extract patterns.

5. Pattern Interpretation/evaluation Pattern Evaluation- Evaluate the interestingness of

resulting patterns or apply interestingness measures to filter out discovered patterns.

Knowledge presentation- present the mined knowledge-

visualization techniques can be used.

Page 19: Introduction to Data Mining

Introduction to Data Mining 19

What Is Data Mining?

Some DefinitionsSome Definitions• “The nontrivial extraction of implicit, previously unknown, and

potentially useful information from data” (Piatetsky-Shapiro)

• "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams." (Han, pg xxi)

• “...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)

• “...finding hidden information in a database.” (Dunham, pg 3)

• “...the process of employing one or more computer learning techniques to automatically analyse and extract knowledge from data contained within a database.” (Roiger, pg 4)

Page 20: Introduction to Data Mining

Introduction to Data Mining 20

Why Data Mining?

• That all sounds ... complicated. Why should I learn about Data Mining?

• What's wrong with just a relational database? Why would I want to go through these extra [complicated] steps?

• Isn't it expensive? It sounds like it takes a lot of skill, programming, computational time and storage space.

• Where's the benefit?

• Data Mining isn't just a cute academic exercise, it has very profitable real world uses. Practically all large companies and many governments perform data mining as part of their planning and analysis.

Page 21: Introduction to Data Mining

Introduction to Data Mining 21

Goal of Data Mining— Simplification and automation of the overall statistical process, from data source (s) to model application

— Changed over the years > Statistician replace data to a model > Many different data mining algorithms / tools available > Statistical expertise required to build intelligence into the software

Page 22: Introduction to Data Mining

Introduction to Data Mining 22

Data Mining is …

Page 23: Introduction to Data Mining

Introduction to Data Mining 23

What is (not) Data Mining?

What is Data Mining?

– Certain names are more common in certain locations of Mumbai (Kulkarni, Shah, Iyer… )

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What is not Data Mining?

– Look up phone number in phone directory – Query a Web search engine for information about Amazon”

Page 24: Introduction to Data Mining

Introduction to Data Mining 24

DB VS DM Processing

• QueryQuery– Well definedWell defined– SQLSQL

• Query– Poorly defined– No precise query language

DataData– Operational dataOperational data

OutputOutput– PrecisePrecise– Subset of Subset of

databasedatabase

DataData– Not operational dataNot operational data

OutputOutput– FuzzyFuzzy– Not a subset Not a subset

of databaseof database

Page 25: Introduction to Data Mining

Introduction to Data Mining 25

Convergence of 3 key Technologies

Page 26: Introduction to Data Mining

Introduction to Data Mining 26

1. Increasing Computing Power

— Moore’s law doubles computing power

every 18 months

— Powerful workstations became common

— Cost effective servers (SMPs) provide

parallel processing to the mass market

— Interesting tradeoff:

< Small number of large analyses vs. large

number of small analyses

Page 27: Introduction to Data Mining

Introduction to Data Mining 27

1. The Data Explosion

• The rate of data creation is accelerating each year. In 2003, UC Berkeley estimated that the previous year generated 5 exabytes of data, of which 92% was stored on electronically accessible media. Mega < Giga < Tera < Peta < Exa ... All the data in all the books in the US Library of Congress is ~136 Terabytes. So 37,000 New Libraries of Congress in 2002.

• VLBI Telescopes produce 16 Gigabytes of data every second.

• Google searches 18 billion+ accessible web pages.

Page 28: Introduction to Data Mining

Introduction to Data Mining 28

1. The Data Explosion Implications

• As the amount of data increases, the proportion of information decreases.

• As more and more data is generated automatically, we need to find automatic solutions to turn those stored raw results into information.

• Companies need to turn stored data into profit ... Otherwise why are they storing it?

Page 29: Introduction to Data Mining

Introduction to Data Mining 29

2. Improved Data Collection and Management

— Data Collection ? Access ? Navigation ? Mining— The more data the better (usually)

Page 30: Introduction to Data Mining

Introduction to Data Mining 30

3. Statistical & Machine Learning Algorithms3. Statistical & Machine Learning Algorithms

— Techniques have often been waiting for computing

technology to catch up

— Statisticians already doing “manual data mining”

— Good machine learning is just the intelligent

application of statistical processes

— A lot of data mining research focused on tweaking existing techniques to get small percentage gains

Page 31: Introduction to Data Mining

Introduction to Data Mining 31

3.Data/Information/Knowledge/Wisdom

• For example, a data mining application may tell you that there is a correlation between buying music magazines and beer, but it doesn't tell you how to use that knowledge. Should you put the two close together to reinforce the tendency, or should you put them far apart as people will buy them anyway and thus stay in the store longer?

• Data mining can help managers plan strategies for a company, it does not give them the strategies.

Page 32: Introduction to Data Mining

Introduction to Data Mining 32

Data mining Functions• All Data Mining functions can be thought of as

attempting to find a model to fit the data.

• Each function needs criteria to create one model over another.

• Each function needs a technique to compare the data.

• Two types of model: – Predictive models predict unknown values based on known data – Descriptive models identify patterns in data

Page 33: Introduction to Data Mining

Introduction to Data Mining 33

Data mining Functions

Page 34: Introduction to Data Mining

Introduction to Data Mining 34

Predictive Model— A “black box” that makes predictions

about

the future based on information from the

past and present

— Large number of inputs usually available

Page 35: Introduction to Data Mining

Introduction to Data Mining 35

Kinds of Data Mining problemsKinds of Data Mining problems

Database

Data Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently purchased with Find all items which are frequently purchased with milk. (association rules)milk. (association rules)

– Find all credit applicants with Aditi as first name Find all credit applicants with Aditi as first name – Identify customers who have purchased Identify customers who have purchased more than more than ₹ ₹ 10,000 in the last month10,000 in the last month

– Find all credit applicants who are poor credit risks. Find all credit applicants who are poor credit risks.

(classification)(classification)– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)

Page 36: Introduction to Data Mining

Introduction to Data Mining 36

• Classification

• Clustering

• Association Rule

Kinds of Data Mining problemsKinds of Data Mining problems

Page 37: Introduction to Data Mining

Introduction to Data Mining 37

ClassificationClassification

Classification Model

Page 38: Introduction to Data Mining

Introduction to Data Mining 38

Definition of Classification Problem

Given a database

D={t1,t2,…,tn} and a set of

classes C={C1,…,Cm}, the

Classification Problem is

to define a mapping

f: DC where each t i is

assigned to one class.

Page 39: Introduction to Data Mining

Introduction to Data Mining 39

Example: Credit Card

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125 Cr No

2 No Married 100 Cr No

3 No Single 70 Cr No

4 Yes Married 120 Cr No

5 No Divorced 95 Cr Yes

6 No Married 60 Cr No

7 Yes Divorced 220 Cr No

8 No Single 85 Cr Yes

9 No Married 75 Cr No

10 No Single 90 Cr Yes 10

Refund Marital Status

Taxable Income Cheat

No Single 75 Cr ?

Yes Married 50 Cr ?

No Married 150 Cr ?

Yes Divorced 90 Cr ?

No Single 40 Cr ?

No Married 80 Cr ? 10

TestSet

Training Set

ModelLearn

Classifier

Page 40: Introduction to Data Mining

Introduction to Data Mining 40

Another Example ...

• In which group, these object belongs to ?

Group 1: Delia Group 2: RosesTarget Object

(Experiment reported on in Cognitive Science, 2002)

oopps

Page 41: Introduction to Data Mining

Introduction to Data Mining 41

Resemblance

• People classify things by finding other items that are similar which have already been classified.

• For example: Is a new species a bird? Does it have the same attributes as lots of other birds? If so, then it's probably a bird too.

• A combination of rote memorization and the notion of 'resembles'.• Although kiwis can't fly like most other birds, they resemble birds more than they resemble other types of animals.• So the problem is to find which instances most closely resemble the instance to be classified.

Page 42: Introduction to Data Mining

Introduction to Data Mining 42

Few More Examples

• Loan companies can “give you results in minutes” by classifying you into a good credit risk or a bad risk, based on your personal information and a large supply of previous, similar customers.

• Cell phone companies can classify customers into those likely to leave, and hence need enticement, and those that are likely to stay regardless.

• The data generated by airplane engines can be used to determine when it needs to be serviced. By discovering the patterns that are indicative of problems, companies can service working engines less often (increasing profit) and discover faults before they materialise (increasing safety).

Page 43: Introduction to Data Mining

Introduction to Data Mining 43

Clustering

• Classification is supervised learning the supervision comes from labeling the instances with the class.

• Clustering is unsupervised learning -- there are no predefined class labels, no training set.

• So our clustering algorithm needs to assign a cluster to each instance such that all objects with the same cluster are more similar than others.

Page 44: Introduction to Data Mining

Introduction to Data Mining 44

Clustering• Finding groups of objects such that the objects in a group will be

similar (or related) to one another and different from (or unrelated to) the objects in other groups

• The goal is to find the most 'natural' groupings of the instances.

- Within a cluster: Maximize similarity between instances.

- Between clusters: Minimize similarity between instances.

Inter-cluster distances are

maximizedIntra-cluster distances are

minimized

Page 45: Introduction to Data Mining

Introduction to Data Mining 45

Clustering• For example, we might have the following data:

• Where the axes are two dimensions and shape is a third, nominal attribute.

Page 46: Introduction to Data Mining

Introduction to Data Mining 46

Clustering• A clustering algorithm might find three clusters:

• Even though there are some squares and circles mixed together.

Page 47: Introduction to Data Mining

Introduction to Data Mining 47

Outliers

Cluster 1

Cluster 2

Outliers

Page 48: Introduction to Data Mining

Introduction to Data Mining 48

What is a natural grouping among these objects?What is a natural grouping among these objects?

School Employees Tatkare’s Family Males Females

Clustering is subjectiveClustering is subjective

Page 49: Introduction to Data Mining

Introduction to Data Mining 49

What is Similarity?What is Similarity?The quality or state of being similar; likeness; resemblance; as, a similarity of features.

Similarity is hard to define, but… “We know it when we see it”

The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.

Webster's Dictionary

Page 50: Introduction to Data Mining

Introduction to Data Mining 50

Clustering Problem • Given a database D={t1,t2,…,tn} of tuples and an

integer value k, the Clustering Problem is to define a mapping f:D{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k.

• A Cluster, Kj, contains precisely those tuples mapped to it.

• Unlike classification problem, clusters are not known a priori.

Page 51: Introduction to Data Mining

Introduction to Data Mining 51

Applications• Marketing:

Discover consumer groups based on their purchasing habits

• City Planning: Identify groups of buildings by type, value, location

Page 52: Introduction to Data Mining

Introduction to Data Mining 52

Applications

• Image Processing: Identify clusters of similar images (eg horses)

• Biological: Discover groups of plants/animals with similar properties

Page 53: Introduction to Data Mining

Introduction to Data Mining 53

Applications

• Given:– A source of textual

documents– Similarity measure

• e.g., how many words are common in these documents

ClusteringSystem

Similarity measure

Documentssource

DocDo

cDoc

Doc

Doc

DocDoc

Doc

DocDoc

• Find:• Several clusters of

documents that are relevant to each other

Page 54: Introduction to Data Mining

Introduction to Data Mining 54

Association Rules • A common application

is market basket

analysis which

(1) items are frequently

sold together at a

supermarket

(2) arranging items on

shelves which items

should be promoted

together

Page 55: Introduction to Data Mining

Introduction to Data Mining 55

Association Rule Discovery

Page 56: Introduction to Data Mining

Introduction to Data Mining 56

Association Rule Discovery• Given a set of records each of

which contain some number of items from a given collection;– Produce dependency rules which will

predict occurrence of an item based on occurrences of other items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Page 57: Introduction to Data Mining

Introduction to Data Mining 57

Market basket:

Rule form: “Body ead [support, confidence]”.buys(X, `beer') buys(X, “snacks') [1%, 60%]

(a) If a customer X purchased `beer', 60% of them purchased `snacks'(b) 1% of all transactions contain the items `beer' and `snacks‘ together

Association Rule Discovery

Page 58: Introduction to Data Mining

Introduction to Data Mining 58

A Weka bird is a strong brown bird which is native to New Zealand and grows to be about the same size as a chicken. The Weka was once fairly common on the North and South

Islands of New Zealand but over the years has heavily declined on the North Island due to the major damage of their

habitats.

Page 59: Introduction to Data Mining

Introduction to Data Mining 59

• Three graphical user interfaces– “The Explorer” (exploratory data

analysis)– “The Experimenter”

(experimental environment)– “The KnowledgeFlow” (new

process model inspired interface)

WEKA is available at http://www.cs.waikato.ac.nz/ml/weka

Page 60: Introduction to Data Mining

Introduction to Data Mining 60

• Witten, Ian and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition, Morgan Kaufmann, 2005

• Dunham, Margaret H, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2003

References

Page 61: Introduction to Data Mining

Introduction to Data Mining 61

• ‘dbmsnotes’ -

http://tech.groups.yahoo.com/group/dbmsnotes/

References: Yahoo Group

Page 62: Introduction to Data Mining

Introduction to Data Mining 62