Top Banner
William M. Pottenger, Ph.D. Computing the Future of Data Mining Computing the Future of Data Mining An Introduction to Data Mining Visit to Messiah College September 4, 2006 William M. Pottenger, Ph.D. Computer Science & Engineering Department www.cse.lehigh.edu/~billp
19
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Computing the Future of Data MiningComputing the Future of Data Mining

An Introduction to Data Mining

Visit to Messiah College

September 4, 2006

William M. Pottenger, Ph.D.

Computer Science & Engineering Department

www.cse.lehigh.edu/~billp

Page 2: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Knowledge Workers are OverwhelmedKnowledge Workers are Overwhelmed

• The user of software tools and computers are domain experts, NOT the computer science professionals

– Too much data

– Too much technology

– Not enough useful information

Page 3: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Data Mining Roots:Data Mining Roots:A Confluence of Multiple DisciplinesA Confluence of Multiple Disciplines

• Database Systems, Data Warehouses, and OLAP

• Machine Learning

• Information Theory & Statistics

• Mathematical Programming

• Visualization

• High Performance Computing

• …

• Algorithms have been known for awhile…Google™

Page 4: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Data Mining: On What Kind of Data?Data Mining: On What Kind of Data?

• Relational Databases

• Data Warehouses

• Transactional Databases

• Advanced Database Systems– Object-Relational

– Text

– Heterogeneous: Legacy, Distributed, …

– WWW

• … the Bible!

Page 5: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Why Do We Need Data Mining?Why Do We Need Data Mining?

• Leverage organization’s data assets

– Only a small portion (typically - 5%-10%) of the collected data is ever analyzed

– Data that may never be analyzed continues to be collected, at a great expense, out of concern that something which may prove important in the future is missed

– Growth rates of data preclude traditional “manual intensive” approach: need automated data fusion techniques based on data mining

Page 6: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Why Do We Need Data Mining?Why Do We Need Data Mining?

• As databases and problems grow, the ability to support the decision support process using traditional query languages become infeasible

– Many queries of interest are difficult to state in a query language (Query formulation problem)– “find all cases of fraud”– “find all individuals likely to buy a FORD Expedition”– “find all documents that are similar to this customers problem”

Page 7: An Introduction to Data Mining

William M. Pottenger, Ph.D.

What (exactly) is Data Mining?What (exactly) is Data Mining?

• Let’s take a few moments and consider this question. Is it:– Knowledge Discovery?

– Knowledge Management?

– Information Retrieval?

– On-line Analytic Processing (OLAP)?

–Machine Learning?

– Decision Support?

– Process Modeling/Control?

–…

Page 8: An Introduction to Data Mining

William M. Pottenger, Ph.D.

DefinitionsDefinitions

• Data mining is the application of computer technology and machine learning algorithms to discover patterns, anomalies, trends, and knowledge from data.– SGI Mineset Product Description

• Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.– Data Mining by Witten and Frank

• Data mining, also popularly referred to as knowledge discovery in databases (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories.– Data Mining: Concepts and Techniques by Han and Kamber

Page 9: An Introduction to Data Mining

William M. Pottenger, Ph.D.

What is Text Mining?What is Text Mining?

• Swanson (‘91) posed problem: Migraine headaches (M)– stress associated with M

– stress leads to loss of magnesium

– calcium channel blockers prevent some M

– magnesium is a natural calcium channel blocker

– spreading cortical depression (SCD) implicated in M

– high levels of magnesium inhibit SCD

– M patients have high platelet aggregability

– magnesium can suppress platelet aggregability

• All extracted from medical journal titles

Slide reused with permission of Marti Hearst @ UCB

Page 10: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Gathering EvidenceGathering Evidence

stress

migraine

CCB

magnesium

PA

magnesium

SCD

magnesiummagnesium

Slide reused with permission of Marti Hearst @ UCB

Page 11: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Novel Discovery: Magnesium & Migraines!Novel Discovery: Magnesium & Migraines!

migraine magnesium

stress

CCB

PA

SCD

Slide reused with permission of Marti Hearst @ UCB

No single author knew/wrote about this connection… this distinguishes Text Mining from Information Retrieval.

Page 12: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Why Use Data Mining?Why Use Data Mining?

• Data mining will become much more important, and companies will throw away nothing about their customers because it will be so valuable. If you’re not doing this, you’re out of business.– Arno Penzias, Chief Scientist @ Bell Labs

• We are deluged by data – scientific data, medical data, demographic data, financial data, and marketing data. People have no time to look at this data. Human attention has become a precious resource.– Jim Gray, Microsoft Research in preface to Data Mining by

Han and Kamber

• Necessity is the mother of invention– Unknown

Page 13: An Introduction to Data Mining

William M. Pottenger, Ph.D.

How is Data Mining Used?How is Data Mining Used?

• Direct Marketing• Customer Acquisition• Customer Retention• Cross-selling• Trend Analysis• Fraud Detection• Forecasting in Financial Markets• Process Modeling• Process Control• …

Page 14: An Introduction to Data Mining

William M. Pottenger, Ph.D.

But What is Data Mining (Really)?But What is Data Mining (Really)?

Data Mining: Data Mining: A ProcessA Process

Copyright © 1997 Stiftelsen Østfoldforskning: Used with permission

Page 15: An Introduction to Data Mining

William M. Pottenger, Ph.D.

An Example of Data Mining inAn Example of Data Mining inProcess Modeling and Control at HPProcess Modeling and Control at HP

• Quality Assurance troubleshooting– KnowledgeSeeker Decision Tree Data

Mining Tool identified critical factors impacting production of HP IIc Color Scanner

• Process control– KnowledgeSeeker Decision Tree Data

Mining Tool derived rules necessary to identify situations where process was about to go out of control.

Page 16: An Introduction to Data Mining

William M. Pottenger, Ph.D.

How Do Decision Trees Work?How Do Decision Trees Work?

Decision treespredict results but also tell about structure.

Page 17: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Be right back …Be right back …

A Demonstration of Data Mining

Featuring

KnowledgeSEEKERby Angoss Knowledge Engineering

Page 18: An Introduction to Data Mining

William M. Pottenger, Ph.D.

Examples of CommercialExamples of CommercialData Mining SystemsData Mining Systems

• IBM’s DB2 Intelligent Miner– www.ibm.com/software/data/iminer

• SAS Institute’s Enterprise Miner– www.sas.com/products/miner

• SPSS’s Clementine– www.spss.com/clementine

• Angoss’ KnowledgeSeeker– http://www.angoss.com/products/seeker.php

• Plus many more …

Page 19: An Introduction to Data Mining

William M. Pottenger, Ph.D.

AsymptopiaAsymptopia

We are always given finite amounts of data … and rarely do we reach asymptopia. Asymptopia is the mythical land, the data miners 'utopia', where the amount of data is infinite and all algorithms converge and all users are satisfied ... Naturally, asymptopia can be reached only in the limit.

Ron Kohavi Nuggets 96:21 (www.kdnuggets.com)