Top Banner
Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology
57

Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

Dec 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

Knowledge Discovery in Databases

Chung-Chian Hsu

National Yunlin University of Science and Technology

Page 2: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

From Data Mining to Knowledge Discovery: An Overview

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth

Advances in Knowledge Discovery and Data Mining, pp.1-36, 1996

Page 3: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

3

Knowledge Discovery in Databases

Non-trivial process of identify valid, novel, potentially useful, and ultimately understandable patterns in data

Page 4: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

4

Related Research Fields

• Machine Learning

• Pattern Recognition

• Databases

• Statistics

• Artificial Intelligence

• Knowledge Acquisition for Expert Systems

• Data Visualization

Page 5: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

5

DataTarget Data

Preprocessed Data

Transformed Data

Patterns

Knowledge

SelectionPreprocessing

Transformation

Data Mining

Interpretation/Evaluation

Page 6: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

6

The KDD Process

• Interactive and iterative, involving numerous steps with many decisions being made by the user– Selection– Preprocessing: noise, missing data, error– Transformation: reduction, projection– Data mining– Interpretation and evaluation

Page 7: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

7

An Overview of Data Mining Methods

• Data Mining– a step in KDD process consisting of particular data

mining algorithms that, under some acceptable computational efficiency limitations, produces a particular enumeration of patterns

– involves fitting models to, or determining patterns from observed data

Note: a pattern can be thought of as instantiation of a modele.g. f(x)= 3x2 + x is a pattern whereas f(x) = x2+ x

is considered a model

Page 8: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

8

Primary Goals of Data Mining

• Prediction– Using some variables or fields in database to predict

unknown or future values of other variables of interest

e.g. education, job, sex and salary

• Description– Finding human-interpretable patterns describing the

data

e.g. mean, variance, mode, distribution

Page 9: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

9

Primary Tasks of Data Mining -1

• Classification:– Learning a function that classify a data item into one of

several predefined classes

• Regression– Learning a function that maps a data item to a real-valued

predication variable

• Clustering– Identify a finite set of categories or clusters to describe the

data

• Summarization– Finding a compact description for a subset of data

Page 10: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

10

Primary Tasks of Data Mining -2

• Dependency Modeling– Finding a model which describes significant

dependencies between variables

• Change and Deviation Detection– Discovering the most significant changes in the data

from previously measured or normative values

Page 11: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

11

Components of Data Mining Algorithms -1

• Model Representation– Describing discoverable patterns

• Search Method– Parameter search:

• Search for parameters optimizing model evaluation criteria given observed data and a fixed model representation

– Model search: • Model representation is changed so that a family of models are

considered.

• For each specific model representation, parameter search method is instantiated to evaluate the quality of that particular model

Page 12: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

12

Components of Data Mining Algorithms -2

• Model Evaluationestimates how well a particular pattern meet the

criteria of the KDD process

Page 13: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

13

Popular Data Mining Methods

• Decision trees and rules // propositional logic power

• Nonlinear regression methods; e.g. Neural network

• Example-based methods: e.g. nearest-neighbor classification, regression algorithms, cas

e-based reasoning

• Probabilistic graphical dependency modelse.g. baysian network

• Genetic algorithms• Inductive logic programming // first-order logic power

Page 14: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

14

Day Outlook Temperature Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Page 15: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

15

Outlook

Sunny RainOvercast

High Normal

No Yes

Strong Weak

No Yes

YesHumidity Wind

Page 16: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

16

Decision Tree Learning (cont’d)

• Inductive rules from the previous examples

1. IF Outlook= Sunny AND Humidity= Normal,

THEN PlayTennis

2. IF Outlook= Overcast,

THEN PlayTennis

3. IF Outlook= Rain AND Wind= Weak,

THEN PlayTennis

Page 17: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

17

Person Age Sex Income CustomerAnn 32 F 10 000 yesJoan 53 F 1000 000 yesMary 27 F 20 000 noJane 55 F 20 000 yesBob 30 M 100 000 yesJack 50 M 200 000 yes

Husband WifeBob AnnJack Jane

Page 18: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

18

Inductive logic programming

IF Married(Person, Spouse) AND Income(Person) >= 100 000

THEN Potential-Customer(Spouse)

Page 19: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

19

Challenges for KDD -1

• Larger Databases– can’t fit in main memory at one time

– solutions: sampling, approximation methods, parallel processing, …

• High Dimensionality– increase size of search space for model induction in a combin

atorially explosive manner

– increase chances that learner will find spurious patterns that are not valid in general

– solutions: use prior knowledge to identify irrelevant variables, ...

Page 20: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

20

Challenges for KDD -2

• Overfitting– good performance on training data, but poor performance on

real data– solutions: cross-validation, regularization, other statistical str

ategies

• Changing Data and Knowledge– changing may make previously discovered patterns invalid– solutions: incremental methods for updating the patterns, …

• Missing and Noisy Data– solutions: statistical strategies to identify hidden variables an

d dependencies

Page 21: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

21

Challenges for KDD -3

• Complex Relationships between Fields– most of algorithms developed for simple attribute-value

records– require algorithms to deal with hierarchically structured

attributes or values, relations between attributes, ...

• Understanding of Patterns– make the discoveries more understandable by humans– solutions: graphical representations, natural language

generations, information visualization, ...

• User Interaction and Prior Knowledge– encoding domain knowledge into learning systems

• Integration with Other Systems– integration with spreadsheet, DBMS, visualization tools..

Page 22: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

Mining Business Databases

R. Brachman, T. Khabaza, W. Kloesgen, G. Piatetsky-Shapiro, and E. Simoudis

Communication of ACM, V.39, No.11, 1996, pp.42-48

Page 23: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

23

Goals of KDD

• Verification• Discovery

– predication• regression• classification

– description• clustering• summarization• visualization• change and deviation detection

Page 24: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

24

Tools Supporting KDD -1

• Generic, single-task– classification

• decision-tree• neural-network• example-based• rule-discovery

• Generic, multitask– Clementine– IMACS– MLC++– MOBAL– Recon

Page 25: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

25

Tools Supporting KDD -2

• Domain-specific– Opportunity Explorer– IBM Advanced Scout– Interactive Data Exploration and Analysis

Page 26: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

26

Representative Applications

• Marketing

• Financial Investment

• Fraud Detection

• Manufacturing and Production

• Network Management

Page 27: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

27

Marketing -1

• Predicting the size of TV audiences– using neural networks and rule induction– examine factors relating audience size

• Analyzing supermarket sales data– Coverstory and Spotlight: producing reports, using natural la

nguage and graphics, on the most significant changes in a particular product volume and share broken down by region, product type, etc.

– Opportunity Explorer– Management Discovery Tool

• summarization, trend analysis, change analysis, and measure and segment comparison

Page 28: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

28

Marketing -2

• Market basket analysis– association rules

• associations between different products bought by the customer

– Lucent Technology’s NicheWorks• clustered purchases to be visualized intuitively

Page 29: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

29

Financial Investment Management

• Fidelity Stock Selector– uses neural network to selection investments

• LBS Capital Management– uses expert systems, neural nets, and GAs to manag

e portfolios worth $600 million

• Carlberg & Associates– uses a neural network model for predicting Standard

& Poor’s 500 Index– surprisingly successful

Page 30: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

30

Fraud Detection

• FALCON– using neural network shell– detect suspicious credit card transactions

• FAIS– detect money-laundering activity from financial

transactions

• Telecommunication– AT&T’s system detecting international calling fraud– GTE and NYNEX: detecting cellular cloning fraud

Page 31: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

31

Manufacturing and Production

• Prospective KDD application– controlling and scheduling technical production processes

• Main advantage– high cost savings

• Key challenges– representation and exploitation of time and location as well

as model levels, such as quality, process, and control

• KDD in an Europe chemical company– assist in production process for polymeric plastics

• CASSIOPEE– diagnose and predict problems in Boeing 737

Page 32: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

32

Network Management

• Used for– filtering redundant alarms

– locating problems in the network

– predicting severe faults

• Telecommunication Alarm Sequence Analyzer (TASA)– by University of Helsinki

– locating frequently occurring alarm episodes

– presenting them as rules

– integrating into alarm-handling software

Page 33: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

33

Other Areas -1

• Health Care: KEFIR• determine most interesting deviations• explain key deviations• generate recommendations

• Data Quality– verify financial trading data– detect error

• NBA Basketball Games: IBM Advanced Scout– helps coaches to discover valuable patterns for

improvements in their strategy

Page 34: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

34

Other Areas -2

• Discovery Agentsthese systems ask the user to specify a profile of interest and se

arch for related information among a wide variety of public domain and proprietary sources

– Firefly asks user their opinion of several music pieces and then suggests other music that user may like <http://www.ffly.com/>.

– NewsHound automatically search information from a wide variety of sources, including newspapers and wire services, and email relevant documents directly to user <http://www.sjmercury.com/hound/>

Page 35: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

35

Application Development

• Practical application affected by– insufficient training– inadequate tool support– data unavailability– overabundance of patterns– changing and time-oriented data– spatially oriented data– complex data types– scalability

Page 36: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

36

Potential for KDD Applications

• Suitable Domains– information-rich– have a changing environment– not already have existing models– require knowledge-based decisions– high payoff for the right decisions

Page 37: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

37

References

1. Fayyad, U., G. Piatetsky-Shapiro and P. Smyth, “From Data Mining to knowledge Discovery: an Overview”, Advances in Knowledge Discovery and Data Mining, pp. 1-36, 1996.

2. Brachman, R., T. Khabaza, W. Kloesgen, G. Piatetsky-Shapiro and E. Simoudis, “Mining business Databases”, Communication of ACM, Vol. 39, No. 11, 1996.

Page 38: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

Data Mining and Knowledge Discovery: Making Sense Out of Data

U. Fayyad

IEEE Expert, Oct. 1996, pp.20-25

Page 39: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

39

The KDD Process

• Selection and sampling

• Preprocessing and cleaning

• Transformation and reduction

• Data mining

• Visualization and evaluation

Page 40: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

40

Major Related Fields -1

• Statistics

• Machine Learning

• Artificial Intelligence and Reasoning With Uncertainty

• Databases

• Knowledge Acquisition

• Pattern Recognition

Page 41: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

41

Major Related Fields -2

• Information Retrieval

• Visualization

• Intelligent Agents for Distributed and Multimedia Environments

• Digital Library

• Management Information Systems

Page 42: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

42

Statistics

• Plays an important role in– data selection and sampling– data mining– evaluation of extracted knowledge

• Statistics focus on – evaluation of model fit to data– hypothesis testing

Page 43: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

43

Statistics• Dealt primarily with small data sets and addressing

small sample problems• Focused mostly on theoretical aspects of techniques

and models. Search received little emphasis.• Require

– an a priori model – significant domain knowledge of the data as well as of the

underlying mathematics for proper use and interpretation.

• Issues have only recently begun to receive attention– interfaces to databases – dealing with massive data sets– techniques for efficient data management

Page 44: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

44

Pattern Recognition

• The major applicable techniques – classification learning– clustering

• Contributions distinguished from statistics by emphasis on– computational algorithms, – more sophisticated data structures– more search

Page 45: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

45

Machine Learning

• Contribute to the data-mining step– representation– selection of variables

• Also focus on– discovering structure in data– empirical laws to describe observations

Page 46: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

46

Database and Data Warehousing

• On-line analytical processing– providing new ways of manipulating and analyzing data usin

g multidimensional methods

• Data warehousing concerned with– schemes and methods of integrating legacy DBs, on-line tran

saction DBs, and nonhomogeneous RDBMSs

• Data warehousing involves– storage– data selection– data cleaning– infrastructure for updating the warehouse

Page 47: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

Reality Check for Data Mining

E. Simoudis

IEEE Expert, Oct. 1996, pp. 26-33

Page 48: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

48

Two Forms of Data Mining

• Verification-driven data mining– query and reporting– multidimensional analysis– statistical analysis

• Discovery-driven data mining– predictive modeling– database segmentation– link analysis– deviation detection

Page 49: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

Data Mining: Machine Learning, Statistics, and Databases

H. Mannila

IEEE, 1996

Page 50: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

50

The KDD Process

• Understanding the Domain

• Preparing the Data Set– could take up to 80% of the time

• Discovering Patterns (Data Mining)– efficient discovery of fairly simple patterns

• Postprocessing of Discovered Patterns

• Putting the Results Into Use

Page 51: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

51

Data Mining, Machine Learning and Statistics -1

• Aim: locating interesting regularities, patterns or concepts form

empirical data

• Machine Learning and KDD– M.L.

• focus on learning or induction step• assume something to learn• difficult for humans to learn

– KDD• emphasis the whole process• not assume any sensible structure behind the data• analyst can do if he had the time• huge amount of data, number of attributes

Page 52: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

52

Data Mining, Machine Learning and Statistics -2

• Statistics and KDD– Statistics

• gradually moved to model selection

• hypothesis testing

– KDD• extensive use of M.L. methods

• volume of data

• role of computational complexity issues

Page 53: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

53

Database and Data Mining

• Association rule, rXB, where XR and BR\X

• Applications– sales data: (transactions, products)– student database: (students, courses)– publications: (publications, publications)– behavior: (measurements, conditions)

• frequency and confidence of r• Discovery of association rule

– the task of finding all rules XB such that the frequency and the confidence exceed thresholds.

Page 54: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

54

Finding Episodes from Sequences

• A sequence of events (e, t) /* e: event, t: time• Applications

– telecommunication networks– process monitoring– quality control– user interface studies– epidemiology

• Time series data analysis in Statistics suited only for small numbers of event types

• An example: a telecommunication network alarm DB– ~ 200 events– 1000-10000 alarms per day

Page 55: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

55

Finding Episodes from Sequences• An episode

– a partially ordered set of events– e.g. events A and B occurs before C

• A slice rt of r of width W• An episode occurs in rt

• Frequent episode– occurs in sufficiently many slices

• A generic data mining task

• A generic data mining algorithmg}interestin is and in offenly sufficient occurs |{),( pdpPpPdPI

Page 56: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

56

Condensed Representations

• A Typical Phenomenon– evaluate other patterns closely resemble p– a solution: caching of previous results

• Condensed Representation– a data structure making queries more efficient

e.g. number of pattern occurrences, data cube

Page 57: Knowledge Discovery in Databases Chung-Chian Hsu National Yunlin University of Science and Technology.

57

Some Research Directions

• Pattern Query Specification Languages and Optimization Techniques

• Condensed Representations of Patterns• Caching Strategies• Combinations of D.M. And Statistics

Techniques• Using Background Knowledge• Tools for Selecting, Grouping, and Visualizing

Discovered Knowledge