Intelligent Data Analysis and Probability Inference Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof. Yike Guo Dept. of Computing Imperial.

Intelligent Data Analysis and Probability Inference

Data Mining : Intelligent Data Analysis for Knowledge Discovery

Prof. Yike Guo

Dept. of ComputingImperial College


Course Overview

• Goal– Basic Concepts of Data Mining– Data Mining Techniques– Data Mining Applications– Future Research Trends on Data Mining

• Reference Books• Data Mining: Concepts and Techniques JiaWei Han and Micheline

Kamber• Advances in Knowledge Discovery and Data Mining U.M Fayyad and G,

Piatetsky-Shapiro AAAI/MIT Press. 1996• Predictive Data Mining: A Practical Guide Sholom M.Weiss and Nitin

Indurkhya Morgan Kaufmann Publishers, Inc. 1997• Intelligent Data Analysis, Springer 1999

• Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000


What does the data say?

Day Outlook Temperature Humidity Wind Play Tennis

1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Weak No9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No


Turing Data into Knowledge


100 000

10 000

1000

100

101

0.1

0.01

1965 1970 1975 1980 1985 1990 1995

MEDLINE G5 MeSH

2000

Year

Amou

nt (x

1000

)

0.001

Transistors / chipDNA sequencesMapped human genes3-D structures

MEDLINE records

What does the data say?



What Is Data Mining?• Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

• Alternative names and their “inside stories”: – Data mining: a misnomer?– Knowledge discovery(mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• What is not data mining?– (Deductive) query processing. – Expert systems or small ML/statistical programs


• Data: set of facts F ( records in a database)

• Pattern : An expression E in a language L describing data in a subset FE of F and E is simpler than the enumeration of al l the facts of FE. FE is also called a class and E is also called a model or knowledge.

• Data Mining Process: data mining is a multi-step process involving multiple choices, iteration and evaluation. It is non-trivial since there is no closed-form solution. It always involve intensive search.

• Validity : E is true (with high probability) for F

• Useful : patterns are not trivial inductive properties of data

• Understandable: patterns should be understandable by data owners to aid in understanding the data/domain


Why Data Mining

• Limitation of traditional database querying:– Most queries of interest to data owners are difficult to

state in a query language• “ find me all records indicating fraud”=> “ tell me the

characteristics of fraud” (Summarisation)• “find me who likely to buy product X” (classification problem)• “find all records that are similar to records in table X”

(clustering problem)

– Ability to support analysis and decision making using traditional (SQL) queries become infeasible (query formulation problem ).


Relational Database Revisited• Terabyte databases, consisting of billions of records, are

becoming common• Relational data model is the defacto standard• A relational database : set of relations• A relation : a set of homogenous tuples• Relations are created, updated and queried using SQL• Query = Keyword based search

SELECT telephone_number

FROM telephone_book

WHERE last_name = “Smith”


SQL : Relational Querying Language

• Provides a well-defined set of operations: scan, join, insert, delete, sort, aggregate, union, difference

• Scan -- applies a predicate P to relation RFor each tuple tr from R

if P(tr) is true, tr is inserted in the output stream

• Join -- composes two relations R and SFor each tuple tr from R

For each tuple ts from S

if join attribute of tr equals to join attribute of ts

form output tuple by concatenating tr and ts


Relational database. A table (relation) is a set and the three basic table operations shown here are extensions of the standard set operations.

Paper 1Paper 2Paper 3Paper 4. . . .

MU

ID

Jour

nal

Vol

ume

Pag

es

Yea

r

SELECT

PROJECT

MU

ID

Aut

hor

Author 1-1Author 1-2Author 2-1Author 2-2Author 2-3Author 3-1. . . .

JOIN

MU

ID

Jour

nal

Vol

ume

Pag

es

Yea

r

Aut

hor


The Query Formulation Problem

• It is not solvable via query optimisation• Has not received much attention in the database field or in traditional

statistical approaches• These problems are of inductive features: learning from data rather than

search from data• Natural solution is via train-by-example approach to construct inductive

models as the answers

Consider the query :

What kinds of weather condition are suitable for playing tennis ?


Why Data Mining Now• Data Explosion

– Business Data : organisations such as supermarket chains, credit card companies, investment banks, government agencies, etc. routinely generate daily volumes of 100MB of data

– Scientific Data: Scientific and remote sensing instruments collect data at the rates of Gigabytes per day: far beyond human analysis abilities.

• Data Wasting– Only a small portion (5% - 10%) of the collected data is ever analysed– Data that may never be analysed continues to be collected, at great expense.

• We are drowning in data, but starving for knowledge!


Steps of a KDD Process

• Learning the application domain:– relevant prior knowledge and goals of application

• Creating a target data set: data selection• Data cleaning and preprocessing: (may take 60% of effort!)• Data reduction and transformation:

– Find useful features, dimensionality/variable reduction, invariant representation.

• Choosing functions of data mining – summarization, classification, regression, association, clustering.

• Choosing the mining algorithm(s)• Data mining: search for patterns of interest• Pattern evaluation and knowledge presentation

– visualization, transformation, removing redundant patterns, etc.

• Use of discovered knowledge


Data Mining and Decision Support

Data Warehousing:create/ select target database

Data Warehousing:create/ select target database

Sampling: choose data for building models

Sampling: choose data for building models

Data Cleaning:supply missing valueseliminate noisy data

Data Cleaning:supply missing valueseliminate noisy data

Data Reduction and Projection:derive useful featuresdimensionality reduction

Data Reduction and Projection:derive useful featuresdimensionality reduction

Data Mining:choose data mining taskschoose data mining methodsto extract patterns / knowledge

Data Mining:choose data mining taskschoose data mining methodsto extract patterns / knowledge

Model Test and Evaluation:test the accuracy of the modelconsistency check model refinement

Model Test and Evaluation:test the accuracy of the modelconsistency check model refinement

Decision Support

Machine Learning Technologies


Data Warehousing

• “ A data warehouse is a subject-oriented, integrated, time-variant,

and nonvolatile collection of data in support of management’s

decision-making process.” --- W. H. Inmon

• A data warehouse is

– A decision support database that is maintained separately from

the organization’s operational databases.

– It integrates data from multiple heterogeneous sources to

support the continuing need for structured and /or ad-hoc

queries, analytical reporting, and decision support.


Modeling Data Warehouses

• Modeling data warehouses: dimensions & measurements

– Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) radically.

– Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.

– Fact constellations: Multiple fact tables share dimension tables.

• Storage of selected summary tables:

– Independent summary table storing pre-aggregated data, e.g., total sales by product by year.

– Encoding aggregated tuples in the same fact table and the same dimension tables.


Example of Star Schema

Many Time Attributes

Time Dimension Table

Many Store Attributes

Store Dimension Table

Sales Fact Table

Time_Key

Product_Key

Store_Key

Location_Key

unit_sales

dollar_sales

Yen_sales

Measures

Many Product Attributes

Product Dimension Table

Many Location Attributes

Location Dimension Table


OLAP: On-Line Analytical Processing• A multidimensional, LOGICAL view of the data.

• Interactive analysis of the data: drill, pivot, slice_dice, filter.

• Summarization and aggregations at every dimension intersection.

• Retrieval and display of data in 2-D or 3-D crosstabs, charts, and graphs, with easy pivoting of the axes.

• Analytical modeling: deriving ratios, variance, etc. and involving measurements or numerical data across many dimensions.

• Forecasting, trend analysis, and statistical analysis.

• Requirement: Quick response to OLAP queries.


OLAP Architecture• Logical architecture:

– OLAP view: multidimensional and logic presentation of the data in the data warehouse/mart to the business user.

– Data store technology: The technology options of how and where the data is stored.

• Three services components:– data store services

– OLAP services, and

– user presentation services.

• Two data store architectures:– Multidimensional data store: (MOLAP).

– Relational data store: Relational OLAP (ROLAP).


Multidimensional Data• Sales volume as a function of product,

month, and region

Pro

duct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day


Construction of Data Cubes

sum

0-20K20-40K 60K- sum

Comp_Method

… ...

sum

Database

Amount

Province

Discipline

40-60KB.C.

PrairiesOntario

All AmountComp_Method, B.C.

Each dimension contains a hierarchy of values for one attributeA cube cell stores aggregate values, e.g., count, sum, max, etc.A “sum” cell stores dimension summation values.Sparse-cube technology and MOLAP/ROLAP integration.“Chunk”-based multi-way aggregation and single-pass computation.


A Star-Net Query Model

Shipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

DISTRICT

REGION

COUNTRY

Geography

DAILYQTRLYANNUALYTime


Decision Support with Data Warehouse• Ad Hoc Queries: Q: How many customers do we

have in London? A: 32776


• Report and Spreadsheet


• OLAP: Q:What are the sales figures for Y in the different regions:


• Statistics: Q: Is there a relation between age and buy

behaviour? A: Older clients buy more


• Data Mining: Q: What factors influence buying behaviour ?

A1: : Young men in sports cars buy 3 times as much audio equipment (clustering/regression):

A2: Older woman with dark hair more often buy rinse (classification)

A3: Buyers of cars are also the buyers of houses (asociation)

Wage

Old YoungMiddle

Y N

N

N Y

Hair color

Age

B W L H


Data Mining Functionalities (1)• Concept description: Characterization and

discrimination– Generalize, summarize, and contrast data characteristics,

e.g., dry vs. wet regions

• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association

– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

– contains(T, “computer”) contains(x, “software”) [1%, 75%]


Data Mining Functionalities (2)• Classification and Prediction

– Finding models (functions) that describe and distinguish classes or concepts for future prediction

– E.g., classify countries based on climate, or classify cars based on gas mileage

– Presentation: decision-tree, classification rule, neural network

– Prediction: Predict some unknown or missing numerical values

• Cluster analysis– Class label is unknown: Group data to form new classes, e.g., cluster

houses to find distribution patterns

– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity


Data Mining Functionalities (3)

• Outlier analysis– Outlier: a data object that does not comply with the general behavior of the

data

– It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

• Trend and evolution analysis

– Trend and deviation: regression analysis

– Sequential pattern mining, periodicity analysis

– Similarity-based analysis

• Other pattern-directed or statistical analyses


Example Data Mining Applications• Commercial :

– Fraud detection: Identify Fraudulent transaction

– Loan approval: Establish the credit worthiness of a customer requesting a loan

– Investment analysis : Predict a portfolio's return on investment

– Marketing and sales data analysis: Identify potential customers; establishing the effectiveness of a sales campaign

• Medical:– Drug effect analysis : from patient records to learn drug effects– Disease causality analysis

• Political policy:– Election policy : people’s voting patterns– Social policy: tax/benefit policy

• Manufacturing:– Manufacturing process analysis: identify the causes of manufacturing problems

– Experiment result analysis : Summarise experiment results and create predictive models


Market Analysis and Management (1)

• Where are the data sources for analysis?– Credit card transactions, loyalty cards, discount coupons, customer

complaint calls, plus (public) lifestyle studies

• Target marketing– Find clusters of “model” customers who share the same characteristics:

interest, income level, spending habits, etc.

• Determine customer purchasing patterns over time– Conversion of single to a joint bank account: marriage, etc.

• Cross-market analysis– Associations/co-relations between product sales

– Prediction based on the association information


Market Analysis and Management (2)

• Customer profiling

– data mining can tell you what types of customers buy what products

(clustering or classification)

• Identifying customer requirements

– identifying the best products for different customers

– use prediction to find what factors will attract new customers

• Provides summary information

– various multidimensional summary reports

– statistical summary information (data central tendency and variation)


Fraud Detection and Management (1)• Applications

– widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.

• Approach– use historical data to build models of fraudulent behavior and use

data mining to help identify similar instances

• Examples– auto insurance: detect a group of people who stage accidents to

collect on insurance

– money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)

– medical insurance: detect professional patients and ring of doctors and ring of references


Fraud Detection and Management (2)• Detecting inappropriate medical treatment

– Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).

• Detecting telephone fraud– Telephone call model: destination of the call, duration, time of day

or week. Analyze patterns that deviate from an expected norm.– British Telecom identified discrete groups of callers with frequent

intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.

• Retail– Analysts estimate that 38% of retail shrink is due to dishonest

employees.


Related Fields:• Machine learning: Inductive reasoning

• Statistics : Sampling, Statistical Inference, Error Estimation

• Pattern recognition: Neural Networks, Clustering

• Knowledge Acquisition, Statistical Expert Systems

• Data Visualisation

• Databases: OLAP, Parallel DBMS, Deductive Databases

• Data Warehousing: collection, cleaning of transactional data for on-line retrial

Intelligent Data Analysis and Probability Inference Data Mining : Intelligent Data Analysis for Knowledge Discovery Prof. Yike Guo Dept. of Computing Imperial.

Documents