Top Banner
DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008
54

DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Dec 24, 2015

Download

Documents

Lee Lambert
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

DBMS support of the Data Mining

Advisor : S.-Y. Hwang Ph.DD954020005 Tsung-Hsien YangD954020006 Shi-Hwao Wang

1/22/2008

Page 2: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Agenda

Introduction to Data Mining The Promise of Data Mining KDD Process Data Mining Algorithms Data Mining Modeling and Language Conclusion

Page 3: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Introduction to Data Mining The Explosive Growth of Data: from terabytes to petabytes

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation,

Society and everyone: news, digital cameras, YouTube

Data collection and data availability

Automated data collection tools, database systems, Web, comp

uterized society

Page 4: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

What Is Data Mining? Data mining: Discovering interesting patterns from large amounts of data Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction,

data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems

Page 5: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

The Promise of Data Mining Database analysis and decision support

Market analysis and management

target marketing, customer relation management, market baske

t analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, qualit

y control, competitive analysis

Fraud detection and management

Other Applications Text mining (news group, email, documents) and Web analysis.

Page 6: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 7: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Data preprocessing

Data MiningManagement System

(DMMS)

Mining Model

Define a model

Train the modelTraining Data

Test the modelTest Data

Prediction using the model

Prediction Input Data

Page 8: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Data Mining Algorithms Decision Trees Naïve Bayesian Clustering Sequence Clustering Association Rules Neural Network Time Series Support Vector Machines ….

Page 9: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Data Mining Function

Classification (attribute)

Estimation (regression)

Prediction (time series)

Association (cross selling) Clustering (segmentation)

Page 10: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Data Mining Algorithms

√ √ √ √ √ √

√ √ √ √ √

√ √ √

√ √ √ √ √ √

√ √ √

Decis

ion T

rees

Naïve

Bay

esClu

ster

ing

Seq. C

lust

erin

gTim

e Ser

ies

Associ

atio

n rule

s

Neura

l Net

work

Classification

Regression

Segmentaion

Assoc. Analysis

Anomaly Detect.

Seq. Analysis

Time series

√ √ - second choice- second choice

√ √ - first choice- first choice

Page 11: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Data Mining Language New challenges in data mining API

Large spectrum of applications: embedded to interactive BI Interoperability between different DM providers (engine) and DM

consumers (tools) Data independence between content representation (trees,

attributes, networks, etc) and data mining task (prediction, scoring, etc)

Requirements: Algorithm-neutral Task-oriented (specification of what we need, rather than how to) Vendor-neutral Flexible, extensible, declarative/self-contained

Sound familiar? Yes, SQL

Page 12: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

DMX Approach Data Mining Extensions (DMX) to SQL Table vs. Mining Model

TABLE MINING MODELschema Column definition Attribute (variable)

definition

contains Rows Patterns, knowledge, cases

operations

DDL (create,drop,alter)

Create/drop/alter a model

DML (insert, delete) Train (populate) a model

Query (select) Prediction/browsing a model

Page 13: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Typical DM Process Using DMX

Data MiningManagement System

(DMMS)

Mining Model

Define a model:CREATE MINING MODEL ….

Train a model:INSERT INTO dmm ….

Training Data

Prediction using a model:SELECT …FROM dmm PREDICTION JOIN … Prediction Input Data

Page 14: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Defining a DM Model Defines

Shape of “training cases” (top-level entity being modeled)

Input/output attributes (variables): type, distribution Algorithms and parameters

ExampleCREATE MINING MODEL CollegePlanModel(

StudentID LONG KEY,Gender TEXT DISCRETE,ParentIncome LONG NORMAL CONTINUOUS,Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT

) USING Microsoft_Decision_Trees (complexity_penalty = 0.5)

Page 15: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Training a DM Model: Simple

INSERT INTO CollegePlanModel(StudentID, Gender, ParentIncome, Encouragement, CollegePlans)

OPENROWSET(‘<provider>’, ‘<connection>’,‘SELECT StudentID,

Gender, ParentIncome,Encouragement,CollegePlans

FROM CollegePlansTrainData’)

Page 16: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Prediction Using a DM Model PREDICTION JOIN

SELECT t.ID, CPModel.PlanFROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM NewStudents’) AS tON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ

ID Gender IQID Gender IQ PlanCPModel NewStudents

Page 17: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Classification Model Definition

CREATE MINING MODEL CPClass(

StudentID LONG KEY,Gender TEXT DISCRETE,ParentIncome LONG CONTINUOUS,Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT

) USING Microsoft_Decision_Trees

Page 18: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Classification (cont)

Find the new students whose predicted class (CollegePlan) is ‘Yes’ with confidence > 0.8

SELECT StudentID, PredictProbability(CPClass.CollegePlan)

FROM CPClass PREDICTION JOIN

OPENROWSET (’<provider>’,’<connection>’,

’SELECT * FROM NewStudents’) AS t

ON t.Gender = CPClass.Gender AND

t.ParentIncome = CPClass.ParentIncome AND

t.Encouragement = CPClass.EncouragementWHERE

CPClass.CollegePlan = ‘Yes’ AND

PredictProbability(CPClass.CollegePlan) > 0.8

Page 19: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Regression Model Definition

CREATE MINING MODEL CustCredit(

CustID LONG KEY,Gender TEXT DISCRETE,Age TEXT CONTINUOUS REGRESSOR,Income LONG CONTINUOUS REGRESSOR,Credit DOUBLE CONTINUOUS PREDICT

) USING Microsoft_Decision_Trees

Page 20: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Regression (cont)

Predict Credit score (and stdev) for the new customer data entered from the web form.SELECT CustCredit.Credit, PredictStdev(CustCredit.Credit)

FROM CustCredit PREDICTION JOIN

(SELECT ’Female’ AS Gender, 30 AS Age, 50000 AS Income) AS t

ON t.Gender = CustCredit.Gender AND

t.Age = CustCredit.Age AND

t.Income = CustCredit.Income

Page 21: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Segmentation Model Definition

CREATE MINING MODEL CPCluster(

StudentID LONG KEY,Gender TEXT DISCRETE,ParentIncome LONG CONTINUOUS,Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE

) USING Microsoft_Clustering

Page 22: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Segmentation (cont.)

Find cluster and its probability for each student

SELECT StudentID, $Cluster, ClusterProbability()FROM CPCluster PREDICTION JOINOPENROWSET (’<provider>’,’<connection>’,

’SELECT * FROM NewStudents’) AS t

ON t.Gender = CPCluster.Gender AND

t.ParentIncome = CPCluster.ParentIncome AND

t.Encouragement = CPCluster.Encouragement AND

t.CollegePlans = CPCluster.CollegePlans

Page 23: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Association Prediction Model Definition

CREATE MINING MODEL FavMovieModel (

ID LONG KEY,MaritalStatus TEXT DISCRETE, FavMovies TABLE PREDICT (

Title TEXT KEY)

) USING Microsoft_Decision_Trees

Page 24: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Association Prediction (cont)

As a web application, find 5 best recommendations for a customer whose shopping cart contains ‘Star Wars’ and ‘Matrix’.

SELECT FLATTENED PredictAssociation(FavMovieModel.FavMovies, INCLUDE_STATISTICS, 5)

FROM FavMovieModel NATURAL PREDICTION JOIN

(SELECT ’Single’ AS MaritalStatus,

(SELECT ’Star Wars’ AS Title UNION SELECT ’Matrix’ AS Title) AS FavMovies) AS t

Page 25: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Sequence Prediction

Model Definition

CREATE MINING MODEL WebSeqModel (

Session LONG KEY,

PageSeq TABLE PREDICT (

SeqID LONG KEY SEQUENCE,

Page TEXT DISCRETE

)

) USING Microsoft_Sequence_Clustering

Page 26: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Sequence Prediction (cont) Show the next 2 steps that a web visitor who visited ‘home’ ‘new

s’ is going to take. For each step, it has to show top 5 candidate pages with the highest probability.

SELECT FLATTENED ( SELECT $Sequence,

TopCount(PredictHistogram(Page), $Probability, 5) FROM PredictSequence(WebSeqModel.PageSeq, 2)

)FROM WebSeqModel NATURAL PREDICTION JOIN(SELECT (SELECT 1 AS SeqID, ’home’ AS Page UNION

SELECT 2 AS SeqID, ’news’ AS Page) AS PageSeq) AS t

Page 27: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Time-Series Prediction

Model Definition

CREATE MINING MODEL StockModel (

Symbol LONG KEY,

DateRecorded DATE KEY TIME,

OpeningQuote DOUBLE CONTINUOUS,

ClosingQuote DOUBLE CONTINUOUS) USING Microsoft_Time_Series

Page 28: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Time-Series Prediction (cont)

Predict next five days of MSFT stock closing quotes.

SELECT FLATTENED PredictTimeSeries(StockModel.ClosingQuote, 5)

FROM FavMovieModel

WHERE StockModel.Symbol = ’MSFT’

Page 29: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Major Issues in Data Mining

Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion

User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction

Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy

Page 30: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Data Mining Vendors SAS (Enterprise Miner) IBM (DB2 Intelligent Miner) Oracle (ODM option to Oracle 10g) SPSS (Clementine) Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its family) Microsoft (SQL Server 2005) Angoss (KnowledgeServer and its family) DBMiner (DBMiner) Many others

Page 31: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 32: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Data Mining Modeling and Language

Problem Description two powerful tools

Database management systems Efficient and effective data mining algorithms and fram

eworks Generally, this work asks:

“How can we merge the two?” “How can we integrate data mining more closely with t

raditional database systems, particularly querying?”

Page 33: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Three Different Answers

MSQL: A Query Language for Database Mining (Imielinski & Virmani, Rutgers University)

DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University)

Integrating Data Mining with SQL Databases: OLE DB for Data Mining (Netz et al, Microsoft)

Page 34: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

MSQL

Focus on Association Rules Seeks to provide a language both to selectively

generate rules, and separately to query the rule base

Expressive rule generation language, and techniques for optimizing some commands

Page 35: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

MSQL

Get-Rules and Select-Rules Queries Get-Rules operator generates rules over elements of argumen

t class C, which satisfy conditions described in the “where” clause

[Project Body, Consequent, confidence, support]

GetRules(C) [as R1]

[into <rulebase_name>]

[where <conds>]

[sql-group-by clause]

[using-clause]

Page 36: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

MSQL

<conds> may contain a number of conditions, including: restrictions on the attributes in the body or consequent

“rule.body HAS {(Job = ‘Doctor’}” “rule1.consequent IN rule2.body” “rule.consequent IS {Age = *}”

pruning conditions (restrict by support, confidence, or size) Stratified or correlated subqueries

in, has, and is are rule subset, superset,

and equality respectively

Page 37: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

MSQL

GetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7and not exists ( GetRules(Patients)

Support > .05 and Confidence > .7

and R2.Body HAS R1.Body)

Retrieve all rules with descriptors of the form “Age = *” in the body, except when there is a rule with equal or greater support and confidence with a rule containing a superset of the descriptors in the body

Page 38: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

MSQL

GetRules(C) R1

where <pruning-conds>

and not exists ( GetRules(C) R2

where <same pruning-conds>

and R2.Body HAS R1.Body)

correlated

stratified

GetRules(C) R1where <pruning-conds>and consequent is {(X=*)}and consequent in (SelectRules(R2)

where consequent is {(X=*)}

Page 39: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

MSQL

Nested Get-Rules Queries and their optimization Stratified (non-corrolated) queries are evaluated “bottom-up.

” The subquery is evaluated first, and replaced with its results in the outer query.

Correlated queries are evaluated either top-down or bottom-up (like “loop-unfolding”), and there are rules for choosing between the two options

Page 40: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

MSQL

GetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7

Top-Down Evaluation

For each rule produced by the outer, evaluate the inner

not exists ( GetRules(Patients)Support > .05 and

Confidence > .7and R2.Body HAS R1.Body)

Page 41: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

MSQL

not exists ( GetRules(Patients)Support > .05 and

Confidence > .7and R2.Body HAS R1.Body)

Bottom-Up Evaluation

For each rule produced by the inner, evaluate the outer

GetRules(Patients)where Body has {Age = *}and Support > .05 and Confidence > .7

Page 42: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

DMQL

Commands specify the following: The set of data relevant to the data mining task (the training s

et) The kinds of knowledge to be discovered

Generalized relation Characteristic rules Discriminant rules Classification rules Association rules

Page 43: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

DMQL

Commands Specify the following: Background knowledge

Concept hierarchies based on attribute relationships, etc.

Various thresholds Minimum support, confidence, etc.

Page 44: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

DMQL

Syntaxuse database <database_name>{use hierarchy <hierarchy_name> for <attr

ibute>}<rule_spec>related to <attr_or_agg_list>from <relation(s)>[where <conditions>][order by <order list>]{with [<kinds of>] threshold = <threshold

_value> [for <attribute(s)>]}

Specify background knowledge

Specify rules to be discovered

Collect the set of relevant data to mine

Specify threshold parameters

Relevant attributes or aggregations

Page 45: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

DMQL

use database Hospital

find association rules as Heart_Health

related to Salary, Age, Smoker, Heart_Disease

from Patient_Financial f, Patient_Medical m

where f.ID = m.ID and m.age >= 18

with support threshold = .05

with confidence threshold = .7

Page 46: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

DMQL

DMQL provides a display in command to view resulting rules, but no advanced way to query them

Suggests that a GUI interface might aid in the presentation of these results in different forms (charts, graphs, etc.)

Page 47: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

OLE DB for DM An extension to the OLE DB interface for Microsoft

SQL Server Seeks to support the following ideas:

Define a model by specifying the set of attributes to be predicted, the attributes used for the prediction, and the algorithm

Populate the model using the training data Predict attributes for new data using the populated model Browse the mining model (not fully addressed because it

varies a lot by model type)

Page 48: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

OLE DB for DM Defining a Mining Model

Identify the set of data attributes to be predicted, the set of attributes to be used for prediction, and the algorithm to be used for building the model

Populating the Model Pull the information into a single rowset using views, and trai

n the model using the data and algorithm specified

Page 49: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

OLE DB for DM

Using the mining model to predict Defines a new operator prediction join. A

model may be used to make predictions on datasets by taking the prediction join of the mining model and the data set.

Page 50: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

OLE DB for DM

CREATE MINING MODEL Heart_Health Prediction(

ID Int Key,Age Int,Smoker Int,Salary Double discretized,HeartAttack Int PREDICT, %Prediction column

) USING Microsoft_Decision_Trees

Identifies the source columns for the training data, the column to be predicted, and the data mining algorithm.

Page 51: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

OLE DB for DM

INSERT INTO Heart_Health Prediction

(Age, Smoker, Salary, HeartAttack )

OPENROWSET (’<provider>’,’<connection>’,

’SELECT Age, Smoker, Salary, HeartAttack

FROM Patient_Medical M, Patient_Financial F

WHERE M.ID = F.ID’)

The INSERT represents using a tuple for training the model (not actually inserting it into the rowset).

Page 52: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

OLE DB for DM

SELECT T.ID, H.HeartAttackFROM Heart_Health Prediction HPREDICTION JOIN (

OPENROWSET (’<provider>’,’<connection>’,’SELECT ID, Age, Smoker, Salary FROM Patient_Medical M, Patient_Financial F WHERE M.ID = F.ID’) as T

ON H.Age = T.Age AND H.Smoker = T.Smoker AND H.Salary = T.Salary

Prediction join connects the model and an actual data table to make predictions

Page 53: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Key Ideas

Important to have an API for creating and manipulating data mining models

The data is already in the DBMS, so it makes sense to do the data mining where the data is

Applications already use SQL, so a SQL extension seems logical

Page 54: DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D954020005 Tsung-Hsien Yang D954020006 Shi-Hwao Wang 1/22/2008.

Key Ideas

Need a method for defining data mining models, including algorithm specification, specification of various parameters, and training set specification (DMQL, MSQL, ODBDM)

Need a method of querying the models (MSQL) Need a way of using the data mining model to

interact with other data in the database, for purposes such as prediction (ODBDM)