Top Banner
Role of Mathematics / Statistics Role of Mathematics / Statistics in DATA MINING in DATA MINING By By Dr.S.Sridhar, Ph.D.(JNUD), Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) RACI(Paris, NICE), RMR(USA), RZFM(Germany) RIEEEProc., RIETCom., LMISTE, LMCSI RIEEEProc., RIETCom., LMISTE, LMCSI Vice Chancellor-DSI-RAK Vice Chancellor-DSI-RAK
48
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: role of maths sat in DM

Role of Mathematics / Statistics Role of Mathematics / Statistics in DATA MININGin DATA MINING

By By

Dr.S.Sridhar, Ph.D.(JNUD),Dr.S.Sridhar, Ph.D.(JNUD),RACI(Paris, NICE), RMR(USA), RZFM(Germany)RACI(Paris, NICE), RMR(USA), RZFM(Germany)

RIEEEProc., RIETCom., LMISTE, LMCSIRIEEEProc., RIETCom., LMISTE, LMCSIVice Chancellor-DSI-RAKVice Chancellor-DSI-RAK

Page 2: role of maths sat in DM

Data Mining DefinitionData Mining Definition

Finding hidden information in a Finding hidden information in a databasedatabase

Fit data to a modelFit data to a model

Similar termsSimilar terms

Page 3: role of maths sat in DM

Data Mining AlgorithmData Mining Algorithm

Objective: Fit Data to a ModelObjective: Fit Data to a Model– DescriptiveDescriptive– PredictivePredictive

Preference – Technique to choose the Preference – Technique to choose the best modelbest model

Search – Technique to search the dataSearch – Technique to search the data– ““Query”Query”

Page 4: role of maths sat in DM

Database Processing vs. Data Database Processing vs. Data Mining ProcessingMining Processing

QueryQuery– Well definedWell defined– SQLSQL

QueryQuery– Poorly definedPoorly defined– No precise query languageNo precise query language

DataData– Operational dataOperational data

OutputOutput– PrecisePrecise– Subset of databaseSubset of database

DataData– Not operational dataNot operational data

OutputOutput– FuzzyFuzzy– Not a subset of databaseNot a subset of database

Page 5: role of maths sat in DM

Query ExamplesQuery Examples DatabaseDatabase

Data MiningData Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)

– Find all credit applicants with last name of KUMAR.Find all credit applicants with last name of KUMAR.– Identify customers who have purchased more Identify customers who have purchased more than INR10,000 in the last month.than INR10,000 in the last month.

– Find all credit applicants who are poor credit Find all credit applicants who are poor credit risks. (classification)risks. (classification)– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)

Page 6: role of maths sat in DM

Data Mining Models and TasksData Mining Models and Tasks

Page 7: role of maths sat in DM

Basic Data Mining TasksBasic Data Mining Tasks Classification Classification maps data into predefined groups maps data into predefined groups

or classesor classes– Supervised learningSupervised learning– Pattern recognitionPattern recognition– PredictionPrediction

RegressionRegression is used to map a data item to a real is used to map a data item to a real valued prediction variable.valued prediction variable.

Clustering Clustering groups similar data together into groups similar data together into clusters.clusters.– Unsupervised learningUnsupervised learning– SegmentationSegmentation– PartitioningPartitioning

Page 8: role of maths sat in DM

Basic Data Mining Tasks Basic Data Mining Tasks (cont’d)(cont’d)

Summarization Summarization maps data into subsets with maps data into subsets with associated simple descriptions.associated simple descriptions.– CharacterizationCharacterization– GeneralizationGeneralization

Link AnalysisLink Analysis uncovers relationships among uncovers relationships among data.data.– Affinity AnalysisAffinity Analysis– Association RulesAssociation Rules– Sequential Analysis determines sequential Sequential Analysis determines sequential

patterns.patterns.

Page 9: role of maths sat in DM

Ex: Time Series AnalysisEx: Time Series Analysis Example: Stock MarketExample: Stock Market Predict future valuesPredict future values Determine similar patterns over timeDetermine similar patterns over time Classify behaviorClassify behavior

Page 10: role of maths sat in DM

Data Mining vs. KDDData Mining vs. KDD

Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD):(KDD): process of finding useful process of finding useful information and patterns in data.information and patterns in data.

Data Mining:Data Mining: Use of algorithms to Use of algorithms to extract the information and patterns extract the information and patterns derived by the KDD process. derived by the KDD process.

Page 11: role of maths sat in DM

KDD ProcessKDD Process

Selection:Selection: Obtain data from various sources. Obtain data from various sources. Preprocessing:Preprocessing: Cleanse data. Cleanse data. Transformation:Transformation: Convert to common format. Convert to common format.

Transform to new format.Transform to new format. Data Mining:Data Mining: Obtain desired results. Obtain desired results. Interpretation/Evaluation:Interpretation/Evaluation: Present results Present results

to user in meaningful manner.to user in meaningful manner.

Page 12: role of maths sat in DM

Data Mining DevelopmentData Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines

•Bayes Theorem•Regression Analysis•EM Algorithm•K-Means Clustering•Time Series Analysis

•Neural Networks•Decision Tree Algorithms

•Algorithm Design Techniques•Algorithm Analysis•Data Structures

•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques

Page 13: role of maths sat in DM

KDD IssuesKDD Issues

Human InteractionHuman Interaction OverfittingOverfitting InterpretationInterpretation Visualization Visualization Large DatasetsLarge Datasets High DimensionalityHigh Dimensionality

Page 14: role of maths sat in DM

KDD Issues (cont’d)KDD Issues (cont’d)

Multimedia DataMultimedia Data Missing DataMissing Data Irrelevant DataIrrelevant Data Noisy DataNoisy Data Changing DataChanging Data IntegrationIntegration ApplicationApplication

Page 15: role of maths sat in DM

Data Mining MetricsData Mining Metrics

UsefulnessUsefulness Return on Investment (ROI)Return on Investment (ROI) AccuracyAccuracy Space/TimeSpace/Time

Page 16: role of maths sat in DM

Database Perspective on Data Database Perspective on Data MiningMining

ScalabilityScalability Real World DataReal World Data UpdatesUpdates Ease of UseEase of Use

Page 17: role of maths sat in DM

Visualization TechniquesVisualization Techniques

GraphicalGraphical GeometricGeometric Icon-basedIcon-based Pixel-basedPixel-based HierarchicalHierarchical

Page 18: role of maths sat in DM

DB & OLTP SystemsDB & OLTP Systems SchemaSchema

– (ID,Name,Address,Salary,JobNo)(ID,Name,Address,Salary,JobNo) Data ModelData Model

– ERER– RelationalRelational

TransactionTransaction Query:Query:

SELECT NameSELECT NameFROM TFROM TWHERE Salary > 100000WHERE Salary > 100000

DM: Only imprecise queriesDM: Only imprecise queries

Page 19: role of maths sat in DM

Fuzzy Sets and LogicFuzzy Sets and Logic Fuzzy Set:Fuzzy Set: Set membership function is a real valued Set membership function is a real valued

function with output in the range [0,1].function with output in the range [0,1]. f(x): Probability x is in F.f(x): Probability x is in F. 1-f(x): Probability x is not in F.1-f(x): Probability x is not in F. EX:EX:

– T = {x | x is a person and x is tall}T = {x | x is a person and x is tall}– Let f(x) be the probability that x is tallLet f(x) be the probability that x is tall– Here f is the membership functionHere f is the membership function

DM: DM: Prediction and classification are fuzzy.Prediction and classification are fuzzy.

Page 20: role of maths sat in DM

Fuzzy SetsFuzzy Sets

Page 21: role of maths sat in DM

Classification/Prediction is Classification/Prediction is FuzzyFuzzy

Loan

Amnt

Simple Fuzzy

Accept Accept

RejectReject

Page 22: role of maths sat in DM

Information Retrieval Information Retrieval

Information Retrieval (IR):Information Retrieval (IR): retrieving desired retrieving desired information from textual data.information from textual data.

Library ScienceLibrary Science Digital LibrariesDigital Libraries Web Search EnginesWeb Search Engines Traditionally keyword basedTraditionally keyword based Sample query:Sample query:

Find all documents about “data mining”.Find all documents about “data mining”.

DM: Similarity measures; DM: Similarity measures; Mine text/Web data.Mine text/Web data.

Page 23: role of maths sat in DM

IR Query Result Measures IR Query Result Measures and Classificationand Classification

IR Classification

Page 24: role of maths sat in DM

Relational View of DataRelational View of Data

ProdID LocID Date Quantity UnitPrice 123 Dallas 022900 5 25 123 Houston 020100 10 20 150 Dallas 031500 1 100 150 Dallas 031500 5 95 150 Fort

Worth 021000 5 80

150 Chicago 012000 20 75 200 Seattle 030100 5 50 300 Rochester 021500 200 5 500 Bradenton 022000 15 20 500 Chicago 012000 10 25 1

Page 25: role of maths sat in DM

Cube view of DataCube view of Data

Page 26: role of maths sat in DM

Aggregation HierarchiesAggregation Hierarchies

Page 27: role of maths sat in DM

Star SchemaStar Schema

Page 28: role of maths sat in DM

Data WarehousingData Warehousing

““Subject-oriented, integrated, time-variant, nonvolatile” Subject-oriented, integrated, time-variant, nonvolatile” William InmonWilliam Inmon

Operational Data:Operational Data: Data used in day to day needs of Data used in day to day needs of company.company.

Informational Data:Informational Data: Supports other functions such as Supports other functions such as planning and forecasting.planning and forecasting.

Data mining tools often access data warehouses rather Data mining tools often access data warehouses rather than operational data.than operational data.

DM: May access data in warehouse.DM: May access data in warehouse.

Page 29: role of maths sat in DM

OLAP OperationsOLAP Operations

Single Cell Multiple Cells Slice Dice

Roll Up

Drill Down

Page 30: role of maths sat in DM

StatisticsStatistics Simple descriptive modelsSimple descriptive models Statistical inference:Statistical inference: generalizing a model generalizing a model

created from a sample of the data to the entire created from a sample of the data to the entire dataset.dataset.

Exploratory Data Analysis:Exploratory Data Analysis: – Data can actually drive the creation of the Data can actually drive the creation of the

modelmodel– Opposite of traditional statistical view.Opposite of traditional statistical view.

Data mining targeted to business userData mining targeted to business user

DM: Many data mining methods come DM: Many data mining methods come from statistical techniques. from statistical techniques.

Page 31: role of maths sat in DM

Pattern Matching Pattern Matching (Recognition)(Recognition)

Pattern Matching:Pattern Matching: finds occurrences of finds occurrences of a predefined pattern in the data.a predefined pattern in the data.

Applications include speech recognition, Applications include speech recognition, information retrieval, time series information retrieval, time series analysis.analysis.

DM: Type of classification.DM: Type of classification.

Page 32: role of maths sat in DM

Data Mining Techniques OutlineData Mining Techniques Outline

StatisticalStatistical– Point EstimationPoint Estimation– Models Based on SummarizationModels Based on Summarization– Bayes TheoremBayes Theorem– Hypothesis TestingHypothesis Testing– Regression and CorrelationRegression and Correlation

Similarity MeasuresSimilarity Measures Decision TreesDecision Trees Neural NetworksNeural Networks

– Activation FunctionsActivation Functions

Genetic AlgorithmsGenetic Algorithms

Goal:Goal: Provide an overview of basic data Provide an overview of basic data mining techniquesmining techniques

Page 33: role of maths sat in DM

Point EstimationPoint Estimation Point Estimate:Point Estimate: estimate a population estimate a population

parameter.parameter. May be made by calculating the parameter for a May be made by calculating the parameter for a

sample.sample. May be used to predict value for missing data.May be used to predict value for missing data. Ex: Ex:

– R contains 100 employeesR contains 100 employees– 99 have salary information99 have salary information– Mean salary of these is $50,000Mean salary of these is $50,000– Use $50,000 as value of remaining employee’s Use $50,000 as value of remaining employee’s

salary. salary. Is this a good idea?Is this a good idea?

Page 34: role of maths sat in DM

Estimation ErrorEstimation Error

Bias: Bias: Difference between expected value and Difference between expected value and actual value.actual value.

Mean Squared Error (MSE):Mean Squared Error (MSE): expected value expected value of the squared difference between the of the squared difference between the estimate and the actual value:estimate and the actual value:

Why square?Why square? Root Mean Square Error (RMSE)Root Mean Square Error (RMSE)

Page 35: role of maths sat in DM

Jackknife EstimateJackknife Estimate Jackknife Estimate:Jackknife Estimate: estimate of parameter is estimate of parameter is

obtained by omitting one value from the set of obtained by omitting one value from the set of observed values.observed values.

Ex: estimate of mean for X={xEx: estimate of mean for X={x1, … , x, … , xn}}

Page 36: role of maths sat in DM

Maximum Likelihood Maximum Likelihood Estimate (MLE)Estimate (MLE)

Obtain parameter estimates that maximize Obtain parameter estimates that maximize the probability that the sample data occurs for the probability that the sample data occurs for the specific model.the specific model.

Joint probability for observing the sample Joint probability for observing the sample data by multiplying the individual probabilities. data by multiplying the individual probabilities. Likelihood function: Likelihood function:

Maximize L.Maximize L.

Page 37: role of maths sat in DM

Models Based on SummarizationModels Based on Summarization

Visualization:Visualization: Frequency distribution, mean, variance, Frequency distribution, mean, variance, median, mode, etc.median, mode, etc.

Box Plot:Box Plot:

Page 38: role of maths sat in DM

Scatter DiagramScatter Diagram

Page 39: role of maths sat in DM

Bayes TheoremBayes Theorem

Posterior Probability:Posterior Probability: P(hP(h1|x|xi)) Prior Probability:Prior Probability: P(h P(h1)) Bayes Theorem:Bayes Theorem:

Assign probabilities of hypotheses given a data Assign probabilities of hypotheses given a data value.value.

Page 40: role of maths sat in DM

Hypothesis TestingHypothesis Testing

Find model to explain behavior by Find model to explain behavior by creating and then testing a hypothesis creating and then testing a hypothesis about the data.about the data.

Exact opposite of usual DM approach.Exact opposite of usual DM approach. HH0 0 – Null hypothesis; Hypothesis to be – Null hypothesis; Hypothesis to be

tested.tested. HH1 1 – Alternative hypothesis– Alternative hypothesis

Page 41: role of maths sat in DM

Chi Squared StatisticChi Squared Statistic

O – observed valueO – observed value E – Expected value based on hypothesis.E – Expected value based on hypothesis.

Ex: Ex: – O={50,93,67,78,87}O={50,93,67,78,87}– E=75E=75– 22=15.55 and therefore significant=15.55 and therefore significant

Page 42: role of maths sat in DM

RegressionRegression

Predict future values based on past Predict future values based on past valuesvalues

Linear RegressionLinear Regression assumes linear assumes linear relationship exists.relationship exists.

y = cy = c00 + c + c11 x x11 + … + c + … + cnn x xnn

Find values to best fit the dataFind values to best fit the data

Page 43: role of maths sat in DM

Linear RegressionLinear Regression

Page 44: role of maths sat in DM

CorrelationCorrelation

Examine the degree to which the values Examine the degree to which the values for two variables behave similarly.for two variables behave similarly.

Correlation coefficient r:Correlation coefficient r:• 1 = perfect correlation1 = perfect correlation• -1 = perfect but opposite correlation-1 = perfect but opposite correlation• 0 = no correlation0 = no correlation

Page 45: role of maths sat in DM

Distance MeasuresDistance Measures

Measure dissimilarity between objectsMeasure dissimilarity between objects

Page 46: role of maths sat in DM

Twenty Questions GameTwenty Questions Game

Page 47: role of maths sat in DM

Decision Tree ExampleDecision Tree Example

Page 48: role of maths sat in DM

<<<<……Thank U……>>>><<<<……Thank U……>>>>

For more details visit my site atFor more details visit my site at http://drsridhar.tripod.com

For your queries, email to me :For your queries, email to me : [email protected]

Reference Book onReference Book on

"Datamining" (ISBN 81-7758-785-4) "Datamining" (ISBN 81-7758-785-4)

48