7/25/2019 Data Mining Introductionduncam
1/83
DATA MINING
Introductory and AdvancedTopics
Part I
Source :
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics,rentice Hall, !""!.
7/25/2019 Data Mining Introductionduncam
2/83
Data Mining Outline
PART I Introduction
Related Concepts
Data Mining Techniques #$% &&
Classification
Clustering
#ssociation $ules
#$% &&&
'eb Mining
Spatial Mining
%emporal Mining
7/25/2019 Data Mining Introductionduncam
3/83
Introduction Outline
Define data mining
Data mining vs. databases (asic data mining tas)s
Data mining development
Data mining issues
Goal:Goal:rovide an overvie* of data mining.rovide an overvie* of data mining.
7/25/2019 Data Mining Introductionduncam
4/83
Introduction
Data is gro*ing at a phenomenalrate
Users expect more sophisticatedinformation
Ho*+
UC-E$ H&DDE &/-$M#%&-UC-E$ H&DDE &/-$M#%&-DATA MININGDATA MINING
7/25/2019 Data Mining Introductionduncam
5/83
Data Mining Definition
/inding hidden information in adatabase
/it data to a model Similar terms
Exploratory data analysis
Data driven discovery
Deductive learning
7/25/2019 Data Mining Introductionduncam
6/83
Database Processing vs.
Data Mining Processing Query
Well defined
SQL
Query
Poorly defined
No recise !uerylanguageDataData
0-perational data-perational data
-utput-utput0reciserecise
0Subset of databaseSubset of database
DataData
0ot operational dataot operational data
-utput-utput0/u11y/u11y
0ot a subset of databaseot a subset of database
7/25/2019 Data Mining Introductionduncam
7/83
Query "#a$les
Database
Data Mining0/ind all customers *ho have purchased mil)/ind all customers *ho have purchased mil)
0/ind all items *hich are fre2uently purchased/ind all items *hich are fre2uently purchased*ith mil). 3association rules4*ith mil). 3association rules4
0/ind all credit applicants *ith last name of Smith./ind all credit applicants *ith last name of Smith.0&dentify customers *ho have purchased more&dentify customers *ho have purchased morethan 56",""" in the last month.than 56",""" in the last month.
0/ind all credit applicants *ho are poor credit/ind all credit applicants *ho are poor creditris)s. 3classification4ris)s. 3classification4
0&dentify customers *ith similar buying habits.&dentify customers *ith similar buying habits.3Clustering43Clustering4
7/25/2019 Data Mining Introductionduncam
8/83
Data Mining Models and %as&s
7/25/2019 Data Mining Introductionduncam
9/83
'asic Data Mining %as&s
Classification maps data into predefined groups orclasses
Supervised learning
attern recognition
rediction
Regressionis used to map a data item to a realvalued prediction variable.
Clustering groups similar data together into clusters.
Unsupervised learning
Segmentation artitioning
7/25/2019 Data Mining Introductionduncam
10/83
'asic Data Mining %as&s (cont)d*
Summariation maps data into subsets *ithassociated simple descriptions.
Characteri1ation
7enerali1ation
!in" Anal#sisuncovers relationships among data.
#ffinity #nalysis
#ssociation $ules
Se2uential #nalysis determines se2uential patterns.
7/25/2019 Data Mining Introductionduncam
11/83
"#+ %i$e Series ,nalysis
"#a$le+ Stoc& Mar&et
Predict future values
Deter$ine si$ilar atterns over ti$e
-lassify beavior
7/25/2019 Data Mining Introductionduncam
12/83
Data Mining vs. /DD
$no%ledge Disco&er# in Data'ases ($DD):process of finding useful information and
patterns in data. Data Mining: Use of algorithms to extract the
information and patterns derived by the 8DDprocess.
7/25/2019 Data Mining Introductionduncam
13/83
/DD Process
Selection:-btain data from various sources. Preprocessing: Cleanse data.
Transformation:Convert to common format.%ransform to ne* format.
Data Mining: -btain desired results. Interpretation*+&aluation: resent results to user in
meaningful manner.
Modified from [FPSS96C]
7/25/2019 Data Mining Introductionduncam
14/83
/DD Process "#+ Web Log Selection:
Select log data 3dates and locations4 to use Preprocessing:
$emove identifying U$9s
$emove error logs
Transformation:
Sessioni1e 3sort and group4
Data Mining:
&dentify and count patterns
Construct data structure
Interpretation*+&aluation: &dentify and display fre2uently accessed se2uences.
Potential ,ser Applications:
Cache prediction
ersonali1ation
7/25/2019 Data Mining Introductionduncam
15/83
Data Mining Develo$ent:Similarity Measures:Hierarchical Clustering
:&$ Systems:&mprecise ;ueries:%extual Data:'eb Search Engines
:(ayes %heorem:$egression #nalysis:EM #lgorithm:8
7/25/2019 Data Mining Introductionduncam
16/83
/DD Issues
Human Interaction Overfitting
Outier!
Inter"retation #i!uai$ation
%arge Data!et!
High Dimen!ionait&
7/25/2019 Data Mining Introductionduncam
17/83
/DD Issues (cont)d* Mutimedia Data
Mi!!ing Data
Irreevant Data
'oi!& Data
Changing Data
Integration
(""ication
7/25/2019 Data Mining Introductionduncam
18/83
Social I$lications of
DM rivacy
rofiling
Unauthori1ed use
7/25/2019 Data Mining Introductionduncam
19/83
Data Mining Metrics
Usefulness
$eturn on &nvestment 3$-&4
#ccuracy Space=%ime
7/25/2019 Data Mining Introductionduncam
20/83
Database Persective on Data
Mining Scalability
$eal 'orld Data
Updates Ease of Use
7/25/2019 Data Mining Introductionduncam
21/83
0isuali1ation %ecni!ues
7raphical
7eometric
&con
7/25/2019 Data Mining Introductionduncam
22/83
2elated -oncets Outline
Database=-9% Systems
/u11y Sets and 9ogic
&nformation $etrieval3'eb Search Engines4 Dimensional Modeling
Data 'arehousing
-9#=DSS
Statistics
Machine 9earning
attern Matching
Goal:Goal:Examine some areas *hich are related toExamine some areas *hich are related todata mining.data mining.
7/25/2019 Data Mining Introductionduncam
23/83
D' 3 OL%P Syste$s Schema
3&D,ame,#ddress,Salary,>obo4
Data Model
E$
$elational %ransaction
;uery?SE9EC% ame
/$-M %
'HE$E Salary @ 6"""""
DM: -nl# imprecise queries
7/25/2019 Data Mining Introductionduncam
24/83
4u11y Sets and Logic
.u# Set: Set membership function is a real valued function*ith output in the range A",6B.
f3x4? robability x is in /.
6
7/25/2019 Data Mining Introductionduncam
25/83
4u11y Sets
7/25/2019 Data Mining Introductionduncam
26/83
-lassification5Prediction
is 4u11y
Loan
,$nt
Si$le 4u11y
,ccet ,ccet
2e6ect2e6ect
7/25/2019 Data Mining Introductionduncam
27/83
Infor$ation 2etrieval Information Retrie&al (IR):retrieving desired information from
textual data.
9ibrary Science
Digital 9ibraries
'eb Search Engines %raditionally )ey*ord based
Sample 2uery?
/ind all documents about data miningI.
DM: Similarit# measures0
Mine te1t*2e' data/
7/25/2019 Data Mining Introductionduncam
28/83
Infor$ation 2etrieval
(cont)d* Similarit#:measure of ho* close a 2uery isto a document.
Documents *hich are close enoughI are
retrieved. Metrics?
Precision F$elevant and $etrievedF F$etrievedF
Recall F$elevant and $etrievedF F$elevantF
7/25/2019 Data Mining Introductionduncam
29/83
I2 Query 2esult
Measures and
-lassification
&$ Classification
7/25/2019 Data Mining Introductionduncam
30/83
Di$ensional Modeling ie* data in a hierarchical manner more as business
executives might Useful in decision support systems and mining Dimension:collection of logically related attributesJ
axis for modeling data. .acts:data stored Ex? Dimensions 0 products, locations, date
/acts 0 2uantity, unit price
DM: Ma# &ie% data as dimensional/
7/25/2019 Data Mining Introductionduncam
31/83
2elational 0ie7 of Data
rod&D 9oc&D Date ;uantity Unitrice6!K Dallas "!!L"" M !M
6!K Houston "!"6"" 6" !"6M" Dallas "K6M"" 6 6""
6M" Dallas "K6M"" M LM6M" /ort
'orth"!6""" M N"
6M" Chicago "6!""" !" OM
!"" Seattle "K"6"" M M"K"" $ochester "!6M"" !"" MM"" (radenton "!!""" 6M !"
M"" Chicago "6!""" 6" !M6
7/25/2019 Data Mining Introductionduncam
32/83
Di$ensional Modeling Queries
Roll ,p: more general dimension
Drill Do%n: more specific dimension
Dimension 3#ggregation4 Hierarchy
S;9 uses aggregation
Decision Support S#stems (DSS):Computersystems and tools to assist managers in
ma)ing decisions and solving problems.
7/25/2019 Data Mining Introductionduncam
33/83
-ube vie7 of Data
7/25/2019 Data Mining Introductionduncam
34/83
,ggregation 8ierarcies
7/25/2019 Data Mining Introductionduncam
35/83
Star Sce$a
7/25/2019 Data Mining Introductionduncam
36/83
Data Wareousing SubPect
7/25/2019 Data Mining Introductionduncam
37/83
Oerational vs. Infor$ational
-perational Data Data 'arehouse
#pplication -9% -9#
Use recise ;ueries #d Hoc
%emporal Snapshot Historical
Modification Dynamic Static
-rientation #pplication (usiness
Data -perational alues &ntegrated
Si1e 7igabits %erabits
9evel Detailed Summari1ed#ccess -ften 9ess -ften
$esponse /e* Seconds Minutes
Data Schema $elational Star=Sno*fla)e
7/25/2019 Data Mining Introductionduncam
38/83
OL,P -nline Anal#tic Processing (-!AP):provides more complex
2ueries than -9%.
-n!ine Transaction Processing (-!TP):traditionaldatabase=transaction processing.
Dimensional dataJ cube vie*
isuali1ation of operations?
Slice:examine sub
7/25/2019 Data Mining Introductionduncam
39/83
OL,P Oerations
Single Cell Multiple Cells Slice Dice
$oll Up
Drill Do*n
St ti ti
7/25/2019 Data Mining Introductionduncam
40/83
Statistics
Simple descriptive models
Statistical inference: generali1ing a model createdfrom a sample of the data to the entire dataset. +1plorator# Data Anal#sis:
Data can actually drive the creation of themodel
-pposite of traditional statistical vie*. Data mining targeted to business user
DM: Man# data mining methods come fromstatistical techniques/
7/25/2019 Data Mining Introductionduncam
41/83
Macine Learning
Machine !earning:area of #& that examines ho* to *rite
programs that can learn. -ften used in classification and prediction
Super&ised !earning: learns by example.
,nsuper&ised !earning: learns *ithout )no*ledge of correct
ans*ers. Machine learning often deals *ith small static datasets.
DM: ,ses man# machine learning techniques/
7/25/2019 Data Mining Introductionduncam
42/83
Pattern Matcing
(2ecognition* Pattern Matching:finds
occurrences of a predefinedpattern in the data.
#pplications include speechrecognition, information retrieval,time series analysis.
DM: T#pe of classification/
7/25/2019 Data Mining Introductionduncam
43/83
DM vs. 2elated %oics(rea )uer& Data *e!ut
!Out"ut
D(=-9%
recise
Database recise
D(-bPects or
#ggregation&$ recis
eDocuments .ague Document
s-9# #nalysi
s
Multidimensio
nal
recis
e
D(
-bPects or#ggregation
DM .ague reprocessed .ague 8DD-bPects
D t Mi i % i O tli
7/25/2019 Data Mining Introductionduncam
44/83
Data Mining %ecni!ues Outline
Statistical
oint Estimation
Models (ased on Summari1ation
(ayes %heorem Hypothesis %esting
$egression and Correlation
Similarity Measures
Decision %rees eural et*or)s
#ctivation /unctions
7enetic #lgorithms
Goal:Goal:rovide an overvie* of basic datarovide an overvie* of basic data
mining techni2uesmining techni2ues
7/25/2019 Data Mining Introductionduncam
45/83
Point "sti$ation
Point +stimate:estimate a population parameter. May be made by calculating the parameter for a
sample. May be used to predict value for missing data. Ex?
$ contains 6"" employees LL have salary information
Mean salary of these is 5","""
Use 5",""" as value of remaining employeeQs salary.
&s this a good idea+
7/25/2019 Data Mining Introductionduncam
46/83
7/25/2019 Data Mining Introductionduncam
47/83
9ac&&nife "sti$ate
4ac""nife +stimate:estimate of parameter isobtained by omitting one value from the set ofobserved values.
Ex? estimate of mean for x6, R , xnG
i i& li d i
7/25/2019 Data Mining Introductionduncam
48/83
Ma#i$u$ Li&eliood "sti$ate
(ML"* -btain parameter estimates that maximi1e the
probability that the sample data occurs for thespecific model.
>oint probability for observing the sample data bymultiplying the individual probabilities. 9i)elihoodfunction?
Maximi1e 9.
7/25/2019 Data Mining Introductionduncam
49/83
ML" "#a$le Coin toss five times? H,H,H,H,%G
#ssuming a perfect coin *ith H and % e2ually li)ely, the
li)elihood of this se2uence is?
Ho*ever if the probability of a H is ".N then?
7/25/2019 Data Mining Introductionduncam
50/83
"#ectation:
Ma#i$i1ation ("M*
Solves estimation *ith incomplete
data. -btain initial estimates for
parameters.
&teratively use estimates for
missing data and continue untilconvergence.
l
7/25/2019 Data Mining Introductionduncam
51/83
"M "#a$le
7/25/2019 Data Mining Introductionduncam
52/83
"M ,lgorit$
7/25/2019 Data Mining Introductionduncam
53/83
'ayes %eore$
Posterior Pro'a'ilit#:3h6Fxi4
Prior Pro'a'ilit#:3h64
3a#es Theorem:
#ssign probabilities of hypotheses givena data value.
' % " l
7/25/2019 Data Mining Introductionduncam
54/83
'ayes %eore$ "#a$le
Credit authori1ations 3hypotheses4? h6authori1e
purchase, h! authori1e after further identification,hKdo not authori1e, h do not authori1e butcontact police
#ssign t*elve data values for all combinations of
credit and income?
/rom training data? 3h64 T"J 3h
!4!"J 3h
K4
6"J 3h46".
1 2 3 4
Excellent x1 x2 x3 x4
Good x5 x6 x7 x8
Bad x9 x10 x11 x12
' " l ( )d*
7/25/2019 Data Mining Introductionduncam
55/83
'ayes "#a$le(cont)d* %raining Data?
ID Income Credit Ca!! +i6 S Excellent h6 xS! K 7ood h6 xO
K ! Excellent h6 x!S K 7ood h6 xO S 7ood h6 xN
T ! Excellent h6 x!
O K (ad h! x66N ! (ad h! x6"
L K (ad hK x666" 6 (ad hS xL
7/25/2019 Data Mining Introductionduncam
56/83
'ayes "#a$le(cont)d*
Calculate 3xiFhP4 and 3xi4 Ex? 3xOFh64!=TJ 3xFh646=TJ 3x!Fh64!=TJ 3xNFh64
6=TJ 3xiFh64" for all other xi.
redict the class for x?
Calculate 3hPFx4 for all hP. lace x in class *ith largest value.
Ex?
3h6Fx433xFh6433h644=3x4
36=T43".T4=".66.x in class h6.
7/25/2019 Data Mining Introductionduncam
57/83
2egression
redict future values based onpast values
!inear Regressionassumeslinear relationship exists.
y c"V c6x6V R V cnxn /ind values to best fit the data
7/25/2019 Data Mining Introductionduncam
58/83
Linear 2egression
7/25/2019 Data Mining Introductionduncam
59/83
-orrelation
Examine the degree to *hich the values for t*ovariables behave similarly.
Correlation coefficient r?
: 6 perfect correlation:
7/25/2019 Data Mining Introductionduncam
60/83
Si$ilarity Measures Determine similarity bet*een t*o
obPects.
Similarity characteristics?
#lternatively, distance measure measureho* unli)e or dissimilar obPects are.
Si il it M
7/25/2019 Data Mining Introductionduncam
61/83
Si$ilarity Measures
7/25/2019 Data Mining Introductionduncam
62/83
Distance Measures
Measure dissimilarity bet*een obPects
% t Q ti ;
7/25/2019 Data Mining Introductionduncam
63/83
%7enty Questions ;a$e
7/25/2019 Data Mining Introductionduncam
64/83
Decision %rees Decision Tree (DT):
%ree *here the root and each internal node islabeled *ith a 2uestion.
%he arcs represent each possible ans*er to the
associated 2uestion. Each leaf node represents a prediction of a
solution to the problem.
opular techni2ue for classificationJ 9eaf node
indicates class to *hich the corresponding tuplebelongs.
7/25/2019 Data Mining Introductionduncam
65/83
Decision %ree "#a$le
7/25/2019 Data Mining Introductionduncam
66/83
Decision %rees #Decision Tree Modelis a computational model
consisting of three parts?
Decision %ree
#lgorithm to create the tree
#lgorithm that applies the tree to data
Creation of the tree is the most difficult part.
rocessing is basically a search similar to that in abinary search tree 3although D% may not be binary4.
Decision %ree ,lgorit$
7/25/2019 Data Mining Introductionduncam
67/83
Decision %ree ,lgorit$
7/25/2019 Data Mining Introductionduncam
68/83
D% ,dvantages5Disadvantages
#dvantages? Easy to understand.
Easy to generate rules
Disadvantages? May suffer from overfitting.
Classifies by rectangular partitioning.
Does not easily handle nonnumeric data.
Can be 2uite large 0 pruning is necessary.
Neural Net7or&s
7/25/2019 Data Mining Introductionduncam
69/83
Neural Net7or&s
(ased on observed functioning of human brain.
(Artificial Neural Net%or"s (ANN) -ur vie* of neural net*or)s is very simplistic. 'e vie* a neural net*or) 34 from a
graphical vie*point. #lternatively, a may be vie*ed from theperspective of matrices.
Used in pattern recognition, speech recognition,computer vision, and classification.
Neural Net7or&s
7/25/2019 Data Mining Introductionduncam
70/83
Neural Net7or&s
Neural Net%or" (NN)is a directed graph /W,#@
*ith vertices 6,!,R,nG and arcs #Wi,P@F6Wi,PWnG, *ith the follo*ing restrictions?
is partitioned into a set of input nodes, &,hidden nodes, H, and output nodes, -.
%he vertices are also partitioned into layers#ny arc Wi,P@ must have node i in layer h
7/25/2019 Data Mining Introductionduncam
71/83
Neural Net7or& "#a$le
NN N d
7/25/2019 Data Mining Introductionduncam
72/83
NN Node
i i i
7/25/2019 Data Mining Introductionduncam
73/83
NN ,ctivation 4unctions /unctions associated *ith nodes in graph.
-utput may be in range A
7/25/2019 Data Mining Introductionduncam
74/83
NN ,ctivation 4unctions
7/25/2019 Data Mining Introductionduncam
75/83
NN Learning
ropagate input values through graph.
Compare output to desired output.
#dPust *eights in graph accordingly.
7/25/2019 Data Mining Introductionduncam
76/83
Neural Net7or&s
# Neural Net%or" Modelis a computational modelconsisting of three parts?
eural et*or) graph
9earning algorithm that indicates ho* learningta)es place.
$ecall techni2ues that determine ho*information is obtained from the net*or).
'e *ill loo) at propagation as the recall techni2ue.
7/25/2019 Data Mining Introductionduncam
77/83
NN ,dvantages
9earning
Can continue learning even after training sethas been applied.
Easy paralleli1ation
Solves many problems
7/25/2019 Data Mining Introductionduncam
78/83
NN Disadvantages
Difficult to understand
May suffer from overfitting
Structure of graph must be determined a priori. &nput values must be numeric.
erification difficult.
;enetic ,lgorit$s
7/25/2019 Data Mining Introductionduncam
79/83
;enetic ,lgorit$s
-ptimi1ation search type algorithms. Creates an initial feasible solution and iteratively
creates ne* betterI solutions.
(ased on human evolution and survival of the fitness.
Must represent a solution as an individual. Indi&idual:string &&6,&!,R,&n*here &Pis in given
alphabet #.
Each character &P is called a gene.
Population:set of individuals.
7/25/2019 Data Mining Introductionduncam
80/83
;enetic ,lgorit$s
# Genetic Algorithm (GA)is a computational modelconsisting of five parts?
# starting set of individuals, . Crosso&er? techni2ue to combine t*o parents
to create offspring. Mutation: randomly change an individual. .itness: determine the best individuals.#lgorithm *hich applies the crossover and
mutation techni2ues to iteratively using the
fitness function to determine the bestindividuals in to )eep.
-rossover "#a$les
7/25/2019 Data Mining Introductionduncam
81/83
-rossover "#a$les
7/25/2019 Data Mining Introductionduncam
82/83
;enetic ,lgorit$
7/25/2019 Data Mining Introductionduncam
83/83
;, ,dvantages5Disadvantages
#dvantages Easily paralleli1ed
Disadvantages
Difficult to understand and explain to end users.
#bstraction of the problem and method torepresent individuals is 2uite difficult.
Determining fitness function is difficult.
Determining ho* to perform crossover and
mutation is difficult.