Top Banner

of 83

Data Mining Introductionduncam

Feb 28, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/25/2019 Data Mining Introductionduncam

    1/83

    DATA MINING

    Introductory and AdvancedTopics

    Part I

    Source :

    Margaret H. Dunham

    Department of Computer Science and Engineering

    Southern Methodist University

    Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics,rentice Hall, !""!.

  • 7/25/2019 Data Mining Introductionduncam

    2/83

    Data Mining Outline

    PART I Introduction

    Related Concepts

    Data Mining Techniques #$% &&

    Classification

    Clustering

    #ssociation $ules

    #$% &&&

    'eb Mining

    Spatial Mining

    %emporal Mining

  • 7/25/2019 Data Mining Introductionduncam

    3/83

    Introduction Outline

    Define data mining

    Data mining vs. databases (asic data mining tas)s

    Data mining development

    Data mining issues

    Goal:Goal:rovide an overvie* of data mining.rovide an overvie* of data mining.

  • 7/25/2019 Data Mining Introductionduncam

    4/83

    Introduction

    Data is gro*ing at a phenomenalrate

    Users expect more sophisticatedinformation

    Ho*+

    UC-E$ H&DDE &/-$M#%&-UC-E$ H&DDE &/-$M#%&-DATA MININGDATA MINING

  • 7/25/2019 Data Mining Introductionduncam

    5/83

    Data Mining Definition

    /inding hidden information in adatabase

    /it data to a model Similar terms

    Exploratory data analysis

    Data driven discovery

    Deductive learning

  • 7/25/2019 Data Mining Introductionduncam

    6/83

    Database Processing vs.

    Data Mining Processing Query

    Well defined

    SQL

    Query

    Poorly defined

    No recise !uerylanguageDataData

    0-perational data-perational data

    -utput-utput0reciserecise

    0Subset of databaseSubset of database

    DataData

    0ot operational dataot operational data

    -utput-utput0/u11y/u11y

    0ot a subset of databaseot a subset of database

  • 7/25/2019 Data Mining Introductionduncam

    7/83

    Query "#a$les

    Database

    Data Mining0/ind all customers *ho have purchased mil)/ind all customers *ho have purchased mil)

    0/ind all items *hich are fre2uently purchased/ind all items *hich are fre2uently purchased*ith mil). 3association rules4*ith mil). 3association rules4

    0/ind all credit applicants *ith last name of Smith./ind all credit applicants *ith last name of Smith.0&dentify customers *ho have purchased more&dentify customers *ho have purchased morethan 56",""" in the last month.than 56",""" in the last month.

    0/ind all credit applicants *ho are poor credit/ind all credit applicants *ho are poor creditris)s. 3classification4ris)s. 3classification4

    0&dentify customers *ith similar buying habits.&dentify customers *ith similar buying habits.3Clustering43Clustering4

  • 7/25/2019 Data Mining Introductionduncam

    8/83

    Data Mining Models and %as&s

  • 7/25/2019 Data Mining Introductionduncam

    9/83

    'asic Data Mining %as&s

    Classification maps data into predefined groups orclasses

    Supervised learning

    attern recognition

    rediction

    Regressionis used to map a data item to a realvalued prediction variable.

    Clustering groups similar data together into clusters.

    Unsupervised learning

    Segmentation artitioning

  • 7/25/2019 Data Mining Introductionduncam

    10/83

    'asic Data Mining %as&s (cont)d*

    Summariation maps data into subsets *ithassociated simple descriptions.

    Characteri1ation

    7enerali1ation

    !in" Anal#sisuncovers relationships among data.

    #ffinity #nalysis

    #ssociation $ules

    Se2uential #nalysis determines se2uential patterns.

  • 7/25/2019 Data Mining Introductionduncam

    11/83

    "#+ %i$e Series ,nalysis

    "#a$le+ Stoc& Mar&et

    Predict future values

    Deter$ine si$ilar atterns over ti$e

    -lassify beavior

  • 7/25/2019 Data Mining Introductionduncam

    12/83

    Data Mining vs. /DD

    $no%ledge Disco&er# in Data'ases ($DD):process of finding useful information and

    patterns in data. Data Mining: Use of algorithms to extract the

    information and patterns derived by the 8DDprocess.

  • 7/25/2019 Data Mining Introductionduncam

    13/83

    /DD Process

    Selection:-btain data from various sources. Preprocessing: Cleanse data.

    Transformation:Convert to common format.%ransform to ne* format.

    Data Mining: -btain desired results. Interpretation*+&aluation: resent results to user in

    meaningful manner.

    Modified from [FPSS96C]

  • 7/25/2019 Data Mining Introductionduncam

    14/83

    /DD Process "#+ Web Log Selection:

    Select log data 3dates and locations4 to use Preprocessing:

    $emove identifying U$9s

    $emove error logs

    Transformation:

    Sessioni1e 3sort and group4

    Data Mining:

    &dentify and count patterns

    Construct data structure

    Interpretation*+&aluation: &dentify and display fre2uently accessed se2uences.

    Potential ,ser Applications:

    Cache prediction

    ersonali1ation

  • 7/25/2019 Data Mining Introductionduncam

    15/83

    Data Mining Develo$ent:Similarity Measures:Hierarchical Clustering

    :&$ Systems:&mprecise ;ueries:%extual Data:'eb Search Engines

    :(ayes %heorem:$egression #nalysis:EM #lgorithm:8

  • 7/25/2019 Data Mining Introductionduncam

    16/83

    /DD Issues

    Human Interaction Overfitting

    Outier!

    Inter"retation #i!uai$ation

    %arge Data!et!

    High Dimen!ionait&

  • 7/25/2019 Data Mining Introductionduncam

    17/83

    /DD Issues (cont)d* Mutimedia Data

    Mi!!ing Data

    Irreevant Data

    'oi!& Data

    Changing Data

    Integration

    (""ication

  • 7/25/2019 Data Mining Introductionduncam

    18/83

    Social I$lications of

    DM rivacy

    rofiling

    Unauthori1ed use

  • 7/25/2019 Data Mining Introductionduncam

    19/83

    Data Mining Metrics

    Usefulness

    $eturn on &nvestment 3$-&4

    #ccuracy Space=%ime

  • 7/25/2019 Data Mining Introductionduncam

    20/83

    Database Persective on Data

    Mining Scalability

    $eal 'orld Data

    Updates Ease of Use

  • 7/25/2019 Data Mining Introductionduncam

    21/83

    0isuali1ation %ecni!ues

    7raphical

    7eometric

    &con

  • 7/25/2019 Data Mining Introductionduncam

    22/83

    2elated -oncets Outline

    Database=-9% Systems

    /u11y Sets and 9ogic

    &nformation $etrieval3'eb Search Engines4 Dimensional Modeling

    Data 'arehousing

    -9#=DSS

    Statistics

    Machine 9earning

    attern Matching

    Goal:Goal:Examine some areas *hich are related toExamine some areas *hich are related todata mining.data mining.

  • 7/25/2019 Data Mining Introductionduncam

    23/83

    D' 3 OL%P Syste$s Schema

    3&D,ame,#ddress,Salary,>obo4

    Data Model

    E$

    $elational %ransaction

    ;uery?SE9EC% ame

    /$-M %

    'HE$E Salary @ 6"""""

    DM: -nl# imprecise queries

  • 7/25/2019 Data Mining Introductionduncam

    24/83

    4u11y Sets and Logic

    .u# Set: Set membership function is a real valued function*ith output in the range A",6B.

    f3x4? robability x is in /.

    6

  • 7/25/2019 Data Mining Introductionduncam

    25/83

    4u11y Sets

  • 7/25/2019 Data Mining Introductionduncam

    26/83

    -lassification5Prediction

    is 4u11y

    Loan

    ,$nt

    Si$le 4u11y

    ,ccet ,ccet

    2e6ect2e6ect

  • 7/25/2019 Data Mining Introductionduncam

    27/83

    Infor$ation 2etrieval Information Retrie&al (IR):retrieving desired information from

    textual data.

    9ibrary Science

    Digital 9ibraries

    'eb Search Engines %raditionally )ey*ord based

    Sample 2uery?

    /ind all documents about data miningI.

    DM: Similarit# measures0

    Mine te1t*2e' data/

  • 7/25/2019 Data Mining Introductionduncam

    28/83

    Infor$ation 2etrieval

    (cont)d* Similarit#:measure of ho* close a 2uery isto a document.

    Documents *hich are close enoughI are

    retrieved. Metrics?

    Precision F$elevant and $etrievedF F$etrievedF

    Recall F$elevant and $etrievedF F$elevantF

  • 7/25/2019 Data Mining Introductionduncam

    29/83

    I2 Query 2esult

    Measures and

    -lassification

    &$ Classification

  • 7/25/2019 Data Mining Introductionduncam

    30/83

    Di$ensional Modeling ie* data in a hierarchical manner more as business

    executives might Useful in decision support systems and mining Dimension:collection of logically related attributesJ

    axis for modeling data. .acts:data stored Ex? Dimensions 0 products, locations, date

    /acts 0 2uantity, unit price

    DM: Ma# &ie% data as dimensional/

  • 7/25/2019 Data Mining Introductionduncam

    31/83

    2elational 0ie7 of Data

    rod&D 9oc&D Date ;uantity Unitrice6!K Dallas "!!L"" M !M

    6!K Houston "!"6"" 6" !"6M" Dallas "K6M"" 6 6""

    6M" Dallas "K6M"" M LM6M" /ort

    'orth"!6""" M N"

    6M" Chicago "6!""" !" OM

    !"" Seattle "K"6"" M M"K"" $ochester "!6M"" !"" MM"" (radenton "!!""" 6M !"

    M"" Chicago "6!""" 6" !M6

  • 7/25/2019 Data Mining Introductionduncam

    32/83

    Di$ensional Modeling Queries

    Roll ,p: more general dimension

    Drill Do%n: more specific dimension

    Dimension 3#ggregation4 Hierarchy

    S;9 uses aggregation

    Decision Support S#stems (DSS):Computersystems and tools to assist managers in

    ma)ing decisions and solving problems.

  • 7/25/2019 Data Mining Introductionduncam

    33/83

    -ube vie7 of Data

  • 7/25/2019 Data Mining Introductionduncam

    34/83

    ,ggregation 8ierarcies

  • 7/25/2019 Data Mining Introductionduncam

    35/83

    Star Sce$a

  • 7/25/2019 Data Mining Introductionduncam

    36/83

    Data Wareousing SubPect

  • 7/25/2019 Data Mining Introductionduncam

    37/83

    Oerational vs. Infor$ational

    -perational Data Data 'arehouse

    #pplication -9% -9#

    Use recise ;ueries #d Hoc

    %emporal Snapshot Historical

    Modification Dynamic Static

    -rientation #pplication (usiness

    Data -perational alues &ntegrated

    Si1e 7igabits %erabits

    9evel Detailed Summari1ed#ccess -ften 9ess -ften

    $esponse /e* Seconds Minutes

    Data Schema $elational Star=Sno*fla)e

  • 7/25/2019 Data Mining Introductionduncam

    38/83

    OL,P -nline Anal#tic Processing (-!AP):provides more complex

    2ueries than -9%.

    -n!ine Transaction Processing (-!TP):traditionaldatabase=transaction processing.

    Dimensional dataJ cube vie*

    isuali1ation of operations?

    Slice:examine sub

  • 7/25/2019 Data Mining Introductionduncam

    39/83

    OL,P Oerations

    Single Cell Multiple Cells Slice Dice

    $oll Up

    Drill Do*n

    St ti ti

  • 7/25/2019 Data Mining Introductionduncam

    40/83

    Statistics

    Simple descriptive models

    Statistical inference: generali1ing a model createdfrom a sample of the data to the entire dataset. +1plorator# Data Anal#sis:

    Data can actually drive the creation of themodel

    -pposite of traditional statistical vie*. Data mining targeted to business user

    DM: Man# data mining methods come fromstatistical techniques/

  • 7/25/2019 Data Mining Introductionduncam

    41/83

    Macine Learning

    Machine !earning:area of #& that examines ho* to *rite

    programs that can learn. -ften used in classification and prediction

    Super&ised !earning: learns by example.

    ,nsuper&ised !earning: learns *ithout )no*ledge of correct

    ans*ers. Machine learning often deals *ith small static datasets.

    DM: ,ses man# machine learning techniques/

  • 7/25/2019 Data Mining Introductionduncam

    42/83

    Pattern Matcing

    (2ecognition* Pattern Matching:finds

    occurrences of a predefinedpattern in the data.

    #pplications include speechrecognition, information retrieval,time series analysis.

    DM: T#pe of classification/

  • 7/25/2019 Data Mining Introductionduncam

    43/83

    DM vs. 2elated %oics(rea )uer& Data *e!ut

    !Out"ut

    D(=-9%

    recise

    Database recise

    D(-bPects or

    #ggregation&$ recis

    eDocuments .ague Document

    s-9# #nalysi

    s

    Multidimensio

    nal

    recis

    e

    D(

    -bPects or#ggregation

    DM .ague reprocessed .ague 8DD-bPects

    D t Mi i % i O tli

  • 7/25/2019 Data Mining Introductionduncam

    44/83

    Data Mining %ecni!ues Outline

    Statistical

    oint Estimation

    Models (ased on Summari1ation

    (ayes %heorem Hypothesis %esting

    $egression and Correlation

    Similarity Measures

    Decision %rees eural et*or)s

    #ctivation /unctions

    7enetic #lgorithms

    Goal:Goal:rovide an overvie* of basic datarovide an overvie* of basic data

    mining techni2uesmining techni2ues

  • 7/25/2019 Data Mining Introductionduncam

    45/83

    Point "sti$ation

    Point +stimate:estimate a population parameter. May be made by calculating the parameter for a

    sample. May be used to predict value for missing data. Ex?

    $ contains 6"" employees LL have salary information

    Mean salary of these is 5","""

    Use 5",""" as value of remaining employeeQs salary.

    &s this a good idea+

  • 7/25/2019 Data Mining Introductionduncam

    46/83

  • 7/25/2019 Data Mining Introductionduncam

    47/83

    9ac&&nife "sti$ate

    4ac""nife +stimate:estimate of parameter isobtained by omitting one value from the set ofobserved values.

    Ex? estimate of mean for x6, R , xnG

    i i& li d i

  • 7/25/2019 Data Mining Introductionduncam

    48/83

    Ma#i$u$ Li&eliood "sti$ate

    (ML"* -btain parameter estimates that maximi1e the

    probability that the sample data occurs for thespecific model.

    >oint probability for observing the sample data bymultiplying the individual probabilities. 9i)elihoodfunction?

    Maximi1e 9.

  • 7/25/2019 Data Mining Introductionduncam

    49/83

    ML" "#a$le Coin toss five times? H,H,H,H,%G

    #ssuming a perfect coin *ith H and % e2ually li)ely, the

    li)elihood of this se2uence is?

    Ho*ever if the probability of a H is ".N then?

  • 7/25/2019 Data Mining Introductionduncam

    50/83

    "#ectation:

    Ma#i$i1ation ("M*

    Solves estimation *ith incomplete

    data. -btain initial estimates for

    parameters.

    &teratively use estimates for

    missing data and continue untilconvergence.

    l

  • 7/25/2019 Data Mining Introductionduncam

    51/83

    "M "#a$le

  • 7/25/2019 Data Mining Introductionduncam

    52/83

    "M ,lgorit$

  • 7/25/2019 Data Mining Introductionduncam

    53/83

    'ayes %eore$

    Posterior Pro'a'ilit#:3h6Fxi4

    Prior Pro'a'ilit#:3h64

    3a#es Theorem:

    #ssign probabilities of hypotheses givena data value.

    ' % " l

  • 7/25/2019 Data Mining Introductionduncam

    54/83

    'ayes %eore$ "#a$le

    Credit authori1ations 3hypotheses4? h6authori1e

    purchase, h! authori1e after further identification,hKdo not authori1e, h do not authori1e butcontact police

    #ssign t*elve data values for all combinations of

    credit and income?

    /rom training data? 3h64 T"J 3h

    !4!"J 3h

    K4

    6"J 3h46".

    1 2 3 4

    Excellent x1 x2 x3 x4

    Good x5 x6 x7 x8

    Bad x9 x10 x11 x12

    ' " l ( )d*

  • 7/25/2019 Data Mining Introductionduncam

    55/83

    'ayes "#a$le(cont)d* %raining Data?

    ID Income Credit Ca!! +i6 S Excellent h6 xS! K 7ood h6 xO

    K ! Excellent h6 x!S K 7ood h6 xO S 7ood h6 xN

    T ! Excellent h6 x!

    O K (ad h! x66N ! (ad h! x6"

    L K (ad hK x666" 6 (ad hS xL

  • 7/25/2019 Data Mining Introductionduncam

    56/83

    'ayes "#a$le(cont)d*

    Calculate 3xiFhP4 and 3xi4 Ex? 3xOFh64!=TJ 3xFh646=TJ 3x!Fh64!=TJ 3xNFh64

    6=TJ 3xiFh64" for all other xi.

    redict the class for x?

    Calculate 3hPFx4 for all hP. lace x in class *ith largest value.

    Ex?

    3h6Fx433xFh6433h644=3x4

    36=T43".T4=".66.x in class h6.

  • 7/25/2019 Data Mining Introductionduncam

    57/83

    2egression

    redict future values based onpast values

    !inear Regressionassumeslinear relationship exists.

    y c"V c6x6V R V cnxn /ind values to best fit the data

  • 7/25/2019 Data Mining Introductionduncam

    58/83

    Linear 2egression

  • 7/25/2019 Data Mining Introductionduncam

    59/83

    -orrelation

    Examine the degree to *hich the values for t*ovariables behave similarly.

    Correlation coefficient r?

    : 6 perfect correlation:

  • 7/25/2019 Data Mining Introductionduncam

    60/83

    Si$ilarity Measures Determine similarity bet*een t*o

    obPects.

    Similarity characteristics?

    #lternatively, distance measure measureho* unli)e or dissimilar obPects are.

    Si il it M

  • 7/25/2019 Data Mining Introductionduncam

    61/83

    Si$ilarity Measures

  • 7/25/2019 Data Mining Introductionduncam

    62/83

    Distance Measures

    Measure dissimilarity bet*een obPects

    % t Q ti ;

  • 7/25/2019 Data Mining Introductionduncam

    63/83

    %7enty Questions ;a$e

  • 7/25/2019 Data Mining Introductionduncam

    64/83

    Decision %rees Decision Tree (DT):

    %ree *here the root and each internal node islabeled *ith a 2uestion.

    %he arcs represent each possible ans*er to the

    associated 2uestion. Each leaf node represents a prediction of a

    solution to the problem.

    opular techni2ue for classificationJ 9eaf node

    indicates class to *hich the corresponding tuplebelongs.

  • 7/25/2019 Data Mining Introductionduncam

    65/83

    Decision %ree "#a$le

  • 7/25/2019 Data Mining Introductionduncam

    66/83

    Decision %rees #Decision Tree Modelis a computational model

    consisting of three parts?

    Decision %ree

    #lgorithm to create the tree

    #lgorithm that applies the tree to data

    Creation of the tree is the most difficult part.

    rocessing is basically a search similar to that in abinary search tree 3although D% may not be binary4.

    Decision %ree ,lgorit$

  • 7/25/2019 Data Mining Introductionduncam

    67/83

    Decision %ree ,lgorit$

  • 7/25/2019 Data Mining Introductionduncam

    68/83

    D% ,dvantages5Disadvantages

    #dvantages? Easy to understand.

    Easy to generate rules

    Disadvantages? May suffer from overfitting.

    Classifies by rectangular partitioning.

    Does not easily handle nonnumeric data.

    Can be 2uite large 0 pruning is necessary.

    Neural Net7or&s

  • 7/25/2019 Data Mining Introductionduncam

    69/83

    Neural Net7or&s

    (ased on observed functioning of human brain.

    (Artificial Neural Net%or"s (ANN) -ur vie* of neural net*or)s is very simplistic. 'e vie* a neural net*or) 34 from a

    graphical vie*point. #lternatively, a may be vie*ed from theperspective of matrices.

    Used in pattern recognition, speech recognition,computer vision, and classification.

    Neural Net7or&s

  • 7/25/2019 Data Mining Introductionduncam

    70/83

    Neural Net7or&s

    Neural Net%or" (NN)is a directed graph /W,#@

    *ith vertices 6,!,R,nG and arcs #Wi,P@F6Wi,PWnG, *ith the follo*ing restrictions?

    is partitioned into a set of input nodes, &,hidden nodes, H, and output nodes, -.

    %he vertices are also partitioned into layers#ny arc Wi,P@ must have node i in layer h

  • 7/25/2019 Data Mining Introductionduncam

    71/83

    Neural Net7or& "#a$le

    NN N d

  • 7/25/2019 Data Mining Introductionduncam

    72/83

    NN Node

    i i i

  • 7/25/2019 Data Mining Introductionduncam

    73/83

    NN ,ctivation 4unctions /unctions associated *ith nodes in graph.

    -utput may be in range A

  • 7/25/2019 Data Mining Introductionduncam

    74/83

    NN ,ctivation 4unctions

  • 7/25/2019 Data Mining Introductionduncam

    75/83

    NN Learning

    ropagate input values through graph.

    Compare output to desired output.

    #dPust *eights in graph accordingly.

  • 7/25/2019 Data Mining Introductionduncam

    76/83

    Neural Net7or&s

    # Neural Net%or" Modelis a computational modelconsisting of three parts?

    eural et*or) graph

    9earning algorithm that indicates ho* learningta)es place.

    $ecall techni2ues that determine ho*information is obtained from the net*or).

    'e *ill loo) at propagation as the recall techni2ue.

  • 7/25/2019 Data Mining Introductionduncam

    77/83

    NN ,dvantages

    9earning

    Can continue learning even after training sethas been applied.

    Easy paralleli1ation

    Solves many problems

  • 7/25/2019 Data Mining Introductionduncam

    78/83

    NN Disadvantages

    Difficult to understand

    May suffer from overfitting

    Structure of graph must be determined a priori. &nput values must be numeric.

    erification difficult.

    ;enetic ,lgorit$s

  • 7/25/2019 Data Mining Introductionduncam

    79/83

    ;enetic ,lgorit$s

    -ptimi1ation search type algorithms. Creates an initial feasible solution and iteratively

    creates ne* betterI solutions.

    (ased on human evolution and survival of the fitness.

    Must represent a solution as an individual. Indi&idual:string &&6,&!,R,&n*here &Pis in given

    alphabet #.

    Each character &P is called a gene.

    Population:set of individuals.

  • 7/25/2019 Data Mining Introductionduncam

    80/83

    ;enetic ,lgorit$s

    # Genetic Algorithm (GA)is a computational modelconsisting of five parts?

    # starting set of individuals, . Crosso&er? techni2ue to combine t*o parents

    to create offspring. Mutation: randomly change an individual. .itness: determine the best individuals.#lgorithm *hich applies the crossover and

    mutation techni2ues to iteratively using the

    fitness function to determine the bestindividuals in to )eep.

    -rossover "#a$les

  • 7/25/2019 Data Mining Introductionduncam

    81/83

    -rossover "#a$les

  • 7/25/2019 Data Mining Introductionduncam

    82/83

    ;enetic ,lgorit$

  • 7/25/2019 Data Mining Introductionduncam

    83/83

    ;, ,dvantages5Disadvantages

    #dvantages Easily paralleli1ed

    Disadvantages

    Difficult to understand and explain to end users.

    #bstraction of the problem and method torepresent individuals is 2uite difficult.

    Determining fitness function is difficult.

    Determining ho* to perform crossover and

    mutation is difficult.