Top Banner

of 23

TÌM HIỂU GOM CỤM DỮ LIỆU VÀ HỌ

Apr 06, 2018

Download

Documents

hosyky
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/3/2019 TM HIU GOM CM D LIU V H

    1/23

    TM HIU GOM CM D LIU

    V H GII THUT K-MEAN

  • 8/3/2019 TM HIU GOM CM D LIU V H

    2/23

    GOM CM D LIU

    Gom cm d liu l mt tc v trong khaiph d liu.

    Gom cm d liu gip ta c th h thng lid liu lm cho chng khng b ri rc.

    Vi mt c s d liu ln v ri rc th vicgom cm rt cn thit v hu nh l khngth thiu.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    3/23

    MC CH CA GOM CM

    Mc ch ca gom cm d liu l nhmkhm ph ra cu trc d liu thnh lpcc tp d liu t cc nhm d liu ln

  • 8/3/2019 TM HIU GOM CM D LIU V H

    4/23

    YU CU CA GOM CM D LIU Gom cm d liu l lm cho cc d liu

    trong cm th tng t nhau. Cn ccphn t khc cm th khng tng tnhau.

    tng t gia cc cm d liu do ngidng nh ngha. c xc nh da trncc i tng thuc tnh m t i tng.Thng ta o khon cch gia cc itng.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    5/23

  • 8/3/2019 TM HIU GOM CM D LIU V H

    6/23

    YU CU CA GOM CM D LIU

    Kh nng gom cm tng dn c lp vi dliu nhp

    Kh nng x l d liu a chiu

    Kh nng gom cm da trn rng buc Kh din v kh dng

  • 8/3/2019 TM HIU GOM CM D LIU V H

    7/23

    PHN LOI CC PHNG PHP GOM CM Phn hoch (partitioning): cc phn hochcto

    ra v nh gi theo mt tiu ch no .

    Phn cp (hierarchical): phn r tpdliu/itng c thtphn cp theo mt tiu ch no .

    Da trn mt (density-based): da trn

    connectivity and density functions.

    Da trn li (grid-based): da trn a multiple-levelgranularity structure.

    Da trn m hnh (model-based): mt m hnh githuytca ra cho micm; sau hiuchnhcc thng s m hnh ph hpvicmdliu/itngnht.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    8/23

    PHNG PHP NH GI GOM CM D LIU nh gi ngoi (external validation)

    nh gi ktqu gom cmda vo cu trc cchnhtrccho tpdliu

    o : Rand statistic, Jaccard coefficient, Folkes and Mallowsindex

    nh gi ni (internal validation)

    nh gi ktqu gom cm theo slng cc vector ca chnh tpdliu (ma trngnproximity matrix)

    o : :Huberts statistic, Silhouette index, Dunns index,

    nh gi tngi (relative validation)

    nh gi ktqu gom cmbngvic so snh cc ktqu gomcm khc ngvi cc btr thng s khc nhau

    Tiu ch cho vicnh gi v chnktqu gom cmtiu- nn (compactness): cc itng trong cm nn gn nhau.

    - phn tch (separation): cc cm nn xa nhau.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    9/23

    PHNG PHP NH GI GOM CM D LIU nh gi theo Entropy (trnh khi chtlng

    gom cmtt)

    ii

    ij

    ji

    iji

    ii

    ij

    ji

    ij

    in

    n

    n

    n

    n

    n

    p

    p

    p

    ppIEntropy )log()log()(

  • 8/3/2019 TM HIU GOM CM D LIU V H

    10/23

    CC VN CN GII QUYT BiuDinKiuDLiu

    + Ta ch quan tm nnhngkiu mcnthit cho vic gom cm m thi

    + Ta nhngha d(i,j) l khon cch

    gia 2 itng i v j. d(i,j) 0 d(i,i) = 0

    d(i,j) =d(j,i)

    d(i,j)d(i,k) +d(k,j)vi k l mtimbt k khc i,j.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    11/23

    CC VN CN GII QUYT itng i,j cbiudinbi vector

    x,y tngt(similarity) gia i v j dc

    tnh theo cng thc

    x = (x1, , xp)

    y = (y1, , yp)

    s(x, y) = (x1*y1 + + xp*yp)/((x12+ + xp2)1/2*(y12+ + yp2)1/2)

  • 8/3/2019 TM HIU GOM CM D LIU V H

    12/23

    CC VN CN GII QUYT Interval-scaled variables/attributes

    + khonlch

    + khon cch

    + Z-score measurement

    |)|...|||(|121 fnffffff

    mxmxmxns

    .)...21

    1nffffxx(xnm

    f

    fif

    if s

    mx

    z

  • 8/3/2019 TM HIU GOM CM D LIU V H

    13/23

    CC VN CN GII QUYT Cc cng thc tnh okhon cch

    + okhong cch Minkowski

    + okhon cch Manhattan

    + okhon cch Euclidean

    ||...||||),(2211 pp j

    xi

    xj

    xi

    xj

    xi

    xjid

    )||...|||(|),( 2222

    2

    11 pp jx

    ix

    jx

    ix

    jx

    ixjid

  • 8/3/2019 TM HIU GOM CM D LIU V H

    14/23

    CC VN CN GII QUYT Binary variables/attributes

    Obj j

    Obj ipdbcasum

    dcdc

    baba

    sum

    0

    1

    01

    Hs so trng ngin (nuixng):

    Hs so trng Jaccard (nubtixng):

    dcbacbjid

    ),(

    cbacbjid

    ),(

  • 8/3/2019 TM HIU GOM CM D LIU V H

    15/23

    CC VN CN GII QUYT Variables/attributes of mixed types

    )(1

    )()(1),(

    fij

    pf

    fij

    fij

    pf djid

    Nu xifhoc xjfbthiu (missing) th

    f (variable/attribute): binary (nominal)

    dij(f) = 0 if xif= xjf , or dij

    (f) = 1 otherwise

    f: interval-scaled (Minkowski, Manhattan,

    Euclidean)

    f: ordinal or ratio-scaled

    tnh ranks rifv

    ziftrthnh interval-scaled1

    1

    f

    if

    Mr

    zif

  • 8/3/2019 TM HIU GOM CM D LIU V H

    16/23

    CC VN CN GII QUYT

    1

    1

    f

    if

    Mr

    zif

    1

    1

    f

    if

    Mr

    zif

  • 8/3/2019 TM HIU GOM CM D LIU V H

    17/23

    NGHA CA VIC PHN CM

    Phn cm ta c th i su vo phn tchnghin cu tng cm d liu nhm khmph v tm kim cc thng tin n nhm h

    tr cho vic ra quyt nh

  • 8/3/2019 TM HIU GOM CM D LIU V H

    18/23

    CC GII THUT GOM CM D LIU

    Trong gom cm d liu c nhiu gii thut ,tiu biu l gii thut k-mean v gii thutgom cm phn cp nhm.

    Chng ta s tm hiu gii thut K-Meantrong gom cm d liu

  • 8/3/2019 TM HIU GOM CM D LIU V H

    19/23

    GII THUT K-MEANS INPUT: Mt CSDL gm n i tng v s cc cm k.

    OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun E t gi tr ti

    thiu. Bc 1: Khi to

    Chn k i tng mj (j=1...k) l trng tm ban u ca k cm t tp dliu

    (vic la chn ny c th l ngu nhin hoc theo kinh nghim).

    Bc 2: Tnh ton khong cchi vi mi i tng Xi (1

  • 8/3/2019 TM HIU GOM CM D LIU V H

    20/23

    GII THUT K-MEANS phc tp d liu c tnh l

    O(n.k.d.t.T)Trong : n l s i tng d liu

    k l s cm d liu

    d l s chiut l s vng lp

    T l thi gian tnh ton mt

    php tnh c s nh : cng , tr, nhn hocchia.....

  • 8/3/2019 TM HIU GOM CM D LIU V H

    21/23

    GII THUT K-MEANS u im :K-Means phn tch phn cm n

    gin nn c th p dng vi tp d liu ln Nhc im: K-Means ch p dng vi d

    liu c thuc tnh s v khm ph ra cc

    cm c dng hnh cu, k-means cn rtnhy cm vi nhiu v cc phn t ngoi laitrong d liu. Ngoi ra cn ph thuc nhiuvo cc thng s u vo

  • 8/3/2019 TM HIU GOM CM D LIU V H

    22/23

    GII THUT K-MEANS

    Trong trng hp, cc trng tm khi to ban um qu lch so vi cc trng tm cm t nhin thkt qu phn cm ca k-means l rt thp, ngha lcc cm d liu c khm ph rt lch so vi cc

    cm trong thc t. Trn thc t ngi ta cha cmt gii php ti u no chn cc tham s uvo, gii php thngc s dng nht l thnghim vi cc gi tr u vo k khc nhau ri sau

    chn gii php tt nht.

  • 8/3/2019 TM HIU GOM CM D LIU V H

    23/23

    GII THUT K-MEANS n nay, c rt nhiu thut ton k

    tha t tng ca thut ton k-meansp dng trong khai ph d liu giiquyt tp d liu c kch thc rt lnang c p dng rt hiu qu v phbin nh thut ton k-medoid, PAM,CLARA, CLARANS, k- prototypes,