Top Banner

of 22

Thuật Toán Cây Quyết Định C4.5

Jul 17, 2015

Download

Documents

Kim Yen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

Thut Ton Cy Quyt nh C4.5Sinh vin: Lu Cng T Ngi hng dn: V Tin Thnh

1

Outline Thut ton Cy quyt nh nh ngha. Xy dng cy quyt nh. c im cy quyt nh.

Thut ton C4.5 Lch s Pseudocode, cc cng thc v v d. C ch chng qu va d liu, chng thiu d liu thuc tnh. Chuyn sang lut ng dng. C4.5 v See5/C5.0 Hng pht trin.2

Thut ton cy quyt nh nh ngha: Cy quyt nh l biu quyt nh pht trin c cu trc dng cy: Gc: Node trn cng cy. Node trong: biu din 1 kim tra hoc 1 thuc tnh n Node l: biu din lp. Nhnh: Kt qu kim tra ca node trnGc

Node L

Node Trong Nhnh

Node L

Node L

3

V d cy quyt nh4

Thut ton cy quyt nh Xy dng cy quyt nh gm 2 bc: Pht trin cy quyt nh: i t gc, n cc nhnh, pht trin quy np theo hnh thc chia tr. Chn thuc tnh tt nht bng mt o nh trc Pht trin cy bng vic thm cc nhnh tng ng vi tng gi tr ca thuc tnh chn Sp xp, phn chia tp d liu o to ti node con Nu cc v d c phn lp r rng th dng. Ngc li: lp li bc 1 ti bc 4 cho tng node con

Ct ta cy: nhm n gin ha, khi qut ha cy, tng chnh xc5

Thut ton cy quyt nh VD: thut ton Hunt s dng trong C4.5, CDP... S={S1,S2,,Sn} l tp d liu o to C={C1,C2,,Cm} l tp cc lp TH1: Si (i=1n) thuc v Cj => Cy quyt nh l 1 l ng Cj. TH2: S thuc v nhiu lp trong C. Chn 1 test trn thuc tnh n c nhiu gi tr O={O1,..Ok} (k thng bng 2). Test t gc ca cy, mi Oi to thnh 1 nhnh, chia S thnh cc tp con c gi tr thuc tnh = Oi. quy cho tng tp con => cy quyt nh gm nhiu nhnh, mi nhnh tng ng vi Oi.

6

Thut ton cy quyt nh im mnh ca cy quyt nh: Sinh ra cc quy tc hiu c: chuyn i c sang ting Anh hoc SQL. Thc thi trong lnh vc hng quy tc. D dng tnh ton trong khi phn lp. X l vi thuc tnh lin tc v ri rc. Th hin r rng nhng thuc tnh tt nht: phn chia d liu t gc.

im yu ca cy quyt nh: D xy ra li khi c nhiu lp: do ch thao tc vi cc lp c gi tr dng nh phn. Chi ph tnh ton t hc: do phi i qua nhiu node n node l cui cng

7

Thut ton C4.5 L s pht trin t CLS v ID3. ID3 (Quinlan, 1979)- 1 h thng n gin ban u cha khong 600 dng lnh Pascal Nm 1993, J. Ross Quinlan pht trin thnh C4.5 vi 9000 dng lnh C. Hin ti: phin bn See5/C5.0. T tng thut ton: Hunt, chin lc pht trin theo su.

8

Thut ton C4.5 Pseudocode: Kim tra case c bn Vi mi thuc tnh A tm thng tin nh vic tch thuc tnh A Chn a_best l thuc tnh m o la chn thuc tnh tt nht Dng a_best lm thuc tnh cho node chia ct cy. quy trn cc danh sch ph c to ra bi vic phn chia theo a_best, v thm cc node ny nh l con ca node(1) ComputerClassFrequency(T); (2) if OneClass or FewCases return a leaf; Create a decision node N; (3) ForEach Attribute A ComputeGain(A); (4) N.test=AttributeWithBestGain; (5) if (N.test is continuous) find Threshold; (6) ForEach T' in the splitting of T (7) If ( T' is Empty ) Child of N is a leaf else (8) Child of N=FormTree(T'); (9) ComputeErrors of N; return N9

Thut ton C4.5 o la chn thuc tnh tt nht: information gain v gain ratio Tn sut cc case Sj thuc v gi tr phn lp Cj |Sj| RF (Cj, S) = |S| Ch s thng tin cn thit cho s phn lp: I(S)

S={S1,S2,,Sx)10

Thut ton C4.5 o la chn thuc tnh tt nht: Information gain: Test B chia S={S ,S ,,S )1 2 t

Test B s c chn nu c G(S, B) t gi tr ln nht. Thng tin tim nng (potential information) ca bn thn mi phn hoch:

Gain ratio = G(S, B) / P(S, B) lnnht => chn test B

11

Thut ton C4.5 o o o o o V d: s1 (yes) 9 case,s2 (no) 5 case I(S) = I(s1,s2) = I(9, 5) = 0.940 A = age =>S={S1,S2,S3} S1 (age30). I (S1): s11(yes & ng dng vi database nh ( tn s li lp li 4% vi database 20000 cases). C c ch x l thiu, li hoc qu va d liu. Lut to ra n gin.15

Thut ton C4.5 ng dng vo bi ton phn lp d liu: Bc 1 (Hc): xy dng m hnh m t tp d liu; khi nim bit Input: tp d liu c cu trc c to m t bng cc thuc tnh Output: Cc lut IfThen

Bc 2 (Phn loi): da trn m hnh xy dng phn lp d liu mi: i t gc n cc nt l nhm rt ra lp ca i tng cn xt.16

Thut ton C4.5 ng dng vo bi ton phn lp d liu: X l vi d liu thuc tnh lin tc: S dng kim tra dng nh phn: value(V) < h vi h l hng s ngng (threshold) h c tm bng cch: Quick sort sp xp cc case trong S theo cc gi tr ca thuc tnh lin tc V ang xt =>V = {v1, v2, , vm} hi = (vi + v(i+1))/2. Test phn chia d liu:V hi => chia V thnh V1={v1,v2,, vi} v V2 = {vi+1, vi+2, , vm} v c hi (i=1m-1) Tnh Information gain hay Gain ratio vi tng hi. Ngng c gi tr ca Information gain hay Gain ratio ln nht s c chn lm ngng phn chia ca thuc tnh .

17

C4.5 v C5.0 Lut: C5.0 chnh xc hn, nhanh hn, tn t b nh hn.Blue: C5.0

Cy quyt nh: nhanh hn, nh hn

18

C4.5 v C5.0 Boost: to v kt hp nhiu lp phn loi tng chnh xc d onBlue: C5.0

Kiu d liu mi: vd ngy,thng

19

Hng nghin cu Cy n nh: Tn s li ca cy c xy dng t data case c cu trc thp hn nhiu so vi data case khng nhn thy c. VD: vi 20k case c cu trc, t l li l 4%, Cng 20k case v c 1 case khng c kim tra, t l li 11,7%

Yu cu t ra: xy dng cy m t l li l xp x nhau cho c 2 trng hp.

Phn ly cy phc tp: C th chia ct cy phc tp thnh cc cy nh, n gin m kt qu khng i ?20

Ti liu tham kho Nguyn Th Thy Linh (2005). Thut ton phn lp cy quyt nh, Kha lun tt nghip i hc, Trng i hc Cng ngh, 2005. [WKQ08] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu , Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg (2008). Top 10 algorithms in data mining, Knowl Inf Syst (2008) 14:137 http://rulequest.com/see5-comparison.html http://en.wikipedia.org/wiki/ID3_algorithm http://en.wikipedia.org/wiki/C4.5_algorithm http://en.wikipedia.org/wiki/Decision_tree21

Cm n thy, anh ch v cc bn theo di!

22