Top Banner
ROUGH SETS and FCA Foundations and Case Studies of Feature Subset Selection and Knowledge Structure Formation DOMINIK ŚLĘZAK www.infobright.com www.infobright.org
30

DOMINIK ŚLĘZAK infobright infobright

Jan 25, 2016

Download

Documents

afya

ROUGH SETS and FCA Foundations and Case Studies of Feature Subset Selection and Knowledge Structure Formation. DOMINIK ŚLĘZAK www.infobright.com www.infobright.org. Contents. Rough Sets & Feature Selection Association Reducts Conceptual Reducts Building Ensembles Towards Clustering - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DOMINIK  ŚLĘZAK infobright infobright

ROUGH SETS and FCAFoundations and Case Studies

of Feature Subset Selection andKnowledge Structure Formation

DOMINIK ŚLĘZAK www.infobright.com www.infobright.org

Page 2: DOMINIK  ŚLĘZAK infobright infobright

Contents

• Rough Sets & Feature Selection– Association Reducts– Conceptual Reducts– Building Ensembles– Towards Clustering

• Rough Sets & Infobright Story– Rough & Granular Computation– Knowledge Structure Formation

Page 3: DOMINIK  ŚLĘZAK infobright infobright

Rough Sets

• Rough set theory proposed by Z. Pawlak in 82 is an approximate reasoning model

• In applications, it focuses on derivation of approximate knowledge from databases

• It provides good results in such domains as, e.g., Web analysis, finance, industry, multimedia, medicine, and bioinformatics

Page 4: DOMINIK  ŚLĘZAK infobright infobright

Decision Systems Outlook Temp. Humid. Wind Sport?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

IF (H=Normal)AND (T=Mild)THEN (S=Yes)

It corresponds to a data block included in the positive region of the decision class “Yes”

Page 5: DOMINIK  ŚLĘZAK infobright infobright

Lower & Upper ApproximationsRules and Approximations

Outlook Temp. Humid. Wind Sport?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No POS(Sport?|B)

Page 6: DOMINIK  ŚLĘZAK infobright infobright

Feature Reduction (Selection)

• Reducts: optimal attribute subsets, which approximate well enough the pre-defined target concepts or the whole data source

• Feature reduction is one of the steps in the knowledge discovery in databases process

• In real-world situations, we may agree to slightly decrease the quality, if it leads to a significantly simpler knowledge model

Page 7: DOMINIK  ŚLĘZAK infobright infobright

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

2 IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

3 O O W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

4 O TO T W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

5 O T H

O T H W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

6 IIIII IIIII O T H W

T H W

W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

7 O T H W

O T H IIIII IIIII IIIII O IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

8 IIIII IIIII O T OO T

H IIIII O T H W IIIII IIIII IIIII IIIII IIIII IIIII IIIII

9 T HT H W IIIII IIIII IIIII O W IIIII T H IIIII IIIII IIIII IIIII IIIII IIIII

10 O T H

O T H W IIIII IIIII IIIII T W IIIII O H IIIII IIIII IIIII IIIII IIIII IIIII

11 T H W

T H IIIII IIIII IIIII O T IIIII H W IIIII IIIII IIIII IIIII IIIII IIIII

12 O T W

O T IIIII IIIII IIIII O T H IIIII O W IIIII IIIII IIIII IIIII IIIII IIIII

13 O HO H W IIIII IIIII IIIII O T

W IIIII O T H IIIII IIIII IIIII IIIII IIIII IIIII

14 IIIII IIIII O T W

WT H W IIIII O T

H IIIII O T H W

H W O H OO T H W IIIII

Discernibility

Page 8: DOMINIK  ŚLĘZAK infobright infobright

Association Reducts1. {a,b,c} {d,e}2. {a,b,d,f} {c,e}3. {a,b,f} {e}4. {a,c,e} {b,d}5. {a,c,f} {d}6. {a,d,e} {b,c}7. {a,d,f} {c}8. {a,e,f} {b}9. {b,c,d} {a,e}10. {b,d,e} {a,c}11. {b,e,f} {a}12. {c,d,f} {a}13. {c,e,f} {a,b,d}

a b c d e f

u1 1 1 1 1 1 1

u2 0 0 0 1 1 1

u3 1 0 1 1 0 1

u4 0 1 0 0 0 0

u5 1 0 0 0 0 1

u6 1 1 1 1 1 0

u7 0 1 1 0 1 2

Page 9: DOMINIK  ŚLĘZAK infobright infobright

a b c d e f a b c d e f

12 0 0 0 1 1 1 34 0 0 0 0 1 0

13 1 0 1 1 0 1 35 1 1 0 0 1 1

14 0 1 0 0 0 0 36 1 0 1 1 0 0

15 1 0 0 0 0 1 37 0 0 1 0 0 0

16 1 1 1 1 1 0 45 0 0 1 1 1 0

17 0 1 1 0 1 0 46 0 1 0 0 0 1

23 0 1 0 1 0 1 47 1 1 0 1 0 0

24 1 0 1 0 0 0 56 1 0 0 0 0 0

25 0 1 1 0 0 1 57 0 0 0 1 0 0

26 0 0 0 1 1 0 67 0 1 1 0 1 0

27 1 0 0 0 1 0

Association Reducts as Association Rules in InDiscernibility Tables

Page 10: DOMINIK  ŚLĘZAK infobright infobright

Most Interesting Reducts

• Given association reduct (C,D), we evaluate it with the value F(|C|,|D|)

• Function F: N N R should hold:

IF n1 < n2 THEN F(n1,m) > F(n2,m)

IF m1 < m2 THEN F(n,m1) < F(n,m2)

• F(|C|,|D|) is maximized subject to # from the space of approximation parameters

• Such maximization problem is NP-hard

Page 11: DOMINIK  ŚLĘZAK infobright infobright

What # can actually mean?1) |POS(d|B)|2) Disc(d|B) = Disc(B{d}) – Disc(B)

where Disc(X) = |{(u1,u2): X(u1)≠X(u2)}|3) Relative Gain R(d|B) =

4) Entropy H(d|B) = H(B{d}) – H(B)

Page 12: DOMINIK  ŚLĘZAK infobright infobright

Conceptual Reducts(empty,empty)

({3,7,12,13},{O})

({1-3,7,9,12,13},{O,T})

({1-3,7-9,11-13},{O,H})

({3-7,10,12-14},{O,W})

({10,11,13},{T,H})

({2,5,9},{T,W})

({5,9,10,13},{H,W})

({1-3,7-13},{O,T,H})

(1-14,{O,T,W})

(1-14,{O,H,W})

({2,5,9-11,13},{H,T,W})

Outlook Temp. Humid. Wind Sport?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Reduct as a pair (X,B), where XU, POS(B)=X, POS(C)X for any CB

Page 13: DOMINIK  ŚLĘZAK infobright infobright

empty empty

1-3,7-9,11-13 O,H

3,7,12,13 O

1-3,7,9,12,13 O,T

3-7,10,12-14 O,W

1-14 O,T,W

1-3,7-13 O,T,H

1-14 O,H,W 2,5,9-11,13 H,T,W

2,5,9 T,W

5,9,10,13 H,W

Reduct „Lattice”

10,11,13 T,H

Page 14: DOMINIK  ŚLĘZAK infobright infobright

Most Interesting Reducts

• Given conceptual reduct (X,B), we evaluate it with the value F(|X|,|B|)

• Function F: N N R should hold:

IF n1 < n2 THEN F(n1,m) < F(n2,m)

IF m1 < m2 THEN F(n,m1) > F(n,m2)

• So we should maximize F(|X|,|B|) or...

• ... shall we rather search for ensembles?

Page 15: DOMINIK  ŚLĘZAK infobright infobright

“Good” Ensembles of Reducts

• Reducts with minimal cardinalities (or rules)

• Reducts with minimal pairwise intersections

R1

R2 R3

ATTRIBUTES Challenge:

How to modify the existing attribute reduction methods to search for such „good” ensembles

Page 16: DOMINIK  ŚLĘZAK infobright infobright

Hybrid Genetic Algorithm (1)

• Genetic part, where each chromosome encodes a permutation of the attributes

• Heuristic part, where permutations are put into the following algorithm:

1.LET LEFT = A

2.FOR i = 1 TO |A| REPEAT2. LET LEFT LEFT \ {a(i)}

3. IF NOT LEFT # d UNDO (a)

3.EVALUATE REDUCT LEFT

Page 17: DOMINIK  ŚLĘZAK infobright infobright

Hybrid Genetic Algorithm (2)

1. LET (LEFT,RIGHT) = (,A)

2. FOR i = 1 TO |U|+|A| REPEAT IF (i){1,...,|U|} THEN

IF u(i)POS(RIGHT) THEN

LET LEFT LEFT {u(i)}

IF (i){|U|+1,...,|U|+|A|} THEN

IF POS(RIGHT \ {a(i)}) LEFT THEN

LET RIGHT RIGHT \ {a(i)}

3. EVALUATE REDUCT (LEFT,RIGHT)

Page 18: DOMINIK  ŚLĘZAK infobright infobright

empty empty

1-3,7-9,11-13 O,H

3,7,12,13 O

1-3,7,9,12,13 O,T

3-7,10,12-14 O,W

1-14 O,T,W

1-3,7-13 O,T,H

1-14 O,H,W 2,5,9-11,13 H,T,W

2,5,9 T,W

5,9,10,13 H,W

Reduct „Lattice” once more

10,11,13 T,H

Page 19: DOMINIK  ŚLĘZAK infobright infobright

CLUSTERS OF ATTRIBUTES

REDUCTS WITH CLUSTER REP-RESENTATIVES

FEEDBACK

Grużdź, Ihnatowicz, Ślęzak: Interactive gene clustering – a case study of breast cancer microarray data. Inf. Systems Frontiers 8 (2006).

Feature Clustering / Selection

• Frequent occurrence of representatives in reducts yields splitting clusters• Rare occurrence of pairs of close representatives yields merging clusters

Page 20: DOMINIK  ŚLĘZAK infobright infobright

How about groups of rows (1)

Data-based knowledge models, classifiers...

Database indices, data partitioning, data sorting...

Difficulty with fast updates of structures...

Outlook Temp. Humid. Wind Sport?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Page 21: DOMINIK  ŚLĘZAK infobright infobright

How about groups of rows (2)

Query

min OUT

max

Nulls

sum

match

???

pattern

Page 22: DOMINIK  ŚLĘZAK infobright infobright

Infobright’s Technology

Page 23: DOMINIK  ŚLĘZAK infobright infobright

Two-Level Computing

Large Data (10TB)& Mixed Workloads

Page 24: DOMINIK  ŚLĘZAK infobright infobright

SELECT MAX(A) FROM T WHERE B>15;

STEP 1 STEP 2 STEP 3DATA

Page 25: DOMINIK  ŚLĘZAK infobright infobright

Knowledge Structures (Nodes)

Order Number

Order Date

Part ID

Quantity $Amt

005 20070214 234 500 1500.00

005 20070214 334 125 250.25

006 20070215 334 100 212.50

Supplier ID

Effective Date

Expiry Date

Part ID

Description

A456 20050315 Null 234 Pre-measured coffee packets – gold blend

A456 20061201 Null 235 Pre-measured coffee packets – silver blend

A456 20060501 Null 334 4-cup Cone coffee filters; quantity 50

Order Detail Table – assume many more rows

Supplier/Part Table – assume many more rows

Pack 1 Pack 2

Pack 1 0 1

Pack 2 1 0

Pack 3 0 0

Page 26: DOMINIK  ŚLĘZAK infobright infobright
Page 27: DOMINIK  ŚLĘZAK infobright infobright
Page 28: DOMINIK  ŚLĘZAK infobright infobright

DATA – Best Inspiration

• New Objectives• New Schemas• New Volumes• New Queries• New Types• New KNs• ...........

Page 29: DOMINIK  ŚLĘZAK infobright infobright

References (Unfinished List)

• D. Ślęzak, J. Wróblewski, V. Eastwood, P. Synak: Brighthouse - An Analytic Data Warehouse for Ad-hoc Queries. VLDB 2008: 1337-1345.

• D. Ślęzak: Rough Sets and Few-Objects-Many-At-tributes Problem - The Case Study of Analysis of Gene Expression Data Sets. FBIT 2007: 437-440.

• D. Ślęzak: Rough Sets and Functional Dependen-cies in Data - Foundations of Association Reducts. To appear.

• ......

Page 30: DOMINIK  ŚLĘZAK infobright infobright

THANK YOU!!

www.infobright.com www.infobright.org

[email protected]