DOMINIK ŚLĘZAK infobright infobright

ROUGH SETS and FCAFoundations and Case Studies

of Feature Subset Selection andKnowledge Structure Formation

DOMINIK ŚLĘZAK www.infobright.com www.infobright.org

Contents

• Rough Sets & Feature Selection– Association Reducts– Conceptual Reducts– Building Ensembles– Towards Clustering

• Rough Sets & Infobright Story– Rough & Granular Computation– Knowledge Structure Formation

Rough Sets

• Rough set theory proposed by Z. Pawlak in 82 is an approximate reasoning model

• In applications, it focuses on derivation of approximate knowledge from databases

• It provides good results in such domains as, e.g., Web analysis, finance, industry, multimedia, medicine, and bioinformatics

Decision Systems Outlook Temp. Humid. Wind Sport?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cold Normal Weak Yes

6 Rain Cold Normal Strong No

7 Overcast Cold Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cold Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

IF (H=Normal)AND (T=Mild)THEN (S=Yes)

It corresponds to a data block included in the positive region of the decision class “Yes”

Lower & Upper ApproximationsRules and Approximations

Outlook Temp. Humid. Wind Sport?














14 Rain Mild High Strong No POS(Sport?|B)

Feature Reduction (Selection)

• Reducts: optimal attribute subsets, which approximate well enough the pre-defined target concepts or the whole data source

• Feature reduction is one of the steps in the knowledge discovery in databases process

• In real-world situations, we may agree to slightly decrease the quality, if it leads to a significantly simpler knowledge model

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

2 IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

3 O O W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

4 O TO T W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

5 O T H

O T H W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

6 IIIII IIIII O T H W

T H W

W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

7 O T H W

O T H IIIII IIIII IIIII O IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII

8 IIIII IIIII O T OO T

H IIIII O T H W IIIII IIIII IIIII IIIII IIIII IIIII IIIII

9 T HT H W IIIII IIIII IIIII O W IIIII T H IIIII IIIII IIIII IIIII IIIII IIIII

10 O T H

O T H W IIIII IIIII IIIII T W IIIII O H IIIII IIIII IIIII IIIII IIIII IIIII

11 T H W

T H IIIII IIIII IIIII O T IIIII H W IIIII IIIII IIIII IIIII IIIII IIIII

12 O T W

O T IIIII IIIII IIIII O T H IIIII O W IIIII IIIII IIIII IIIII IIIII IIIII

13 O HO H W IIIII IIIII IIIII O T

W IIIII O T H IIIII IIIII IIIII IIIII IIIII IIIII

14 IIIII IIIII O T W

WT H W IIIII O T

H IIIII O T H W

H W O H OO T H W IIIII

Discernibility

Association Reducts1. {a,b,c} {d,e}2. {a,b,d,f} {c,e}3. {a,b,f} {e}4. {a,c,e} {b,d}5. {a,c,f} {d}6. {a,d,e} {b,c}7. {a,d,f} {c}8. {a,e,f} {b}9. {b,c,d} {a,e}10. {b,d,e} {a,c}11. {b,e,f} {a}12. {c,d,f} {a}13. {c,e,f} {a,b,d}

a b c d e f

u1 1 1 1 1 1 1

u2 0 0 0 1 1 1

u3 1 0 1 1 0 1

u4 0 1 0 0 0 0

u5 1 0 0 0 0 1

u6 1 1 1 1 1 0

u7 0 1 1 0 1 2

a b c d e f a b c d e f

12 0 0 0 1 1 1 34 0 0 0 0 1 0

13 1 0 1 1 0 1 35 1 1 0 0 1 1

14 0 1 0 0 0 0 36 1 0 1 1 0 0

15 1 0 0 0 0 1 37 0 0 1 0 0 0

16 1 1 1 1 1 0 45 0 0 1 1 1 0

17 0 1 1 0 1 0 46 0 1 0 0 0 1

23 0 1 0 1 0 1 47 1 1 0 1 0 0

24 1 0 1 0 0 0 56 1 0 0 0 0 0

25 0 1 1 0 0 1 57 0 0 0 1 0 0

26 0 0 0 1 1 0 67 0 1 1 0 1 0

27 1 0 0 0 1 0

Association Reducts as Association Rules in InDiscernibility Tables

Most Interesting Reducts

• Given association reduct (C,D), we evaluate it with the value F(|C|,|D|)

• Function F: N N R should hold:

IF n1 < n2 THEN F(n1,m) > F(n2,m)

IF m1 < m2 THEN F(n,m1) < F(n,m2)

• F(|C|,|D|) is maximized subject to # from the space of approximation parameters

• Such maximization problem is NP-hard

What # can actually mean?1) |POS(d|B)|2) Disc(d|B) = Disc(B{d}) – Disc(B)

where Disc(X) = |{(u1,u2): X(u1)≠X(u2)}|3) Relative Gain R(d|B) =

4) Entropy H(d|B) = H(B{d}) – H(B)

Conceptual Reducts(empty,empty)

({3,7,12,13},{O})

({1-3,7,9,12,13},{O,T})

({1-3,7-9,11-13},{O,H})

({3-7,10,12-14},{O,W})

({10,11,13},{T,H})

({2,5,9},{T,W})

({5,9,10,13},{H,W})

({1-3,7-13},{O,T,H})

(1-14,{O,T,W})

(1-14,{O,H,W})

({2,5,9-11,13},{H,T,W})
















Reduct as a pair (X,B), where XU, POS(B)=X, POS(C)X for any CB

empty empty

1-3,7-9,11-13 O,H

3,7,12,13 O

1-3,7,9,12,13 O,T

3-7,10,12-14 O,W

1-14 O,T,W

1-3,7-13 O,T,H

1-14 O,H,W 2,5,9-11,13 H,T,W

2,5,9 T,W

5,9,10,13 H,W

Reduct „Lattice”

10,11,13 T,H

Most Interesting Reducts

• Given conceptual reduct (X,B), we evaluate it with the value F(|X|,|B|)

• Function F: N N R should hold:

IF n1 < n2 THEN F(n1,m) < F(n2,m)

IF m1 < m2 THEN F(n,m1) > F(n,m2)

• So we should maximize F(|X|,|B|) or...

• ... shall we rather search for ensembles?

“Good” Ensembles of Reducts

• Reducts with minimal cardinalities (or rules)

• Reducts with minimal pairwise intersections

R1

R2 R3

ATTRIBUTES Challenge:

How to modify the existing attribute reduction methods to search for such „good” ensembles

Hybrid Genetic Algorithm (1)

• Genetic part, where each chromosome encodes a permutation of the attributes

• Heuristic part, where permutations are put into the following algorithm:

1.LET LEFT = A

2.FOR i = 1 TO |A| REPEAT2. LET LEFT LEFT \ {a(i)}

3. IF NOT LEFT # d UNDO (a)

3.EVALUATE REDUCT LEFT

Hybrid Genetic Algorithm (2)

1. LET (LEFT,RIGHT) = (,A)

2. FOR i = 1 TO |U|+|A| REPEAT IF (i){1,...,|U|} THEN

IF u(i)POS(RIGHT) THEN

LET LEFT LEFT {u(i)}

IF (i){|U|+1,...,|U|+|A|} THEN

IF POS(RIGHT \ {a(i)}) LEFT THEN

LET RIGHT RIGHT \ {a(i)}

3. EVALUATE REDUCT (LEFT,RIGHT)

empty empty

1-3,7-9,11-13 O,H

3,7,12,13 O

1-3,7,9,12,13 O,T

3-7,10,12-14 O,W

1-14 O,T,W

1-3,7-13 O,T,H

1-14 O,H,W 2,5,9-11,13 H,T,W

2,5,9 T,W

5,9,10,13 H,W

Reduct „Lattice” once more

10,11,13 T,H

CLUSTERS OF ATTRIBUTES

REDUCTS WITH CLUSTER REP-RESENTATIVES

FEEDBACK

Grużdź, Ihnatowicz, Ślęzak: Interactive gene clustering – a case study of breast cancer microarray data. Inf. Systems Frontiers 8 (2006).

Feature Clustering / Selection

• Frequent occurrence of representatives in reducts yields splitting clusters• Rare occurrence of pairs of close representatives yields merging clusters

How about groups of rows (1)

Data-based knowledge models, classifiers...

Database indices, data partitioning, data sorting...

Difficulty with fast updates of structures...
















How about groups of rows (2)

Query

min OUT

max

Nulls

sum

match

???

pattern

Infobright’s Technology

Two-Level Computing

Large Data (10TB)& Mixed Workloads

SELECT MAX(A) FROM T WHERE B>15;

STEP 1 STEP 2 STEP 3DATA

Knowledge Structures (Nodes)

Order Number

Order Date

Part ID

Quantity $Amt

005 20070214 234 500 1500.00

005 20070214 334 125 250.25

006 20070215 334 100 212.50

Supplier ID

Effective Date

Expiry Date

Part ID

Description

A456 20050315 Null 234 Pre-measured coffee packets – gold blend

A456 20061201 Null 235 Pre-measured coffee packets – silver blend

A456 20060501 Null 334 4-cup Cone coffee filters; quantity 50

Order Detail Table – assume many more rows

Supplier/Part Table – assume many more rows

Pack 1 Pack 2

Pack 1 0 1

Pack 2 1 0

Pack 3 0 0

DATA – Best Inspiration

• New Objectives• New Schemas• New Volumes• New Queries• New Types• New KNs• ...........

References (Unfinished List)

• D. Ślęzak, J. Wróblewski, V. Eastwood, P. Synak: Brighthouse - An Analytic Data Warehouse for Ad-hoc Queries. VLDB 2008: 1337-1345.

• D. Ślęzak: Rough Sets and Few-Objects-Many-At-tributes Problem - The Case Study of Analysis of Gene Expression Data Sets. FBIT 2007: 437-440.

• D. Ślęzak: Rough Sets and Functional Dependen-cies in Data - Foundations of Association Reducts. To appear.

• ......

THANK YOU!!

www.infobright.com www.infobright.org

[email protected]

DOMINIK ŚLĘZAK infobright infobright

Documents