ROUGH SETS and FCA Foundations and Case Studies of Feature Subset Selection and Knowledge Structure Formation DOMINIK ŚLĘZAK www.infobright.com www.infobright.org
Jan 25, 2016
ROUGH SETS and FCAFoundations and Case Studies
of Feature Subset Selection andKnowledge Structure Formation
DOMINIK ŚLĘZAK www.infobright.com www.infobright.org
Contents
• Rough Sets & Feature Selection– Association Reducts– Conceptual Reducts– Building Ensembles– Towards Clustering
• Rough Sets & Infobright Story– Rough & Granular Computation– Knowledge Structure Formation
Rough Sets
• Rough set theory proposed by Z. Pawlak in 82 is an approximate reasoning model
• In applications, it focuses on derivation of approximate knowledge from databases
• It provides good results in such domains as, e.g., Web analysis, finance, industry, multimedia, medicine, and bioinformatics
Decision Systems Outlook Temp. Humid. Wind Sport?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cold Normal Weak Yes
6 Rain Cold Normal Strong No
7 Overcast Cold Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cold Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
IF (H=Normal)AND (T=Mild)THEN (S=Yes)
It corresponds to a data block included in the positive region of the decision class “Yes”
Lower & Upper ApproximationsRules and Approximations
Outlook Temp. Humid. Wind Sport?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cold Normal Weak Yes
6 Rain Cold Normal Strong No
7 Overcast Cold Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cold Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No POS(Sport?|B)
Feature Reduction (Selection)
• Reducts: optimal attribute subsets, which approximate well enough the pre-defined target concepts or the whole data source
• Feature reduction is one of the steps in the knowledge discovery in databases process
• In real-world situations, we may agree to slightly decrease the quality, if it leads to a significantly simpler knowledge model
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII
2 IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII
3 O O W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII
4 O TO T W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII
5 O T H
O T H W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII
6 IIIII IIIII O T H W
T H W
W IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII
7 O T H W
O T H IIIII IIIII IIIII O IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIII
8 IIIII IIIII O T OO T
H IIIII O T H W IIIII IIIII IIIII IIIII IIIII IIIII IIIII
9 T HT H W IIIII IIIII IIIII O W IIIII T H IIIII IIIII IIIII IIIII IIIII IIIII
10 O T H
O T H W IIIII IIIII IIIII T W IIIII O H IIIII IIIII IIIII IIIII IIIII IIIII
11 T H W
T H IIIII IIIII IIIII O T IIIII H W IIIII IIIII IIIII IIIII IIIII IIIII
12 O T W
O T IIIII IIIII IIIII O T H IIIII O W IIIII IIIII IIIII IIIII IIIII IIIII
13 O HO H W IIIII IIIII IIIII O T
W IIIII O T H IIIII IIIII IIIII IIIII IIIII IIIII
14 IIIII IIIII O T W
WT H W IIIII O T
H IIIII O T H W
H W O H OO T H W IIIII
Discernibility
Association Reducts1. {a,b,c} {d,e}2. {a,b,d,f} {c,e}3. {a,b,f} {e}4. {a,c,e} {b,d}5. {a,c,f} {d}6. {a,d,e} {b,c}7. {a,d,f} {c}8. {a,e,f} {b}9. {b,c,d} {a,e}10. {b,d,e} {a,c}11. {b,e,f} {a}12. {c,d,f} {a}13. {c,e,f} {a,b,d}
a b c d e f
u1 1 1 1 1 1 1
u2 0 0 0 1 1 1
u3 1 0 1 1 0 1
u4 0 1 0 0 0 0
u5 1 0 0 0 0 1
u6 1 1 1 1 1 0
u7 0 1 1 0 1 2
a b c d e f a b c d e f
12 0 0 0 1 1 1 34 0 0 0 0 1 0
13 1 0 1 1 0 1 35 1 1 0 0 1 1
14 0 1 0 0 0 0 36 1 0 1 1 0 0
15 1 0 0 0 0 1 37 0 0 1 0 0 0
16 1 1 1 1 1 0 45 0 0 1 1 1 0
17 0 1 1 0 1 0 46 0 1 0 0 0 1
23 0 1 0 1 0 1 47 1 1 0 1 0 0
24 1 0 1 0 0 0 56 1 0 0 0 0 0
25 0 1 1 0 0 1 57 0 0 0 1 0 0
26 0 0 0 1 1 0 67 0 1 1 0 1 0
27 1 0 0 0 1 0
Association Reducts as Association Rules in InDiscernibility Tables
Most Interesting Reducts
• Given association reduct (C,D), we evaluate it with the value F(|C|,|D|)
• Function F: N N R should hold:
IF n1 < n2 THEN F(n1,m) > F(n2,m)
IF m1 < m2 THEN F(n,m1) < F(n,m2)
• F(|C|,|D|) is maximized subject to # from the space of approximation parameters
• Such maximization problem is NP-hard
What # can actually mean?1) |POS(d|B)|2) Disc(d|B) = Disc(B{d}) – Disc(B)
where Disc(X) = |{(u1,u2): X(u1)≠X(u2)}|3) Relative Gain R(d|B) =
4) Entropy H(d|B) = H(B{d}) – H(B)
Conceptual Reducts(empty,empty)
({3,7,12,13},{O})
({1-3,7,9,12,13},{O,T})
({1-3,7-9,11-13},{O,H})
({3-7,10,12-14},{O,W})
({10,11,13},{T,H})
({2,5,9},{T,W})
({5,9,10,13},{H,W})
({1-3,7-13},{O,T,H})
(1-14,{O,T,W})
(1-14,{O,H,W})
({2,5,9-11,13},{H,T,W})
Outlook Temp. Humid. Wind Sport?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cold Normal Weak Yes
6 Rain Cold Normal Strong No
7 Overcast Cold Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cold Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Reduct as a pair (X,B), where XU, POS(B)=X, POS(C)X for any CB
empty empty
1-3,7-9,11-13 O,H
3,7,12,13 O
1-3,7,9,12,13 O,T
3-7,10,12-14 O,W
1-14 O,T,W
1-3,7-13 O,T,H
1-14 O,H,W 2,5,9-11,13 H,T,W
2,5,9 T,W
5,9,10,13 H,W
Reduct „Lattice”
10,11,13 T,H
Most Interesting Reducts
• Given conceptual reduct (X,B), we evaluate it with the value F(|X|,|B|)
• Function F: N N R should hold:
IF n1 < n2 THEN F(n1,m) < F(n2,m)
IF m1 < m2 THEN F(n,m1) > F(n,m2)
• So we should maximize F(|X|,|B|) or...
• ... shall we rather search for ensembles?
“Good” Ensembles of Reducts
• Reducts with minimal cardinalities (or rules)
• Reducts with minimal pairwise intersections
R1
R2 R3
ATTRIBUTES Challenge:
How to modify the existing attribute reduction methods to search for such „good” ensembles
Hybrid Genetic Algorithm (1)
• Genetic part, where each chromosome encodes a permutation of the attributes
• Heuristic part, where permutations are put into the following algorithm:
1.LET LEFT = A
2.FOR i = 1 TO |A| REPEAT2. LET LEFT LEFT \ {a(i)}
3. IF NOT LEFT # d UNDO (a)
3.EVALUATE REDUCT LEFT
Hybrid Genetic Algorithm (2)
1. LET (LEFT,RIGHT) = (,A)
2. FOR i = 1 TO |U|+|A| REPEAT IF (i){1,...,|U|} THEN
IF u(i)POS(RIGHT) THEN
LET LEFT LEFT {u(i)}
IF (i){|U|+1,...,|U|+|A|} THEN
IF POS(RIGHT \ {a(i)}) LEFT THEN
LET RIGHT RIGHT \ {a(i)}
3. EVALUATE REDUCT (LEFT,RIGHT)
empty empty
1-3,7-9,11-13 O,H
3,7,12,13 O
1-3,7,9,12,13 O,T
3-7,10,12-14 O,W
1-14 O,T,W
1-3,7-13 O,T,H
1-14 O,H,W 2,5,9-11,13 H,T,W
2,5,9 T,W
5,9,10,13 H,W
Reduct „Lattice” once more
10,11,13 T,H
CLUSTERS OF ATTRIBUTES
REDUCTS WITH CLUSTER REP-RESENTATIVES
FEEDBACK
Grużdź, Ihnatowicz, Ślęzak: Interactive gene clustering – a case study of breast cancer microarray data. Inf. Systems Frontiers 8 (2006).
Feature Clustering / Selection
• Frequent occurrence of representatives in reducts yields splitting clusters• Rare occurrence of pairs of close representatives yields merging clusters
How about groups of rows (1)
Data-based knowledge models, classifiers...
Database indices, data partitioning, data sorting...
Difficulty with fast updates of structures...
Outlook Temp. Humid. Wind Sport?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cold Normal Weak Yes
6 Rain Cold Normal Strong No
7 Overcast Cold Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cold Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
How about groups of rows (2)
Query
min OUT
max
Nulls
sum
match
???
pattern
Infobright’s Technology
Two-Level Computing
Large Data (10TB)& Mixed Workloads
SELECT MAX(A) FROM T WHERE B>15;
STEP 1 STEP 2 STEP 3DATA
Knowledge Structures (Nodes)
Order Number
Order Date
Part ID
Quantity $Amt
005 20070214 234 500 1500.00
005 20070214 334 125 250.25
006 20070215 334 100 212.50
Supplier ID
Effective Date
Expiry Date
Part ID
Description
A456 20050315 Null 234 Pre-measured coffee packets – gold blend
A456 20061201 Null 235 Pre-measured coffee packets – silver blend
A456 20060501 Null 334 4-cup Cone coffee filters; quantity 50
Order Detail Table – assume many more rows
Supplier/Part Table – assume many more rows
Pack 1 Pack 2
Pack 1 0 1
Pack 2 1 0
Pack 3 0 0
DATA – Best Inspiration
• New Objectives• New Schemas• New Volumes• New Queries• New Types• New KNs• ...........
References (Unfinished List)
• D. Ślęzak, J. Wróblewski, V. Eastwood, P. Synak: Brighthouse - An Analytic Data Warehouse for Ad-hoc Queries. VLDB 2008: 1337-1345.
• D. Ślęzak: Rough Sets and Few-Objects-Many-At-tributes Problem - The Case Study of Analysis of Gene Expression Data Sets. FBIT 2007: 437-440.
• D. Ślęzak: Rough Sets and Functional Dependen-cies in Data - Foundations of Association Reducts. To appear.
• ......