Page 1
HUMAN-CENTRIC
DATA EXPLORATION
PART 1/5: MOTIVATION, BACKGROUND & OUTLINE
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
Based on joint work with many others (see references and final slide of this lecture)
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
Page 2
SUBJECTIVITY
AND VISUALIZATION
PART 1/5: MOTIVATION, BACKGROUND & OUTLINE
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 2
Page 4
SUBJECTIVITY = KEY
Three motivating examples:
1. Frequent itemset mining
‒ Individually frequent items = probably frequent together
2. Visualizing high-dimensional data
‒ Outliers = high variance, so why maximize it in PCA?
‒ Interaction = key
3. Graph embedding
‒ High degree nodes = probably embedded centrally
4
Page 5
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Support x size (area)
#docs
svm, support, machin, vector 25 data, paper 389
state, art 39 algorithm, propose 246
unlabelled, labelled, supervised, learn 10 data, mine 312
associ, rule, mine 36 base, method 202
gene, express 25 result, show 196
frequent, itemset 28 problem 373
large, social, network, graph 15 data, set 279
column, row 13 approach 330
algorithm, order, magnitud, faster 12 model 301
paper, propos, algorithm, real, synthetic, data
27 present 296
ASSOCIATION ANALYSIS / ITEMSET MINING
words
docum
ents
KDD abstracts dataset:
Page 6
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Support x size (area)
#docs
svm, support, machin, vector 25 data, paper 389
state, art 39 algorithm, propose 246
unlabelled, labelled, supervised, learn 10 data, mine 312
associ, rule, mine 36 base, method 202
gene, express 25 result, show 196
frequent, itemset 28 problem 373
large, social, network, graph 15 data, set 279
column, row 13 approach 330
algorithm, order, magnitud, faster 12 model 301
paper, propos, algorithm, real, synthetic, data
27 present 296
ASSOCIATION ANALYSIS / ITEMSET MINING
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Subjective interestingnessrankingAdditionally prior info on:Keyword tiles
#docs
svm, support, machin, vector 25 art, state 39
state, art 39 row, column, algorithm 12
unlabelled, labelled, supervised, learn 10 unlabelled, labelled, data 14
associ, rule, mine 36 answer, question 18
gene, express 25 Precis, recal 14
words
docum
ents
KDD abstracts dataset:
Page 7
VISUALIZING HIGH-DIMENSIONAL DATA
Page 8
CONDITIONAL NETWORK EMBEDDINGS
8
Page 9
The search for interesting patterns in data
• Association analysis
• Dimensionality reduction
• Graph embedding
• Clustering
• Community detection
• Privacy-preserving data publishing
• … Zillions of
PCA, ICA, projection pursuit, Laplacian Eigenmaps,
tSNE, LLE,…
K-means clustering, hierarchical clustering, Mixture of
Gaussians, spectral clustering,…
Stochastic block modelling, modularity, k-cores, quasi-
cliques, dense subgraphs,…
Frequency, lift, confidence, leverage, coverage,...
EXPLORING DATA
‘Interestingness measures’
Objective functions
Quality functions
Utility functions
Cost functions
…
Node2Vec, Path2Vec, MetaPath2Vec,...
Discernibility, generalization height,
average group size,...
Page 10
THE CHALLENGE
Zillions of interestingness measures = good & bad‒ Good: more options!
‒ Bad: the trees & the forest…
Challenge:
‒ Formalise true interestingness!‒ With minimal user interaction
‒ Without requiring user expertise
Page 11
MOTIVATING EXAMPLE
Community detection:
What makes for an interesting community?
‒ Densely connected?
‒ Large?
‒ Few neighbours outside community?
‒ Unrelated to certain known ‘affiliations’?‒ …
11
Page 12
THE FORSIED APPROACH: SUBJECTIVITY!
12
DataData mining
researcher
Interestingness(pattern)
Page 13
THE FORSIED APPROACH: SUBJECTIVITY!
13
Data
Data mining
researcher
Data
analyst
Interestingness(pattern) Interestingness(pattern, analyst)
Interestingness = subjective
Page 14
MOTIVATING EXAMPLE
Community detection:
User states expectations / beliefs
‒ Formalized as a ‘background distribution’ Any ‘pattern’ that contrasts with this and is easy to describe
= subjectively interesting
14
Page 16
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
16
Page 17
17
Feel free to interrupt
for questions anytime
Page 18
HUMAN-CENTRIC
DATA EXPLORATION
PART 2/5: THE FORSIED FRAMEWORK
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
Page 19
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
2
Page 20
GENERIC FRAMEWORK
3
De Bie, KDD 2011
De Bie, DAMI 2011
Page 21
FORSIED(*)
framework
Data PatternsModel of user beliefs
(“background distribution”)
Data space
Data
Pattern
𝑃 evolves!
Pattern User Subjective!
(*)Formalizing Subjective Interestingness in Exploratory Data mining
𝑥 ∈ Ω′𝑥 ∈ Ω
ΩΩ′′Ω′ 𝑃′𝑃′′ 𝑃
• Data: the adjacency matrix
of a graph under study
• Patterns: the claim that a
specified set of nodes are
densely connected
• Prior beliefs: the degrees of
the nodes, a known block
structure,...
• Interestingness: subjective
information density
• Overlapping communities!
Interestingness Ω′, 𝑃 = InformationContent Ω′, 𝑃DescriptionLength Ω′
InformationContent Ω′, 𝑃 = − log 𝑃 Ω′
SI = IC/DL
Page 22
THE FINE PRINT
Initial background distribution 𝑃?‒ Maximum entropy distribution s.t. prior belief constraints:
Updated background distribution 𝑃′ given pattern 𝑥 ∈ Ω′?‒ 𝑃 conditioned onto event 𝑥 ∈ Ω′𝑃′ Ω′′ = 𝑃 Ω′′ ∩ Ω′𝑃 Ω′⇒ −log 𝑃′ 𝒙 = −log 𝑃 𝒙 + log 𝑃 Ω′
Description length?‒ Smaller if the pattern is a better explanation
‒ Essentially problem-dependent
max𝑃 𝐸𝑋~𝑃 − log𝑃(𝑋) s.t. 𝐸𝑋~𝑃 𝑓 𝑋 = 𝑐𝑓 (∀𝑓) ΩΩ′ 𝑃′ 𝑃𝑥
Information content
in data after pattern
Information content
in data before pattern
Minus information content
of pattern under 𝑃
Ω′′
Page 23
WHY MAXIMUM ENTROPY / CONDITIONING?
Most unbiased estimateInformal... no bias other than the constraints
Assume cautious / pessimistic userA user who expects to be very surprised
Leads to most robust estimate of true subjective information
contentInformation content estimated with maxent 𝑃 will never differ much from
information content w.r.t. true prior belief of user
6
Page 24
A FIRST INSTANTIATION:COMMUNITY DETECTION
7
van Leeuwen, De Bie, Spyropoulou, Mesnage, MLj, 2016
Page 25
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
MaxEnt distribution:𝑃 𝑨 =ෑ𝑖>𝑗 𝑃𝑖,𝑗 (𝑎𝑖𝑗)𝑃𝑖,𝑗 𝑎𝑖𝑗 = exp 𝑎𝑖𝑗 ∙ 𝜆𝑖 + 𝜆𝑗1 + exp 𝜆𝑖 + 𝜆𝑗
𝑷𝒊,𝒋 𝒂𝒊𝒋 small
𝑷𝒊,𝒋 𝒂𝒊𝒋large
Adjacency
matrixEdge indicator
variables
Page 26
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
𝑷𝒊,𝒋 𝒂𝒊𝒋 small
𝑷𝒊,𝒋 𝒂𝒊𝒋large𝑖,𝑗∈subgraph𝑎𝑖𝑗 ≥ 𝑘
Page 27
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
Interestingness:
𝑷𝒊,𝒋 𝒂𝒊𝒋 small
𝑷𝒊,𝒋 𝒂𝒊𝒋large− log𝑃 patternDescriptionLength(pattern)
Page 28
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
Interestingness: Density vs. size
2. preferably low degree nodes
Most interesting given 1.
Most interesting given 2.
Page 29
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
Interestingness: Density vs. size
2. preferably low degree nodes
Hill-climbing for searchUpdate 𝑷 after each pattern
Most interesting given 1.
Most interesting given 2.
Page 30
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
Interestingness: Density vs. size
2. preferably low degree nodes
Hill-climbing for searchUpdate 𝑷 after each pattern
Rock
Trance
Indie
Bhangra
GospelCountry
Hip hop / grime
Afro pop
UK garage
Hip hop
Page 31
TAKE-AWAYS
1. What is the data?
2. Determine suitable pattern syntax
3. What are the prior beliefs? (= what is irrelevant to user?)
Compute background distribution 𝑷 using maximum entropy
4. Formulate subjective interestingness:
5. Design an algorithm to optimize it
6. Find out how to condition background distribution on a pattern
14
Interestingness Ω′, 𝑃 = InformationContent Ω′, 𝑃DescriptionLength Ω′InformationContent Ω′, 𝑃 = − log 𝑃 Ω′
Page 32
THE BACKGROUND DISTRIBUTION: MAXENT
15
Page 33
MAXENT MODEL S.T. DEGREE BELIEFS
max𝑃 𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨s.t. 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 = 𝑑𝑖 , ∀𝑖 = 1: 𝑛
𝑨 𝑃 𝑨 = 1𝐿 𝑃, 𝝀, 𝜇 =𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨 + 𝑖=1:𝑛 𝜆𝑖 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 − 𝑑𝑖 + 𝜇 𝑨 𝑃 𝑨 − 1
Convex!
Lagrangian:
Entropy
Degree
Expected degree constraint
Normalization
Lagrange multipliers
𝑨 = adjacency matrix with 𝑎𝑖𝑗 in row 𝑖 and column 𝑗
Page 34
𝐿 𝑃, 𝝀, 𝜇 =𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨 + 𝑖=1:𝑛 𝜆𝑖 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 − 𝑑𝑖 + 𝜇 𝑨 𝑃 𝑨 − 1Optimality condition:𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = 0𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0
MAXENT MODEL S.T. DEGREE BELIEFS
Page 35
MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗
=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗
Page 36
MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗
=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗
Page 37
MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗
=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗
Page 38
MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗
=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗 𝑃𝑖,𝑗 𝑎𝑖𝑗 Product of independent
Bernoulli distributions!
Thanks to fact that prior belief constraint
is on a (weighted) sum of the 𝑎𝑖𝑗
Page 39
MAXENT MODEL S.T. DEGREE BELIEFS
To find optimal values of Lagrange multipliers, solve the dual:min𝝀 𝐿 𝑃, 𝝀where 𝑃 is given as:
𝑃 𝑨 =ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗After some calculations:
22
min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖
Page 40
MAXENT MODEL S.T. DEGREE BELIEFS
min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖Can be solved using gradient descent:𝜕𝜕𝜆𝑘 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖= 𝑖=1:𝑛 exp 𝜆𝑖 + 𝜆𝑘1+exp 𝜆𝑖 + 𝜆𝑘 − 𝑑𝑘Lots of computational speed-ups possible...
23
Expected degree of node 𝒌 Required expected degree of node 𝒌
min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖
Page 41
TAKE-AWAYS
Constraints on the expected value of weighted averages:𝐸𝑨~𝑃 𝑖,𝑗∈𝐼 𝑓𝑖𝑗𝑎𝑖𝑗 = 𝑐where 𝑓𝑖𝑗 and 𝑐 are constants and 𝐼 is a set of indices,
lead to convenient product distributions
Other examples for graphs: Overall density (trivial)
Densities of particular blocks (e.g. block of nodes with same
affiliation)
Assortativity (approximately)
...
24
Page 42
THE INTERESTINGNESS
25
Page 43
INFORMATION CONTENT
Information content:InformationContent pattern, 𝑃 = − log 𝑃 pattern Pattern: “number of edges between given set of nodes 𝑊 ⊆ 𝑉 is larger than or
equal to a specified 𝑘𝑊” A bit tricky...
Cliques as a special case: “set of nodes 𝑊 ⊆ 𝑉 forms a clique”. Then:𝑃 pattern = ෑ𝑖>𝑗∈𝑊𝑃𝑖,𝑗 1 So: InformationContent pattern, 𝑃 = − log 𝑃 pattern= −σ𝑖>𝑗∈𝑊 log 𝑃𝑖,𝑗 1 Larger if |𝑊| is larger and if 𝑃𝑖,𝑗 1 for 𝑖, 𝑗 ∈ 𝑊 is smaller
26
Page 44
INFORMATION CONTENT
Pattern (general case): “number of edges between given set of nodes 𝑊 ⊆ 𝑉 is
larger than or equal to a specified 𝑘𝑊” Probability of at least 𝑘𝑊 successes in 𝑛𝑊 = |𝑊|2 Bernoulli trials?
Approximated by:𝑃 pattern ≈ exp −𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊where 𝑝𝑊 is the average probability 𝑃𝑖,𝑗 1 for the edges between 𝑖, 𝑗 ∈ 𝑊
And thus:InformationContent pattern, 𝑃 = − log 𝑃 pattern≈ 𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊 Larger if |𝑊| (and thus 𝑛𝑊) is larger, 𝑝𝑊 is smaller, and 𝑘𝑊 is larger
27
Page 45
DESCRIPTION LENGTH
For cliques:
Describe set 𝑊 A constant (to describe |𝑊|)
plus a linear term in 𝑊 (to describe its elements):DescriptionLength pattern = 𝛼 𝑊 + 𝛽 For dense subgraphs:
Constant 𝛽 also describes 𝑘𝑊28
Page 46
INTERESTINGNESS
Putting things together:Interestingness pattern, 𝑃 = −σ𝑖>𝑗∈𝑊 log 𝑃𝑖,𝑗 1𝛼 𝑊 + 𝛽 A bit more complex for general dense subgraphs:
Interestingness pattern, 𝑃 ≈ 𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊𝛼 𝑊 + 𝛽 Hard to optimize!
Exact search for small graphs
Effective hill climber for large graphs
29
Page 47
TAKE-AWAYS
No compromises w.r.t. interestingness
Often leads to hard search problems
Question: is this intrinsic to genuine subjective
interestingness?
30
Page 48
UPDATING THE BACKGROUND DISTRIBUTION
31
Page 49
UPDATING THE BACKGROUND DISTRIBUTION
Given a pattern,update the background distribution by conditioning on the pattern
Easy to do for cliques 𝑊:
Set 𝑃′𝑖,𝑗 𝑎𝑖𝑗 = 1 for 𝑎𝑖𝑗 = 1 if 𝑖, 𝑗 ∈ 𝑊 Fast to approximate for (non-clique) dense subgraphs 𝑊:
Set 𝑃′𝑖,𝑗 𝑎𝑖𝑗 ∝ 𝑃𝑖,𝑗 𝑎𝑖𝑗 ⋅ exp 𝜆𝑊𝑎𝑖𝑗 if 𝑖, 𝑗 ∈ 𝑊such that the expected density of 𝑊 is 𝑘𝑊
Remains a product of Bernoulli’s32
Page 50
TAKE-AWAYS
Updating can be trivial
Otherwise, often easy to do approximately
33
Page 51
HUMAN-CENTRIC
DATA EXPLORATION
PART 3/5: BINARY MATRICES, GRAPHS, RELATIONAL DATA
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
Page 52
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
2
Page 53
OUTLINE OF THIS PART
(Community detection)
Itemsets
Relational patterns
Connecting trees
Network embedding
3
Page 55
ITEMSETS
Data: binary matrix:
5
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
De Bie, DMKD 2011
Page 56
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
6
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
De Bie, DMKD 2011
Page 57
ITEMSETS
7
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Data: binary matrix:
Prior beliefs: row and column sums
Background distribution 𝑃- Nonnegative and properly normalized
- Has correct marginals Many solutions !?
‘Unbiased’ distribution: Maximum Entropy distribution
De Bie, DMKD 2011
Page 58
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Background distribution 𝑃
8
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
𝑃 𝑿 =ෑ𝑖,𝑗 𝑃𝑖,𝑗 (𝑥𝑖𝑗) 𝑃𝑖,𝑗 𝑥𝑖𝑗 = exp 𝑥𝑖𝑗 ∙ 𝜇𝑖 + 𝜆𝑗1 + exp 𝜇𝑖 + 𝜆𝑗
De Bie, DMKD 2011
Convex
optimization
problem
‘Unbiased’ distribution: Maximum Entropy distribution
Page 59
ITEMSETS
Data: binary matrix:
Prior beliefs: uniform at observed density
9
𝑿 ∈ {0,1}𝑚×𝑛De Bie, DMKD 2011
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
𝑃𝑖,𝑗 = 1424 ∀ 𝑖, 𝑗𝑃𝑖,𝑗 𝑥𝑖𝑗 = exp 𝑥𝑖𝑗 ∙ 𝜆1 + exp 𝜆 𝜆 = 0.3365
Page 60
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles (set of rows and columns)
10
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Large support may not be
interesting
?
De Bie, DMKD 2011
Page 61
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
11
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Large surface may not be
interesting
?
De Bie, DMKD 2011
Page 62
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
12
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Less expected smaller row
and column margins
Bingo!
De Bie, DMKD 2011
Page 63
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
As stated SI = IC / DL
13
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Less expected smaller row
and column margins
Bingo!
De Bie, DMKD 2011
IC 𝑍 = − log Pr 𝑍 = 𝑖,𝑗∈𝑍−log Pr 𝑝𝑖,𝑗DL 𝑍 = 𝑎 #rows + #columns + 𝑏
Page 64
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
14
𝑿 ∈ {0,1}𝑚×𝑛 Iterate
De Bie, DMKD 2011
We specified there are ones here
Page 65
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
15
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Iterate
De Bie, DMKD 2011
Page 66
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
16
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Iterate
De Bie, DMKD 2011
Page 67
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Support x size (area)
#docs
svm, support, machin, vector 25 data, paper 389
state, art 39 algorithm, propose 246
unlabelled, labelled, supervised, learn 10 data, mine 312
associ, rule, mine 36 base, method 202
gene, express 25 result, show 196
frequent, itemset 28 problem 373
large, social, network, graph 15 data, set 279
column, row 13 approach 330
algorithm, order, magnitud, faster 12 model 301
paper, propos, algorithm, real, synthetic, data
27 present 296
ITEMSETS
words
docum
ents
KDD abstracts dataset:
De Bie, DMKD 2011
Page 68
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Support x size (area)
#docs
svm, support, machin, vector 25 data, paper 389
state, art 39 algorithm, propose 246
unlabelled, labelled, supervised, learn 10 data, mine 312
associ, rule, mine 36 base, method 202
gene, express 25 result, show 196
frequent, itemset 28 problem 373
large, social, network, graph 15 data, set 279
column, row 13 approach 330
algorithm, order, magnitud, faster 12 model 301
paper, propos, algorithm, real, synthetic, data
27 present 296
ITEMSETS
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Subjective interestingnessrankingAdditionally prior info on:Keyword tiles
#docs
svm, support, machin, vector 25 art, state 39
state, art 39 row, column, algorithm 12
unlabelled, labelled, supervised, learn 10 unlabelled, labelled, data 14
associ, rule, mine 36 answer, question 18
gene, express 25 Precis, recal 14
words
docum
ents
KDD abstracts dataset:
De Bie, DMKD 2011
Page 69
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Background distribution 𝑃 Patterns: tiles of ones and zeros
19
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Extension: noisy tiles
Kontonasios & De Bie, SDM 2010
IC is straightforward
DL depends on the skew
(entropy) of the distribution of
ones and zeros within the tile
Page 70
ITEMSETS
Algorithmic approach: ?
Not studied extensively, but special case of relational
patterns (see next instantiation)
Interesting result: if we can mine the best pattern at
every iteration, then we approximate the total IC for the
best set of tiles with that (cumulative) DL at
20
1 − 1𝑒 (≈ 0.63)Kontonasios & De Bie, SDM 2010
Page 71
RELATIONAL PATTERNS
21
Page 72
RELATIONAL PATTERN MINING
Data: relational database
Pattern: connected complete subgraphs
Prior beliefs: degree of each node in each
relationship
Customers Items Attributes
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)
Guns, Aknin, Lijffijt, De Bie (ICDM 2016)
Page 73
RELATIONAL PATTERN MINING
Data: relational database
Pattern: connected complete subgraphs
Prior beliefs: degree of each node in each
relationship
Customers Items Attributes
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)
Guns, Aknin, Lijffijt, De Bie (ICDM 2016)
Prior factorizes over relationships
equivalent to itemsets
Page 74
Users Films
Genres
Actors
RELATIONAL PATTERN MINING
RMiner
Page 75
RMINER
Algorithmic approach: enumerate + rank
Based on fixpoint-enumeration (Boley et al. 2010)
25
A1
A2
A3
B1
B2
B3
C1
C2
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 76
RMINER
Algorithmic approach: enumerate + rank
2626
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 77
RMINER
Algorithmic approach: enumerate + rank
2727
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 78
RMINER
Algorithmic approach: enumerate + rank
28
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 79
RMINER
Algorithmic approach: enumerate + rank
29
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 80
RMINER
Algorithmic approach: enumerate + rank
30
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 81
RMINER
Algorithmic approach: enumerate + rank
31
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 82
RMINER
Algorithmic approach: enumerate + rank
32
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 83
RMINER
Algorithmic approach: enumerate + rank
33
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- If invalid added, backtrack
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 84
RMINER
Algorithmic approach: enumerate + rank
34
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- If invalid added, backtrack
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 85
RMINER
Algorithmic approach: enumerate + rank
35
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- If invalid added, backtrack
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Etc. etc.
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Page 86
Circadian
Numeric
Numeric Taxonomyelement
Taxonomyelement
n-ary relation
N-RMiner
P-N-RMiner
RELATIONAL PATTERN MINING
Page 87
P-N-RMINER: FISHER DATA
37
Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)
Page 88
CP-RMINER
Algorithmic approach: branch & bound in CP (top 1)
38
Guns, Aknin, Lijffijt, De Bie (ICDM 2016)
Page 89
CONNECTING TREES
39
Page 90
I
D
CONNECTING SUBTREES
Data: Graph , known, unknown
40
𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A
B
C
E
F
G
H
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)
Page 91
CONNECTING SUBTREES
Data: Graph , known, unknown
Prior beliefs: degree of vertices, (batch) time order
41
𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽
I
DA
B
C
E
F
G
H
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)
Page 92
I
D
CONNECTING SUBTREES
Data: Graph , known, unknown
Prior beliefs: degree of vertices, (batch) time order
Patterns: subtree connecting query vertices Q
42
𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A
B
C
E
F
G
H
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)
Page 93
I
D
CONNECTING SUBTREES
Data: Graph , known, unknown
Prior beliefs: degree of vertices, (batch) time order
Patterns: subtree connecting query vertices Q
Which is more interesting?
43
𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A
B
C
E
F
G
H
ID
F
D
I
C
E
I
D
A
C
E
F
Unsuprising No information
compression
Most interesting,
since E and C have
small in-degree
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)
Page 94
TIME DIFFERENCE PRIOR
Consider citation network
Papers arrive in batches
(per year)
Earlier papers cannot /
rarely cite newer papers
44
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)
Page 95
TIME DIFFERENCE PRIOR
Consider citation network
Papers arrive in batches
(per year)
Earlier papers cannot /
rarely cite newer papers
Limited increase in
computational cost to fit
background distribution
45
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)
Page 96
EXAMPLE ON CITATION DATA
46
3 recent best
papers from ACM
SIGKDD
Uniform prior:
Time and degree prior:
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)
Page 97
CONNECTING SUBTREES
Algorithmic approach: greedily construct a tree with
maximum depth k
Not so straightforward!
We investigated various heuristics
47
Page 98
CONNECTING SUBTREES
Empirically: best strategy depends on size of query set Q
48
Page 99
CONNECTING SUBTREES
49
Uniform Degree prior
NIPS/PODS authors
Repeated from Akoglu et al. (SDM 2013)
Page 100
NETWORK EMBEDDING
50
Page 101
CONDITIONAL NETWORK EMBEDDINGS
51
Data: a graph 𝐺 w. adj. matrix 𝑨 Pattern: a metric embedding 𝑿
‒ Probabilistic info about the graph
‒ 𝑃 𝒙𝑖 − 𝒙𝑗 |𝑎𝑖𝑗 = Half-Normal
Prior beliefs: 𝑃𝑖,𝑗 𝑎𝑖𝑗‒ overall density
‒ degrees
‒ block structure
‒ assortativity
‒ ...
Find ML embedding:max𝑿 𝑃 𝐺|𝑿
Kang, Lijffijt, De Bie
(ICLR 2019)
Page 102
EXAMPLE ON STUDENTDB
52
Page 104
CONDITIONAL NETWORK EMBEDDINGS
54
Page 105
CONDITIONAL NETWORK EMBEDDINGS
Algorithmic approach: gradient descent with
estimated gradient (positive and negative sampling)
55
Page 106
SUMMARY PART 3
56
Page 107
SUMMARY
Informative prior can be useful in many settings
For binary data (all previous examples), fitting and
updating background model is computationally easy
Mining SI patterns challenging, (for now) tailor-made
algorithms necessary
57
Page 108
HUMAN-CENTRIC
DATA EXPLORATION
PART 4/5: NUMERIC AND MIXED DATA
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
Page 109
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
2
Page 110
OUTLINE OF THIS PART
Attributed subgraphs
Subgroup discovery in real-valued (target) data
Dimensionality reduction
Time series
3
Page 111
ATTRIBUTED SUBGRAPHS
4
Page 112
COHESIVE SUBGRAPHS
Data: attributed graph
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
A
B
D
C
E
F
G
H
I
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Page 113
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
A
B
D
C
E
F
G
H
I
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 114
A
C
D
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
B
E
F
G
H
I
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 2 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Pattern:
These locations
have many events
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 115
F
G
H
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
A
B
D
C
EI
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Pattern is easy to
interpret if it is local.
How to quantify this?
Pattern:
These locations
have many pubs &
shops
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 116
G
FH
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
A
B
D
C
EI
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Pattern:
Vertices around G
have many pubs &
shops
Easy to describe
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 117
C
AD
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
B
E
F
G
H
I
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 2 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Pattern:
Vertices around C
have many events
Easy to describe
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 118
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
‒ Like subgroup discovery
‒ Describe vertex set with rule
‒ Intersection of neighbourhoods
‒ Minus exceptions
‒ Attributes below / above threshold
‒ As compared to expectation
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 119
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Vertex Shops Events Pubs
A 0.5 … …B 0.04 … …C 0.2 … …D 0.8 … …E 0.06 … …F 0.13 … …G 0.07 … …H 0.1 … …I 1.0 … …
Attribute values Interestingness
Background
distribution
- Geometric
per cell
- Using
row/column
margins
(values are for illustration only)
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 120
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Vertex Shops Events Pubs
A 0.5 … …B 0.04 … …C 0.2 … …D 0.8 … …E 0.06 … …F 0.13 … …G 0.07 … …H 0.1 … …I 1.0 … …
Attribute values Interestingness
Background
distribution
- Geometric
per cell
- Using
row/column
margins
(values are for illustration only)
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 121
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
14
𝑃1: +food
𝑃2: +professional, +nightlife,
+outdoors, +college
𝑃3: +nightlife, +food, -college
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 122
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
Page 123
COHESIVE SUBGRAPHS
Algorithmic approach: we tried
Option 1: enumerate with P-N-RMiner, then rank
Option 2: dedicated branch-and-bound algorithm
16
Page 124
SUBGROUP DISCOVERY (EXCEPTIONAL MODEL MINING)
17
Page 125
LOCATION & SPREAD PATTERNS
Data:
‒ Meta-data: any type
‒ Target data: real valued matrix
Prior beliefs: mean and variance statistics
‒ Typically overall, but can be for subsets
Patterns: description with
‒ mean vector, or
‒ projection and magnitude of variance
18
𝑿 ∈ ℝ𝑚×𝑛
-4 -2 0 2 4
Attribute 1
-4
-2
0
2
4
Att
rib
ute
2
(a)Objects with
property x
Objects with
property y
Objects with
property z1
and z2
Lijffijt, Kang, Duivesteijn, Puolamäki,
Oikarinen, De Bie (ICDE 2018)
Page 126
SINGLE TARGET DIMENSION: CRIME IN US
UCI Crime data:
violent crime rate
(per 1k pop)
Description: areas with high incidence of unmarried mothers (coded
by CBS as percentage illegitimate)
Target: high average crime rate
19
Page 127
ECOLOGY: PRESENT SPECIES AS TARGETS
Description of pattern (a): “mean temperature in March ≤ −1.68 ◦C’”Description of pattern (b): “average monthly rainfall in August ≤ 47.62 mm”Description of pattern (c): “average monthly rainfall in October ≤ 45.25 mm and mean temperature of wettest quarter ≥ 16.32 ◦C”
Mean of target attributes for (a): -Wood mouse +Mountain hare, moose, red-
backed vole, wood lemming 20
Page 128
GERMAN POLITICS VS. DEMOGRAPHICS
Description of pattern (a): "few children". Target: LEFT is popular.
Description of pattern (b): "large mid-aged pop". Target: GREEN relatively popular.
Description of pattern (c): many children. Target: LEFT is impopular.
21
Page 129
GERMAN POLITICS VS. DEMOGRAPHICS
Moreover, target of pattern (a): “SDP and CDU negative correlated”.Due to popularity of LEFT, SDP and CDU are in tougher competition here
22
Page 130
SUBGROUP DISCOVERY
For reference only
Prior:
IC/DL:
23
Page 131
SUBGROUP DISCOVERY
Algorithmic approach:
Find descriptions using beam search
‒ Fairly standard in SD; Implementation from
Cortana (Meeng & Knobbe, BeneLearn 2011)
‒ Using the SI objective directly
For projections
‒ Manifold learning problem (ManOpt toolbox)
Interesting results to speed-up update to bg. distrib.
24
Page 132
DIMENSIONALITY REDUCTION
25
Page 133
PROJECTION PATTERNS
Data: Real valued matrix
Prior beliefs: global mean and (co-)variance structure
Patterns: projections
26
𝑿 ∈ ℝ𝑚×𝑛
Page 134
DATA PROJECTIONS
De Bie, Lijffijt, Santos-Rodriguez, Kang
(ESANN 2016)
Kang, Lijffijt, Santos-Rodriguez, De Bie
(KDD 2016, DMKD 2018)
Puolamäki, Kang, Lijffijt, De Bie (ECMLPKDD 2016)
Kang, Puolamäki, Lijffijt, De Bie (ECMLPKDD 2016)
Puolamäki, Oikarinen, Kang, Lijffijt, De Bie (ICDE 2018)
Finding informative
projections
Accounting for
user feedback
Page 135
SI COMPONENT ANALYSIS (SICA)
Problem is parametrized by a resolution parameter
Go from density to probability for projections
28
De Bie, Lijffijt, Santos-Rodriguez, Kang (ESANN 2016)
Kang, Lijffijt, Santos-Rodriguez, De Bie (KDD 2016, DMKD 2018)
Page 136
SICA
Effect of the prior beliefs:
Expectation on mean/variance PCA objective
Expectation on magnitude of variance more
robust variant of PCA (which we call t-PCA)
Graph of point similarities next slides
29
Page 137
SICA GRAPH PRIOR
30
Page 138
SICA GRAPH PRIOR
German voting percentages per district, account for
east-west divide
31
Page 139
SI DATA EXPLORER (SIDE)
32
https://users.ugent.be/~bkang/software/side_dev/entry.html
Page 140
SI DATA EXPLORER (SIDE)
33
Page 141
SI DATA EXPLORER (SIDE)
34
Page 142
SICA/SIDE
Algorithmic approach:
SICA (uniform/graph) are eigenvalue problems
SICA t-PCA and SIDE use manifold learning toolbox
35
Page 143
C-T-SNE
36
Non-linear
dimensionality
reduction
Based on t-SNE
Prior beliefs:
known clusters
Result: c-t-SNE
visualization
may make new
clusters salientBo Kang, Dario Garcia Garcia, Jefrey
Lijffijt, Raul Santos Rodriguez, Tijl De
Bie (arxiv, 2019)
Page 145
TIME SERIES MOTIFS
Data: time series
Prior beliefs: mean, var, co-var (first order difference)
38
𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)
Page 146
TIME SERIES MOTIFS
Data: time series
Prior beliefs: mean, var, co-var (first order difference)
Patterns: motif template
39
𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)
Page 147
TIME SERIES MOTIFS
IC quantifies the reduction in uncertainty
Likelihood of data increases by inserting template in
background distribution for matched locations
Due to co-variance, expectations change globally
Updating computationally (somewhat) costly
40
Page 148
TIME SERIES MOTIFS
Algorithmic approach:
Construct template from 3 or 4 subsequences using
constraint programming and relaxed objective
Greedily add subsequences using exact objective
Prune dissimilar subsequences from search after
branching and selection of initial set
41
Page 149
TIME SERIES MOTIFS
Data: time series
Prior beliefs: mean, var, co-var (first order difference)
Patterns: motif template
42
𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)
Page 151
AND MORE
Past Data clustering
Biclustering
Exceptional model mining / subgroup discovery Time series segments
Ongoing / future Backbone of a network
Insightful summaries of an attributed network
Network embeddings ...
44
with all past and current members of the FORSIED team and Jilles
Vreeken, Antonis Matakos, Dario Garcia-Garcia, Siegfried Nijssen,...
Page 152
HUMAN-CENTRIC
DATA EXPLORATION
PART 5/5: ADVANCED TOPICS, OUTLOOK & CONCLUSIONS
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
Page 153
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
2
Page 155
RELATED WORK
MDL/compression
MAP estimation
Hypothesis testing / randomization techniques
‘Subjective interestingness’ research by Thuzhilin,
Silberschatz, Padmanabhan (90’s)
4
Page 156
LIMITATIONS
Not all new information is interesting!
The FORSIED framework does not directly address this
Importance of feedback in combination with FORSIED
(see work on dimensionality reduction)
Description length
How to determine?
Is shorter really always better?
5
Page 157
FORSIED’S ORIGINS,AND OUTLOOK
6
Page 158
FORSIED’S ORIGINS
7
Remodiscovery, 2006
DISTILLER, 2009
Page 159
FORSIED’S ORIGINS
8
Page 160
FORSIED’S ORIGINS
9
MINI, ECML-PKDD 2007
TKDD, 2007
KDD, 2009
DAMI, 2014
Page 161
FORSIED’S ORIGINS
10
Since 2017 also Jefrey’s Pegasus2 fellowship, and a large and a small FWO grant
Page 162
OUTLOOK Improve theoretical understanding
Estimating the background distribution (information geometry)
Cognitive aspects (cognitive science)
User interface (human computer interaction)
Visualization (visual analytics)
Algorithmic aspects (optimisation theory)
Safeguarding sensitive information & fairness
More instantiations Data types (linked data / knowledge graphs!) / pattern types / prior belief types
Applications Bioinformatics
Web and social media mining
11
Page 163
DATA MINING WITHOUT SPILLING THE BEANS
12
Page 164
PRIVACY-PRESERVING DATA PUBLISHING
Anonymization insufficient to protect sensitive attributes (linkage attack)
Generalization!
13
ZIP D.O.B. Sex Diagnosis
94701 01/02/1968 F Healthy
94701 06/03/1990 F Obesitas
94702 11/08/1991 M Healthy
94703 03/09/1979 M Prostate cancer
94703 07/10/1951 F Healthy
94704 10/02/1973 M Obesitas
94705 20/12/2001 F Obesitas
ZIP D.O.B. Sex Full name
94701 01/02/1968 F Mary Smith
94701 06/03/1990 F Patricia Johnson
94702 11/08/1991 M James Jones
94703 03/09/1979 M John Brown
94703 07/10/1951 F Linda Davis
94704 10/02/1973 M Robert Miller
94705 20/12/2001 F Barbara Wilson
Anonymized patient database Voting records database
Quasi-identifiers
ZIP D.O.B. Sex Diagnosis
94701 ‘51-’01 F Healthy
94701 ‘51-’01 F Obesitas
94702-5 ‘51-’01 M Healthy
94702-5 ‘51-’01 M Prostate cancer
94702-5 ‘51-’01 F Healthy
94702-5 ‘51-’01 M Obesitas
94702-5 ‘51-’01 F Obesitas
Page 165
PRIVACY-PRESERVING DATA PUBLISHING
𝒌-anonymity: minimum size equivalence class ≥ 𝑘 Homogeneity attack
Background knowledge attack
𝒍-diversity: ≥ 𝑙 sensitive attribute values well represented in each
equivalence class
Hard to achieve for imbalanced data
Skewness attack
Similarity attack
𝒕-closeness: sensitive attribute value distribution in equivalence classes
same as in overall data
14Source: https://www.linkedin.com/pulse/dont-throw-baby-out-bathwater-didi-gurfinkel/
Page 166
PRIVACY-PRESERVING DATA PUBLISHING
𝒌-anonymity: minimum size equivalence class ≥ 𝑘 Homogeneity attack
Background knowledge attack
𝒍-diversity: ≥ 𝑙 sensitive attribute values well represented in each
equivalence class
Hard to achieve for imbalanced data
Skewness attack
Similarity attack
𝒕-closeness: sensitive attribute value distribution in equivalence classes
same as in overall data
15Source: https://www.linkedin.com/pulse/dont-throw-baby-out-bathwater-didi-gurfinkel/
Page 167
OTHER KINDS OF SENSITIVE INFORMATION
Existence of a tight community in a network
16
Page 168
OTHER KINDS OF SENSITIVE INFORMATION
Existence of a tight community in a network
Existence of a cluster in data
Frequency of particular items / size of particular
transactions in a database of purchases
Preserve this while:
publishing generalized version of database,
identifying dense subgraphs,
finding clusters,
mining frequent itemsets, etc
17
Data mining patterns
Page 169
GENERAL STRATEGY
Data: 𝒙 Data mining goal: reveal as much as possible about 𝒙
Sensitive aspects: 𝑓 𝒙 ∈ Φ the sensitive attributes’ values density of a specified subgraph
existence of a tight cluster
frequencies of all items
Goal: reveal as little as possible about 𝒇 𝒙 Updating 𝑷 → 𝑷′ results in updating 𝑷𝒇 → 𝑷𝒇′
More complex than conditioning!
𝑃𝑓 𝑓 𝒙 can be larger or smaller than 𝑃𝑓′ 𝑓 𝒙18
(data) 𝒙 → 𝑓 𝒙 (sensitive aspects)
Considering the user’s prior beliefs! ‘Subjective’ measures
ΩΩ′ 𝑃𝑥 Φ𝑃𝑓
𝑓 𝑥𝑓
Φ′ = 𝑓 Ω′
Page 170
TRADING-OFF TWO THINGS
1. Subjective information content of a pattern2. A criterion on the background distribution about sensitive aspects:
Information content left in sensitive aspects(surprise in actual value of the sensitive attributes):−log 𝑃𝑓′ 𝑓 𝒙
Entropy of 𝑃𝑓 (uncertainty about sensitive attributes):−𝐸𝒙~𝑃𝑓 log 𝑃𝑓′ 𝑓 𝒙 Knowledge gained about actual value of the sensitive aspects:−log 𝑃𝑓 𝑓 𝒙𝑃𝑓′ 𝑓 𝒙 Degree of belief that the sensitive aspects are within a specified set Φ∗ ⊆ Φ:𝑃𝑓′ Φ∗ ...
19
Page 172
PRIVACY-PRESERVING DATA PUBLISHING
21
• Random synthetic dataset:
• 5 real-valued quasi-identifiers,
generalization through intervals
• 1 sensitive attribute, 3 possible values
• 1 other attribute, 3 possible values
• 100 data records
• Trade-off:
• information about data (other & sensitive attributes)
• knowledge gained about sensitive attribute
• Generalize quasi-attributes 5 equivalence classes
• Ensure the maximum information content about any
sensitive attribute value is small
Conditional distributions within
the 5 equivalence classes over
the 3 sensitive attribute values
Conditional distributions within
the 5 equivalence classes over
the 3 other attribute values
Joint conditional distribution of the sensitive (rows) and other (columns) attributes, within 5 equivalence classesQI SA
OA
• Zip code
• DOB
E.g.
• Sexual orientation
• Ethnicity
E.g.
• Sense of well-being
• Productivity
E.g.
Page 173
DENSE SUBGRAPHS WITHOUT SPILLING BEANS
22
Initially
10 20 30 40
10
20
30
40
After both community patterns
10 20 30 40
10
20
30
40
After both community patterns
and a deception pattern
10 20 30 40
10
20
30
40
After both community patterns
partially concealed
10 20 30 40
10
20
30
40
• Random network:
• 2 non-overlapping
communities
• A 3rd community
overlapping both
• The 3rd is sensitive
• Analyst should
remain surprised by its presence
• Task:
• Identify (non-)dense subgraphs
• Without spilling the beans on the 3rd
community
• Approaches (result from general strategy):
• Deceive
• Conceal
Page 174
TAKE-AWAYS
FORSIED ideas can be used for quantifying sensitive
information disclosure
Key point: sensitive information disclosure is subjective
More work needed to understand how to make this
practical
23
Page 176
OVERALL CONCLUSIONS
A generic approach for designing methods for exploring data Several successes
Sometimes more challenging, mostly due to algorithmic issues
Key take-away Model what’s not interesting (= prior beliefs),
show what’s complementary (= subjectively interesting) Using information theory
New horizons? Privacy and sensitive information
25
Page 177
www.forsied.net 26
“Data Mining without Spilling the Beans: Preserving more than Privacy alone”Research project funded by the FWO
Tijl De Bie, Jefrey Lijffijt
“Exploring Data: Theoretical Foundations and Applications to Web, Multimedia, and Omics Data”Odysseus project funded by the FWO
Tijl De Bie
“Formalizing Subjective Interestingness in Data mining”ERC project FORSIED
Tijl De Bie
“Personalised, interactive, and visual exploratory mining of patterns in complex data”FWO [Pegasus]2 Marie Skłodowska-Curie Fellowship
Jefrey Lijffijt
Acknowledgements go to
Page 178
27
Jefrey Lijffijt
Bo Kang
Wouter
Duivesteijn
Achille Aknin
Holly Silk
Raul
Santos-Rodriguez
Eirini Spyropoulou
Akis Kontonasios
Paolo Simeone
Robin Vandaele
Florian Adriaens
Tijl De Bie
Xi Chen
Junning (Lemon)
Deng
Ahmad MelAlexandru Mara
We are recruiting!
www.forsied.net / aida.ugent.be
+ lots of collaborators...
Maryam Fanaeepour
Maarten Buyl
Page 179
THANKS!
TIME FOR Q&A
28
Page 180
MINING SUBJECTIVELY
INTERESTING PATTERNS IN DATA
SUPPLEMENTARY MATERIAL
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
Page 182
Adriaens, Lijffijt, De Bie: Subjectively Interesting Connecting Trees. ECML/PKDD (2) 2017: 53-69
De Bie, Lijffijt, Santos-Rodriguez, Kang: Informative Data Projections: A Framework and Two Examples. ESANN 2015 :
435-640
De Bie: Subjective Interestingness in Exploratory Data Mining. IDA 2013: 19-31
De Bie: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min.
Knowl. Discov. 23(3): 407-446 (2011)
De Bie: An information theoretic framework for data mining. KDD 2011: 564-572
De Bie: Subjectively Interesting Alternative Clusters. MultiClust@ECML/PKDD 2011: 43-54
Deng, Lijffijt, Kang, De Bie: Subjectively Interesting Motifs in Time Series, AALTD@ECML/PKDD 2018
Guns, Aknin, Lijffijt, De Bie: Direct Mining of Subjectively Interesting Relational Patterns. ICDM 2016: 913-918
3
Page 183
Kang, Lijffijt, De Bie: Conditional Network Embeddings. Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.
Kang, Lijffijt, Santos-Rodríguez, De Bie: SICA: subjectively interesting component analysis. Data Min. Knowl. Discov. 32(4): 949-987 (2018)
Kang, Lijffijt, Santos-Rodriguez, De Bie: Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations. KDD 2016: 1615-1624
Kang, Puolamäki, Lijffijt, De Bie: A Tool for Subjective and Interactive Visual Data Exploration. ECML/PKDD (3) 2016: 3-7
Kontonasios, De Bie: Subjectively interesting alternative clusterings. Machine Learning 98(1-2): 31-56 (2015)
Kontonasios, De Bie: An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases. SDM 2010: 153-164
Kontonasios, Vreeken, De Bie: Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data. ECML/PKDD (2) 2013: 256-271
4
Page 184
Kontonasios, Vreeken, De Bie: Maximum Entropy Modelling for Assessing Results on Real-Valued Data. ICDM 2011:
350-359
Van Leeuwen, De Bie, Spyropoulou, Mesnage: Subjective interestingness of subgraph patterns. Machine Learning
105(1): 41-75 (2016)
Lijffijt, Kang, Duivesteijn, Puolamäki, Oikarinen, De Bie: Subjectively Interesting Subgroup Discovery on Real-valued
Targets. IEEE ICDE 2018 : to appear
Lijffijt, Spyropoulou, Kang, De Bie: P-N-RMiner: a generic framework for mining interesting structured relational
patterns. I. J. Data Science and Analytics 1(1): 61-76 (2016)
Lijffijt, Spyropoulou, Kang, De Bie: P-N-RMiner: A generic framework for mining interesting structured relational
patterns. DSAA 2015: 1-10
Puolamäki, Kang, Lijffijt, De Bie: Interactive Visual Data Exploration with Subjective Feedback. ECML/PKDD (2) 2016:
214-229
5
Page 185
Puolamäki, Oikarinen, Kang, Lijffijt, De Bie: Interactive Visual Data Exploration with Subjective Feedback: An
Information-Theoretic Approach. IEEE ICDE 2018 : to appear
Spyropoulou, De Bie, Boley: Interesting pattern mining in multi-relational data. Data Min. Knowl. Discov. 28(3): 808-849
(2014)
Spyropoulou, De Bie, Boley: Mining Interesting Patterns in Multi-relational Data with N-ary Relationships. Discovery
Science 2013: 217-232
6
Page 186
LINKS TO SOFTWARE
7
Page 187
R-MINER(S)
Original (fastest for full enumeration):
https://bitbucket.org/BristolDataScience/rminer/
N-RMiner (supports n-ary):
https://bitbucket.org/BristolDataScience/n-rminer/
P-N-RMiner (support structured attributes):
https://bitbucket.org/BristolDataScience/p-n-rminer/
CP-RMiner (top 1 RMiner pattern, iteratively, fast):
https://bitbucket.org/ghentdatascience/cp/
8
Page 188
CONNECTING TREES
https://bitbucket.org/ghentdatascience/interestingtreesp
ublic/
9
Page 189
DENSE SUBGRAPHS (COMMUNITIES)
http://patternsthatmatter.org/software.php#ssgminer
10
Page 190
NETWORK EMBEDDING
https://bitbucket.org/ghentdatascience/cne-public/
11
Page 191
ATTRIBUTED SUBGRAPHS
SIAS-Miner: http://goo.gl/ZxsvbX
12
Page 192
SUBGROUP DISCOVERY
https://bitbucket.org/ghentdatascience/sisd-public/
13
Page 193
DIMENSIONALITY REDUCTION
SICA:
http://users.ugent.be/~bkang/software/sica/sica.zip
SIDE (online tool):
http://users.ugent.be/~bkang/software/side_dev/index.
html
SIDE (MaxEnt R version): http://kaip.iki.fi/sider.html
CLIPPR: https://bitbucket.org/ghentdatascience/clippr/
14
Page 194
ANYTHING MISSING?
Not all (source) code has been published, please ask if
you are interested in something that is missing!
15