HUMAN-CENTRIC DATA EXPLORATION · 2019-09-13 · HUMAN-CENTRIC DATA EXPLORATION PART 1/5: MOTIVATION, BACKGROUND & OUTLINE Tijl De Bie –slides in collaboration with Jefrey Lijffijt

Post on 01-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

HUMAN-CENTRIC

DATA EXPLORATION

PART 1/5: MOTIVATION, BACKGROUND & OUTLINE

Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University

Based on joint work with many others (see references and final slide of this lecture)

DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)

RESEARCH GROUP IDLAB

www.forsied.net 1

SUBJECTIVITY

AND VISUALIZATION

PART 1/5: MOTIVATION, BACKGROUND & OUTLINE

Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University

DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)

RESEARCH GROUP IDLAB

www.forsied.net 2

MOTIVATION

3

SUBJECTIVITY = KEY

Three motivating examples:

1. Frequent itemset mining

‒ Individually frequent items = probably frequent together

2. Visualizing high-dimensional data

‒ Outliers = high variance, so why maximize it in PCA?

‒ Interaction = key

3. Graph embedding

‒ High degree nodes = probably embedded centrally

4

Subjective interestingness ranking

Prior info on:Row & column sums

#docs

Support x size (area)

#docs

svm, support, machin, vector 25 data, paper 389

state, art 39 algorithm, propose 246

unlabelled, labelled, supervised, learn 10 data, mine 312

associ, rule, mine 36 base, method 202

gene, express 25 result, show 196

frequent, itemset 28 problem 373

large, social, network, graph 15 data, set 279

column, row 13 approach 330

algorithm, order, magnitud, faster 12 model 301

paper, propos, algorithm, real, synthetic, data

27 present 296

ASSOCIATION ANALYSIS / ITEMSET MINING

words

docum

ents

KDD abstracts dataset:

Subjective interestingness ranking

Prior info on:Row & column sums

#docs

Support x size (area)

#docs

svm, support, machin, vector 25 data, paper 389

state, art 39 algorithm, propose 246

unlabelled, labelled, supervised, learn 10 data, mine 312

associ, rule, mine 36 base, method 202

gene, express 25 result, show 196

frequent, itemset 28 problem 373

large, social, network, graph 15 data, set 279

column, row 13 approach 330

algorithm, order, magnitud, faster 12 model 301

paper, propos, algorithm, real, synthetic, data

27 present 296

ASSOCIATION ANALYSIS / ITEMSET MINING

Subjective interestingness ranking

Prior info on:Row & column sums

#docs

Subjective interestingnessrankingAdditionally prior info on:Keyword tiles

#docs

svm, support, machin, vector 25 art, state 39

state, art 39 row, column, algorithm 12

unlabelled, labelled, supervised, learn 10 unlabelled, labelled, data 14

associ, rule, mine 36 answer, question 18

gene, express 25 Precis, recal 14

words

docum

ents

KDD abstracts dataset:

CONDITIONAL NETWORK EMBEDDINGS

8

The search for interesting patterns in data

• Association analysis

• Dimensionality reduction

• Graph embedding

• Clustering

• Community detection

• Privacy-preserving data publishing

• … Zillions of

PCA, ICA, projection pursuit, Laplacian Eigenmaps,

tSNE, LLE,…

K-means clustering, hierarchical clustering, Mixture of

Gaussians, spectral clustering,…

Stochastic block modelling, modularity, k-cores, quasi-

cliques, dense subgraphs,…

Frequency, lift, confidence, leverage, coverage,...

EXPLORING DATA

‘Interestingness measures’

Objective functions

Quality functions

Utility functions

Cost functions

Node2Vec, Path2Vec, MetaPath2Vec,...

Discernibility, generalization height,

average group size,...

THE CHALLENGE

Zillions of interestingness measures = good & bad‒ Good: more options!

‒ Bad: the trees & the forest…

Challenge:

‒ Formalise true interestingness!‒ With minimal user interaction

‒ Without requiring user expertise

MOTIVATING EXAMPLE

Community detection:

What makes for an interesting community?

‒ Densely connected?

‒ Large?

‒ Few neighbours outside community?

‒ Unrelated to certain known ‘affiliations’?‒ …

11

THE FORSIED APPROACH: SUBJECTIVITY!

12

DataData mining

researcher

Interestingness(pattern)

THE FORSIED APPROACH: SUBJECTIVITY!

13

Data

Data mining

researcher

Data

analyst

Interestingness(pattern) Interestingness(pattern, analyst)

Interestingness = subjective

MOTIVATING EXAMPLE

Community detection:

User states expectations / beliefs

‒ Formalized as a ‘background distribution’ Any ‘pattern’ that contrasts with this and is easy to describe

= subjectively interesting

14

OUTLINE

15

OUTLINEPart 1: Introduction and motivation

10mins

Part 2: The FORSIED framework40mins

Part 3: Binary matrices, graphs, and relational data40mins

BREAK

Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins

Part 5: Advanced topics, outlook & conclusions25mins

Q&A15mins

16

17

Feel free to interrupt

for questions anytime

HUMAN-CENTRIC

DATA EXPLORATION

PART 2/5: THE FORSIED FRAMEWORK

Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University

DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)

RESEARCH GROUP IDLAB

www.forsied.net 1

OUTLINEPart 1: Introduction and motivation

10mins

Part 2: The FORSIED framework40mins

Part 3: Binary matrices, graphs, and relational data40mins

BREAK

Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins

Part 5: Advanced topics, outlook & conclusions25mins

Q&A15mins

2

GENERIC FRAMEWORK

3

De Bie, KDD 2011

De Bie, DAMI 2011

FORSIED(*)

framework

Data PatternsModel of user beliefs

(“background distribution”)

Data space

Data

Pattern

𝑃 evolves!

Pattern User Subjective!

(*)Formalizing Subjective Interestingness in Exploratory Data mining

𝑥 ∈ Ω′𝑥 ∈ Ω

ΩΩ′′Ω′ 𝑃′𝑃′′ 𝑃

• Data: the adjacency matrix

of a graph under study

• Patterns: the claim that a

specified set of nodes are

densely connected

• Prior beliefs: the degrees of

the nodes, a known block

structure,...

• Interestingness: subjective

information density

• Overlapping communities!

Interestingness Ω′, 𝑃 = InformationContent Ω′, 𝑃DescriptionLength Ω′

InformationContent Ω′, 𝑃 = − log 𝑃 Ω′

SI = IC/DL

THE FINE PRINT

Initial background distribution 𝑃?‒ Maximum entropy distribution s.t. prior belief constraints:

Updated background distribution 𝑃′ given pattern 𝑥 ∈ Ω′?‒ 𝑃 conditioned onto event 𝑥 ∈ Ω′𝑃′ Ω′′ = 𝑃 Ω′′ ∩ Ω′𝑃 Ω′⇒ −log 𝑃′ 𝒙 = −log 𝑃 𝒙 + log 𝑃 Ω′

Description length?‒ Smaller if the pattern is a better explanation

‒ Essentially problem-dependent

max𝑃 𝐸𝑋~𝑃 − log𝑃(𝑋) s.t. 𝐸𝑋~𝑃 𝑓 𝑋 = 𝑐𝑓 (∀𝑓) ΩΩ′ 𝑃′ 𝑃𝑥

Information content

in data after pattern

Information content

in data before pattern

Minus information content

of pattern under 𝑃

Ω′′

WHY MAXIMUM ENTROPY / CONDITIONING?

Most unbiased estimateInformal... no bias other than the constraints

Assume cautious / pessimistic userA user who expects to be very surprised

Leads to most robust estimate of true subjective information

contentInformation content estimated with maxent 𝑃 will never differ much from

information content w.r.t. true prior belief of user

6

A FIRST INSTANTIATION:COMMUNITY DETECTION

7

van Leeuwen, De Bie, Spyropoulou, Mesnage, MLj, 2016

COMMUNITY DETECTION IN NETWORKS

Data: Graph

Prior beliefs:1. Overall density2. or: Vertex degrees

MaxEnt distribution:𝑃 𝑨 =ෑ𝑖>𝑗 𝑃𝑖,𝑗 (𝑎𝑖𝑗)𝑃𝑖,𝑗 𝑎𝑖𝑗 = exp 𝑎𝑖𝑗 ∙ 𝜆𝑖 + 𝜆𝑗1 + exp 𝜆𝑖 + 𝜆𝑗

𝑷𝒊,𝒋 𝒂𝒊𝒋 small

𝑷𝒊,𝒋 𝒂𝒊𝒋large

Adjacency

matrixEdge indicator

variables

COMMUNITY DETECTION IN NETWORKS

Data: Graph

Prior beliefs:1. Overall density2. or: Vertex degrees

Pattern: Dense subgraphs

𝑷𝒊,𝒋 𝒂𝒊𝒋 small

𝑷𝒊,𝒋 𝒂𝒊𝒋large𝑖,𝑗∈subgraph𝑎𝑖𝑗 ≥ 𝑘

COMMUNITY DETECTION IN NETWORKS

Data: Graph

Prior beliefs:1. Overall density2. or: Vertex degrees

Pattern: Dense subgraphs

Interestingness:

𝑷𝒊,𝒋 𝒂𝒊𝒋 small

𝑷𝒊,𝒋 𝒂𝒊𝒋large− log𝑃 patternDescriptionLength(pattern)

COMMUNITY DETECTION IN NETWORKS

Data: Graph

Prior beliefs:1. Overall density2. or: Vertex degrees

Pattern: Dense subgraphs

Interestingness: Density vs. size

2. preferably low degree nodes

Most interesting given 1.

Most interesting given 2.

COMMUNITY DETECTION IN NETWORKS

Data: Graph

Prior beliefs:1. Overall density2. or: Vertex degrees

Pattern: Dense subgraphs

Interestingness: Density vs. size

2. preferably low degree nodes

Hill-climbing for searchUpdate 𝑷 after each pattern

Most interesting given 1.

Most interesting given 2.

COMMUNITY DETECTION IN NETWORKS

Data: Graph

Prior beliefs:1. Overall density2. or: Vertex degrees

Pattern: Dense subgraphs

Interestingness: Density vs. size

2. preferably low degree nodes

Hill-climbing for searchUpdate 𝑷 after each pattern

Rock

Trance

Indie

Bhangra

GospelCountry

Hip hop / grime

Afro pop

UK garage

Hip hop

TAKE-AWAYS

1. What is the data?

2. Determine suitable pattern syntax

3. What are the prior beliefs? (= what is irrelevant to user?)

Compute background distribution 𝑷 using maximum entropy

4. Formulate subjective interestingness:

5. Design an algorithm to optimize it

6. Find out how to condition background distribution on a pattern

14

Interestingness Ω′, 𝑃 = InformationContent Ω′, 𝑃DescriptionLength Ω′InformationContent Ω′, 𝑃 = − log 𝑃 Ω′

THE BACKGROUND DISTRIBUTION: MAXENT

15

MAXENT MODEL S.T. DEGREE BELIEFS

max𝑃 𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨s.t. 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 = 𝑑𝑖 , ∀𝑖 = 1: 𝑛

𝑨 𝑃 𝑨 = 1𝐿 𝑃, 𝝀, 𝜇 =𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨 + 𝑖=1:𝑛 𝜆𝑖 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 − 𝑑𝑖 + 𝜇 𝑨 𝑃 𝑨 − 1

Convex!

Lagrangian:

Entropy

Degree

Expected degree constraint

Normalization

Lagrange multipliers

𝑨 = adjacency matrix with 𝑎𝑖𝑗 in row 𝑖 and column 𝑗

𝐿 𝑃, 𝝀, 𝜇 =𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨 + 𝑖=1:𝑛 𝜆𝑖 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 − 𝑑𝑖 + 𝜇 𝑨 𝑃 𝑨 − 1Optimality condition:𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = 0𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0

MAXENT MODEL S.T. DEGREE BELIEFS

MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗

=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗

MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗

=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗

MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗

=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗

MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗

=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗 𝑃𝑖,𝑗 𝑎𝑖𝑗 Product of independent

Bernoulli distributions!

Thanks to fact that prior belief constraint

is on a (weighted) sum of the 𝑎𝑖𝑗

MAXENT MODEL S.T. DEGREE BELIEFS

To find optimal values of Lagrange multipliers, solve the dual:min𝝀 𝐿 𝑃, 𝝀where 𝑃 is given as:

𝑃 𝑨 =ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗After some calculations:

22

min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖

MAXENT MODEL S.T. DEGREE BELIEFS

min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖Can be solved using gradient descent:𝜕𝜕𝜆𝑘 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖= 𝑖=1:𝑛 exp 𝜆𝑖 + 𝜆𝑘1+exp 𝜆𝑖 + 𝜆𝑘 − 𝑑𝑘Lots of computational speed-ups possible...

23

Expected degree of node 𝒌 Required expected degree of node 𝒌

min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖

TAKE-AWAYS

Constraints on the expected value of weighted averages:𝐸𝑨~𝑃 𝑖,𝑗∈𝐼 𝑓𝑖𝑗𝑎𝑖𝑗 = 𝑐where 𝑓𝑖𝑗 and 𝑐 are constants and 𝐼 is a set of indices,

lead to convenient product distributions

Other examples for graphs: Overall density (trivial)

Densities of particular blocks (e.g. block of nodes with same

affiliation)

Assortativity (approximately)

...

24

THE INTERESTINGNESS

25

INFORMATION CONTENT

Information content:InformationContent pattern, 𝑃 = − log 𝑃 pattern Pattern: “number of edges between given set of nodes 𝑊 ⊆ 𝑉 is larger than or

equal to a specified 𝑘𝑊” A bit tricky...

Cliques as a special case: “set of nodes 𝑊 ⊆ 𝑉 forms a clique”. Then:𝑃 pattern = ෑ𝑖>𝑗∈𝑊𝑃𝑖,𝑗 1 So: InformationContent pattern, 𝑃 = − log 𝑃 pattern= −σ𝑖>𝑗∈𝑊 log 𝑃𝑖,𝑗 1 Larger if |𝑊| is larger and if 𝑃𝑖,𝑗 1 for 𝑖, 𝑗 ∈ 𝑊 is smaller

26

INFORMATION CONTENT

Pattern (general case): “number of edges between given set of nodes 𝑊 ⊆ 𝑉 is

larger than or equal to a specified 𝑘𝑊” Probability of at least 𝑘𝑊 successes in 𝑛𝑊 = |𝑊|2 Bernoulli trials?

Approximated by:𝑃 pattern ≈ exp −𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊where 𝑝𝑊 is the average probability 𝑃𝑖,𝑗 1 for the edges between 𝑖, 𝑗 ∈ 𝑊

And thus:InformationContent pattern, 𝑃 = − log 𝑃 pattern≈ 𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊 Larger if |𝑊| (and thus 𝑛𝑊) is larger, 𝑝𝑊 is smaller, and 𝑘𝑊 is larger

27

DESCRIPTION LENGTH

For cliques:

Describe set 𝑊 A constant (to describe |𝑊|)

plus a linear term in 𝑊 (to describe its elements):DescriptionLength pattern = 𝛼 𝑊 + 𝛽 For dense subgraphs:

Constant 𝛽 also describes 𝑘𝑊28

INTERESTINGNESS

Putting things together:Interestingness pattern, 𝑃 = −σ𝑖>𝑗∈𝑊 log 𝑃𝑖,𝑗 1𝛼 𝑊 + 𝛽 A bit more complex for general dense subgraphs:

Interestingness pattern, 𝑃 ≈ 𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊𝛼 𝑊 + 𝛽 Hard to optimize!

Exact search for small graphs

Effective hill climber for large graphs

29

TAKE-AWAYS

No compromises w.r.t. interestingness

Often leads to hard search problems

Question: is this intrinsic to genuine subjective

interestingness?

30

UPDATING THE BACKGROUND DISTRIBUTION

31

UPDATING THE BACKGROUND DISTRIBUTION

Given a pattern,update the background distribution by conditioning on the pattern

Easy to do for cliques 𝑊:

Set 𝑃′𝑖,𝑗 𝑎𝑖𝑗 = 1 for 𝑎𝑖𝑗 = 1 if 𝑖, 𝑗 ∈ 𝑊 Fast to approximate for (non-clique) dense subgraphs 𝑊:

Set 𝑃′𝑖,𝑗 𝑎𝑖𝑗 ∝ 𝑃𝑖,𝑗 𝑎𝑖𝑗 ⋅ exp 𝜆𝑊𝑎𝑖𝑗 if 𝑖, 𝑗 ∈ 𝑊such that the expected density of 𝑊 is 𝑘𝑊

Remains a product of Bernoulli’s32

TAKE-AWAYS

Updating can be trivial

Otherwise, often easy to do approximately

33

HUMAN-CENTRIC

DATA EXPLORATION

PART 3/5: BINARY MATRICES, GRAPHS, RELATIONAL DATA

Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University

DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)

RESEARCH GROUP IDLAB

www.forsied.net 1

OUTLINEPart 1: Introduction and motivation

10mins

Part 2: The FORSIED framework40mins

Part 3: Binary matrices, graphs, and relational data40mins

BREAK

Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins

Part 5: Advanced topics, outlook & conclusions25mins

Q&A15mins

2

OUTLINE OF THIS PART

(Community detection)

Itemsets

Relational patterns

Connecting trees

Network embedding

3

ITEMSETS

4

ITEMSETS

Data: binary matrix:

5

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

De Bie, DMKD 2011

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

6

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

De Bie, DMKD 2011

ITEMSETS

7

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Data: binary matrix:

Prior beliefs: row and column sums

Background distribution 𝑃- Nonnegative and properly normalized

- Has correct marginals Many solutions !?

‘Unbiased’ distribution: Maximum Entropy distribution

De Bie, DMKD 2011

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Background distribution 𝑃

8

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

𝑃 𝑿 =ෑ𝑖,𝑗 𝑃𝑖,𝑗 (𝑥𝑖𝑗) 𝑃𝑖,𝑗 𝑥𝑖𝑗 = exp 𝑥𝑖𝑗 ∙ 𝜇𝑖 + 𝜆𝑗1 + exp 𝜇𝑖 + 𝜆𝑗

De Bie, DMKD 2011

Convex

optimization

problem

‘Unbiased’ distribution: Maximum Entropy distribution

ITEMSETS

Data: binary matrix:

Prior beliefs: uniform at observed density

9

𝑿 ∈ {0,1}𝑚×𝑛De Bie, DMKD 2011

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

𝑃𝑖,𝑗 = 1424 ∀ 𝑖, 𝑗𝑃𝑖,𝑗 𝑥𝑖𝑗 = exp 𝑥𝑖𝑗 ∙ 𝜆1 + exp 𝜆 𝜆 = 0.3365

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Patterns: tiles (set of rows and columns)

10

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Large support may not be

interesting

?

De Bie, DMKD 2011

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Patterns: tiles

11

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Large surface may not be

interesting

?

De Bie, DMKD 2011

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Patterns: tiles

12

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Less expected smaller row

and column margins

Bingo!

De Bie, DMKD 2011

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Patterns: tiles

As stated SI = IC / DL

13

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Less expected smaller row

and column margins

Bingo!

De Bie, DMKD 2011

IC 𝑍 = − log Pr 𝑍 = 𝑖,𝑗∈𝑍−log Pr 𝑝𝑖,𝑗DL 𝑍 = 𝑎 #rows + #columns + 𝑏

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Patterns: tiles

14

𝑿 ∈ {0,1}𝑚×𝑛 Iterate

De Bie, DMKD 2011

We specified there are ones here

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Patterns: tiles

15

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Iterate

De Bie, DMKD 2011

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Patterns: tiles

16

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Iterate

De Bie, DMKD 2011

Subjective interestingness ranking

Prior info on:Row & column sums

#docs

Support x size (area)

#docs

svm, support, machin, vector 25 data, paper 389

state, art 39 algorithm, propose 246

unlabelled, labelled, supervised, learn 10 data, mine 312

associ, rule, mine 36 base, method 202

gene, express 25 result, show 196

frequent, itemset 28 problem 373

large, social, network, graph 15 data, set 279

column, row 13 approach 330

algorithm, order, magnitud, faster 12 model 301

paper, propos, algorithm, real, synthetic, data

27 present 296

ITEMSETS

words

docum

ents

KDD abstracts dataset:

De Bie, DMKD 2011

Subjective interestingness ranking

Prior info on:Row & column sums

#docs

Support x size (area)

#docs

svm, support, machin, vector 25 data, paper 389

state, art 39 algorithm, propose 246

unlabelled, labelled, supervised, learn 10 data, mine 312

associ, rule, mine 36 base, method 202

gene, express 25 result, show 196

frequent, itemset 28 problem 373

large, social, network, graph 15 data, set 279

column, row 13 approach 330

algorithm, order, magnitud, faster 12 model 301

paper, propos, algorithm, real, synthetic, data

27 present 296

ITEMSETS

Subjective interestingness ranking

Prior info on:Row & column sums

#docs

Subjective interestingnessrankingAdditionally prior info on:Keyword tiles

#docs

svm, support, machin, vector 25 art, state 39

state, art 39 row, column, algorithm 12

unlabelled, labelled, supervised, learn 10 unlabelled, labelled, data 14

associ, rule, mine 36 answer, question 18

gene, express 25 Precis, recal 14

words

docum

ents

KDD abstracts dataset:

De Bie, DMKD 2011

ITEMSETS

Data: binary matrix:

Prior beliefs: row and column sums

Background distribution 𝑃 Patterns: tiles of ones and zeros

19

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Extension: noisy tiles

Kontonasios & De Bie, SDM 2010

IC is straightforward

DL depends on the skew

(entropy) of the distribution of

ones and zeros within the tile

ITEMSETS

Algorithmic approach: ?

Not studied extensively, but special case of relational

patterns (see next instantiation)

Interesting result: if we can mine the best pattern at

every iteration, then we approximate the total IC for the

best set of tiles with that (cumulative) DL at

20

1 − 1𝑒 (≈ 0.63)Kontonasios & De Bie, SDM 2010

RELATIONAL PATTERNS

21

RELATIONAL PATTERN MINING

Data: relational database

Pattern: connected complete subgraphs

Prior beliefs: degree of each node in each

relationship

Customers Items Attributes

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)

Guns, Aknin, Lijffijt, De Bie (ICDM 2016)

RELATIONAL PATTERN MINING

Data: relational database

Pattern: connected complete subgraphs

Prior beliefs: degree of each node in each

relationship

Customers Items Attributes

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)

Guns, Aknin, Lijffijt, De Bie (ICDM 2016)

Prior factorizes over relationships

equivalent to itemsets

Users Films

Genres

Actors

RELATIONAL PATTERN MINING

RMiner

RMINER

Algorithmic approach: enumerate + rank

Based on fixpoint-enumeration (Boley et al. 2010)

25

A1

A2

A3

B1

B2

B3

C1

C2

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

2626

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

2727

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

28

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

29

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

30

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

31

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

32

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

33

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

- If invalid added, backtrack

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

34

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

- If invalid added, backtrack

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

RMINER

Algorithmic approach: enumerate + rank

35

A1

A2

A3

B1

B2

B3

C1

C2

1) Branch on any entity

2) Compute closure

- Add entities that are in all supersets

- If invalid added, backtrack

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

Etc. etc.

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

Circadian

Numeric

Numeric Taxonomyelement

Taxonomyelement

n-ary relation

N-RMiner

P-N-RMiner

RELATIONAL PATTERN MINING

P-N-RMINER: FISHER DATA

37

Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)

CP-RMINER

Algorithmic approach: branch & bound in CP (top 1)

38

Guns, Aknin, Lijffijt, De Bie (ICDM 2016)

CONNECTING TREES

39

I

D

CONNECTING SUBTREES

Data: Graph , known, unknown

40

𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A

B

C

E

F

G

H

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)

CONNECTING SUBTREES

Data: Graph , known, unknown

Prior beliefs: degree of vertices, (batch) time order

41

𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽

I

DA

B

C

E

F

G

H

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)

I

D

CONNECTING SUBTREES

Data: Graph , known, unknown

Prior beliefs: degree of vertices, (batch) time order

Patterns: subtree connecting query vertices Q

42

𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A

B

C

E

F

G

H

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)

I

D

CONNECTING SUBTREES

Data: Graph , known, unknown

Prior beliefs: degree of vertices, (batch) time order

Patterns: subtree connecting query vertices Q

Which is more interesting?

43

𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A

B

C

E

F

G

H

ID

F

D

I

C

E

I

D

A

C

E

F

Unsuprising No information

compression

Most interesting,

since E and C have

small in-degree

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)

TIME DIFFERENCE PRIOR

Consider citation network

Papers arrive in batches

(per year)

Earlier papers cannot /

rarely cite newer papers

44

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)

TIME DIFFERENCE PRIOR

Consider citation network

Papers arrive in batches

(per year)

Earlier papers cannot /

rarely cite newer papers

Limited increase in

computational cost to fit

background distribution

45

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)

EXAMPLE ON CITATION DATA

46

3 recent best

papers from ACM

SIGKDD

Uniform prior:

Time and degree prior:

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)

CONNECTING SUBTREES

Algorithmic approach: greedily construct a tree with

maximum depth k

Not so straightforward!

We investigated various heuristics

47

CONNECTING SUBTREES

Empirically: best strategy depends on size of query set Q

48

CONNECTING SUBTREES

49

Uniform Degree prior

NIPS/PODS authors

Repeated from Akoglu et al. (SDM 2013)

NETWORK EMBEDDING

50

CONDITIONAL NETWORK EMBEDDINGS

51

Data: a graph 𝐺 w. adj. matrix 𝑨 Pattern: a metric embedding 𝑿

‒ Probabilistic info about the graph

‒ 𝑃 𝒙𝑖 − 𝒙𝑗 |𝑎𝑖𝑗 = Half-Normal

Prior beliefs: 𝑃𝑖,𝑗 𝑎𝑖𝑗‒ overall density

‒ degrees

‒ block structure

‒ assortativity

‒ ...

Find ML embedding:max𝑿 𝑃 𝐺|𝑿

Kang, Lijffijt, De Bie

(ICLR 2019)

EXAMPLE ON STUDENTDB

52

53

CONDITIONAL NETWORK EMBEDDINGS

54

CONDITIONAL NETWORK EMBEDDINGS

Algorithmic approach: gradient descent with

estimated gradient (positive and negative sampling)

55

SUMMARY PART 3

56

SUMMARY

Informative prior can be useful in many settings

For binary data (all previous examples), fitting and

updating background model is computationally easy

Mining SI patterns challenging, (for now) tailor-made

algorithms necessary

57

HUMAN-CENTRIC

DATA EXPLORATION

PART 4/5: NUMERIC AND MIXED DATA

Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University

DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)

RESEARCH GROUP IDLAB

www.forsied.net 1

OUTLINEPart 1: Introduction and motivation

10mins

Part 2: The FORSIED framework40mins

Part 3: Binary matrices, graphs, and relational data40mins

BREAK

Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins

Part 5: Advanced topics, outlook & conclusions25mins

Q&A15mins

2

OUTLINE OF THIS PART

Attributed subgraphs

Subgroup discovery in real-valued (target) data

Dimensionality reduction

Time series

3

ATTRIBUTED SUBGRAPHS

4

COHESIVE SUBGRAPHS

Data: attributed graph

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

A

B

D

C

E

F

G

H

I

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

A

B

D

C

E

F

G

H

I

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

A

C

D

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Patterns: cohesive subgraphs with exceptional attributes (CSEA)

B

E

F

G

H

I

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 2 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Pattern:

These locations

have many events

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

F

G

H

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Patterns: cohesive subgraphs with exceptional attributes (CSEA)

A

B

D

C

EI

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Pattern is easy to

interpret if it is local.

How to quantify this?

Pattern:

These locations

have many pubs &

shops

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

G

FH

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Patterns: cohesive subgraphs with exceptional attributes (CSEA)

A

B

D

C

EI

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Pattern:

Vertices around G

have many pubs &

shops

Easy to describe

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

C

AD

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Patterns: cohesive subgraphs with exceptional attributes (CSEA)

B

E

F

G

H

I

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 2 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Pattern:

Vertices around C

have many events

Easy to describe

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Patterns: cohesive subgraphs with exceptional attributes (CSEA)

‒ Like subgroup discovery

‒ Describe vertex set with rule

‒ Intersection of neighbourhoods

‒ Minus exceptions

‒ Attributes below / above threshold

‒ As compared to expectation

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Vertex Shops Events Pubs

A 0.5 … …B 0.04 … …C 0.2 … …D 0.8 … …E 0.06 … …F 0.13 … …G 0.07 … …H 0.1 … …I 1.0 … …

Attribute values Interestingness

Background

distribution

- Geometric

per cell

- Using

row/column

margins

(values are for illustration only)

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Vertex Shops Events Pubs

A 0.5 … …B 0.04 … …C 0.2 … …D 0.8 … …E 0.06 … …F 0.13 … …G 0.07 … …H 0.1 … …I 1.0 … …

Attribute values Interestingness

Background

distribution

- Geometric

per cell

- Using

row/column

margins

(values are for illustration only)

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Patterns: cohesive subgraphs with exceptional attributes (CSEA)

14

𝑃1: +food

𝑃2: +professional, +nightlife,

+outdoors, +college

𝑃3: +nightlife, +food, -college

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

COHESIVE SUBGRAPHS

Data: attributed graph

Prior beliefs: global attribute statistics

Patterns: cohesive subgraphs with exceptional attributes (CSEA)

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

COHESIVE SUBGRAPHS

Algorithmic approach: we tried

Option 1: enumerate with P-N-RMiner, then rank

Option 2: dedicated branch-and-bound algorithm

16

SUBGROUP DISCOVERY (EXCEPTIONAL MODEL MINING)

17

LOCATION & SPREAD PATTERNS

Data:

‒ Meta-data: any type

‒ Target data: real valued matrix

Prior beliefs: mean and variance statistics

‒ Typically overall, but can be for subsets

Patterns: description with

‒ mean vector, or

‒ projection and magnitude of variance

18

𝑿 ∈ ℝ𝑚×𝑛

-4 -2 0 2 4

Attribute 1

-4

-2

0

2

4

Att

rib

ute

2

(a)Objects with

property x

Objects with

property y

Objects with

property z1

and z2

Lijffijt, Kang, Duivesteijn, Puolamäki,

Oikarinen, De Bie (ICDE 2018)

SINGLE TARGET DIMENSION: CRIME IN US

UCI Crime data:

violent crime rate

(per 1k pop)

Description: areas with high incidence of unmarried mothers (coded

by CBS as percentage illegitimate)

Target: high average crime rate

19

ECOLOGY: PRESENT SPECIES AS TARGETS

Description of pattern (a): “mean temperature in March ≤ −1.68 ◦C’”Description of pattern (b): “average monthly rainfall in August ≤ 47.62 mm”Description of pattern (c): “average monthly rainfall in October ≤ 45.25 mm and mean temperature of wettest quarter ≥ 16.32 ◦C”

Mean of target attributes for (a): -Wood mouse +Mountain hare, moose, red-

backed vole, wood lemming 20

GERMAN POLITICS VS. DEMOGRAPHICS

Description of pattern (a): "few children". Target: LEFT is popular.

Description of pattern (b): "large mid-aged pop". Target: GREEN relatively popular.

Description of pattern (c): many children. Target: LEFT is impopular.

21

GERMAN POLITICS VS. DEMOGRAPHICS

Moreover, target of pattern (a): “SDP and CDU negative correlated”.Due to popularity of LEFT, SDP and CDU are in tougher competition here

22

SUBGROUP DISCOVERY

For reference only

Prior:

IC/DL:

23

SUBGROUP DISCOVERY

Algorithmic approach:

Find descriptions using beam search

‒ Fairly standard in SD; Implementation from

Cortana (Meeng & Knobbe, BeneLearn 2011)

‒ Using the SI objective directly

For projections

‒ Manifold learning problem (ManOpt toolbox)

Interesting results to speed-up update to bg. distrib.

24

DIMENSIONALITY REDUCTION

25

PROJECTION PATTERNS

Data: Real valued matrix

Prior beliefs: global mean and (co-)variance structure

Patterns: projections

26

𝑿 ∈ ℝ𝑚×𝑛

DATA PROJECTIONS

De Bie, Lijffijt, Santos-Rodriguez, Kang

(ESANN 2016)

Kang, Lijffijt, Santos-Rodriguez, De Bie

(KDD 2016, DMKD 2018)

Puolamäki, Kang, Lijffijt, De Bie (ECMLPKDD 2016)

Kang, Puolamäki, Lijffijt, De Bie (ECMLPKDD 2016)

Puolamäki, Oikarinen, Kang, Lijffijt, De Bie (ICDE 2018)

Finding informative

projections

Accounting for

user feedback

SI COMPONENT ANALYSIS (SICA)

Problem is parametrized by a resolution parameter

Go from density to probability for projections

28

De Bie, Lijffijt, Santos-Rodriguez, Kang (ESANN 2016)

Kang, Lijffijt, Santos-Rodriguez, De Bie (KDD 2016, DMKD 2018)

SICA

Effect of the prior beliefs:

Expectation on mean/variance PCA objective

Expectation on magnitude of variance more

robust variant of PCA (which we call t-PCA)

Graph of point similarities next slides

29

SICA GRAPH PRIOR

30

SICA GRAPH PRIOR

German voting percentages per district, account for

east-west divide

31

SI DATA EXPLORER (SIDE)

32

https://users.ugent.be/~bkang/software/side_dev/entry.html

SI DATA EXPLORER (SIDE)

33

SI DATA EXPLORER (SIDE)

34

SICA/SIDE

Algorithmic approach:

SICA (uniform/graph) are eigenvalue problems

SICA t-PCA and SIDE use manifold learning toolbox

35

C-T-SNE

36

Non-linear

dimensionality

reduction

Based on t-SNE

Prior beliefs:

known clusters

Result: c-t-SNE

visualization

may make new

clusters salientBo Kang, Dario Garcia Garcia, Jefrey

Lijffijt, Raul Santos Rodriguez, Tijl De

Bie (arxiv, 2019)

TIME SERIES

37

TIME SERIES MOTIFS

Data: time series

Prior beliefs: mean, var, co-var (first order difference)

38

𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)

TIME SERIES MOTIFS

Data: time series

Prior beliefs: mean, var, co-var (first order difference)

Patterns: motif template

39

𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)

TIME SERIES MOTIFS

IC quantifies the reduction in uncertainty

Likelihood of data increases by inserting template in

background distribution for matched locations

Due to co-variance, expectations change globally

Updating computationally (somewhat) costly

40

TIME SERIES MOTIFS

Algorithmic approach:

Construct template from 3 or 4 subsequences using

constraint programming and relaxed objective

Greedily add subsequences using exact objective

Prune dissimilar subsequences from search after

branching and selection of initial set

41

TIME SERIES MOTIFS

Data: time series

Prior beliefs: mean, var, co-var (first order difference)

Patterns: motif template

42

𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)

AND MORE

43

AND MORE

Past Data clustering

Biclustering

Exceptional model mining / subgroup discovery Time series segments

Ongoing / future Backbone of a network

Insightful summaries of an attributed network

Network embeddings ...

44

with all past and current members of the FORSIED team and Jilles

Vreeken, Antonis Matakos, Dario Garcia-Garcia, Siegfried Nijssen,...

HUMAN-CENTRIC

DATA EXPLORATION

PART 5/5: ADVANCED TOPICS, OUTLOOK & CONCLUSIONS

Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University

DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)

RESEARCH GROUP IDLAB

www.forsied.net 1

OUTLINEPart 1: Introduction and motivation

10mins

Part 2: The FORSIED framework40mins

Part 3: Binary matrices, graphs, and relational data40mins

BREAK

Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins

Part 5: Advanced topics, outlook & conclusions25mins

Q&A15mins

2

RELATED WORK

3

RELATED WORK

MDL/compression

MAP estimation

Hypothesis testing / randomization techniques

‘Subjective interestingness’ research by Thuzhilin,

Silberschatz, Padmanabhan (90’s)

4

LIMITATIONS

Not all new information is interesting!

The FORSIED framework does not directly address this

Importance of feedback in combination with FORSIED

(see work on dimensionality reduction)

Description length

How to determine?

Is shorter really always better?

5

FORSIED’S ORIGINS,AND OUTLOOK

6

FORSIED’S ORIGINS

7

Remodiscovery, 2006

DISTILLER, 2009

FORSIED’S ORIGINS

8

FORSIED’S ORIGINS

9

MINI, ECML-PKDD 2007

TKDD, 2007

KDD, 2009

DAMI, 2014

FORSIED’S ORIGINS

10

Since 2017 also Jefrey’s Pegasus2 fellowship, and a large and a small FWO grant

OUTLOOK Improve theoretical understanding

Estimating the background distribution (information geometry)

Cognitive aspects (cognitive science)

User interface (human computer interaction)

Visualization (visual analytics)

Algorithmic aspects (optimisation theory)

Safeguarding sensitive information & fairness

More instantiations Data types (linked data / knowledge graphs!) / pattern types / prior belief types

Applications Bioinformatics

Web and social media mining

11

DATA MINING WITHOUT SPILLING THE BEANS

12

PRIVACY-PRESERVING DATA PUBLISHING

Anonymization insufficient to protect sensitive attributes (linkage attack)

Generalization!

13

ZIP D.O.B. Sex Diagnosis

94701 01/02/1968 F Healthy

94701 06/03/1990 F Obesitas

94702 11/08/1991 M Healthy

94703 03/09/1979 M Prostate cancer

94703 07/10/1951 F Healthy

94704 10/02/1973 M Obesitas

94705 20/12/2001 F Obesitas

ZIP D.O.B. Sex Full name

94701 01/02/1968 F Mary Smith

94701 06/03/1990 F Patricia Johnson

94702 11/08/1991 M James Jones

94703 03/09/1979 M John Brown

94703 07/10/1951 F Linda Davis

94704 10/02/1973 M Robert Miller

94705 20/12/2001 F Barbara Wilson

Anonymized patient database Voting records database

Quasi-identifiers

ZIP D.O.B. Sex Diagnosis

94701 ‘51-’01 F Healthy

94701 ‘51-’01 F Obesitas

94702-5 ‘51-’01 M Healthy

94702-5 ‘51-’01 M Prostate cancer

94702-5 ‘51-’01 F Healthy

94702-5 ‘51-’01 M Obesitas

94702-5 ‘51-’01 F Obesitas

PRIVACY-PRESERVING DATA PUBLISHING

𝒌-anonymity: minimum size equivalence class ≥ 𝑘 Homogeneity attack

Background knowledge attack

𝒍-diversity: ≥ 𝑙 sensitive attribute values well represented in each

equivalence class

Hard to achieve for imbalanced data

Skewness attack

Similarity attack

𝒕-closeness: sensitive attribute value distribution in equivalence classes

same as in overall data

14Source: https://www.linkedin.com/pulse/dont-throw-baby-out-bathwater-didi-gurfinkel/

PRIVACY-PRESERVING DATA PUBLISHING

𝒌-anonymity: minimum size equivalence class ≥ 𝑘 Homogeneity attack

Background knowledge attack

𝒍-diversity: ≥ 𝑙 sensitive attribute values well represented in each

equivalence class

Hard to achieve for imbalanced data

Skewness attack

Similarity attack

𝒕-closeness: sensitive attribute value distribution in equivalence classes

same as in overall data

15Source: https://www.linkedin.com/pulse/dont-throw-baby-out-bathwater-didi-gurfinkel/

OTHER KINDS OF SENSITIVE INFORMATION

Existence of a tight community in a network

16

OTHER KINDS OF SENSITIVE INFORMATION

Existence of a tight community in a network

Existence of a cluster in data

Frequency of particular items / size of particular

transactions in a database of purchases

Preserve this while:

publishing generalized version of database,

identifying dense subgraphs,

finding clusters,

mining frequent itemsets, etc

17

Data mining patterns

GENERAL STRATEGY

Data: 𝒙 Data mining goal: reveal as much as possible about 𝒙

Sensitive aspects: 𝑓 𝒙 ∈ Φ the sensitive attributes’ values density of a specified subgraph

existence of a tight cluster

frequencies of all items

Goal: reveal as little as possible about 𝒇 𝒙 Updating 𝑷 → 𝑷′ results in updating 𝑷𝒇 → 𝑷𝒇′

More complex than conditioning!

𝑃𝑓 𝑓 𝒙 can be larger or smaller than 𝑃𝑓′ 𝑓 𝒙18

(data) 𝒙 → 𝑓 𝒙 (sensitive aspects)

Considering the user’s prior beliefs! ‘Subjective’ measures

ΩΩ′ 𝑃𝑥 Φ𝑃𝑓

𝑓 𝑥𝑓

Φ′ = 𝑓 Ω′

TRADING-OFF TWO THINGS

1. Subjective information content of a pattern2. A criterion on the background distribution about sensitive aspects:

Information content left in sensitive aspects(surprise in actual value of the sensitive attributes):−log 𝑃𝑓′ 𝑓 𝒙

Entropy of 𝑃𝑓 (uncertainty about sensitive attributes):−𝐸𝒙~𝑃𝑓 log 𝑃𝑓′ 𝑓 𝒙 Knowledge gained about actual value of the sensitive aspects:−log 𝑃𝑓 𝑓 𝒙𝑃𝑓′ 𝑓 𝒙 Degree of belief that the sensitive aspects are within a specified set Φ∗ ⊆ Φ:𝑃𝑓′ Φ∗ ...

19

EXAMPLES

20

PRIVACY-PRESERVING DATA PUBLISHING

21

• Random synthetic dataset:

• 5 real-valued quasi-identifiers,

generalization through intervals

• 1 sensitive attribute, 3 possible values

• 1 other attribute, 3 possible values

• 100 data records

• Trade-off:

• information about data (other & sensitive attributes)

• knowledge gained about sensitive attribute

• Generalize quasi-attributes 5 equivalence classes

• Ensure the maximum information content about any

sensitive attribute value is small

Conditional distributions within

the 5 equivalence classes over

the 3 sensitive attribute values

Conditional distributions within

the 5 equivalence classes over

the 3 other attribute values

Joint conditional distribution of the sensitive (rows) and other (columns) attributes, within 5 equivalence classesQI SA

OA

• Zip code

• DOB

E.g.

• Sexual orientation

• Ethnicity

E.g.

• Sense of well-being

• Productivity

E.g.

DENSE SUBGRAPHS WITHOUT SPILLING BEANS

22

Initially

10 20 30 40

10

20

30

40

After both community patterns

10 20 30 40

10

20

30

40

After both community patterns

and a deception pattern

10 20 30 40

10

20

30

40

After both community patterns

partially concealed

10 20 30 40

10

20

30

40

• Random network:

• 2 non-overlapping

communities

• A 3rd community

overlapping both

• The 3rd is sensitive

• Analyst should

remain surprised by its presence

• Task:

• Identify (non-)dense subgraphs

• Without spilling the beans on the 3rd

community

• Approaches (result from general strategy):

• Deceive

• Conceal

TAKE-AWAYS

FORSIED ideas can be used for quantifying sensitive

information disclosure

Key point: sensitive information disclosure is subjective

More work needed to understand how to make this

practical

23

CONCLUSIONS

24

OVERALL CONCLUSIONS

A generic approach for designing methods for exploring data Several successes

Sometimes more challenging, mostly due to algorithmic issues

Key take-away Model what’s not interesting (= prior beliefs),

show what’s complementary (= subjectively interesting) Using information theory

New horizons? Privacy and sensitive information

25

www.forsied.net 26

“Data Mining without Spilling the Beans: Preserving more than Privacy alone”Research project funded by the FWO

Tijl De Bie, Jefrey Lijffijt

“Exploring Data: Theoretical Foundations and Applications to Web, Multimedia, and Omics Data”Odysseus project funded by the FWO

Tijl De Bie

“Formalizing Subjective Interestingness in Data mining”ERC project FORSIED

Tijl De Bie

“Personalised, interactive, and visual exploratory mining of patterns in complex data”FWO [Pegasus]2 Marie Skłodowska-Curie Fellowship

Jefrey Lijffijt

Acknowledgements go to

27

Jefrey Lijffijt

Bo Kang

Wouter

Duivesteijn

Achille Aknin

Holly Silk

Raul

Santos-Rodriguez

Eirini Spyropoulou

Akis Kontonasios

Paolo Simeone

Robin Vandaele

Florian Adriaens

Tijl De Bie

Xi Chen

Junning (Lemon)

Deng

Ahmad MelAlexandru Mara

We are recruiting!

www.forsied.net / aida.ugent.be

+ lots of collaborators...

Maryam Fanaeepour

Maarten Buyl

THANKS!

TIME FOR Q&A

28

MINING SUBJECTIVELY

INTERESTING PATTERNS IN DATA

SUPPLEMENTARY MATERIAL

Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University

DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)

RESEARCH GROUP IDLAB

www.forsied.net 1

REFERENCES

2

Adriaens, Lijffijt, De Bie: Subjectively Interesting Connecting Trees. ECML/PKDD (2) 2017: 53-69

De Bie, Lijffijt, Santos-Rodriguez, Kang: Informative Data Projections: A Framework and Two Examples. ESANN 2015 :

435-640

De Bie: Subjective Interestingness in Exploratory Data Mining. IDA 2013: 19-31

De Bie: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min.

Knowl. Discov. 23(3): 407-446 (2011)

De Bie: An information theoretic framework for data mining. KDD 2011: 564-572

De Bie: Subjectively Interesting Alternative Clusters. MultiClust@ECML/PKDD 2011: 43-54

Deng, Lijffijt, Kang, De Bie: Subjectively Interesting Motifs in Time Series, AALTD@ECML/PKDD 2018

Guns, Aknin, Lijffijt, De Bie: Direct Mining of Subjectively Interesting Relational Patterns. ICDM 2016: 913-918

3

Kang, Lijffijt, De Bie: Conditional Network Embeddings. Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.

Kang, Lijffijt, Santos-Rodríguez, De Bie: SICA: subjectively interesting component analysis. Data Min. Knowl. Discov. 32(4): 949-987 (2018)

Kang, Lijffijt, Santos-Rodriguez, De Bie: Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations. KDD 2016: 1615-1624

Kang, Puolamäki, Lijffijt, De Bie: A Tool for Subjective and Interactive Visual Data Exploration. ECML/PKDD (3) 2016: 3-7

Kontonasios, De Bie: Subjectively interesting alternative clusterings. Machine Learning 98(1-2): 31-56 (2015)

Kontonasios, De Bie: An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases. SDM 2010: 153-164

Kontonasios, Vreeken, De Bie: Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data. ECML/PKDD (2) 2013: 256-271

4

Kontonasios, Vreeken, De Bie: Maximum Entropy Modelling for Assessing Results on Real-Valued Data. ICDM 2011:

350-359

Van Leeuwen, De Bie, Spyropoulou, Mesnage: Subjective interestingness of subgraph patterns. Machine Learning

105(1): 41-75 (2016)

Lijffijt, Kang, Duivesteijn, Puolamäki, Oikarinen, De Bie: Subjectively Interesting Subgroup Discovery on Real-valued

Targets. IEEE ICDE 2018 : to appear

Lijffijt, Spyropoulou, Kang, De Bie: P-N-RMiner: a generic framework for mining interesting structured relational

patterns. I. J. Data Science and Analytics 1(1): 61-76 (2016)

Lijffijt, Spyropoulou, Kang, De Bie: P-N-RMiner: A generic framework for mining interesting structured relational

patterns. DSAA 2015: 1-10

Puolamäki, Kang, Lijffijt, De Bie: Interactive Visual Data Exploration with Subjective Feedback. ECML/PKDD (2) 2016:

214-229

5

Puolamäki, Oikarinen, Kang, Lijffijt, De Bie: Interactive Visual Data Exploration with Subjective Feedback: An

Information-Theoretic Approach. IEEE ICDE 2018 : to appear

Spyropoulou, De Bie, Boley: Interesting pattern mining in multi-relational data. Data Min. Knowl. Discov. 28(3): 808-849

(2014)

Spyropoulou, De Bie, Boley: Mining Interesting Patterns in Multi-relational Data with N-ary Relationships. Discovery

Science 2013: 217-232

6

LINKS TO SOFTWARE

7

R-MINER(S)

Original (fastest for full enumeration):

https://bitbucket.org/BristolDataScience/rminer/

N-RMiner (supports n-ary):

https://bitbucket.org/BristolDataScience/n-rminer/

P-N-RMiner (support structured attributes):

https://bitbucket.org/BristolDataScience/p-n-rminer/

CP-RMiner (top 1 RMiner pattern, iteratively, fast):

https://bitbucket.org/ghentdatascience/cp/

8

CONNECTING TREES

https://bitbucket.org/ghentdatascience/interestingtreesp

ublic/

9

DENSE SUBGRAPHS (COMMUNITIES)

http://patternsthatmatter.org/software.php#ssgminer

10

NETWORK EMBEDDING

https://bitbucket.org/ghentdatascience/cne-public/

11

ATTRIBUTED SUBGRAPHS

SIAS-Miner: http://goo.gl/ZxsvbX

12

SUBGROUP DISCOVERY

https://bitbucket.org/ghentdatascience/sisd-public/

13

DIMENSIONALITY REDUCTION

SICA:

http://users.ugent.be/~bkang/software/sica/sica.zip

SIDE (online tool):

http://users.ugent.be/~bkang/software/side_dev/index.

html

SIDE (MaxEnt R version): http://kaip.iki.fi/sider.html

CLIPPR: https://bitbucket.org/ghentdatascience/clippr/

14

ANYTHING MISSING?

Not all (source) code has been published, please ask if

you are interested in something that is missing!

15

top related