HUMAN-CENTRIC DATA EXPLORATION · 2019-09-13 · HUMAN-CENTRIC DATA EXPLORATION PART 1/5: MOTIVATION, BACKGROUND & OUTLINE Tijl De Bie –slides in collaboration with Jefrey Lijffijt

HUMAN-CENTRIC

DATA EXPLORATION

PART 1/5: MOTIVATION, BACKGROUND & OUTLINE

Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University

Based on joint work with many others (see references and final slide of this lecture)

DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)

RESEARCH GROUP IDLAB

www.forsied.net 1

SUBJECTIVITY

AND VISUALIZATION

PART 1/5: MOTIVATION, BACKGROUND & OUTLINE

www.forsied.net 2

MOTIVATION

SUBJECTIVITY = KEY

Three motivating examples:

1. Frequent itemset mining

‒ Individually frequent items = probably frequent together

2. Visualizing high-dimensional data

‒ Outliers = high variance, so why maximize it in PCA?

‒ Interaction = key

3. Graph embedding

‒ High degree nodes = probably embedded centrally

Subjective interestingness ranking

Prior info on:Row & column sums

Support x size (area)

svm, support, machin, vector 25 data, paper 389

state, art 39 algorithm, propose 246

unlabelled, labelled, supervised, learn 10 data, mine 312

associ, rule, mine 36 base, method 202

gene, express 25 result, show 196

frequent, itemset 28 problem 373

large, social, network, graph 15 data, set 279

column, row 13 approach 330

algorithm, order, magnitud, faster 12 model 301

paper, propos, algorithm, real, synthetic, data

27 present 296

ASSOCIATION ANALYSIS / ITEMSET MINING

KDD abstracts dataset:

27 present 296

ASSOCIATION ANALYSIS / ITEMSET MINING

Subjective interestingnessrankingAdditionally prior info on:Keyword tiles

svm, support, machin, vector 25 art, state 39

state, art 39 row, column, algorithm 12

unlabelled, labelled, supervised, learn 10 unlabelled, labelled, data 14

associ, rule, mine 36 answer, question 18

gene, express 25 Precis, recal 14

VISUALIZING HIGH-DIMENSIONAL DATA

CONDITIONAL NETWORK EMBEDDINGS

The search for interesting patterns in data

• Association analysis

• Dimensionality reduction

• Graph embedding

• Clustering

• Community detection

• Privacy-preserving data publishing

• … Zillions of

PCA, ICA, projection pursuit, Laplacian Eigenmaps,

tSNE, LLE,…

K-means clustering, hierarchical clustering, Mixture of

Gaussians, spectral clustering,…

Stochastic block modelling, modularity, k-cores, quasi-

cliques, dense subgraphs,…

Frequency, lift, confidence, leverage, coverage,...

EXPLORING DATA

‘Interestingness measures’

Objective functions

Quality functions

Utility functions

Cost functions

Node2Vec, Path2Vec, MetaPath2Vec,...

Discernibility, generalization height,

average group size,...

THE CHALLENGE

Zillions of interestingness measures = good & bad‒ Good: more options!

‒ Bad: the trees & the forest…

Challenge:

‒ Formalise true interestingness!‒ With minimal user interaction

‒ Without requiring user expertise

MOTIVATING EXAMPLE

Community detection:

What makes for an interesting community?

‒ Densely connected?

‒ Large?

‒ Few neighbours outside community?

‒ Unrelated to certain known ‘affiliations’?‒ …

THE FORSIED APPROACH: SUBJECTIVITY!

DataData mining

researcher

Interestingness(pattern)

THE FORSIED APPROACH: SUBJECTIVITY!

Data mining

researcher

analyst

Interestingness(pattern) Interestingness(pattern, analyst)

Interestingness = subjective

MOTIVATING EXAMPLE

Community detection:

User states expectations / beliefs

‒ Formalized as a ‘background distribution’ Any ‘pattern’ that contrasts with this and is easy to describe

= subjectively interesting

OUTLINE

OUTLINEPart 1: Introduction and motivation

10mins

Part 2: The FORSIED framework40mins

Part 3: Binary matrices, graphs, and relational data40mins

Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins

Part 5: Advanced topics, outlook & conclusions25mins

Q&A15mins

Feel free to interrupt

for questions anytime

HUMAN-CENTRIC

DATA EXPLORATION

PART 2/5: THE FORSIED FRAMEWORK

www.forsied.net 1

10mins

Q&A15mins

GENERIC FRAMEWORK

De Bie, KDD 2011

De Bie, DAMI 2011

FORSIED(*)

framework

Data PatternsModel of user beliefs

(“background distribution”)

Data space

Pattern

𝑃 evolves!

Pattern User Subjective!

(*)Formalizing Subjective Interestingness in Exploratory Data mining

𝑥 ∈ Ω′𝑥 ∈ Ω

ΩΩ′′Ω′ 𝑃′𝑃′′ 𝑃

• Data: the adjacency matrix

of a graph under study

• Patterns: the claim that a

specified set of nodes are

densely connected

• Prior beliefs: the degrees of

the nodes, a known block

structure,...

• Interestingness: subjective

information density

• Overlapping communities!

Interestingness Ω′, 𝑃 = InformationContent Ω′, 𝑃DescriptionLength Ω′

InformationContent Ω′, 𝑃 = − log 𝑃 Ω′

SI = IC/DL

THE FINE PRINT

Initial background distribution 𝑃?‒ Maximum entropy distribution s.t. prior belief constraints:

Updated background distribution 𝑃′ given pattern 𝑥 ∈ Ω′?‒ 𝑃 conditioned onto event 𝑥 ∈ Ω′𝑃′ Ω′′ = 𝑃 Ω′′ ∩ Ω′𝑃 Ω′⇒ −log 𝑃′ 𝒙 = −log 𝑃 𝒙 + log 𝑃 Ω′

Description length?‒ Smaller if the pattern is a better explanation

‒ Essentially problem-dependent

max𝑃 𝐸𝑋~𝑃 − log𝑃(𝑋) s.t. 𝐸𝑋~𝑃 𝑓 𝑋 = 𝑐𝑓 (∀𝑓) ΩΩ′ 𝑃′ 𝑃𝑥

Information content

in data after pattern

Information content

in data before pattern

Minus information content

of pattern under 𝑃

Ω′′

WHY MAXIMUM ENTROPY / CONDITIONING?

Most unbiased estimateInformal... no bias other than the constraints

Assume cautious / pessimistic userA user who expects to be very surprised

Leads to most robust estimate of true subjective information

contentInformation content estimated with maxent 𝑃 will never differ much from

information content w.r.t. true prior belief of user

A FIRST INSTANTIATION:COMMUNITY DETECTION

van Leeuwen, De Bie, Spyropoulou, Mesnage, MLj, 2016

COMMUNITY DETECTION IN NETWORKS

Data: Graph

Prior beliefs:1. Overall density2. or: Vertex degrees

MaxEnt distribution:𝑃 𝑨 =ෑ𝑖>𝑗 𝑃𝑖,𝑗 (𝑎𝑖𝑗)𝑃𝑖,𝑗 𝑎𝑖𝑗 = exp 𝑎𝑖𝑗 ∙ 𝜆𝑖 + 𝜆𝑗1 + exp 𝜆𝑖 + 𝜆𝑗

𝑷𝒊,𝒋 𝒂𝒊𝒋 small

𝑷𝒊,𝒋 𝒂𝒊𝒋large

Adjacency

matrixEdge indicator

variables

Data: Graph

Pattern: Dense subgraphs

𝑷𝒊,𝒋 𝒂𝒊𝒋large𝑖,𝑗∈subgraph𝑎𝑖𝑗 ≥ 𝑘

Data: Graph

Interestingness:

𝑷𝒊,𝒋 𝒂𝒊𝒋large− log𝑃 patternDescriptionLength(pattern)

Data: Graph

Interestingness: Density vs. size

2. preferably low degree nodes

Most interesting given 1.

Data: Graph

Hill-climbing for searchUpdate 𝑷 after each pattern

Data: Graph

Hill-climbing for searchUpdate 𝑷 after each pattern

Trance

Bhangra

GospelCountry

Hip hop / grime

Afro pop

UK garage

Hip hop

TAKE-AWAYS

1. What is the data?

2. Determine suitable pattern syntax

3. What are the prior beliefs? (= what is irrelevant to user?)

Compute background distribution 𝑷 using maximum entropy

4. Formulate subjective interestingness:

5. Design an algorithm to optimize it

6. Find out how to condition background distribution on a pattern

Interestingness Ω′, 𝑃 = InformationContent Ω′, 𝑃DescriptionLength Ω′InformationContent Ω′, 𝑃 = − log 𝑃 Ω′

THE BACKGROUND DISTRIBUTION: MAXENT

MAXENT MODEL S.T. DEGREE BELIEFS

max𝑃 𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨s.t. 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 = 𝑑𝑖 , ∀𝑖 = 1: 𝑛

𝑨 𝑃 𝑨 = 1𝐿 𝑃, 𝝀, 𝜇 =𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨 + 𝑖=1:𝑛 𝜆𝑖 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 − 𝑑𝑖 + 𝜇 𝑨 𝑃 𝑨 − 1

Convex!

Lagrangian:

Entropy

Degree

Expected degree constraint

Normalization

Lagrange multipliers

𝑨 = adjacency matrix with 𝑎𝑖𝑗 in row 𝑖 and column 𝑗

𝐿 𝑃, 𝝀, 𝜇 =𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨 + 𝑖=1:𝑛 𝜆𝑖 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 − 𝑑𝑖 + 𝜇 𝑨 𝑃 𝑨 − 1Optimality condition:𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = 0𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0

MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗

=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗

=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗 𝑃𝑖,𝑗 𝑎𝑖𝑗 Product of independent

Bernoulli distributions!

Thanks to fact that prior belief constraint

is on a (weighted) sum of the 𝑎𝑖𝑗

To find optimal values of Lagrange multipliers, solve the dual:min𝝀 𝐿 𝑃, 𝝀where 𝑃 is given as:

𝑃 𝑨 =ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗After some calculations:

min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖

min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖Can be solved using gradient descent:𝜕𝜕𝜆𝑘 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖= 𝑖=1:𝑛 exp 𝜆𝑖 + 𝜆𝑘1+exp 𝜆𝑖 + 𝜆𝑘 − 𝑑𝑘Lots of computational speed-ups possible...

Expected degree of node 𝒌 Required expected degree of node 𝒌

min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖

TAKE-AWAYS

Constraints on the expected value of weighted averages:𝐸𝑨~𝑃 𝑖,𝑗∈𝐼 𝑓𝑖𝑗𝑎𝑖𝑗 = 𝑐where 𝑓𝑖𝑗 and 𝑐 are constants and 𝐼 is a set of indices,

lead to convenient product distributions

Other examples for graphs: Overall density (trivial)

Densities of particular blocks (e.g. block of nodes with same

affiliation)

Assortativity (approximately)

THE INTERESTINGNESS

INFORMATION CONTENT

Information content:InformationContent pattern, 𝑃 = − log 𝑃 pattern Pattern: “number of edges between given set of nodes 𝑊 ⊆ 𝑉 is larger than or

equal to a specified 𝑘𝑊” A bit tricky...

Cliques as a special case: “set of nodes 𝑊 ⊆ 𝑉 forms a clique”. Then:𝑃 pattern = ෑ𝑖>𝑗∈𝑊𝑃𝑖,𝑗 1 So: InformationContent pattern, 𝑃 = − log 𝑃 pattern= −σ𝑖>𝑗∈𝑊 log 𝑃𝑖,𝑗 1 Larger if |𝑊| is larger and if 𝑃𝑖,𝑗 1 for 𝑖, 𝑗 ∈ 𝑊 is smaller

INFORMATION CONTENT

Pattern (general case): “number of edges between given set of nodes 𝑊 ⊆ 𝑉 is

larger than or equal to a specified 𝑘𝑊” Probability of at least 𝑘𝑊 successes in 𝑛𝑊 = |𝑊|2 Bernoulli trials?

Approximated by:𝑃 pattern ≈ exp −𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊where 𝑝𝑊 is the average probability 𝑃𝑖,𝑗 1 for the edges between 𝑖, 𝑗 ∈ 𝑊

And thus:InformationContent pattern, 𝑃 = − log 𝑃 pattern≈ 𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊 Larger if |𝑊| (and thus 𝑛𝑊) is larger, 𝑝𝑊 is smaller, and 𝑘𝑊 is larger

DESCRIPTION LENGTH

For cliques:

Describe set 𝑊 A constant (to describe |𝑊|)

plus a linear term in 𝑊 (to describe its elements):DescriptionLength pattern = 𝛼 𝑊 + 𝛽 For dense subgraphs:

Constant 𝛽 also describes 𝑘𝑊28

INTERESTINGNESS

Putting things together:Interestingness pattern, 𝑃 = −σ𝑖>𝑗∈𝑊 log 𝑃𝑖,𝑗 1𝛼 𝑊 + 𝛽 A bit more complex for general dense subgraphs:

Interestingness pattern, 𝑃 ≈ 𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊𝛼 𝑊 + 𝛽 Hard to optimize!

Exact search for small graphs

Effective hill climber for large graphs

TAKE-AWAYS

No compromises w.r.t. interestingness

Often leads to hard search problems

Question: is this intrinsic to genuine subjective

interestingness?

UPDATING THE BACKGROUND DISTRIBUTION

Given a pattern,update the background distribution by conditioning on the pattern

Easy to do for cliques 𝑊:

Set 𝑃′𝑖,𝑗 𝑎𝑖𝑗 = 1 for 𝑎𝑖𝑗 = 1 if 𝑖, 𝑗 ∈ 𝑊 Fast to approximate for (non-clique) dense subgraphs 𝑊:

Set 𝑃′𝑖,𝑗 𝑎𝑖𝑗 ∝ 𝑃𝑖,𝑗 𝑎𝑖𝑗 ⋅ exp 𝜆𝑊𝑎𝑖𝑗 if 𝑖, 𝑗 ∈ 𝑊such that the expected density of 𝑊 is 𝑘𝑊

Remains a product of Bernoulli’s32

TAKE-AWAYS

Updating can be trivial

Otherwise, often easy to do approximately

HUMAN-CENTRIC

DATA EXPLORATION

PART 3/5: BINARY MATRICES, GRAPHS, RELATIONAL DATA

www.forsied.net 1

10mins

Q&A15mins

OUTLINE OF THIS PART

(Community detection)

Itemsets

Relational patterns

Connecting trees

Network embedding

ITEMSETS

Data: binary matrix:

𝑿 ∈ {0,1}𝑚×𝑛

Beer Diapers Lipstick Carrier SUM

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

De Bie, DMKD 2011

ITEMSETS

Prior beliefs: row and column sums

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

De Bie, DMKD 2011

ITEMSETS

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Background distribution 𝑃- Nonnegative and properly normalized

- Has correct marginals Many solutions !?

‘Unbiased’ distribution: Maximum Entropy distribution

De Bie, DMKD 2011

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

ITEMSETS

Background distribution 𝑃

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

𝑃 𝑿 =ෑ𝑖,𝑗 𝑃𝑖,𝑗 (𝑥𝑖𝑗) 𝑃𝑖,𝑗 𝑥𝑖𝑗 = exp 𝑥𝑖𝑗 ∙ 𝜇𝑖 + 𝜆𝑗1 + exp 𝜇𝑖 + 𝜆𝑗

De Bie, DMKD 2011

Convex

optimization

problem

‘Unbiased’ distribution: Maximum Entropy distribution

ITEMSETS

Prior beliefs: uniform at observed density

𝑿 ∈ {0,1}𝑚×𝑛De Bie, DMKD 2011

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

𝑃𝑖,𝑗 = 1424 ∀ 𝑖, 𝑗𝑃𝑖,𝑗 𝑥𝑖𝑗 = exp 𝑥𝑖𝑗 ∙ 𝜆1 + exp 𝜆 𝜆 = 0.3365

ITEMSETS

Patterns: tiles (set of rows and columns)

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Large support may not be

interesting

De Bie, DMKD 2011

ITEMSETS

Patterns: tiles

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Large surface may not be

interesting

De Bie, DMKD 2011

ITEMSETS

Patterns: tiles

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Less expected smaller row

and column margins

Bingo!

De Bie, DMKD 2011

ITEMSETS

Patterns: tiles

As stated SI = IC / DL

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Less expected smaller row

and column margins

Bingo!

De Bie, DMKD 2011

IC 𝑍 = − log Pr 𝑍 = 𝑖,𝑗∈𝑍−log Pr 𝑝𝑖,𝑗DL 𝑍 = 𝑎 #rows + #columns + 𝑏

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

ITEMSETS

Patterns: tiles

𝑿 ∈ {0,1}𝑚×𝑛 Iterate

De Bie, DMKD 2011

We specified there are ones here

ITEMSETS

Patterns: tiles

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Iterate

De Bie, DMKD 2011

ITEMSETS

Patterns: tiles

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Iterate

De Bie, DMKD 2011

27 present 296

ITEMSETS

De Bie, DMKD 2011

27 present 296

ITEMSETS

Subjective interestingnessrankingAdditionally prior info on:Keyword tiles

svm, support, machin, vector 25 art, state 39

state, art 39 row, column, algorithm 12

unlabelled, labelled, supervised, learn 10 unlabelled, labelled, data 14

associ, rule, mine 36 answer, question 18

gene, express 25 Precis, recal 14

De Bie, DMKD 2011

ITEMSETS

Background distribution 𝑃 Patterns: tiles of ones and zeros

𝑿 ∈ {0,1}𝑚×𝑛

Alice 1 1 1 3

Bob 1 1 1 3

Charlie 1 1 2

Denise 1 1 2

Eve 1 1 2

Frankie 1 1 2

SUM 4 3 2 5

Extension: noisy tiles

Kontonasios & De Bie, SDM 2010

IC is straightforward

DL depends on the skew

(entropy) of the distribution of

ones and zeros within the tile

ITEMSETS

Algorithmic approach: ?

Not studied extensively, but special case of relational

patterns (see next instantiation)

Interesting result: if we can mine the best pattern at

every iteration, then we approximate the total IC for the

best set of tiles with that (cumulative) DL at

1 − 1𝑒 (≈ 0.63)Kontonasios & De Bie, SDM 2010

RELATIONAL PATTERNS

RELATIONAL PATTERN MINING

Data: relational database

Pattern: connected complete subgraphs

Prior beliefs: degree of each node in each

relationship

Customers Items Attributes

Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)

Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)

Guns, Aknin, Lijffijt, De Bie (ICDM 2016)

Data: relational database

Pattern: connected complete subgraphs

Prior beliefs: degree of each node in each

relationship

Customers Items Attributes

Prior factorizes over relationships

equivalent to itemsets

Users Films

Genres

Actors

RMiner

RMINER

Algorithmic approach: enumerate + rank

Based on fixpoint-enumeration (Boley et al. 2010)

RMINER

1) Branch on any entity

RMINER

2) Compute closure

- Add entities that are in all supersets

RMINER

2) Compute closure

- Repeat until no valid candidates

- If invalid candidates

- Stop

- Else

- Output pattern

- Backtrack and declare entity invalid

RMINER

2) Compute closure

- Stop

- Else

- Output pattern

RMINER

2) Compute closure

- Stop

- Else

- Output pattern

RMINER

2) Compute closure

- Stop

- Else

- Output pattern

RMINER

2) Compute closure

- Stop

- Else

- Output pattern

RMINER

2) Compute closure

- If invalid added, backtrack

- Stop

- Else

- Output pattern

RMINER

2) Compute closure

- Stop

- Else

- Output pattern

RMINER

2) Compute closure

- Stop

- Else

- Output pattern

Etc. etc.

Circadian

Numeric

Numeric Taxonomyelement

Taxonomyelement

n-ary relation

N-RMiner

P-N-RMiner

P-N-RMINER: FISHER DATA

CP-RMINER

Algorithmic approach: branch & bound in CP (top 1)

CONNECTING TREES

CONNECTING SUBTREES

Data: Graph , known, unknown

𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)

CONNECTING SUBTREES

Prior beliefs: degree of vertices, (batch) time order

𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽

CONNECTING SUBTREES

Patterns: subtree connecting query vertices Q

𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A

CONNECTING SUBTREES

Patterns: subtree connecting query vertices Q

Which is more interesting?

𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A

Unsuprising No information

compression

Most interesting,

since E and C have

small in-degree

Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)

TIME DIFFERENCE PRIOR

Consider citation network

Papers arrive in batches

(per year)

Earlier papers cannot /

rarely cite newer papers

TIME DIFFERENCE PRIOR

Consider citation network

Papers arrive in batches

(per year)

Earlier papers cannot /

rarely cite newer papers

Limited increase in

computational cost to fit

background distribution

EXAMPLE ON CITATION DATA

3 recent best

papers from ACM

SIGKDD

Uniform prior:

Time and degree prior:

CONNECTING SUBTREES

Algorithmic approach: greedily construct a tree with

maximum depth k

Not so straightforward!

We investigated various heuristics

CONNECTING SUBTREES

Empirically: best strategy depends on size of query set Q

CONNECTING SUBTREES

Uniform Degree prior

NIPS/PODS authors

Repeated from Akoglu et al. (SDM 2013)

NETWORK EMBEDDING

Data: a graph 𝐺 w. adj. matrix 𝑨 Pattern: a metric embedding 𝑿

‒ Probabilistic info about the graph

‒ 𝑃 𝒙𝑖 − 𝒙𝑗 |𝑎𝑖𝑗 = Half-Normal

Prior beliefs: 𝑃𝑖,𝑗 𝑎𝑖𝑗‒ overall density

‒ degrees

‒ block structure

‒ assortativity

‒ ...

Find ML embedding:max𝑿 𝑃 𝐺|𝑿

Kang, Lijffijt, De Bie

(ICLR 2019)

EXAMPLE ON STUDENTDB

Algorithmic approach: gradient descent with

estimated gradient (positive and negative sampling)

SUMMARY PART 3

SUMMARY

Informative prior can be useful in many settings

For binary data (all previous examples), fitting and

updating background model is computationally easy

Mining SI patterns challenging, (for now) tailor-made

algorithms necessary

HUMAN-CENTRIC

DATA EXPLORATION

PART 4/5: NUMERIC AND MIXED DATA

www.forsied.net 1

10mins

Q&A15mins

OUTLINE OF THIS PART

Attributed subgraphs

Subgroup discovery in real-valued (target) data

Dimensionality reduction

Time series

ATTRIBUTED SUBGRAPHS

COHESIVE SUBGRAPHS

Data: attributed graph

Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)

[Presented and can be found online at MLG @ KDD 2018 workshop]

Vertex Shops Events Pubs

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

COHESIVE SUBGRAPHS

Prior beliefs: global attribute statistics

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

COHESIVE SUBGRAPHS

Patterns: cohesive subgraphs with exceptional attributes (CSEA)

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 2 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Pattern:

These locations

have many events

COHESIVE SUBGRAPHS

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Pattern is easy to

interpret if it is local.

How to quantify this?

Pattern:

These locations

have many pubs &

COHESIVE SUBGRAPHS

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Pattern:

Vertices around G

have many pubs &

Easy to describe

COHESIVE SUBGRAPHS

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 2 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

Pattern:

Vertices around C

have many events

Easy to describe

COHESIVE SUBGRAPHS

‒ Like subgroup discovery

‒ Describe vertex set with rule

‒ Intersection of neighbourhoods

‒ Minus exceptions

‒ Attributes below / above threshold

‒ As compared to expectation

COHESIVE SUBGRAPHS

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

A 0.5 … …B 0.04 … …C 0.2 … …D 0.8 … …E 0.06 … …F 0.13 … …G 0.07 … …H 0.1 … …I 1.0 … …

Attribute values Interestingness

Background

distribution

- Geometric

per cell

- Using

row/column

margins

(values are for illustration only)

COHESIVE SUBGRAPHS

A 1 5 3

B 5 0 2

C 2 3 10

D 1 6 4

E 4 0 2

F 3 1 5

G 6 2 9

H 2 1 3

I 0 1 3

A 0.5 … …B 0.04 … …C 0.2 … …D 0.8 … …E 0.06 … …F 0.13 … …G 0.07 … …H 0.1 … …I 1.0 … …

Attribute values Interestingness

Background

distribution

- Geometric

per cell

- Using

row/column

margins

(values are for illustration only)

COHESIVE SUBGRAPHS

𝑃1: +food

𝑃2: +professional, +nightlife,

+outdoors, +college

𝑃3: +nightlife, +food, -college

COHESIVE SUBGRAPHS

Algorithmic approach: we tried

Option 1: enumerate with P-N-RMiner, then rank

Option 2: dedicated branch-and-bound algorithm

SUBGROUP DISCOVERY (EXCEPTIONAL MODEL MINING)

LOCATION & SPREAD PATTERNS

‒ Meta-data: any type

‒ Target data: real valued matrix

Prior beliefs: mean and variance statistics

‒ Typically overall, but can be for subsets

Patterns: description with

‒ mean vector, or

‒ projection and magnitude of variance

𝑿 ∈ ℝ𝑚×𝑛

-4 -2 0 2 4

Attribute 1

(a)Objects with

property x

Objects with

property y

Objects with

property z1

and z2

Lijffijt, Kang, Duivesteijn, Puolamäki,

Oikarinen, De Bie (ICDE 2018)

SINGLE TARGET DIMENSION: CRIME IN US

UCI Crime data:

violent crime rate

(per 1k pop)

Description: areas with high incidence of unmarried mothers (coded

by CBS as percentage illegitimate)

Target: high average crime rate

ECOLOGY: PRESENT SPECIES AS TARGETS

Description of pattern (a): “mean temperature in March ≤ −1.68 ◦C’”Description of pattern (b): “average monthly rainfall in August ≤ 47.62 mm”Description of pattern (c): “average monthly rainfall in October ≤ 45.25 mm and mean temperature of wettest quarter ≥ 16.32 ◦C”

Mean of target attributes for (a): -Wood mouse +Mountain hare, moose, red-

backed vole, wood lemming 20

GERMAN POLITICS VS. DEMOGRAPHICS

Description of pattern (a): "few children". Target: LEFT is popular.

Description of pattern (b): "large mid-aged pop". Target: GREEN relatively popular.

Description of pattern (c): many children. Target: LEFT is impopular.

GERMAN POLITICS VS. DEMOGRAPHICS

Moreover, target of pattern (a): “SDP and CDU negative correlated”.Due to popularity of LEFT, SDP and CDU are in tougher competition here

SUBGROUP DISCOVERY

For reference only

Prior:

IC/DL:

SUBGROUP DISCOVERY

Algorithmic approach:

Find descriptions using beam search

‒ Fairly standard in SD; Implementation from

Cortana (Meeng & Knobbe, BeneLearn 2011)

‒ Using the SI objective directly

For projections

‒ Manifold learning problem (ManOpt toolbox)

Interesting results to speed-up update to bg. distrib.

DIMENSIONALITY REDUCTION

PROJECTION PATTERNS

Data: Real valued matrix

Prior beliefs: global mean and (co-)variance structure

Patterns: projections

𝑿 ∈ ℝ𝑚×𝑛

DATA PROJECTIONS

De Bie, Lijffijt, Santos-Rodriguez, Kang

(ESANN 2016)

Kang, Lijffijt, Santos-Rodriguez, De Bie

(KDD 2016, DMKD 2018)

Puolamäki, Kang, Lijffijt, De Bie (ECMLPKDD 2016)

Kang, Puolamäki, Lijffijt, De Bie (ECMLPKDD 2016)

Puolamäki, Oikarinen, Kang, Lijffijt, De Bie (ICDE 2018)

Finding informative

projections

Accounting for

user feedback

SI COMPONENT ANALYSIS (SICA)

Problem is parametrized by a resolution parameter

Go from density to probability for projections

De Bie, Lijffijt, Santos-Rodriguez, Kang (ESANN 2016)

Kang, Lijffijt, Santos-Rodriguez, De Bie (KDD 2016, DMKD 2018)

Effect of the prior beliefs:

Expectation on mean/variance PCA objective

Expectation on magnitude of variance more

robust variant of PCA (which we call t-PCA)

Graph of point similarities next slides

SICA GRAPH PRIOR

German voting percentages per district, account for

east-west divide

SI DATA EXPLORER (SIDE)

https://users.ugent.be/~bkang/software/side_dev/entry.html

SI DATA EXPLORER (SIDE)

SICA/SIDE

SICA (uniform/graph) are eigenvalue problems

SICA t-PCA and SIDE use manifold learning toolbox

C-T-SNE

Non-linear

dimensionality

reduction

Based on t-SNE

Prior beliefs:

known clusters

Result: c-t-SNE

visualization

may make new

clusters salientBo Kang, Dario Garcia Garcia, Jefrey

Lijffijt, Raul Santos Rodriguez, Tijl De

Bie (arxiv, 2019)

TIME SERIES

TIME SERIES MOTIFS

Data: time series

Prior beliefs: mean, var, co-var (first order difference)

𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)

TIME SERIES MOTIFS

Data: time series

Patterns: motif template

TIME SERIES MOTIFS

IC quantifies the reduction in uncertainty

Likelihood of data increases by inserting template in

background distribution for matched locations

Due to co-variance, expectations change globally

Updating computationally (somewhat) costly

TIME SERIES MOTIFS

Construct template from 3 or 4 subsequences using

constraint programming and relaxed objective

Greedily add subsequences using exact objective

Prune dissimilar subsequences from search after

branching and selection of initial set

TIME SERIES MOTIFS

Data: time series

Patterns: motif template

AND MORE

Past Data clustering

Biclustering

Exceptional model mining / subgroup discovery Time series segments

Ongoing / future Backbone of a network

Insightful summaries of an attributed network

Network embeddings ...

with all past and current members of the FORSIED team and Jilles

Vreeken, Antonis Matakos, Dario Garcia-Garcia, Siegfried Nijssen,...

HUMAN-CENTRIC

DATA EXPLORATION

PART 5/5: ADVANCED TOPICS, OUTLOOK & CONCLUSIONS

www.forsied.net 1

10mins

Q&A15mins

RELATED WORK

MDL/compression

MAP estimation

Hypothesis testing / randomization techniques

‘Subjective interestingness’ research by Thuzhilin,

Silberschatz, Padmanabhan (90’s)

LIMITATIONS

Not all new information is interesting!

The FORSIED framework does not directly address this

Importance of feedback in combination with FORSIED

(see work on dimensionality reduction)

Description length

How to determine?

Is shorter really always better?

FORSIED’S ORIGINS,AND OUTLOOK

FORSIED’S ORIGINS

Remodiscovery, 2006

DISTILLER, 2009

FORSIED’S ORIGINS

MINI, ECML-PKDD 2007

TKDD, 2007

KDD, 2009

DAMI, 2014

FORSIED’S ORIGINS

Since 2017 also Jefrey’s Pegasus2 fellowship, and a large and a small FWO grant

OUTLOOK Improve theoretical understanding

Estimating the background distribution (information geometry)

Cognitive aspects (cognitive science)

User interface (human computer interaction)

Visualization (visual analytics)

Algorithmic aspects (optimisation theory)

Safeguarding sensitive information & fairness

More instantiations Data types (linked data / knowledge graphs!) / pattern types / prior belief types

Applications Bioinformatics

Web and social media mining

DATA MINING WITHOUT SPILLING THE BEANS

PRIVACY-PRESERVING DATA PUBLISHING

Anonymization insufficient to protect sensitive attributes (linkage attack)

Generalization!

ZIP D.O.B. Sex Diagnosis

94701 01/02/1968 F Healthy

94701 06/03/1990 F Obesitas

94702 11/08/1991 M Healthy

94703 03/09/1979 M Prostate cancer

94703 07/10/1951 F Healthy

94704 10/02/1973 M Obesitas

94705 20/12/2001 F Obesitas

ZIP D.O.B. Sex Full name

94701 01/02/1968 F Mary Smith

94701 06/03/1990 F Patricia Johnson

94702 11/08/1991 M James Jones

94703 03/09/1979 M John Brown

94703 07/10/1951 F Linda Davis

94704 10/02/1973 M Robert Miller

94705 20/12/2001 F Barbara Wilson

Anonymized patient database Voting records database

Quasi-identifiers

ZIP D.O.B. Sex Diagnosis

94701 ‘51-’01 F Healthy

94701 ‘51-’01 F Obesitas

94702-5 ‘51-’01 M Healthy

94702-5 ‘51-’01 M Prostate cancer

94702-5 ‘51-’01 F Healthy

94702-5 ‘51-’01 M Obesitas

94702-5 ‘51-’01 F Obesitas

𝒌-anonymity: minimum size equivalence class ≥ 𝑘 Homogeneity attack

Background knowledge attack

𝒍-diversity: ≥ 𝑙 sensitive attribute values well represented in each

equivalence class

Hard to achieve for imbalanced data

Skewness attack

Similarity attack

𝒕-closeness: sensitive attribute value distribution in equivalence classes

same as in overall data

14Source: https://www.linkedin.com/pulse/dont-throw-baby-out-bathwater-didi-gurfinkel/

𝒌-anonymity: minimum size equivalence class ≥ 𝑘 Homogeneity attack

Background knowledge attack

𝒍-diversity: ≥ 𝑙 sensitive attribute values well represented in each

equivalence class

Hard to achieve for imbalanced data

Skewness attack

Similarity attack

𝒕-closeness: sensitive attribute value distribution in equivalence classes

same as in overall data

15Source: https://www.linkedin.com/pulse/dont-throw-baby-out-bathwater-didi-gurfinkel/

OTHER KINDS OF SENSITIVE INFORMATION

Existence of a tight community in a network

OTHER KINDS OF SENSITIVE INFORMATION

Existence of a tight community in a network

Existence of a cluster in data

Frequency of particular items / size of particular

transactions in a database of purchases

Preserve this while:

publishing generalized version of database,

identifying dense subgraphs,

finding clusters,

mining frequent itemsets, etc

Data mining patterns

GENERAL STRATEGY

Data: 𝒙 Data mining goal: reveal as much as possible about 𝒙

Sensitive aspects: 𝑓 𝒙 ∈ Φ the sensitive attributes’ values density of a specified subgraph

existence of a tight cluster

frequencies of all items

Goal: reveal as little as possible about 𝒇 𝒙 Updating 𝑷 → 𝑷′ results in updating 𝑷𝒇 → 𝑷𝒇′

More complex than conditioning!

𝑃𝑓 𝑓 𝒙 can be larger or smaller than 𝑃𝑓′ 𝑓 𝒙18

(data) 𝒙 → 𝑓 𝒙 (sensitive aspects)

Considering the user’s prior beliefs! ‘Subjective’ measures

ΩΩ′ 𝑃𝑥 Φ𝑃𝑓

𝑓 𝑥𝑓

Φ′ = 𝑓 Ω′

TRADING-OFF TWO THINGS

1. Subjective information content of a pattern2. A criterion on the background distribution about sensitive aspects:

Information content left in sensitive aspects(surprise in actual value of the sensitive attributes):−log 𝑃𝑓′ 𝑓 𝒙

Entropy of 𝑃𝑓 (uncertainty about sensitive attributes):−𝐸𝒙~𝑃𝑓 log 𝑃𝑓′ 𝑓 𝒙 Knowledge gained about actual value of the sensitive aspects:−log 𝑃𝑓 𝑓 𝒙𝑃𝑓′ 𝑓 𝒙 Degree of belief that the sensitive aspects are within a specified set Φ∗ ⊆ Φ:𝑃𝑓′ Φ∗ ...

EXAMPLES

• Random synthetic dataset:

• 5 real-valued quasi-identifiers,

generalization through intervals

• 1 sensitive attribute, 3 possible values

• 1 other attribute, 3 possible values

• 100 data records

• Trade-off:

• information about data (other & sensitive attributes)

• knowledge gained about sensitive attribute

• Generalize quasi-attributes 5 equivalence classes

• Ensure the maximum information content about any

sensitive attribute value is small

Conditional distributions within

the 5 equivalence classes over

the 3 sensitive attribute values

Conditional distributions within

the 5 equivalence classes over

the 3 other attribute values

Joint conditional distribution of the sensitive (rows) and other (columns) attributes, within 5 equivalence classesQI SA

• Zip code

• DOB

• Sexual orientation

• Ethnicity

• Sense of well-being

• Productivity

DENSE SUBGRAPHS WITHOUT SPILLING BEANS

Initially

10 20 30 40

After both community patterns

10 20 30 40

and a deception pattern

10 20 30 40

partially concealed

10 20 30 40

• Random network:

• 2 non-overlapping

communities

• A 3rd community

overlapping both

• The 3rd is sensitive

• Analyst should

remain surprised by its presence

• Task:

• Identify (non-)dense subgraphs

• Without spilling the beans on the 3rd

community

• Approaches (result from general strategy):

• Deceive

• Conceal

TAKE-AWAYS

FORSIED ideas can be used for quantifying sensitive

information disclosure

Key point: sensitive information disclosure is subjective

More work needed to understand how to make this

practical

CONCLUSIONS

OVERALL CONCLUSIONS

A generic approach for designing methods for exploring data Several successes

Sometimes more challenging, mostly due to algorithmic issues

Key take-away Model what’s not interesting (= prior beliefs),

show what’s complementary (= subjectively interesting) Using information theory

New horizons? Privacy and sensitive information

www.forsied.net 26

“Data Mining without Spilling the Beans: Preserving more than Privacy alone”Research project funded by the FWO

Tijl De Bie, Jefrey Lijffijt

“Exploring Data: Theoretical Foundations and Applications to Web, Multimedia, and Omics Data”Odysseus project funded by the FWO

Tijl De Bie

“Formalizing Subjective Interestingness in Data mining”ERC project FORSIED

Tijl De Bie

“Personalised, interactive, and visual exploratory mining of patterns in complex data”FWO [Pegasus]2 Marie Skłodowska-Curie Fellowship

Jefrey Lijffijt

Acknowledgements go to

Jefrey Lijffijt

Bo Kang

Wouter

Duivesteijn

Achille Aknin

Holly Silk

Santos-Rodriguez

Eirini Spyropoulou

Akis Kontonasios

Paolo Simeone

Robin Vandaele

Florian Adriaens

Tijl De Bie

Xi Chen

Junning (Lemon)

Ahmad MelAlexandru Mara

We are recruiting!

www.forsied.net / aida.ugent.be

+ lots of collaborators...

Maryam Fanaeepour

Maarten Buyl

THANKS!

TIME FOR Q&A

MINING SUBJECTIVELY

INTERESTING PATTERNS IN DATA

SUPPLEMENTARY MATERIAL

www.forsied.net 1

REFERENCES

Adriaens, Lijffijt, De Bie: Subjectively Interesting Connecting Trees. ECML/PKDD (2) 2017: 53-69

De Bie, Lijffijt, Santos-Rodriguez, Kang: Informative Data Projections: A Framework and Two Examples. ESANN 2015 :

435-640

De Bie: Subjective Interestingness in Exploratory Data Mining. IDA 2013: 19-31

De Bie: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min.

Knowl. Discov. 23(3): 407-446 (2011)

De Bie: An information theoretic framework for data mining. KDD 2011: 564-572

De Bie: Subjectively Interesting Alternative Clusters. MultiClust@ECML/PKDD 2011: 43-54

Deng, Lijffijt, Kang, De Bie: Subjectively Interesting Motifs in Time Series, AALTD@ECML/PKDD 2018

Guns, Aknin, Lijffijt, De Bie: Direct Mining of Subjectively Interesting Relational Patterns. ICDM 2016: 913-918

Kang, Lijffijt, De Bie: Conditional Network Embeddings. Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.

Kang, Lijffijt, Santos-Rodríguez, De Bie: SICA: subjectively interesting component analysis. Data Min. Knowl. Discov. 32(4): 949-987 (2018)

Kang, Lijffijt, Santos-Rodriguez, De Bie: Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations. KDD 2016: 1615-1624

Kang, Puolamäki, Lijffijt, De Bie: A Tool for Subjective and Interactive Visual Data Exploration. ECML/PKDD (3) 2016: 3-7

Kontonasios, De Bie: Subjectively interesting alternative clusterings. Machine Learning 98(1-2): 31-56 (2015)

Kontonasios, De Bie: An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases. SDM 2010: 153-164

Kontonasios, Vreeken, De Bie: Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data. ECML/PKDD (2) 2013: 256-271

Kontonasios, Vreeken, De Bie: Maximum Entropy Modelling for Assessing Results on Real-Valued Data. ICDM 2011:

350-359

Van Leeuwen, De Bie, Spyropoulou, Mesnage: Subjective interestingness of subgraph patterns. Machine Learning

105(1): 41-75 (2016)

Lijffijt, Kang, Duivesteijn, Puolamäki, Oikarinen, De Bie: Subjectively Interesting Subgroup Discovery on Real-valued

Targets. IEEE ICDE 2018 : to appear

Lijffijt, Spyropoulou, Kang, De Bie: P-N-RMiner: a generic framework for mining interesting structured relational

patterns. I. J. Data Science and Analytics 1(1): 61-76 (2016)

Lijffijt, Spyropoulou, Kang, De Bie: P-N-RMiner: A generic framework for mining interesting structured relational

patterns. DSAA 2015: 1-10

Puolamäki, Kang, Lijffijt, De Bie: Interactive Visual Data Exploration with Subjective Feedback. ECML/PKDD (2) 2016:

214-229

Puolamäki, Oikarinen, Kang, Lijffijt, De Bie: Interactive Visual Data Exploration with Subjective Feedback: An

Information-Theoretic Approach. IEEE ICDE 2018 : to appear

Spyropoulou, De Bie, Boley: Interesting pattern mining in multi-relational data. Data Min. Knowl. Discov. 28(3): 808-849

(2014)

Spyropoulou, De Bie, Boley: Mining Interesting Patterns in Multi-relational Data with N-ary Relationships. Discovery

Science 2013: 217-232

LINKS TO SOFTWARE

R-MINER(S)

Original (fastest for full enumeration):

https://bitbucket.org/BristolDataScience/rminer/

N-RMiner (supports n-ary):

https://bitbucket.org/BristolDataScience/n-rminer/

P-N-RMiner (support structured attributes):

https://bitbucket.org/BristolDataScience/p-n-rminer/

CP-RMiner (top 1 RMiner pattern, iteratively, fast):

https://bitbucket.org/ghentdatascience/cp/

CONNECTING TREES

https://bitbucket.org/ghentdatascience/interestingtreesp

ublic/

DENSE SUBGRAPHS (COMMUNITIES)

http://patternsthatmatter.org/software.php#ssgminer

NETWORK EMBEDDING

https://bitbucket.org/ghentdatascience/cne-public/

ATTRIBUTED SUBGRAPHS

SIAS-Miner: http://goo.gl/ZxsvbX

SUBGROUP DISCOVERY

https://bitbucket.org/ghentdatascience/sisd-public/

DIMENSIONALITY REDUCTION

http://users.ugent.be/~bkang/software/sica/sica.zip

SIDE (online tool):

http://users.ugent.be/~bkang/software/side_dev/index.

SIDE (MaxEnt R version): http://kaip.iki.fi/sider.html

CLIPPR: https://bitbucket.org/ghentdatascience/clippr/

ANYTHING MISSING?

Not all (source) code has been published, please ask if

you are interested in something that is missing!

HUMAN-CENTRIC DATA EXPLORATION · 2019-09-13 · HUMAN-CENTRIC DATA EXPLORATION PART 1/5: MOTIVATION, BACKGROUND & OUTLINE Tijl De Bie –slides in collaboration with Jefrey Lijffijt

Documents

CCNA Exploration 4.0.4 - Ridgewater College€¦ · ·...

Guided Interaction Exploration in Artifact-centric Process.....

Centric AutoMotive

MULTI-A TRADESPACE E C D V CENTRIC F S A...

Data-centric Networking for a Data-centric IoT: A User’s.....

People-Centric Collaboration · PC Centric Phone Centric...

Are you talking Bernoulli to me? -...

Automatic Discovery of Data-Centric and Artifact-Centric...

TopExNet: Entity-centric Network Topic Exploration in News.....

EVELIN: Exploration of Event and Entity Links in Implicit...

Rearrangement planning using object-centric and...

CUSTOMER CENTRIC INNOVATION... · 3 19/02/2016 Business...

Insight Control Panel - gis-systemhaus.de ERP BFS CRM …...

Customer centric

Size Matters: Finding the Most Informative Set of Window...

From page-centric to portlet-centric Web development: easing...