Top Banner
Efficient Algorithms for Association Finding and Frequent Association Pattern Mining Gong Cheng , Daxin Liu, Yuzhong Qu Websoft Research Group National Key Laboratory for Novel Software Technology Nanjing University, China Websof t
26

Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Feb 09, 2017

Download

Science

Gong Cheng
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Efficient Algorithms for Association Finding and

Frequent Association Pattern Mining

Gong Cheng, Daxin Liu, Yuzhong QuWebsoft Research Group

National Key Laboratory for Novel Software TechnologyNanjing University, China

Websoft

Page 2: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Background and motivation• To suggest friends, recognize suspected terrorists,

answer questions … based on massive graph data

Page 3: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Problem statementAn association connecting a set of query entities isa minimal subgraph that• contains all the query entities, and• is connected.

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

Page 4: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Problem statementAn association connecting a set of query entities isa minimal subgraph that• contains all the query entities, and• is connected.

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

Page 5: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Problem statementAn association connecting a set of query entities isa minimal subgraph that• contains all the query entities, and• is connected.

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

Page 6: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Problem statementAn association connecting a set of query entities isa minimal subgraph that• contains all the query entities, and• is connected.

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

• Tree-structured• Leaves Query entities⊆

Page 7: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Problem statement1. How to efficiently find associations in a possibly

very large graph?2. How to help users explore a possibly large set of

associations that have been found?

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

Page 8: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Problem statement1. How to efficiently find associations in a possibly

very large graph?2. How to help users explore a possibly large set of

associations that have been found?

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

Association finding

Frequent association pattern mining

Page 9: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Association finding: Problem• To find all the associations having a limited diameter

(Diameter = Greatest distance between any pair of vertices)

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

Page 10: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Association finding: Basic solution

Chris

Bob

Paper-AisAuthorOfacceptedAt

ISWC attendedreviewer

Alice

Chris

attended

Paper-AisAuthorOfacceptedAt

ISWC

Alice

Bobreviewer

ISWC

ISWC

An association A set of paths

Page 11: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Association finding: Basic solution

Chris

Bob

Paper-AisAuthorOfacceptedAt

ISWC attendedreviewer

Alice

Chris

attended

Paper-AisAuthorOfacceptedAt

ISWC

Alice

Bobreviewer

ISWC

ISWC

1. Path finding2. Path merging

Page 12: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Association finding: Optimization• Distance-based search space pruning

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

• DiameterConstraint ≤ 3• Length(AliceDan) = 1• Distance(Dan, Bob) = 4Length+Distance > DiameterConstraint

Page 13: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Association finding: Optimization• Distance computation• Materializing offline computed results: O(V2) space• Online computing: O(E) time per pair• Using distance oracle: a space-time trade-off

Chris

Bob

Paper-A

Paper-B

Dan

isAuthorOf

knows

correspondingAuthor

acceptedAt

Ellenknows

ISWC

isAuthorOf

COLD

acceptedAt

attended

attended

attendedreviewer

reviewer

Frankknows

Alice

Page 14: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Association finding: Deduplication

Chris

Bob

Paper-AisAuthorOfacceptedAt

ISWC attendedreviewer

Alice

Chris

attended

Paper-AisAuthorOfacceptedAt

ISWC

Alice

Bobreviewer

ISWC

ISWC

An association A set of paths

Page 15: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Association finding: Deduplication

Chris

Bob

Paper-AisAuthorOfacceptedAt

ISWC attendedreviewer

Alice

Chris

attended

Paper-AisAuthorOf

Alice

Bobreviewer

ISWC

ISWC

A duplicate association Another set of paths

Paper-AacceptedAt

Paper-AacceptedAt

Page 16: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Association finding: Deduplication• Canonical code

Chris

Bob

Paper-AisAuthorOfacceptedAt

ISWC attendedreviewer

Alice

code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper-A))$ = … Paper-A,acceptedAt,code(Tree(ISWC))$ … = … ISWC,reviewer,code(Tree(Bob)),~attended,code(Tree(Chris))$ … = … Bob$ …

(Assuming Bob precedes Chris)

Page 17: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Frequent association pattern mining• Association pattern: A conceptual abstract that

summarizes a group of associations

Chris

Bob

Paper-AisAuthorOfacceptedAt

ISWC attendedreviewer

Alice Association Association pattern(Intermediate entity Class)

Chris

Bob

PaperisAuthorOfacceptedAt

Conference attendedreviewer

Alice

Page 18: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Frequent association pattern mining• Problem: To mine all the association patterns matched by

more than a threshold proportion of associations

Chris

Bob

Paper-AisAuthorOfacceptedAt

ISWC attendedreviewer

Alice Association Association pattern(Intermediate entity Class)

Chris

Bob

PaperisAuthorOfacceptedAt

Conference attendedreviewer

Alice

Page 19: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Frequent association pattern mining• Basic solution:

Chris

Bob

Paper-AisAuthorOfacceptedAt

ISWC attendedreviewer

Alice Association Association pattern(Intermediate entity Class)

Chris

Bob

PaperisAuthorOfacceptedAt

Conference attendedreviewer

Alice

Calculating the frequency of an association pattern= Counting the occurrence of its canonical code

Page 20: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Frequent association pattern mining• Canonical code

Chris

Bob

PaperisAuthorOfacceptedAt

Conference

attended

reviewer

Alice PaperisAuthorOf acceptedAt

Conference

code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper)),isAuthorOf,code(Tree(Paper))$

Paper equals Paper.So which subtree should go first?(Canonical code may not be unique!)

? ?

Page 21: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Frequent association pattern mining• Canonical code

Chris

Bob

PaperisAuthorOfacceptedAt

Conference

attended

reviewer

Alice PaperisAuthorOf acceptedAt

Conference

code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper)),isAuthorOf,code(Tree(Paper))$

Smallest leaf as its proxy to be compared

Equality would never happen.(Canonical code is now unique!)

(Assuming Bob precedes Chris)

Page 22: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Experiments• Datasets

• LinkedMDB: 1M vertices and 2M arcs• DBpedia (2015-04): 4M vertices and 15M arcs

• Parameter settings• Diameter constraint (λ): 2, 4• Number of query entities (n): 2, 3, 4, 5

• Test queries• 1,000 random sets of query entities under each setting of λ and n

• Hardware configuration• 3.3GHz CPU, 24GB memory• Data graphs: in memory• Distance oracles: on disk

Page 23: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Experiments• Results: Association finding

BSC: Basic solution (not pruning)PRN: Optimized solution (distance-based pruning)PRN-1: Optimized solution (distance-based pruning except for the last level of search)

Page 24: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Experiments• Results: Frequent association pattern mining• LinkedMDB

• <10,000 associations: <21ms• 13,531 associations: 68ms

• DBpedia• <10,000 associations: <65ms• 1,198,968 associations: 2909ms

Page 25: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Takeaway messages• Subgraph finding and mining are faster than what we expected.• Consider distance oracle and canonical code in your own research.

Page 26: Efficient Algorithms for Association Finding and Frequent Association Pattern Mining

Takeaway messages• Subgraph finding and mining are faster than what we expected.• Consider distance oracle and canonical code in your own research.