Top Banner
Similarity on DBpedia UIMR PhD student: Samantha Lam Supervisor: Conor Hayes
38

Similarity on DBpedia

Jun 15, 2015

Download

Documents

Samantha Lam

Overview on the notion of similarity and methods for defining similarity on DBpedia.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Similarity on DBpedia

Similarity on DBpediaUIMR

PhD student: Samantha LamSupervisor: Conor Hayes

Page 2: Similarity on DBpedia

Similarity

How similar are the following films:

2

Page 3: Similarity on DBpedia

Similarity

How similar are the following films: (Unsatisfactory)Answer: it depends!

3

Page 4: Similarity on DBpedia

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

Page 5: Similarity on DBpedia

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

Page 6: Similarity on DBpedia

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

Page 7: Similarity on DBpedia

Similarity in general

Cognitive Science - Tversky (1977) - psychology - featural.

E.g. film: genre, language, director

Modelling of human thought, semantic relations, how do werelate things to each other? (Quillian & Collins 1969)

5

Page 8: Similarity on DBpedia

Semantic

The notion of semantic networks is derived from the hierarchicalsemantic memory model [Collins & Quillian, 1969]

6

Page 9: Similarity on DBpedia

Semantic Similarity

Different techniques:

Word frequency: Latent semantic analysis (doesn’t actuallyuse semantic net structure)

Rada (1989) - average shortest path length

Resnik (1999) - information content of lcs

Unfortunately...

Word frequency N/A

Often assumes hierarchical/tree structure oftaxonomy/ontology. (Both Rada and Resnik assumetaxonomy is an is-A hierarchy)

7

Page 10: Similarity on DBpedia

Semantic Similarity

Different techniques:

Word frequency: Latent semantic analysis (doesn’t actuallyuse semantic net structure)

Rada (1989) - average shortest path length

Resnik (1999) - information content of lcs

Unfortunately...

Word frequency N/A

Often assumes hierarchical/tree structure oftaxonomy/ontology. (Both Rada and Resnik assumetaxonomy is an is-A hierarchy)

7

Page 11: Similarity on DBpedia

Semantic Similarity

Remember, DBpedia not as ‘neat’:

(Image source: http://www.visualdataweb.org/relfinder/)

8

Page 12: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

Page 13: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

Page 14: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

Page 15: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

Page 16: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance ← uses paths!

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

10

Page 17: Similarity on DBpedia

Similarity

Important:

Properties can be related to each other

type 1, e.g. influenced

node, e.g. director

type 2, e.g. collaborated with

node type 2, e.g. film

11

Page 18: Similarity on DBpedia

Network Similarity

Social Network Analysis

Established field - notions of influence, centrality, rank etc.

Often applied to small networks

Note: Ranking is often based on similarity

12

Page 19: Similarity on DBpedia

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Page 20: Similarity on DBpedia

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Page 21: Similarity on DBpedia

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Page 22: Similarity on DBpedia

Network Similarity

Heterogeneous network measures:

PathSim - Sun & Han (2009) - count instances of‘meta-path’ (specific link pattern)

14

Page 23: Similarity on DBpedia

Network Similarity

Applicability to DBpedia:

PageRank, SimRank - N/A - assumes homogeneous links!

Spreading Activation - possible with constraints

Apply PathSim - but how to learn such meta-paths?

Another idea:

Count node-disjoint paths.

Why? View each path as one distinct ‘reason’.

15

Page 24: Similarity on DBpedia

Network Similarity

Applicability to DBpedia:

PageRank, SimRank - N/A - assumes homogeneous links!

Spreading Activation - possible with constraints

Apply PathSim - but how to learn such meta-paths?

Another idea:

Count node-disjoint paths.

Why? View each path as one distinct ‘reason’.

15

Page 25: Similarity on DBpedia

Similarity

Totoro GITS Matrix

Totoro 44 1 0GITS 1 35 2

Matrix 0 2 58

Totoro – GITS

Category:Anime films

GITS – Matrix

Category:Brain-computer interfacing in fictionMatrix → Category:The Matrix (franchise) →Category:Media franchises ← GITS

16

Page 26: Similarity on DBpedia

Similarity

How similar are the following films: Answer: it still depends

17

Page 27: Similarity on DBpedia

Similarity

How similar are the following films: Answer: it still depends- on the path you take

18

Page 28: Similarity on DBpedia

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

19

Page 29: Similarity on DBpedia

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

19

Page 30: Similarity on DBpedia

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

20

Page 31: Similarity on DBpedia

Ongoing/Future Work

Mining DBpedia as Network

Analyse structured and related data

Similarity as complement to – reasoning, retrieval, querying

Also useful in NLP, recommender systems, knowledgediscovery

→ Examples: work we do in UIMR

21

Page 32: Similarity on DBpedia

Ongoing/Future Work

Mining DBpedia as Network

Analyse structured and related data

Similarity as complement to – reasoning, retrieval, querying

Also useful in NLP, recommender systems, knowledgediscovery

→ Examples: work we do in UIMR

21

Page 33: Similarity on DBpedia

Ioana Hulpus (2011/2012)

Graph-based topic analysis with the support of Linked Data

22

Page 34: Similarity on DBpedia

Ioana Hulpus (2011/2012)

Graph-based topic analysis with the support of Linked Data

23

Page 35: Similarity on DBpedia

Benjamin Heitmann (2011/2012)

Spreading activation for cross-domain recommendation

24

Page 36: Similarity on DBpedia

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25

Page 37: Similarity on DBpedia

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25

Page 38: Similarity on DBpedia

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25