Top Banner
The Web of Data as a Complex System - First insight into its multi- scale network properties Christophe Guéret, Shenghui Wang, and Stefan Schlobach Department of Computer Science, Network Institute Vrije Universiteit Amsterdam
32

ECCS 2010

May 09, 2015

Download

Career

Shenghui Wang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ECCS 2010

The Web of Data as a Complex System

- First insight into its multi-scale network properties

Christophe Guéret, Shenghui Wang, and Stefan Schlobach 

Department of Computer Science, Network InstituteVrije Universiteit Amsterdam

Page 2: ECCS 2010

Outline

• What is the Web of Data? • How complex is the Web of Data?

 • A new way of seeing the Web of Data

 • What have we found?

 • What are the challenges?

Page 3: ECCS 2010

What is the Web of Data?

The Semantic Web is a web of data                                -- http://www.w3.org/2001/sw/

 Linked Data is a sub-topic of the Semantic Web. The term Linked Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web.

-- http://en.wikipedia.org/wiki/Linked_Data  Linked Data is about using the Web to connect related data that wasn't previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.

-- http://linkeddata.org/  

Page 4: ECCS 2010

Four principles of Linked Data

1.Use URIs to identify things.2.Use HTTP URIs so that these things can be referred to

and looked up ("dereferenced") by people and user agents.

3.Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF/XML.

4.Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

-- Tim Berners-Lee

Page 5: ECCS 2010

http://dbpedia.org/resource/Amsterdamhttp://dbpedia.org/resource/Amsterdam

http://dbpedia.org/resource/Cityhttp://dbpedia.org/resource/City

http://www.w3.org/1999/02/22-rdf-syntax-ns#type

http://umbel.org/umbel/ne/wikipedia/Amsterdamhttp://umbel.org/umbel/ne/wikipedia/Amsterdam

http://www.w3.org/2002/07/owl#sameAs

http://www.freebase.com/view/en/abraham_pais

http://www.freebase.com/view/en/abraham_pais

http://dbpedia.org/ontology/birthPlace

An example of linked data

• Nodes are shared across statements• The links have some meaning

Page 6: ECCS 2010

Since 2006, people are creating linked data

Page 7: ECCS 2010

October 2007

Page 8: ECCS 2010

July 2009

Page 9: ECCS 2010

Evolution of the Web of Data

Page 10: ECCS 2010

The WoD is a complex system!

• More than 260 extremely heterogeneous datasetso general-purposed datasets, such as DBpediao domain-oriented datasets, such as Bio2RDFo government data, music data, geological data, social

network data, etc. • Nearly 50 billion RDF triples

o Nearly 50 billion links within the datasetso More than 800 million links between the datasets

 • Embedded rich semantics in the data

o data points are typedo links are typedo links is what makes the statements useful

Page 11: ECCS 2010

AmsterdamAmsterdam

The NetherlandsThe Netherlands

isLocatedIn

ChristopheChristophe VU AmsterdamVU AmsterdamworkIn

isLocatedIn

workIn

workIn

The links have explicit semantics, which brings implicit links deduced after the reasoning process

Page 12: ECCS 2010

People are trying to use the WoD

Billion triple challenges since 2008     "The specific goal of the Billion Triples Track is to demonstrate the scalability of applications as well as to encourage the development of applications that can deal with Web data. We stress that the goal of this is not to be a benchmarking effort between triple stores, but rather to demonstrate applications that can scale to a Web scale using realistic Web-quality data. "

http://challenge.semanticweb.org/

Page 13: ECCS 2010

The WoD itself should be robust

• Is there central hubs whose failure would lead to lack of connectivity?

 • The WoD is designed for automated agents that

have less capability to recover from the failure of the connectivity.

 • The robustness of the WoD should be ensured

 • Up till now, the WoD could be studied, searched

and maintained like a classical database

Page 14: ECCS 2010

Network analysis

A new way of seeing the WoD 

What network analysis tells us

Page 15: ECCS 2010

A new way of seeing the WoD

Consider the WoD as network

Page 16: ECCS 2010

Applying network analysis over the WoD

• Average path length

• Degree distribution

• Strongly connected components

• Degree centrality

• Between centrality

• Closeness centrality

Page 17: ECCS 2010

Scales of observation of the WoD  1. Graphs scale

Page 18: ECCS 2010

Graph-scale WoD network

• Each dataset is a node • Edges are weighted, directed connections

between the datasetso if there is at least one triple having a subject

within dataset 1 and an object within dataset 2, then there is an edge between these two datasets. 

o the number of such triples is the weight of the edge.

     

Page 19: ECCS 2010

• 110 nodes with 350 edges• Average path length is 2.16• 50 components

Page 20: ECCS 2010

The degree of 7 is critical point after which the network is not scale-free any more.

Page 21: ECCS 2010

Top central nodes

Node Value

DBpedia 0.332

DBLP Berlin 0.108

DBLP (RKB) 0.100

DBLP Hannover 0.097

FOAF profiles 0.075

Betweenness centrality

Node Value

DBpedia 0.762

Geonames 0.614

Drug Bank 0.576

Linked MDB 0.544

Flickr wrappr 0.526

Closeness centrality

Node Value

DBpedia 0.505

UniProt 0.266

DBLP (RKB) 0.266

ACM (RKB) 0.229

GeneID 0.211

Degree centrality

Every centrality has a specific meaning...

Page 22: ECCS 2010

Scales of observation of the WoD2. Triple scale

Page 23: ECCS 2010

Triple-scale WoD network

• We took the 10 million triples from the dataset crawled from the WoD, provided by the billion triple challenge 2009 

 • This "BTC" network is defined as G=(V, (E, L)), where

o V is a set of nodes, and each node is a URI or a literal

o E is a set of edgeso L is a set of labels, each label characterising a

relation between nodes • We applied a few strategies to aggregate data for

comparison. 

Page 24: ECCS 2010

Network Nodes EgesAverage path

lengthComponents

BTC 605K 860K 2.15 602K

BTC aggregated 14K 31K 2.80 7K

BTC aggregated + filter

37 91 1.88 17

Triple-scale network and its aggregations• BTC aggregated: triples are aggregated by the domain names• BTC aggregated + filter: only domain names shared with the graph-scale network

Page 25: ECCS 2010

Degree distribution

BTC BTC aggregated

Power-law distribution

Page 26: ECCS 2010

Top central nodes:

Page 27: ECCS 2010

The next steps

 Open challenges

Ongoing research activities at VUA

Page 28: ECCS 2010

Challenges:

• Existence of implicit links

“Semantic virus”

AmsterdamAmsterdam

The NetherlandsThe Netherlands

isLocatedIn

ChristopheChristophe VU AmsterdamVU AmsterdamworkIn

isLocatedIn

workIn

workIn

AsiaAsia

isLocatedIn

Page 29: ECCS 2010

Challenges:

• Multi-relations links

• FOAF (social networks + personal information)• SIOC (relations characterising blogs)• SWRC (describing research work)• …

Different filtering produce different networksCentrality status of nodes changes w.r.t the networks

• Dynamics

• Data will be continuously added and linked.

Page 30: ECCS 2010

“sameAs” networks

Page 31: ECCS 2010

Monitoring and Improving the WoD

• Linked data is meant to be browsed, jumping from one ressource to another

• The presence of Hubs is critical for the paths• Create alternate paths to be used in case of failure

 

Guéret, Groth, van Harmelen, Schlobach, "Finding the Achilles Heel of the Web of Data: using network analysis for link-recommendation", ISWC2010 - To appear

Page 32: ECCS 2010

We need to study more!

{cgueret, swang, schlobac}@few.vu.nl