Top Banner
How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies [email protected] 1
36

How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies [email protected] 1.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

How is the Semantic Web Being Used? An Analysis of the Billion

Triples Challenge Corpus

Mike DeanPrincipal EngineerBBN [email protected]

1

Page 2: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Assumptions

• Technology – Intermediate– Familiarity with RDF and OWL

• Interest in– Semantic Web usage patterns– Semantic Web Challenge

2

Page 3: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Presenter Background• Principal Engineer at BBN Technologies (1984-present)• Principal Investigator for DARPA Agent Markup Language (DAML) Integration and

Transition (2000-2005)– Chaired the Joint US/EU Committee that developed DAML+OIL and SWRL

• Developer and/or Principal Investigator for many Semantic Web tools, datasets, and applications (2000-present)

• Member of the W3C RDF Core, Web Ontology, and Rule Interchange Format Working Groups– Co-editor of the W3C OWL Reference

• Member of the Semantic Web Challenge Advisory Board since its inception• Local co-chair for ISWC2009• Other SemTech presentations

– Semantic Query: Solving the Needs of a Net-Centric Data Sharing Environment (2007, w/ Matt Fisher)

– Semantic Queries and Mediation in a RESTful Architecture (2008, w/ John Gilman and Matt Fisher)

– Use of SWRL for Ontology Translation (2008)– Semantic Web @ BBN: Application to the Digital Whitewater Challenge (2009, w/ John

Hebeler)

3

Page 4: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Semantic Web Challenge

• Founded in 2003 by Michel Klein and Ubbo Visser• Demonstrate the value of the Semantic Web

through applications• Submissions evaluated according to a set of

minimal requirements and additional desirable features

• Has become an annual event at International Semantic Web Conferences– 22 submissions in 2008

4

Page 5: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

2008 Billion Triples Challenge• A new Semantic Web Challenge track in 2008

– Do “something interesting” with a large subset of a billion provided triples

– Co-chaired by Jim Hendler and Peter Mika• 12 real web data sets

– Not a scientific sample– Enough to be interesting and probably representative– Stable snapshot

• Our analysis initially arose from discussing a possible application– We now know “yes, there is enough data to support what we

wanted to do”– Tools and techniques should be generally applicable to other

corpora

5

Page 6: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

2008 Billion Triples CorpusData Set Format Triples URLs Size Composition

Webscope WARC 82,768,342 1,979,022 2.7 GB Heterogeneous

Falcon WARC 32,512,340 541,518 834 MB Heterogeneous

Swoogle WARC 174,981,639 1,468,766 3.2 GB Heterogeneous

Watson WARC 59,750,019 130,701 267 MB Heterogeneous

SWSE-1 WARC 30,346,451 194,259 4 GB Heterogeneous

SWSE-2 WARC 60,504,716 389,107 2.4 GB Heterogeneous

DBpedia tar.gz 110,241,463 29 1.9 GB Homogeneous

Geonames WARC 69,778,255 6,668,395 3.4 GB Homogeneous

SwetoDBLP tar.gz 14,936,600 1 167 MB Homogeneous

WordNet tar.gz 1,942,887 1 17 MB Homogeneous

Freebase tar.gz 63,069,952 1 569 MB Heterogeneous

US Census tar.gz 445,752,172 1 3.3 GB Homogeneous

TOTAL 1,146,584,836 11,371,801 22.8 GB6http://www.cs.vu.nl/~pmika/swc/btc.html

Page 7: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Data Set Characterization

• Metrics that can impact selection/tuning of KB implementations– Statement count– Number of classes and predicates– Statements per subject/predicate/object– Degree of interconnectedness (percentage of non-

literal statements, with/without rdf:type)– RDFS and OWL reasoning employed– Use of reification

7

Page 8: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Analysis• Stream processing of the compressed data set archives

– Statement counts– Datatype, language, predicate, and type counts

• Use of RDF, RDFS, OWL, FOAF, and other vocabularies– (May include duplicate statements)

• Load each dataset into its own Parliament KB– (Eliminates duplicates within dataset)

• (Both programs used code based on Peter Mika’s WARC example with the OpenRDF RIO parser and no inference)

• Process the statement and resource tables– Mark each node as resource and/or literal– URI, blank node, and literal counts– Chain length statistics and histograms– (Parliament worked very well here. Each operation took 1-736 seconds.)

8

Page 9: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Stream Processing• Many Semantic Web tools provide streaming

parsers rather than, or in addition to, model access– Analogous to XML SAX vs. DOM

• For suitable applications, this can be a lot faster than loading statements into a KB

• Streaming analysis of the 2009 corpus was performed at an overall rate of 103K statements/sec on a Mac laptop with a portable external disk– Compare to loading 10-20K statements/second on a

server 9

Page 10: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Classes and PredicatesData Set Classes Predicates

Webscope 724 782

Falcon 19,660 29,248

Swoogle 33,318 33,981

Watson 13,660 18,091

SWSE-1 115 1,040

SWSE-2 104 625

DBpedia 4 288

Geonames 1 17

SwetoDBLP 11 145

WordNet 22 41

Freebase 0 5,008

US Census 8 1,68210

Page 11: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Statements

• Statement (subject, predicate, object)– Resource object• rdf:type predicate• Other predicate

– Literal object• rdf:datatype• Plain literal

– xml:lang– Neither datatype nor language

11

Page 12: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Statement % (distinct values)

12

Dataset rdf:type rdf:resource rdf:datatype xml:lang Neither

Webscope 24 (724) 32 3 (10) 14 (93) 27

Falcon 16 (19,660) 50 16 (72) 9 (252) 18

Swoogle 15 (33,318) 39 2 (87) 18 (280) 26

Watson 16 (13,660) 40 2 (79) 29 (162) 13

SWSE-1 13 (115) 53 0 (1) 32 (6) 1

SWSE-2 13 (104) 53 0 (1) 32 (15) 1

DBpedia 0 (4) 91 0 (6) 8 (1) 0

Geonames 10 (1) 49 0 (0) 1 (342) 41

SwetoDBLP 18 (11) 28 14 (4) 0 (0) 41

WordNet 24 (22) 30 0 (1) 46 (1) 0

Freebase 0 (0) 62 0 (0) 19 (169) 19

US Census 0 (8) 19 78 (2) 0 (0) 3

Page 13: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Resources and Literals

• Node– Resource• URI• Blank Node

– Literal

13

Page 14: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Node %Data Set URI Blank Node Literal

Webscope 24 53 23

Falcon 56 13 31

Swoogle 31 34 41

Watson 29 32 40

SWSE-1 39 36 25

SWSE-2 35 42 23

DBpedia 74 0 26

Geonames 45 0 55

SwetoDBLP 27 17 56

WordNet 55 0 45

Freebase 52 0 48

US Census 0 98 214

Page 15: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Chain Lengths

• How long are the linked-list chains used by Parliament?– How many statements share the same subject,

predicate, or object?

• Histograms proved unwieldy– Presenting summary statistics instead

• rdf:type statements significantly impact results

15

Page 16: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Mean chain lengths (std dev)Data Set Subject Predicate Object Literal Object

Webscope 3.96 (9.77) 87,900 (722,575) 3.43 (2170) 4.33 (659)

Falcon 4.22 (13) 983 (31,773) 2.56 (328) 2.31 (217)

Swoogle 5.65 (36) 4,464 (188,023) 3.27 (1,793) 3.38 (569)

Watson 5.58 (56) 3,040 (98,288) 2.87 (918) 2.91 (407)

SWSE-1 5.25 (15) 25,404 (289,000) 2.46 (1,138) 2.29 (187)

SWSE-2 5.37 (15) 83,773 (739,736) 2.89 (1,741) 2.87 (300)

DBpedia 15 (39) 300,855 (3,560,666) 3.84 (148) 1.17 (22)

Geonames 10.4 (1.66) 4,096,150 (3,167,048) 2.81 (1,623) 1.67 (15)

SwetoDBLP 5.63 (3.82) 103,009 (325,380) 2.93 (629) 2.36 (168)

Wordnet 4.18 (2.04) 47,387 (100,907) 2.53 (295) 2.39 (271)

Freebase 4.45 (15) 12,329 (316,363) 2.79 (1,286) 1.83 (116)

US Census 5.39 (9.18) 265,005 (1,921,537) 5.29 (15,916) 227 (115,616)16

Page 17: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

RDF/RDFS/OWL Usage• 80,309,558 rdf:type statements in 11 data sets• 4,033,540 rdfs:subClassOf statements in 6 data sets• 2,988,396 owl:Class instances in 6 data sets• 1,492,214 rdf:_1 statements in 7 data sets• 1,042,032 owl:Restriction instances in 5 data sets• 480,771 owl:sameAs statements in 9 data sets• 299,962 rdfs:Class instances in same 6 data sets as owl:Class• 265,124 rdfs:domain statements in 6 data sets• 252,175 rdfs:range statements in 6 data sets• ~238,000 reified statements in 4 data sets• 50,482 instances of rdf:Bag in 5 data sets• 22,154 instances of owl:Ontology in 5 data sets• 14,913 owl:imports statements in 3 data sets• 83 rdf:_2000 statements in 3 data sets• 1 rdf:_10763 statement in 1 data set

17

Page 18: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Popular Vocabularies• FOAF

– 29,308,169 Person instances in 7 data sets– 25,864,527 knows statements in 6 data sets

• Dublin Core– 43,591,844 title statements in 7 data sets– 4,416,716 date statements in 6 data sets

• Geospatial– 7,075,380 wgs84_pos:lat statements in 9 data sets– 4,436 georss:point statements in 5 data sets

• SKOS– 6,619,912 subject statements in 4 data sets– 403,912 Concept instances in 4 data sets

• RSS 1.0– 2,893,750 item instances in 6 data sets

• OWL-S– 92 0.9-1.2 Profiles in 3 data sets

• OWL-Time– No usage?

18

Page 19: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Errors

• 95,937 Java exceptions• Lots of bad languages and datatypes• Lots of namespace/URI typos/confusion• Slightly different statement counts, due to

exceptions, duplicates, etc.– 1,063,616,774 statements (4% less)

19

Page 20: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Crawled Data

• Webscope, Falcon, Swoogle, Watson, SWSE-1, and SWSE-2 consisted of crawled data from a wide range of sites– Included some data I published in 2002

20

Page 21: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

DBpedia• Information extracted from Wikipedia pages• Example

<http://dbpedia.org/resource/San_Jose%2C_California> rdfs:label "San Jose, California"@en ; dbpedia:officialName "City of San Jose"@en ; geo:lat "37.304"^^xsd:float ; geo:long "-121.873"^^xsd:float ; dbpedia:populationTotal "929936" ; dbpedia:areaLandSqMi "174.9" ; dbpedia:timezone <http://dbpedia.org/resource/Pacific_Time_Zone> ; foaf:homepage <http://www.sanjoseca.gov> ; foaf:img <http://upload.wikimedia.org/wikipedia/commons/3/3f/SJPan.jpg> ; foaf:page <http://en.wikipedia.org/wiki/San_Jose%2C_California> ; dbpedia:wikilink <http://dbpedia.org/resource/April_3> , ... ; owl:sameAs <http://sws.geonames.org/5392171/> , <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/

san_jose> .• See http://dbpedia.org

21

Page 22: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Freebase• Collections of curated datasets

– RDF-like data model– Data exports available, but no standard mapping to RDF until rdf.freebase.com was

announced at ISWC2008• Follows Linked Data principles• Standard RDF dump still not available

• Some anomalies in the corpus mappings affected statistics– Used freebase:type rather than rdf:type– Language codes had a prepended /, e.g. “/en”– freebase.org (a different site) should be freebase.com• Example

<http://www.freebase.org/guid/9202a8c04000641f800000000006809a> <http://www.freebase.org/type/object/name> "San Jose, California"@/en ; <http://www.freebase.org/type/object/type> <http://www.freebase.org/location/citytown> , <http://www.freebase.org/location/us_citytown> ; <http://www.freebase.org/location/citytown/founded> "1777-11-29" ; <http://www.freebase.org/location/location/area> "461.5” .

• See http://freebase.com and http://rdf.freebase.com

22

Page 23: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Geonames• 8 million geographic names and locations• Example

<http://sws.geonames.org/6484236/> a geonames:Feature> ; geonames:featureClass geonames:S ; geonames:featureCode geonames:S.HTL ; geonames:inCountry <http://www.geonames.org/countries/#US> ; geonames:locationMap "http://www.geonames.org/6484236/the-fairmont-san-

jose.html" ; geonames:name "The Fairmont San Jose" ; geonames:nearbyFeatures> <http://sws.geonames.org/6484236/nearby.rdf> ; geonames:parentFeature <http://sws.geonames.org/5332921/> ; geo:lat "37.3326" ; geo:long "-121.8893" .

• See http://geonames.org

23

Page 24: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

SwetoDBLP• Metadata on publications in Computer Science (originally Databases and Logic Programming)• Example

<http://dblp.uni-trier.de/rec/bibtex/conf/geos/KolasHD05> a opus:Article_in_Proceedings ; rdfs:label "Geospatial Semantic Web: Architecture of Ontologies." ; opus:author [ a rdf:Seq ; rdf:_1 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/k/Kolas:Dave.html> ; rdf:_2 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hebeler:John.html> ; rdf:_3 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Dean:Mike.html> ] ; opus:isIncludedIn <http://dblp.uni-trier.de/rec/bibtex/conf/geos/2005> ; opus:book_title "GeoS" ; opus:year "2005"^^xsd:gYear ; opus:pages "183-194" ; dcelem:relation "http://www.informatik.uni-trier.de/~ley/db/conf/geos/geos2005.html#KolasHD05" ; opus:last_modified_date "2005-11-08"^^xsd:date .<http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Dean:Mike.html> a foaf:Person ; foaf:name "Mike Dean" .

• See http://lsdis.cs.uga.edu/projects/semdis/swetodblp/

24

Page 25: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

WordNet• Lexical database of English, including multiple word senses and

synonym sets• Example

wn20instances:wordsense-semantic-adjective-1 a wn20schema:AdjectiveWordSense ; rdfs:label "semantic"@en-us ; wn20schema:adjectivePertainsTo wn20instances:wordsense-semantics-noun-1 ; wn20schema:tagCount "3"@en-us ; wn20schema:word wn20instances:word-semantic .wn20instances:word-semantic a wn20schema:Word ; wn20schema:lexicalForm "semantic"@en-us .

• See http://www.w3.org/2006/03/wn/wn20/

25

Page 26: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

US Census• 1 billion triples published by Joshua Tauberer in April 2007• Highly tabular data• Example

<http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose> a <http://www.rdfabout.com/rdf/schema/usgovt/Town> ; dc:title "San Jose" ; dcterms:hasPart <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/fruitdale> , <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/seven_trees> , ... ; dcterms:isPartOf <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county> ; census:details <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/censustables> ; census:households 559949 ; census:landArea "1144714122 m^2" ; census:population 1621316 ; census:waterArea "20064384 m^2" ; geo:lat "37.318892" ; geo:long "-121.928244" .

• See http://www.rdfabout.com/demo/census/

26

Page 27: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

2009 Corpus• All crawled data, using Falcon-S, Sindice,

Swoogle, SWSE, and Watson• 1,151,383,509 statements in 116 chunks of 10

million• Represented in NQuads format– Explicit source/context for each statement– No parsing errors

• See http://vmlion25.deri.ie/– Includes sampled statistics (which I found to be highly

accurate)• Sources by “Pay Level Domain”

27

Page 28: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

LUBM• The Lehigh University Benchmark (LUBM) is widely used for Semantic Web benchmarking

– Synthetic data generated for a specified number of universities• Example

<http://www.Department0.University0.edu/FullProfessor0> a ub:FullProfessor ; ub:doctoralDegreeFrom <http://www.University241.edu> ; ub:emailAddress "[email protected]" ; ub:mastersDegreeFrom <http://www.University875.edu> ; ub:name "FullProfessor0" ; ub:researchInterest "Research20" ; ub:teacherOf <http://www.Department0.University0.edu/GraduateCourse1> ,

<http://www.Department0.University0.edu/Course0> , <http://www.Department0.University0.edu/GraduateCourse0> ;

ub:telephone "xxx-xxx-xxxx" ; ub:undergraduateDegreeFrom <http://www.University84.edu> ; ub:worksFor <http://www.Department0.University0.edu> .

• See http://swat.cse.lehigh.edu/projects/lubm/

28

Page 29: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Statement % (distinct values)

29

Dataset rdf:type rdf:resource rdf:datatype xml:lang Neither

Webscope 24 (724) 32 3 (10) 14 (93) 27

Falcon 16 (19,660) 50 16 (72) 9 (252) 18

Swoogle 15 (33,318) 39 2 (87) 18 (280) 26

Watson 16 (13,660) 40 2 (79) 29 (162) 13

SWSE-1 13 (115) 53 0 (1) 32 (6) 1

SWSE-2 13 (104) 53 0 (1) 32 (15) 1

DBpedia 0 (4) 91 0 (6) 8 (1) 0

Geonames 10 (1) 49 0 (0) 1 (342) 41

SwetoDBLP 18 (11) 28 14 (4) 0 (0) 41

WordNet 24 (22) 30 0 (1) 46 (1) 0

Freebase 0 (0) 62 0 (0) 19 (169) 19

US Census 0 (8) 19 78 (2) 0 (0) 3

BTC 2009 12 (283,612) 57 1 (198) 7 (386) 22

LUBM 1 20 (15) 48 0 (0) 0 (0) 32

Page 30: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

RDF/RDFS/OWL Usage2008 2009

rdf:type 80,309,558 143,293,758

rdfs:subClassOf 4,033,540 2,712,766

owl:Class 2,988,396 2,680,081

rdf:_1 1,492,214 757,717

owl:Restriction 1,042,032 440,750

owl:sameAs 480,771 6,565,347

rdfs:Class 299,962 186,770

rdfs:domain 265,124 195,053

rdfs:range 252,175 187,746

reified statements ~238,000 ~328,000

rdf:Bag 50,482 47,843

owl:Ontology 22,154 445,994

owl:imports 14,913 212,731

rdf:_2000 83 2,018

rdf:_10763 1 43

rdf:_32061 0 130

Page 31: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Popular Vocabularies2008 2009

foaf:Person 29,308,169 38,790,680

foaf:knows 25,864,527 35,811,115

dc:date 43,591,844 12,537,177

dc:title 4,416,716 22,326,441

wgs84_pos:lat 7,075,380 7,398,911

georss:point 4,436 367,291

skos:subject 6,619,912 18,257,337

skos:Concept 403,912 697,311

rss:item 2,893,750 13,687,021

owls:Profile 92 138

31

Page 32: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Corpus Composition2008 2009

US Census 39% rdfabout.com 2%

DBpedia 10% dbpedia.org 35%

GeoNames 6% geonames.org 11%

Freebase 6% freebase.com 1%

OTHER 40% OTHER 50%

32

Page 33: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

Further Analysis• Node level comparison of the 2009 corpus• Increased factoring of rdf:type statements– How many rdf:type’s are associated with each resource?

• Overlap between 2008 and 2009 corpora• Analysis and reporting by Pay Level Domain rather

than dataset– By vocabulary (aggregated source vs. aggregated

predicate/type)• Drilldown into particular patterns, e.g. 32K element

set/bag• Additional graph metrics (e.g. diameter)

33

Page 34: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

2008 Billion Triples Winners

• SemaPlorer: map-based exploration and visualization

• SearchWebDB: inexact keyword search• MaRVIN: scalable reasoning from LarKC• i-MoCo: storage and browsing of 250M+

triples with an iPhone application• SAOR: Scalable Authoritative OWL Reasoning• Virtuoso: sophisticated storage and querying

34

Page 35: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

2009 Challenge

• Consider entering the Semantic Web Challenge

• Submissions due October 1

• Submissions will be presented and winners named at the 8th International Semantic Web Conference (ISWC2009) October 25-29 near Washington, DC

35

Page 36: How is the Semantic Web Being Used? An Analysis of the Billion Triples Challenge Corpus Mike Dean Principal Engineer BBN Technologies mdean@bbn.com 1.

More Information

• Semantic Web Challenge– http://challenge.semanticweb.org

• Analysis Code and Raw Data– 2008: http://asio.bbn.com/2008/10/btc/– 2009: http://asio.bbn.com/2009/06/btc/

• ISWC2009– http://iswc2009.semanticweb.org

36