Top Banner
DBPEDIA INSIDEOUT: AN INTRODUCTION TO THE MAJOR HUB FOR LINKED OPEN DATA Cristina Pattuelli, Pratt Institute March 16, 2015
64
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DBpedia InsideOut

DBPEDIA INSIDEOUT: AN INTRODUCTION TO THE MAJOR HUB FOR

LINKED OPEN DATA

Cristina Pattuelli, Pratt Institute

March 16, 2015

Page 2: DBpedia InsideOut

“DBpedia is the Semantic Web mirror of Wikipedia”

Page 3: DBpedia InsideOut

WHAT IT IS

DBpedia is  a crowd-sourced community effort to  extract structured information from Wikipedia and make this information available on the Web in the form of Linked Open Data.

Page 4: DBpedia InsideOut

Source: http://lod-cloud.net/

THE STATE OF THE LOD CLOUD 2014

Page 5: DBpedia InsideOut

Source: http://lod-cloud.net/

THE STATE OF THE LOD CLOUD 2014

2011: 295 DATASETS 2014: 570 DATASETS (+93%)

Page 6: DBpedia InsideOut

Source: blog.classora.com/2012/10/10/describiendo-el-conocimiento-en-un-formato-estandar-para-la-web-semantica-rdf/

Page 7: DBpedia InsideOut

 Connected with other Linked Datasets by  50 million RDF links

Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows

CENTRAL INTERLINKING HUB OF THE WEB OF DATA

Page 8: DBpedia InsideOut

Web of Data Browsing and Crawling Web Data Integration and Mashups

Page 9: DBpedia InsideOut

“Which albums did Miles Davis record with female instrumentalists?” “Which populated places in Australia are below sea level?” “What did Andy Warhol and Thelonious Monk have in common ?”

Page 10: DBpedia InsideOut

PEAN TO DBPEDIA

Multi-domain Automatically evolving Community consensus driven

Multilingual >125 language editions Accessible on the Web

Page 11: DBpedia InsideOut

DBPEDIA SEMANTICS

4.58 million “things” 583 million “facts”

Page 12: DBpedia InsideOut

“THINGS”

Each thing in the DBpedia dataset is identified by a URI of the form http://dbpedia.org/resource/Name Name is  derived from the  URL of  the source Wikipedia article, which has the form

http://en.wikipedia.org/wiki/Name. .

Page 13: DBpedia InsideOut

http://dbpedia.org/page/Billie_Holiday

Dereferencing the URI DBpedia: Billie Holiday’s Green Page

Page 14: DBpedia InsideOut

http://en.wikipedia.org/wiki/Billie_Holiday

Page 15: DBpedia InsideOut

http://dbpedia.org/resource /Billie_Holiday http://en.wikipedia.org/wiki/Billie_Holiday

Page 16: DBpedia InsideOut

DBPEDIA SEMANTICS

4.58 million “things” 583 million “facts”

Page 17: DBpedia InsideOut

“Facts” as RDF Triples

has name

Subject Predicate Object (Thing)

Billie Holiday

Page 18: DBpedia InsideOut

GENERATING FACTS FOR THE ENTITY BILLIE HOLIDAY

has name

Subject Predicate Object

S <http://dbpedia.org/resource/Billie_Holiday>

P <http://xmlns.com/foaf/0.1/name> O ”Billie Holiday”

Billie Holiday

Page 19: DBpedia InsideOut

S <http://dbpedia.org/resource/Billie_Holiday>

P <http://dbpedia-owl:alias> O “Lady Day”

Page 20: DBpedia InsideOut

S <http://dbpedia.org/resource/Billie_Holiday>

P <http://dbpedia-owl:occupation>

O <http://dbpedia.org/page/Songwriter>

Page 21: DBpedia InsideOut

CHARTING DBPEDIA

Extraction Mapping Categorization

Page 22: DBpedia InsideOut

HARVESTING FACTS

Wikipedia articles consist mostly of  free text, but  also contain different types of  structured information, such as  infobox templates, categorization information, images, geo-coordinates, and  links to  external Web pages.

Page 23: DBpedia InsideOut

DBPEDIA COMPONENTS

Source: http://wiki.dbpedia.org/PHPframework

Page 24: DBpedia InsideOut

DBPEDIA COMPONENTS

Extractors turn a specific type

of wiki markup into triples.

Page 25: DBpedia InsideOut

DBPEDIA COMPONENTS

Extractors turn a specific type

of wiki markup into triples.

Page 26: DBpedia InsideOut

The  core of  DBpedia consists of  an infobox extraction process. I n f o b ox e s a r e  t e m p l a t e s contained in  many Wikipedia ar t ic les. They are  usual ly displayed in  the top  right corner of  articles and  contain factual information.

Page 27: DBpedia InsideOut

Infobox for MusicalArtist

Page 28: DBpedia InsideOut
Page 29: DBpedia InsideOut
Page 30: DBpedia InsideOut
Page 31: DBpedia InsideOut

INFOBOX EXTRACTION

Raw Infobox Extraction – create triples directly from the infobox data. Mapping-based Infobox Extraction – mappings against the DBpedia Ontology.

Page 32: DBpedia InsideOut

RAW INFOBOX EXTRACTION

Generic Algorithm-based Retains property names used in the infobox Properties are identified by the dbpprop prefix.

Page 33: DBpedia InsideOut

MAPPING-BASED INFOBOX EXTRACTION

Mapping of infobox data to community-curated DBpedia Ontology. Properties are identified by the dbpedia-owl prefix.

Page 34: DBpedia InsideOut
Page 35: DBpedia InsideOut

RAW INFOBOX EXTRACTION

Pros: Complete coverage of all the infobox attributes (not all the infoboxes have been mapped yet) Cons: Lower data quality (synonyms are not resolved e.g., paceOfBirth/birthPlace; high error rate to determine the datatype of an attribute value)

Page 36: DBpedia InsideOut

MAPPING-BASED INFOBOX EXTRACTION

Pros: Data is cleaner (typing resources, merging name variants, assigning specific datatypes to the values). Cons: Not full coverage.

4.58 million things 4.22 million are classified in a consistent ontology.

Page 37: DBpedia InsideOut
Page 38: DBpedia InsideOut

Normalization of variant names

Page 39: DBpedia InsideOut

THE DBPEDIA ONTOLOGY

Cross-domain ontology Large thematic coverage Currently covers 685 classes which form a  subsumption hierarchy and  2,795 different p r o p e r t i e s d e s c r i b i n g t h e c l a s s e s (aircraftHelicopterAttack) Shallow (≤ 5 levels)

Page 40: DBpedia InsideOut

THE DBPEDIA ONTOLOGY

Because the DBpedia Ontology is built upon infobox templates, its semantic structure suffers from a lack of logical consistency and present significant semantic gaps in the hierarchy.

Page 41: DBpedia InsideOut

http://mappings.dbpedia.org/server/ontology/classes/

THE DOMAIN OF MUSIC IN THE DBPEDIA ONTOLOGY

Page 42: DBpedia InsideOut

Hierarchy is kept shallow (sake of visualization and navigation). – http://dbpedia.org/ontology/MusicalArtist

Page 43: DBpedia InsideOut

CATEGORIZING DBPEDIA

Page 44: DBpedia InsideOut

WIKIPEDIA CATEGORY SYSTEM

Wikipedia categories to group articles that share similar subjects. Wikipedia categories are constantly evolving and currently number more than 740,000. 80.9 million links to Wikipedia categories.

Page 45: DBpedia InsideOut

WIKIPEDIA CATEGORY SYSTEM

Most categories are assigned manually by Wikipedia contributors and can be found listed as links at the bottom of a Wikipedia article.

Page 46: DBpedia InsideOut
Page 47: DBpedia InsideOut

CATEGORIZING PEOPLE

At least four categories: •  the year the person was born •  the year they died •  their nationality •  their reason for being notable.

Page 48: DBpedia InsideOut

CATEGORIZATION OF PEOPLE

First sentence of an article: Billie Holiday (born Eleanora Fagan; April 7, 1915 – July 17, 1959) was an American jazz singer and songwriter.

Year born: Category:1915 births Year died: Category:1959 deaths Nationality: Category: American people

Reason for notability / Occupation: Category:Musicians

Page 49: DBpedia InsideOut
Page 50: DBpedia InsideOut

WIKIPEDIA CATEGORY SYSTEM

Collaborative effort Advantages à categories are continually updated to correspond with article content. Dis/advantages à lack of consistency in its hierarchical structure and “rather loose relatedness between articles” (Bizer et al. (2009). “Messy hierarchy”

Page 51: DBpedia InsideOut

RE-CATEGORIZATION OF BILLIE HOLIDAY

(→‎External links: re-categorisation per Wikipedia:Categories for discussion/Log/2014 December 26, replaced: Category:American women composers

→ Category:American female composers) (undo) -- (Robot - Moving category African-American female musicians toCategory:African-American musicians per CFD at Wikipedia:Categories for discussion/Log/2013 January 10.)

Page 52: DBpedia InsideOut

WIKIPEDIA ONTOLOGY IN DBPEDIA

The hierarchical structure of the categories is represented in DBpedia by way of two different properties: dcterms:subject (relate entity to category) skos:broader (relate child to parent category)

Page 53: DBpedia InsideOut

http://ensiwiki.ensimag.fr/images/f/fa/Dbpedia-relation-discovery-demo.pdf

The  Hierarchy  of  categories  between  “flower”  and  “cucumber”  

Page 54: DBpedia InsideOut

CATEGORY:JAZZ_MUSICIANS

http://dbpedia.org/page/Category:Jazz_musicians  

Page 55: DBpedia InsideOut

YAGO ONTOLOGY

A robust classification scheme with a deep hierarchical structure. Originally derived from the Wikipedia category system using the semantic lexicon WordNet.

Over 350,000 classes; 100 relationships Provides DBpedia data with coherence and structural consistency A taxonomic backbone

Page 56: DBpedia InsideOut

QUERYING DBPEDIA FOR LINKED JAZZ

Jazz Name Vocabulary Personal name vocabulary in the form of RDF statements including the artist’s name paired with a Uniform Resource Identifier (URI).

<http://dbpedia.org/resource/Billie_Holiday>!<http://xmlns.com/foaf/0.1/name> !“Billie Holiday”  

Page 57: DBpedia InsideOut

QUERYING DBPEDIA FOR LINKED JAZZ

DBpedia was initially queried for literal triples with a foaf:name predicate that satisfied the following criteria: 1. the entity must be an rdf:type of dbpedia-owl:MusicalArtist

2. must have dbpedia:genre property: dbpedia:Jazz.

Page 58: DBpedia InsideOut

QUERYING DBPEDIA FOR LINKED JAZZ

DBpedia was initially queried for literal triples with a foaf:name predicate that satisfied the following criteria: 1. the entity must be an rdf:type of dbpedia-owl:MusicalArtist

2. must have dbpedia:genre property: dbpedia:Jazz.

+ rdfs:label à name of the resource

Page 59: DBpedia InsideOut

QUERYING DBPEDIA FOR LINKED JAZZ

Prominent musicians who we expected to find by querying dbpedia:Jazz property were not returned. Example: “Count Basie” -  f e l l u n d e r d b p e d i a : S w i n g _ m u s i c ,

dbpedia:Big_band_music and dbpedia:Piano_blues

-  not under dbpedia:Jazz This required us to revise our query method by expanding it to include additional relevant music genres.

Page 60: DBpedia InsideOut

Name Extraction from DBpedia

Bootstrapping  &  Querying  

Page 61: DBpedia InsideOut
Page 62: DBpedia InsideOut

IN SUM

New type of knowledge representation environment -constant state of flux. -decentralized interplay of different descriptive and classification systems. -it challenges our tolerance threshold for data quality and our traditional notion of authority control.

Page 63: DBpedia InsideOut

http

://db

pedi

a.or

g/pa

ge/B

illie_

Holid

ay

LodLive

Visualizing DBpedia

Page 64: DBpedia InsideOut

Thank You!

@cristinapattuel [email protected]