Top Banner
Entity Enrichment and Consolidation in ARCOMEM Elena Demidova 1 , including slides by: Stefan Dietze 1 , Diana Maynard 2 , Thomas Risse 1 , Wim Peters 2 , Katerina Doka 3 , Yannis Stavrakas 3 1 L3S Research Center, Hannover, Germany 2 University Sheffield, UK 3 IMIS, RC ATHENA, Athens, Greece
19

Arcomem training enrichment_advanced

May 10, 2015

Download

Technology

arcomem

This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Arcomem training enrichment_advanced

Entity Enrichment and Consolidationin ARCOMEM

Elena Demidova1,

including slides by: Stefan Dietze1, Diana Maynard2, Thomas Risse1, Wim Peters2, Katerina Doka3, Yannis Stavrakas3

1 L3S Research Center, Hannover, Germany2 University Sheffield, UK3 IMIS, RC ATHENA, Athens, Greece

Page 2: Arcomem training enrichment_advanced

The ARCOMEM approach

• Make use of the Social Web– Huge source of user generated content– Wide range of articulation methods

From simple „I like it“-Buttons to complete articles– Represents the diversity of opinions of the public

• User activities often triggered by – Events and related entities

(e.g. Sport Events, Celebrations, Crises, News Articles, Persons, Locations)

– Topics (e.g. Global Warming, Financial Crisis, Swine Flu)

A semantic-aware and socially-driven preservation model is a natural way to go

Slide 2

Page 3: Arcomem training enrichment_advanced

ARCOMEM architecture

Slide 3

Crawler

Cross Crawl Analysis

Online Processing

Offline Processing

Queue Management

Application-Aware Helper

Resource Selection& Prioritization

Resource Fetching

Intelligent Crawl

Definition

ConsolidationEnrichment

GATE Offline Analysis

Social Web Analysis

GATE Online Analysis Social Web Analysis

Named EntityEvol. Recog.

Extracted SocialWeb Information

Crawler Cockpit

ARCOMEMStorage

URLs

Relevance Analysis &

Priorization

Image/Video Analysis

Twitter Dynamics

WARC Export

WARCFiles

ApplicationsBroadcasterApplication

Parliament Application

ARCOMEM system architecture foresees four processing levels: crawler level, online processing level, offline processing level and cross crawl analysis

Page 4: Arcomem training enrichment_advanced

4

ETOE offline processing chainThe processing chain depicted here describes all components involved in the offline processing of Web objects.

Page 5: Arcomem training enrichment_advanced

The extraction components for text

Aim Extraction of Entities, Topics, Events and Opinions (ETOEs) from

Web Pages Social Web (Twitter, YouTube, Facebook, …)

Challenges Entity recognition from degraded input sources (tweets etc)

Advancing state of the art NLP and text mining Dynamics detection: evolution of terms/entities

Semantic representation of Web objects and entities Appropriate RDF schemas for ETOE and Web objects Exploiting (Linked Open) Web data to enrich extracted ETOE

Entity classification (into events, locations, topics etc) & consolidation

Slide 5

Page 6: Arcomem training enrichment_advanced

ETOE extraction with GATE: an example

Slide 6

candidate multi-word term

Page 7: Arcomem training enrichment_advanced

Data consolidation & integration problem

Data extracted from different components or during different processing cycles not aligned => consolidation, disambiguation & correlation required.

Slide 7

<Location>Greece</Location><Person>Venizelos</Person> <Location>Griechenland</Location>

<Organisation>Greek Parliament</Organisation>

?

Page 8: Arcomem training enrichment_advanced

Data clustering & enrichmentEnrichment of entities with related references to Linked Data, particularly reference datasets (DBpedia, Freebase, …)=> use enrichments for correlation/clustering/consolidation

Slide 8

Page 9: Arcomem training enrichment_advanced

Enrichment with DBpedia & Freebase

• DBpedia and Freebase are particularly well-suited due to their vast size, the availability of disambiguation techniques which can utilise the variety of multilingual labels available in both datasets for individual data items and the level of inter-connectedness of both datasets, allowing the retrieval of a wealth of related information for particular items.

• In the case of DBpedia, we make use of the DBpedia Spotlight service which enables an approximate string matching with adjustable confidence level in the interval [0,1]. Experimentally, we set confidence to 0.6.

• For Freebase, we use structured queries, taking into account entity types extracted by GATE.

9

Page 10: Arcomem training enrichment_advanced

<Event>Trichet warns of systemic debt crisis</Event>

<Person>Jean Claude Trichet</Person> <Organisation>ECB</Organisation>

Enrichment for clustering & correlation: example

Slide 10

Page 11: Arcomem training enrichment_advanced

<Enrichment>http://dbpedia.org/resource/Jean-Claude_Trichet</Enrichment>

<Enrichment>http://dbpedia.org/resource/ECB</Enrichment>

<Event>Trichet warns of systemic debt crisis</Event>

<Person>Jean Claude Trichet</Person> <Organisation>ECB</Organisation>

Enrichment for clustering & correlation: example

Slide 11

Page 12: Arcomem training enrichment_advanced

=> dbpprop:office dbpedia:President_of_the_European_Central_Bankdbpedia:Governor_of_the_Banque_de_France

=> dcterms:subject category:Living_peoplecategory:Karlspreis_recipientscategory:Alumni_of_the_École_Nationale_d'Administrationcategory:People_from_Lyon…

<Enrichment>http://dbpedia.org/resource/Jean-Claude_Trichet</Enrichment>

<Enrichment>http://dbpedia.org/resource/ECB</Enrichment>

<Event>Trichet warns of systemic debt crisis</Event>

<Person>Jean Claude Trichet</Person> <Organisation>ECB</Organisation>

Enrichment for clustering & correlation: example

Slide 12

Page 13: Arcomem training enrichment_advanced

ARCOMEM entities and enrichments - graph

Slide 13

Nodes: entities/events (blue), enrichments DBpedia (green), Freebase (orange)

1013 clusters of correlated entities/events

Page 14: Arcomem training enrichment_advanced

Nodes: entities/events (blue), enrichments DBpedia (green), Freebase (orange)

1013 clusters of correlated entities/events => cluster expansion by considering related enrichments

ARCOMEM entities and enrichments - graph

Slide 14

Page 15: Arcomem training enrichment_advanced

Clustering of entities via enrichment relatedness

Discovery of “related” entities by discovering related enrichments(a) Retrieving possible paths between 2 enrichments (eg via RelFinder

http://www.visualdataweb.org/relfinder.php)(b) Computation of relatedness measure (considering variables such as shortest path,

number of paths, relationship types, number of directly connected edges of both enrichments…)

(c) Clustering enrichments (entities) which are above certain threshold

Slide 15

Page 16: Arcomem training enrichment_advanced

RDF schema for the Knowledge Base

16

Relationships between ARCOMEM entities (ETOE etc) and enrichments RDF schema: http://www.gate.ac.uk/ns/ontologies/arcomem-data-model.rdf

Page 17: Arcomem training enrichment_advanced

Enrichment evaluation results

Manual evaluation of 240 enrichment-entity pairs Available scores: 1 (correct), 0 (incorrect), 0.5 (vague or

ambiguous relationship)

Slide 17

Entity Type Average score DBpedia

Average score Freebase

Average Score Total  

arco:Event 0.71 0.71

arco:Location 0.81 0.94 0.88

arco:Money 0.67 0.67

arco:Organization 0.93 1

0.97

arco:Person 0.9 0.89 0.89

arco:Time 0.74 0.74

Total 0.79 0.94 0.87

Page 18: Arcomem training enrichment_advanced

Further reading• Entity Extraction and Consolidation for Social Web Content Preservation. S.

Dietze, D. Maynard, E. Demidova, T. Risse, W. Peters, K. Doka und Y. Stavrakas, SDA, volume 912 of CEUR Workshop Proceedings, page 18-29. CEUR-WS.org, (2012)

• Can entities be friends? B. P. Nunes , R. Kawase, S. Dietze, D. Taibi, M. A. Casanova, W. Nejdl Boston, US, 2012. Web of Linked Entities (WOLE2012), Workshop at The 11th International Semantic Web Conference (ISWC2012).

• Combining a co-occurrence-based and a semantic measure for entity linking. B. P. Nunes, S. Dietze, M. A. Casanova, R. Kawase, B. Fetahu, W. Nejdl. 2013. ESWC 2013 - 10th Extended Semantic Web Conference.

• Linked data - The Story So Far. Biser, C., Heath, T. and Berners-Lee, T. 2009, Special Issue on Linked data, International Journal on Semantic Web and Information Systems (IJSWIS).

Slide 18

Page 19: Arcomem training enrichment_advanced

THANK YOUCONTACT DETAILS

Dr. Elena DemidovaL3S Research Center+49 511 762 17732

[email protected]