This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Entity Enrichment and Consolidationin ARCOMEM
Elena Demidova1,
including slides by: Stefan Dietze1, Diana Maynard2, Thomas Risse1, Wim Peters2, Katerina Doka3, Yannis Stavrakas3
1 L3S Research Center, Hannover, Germany2 University Sheffield, UK3 IMIS, RC ATHENA, Athens, Greece
The ARCOMEM approach
• Make use of the Social Web– Huge source of user generated content– Wide range of articulation methods
From simple „I like it“-Buttons to complete articles– Represents the diversity of opinions of the public
• User activities often triggered by – Events and related entities
(e.g. Sport Events, Celebrations, Crises, News Articles, Persons, Locations)
– Topics (e.g. Global Warming, Financial Crisis, Swine Flu)
A semantic-aware and socially-driven preservation model is a natural way to go
Slide 2
ARCOMEM architecture
Slide 3
Crawler
Cross Crawl Analysis
Online Processing
Offline Processing
Queue Management
Application-Aware Helper
Resource Selection& Prioritization
Resource Fetching
Intelligent Crawl
Definition
ConsolidationEnrichment
GATE Offline Analysis
Social Web Analysis
GATE Online Analysis Social Web Analysis
Named EntityEvol. Recog.
Extracted SocialWeb Information
Crawler Cockpit
ARCOMEMStorage
URLs
Relevance Analysis &
Priorization
Image/Video Analysis
Twitter Dynamics
WARC Export
WARCFiles
ApplicationsBroadcasterApplication
Parliament Application
ARCOMEM system architecture foresees four processing levels: crawler level, online processing level, offline processing level and cross crawl analysis
4
ETOE offline processing chainThe processing chain depicted here describes all components involved in the offline processing of Web objects.
The extraction components for text
Aim Extraction of Entities, Topics, Events and Opinions (ETOEs) from
Web Pages Social Web (Twitter, YouTube, Facebook, …)
Challenges Entity recognition from degraded input sources (tweets etc)
Advancing state of the art NLP and text mining Dynamics detection: evolution of terms/entities
Semantic representation of Web objects and entities Appropriate RDF schemas for ETOE and Web objects Exploiting (Linked Open) Web data to enrich extracted ETOE
Data clustering & enrichmentEnrichment of entities with related references to Linked Data, particularly reference datasets (DBpedia, Freebase, …)=> use enrichments for correlation/clustering/consolidation
Slide 8
Enrichment with DBpedia & Freebase
• DBpedia and Freebase are particularly well-suited due to their vast size, the availability of disambiguation techniques which can utilise the variety of multilingual labels available in both datasets for individual data items and the level of inter-connectedness of both datasets, allowing the retrieval of a wealth of related information for particular items.
• In the case of DBpedia, we make use of the DBpedia Spotlight service which enables an approximate string matching with adjustable confidence level in the interval [0,1]. Experimentally, we set confidence to 0.6.
• For Freebase, we use structured queries, taking into account entity types extracted by GATE.
9
<Event>Trichet warns of systemic debt crisis</Event>
<Person>Jean Claude Trichet</Person> <Organisation>ECB</Organisation>
Manual evaluation of 240 enrichment-entity pairs Available scores: 1 (correct), 0 (incorrect), 0.5 (vague or
ambiguous relationship)
Slide 17
Entity Type Average score DBpedia
Average score Freebase
Average Score Total
arco:Event 0.71 0.71
arco:Location 0.81 0.94 0.88
arco:Money 0.67 0.67
arco:Organization 0.93 1
0.97
arco:Person 0.9 0.89 0.89
arco:Time 0.74 0.74
Total 0.79 0.94 0.87
Further reading• Entity Extraction and Consolidation for Social Web Content Preservation. S.
Dietze, D. Maynard, E. Demidova, T. Risse, W. Peters, K. Doka und Y. Stavrakas, SDA, volume 912 of CEUR Workshop Proceedings, page 18-29. CEUR-WS.org, (2012)
• Can entities be friends? B. P. Nunes , R. Kawase, S. Dietze, D. Taibi, M. A. Casanova, W. Nejdl Boston, US, 2012. Web of Linked Entities (WOLE2012), Workshop at The 11th International Semantic Web Conference (ISWC2012).
• Combining a co-occurrence-based and a semantic measure for entity linking. B. P. Nunes, S. Dietze, M. A. Casanova, R. Kawase, B. Fetahu, W. Nejdl. 2013. ESWC 2013 - 10th Extended Semantic Web Conference.
• Linked data - The Story So Far. Biser, C., Heath, T. and Berners-Lee, T. 2009, Special Issue on Linked data, International Journal on Semantic Web and Information Systems (IJSWIS).
Slide 18
THANK YOUCONTACT DETAILS
Dr. Elena DemidovaL3S Research Center+49 511 762 17732