Top Banner
Mining Social Media with Linked Open Data, Entity Recognition and Event Extraction Leon Derczynski Kalina Bontcheva Third Workshop on Data Extraction and Object Search, Oxford, 7 July 2013
17

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Jan 26, 2015

Download

Technology

Leon Derczynski

Presented at the 4th DEOS workshop, http://diadem.cs.ox.ac.uk/deos13/

Social media presents itself as a context-rich source of big data, readily exhibiting volume, velocity and variety. Mining information from microblogs and other social media is a challenging, emerging research area. Unlike carefully authored news text and other longer content, social media text poses a number of new challenges, due to the short, noisy, context-dependent, and dynamic nature.

This talk will discuss firstly how Linked Open Data (LOD) vocabularies (namely DBpedia and YAGO) have been used to help entity recognition and disambiguation in such content. We will introduce LODIE, the LOD-based extension of the widely used ANNIE open-source entity recognition system. LODIE includes also entity disambiguation (including products, as well as names of persons, locations, and organisations) and has been developed as part of the TrendMiner and uComp projects. Quantitative evaluation results will be shown, including a comparison against other state-of-the-art methods and an analysis of how errors in upstream linguistic pre-processing (i.e. tokenisation and POS tagging) can affect disambiguation performance. Our results demonstrate the importance of adjusting approaches for this genre.

The second half of the talk will focus on fine-grained events in tweets. Awareness of temporal context in social media enables many interesting applications. We identify events using the TimeML schema, focusing on occurrences and actions. Challenges of event annotation will be discussed, as well as the development of a supervised event extractor specifically for social media. We evaluate this against traditional event annotation approaches (e.g. Evita, TIPSem).
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Mining Social Media with Linked Open Data,Entity Recognition and Event Extraction

Leon DerczynskiKalina Bontcheva

Third Workshop on Data Extraction and Object Search, Oxford,

7 July 2013

Page 2: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction
Page 3: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Social Media = Big Data

Gartner ''3V'' definition:

1.Volume

2.Velocity

3.Variety

High volume & velocity of messages:

Twitter has ~20 000 000 users per monthThey write ~500 000 000 messages per day

Massive variety: Stock markets;Earthquakes;Social arrangements;… Bieber

Page 4: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

What resources do we have now?

Large, content-rich, connected, digital streams of human discourse

We transfer knowledge via communication

Sampling communication gives a sample of human knowledge

''You've only done that which you can communicate''

The metadata (time – place – imagery) gives a richer resource:

→ A sampling of human behaviour

Page 5: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Entity annotation components

Named entity recognition

dbpedia.org/resource/..... Michael_Jackson Michael_Jackson_(writer)

Linking entities

Page 6: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Named Entity Recognition

Goal is to find entities we might like to link

General accuracy on newswire: 89% F1General accuracy on microblogs: 41% F1L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva. ''Microblog-Genre Noise and Impact on Semantic Annotation Accuracy.'' 24th ACM Conference on Hypertext and Social Media. 2013

Newswire:

Microblog:Gotta dress up for london fashion week and party in style!!!

London Fashion Week grows up – but mustn't take itself too seriously. Once a launching pad for new designers, it is fast becoming the main event. But LFW mustn't let the luxury and money crush its sense of silliness.

Page 7: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

NER difficulties

Rule-based systems get the bulk of entities (newswire 77% F1)

ML-based systems do well at the remainder (newswire 89% F1)

Small proportion of difficult entities

Many complex issues

Using improved pipeline:

ML struggles, even with in-genre data: 49% F1

Rules cut through microblog noise: 80% F1

Page 8: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Word-level linking performance

Dataset: Ritter NER + DBpedia URIs

Detect mentions of entity in tweets

Crowdsourced annotations

Expert gold standard

Discard after disagreement or ambiguity

We disambiguate mentions to DBpedia / Wikipedia (easy to map)

General performance: F1 81%

Page 9: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Word-level linking issues

Automatic annotation:Branching out from Lincoln park(LOC) after dark ... Hello "Russian Navy(ORG)", it's like the same thing but with glitter!

Actual:Branching out from Lincoln park after dark(PROD) ... Hello "Russian Navy(PROD)", it's like the same thing but with glitter!

Clue in unusual collocations

+ ?

Page 10: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

LODIE: LOD-based Inf. Extr.

Uses DBPedia as reference knowledge graph

Why DBPedia?

Regularly updated (from Wikipedia)

Good source for named entities

A hierarchy of concepts

A capital is also a city, but not vice versa

Relations between conceptsParis locatedIn FranceParisHilton bornIn NewYorkCity

Demo: http://demos.gate.ac.uk/trendminer/obie/

Page 11: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

LODIE: LOD-based Inf. Extr.

We increase recall by:

Deriving abbreviations from link anchor texts in Wikipedia

''She was born in <a href=''New_York_(city)''>NYC</a>''

Rank boosting terms using redirect pages

Matching NE candidates using include wild card queries (e.g. Burton upon Trent and Burton-on-Trent)

This makes disambiguation harder (more choices hurts precision)

Use joint inference over multiple similarity metrics:StringStructureGeneral collocationOntology / conceptual graph

D. Damljanovic and K. Bontcheva ''Named Entity Disambiguation using Linked Data'', ESWC, 2012

Demo: http://demos.gate.ac.uk/trendminer/obie/

Page 12: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Social media contains events

How are events differently described in social media and news?

Conventional docs (e.g. newswire) have contextual info

Central event in distinct document segment (e.g. headline)LocationActors / participantsCausesOutcomesSimilar prior events

This kind of description not found in social media

No editing guidelinesOften limited message length

Instead, events are fractured, with facets represented sparsely

Only 1-2 facets per message about the event

Page 13: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Event extraction

Social media streams are punctuated with descriptions of events

… Accompanied by event facets

''Obama is visiting Russia''

''The US president has not visited Putin before''

Many viewpoints on the same temporal entity

(like triples)

How can we extract these?

We use the TimeML definitions of events in text:

Minimal lexicalisation (i.e. annotate one word)

Event classes: we focus on ACTIONs and OCCURRENCEs

Page 14: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Event extraction

How can we extract event mentions?

Conventional approaches are hybrid:

Statistical learningSyntactic structures

Existing TimeML resources

TimeBank corpus (newswire)Evita event extraction tool

Adapting to social media text

Negatively impacted by problems with NERShort sentence structure

→ Use shallow linguistic techniques and fuzzy matches

Evita: F1 80.1TIPSem: F1 81.4 (on well-formed text)USFD Arcomem: F1 81.1 (noise-resilient)

Page 15: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

LOD for event reassembly

What is needed to reassemble events from social media?

Identify mentions of the same eventCollect facets and integrate them

LOD gives unique identifiers for facet values

Many possible lexicalisations for the same event (run, control)

Identify co-referring mentions though:

Shared actorsConsistent facets (i.e. non-conflicting)Lexical event similarity (e.g. wordnet)

This helps cluster mentions of the same eventAggolmerate facets

Final product: Event description grounded in linked open data

Page 16: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Conclusion

Event extraction from social media

using

linked open data

enables

extraction of rich event descriptions

Page 17: Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Thank you!

Thank you for listening!

Do you have any questions?