NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Post on 10-May-2015

2112 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

"NERD: an open source platform for extracting and disambiguating named entities in very diverse documents" - Keynote Talk given at the NLP&DBpedia International Workshop (NLP&DBpedia), 22 October 2013

Transcript

NERD: an open source platform for extracting and

disambiguating named entities in very diverse documents

Raphaël Troncy <raphael.troncy@eurecom.fr> Giuseppe Rizzo <giuseppe.rizzo@eurecom.fr>

What is a Named Entity recognition task?

A task that aims to locate and classify the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent in a textual document

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 2

Example

“ I want to book a room in an hotel located in the heart of Paris, just a stone’s throw from the Eiffel Tower ”

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 3

Eric Charton, “Named Entity Detection and Entity Linking in the Context of Semantic Web: Exploring the ambiguity question”

Part of Speech

I PRP want VBP to TO book VB a DT room NN in IN … … Paris NNP

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 4

NER: What is Paris? NEL: Which Paris are we talking about?

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

What is Paris? Type Ambiguity

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 5

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

dbpedia-owl:Asteroid schema:City schema:Movie dbpedia-owl:Film

Named Entity Recognition (NER)

I PRP O want VBP O to TO O book VB O a DT O room NN O in IN O … … … Paris NNP LOC

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 6

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

What is Paris? Name Ambiguity

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 7

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

Paris, Kentucky Paris, Maine Paris, Tennessee

Paris, France Paris, Idaho Paris, Ontario

Named Entity Linking (NEL)

I PRP O O want VBP O O to TO O O book VB O O a DT O O room NN O O in IN O O … … … … Paris NNP LOC http://dbpedia.org/resource/Paris

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 8

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

NER Tools and Web APIs

Standalone software GATE Stanford CoreNLP Temis

Web APIs

http://nerd.eurecom.fr/

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 9

Compare performances of NER and NEL tools Understand strengths and weaknesses of different Web APIs Adapt NER processing to different context

(Learn how to) Combine NER (/ NEL) tools

Participate in various benchmarks

NERD: Named Entity Recognition and Disambiguation

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 10

What is NERD? REST API2 ontology1

UI3

1 http://nerd.eurecom.fr/ontology 2 http://nerd.eurecom.fr/api/application.wadl

3 http://nerd.eurecom.fr

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 11

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 12/15

Alchemy API

DBpedia Spotlight

Evri Extractiv Lupedia Open Calais

Saplo Wikimeta Yahoo! Zemanta

Language EN,FR, GR,IT, PT,RU, SP,SW

EN GR* PT* SP*

EN,IT

EN EN,FR, IT

EN,FR SP

EN, SW

EN,FR SP

EN EN

Granularity OEN OEN OED OEN OEN OEN OED OEN OEN OED

Entity position

N/A char offset

N/A word offset

range of chars

char offset

N/A POS offset

range of

chars

N/A

Classification schema

Alchemy DBpedia FreeBase Scema.or

g

Evri DBpedia DBpedia LinkedM

DB

Open Calais

N/A ESTER

Yahoo FreeBase

Number of classes

324 320 5 34 319 95 5 7 13 81

Response Format

JSON MicroF XML RDF

HTML JSON RDF XML

HTML

JSON

RDF

HTML JSON RDF XML

HTML JSON RDFa XML

JSON MicroFormat

JSON JSON XML

JSON XML

XML JSON RDF

Quota (calls/day)

30000 unl 3000

3000 unl 50000 1333 unl 5000 10000

Factual comparison of 10 Web NER tools

Aligned the taxonomies used by the extractors

NERD Ontology

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 13

NERD type Occurrence

Person 10

Organization 10

Country 6

Company 6

Location 6

Continent 5

City 5

RadioStation 5

Album 5

Product 5

... ...

Building the NERD Ontology

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 14

NERD REST API

GET, POST, PUT,

DELETE

/document /user /annotation/{extractor} /extraction /evaluation ...

JSON

“entities” : [{ “entity”: “Tim Berners-Lee” , “type”: “Person” , “uri”: "http://dbpedia.org/resource/Tim_berners_lee", “nerdType”: "http://nerd.eurecom.fr/ontology#Person", “startChar”: 30, “endChar”: 45, “confidence”: 1, “relevance”: 0.5 }]

Rizzo G., Troncy R. (2012), NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction Tools. In: European chapter of the Association for Computational Linguistics (EACL'12), Avignon, France.

RDF

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 15

NERD meets NIF

Model documents through a set of strings deferencable on the Web

: offset_23107_ 23110 a str:String ; str:referenceContext :offset_0_26546 .

: offset_23107_ 23110 sso:oen dbpedia:W3C.

dbpedia:W3C rdf:type nerd:Organization .

Map string to entity

Classification

Rizzo G, Troncy R., Hellmann S. and Bruemmer M. (2012), NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. In: (LDOW'12) Linked Data on the Web (WWW'12), Lyon, France.

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 16

NERD User Dashboard

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 17

NERD User Interface

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 18

History of NER benchmarks CoNLL 2003 and CoNLL 2005

schema (4 types): person, organization, location and miscellaneous

ACE 2004, ACE 2005 and ACE 2007 schema (7 types): person, organization, location, facility, weapon,

vehicle and geo-political entity entity recognition, co-ref, find relationships among entities extracted

TAC 2009 (Knowledge Base Track) schema (3 types): person, organization and location create a knowledge base from the named entities extracted

ETAPE 2012 (Named Entity Task) schema: Quaero (7 main types, 32 sub-types)

MSM 2013: tweet corpus ! schema (4 types): person, organization, location, miscellaneous

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 19

ETAPE 2012 challenge

genre train dev test sources

TV news 7h 40m 1h 40m 1h 40m BFM Story, Top QUestions (LCP)

TV debates 10h 30m 5h 10m 5h 10m Pile et Face, Ca vous regarde, Entre les lignes (LCP)

TV amusements - 1h 05m 1h 05m La place du village (TV8)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 20

Train Dev Eval Item length 26h 10h 55m 10h 55m Nb files 44 15 15 Nb words 290517 91656 115511 Nb Named Entities 46763 14398 13055 Nb unique categories 33 33 33

NERD @ ETAPE (naïve combined strategy)

(eA1,tA1,URIA1,siA1,eiA1) ... ... ...

`

(eA2,tA2,URIA2,siA2,eiA2) (eA3,tA3,URIA3,siA3,eiA3)

(eN2,tN2,URIN2,siN2,eiN2) (eN1,tN1,URIN1,siN1,eiN1)

extraction

cleaning

fusion When at least 2 extractors classify the same entity with a different type then we apply a preferred selection order

(empirically defined): Wikimeta, AlchemyAPI, OpenCalais, Lupedia

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 21

Participation at ETAPE (combined+ strategy)

(eA1,tA1,URIA1,siA1,eA1)

`

(eA2,tA2,URIA2,siA2,eiA2)

(eN2,tN2,URIN2,sN2,eN2) (eN1,tN1,URIN1,sN1,eN1)

...

ETAPE Train & Dev

Learned model

Created static rules

fusion Conflicts handled by

priority selection: own, Wikimeta,AlchemyAPI,OpenCalais,Lupedia

POS tagger

Apply rules

(e1,t1,URI1,si1,ei1)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 22

NERD Global results

SLR Precision Recall F-measure %correct

combined 86.85% 35.31% 17.69% 23.44% 17.69%

combined+ 188.81% 15.13% 28.40% 19.45% 28.40%

Combined+ : Eval corpus differs substantially from the Train & Dev corpora. The static rules do not fit well the Eval corpora and they introduce classification noise.

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 23

Per-extractor results SLR Precision Recall F-measure %correct

alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%

lupedia 39.49% 22.87% 1.56% 2.91% 1.56%

opencalais 37.47% 41.69% 3.53% 6.49% 3.53%

wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%

combined (nerd)

86.85% 35.31% 17.69% 23.44% 17.69%

combined+ (nerd+)

188.81% 15.13% 28.40% 19.45% 28.40%

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 24

- 25 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013

Learning How to Combine NER Extractors

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 26

NERD on CoNLL 2003 (NER task)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 27

NERD on MSM 2013 (NER task)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 28

NERD on MSM 2013 (NEL task)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 29

Media Fragment Enricher: http://mfe.synote.org/mfe/

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 30

Linking pieces of knowledge

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 31

Linking pieces of knowledge

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 32

Named Entities for Video Classification

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 33

Workflow

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 34

Media Fragment Enricher Services

Media Fragment Enricher UI

Metadata & timed-text

NERD Client RDFizator Triple Store

Categori-zation

Video and metadata preview

Video replay with subtitles and aligned NEs

1: Video URL

2: Metadata

3: meta-data 4:NERDify

5:Timed Text 6: NEs with time

alignment (json)

7: RDFize (ttl)

8: Generate Category

9: SPARQL query

Channel signature based on NE distribution

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 35

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 36

LinkedTV: automatic annotations ...

22/10/2013 - - 37 NLP&DBpedia International Workshop, Sydney, October 2013

... and enrichment for hypervideos

Cubism Expressionism

Fauvism

FACETS / PROPERTIES OF CONCEPT

CONCEPT IN PLAYER

CONTENT ENRICHMENT

22/10/2013 - - 38 NLP&DBpedia International Workshop, Sydney, October 2013

Media Fragments and Annotations

nerd:Location Cafe Rick

nerd:Person H. Bogart

nerd:Person I. Bergman

nerd:Location Casablanca

Media Fragment URI 1.0 Chapters Scenes Shots etc…

http://data.linkedtv.eu/media/e2899e7f#t=840,900

22/10/2013 - - 39 NLP&DBpedia International Workshop, Sydney, October 2013

Enrichment and Hypervideos

nerd:Location Cafe Rick

nerd:Person H. Bogart

nerd:Person I. Bergman

nerd:Location Casablanca

Nerd:Person E. Tierney

nerd:Location China

22/10/2013 - - 40 NLP&DBpedia International Workshop, Sydney, October 2013

Locator

MediaResource

MediaFragment Annotation

Entity

URL (hyperlink)

Type

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 41

Media Fragment + Open Annotation + NERD

OffsetBasedString

Towards a Linked Media Layer

Enriching media with media from a closed collection (e.g. BBC archive) The MediaEval scenario (~ 1697 hours of archived BBC video)

http://www.multimediaeval.org/mediaeval2013/hyper2013/

Enriching media with content from the open web LinkedTV scenarios: white listed web sites for each program Media Collector for Social Media

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 42

Seed video enriched with web content rbbaktuell_20120809

nerd:Location Brandenburg

oa

Enrichments are Annotations too

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 44

Media Finder (named entities clustering)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 45

Media Finder (zooming in a cluster)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 46

Media Finder: http://mediafinder.eurecom.fr/

Live Topic Generation from Event Streams WWW 2013 Demo Session http://www.youtube.com/watch?v=8iRiwz7cDYY

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 47

Credits

Giuseppe Rizzo, Vuk Milicic, José Luis Redondo Garcia (EURECOM)

Thomas Steiner (Google Inc.)

Marieke van Erp (Free University of Amsterdam)

Yunjia Li (University of Southampton)

… and many other students

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 48

http://www.slideshare.net/troncy

22/10/2013 - - 49 NLP&DBpedia International Workshop, Sydney, October 2013

top related