Top Banner
Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas
54

Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Dec 25, 2015

Download

Documents

Elfreda Owens
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Natural Language Processingusing Wikipedia

Rada Mihalcea University of North Texas

Page 2: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Text Wikification Finding key terms in documents and linking

them to relevant encyclopedic information.

Page 3: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Text Wikification (continued)

Motivation: Help Wikipedia contributors NLP applications (summarization, text

categorization, metadata annotation, text similarity) Enrich educational materials Annotating web pages (semantic web)

Combined problem Finding the important concepts

Keyword extraction Finding the correct article

Word sense disambiguation

Page 4: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Wikification pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Page 5: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Keyword Extraction Finding important words/phrases in raw

text Two-stage process

Candidate extraction Typical methods: n-grams, noun phrases

Candidate ranking Rank the candidates by importance Typical methods:

Unsupervised: information theoretic Supervised: machine learning using positional and

linguistic features

Page 6: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Keyword Extraction using Wikipedia1. Candidate extraction Semi-controlled vocabulary

Wikipedia article titles and anchor texts (surface forms). E.g. “USA”, “U.S.” = “United States of America”

More than 2,000,000 terms/phrases Vocabulary is broad (e.g., the, a are

included)

Page 7: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Keyword Extraction using Wikipedia2. Candidate ranking tf * idf

Wikipedia articles as document collection Chi-squared independence of phrase

and text The degree to which it appeared more

times than expected by chance Keyphraseness:

)(

)()|(

W

key

Dcount

DcountWkeywordP

Page 8: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Evaluations

Gold standard 85 documents containing 7.286 links

Links selected by Wikipedia users Have undergone the continuous editorial process of

Wikipedia Extract N keywords from the ranking N=6% of number of words

Page 9: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Results

30.00%

35.00%

40.00%

45.00%

50.00%

55.00%

60.00%

Precision Recall F

tf.idf

chi-squared

keyphraseness

Page 10: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Example Keyword ExtractionAutomatically extracted

Wikipedia annotations

The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere: its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south; the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.

The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere: its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south; the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.

Page 11: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Wikification Pipeline

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Page 12: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Word Sense Disambiguation

Channel: A channel is also the natural or man-made deeper course through a reef, bar, bay, or any shallow body of water.

Meter: Each bar has a 2-beat unit, a 5-beat unit, and a 3-beat unit, with a stress at the beginning of each unit.

Aida (café): In most shops a quick coffee while standing up at the bar is possible.

Page 13: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Aida (café): In most shops a quick coffee while standing up at the bar is possible.

Page 14: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Wikipedia as a Sense Tagged Corpus

In most shops a quick coffee while standing up at the [[bar (counter) | bar]] is possible.

A channel is also the natural or man-made deeper course through a reef, [[bar (landform) | bar]], bay, or any shallow body of water.

Each [[bar (music) | bar]] has a 2-beat unit, a 5-beat unit, and a 3-beat unit, with a stress at the beginning of each unit.

Wikipedia links = Sense annotations

Page 15: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Sense Inventory Alternative 1: disambiguation webpages

Does not include all possible annotations [[measure (music) | bar ]] measure (music)

not listed Inconsistent

identifier of disambiguation page: paper (disambiguation) vs. paper

Alternative 2: extract all link annotations bar (counter), bar (music), bar (landform) map them to WordNet senses

Page 16: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Building a Sense Tagged Corpus

Given ambiguous word W 1. Extract all the paragraphs in Wikipedia

containing the ambiguous word W inside a link

2. Collect all the possible Wikipedia labels = leftmost component of each link

3. Map the Wikipedia labels to WordNet senses

Page 17: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

An ExampleGiven ambiguous word W = BAR1. Extract all the paragraphs in Wikipedia

containing the ambiguous word W inside a link 1,217 paragraphs

remove examples with [[bar]] (ambiguous): 1,108 examples

2. Collect all the possible Wikipedia labels = leftmost component of each link

40 Wikipedia labels bar (music); measure music; musical

notation 3. Map the Wikipedia labels to WordNet senses

9 WordNet senses

Page 18: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

bar (counter) bar_(counter) The counter from which drinks are dispensed

A counter where you can obtain food or drink

bar (music) bar_(music), measure_music, musical_notation

A period of music

Musical notation for a repeating pattern of musical beats

bar (landform) bar_(landform) A type of beach behind which lies a lagoon

A submerged (or partly submerged) ridge in a river or along a shore

Word sense Wikipedia label Wikipedia definition WordNet definition

Page 19: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Supervised Word Sense Disambiguation Local and topical features in a Naïve Bayes

classifier Good performance on Senseval-2 and Senseval-3

data Local features

Current word and part-of-speech Surrounding context of three words Collocational features

Topical features Five keywords per sense, occurring at least three

times (Ng & Lee, 1996), (Lee & Ng, 2002)

Page 20: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Experiments on Senseval-2 / Senseval-3Lexical sample WSD 49 ambiguous nouns from Senseval-2 (29),

Senseval-3 (20) Remove the words with one Wikipedia sense

detention Remove the words with all Wikipedia senses

mapped to one WordNet sense Roman church, Catholic church Catholic church

Final set: 30 nouns with Wikipedia labels mapped to at least two WordNet senses

Page 21: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Ten-fold cross validations [WSD] Supervised word sense

disambiguation on Wikipedia sense tagged corpora

[MFS] Most frequent sense: choose the most frequent sense by default

[Similarity] Similarity between current example and training data available for each sense

Page 22: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Results on Senseval-2 / Senseval-3

#s #ex MFS Similarity WSDargument 2 114 70.17% 73.63% 89.47%arm 2 291 61.85% 69.31% 84.87%bank 3 1074 97.20% 97.20% 97.20%bar 10 1108 47.38% 68.09% 83.12%circuit 4 327 85.32% 85.62% 87.15%degree 7 849 58.77% 73.05% 85.98%stress 3 565 53.27% 54.28% 86.37%

Average 3.31 316 72.58% 78.02% 84.65%

Page 23: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Some Notes

Words with no improvement Small number of examples in Wikipedia

restraint (9), shelter (17) Skewed sense distributions

bank: 1044 occurrences as “financial institution”, 30 occurrences as “river bank”

Different granularity Coarser grained senses in Wikipedia

Missing senses: atmosphere: ambiance Coarse distinctions: grasp: act of grasping (#1) = hold

(#2) Exceptions: dance performance, theatre performance

Page 24: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Experiments on WikipediaAll-words WSD “Link disambiguation”

Find the link assigned by the Wikipedia annotators Data set

The same data set used in keyword evaluation 85 documents containing 7.286 links

Three methods Supervised Similarity

Unsupervised: measure similarity of context and candidate article

Combined: voting

Page 25: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Results

50.00%

55.00%

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

Precision Recall F

Random baseline

Most frequent sense

Similarity

Supervised

Combined

Page 26: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Wikification

Candidate

Extraction

Candidate

Ranking

Extract Sense Definitions

from Sense Inventory

Knowledge- based

Lesk- like Definition

Overlap

Data Driven

Naive Bayes

trained on Wikipedia

Voting

Tex

t w

ith

sel

ecte

d k

eyw

ord

s

Dec

om

po

siti

on

Raw

(h

yper

)tex

t

Cle

an T

ext

Rec

om

posi

tion

(Hyp

er)t

ext

wit

h

linked

key

wo

rds

Annotated Text

Word Sense DisambiguationKeyword Extraction

Page 27: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Wikify! system (http://lit.csci.unt.edu/~wikify/

or www.wikifyer.com)

Page 28: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Overall System Evaluation

Turing-like test Annotation of educational materials

Page 29: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Turing-like Test Given a Wikipedia article, decide if it was

annotated by humans or our automated systemAutomatically

extractedWikipedia annotations

The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere: its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south; the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.

The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere: its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south; the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.

Page 30: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Turing-like Test

20 test subjects (mixed background) 10 document pairs for each subject

(side by side) Average accuracy: 57% Ideal case = 50% success rate (total

confusion)

Page 31: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Annotation of Educational Materials Studies in cognitive science

“An important part of the learning process is the ability to connect the learning material to the prior knowledge of the learner” (Walter Kinsch, 1998)

Amount of required background material Depends on the level of explicitness of the text Knowledge of the learner

Low-knowledge vs. high-knowledge learners Use the text wikifier to facilitate access to

background knowledge

Page 32: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

A History Test

A test consisting of 14 questions from a quiz from an online history course at UNT Multiple-choice questions Half the questions linked to Wikipedia, half left in their

original format 60 students taking the test

Randomly either the first or the last 7 questions were wikified

Students were instructed: they were allowed to use any information they wanted

to answer the questions they were not required to use the Wikipedia links

Page 33: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.
Page 34: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Results

60.00%

65.00%

70.00%

75.00%

80.00%

Correct

Raw Wikified

(p<0.1) (p<0.05)50.00

55.00

60.00

65.00

70.00

75.00

80.00

Time

Page 35: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Lessons Learned Wikipedia can be used as a source of

evidence for text processing tasks Keyword extraction Word sense disambiguation

Text wikification: linking documents to encyclopedic knowledge Enrich educational materials Annotation of web pages (semantic web) NLP applications

summarization, information retrieval, text categorization

text adaptation, topic identification, multilingual semantic networks

Page 36: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Ongoing Work: Text AdaptationPlanning for a Long Trip (Magellan’s Stories)Serrao’s letters helped build in my mind the location

of the Spice Islands, which later became the destination for my great voyage. I asked the King of Portugal to support my journey, but he refused. After that, I begged the King of Spain. He was interested in my plan since Spain was looking for a better sea route to Asia than the Portuguese route around the southern tip of Africa. It was going to be hard to find sailors, though. None of the Spanish sailors wanted to sail with me because I was Portuguese.

Funded by the National Science Foundation under CAREER IIS-0747340, 2008-2013Collaboration with Educational Testing Service (ETS)

Def: long trip with a specific objective, esp. by sea or airEn: trip, journeyEs: travesia, viaje

Def: to travel by boatEn: navigateEs: salir, navigar

Page 37: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

38

Ongoing Work: Topic Identification Automatic identification of the topic/category of a text (e.g., computer

science, psychology) Books Learning objects

Funded by the Texas Higher Education Coord. Board, Google 2008-2010

“The United States was involved in the Cold War.”

United States0.3793

Cold War0.3111

Vietnam War0.0023

World War I0.0023

Communism0.0027

Ronald Reagan0.0027

Michail Gorbachev0.0023

Cat: Wars Involvingthe United States0.00779

Cat: Global Conflicts0.00779

Page 38: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Ongoing WorkMultilingual Semantic Networks

Funded by the National Science Foundation under IIS-1018613, 2010-2013

COMPOSEREn: composerFr: compositeurDe: komponist

JOHN WILLIAMSEn: John Williams, WilliamsFr: John WilliamsDe: John Williams

PIANISTEn: pianistFr: pianisteDe: pianist

MUSICIANEn: musicianFr: musicienDe: Musiket

CONDUCTOREn: conductorFr: chef d’orchestreDe: Dirigent

CONDUCTOR OF THE BOSTON POPS ORCHESTRAEn: conductor of the Boston Pops OrchestraFr: chef d’orchestre de l’Orchestre Boston PopsDe: Dirigent de Boston Pops Orchestra

ORCHESTRAEn: orchestraFr: orchestreDe: orchester

BOSTON POPS ORCHESTRAEn: Boston Pops OrchestraFr: Orchestre Boston Pops

isA

instanceOfinstanceOf

instanceOf

instanceOf

instanceOf

partOf

partOfisA

isA

John Williams served as the principal conductor of the Boston Pops Orchestra

Page 39: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Thank You!

Questions?

Page 40: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Wikipedia for Natural Language Processing Word similarity

(Strube & Ponzetto, 2006) (Gabrilovich & Markovitch, 2007)

Text categorization (Gabrilovich & Markovitch, 2006)

Named entity disambiguation (Bunescu & Pasca, 2006)

Page 41: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Wikipedia vs. WordNet (Senseval) Different granularity

Coarser grained senses in Wikipedia Missing senses: atmosphere: ambiance Coarse distinctions: grasp: act of grasping (#1) =

hold (#2) Exceptions: dance performance, theatre

performance

Wikipedia vs. Senseval – different sense distribution Low sense distribution correlation r = 0.51#s #ex MFS LeskC WSDSenseval 4.6 226 51.53% 58.33% 68.13%Wikipedia 3.31 316 72.58% 78.02% 84.65%

Page 42: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Sense Disambiguation Learning Curve Disambiguation accuracy using 10%,

20%… 100% of the data

70

75

80

85

90

1 2 3 4 5 6 7 8 9 10 11

Fraction of data

Acc

ura

cy

Page 43: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Text Wikification

Finding key terms in documents and linking them to relevant encyclopedic information.

Page 44: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Text Wikification Finding key terms in documents and linking

them to relevant encyclopedic information.

Page 45: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Text Wikification Finding key terms in documents and linking

them to relevant encyclopedic information.

Page 46: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Text Wikification Finding key terms in documents and linking

them to relevant encyclopedic information.

Page 47: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

48

Lexical Semantics Find the meaning of all-words in unrestricted text Required for automatic machine translation,

information retrieval, text understanding SenseLearner – minimally supervised learning

Senseval-2, Senseval-3, Semeval (Semeval @ ACL 2007) Publicly available http://lit.csci.unt.edu/~senselearner

GWSD – unsupervised graph-based algorithms Random walks on text structures Find the most central meanings in a text http://lit.csci.unt.edu/index.php/Downloads

Funded by the National Science Foundation

Page 48: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

49

Lexical substitution: SubFinder Find semantically-equivalent substitutes for a target

word in a given context Combine corpus-based and knowledge-based

approaches Combine monolingual and multilingual resources Wordnet, Encarta, bilingual dictionaries, large corpora

Faired well in the Semeval 2007 lexical substitution task

TransFinder Find the translation of a target word in a given context Assist Hispanic students with the understanding of

English texts Task at Semeval 2010

Lexical Semantics

Funded by the National Science Foundation

Page 49: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Text-to-text semantic similarity Find if two pieces of text contain the same

information Useful for information retrieval (search

engines), text summarization Focus on automatic student answer grading

Given the instructor answer and the student answer, assign a grade and identify potential misunderstandings and areas that need clarifications

Lexical Semantics

Funded by the National Science Foundation

Page 50: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Metadata Annotation for Learning Object Repositories Learning object repositories: support

sharing and reuse of educational materials

Identify keywords and related concepts for the automatic annotation of learning object repositories

Keyword extraction using Graph-based algorithms Knowledge drawn from Wikipedia

51Funded by the Texas Higher Education Coordinating Board (THECB)

Page 51: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

52

Sentiment and Subjectivity

Add subjectivity and sentiment labels to word senses Important for automatic analysis of political opinions,

product reviews, market research Collaboration with Jan Wiebe, U. Pittsburgh

Automatic assignment of subjectivity to word senses

Projection of subjectivity annotations and resources to other languages Via parallel texts / bilingual dictionaries Via machine translation

Bootstrapping of subjectivity / sentiment seeds using propagation on graphs and word similarity

Funded by the National Science Foundation

Page 52: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

53

Affective text Automatic annotation of emotions in text Anger, disgust, fear, joy, sadness, surprise Collaboration with Carlo Strapparava, IRST Large data sets constructed

Computational humour Learning to recognize humour Identification of connections with other linguistic properties: affect, valence, semantic classes

Sentiment and Subjectivity

Page 53: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Text-to-image Synthesis Language learning

Children Second (foreign) language People with language disorders

International language-independent knowledge base Pictures are transparent to languages

Applications Pictorial translations (“Letters to my cousin”)

Bridge the gap between research in image and text processing Image retrieval/classification, natural language

Page 54: Natural Language Processing using Wikipedia Rada Mihalcea University of North Texas.

Typical entry in a dictionary

pipe, tobacco pipe a tube with a small

bowl at one end; used for smoking tobacco

pipe, pipage, piping a long tube made of

metal or plastic that is used to carry water or oil or gas etc.)

pipe, tabor pipe a tubular wind

instrument

+ pictorial representations