Top Banner
Carmen Banea, Rada Mihalcea University of North Texas [email protected], [email protected] A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Janyce Wiebe University of Pittsburg [email protected]
18

Carmen Banea, Rada Mihalcea University of North Texas [email protected], [email protected] A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Dec 28, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Carmen Banea, Rada Mihalcea

University of North [email protected], [email protected]

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources

Janyce WiebeUniversity of Pittsburg

[email protected]

Page 2: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Subjectivity analysisSubjectivity analysis (opinions and sentiments)Used in a wide variety of applications

Tracking sentiment timelines in news (Lloyd et. al, 2005)Review classification (Turney, 2002; Pang et. al, 2002)Mining opinions from product reviews (Hu and Liu, 2004)Expressive text-to-speech synthesis (Alm et. al, 2005)Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli

and Sebastiani, 2006)Question answering (Yu and Hatzivassiloglou, 2003)

Much work on subjectivity analysis has focused on EnglishJapanese (Takumura et. al, 2006), Chinese (Hu et. al,

2005), German (Kim and Hovy, 2006)

Page 3: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Proportion of Languages on the Web

internetworldstats.com ~ updated November 30, 2007

Page 4: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

ObjectiveDevelop a method for subjectivity analysis

thatRequires few electronic resources Can be easily ported to a new language

Applicable to the large number of languages that have scarce electronic resources

Page 5: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Related WorkTools that rely on manually or semi-automatically

constructed lexiconsYu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim

and Hovy, 2006Enable the efficient rule-based subjectivity and sentiment

classifiers that rely on the presence of lexicon entries in text

These tools assume the availability of advanced language processing tools:

Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003)

broad-coverage rich lexical resources WordNet (Essuli and Sebastiani, 2006)

Our approach relates most closely to the method of (Turney, 2002) for the construction of lexicons annotated for polarityWe address the task of acquiring a subjectivity lexicon We rely on fewer, smaller-scale resources

Page 6: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Our MethodBased on bootstrappingRequires:

A small seed set of subjective entriesOne/multiple electronic dictionariesA small training corpus (approx.

500,000 words)Experiments focused on Romanian

Applicable to other languages as well

Page 7: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Bootstrapping Process

seedsseeds query Candidate synonymsCandidate synonyms

Max. no. of iterations?

no

yes

Candidate synonymsCandidate synonyms

Selected synonymsSelected synonyms

Variable filtering

Online dictionary

Fixed filtering

Page 8: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Seed SetCategory

Sample Entries (with their English translation)

Noun blestem (curse), despot (tyrant), furie (fury), idiot (idiot), fericire (happiness)

Verb iubi (love), aprecia (appreciate), spera (hope), dori (wish), uri (hate)

Adjective

frumos (beautiful), dulce (sweet), urat (ugly), fericit (happy), fascinant (fascinating)

Adverb posibil (possibly), probabil (probably),desigur (of course), enervant (unnerving)

60 seeds, evenhandedly sampled from verbs, nouns, adjectives and adverbs.

Manually selectedSeed sources:

XI-th grade curriculum for Romanian Language and Literature

Translations of instances appearing in the OpinionFinder strong subjective lexicon (Wiebe and Riloff, 2005)

Page 9: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Expansion

Romanian dictionary: http://www.dexonline.roDictionaries for other languages are also available, or

can be obtained from paper dictionaries through OCR

Definition

All open-class words, that have a definition in the dictionary

longer than 3 lettersDiacritics are removed

Candidate synonymsCandidate synonyms

SeedSeed

Page 10: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

FilteringCandidates are filtered based on a measure

of similarity with the original seedsWe use Latent Semantic Analysis (LSA)

(Dumais et al., 1988) trained on the SemCor corpus (Miller et al., 1993)

After each iteration, only candidates with an LSA score higher than a given threshold are selected for further expansion

Example:Seed: dulce (sweet)Candidate synonyms: cu gust dulce (sweet-

tasting). placut (pleasant), dulceag (quasi-sweet)

Page 11: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

FilteringSeveral iterations of the bootstrapping

process will result in a subjectivity lexicon consisting of a ranked list of candidates in decreasing order of similarity to the original seeds

A variable filtering threshold can be used to further restrict the similarity for a more pure lexicon

Filtering parameters:Similarity thresholdNumber of iterations

Page 12: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Lexicon Acquisition

Page 13: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

EvaluationRule-based classifier of subjectivity

(Riloff and Wiebe, 2003)Subjective sentence: three or more subjective

entries.Objective sentence: two subjective entries or less.

Gold standard data set (Mihalcea, Banea and Wiebe, 2007)504 sentences from five SemCor documents

(manually translated in Romanian)Labeled by two annotatorsAgreement (all): 83% (=0.67)Agreement (uncertain removed): 89% (=0.77)Baseline: 54% (all subjective)

Page 14: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Number of Iterations

F-measure for the bootstrapping subjectivity lexicon over 5 iterations and an LSA threshold of 0.5

Page 15: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Similarity Threshold

F-measure for the fifth bootstrapping iteration for varying LSA scores

Page 16: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Comparison

Bootstrapping rule-based classifier: uses a 3913 entries subjectivity lexicon obtained through 5 iterations and similarity threshold of 0.5

Page 17: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

ConclusionsOur bootstrapping method uses few

electronic resources:A small seed setOne/multiple dictionariesA small corpus of half a million words

A large subjectivity lexicon of approx. 4000 entries was extracted

Using an unsupervised rule-based classifier, a subjectivity F-measure of 66.20% and an overall F-measure of 61.69% can be achieved

Page 18: Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Questions?