Top Banner
Understanding User-generated Content on Social Media Meena Nagarajan Ph.D. Dissertation Defense Kno.e.sis Center, College of Engineering and Computer Science Wright State University 1
190

Meena Nagarajan Ph.D. Dissertation Defense

Dec 05, 2014

Download

Education

Understanding User-generated Content on Social Media Platforms
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Meena Nagarajan Ph.D. Dissertation Defense

Understanding User-generated Content on

Social Media Meena Nagarajan

Ph.D. Dissertation DefenseKno.e.sis Center, College of Engineering and Computer Science

Wright State University

1

Page 2: Meena Nagarajan Ph.D. Dissertation Defense

Introductions and Thank-you!

2

Page 3: Meena Nagarajan Ph.D. Dissertation Defense

Social Information Needs

• Facts, Networked Public Conversations, Opinions, Emotions, Preferences..

3

Page 4: Meena Nagarajan Ph.D. Dissertation Defense

Social Information Needs

• Can we use this information to assess a population’s preference?

• Can we study how these preferences propagate in a network of friends?

• Are such crowd-sourced preferences a good substitute for traditional polling methods?

4

Page 5: Meena Nagarajan Ph.D. Dissertation Defense

!" !"

Social Information Processing

• "Who says what, to whom, why, to what extent and with what effect?" [Laswell]

• Network: Social structure emerges from the aggregate of relationships (ties)

• People: poster identities, the active effort of accomplishing interaction

• Content : studying the content of communication.

ABOUTNESS of textual user-generated content via the lens of TEXT MINING

5

Page 6: Meena Nagarajan Ph.D. Dissertation Defense

Aboutness Of Text

• One among several terms used to express certain attributes of a discourse, text or document

• characterizing what a document is about, what its content, subject or topic matter are

• A central component of knowledge organization and information retrieval

• For machine and human consumption

6

Page 7: Meena Nagarajan Ph.D. Dissertation Defense

Aboutness & Subgoals in IE

• Named entity recognition

• Co-reference, anaphora resolution• "International Business Machines" and "IBM"; ‘he’ in a passage

refers to the mention of ‘John Smith’

• Terminology, key-phrase, lexical chain extraction

• Relationship and fact extraction• ‘person works for organization’

7

Page 8: Meena Nagarajan Ph.D. Dissertation Defense

Text Mining and Aboutness

• Thesis focus: `Aboutness’ understanding via Text Mining

• Gleaning meaningful information from natural language text useful for particular purposes

• Indicators of thematic elements for aboutness

• via NER, Key phrase extraction

8

Page 9: Meena Nagarajan Ph.D. Dissertation Defense

Aboutness & The Role Of Context

• Extracting thematic elements : interpretation of the individual elements in context.

• (a) I can hear bass sounds. (b) They like grilled bass.

• Typical context cues that are employed

• Word Associations, Linguistic Cues, Syntactic, Structural Cues, Knowledge Sources..

9

Page 10: Meena Nagarajan Ph.D. Dissertation Defense

10

1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010

User-generated content on Twitter during the 2009 Iran Election

show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/

Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft

Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches

User comments on music artist pages on MySpace

Your music is really bangin!

You’re a genius! Keep droppin bombs!

u doin it up 4 real. i really love the album.

hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.

Comments on Weblogs about movies and video games

I decided to check out Wanted demo today even though I really did not like the movie

It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now

Excerpt from a blog around the 2009 Health Care Reform debate

Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.

Figure 1.1: Examples of user-generated content from different social media platforms

5

Page 11: Meena Nagarajan Ph.D. Dissertation Defense

10

1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010

User-generated content on Twitter during the 2009 Iran Election

show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/

Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft

Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches

User comments on music artist pages on MySpace

Your music is really bangin!

You’re a genius! Keep droppin bombs!

u doin it up 4 real. i really love the album.

hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.

Comments on Weblogs about movies and video games

I decided to check out Wanted demo today even though I really did not like the movie

It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now

Excerpt from a blog around the 2009 Health Care Reform debate

Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.

Figure 1.1: Examples of user-generated content from different social media platforms

5

Unmediated Interpersonal communication

Informal English Domain

Page 12: Meena Nagarajan Ph.D. Dissertation Defense

10

1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010

User-generated content on Twitter during the 2009 Iran Election

show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/

Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft

Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches

User comments on music artist pages on MySpace

Your music is really bangin!

You’re a genius! Keep droppin bombs!

u doin it up 4 real. i really love the album.

hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.

Comments on Weblogs about movies and video games

I decided to check out Wanted demo today even though I really did not like the movie

It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now

Excerpt from a blog around the 2009 Health Care Reform debate

Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.

Figure 1.1: Examples of user-generated content from different social media platforms

5

Unmediated Interpersonal communication

Informal English Domain

Context is implicit

Interactions between like-minded people

Page 13: Meena Nagarajan Ph.D. Dissertation Defense

10

1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010

User-generated content on Twitter during the 2009 Iran Election

show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/

Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft

Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches

User comments on music artist pages on MySpace

Your music is really bangin!

You’re a genius! Keep droppin bombs!

u doin it up 4 real. i really love the album.

hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.

Comments on Weblogs about movies and video games

I decided to check out Wanted demo today even though I really did not like the movie

It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now

Excerpt from a blog around the 2009 Health Care Reform debate

Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.

Figure 1.1: Examples of user-generated content from different social media platforms

5

Unmediated Interpersonal communication

Informal English Domain

Context is implicit

Interactions between like-minded people

Variations and creativity in expression

Properties of the medium

Page 14: Meena Nagarajan Ph.D. Dissertation Defense

10

1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010

User-generated content on Twitter during the 2009 Iran Election

show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/

Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft

Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches

User comments on music artist pages on MySpace

Your music is really bangin!

You’re a genius! Keep droppin bombs!

u doin it up 4 real. i really love the album.

hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.

Comments on Weblogs about movies and video games

I decided to check out Wanted demo today even though I really did not like the movie

It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now

Excerpt from a blog around the 2009 Health Care Reform debate

Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.

Figure 1.1: Examples of user-generated content from different social media platforms

5

Unmediated Interpersonal communication

Informal English Domain

Context is implicit

Interactions between like-minded people

Variations and creativity in expression

Properties of the medium

One solution rarely fits all

social media content

Page 15: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Contributions• Compensating for informal highly variable

language, lack of context

• Examining usefulness of multiple context cues for text mining algorithms

• Context cues: Document corpus, syntactic, structural cues, social medium and external domain knowledge

• End goal: NER, Key Phrase Extraction11

Page 16: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Statements

• We show that for 2 Aboutness Understanding tasks -- NER, Key Phrase Extraction

• Multiple contextual information can supplement and improve the reliability and performance of existing NLP/ML algorithms

• Improvements tend to be robust across domains and data sources

12

Page 17: Meena Nagarajan Ph.D. Dissertation Defense

13

Con

text

Cue

s

In Content

Medium Metadata,

Structural cues

External Knowledge

Sources

Text Formality

Thesis ContributionsTask : Aboutness of text

Weblogs MySpace Music Forum

NER - Movie Names NER - Music Album/Track names

I loved your music Yesterday!

“It was THE HANGOVER of the year..lasted forever.. so I went to the movies..bad choice

picking “GI Jane” worse now”

Page 18: Meena Nagarajan Ph.D. Dissertation Defense

13

Con

text

Cue

s

In Content

Medium Metadata,

Structural cues

External Knowledge

Sources

Text Formality

Thesis ContributionsTask : Aboutness of text

Weblogs MySpace Music Forum

Wikipedia Infoboxes

Word Associations from large corpora

Blog URL, Title, Post URL

NER - Movie Names NER - Music Album/Track names

Page 19: Meena Nagarajan Ph.D. Dissertation Defense

13

Con

text

Cue

s

In Content

Medium Metadata,

Structural cues

External Knowledge

Sources

Text Formality

Thesis ContributionsTask : Aboutness of text

Weblogs MySpace Music Forum

Wikipedia Infoboxes

Word Associations from large corpora

Blog URL, Title, Post URL

Word associations from large corpora, POS Tags, Syntactic

Dependencies

Music Brainz, UrbanDictionary

Page URL

NER - Movie Names NER - Music Album/Track names

Page 20: Meena Nagarajan Ph.D. Dissertation Defense

14

Con

text

Cue

s

In Content

Medium Metadata,

Structural cues

External Knowledge

Sources

Text Formality

Thesis ContributionsTask : Aboutness of text

Twitter Facebook, MySpace Forums

Key Phrase Extraction Key Phrase Elimination

4.1. KEY PHRASE EXTRACTION - ‘ABOUTNESS’ OF CONTENT August 10, 2010

document that are descriptive of its contents.

The contributions made in this thesis fall under the second category of extracting key phrases

that are explicitly present in the content and are also indicative of what the document is ‘about’.

The focus of previous approaches to key phrase extraction have been on extracting phrases

that summarize a document, e.g. a news article, a web page, a journal article or a book. In

contrast, the focus of this thesis is not in summarizing a document generated by users on social

media platforms but to extract key phrases that are descriptive of information present in multiple

observations (or documents) made by users about an entity, event or topic of interest.

The primary motivation is to obtain an abstraction of a social phenomenon that makes volumes

of unstructured user-generated content easily consumable by humans and agents alike. As an

example of the goals of our work, Table 4.1 shows key phrases extracted from online discussions

around the 2009 Health Care Reform debate and the 2008 Mumbai terror attack, summarizing

hundreds of user comments to give a sense of what the population cared about on a particular day.

2009 Health Care Reform 2008 Mumbai Terror Attack

Health care debate Foreign relations perspectiveHealthcare staffing problem Indian prime minister speechObamacare facts UK indicating supportHealthcare protestors Country of IndiaParty ratings plummet Rejected evidence providedPublic option Photographers capture images of Mumbai

Table 4.1: Showing summary key phrases extracted from more than 500 online posts on Twitteraround two news-worthy events on a single day.

Solutions to key phrase extraction have ranged from both unsupervised techniques that are

based on heuristics to identify phrases and supervised learning approaches that learn from human

105

Page 21: Meena Nagarajan Ph.D. Dissertation Defense

14

Con

text

Cue

s

In Content

Medium Metadata,

Structural cues

External Knowledge

Sources

Text Formality

Thesis ContributionsTask : Aboutness of text

Twitter Facebook, MySpace Forums

n-grams for thematic cues

spatial, temporal metadata

Key Phrase Extraction Key Phrase Elimination

Page 22: Meena Nagarajan Ph.D. Dissertation Defense

14

Con

text

Cue

s

In Content

Medium Metadata,

Structural cues

External Knowledge

Sources

Text Formality

Thesis ContributionsTask : Aboutness of text

Twitter Facebook, MySpace Forums

n-grams for thematic cues

spatial, temporal metadata

Word associations from large corpora

Page Title

Key Phrase Extraction Key Phrase Elimination

Seeds from a Domain Knowledge base

Page 23: Meena Nagarajan Ph.D. Dissertation Defense

WHY

15

WHAT

HOW

Building on results of NER, Key Phrase Extraction

WHEREWHEN

WHO

Thesis ContributionsBuilding Social Intelligence Applications

Page 24: Meena Nagarajan Ph.D. Dissertation Defense

WHY

15

WHAT

HOW

Social Intelligence Applications

1. Application of NER results : BBC Sound Index with IBM Almaden2. Application of Key Phrase Extraction : Twitris @ Knoesis

Building on results of NER, Key Phrase Extraction

WHEREWHEN

WHO

Thesis ContributionsBuilding Social Intelligence Applications

Page 25: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Significance, Impact• Focuses on relatively less explored content aspects

of expression on social media platforms

• Why text on social media is different from what most text mining applications have focused on

• Combination of top-down, bottom-up analysis for informal text

• Statistical NLP, ML algorithms over large corpora

• Models and rich knowledge bases in a domain16

Page 26: Meena Nagarajan Ph.D. Dissertation Defense

TALK OUTLINE - In Detail

ABOUTNESS UNDERSTANDING

• Named Entity Identification in Informal Text

TALK OUTLINE - Overviews

• Topical Key Phrase Extraction from Informal Text

• Applications and Consequences of Understanding content : Social Intelligence Application

• BBC SoundIndex, Twitris

17

Page 27: Meena Nagarajan Ph.D. Dissertation Defense

18

Named Entity RecognitionI loved your music Yesterday!

“It was THE HANGOVER of the year..lasted forever..

so I went to the movies..bad choice picking “GI Jane” worse now”

Page 28: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Contributions

19

Predominant Focus of Prior Work Thesis Focus

Entity Type Focus : PER, LOC, ORGN, DATE, TIME.. [TREC]

Entity Type Focus: Cultural Entities

Method: Sequential Labeling Method: Spot and Disambiguate (pre-supposed knowledge)

Document Types: Scientific Literature, News, Blogs (formal)

Document Types: Social Media Content, Blogs, MySpace Forums

Features: Word-Level Features, List-lookup Features, Documents and

corpus features

Features: Word-Level Features, List-lookup Features, Documents and

corpus features

Page 29: Meena Nagarajan Ph.D. Dissertation Defense

Cultural Named Entities

20

• NER focus in my work: Cultural Named Entities

• Name of a books, music albums, films, video games, etc.

• The Lord of the Rings, Lips, Crash, Up, Wanted, Today, Twilight, Dark Knight...

• Common words in a language

Page 30: Meena Nagarajan Ph.D. Dissertation Defense

Characteristics of Cultural Entities

• Varied senses, several poorly documented• Merry Christmas covered by 60+ artists

Star Trek: movies, tv series, media franchise.. and cuisines !!

• Changing contexts with recent events• The Dark Knight reference to Obama, health care reform

• Unrealistic expectations: Comprehensive sense definitions, enumeration of contexts, labeled corpora for all senses ..

21

Page 31: Meena Nagarajan Ph.D. Dissertation Defense

Characteristics of Cultural Entities

• Varied senses, several poorly documented• Merry Christmas covered by 60+ artists

Star Trek: movies, tv series, media franchise.. and cuisines !!

• Changing contexts with recent events• The Dark Knight reference to Obama, health care reform

• Unrealistic expectations: Comprehensive sense definitions, enumeration of contexts, labeled corpora for all senses ..

21NER Relaxing the closed-world sense assumptions

Page 32: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Contributions

22

Predominant Focus of Prior Work Thesis Focus

Entity Types : PER, LOC, ORGN, DATE, TIME Entity Type Focus: Cultural Entities

Method: Sequential Labeling Method: Spot and Disambiguate (pre-supposed knowledge)

Document Types: Scientific Literature, News, Blogs (formal)

Document Types: Social Media Content, Blogs, MySpace Forums

Features: Word-Level Features, List-lookup Features, Documents and

corpus features

Features: Word-Level Features, List-lookup Features, Documents and

corpus features

Page 33: Meena Nagarajan Ph.D. Dissertation Defense

A Spot and Disambiguate Paradigm

23

• NER generally a sequential prediction problem• NER system that achieves 90.8 F1 score on the CoNLL-2003 NER

shared task (PER, LOC, ORGN entities) [Lev Ratinov, Dan Roth]

• My approach: Spot and Disambiguate Paradigm• Dictionary or list of entities we want to spot

• Disambiguate in context (natural language, domain knowledge cues)

• Binary Classification

Page 34: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Contributions

24

Predominant Focus of Prior Work Thesis Focus

Entity Types : PER, LOC, ORGN, DATE, TIME Entity Type Focus: Cultural Entities

Method: Sequential Labeling Method: Spot and Disambiguate (pre-supposed knowledge)

Document Types: Scientific Literature, News, Blogs (formal)

Document Types: Informal Social Media Content, Blogs, MySpace

Forums, Twitter, Facebook

Features: Word-Level Features, List-lookup Features, Documents

and corpus features

Features: SENSE BIASED Word-Level Features, List-lookup

Features, Documents and corpus features

Page 35: Meena Nagarajan Ph.D. Dissertation Defense

NER Algorithmic Contributions Supervised, Two Flavors

25

3.2. THESIS FOCUS - CULTURAL NER IN INFORMAL TEXT August 10, 2010

(a) Multiple Senses in the same Music DomainBands with a song “Merry Christmas” 60Songs with “Yesterday” in the title 3,600Releases of “American Pie” 195Artists covering “American Pie” 31

(b) Multiple senses in different domains for the same movie entitiesTwilight Novel, Film, Short story, Albums, Places, Comics, Poem, Time of dayTransformers Electronic device, Film, Comic book series, Album, Song, Toy LineThe Dark Knight Nickname for comic superhero Batman, Film, Soundtrack, Video game,

Themed roller coaster ride

Table 3.3: Challenging Aspects of Cultural Named Entities

3.2.3 Two Approaches to Cultural NER

In this thesis, we present two approaches to Cultural NER, both addressing different challenges in

their identification. Cultural entities display two characteristic challenges related to their sense or

meanings – certain Cultural entities are so commonly used that they tend to have multiple senses

in the same domain. The music industry is a great example of this scenario where popular themes

feature in several track/album titles of different artists. Table 3.3(a) shows examples of such cases

– for example, there are more than 3600 songs with the word ‘Yesterday’ in their title.

Connecting mentions of such entities in free text to their actual real-world references is rather

challenging, especially in light of poor contextual information. If a user post mentioned the song

‘Merry Christmas’, as in, “This new Merry Christmas tune is so good!”; it is non-trivial to disam-

biguate its reference to one among 60 artists who have covered that song.

On the other hand, there are Cultural entities that span multiple domains. The phrase, ‘The

Hangover’ is a named entity in the film and music domain. Movies that are based on novels

or video games are great examples of such cases of sense ambiguity. Resolving the mention of

‘Wanted’ in Figure 3.2 as a reference to the video game entity (and not the movie reference) is a

38

“I am watching Pattinson scenes in <movie id=2341>Twilight</movie> for the nth time.” “I spent a romantic evening watching the Twilight by the bay..”

“I love <artist id=357688>Lilyʼs</artist> song <track id=8513722>smile</track>”.

Page 36: Meena Nagarajan Ph.D. Dissertation Defense

NER - Approach 1

26

Page 37: Meena Nagarajan Ph.D. Dissertation Defense

Approach 1: Multiple Senses, Multiple Domains

• When a Cultural entity appears in multiple senses across domains in the same corpus

27

3.3. CULTURAL NER – MULTIPLE SENSES ACROSS MULTIPLE DOMAINSAugust 10, 2010

Title: Peter Cullen Talks Transformers: War for Cybertron

Recently, we heard legendary Transformers voice actor Peter

Cullen talk not only about becoming an hero to millions for his

portrayal of the heroic Autobot leader, Optimus Prime, but also

about being the first person to play the role of video game icon

Mario. But today, he focuses more on the recent Transformersvideo game release, War for Cybertron.

Following are some excerpts from an interview Cullen recently

conducted with Techland. On how the Optimus Prime seen in War

for Cybertron differs from the versions seen in other branches of

the franchise and its multiverse...

Figure 3.1: Showing excerpt of a blog discussing two senses of the entity ‘Transformers’

the second mention.

3.3.1 A Feature Based Approach to Cultural NER

In this work, we propose a new feature that represents the complexity of extracting particular en-

tities. We hypothesize that knowing how hard it is to extract an entity is useful for learning better

entity classifiers. With such a measure, entity extractors become ‘complexity aware’, i.e. they can

respond differently to the extraction complexity associated with different target entities.

Suppose that we have two entities, one deemed easy to extract and the other more complex.

When a classifier knows the extraction complexity of the entity, it may require more evidence (or

apply more complex rules) in identifying the more complex entity compared to the easier target.

Consider concretely a movie recognition system dealing with two movies, say, ‘The Curious Case

of Benjamin Button’ a title appearing only in reference to a recent movie, and ‘Wanted’, a segment

with wider senses and intentions. With comparable signals a traditional NER system can only

apply the same inference to both cases whereas a ‘complexity aware’ system has the advantage of

42

Page 38: Meena Nagarajan Ph.D. Dissertation Defense

Algorithm Preliminaries• Problem Space

• Corpus: Weblogs, Distribution: unknown

• All senses of a cultural entity: unknown

• Problem Definition

• Input: A target Sense (e.g., movie); List of Entities to be extracted

• Goal: Disambiguating every entity’s mention as related to target sense or not

28

Page 39: Meena Nagarajan Ph.D. Dissertation Defense

Contribution: Improving NER - feature-based approach

• Improving classifiers using a novel feature• the “complexity of extraction” in a target sense

• Hypothesis: knowing how hard or easy it is to extract this entity in a particular sense will improve extraction accuracy of learners

29

Page 40: Meena Nagarajan Ph.D. Dissertation Defense

Contribution: Improving NER - feature-based approach

• Improving classifiers using a novel feature• the “complexity of extraction” in a target sense

• Hypothesis: knowing how hard or easy it is to extract this entity in a particular sense will improve extraction accuracy of learners

• Making classifiers ‘complexity aware’ • ‘The Curious Case of Benjamin Button’ vs. ‘Wanted’

29

Page 41: Meena Nagarajan Ph.D. Dissertation Defense

Overview

The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons

Sample Population

List of movies to extract

Uncharacterized population (blog corpus), target sense (movies)

Page 42: Meena Nagarajan Ph.D. Dissertation Defense

Overview

The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons

Sample Population

The Curious Case of Benjamin Button

List of movies to extract

Uncharacterized population (blog corpus), target sense (movies)

Page 43: Meena Nagarajan Ph.D. Dissertation Defense

Overview

The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons

Sample Population

The Curious Case of Benjamin Button 0.2

Complexity of ExtractionEntityList of movies to extract

Uncharacterized population (blog corpus), target sense (movies)

Page 44: Meena Nagarajan Ph.D. Dissertation Defense

Overview

The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons

Sample Population

The Curious Case of Benjamin Button 0.2

Date Night

Complexity of ExtractionEntityList of movies to extract

Uncharacterized population (blog corpus), target sense (movies)

Page 45: Meena Nagarajan Ph.D. Dissertation Defense

Overview

The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons

Sample Population

The Curious Case of Benjamin Button 0.2

Date Night 0.5

Complexity of ExtractionEntityList of movies to extract

Uncharacterized population (blog corpus), target sense (movies)

Page 46: Meena Nagarajan Ph.D. Dissertation Defense

Overview

The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons

Sample Population

The Curious Case of Benjamin Button 0.2

Date Night 0.5

Complexity of ExtractionEntity

Use Complexity of Extraction as a feature in named entity classifiers

List of movies to extract

Uncharacterized population (blog corpus), target sense (movies)

Page 47: Meena Nagarajan Ph.D. Dissertation Defense

Overview

The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons

Sample Population

The Curious Case of Benjamin Button 0.2

Date Night 0.5

Complexity of ExtractionEntity

Use Complexity of Extraction as a feature in named entity classifiers

List of movies to extract

Uncharacterized population (blog corpus), target sense (movies)

NOTE: An entity occurring in fewer varied senses (The Curious Case of Benjamin Button) could still have a high complexity of extraction if the distribution is skewed away from the

sense of interest!

Page 48: Meena Nagarajan Ph.D. Dissertation Defense

Extraction in a Target Sense

• Complexity of extraction in a sense of interest = how much support in corpus toward that sense

31

Page 49: Meena Nagarajan Ph.D. Dissertation Defense

Extraction in a Target Sense

• Complexity of extraction in a sense of interest = how much support in corpus toward that sense

• How do we find this?

31

Page 50: Meena Nagarajan Ph.D. Dissertation Defense

Extraction in a Target Sense

• Complexity of extraction in a sense of interest = how much support in corpus toward that sense

• How do we find this?• Documents that mention the entity in word contexts that

are biased to our sense of interest (language models)

31

Page 51: Meena Nagarajan Ph.D. Dissertation Defense

Extraction in a Target Sense

• Complexity of extraction in a sense of interest = how much support in corpus toward that sense

• How do we find this?• Documents that mention the entity in word contexts that

are biased to our sense of interest (language models)

• More document, implies a lot of support, implies easy to extract, low complexity of extraction

31

Page 52: Meena Nagarajan Ph.D. Dissertation Defense

Support via Word Associations

• Co-occurring words alone wont cut it!

• Prolific discussion and comparison of different senses

• Co-occurrence based language models will give us everything unless we bias it to our sense (movies)

32

Page 53: Meena Nagarajan Ph.D. Dissertation Defense

Complexity of Extraction• Goal: Complexity of Extraction in a target sense

• Subgoal: Support in terms of sense-biased contexts in documents that mention entity

• Step1: Extract a sense-biased LM

• Step 2: Identify documents that mention entity in the context of the sense-biased LM

33

Page 54: Meena Nagarajan Ph.D. Dissertation Defense

Knowledge Features to seed Sense-biased Word Association Gathering

34

• Sense Definition (hints) from Wikipedia Infoboxes

• Working definition: Sense is domain of interest

Page 55: Meena Nagarajan Ph.D. Dissertation Defense

Knowledge Features to seed Sense-biased Word Association Gathering

34

• Sense Definition (hints) from Wikipedia Infoboxes

• Working definition: Sense is domain of interest

• Use sense hints to derive contextual support

Lot of support, easy to extract, implies a low ‘complexity of extraction’ score!

Page 56: Meena Nagarajan Ph.D. Dissertation Defense

Measuring ‘complexity of extraction’

• Step 1: Propagate sense evidence in contexts of e, extract a sense-biased language model (LM)• random walks, distributional similarity approaches

• SPREADING ACTIVATION NETWORKS

35

Two step framework (unsupervised)

D e e

Page 57: Meena Nagarajan Ph.D. Dissertation Defense

Measuring ‘complexity of extraction’

• Step 1: Propagate sense evidence in contexts of e, extract a sense-biased language model (LM)• random walks, distributional similarity approaches

• SPREADING ACTIVATION NETWORKS

35

Two step framework (unsupervised)

Sense hint nodesSense-biased Language Model

D e e

Page 58: Meena Nagarajan Ph.D. Dissertation Defense

Overview

• Result: Clustered Documents in similar senses

• Not just similar words!

SenseRel doc 1 doc 2 doc n

sense LM term 1 SenseRel (t1) SenseRel (t1) SenseRel (t1)

sense LM term 2 SenseRel (t2)

sense LM term m SenseRel (tm) SenseRel (tm)

• Step 2: Clustering documents represented by sense-relatedness vectors

• CHINESE WHISPERS CLUSTERING

Page 59: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SAN

Page 60: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

Page 61: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

indicative of being a Named Entity

Page 62: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

indicative of being a Named Entity

Page 63: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest -  Force include sense related words

Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous

indicative of being a Named Entity

Page 64: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest -  Force include sense related words

Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous

Activation Network

indicative of being a Named Entity

Page 65: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest -  Force include sense related words

Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous

Activation Network

Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of

indicative of being a Named Entity

Page 66: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest -  Force include sense related words

Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous

Activation Network

Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of

libidinous

Spock 1 imax

1 1

indicative of being a Named Entity

Page 67: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest -  Force include sense related words

Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous

Activation Network

effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

libidinous

Spock 1 imax

1 1

indicative of being a Named Entity

Page 68: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest -  Force include sense related words

Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous

Activation Network

effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Chris Pine

Kirk 1

1

1 libidinous

Spock 1 imax

1 1

indicative of being a Named Entity

Page 69: Meena Nagarajan Ph.D. Dissertation Defense

Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura

Star Trek Startrek

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest -  Force include sense related words

Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous

Activation Network

effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Chris Pine

Kirk 1

1

1 Repeat this procedure for all blogs End up with a connected SAN With some sense Nodes and other words in context of entity

libidinous

Spock 1 imax

1 1

indicative of being a Named Entity

Page 70: Meena Nagarajan Ph.D. Dissertation Defense

Node and Edge Semantics• Pre-adjustment phase

• Node weights: Sense nodes: 1; Other nodes: 0.1

• ambiguous sense nodes

• alternate seeding methods: distributional similarity with unambiguous domain terms (movie, theatre, imax, cinemas)

• Edge weights: co-occurrence counts

38

Page 71: Meena Nagarajan Ph.D. Dissertation Defense

Node and Edge Semantics• Pre-adjustment phase

• Node weights: Sense nodes: 1; Other nodes: 0.1

• ambiguous sense nodes

• alternate seeding methods: distributional similarity with unambiguous domain terms (movie, theatre, imax, cinemas)

• Edge weights: co-occurrence counts

38

1

1

1

1

1

Eric Bana

Sulu

Romulan

movie

franchise

Chris pine

J. J. Abrams starship

seats

Constructing the Spreading Activation Network G from

words co-occurring with e in D

sense hints Yother vertices X

Page 72: Meena Nagarajan Ph.D. Dissertation Defense

Propagating Sense Evidences Pulse sense nodes and spread effect As many pulses (iterations) as number of

sense nodes

39

1

1

1

1

1

Eric Bana

Sulu

Romulan

movie

franchise

Chris pine

J. J. Abrams starship

seats

Constructing the Spreading Activation Network G from

words co-occurring with e in D

sense hints Yother vertices X

Page 73: Meena Nagarajan Ph.D. Dissertation Defense

Propagating Sense Evidences Pulse sense nodes and spread effect As many pulses (iterations) as number of

sense nodes

39

At every iterationA BFS walk starting at a sense node (weight 1)Revisiting nodes not edgesAmplifying weights of visited nodes:W [ j ] = W [ j ] + (W [ i ] * co-occ[ i, j ] * α)

1

1

1

1

1

Eric Bana

Sulu

Romulan

movie

franchise

Chris pine

J. J. Abrams starship

seats

Constructing the Spreading Activation Network G from

words co-occurring with e in D

sense hints Yother vertices X

Page 74: Meena Nagarajan Ph.D. Dissertation Defense

Propagating Sense Evidences Pulse sense nodes and spread effect As many pulses (iterations) as number of

sense nodes

39

At every iterationA BFS walk starting at a sense node (weight 1)Revisiting nodes not edgesAmplifying weights of visited nodes:W [ j ] = W [ j ] + (W [ i ] * co-occ[ i, j ] * α)

1

1

1

1

1

Eric Bana

Sulu

Romulan

movie

franchise

Chris pine

J. J. Abrams starship

seats

Constructing the Spreading Activation Network G from

words co-occurring with e in D

sense hints Yother vertices X

Collective Spreading controlled by dampening factor α, co-occurrence thresholds

Page 75: Meena Nagarajan Ph.D. Dissertation Defense

Propagating Sense Evidences

Eric Bana

Sulu

Romulan

movie

franchise

Chris pine

J. J. Abrams starship

seats

non-activated vertices

Post Propagation of Sense Evidences:

Spreading Activation Theory

Pulse sense nodes and spread effect As many pulses (iterations) as number of

sense nodes

39

Final activated portions of the network indicate word’s

relatedness to sense = sense-biased LM

At every iterationA BFS walk starting at a sense node (weight 1)Revisiting nodes not edgesAmplifying weights of visited nodes:W [ j ] = W [ j ] + (W [ i ] * co-occ[ i, j ] * α)

Collective Spreading controlled by dampening factor α, co-occurrence thresholds

Page 76: Meena Nagarajan Ph.D. Dissertation Defense

Sense-biased LM

Entity: Star Trek(movie)20 iterations (pulsed sense nodes)900+ blogs, 35K+ words in co-occ graph167 words in the LM

40

Page 77: Meena Nagarajan Ph.D. Dissertation Defense

Sense-biased LM

Entity: Star Trek(movie)20 iterations (pulsed sense nodes)900+ blogs, 35K+ words in co-occ graph167 words in the LM

Sense-biased Spreading Activation already lends one type of clustering (separation of words strongly related to our sense)

40

Page 78: Meena Nagarajan Ph.D. Dissertation Defense

Sense-biased LM

Entity: Star Trek(movie)20 iterations (pulsed sense nodes)900+ blogs, 35K+ words in co-occ graph167 words in the LM

Sense-biased Spreading Activation already lends one type of clustering (separation of words strongly related to our sense)

40

Page 79: Meena Nagarajan Ph.D. Dissertation Defense

Documents D Represented in terms of LMe

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

di(LMe) = {w1, LMe(w1) ; .. wx, LMe(wx) }

Step2:Clustering using Extracted LMAlgorithmic Implementations

41

Vector Space ModelTypically: word, tfidf score

Here: word, sense relatedness score

Page 80: Meena Nagarajan Ph.D. Dissertation Defense

Step2:Clustering using Extracted LMAlgorithmic Implementations

41

Vector Space ModelTypically: word, tfidf score

Here: word, sense relatedness score

http://realart.blogspot.com/2009/05/star-trek-balance-of-terror-from.html

http://realart.blogspot.com/2009/05/star-trek-balance-of-terror-from.html

http://susanisaacs.blogspot.com/2009/04/quantum-leap-convention.html

http://semioblog.blogspot.com/2009/01/retrofuturo-web.html http://wilwheaton.net/2006/05/learn_to_swim.php

No Representation

Page 81: Meena Nagarajan Ph.D. Dissertation Defense

Step2:Clustering using Extracted LM

Clustering documents in D along same the dimensions of

propagation

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Algorithmic Implementations

Cluster scores are as high as sense-relatedness scores of terms in

documents in clusters41

Vector Space ModelTypically: word, tfidf score

Here: word, sense relatedness score

http://realart.blogspot.com/2009/05/star-trek-balance-of-terror-from.html

http://realart.blogspot.com/2009/05/star-trek-balance-of-terror-from.html

http://susanisaacs.blogspot.com/2009/04/quantum-leap-convention.html

http://semioblog.blogspot.com/2009/01/retrofuturo-web.html http://wilwheaton.net/2006/05/learn_to_swim.php

No Representation

Page 82: Meena Nagarajan Ph.D. Dissertation Defense

Chinese Whispers

42

*[Biemann 2006] Biemann, C. (2006): Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. Proceedings of the HLT-NAACL-06 Workshop on Textgraphs-06,!New York, USA.

Clustering documents in D along same the dimensions of

propagation

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

!  Randomized graph-clustering algorithm !  for undirected, weighted graphs

!  Nodes are documents, edges represent Dot Product similarity between documents !  Feature vector = language model from Step1

!  Partitions nodes i.e. documents based on max average similarity

Page 83: Meena Nagarajan Ph.D. Dissertation Defense

High and Low Scoring Clusters

43

++++++++++CLUSTER++++++++++On cluster 1032 total of 4 members and score = 7.46775436953479E-05 http://...mish-mashy-hodge-podge-with-no.html Keywords adventure 0.0016552556767791 1 movie 0.000477036040807386 5 http://perchedraven.blogspot.com/2006/06/reading-habits.html Keywords adventure 0.0016552556767791 1 movie 0.000477036040807386 4 http:/..9/05/27/new-york-times-crossword-will-shortz-corey-rubin/ Keywords adventure 0.0016552556767791 1 movie 0.000477036040807386 1 http://....no-doubt-have-heard-by-now.html Keywords adventure 0.0016552556767791 1 movie 0.000477036040807386 1

++++++++++CLUSTER++++++++++On cluster 2382 total of 4 members and score = 0.130057194715825 http://torontomike.com/2009/05/advanced_screening_of_pixars_u.html Keywords comedy 0.00256627265885554 1 adventure 0.0016552556767791 2 carl 0.020754281327519 1 pete docter 0.0549975353578234 1 carl fredricksen 0.133630375166943 1 russell 0.116638105837327 1 pixar 0.0048327182073532 2 digital 5.70733953613266E-05 1 disney 2.05783382714942E-06 2 fredricksen 1 1 docter 0.016306935328765 1 http://theplaylist.blogspot.com/2009/05/up-pixars-latest-is-profoundly.html Keywords russell 0.116638105837327 2 carl 0.020754281327519 3 carl fredricksen 0.133630375166943 1 pete docter 0.0549975353578234 1 comedy 0.00256627265885554 1 animation 0.0164047754350987 1 pixar 0.0048327182073532 7 film 1.32766200713231E-05 4 adventure 0.0016552556767791 1 movies 0.0399341006575118 2

Low scoring cluster - less evidence for relatedness to sense

High scoring cluster

Page 84: Meena Nagarajan Ph.D. Dissertation Defense

From Clusters to SupportClustering documents in D along

same the dimensions of propagation

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

44

Cluster scores are as high as sense-relatedness

scores of terms in documents in clusters

!  A conservative estimate, a heuristic

!  Average strength of all clusters (A) !  average sense relatedness

!  C* = Clusters with score >= A

!  Support = No. of documents in strong sense related clusters |C*| / No. of documents mentioning the entity |D|

Page 85: Meena Nagarajan Ph.D. Dissertation Defense

From Clusters to SupportClustering documents in D along

same the dimensions of propagation

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

44

Cluster scores are as high as sense-relatedness

scores of terms in documents in clusters

!  A conservative estimate, a heuristic

!  Average strength of all clusters (A) !  average sense relatedness

!  C* = Clusters with score >= A

!  Support = No. of documents in strong sense related clusters |C*| / No. of documents mentioning the entity |D|

Page 86: Meena Nagarajan Ph.D. Dissertation Defense

From Clusters to SupportClustering documents in D along

same the dimensions of propagation

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock

Higher the proportion of documents, lower the entity’s complexity of extraction score

44

Cluster scores are as high as sense-relatedness

scores of terms in documents in clusters

!

'complexity of extraction'of e =1

|C* | |D |!

!  A conservative estimate, a heuristic

!  Average strength of all clusters (A) !  average sense relatedness

!  C* = Clusters with score >= A

!  Support = No. of documents in strong sense related clusters |C*| / No. of documents mentioning the entity |D|

Page 87: Meena Nagarajan Ph.D. Dissertation Defense

Validating the Framework

45

!  Intuitively, some (cultural) entities are harder to extract than some others !  ‘Up’ vs. ‘The Time Traveler’s Wife’ !!!

!  For a list of X movies !  Selected blogs from the general corpus !  Obtained Sense definitions from Wikipedia !  Generated support for the movie sense using proposed

framework

Page 88: Meena Nagarajan Ph.D. Dissertation Defense

46

!

“Up” and “Wanted” will arguably be harder to extract than a movie like “Angels and Demons” – clearly indicating that our approach for computing ‘extraction complexity’ is effective. A note on the confidence of these scores is presented in Section 4.2.

Table 1 Entities and their computed extraction complexities

Entity and Variations used in obtaining D; (possible senses found in Wikipedia Disambiguation pages)

Complexity of Extraction

Twilight (novel, film, time of day, albums..) 0.4

Up (film, relative direction, abbreviations, albums..) 0.352

Wanted (film, common verb in English, video game, music..) 0.161

Star Trek, Startrek (film, tv series, video game, media franchise..) 0.114

Transformers (toy line, film, comic series, electronic device..) 0.085

The Hangover (unpleasant feeling, film, band, song.. ) 0.072

The Dark Knight, Dark Knight (Batman’s nickname, film, comic series, soundtrack..) 0.070

Angels and Demons, Angels & Demons (novel, movie, episode on The Blade: the series…) 0.066

4.2 NER Improvements The underlying hypothesis behind this work is that knowing how hard it is to extract an entity will allow classifiers to use cues differently in deciding whether a mention spotted is mentioned in a valid entity in a target sense or not. In this second set of experiments, we measure the usefulness of our feature in assisting a variety of supervised classifiers.

Labeling Data We randomly selected documents one after another from the pool of documents D collected for all entities E in Experiment 1 and labeled every occurrence of entity e and its valid variations in a document as either a movie entity or not. We labeled a total of 1500 spots. We also observed a 100% inter-annotator agreement between the two authors over a random sample of 15% of the labeled spots, indicating that labeling for movie or not-movie sense is not hard in this data. Figure 3 shows statistics for the percentage of true positives found for each entity. The percentage of entity mentions that appear in the movie sense implicitly indicates how much support there is for the target movie sense for the entities. It is interesting then that this order closely matches the extraction complexity ordering of the entities – an indication that the approach we use for extracting our feature is sound. In the process of random labeling the entity “Angels and Demons” received only 10 labels and was discarded from this experiment.

Classifiers, Features, Experimental Setup: We used 3 different state-of-the-art entity classifiers for learning entity extractors – decision tree classifiers, bagging and boosting (using 6 nodes and stump base learners for boosting) [1, 4, 21]. The goal of using different classification models was to show that our measure is useful with different underlying prediction approaches rather than for the purpose of finding the most suitable classifier. We trained and tested the classifiers on an extensive list of features (Figure 4).

We used two well-known features – word-level features that indicate whether the spotted entity is capitalized, surrounded by quotes etc., and contextual syntactic features that encode the Part- Of-Speech tags of words surrounding the entity. We also used knowledge-derived features that indicate whether words already

known to be relevant to the target sense of the entity (sense definition of e) are found in the document, surrounding paragraph, title of post or in the post or blog url. The intuition is that the presence of such words strengthens its case for being a valid mention. We also encoded similar features using the extracted language model LMe to test the usefulness of the new words we extracted as relevant to the target sense of the entity.

In addition to the basic word-level and syntactic features, we also measure the usefulness of our proposed feature against a strong ‘contextual entropy’ baseline. This baseline measures how specific a word is in terms of the contexts it occurs in. A general word will occur in varied contexts, and will have a high context entropy value [14]. High entropy in context distribution is an indication that extracting the entity in any sense might be hard. This baseline is very similar in spirit to our feature, except that our proposed measure identifies how hard it is to extract an entity in a particular target sense. We evaluated classifier accuracies in labeling test spots with and without our ‘complexity of extraction’ feature as a prior. Specifically, we used the following feature combinations:

a. Basic features: word-level, syntactic, knowledge features obtained from the sense definitions S and Sd. b. Baseline: Basic + ‘contextual entropy’ feature as a prior.

c. Our measure: Basic + knowledge features obtained from the extracted LMe + ‘complexity of extraction’ feature as a prior.

Figure 5 shows the precision-recall curves using the basic, baseline and our proposed measures for entity classification using the decision tree and boosting classifiers. We verified stability of these results using 10 fold cross validation. We see better performance of our measure compared to both the basic setting and the strong ‘contextual entropy’ baseline. Notably, there is overwhelming improvement in entity extraction over traditional extractor settings (basic features). The stability of the suggested improvement is also confirmed across both classifiers.

We see significant improvements using the proposed feature and now turn to confirm that this is indeed a consistent pattern. Here, we show the averaged performance of binary classification over 100 runs, each run using different and random samples of training and test sets (obtained from 50-50 splits).

We measured the F-measure and accuracy of the classifiers using the basic, baseline and our proposed measure features. Accuracy is defined as the number of correct classifications (both true positives and negatives) divided by the total number of classifications. We use accuracy to represent general classification improvement – when we care about classifying both the correct and incorrect cases. The F-measure is the standard harmonic mean of precision and recall of classification results and we use it to represent information retrieval improvement – when we only care about our target sense. We report both of these metrics here for consistency with past literature.

Figure 3 Labeled Data

Figure 4 Features used in judging NER improvements

Computed Extraction Complexities

Hand-labeled data% true positives

Page 89: Meena Nagarajan Ph.D. Dissertation Defense

46

!

“Up” and “Wanted” will arguably be harder to extract than a movie like “Angels and Demons” – clearly indicating that our approach for computing ‘extraction complexity’ is effective. A note on the confidence of these scores is presented in Section 4.2.

Table 1 Entities and their computed extraction complexities

Entity and Variations used in obtaining D; (possible senses found in Wikipedia Disambiguation pages)

Complexity of Extraction

Twilight (novel, film, time of day, albums..) 0.4

Up (film, relative direction, abbreviations, albums..) 0.352

Wanted (film, common verb in English, video game, music..) 0.161

Star Trek, Startrek (film, tv series, video game, media franchise..) 0.114

Transformers (toy line, film, comic series, electronic device..) 0.085

The Hangover (unpleasant feeling, film, band, song.. ) 0.072

The Dark Knight, Dark Knight (Batman’s nickname, film, comic series, soundtrack..) 0.070

Angels and Demons, Angels & Demons (novel, movie, episode on The Blade: the series…) 0.066

4.2 NER Improvements The underlying hypothesis behind this work is that knowing how hard it is to extract an entity will allow classifiers to use cues differently in deciding whether a mention spotted is mentioned in a valid entity in a target sense or not. In this second set of experiments, we measure the usefulness of our feature in assisting a variety of supervised classifiers.

Labeling Data We randomly selected documents one after another from the pool of documents D collected for all entities E in Experiment 1 and labeled every occurrence of entity e and its valid variations in a document as either a movie entity or not. We labeled a total of 1500 spots. We also observed a 100% inter-annotator agreement between the two authors over a random sample of 15% of the labeled spots, indicating that labeling for movie or not-movie sense is not hard in this data. Figure 3 shows statistics for the percentage of true positives found for each entity. The percentage of entity mentions that appear in the movie sense implicitly indicates how much support there is for the target movie sense for the entities. It is interesting then that this order closely matches the extraction complexity ordering of the entities – an indication that the approach we use for extracting our feature is sound. In the process of random labeling the entity “Angels and Demons” received only 10 labels and was discarded from this experiment.

Classifiers, Features, Experimental Setup: We used 3 different state-of-the-art entity classifiers for learning entity extractors – decision tree classifiers, bagging and boosting (using 6 nodes and stump base learners for boosting) [1, 4, 21]. The goal of using different classification models was to show that our measure is useful with different underlying prediction approaches rather than for the purpose of finding the most suitable classifier. We trained and tested the classifiers on an extensive list of features (Figure 4).

We used two well-known features – word-level features that indicate whether the spotted entity is capitalized, surrounded by quotes etc., and contextual syntactic features that encode the Part- Of-Speech tags of words surrounding the entity. We also used knowledge-derived features that indicate whether words already

known to be relevant to the target sense of the entity (sense definition of e) are found in the document, surrounding paragraph, title of post or in the post or blog url. The intuition is that the presence of such words strengthens its case for being a valid mention. We also encoded similar features using the extracted language model LMe to test the usefulness of the new words we extracted as relevant to the target sense of the entity.

In addition to the basic word-level and syntactic features, we also measure the usefulness of our proposed feature against a strong ‘contextual entropy’ baseline. This baseline measures how specific a word is in terms of the contexts it occurs in. A general word will occur in varied contexts, and will have a high context entropy value [14]. High entropy in context distribution is an indication that extracting the entity in any sense might be hard. This baseline is very similar in spirit to our feature, except that our proposed measure identifies how hard it is to extract an entity in a particular target sense. We evaluated classifier accuracies in labeling test spots with and without our ‘complexity of extraction’ feature as a prior. Specifically, we used the following feature combinations:

a. Basic features: word-level, syntactic, knowledge features obtained from the sense definitions S and Sd. b. Baseline: Basic + ‘contextual entropy’ feature as a prior.

c. Our measure: Basic + knowledge features obtained from the extracted LMe + ‘complexity of extraction’ feature as a prior.

Figure 5 shows the precision-recall curves using the basic, baseline and our proposed measures for entity classification using the decision tree and boosting classifiers. We verified stability of these results using 10 fold cross validation. We see better performance of our measure compared to both the basic setting and the strong ‘contextual entropy’ baseline. Notably, there is overwhelming improvement in entity extraction over traditional extractor settings (basic features). The stability of the suggested improvement is also confirmed across both classifiers.

We see significant improvements using the proposed feature and now turn to confirm that this is indeed a consistent pattern. Here, we show the averaged performance of binary classification over 100 runs, each run using different and random samples of training and test sets (obtained from 50-50 splits).

We measured the F-measure and accuracy of the classifiers using the basic, baseline and our proposed measure features. Accuracy is defined as the number of correct classifications (both true positives and negatives) divided by the total number of classifications. We use accuracy to represent general classification improvement – when we care about classifying both the correct and incorrect cases. The F-measure is the standard harmonic mean of precision and recall of classification results and we use it to represent information retrieval improvement – when we only care about our target sense. We report both of these metrics here for consistency with past literature.

Figure 3 Labeled Data

Figure 4 Features used in judging NER improvements

Computed Extraction Complexities

Hand-labeled data% true positives

Page 90: Meena Nagarajan Ph.D. Dissertation Defense

NER Improvements• Goal: Evaluate existing NE classifiers with and

without ‘complexity of extraction’ feature• spot and disambiguate paradigm

• decision tree classifiers, bagging and boosting

• Cultural Entity: Movie entities in weblogs

• Binary Classification: 1500+ multiple author annotations from a 2130K general blog corpus

47

Page 91: Meena Nagarajan Ph.D. Dissertation Defense

Feature Set

Page 92: Meena Nagarajan Ph.D. Dissertation Defense

Feature SetBoolean Word-level

features First letter capitalize, All capitalized, In quotes

Boolean Contextual Syntactic features POS tags of words before and after the spot

Knowledge features Infobox sense definitions in same blog, same paragraph as entity, title, URL, blog URL

Basic

Page 93: Meena Nagarajan Ph.D. Dissertation Defense

Feature SetBoolean Word-level

features First letter capitalize, All capitalized, In quotes

Boolean Contextual Syntactic features POS tags of words before and after the spot

Knowledge features Infobox sense definitions in same blog, same paragraph as entity, title, URL, blog URL

Basic

Baseline Basic + Contextual Entropy prior

Page 94: Meena Nagarajan Ph.D. Dissertation Defense

Feature SetBoolean Word-level

features First letter capitalize, All capitalized, In quotes

Boolean Contextual Syntactic features POS tags of words before and after the spot

Knowledge features Infobox sense definitions in same blog, same paragraph as entity, title, URL, blog URL

Basic

Baseline Basic + Contextual Entropy prior

+ Basic + proposed ‘Complexity of Extraction’ prior

Our Measure

Knowledge features Extracted sense-biased LM in same blog, same paragraph as entity, title, URL, blog URL

Page 95: Meena Nagarajan Ph.D. Dissertation Defense

PR Curves (10 fold cross validation)basic: word-level, syntactic, sense definitionsbaseline: Basic + contextual entropyour measure: Basic + sense-biased LM + ‘complexity of extraction’

49

Page 96: Meena Nagarajan Ph.D. Dissertation Defense

Binary Classification (over 100 runs) F-measure, Accuracybasic: word-level, syntactic, sense definitionsbaseline: Basic + contextual entropyour measure: Basic + sense-biased LM + ‘complexity of extraction’

Basic Features: Average accuracy - 74%; F Measure 69%Proposed Feature: +10% Accuracy; +11% F Measure

50

Page 97: Meena Nagarajan Ph.D. Dissertation Defense

Entity level improvementsBasic vs. Proposed feature sets

Accuracy improvementsThe Dark Night (+26.9%) and The Hangover (+31%)

F-measure improvementsUp (+12.6%), The Dark Knight (+14.9%) and The Hangover (+16.5%)

51

Page 98: Meena Nagarajan Ph.D. Dissertation Defense

The Polluted LM of Twilight

• “I spent a romantic evening watching the Twilight…”

• “here are photos of the Twilight from the bay..” (photos turned up in the extracted LM: “red carpet photos of the Twilight crew”)

• “I am re-reading Edward Cullen chapters of Twilight” mentions Cullen, a cast in the movie.

Extracted sense-biased LM (words in bold) were used to derive knowledge features; negatively impacted classifier results

Page 99: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Statement• Knowledge Features (Wikipedia Infobox/IMDB)

• From statistically significant co-occurrences to a sense-biased LM

• Baseline settings (state-of-the-art for cultural entities): F-score 69% for traditional extractor settings

• 90.8 F-score on the CoNLL-2003 NER shared task (PER, LOC, ORGN entities) [Lev Ratinov, Dan Roth]

• Average improvements: +11% (F-score)

• Generic methods/proposed feature53

Page 100: Meena Nagarajan Ph.D. Dissertation Defense

APPLICATIONS

54

Page 101: Meena Nagarajan Ph.D. Dissertation Defense

Several Applications of this work

• Weak Indicators for contextual browsing/search, reduced document set for manual labeling

• Step1,2: Clustered Sense-biased Documents

55

Page 102: Meena Nagarajan Ph.D. Dissertation Defense

Several Applications of this work

• Weak Indicators for contextual browsing/search, reduced document set for manual labeling

• Step1,2: Clustered Sense-biased Documents

• IE Pipeline: Ignore entity if high extraction complexity

55

Page 103: Meena Nagarajan Ph.D. Dissertation Defense

Several Applications of this work

• Weak Indicators for contextual browsing/search, reduced document set for manual labeling

• Step1,2: Clustered Sense-biased Documents

• IE Pipeline: Ignore entity if high extraction complexity

• Step 1 (LM generation): Unsupervised Domain Lexicon Generation

• restaurants and bars55

Page 104: Meena Nagarajan Ph.D. Dissertation Defense

Related Terms for Topic Classification

Unsupervised Generation of associated words in the Restaurants and Bars topic

56

5

3

4

2

1

Restaurant

Step 1: Pulse on “Restaurant”

Page 105: Meena Nagarajan Ph.D. Dissertation Defense

Related Terms for Topic Classification

Unsupervised Generation of associated words in the Restaurants and Bars topic

56

1 Restaurant

2

3

Review

Table

5

4

Reservation

Waiters

Step 2: Pulse on top n surfaced terms from step 1

5

3

4

2

1

Restaurant

Step 1: Pulse on “Restaurant”

Page 106: Meena Nagarajan Ph.D. Dissertation Defense

Related Terms for Topic Classification

Unsupervised Generation of associated words in the Restaurants and Bars topic

57

5

3

4

2

1

Restaurant

Step 1: Pulse on “Restaurant”

restaurant, waiters, tasty, waiter, dish, nutrition, review, cooking, reviews, tibits, vegetarian, chef, sweet, bourdain, waitress, reservations, lunch, dishes, sushi, cuisine, burger, taste, burgers, fries, french, wines, tapas, wineries, wine, café, huang, vietnamese, espresso, anhui, coffee, shops, hotels, cafes, diners, bars, called, hefei, menus, chefs, michelin, dine, establishments, tourist, eateries, chain, meals, culinary, stores, pubs, food, retail, chains, specialty, bakeries, vendors, fuyang, restaurants, entrees, appetizers, salads, menu, assignment, shopper, shoppers, service, delicious, meal, paleo, eating, booths, tables, buffet, shrimp, chopsticks, eat, micah, tierney, dinners, dinner, mkhulu, san, tex, mexican, italian, pizza, brunch, bar, dining, steak, place, seafood, servers, salad, hostess, chinese, sandwich, patrons, bakery, eatery, local, outdoor, diner, mcdonald, greek, fancy, ate, ordering, cheese, business, thai, sandy, dined, hotel, japanese, afternoon, celebrate, birthday, cafe, table, downtown, francisco, good, seating, taco, foods, mex, night, soup, gift, chicken, banquet, anniversary, themed, pizzas, recommend, don, priced, pancakes, burrito, famous, neighborhood, drinks, potato, dessert, sausage, restaurateur, tonight, nearby, german, morton’s, casual, reception, kosher, ranch, favorite, servings, crab, appetizer, steaks, toilets, veggie, grilled, baked, pho, pasta, opened, wonderful, reservation, mussels, quaint, pancake, chinatown, foodies, oasis, swanky, kitchen, enjoyed, patio, work, upscale, friend, plate, cab, corner, coworkers, cooks, valentine, celebrated, arrive, stuffed, owners, discount, bistro, vegan, …

Page 107: Meena Nagarajan Ph.D. Dissertation Defense

NER - Approach 2

58

Page 108: Meena Nagarajan Ph.D. Dissertation Defense

• When a cultural entity has several senses in the same domain

• Domain: Music; Senses: album, tracks, band, artist name

Approach 2: Multiple senses same Domain

3.4. CULTURAL NER – MULTIPLE SENSES IN THE SAME DOMAIN August 12, 2010

3.4 Cultural NER – Multiple Senses in the Same Domain

Compared to our first contribution in Cultural NER (Section 3.3) that focused on disambiguating

entity names that span different domains (e.g., movies vs. video games), the focus of our sec-

ond contribution detailed in this chapter is in the identification of Cultural entities that appear in

multiple senses even in the same domain.

The occurrence of a person last names such as ‘Clinton’ in text, even if restricted to docu-

ments in the political domain, could refer to either Bill or Hillary or Chelsea Clinton. Figure 3.12

shows another example of the word ‘Celebration’ used as the name of a band, song, album and

track title by multiple artists in the music domain.

‘Celebration’ (song), a song by Kool & The Gang, notably covered by Kylie Minogue

‘Celebration’ (Voices With Soul song), the debut single from girl band, Voices With Soul

‘Celebration’, a song by Krokus from Hardware

‘Celebration’ (Simple Minds album), a 1982 album by Simple Minds

‘Celebration’ (Julian Lloyd Webber album), a 2001 album by Julian Lloyd Webber

‘Celebration’ (Madonna album), a 2009 greatest hits album by Madonna

‘Celebration’ (Madonna song), same-titled single by Madonna

‘Celebration’ (band), a Baltimore-based band

‘Celebration’ (Celebration album), a 2006 album by ‘Celebration’

‘Celebration’ (musical), a 1969 musical theater work by Harvey Schmidt and Tom Jones

Figure 3.12: Showing usages of the word ‘Celebration’ as the name of a band, hitsong, album and

track title by multiple artists in the music domain.

The goal of the algorithm described in this chapter is in the fine-grained entity identification

and disambiguation of such entities (in social media text) that have multiple real-world references

within a same domain.

79

Page 109: Meena Nagarajan Ph.D. Dissertation Defense

• Problem Space

• Cultural Entity: Music album, tracks• Smile (Lilly Allen), Celebration (Madonna)..

60

Algorithm Preliminaries

Page 110: Meena Nagarajan Ph.D. Dissertation Defense

• Problem Space

• Cultural Entity: Music album, tracks• Smile (Lilly Allen), Celebration (Madonna)..

• Corpus: MySpace comments

• Context-poor utterances• “Happy 25th Lilly, Alfie is funny”

• Goal: Semantic Annotationof named entity (w.r.t MusicBrainz)

60

Algorithm Preliminaries

Page 111: Meena Nagarajan Ph.D. Dissertation Defense

61

• 60 songs with Merry Christmas

• 3600 songs with Yesterday

• 195 releases of American Pie

• 31 artists covering American Pie

Happy 25th! Loved your song Smile ..

Semantic Annotation

Using a Knowledge Resource for NER

Page 112: Meena Nagarajan Ph.D. Dissertation Defense

61

• 60 songs with Merry Christmas

• 3600 songs with Yesterday

• 195 releases of American Pie

• 31 artists covering American Pie

Happy 25th! Loved your song Smile ..

Using a domain knowledge base is not straight-forward

Semantic Annotation

Using a Knowledge Resource for NER

Page 113: Meena Nagarajan Ph.D. Dissertation Defense

Approach Overview

62

This new Merry Christmas tune..

SO GOOD!

Which ‘Merry Christmas’?; ‘So Good’ is also a song!

Page 114: Meena Nagarajan Ph.D. Dissertation Defense

Approach Overview

62

This new Merry Christmas tune..

SO GOOD!

• Scoped Relationship graphs

• Using context cues from the content, webpage title, url..

• Reduce potential entity spot size

• Generate candidate entities

• Spot and Disambiguate

Which ‘Merry Christmas’?; ‘So Good’ is also a song!

Page 115: Meena Nagarajan Ph.D. Dissertation Defense

Scoping via Real-world Restrictions

“I heart your new album Habits”

eliminate album releases that are not ‘new’ using metadata in MusicBrainz

Page 116: Meena Nagarajan Ph.D. Dissertation Defense

Scoping via Real-world Restrictions

!"""#$

!""#$

!"#$

!#$

#$

#"$

#""$!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$

!"#$%&%'()'*)+,#)-.'++#"%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**

*****789!9$*,/):0+0-%;

&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*0%*.5)*,&+.*8*3)&/+

*&22*&/.0+.+*<5-*/)2)&+)1*&%*&2>?@*0%*.5)*,&+.*8*3)&/+

&.*2)&+.*8*&2>?@+)A&:.23*8*&2>?@+

*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;

“I heart your new album Habits”

eliminate album releases that are not ‘new’ using metadata in MusicBrainz

From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist

Page 117: Meena Nagarajan Ph.D. Dissertation Defense

Scoping via Real-world Restrictions

!"""#$

!""#$

!"#$

!#$

#$

#"$

#""$!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$

!"#$%&%'()'*)+,#)-.'++#"%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**

*****789!9$*,/):0+0-%;

&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*0%*.5)*,&+.*8*3)&/+

*&22*&/.0+.+*<5-*/)2)&+)1*&%*&2>?@*0%*.5)*,&+.*8*3)&/+

&.*2)&+.*8*&2>?@+)A&:.23*8*&2>?@+

*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;

Closely follow distribution of random restrictions, conforming loosely to a Zipf distribution

“I heart your new album Habits”

eliminate album releases that are not ‘new’ using metadata in MusicBrainz

From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist

Page 118: Meena Nagarajan Ph.D. Dissertation Defense

Scoping via Real-world Restrictions

!"""#$

!""#$

!"#$

!#$

#$

#"$

#""$!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$

!"#$%&%'()'*)+,#)-.'++#"%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**

*****789!9$*,/):0+0-%;

&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*0%*.5)*,&+.*8*3)&/+

*&22*&/.0+.+*<5-*/)2)&+)1*&%*&2>?@*0%*.5)*,&+.*8*3)&/+

&.*2)&+.*8*&2>?@+)A&:.23*8*&2>?@+

*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;

Closely follow distribution of random restrictions, conforming loosely to a Zipf distribution

“I heart your new album Habits”

eliminate album releases that are not ‘new’ using metadata in MusicBrainz

Choosing which constraints to implement is simple - pick whatever is easiest first

From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist

Page 119: Meena Nagarajan Ph.D. Dissertation Defense

• User comments are on MySpace artist pages

• Contextual Restriction: Artist name

• Assumption: no other artist/work mention

Madonna’s tracks

64

Scoped Entity Lists

Page 120: Meena Nagarajan Ph.D. Dissertation Defense

“this is bad news, ill miss you MJ”

• User comments are on MySpace artist pages

• Contextual Restriction: Artist name

• Assumption: no other artist/work mention

• Naive spotter has advantage of spotting all possible mentions (modulo spelling errors)

• Generates several false positives

Madonna’s tracks

64

Scoped Entity Lists

Page 121: Meena Nagarajan Ph.D. Dissertation Defense

Non-Music Mentions

• Challenge 1: Several senses in the same domain

• Scoping relationship graphs narrows possible senses

• Challenge 2: Non-music mentions

• Got your new album Smile. Loved it!

• Keep your SMILE on!

65valid mention?

Page 122: Meena Nagarajan Ph.D. Dissertation Defense

Hand-labeling - Fairly Subjective• 1800+ spots in MySpace user comments from

artist pages

• Keep your SMILE on!

• good spot, bad spot, inconclusive?

• 4-way annotator agreements• Madonna 90% agreement

• Rihanna 84% agreement

• Lily Allen 53% agreement

66

Page 123: Meena Nagarajan Ph.D. Dissertation Defense

Supervised Learners

67

Syntactic features Notation-S+POS tag of s s.POS

POS tag of one token before s s.POSb

POS tag of one token after s s.POSa

Typed dependency between s and sentiment word s.POS-TDsent∗

Typed dependency between s and domain-specific term s.POS-TDdom∗

Boolean Typed dependency between s and sentiment s.B-TDsent∗

Boolean Typed dependency between s and domain-specific term s.B-TDdom∗

Word-level features Notation-W+Capitalization of spot s s.allCaps

+Capitalization of first letter of s s.firstCaps

+s in Quotes s.inQuotes

Domain-specific features Notation-DSentiment expression in the same sentence as s s.Ssent

Sentiment expression elsewhere in the comment s.Csent

Domain-related term in the same sentence as s s.Sdom

Domain-related term elsewhere in the comment s.Cdom+Refers to basic features, others are advanced features

∗These features apply only to one-word-long spots.

Table 6. Features used by the SVM learner

Valid spot: Got your new album Smile.

Simply loved it!

Encoding: nsubj(loved-8, Smile-5) imply-

ing that Smile is the nominal subject of

the expression loved.

Invalid spot: Keep your smile on. You’ll

do great !Encoding: No typed dependency between

smile and great

Table 7. Typed Dependencies Example

Typed Dependencies:

We also captured the typed de-

pendency paths (grammatical rela-

tions) via the s.POS-

TDsent and s.POS-TDdom boolean

features. These were obtained be-

tween a spot and co-occurring senti-

ment and domain-specific words by

the Stanford parser[12] (see exam-

ple in 7). We also encode a boolean

value indicating whether a relation

was found at all using the s.B-TDsent

and s.B-TDdom features. This allows us to accommodate parse errors given the

informal and often non-grammatical English in this corpus.

5.2 Data and Experiments

Our training and test data sets were obtained from the hand-tagged data (see

Table 3). Positive and negative training examples were all spots that all four

annotators had confirmed as valid or invalid respectively, for a total of 571 posi-

tive and 864 negative examples. Of these, we used 550 positive and 550 negative

examples for training. The remaining spots were used for test purposes.

Our positive and negative test sets comprised of all spots that three annota-

tors had confirmed as valid or invalid spots, i.e. had a 75% agreement. We also

included spots where 50% of the annotators had agreement on the validity of the

Generic syntactic, spot-level, domain, knowledge features

[“Multimodal Social Intelligence in a Realtime Dashboard System”, VLDB Journal, Special Issue on "Data Management and Mining on Social Networks and Social Media", 2010]

Page 124: Meena Nagarajan Ph.D. Dissertation Defense

Supervised Learners

67

Syntactic features Notation-S+POS tag of s s.POS

POS tag of one token before s s.POSb

POS tag of one token after s s.POSa

Typed dependency between s and sentiment word s.POS-TDsent∗

Typed dependency between s and domain-specific term s.POS-TDdom∗

Boolean Typed dependency between s and sentiment s.B-TDsent∗

Boolean Typed dependency between s and domain-specific term s.B-TDdom∗

Word-level features Notation-W+Capitalization of spot s s.allCaps

+Capitalization of first letter of s s.firstCaps

+s in Quotes s.inQuotes

Domain-specific features Notation-DSentiment expression in the same sentence as s s.Ssent

Sentiment expression elsewhere in the comment s.Csent

Domain-related term in the same sentence as s s.Sdom

Domain-related term elsewhere in the comment s.Cdom+Refers to basic features, others are advanced features

∗These features apply only to one-word-long spots.

Table 6. Features used by the SVM learner

Valid spot: Got your new album Smile.

Simply loved it!

Encoding: nsubj(loved-8, Smile-5) imply-

ing that Smile is the nominal subject of

the expression loved.

Invalid spot: Keep your smile on. You’ll

do great !Encoding: No typed dependency between

smile and great

Table 7. Typed Dependencies Example

Typed Dependencies:

We also captured the typed de-

pendency paths (grammatical rela-

tions) via the s.POS-

TDsent and s.POS-TDdom boolean

features. These were obtained be-

tween a spot and co-occurring senti-

ment and domain-specific words by

the Stanford parser[12] (see exam-

ple in 7). We also encode a boolean

value indicating whether a relation

was found at all using the s.B-TDsent

and s.B-TDdom features. This allows us to accommodate parse errors given the

informal and often non-grammatical English in this corpus.

5.2 Data and Experiments

Our training and test data sets were obtained from the hand-tagged data (see

Table 3). Positive and negative training examples were all spots that all four

annotators had confirmed as valid or invalid respectively, for a total of 571 posi-

tive and 864 negative examples. Of these, we used 550 positive and 550 negative

examples for training. The remaining spots were used for test purposes.

Our positive and negative test sets comprised of all spots that three annota-

tors had confirmed as valid or invalid spots, i.e. had a 75% agreement. We also

included spots where 50% of the annotators had agreement on the validity of the

Generic syntactic, spot-level, domain, knowledge features

[“Multimodal Social Intelligence in a Realtime Dashboard System”, VLDB Journal, Special Issue on "Data Management and Mining on Social Networks and Social Media", 2010]

1. Sentiment Expressions:Slang sentiment gazetteer using Urban Dictionary

2. Domain specific termsmusic, album, concert..

Page 125: Meena Nagarajan Ph.D. Dissertation Defense

Efficacy of Features

PR tradeoffs: choosing feature combinations depends on end application requirement

Recall intensive

Prec

ision

inte

nsive

Page 126: Meena Nagarajan Ph.D. Dissertation Defense

Efficacy of Features

PR tradeoffs: choosing feature combinations depends on end application requirement

identified 90% of valid spotseliminated 35% of invalid spots

78-50

42-91

Recall intensive

Prec

ision

inte

nsive

90-35

Page 127: Meena Nagarajan Ph.D. Dissertation Defense

Efficacy of Features

PR tradeoffs: choosing feature combinations depends on end application requirement

identified 90% of valid spotseliminated 35% of invalid spots

78-50

42-91

Recall intensive

Prec

ision

inte

nsive Thesis Statement: Feature

combinations were most stable, best performing

Gazetteer matched domain words and sentiment expressions proved to be useful

90-35

Page 128: Meena Nagarajan Ph.D. Dissertation Defense

Dictionary Spotter + NLP

69

!"

#!"

$!"

%!"

&!"

'!!"

()*+,

-./00,1

2!345

6&35!

6$3$!

6#345

6'36!

6!35%

%#3&$

%'36!

%!3&5

$53&%

$#32'

!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14

5('*%$%63)7)8'*#""

71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E

Step 1: Spot with naive spotter, knowledge base restricted to an

Artists’ tracks

Madonna’s track spots-

23% precision

Page 129: Meena Nagarajan Ph.D. Dissertation Defense

Dictionary Spotter + NLP

69

!"

#!"

$!"

%!"

&!"

'!!"

()*+,

-./00,1

2!345

6&35!

6$3$!

6#345

6'36!

6!35%

%#3&$

%'36!

%!3&5

$53&%

$#32'

!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14

5('*%$%63)7)8'*#""

71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E

Step 1: Spot with naive spotter, knowledge base restricted to an

Artists’ tracks

Step 2: Disambiguate using NL features (SVM classifier)

Madonna’s track spots-

23% precision 42-91 : “All features” setting

Page 130: Meena Nagarajan Ph.D. Dissertation Defense

Dictionary Spotter + NLP

69

!"

#!"

$!"

%!"

&!"

'!!"

()*+,

-./00,1

2!345

6&35!

6$3$!

6#345

6'36!

6!35%

%#3&$

%'36!

%!3&5

$53&%

$#32'

!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14

5('*%$%63)7)8'*#""

71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E

Step 1: Spot with naive spotter, knowledge base restricted to an

Artists’ tracks

Step 2: Disambiguate using NL features (SVM classifier)

Madonna’s track spots-

23% precision

Madonna’s track spots-

~60% precision

42-91 : “All features” setting

Page 131: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Statements

• Highlights issues with using a domain knowledge for an IE task

70

Page 132: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Statements

• Highlights issues with using a domain knowledge for an IE task

• Two stage approach: chaining NL learners over results of domain model based spotters

70

Page 133: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Statements

• Highlights issues with using a domain knowledge for an IE task

• Two stage approach: chaining NL learners over results of domain model based spotters

• Improves accuracy up to a further 50%• allows the more time-intensive NLP analytics to run on

less than the full set of input data

70

Page 134: Meena Nagarajan Ph.D. Dissertation Defense

APPLICATIONS

71

Page 135: Meena Nagarajan Ph.D. Dissertation Defense

BBC SoundIndex (IBM Almaden)Pulse of the Online Music Populace

http://www.almaden.ibm.com/cs/projects/iis/sound/

Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth: Multimodal Social Intelligence in a Real-Time Dashboard System to appear in a special issue of the

VLDB Journal on "Data Management and Mining for Social Networks and Social Media", 2010

http://www.almaden.ibm.com/cs/projects/iis/sound/

Page 136: Meena Nagarajan Ph.D. Dissertation Defense

Domain metadata, Artist/Track

unstructured, structured metadata

ETL

Page 137: Meena Nagarajan Ph.D. Dissertation Defense

UIMA Analytics Environment

Album/Track identification [ISWC09]

Sentiment Identification

Spam and off-topic comments

“U R $o Bad!”, “Thriller is my most fav MJ album”;“this is BAD news, miss u MJ”

ETL

Thriller/NNP is/VBZ my/PRP$ most/RBS fav/JJ MJ/NN album/NN

Page 138: Meena Nagarajan Ph.D. Dissertation Defense

Exracted concepts into explorable datastructures

ETL

Page 139: Meena Nagarajan Ph.D. Dissertation Defense

What are 18 year olds in London listening to?

ETL

Page 140: Meena Nagarajan Ph.D. Dissertation Defense

What are 18 year olds in London listening to?

Crowd-sourced preferences

ETL

Page 141: Meena Nagarajan Ph.D. Dissertation Defense

Several Insights..

74-ve +ve

< 4%

spamnon-spam

Page 142: Meena Nagarajan Ph.D. Dissertation Defense

Predictive Power of Data

21

38% of total comments were spam61% of total comments had positive sentiments4% of total comments had negative sentiments

35% of total comments had no identifiable sentiments

Table 7 Annotation Statistics

As described in Section 8, the structured metadata

(artist name, timestamp, etc.) and annotation results

(spam/non-spam, sentiment, etc.) were loaded in the

hypercube.

The data represented by each cell of the cube is the

number of comments for a given artist. The dimension-

ality of the cube is dependent on what variables we

are examining in our experiments. Timestamp, age and

gender of the poster, geography, and other factors are

all dimensions in hypercube, in addition to the mea-

sures derived from the annotators (spam, non-spam,

number of positive sentiments, etc.).

For the purposes of creating a top-N list, all dimen-

sions except for artist name are collapsed. The cube is

then sliced along the spam axis (to project only non-

spam comments) and the comment counts are projected

onto the artist name axis. Since the percentage of nega-

tive comments was very small (4%), the top-N list was

prepared by sorting artists on the number of non-spam

comments they had received independent of the senti-

ment scoring.

In Table 8 we show the top 10 most popular Bill-

board artists and the list generated by our analysis of

MySpace for the week of the survey. While some top

artists appear on both lists (e.g., Soulja Boy, Timba-

land, 50 Cent, and Pink), there are important differ-ences. In some cases, our MySpace Analysis list clearly

identified rising artists before they reached the Top-10

list on billboard (e.g., Fall Out Boy and Alicia Keys

both climbed to #1 on Billboard.com shortly after we

produced these lists). Overall, we can observe that the

Billboard.com list contains more artists with a long his-

tory and large body of work (e.g., Kanye West, Fer-

gie, Nickleback), whereas our MySpace Analysis List is

more likely to identify ”up and coming” artists. This is

consistent with our expectations, particularly in light

of the aforementioned industry reports which indicate

that teenagers are the biggest music influencers (Me-

diamark, 2004).

11.1.2 The Word on the Street

Using the above lists, we performed a casual preference

poll of 74 people in the target demographic. We con-

ducted a survey among students of an after-school pro-

gram (Group 1), Wright State (Group 2), and Carnegie

Mellon (Group 3). Of the three different groups, Group

Billboard.com MySpace Analysis

Soulja Boy T.I.Kanye West Soulja BoyTimbaland Fall Out BoyFergie RihannaJ. Holiday Keyshia Cole50 Cent Avril LavigneKeyshia Cole TimbalandNickelback PinkPink 50 CentColbie Caillat Alicia Keys

Table 8 Billboard’s Top Artists vs. our generated list

1 was comprised of respondents between ages 8 and 15;

while Group 2 and 3 were primarily comprised of col-

lege students in the 17–22 age group. Table 9 shows

statistics pertaining to the three survey groups.

Groups and Age Range No. of male No. of femalerespondents respondents

Group 1 (8–15) 8 9Group 2 (17–22) 21 26Group 3 (17–22) 7 3

Table 9 Survey Group Statistics

The survey was conducted as follows: the 74 respon-

dents were asked to study the two lists shown in Table 8.

One was generated by Billboard and the other through

the crawl of MySpace. They were then asked the fol-

lowing question: “Which list more accurately reflectsthe artists that were more popular last week?” Their re-

sponse along with their age, gender and the reason for

preferring a list was recorded.

The sources used to prepare the lists were not shown

to the respondents, so they would not be influenced by

the popularity of MySpace or Billboard. In addition,

we periodically switched the lists while conducting the

study to avoid any bias based on which list was pre-

sented first.

11.1.3 Results

The raw results of our study immediately suggest the

validity of the system, as can be seen in Table 10. The

MySpace data generated list is preferred over 2 to 1

to the Billboard list by our 74 test subjects, and the

preference is consistently in favor of our list across all

three survey groups.

More exactly, 68.9± 5.4% of subjects prefer the SI

derived list to the Billboard list. Looking specifically at

Group 1, the youngest survey group whose ages range

from 8–15, we can see that our list is even more suc-

cessful. Even with a smaller sample group (resulting in

User study indicated 2:1 and upto 7:1 (younger age groups) preference for MySpace list

Billboards Top 50 Singles chart during the week of Sept 22-28 ’07 vs. MySpace popularity charts

Challenging traditional polling methods!

Page 143: Meena Nagarajan Ph.D. Dissertation Defense

Predictive Power of Data

21

38% of total comments were spam61% of total comments had positive sentiments4% of total comments had negative sentiments

35% of total comments had no identifiable sentiments

Table 7 Annotation Statistics

As described in Section 8, the structured metadata

(artist name, timestamp, etc.) and annotation results

(spam/non-spam, sentiment, etc.) were loaded in the

hypercube.

The data represented by each cell of the cube is the

number of comments for a given artist. The dimension-

ality of the cube is dependent on what variables we

are examining in our experiments. Timestamp, age and

gender of the poster, geography, and other factors are

all dimensions in hypercube, in addition to the mea-

sures derived from the annotators (spam, non-spam,

number of positive sentiments, etc.).

For the purposes of creating a top-N list, all dimen-

sions except for artist name are collapsed. The cube is

then sliced along the spam axis (to project only non-

spam comments) and the comment counts are projected

onto the artist name axis. Since the percentage of nega-

tive comments was very small (4%), the top-N list was

prepared by sorting artists on the number of non-spam

comments they had received independent of the senti-

ment scoring.

In Table 8 we show the top 10 most popular Bill-

board artists and the list generated by our analysis of

MySpace for the week of the survey. While some top

artists appear on both lists (e.g., Soulja Boy, Timba-

land, 50 Cent, and Pink), there are important differ-ences. In some cases, our MySpace Analysis list clearly

identified rising artists before they reached the Top-10

list on billboard (e.g., Fall Out Boy and Alicia Keys

both climbed to #1 on Billboard.com shortly after we

produced these lists). Overall, we can observe that the

Billboard.com list contains more artists with a long his-

tory and large body of work (e.g., Kanye West, Fer-

gie, Nickleback), whereas our MySpace Analysis List is

more likely to identify ”up and coming” artists. This is

consistent with our expectations, particularly in light

of the aforementioned industry reports which indicate

that teenagers are the biggest music influencers (Me-

diamark, 2004).

11.1.2 The Word on the Street

Using the above lists, we performed a casual preference

poll of 74 people in the target demographic. We con-

ducted a survey among students of an after-school pro-

gram (Group 1), Wright State (Group 2), and Carnegie

Mellon (Group 3). Of the three different groups, Group

Billboard.com MySpace Analysis

Soulja Boy T.I.Kanye West Soulja BoyTimbaland Fall Out BoyFergie RihannaJ. Holiday Keyshia Cole50 Cent Avril LavigneKeyshia Cole TimbalandNickelback PinkPink 50 CentColbie Caillat Alicia Keys

Table 8 Billboard’s Top Artists vs. our generated list

1 was comprised of respondents between ages 8 and 15;

while Group 2 and 3 were primarily comprised of col-

lege students in the 17–22 age group. Table 9 shows

statistics pertaining to the three survey groups.

Groups and Age Range No. of male No. of femalerespondents respondents

Group 1 (8–15) 8 9Group 2 (17–22) 21 26Group 3 (17–22) 7 3

Table 9 Survey Group Statistics

The survey was conducted as follows: the 74 respon-

dents were asked to study the two lists shown in Table 8.

One was generated by Billboard and the other through

the crawl of MySpace. They were then asked the fol-

lowing question: “Which list more accurately reflectsthe artists that were more popular last week?” Their re-

sponse along with their age, gender and the reason for

preferring a list was recorded.

The sources used to prepare the lists were not shown

to the respondents, so they would not be influenced by

the popularity of MySpace or Billboard. In addition,

we periodically switched the lists while conducting the

study to avoid any bias based on which list was pre-

sented first.

11.1.3 Results

The raw results of our study immediately suggest the

validity of the system, as can be seen in Table 10. The

MySpace data generated list is preferred over 2 to 1

to the Billboard list by our 74 test subjects, and the

preference is consistently in favor of our list across all

three survey groups.

More exactly, 68.9± 5.4% of subjects prefer the SI

derived list to the Billboard list. Looking specifically at

Group 1, the youngest survey group whose ages range

from 8–15, we can see that our list is even more suc-

cessful. Even with a smaller sample group (resulting in

User study indicated 2:1 and upto 7:1 (younger age groups) preference for MySpace list

Billboards Top 50 Singles chart during the week of Sept 22-28 ’07 vs. MySpace popularity charts

Challenging traditional polling methods!

Page 144: Meena Nagarajan Ph.D. Dissertation Defense

INTERMISSION?Up Next: Overview of Key Phrase Extraction,

Applications, Conclusions

Page 145: Meena Nagarajan Ph.D. Dissertation Defense

Key Phrase Extraction

Page 146: Meena Nagarajan Ph.D. Dissertation Defense

• Different from Information Extraction

• Extracting vs. Assigning Key Phrases • Thesis Focus: Key Phrase Extraction

• Prior work focus: extracting phrases that summarize a document -- a news article, a web page, a journal article, a book..

• Thesis focus: summarize multiple documents (UGC) around same event/topic of interest

Key Phrase Extraction - Aboutness Understanding

Page 147: Meena Nagarajan Ph.D. Dissertation Defense

• Prominent discussions (key phrases) around the 2009 Health Care Reform debate and 2008 Mumbai Terror Attack on one day

Key Phrase Extraction - Aboutness Understanding

4.1. KEY PHRASE EXTRACTION - ‘ABOUTNESS’ OF CONTENT August 10, 2010

document that are descriptive of its contents.

The contributions made in this thesis fall under the second category of extracting key phrases

that are explicitly present in the content and are also indicative of what the document is ‘about’.

The focus of previous approaches to key phrase extraction have been on extracting phrases

that summarize a document, e.g. a news article, a web page, a journal article or a book. In

contrast, the focus of this thesis is not in summarizing a document generated by users on social

media platforms but to extract key phrases that are descriptive of information present in multiple

observations (or documents) made by users about an entity, event or topic of interest.

The primary motivation is to obtain an abstraction of a social phenomenon that makes volumes

of unstructured user-generated content easily consumable by humans and agents alike. As an

example of the goals of our work, Table 4.1 shows key phrases extracted from online discussions

around the 2009 Health Care Reform debate and the 2008 Mumbai terror attack, summarizing

hundreds of user comments to give a sense of what the population cared about on a particular day.

2009 Health Care Reform 2008 Mumbai Terror Attack

Health care debate Foreign relations perspectiveHealthcare staffing problem Indian prime minister speechObamacare facts UK indicating supportHealthcare protestors Country of IndiaParty ratings plummet Rejected evidence providedPublic option Photographers capture images of Mumbai

Table 4.1: Showing summary key phrases extracted from more than 500 online posts on Twitteraround two news-worthy events on a single day.

Solutions to key phrase extraction have ranged from both unsupervised techniques that are

based on heuristics to identify phrases and supervised learning approaches that learn from human

105

Page 148: Meena Nagarajan Ph.D. Dissertation Defense

Key Phrase Extraction on Social Media Content

• Thesis Focus: Summarizing Social Perceptions via key phrase extraction

• Preserving/Isolating the social behind the social data

• Accounting for redundancy, variability, off-topic content

80

“Met up with mom for lunch, she looks lovely as ever, good genes .. Thanks Nike, I love my new Gladiators ..smooth as a feather. I burnt all the calories of Italian joy in one run.. if you are

looking for good Italian food on Main, Buca is the place to go.”

Page 149: Meena Nagarajan Ph.D. Dissertation Defense

Social and Cultural Logic in UGC

• Thematic components

• similar messages convey similar ideas

• Space, time metadata• role of community and geography in communication

• Poster attributes• age, gender, socio-economic status reflect similar

perceptions

81

Page 150: Meena Nagarajan Ph.D. Dissertation Defense

Feature Space (in prior work and in thesis)

• Thesis Focus: n-grams, spatio-temporal metadata (social components)

• Syntactic Cues: In quotes, italics, bold; in document headers; phrases collocated with acronyms

• Document and Structural Cues: Two word phrases, appearing in the beginning of a document, frequency, presence in multiple similar documents etc.

• Linguistic Cues: Stemmed form of a phrase, phrases that are simple and compound nouns in sentences etc.

82

Page 151: Meena Nagarajan Ph.D. Dissertation Defense

Key Phrase Extraction Overview

Spatio-Temporal Clusters δs Event Spatial Bias

δt Event Temporal Bias

Key Phrase Generation

n-gram generation

n-gram WeightingThematic, Temporal and

Spatial Scores

User-generated Content

textual component tc,temporal parameter tt,spatial parameter tg

Off-topic Key Phrase Elimination

Page 152: Meena Nagarajan Ph.D. Dissertation Defense

Key Phrase Extraction Overview

Spatio-Temporal Clusters δs Event Spatial Bias

δt Event Temporal Bias

Key Phrase Generation

n-gram generation

n-gram WeightingThematic, Temporal and

Spatial Scores

User-generated Content

textual component tc,temporal parameter tt,spatial parameter tg

Off-topic Key Phrase Elimination

“President Obama in trying to regain control of the health-care debate will likely shift his pitch in September”

1-grams: President, Obama, in, trying, to, regain, ...

2-grams: “President Obama”, “Obama in”, “in trying”, “trying to”...

3-grams: “President Obama in”, “Obama in trying”; “in trying to”...

Page 153: Meena Nagarajan Ph.D. Dissertation Defense

“President Obama in”“President” “President Obama”

Spatio-Temporal Clusters δs Event Spatial Bias

δt Event Temporal Bias

Key Phrase Generation

n-gram generation

n-gram WeightingThematic, Temporal and

Spatial Scores

User-generated Content

textual component tc,temporal parameter tt,spatial parameter tg

Off-topic Key Phrase Elimination

Page 154: Meena Nagarajan Ph.D. Dissertation Defense

• A descriptor is an n-gram weighted by:

• Thematic Importance• TFIDF, stop words, noun phrases

• redundancy: statistically discriminatory in nature

• variability: contextually important

• Spatial Importance (local vs. global popularity)

• Temporal Importance (always popular vs. currently trending)

“President Obama in”“President” “President Obama”

Spatio-Temporal Clusters δs Event Spatial Bias

δt Event Temporal Bias

Key Phrase Generation

n-gram generation

n-gram WeightingThematic, Temporal and

Spatial Scores

User-generated Content

textual component tc,temporal parameter tt,spatial parameter tg

Off-topic Key Phrase Elimination

Page 155: Meena Nagarajan Ph.D. Dissertation Defense

Higher-order n-grams picked over lower-order n-grams (if same scores)

Page 156: Meena Nagarajan Ph.D. Dissertation Defense

Eliminating Off-topic Content [WISE2009]

• Frequency based heuristics will not eliminate off-topic content that is ALSO POPULAR

Spatio-Temporal Clusters δs Event Spatial Bias

δt Event Temporal Bias

Key Phrase Generation

n-gram generation

n-gram WeightingThematic, Temporal and

Spatial Scores

User-generated Content

textual component tc,temporal parameter tt,spatial parameter tg

Off-topic Key Phrase Elimination

Page 157: Meena Nagarajan Ph.D. Dissertation Defense

• “Yeah i know this a bit off topic but the other electronics forum is dead right now. im looking for a good camcorder, somethin not to large that can record in full HD only ones so far that ive seen are sonys”

• “Canon HV20. Great little cameras under $1000.”

87

Approach Overview

Page 158: Meena Nagarajan Ph.D. Dissertation Defense

Approach Overview

• Assume one or more seed words (from domain knowledge base) C1 - ['camcorder']

• Extracted Key words / phrases : C2 - ['electronics forum', 'hd', 'camcorder', 'somethin', 'ive', 'canon', 'little camera', 'canon hv20', 'cameras', 'offtopic']

• Gradually expand C1 by adding phrases from C2 that are strongly associated with C1

• Mutual Information based algorithm [WISE2009]

Page 159: Meena Nagarajan Ph.D. Dissertation Defense

Key Phrases & Aboutness - Evaluations

• Are the key phrases we extracted topical and good indicators of what the content is about?

• If it is, it should act as an effective index/search phrase and return relevant content

• Evaluation Application: Targeted Content Delivery

89

Page 160: Meena Nagarajan Ph.D. Dissertation Defense

Targeted Content Delivery - Evaluations

• 12K posts from MySpace and Facebook Electronics forums

• Baseline phrases: Yahoo Term Extractor

• Our method phrases: Key phrase extraction, elimination

• Targeted Content from Google AdSense

90

Page 161: Meena Nagarajan Ph.D. Dissertation Defense

Targeted Content for Extracted Key Phrases

91

A. Showing Advertisements generated for phrases identified by the Yahoo Term Extractor (YTE)

B. Showing Advertisements generated for topical phrases extracted by our algorithm

Page 162: Meena Nagarajan Ph.D. Dissertation Defense

User Studies and Results

92

Page 163: Meena Nagarajan Ph.D. Dissertation Defense

User Studies and Results

92

!  !"#$"%&'()#*%+*"%$#,#-+./%/0%/1#%&0"/%!  2/%,#+"/%345%'./#$6#-+,7+/0$%+8$##9#./%

!  :0$%/1#%;4%&0"/"%!  <0/+,%0=%>??%+*%'9&$#""'0."%!  >@5%0=%+*"%&'()#*%+"%$#,#-+./%

!  :0$%/1#%/0&'(+,%)#AB0$*"%!  <0/+,%0=%>;C%+*%'9&$#""'0."%!  ?45%0=%+*"%&'()#*%+"%$#,#-+./%

Page 164: Meena Nagarajan Ph.D. Dissertation Defense

User Studies and Results

92

!  !"#$"%&'()#*%+*"%$#,#-+./%/0%/1#%&0"/%!  2/%,#+"/%345%'./#$6#-+,7+/0$%+8$##9#./%

!  :0$%/1#%;4%&0"/"%!  <0/+,%0=%>??%+*%'9&$#""'0."%!  >@5%0=%+*"%&'()#*%+"%$#,#-+./%

!  :0$%/1#%/0&'(+,%)#AB0$*"%!  <0/+,%0=%>;C%+*%'9&$#""'0."%!  ?45%0=%+*"%&'()#*%+"%$#,#-+./%

Extracted, topical phrases yield 2X more relevant content

Page 165: Meena Nagarajan Ph.D. Dissertation Defense

Thesis Statement

• TFIDF + social contextual cues yield more useful phrases that preserve social perceptions

• Corpus + seeds from a domain knowledge base eliminate off-topic phrases effectively

93

Page 166: Meena Nagarajan Ph.D. Dissertation Defense

APPLICATIONS

94

Page 167: Meena Nagarajan Ph.D. Dissertation Defense

http://twitris.knoesis.org/

Twitris (with Knoesis)online pulse around news-worthy events..

(Mumbai Terror Attack ‘08, Health Care Debate 2010)

Meenakshi Nagarajan, Karthik Gomadam, Amit Sheth, Ajith Ranabahu, Raghava Mutharaju and Ashutosh Jadhav, Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data - Challenges and Experiences, Tenth International Conference on Web

Information Systems Engineering, Oct 5-7, 2009: 539-553

Page 168: Meena Nagarajan Ph.D. Dissertation Defense

Chatter around news-worthy events

Hundreds of tweets, facebook posts, blogs about a single event Multiple narratives, strong opinions, breaking news..

Page 169: Meena Nagarajan Ph.D. Dissertation Defense

Preserving Social Perceptions

The Health Care Reform Debate

Page 170: Meena Nagarajan Ph.D. Dissertation Defense

Zooming in on Florida

Page 171: Meena Nagarajan Ph.D. Dissertation Defense

Summaries of Citizen Reports

Page 172: Meena Nagarajan Ph.D. Dissertation Defense

Zooming in on Washington

Page 173: Meena Nagarajan Ph.D. Dissertation Defense

Summaries of Citizen Reports

RT @WestWingReport: Obama reminds the faith-based groups "we're neglecting 2 live up 2 the call" of being R

brother's keeper on #healthcare

Page 174: Meena Nagarajan Ph.D. Dissertation Defense

Providing Context

twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith

kno.e.sis center, Wright State University Opinion on Ira

n

Electio

n from th

e

US talks

about O

il

economies

,

blogging

Opinion on Iran

Electio

n from Ira

n

talks

about

theocra

cy,

oppressio

n,

demonstr

ation

Spatial perspective

Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.

Temporal perspective

Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.

Find resources related to social perceptions

News and Wikipedia articles to put extracted descriptors in context

Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using

people's perceptions as the fulcrum.

What does twitris do?

✓ Exploit spatio, temporal semantics for thematic aggregation

✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to

other users

✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)

✓ End user application

Little statistics from Tiwtris (unit: tweets)

Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)

Obama (Oct 8 - 20): 312 K (US Only)

H1N1 (Oct 5 - 20) : 232 K (US Only)

Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)

`

Twitris

Concept Cloud, News and related

articles

Google News widget

DBpedia widget

Context + Selected

Term

Context + Selected

Term

Twitris DB

Data Collection

event-1 crawler....

event-kcrawler

.

.

.

.event-ncrawler

Author Location Lookup

.

.

.Author Location

Lookup..

Author Location Lookup

Geocode Lookup....

Geocode Lookup....

Geocode Lookup

Data ProcessingTFIDF based

descriptor extraction

Spatio, Temporal, Thematic descriptor extraction

Extracting storylines around

descriptorsTwitter Search

Shared

Memory

Data Dumper....

Data Dumper....

Data Dumper

Shared

Memory

Shared

Memory

Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics

twitris internals in less than

140 characters

Culled out user observations correlated well with mainstream media (news, blogs)

The fourth estate perspective

Cavetas and Future work

1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization

Check us out at: http://twitris.knoesis.org

Follow us @7w17r15

Become a FB Fan and share Twtitris with everyone

A tetris like approach to twitter to gather aggregated social signals is defined as

SOYLENT GREEN and the HEALTH CARE REFORMInformation right where you need it !

twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith

kno.e.sis center, Wright State University

Opinion on Iran

Electio

n from th

e

US talks

about O

il

economies

,

blogging

Opinion on Iran

Electio

n from Ira

n

talks

about

theocra

cy,

oppressio

n,

demonstr

ation

Spatial perspective

Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.

Temporal perspective

Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.

Find resources related to social perceptions

News and Wikipedia articles to put extracted descriptors in context

Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using

people's perceptions as the fulcrum.

What does twitris do?

✓ Exploit spatio, temporal semantics for thematic aggregation

✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to

other users

✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)

✓ End user application

Little statistics from Tiwtris (unit: tweets)

Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)

Obama (Oct 8 - 20): 312 K (US Only)

H1N1 (Oct 5 - 20) : 232 K (US Only)

Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)

`

Twitris

Concept Cloud, News and related

articles

Google News widget

DBpedia widget

Context + Selected

Term

Context + Selected

Term

Twitris DB

Data Collection

event-1 crawler....

event-kcrawler

.

.

.

.event-ncrawler

Author Location Lookup

.

.

.Author Location

Lookup..

Author Location Lookup

Geocode Lookup....

Geocode Lookup....

Geocode Lookup

Data ProcessingTFIDF based

descriptor extraction

Spatio, Temporal, Thematic descriptor extraction

Extracting storylines around

descriptorsTwitter Search

Shared

Memory

Data Dumper....

Data Dumper....

Data Dumper

Shared

Memory

Shared

Memory

Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics

twitris internals in less than

140 characters

Culled out user observations correlated well with mainstream media (news, blogs)

The fourth estate perspective

Cavetas and Future work

1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization

Check us out at: http://twitris.knoesis.org

Follow us @7w17r15

Become a FB Fan and share Twtitris with everyone

A tetris like approach to twitter to gather aggregated social signals is defined as

twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith

kno.e.sis center, Wright State University

Opinion on Iran

Electio

n from th

e

US talks

about O

il

economies

,

blogging

Opinion on Iran

Electio

n from Ira

n

talks

about

theocra

cy,

oppressio

n,

demonstr

ation

Spatial perspective

Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.

Temporal perspective

Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.

Find resources related to social perceptions

News and Wikipedia articles to put extracted descriptors in context

Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using

people's perceptions as the fulcrum.

What does twitris do?

✓ Exploit spatio, temporal semantics for thematic aggregation

✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to

other users

✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)

✓ End user application

Little statistics from Tiwtris (unit: tweets)

Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)

Obama (Oct 8 - 20): 312 K (US Only)

H1N1 (Oct 5 - 20) : 232 K (US Only)

Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)

`

Twitris

Concept Cloud, News and related

articles

Google News widget

DBpedia widget

Context + Selected

Term

Context + Selected

Term

Twitris DB

Data Collection

event-1 crawler....

event-kcrawler

.

.

.

.event-ncrawler

Author Location Lookup

.

.

.Author Location

Lookup..

Author Location Lookup

Geocode Lookup....

Geocode Lookup....

Geocode Lookup

Data ProcessingTFIDF based

descriptor extraction

Spatio, Temporal, Thematic descriptor extraction

Extracting storylines around

descriptorsTwitter Search

Shared

Memory

Data Dumper....

Data Dumper....

Data Dumper

Shared

Memory

Shared

Memory

Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics

twitris internals in less than

140 characters

Culled out user observations correlated well with mainstream media (news, blogs)

The fourth estate perspective

Cavetas and Future work

1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization

Check us out at: http://twitris.knoesis.org

Follow us @7w17r15

Become a FB Fan and share Twtitris with everyone

A tetris like approach to twitter to gather aggregated social signals is defined as

Page 175: Meena Nagarajan Ph.D. Dissertation Defense

CONCLUSIONS

103

Page 176: Meena Nagarajan Ph.D. Dissertation Defense

Contributions and Summary

104

• Thesis motivation

• Understand characteristics of user-generated textual content on social media.

• Thesis demonstrated that

• UGC is different, variability and lack of context affects performance of TM tasks

• Example: 69% F-measure for NER on UGC

Page 177: Meena Nagarajan Ph.D. Dissertation Defense

Summary• Described frameworks and implementations

• How contextual knowledge from multiple sources can be integrated to supplement traditional NLP/ML algorithms

• Showed effectiveness of these frameworks for NER, Key Phrase extraction on UGC

• for e.g., +11% average F-score improvements for NER in Weblogs

• Building Social Intelligence Applications

Page 178: Meena Nagarajan Ph.D. Dissertation Defense

Other Contributions

106

UG

C U

nder

stan

ding

Tas

ks

FacetsHOWWHYWHAT

Named Entity Recognition [ISWC2009]

Key Phrase Extraction[WISE2009, WI2009]

Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]

Page 179: Meena Nagarajan Ph.D. Dissertation Defense

Other Contributions

106

UG

C U

nder

stan

ding

Tas

ks

FacetsHOWWHYWHAT

Named Entity Recognition [ISWC2009]

Key Phrase Extraction[WISE2009, WI2009]

Semantic Document Classification [WWW2007]

Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]

Page 180: Meena Nagarajan Ph.D. Dissertation Defense

Other Contributions

106

UG

C U

nder

stan

ding

Tas

ks

FacetsHOWWHYWHAT

Named Entity Recognition [ISWC2009]

Key Phrase Extraction[WISE2009, WI2009]

Semantic Document Classification [WWW2007]

Intent Mining [WI2009]

Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]

Page 181: Meena Nagarajan Ph.D. Dissertation Defense

Other Contributions

106

UG

C U

nder

stan

ding

Tas

ks

FacetsHOWWHYWHAT

Named Entity Recognition [ISWC2009]

Key Phrase Extraction[WISE2009, WI2009]

Semantic Document Classification [WWW2007]

Intent Mining [WI2009]

Network Effects[ICWSM2010]

Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]

Page 182: Meena Nagarajan Ph.D. Dissertation Defense

Other Contributions

106

UG

C U

nder

stan

ding

Tas

ks

FacetsHOWWHYWHAT

Named Entity Recognition [ISWC2009]

Key Phrase Extraction[WISE2009, WI2009]

Semantic Document Classification [WWW2007]

Intent Mining [WI2009]

Network Effects[ICWSM2010]

Gendered Language Usage in online Self-Expression

[ICWSM2009]

Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]

Page 183: Meena Nagarajan Ph.D. Dissertation Defense

Future Work

• The long-term outlook

• research online social user interactions

• build and design systems to understand and impact how society produces, consumes and shares data

• The near-term goals

• transformative and robust ways of coding, analyzing and interpreting user observations

107

Page 184: Meena Nagarajan Ph.D. Dissertation Defense

Future Work

108

Page 185: Meena Nagarajan Ph.D. Dissertation Defense

Future Work

• Big Data Challenges & Availability of Domain Models

108

Page 186: Meena Nagarajan Ph.D. Dissertation Defense

Future Work

• Big Data Challenges & Availability of Domain Models

• Computational Social Science

108

Page 187: Meena Nagarajan Ph.D. Dissertation Defense

Future Work

• Big Data Challenges & Availability of Domain Models

• Computational Social Science

• ‘why’ are we seeing what we seeing

108

Page 188: Meena Nagarajan Ph.D. Dissertation Defense

Future Work

• Big Data Challenges & Availability of Domain Models

• Computational Social Science

• ‘why’ are we seeing what we seeing

• people-content-network interactions

108

Page 189: Meena Nagarajan Ph.D. Dissertation Defense

Future Work

• Big Data Challenges & Availability of Domain Models

• Computational Social Science

• ‘why’ are we seeing what we seeing

• people-content-network interactions

• building tools that close that loop; slice and dice of data, correlation with other media..

108

Page 190: Meena Nagarajan Ph.D. Dissertation Defense

THANK-YOU!Are there any questions?

109