Top Banner
CS 4705 Robust Semantics, Information Extraction, and Information Retrieval
19

Robust Semantics, Information Extraction, and Information Retrieval

Jan 27, 2016

Download

Documents

mimis

Robust Semantics, Information Extraction, and Information Retrieval. Problems with Syntax-Driven Semantics. Syntactic structures often don’t fit semantic structures very well Important semantic elements often distributed very differently in trees for sentences that mean ‘the same’ - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Robust Semantics, Information Extraction, and Information Retrieval

CS 4705

Robust Semantics, Information Extraction, and Information

Retrieval

Page 2: Robust Semantics, Information Extraction, and Information Retrieval

Problems with Syntax-Driven Semantics

• Syntactic structures often don’t fit semantic structures very well– Important semantic elements often distributed very

differently in trees for sentences that mean ‘the same’

I like soup. Soup is what I like.

– Parse trees contain many structural elements not clearly important to making semantic distinctions

– Syntax driven semantic representations are sometimes pretty verbose

V --> serves )},(),(),((,{ xeServedyeServerServingeIsayex

Page 3: Robust Semantics, Information Extraction, and Information Retrieval

Alternatives?

• Semantic Grammars• Information Extraction Techniques• Information Retrieval --> Information Extraction

Page 4: Robust Semantics, Information Extraction, and Information Retrieval

Semantic Grammars

• Alternative to modifying syntactic grammars to deal with semantics too

• Define grammars specifically in terms of the semantic information we want to extract– Domain specific: Rules correspond directly to entities

and activities in the domainI want to go from Boston to Baltimore on Thursday,

September 24th

– Greeting --> {Hello|Hi|Um…}– TripRequest Need-spec travel-verb from City to

City on Date

Page 5: Robust Semantics, Information Extraction, and Information Retrieval

Predicting User Input

• Semantic grammars rely upon knowledge of the task and (sometimes) constraints on what the user can do, when– Allows them to handle very sophisticated phenomena

I want to go to Boston on Thursday.

I want to leave from there on Friday for Baltimore.

TripRequest Need-spec travel-verb from City on Date for City

Dialogue postulate maps filler for ‘from-city’ to pre-specified from-city

Page 6: Robust Semantics, Information Extraction, and Information Retrieval

Drawbacks of Semantic Grammars

• Lack of generality– A new one for each application

– Large cost in development time

• Can be very large, depending on how much coverage you want

• If users go outside the grammar, things may break disastrouslyI want to leave from my house.

I want to talk to someone human.

Page 7: Robust Semantics, Information Extraction, and Information Retrieval

Information Extraction

• Another ‘robust’ alternative• Idea is to ‘extract’ particular types of information

from arbitrary text or transcribed speech• Examples:

– Named entities: people, places, organizations, times, dates

– Telephone numbers

<Organization> MIPS</Organization> Vice President <Person>John Hime</Person>

• Domains: Medical texts, broadcast news, voicemail,...

Page 8: Robust Semantics, Information Extraction, and Information Retrieval

Appropriate where Semantic Grammars and Syntactic Parsers are Not

• Appropriate where information needs very specific– Question answering systems, gisting of news or mail…– Job ads, financial information, terrorist attacks

• Input too complex and far-ranging to build semantic grammars

• But full-blown syntactic parsers are impractical– Too much ambiguity for arbitrary text– 50 parses or none at all– Too slow for real-time applications

Page 9: Robust Semantics, Information Extraction, and Information Retrieval

Information Extraction Techniques

• Often use a set of simple templates or frames with slots to be filled in from input text– Ignore everything else– My number is 212-555-1212.– The inventor of the wiggleswort was Capt. John T.

Hart.– The king died in March of 1932.

• Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots

Page 10: Robust Semantics, Information Extraction, and Information Retrieval

The IE Process

• Given a corpus and a target set of items to be extracted:– Clean up the corpus

– Tokenize it

– Do some hand labeling of target items

– Extract some simple features

• POS tags

• Phrase Chunks …

– Do some machine learning to associate features with target items or derive this associate by intuition

– Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers

Page 11: Robust Semantics, Information Extraction, and Information Retrieval

Some examples

• Semantic grammars• Information extraction

Page 12: Robust Semantics, Information Extraction, and Information Retrieval

Information Retrieval

• How related to NLP?– Operates on language (speech or text)– Does it use linguistic information?

• Stemming• Bag-of-words approach

– Does it make use of document formatting?• Headlines, punctuation, captions

• Collection: a set of documents• Term: a word or phrase• Query: a set of terms

Page 13: Robust Semantics, Information Extraction, and Information Retrieval

But…what is a term?

• Stop list• Stemming• Homonymy, polysemy, synonymy

Page 14: Robust Semantics, Information Extraction, and Information Retrieval

Vector Space Model

• Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection

• Is t in this document or query or not?D = (t1,t2,…,tn)Q = (t1,t2,…,tn)• Similarity metric:how many terms does a query

share with each candidate document?

• Weighted terms: term-by-document matrixD = (wt1,wt2,…,wtn)Q = (wt1,wt2,…,wtn)

Page 15: Robust Semantics, Information Extraction, and Information Retrieval

• How do we compare the vectors?– Normalize each term weight by the number of terms in

the document: how important is each t in D?

– Compute dot product between vectors to see how similar they are

– Cosine of angle: 1 = identity; 0 = no common terms

• How do we get the weights?– Term frequency (tf): how often does t occur in D?

– Inverse document frequency (idf): # docs/ # docs term t occurs in

– tf . idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection

i

i nNidf log

iji idftfw ji ,,

Page 16: Robust Semantics, Information Extraction, and Information Retrieval

Evaluating IR Performance

• Precision: #rel docs returned/total #docs returned -- how often are you right when you say this document is relevant?

• Recall: #rel docs returned/#rel docs in collection -- how many of the relevant documents do you find?

• F-measure combines P and R RPPR

)(2

Page 17: Robust Semantics, Information Extraction, and Information Retrieval

Improving Queries

• Relevance feedback: users rate retrieved docs• Query expansion: many techniques

– e.g. add top N docs retrieved to query

• Term clustering: cluster rows of terms to produce synonyms and add to query

Page 18: Robust Semantics, Information Extraction, and Information Retrieval

IR Tasks

• Ad hoc retrieval: ‘normal’ IR• Routing/categorization: assign new doc to one of

predefined set of categories• Clustering: divide a collection into N clusters• Segmentation: segment text into coherent chunks• Summarization: compress a text by extracting

summary items• Question-answering: find a stretch of text

containing the answer to a question

Page 19: Robust Semantics, Information Extraction, and Information Retrieval

Summary

• Many approaches to ‘robust’ semantic analysis– Semantic grammars targeting particular domains

Utterance --> Yes/No Reply

Yes/No Reply --> Yes-Reply | No-Reply

Yes-Reply --> {yes,yeah, right, ok,”you bet”,…}

– Information extraction techniques targeting specific tasks

• Extracting information about terrorist events from news

– Information retrieval techniques --> more like NLP