Top Banner
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Lecture 11: Lecture 11: Natural Language Processing Natural Language Processing and IR. and IR. Semantics Semantics and Semantically-rich and Semantically-rich representations representations Alexander Gelbukh www.Gelbukh.com
48

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

Mar 27, 2015

Download

Documents

Alexa Weeks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 11: Lecture 11: Natural Language Processing and IR. Natural Language Processing and IR.

SemanticsSemantics

and Semantically-rich representations and Semantically-rich representations Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

2

Previous Lecture: Previous Lecture: ConclusionsConclusions

Syntax structure is one of intermediaterepresentations of a text for its processing

Helps text understanding Thus reasoning, question answering, ...

Directly helps POS tagging Resolves lexical ambiguity of part of speech But not WSD-type ambiguities

A big science in itself, with 50 (2000?) years of history

Page 3: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

3

Previous Lecture: Research topicsPrevious Lecture: Research topics

Faster algorithms E.g. parallel

Handling linguistic phenomena not handled bycurrent approaches

Ambiguity resolution! Statistical methods A lot can be done

Page 4: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

4

ContentsContents

Semantic representations Semantic networks Conceptual graphs

Simpler representations Head-Modifier pairs

Tasks beyond IR Question Answering Summarization Information Extraction Cross-language IR

Page 5: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

5

Syntactic representation Syntactic representation

A sequence of syntactic trees.

BE

SCIENCE IMPORTANT

COUNTRY

WE

of

PAY

GOVERNMENT ATTENTION IT

MUCH

Page 6: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

6

Linguistic processor

Morpho-logical

analyzer

Semantic analyzer

Syntactic parser

Semanticanalysis

Semantic analysisSemantic analysis

Page 7: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

7

Semantic representationSemantic representation

Complex structure of whole text

SCIENCE

IMPORTANT

COUNTRY

WE

GOVERNMENT

ATTENTION

is

of

gives

for

of for

Page 8: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

8

Semantic representationSemantic representation

Expresses the (direct) meaning of the text Not what is implied

Free of the means of communications Morphological cases (transformed to semantic links) Word order, passive/active Sentences and paragraphs Pronouns (resolved)

Free of means of expressing Synonyms (reduced to a common ID) Lexical functions

Page 9: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

9

Lexical FunctionsLexical Functions

The same meaning expressed by different words The choice of the word is a function of other words Few standard meanings Example: Magn = “much”, “very”

Strong wind, tea, desire Thick soup High temperature, potential, sea; highly expensive Hard work; hardcore porno Deep understanding, knowledge, appreciation

Page 10: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

10

...Lexical Functions...Lexical Functions

“give” pay attention provide help adjudge a prize yield the word confer a degree deliver a lection

“get” attract attention obtain help

receive a degree attend a lection

Page 11: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

11

...Syntagmatic lexical functions ...Syntagmatic lexical functions

In semantic representation, are transformed to the function name: Magn wind, tea, desire Magn soup Magn temperature, potential, sea; MAGN expensive Magn work; Magn porno Magn understanding, knowledge, appreciation

In different languages, different words are used... Russian: dense soup; Spanish: loaded tea, lend attention

...but the same function names.

Page 12: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

12

Example: TranslationExample: Translation

?

Morphologicallevel

Syntacticlevel

Textlevel

Semanticlevel

The Meaning,yet unreachable

Language A Language B

Page 13: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

13

...Paradigmatic lexical functions...Paradigmatic lexical functions

Used for synonymic rephrasing Need to reduce the meaning to a standard form Example: Syn, hyponyms, hypernyms

W Syn (W) complex apparatus complex mechanism

Example: Conv31, Conv24, ... A V B C C Conv31(V) B A

John sold the book to Mary for $5 Mary bough the book from John for $5 The book costed Mary $5

Page 14: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

14

Semantic networkSemantic network

Representation of the text as a directed graph Nodes are situations and entities Edges are participation of an entity in a situation

Also situation in a situation:begin reading a book, John died yesterday

Situation can be expressed with a noun:Professor delivered a lection to studentsProfessor “*lectured” to studentsLecture on history, memorial to heroes

A node can participate in many situations! No division into sentences

Page 15: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

15

SituationsSituations

Situations with different participants are different situations John reads a book and Mary reads a newspaper. He aks h

er whether the newspaper is interesting. Here two different situations of reading! But the same entities: John, Mary, newspaper, participatin

g in different situations

Tense and number is described as situations John reads a book: Now (reading (John, book) & quantity (book, one)

Page 16: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

16

Semantic valenciesSemantic valencies

A situation can have few participants (up to ~5) Their meaning is usually very general They are usually “naturally” ordered:

Who (agent) What (patient, object) To whom (receiver) With what (instrument, ...) John sold the book to Mary for $5

So, in the network the outgoing arcs of a node are numbered

Page 17: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

17

Semantic representationSemantic representation

Complex structure of whole text

SCIENCE

IMPORTANT

COUNTRY

WE

GOVERNMENT

ATTENTION

1

2

1 2

Give2

1

Possess

1

2

Now

Now

Now

Quantity

1

Page 18: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

18

Reasoning and common-sense infoReasoning and common-sense info

One can reason on the network If John sold a book, he does not have it

For this, additional knowledge is needed! A huge amount of knowledge to reason

A 9-year-old child knows some 10,000,000 simple facts Probably some of them can be inferred, but not (yet)

automatically There were attempts to compile such knowledge

manually There is a hope to compile it automatically...

Page 19: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

19

Semantic representationSemantic representation

... and common-sense knowledge

SCIENCE

IMPORTANT

COUNTRY

WE

GOVERNMENT

ATTENTION

is

of

gives

for

of for

Funding

Organization

Sector

Money

is a main form

needs

is a

gives

is a implies

Page 20: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

20

Computer representationComputer representation

Logical predicates Arcs are arguments

In AI, allows reasoning In IR, can allow comparison even without reasoning

Page 21: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

21

Conceptual GraphsConceptual Graphs

•A CG is a bipartite graph.

Concept nodes represent entities, attributes, or events (actions).

Relation nodes denote the kinds of relationships between the concept nodes.

[John](agnt)[love](ptnt)[Mary]

Page 22: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

22

program:{*} analyze logicallypnt mnr

criteriaprovide use Invariant:{*}ptn for ptn

Implication:{*}examine approach

diagnosis

automatic

correctionerrorlogical

ptn of

for

ofattr attr

for

Page 23: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

23

Use in IRUse in IR

Restrict the search to specific situations Where John loves Mary, but not vice versa

or

Soften the comparison Approximate search Look for John loves Mary, get someone loves Mary

Page 24: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

24

Obtaining from textObtaining from text

• “Algebraic formulation of flow diagrams”

• Algebraic|JJ formulation|NN of|IN flow|NN diagrams|NNS

• [[np, [n, [formulation, sg]], [adj, [algebraic]], [of, [np, [n, [diagram, pl]], [n_pos, [np, [n, [flow, sg]]]]]]]]

• [algebraically](manr)[formulate](ptn)[flow-diagram]

Tagging Parsing GraphGeneration

TEXTS CGs

Page 25: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

25

Steps of comparisonSteps of comparison

• Determine the common elements (overlap) between the two graphs. Based on the CG theory

Compatible common generalizations

• Measure their similarity.

The similarity must be proportional to the size of their overlap.

Page 26: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

26

An overlapAn overlap

• Given two conceptual graphs G1 and G2, the set of their common generalizations O = {g1, g2,...,gn} is an overlap if:

If all common generalizations gi are compatible.

If the set O is maximal.

Page 27: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

27

An example of overlapAn example of overlap

candidate:Gore criticize candidate:BushG1:

G2:

Candidate:GoreO2:

Agnt Ptnt

criticize Candidate:Bush

candidate:Bush criticize candidate:GoreAgnt Ptnt

candidate:Gore criticize candidate:BushG1:

G2:

candidateO1:

Agnt Ptnt

criticize candidateAgnt Ptnt

candidate:Bush criticize candidate:GoreAgnt Ptnt

(a)

(b)

Page 28: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

28

Similarity measureSimilarity measure

• Conceptual similarity: indicates the amount of information contained in common concepts of G1 and G2.

Do they mention similar concepts?

• Relational similarity: indicates how similar the contexts of the common concepts in both graphs are.

Do they mention similar things about the common concepts?

Page 29: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

29

Conceptual similarityConceptual similarity

• Analogous to the Dice coefficient.

• Considers different weights for the different kinds of concepts.

• Considers the level of generalization of the common concepts (of the overlap).

21

21,2

, 21

GcGc

OcGG

c cweightcweight

cccweight

GGs

Page 30: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

30

Relational SimilarityRelational Similarity

• Analogous to the Dice coefficient.

• Considers just the neighbors of the common concepts.

• Considers different weights for the different kinds of conceptual relations.

2

2

2

1

2

, 21

GNrG

GNrG

OrO

r

OO

rweightrweight

rweight

GGs

Page 31: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

31

Similarity MeasureSimilarity Measure

rc sbass

• Combines the conceptual and relational similarities.

• Multiplicative combination: a similarity roughly proportional to each of the two components.

• Relational similarity has secondary importance: even if no common relations exits, the pieces of knowledge are still similar to some degree.

Page 32: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

32

Flexibility of the comparisonFlexibility of the comparison

• Configurable by the user. Use different concept hierarchies.

Designate the importance for the different kind of concepts.

Manipulate the importance of the conceptual and relational similarities.

Conditions Effect

a > b Focus on the conceptual similarities

b > a Focus on the structural similarities

wE > wV, wA Focus on the similarities among entities

wV > wE, wA Focus on the similarities among actions

wA > wE, wV Focus on the similarities among attributes

Page 33: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

33

Example of the flexibilityExample of the flexibility

Conditions Overlap sc sr s

[candidate] (agt) [criticize] (pnt) [candidate] 0.86 1 0.86a = 0.1, b = 0.9wE = wV = wA = 1

[candidate:Bush] [criticize] [candidate:Gore] 1.00 0 0.10

[candidate] (agt) [criticize] (pnt) [candidate] 0.86 1 0.86a = 0.9, b = 0.1wE = wV = wA = 1 [candidate:Bush] [criticize] [candidate:Gore] 1.00 0 0.90

[candidate] (agt) [criticize] (pnt) [candidate] 0.84 1 0.84a = 0.5, b = 0.5wE = 2

wV = wA = 1 [candidate:Bush] [criticize] [candidate:Gore] 1.00 0 0.50

Gore criticezes Bush vs. Bush criticizes GoreGore criticezes Bush vs. Bush criticizes Gore

Page 34: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

34

An ExperimentAn Experiment

• Use the collection CACM-3204 (articles of computer science).

We built the conceptual graphs from the document titles.

Query: Description of a fast procedure for solving a systemof linear equations.

[Describe] [procedure] [solve] (obj) (obj) ̀

(obj)

(for)

(attr)

[system](obj)

[fast] [equation] (of)(attr)[linear]

Page 35: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

35

The resultsThe results

• Focus on the structural similarity, basically on the one caused by the entities and attributes. (a=0.3,b=0.7, We=Wa=10,Wv=1)

• One of the best matches: Description of a fast algorithm for copying list structures.

[Describe]

[fast]

[algorithm] [copy] [list-structure](obj)

(attr)

(for) (obj)

Page 36: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

36

The results (2)The results (2)

• Focus on the structural similarity, basically on the one caused by the entities and actions. (a=0.3,b=0.7, We=Wv=10,Wa=1)

• One of the best matches: Solution of an overdetermined system of equations in the L1 norm.

[overdetermined]

(attr)

[Solve] [system] [equation] [l1-norm](obj) (of) (in)

Page 37: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

37

Advantages of CGsAdvantages of CGs

• Well-known strategies for text comparison (Dice coefficient) with new characteristics derived from the CGs structure.

• The similarity is a combination of two sources of similarity: the conceptual similarity and the relational similarity.

• Appropriate to compare small pieces of knowledge (other methods based on topical statistics do not work).

• Two interesting characteristics: uses domain knowledge and allows a direct influence of the user.

Analyze the similarity between two CGs from different points of view.

Selects the best interpretation in accordance with the user interests.

Page 38: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

38

Simpler representationsSimpler representations

Head-Modifier pairs John sold Mary an interesting book for a very low price John sold, sold Mary, sold book, sold for price

interesting book, low price A paper in CICLing-2004

Restrict your semantic representation to only two words

Shallow syntax Semantics improves this representation

Standard form: Mary bought John sold, etc.

Page 39: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

39

Tasks beyond IR:Tasks beyond IR: Question Answering Question Answering

User information need An answer to a question Not a bunch of docs

Who won Nobel Peace Prize in 1992? (35500 docs)

Page 40: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

40

...QA...QA

Answer: Rigoberta Menchú Tum Logical methods:

“Understand” the text Reason on it Construct the answer Generate the text expressing it

Statistical methods (no or little semantics) Look what word is repeated in the docs Perhaps try to understand something around it

Page 41: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

41

...Better QA...Better QA

What is the info is not in a single document? Who is the queen of Spain?

King of Spain is Juan Carlos Wife of Juan Carlos is Sofía (Wife of a king is a queen)

Logical reasoning may prove useful In practice, the degree of “understanding” is not yet e

nough We are working to improve it

Page 42: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

42

Tasks beyond IR: Tasks beyond IR: Passage ExtractionPassage Extraction

If the answer is long: a story What do you know on wars between England and France?

Or if we cannot detect the simple answer Then find short pieces of the text where the answer is Can be done even with keywords:

Find passages with many keywords (Kang et al. 2004): Choose passages with greatest vector

similarity. Too short: few keywords, too long: normalized Awful quality

Reasoning can help

Page 43: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

43

Tasks beyond IR: Tasks beyond IR: SummarizationSummarization

And what if the answer is not in a short passage Summarize: say the same (without unimportant

details) but in fewer words Now: statistical methods Reasoning can help

Page 44: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

44

Tasks beyond IR:Tasks beyond IR: Information Extraction Information Extraction

Question answering on a massive basis Fill a database with the answers Example: what company bought what company and

when? A database of three columns Now: (statistical) patterns Reasoning can help

Page 45: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

45

Cross-lingual IRCross-lingual IR

Question in one language, answer in another language

Or: question and summary of the answer in English, over a database in Chinese

Is a kind of translation, but simpler Thus can be done more reliably A transformation into semantic network can greatly help

Page 46: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

46

Research topicsResearch topics

Recognition of the semantic structure Convert text to conceptual graphs All kinds of disambiguation

Shallow semantic representations Application of semantic representations to specific

tasks Similarity measures on semantic representations Reasoning and IR

Page 47: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

47

ConclusionsConclusions

Semantic representation gives meaning Language-specific constructions used only in the

process of communication are removed Network of entities / situations and predicates Allows for translation and logical reasoning Can improve IR:

Compare the query with the doc by meaning, not words Search for a specific situation Search for an approximate situation QA, summarization, IE Cross-lingual IR

Page 48: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 11: Natural Language Processing and IR. Semantics and Semantically-rich.

48

Thank you!Till June 15? 6 pm

Thesis presentation?Oral test?