Top Banner
06/24/22 1 Natural Language Processing Lecture 1 Sudeshna Sarkar 26 July 2007
55

Natural Language Processing

Feb 10, 2016

Download

Documents

tanuja munde

Natural Language Processing. Lecture 1 Sudeshna Sarkar 26 July 2007. Notes adapted from Martin’s NLP slides. Text Books . Daniel Jurafsky, and James H. Martin, "Speech and Language Processing", Prentice Hall, 2000. Other References - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural Language Processing

04/22/23 1

Natural Language Processing

Lecture 1Sudeshna Sarkar

26 July 2007

Page 2: Natural Language Processing

04/22/23 2

Notes adapted from Martin’sNLP slides

Page 3: Natural Language Processing

04/22/23 3

Text Books Daniel Jurafsky, and James H. Martin, "Speech and Language

Processing", Prentice Hall, 2000. Other References

James Allen, "Natural Language Understanding", Second edition, Pearson

Christopher D. Manning, and Hinrich Schutze, "Foundations of Statistical Natural Language Processing", The MIT Press, 1999.

Page 4: Natural Language Processing

04/22/23 4

Final Project

This will be a research-oriented project. The goal is to have a paper suitable for a conference submission.

These will preferably be done in groups.

Page 5: Natural Language Processing

04/22/23 5

Natural Language Processing

What is it? We’re going to study what goes into getting

computers to perform useful and interesting tasks involving human languages.

We will be secondarily concerned with the insights that such computational work gives us into human processing of language.

Page 6: Natural Language Processing

04/22/23 6

Why Should You Care?

Two trends1.1. An enormous amount of knowledge is now An enormous amount of knowledge is now

available in machine readable form as available in machine readable form as natural language textnatural language text

2.2. Conversational agents are becoming an Conversational agents are becoming an important form of human-computer important form of human-computer communicationcommunication

Page 7: Natural Language Processing

04/22/23 7

Major Topics Words Syntax Meaning Dialog and Discourse

Applications

Page 8: Natural Language Processing

04/22/23 8

ApplicationsFirst, what makes an application a

language processing application (as opposed to any other piece of software)? An application that requires the use of knowledge about

human languages Example: Is Unix wc (word count) a language

processing application?

Page 9: Natural Language Processing

04/22/23 9

Applications

Word count? When it counts words: Yes

To count words you need to know what a word is. That’s knowledge of language.

When it counts lines and bytes: No Lines and bytes are computer artifacts, not linguistic

entities

Page 10: Natural Language Processing

04/22/23 10

Big Applications Question answering Conversational agents Summarization Machine translation

Page 11: Natural Language Processing

04/22/23 11

Big Applications These kinds of applications require a

tremendous amount of knowledge of language.

Consider the following interaction with HAL the computer from 2001: A Space Odyssey

Page 12: Natural Language Processing

04/22/23 12

HAL

Dave: Open the pod bay doors, Hal. HAL: I’m sorry Dave, I’m afraid I can’t do

that.

Page 13: Natural Language Processing

04/22/23 13

What’s needed?

Speech recognition and synthesis Knowledge of the English words involved

What they mean How they combine (bay, vs. pod bay)

How groups of words clump What the clumps mean

Page 14: Natural Language Processing

04/22/23 14

What’s needed? Dialog

It is polite to respond, even if you’re planning to kill someone.

It is polite to pretend to want to be cooperative (I’m afraid, I can’t…)

Page 15: Natural Language Processing

04/22/23 15

Real ExampleWhat is the Fed’s current position on interest rates?

What or who is the “Fed”? What does it mean for it to to have a position? How does “current” modify that?

Page 16: Natural Language Processing

04/22/23 16

Caveat

NLP has an AI aspect to it. We’re often dealing with ill-defined problems We don’t often come up with perfect

solutions/algorithms We can’t let either of those facts get in our way

Page 17: Natural Language Processing

04/22/23 17

Preparation

Basic algorithm and data structure analysis

Ability to program Some exposure to logic Exposure to basic

concepts in probability

Familiarity with linguistics, psychology, and philosophy

Ability to write well in English

Page 18: Natural Language Processing

04/22/23 18

Topics: Linguistics

Word-level processing Syntactic processing Lexical and compositional semantics Discourse and dialog processing

Page 19: Natural Language Processing

04/22/23 19

Topics: Techniques Finite-state methods Context-free methods Augmented grammars

Unification Logic

Probabilistic versions

Supervised machine learning

Page 20: Natural Language Processing

04/22/23 20

Topics: Applications Small

Spelling correction Medium

Word-sense disambiguation

Named entity recognition Information retrieval

Large Question answering Conversational agents Machine translation

Page 21: Natural Language Processing

04/22/23 21

Commercial World Lot’s of exciting stuff going on… Some samples…

Machine translation Question answering Buzz analysis

Page 22: Natural Language Processing

04/22/23 22

Google/Arabic

Page 23: Natural Language Processing

04/22/23 23

Google/Arabic Translation

Page 24: Natural Language Processing

04/22/23 24

Web Q/A

Page 25: Natural Language Processing

04/22/23 25

Summarization Current web-based Q/A is limited to returning

simple fact-like (factoid) answers (names, dates, places, etc).

Multi-document summarization can be used to address more complex kinds of questions. Circa 2002:

What’s going on with the Hubble?

Page 26: Natural Language Processing

04/22/23 26

NewsBlaster ExampleThe U.S. orbiter Columbia has touched down at the

Kennedy Space Center after an 11-day mission to upgrade the Hubble observatory. The astronauts on Columbia gave the space telescope new solar wings, a better central power unit and the most advanced optical camera. The astronauts added an experimental refrigeration system that will revive a disabled infrared camera. ''Unbelievable that we got everything we set out to do accomplished,'' shuttle commander Scott Altman said. Hubble is scheduled for one more servicing mission in 2004.

Page 27: Natural Language Processing

04/22/23 27

Weblog Analytics Textmining weblogs, discussion forums, user

groups, and other forms of user generated media. Product marketing information Political opinion tracking Social network analysis Buzz analysis (what’s hot, what topics are people

talking about right now).

Page 28: Natural Language Processing

04/22/23 28

Web Analytics

Page 29: Natural Language Processing

04/22/23 29

Umbria

Page 30: Natural Language Processing

04/22/23 30

Forms of Natural Language The input/output of a NLP system can be:

written text: newspaper articles, letters, manuals, prose, … Speech: read speech (radio, TV, dictations), conversational speech,

commands, … To process written text, we need:

lexical, syntactic, Semantic knowledge about the language discourse information, real world knowledge

To process spoken language, we need additionally speech recognition speech synthesis

Page 31: Natural Language Processing

04/22/23 31

Components of NLP Natural Language Understanding

Mapping the given input in the natural language into a useful representation.

Different level of analysis required: morphological analysis,

syntactic analysis, semantic analysis, discourse analysis, …

Natural Language Generation Producing output in the natural language from some internal

representation. Different level of synthesis required:

deep planning (what to say), syntactic generation

Which is harder?

Page 32: Natural Language Processing

04/22/23 32

Natural language understanding Uncovering the mappings between the linear sequence of words (or

phonemes) and the meaning that it encodes. Representing this meaning in a useful (usually symbolic)

representation. By definition - heavily dependent on the target task

Words and structures mean different things in different contexts The required target representation is different for different tasks.

Why is NLU hard?

The mapping between words, their linguistic structure and the meaning that they encode is extremely complex and difficult to model and decompose.

Natural language is very ambiguous The goal of understanding is itself task dependent and very complex.

Page 33: Natural Language Processing

04/22/23 33

Why NL Understanding is hard? Natural language is extremely rich in form and structure, and very ambiguous.

How to represent meaning, Which structures map to which meaning structures.

Ambiguity: ne input can mean many different things Lexical (word level) ambiguity -- different meanings of words Syntactic ambiguity -- different ways to parse the sentence Interpreting partial information -- how to interpret pronouns Contextual information -- context of the sentence may affect the

meaning of that sentence. Many input can mean the same thing. Interaction among components of the input. Noisy input (e.g. speech)

Page 34: Natural Language Processing

04/22/23 34

Knowledge of Language Phonology – concerns how words are related to the sounds that

realize them.

Morphology – concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language.

Syntax – concerns how can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of other phrases.

Semantics – concerns what words mean and how these meaning combine in sentences to form sentence meaning. The study of context-independent meaning.

Page 35: Natural Language Processing

04/22/23 35

Knowledge of Language Pragmatics – concerns how sentences are used in different

situations and how use affects the interpretation of the sentence.

Discourse – concerns how the immediately preceding sentences affect the interpretation of the next sentence.For example, interpreting pronouns and interpreting the temporal aspects of the information.

World Knowledge – includes general knowledge about the world. What each language user must know about the other’s beliefs and goals.

Page 36: Natural Language Processing

04/22/23 36

AmbiguityAt last, a computer that understands you

like your mother.-- 1985 McDonnell-Douglas Ad

Different interpretations:1. The computer understands you as well as your mother

understands you.2. The computer understands that you like your mother.3. The computer understands you as well as it understands your

mother.

Speech : ….. a computer that understands your lie cured mother …

Page 37: Natural Language Processing

04/22/23 37

Why is NLP difficult? Because Natural Language is highly ambiguous.

Syntactic ambiguityThe president spoke to the nation about the

problem of drug use in the schools from one coast to the other.

has 720 parses.Ex:

“to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb 6 places

“from one coast” has 5 places to attach …

Page 38: Natural Language Processing

04/22/23 38

Why is NLP difficult? Word category ambiguity

book --> verb? or noun? Word sense ambiguity

bank --> financial institution? building? or river side? Words can mean more than their sum of parts

make up a story Fictitious worlds

People on mars can fly. Defining scope

People like ice-cream. Does this mean that all (or some?) people like ice cream?

Language is changing and evolving I’ll email you my answer. This new S.U.V. has a compartment for your mobile phone. Googling, …

Page 39: Natural Language Processing

04/22/23 39

Resolve Ambiguities We will introduce models and algorithms to resolve ambiguities at

different levels. part-of-speech tagging -- Deciding whether duck is verb or noun. word-sense disambiguation -- Deciding whether make is create or cook.

lexical disambiguation -- Resolution of part-of-speech and word-sense ambiguities are two important kinds of lexical disambiguation.

syntactic ambiguity -- her duck is an example of syntactic ambiguity, and can be addressed by probabilistic parsing.

Page 40: Natural Language Processing

04/22/23 40

Resolve Ambiguities (cont.)I made her duck

S S

NP VP NP VP

I V NP NP I V NP

made her duck made DET N

her duck

Page 41: Natural Language Processing

04/22/23 41

Dealing with Ambiguity Three approaches:

Tightly coupled interaction among processing levels; knowledge from other levels can help decide among choices at ambiguous levels.

Pipeline processing that ignores ambiguity as it occurs and hopes that other levels can eliminate incorrect structures.

Syntax proposes/semantics disposes approach Probabilistic approaches based on making the most

likely choices

Page 42: Natural Language Processing

04/22/23 42

Models to Represent Linguistic Knowledge Different formalisms (models) are used to represent

the required linguistic knowledge. State Machines -- FSAs, HMMs, ATNs, RTNs Formal Rule Systems -- Context Free Grammars,

Unification Grammars, Probabilistic CFGs. Logic-based Formalisms -- first order predicate

logic, some higher order logic. Models of Uncertainty -- Bayesian probability

theory.

Page 43: Natural Language Processing

04/22/23 43

Algorithms to Manipulate Linguistic Knowledge We will use algorithms to manipulate the models of linguistic

knowledge to produce the desired behavior. Most of the algorithms we will study are transducers and

parsers. These algorithms construct some structure based on their input.

Since the language is ambiguous at all levels, these algorithms are never simple processes.

Categories of most algorithms that will be used can fall into following categories. state space search dynamic programming

Page 44: Natural Language Processing

04/22/23 44

Language and IntelligenceTuring Test

Computer Human

Human Judge

Human Judge asks tele-typed questions to Computer and Human. Computer’s job is to act like a human. Human’s job is to convince Judge that he is not machine. Computer is judged “intelligent” if it can fool the judge Judgment of intelligence is linked to appropriate answers to

questions from the system.

Page 45: Natural Language Processing

04/22/23 45

NLP - an inter-disciplinary Field NLP borrows techniques and insights from several disciplines. Linguistics: How do words form phrases and sentences? What

constraints the possible meaning for a sentence? Computational Linguistics: How is the structure of sentences

are identified? How can knowledge and reasoning be modeled? Computer Science: Algorithms for automatons, parsers. Engineering: Stochastic techniques for ambiguity resolution. Psychology: What linguistic constructions are easy or difficult

for people to learn to use? Philosophy: What is the meaning, and how do words and

sentences acquire it?

Page 46: Natural Language Processing

04/22/23 46

Some Buzz-Words NLP – Natural Language Processing CL – Computational Linguistics SP – Speech Processing HLT – Human Language Technology NLE – Natural Language Engineering SNLP – Statistical Natural Language Processing Other Areas:

Speech Generation, Text Generation, Speech Understanding, Information Retrieval,

Dialogue Processing, Inference, Spelling Correction, Grammar Correction,

Text Summarization, Text Categorization,

Page 47: Natural Language Processing

04/22/23 47

Some NLP Applications Machine Translation – Translation between two natural languages.

Babel Fish translations system, Systran

Information Retrieval – Web search (uni-lingual or multi-lingual).

Query Answering/Dialogue – Natural language interface with a database system, or a dialogue system.

Report Generation – Generation of reports such as weather reports.

Other Applications – Grammar Checking, Spell Checking, Spell Corrector

Page 48: Natural Language Processing

04/22/23 49

The Big Picture

Speech recognition Speech Synthesis

Source text Analysis Target text Generation

Source Language Speech Signal

Target Language Speech Signal

Page 49: Natural Language Processing

04/22/23 50

The Reductionist Approach

Text Normalization

Morphological Analysis

POS Tagging

Parsing

Semantic Analysis

Discourse Analysis

Text Rendering

Morphological Synthesis

Phrase Generation

Role Ordering

Lexical Choice

Discourse Planning

Source Language Analysis Target Language Generation

Page 50: Natural Language Processing

04/22/23 51

Natural Language Understanding

Words

Morphological AnalysisMorphologically analyzed words (another step: POS

tagging)

Syntactic AnalysisSyntactic Structure

Semantic AnalysisContext-independent meaning representation

Discourse ProcessingFinal meaning representation

Page 51: Natural Language Processing

04/22/23 52

Natural Language Generation

Meaning representation

Utterance PlanningMeaning representations for sentences

Sentence Planning and Lexical ChoiceSyntactic structures of sentences with lexical choices

Sentence GenerationMorphologically analyzed words

Morphological GenerationWords

Page 52: Natural Language Processing

04/22/23 53

Natural Language Generation NLG is the process of constructing natural language

outputs from non-linguistic inputs. the reverse process of NL understanding.

A NLG system may have two main parts: Discourse Planning -- what will be generated, Surface Realization -- realizes a sentence from

its internal representation. Lexical Choice

selecting the correct words describing the concepts.

Page 53: Natural Language Processing

04/22/23 54

Machine Translation Machine Translation -- converting a text in language A into the

corresponding text in language B (or speech). Different Machine Translation architectures:

interlingua based systems transfer based systems

How to acquire the required knowledge resources such as mapping rules and bi-lingual dictionary? By hand or acquire them automatically from corpora.

Example Based Machine Translation acquires the required knowledge (some of it or all of it) from corpora.

Page 54: Natural Language Processing

04/22/23 55

Some statistics (old) Business e-mail sent per day in the US: 2.1Billion First class mail per year: 107 Billion Text on Internet

(2/99): > 6TB Current: ?

indexed: 16% (Lawrence and Giles, Nature 400, 1999)

Dialog (www.dialog.com): 9 TB Average college library: 1 TB

Page 55: Natural Language Processing

04/22/23 56

Languages Languages: 39,000 languages and dialects (22,000 dialects in India

alone) Top languages:

Chinese/Mandarin (885M), Spanish (332M), English (322M), Bengali (189M), Hindi (182M), Portuguese (170M), Russian (170M), Japanese (125M)

Source: www.sil.org/ethnologue, www.nytimes.com Internet: English (128M), Japanese (19.7M), German (14M), Spanish

(9.4M), French (9.3M), Chinese (7.0M) Usage: English (1999-54%, 2001-51%, 2003-46%, 2005-43%) Source: www.computereconomics.com