Top Banner
Apr 2009 CLINT-LIN: Finite State M achinery Computational Linguistics Introduction Finite State Machinery and Language Description
42

Computational Linguistics Introduction

Jan 16, 2016

Download

Documents

cohen cohen

Computational Linguistics Introduction. Finite State Machinery and Language Description. Acknowledgement. The material for this lecture is derived from a series of talks given by Dr. Ken Beesley (Xerox European Research Centre, Grenoble) in Malta, 2001. Today’s Topics. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Computational Linguistics Introduction

Finite State Machinery and Language Description

Page 2: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Acknowledgement

The material for this lecture is derived from a series of talks given by Dr. Ken Beesley (Xerox European Research Centre, Grenoble) in Malta, 2001.

Page 3: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Today’s Topics

• Finite State Technology

• Regular Languages and Relations

• Review of Set Theory

• Understand the mathematical operations that can be performed on such Languages.

• Understand how Languages, Relations, Regular Expressions, and Networks are interrelated.

Page 4: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

What is Finite State Technology?

• Finite State Technology refers to a collection of techniques for application of Finite State Automata (FSA) to a range of linguistically motivated problems.

• Such Techniques include• Design of user languages for specifying FSA• Compilation of such languages into efficient

transition networks.• Development environments and runtime systems

Page 5: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

What is Finite-State Technology Good For?

• Finite-state techniques cannot handle central embedding• the man the dog the cat bit followed ate.

• They are well suited to “lower-level” natural language processing such as • Tokenization – what is the next word?• Spelling error detection: does the next word

belong to a list?• Morphological/phonological analysis/generation• Shallow syntactic parsing and “chunking”

Page 6: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Tokenisation Problems

VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London)

• VfB Stuttgart, Manchester United• succession• 2-1• Wednesday

• Finite state techniques provide a means to specify the language of words, thus defining what it means to be the next token.

• There are three ways to specify such languages

Page 7: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages,Notations and Machines

LANGUAGE(set of strings)

NOTATION MACHINE

Page 8: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages,Notations and Machines

FINITE STATELANGUAGE

FINITE STATENOTATION

FINITE STATEAUTOMATON

Page 9: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

FINITE STATE AUTOMATA:preliminary definition

A finite state automaton includes:• A finite set of states• A finite set of labelled transitions between

states

Page 10: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States

The Lightswitch Machine

OFF ON

UP

DOWN

Page 11: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States

The Lightswitch Toggle Machine

OFF ON

PUSH

PUSH

Page 12: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

The Five Cent Machine

Problem:

• Assume you have one, two, and five cent pieces

• Design a finite state automaton which accepts exactly 5 cents.

Page 13: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

The Cola Machine

• Need to enter 25 cents (USA) to get a drink

• Accepts the following coins:• Nickel = 5 cents

• Dime = 10 cents

• Quarter = 25 cents

• For simplicity, our machine needs exact change

• We will model only the coin-accepting mechanism

Page 14: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States

The Cola Machine

0

N

D

Q

N N NN

D D D

5 10 15 20 25

Start State Final State

Page 15: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

The Cola Machine Language

• List of all the sequences of coins accepted:• { Q, DDN, DND, NDD, DNNN, NDNN,

NNDNNNND, NNNNN }

• Think of the coins as SYMBOLS or CHARACTERS

• The set of symbols accepted is the ALPHABET of the machine

• Think of sequences of coins as WORDS or “strings”

• The set of words accepted by the machine is its LANGUAGE

Page 16: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

FINITE STATE AUTOMATA:better definition

A finite state automaton includes:• A finite set of states

• Initial State• Final State (s)

• A finite set of labelled transitions beween states

• Labels are symbols from an alphabet• Recognises a language• Generates a language as well!

Page 17: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Network that Accepts aOne Word Language

c a n t o

Page 18: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Network that Accepts aThree Word Language

ca n t

o

t i g r e

m e s a

Page 19: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Scaling Up the Network

• Imagine the same network expanded to handle three million words, all of them corresponding to valid words of a given language.

• We supply a word and ‘apply’ it to the network. If it is accepted by the network, then it is a valid word. Otherwise it does not belong to the language

• This is the basis for a Spanish spelling error detector.

Page 20: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Looking Up a Word

ca n t

o

t i g r e

m e s a

m e s a“Apply”

Page 21: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Lookup Failure

Lookup succeeds when all input is consumed and final state is reached. Lookup can fail because:

• Not all input is consumed ("libro", "tigra")• Input is fully consumed but state is not final

("cant")• Final state is reached but there is still

unconsumed output ("mesas")

Page 22: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Shared Structure

c l e a

e

v

r

e

Page 23: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Transducers

m e s a s“Lookup”

“Lookdown”

m e s a +Noun +Fem +Pl

m e s a 0 0 s

mesa+Noun+Fem+Pl

Page 24: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Morphological Analyzer

Transducer

dogs

dog +n +pl

Page 25: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Morphological Analyzer

Transducer

Surface Language

Lexical Language

Page 26: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

A Quick Review of Set Theory

A set is a collection of objects.

A B

D E

We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }.

There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc.

Page 27: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Uniqueness of Elements

You cannot have two or more ‘A’ elements in the same set

A B

D E

{ A, A, D, B, E} is just a redundant specification of the set { A, D, B, E }.

Page 28: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Cardinality of Sets

The Empty Set:

A Finite Set:

An Infinite Set: e.g. The Set of all Positive Integers

Norway Denmark Sweden

Page 29: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets: Union

A B

C

DE

Set 1 Set 2

B C A D E

Union of Set1 and Set 2

Page 30: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (2): Union

A B

C

CD

Set 1 Set 2

B C A D

Union of Set1 and Set 2

Page 31: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (3): Intersection

A B

C

CD

Set 1 Set 2

C

Intersection of Set1 and Set 2

Page 32: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (4): Subtraction

A B

C

CD

Set 1 Set 2

A B

Set 1 minus Set 2

Page 33: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Formal Languages

Very Important Concept in Formal Language Theory:

A Language is just a Set of Words.

• We use the terms “word” and “string” interchangeably.

• A Language can be empty, have finite cardinality, or be infinite in size.

• You can union, intersect and subtract languages, just like any other sets.

Page 34: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Union of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

dog cat rat

elephant mouse

Union of Language 1 and Language 2

Page 35: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Intersection of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

Page 36: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Intersection of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

rat

Page 37: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Subtraction of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Language 1 minus Language 2

dog cat

Page 38: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages

• A language is a set of words (=strings).

• Words (strings) are composed of symbols (letters) that are “concatenated” together.

• At another level, words are composed of “morphemes”.

• In most natural languages, we concatenate morphemes together to form whole words.

For sets consisting of words (i.e. for Languages), the operation of concatenation is very important.

Page 39: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Concatenation of Languages

work talk walk

Root Language

0 ing ed s

Suffix Language

work working worked works talk talking talked talks walk walking walked walks

The concatenation of the Suffix language after the Root language.

Page 40: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages and Networks

w a l k

o r

t

Network/Language 1

Network/Language 2

s

o r

s The concatenation of Network 1 and Network 2

w a l k

t

a

as

ed

i n g

0

s

ed

i n g

0

s

Page 41: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Why is “Finite State” Computing so Interesting?

• Finite-state systems are mathematically elegant, easily manipulated and modifiable.

• Computationally efficient. Usually very compact.• The programming we linguists do is declarative. We describe

the facts of our natural language; i.e. we write grammars. We do not hack ad hoc code.

• The runtime code, which applies our systems to linguistic input, is already written and it is completely language-independent.

• Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate.

Page 42: Computational Linguistics Introduction

Apr 2009 CLINT-LIN: Finite State Machinery

Languages,Notations and Machines

FINITE STATELANGUAGE

FINITE STATENOTATION

FINITE STATEMACHINE