Top Banner
LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

LING 438/538Computational Linguistics

Sandiway Fong

Lecture 10: 9/26

Page 2: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

2

Administrivia

• reminder– no class this Thursday

Page 3: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

3

+ left & right recursive rules

Last Time

introduced Finite State Automata (FSA) and regular expressions (RE)

formally equivalent– in terms of generative capacity or power

Regular Grammars

FSA Regular Expressions

Regular Languagesregular grammars --> [a],b.b --> [a],b.b --> [b],c.b --> [b].c --> [b],c.c --> [b].

regular expressiona+b+s x

y

a a

b

b

Page 4: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

4

Last Time

• FSA– gave a formal definition

• (Q,s,f,Σ,)

• many practical applications– can be encoded and run

efficiently on a computer– implement regular

expressions– compress large dictionaries– build morphological

analyzers (suffixation)• see chapter 3 of textbook

– speech recognizers • (Hidden) Markov models =

FSA + probabilities

Page 5: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

5

Today’s Lecture

• from the textbook– Chapter 2: Regular

Expressions and Finite State Automata

• how to implement FSA – in Prolog– from first principles– two ways

Page 6: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

6

Regular Expressions

• pattern-matching using regular expressions– important tool in automated searching

• popular implementations– Unix grep

• returns lines matching a regular expression• standard part of all Unix-based systems

– including MacOS X (command-line interface in Terminal)

• many shareware/freeware implementations available for Windows XP– just Google and see...

– wildcard search in Microsoft Word• limited version with differences in notation

Page 7: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

7

Regular Expressions

• One of the most popular programs for searching files and returning lines that match a regular expression pattern is called GREP– name comes from Unix ed command g/re/p– “search globally for lines matching the regular

expression, and print them”

– [Source: http://en.wikipedia.org/wiki/Grep]

Page 8: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

8

Regular Expressions: GNU grep

• terminology:– metacharacter

• character with special meaning, not interpreted literally, e.g. ^ vs. a

• must be quoted or escaped using the backslash \ to get literal meaning, e.g. \^

• excerpts from the manpage

– A list of characters enclosed by [ and ] matches any single character in that list;

– if the first character of the list is the caret ^ then it matches any character not in the list.

• Examples– the regular expression

[0123456789] matches any single digit.

– A range of characters may be specified by giving the first and last characters, separated by a hyphen.

– [0-9]– [a-z]– [A-Za-z]

Page 9: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

9

Regular Expressions: grep

• excerpts from the manpage

– The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line.

– The symbol \b matches the empty string at the edge of a word

– The symbols \< and \> respectively match the empty string at the beginning and end of a word.

– The period . matches any single character.

– Finally, certain named classes of characters are predefined.

• Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].

• For example, [[:alnum:]] (alphanumeric) means [0-9A-Za-z]

– The symbol \w is a synonym for [[:alnum:]] and

– \W is a synonym for [^[:alnum]].

• terminology– word

• unbroken sequence of digits, underscores and letters

Page 10: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

10

Regular Expressions: grep

• Excerpts from the manpage– A regular expression may be followed by one of several repetition

operators:• ? The preceding item is optional and matched at most once.• * The preceding item will be matched zero or more times.• + The preceding item will be matched one or more times.• {n} The preceding item is matched exactly n times• {n,} The preceding item is matched n or more times.• {n,m} The preceding item is matched at least n times, but not more

than m times.

Page 11: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

11

Regular Expressions: GNU grep

• concatenation– Two regular expressions

may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.

• disjunction– Two regular expressions

may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.

Excerpts from the manpage

Page 12: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

12

Regular Expressions: Examples

• Regular Expression– \b99

• matches 99 in “there are 99 bottles …”

• but not in “there are 299 bottles …”

– Note: • $99 contains two words • so \b99 will match 99 here

• Regular Expression– beds?

• examples– bed– beds

Page 13: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

13

Regular Expressions: Examples

• example– guppy

– guppies

• Regular Expression– gupp(y|ies)– | = disjunction– ( ) = parentheses indicate

scope

• example– the

• (whole word, case insensitive)

– the25

• Regular Expression (pg. 29)(^|[^a-zA-Z])[tT]he[^a-zA-

Z]– ^ = beginning of line– [^ ] = negation

Page 14: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

14

Regular Expressions: Microsoft Word

• terminology:– wildcard search

Page 15: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

15

Regular Expressions: Microsoft Word

Note: zero or more times is missing in Microsoft Word

Page 16: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

16

Finite State Automata (FSA)

• more formally– (Q,s,f,Σ,)1. set of states (Q): {s,x,y} must be a finite set2. start state (s): s3. end state(s) (f): y

4. alphabet (Σ): {a, b}5. transition function :

signature: character × state → state1. (a,s)=x2. (a,x)=x3. (b,x)=y4. (b,y)=y

s x

y

aa

b

b

Page 17: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

17

Finite State Automata (FSA)

• directly implement the formal definition– define a predicate fsa/2– takes two arguments– S = a start state– L = string (as a list) we’re interested in testing

• Prolog code (for any FSA)– fsa(S,L) :-

L = [C|M], transition(S,C,T), fsa(T,M).

– fsa(E,[]):- end_state(E).

Page 18: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

18

Finite State Automata (FSA)

• Prolog code (for any FSA)– fsa(S,L) :-

L = [C|M], transition(S,C,T), fsa(T,M).

– fsa(E,[]):- end_state(E).• Facts (FSA-particular)

– end_state(y).

– transition(s,a,x).– transition(x,a,x).– transition(x,b,y).– transition(y,b,y).

s x

y

aa

b

b

transition function : (a,s)=x (a,x)=x (b,x)=y (b,y)=y

Page 19: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

19

Finite State Automata (FSA)

• computation tree?- fsa(s,[a,a,b]).

?- transition(s,a,T). T=x

?- fsa(x,[a,b]).?- transition(x,a,T’).

T’=x?- fsa(x,[b]).

?- transition(x,b,T”). T”=y

?- fsa(y,[]). ?- end_state(y).

Yes

fsa(S,L) :-fsa(S,L) :- L = [C|M], L = [C|M], transition(S,C,T),transition(S,C,T),fsa(T,M).fsa(T,M).

fsa(E,[]) :- fsa(E,[]) :- end_state(E)..

Page 20: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

20

Finite State Automata (FSA)

• deterministic FSA (DFSA)– no ambiguity about where to go at any given state

• non-deterministic FSA (NDFSA)– no restriction on ambiguity (surprisingly, no increase in formal power)

• textbook– D-RECOGNIZE (FIGURE 2.13)– ND-RECOGNIZE (FIGURE 2.21)

fsa(S,L) :-fsa(S,L) :- L = [C|M], L = [C|M], transition(S,C,T),transition(S,C,T),fsa(T,M).fsa(T,M).

fsa(E,[]) :- fsa(E,[]) :- end_state(E)..

Page 21: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

21

Finite State Automata (FSA)

• Prolog– no change in code– Prolog computation rule takes care of choice

point management

• example– one change – “a” instead of “b” from x to y– non-deterministic– what regular language does this machine

accept?

fsa(S,L) :-fsa(S,L) :- L = [C|M], L = [C|M], transition(S,C,T),transition(S,C,T),fsa(T,M).fsa(T,M).

fsa(E,[]) :- fsa(E,[]) :- end_state(E)..

s x

y

aa

b

a

Page 22: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

22

Finite State Automata (FSA)

• another possible Prolog encoding strategy

– define one predicate for each state• taking one argument (the input string)• consume input character• call next state with remaining input string

– query•?- s(L).

call start state s

Page 23: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

23

Finite State Automata (FSA)

– state s: (start state)• s([a|L]) :- x(L).

match input string beginning with a

and call state x with remainder of input

– state x:• x([a|L]) :- x(L).• x([b|L]) :- y(L).

– state y: (end state)• y([]).

• y([b|L]) :- y(L).

s x

y

aa

b

b

Page 24: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

24

Finite State Automata (FSA)

example:1. ?- s([a,a,b]).

2. ?- x([a,b]).

3. ?- x([b]).

4. ?- y([]).

Yes

s([a|L]) :- x(L).s([a|L]) :- x(L).x([a|L]) :- x(L).x([a|L]) :- x(L).x([b|L]) :- y(L).x([b|L]) :- y(L).y([]).y([]).y([b|L]) :- y(L).y([b|L]) :- y(L).

Page 25: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

25

Finite State Automata (FSA)

• Note:– non-deterministic properties of Prolog’s

computation rule still applies here

Page 26: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

26

Finite State Automata (FSA)

example1. ?- s([a,b,a]).

2. ?- x([b,a]).

3. ?- y([a]).

No

s([a|L]) :- x(L).s([a|L]) :- x(L).x([a|L]) :- x(L).x([a|L]) :- x(L).x([b|L]) :- y(L).x([b|L]) :- y(L).y([]).y([]).y([b|L]) :- y(L).y([b|L]) :- y(L).

Page 27: LING 438/538 Computational Linguistics Sandiway Fong Lecture 10: 9/26.

27

Next Time

• bit more on FSA... • read (if you haven’t yet)– Chapter 3:

Morphology and Finite State Transducers