Top Banner
Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018
65

Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Aug 21, 2018

Download

Documents

duongtuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Computational Lexicology, Morphology and Syntax

Diana Trandabăț

Academic year 2017-2018

Page 2: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Today’s Topics

• Finite State Technology

• Regular Languages and Relations

• Review of Set Theory

• Understand the mathematical operations that can be performed on such Languages.

• Understand how Languages, Relations, Regular Expressions, and Networks are interrelated.

Page 3: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

What is Finite State Technology?

• Finite State Technology refers to a collection of techniques for application of Finite State Automata (FSA) to a range of linguistically motivated problems.

• Such Techniques include

– Design of user languages for specifying FSA

– Compilation of such languages into efficient transition networks.

– Development environments and runtime systems

Page 4: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

What is Finite-State Technology Good For?

• Finite-state techniques cannot handle central embedding

– the man the dog the cat bit followed ate.

• They are well suited to “lower-level” natural language processing such as

– Tokenization – what is the next word?

– Spelling error detection: does the next word belong to a list?

– Morphological/phonological analysis/generation

– Shallow syntactic parsing and “chunking”

Page 5: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Languages, Notations and Machines

LANGUAGE

(set of strings)

NOTATION

MACHINE

Page 6: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Languages, Notations and Machines

FINITE STATE

LANGUAGE

FINITE STATE

NOTATION

FINITE STATE

AUTOMATON

Page 7: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

FINITE STATE AUTOMATA: preliminary definition

• A finite state automaton includes:

• A finite set of states

• A finite set of labelled transitions between states

Page 8: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Physical Machines with Finite States

• The Lightswitch Machine

OFF ON

UP

DOWN

Page 9: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Physical Machines with Finite States

• The Lightswitch Toggle Machine

OFF ON

PUSH

PUSH

Page 10: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

The Five Cent Machine

• Problem:

• Assume you have one, two, and five cent pieces

• Design a finite state automaton which accepts exactly 5 cents.

Page 11: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

The Cola Machine

• Need to enter 25 cents (USA) to get a drink

• Accepts the following coins:

– Nickel = 5 cents

– Dime = 10 cents

– Quarter = 25 cents

• For simplicity, our machine needs exact change

• We will model only the coin-accepting mechanism

Page 12: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Physical Machines with Finite States

• The Cola Machine

0

N

D

Q

N N N N

D D D

5 15 20 25

Start State Final State

10

Page 13: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

The Cola Machine Language

• List of all the sequences of coins accepted:

– { Q, DDN, DND, NDD, DNNN, NDNN, NNDNNNND, NNNNN}

• Think of the coins as SYMBOLS or CHARACTERS

• The set of symbols accepted is the ALPHABET of the machine

• Think of sequences of coins as WORDS or “strings”

• The set of words accepted by the machine is its LANGUAGE

Page 14: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

FINITE STATE AUTOMATA: better definition

• A finite state automaton includes:

• A finite set of states

– Initial State

– Final State (s)

• A finite set of labelled transitions between states

• Labels are symbols from an alphabet

• Recognises a language

• Generates a language as well!

Page 15: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

A Network that Accepts a One Word Language

c a n t o

Page 16: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

A Network that Accepts a Three Word Language

c

a n t o

t i g r e

m e s a

Page 17: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Scaling Up the Network

• Imagine the same network expanded to handle three million words, all of them corresponding to valid words of a given language.

• We supply a word and ‘apply’ it to the network. If it is accepted by the network, then it is a valid word. Otherwise it does not belong to the language

• This is the basis for a Spanish spelling error detector.

Page 18: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Looking Up a Word

c

a n t o

t i g r e

m e s a

m e s a “Apply”

Page 19: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Lookup Failure

• Lookup succeeds when all input is consumed and final state is reached. Lookup can fail because:

• Not all input is consumed ("libro", "tigra")

• Input is fully consumed but state is not final ("cant")

• Final state is reached but there is still unconsumed output ("mesas")

Page 20: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Shared Structure

c l e a

e

v

r

e

Page 21: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Transducers

m e s a s “Lookup”

“Lookdown”

m e s a +Noun +Fem +Pl

m e s a 0 0 s

mesa+Noun+Fem+Pl

Page 22: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

A Morphological Analyzer

Transducer

dogs

dog +n +pl

Page 23: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

A Morphological Analyzer

Transducer

Surface Language

Lexical Language

Page 24: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

A Quick Review of Set Theory

• A set is a collection of objects.

A B

D E

We can enumerate the “members” or “elements” of finite sets:

{ A, D, B, E }.

There is no significant order in a set, so { A, D, B, E } is the

same set as { E, A, D, B }, etc.

Page 25: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Uniqueness of Elements • You cannot have two or more

‘A’ elements in the same set

A B

D E

{ A, A, D, B, E} is just a redundant specification of the set

{ A, D, B, E }.

Page 26: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Cardinality of Sets

• The Empty Set:

• A Finite Set:

• An Infinite Set: e.g. The Set of all Positive Integers

Norway Denmark Sweden

Page 27: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Simple Operations on Sets: Union

A B

C

C

D

Set 1 Set 2

B C A D

Union of Set1 and Set 2

Page 28: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Simple Operations on Sets: Intersection

A B

C

C

D

Set 1 Set 2

C

Intersection of Set1 and Set 2

Page 29: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Simple Operations on Sets (4): Subtraction

A B

C

C

D

Set 1 Set 2

A B

Set 1 minus Set 2

Page 30: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Formal Languages

Very Important Concept in Formal Language Theory:

A Language is just a Set of Words.

• We use the terms “word” and “string” interchangeably.

• A Language can be empty, have finite cardinality, or be infinite in

size.

• You can union, intersect and subtract languages, just like any other

sets.

Page 31: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Union of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

dog cat rat

elephant mouse

Union of Language 1 and Language 2

Page 32: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Intersection of Languages (Sets)

dog cat rat elephant mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

Page 33: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Intersection of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Intersection of Language 1 and Language 2

rat

Page 34: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Subtraction of Languages (Sets)

dog cat rat rat mouse

Language 1 Language 2

Language 1 minus Language 2

dog cat

Page 35: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Languages

• A language is a set of words (=strings).

• Words (strings) are composed of symbols (letters) that are “concatenated” together.

• At another level, words are composed of “morphemes”.

• In most natural languages, we concatenate morphemes together to form whole words.

• For sets consisting of words (i.e. for Languages), the operation of concatenation is very important.

Page 36: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Concatenation of Languages

work talk walk

Root Language

0 ing ed s

Suffix Language

work working worked works talk talking talked talks walk walking walked walks

The concatenation of

the Suffix language

after the Root

language.

Page 37: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Languages and Networks

w a l k

o r

t

Network/Language 1

Network/Language 2

s

o r

s The concatenation of

Network 1 and Network 2

w a l k

t

a

a

s

e d

i n g

0

s

e d

i n g

0

s

Page 38: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Why is “Finite State” Computing so Interesting?

• Finite-state systems are mathematically elegant, easily manipulated and modifiable.

• Computationally efficient. Usually very compact.

• The programming linguists do is declarative, describing facts of our natural language; i.e. we write grammars. We do not hack ad hoc code.

• Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate.

Page 39: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Languages, Notations and Machines

FINITE STATE

LANGUAGE

FINITE STATE

NOTATION

FINITE STATE

MACHINE

Page 40: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

40

Regular Expressions

Abstract Definition

• 0 is a regular expression

• ε is a regular expression

• if α ϵ Σ is a letter then α is a regular expression

• if Ψ and Φ are regular expressions then so are (Ψ + Φ) and (Ψ . Φ)

• if Φ is a regular expression then so is (Φ)*

• Nothing else is a regular expression

Page 41: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

41

Searching Text

• Analysis of written texts often involves searching for (and subsequent processing of):

– a particular word

– a particular phrase

– a particular pattern of words involving gaps

• How can we specify the things we are searching for?

Page 42: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

42

Regular Expressions and Matching

• A Regular Expression is a special notation used to specify the things we want to search for.

• We use regular expressions to define patterns.

• We then implement a matching operation m(<pattern>,<text>) which tries to match the pattern against the text.

Page 43: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

43

Simple Regular Expressions

• Most ordinary characters match themselves.

• For example, the pattern sing exactly matches the string sing.

• In addition, regular expressions provide us with a set of special characters

Page 44: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

44

The Wildcard Symbol

• The “.” symbol is called a wildcard: it matches any single character.

• For example, the expression s.ng matches sang, sing, song, and sung.

• Note that "." will match not only alphabetic characters, but also numeric and whitespace characters.

• Consequently, s.ng will also match non-words such as s3ng

Page 45: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

45

Assignment

• Draw the FSM which corresponds to s.ng

Page 46: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

46

Repeated Wildcards

• We can also use the wildcard symbol for counting characters. For instance ....zy matches six-letter strings that end in zy.

• The pattern t... will match, among others, the words that and term

• It will also match the word sequence to a (since the third "." in the pattern can match the space character).

Page 47: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

47

Optionality

• The “?” symbol indicates that the immediately preceding regular expression is optional. The regular

• expression colou?r matches both British and American spellings, colour and color.

Page 48: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

48

Repetition

• The "+" symbol indicates that the immediately preceding expression is repeatable at least once

• For example, the regular expression "coo+l" matches cool, coool, and so on.

• This symbol is particularly effective when combined with the . symbol. For example, f .+ f matches all strings of length greater than two, that begin and end with the letter f (e.g foolproof).

Page 49: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

49

Repetition 2

• The “*” symbol indicates that the immediately preceding expression is both optional and repeatable.

• For example .*gnt.* matches all strings that contain gnt.

Page 50: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

50

Character Class

• The [ ] notation enumerates the set of characters to be matched is called a character class.

• For example, we can match any English vowel, but no consonant, using [aeiou].

• We can combine the [] notation with our notation for repeatability.

• For example, expression p[aeiou]+t matches peat, poet, and pout.

Page 51: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

51

The Choice Operator

• Often the choices we want to describe cannot be expressed at the level of individual characters.

• In such cases we use the choice operator "|" to indicate the alternate choices.

• The operands can be any expression.

• For instance, jack | gill will match either jack or gill.

Page 52: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

52

Choice Operator 2

• Note that the choice operator has wide scope, so that abc|def is a choice between abc and def, and not between abcef and abdef.

• The latter choice must be written using parentheses: ab(c|d)ef

Page 53: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

53

Ranges

• The [ ] notation is used to express a set of choices between individual characters.

• Instead of listing each character, it is also possible to express a range of characters, using the - operator.

• For example, [a-z] matches any lowercase letter

Page 54: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

54

Exercise

Write regular expressions matching

• All 1 digit numbers

• All 2 digit numbers

• All date expressions such as 12/12/1950

Page 55: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

55

Ranges II

• Ranges can be combined with other operators.

• For example [A-Z][a-z]* matches words that have an initial capital letter followed by any number of lowercase letters.

• Ranges can be combined as in [A-Za-z] which matches any alphabetical character.

Page 56: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

56

Assignment

• What does the following expression match?

• [b-df-hj-np-tv-z]+

Page 57: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

57

Complementation

• The character class [b-df-hj-np-tv-z] allows us to match consonants.

• However, this expression is quite cumbersome.

• A better alternative is to say: let’s match anything which isn’t a vowel.

• To do this, we need a way of expressing complementation.

Page 58: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

58

Complementation 2

• We do this using the symbol “^” as the first character within the class expression [ ].

• [^aeiou] is just like our earlier character class, except now the set of vowels is preceded by ^.

• The expression as a whole is interpreted as matching anything which fails to match [aeiou]

• In other words, it matches all lowercase consonants (plus all uppercase letters and non-alphabetic

Page 59: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

59

Complementation 3

• As another example, suppose we want to match any string which is enclosed by the HTML tags for boldface, namely <B> and </B>, We might try something like this: <B>.*</B>.

• This would successfully match <B>important</B>, but would also match <B>important</B> and <B>urgent</B>, since the .* subpattern will happily match all the characters from the end of important to the end of urgent.

Page 60: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

60

Complementation 4

• One way of ensuring that we only look at matched pairs of tags would be to use the expression <B>[^<]*</B>, where the character class matches anything other than a left angle bracket.

• Finally, note that character class complementation also works with ranges. Thus [^a-z] matches anything other than the lower case alphabetic characters a through z.

Page 61: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

61

Other Special Symbols

• Two important symbols in this are “^” and “$” which are used to anchor matches to the beginnings or ends of lines in a file.

• Note: “^” has two quite distinct uses: it is interpreted as complementation when it occurs as the first symbol within a character class [], and as matching the beginning of lines when it occurs elsewhere in a pattern.

Page 62: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

62

Special Symbols 2

• As an example, [a-z]*s$ will match words ending in s that occur at the end of a line.

• Finally, consider the pattern ^$; this matches strings where no character occurs between the beginning and the end of a line — in other words, empty lines.

Page 63: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

63

Special Symbols 3

• Special characters like “.”, “*”, “+” and “$” give us powerful means to generalise over character strings.

• Suppose we wanted to match against a string which itself contains one or more special characters?

• An example would be the arithmetic statement $5.00 * ($3.05 + $0.85).

• In this case, we need to resort to the so-called escape character “\” (“backslash”).

• For example, to match a dollar amount, we might use \$[1-9][0-9]*\.[0-9][0-9]

Page 64: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

64

Summary

• Regular Expressions are a special notation.

• Regular expressions describe patterns which can be matched in text.

• A particular regular expression E stands for a set of strings. We can thus say that E describes a language.

Page 65: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX05... · Computational Lexicology, Morphology and Syntax Diana Trandabăț Academic year 2017-2018

Great!

See you next time!