Top Banner
1 40-414 Compiler Design Implementation of Lexical Analysis Lecture 3
70

40-414 Compiler Design Implementation of Lexical Analysis

Jan 24, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 40-414 Compiler Design Implementation of Lexical Analysis

1

40-414 Compiler Design

Implementation of Lexical Analysis

Lecture 3

Page 2: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 2

Notation

• There is variation in regular expression notation

• At least one: A+ AA*

• Union: A | B A + B

• Option: A + A?

• Range: ‘a’+’b’+…+’z’ [a-z]

• Excluded range:

complement of [a-z] [^a-z]

Page 3: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 3

Regular Expressions in Lexical Specification

• Last Lecture: a specification for the predicate

s L(R)

• But a yes/no answer is not enough!

• Instead: partition the input into tokens

• We adapt regular expressions to this goal

c1c2c3 c4c5c6c7 …

Set of strings

Page 4: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 4

Regular Expressions => Lexical Spec. (1)

1. Write a rexp for the lexemes of each token• Number = digit +

• Keyword = ‘if’ + ‘else’ + …

• Identifier = letter (letter + digit)*

• OpenPar = ‘(‘

• …

Page 5: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 5

Regular Expressions => Lexical Spec. (2)

2. Construct R, matching all lexemes for all tokens

R = Keyword + Identifier + Number + …

= R1 + R2 + …

Page 6: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 6

Regular Expressions => Lexical Spec. (3)

3. Let input be x1…xn

For 1 i n check

x1…xi L(R)

4. If success, then we know that

x1…xi L(Rj) for some j

5. Remove x1…xi from input and go to (3)

Page 7: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 7

Ambiguities (1)

• There are ambiguities in the algorithm

• How much input is used? What if• x1…xi L(R) and also

• x1…xK L(R)

k≠i

• Rule: Pick longest possible string in L(R)

– The “maximal munch”

Page 8: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 8

Ambiguities (2)

• Which token is used? What if• x1…xi L(Rj) and also

• x1…xi L(Rk)

k≠i

• Rule: use rule listed first (j if j < k)– Treats “if” as a keyword, not an identifier

Keyword = ‘if’ + ‘else’ + …

Identifier = letter (letter + digit)*

R = R1 + R2 + R3 + …

Page 9: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 9

Error Handling

• What ifNo rule matches a prefix of input ?

x1…xi L(Rj)

• Problem: Can’t just get stuck …

• Solution:

– Write a rule matching all “bad” strings

– Put it last (lowest priority)

Page 10: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 10

Summary

• Regular expressions provide a concise notation for string patterns

• Use in lexical analysis requires small extensions– To resolve ambiguities

– To handle errors

• Good algorithms known– Require only single pass over the input

– Few operations per character (table lookup)

Page 11: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 11

Finite Automata

• Regular expressions = specification

• Finite automata = implementation

• A finite automaton consists of– An input alphabet

– A finite set of states S

– A start state n

– A set of accepting states F S

– A set of transitions state input state

Page 12: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 12

Finite Automata

• Transition

s1 a s2

• Is read

In state s1 on input “a” go to state s2

• If end of input and in accepting state => accept

• Otherwise => reject • Terminates in a state s that is

NOT an accepting state (s F)• Gets stuck

Page 13: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 13

Finite Automata State Graphs

• A state

• The start state

• An accepting state

• A transitiona

Page 14: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 14

A Simple Example

• A finite automaton that accepts only “1”

1

• Accepts ‘1’ : 1, 1

• Rejects ‘0’ : 0

• Rejects ’10’ : 1, 10

A B

Page 15: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken15

Another Simple Example

• A finite automaton accepting any number of 1’s followed by a single 0

• Alphabet: {0,1}

0

1

• Accepts ‘110’: 110, 110, 110, 110

• Rejects ‘100’ : 100, 100, 100

A B

Page 16: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 16

And Another Example

• Alphabet {0,1}

• What language does this recognize?

0

1

0

1

0

1

Page 17: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 17

And Another Example

0

1

0

1

0

1(0 + 1)*

(1* + 0)(1 + 0)

1* + (01)* + (001)* + (000*1)*

(0 + 1)*00

Select the regular

language that denotes

the same language as

this finite automaton

Page 18: 40-414 Compiler Design Implementation of Lexical Analysis

18

And Another Example

0

1

0

1

0

1(0 + 1)*

(1* + 0)(1 + 0)

1* + (01)* + (001)* + (000*1)*

(0 + 1)*00

Select the regular

language that denotes

the same language as

this finite automaton

Page 19: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 19

Epsilon Moves

• Another kind of transition: -moves

• Machine can move from state A to state Bwithout reading input

A B

Page 20: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 20

Deterministic and Nondeterministic Automata

• Deterministic Finite Automata (DFA)– One transition per input per state

– No -moves

• Nondeterministic Finite Automata (NFA)

– Can have multiple transitions for one input in a given state

– Can have -moves

Page 21: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 21

Execution of Finite Automata

• A DFA can take only one path through the state graph– Completely determined by input

• NFAs can choose– Whether to make -moves

– Which of multiple transitions for a single input to take

Page 22: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken22

Acceptance of NFAs

• An NFA can get into multiple states

• Input:

• Possible States:

0

1

0

0

Rule: NFA accepts if it can get to a final state

Page 23: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken23

Acceptance of NFAs

• An NFA can get into multiple states

• Input:

• Possible States:

0

1

0

0

1

Rule: NFA accepts if it can get to a final state

A

Page 24: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken24

Acceptance of NFAs

• An NFA can get into multiple states

• Input:

• Possible States:

0

1

0

0

1

Rule: NFA accepts if it can get to a final state

A

{A}

Page 25: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken25

Acceptance of NFAs

• An NFA can get into multiple states

• Input:

• Possible States:

0

1

0

0

1 0

Rule: NFA accepts if it can get to a final state

A

{A}

Page 26: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken26

Acceptance of NFAs

• An NFA can get into multiple states

• Input:

• Possible States:

0

1

0

0

1 0

Rule: NFA accepts if it can get to a final state

A B

{A} {A, B}

Page 27: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken27

Acceptance of NFAs

• An NFA can get into multiple states

• Input:

• Possible States:

0

1

0

0

1 0 0

Rule: NFA accepts if it can get to a final state

A B

{A} {A, B}

Page 28: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken28

Acceptance of NFAs

• An NFA can get into multiple states

• Input:

• Possible States:

0

1

0

0

1 0 0

Rule: NFA accepts if it can get to a final state

CA B

{A, B, C}{A} {A, B}

Page 29: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 29

NFA vs. DFA (1)

• NFAs and DFAs recognize the same set of languages (regular languages)

• DFAs are faster to execute– There are no choices to consider

• NFAs are, in general, smaller– Sometimes exponentially smaller

Page 30: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 30

NFA vs. DFA (2)

• For a given language NFA can be simpler than DFA

01

0

0

01

0

1

0

1

NFA

DFA

• DFA can be exponentially larger than NFA

Page 31: 40-414 Compiler Design Implementation of Lexical Analysis

31

Regular Expressions to Finite Automata

• High-level sketch

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA

Prof. Aiken

Page 32: 40-414 Compiler Design Implementation of Lexical Analysis

32

Regular Expressions to NFA (1)

• For each kind of rexp, define an NFA– Notation: NFA for rexp M

M

• For

• For input aa

Prof. Aiken

Page 33: 40-414 Compiler Design Implementation of Lexical Analysis

33

Regular Expressions to NFA (2)

• For AB

A B

• For A + B

A

B

Prof. Aiken

Page 34: 40-414 Compiler Design Implementation of Lexical Analysis

34

Regular Expressions to NFA (3)

• For A*

A

Prof. Aiken

Page 35: 40-414 Compiler Design Implementation of Lexical Analysis

35

Example of RegExp -> NFA conversion

• Consider the regular expression

(1+0)*1

• The NFA is

B

1C E

0D F

G

A H1

I J

Prof. Aiken

Page 36: 40-414 Compiler Design Implementation of Lexical Analysis

36

NFA to DFA: The Trick

• Simulate the NFA

• Each state of DFA = a non-empty subset of states of the NFA

• Start state = -closure of the start state of NFA

• Add a transition S a S’ to DFA iff– S’ is the set of NFA states reachable from any state

in S after seeing the input a, considering -moves as well

• Final states Subsets that include at least one final state of NFA

Prof. Aiken

Page 37: 40-414 Compiler Design Implementation of Lexical Analysis

37

-closure of a state

Prof. Aiken

-closure(B)= {B,C,D}

-closure(G)= {A,B,C,D,G,H,I}

Page 38: 40-414 Compiler Design Implementation of Lexical Analysis

38

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

Prof. Aiken

Page 39: 40-414 Compiler Design Implementation of Lexical Analysis

39

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

Prof. Aiken

Page 40: 40-414 Compiler Design Implementation of Lexical Analysis

40

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

Prof. Aiken

Page 41: 40-414 Compiler Design Implementation of Lexical Analysis

41

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

ABCDHI

Prof. Aiken

Page 42: 40-414 Compiler Design Implementation of Lexical Analysis

42

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

ABCDHI

0

Prof. Aiken

Page 43: 40-414 Compiler Design Implementation of Lexical Analysis

43

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

ABCDHI

0

Prof. Aiken

Page 44: 40-414 Compiler Design Implementation of Lexical Analysis

44

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

FGHIABCD

ABCDHI

0

Prof. Aiken

Page 45: 40-414 Compiler Design Implementation of Lexical Analysis

45

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

FGHIABCD

ABCDHI

0

1

Prof. Aiken

Page 46: 40-414 Compiler Design Implementation of Lexical Analysis

46

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

FGHIABCD

ABCDHI

0

1

Prof. Aiken

Page 47: 40-414 Compiler Design Implementation of Lexical Analysis

47

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

FGHIABCD

EJGHIABCD

ABCDHI

0

1

Prof. Aiken

Page 48: 40-414 Compiler Design Implementation of Lexical Analysis

48

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

FGHIABCD

EJGHIABCD

ABCDHI

0

1

0

Prof. Aiken

Page 49: 40-414 Compiler Design Implementation of Lexical Analysis

49

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

FGHIABCD

EJGHIABCD

ABCDHI

0

1

0 1

Prof. Aiken

Page 50: 40-414 Compiler Design Implementation of Lexical Analysis

50

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

FGHIABCD

EJGHIABCD

ABCDHI

0

1

0

0 1

Prof. Aiken

Page 51: 40-414 Compiler Design Implementation of Lexical Analysis

51

NFA -> DFA Example

1

01

A BC

D

E

FG H I J

FGHIABCD

EJGHIABCD

ABCDHI

0

1

0

10 1

Prof. Aiken

Page 52: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 52

Implementation

• A DFA can be implemented by a 2D table T– One dimension is “states”

– Other dimension is “input symbol”

– For every transition Si a Sk define T[i,a] = k

• DFA “execution”– If in state Si and input a, read T[i,a] = k and skip to

state Sk

– Very efficient

Page 53: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 53

Table Implementation of a DFA

S

T

U

0

1

0

10 1

0 1

S T U

T T U

U T U

Page 54: 40-414 Compiler Design Implementation of Lexical Analysis

Prof. Aiken 54

Implementation (Cont.)

• NFA -> DFA conversion is at the heart of tools such as flex

• But, DFAs can be huge

• In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations

Page 55: 40-414 Compiler Design Implementation of Lexical Analysis

55

DFA for recognizing two relational operators

start

other

=>0 6 7

8* return(SYMBOL, >)

return(SYMBOL, >=)

We’ve accepted “>” and have read “other” character that must be

unread. That is moving the input pointer one character back.

Page 56: 40-414 Compiler Design Implementation of Lexical Analysis

56

DFA of Pascal relational operators

start <0

other

=6 7

8

return(SYMBOL, <=)

5

4

>

=1 2

3

other

>

=

*

*

return(SYMBOL, <>)

return(SYMBOL, <)

return(SYMBOL, =)

return(SYMBOL, >=)

return(SYMBOL, >)

Page 57: 40-414 Compiler Design Implementation of Lexical Analysis

57

DFA for recognizing id and keyword

return(get_token(), install_id())

start letter0

other109

letter or digit

*

If the token is an ID, its lexeme

is inserted into the symbol

table (only one record for each

lexeme); and lexeme of the

token is returned.

returns either a KEYWORD or ID based on

the type of the token

Page 58: 40-414 Compiler Design Implementation of Lexical Analysis

58

DFA of Pascal Unsigned Numbers

return(NUM, lexeme of the number)

170 1211 1413 1615start otherdigit . digit E + | - digit

digit

digit

digit

E

digit

*

other

other

Page 59: 40-414 Compiler Design Implementation of Lexical Analysis

59

Lexical errors

• Some errors are out of power of lexical analyzer to recognize:

fi (a == f(x)) …

• However, it may be able to recognize errors like:

d = 2r

• Such errors are recognized when no pattern for tokens matches a character sequence

Page 60: 40-414 Compiler Design Implementation of Lexical Analysis

60

Error recovery

• Panic mode: successive characters are ignored until we reach to a well formed token

• Delete one character from the remaining input

• Insert a missing character into the remaining input

• Replace a character by another character

• Transpose two adjacent characters

Page 61: 40-414 Compiler Design Implementation of Lexical Analysis

61

Lexical Analyzer in Perspective (Revisited)

lexical

analyzer parser

symbol

table

source

program

token

<type, attribute>

get next

token

output

position = initial + rate * 60

<id, 1> <op, = > <id, 2> <op, + > <id, 3> <op, * > <num, 60 >

Symbol Table

key lexeme type …

1 position real …

2 initial real …

3 rate real …

Page 62: 40-414 Compiler Design Implementation of Lexical Analysis

62

Using Buffer to Enhance Efficiency

*M=E eof2**C

Current token

lexeme beginning forward (scans

ahead to find

pattern match)

if forward at end of first half then begin

reload second half ;

forward : = forward + 1

end

else if forward at end of second half then begin

reload first half ;

move forward to biginning of first half

end

else forward : = forward + 1 ;

Block I/O

Block I/O

Page 63: 40-414 Compiler Design Implementation of Lexical Analysis

63

Algorithm: Buffered I/O with Sentinels

eof*M=E eofeof2**C

Current token

lexeme beginning forward (scans

ahead to find

pattern match)

forward : = forward + 1 ;

if forward is at eof then begin

if forward at end of first half then begin

reload second half ;

forward : = forward + 1

end

else if forward at end of second half then begin

reload first half ;

move forward to biginning of first half

end

else / * eof within buffer signifying end of input * /

terminate lexical analysis

end 2nd eof no more input !

Block I/O

Block I/O

Page 64: 40-414 Compiler Design Implementation of Lexical Analysis

64

Question?

c*

b+

ab

ac*

Consider the string abbbaacc . Which of the following lexical

specifications produces the tokenization:

Choose all that apply

a(b + c*)

b+

ab

b+

ac*

Prof. Aiken

ab/bb/a/acc

b+

ab*

ac*

Page 65: 40-414 Compiler Design Implementation of Lexical Analysis

65

Question?

1, 3

3

4

2, 3

Using the lexical specification below, how is the string “dictatorial”

tokenized?

dict (1)

dictator (2)

[a-z]* (3)

dictatorial (4)

Choose all that apply

Prof. Aiken

Page 66: 40-414 Compiler Design Implementation of Lexical Analysis

66

Question?

babad will be tokenized as: bab/a/d

Given the following lexical specification:

a(ba)*

b*(ab)*

abd

d+

Which of the following

statements is true?

Choose all that apply

ababdddd will be tokenized as: abab/dddd

dddabbabab will be tokenized as: ddd/a/bbabab

ababddababa will be tokenized as: ab/abd/d/ababa

Prof. Aiken

Page 67: 40-414 Compiler Design Implementation of Lexical Analysis

67

Question?

011110

Given the following lexical specification:

(00)*

01+

10+

Prof. Aiken

Which strings are NOT

successfully processed by this

specification?

Choose all that apply

01100100

01100110

0001101

Page 68: 40-414 Compiler Design Implementation of Lexical Analysis

68

Question?

(000)*(01)+

0(000)*1(01)*

(000)*(10)+

0(00)*(10)*

0(000)*(01)*

Which of the following regular

expressions generate the

same language as the one

recognized by this NFA?

Prof. Aiken

Choose all that apply

Page 69: 40-414 Compiler Design Implementation of Lexical Analysis

69

Question?

Which of the following automata are DFA?

Prof. Aiken

Choose all that apply

Page 70: 40-414 Compiler Design Implementation of Lexical Analysis

70

Question?

Which of the following automata are NFA?

Prof. Aiken

Choose all that apply