Top Banner
CS502: Compiler Design Syntax Analysis (Cont.) Manas Thakur Fall 2020
34

Syntax Analysis (Cont.) Manas Thakur

Jan 27, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Syntax Analysis (Cont.) Manas Thakur

CS502: Compiler Design

Syntax Analysis (Cont.)

Manas Thakur

Fall 2020

Page 2: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 2

What next?

● Bottom-up parsing

● Why?

– BU parsers are more powerful

than TD parsers● Cover more kinds of grammars

(e.g., no need to eliminate left recursion)

– More efficient as well

● Bad news: Slightly more complicated

● Good news: Well known parser generators exist

Seems like winter would never end!

parsing

Page 3: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 3

Bottom-Up Parsing

● Given a string, construct a parse tree by starting at the leaves and walking up to the root.

● The process is called reduction.

– Reduce a string w to the start symbol of the grammar.

– Recall derivation from top-down parsing?

Page 4: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 4

Reduction

● At each reduction step, a specific substring matching the body of a production is replaced by the non-terminal at the head of the production.

● Basically we are constructing a rightmost derivation in reverse!

● How to decide which substring to reduce?

FFid * id * id

id FF

id

TT * id

FF

id

TT * FF

id

FF

id

TT * FF

id

TT

FF

id

TT * FF

id

TT

EEReduction

steps

E → E+T | TT → T*F | FF → id

F→id T→F F→id T→T*F E→T

Page 5: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 5

Handle pruning

● A handle is a substring that matches the body of a production,

and reducing this handle represents one step of reduction.

● Theorem: If G is unambiguous, then every right-sentential form has a unique handle.

● Notice why did we say “a handle” instead of “the handle”?

● BU parsing is essentially the problem of handle pruning.

Right Sentential Form Handle Reducing Production

id1 * id

2id

1F -> id

F * id2

F T -> F

T * id2

id2

F -> id

T * F T * F T -> T * F

T T E -> T

Page 6: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 6

Shift-Reduce Parsing

● Uses a stack to perform bottom-up parsing

● Four actions:

– Shift: shift the next input symbol on top of stack

– Reduce: pop handle off the stack and push the corresponding non-terminal

– Accept: parsing successful

– Error: parsing failed

● The standard scheme used by LR grammars.

Left to right scanning Rightmost derivation

Page 7: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 7

LR Parsing Example

● A table guides the actions, based on the top of the stack and the next input symbol.

Stack Input Action

$ id1 * id

2 $ shift

$ id1

* id2 $ reduce by F -> id

$ F * id2 $ reduce by T -> F

$ T * id2 $ shift

$ T * id2 $ shift

$ T * id2

$ reduce by F -> id

$ T * F $ reduce by T -> T * F

$ T $ reduce by E -> T

$ E $ accept

The job of all LR parsers is toconstruct the “action” table.

Page 8: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 8

LR Parsing Algorithms

● Simple LR or SLR

– Smallest class of grammars

– Smallest tables

– Simple, fast construction

● Canonical LR or CLR

– Largest set of grammars

– Largest tables

– Slow construction

● LookAhead LR or LALR

– Intermediate set of grammars

– Same number of states as CLR

– Faster construction than CLR

Page 9: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 9

LR(k) Items

● An LR(k) item is a pair [α, β], where

– α is a production with a • at some position in the RHS, marking how much of the RHS has been seen

– β is a lookahead string containing k symbols (terminals or $)

● Two cases of interest:

– LR(0) items for SLR table construction

– LR(1) items for CLR and LALR table construction

Page 10: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 10

Example of LR(0) items

● A → XYZ generates four LR(0) items:– [A → •XYZ]

– [A → X•YZ]

– [A → XY•Z]

– [A → XYZ•]

● [A → •XYZ] indicates that the parser is looking for a string that can be derived from XYZ

● [A → XY•Z] indicates that the parser has seen a string derived from XY and is looking for one derivable from Z

Page 11: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 11

CLOSURE

● Given an item [A → α • Bβ ], its closure contains the item and any other items that can generate legal substrings to follow α.

function CLOSURE(I)repeat

if [A → α • Bβ ] I∈add [B → •γ] to I

until no more items can be added to Ireturn I

E’ → EE → E+T | TT → T*F | FF → (E) | id

I = {[E’ → •E]}

I0

E’ → •EE → •E+TE → •TT → •T*FT → •FF → •(E)F → •id

CLOSURE(I)

Grammar G’ with anaugmented production:

Page 12: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 12

GOTO

● Let I be the set of LR(0) items and X be a grammar symbol. Then, GOTO(I, X) is the closure of the set of all items

– [A → αX•β] such that [A → α•Xβ] ∈ I

I0

E’ → •EE → •E+TE → •TT → •T*FT → •FF → •(E)F → •id

EI1

E’ → E•E → E•+T

GOTO(I0, E) = I1

Classwork: Construct GOTO(I1, +).

E’ → EE → E+T | TT → T*F | FF → (E) | id

Page 13: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 13

I0

E' → . EE → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I0

E' → . EE → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I1

E' → E .E → E . + T

I1

E' → E .E → E . + T

E

accept

$

I2

E → T .T → T . * F

I2

E → T .T → T . * F

T

I3

T → F .

I5

F → id .

I5

F → id .

I4

F → ( . E )E → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

I4

F → ( . E )E → . E + TE → . TT → . T * F T → . FF → . (E) F → . id

id

F

id

(

F

I6

E → E + . TT → . T * F T → . FF → . (E) F → . id

I6

E → E + . TT → . T * F T → . FF → . (E) F → . id

I7

T → T * . FF → . ( E )F → . id

I7

T → T * . FF → . ( E )F → . id

+

*

I8

E → E . + TF → ( E . )

I8

E → E . + TF → ( E . )

E

I9

E → E + T .T → T . * F

I9

E → E + T .T → T . * F

I10

T → T * F .

I10

T → T * F .

I11

F → ( E ) .

I11

F → ( E ) .

T

F

(

T

T

*

id

id

(

F

F

(

)+

id

(

LR(0) Automaton

E’ → EE → E+T | TT → T*F | FF → (E) | id

Page 14: Syntax Analysis (Cont.) Manas Thakur

CS502: Compiler Design

Syntax Analysis (Cont.)

Manas Thakur

Fall 2020

Page 15: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 15

Before we reach SLR● We can build a simpler than SLR parser using LR(0) item sets for

the following grammar:E’ → EE → E+T | TT → (E) | id

I0

E' → . EE → . E + TE → . TT → . (E) T → . id

I0

E' → . EE → . E + TE → . TT → . (E) T → . id

I1

E' → E .E → E . + T

I1

E' → E .E → E . + T

E accept$

I2

E → T .

I2

E → T .

T

I3

T → id .

I3

T → id .

I4

T → ( . E )E → . E + TE → . TT → . (E) T → . id

I4

T → ( . E )E → . E + TE → . TT → . (E) T → . id

id

id

(

I5

E → E + . TT → . (E) T → . id

I5

E → E + . TT → . (E) T → . id

+

I6

E → E . + TT → ( E . )

I6

E → E . + TT → ( E . )

E

I7

E → E + T .

I7

E → E + T .

I8

T → ( E ) .

I8

T → ( E ) .

T

(

T

T

id

(

)

+

id

(

Page 16: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 16

Constructing LR(0) parsing table● Construct the LR(0) item sets for G’

– G’ is G with an augmented start production S’ → S

● State i is constructed using set Ii

– [A → α•aβ] I∈ i and GOTO(Ii,a) = Ij

– ⇒ ACTION[i,a] ← “shift j”, a != $∀– [A → α•] I∈ i, A != S’

– ⇒ ACTION[i,a] ← “reduce A → α”, a∀– [S’ → S•] I∈ i ACTION[i, a] ← ⇒ “accept”, a∀

● GOTO(Ii, A) = Ij GOTO[i, A] ← j⇒

● Set undefined entries in ACTION and GOTO to “error”

● Initial state of parser is CLOSURE([S’ → •S])

Page 17: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 17

LR(0) Parsing Table

State id + ( ) $ E T

0 s3 s4 1 2

1 s5 accept

2 r(E→T) r(E→T) r(E→T) r(E→T) r(E→T)

3 r(T→id) r(T→id) r(T→id) r(T→id) r(T→id)

4 s3 s4 6 2

5 s3 s4 7

6 s5 s4 s8 9

7 r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T)

8 r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T) r(E→E+T)

E'→ EE → E + T | TF → (E) | id

Page 18: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 18

Need for more powerful LR parsers

● LR(0) is too simple to cover many grammars.

● Doesn’t cover even our expression grammar:

● Recall the giant automaton:

– e.g.: s7 or r(E→T) on (I2, *)

– Called a shift-reduce conflict– Similarly we can have reduce-reduce conflicts

● Further reading: Section 4.5.4 (DB)– Multiply defined entries imply the grammar is not LR(0)

● Reason: LR(0) automata do not know on what next symbol to reduce, and end up adding too many reduce actions conservatively.

E’ → EE → E+T | TT → T*F | FF → (E) | id

Page 19: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 19

Constructing SLR parsing table● Construct the LR(0) item sets for G’

– G’ is G with an augmented start production S’ → S

● State i is constructed using set Ii

– [A → α•aβ] I∈ i and GOTO(Ii,a) = Ij

⇒ ACTION[i,a]← “shift j”, a != $∀– [A → α•] I∈ i, A != S’

⇒ ACTION[i,a] ← “reduce A → α”, a FOLLOW(A)∀ ∈

– [S0’ → S$•] I∈ i ACTION[i, a] ← ⇒ “accept”, a∀● GOTO(Ii, A) = Ij GOTO [i, A] ← j⇒

● Set undefined entries in ACTION and GOTO to “error”

● Initial state of parser s0 is CLOSURE([S’ → •S$])

This is the only addition w.r.t. the LR(0) algorithm!

Page 20: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 20

SLR Parsing TableState id + * ( ) $ E T F

0 s5 s4 1 2 3

1 s6 accept

2 r(E→T) s7 r(E→T) r(E→T)

3 r(T→F) r(T→F) r(T→F) r(T→F)

4 s5 s4 8 2 3

5 r(F→id) r(F→id) r(F→id) r(F→id)

6 s5 s4 9 3

7 s5 s4 10

8 s6 s11

9 r(E→E+T) s7 r(E→E+T) r(E→E+T)

10 r(T→T*F) r(T→T*F) r(T→T*F) r(T→T*F)

11 r(F→(E)) r(F→(E)) r(F→(E)) r(F→(E))

FOLLOW(E) = {+,),$}FOLLOW(T) = {+,*,),$}FOLLOW(F) = {+,*,),$}

E' → EE → E + T | TT → T * F | FF → (E) | id

Page 21: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 21

SLR Parsing Example

0 $ id * id $ Shift to 5

0 5 $ id * id $ Reduce by F → id

Stack Symbols Input Action

0 3 $ F * id $ Reduce by T → F

0 2 $ T * id $ Shift to 7

0 2 7 $ T * id $ Shift to 5

0 2 7 5 $ T * id $ Reduce by F → id

0 2 7 10 $ T * F $ Reduce by T → T * F

0 2 $ T $ Reduce by E → T

0 1 $ E $ Accept

E' → EE → E + T | TT → T * F | FF → (E) | id

● Parse for id*id:

Shift si: Push current symbol and state si, move pointer.Reduce A →α: Pop |α| symbols and states. GOTO using the nex symbol

Page 22: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 22

A grammar that is not SLR

S'→ SS → L = R | RL → *R | idR → L

I0

S' → . SS → . L = RS → . RL → . *R L → . idR → . L

I0

S' → . SS → . L = RS → . RL → . *R L → . idR → . L

I1

S' → S .

I1

S' → S .I2

S → L . = RR → L .

I2

S → L . = RR → L .

I3

S' → R .

I3

S' → R .I4

L →id .

I4

L →id .

I5

L → * . R L → . * RR → . LR → . id

I5

L → * . R L → . * RR → . LR → . id

I6

S → L = . RR → . LL → . *R L → . id

I6

S → L = . RR → . LL → . *R L → . id

I7

L → *R .

I7

L → *R .I8

R → L .

I8

R → L .

I9

S → L = R .

I9

S → L = R .

● Consider I2 on ‘=’:

– Shift to I6

– Reduce using R → L (as = is in FOLLOW(R); how?)– Conflict in the parsing table implies the grammar is not SLR(1)

Page 23: Syntax Analysis (Cont.) Manas Thakur

CS502: Compiler Design

Syntax Analysis (Cont.)

Manas Thakur

Fall 2020

Page 24: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 24

LR(1) Items

● Recall LR(k) items definition?

– An LR(k) item is a pair [α, β], where● α is a production with a • at some position in the RHS, marking how

much of the RHS has been seen● β is a lookahead string containing k symbols (terminals or $)

● LR(1) items look like [A → X • YZ, a]

Page 25: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 25

CLOSURE1 and GOTO1

function CLOSURE1(I)repeat

if [A → α • Bβ, a] I∈add [B → •γ, b] to I, where b FIRST(βa)∈

until no more items can be added to Ireturn I

function GOTO1(I, X)Let J be the set of items [A → αX•β, a]

such that [A → α•Xβ, a] I∈return CLOSURE1(J)

Page 26: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 26

LR(1) AutomatonS'→ SS → C CC → c C| d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I2

S → C . C, $ C → . c C, $ C → . d, $

I2

S → C . C, $ C → . c C, $ C → . d, $

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I1

S' → S ., $

I1

S' → S ., $

I4

C → d ., c/d

I4

C → d ., c/d

I6

C → c . C, $ C → . c C, $ C → . d, $

I6

C → c . C, $ C → . c C, $ C → . d, $

I5

S → CC ., $

I5

S → CC ., $

I7

C → d ., $

I7

C → d ., $

I8

C → c C ., c/d

I8

C → c C ., c/d

I9

C → c C ., $

I9

C → c C ., $

c

c

S

C

c

dd

d

C

c

C

C

d

Same LR(0) item, but different LR(1) items.

$accept

Page 27: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 27

LR(1) or Canonical LR (CLR) Parsing Table

Homework: Construct the LR(1) parserfor our non-SLR grammar and verify that there is no shift-reduce conflict.

State c d $ S C0 s3 s4 1 2

1 accept

2 s6 s7 5

3 s3 s4 8

4 r3 r3

5 r1

6 s6 s7 9

7 r3

8 r2 r2

9 r2

0: S'→ S1: S → C C2: C → c C3: C → d

Page 28: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 28

LookAhead LR (LALR) Parsing

● LR(1) parsers have too many states compared to SLR parsers.

– For C, SLR would have a few hundred states

– For C, LR(1) would have a few thousand states

● How about merging states with the same LR(0) items (aka core)?

– Result: We get LALR parsers!

● A bit of history:

– Knuth invented LR in 1965, but it was considered impractical due to memory requirements.

– Frank DeRemer invented SLR and LALR in 1969 (LALR as part of his PhD thesis).

Page 29: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 29

LALR(1) Automaton

S'→ SS → C CC → c C | dI

0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I0

S' → . S, $ S → . CC, $ C → . c C, c/d C → . d, c/d

I2

S → C . C, $ C → . c C, $ C → . d, $

I2

S → C . C, $ C → . c C, $ C → . d, $ I

3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I3

C → c . C, c/d C → . c C, c/d C → . d, c/d

I1

S' → S ., $

I1

S' → S ., $

I4

C → d ., c/d

I4

C → d ., c/d

I6

C → c . C, $ C → . c C, $ L → . d, $

I6

C → c . C, $ C → . c C, $ L → . d, $

I5

S → CC ., $

I5

S → CC ., $

I7

C → d ., $

I7

C → d ., $

I8

C → c C ., c/d

I8

C → c C ., c/d

I9

C → c C ., $

I9

C → c C ., $

Merged states for LALR(1):

Original LR(1) states:

I36

C → c . C, c/d/$ C → . c C, c/d/$ C → . d, c/d/$

I36

C → c . C, c/d/$ C → . c C, c/d/$ C → . d, c/d/$

I47

C → d ., c/d/$

I47

C → d ., c/d/$

I89

C → c C ., c/d/$

I89

C → c C ., c/d/$

Page 30: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 30

LALR(1) Parsing Table

State c d $ S C0 s36 s47 1 2

1 accept

2 s36 s47 5

36 s36 s47 8

47 r3 r3 r3

5 r1

9 r2 r2 r2

0: S'→ S1: S → C C2: C → c C3: C → d

Page 31: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 31

A few notes in passing

● LALR parsers are smaller than corresponding LR(1) parsing tables.

● LALR parsers mimic LR parsers on correct inputs.

● On erroneous inputs, LALR may proceed with reductions while LR might have declared an error.

– However, eventually, LALR is guaranteed to report the error.

● Merging sets for LALR never generates SR conflicts, but can generate RR conflicts.

– Further reading: Section 4.7.4.

● Difference between SLR and LALR?

– Both have same LR(0) item sets!

– Difference lies in the lookahead.● The lookaheads in LALR can be proved to be a subset of the

FOLLOW sets in SLR.

Page 32: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 32

Using ambiguous grammars

● Ambiguous grammars should be used sparingly.

● However, they can sometimes feel more natural to write; e.g.:

● Sometimes easier to resolve a resulting conflict by hard-coding:

– Higher priority to shift or reduce

– Higher priority to a certain reduce

● However, it is an ad-hoc way and is better avoided.

E → E + E | E * E | id versusE → E+T | TT → T*F | FF → id

Page 33: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 33

Error handling in parsers

● Ignore till a synchronizing token (such as } or ;):

– Pop the stack

– Discard input symbols

– Resume parsing

● Attach semantic error actions to grammar rules

– Add tokens based on what is missing (e.g., closing parenthesis)

● Programmer-specified substitutions

– %change directive in some parser specifications

● Global error recovery

– Again more of theoretical interest

Page 34: Syntax Analysis (Cont.) Manas Thakur

Manas Thakur CS502: Compiler Design 34

The Big Grammatical Picture

Clicked from “Modern Compiler Implementation in Java” by Andrew W. Appel.