Recursive-Descent Parsing continued - CS Home · PDF fileRecursive-Descent Parsing CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Wednesday, February

Recursive-Descent Parsing

CS F331 Programming LanguagesCSCE A331 Programming Language ConceptsLecture SlidesWednesday, February 15, 2017

Glenn G. ChappellDepartment of Computer ScienceUniversity of Alaska [email protected]

© 2017 Glenn G. Chappell

continued

ReviewOverview of Lexing & Parsing

Two phases:§ Lexical analysis (lexing)§ Syntax analysis (parsing)

The output of a parser is often an abstractsyntax tree (AST). Specifications of these can vary.

15 Feb 2017 CS F331 / CSCE A331 Spring 2017

ParserLexemeStream

ASTor Error

cout << ff(12.6);

id op id litop

punctop

expr

binOp: <<

expr

id: cout funcCall

expr

id: ff

numLit: 12.6

expr

LexerCharacter

Streamcout << ff(12.6);

Parsing

2

ReviewIntroduction to Syntax Analysis — Categories of Parsers

Parsing algorithms can be divided into two broad categories.Top-Down Parsing Algorithms

§ Go through derivation from top to bottom, expanding nonterminals.§ Usually produce a leftmost derivation.§ Important subclass: LL (read Left-to-right, Leftmost derivation).

§ Common reason a top-down parser may not be LL: doing lookahead.§ Often hand-coded.§ Example algorithm we will look at: Recursive Descent.

Bottom-Up Parsing Algorithms§ Go through the derivation from bottom to top, collapsing substrings.§ Usually produce a rightmost derivation.§ Important subclass: LR (read Left-to-right, Rightmost derivation).

§ Common reason a bottom-up parser may not be LR: doing lookahead.§ Almost always automatically generated.§ Example algorithm we will look at: Shift-Reduce.

15 Feb 2017 CS F331 / CSCE A331 Spring 2017 3

ReviewIntroduction to Syntax Analysis — Categories of Grammars

Grammars that LL parsers can use are LL grammars.Grammars that LR parsers can use are LR grammars.


All Grammars

LR Grammars

CFGs

LL Grammars

4

ReviewRecursive-Descent Parsing — Intro, How It Works, Example #1

Recursive Descent is a top-down, LL parsing algorithm.§ There is one parsing function for each nonterminal.§ A parsing function is responsible for parsing all strings that its

nonterminal can be expanded into.

We wrote a Recursive-Descent parser based on Grammar 1.

Grammar 1item → “(” item “)”item → thingthing → IDthing → “%”

Our parser does not generate an AST.


See rdparser1.lua.

Grammar 1aitem → “(” item “)”

| thingthing → ID

| “%”

5

ReviewRecursive-Descent Parsing — Handling Incorrect Input

Our parser might call an input correct when it could not parse the entire input.

Example: ((x)))

Two Solutions§ Introduce a new end of input lexeme. Revise the grammar to

include it.§ After parsing, check to see the end of the input was reached.

Our parsers use the second solution.


ReviewRecursive-Descent Parsing — Example #2: More Complex [1/2]

We wrote a Recursive-Descent parser for the following more complex grammar, whose start symbol is still item.

Grammar 2item → “(” item “)”

| thingthing → ID { ( “,” | “:” ) ID }

| “%”| [ “*” “-” ] “[” item “]”

All strings in the old language are also in the new language. But now we can get strings like these:§ ((a,b,c:d))§ ((*-[([%])]))


Recall:

Braces mean optional, repeatable (0 or more).

Brackets mean optional (0 or 1).

Note the difference between the following:

[ “[”

7

ReviewRecursive-Descent Parsing — Example #2: More Complex [2/2]

Grammar 2item → “(” item “)” | thingthing → ID { ( “,” | “:” ) ID } | “%” | [ “*” “-” ] “[” item “]”

In a parsing function:[ … ] Brackets (optional: 0 or 1) become a conditional.

§ Check for the possible initial lexemes inside the brackets. If found, parse everything inside the brackets. Otherwise skip the brackets.

{ … } Braces (optional, repeatable: 0 or more) become a loop.§ Loop body: Check for the possible initial lexemes inside the braces.

If not found, then exit loop, moving to just after the braces. If found, parse everything inside the braces, and then REPEAT.

TO DO§ Write a Recursive-Descent parser based on Grammar 2.


Done. See rdparser2.cpp.

8

ReviewRecursive-Descent Parsing — Example #3: Expressions [1/5]

Now we bump up our standards. We wish to parse arithmetic expressions in their usual form, with variables, numeric literals, binary +, -, *, and / operators, and parentheses. When given a syntactically correct expression, our parser should return an abstract syntax tree (AST).§ All operators will be binary and left-associative, so that, for

example, “a + b + c” means “(a + b) + c”.§ Precedence will be as usual, so that “a + b * c” means

“a + (b * c)”.§ Precedence and associativity may be overridden using parentheses:

“(a + b) * c”.

Due to the limitations of our lexer, the expression “k-4” will need to be rewritten as “k - 4”.



We begin with the following grammar, with start symbol expr.

Grammar 3expr → term

| expr ( “+” | “-” ) termterm → factor

| term ( “*” | “/” ) factorfactor → ID

| NUMLIT| “(” expr “)”

Grammar 3 encodes our associativity and precedence rules.



To the right is part of a parsing function for nonterminal expr.





What is wrong with this code?


function parse_expr()if parse_term() then

return trueelseif parse_expr() then

…

11


function parse_expr()if parse_term() then

return trueelseif parse_expr() then

…

What is wrong with this code?§ First, if the call to parse_term returns false, then the position in the

input may have changed. Fixing this requires backtracking, which can lead to extreme inefficiency.

§ But even if we can solve that, there is a more serious problem. Suppose parse_expr is called with input that does not begin with a valid term. What happens? Answer: infinite recursion!



In fact, without lookahead, it is impossible to write a Recursive-Descent parser for Grammar 3.





Recall that a Recursive-Descent parser requires an LL grammar. But Grammar 3 is not an LL grammar. Next we look at LL grammars. We return to the expression-parsing problem later.


ReviewRecursive-Descent Parsing — LL Grammars: Properties [1/8]

An LL grammar is a CFG that can be handled by an LL parsing algorithm, such as Recursive Descent, if multiple-lexeme lookahead is not done.

Recall the origin of the name: these parsers handle their input in a strictly Left-to-right order, and they go through the steps required to generate a Leftmost derivation.

Now we look at some of the properties that an LL grammar must have.



Consider the following grammar.

Grammar Axx → xx “+” “b” | “a”

A parsing function would begin:

function parse_xx()if parse_xx() then

We have recursion without a base-case check.The trouble lies in the grammar. The right-hand side of the

production for xx begins with xx. This is left recursion. It is not allowed in an LL grammar.



Left recursion can be more subtle. Below is a variation on Grammar A.

Grammar Axxx → yy “b” | “a”yy → xx “+”

Grammar Ax also contains left recursion. It is not LL.



The grammar below illustrates a more general problem.

Grammar Bxx → “a” yy | “a” zzyy → “*”zz → “/”

We cannot even being to write a Recursive-Descent parser for Grammar B. How would the code for function parse_xx begin? Should it take the first or second option? There is no way to tell, without lookahead.

We say the first production in Grammar B is not left-factored. An LL grammar can only contain left-factored productions.



Here is another problematic grammar.

Grammar Cxx → yy | zzyy → “” | “a”zz → “” | “b”

In Grammar C, the empty string can be derived from either yy or zz. So if there is no more input, then there is no basis for making the xx-or-yy decision in the first production.



One last non-LL grammar.

Grammar Dxx → yy “a”yy → “a” | “”

The strings “a” and “aa” lie in the language generated by Grammar D. But imagine a Recursive-Descent parser based on Grammar D, attempting to parse these strings. What would happen?



It turns out that the problems presented by Grammars A–D illustrate all the reasons a CFG might not be LL.

Fact.* Suppose that a context-free grammar G has the following three properties.1. If A → α and A → β are productions in G, then there do not exist

two strings, one derived from α, the other derived from β, that begin with the same (terminal) symbol.

2. If A → α and A → β are productions in G, then it is not the case that the empty string can be derived from both α and β.

3. If A → α and A → β are productions in G, and the empty string can be derived from β, then there is no (terminal) symbol x that begins a string that can be derived from α, such that x can follow a string derived from A.

Then Grammar G is an LL grammar.

*Adapted from A.V. Aho, R. Sethi, and J.D. Ullman,Compilers: Principles, Techniques, and Tools, 1986, p. 192.


(1) does not hold for Grammars A, AA, and B; (2) does not hold

for Grammar C; and (3) does not hold for Grammar D.

20


In addition:

Fact. Suppose that G is an LL grammar. Then,§ G is not ambiguous, and§ G does not contain left recursion.

In general, when there is a choice to be made, an LL parser must be able to make that choice based on the current lexeme. If this cannot be done, then the grammar is not LL.

Now suppose—as in our expression-parsing example—that we wish to write a Recursive-Descent parser, but our grammar is not LL. What can we do about this?


Recursive-Descent ParsingLL Grammars — Transforming [1/5]

If a grammar is not LL, this does not mean that the grammar must be completely useless as a basis for a Recursive-Descent parser. We might be able to transform the grammar into an LL grammar that generates the same language.

For example, here is Grammar A, which is not LL, along with an LL grammar that generates the same language.

Grammar Axx → xx “+” “b” | “a”


Grammar Aaxx → “a” yyyy → “” | “+” “b” yy

continued

22


Grammar B, which is not LL, along with an LL grammar that generates the same language.

Grammar Bxx → “a” yy | “a” zzyy → “*”zz → “/”


Grammar Baxx → “a” yyyy → “*” | “/”

23


Grammar C, which is not LL, along with an LL grammar that generates the same language.

Grammar Cxx → yy | zzyy → “” | “a”zz → “” | “b”


Grammar Caxx → yy | zz | “”yy → “a”zz → “b”

24


And Grammar D, which is not LL, along with an LL grammar that generates the same language.

Grammar Dxx → yy “a”yy → “a” | “”


Grammar Daxx → “a” yyyy → “a” | “”

25


It is not at all uncommon to be faced with a grammar that is not LL, but that can be transformed easily to one that is LL. In particular, this is common in the specification of programming-language syntax.

Note, however, that there are context-free languages that cannot be generated by any LL grammar at all.


Recursive-Descent ParsingBack to Example #3: Expressions — Left-Associativity [1/3]

Now we return to our expression grammar. It is given below. Recall that this is not an LL grammar.





An easy fix is to reorder the operands; for example,expr ( “+” | “-” ) term becomes term ( “+” | “-” ) expr. I will also use [ … ] to make the grammar more concise.



Here is the resulting grammar. This is an LL grammar.

Grammar 3aexpr → term [ ( “+” | “-” ) expr ]term → factor [ ( “*” | “/” ) term ]factor → ID


But now we have a new problem: Grammar 3a is LL, but it encodes right-associative binary operators. We want our operators to be left-associative.

Fortunately, all is not lost …



Here is how we do it.

Grammar 3bexpr → term { ( “+” | “-” ) term }term → factor { ( “*” | “/” ) factor }factor→ ID


Grammar 3b is what we want. Itis LL, and we can use it toparse left-associativebinary operators.

However, we still need to generate an AST.15 Feb 2017 CS F331 / CSCE A331 Spring 2017

function parse_expr()

if not parse_term() then

return false

end

while true do

if not matchString("+")

and not matchString("-")

then break

end

if not parse_term() then

return false

end

end

return true

end

29

Recursive-Descent ParsingBack to Example #3: Expressions — ASTs [1/7]

We want write a parser that returns an abstract syntax tree(AST). First, we need to specify the format of an AST.

Recall that a parse tree, or concrete syntax tree, includes one leaf node for each lexeme in the input, and one non-leaf node for each nonterminal in the derivation.

However, an AST is more sparse. For example, below are reasonable ASTs for the expressions a + 2 and (a + 2) * b.

Lexemes that only guide parsing are omitted from an AST: semicolons to end statements, parentheses in expressions, etc.


*

+ b

a 2

a 2

+

30


We need to represent these trees in Lua. Represent a single lexeme by its string form. If there is more than one node in an AST, then represent it as an array whose first item represents the root node and whose remaining items each represent one of the subtrees rooted at the child nodes, in order.

Here is the first AST above in Lua: { "+", "a", "2" }

And here is the second AST: { "*", { "+", "a", "2" }, "b" }


*

+ b

a 2

a 2

+

31


It is better to describe our ASTs in a way that does not require drawings of trees. So we specify the format of an AST for each line in our grammar.

Grammar 3b(1) expr → term { ( “+” | “-” ) term }(2) term → factor { ( “*” | “/” ) factor }(3) factor→ ID(4) | NUMLIT(5) | “(” expr “)”

(1) expr. If there is only a term, then the AST for the expr is the AST for the term. Otherwise, the AST is { OO, AA, BB }, where OO is the string form of the last operator, AA is the AST for everything before it, and BB is the AST for the last term.



A term is handled similarly.


(2) term. If there is only a factor, then the AST for the term is the AST for the factor. Otherwise, the AST is { OO, AA, BB }, where OO is the string form of the last operator, AA is the AST for everything before it, and BB is the AST for the last factor.



A factor has multiple options.


(3) factor: ID. AST for the factor: string form of the ID.(4) factor: NUMLIT. AST for the factor: string form of the NUMLIT.(5) factor: “(” expr “)”. AST for the factor: AST for the expr.



Applying the various rules, the AST for (a + 2) * b is{ "*", { "+", "a", "2" }, "b" }

Each parsing function can now return a pair: a boolean and an AST. The boolean indicates a correct parse, as before. The AST is only valid if the boolean is true, in which case it will be in the specified format.




(1) expr. If there is only a term, then the AST for the expr is the AST for the term. Otherwise, the AST is { OO, AA, BB }, where OO is the string form of the last operator, AA is the AST for everything before it, and BB is the AST for the last term.

(2) term. Similar to (1).(3) factor: ID. AST for the factor: string form of the ID.(4) factor: NUMLIT. AST for the factor: string form of the NUMLIT.(5) factor: “(” expr “)”. AST for the factor: AST for the expr.15 Feb 2017 CS F331 / CSCE A331 Spring 2017

TO DO§ Based on Grammar 3b,

write a Recursive-Descent parser that produces an AST, as described.


36

Recursive-Descent ParsingExample #4: Better ASTs [1/4]

The ASTs we have specified are not quite what we want.

We need to know whether each node represents an operator, an identifier, etc. The lexer already figured this out, but then we did not store this information in the AST.

And there is other information we could store. For example, in many PLs, “-” can be either a binary operator (a - b) or a unary operator (-x). The lexer does not know which it is. But the parser knows, and the parser could return this information to its caller.



To give the caller additional information, we mark each node in the AST, indicating what kind of entity it is. So far, we have three kinds of nodes: binary operators, identifiers, and numeric literals. So we mark each node as being one of these three.


*

+ b

a 2

binOp: *

binOp: + id: b

id: a numLit: 2

38


In the Lua form of our ASTs, we can replace each string with a two-item array. The first item in the array will be one of three constants: BIN_OP, ID_VAL, or NUMLIT_VAL. The second item will be the string form of the lexeme.

"/" { BIN_OP, "/" }"abc" { ID_VAL, "abc" }"123" { NUMLIT_VAL, "123" }

So the AST for a + 2 changes as shown below.

{ "+", "a", "2" } {{ BIN_OP, "+" },{ ID_VAL, "a" },{ NUMLIT_VAL, "2" }}



"/" { BIN_OP, "/" }"abc" { ID_VAL, "abc" }"123" { NUMLIT_VAL, "123" }

TO DO§ Rewrite the Recursive-Descent parser based on Grammar 3b, so

that it produces the improved ASTs.



40

Recursive-Descent Parsing continued - CS Home · PDF fileRecursive-Descent Parsing CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Wednesday, February

Documents