Module 9 – Top Down Parser – Pre-processing - e-PG ...

Module 9 – Top Down Parser – Pre-processing

In this module, the role of a Parser in the context of a compiler is discussed. Types of Parsers

and the preprocessing that is necessary for a Top down parser are dealt in detail in this module.

Pre-processing steps of eliminating left recursion and left factoring is dealt with algorithm and

example.

9.1 .Role of the Parser

Parser is typically integrated with the lexical phase of the compiler. This is done to avoid

multiple passes of the compiler, anyway a compiler will have more than one pass. Integration of

a parser with the lexical phase is depicted in figure 9.1 (a) and an elaborate representation is

given in figure 9.1 (b). From figure 9.1(a) it can be understood that the scanner issues tokens to

the parser and the parser validates the tokens and converts it to an intermediate representation

(IR).

Figure 9.1 (a) Integration of lexer and parser

Figure 9.1 (b) Interaction of lexer and parser

Figure 9.1 (b) gives more clarity on the interaction between the lexer and the parser. The lexer

scans the input code, tokenizes and issues a token to the parser, whenever a “GetNextToken”

request is issued by the parser. The lexer records lexical errors during the scanning process while

the parser records any syntax errors. The semantic phase is one more component of the front end

of the compiler which will check for semantic errors. All these phases interact with the symbol

table.

9.2 Brief discussion on Grammar

Before we actually discuss the types of parsers, a brief discussion on the Grammar is necessary.

A grammar is used to denote the sentence structure of a particular language. A Grammar can be

of 4 types, type 0, type 1, type 2 and type 3. Type 0 grammar defines a superset of a class of

languages while Type 3 grammar defines a smaller set of language. For compiler, type 2

grammar otherwise called Context Free Grammar is used. A Context Free Grammar (CFG) is

defined formally as a four tuple, (V, T, P, S) where, V is the set of Variables / Non-terminals,

T is the set terminals that constitute the string, P is the set of Productions that has a LHS and

RHS where the LHS can be replaced by RHS to derive a string and S is the special symbol called

as the start symbol and is a subset of V. A string is said to belong to a grammar if and only if it is

derived from the start symbol.

The following are the set of conventions which is typically used in the context of

Grammar.

• Terminals are specified using characters a,b,c, etc. Other specific terminals could be 0, 1,

id, +, -, *, etc.

• Non-terminals are specified using A, B, C, etc, Other specific non-terminals include expr,

term, stmt

• Grammar symbols - X,Y,Z (VT)

• Strings of terminals are indicated using symbols u, v, w, x, y, z T*

• Strings of grammar symbols are defined using symbols ,, (NT)*

A derivation is a process of deriving a string from the start symbol of the grammar. The one-step

derivation is defined by

A (9.1)

where A is a production in the grammar and indicates the process of derivation. In

addition, we define the following terminologies for the process of derivation

• is leftmost lm if does not contain a non-terminal

• is rightmost rm if does not contain a non-terminal

• Transitive closure * (zero or more steps)

• Positive closure + (one or more steps)

The language generated by G is defined by

L(G) = {w | S + w} (9.2)

Consider the following grammar defined as ({E}, {+, *, (, ), -, id}, P, E) where P is defined as

follows:

E E + E

E E * E

E ( E )

E - E

E id (9.3)

This grammar is a subset of the expression grammar over the terminals +, *, id, (, ). The terminal

“id” indicates identifier while other terminals are arithmetic operators. In this grammar, any

string can be derived from the start symbol “E”. Consider the following derivation

E rm E + E rm E + id rm id + id (9.4)

From equation (9.4), we have substitution, “E” with “E+E” followed by replacement of “E” with

“id” from right to left and hence, this is right most derivation thus yielding the string “id+id”.

Other possible derivations are

E - E - id (9.5)

E E* E id * E id * E + E id* id + id (9.6)

Equation (9.6), apply left most derivation to derive the string “id*id+id”. The process of

derivation is depicted as a tree representation and is called as a derivation tree where the LHS

non-terminal is the parent and the symbols on the RHS of the production are the children.

The grammar construct is used to derive strings. Thus, given any input string, the grammar

construct is used to start deriving the string by applying one derivation after another. This

process is called as top-down derivation. On the other hand, given a string, if the symbols are

combined and if it yields the start symbol the process is called bottom-up derivation. In both

events, if the string is derived from the start symbol or if the combination yields the start symbol

of the grammar the string is said to be belonging to the grammar.

This process of derivation is used by the parsers to validate a string or a sequence of strings.

9.3 Types of Parsers

The functionality of the parser is to verify whether a sequence of tokens belong to a correct

sentence structure. The validation is based on designing rules that has to be followed for

constructing a sentence. The rules are specified using any one type of grammar. Using this

grammar, when an input sentence is given, the parser compares whether the input sentence

belongs to the grammar.

There are various types of parsers. The following is one of the classifications of parsers:

• Universal Parsers

– Cocke- Younger-Kasami (CYK)

– Earley Parser

• Top-Down Parsers

• Bottom Up Parsers

9.3.1 Universal Parsers

The Universal parsers CYK and Earley can parse any type of grammar given to it. They are

typically used in natural language processing to validate the syntax of natural language sentences

using a predefined grammar for Natural language. However, these parsers are not efficient for a

compiler, where the sentence structure is based on Context Free Grammar.

9.3.2 Top Down Parsers

The top down parser is a name that is derived based on the construction of the derivation tree to

validate a string for a particular grammar. If the input string is derived from the start symbol of

the grammar then we conclude that the string belongs to the Grammar. For example, consider the

Grammar

S → c A d

A → ab | a (9.7)

and let “S” be the start symbol and consider the input “cad”. The first step of the derivation is

given by the following

S

c A d

Consider replacing “A” with one of its productions “ab” and this would result in the following

S

c A d

a b

The string has to be taken by looking at the leaves of the derivation tree from left to right. Thus

the string formed in “cabd” and this does not match with the input string. So, we try an alternate

application of the production for “A” and this results in the string “cad”.

S

c A d

a

Thus in top-down parsing, we try all possible productions and if it doesn‟t match, we backtrack

and try an alternate production to derive the string.

As backtracking is a typical characteristic of top-down parsing, these parsers have to handle

this while validating sentences and these parsers are called as recursive descent parsers. There is

a variation of recursive descent parsers which is Predictive parsers which avoids backtracking.

One such type of predictive parsers is the LL parsers, where the first “L” stands for input being

scanned from left to right and the second “L” stands for left-most derivation.

9.3.3 Bottom Up Parsers

As already discussed bottom up parsing, starts from the string and combines the strings to

generate the start symbol in a bottom up fashion. Consider the same string “cad” for the grammar

given in (9.7).

c a d

A

S

As the start symbol “S” could be reached this string belongs to the grammar. LR parsers are

bottom up parsers where the “L” indicates input being scanned from left to right and the “R”

indicates that we are applying right most derivation.

9.3.4 Property used by parsers

Top Down and Bottom up parsers parse the string based on a viable-prefix property. The

property states that before the string is fully processed, if there is an error, the parser will identify

it and recover from it. This property is based on identifying the possible prefix of all strings that

belong to any context free grammar. All programming language constructs are defined using

context free grammar. For example consider the following grammar with “stmt” as the start

symbol, {stmt, E} are Non-terminals, {if, then, else, a, b} being terminals and productions

defined as follows:

stmt if E then stmt else stmt

stmt if E then stmt

stmt a

E b

This grammar defines the CFG for the “if – else” programming construct. Context Free

grammars are already defined for all programming constructs and all strings that are part of the

programming language will be based on this construct. Hence, parsers are designed keeping in

mind the CFG. The following are some of the top-down and bottom-up parsers which we will

look into in this course.

• Top-Down parser - LL(k): Input is scanned from left-to-right, Leftmost derivation, k

tokens lookahead

• Bottom Up Parsers - LR(k): Input is scanned from left-to-right, Rightmost derivation, k

tokens lookahead

9.4 LL Parsers

LL parsers are non-recursive top down parsers. The parsing action is incorporated using a table

approach to avoid backtracking. LL(k) parsers have „k‟ tokens lookahead. In this module, the

value of „k‟ is chosen as „1‟ where we consider only 1 symbol lookahead.

However, LL parsers cannot handle a left recursive grammar or a grammar that has the

characteristics of left factoring. Formally, a grammar is left recursive if

A NT such that a derivation A + A, for some string (NT T )

+

For example, in the grammar, A A | | , the production A A is said to be left

recursive as the left-hand side non-terminal A and the first symbol on the right hand side of the

production is also A.

On the other hand, when a non-terminal has two or more productions and all the productions

have a common prefix grammar symbols the grammar is said to have left-factor property.

Formally, a grammar has a left factoring property is

, (NT T )+

, such that there exists productions A 1 | 2 | …. | n

For example, consider the grammar involving the following productions,

A 1 | 2 | … | n |

In this grammar, the prefix “” is common for all the productions, hence while deriving a string

one would not know which productions to substitute. This involves lot of backtracking and

therefore the LL(1) parsers should consider the action of parsing after left factoring the grammar.

9.4.1 Elimination of Left Recursion

This is the first step of the pre-processing that is necessary for a grammar to be parsed by LL (1)

grammar. As already discussed, if A A is a production, this is referred to as A-production

and this needs to be removed from the grammar. The algorithm for removing left recursion is

given in algorithm 9.1

LeftrecursionEliminate(G)

{

Arrange the nonterminals in some order A1, A2, …, An

for i = 1, …, n do

for j = 1, …, i-1 do

replace each

Ai Aj

with

Ai 1 | 2 | … | k

where

Aj 1 | 2 | … | k

enddo

eliminate the immediate left recursion in Ai

enddo

}

Algorithm 9.1 Left recursion elimination

The first step arranges all the non-terminals in some order starting from the start symbol. There

are two loops to pair every non-terminal with every other non-terminal. The logic behind the

elimination algorithm is to substitute any non-terminal that starts with another non-terminal thus

verifying the occurrence of an indirect left recursion. After forming new productions, immediate

left recursion is eliminated. Immediate left recursion is eliminated using the following procedure.

Consider the grammar to have the following A-productions and non-A productions

A A | A

A |

Left recursion from this grammar is eliminated by converting to a right-recursive grammar by

using the following procedure:

A new non-terminal is introduced for every non-terminal that has a left recursion production.

The non-A productions are suffixed with this new non-terminal and the new non-terminal is

made to produce the suffix of the A-production‟s RHS and is then suffixed by the new non-

terminal.

A AR | AR

AR AR | AR |

As can be seen from the above set of productions, the grammar has now become a right-heavy

grammar on the new non-terminal AR .

Example 9.1

Remove left recursion from the following grammar

A B C | a

B C A | A b

C A B | C C | a

The non-terminals are arranged as A, B, C to check for direct and in-direct left recursion. First

iteration of “i" doesn‟t do anything as “j” is only till “i-1”.

i = 1: nothing to do

The second iteration considers the non-terminal B for removing left recursion. For this iteration

of “i" there is only one iteration for “j”.

i = 2, j = 1: B C A | A b

B C A | B C b | a b

This introduces a left recursion as underlined. This can be removed by introducing one new non-

terminal for B called as BR and is given below:

(imm) B C A BR | a b BR

BR C b BR |

The third iteration of “i" has two iterations for “j” thus comparing the indirect left recursion of C

with respect to B as well as A. As can be seen from the first iteration of “j” there is no immediate

left recursion when substituting for B

i = 3, j = 1: C A B | C C | a

C B C B | a B | C C | a

With this as the basis, the second iteration of “j” is carried out and this introduces a left

recursion.

i = 3, j = 2: C B C B | a B | C C | a

C C A BR C B | a b BR C B | a B | C C | a

The immediate left recursion for C is removed by introducing a new non-terminal for C called as

CR and is given below.

(imm) C a b BR C B CR | a B CR | a CR

CR A BR C B CR | C CR |

Example 9.2

Remove left recursion from the following grammar

E E+T | T

T T * F | F

F (E) | id

The non-terminals E and T have left recursions directly. Substitution of E in the productions

involving T is not possible and so is the case with F which would introduce an immediate left

recursion. Hence, the non-terminals E and T need two new non-terminals E‟ and T‟ and the left

recursion removed grammar is given below

• E TE‟

• E‟ +TE‟ | ε

• T FT‟

• T‟ *FT‟ | ε

• F (E) | id

9.4.2 Left Factoring

After removing left-recursion from the grammar, left-factoring of the grammar is done. Left-

factoring also involves introduction of a new non-terminal to produce the suffixes other than the

common prefix. The existing non-terminal would produce the common prefix followed by the

new non-terminal. In the following example, “α” is the common prefix and the new non-terminal

is AR and thus A is made to produce α AR and other productions of A. AR the new non-terminal

produces all uncommon suffixes of the productions of A.

Input Grammar:

A 1 | 2 | … | n |

Left-factored grammar:

A AR |

AR 1 | 2 | … | n

Example 9.3

Remove left factoring from the following grammar.

S iCtS | iCtSeS | a

C b

This grammar does not have left recursion and hence can be directly left factored. A new non-

terminal S‟ is introduced and the existing production of S‟s prefix iCtS is suffixed with S‟ and

the other production of S is retained. The new non-terminal S‟ is made to produce the suffix of

the productions that has a common prefix and in this example, it is “ε” and “eS”.

• S iCtSS‟ | a

• S‟ eS | ε

• C b

The modified grammar is now ready for the implementation of LL(1) parser.

Summary

In this module derivation and derivation tree for deriving a string using a Context Free Grammar

are discussed. The types of derivation, parsing and the various parsers are briefed. The pre-

processing tasks like left recursion elimination and left factoring which are needed for the

implementation of LL(1) parser are discussed. The next module will focus on constructing the

LL(1) parsing table from the pre-processed grammar.

Module 9 – Top Down Parser – Pre-processing - e-PG ...

Documents