Unit 6

Unit 6Compilers

Introduction A compiler is a program that reads a program written in one

language called the source language and translates it into an equivalent program in another language called the target language.

There are two parts of compilation: analysis and synthesisAnalysis: creates intermediate representation of SPSynthesis: constructs the desired target program

Compiler Target program

Source program

Error messages

Phases of a CompilerCompiler operates

in phasesLexical analyzerSyntax

analyzer

Semantic analyzer

Intermediate code generator

Code optimizer

Code generator

Error handlerSymbol table

manager

Source program

target program

Phases of a Compiler Lexical analyzer:

Performs lexical analysis also known as linear analysis or scanning.The stream of characters are read from left-to-right and grouped into

tokens.The white spaces are eliminated during lexical analysisFor eg.,

position = initial + rate * 60The foll. Tokens are formed:

The identifier position The assignment symbol The identifier initial Plus sign The identifier rate Multiplication sign Number 60

Phases of a CompilerSyntax analyzer:

Performs syntax analysis also known as hierarchical analysis or parsing.

It involves grouping of tokens of SP into grammatical phrases.

Then they are represented by a parse tree.

Phases of a Compiler

Assignment statement

expressionidentifier

expression

=

expression+

expressionexpression

identifier number

rate 60

identifier

initial

position

*

Parse tree for position = initial + rate * 60

Phases of a Compiler

Syntax tree is a compressed representation of a parse tree.The operators appear in the interior nodes and the operands of an

operator are the children of the node for that operator.

=

+

rate 60

initial

position

*

Syntax tree for position = initial + rate * 60

Phases of a CompilerSemantic analyzer:

Performs semantic analysisIt involves checking the SP for semantic errors and gathers

type information.Important component is type checking

=

+

rate

60

initial

position

*

inttoreal

Phases of a Compiler Intermediate code generation: intermediate code must have 2 properties: easy to produce and easy

to translate. It can be in different forms:One such form is “three-address code” It is a like assembly languageThree address code consists of a sequence of instructions each of

which has at most 3 operandsid1 = id2 +id3*60

temp1 = inttoreal(60)temp2 = id3*temp1temp3 = id2+temp2id1 = temp3

Phases of a CompilerCode optimizer:Attempts to improve the intermediate code

temp1 = inttoreal(60)temp2 = id3*temp1temp3 = id2+temp2id1 = temp3

temp1 = id3 * 60id1 = id2 + temp1

Optimized to

Phases of a CompilerCode generator:Deals with generation of target code, consisting of relocatable

machine code or assembly code.

MOVF id3,R2MULF #60.0,R2MOVF id2,R1ADDF R2,R1MOVF R1,id1

‘#’ treated as constant

Floating point Source Destination

Phases of a CompilerSymbol table management:

A symbol table is a data structure containing a record for each identifier with fields for the attributes of the identifier.

It allows us to find the record for each identifier and to store r retrieve data from that record.

Error detection and Reporting:Each phase can encounter errorsEach phase must deal with that errors

Lexical AnalyzerThe lexical analyzer is the first phase of compiler Its main task is to read the input characters and produces output a

sequence of tokens that the parser uses for syntax analysis

tokenget next token

Lexical analyser

Parser

Symbol table

Source program

Interaction of lexical analyzer with parser

Lexical AnalyzerTokens, Patterns and Lexemes.A lexeme is a sequence of characters in the source program

that is matched by the pattern for the token.A token is an abstract symbol representing a kind of lexical

unit.eg., a keyword, identifierA pattern is a description of the form that the lexemes of a

token may take.

For example in the statement,

const pi = 3.1416;

the substring pi is a lexeme for the token identifier.

Lexical AnalyzerTokens Patterns and Lexemes.

Another example, a C statement,

printf(“Total=%d\n”, score);

both printf and score are lexemes matching the pattern for token id, and “Total=%d\n” is a lexeme matching literal.

In most programming languages, the following constructs are treated as tokens: keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as parentheses, commas, and semicolons.

Lexical AnalyzerTokens Patterns and Lexemes.

TOKEN SAMPLE LEXEMES INFORMAL DESCRIPTION OF PATTERN

constifrelationidnumliteral

Constif<,<=,=,<>,>,>=pi,count,D23.1416,0,6.02E23“core dumped”

constif< or <= or = or <> or >= or >letter followed by letters and digitsany numeric constantany characters between “ and “ except”

Lexical Analyzer: Specification of TokensRegular expressions are an important notation for specifying tokens.

Strings and Languages:The term alphabet or character class denotes any finite set of

symbols eg., The set {0, 1} is the binary alphabet.

A String over some alphabet is a finite sequence of symbols drawn from that alphabet

The term language denotes any set of strings over some fixed alphabet

Lexical Analyzer: Specification of TokensOperations on languages:

There are several important operations like union, concatenation and closure that can be applied to languages.

For example:Let L be the set {A,B,….,Z,a,b,…z} and D be the set {0,1,…9}1. LUD is the set of letters and digits2. LD is the set of strings consisting of a letter followed by a digit3. L4 is the set of all four-letter strings4. L* is the set of all strings, including ε , the empty string5. L(LUD)* is the set of all strings of letters and digits beginning with a

letter.6. D+ is the set of all strings of one or more digits

Lexical Analyzer: Specification of TokensRegular Expressions:

An identifier is a letter followed by zero or more letters or digitsThe expression: letter (letter | digit)*The | here means “or” , the parentheses are used to group sub

expressions, the star means “ zero or more instances of” the parenthesized expression, and the juxtaposition of letter with remainder of the expression means concatenation.

A regular expression is built up out of simpler regular expressions using set of defining rules.

Each regular expression r denotes a language L(r).

Lexical Analyzer: Specification of TokensRegular Expressions RE:

The rules that define the regular expression over alphabet ∑.ε is a RE that denotes {ε}, the set containing the empty stringIf a is a symbol in ∑, then a is a RE that denotes {a}, the set

containing the string a.Suppose r and s are regular expressions denoting the languages

L(r) and L(s). Then,a) (r)|(s) is a RE denoting L(r) U L(s)b) (r)(s) is a RE denoting L(r)L(s)c) (r)* is a RE denoting (L(r))*d) (r) is a RE denoting (L(r))2

A language denoted by a RE is said to be a regular set.

Lexical Analyzer: Specification of TokensRegular Expressions:Example: Let ∑ = {a,b}

The RE a|b denotes the set {a,b}The RE (a|b) (a|b) denotes {aa,ab,ba,bb}, the set of all strings of a’s

and b’s of length two.The RE a* denotes the set of all strings of zero or more instances of

an a {ε,a,aa,…..}.The RE (a|b)* denotes the denotes the set of all strings of zero or

more instances of an a or b.The RE a|a*b denotes the set containing the string a and all strings

consisting of zero or more a’s followed be b.

Lexical Analyzer: Specification of TokensRegular definitions: If ∑ is an alphabet of basic symbols, then a regular definition is a

sequence of definitions of the form d r , d’ r’

where d, d’ is a distinct name and r, r’ is a regular expression over the symbols in ∑ U {d, d’,…}

Example: Consider the set of strings of letters and digits beginning with a letter. The regular definition for the set is

letter A|B|…|Z|a|b|…zdigit 0|1|2|…|9

id letter ( letter | digit ) *

Lexical Analyzer: Recognition of Tokens Consider the following grammar fragment:

stmt if expr then stmt|if expr then stmt else stmt|e

exprterm relop term|term

termid|num

where the terminals if , then, else, relop, id and num generate sets of strings given by the following regular definitions:

if ifthen tenelse elserelop <|<=|=|<>|>|>=idletter(letter|digit)*numdigit+ (.digit+)?(E(+|-)?digit+)?

Lexical Analyzer: Recognition of Tokens

REGULAR EXPRESSION

TOKEN ATTIBUTE VALUE

wsif

thenelseid

num<

<==

<>>

>=

-if

thenelseid

numreloprelopreloprelopreloprelop

----

Pointer to table entryPointer to table entry

LTLEEQNEGTGE

Lexical Analyzer: Finite Automata

A recognizer for a language is a program that takes as input a string x and answers ‘yes’ if x is a sentence of the language and ‘no’ otherwise.

A finite automaton can be deterministic or non deterministic.

They are represented by transition graphs.

In these labeled directed graphs, the nodes are the states and the labeled edges represent the transition function.


Nondeterministic Finite Automata (NFA):A mathematical model consisting of:

1) a set of states S

2) a set of input alphabet ∑

3) a transition function move that maps state-symbol pairs to set of

states

4) a state s0 as start or initial state

5) a set of states F as final or accepting states

Lexical Analyzer: Finite AutomataNondeterministic Finite Automata :Example: the transition graph for an NFA that recognizes the

language (a|b)*abb

Set of states S: {0,1,2,3} Input symbol alphabet ∑ = {a, b}Initial state is 0Accepting state is 3 indicated by double circle.

0 1 2 3start a b b

a

b

Lexical Analyzer: Finite AutomataNondeterministic Finite Automata :

Transition table: Row represents each state

Column for each input symbol

The entry for row i and symbol a

in the table is the set of states that

can be reached by a transition

from state i on input a.

State Input symbols

a b

0 {0,1} {0}

1 - {2}

2 - {3}


Deterministic Finite Automata (DFA) :A mathematical model in which

1) no state has an ε-transition, a transition on input ε, and

2) for each state s and input symbol a, there is at most one edge labeled a leaving s.

Since there is at most one transition from each state on any input, it becomes very easy to determine whether a DFA accepts an input string .

Lexical Analyzer: Finite AutomataConversion of NFA to DFA:Subset construction algorithm

Input: NFA N

Output: equivalent DFA D

Method:Operations used:

operation description

Epsilon-closure(S)

Set of nfa states reachable from nfa state s on e-transitions alone

Epsilon-closure(T)

Set of nfa states reachable from nfa state s in T on e-transitions alone

Move(T, a) Set of nfa states to which there is a transition on input symbol a from nfa state s in T

Lexical Analyzer: Finite AutomataConversion of NFA to DFA:

Subset construction algorithm:

Initially,ε-closure(So) is the only state in D-states and it is unmarked

While there is an unmarked state in T in D-states do begin

Mark T;

For each input symbol a do begin

U:=e-closure(move(T, a));

If U is not in D-states then

Add U as an unmarked state to D-states;

Dtrans[T, a]:=U

EndEnd

Lexical Analyzer: Finite AutomataFrom a regular expression to an NFA:Thomson’s ConstructionTo convert regular expression r over an alphabet into an NFA N

accepting L(r)

Parse r into its constituent sub-expressions. Construct NFAs for each of the basic symbols in r.

Lexical Analyzer GeneratorLex is used to specify lexical analyzers for a variety of languages.This tool is referred as lex compiler.

Lex compiler

C compiler

a.out

Lex.yy.c

a.out

Sequence of tokens

Lex source program lex.l

Lex.yy.c

Input stream

Creating a lexical analyzer with Lex

Lexical Analyzer GeneratorLex specifications:A lex program consists of three parts:

Declaration section includes declarations of variables, manifest constants and regular definitions.

Translation rules are of the form:P1 {action1}P2 {action2}… ….

Declarations%%Translation rules%%Auxiliary procedures

Here, each pi is a RE and

each action i is a program fragment describing what action is to be taken when pattern pi matches a lexeme.

Lexical Analyzer GeneratorDesign:Given a set of specifications, the lexical analyzer should look for

lexemes.This is usually implemented using a finite automatonThe lexical analyzer generator constructs a transition table for a finite

automaton from the regular expression patterns in the lexical analyzer generator specification.

The lexical analyzer itself consists of a finite automaton simulator that uses this transition table to look for the regular expression patterns in the input buffer.

This can be implemented using an NFA or a DFA. The transition table for an NFA is considerably smaller than that for a DFA, but DFA recognizes patterns faster than the NFA.

Lexical Analyzer GeneratorDesign:

Lex Specification

Lex CompilerTransition table

Lexeme

FA Simulator

Transition table

Input buffer

b) Schematic lexical analyzer

a) Lex Compiler

Model of Lex Compiler

Syntax AnalysisEvery programming language has rules that prescribe the syntactic

structure of well-formed programs.The syntax of programming language constructs can be described by

context-free grammars or BNF(Backus-Naur Form) notation.Grammars offer significant advantages:

Gives a precise, yet easy-to-understand, syntactic specification of a programming language.

From certain classes of grammars, we can automatically construct an efficient parser that determines if a source program is syntactically well-formed.

A properly designed grammar imparts a structure to a programming language that is useful for the translation of source programs.

New constructs can be added to a language.

Syntax AnalysisThe parser obtains a string of tokens from the lexical analyzer. It then verifies that the string can be generated by the grammar for

the source language.The parser should report syntax errors, if any.

tokenget next token

Lexical analyser

Parser

Symbol table

rest of front end

Source program

intermediate representation

Parse tree

Syntax AnalysisThree general types of parsers for grammars:

Universal parsing methods:Can parse any grammarToo inefficient to use in production compilers

Top-down methods:Build parse trees from the top(root) to the bottom(leaves)

Bottom-up methods:Start from the leaves and work up to the root.

Context-free grammarsConsider a conditional statement defined by a rule such as:

If S1 and S2 are statements and E is an expression, then

“if E then S1 else S2” is a statement

stmt → if expr then stmt else stmt

A context-free grammar consists of terminals, non-terminals, a start symbol and productions

Context-free grammars1. Terminals are the basic symbols from which strings are formed. The

word "token" is a synonym for "terminal" when we are talking about grammars for programming languages.

2. Non-terminals are syntactic variables that denote sets of strings that help define language generated by the grammar. They impose a hierarchical structure on the language.

3. In a grammar one non-terminal is distinguished as the start symbol, and the sets of strings it denotes is the language denoted by the grammar.

4. The productions of a grammar specify the manner in which terminals and non-terminals can be combined to form strings . Each production consists of a non-terminal followed by an arrow(==>) followed by a string of non-terminals and terminals.

Context-free grammars Example:

The grammar with the foll. Productions

expr → expr op expr

expr → (expr)

expr → -expr

expr → id

op → +

op → -

op → *

op → /

In this grammar, the terminal symbols are id,+,-,*,()

The non terminal symbols are expr and op.

expr is the start symbol

Example:E ==>EAE | (E) | -E | id A==> + | - | * | / |

Where E,A are the non-terminals

while id, +, *, -, /,(, ) are the terminals.

Derivation and Parse trees• Consider the foll. Grammar:

E ==>E+E | E*E | (E) | -E | id• E ==> -E is read as “E derives -E”• We can take a single E and repeatedly apply productions in any

order to obtain a sequence of replacements• For eg., E ==> -E ==> -(E) ==> - (id)• We call such a sequence of replacements a derivation of -(id)

from E.

Derivation and Parse trees• Given a grammar G with start symbol S, we can use ==> + relation

to define L(G), the language generated by G.• A string of terminals w is in L(G) if and only if S ==> + w, the

string w is called a sentence of G.

• If S ==> * α , where α may contain non terminals, then α is a sentential form of G.

Derivation and Parse trees• Parse trees:• It may be viewed as a graphical representation for a derivation.• Each interior node of a parse tree is labeled by a nonterminal.• The leaves are labeled by nonterminals or terminals, read from left to

right.• For eg., the parse tree for -(id+id)• E -> -E

-> - (E)

-> - (E + E)

-> - (id + E)

-> - (id + id)

E

- E

( E )

E + E

id id

Derivation and Parse trees• Example: id+id*id

E -> E+E -> id+E -> id+ E*E -> id + id*E -> id + id*id

E -> E*E -> E+E*E -> id+ E*E -> id + id*E -> id + id*id

E

E E+

E E*

id

id

id

E

E E*

E E+

id

id

id

(a) (b)

Ambiguity• A grammar that produces more than one parse tree for some

sentence is said to be ambiguous

• An ambiguous grammar is one that produces more than one leftmost or more than one rightmost derivation for the same sentence.

• Carefully writing the grammar can eliminate ambiguity.

Elimination of Left Recursion• Definition: A grammar is left recursive if it has a non terminal A

such that there is a derivation A→ Aα for some string α.

• Top-Down parsing methods cannot handle left-recursive grammars, so a transformation that eliminates left recursion is needed.

• A left-recursive pair of productions A → Aα| β could be replaced by the non-left-recursive productions.

A → βA'A' → α A'| ε

Elimination of Left Recursion Algorithm:

Input: Grammar with no cycles or e-productions.Output: An equivalent grammar with no left recursion.Method: Apply the algorithm to G. Note that the resulting non-left-recursive

grammar may have e -productions. 1. Arrange the nonterminals in some order A1, A2, ,...........,An. 2. for i := 1 to n do begin for j: = 1 to i - 1 do begin replace each production of the form Ai ==> Aj γ by the productions Ai ==> δ1γ | δ2 γ...............| δk γ

where Aj ==> δ1 | δ2 |.........| δk are all current Aj productions. end eliminate the immediate left recursion among Ai productionsend

Elimination of Left RecursionNo matter how many A-productions there are, we can eliminate

immediate left recursion from them.

First, we group the A-productions as

A → Aα1| Aα2| …| Aαm|β1| β2…| βm

• Then, we replace A-productions by

• A → β1 A’ | β2 A’ …| βm A’

• A’ → α1 A’ | α2 A’ | …| αmA’ | ε

Elimination of Left RecursionExample:Consider the foll. Grammar:

• Eliminating the immediate left recursion to the productions for E and then for T, we obtain

E → E+T| TT → T*F| FF → (E) | id

E → TE’E’ → +TE’ | ε T → FT’T’ → *FT’ | ε F → (E) | id

Left Factoring Left factoring is a grammar transformation that is useful producing a

grammar suitable for predictive parsing.The basic idea is that when it is not clear which of the two alternative

productions to use to expand a nonterminal A.For example:If A==>αβ1|αβ2 are two A-productions and the input begins

with a non empty string derived from α , we do not know whether to expand A to αβ1 or to αβ2.

We may defer decision by expanding A to αA' . Then after seeing the input derived from α we expand A' to β1or to β2

A==>αA'A'==>β1|β2

Left Factoring Algorithm:

Input. Grammar G

Output. An equivalent left factored grammar.

Method. For each non terminal A find the longest prefix α common to two or more of its alternatives. If α!= ε, i.e., there is a non trivial common prefix, replace all the A productions

A==>αb1|αb2|..............|αbn|g , where g represents all alternatives that do not begin with α by

A==>αA'|g

A'==>b1b|2|.............|bn

Here A' is new nonterminal. Repeatedly apply this transformation until no two alternatives for a non-terminal have a common prefix.

Left Factoring Example: consider the foll. Grammar

S → iEtS | iEtSeS | aE → b

S → iEtSS’ | aS’ → eS | εE → b

Left-factored

Parsing methodsThe syntax analysis phase of a compiler verifies that the sequence of

tokens extracted by the scanner represents a valid sentence in the grammar of the programming language.

There are two major parsing approaches: top-down and bottom-up.

In top-down parsing, you start with the start symbol and apply the productions until you arrive at the desired string.

In bottom-up parsing, you start with the string and reduce it to the start symbol by applying the productions backwards.

Parsing methodsConsider the foll grammar:

S –> AB

A –> aA | ε

B –> b | bBHere is a top-down parse of aaab.

S

AB S –> AB

aAB A –> aA

aaAB A –> aA

aaaAB A –> aA

aaaεB A –> ε

aaab B –> bThe top-down parse produces a leftmost derivation of the sentence.

Parsing methods Consider the foll grammar:

S –> AB

A –> aA | ε

B –> b | bB A bottom-up parse works in reverse The bottom-up parse prints out a rightmost derivation of the sentence.

aaab

aaaεb (insert ε)

aaaAb A –> ε

aaAb A –> aA

aAb A –> aA

Ab A –> aA

AB B –> b

S S –> AB

Top-Down Parsing It is an attempt to find the leftmost derivation for an input string

Recursive-descent parsing: In this we execute a set of recursive procedures to process the input.A procedure is associated with each nonterminal of a grammar.As we parse the input string, we call the procedures that correspond

to the left-side nonterminal of the productions. Consider the foll. Grammar:

S → cAd

A → ab|a Input string w = cad

Top-Down ParsingConstruct parse tree:

Initially, the i/p pointer points to c. then uses the first production for S to expand the tree as shown in Fig.a

The leftmost leaf matches with the first symbol, so advance the i/p pointer to a. then use the second leaf and expand A using the first alternative as shown in Fig. b

Since there is a match advance the pointer to third symbol d. since there is no match, report failure and go back to A and reset the pointer to position 2.

Expand A with other alternative and check. Since we have produced a parse tree for w, we halt and announce successful completion of parsing

S

A dc

S

A dc

S

A dc

a b aFig. a

Fig. b

Fig. c

Top-Down ParsingPredictive parser: It is recursive descent parser that needs no backtracking.Transition diagrams for predictive parsers:

To construct the transition diagram, first eliminate left recursion and then left factor the grammar. Then for each nonterminal A do the foll:

1. create an initial and final state

2. for each production A -> X1X2…Xn, create a path from the initial to the final state, with edges labeled X1X2…Xn .

A predictive parser based on transition diagrams attempt to match terminal symbols against the input, and makes a potentially recursive procedure call whenever it has to follow an edge labeled by a nonterminal.

Top-Down ParsingNonrecursive Predictive Parsing: It is possible to build a nonrecursive predictive parser by maintaining

a stack explicitly , rather than implicitly via recursive calls.

Predictive Parsing Program

Parsing Table M

a + b $

X

Y

Z

$

OutputStack

Input

Model of a nonrecursive predictive parser

Top-Down Parsing: Nonrecursive Predictive Parsing Input buffer:

it contains the string to be parsed, followed by $.$, symbol used to indicate the end of the input string

Stack:It contains a sequence of grammar symbols with $ on the bottom.Initially, it contains the start symbol of the grammar on top of $.

Parsing table:It is a 2-D array M[A,a] where A is a nonterminal, a is a terminal

or symbol $.

The parser is controlled by a program as follows

Top-Down Parsing: Nonrecursive Predictive ParsingThe program considers X, the symbol on top of the stack and a , the

current i/p symbol.These 2 symbols determine the actions. There are 3 possibilities:

If X= a=$, the parser halts and announces successful completionIf X=a!=$, the parser pops X off the stack and advances the input

pointer to the next input symbol.If X is a nonterminal, the program consults entry M[X, a] of the

parsing table M. This entry will be either an X-production of the grammar or an error entry. If, for example, M[X, a]={X->UVW}, the parser replaces X on top of the stack by WVU( with U on top).

If M[X, a]=error, the parser calls an error recovery routine.

Top-Down Parsing: Nonrecursive Predictive Parsing Input. A string w and a parsing table M for grammar G. Output. If w is in L(G), a leftmost derivation of w; otherwise, an error indication. Method. Initially, $S on the stack with S on top, and w$ in the input buffer.

set ip to point to the first symbol of w$.

repeat

let X be the top stack symbol and a the symbol pointed to by ip.

if X is a terminal or $ then

if X=a then

pop X from the stack and advance ip

else error()

else

if M[X,a]=X->Y1Y2...Yk then begin

pop X from the stack;

push Yk,Yk-1...Y1 onto the stack, with Y1 on top;

output the production X-> Y1Y2...Yk

end

else error()

until X=$

Top-Down Parsing: Nonrecursive Predictive Parsing

Nonter-minal

Input symbol

id + * ( ) $

E E →TE’ E →TE’

E’ E’ →+TE’ E’ → ε E’ → ε

T T → FT’ T → FT’

T’ T’ → ε T’ → *FT’ T’ → ε T’ → ε

F F →id F →(E)

Parsing table M


The construction of a predictive parser is aided by two functions associated with a grammar G: FIRST and FOLLOW.

If α is any string of grammar symbols, let FIRST(α) be the set of terminals that begin the strings derived from α. If α =*>e, then e is also in FIRST(α).

FOLLOW(A), for nonterminals A, to be the set of terminals a that can appear immediately to the right of A in some sentential form, that is, the set of terminals a such that there exists a derivation of the form S=*> αAaβ for some α and β. If A can be the rightmost symbol in some sentential form, then $ is in FOLLOW(A).


To compute FIRST(X) for all grammar symbols X, apply the following rules until no more terminals or e can be added to any FIRST set.

1. If X is terminal, then FIRST(X) is {X}.

2. If X->e is a production, then add e to FIRST(X).

3. If X is nonterminal and X->Y1Y2...Yk is a production, then place a in FIRST(X) if for some i, a is in FIRST(Yi) and e is in all of FIRST(Y1),...,FIRST(Yi-1); that is, Y1...Yi-1=*>e. If e is in FIRST(Yj) for all j=1,2,...,k, then add e to FIRST(X). For example, everything in FIRST(Yj) is surely in FIRST(X). If y1 does not derive e, then we add nothing more to FIRST(X), but if Y1=*>e, then we add FIRST(Y2) and so on.


To compute the FOLLOW(A) for all nonterminals A, apply the following rules until nothing can be added to any FOLLOW set.

1. Place $ in FOLLOW(S), where S is the start symbol and $ in the input right endmarker.

2. If there is a production A=>aBß where FIRST(ß) except e is placed in FOLLOW(B).

3. If there is a production A->aB or a production A->aBß where FIRST(ß) contains e, then everything in FOLLOW(A) is in FOLLOW(B).


Algorithm: Construction of predictive parsing table

Input: Grammar G

Output: Parsing table M

Method:

1. For each production A → α of the grammar, do steps 2 and 3.

2. For each terminal a in FIRST(α), add A → α to M[A, a].

3. If ε is in FIRST(α), add A → α to M[A, b] for each nonterminal b in FOLLOW(A). If ε is in FIRST(α) and $ is in FOLLOW(A), add A → α to M[A,$].

4. Make each undefined entry of M be error.

Top-Down Parsing: LL(1) grammars

A grammar whose parsing table has no multiple-defined entries is said to be LL(1).

The first “L” stands for scanning the input from left to right.The second “L” for producing a leftmost derivation1 for using one input symbol of lookahead at each step to make

parsing action decisions.Properties:

No ambiguous or left-recursive grammar can be LL(1)A grammar is LL(1) iff whenever A → α| β are two distinct productions

of G and the foll. conditions hold:For no terminal a do both α and β derive strings beginning with aAt most one of α and β can derive the empty stringIf β =*>e, then α does not derive any string beginning with a terminal

in FOLLOW(A).

Top-Down Parsing: LL(1) grammars

Disadvantages:

The main difficulty in using predictive parsing is in writing a grammar for the source language

Although left recursion elimination and left factoring are easy to do, they make the resulting grammar hard to read and difficult to use for translation purposes.

to alleviate some of this difficulty, a common organization for a parser in a compiler is to use a predictive parser for control constructs and to use operator precedence for expressions.

Bottom-Up Parsing:

It attempts to construct a parse tree for an i/p string beginning at the leaves and working towards the root.

This process is of reducing a string to the start symbol of the grammar. At each reduction step a particular substring matching the right side of a

production is replaced by the symbol of the left of that production.

Consider the grammar:S → aABeA → Abc | bB → dString w= abbcde

These reductions trace out the rightmost derivation in reverse.

abbcdeaAbcdeaAdeaABeS

Bottom-Up Parsing: Shift-Reduce parser

Consider the foll. Grammar: E ==>E+E | E*E | (E) | id Input string: id1+id2*id3

Right-sentential form

Handle Reducing production

id1+id2*id3 id1 E →id

E+id2*id3 id2 E →id

E+E*id3 id3 E →id

E+E*E E*E E →E*E

E+E E+E E →E+E

E

Reductions made by shift-reduce parser

Bottom-Up Parsing:

Stack implementation of shift-reduce parsing:A stack is used to hold grammar symbols Input buffer to hold the string w to be parsed. Initially, the stack is empty and the string w is the input as follows:

Stack Input$ w$

The parser operates by shifting zero or more input symbols onto the stack until a handle β is on top of the stack.

The parser then reduces β to the left side of the production.The parser repeats until the stack contains the start symbol and the

input is empty.Stack Input$S $

A handle of a string is a substring that matches the right side of a production, and whose reduction to the nonterminal on the left side of the production represents one step of reduction process.

Bottom-Up Parsing:

Stack implementation of shift-reduce parsing:There are four possible actions a shift-reduce parser can make:

Shift: the next symbol is shifted onto the top of the stack.

Reduce: the parser knows the right end of the handle is at the top of the stack. It must then locate the left end of the handle within the stack and decide with what nonterminal to replace the handle.

Accept: the parser announces successful completion of parsing

Error: the parser discovers that a syntax error has occurred and calls an error recovery routine.

Bottom-Up Parsing:

Stack Input Action

$ id1+id2*id3$ Shift

$id1 +id2*id3$ Reduce by E →id

$E +id2*id3$ Shift

$E+ id2*id3$ Shift

$E+id2 *id3$ Reduce by E →id

$E+E *id3$ shift

$E+E* id3$ Shift

$E+E*id3 $ Reduce by E →id

$E+E*E $ Reduce by E →E*E

$E+E $ Reduce by E →E+E

$E $ Accept

Bottom-Up Parsing: Operator-Precedence parsing

Operator grammar: These grammars have the property that no production right side is ε

or has two adjacent nonterminals.Consider the foll. grammar for expressions:

Is not an operator grammar. If we substitute for A each of its alternatives, we obtain an operator

grammar.

E → EAE | (E) | -E |idA → + | - | * | / | ↑

E → E+E | E-E | E*E | E/E | E ↑ E | (E) | -E |id


We define three disjoint precedence relations,<., =. and .>, between certain pairs of terminals.

Relation Meaning

a<.b a “yields precedence to” b

a = .b a “has the same precedence as” b

a.>b a “takes precedence over” b

id + * $

id .> .> .>

+ <. .> <. .>

* <. .> .> .>

$ <. <. <.


Consider the string : id+id*id

The handle can then be found by the foll. process:Scan the string from left until first .> is encountered.Then scan backwards until a <. is encountered.The handle contains everything to the left of the first .> and to the

right of <.

$<. id .> + <. id .> * <. id .> $

$<. id .> + <. id .> * <. id .> $$E+E*E$

$+*$

$<.+<.*.>$

$E+E$$+$$<.+.>$E


Algorithm: Set ip to point to the first symbol of w$ Repeat forever

If $ is on top of the stack and ip points to $ thenreturn

Else beginLet a be the topmost terminal symbol on the stackand let b be the symbol pointed to by ip

If a<.b or a=b then beginPush b onto the stackAdvance ip to the next input symbol

end else if a.>b then repeat

Pop the stack Until the top stack terminal is related by <. to the terminal most recently popped Else error()

end


If operator θ1 has higher precedence than operator θ2, make θ1 .> θ2 and θ2 <. θ1 eg., if * has higher precedence than +, make *.>+ and +<. *

If θ1 and θ2 are operators of equal precedence, then make θ1 .> θ2 and θ2 .> θ1 if the operators are left associative or make θ1 <. θ2 and θ2 <. θ1 if the operators are right associative. eg., if + and – are left associative, then make + .> +, +.> -, -.>- and -.> +

Make θ<.id, id.>θ, θ<.(, (<. θ, ).> θ, θ.>), θ.> $ and $<. θ for all operators θ. Also let(=) $<.( $<.id(<.( id.>$ ).>$(<.id id.> ) )>.)


Consider the foll grammar:

Assume:↑ is of highest precedence and right associative* and / are of next highest precedence and left-associative and+ and – are of lowest precedence and left-associative

E → E+E | E-E | E*E | E/E | E ↑ E | (E) | -E |id

+ - * / ↑ id ( ) $+ .> .> <. <. <. <. <. .> .>- .> .> <. <. <. <. <. .> .>* .> .> .> .> <. <. <. .> .>/ .> .> .> .> <. <. <. .> .>↑ .> .> .> .> <. <. <. .> .>id .> .> .> .> .> .> .>( <. <. <. <. <. <. <. =) .> .> .> .> .> .> .>$ <. <. <. <. <. <. <.

Bottom-Up Parsing: LR Parsers

It is used to parse a large class of context-free grammars.LR(k) parsing: L is for left-to-right scanning of the input, R for

constructing a rightmost derivation in reverse and k for the number of lookahead input symbols.

Characteristics:LR parsers can be constructed to recognize all programming language

constructsIt is general nonbacktracking shift-reduce parsing methodIt can detect a syntactic error as soon as it is possible

Drawback:Lot of work to construct LR parser , hence requires a specialized tool-

LR parser generator.


There are 3 techniques for constructing an LR parsing table for a grammar.

Simple LR(SLR): is the easiest to implement but the least powerful.Canonical LR: most powerful and the most expensive.LookAhead LR (LALR): is intermediate in power , will work on

most programming language grammars.


LR parsing algorithm: It consists of an input, an output, a stack, a driver program and a

parsing table that has two parts (action and goto). The parsing program reads characters from an input buffer one at a

time.The program uses a stack to store a string of the form s0X1s1x2….,

where sm is on top,

Each Xi is a grammar symbol and si is a symbol called a state.

Each state symbol summarizes the information contained in the stack below it and the combination of the state symbol on top of the stack and the current input symbol are used to index the parsing table and determine the shift-reduce parsing decision.


The parsing table consists of 2 parts: a parsing function action and a goto function goto.

The program behaves as follows: It determines sm, the state currently on top of the stack and ai, the

current input symbol . It consults action[sm,ai], the parsing action table entry for state sm

and input ai, which can have one of four values:Shift s, where s is a stateReduce by a grammar production A → βAccept, andErrorThe function goto takes a state and grammar symbol and produces a

state.


The configurations resulting after each of the four types of move are as follows:If action[sm,ai] =shift s, the parser executes a shift move.

Here the parser has shifted both the current input symbol ai and the next state s, which is given in action[sm,ai] onto the stack; ai+1 becomes the current input symbol.

If action[sm,ai]=reduce A → β, then the parser executes a reduce moveHere the parser pops 2r symbols off the stack (r is the length of β). The parser then pushes both A and s, the entry for goto [sm-r,A], onto the stack. The current i/p symbol is not changed in a reduce move.

If action[sm,ai] = accept, parsing is completed.If action[sm,ai]=error, the parser has discovered an error and calls

an error recovery routine.

Bottom-Up Parsing: LR Parsing algorithm

Input: input string w and an LR parsing table with functions action and goto for G

Output: if w is in L(G), a bottom-up parse for w, otherwise an error indication

Method: initially, the parser has s0 (initial state) on its stack and w$ in the input buffer.

Example: (1) E → E+T (2) E →T (3) T → T*F (4) T →F (5) F → (E) (6) F →id


State action gotoid + * ( ) $ E T F

0 S5 S4 1 2 31 S6 Acc2 R2 S7 R2 R23 r4 r4 r4 r44 S5 S4 8 2 35 r6 r6 r6 r66 S5 S4 9 37 S5 S4 108 S6 S119 R1 S7 R1 R110 R3 R3 R3 R311 R5 R5 R5 R5

Parsing table

1. sj means shift and stack state i2. rj means reduce by production numbered j3. acc means accept4. blank means error


STACK INPUT ACTION(1) 0 Id * Id + id $ Shift(2) 0 id 5 * Id + id $ Reduce by F →id

(3) 0 F 3 * Id + id $ Reduce by T →F(4) 0 T 2 * Id + id $ Shift(5) 0 T 2 *7 Id + id $ Shift

(6) 0 T 2 *7 id 5 + id $ Reduce by F →id(7) 0 T 2 *7 F 10 + id $ Reduce by T →T*F

(8) 0 T 2 + id $ Reduce by E →T(9) 0 E 1 + id $ Shift(10) 0 E 1 + 6 id $ Shift

(11) 0 E 1 + 6 id 5 $ Reduce by F →id

(12) 0 E 1 + 6 F 3 $ Reduce by T →F

(13) 0 E 1 + 6 T 9 $ E →E+T

(14) 0 E 1 $ accept

Moves of LR parser on id+id*id

Code Optimization

It aims at improving the execution efficiency of a program. This is achieved in two ways:

Redundancies in a program are eliminatedComputations in a program are rearranged or rewritten to make it

execute efficiently.

Front End

Optimization Phase

Back endSource Program

Target Program

Intermediate representation (IR)

Code Optimization techniques

Compile time evaluation:Performing certain actions specified during compilation itself ,

thereby reducing the execution time of the program.When all operands in an operation are constants, the operation can

be performed at compilation time.Known as constant folding

eg., an assignment a=3.14/2

can be replaced by a=1.57 thereby eliminating division operation.


Elimination of common subexpressions:Common subexpressions are occurrences of expressions yielding the

same value.We can avoid recomputing the expression if we can use the

previously computed value.Example:

a =b+c…x= b*c+5.2

t =b+ca=t…x= t+5.2


Dead Code Elimination: Code which can be omitted from a program without affecting its results is

called dead code. Dead code is detected by checking whether the value assigned in an

assignment statement is used anywhere in the program.Frequency Reduction: Execution time of a program can be reduced by moving code from a part

of a program which is executed very frequently to another part of the program which is executed fewer times.

for i=1 to 100 dobegin z=i; x= 25*a; y = x+z;end

x= 25*a;for i=1 to 100 dobegin z=i; y = x+z;end


Strength Reduction:Replaces the occurrence of a time consuming operation by an

occurrence of a faster operation eg., replacement of a multiplication by an addition.

for i=1 to 10 dobegin --- k= i*5; ---;end

itemp=5;for i=1 to 10 dobegin --- k = itemp; --- itemp = itemp+5;end


Local and global optimization:Local optimization : optimizing transformations are applied over

small segments of a program consisting of a few statements.

Global optimization : optimizing transformations are applied over a program unit i.e., over a function or a procedure.

YACC

It is a parser generator .It stands for “yet another compiler-compiler”.

Used to generate LALR parsers using the YACC parser generator provided on Unix.

Yacccompiler

C compiler

a.out

y.tab.c

a.out

output

Yacc specification translate.y

y.tab.c

Input

YACC

A YACC program has 3 parts:declarations%%translation rules%%supporting C functions

Declarations part: there are two optional sections In the first section, we put ordinary C declarations delimited by %

{ and %}. It also contain the declarations of grammar tokensEg. The statement %token DIGIT

YACC

The translation rules part:Enclosed between %% and %%Each rule consists of a grammar production and the associated

semantic action. For eg., <left side> : <alt 1> {semantic action 1}

| <alt 2> {semantic action 2}

…

| <alt n> {semantic action n}

; In a YACC production, a quoted single character is taken to be the

terminal symbol and unquoted strings of letters and digits not declared to be tokens are taken to be nonterminals.

YACC

A YACC semantic action is a sequence of C statements.The semantic action is performed whenever we reduce by the

associated production.eg., For 2 productions: E → E+T| T

expr : expr ‘+’ term {$$ = $1 + $3;} | term ;

$$ refers to the attribute value associated with the nonterminal on the left.$i refers to the value associated with the ith grammar symbol on the right.

Supporting C-routines part: a lexical analyzer by the name yylex() must be provided.Error recovery routines may be added

Syntax Directed Translation

There are 2 notations for associating semantic rules with productions: Syntax-directed definitions and translation schemes

Conceptually, with both syntax directed definitions and translation schemes, we pass the input token stream, build the parse tree, and then traverse the tree as needed to evaluate the semantic rules at the parse tree nodes

input parse dependency evaluation order

string tree graph for semantic rules

Syntax Directed Definition

A syntax-directed definition is a generalization of a CFG in which each grammar symbol has an associated set of attributes, partitioned into 2 subsets called the synthesized and inherited attributes of that grammar symbol.

The value of an attribute at a parse tree node is defined by a semantic rule associated with a production used at that node.The value of a synthesized attribute at a node is computed from the

values of attributes at the children of that node.The value of a inherited attribute at a node is computed from the values

of attributes at the siblings and parent of that node


Semantic rules set up dependencies between attributes that will be represented by a graph.

From the dependency graph, we derive an evaluation order for the semantic rules.

Evaluation of the semantic rules defines the values of the attributes at the nodes in parse tree for the input string.

A parse tree showing the values of attributes at each node is called an annotated parse tree.

The process of computing the attribute values at the nodes is called annotating or decorating the parse tree.


Form of a Syntax Directed Definition

In a syntax directed definition, each grammar production A has associated with it a set of semantic rules of the form

b:= f(c1, c2,……..,ck), where f is a function, and either b is a synthesized attribute of A and c1, c2,……..,ck are attributes

belonging to the grammar symbols of the production, or, b is an inherited attribute of one of the grammar symbols on the

right side of the production, and c1, c2,……..,ck are attributes belonging to the grammar symbols of the production.

In either case, we say that the attribute b depends on attributes c1, c2,……..,ck. An attribute grammar is a syntax directed definition in which the functions in semantic rules cannot have side effects.


A syntax directed definition that uses synthesized attributes exclusively is said to be an S-attributed definition.

Example:

production Semantic rules

L → E n Print(E.val)

E → E1 +T E.val = E1.val + T.val

E → T E.val = T.val

T → T1 * F T.val = T1.val + F.val

T → F T.val = F.val

F → (E) F.val = E.val

F → digit F.val = digit. lexval


Synthesized attributes: A syntax directed definition that uses synthesized attributes exclusively is said to be an S-attributed definition.

ANOTATED PARSE TREE FOR 3*5+4n L

E.val=19 n

E.val=15 + T.val=4

T.val=15 F.val =4

T.val=3 * F.val=5 digit.lexval=4

F.val=3 digit.lexval=5 digit.lexval=3

A parse tree is annotated by evaluating the semantic rules for the attributes at each node bottom up, from the leaves to the root.


Inherited Attributes:They are useful for expressing the dependence of a programming

language construct on the context in which it appears.For eg., to keep track of whether an identifier appears on the left or

right side of an assignment in order to decide whether the address or value of the identifier is needed.

production Semantic rules

D → T L L.in = T.type

T → int T.type = integer

T → real T.Type = real

L → L1 , id Ll.in = L.inaddtype(id.entry,L.in)

L → id addtype(id.entry,L.in)


Inherited Attributes:Parse tree for the sentence: real id1,id2,id3

D

T.type = real

L.in =real

real L.in =real id3,

,L.in =real id2

id1

Syntax trees

An (abstract) syntax tree is a condensed form of a parse tree useful for representing language constructs.

In the syntax tree, operands and keywords do not appear as leaves, but rather are associated with the interior node that would be the parent of those leaves in the parse tree.

Example: 3*5+4

+

4*

53

Unit 6

Documents

id idide

id id

error recovery

finite automata

subset construction

nonrecursive

nonrecursive

regular expression