Top Banner
CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I INTRODUCTION COMPILER A compiler is a program that can read a program in one language — the source language and translate it into an equivalent program in another language — the target language. An important role of the compiler is to report any errors in the source program that it detects during the translation process. COUSINS OF THE COMPILER The preprocessor, assembler, linker and loader are referred to as the cousins of the compiler. In addition to a compiler, several other programs may be required to create an executable target program. A source program may be divided into modules stored in separate files. The task of collecting the source program is sometimes entrusted to a separate program, called a preprocessor. C.S.Anita, Assoc.Prof/CSE Page 1
53
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

INTRODUCTION

COMPILER

A compiler is a program that can read a program in one language — the source

language and translate it into an equivalent program in another language — the

target language.

An important role of the compiler is to report any errors in the source program

that it detects during the translation process.

COUSINS OF THE COMPILER

The preprocessor, assembler, linker and loader are referred to as the cousins of the

compiler.

In addition to a compiler, several other programs may be required to create an

executable target program.

A source program may be divided into modules stored in separate files. The task

of collecting the source program is sometimes entrusted to a separate program,

called a preprocessor.

The preprocessor may also expand shorthands, called macros, into source

language statements.

The modified source program is then fed to a compiler. The compiler may

produce an assembly-language program as its output, because assembly language

is easier to produce as output and is easier to debug.

The assembly language is then processed by a program called an assembler that

produces relocatable machine code as its output.

C.S.Anita, Assoc.Prof/CSE Page 1

Page 2: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Large programs are often compiled in pieces, so the relocatable machine code

may have to be linked together with other relocatable object files and library files

into the code that actually runs on the machine.

The linker resolves external memory addresses, where the code in one file may

refer to a location in another file.

The loader then puts together all of the executable object files into memory for

execution.

ANALYSIS – SYNTHESIS MODEL OF COMPILATION

The process of compilation has two parts namely : Analysis and Synthesis

Analysis :The analysis part breaks up the source program into constituent pieces

and creates an intermediate representation of the source program.

Synthesis : The synthesis part constructs the desired target program from the

intermediate representation .

The analysis part is often called the front end of the compiler; the synthesis

part is the back end of the compiler.

C.S.Anita, Assoc.Prof/CSE Page 2

Page 3: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

PHASES OF THE COMPILER

There are six phases of the compiler namely

1. Lexical Analysis

2. Syntax Analysis

3. Semantic Analysis

4. Intermediate Code Generation

5. Code Optimization

6. Code Generation

C.S.Anita, Assoc.Prof/CSE Page 3

Page 4: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Lexical Analysis

The first phase of a compiler is called lexical analysis linear analysis or scanning.

The lexical analyzer reads the stream of characters making up the source program

and groups the characters into meaningful sequences called lexemes.

C.S.Anita, Assoc.Prof/CSE Page 4

Page 5: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

For each lexeme, the lexical analyzer produces as output a token of the form

(token-name, attribute-value)

that it passes on to the subsequent phase, syntax analysis.

For example, suppose a source program contains the assignment statement

p o s i t i o n := i n i t i a l + r a t e * 60

The characters in this assignment could be grouped into the following lexemes

and mapped into the following tokens passed on to the syntax analyzer:

1. The identifier position

2. The assignment symbol :=

3. The identifier initial

4. The plus sign

5. The identifier rate

6. The multiplication sign

7. The number 60

The blanks separating the characters are eliminated during lexical analysis

Syntax Analysis

The second phase of the compiler is syntax analysis or hierarchical analysis or

parsing.

The tokens from the lexical analyzer are grouped hierarchically into nested

collections with collective meaning.

This is represented using a parse tree.For eg. For the assignment statement

p o s i t i o n: = i n i t i a l + rate * 60

the parse tree is as follows

C.S.Anita, Assoc.Prof/CSE Page 5

Page 6: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

A syntax tree is a compressed representation of a parse tree in which each interior

node represents an operation and the children of the node represent the arguments

of the operation. The syntax tree for the above is as follows

Semantic Analysis

The semantic analyzer uses the syntax tree and the information in the symbol

table to check the source program for semantic consistency with the language

definition.

It also gathers type information and saves it in either the syntax tree or the symbol

table, for subsequent use during intermediate-code generation.

An important part of semantic analysis is type checking, where the compiler

checks that each operator has matching operands.

C.S.Anita, Assoc.Prof/CSE Page 6

Page 7: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

For example, a binary arithmetic operator may be applied to either a pair of

integers or to a pair of floating-point numbers. If the operator is applied to a

floating-point number and an integer, the compiler may convert the integer into a

floating-point number.

For the above syntax tree we apply the type conversion considering all the

identifiers to be real values,we get

Intermediate Code Generation

The intermediate representation should have two important properties:

it should be easy to produce and

it should be easy to translate into the target machine.

We consider an intermediate form called three-address code, which consists of a

sequence of assembly-like instructions with three operands per instruction.

Properties of three-address instructions.

1. Each three-address assignment instruction has at most one operator on the

right side.

2. The compiler must generate a temporary name to hold the value computed

by a three-address instruction.

3. Some "three-address instructions may have fewer than three operands.

C.S.Anita, Assoc.Prof/CSE Page 7

Page 8: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Code Optimization

The machine-independent code-optimization phase attempts to improve the

intermediate code so that better target code will result.

There is a great variation in the amount of code optimization different compilers

perform. Those that do the most, are called "optimizing compilers."A significant

amount of time is spent on this phase.

There are simple optimizations that significantly improve the running time of the

target program without slowing down compilation too much.

Code Generation

The code generator takes as input an intermediate representation of the source

program and maps it into the target language.

If the target language is machine code, registers or memory locations are selected

for each of the variables used by the program.

Then, the intermediate instructions are translated into sequences of machine

instructions that perform the same task.

Symbol-Table Management

An essential function of a compiler is to record the variable names used in the

source program and collect information about various attributes of each name.

These attributes may provide information about the storage allocated for a name,

its type, its scope (where in the program its value may be used), and in the case of

procedure names, such things as the number and types of its arguments, the

method of passing each argument (for example, by value or by reference), and the

type returned.

The symbol table is a data structure containing a record for each variable name,

with fields for the attributes of the name.

The data structure should be designed to allow the compiler to find the record for

each name quickly and to store or retrieve data from that record quickly.

Error Detection and Reporting

C.S.Anita, Assoc.Prof/CSE Page 8

Page 9: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Each phase can encounter errors.

After detecting an error , a phase must be able to recover from the error so that

compilation can proceed and allow further errors to be detected.

A compiler which stops after detecting the first error is not useful.

Translation of an assignment statement

C.S.Anita, Assoc.Prof/CSE Page 9

Page 10: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

GROUPING OF PHASES

Front and Back Ends:

C.S.Anita, Assoc.Prof/CSE Page 10

Page 11: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

The phases are collected into a front end and a back end.

Front End:

Consists of those phases or parts of phases that depend primarily on the

source language and are largely independent of target machine.

Lexical and syntactic analysis, symbol table, semantic analysis and the

generation of intermediate code is included.

Certain amount of code optimization can be done by the front end.

It also includes error handling that goes along with each of these phases.

Back End:

Includes those portions of the compiler that depend on the target machine

and these portions do not depend on the source language.

Find the aspects of code optimization phase, code generation along with

necessary error handling and symbol table operations.

Passes:

Several phases of compilation are usually implemented in a single pass

consisting of reading an input file and writing an output file.

It is common for several phases to be grouped into one pass, and for the

activity of these phases to be interleaved during the pass.

Eg: Lexical analysis, syntax analysis, semantic analysis and intermediate

code generation might be grouped into one pass. If so, the token stream

after lexical analysis may be translated directly into intermediate code.

Reducing the number of passes:

It is desirable to have relatively few passes, since it takes time to read and

write intermediate files.

If we group several phases into one pass, we may forced to keep the entire

program in memory, because one phase may need information in a

different order than a previous phase produces it.

The internal form of the program may be considerably larger than either

the source program or the target program, so this space may not be a

trivial matter.

C.S.Anita, Assoc.Prof/CSE Page 11

Page 12: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

COMPILER-CONSTRUCTION TOOLS

Some commonly used compiler-construction tools include

1. Parser generator

2. Scanner generator

3. Syntax-directed translation engine

4. Automatic code generator

5. Data flow engine

1. Parser generators

produce syntax analyzers from input that is based on context-free

grammar.

Earlier, syntax analysis consumed large fraction of the running time of

a compiler + large fraction of the intellectual effort of writing a

compiler.

This phase is now considered as one of the easiest to implement.

Many parser generators utilize powerful parsing algorithms that are

too complex to be carried out by hand.

2. Scanner generators

automatically generates lexical analyzers from a specification based

on regular expression.

The basic organization of the resulting lexical analyzers is finite

automation.

C.S.Anita, Assoc.Prof/CSE Page 12

Page 13: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

3. Syntax-directed translation engines

produce collections of routines that walk a parse tree and generating

intermediate code.

The basic idea is that one or more “translations” are associated with

each node of the parse tree.

Each translation is defined in terms of translations at its neighbor

nodes in the tree.

4. Automatic Coder generators

A tool takes a collection of rules that define the translation of each

operation of the intermediate language into the machine language for

a target machine.

The rules must include sufficient details that we can handle the

different possible access methods for data.

5. Data-flow analysis engines

gathering of information about how values are transmitted from one

part of a program to each other part.

Data-flow analysis is a key part of code optimization.

THE ROLE OF THE LEXICAL ANALYZER

C.S.Anita, Assoc.Prof/CSE Page 13

Page 14: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

The main task of the lexical analyzer is to read the input characters of the source

program, group them into lexemes, and produce as output a sequence of tokens

for each lexeme in the source program.

The stream of tokens is sent to the parser for syntax analysis.

When the lexical analyzer discovers a lexeme constituting an identifier, it needs to

enter that lexeme into the symbol table.

.

Interactions between the lexical analyzer and the parser

Other functions of lexical analyzer

1. stripping out comments and whitespace (blank, newline, tab characters

that are used to separate tokens in the input).

2. Another task is correlating error messages generated by the compiler with

the source program. For instance, the lexical analyzer may keep track of

the number of newline characters seen, so it can associate a line number

with each error message.

Issues in Lexical Analysis

C.S.Anita, Assoc.Prof/CSE Page 14

Page 15: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

1. Simplicity of design is the most important consideration. The separation

of lexical and syntactic analysis often allows us to simplify at least one of

these tasks.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to

apply specialized techniques that serve only the lexical task, not the job of

parsing. In addition, specialized buffering techniques for reading input

characters can speed up the compiler significantly.

3. Compiler portability is enhanced.

Tokens, Patterns, and Lexemes

Token

A token is a pair consisting of a token name and an optional attribute value.

The token name is an abstract symbol representing a kind of lexical unit, e.g., a

particular keyword, or a sequence of input characters denoting an identifier. The

token names are the input symbols that the parser processes.

Pattern

A pattern is a description of the form that the lexemes of a token may take.

In the case of a keyword as a token, the pattern is just the sequence of characters that

form the keyword. For identifiers and some other tokens, the pattern is a more

complex structure that is matched by many strings.

Lexeme

A lexeme is a sequence of characters in the source program that matches the

pattern for a token and is identified by the lexical analyzer as an instance of that

token.

C.S.Anita, Assoc.Prof/CSE Page 15

Page 16: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Examples of tokens

Lexical Errors

It is hard for a lexical analyzer to tell, without the aid of other components, that

there is a source-code error. For instance in the following C statement

f i ( a == f ( x ) ) . ..

a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an

undeclared function identifier. Since f i is a valid lexeme for the token id, the

lexical analyzer must return the token id to the parser and let some other phase of

the compiler — probably the parser in this case — handle an error due to

transposition of the letters.

However, suppose a situation arises in which the lexical analyzer is unable to

proceed because none of the patterns for tokens matches any prefix of the

remaining input. The simplest recovery strategy is "panic mode" recovery. We

delete successive characters from the remaining input, until the lexical analyzer

an find a well-formed token at the beginning of what input is left.

Other possible error-recovery actions are:

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

INPUT BUFFERING

C.S.Anita, Assoc.Prof/CSE Page 16

Page 17: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

We often have to look one or more characters beyond the next lexeme before we

can be sure we have the right lexeme.

Hence we introduce a two-buffer scheme that handles large lookaheads safely.

We then consider an improvement involving "sentinels" that saves time checking

for the ends of buffers.

Buffer Pairs

Each buffer is of the same size N, and N is usually the size of a disk block, e.g.,

4096 bytes.

Using one system read command we can read N characters into a buffer.

If fewer than N characters remain in the input file, then a special character,

represented by eof, marks the end of the source file.

Two pointers to the input are maintained:

1. Pointer lexeme_beginning, marks the beginning of the current lexeme,

whose extent we are attempting to determine.

2. Pointer forward scans ahead until a pattern match is found.

Once the next lexeme is determined, forward is set to the character at its right end.

Then, after the lexeme is recorded as an attribute value of a token returned to the

parser, lexeme_beginning is set to the character immediately after the lexeme just

found.

Advancing forward requires that we first test whether we have reached the end of

one of the buffers, and if so, we must reload the other buffer from the input, and

move forward to the beginning of the newly loaded buffer.

if forward at end of first half then begin

reload second half;

forward := forward + 1

C.S.Anita, Assoc.Prof/CSE Page 17

Page 18: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

end

else if forward at end of second half then begin

reload first half;

move forward to beginning of first half

end

else forward := forward + 1;

Code to advance forward pointer

Sentinels

For each character read, we make two tests: one for the end of the buffer, and one

to determine what character is read .

We can combine the buffer-end test with the test for the current character if we

extend each buffer to hold a sentinel character at the end.

The sentinel is a special character that cannot be part of the source program, and a

natural choice is the character eof.

Note that eof retains its use as a marker for the end of the entire input.

Any eof that appears other than at the end of a buffer means that the input is at an

end.

forward : = forward + 1;

if forward ↑ = eof then begin

if forward at end of first half then begin

reload second half;

forward := forward + 1

end

else if forward at end of second half then begin

reload first half;

move forward to beginning of first half

C.S.Anita, Assoc.Prof/CSE Page 18

Page 19: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

end

else /* eof within a buffer signifying end of input */

terminate lexical analysis

end

Lookahead code with sentinels

SPECIFICATION OF TOKENS

Regular expressions are an important notation for specifying lexeme patterns.

Strings and Languages

An alphabet is a finite set of symbols.

A string over an alphabet is a finite sequence of symbols drawn from that

alphabet.

A language is any countable set of strings over some fixed alphabet.

In language theory, the terms "sentence" and "word" are often used as synonyms

for "string." The length of a string s, usually written |s|, is the number of

occurrences of symbols in s.

For example, banana is a string of length six. The empty string, denoted ε, is the

string of length zero.

C.S.Anita, Assoc.Prof/CSE Page 19

Page 20: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Terms for Parts of Strings

The following string-related terms are commonly used:

1. A prefix of string s is any string obtained by removing zero or more symbols from

the end of s.

For example, ban, banana, and e are prefixes of banana.

2. A suffix of string s is any string obtained by removing zero or more symbols from

the beginning of s.

For example, nana, banana, and e are suffixes of banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s.

For example, banana, nan, and e are substrings of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those,prefixes,

suffixes, and substrings, respectively, of s that are not ε or not equal to s itself.

5. A subsequence of s is any string formed by deleting zero or more not necessarily

consecutive positions of s.

For example, baan is a subsequence of banana.

Operations on Languages

In lexical analysis, the most important operations on languages are union,

concatenation, and closure.

C.S.Anita, Assoc.Prof/CSE Page 20

Page 21: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Definitions of operations on languages

Example:

Let L be the set of letters {A, B , . . . , Z, a, b , . . . , z} and let D be the set of

digits { 0 , 1 , . . . 9}.

L U D is the set of letters and digits

LD is the set of strings of length two, each consisting of one letter followed by

one digit.

L4 is the set of all 4-letter strings.

L* is the set of all strings of letters, including ε, the empty string.

L(L U D)* is the set of all strings of letters and digits beginning with a letter.

D+ is the set of all strings of one or more digits.

Regular Expressions

Each regular expression r denotes a language L(r).

Here are the rules that define the regular expressions over some alphabet Σ and

the languages that those expressions denote.

ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole

member is the empty string.

If a is a symbol in Σ, then a is a regular expression, and L(a) = {a}, that is, the

language with one string, of length one, with a in its one position.

Suppose r and s are regular expressions denoting languages L(r) and L(s),

respectively.

C.S.Anita, Assoc.Prof/CSE Page 21

Page 22: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

1. (r)|(s) is a regular expression denoting the language L(r) U L(s).

2. (r)(s) is a regular expression denoting the language L(r)L(s).

3. (r)* is a regular expression denoting (L(r))*.

4. (r) is a regular expression denoting L(r).

The unary operator * has highest precedence and is left associative.

Concatenation has second highest precedence and is left associative.

| has lowest precedence and is left associative.

A language that can be defined by a regular expression is called a regular set.

If two regular expressions r and s denote the same regular set, we say they are

equivalent and write r = s.

For instance, (a|b) = (b|a).

There are a number of algebraic laws for regular expressions

Algebraic laws for regular expressions

C.S.Anita, Assoc.Prof/CSE Page 22

Page 23: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Regular Definitions

Giving names to regular expressions is referred a Regular definition. If Σ is an alphabet

of basic symbols, then a regular definition is a sequence of definitions of the form:

where:

dl → r 1

d2 → r2

………

dn → rn

1. Each di is a distinct name

2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l).

E.g: Identifiers is the set of strings of letters and digits beginning with a letter. Regular

definition for this set:

letter → A | B | …. | Z | a | b | …. | z |

digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *

Notational Shorthands

Certain constructs occur so frequently in regular expressions that it is convenient to

introduce notational shorthand’s for them.

1. One or more instances (+)

- The unary postfix operator + means “ one or more instances of”

C.S.Anita, Assoc.Prof/CSE Page 23

Page 24: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

- If r is a regular expression that denotes the language L(r), then ( r )+ is a

regular expression that denotes the language ( L (r ) )+

- Thus the regular expression a+ denotes the set of all strings of one or more

a’s.

- The operator + has the same precedence and associativity as the operator

*.

2. Zero or one instance ( ?)

- The unary postfix operator ? means “zero or one instance of”.

- The notation r? is a shorthand for r | ε.

- If ‘r’ is a regular expression, then ( r )? Is a regular expression that denotes

the language L( r ) U { ε }.

3. Character Classes.

- The notation [abc] where a, b and c are alphabet symbols denotes the

regular expression a | b | c.

- Character class such as [a – z] denotes the regular expression a | b | c | d |

….|z.

- Identifiers as being strings generated by the regular expression,

[ A – Z a – z ] [ A – Z a – z 0 – 9 ] *

4. Regular Set

- A language denoted by a regular expression is said to be a regular set.

5. Non-regular Set

- A language which cannot be described by any regular expression.

Eg. The set of all strings of balanced parentheses and repeating strings cannot

be described by a regular expression. This set can be specified by a context-

free grammar.

Example 1:

Regular definition for identifiers in Pascal

C.S.Anita, Assoc.Prof/CSE Page 24

Page 25: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

letter → A | B | . . . | Z | a | b | . . . | z

digit  → 0 | 1 | 2 | . . . | 9

   id  → letter (letter | digit)*

Example 2:

Regular definition for unsigned numbers in Pascal

digit → 0 | 1 | 2 | . . . | 9digit → digit digit*

    optional-fraction     → . digits | ε

    optional-exponent  → (E (+ | - | ) digits) | ε                         num  → digits optional-fraction optional-exponent.

Tokens, their patterns and attribute values

RECOGNITION OF TOKENS

It is done using finite automata(transition diagrams)

Consider the following grammar fragment

C.S.Anita, Assoc.Prof/CSE Page 25

Page 26: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Transition Diagrams

As an intermediate step in the construction of a lexical analyzer, we first convert

patterns into stylized flowcharts, called "transition diagrams."

Transition diagrams have a collection of nodes or circles, called states.

Each state represents a condition that could occur during the process of scanning

the input looking for a lexeme that matches one of several patterns.

Edges are directed from one state of the transition diagram to another.

Each edge is labeled by a symbol or set of symbols.

Some important conventions about transition diagrams are:

1. Certain states are said to be accepting, or final. These states indicate that a

lexeme has been found. We always indicate an accepting state by a double

circle, and if there is an action to be taken — typically returning a token and

an attribute value to the parser — we shall attach that action to the accepting

state.

C.S.Anita, Assoc.Prof/CSE Page 26

Page 27: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

2. In addition, if it is necessary to retract the forward pointer one position (i.e.,

the lexeme does not include the symbol that got us to the accepting state), then

we shall additionally place a * near that accepting state.

3. One state is designated the start state, or initial state; it is indicated by an edge,

labeled "start," entering from nowhere.

4. The transition diagram always begins in the start state before any input

symbols have been read.

Transition diagram for relop

Transition diagram for unsigned numbers

C.S.Anita, Assoc.Prof/CSE Page 27

Page 28: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Transition diagram for whitespace characters

Transition diagram for identifiers and keywords

A LANGUAGE FOR SPECIFYING A LEXICAL ANALYZER

There is a tool called Lex, or in a more recent implementation Flex, that allows

one to specify a lexical analyzer by specifying regular expressions to describe

patterns for tokens.

The input notation for the Lex tool is referred to as the Lex language and the tool

itself is the Lex compiler.

The Lex compiler transforms the input patterns into a transition diagram and

generates code, in a file called l e x . y y . c, that simulates this transition diagram.

An input file, which we call l e x . l , is written in the Lex language and describes

the lexical analyzer to be generated.

The Lex compiler transforms l e x . 1 to a C program, in a file that is always

named l e x . y y . c.

The latter file is compiled by the C compiler into a file called a . o u t , as always.

The C-compiler output is a working lexical analyzer that can take a stream of

input characters and produce a stream of tokens.

C.S.Anita, Assoc.Prof/CSE Page 28

Page 29: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Creating a lexical analyzer with Lex

Lex Specifications

A lex program has three parts

declarations

%%

translation rules

%%

auxiliary procedures

The translation rules are statements of the form

p1 {action1}

p2 {action2}

pn {actionn}

Example Lex program for a few tokens

C.S.Anita, Assoc.Prof/CSE Page 29

Page 30: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

Part-A

C.S.Anita, Assoc.Prof/CSE Page 30

Page 31: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

1. Define a preprocessor. What are the functions of preprocessor. [AU MAY/JUN 2007(Reg 2004),NOV/DEC 2007(Reg

2004)]

A preprocessor produces input to compilers. A source program may

be divided into modules stored in separate files. The task of collecting

the source program is sometimes entrusted to a distinct program called

a preprocessor.

Its functions are Macro processing, file inclusion, rational

preprocessors and language extensions.

2. What are the issues in lexical analysis? [AU MAY/JUN 2007(Reg 2004)]

Simpler design Compiler efficiency is improved Compiler portability is enhanced.

3. Define a symbol table. [AU NOV/DEC 2007(Reg 2004)]

The symbol table is a data structure containing a record for each

variable name, with fields for the attributes of the name.

4. Differentiate compiler and interpreter. [AU APR/MAY 2008(Reg 2004)]

compiler interpreterIt is a translator that translates high level tolow level language

It is a translator that translates high level to lowlevel language

It displays the errors after the whole program isexecuted.

It checks line by line for errors.

C.S.Anita, Assoc.Prof/CSE Page 31

Page 32: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

5. Write short notes on buffer pair. [AU APR/MAY 2008(Reg 2004)]

Each buffer is of the same size N, and N is usually the size of a disk

block, e.g., 4096 bytes. Using one system read command we can read

N characters into a buffer.If fewer than N characters remain in the

input file, then a special character, represented by eof, marks the end

of the source file.Two pointers to the input buffer are maintained

namely Pointer lexeme_beginning and Pointer forward .

6. What is a language processing system?

[AU NOV/DEC 2008(Reg 2004),MAY/JUNE 2012(Reg 2004)]

7. What are the error recovery actions in a lexical analyzer?

[AU NOV/DEC 2008(Reg 2004),MAY/JUN 2012(Reg 2008)]

C.S.Anita, Assoc.Prof/CSE Page 32

Page 33: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

1. Delete one character from the remaining input.

2. Insert a missing character into the remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

8. What are the issues to be considered in the design of lexical

analyzer?

[AU MAY/JUN 2009(Reg 2004)]

1. Simplicity of Design

2. Compiler Efficiency

3. Compiler Portability

9. Define concrete and abstract syntax with example.

[AU MAY/JUN 2009(Reg 2004)]

Abstract syntax: what are the significant parts of the expression?

Example: a sum expression has its two operand expressions as its

significant parts

Concrete syntax: what does the expression look like?

Example: the same sum expression can look in different ways:

2 + 3 -- infix

(+ 2 3) -- prefix

C.S.Anita, Assoc.Prof/CSE Page 33

Page 34: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

(2 3 +) -- postfix

10. What is a sentinel? What is its purpose?

[AU NOV/DEC 2010(Reg 2004),MAY/JUNE 2012(Reg 2004)]

The sentinel is a special character that cannot be part of the source

program, and a natural choice is the character eof.

eof retains its use as a marker for the end of the entire input.

Any eof that appears other than at the end of a buffer means that the

input is at an end.

This is used for speeding-up the lexical analyzer.

11. Define lexeme and pattern. [AU NOV/DEC 2010(Reg 2004)]

Lexeme

A lexeme is a sequence of characters in the source program that

matches the pattern for a token and is identified by the lexical

analyzer as an instance of that token.

Pattern

A pattern is a description of the form that the lexemes of a token may

take.

12.What is an interpreter? [AU APR/MAY 2011(Reg 2008)]

It is one of the translators that translate high level language to low

level language. During execution, it checks line by line for errors.

C.S.Anita, Assoc.Prof/CSE Page 34

Page 35: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

13.Define token and lexeme. [AU APR/MAY 2011(Reg 2008)]

Token

A token is a pair consisting of a token name and an optional attribute

value.

Lexeme

A lexeme is a sequence of characters in the source program that

matches the pattern for a token and is identified by the lexical

analyzer as an instance of that token.

14. What is the role of lexical analyzer? [AU NOV/DEC 2011(Reg

2008)]

The main task of the lexical analyzer is to read the input characters of

the source program, group them into lexemes, and produce as output a

sequence of tokens for each lexeme in the source program.The stream

of tokens is sent to the parser for syntax analysis.When the lexical

analyzer discovers a lexeme constituting an identifier, it needs to enter

that lexeme into the symbol table.

15. Give the transition diagram of identifier.

[AU NOV/DEC 2011(Reg 2008)]

C.S.Anita, Assoc.Prof/CSE Page 35

Page 36: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

16. Mention few cousins of the compiler. [AU MAY/JUN 2012 (Reg

2008)]

The following are the cousins of compilers

i. Preprocessors

ii. Assemblers

iii. Loaders

iv. Link editors.

17. What is a Complier?

A Complier is a program that reads a program written in one

language-the source language-and translates it in to an equivalent

program in another language-the target language. As an important

part of this translation process, the compiler reports to its user the

presence of errors in the source program .

15. State some software tools that manipulate source program?

i. Structure editors

ii. Pretty printers

iii. Static checkers

iv. Interpreters.

C.S.Anita, Assoc.Prof/CSE Page 36

Page 37: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

16. What are the two main parts of compilation?

The two main parts are

a) Analysis part breaks up the source program into constituent

pieces and creates an intermediate representation of the source

program.

b) Synthesis part constructs the desired target program from the

intermediate representation .

17. How many phases does analysis consists?

Analysis consists of three phases

i .Linear analysis

ii .Hierarchical analysis

iii. Semantic analysis

18. State some compiler construction tools?

i. Parse generator

ii. Scanner generators

iii. Syntax-directed translation engines

iv. Automatic code generator

v. Data flow engines.

19. State the general phases of a compiler

i) Lexical analysis

ii) Syntax analysis

iii) Semantic analysis

iv) Intermediate code generation

C.S.Anita, Assoc.Prof/CSE Page 37

Page 38: unit-1 pcd

CS2352 PRINCIPLES OF COMPILER DESIGN UNIT-I

v) Code optimization

vi) Code generation

20. Give the transition diagram for whitespace characters.

21.What is an assembler?

Assembler is a program, which converts the source language in to

assembly language.

C.S.Anita, Assoc.Prof/CSE Page 38