Top Banner
P16CS32 Compiler Design Date : 18.08.2020 Topic : Introduction about Compiler What is Compiler? o A Compiler is a translator that converts the high level language into the machine language. o High level language is written by a developer and machine language can be understood by the processor. o Compiler is used to show errors to the program. o The main purpose of compiler is to change the code written in one language without changing the meaning of the program. o When you execute a program which is written in HLL programming language then it executes into two parts. o In the first part, the source program compiled and translated into the object program (Low Level Language). o In the second part, object program translated into the target program through the assembler. Execution process of source program in compiler Source Program Object Program Object Program Target Program Language Processing System We have learnt that any computer system is made of hardware and software. The hardware understands a language, which humans cannot understand. So we write programs in high-level language, which is easier for us to understand and remember. These programs are then fed into a series of tools and OS components to get the desired code that can be used by the machine. This is known as Language Processing System. Compiler Assembler
45

Topic : Introduction about Compiler

Feb 15, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topic : Introduction about Compiler

P16CS32 – Compiler Design

Date : 18.08.2020

Topic : Introduction about Compiler

What is Compiler?

o A Compiler is a translator that converts the high level language into the

machine language.

o High level language is written by a developer and machine language can be

understood by the processor.

o Compiler is used to show errors to the program.

o The main purpose of compiler is to change the code written in one language

without changing the meaning of the program.

o When you execute a program which is written in HLL programming language

then it executes into two parts.

o In the first part, the source program compiled and translated into the object

program (Low Level Language).

o In the second part, object program translated into the target program through

the assembler.

Execution process of source program in compiler

Source Program Object Program

Object Program Target Program

Language Processing System

We have learnt that any computer system is made of hardware and software. The

hardware understands a language, which humans cannot understand. So we write programs in

high-level language, which is easier for us to understand and remember. These programs are

then fed into a series of tools and OS components to get the desired code that can be used by

the machine. This is known as Language Processing System.

Compiler

Assembler

Page 2: Topic : Introduction about Compiler

The high-level language is converted into binary language in various phases. A compiler is a

program that converts high-level language to assembly language. Similarly, an assembler is a

program that converts the assembly language to machine-level language.

Let us first understand how a program, using C compiler, is executed on a host machine.

o User writes a program in C language (high-level language).

o The C compiler compiles the program and translates it to assembly program (low-

level language).

o An assembler then translates the assembly program into machine code (object).

o A linker tool is used to link all the parts of the program together for execution

(executable machine code).

o A loader loads all of them into memory and then the program is executed.

Before diving straight into the concepts of compilers, we should understand a few other

tools that work closely with compilers.

Preprocessor

A preprocessor, generally considered as a part of compiler, is a tool that produces

input for compilers. It deals with macro-processing, augmentation, file inclusion, language

extension, etc.

Page 3: Topic : Introduction about Compiler

Interpreter

An interpreter, like a compiler, translates high-level language into low-level machine

language. The difference lies in the way they read the source code or input. A compiler reads

the whole source code at once, creates tokens, checks semantics, generates intermediate code,

executes the whole program and may involve many passes. In contrast, an interpreter reads a

statement from the input, converts it to an intermediate code, executes it, then takes the next

statement in sequence. If an error occurs, an interpreter stops execution and reports it;

whereas a compiler reads the whole program even if it encounters several errors.

Assembler

An assembler translates assembly language programs into machine code. The output

of an assembler is called an object file, which contains a combination of machine instructions

as well as the data required to place these instructions in memory.

Linker

Linker is a computer program that links and merges various object files together in

order to make an executable file. All these files might have been compiled by separate

assemblers. The major task of a linker is to search and locate referenced module/routines in a

program and to determine the memory location where these codes will be loaded, making the

program instruction to have absolute references.

Compilers and Interpreters

• “Compilation”

– Translation of a program written in a source language into a semantically

equivalent program written in a target language

• “Interpretation”

– Performing the operations implied by the source program

Page 4: Topic : Introduction about Compiler

Loader

Loader is a part of operating system and is responsible for loading executable files

into memory and execute them. It calculates the size of a program (instructions and data) and

creates memory space for it. It initializes various registers to initiate execution.

Cross-compiler

A compiler that runs on platform (A) and is capable of generating executable code for

platform (B) is called a cross-compiler.

Source-to-source Compiler

A compiler that takes the source code of one programming language and translates

it into the source code of another programming language is called a source-to-source

compiler.

COMPILER DESIGN –ARCHITECTURE

A compiler can broadly be divided into two phases based on the way they compile.

Analysis Phase

Known as the front-end of the compiler, the analysis phase of the compiler reads

the source program, divides it into core parts, and then checks for lexical, grammar, and

syntax errors. The analysis phase generates an intermediate representation of the source

program and symbol table, which should be fed to the Synthesis phase as input.

Synthesis Phase

Known as the back-end of the compiler, the synthesis phase generates the target

program with the help of intermediate source code representation and symbol table.

A compiler can have many phases and passes.

o Pass : A pass refers to the traversal of a compiler through the entire program.

o Phase : A phase of a compiler is a distinguishable stage, which takes input from the

previous stage, processes and yields output that can be used as input for the next

stage. A pass can have more than one phase.

Page 5: Topic : Introduction about Compiler

P16CS32 – Compiler Design

Date : 21.08.2020

Topic : Input Buffering, Specification of Tokens, Regular

Expressions, Transition Diagram

Input Buffering

The LA scans the characters of the source program one at a time to discover tokens.

Because of large amount of time can be consumed scanning characters, specialized buffering

techniques have been developed to reduce the amount of overhead required to process an input

character.

Buffering techniques:

1. Buffer pairs

2. Sentinels

The lexical analyzer scans the characters of the source program one at a time to discover

tokens. Often, however, many characters beyond the next token many have to be examined

before the next token itself can be determined. For this and other reasons, it is desirable for the

lexical analyzer to read its input from an input buffer. Figure shows a buffer divided into two

have of, say 100 characters each. One pointer marks the beginning of the token being discovered.

A look ahead pointer scans ahead of the beginning point, until the token is discovered .we view

the position of each pointer as being between the character last read and the character next to be

read.

In practice each buffering scheme adopts one convention either a pointer is at the symbol

last read or the symbol it is ready to read.

Page 6: Topic : Introduction about Compiler

Token beginnings look ahead pointer. The distance which the look ahead pointer may

have to travel past the actual token may be large. For example, in a PL/I program we may see:

DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or an

array name until we see the character that follows the right parenthesis. In either case, the token

itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it

began, the other half must be loaded with the next characters from the source file. Since the

buffer shown in above figure is of limited size there is an implied constraint on how much look

ahead can be used before the next token is discovered. In the above example, if the look ahead

traveled to the left half and all the way through the left half to the middle, we could not reload

the right half, because we would lose characters that had not yet been grouped into tokens. While

we can make the buffer larger if we chose or use another buffering scheme, we cannot ignore the

fact that overhead is limited

Variables (Div, mod, count, i)

Page 7: Topic : Introduction about Compiler

Specifications of Tokens

Let us understand how the language theory undertakes the following terms:

Alphabets

Any finite set of symbols {0,1} is a set of binary alphabets,

{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a set of

English language alphabets.

Strings

Any finite sequence of alphabets is called a string. Length of the string is the total

number of occurrence of alphabets, e.g., the length of the string tutorials point is 14 and is

denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is

known as an empty string and is denoted by ε (epsilon).

Special Symbols

A typical high-level language contains the following symbols:-

Arithmetic Symbols Addition(+), Subtraction(-),

Modulo(%), Multiplication(*),

Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.),

Arrow(->)

Assignment =

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Language

A language is considered as a finite set of strings over some finite set of alphabets.

Computer languages are considered as finite sets, and mathematically set operations can be

performed on them. Finite languages can be described by means of regular expressions.

Page 8: Topic : Introduction about Compiler

The lexical analyzer needs to scan and identify only a finite set of valid

string/token/lexeme that belong to the language in hand. It searches for the pattern defined by the

language rules.

REGULAR EXPRESSIONS

Regular expressions have the capability to express finite languages by defining a pattern

for finite strings of symbols. The grammar defined by regular expressions is known as regular

grammar.

The language defined by regular grammar is known as regular language.

Regular expression is an important notation for specifying patterns. Each pattern matches

a set of strings, so regular expressions serve as names for a set of strings. Programming language

tokens can be described by regular languages. The specification of regular expressions is an

example of a recursive definition. Regular languages are easy to understand and have efficient

implementation.

There are a number of algebraic laws that are obeyed by regular expressions, which can

be used to manipulate regular expressions into equivalent forms.

A token is either a single string or one of a collection of strings of a certain type. If we

view the set of strings in each token class as an language, we can use the regular-expression

notation to describe tokens.

Consider an identifier, which is defined to be a letter followed by zero or more letters or

digits. In regular expression notation we would write.

Identifier = letter (letter | digit)*

Here are the rules that define the regular expression over alphabet.

Recognition of tokens:

We learn how to express pattern using regular expressions. Now, we must study how to take the

patterns for all the needed tokens and build a piece of code that examines the input string and

finds a prefix that is a lexeme matching one of the patterns.

Stmt →if expr then stmt

| If expr then else stmt

| є

Expr →term relop term

Page 9: Topic : Introduction about Compiler

| term

Term →id

|number

In addition, we assign the lexical analyzer the job stripping out white space, by

recognizing the

“token” we defined by:

WS → (blank/tab/newline)+

Here, blank, tab and newline are abstract symbols that we use to express the ASCII

characters of the same names. Token ws is different from the other tokens in that ,when we

recognize it, we do not return it to parser ,but rather restart the lexical analysis from the character

that follows the white space . It is the following token that gets returned to the parser.

Representing valid tokens of a language in regular expression

If x is a regular expression, then:

• x* means zero or more occurrence of x.

o i.e., it can generate { e, x, xx, xxx, xxxx, … }

• x+ means one or more occurrence of x.

o i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

• x? means at most one occurrence of x

i.e., it can generate either {x} or {e}.

Representing occurrence of symbols using regular expressions

letter = [a – z] or [A – Z]

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]

sign = [ + | - ]

Representing language tokens using regular expressions

Decimal = (sign)?(digit)+

Identifier = (letter)(letter | digit)*

relop →< | > |<= | >= | = = | < >

< relop LT

<= relop LE

Page 10: Topic : Introduction about Compiler

== relop EQ

<> relop NE

The only problem left with the lexical analyzer is how to verify the validity of a regular

expression used in specifying the patterns of keywords of a language. A well-accepted solution is

to use finite automata for verification.

TRANSITION DIAGRAM:

Transition Diagram has a collection of nodes or circles, called states. Each state

represents a condition that could occur during the process of scanning the input looking for a

lexeme that matches one of several patterns .

Edges are directed from one state of the transition diagram to another. each edge is labeled by a

symbol or set of symbols.

If we are in one state ‘S’, and the next input symbol is ‘a’, we look for an edge out of

state ‘s’ labeled by a. if we find such an edge ,we advance the forward pointer and enter the state

of the transition diagram to which that edge leads.

Some important conventions about transition diagrams are

1. Certain states are said to be accepting or final .These states indicates that a lexeme has been

found, although the actual lexeme may not consist of all positions b/w the lexeme Begin and

forward pointers we always indicate an accepting state by a double circle.

2. In addition, if it is necessary to return the forward pointer one position, then we shall

additionally place a * near that accepting state.

3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”

entering from nowhere .the transition diagram always begins in the state before any input

symbols have been used.

Page 11: Topic : Introduction about Compiler

As an intermediate step in the construction of a LA, we first produce a stylized

flowchart, called a transition diagram. Position in a transition diagram, are drawn as circles

and are called as states.

The above TD for an identifier, defined to be a letter followed by any no of letters or

digits.A sequence of transition diagram can be converted into program to look for the tokens

specified by the diagrams. Each state gets a segment of code.

FINITE AUTOMATA

• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a

sentence of that language, and “no” otherwise.

• We call the recognizer of the tokens as a finite automaton.

Finite set of states (Q)

Finite set of input symbols (Σ)

One Start state (q0)

Set of final states (qf)

Transition function (δ)

Page 12: Topic : Introduction about Compiler

The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q ×

Σ ➔ Q

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).

States : States of FA are represented by circles. State names are written inside circles.

Start state : The state from where the automata starts is known as the start state. Start state

has an arrow pointed towards it.

Intermediate states : All intermediate states have at least two arrows; one pointing to and

another pointing out from them.

Final state : If the input string is successfully parsed, the automata is expected to be in this

state. Final state is represented by double circles. It may have any odd number of arrows pointing

to it and even number of arrows pointing out from it. The number of odd arrows are one greater

than even, i.e. odd = even+1.

Transition : The transition from one state to another state happens when a desired symbol in

the input is found. Upon transition, automata can either move to the next state or stay in the same

state. Movement from one state to another is shown as a directed arrow, where the arrows point

to the destination state. If automata stays on the same state, an arrow pointing from a state to

itself is drawn.

Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q0,

qf), Σ(0,1), q0, qf, δ}

Page 13: Topic : Introduction about Compiler

TYPES OF FINITE AUTOMATON

• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)

• This means that we may use a deterministic or non-deterministic automaton as a lexical

analyzer.

• Both deterministic and non-deterministic finite automaton recognize regular sets.

– deterministic – faster recognizer, but it may take more space

– non-deterministic – slower, but it may take less space

– Deterministic automatons are widely used lexical analyzers.

• First, we define regular expressions for tokens; Then we convert them into a DFA to get a

lexical analyzer for our tokens.

Non-Deterministic Finite Automaton (NFA)

• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:

o S - a set of states

o Σ - a set of input symbols (alphabet)

o move - a transition function move to map state-symbol pairs to sets of states.

o s0 - a start (initial) state

o F- a set of accepting states (final states)

• ε- transitions are allowed in NFAs. In other words, we can move from one state to

another one without consuming any symbol.

• A NFA accepts a string x, if and only if there is a path from the starting state to one of

accepting states such that edge labels along this path spell out x.

Example:

Page 14: Topic : Introduction about Compiler

Deterministic Finite Automaton (DFA)

• A Deterministic Finite Automaton (DFA) is a special form of a NFA.

• No state has ε- transition

• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition

function is from pair of state-symbol to state (not set of states)

Example:

Page 15: Topic : Introduction about Compiler

P16CS32 – Compiler Design

Date : 20.08.2020

Topic : Lexical Analysis

What is Lexical Analysis?

Lexical Analysis is the first phase of the compiler also known as a scanner. It converts

the High level input program into a sequence of Tokens.

The lexical analyzer breaks this syntax into a series of tokens. It removes any extra space

or comment written in the source code.

Programs that perform lexical analysis are called lexical analyzers or lexers. A lexer

contains tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it generates

an error. It reads character streams from the source code, checks for legal tokens, and pass the

data to the syntax analyzer when it demands.

LEXICAL ANALYSIS

• reads and converts the input into a stream of tokens to be analyzed by parser.

• lexeme : a sequence of characters which comprises a single token.

• Lexical Analyzer →Lexeme / Token → Parser

Removal of White Space and Comments

• Remove white space(blank, tab, new line etc.) and comments

Contsants

• Constants: For a while, consider only integers

What's a lexeme?

A lexeme is a sequence of characters that are included in the source program according to

the matching pattern of a token. It is nothing but an instance of a token.

In other words, The sequence of characters matched by a pattern to form the

corresponding token or a sequence of input characters that comprises a single token is called a

lexeme.

eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

Page 16: Topic : Introduction about Compiler

What's a token?

The token is a sequence of characters which represents a unit of information in the source

program.

Example of tokens:

• Type token (id, number, real, . . . )

• Punctuation tokens (IF, void, return, . . . )

• Alphabetic tokens (keywords)

• Keywords : Examples - for, while, if, etc.,,,

• Identifiers : Examples – Variable Name, Function Name, etc,

• Operators: Examples- ‘+’, ‘++’, ‘-‘, etc,

• Separators: Examples : ‘;’ ‘,’, etc,

(eg.) c=a+b*5;

Lexemes and tokens

Lexemes Tokens

c identifier

= assignment symbol

a identifier

+ + (addition symbol)

b identifier

* * (multiplication symbol)

5 5 (number)

What is Pattern?

A pattern is a description which is used by the token. In the case of a keyword which uses

as a token, the pattern is a sequence of characters.

• Examples 1:

for input 31 + 28, output(token representation)?

input : 31 + 28

output: <num, 31> <+, > <num, 28>

num + :token

31 28 : attribute, value(or lexeme) of integer token num

Page 17: Topic : Introduction about Compiler

• Examples 2:

input : count = count + increment;

output : <id,1> <=, > <id,1> <+, > <id, 2>;

Symbol table

tokens attributes(lexeme)

id

count

increment

How Lexical Analyzer functions

1. Tokenization i.e. Dividing the program into valid tokens.

2. Remove white space characters.

3. Remove comments.

4. It also provides help in generating error messages by providing row numbers and column

numbers.

For example, consider the program

int main()

{

// 2 variables

int a, b; a = 10; return 0;

}

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'

'a' '=' '10' ';' 'return' '0' ';' '}'

Above are the valid tokens.

As another example, consider below printf statement.

2 5

Printf ( “ WELCOME ” ) ;

1 3 4

There are 5 valid token in this printf statement.

Page 18: Topic : Introduction about Compiler

Exercise 1:

Count number of tokens :

int main()

{

int a = 10, b = 20;

printf("sum is :%d",a+b);

return 0;

}

Answer: Total number of token: 27.

Exercise 2:

Count number of tokens :

int max(int i);

• Lexical analyzer first read int and finds it to be valid and accepts as token

• max is read by it and found to be a valid function name after reading (

• int is also a token , then again i as another token and finally ;

Answer: Total number of tokens 7:

int, max, ( ,int, i, ), ;

Lexical Analyzer Architecture: OR The role of the Lexical Analyzer:-

The main task of lexical analysis is to read input characters in the code and produce

tokens.

Lexical analyzer scans the entire source code of the program. It identifies each token one by one.

Scanners are usually implemented to produce tokens only when requested by a parser. Here is

how this works-

Lexical Analyzer

Symbol Table

Parser

Token

getNextToken

To sematic

analysis Source

Program

Page 19: Topic : Introduction about Compiler

1. Get next token" is a command which is sent from the parser to the lexical analyzer.

2. On receiving this command, the lexical analyzer scans the input until it finds the next

token.

3. It returns the token to Parser.

Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is

present, then Lexical analyzer will correlate that error with the source file and line number.

Lexical analyzer performs below given tasks:

• Helps to identify token into the symbol table

• Removes white spaces and comments from the source program

• Correlates error messages with the source program

• Helps you to expands the macros if it is found in the source program

• Read input characters from the source program

Example of Lexical Analysis, Tokens, Non-Tokens

Consider the following code that is fed to Lexical Analyzer

#include <stdio.h>

int maximum(int x, int y) {

// This will compare 2 numbers

if (x > y)

return x;

else {

Page 20: Topic : Introduction about Compiler

return y;

}

}

Examples of Tokens created

Lexeme Token

int Keyword

maximum Identifier

( Operator

int Keyword

x Identifier

, Operator

int Keyword

Y Identifier

) Operator

{ Operator

If Keyword

Page 21: Topic : Introduction about Compiler

Examples of Nontokens

Type Examples

Comment // This will compare 2 numbers

Pre-processor directive #include <stdio.h>

Pre-processor directive #define NUMS 8,9

Macro NUMS

Whitespace /n /b /t

Need of Lexical Analyzer

• Simplicity of design of compiler The removal of white spaces and comments enables the

syntax analyzer for efficient syntactic constructs.

• Compiler efficiency is improved Specialized buffering techniques for reading characters

speed up the compiler process.

• Compiler portability is enhanced

Issues in Lexical Analysis

Lexical analysis is the process of producing tokens from the source program. It has the following

issues:

• Lookahead

• Ambiguities

Lookahead

Lookahead is required to decide when one token will end and the next token will begin. The

simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to describe the

lexemes of each token is required.

A way needed to resolve ambiguities

• Is if it is two variables i and f or if?

• Is == is two equal signs =, = or ==?

• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are similar.

Page 22: Topic : Introduction about Compiler

Hence, the number of lookahead to be considered and a way to describe the lexemes of each

token is also needed.

Regular expressions are one of the most popular ways of representing tokens.

Ambiguities

The lexical analysis programs written with lex accept ambiguous specifications and choose the

longest match possible at each input point. Lex can handle ambiguous specifications. When more

than one expression can match the current input, lex chooses as follows:

• The longest match is preferred.

• Among rules which matched the same number of characters, the rule given first is preferred.

Lexical Errors

• A character sequence that cannot be scanned into any valid token is a lexical error.

• Lexical errors are uncommon, but they still must be handled by a scanner.

• Misspelling of identifiers, keyword, or operators are considered as lexical errors.

Usually, a lexical error is caused by the appearance of some illegal character, mostly at the

beginning of a token.

Lexical error handling approaches

Lexical errors can be handled by the following actions:

• Deleting one character from the remaining input.

• Inserting a missing character into the remaining input.

• Replacing a character by another character.

• Transposing two adjacent characters

Error Recovery Schemes

• Panic mode recovery

• Local correction

❖ Source text is changed around the error point in order to get a correct text.

❖ Analyzer will be restarted with the resultant new text as input.

• Global correction

❖ It is an enhanced panic mode recovery.

❖ Preferred when local correction fails.

Panic mode recovery

In panic mode recovery, unmatched patterns are deleted from the remaining input, until the

lexical analyzer can find a well-formed token at the beginning of what input is left.

(eg.) For instance the string fi is encountered for the first time in a C program in the context:

Page 23: Topic : Introduction about Compiler

fi (a== f(x))

A lexical analyzer cannot tell whether f iis a misspelling of the keyword if or an undeclared

function identifier.

Since f i is a valid lexeme for the token id, the lexical analyzer will return the token id to the

parser.

Local correction

Local correction performs deletion/insertion and/or replacement of any number of symbols in the

error detection point.

(eg.) In Pascal, c[i] '='; the scanner deletes the first quote because it cannot legally follow the

closing bracket and the parser replaces the resulting'=' by an assignment statement.

Most of the errors are corrected by local correction.

(eg.) The effects of lexical error recovery might well create a later syntax error, handled by the

parser. Consider

· · · for $tnight · · ·

The $ terminates scanning of for. Since no valid token begins with $, it is deleted. Then tnight is

scanned as an identifier.

In effect it results,

· · · fortnight · · ·

Which will cause a syntax error? Such false errors are unavoidable, though a syntactic error-

repair may help.

Advantages of Lexical analysis

• Lexical analyzer method is used by programs like compilers which can use the parsed

data from a programmer's code to create a compiled binary executable code

• It is used by web browsers to format and display a web page with the help of parsed data

from JavsScript, HTML, CSS

• A separate lexical analyzer helps you to construct a specialized and potentially more

efficient processor for the task

Disadvantage of Lexical analysis

• You need to spend significant time reading the source program and partitioning it in the

form of tokens

• Some regular expressions are quite difficult to understand compared to PEG or EBNF

rules

• More effort is needed to develop and debug the lexer and its token descriptions

• Additional runtime overhead is required to generate the lexer tables and construct the

tokens

Page 24: Topic : Introduction about Compiler

P16CS32 – Compiler Design

Date : 24.08.2020

Topic : Finite Automata and Conversion NFA into DFA

FINITE AUTOMATA

• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a

sentence of that language, and “no” otherwise.

• We call the recognizer of the tokens as a finite automaton.

Finite set of states (Q)

Finite set of input symbols (Σ)

One Start state (q0)

Set of final states (qf)

Transition function (δ)

• In other words Lexical analysis is essentially a process of recognizing different tokens from the

source program. This Process of recognition can be accomplished by building a classical model

called Finite State Machine (FSM) or a Finite Automation(FA).

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).

States : States of FA are represented by circles. State names are written inside circles.

Start state : The state from where the automata starts is known as the start state. Start state

has an arrow pointed towards it.

Intermediate states : All intermediate states have at least two arrows; one pointing to and

another pointing out from them.

Final state : If the input string is successfully parsed, the automata is expected to be in this

state. Final state is represented by double circles. It may have any odd number of arrows pointing

to it and even number of arrows pointing out from it. The number of odd arrows are one greater

than even, i.e. odd = even+1.

Page 25: Topic : Introduction about Compiler

Transition : The transition from one state to another state happens when a desired symbol in

the input is found. Upon transition, automata can either move to the next state or stay in the same

state. Movement from one state to another is shown as a directed arrow, where the arrows point

to the destination state. If automata stays on the same state, an arrow pointing from a state to

itself is drawn.

Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q0,

qf), Σ(0,1), q0, qf, δ}

TYPES OF FINITE AUTOMATON

• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)

• This means that we may use a deterministic or non-deterministic automaton as a lexical

analyzer.

• Both deterministic and non-deterministic finite automaton recognizes regular sets.

– deterministic – faster recognizer, but it may take more space

– non-deterministic – slower, but it may take less space

– Deterministic automatons are widely used lexical analyzers.

• First, we define regular expressions for tokens; Then we convert them into a DFA to get a

lexical analyzer for our tokens.

Non-Deterministic Finite Automaton (NFA)

• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:

o S - a set of states

o Σ - a set of input symbols (alphabet)

o move - a transition function move to map state-symbol pairs to sets of states.

o s0 - a start (initial) state

o F- a set of accepting states (final states)

• ε- transitions are allowed in NFAs. In other words, we can move from one state to

another one without consuming any symbol.

Page 26: Topic : Introduction about Compiler

• A NFA accepts a string x, if and only if there is a path from the starting state to one of

accepting states such that edge labels along this path spell out x.

Example:

Deterministic Finite Automaton (DFA)

• A Deterministic Finite Automaton (DFA) is a special form of a NFA.

• No state has ε- transition

• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition

function is from pair of state-symbol to state (not set of states)

Example:

Page 27: Topic : Introduction about Compiler

Conversion of NFA to DFA

We can convert from an NFA to DFA using subset Construction. To perform this operation, let

us define two functions:

• The €- closure function takes a state and returns the set of states reachable from it based on

(one or more) €-Transitions. Note that this will always include the state itself. We should be

able to get from a state to any state in its €- closure without consuming any input.

• The function move takes a state and a character, and returns the states reachable by one

transition on this character.

We can generate both these functions to apply t sets of states by taking the union of the

application to individual states.

Eg. If A, B and C are states, move ({A,B,C},’a’) = move(A,’a’) U move(B,’a’) U move(C,’a’).

ALGORITHM: (The subset construction of a DFA from an NFA)

INPUT: An NFA N.

OUTPUT: A DFA d accepting the same language as N .

Method:

OPERATION DESCRIPTION

€-closure(s) Set of NFA States Reachable from NFA state S One €- transitions

alone.

€-closure(T) Set of NFA states reachable from some NFA state S in set T on €-

transitions alone ; =Us in T €- closure(s).

Move(T,a) Set of NFA states to which there is a transition on input symbol a

from input symbol a from some state S in T.

The start state of D is €- closure(so),and the accepting states of d are all those sets n’ s

states that include at least one accepting start of N . TO COMPLETE our description of the

subset construction, we need only to show how initially, €- closure (so) is the only state in D

states and it is unmarked;

Page 28: Topic : Introduction about Compiler

The Subset construction algorithm

1. Create the start state of the DFA by taking the €- closure of the start state of the NFA.

2. Perform the flowing for the new DFA state:

For each possible input symbol:

▪ Apply move to the newly created state and the input symbol; this will

return a set of stated.

▪ Apply the €- closure t this set of states, possibly resulting in a new set.

▪ This set of NFA states will be a single state in the DFA.

3. Each time we generate a new DFA state, we must apply step 2 to it. The process is

complete when applying step 2 does not yield any new states.

4. The finish states of the DFA ate those which contain any of the finish states of the NFA.

While (there is an unmarkable state T in Dstates)

{

mark T;

FOR ( each input symbol a)

{

U=€-closure (move(T,a));

If ( U is not in D states)

add U as an unmarked state t Dstates;

Dtran[T,a] = U;

}

}

Computing €- closure (T)

Push all states of T onto stack;

initialize €- closure (T) to T;

While ( Stack is not empty)

pop t, the top element, off stack;

for (each state u with an edge from t to u labeled e)

if (u is not in €- closure (T))

Page 29: Topic : Introduction about Compiler

{

Add u to €- closure (T);

Push u onto stack;

}

}

Simulating of an NFA

S= €- closure (s0);

C= nextChar( );

While (c != eof)

{

S=€- closure (move (S,c));

C=nextChar( );

}

If ( S ∩ F != 0) return “yes”;

Else return “on”;

Conversion NFA into DFA:

Algorithm used for construction an DFA from NFA is called subset construction

algorithm.

NFA – N for (a | b )* abb

i/p = NFA o/p = DFA

a

€ €

€ € € € a b b

b

0 1

2 3

4 5

6 7 8 9

Page 30: Topic : Introduction about Compiler

Starting state of DFA is €-closure(s)

€-closure(0) = {0, 1, 2, 4, 7}---------------------------------=>

1.This State is called unmarked

State x = {0, 1, 2, 4, 7}=>

For i/p symbol a ( from transition A )

a a

T = { 3, 8 } DFA

U = €-closure ( move (T, a))

U = €-closure ( move ( {3, 8 }, a)) b

€-closure ( 3,8) = { 1, 2, 3, 4, 6, 7, 8 }---------------------- >

For i/p symbol b ( from transition A ) a

b

T = { 5 }

U = €-closure ( move (T, a))

U = €-closure ( move ( {5 }, b))

€-closure ( 5) = { 1, 2, 4, 5, 6, 7}----------------------------->

2. This State is called unmarked

State x = {1, 2, 3, 4, 6, 7, 8 }=>

For i/p symbol a ( from transition B ) DFA

a a b

€-closure ( 3,8) = { 1, 2, 3, 4, 6, 7, 8 }------------------>

For i/p symbol b ( from transition B ) a b

b b a

€-closure ( 5, 9 ) = { 1, 2, 4, 5, 6, 7, 9 }------------------>

A

A

2 3 7 8

B

C

4 5

A

C

B

B

2 3 7 8

B

4 5 8 9

D

A

C

B

D

Page 31: Topic : Introduction about Compiler

3. This State is called unmarked

State x = {1, 2, 3, 4, 5, 6, 7 }=> DFA b

For i/p symbol a ( from transition C ) b

a a a

€-closure ( 3,8) = { 1, 2, 3, 4, 6, 7, 8 }----------------- > a b

For i/p symbol b ( from transition C ) a

b

€-closure ( 5} = { 1, 2, 3, 4, 5, 6, 7}------------------------- >

4. This state is called unmarked

State x = { 1, 2, 4, 5, 6, 7, 9 } => DFA

For i/p symbol a ( from transition D ) b

a a b a

€-closure ( 3, 8) = { 1, 2, 3, 4, 6, 7, 8 }--------------------> a b

For i/p symbol b ( from transition D ) a a

b b

€-closure ( 5, 10) = { 1, 2, 3, 4, 5, 6, 7, 10 }-------------------->

5.This state is called unmarked

State x = { 1, 2, 3, 4, 5, 6, 7, 10 } =>

For i/p symbol a ( from transition E )

a a

€-closure ( 3, 8) = { 1, 2, 3, 4, 6, 7, 8 }-------------------->

C

2 3 7 8

B

4 5

C

A

C

B

D

B

{

1

,

2

,

4

,

5

,

6

,

7

,

9

{

1

,

2

,

4

D

{

1

,

2

,

4

,

5

,

6

,

7

,

9

{

1

,

2

,

4

,

5

,

6

,

7

2 3 7 8

4 5 9 10

A

C

B

D

E

{

1

,

2

,

4

,

5

,

6

,

7

,

9

{

1

,

2

,

4

,

5

,

b

E

B

2 3 7 8

E

Page 32: Topic : Introduction about Compiler

For i/p symbol b ( from transition E)

b

€-closure ( 5) = { 1, 2, 3, 4, 5, 6, 7 }-------------------->

Transition table:

If we continue this process with the unmarked sets B and C, we eventually reach a point where

all the states of the DFA are marked. The state A is the start state, and state E, which contains

state 10 of the NFA, is the only accepting state. Note that D has one more state than the DFA.

States A and C have the same move function, and so can be merged. The figure below shows the

DFA for the ( a | b )*abb

Difference between NFA and DFA

1. Every DFA is NFA but not vice versa.

2. Both NFA and DFA have same power and each NFA can be translated into a DFA.

3. There can be multiple final in both DFA and NFA.

NFA State DFA State a b

{0,1,2,4,7} A B C

{1,2,3,4,6,7,8} B B D

{1,2,4,5,6,7} C B C

{1,2,4,5,6,7,9} D B E

{1,2,3,5,6,7,10} E B C

4 5

C

C

A B D E

Page 33: Topic : Introduction about Compiler

4. NFA is more of theoretical concepts. DFA is used in lexical analysis in compiler.

5. The transition function for non-deterministic finite automata i.e. delta is multi valued

where as for DFA it is single valued.

6. Checking membership is easy with deterministic finite automata where as it is difficult

for non-deterministic finite automata.

7. Construction of non-deterministic finite automata is very easy where as for DFA it is

difficult.

8. Space required for deterministic finite automata is more where for non-deterministic

finite automata it is less.

9. Backtracking is allowed in deterministic finite automata, but it is not possible in every

case in non-deterministic finite automata.

10. For every input and output we can construct deterministic finite automata machine, but it

is not possible to construct an NFA machine for every input and output.

11. There is only one final state in non-deterministic finite automata but there can be more

than one final state in deterministic finite automata.

Page 34: Topic : Introduction about Compiler

P16CS32 – Compiler Design

Date : 19.08.2020 Topic: Phases of the Compiler

Cousins of the Compiler

1. Pre Processor

The output of preprocessors may be given as the input t compilers. The tasks performed

by the preprocessor are given as below:

Macro Processing – Preprocessor allow to macros in the program. Macro means some

set of instruction which can be used repeatedly in the program. Thus macro preprocessing task is

to be done by preprocessors.

File – Inclusion – Preprocessor also allows user to include the header files whichs may

be required by the program. For example #include <stdio.h>

By this statement the header file stdio.h can be included and user can make use of the

functions defined in this header file. This task of preprocessor is called file inclusion.

Rational Preprocessor – These processors augment older language with modern flow of

control and data structuring facilities. Such processors provide the users with built in macros for

constructs like while – loop or if statements.

2. ASSEMBLER

Programmers found it difficult to write or read programs in machine language. They

begin to use a mnemonic (symbols) for each machine instruction, which they would

subsequently translate into machine language. Such a mnemonic machine language is now called

an assembly language.

Programs known as assembler were written to automate the translation of assembly

language in to machine language. The input to an assembler program is called source program,

the output is a machine language translation (object program).

3. Loader and Link Editors

Linker is a computer program that links and merges various object files together in order

to make an executable file. All these files might have been compiled by separate assemblers.

The major task of a linker is to search and locate referenced module/routines in a program and to

determine the memory location where these codes will be loaded, making the program

instruction to have absolute references.

Page 35: Topic : Introduction about Compiler

Loader is a part of operating system and is responsible for loading executable files into

memory and execute them. It calculates the size of a program (instructions and data) and creates

memory space for it. It initializes various registers to initiate execution.

INTERPRETER

An interpreter is a program that appears to execute a source program as if it were

machine language.

Languages such as BASIC, SNOBOL, and LISP can be translated using interpreters. JAVA also

uses interpreter.

The process of interpretation can be carried out in following phases.

o Lexical analysis

o Syntax analysis

o Semantic analysis

o Direct Execution

Advantages:

o Modification of user program can be easily made and implemented as execution

proceeds.

o Type of object that denotes various may change dynamically.

o Debugging a program and finding errors is simplified task for a program used for

interpretation.

o The interpreter for the language makes it machine independent.

Disadvantages:

o The execution of the program is slower.

o Memory consumption is more.

TRANSLATOR

A translator is a program that takes as input a program written in one language and

produces as output a program in another language. Beside program translation, the translator

performs another very important role, the error-detection. Any violation of d HLL specification

would be detected and reported to the programmers. Important role of translator are:

1 Translating the HLL program input into an equivalent ml program.

2 Providing diagnostic messages wherever the programmer violates specification of the HLL.

Page 36: Topic : Introduction about Compiler

LIST OF COMPILERS

1.Ada compilers

2 .ALGOL compilers

3 .BASIC compilers

4 .C# compilers

5 .C compilers

6 .C++ compilers

7 .COBOL compilers

8 .Common Lisp compilers

9. ECMAScript interpreters

10. Fortran compilers

11 .Java compilers

12. Pascal compilers

13. PL/I compilers

14. Python compilers

15. Smalltalk compilers

PHASES OF COMPILER

A compiler operates in phases. A phase is a logically interrelated operation that takes

source program in one representation and produces output in another representation.

There are two phases of compilation.

o Analysis (Machine Independent/Language Dependent) Front end

o Synthesis(Machine Dependent/Language independent) Back end

Compiler passes:

A collection of phases is done only once (single pass) or multiple times (multi pass)

• Single pass: usually requires everything to be defined before being

used in source program

• Multi pass: compiler may have to keep entire program representation

in memory

Compilation process is partitioned into no-of-sub processes called ‘phases’.

A compiler can have many phases and passes.

Pass : A pass refers to the traversal of a compiler through the entire program.

Phase : A phase of a compiler is a distinguishable stage, which takes input from the previous

stage, processes and yields output that can be used as input for the next stage. A pass can

have more than one phase.

Page 37: Topic : Introduction about Compiler

STRUCTURE OF THE COMPILER DESIGN

The compilation process is a sequence of various phases. Each phase takes input from its

previous stage, has its own representation of source program, and feeds its output to the next

phase of the compiler. Let us understand the phases of a compiler.

Lexical Analysis:-

The first phase of scanner works as a text scanner. This phase scans the source code as a

stream of characters and converts it into meaningful lexemes. Lexical analyzer represents these

lexemes in the form of tokens as:

<token-name, attribute-value>

Page 38: Topic : Introduction about Compiler

LA or Scanners reads the source program one character at a time, carving the source

program into a sequence of atomic units called tokens.

Syntax Analysis:-

The next phase is called the syntax analysis or parsing. It takes the token produced by

lexical analysis as input and generates a parse tree (or syntax tree). In this phase, token

arrangements are checked against the source code grammar, i.e., the parser checks if the

expression made by the tokens is syntactically correct.

Semantic Analysis:-

Semantic analysis checks whether the parse tree constructed follows the rules of

language. For example, assignment of values is between compatible data types, and adding string

to an integer. Also, the semantic analyzer keeps track of identifiers, their types and expressions;

whether identifiers are declared before use or not, etc. The semantic analyzer produces an

annotated syntax tree as an output.

Intermediate Code Generations:-

The intermediate code generation uses the structure produced by the syntax analyzer to

create a stream of simple instructions. Many styles of intermediate code are possible. One

common style uses instruction with one operator and a small number of operands. The output of

the syntax analyzer is some representation of a parse tree. The intermediate code generation

phase transforms this parse tree into an intermediate language representation of the source

program.

After semantic analysis, the compiler generates an intermediate code of the source code

for the target machine. It represents a program for some abstract machine. It is in between the

high-level language and the machine language. This intermediate code should be generated in

such a way that it makes it easier to be translated into the target machine code.

An intermediate representation of the final machine language code is produced. This

phase bridges the analysis and synthesis phases of translation.

Code Optimization:-

The next phase does code optimization of the intermediate code. Optimization can be

assumed as something that removes unnecessary code lines, and arranges the sequence of

statements in order to speed up the program execution without wasting resources (CPU,

Page 39: Topic : Introduction about Compiler

memory). This is optional phase described to improve the intermediate code so that the output

runs faster and takes less space.

Code Optimization

This is optional phase described to improve the intermediate code so that the output runs

faster and takes less space. Its output is another intermediate code program that does the same

job as the original, but in a way that saves time and / or spaces.

a. Local Optimization:-

There are local transformations that can be applied to a program to make an

improvement. For example,

If A > B goto L2

Goto L3

L2 :

This can be replaced by a single statement

If A < B goto L3

Another important local optimization is the elimination of common sub-expressions

A := B + C + D

E := B + C + F

Might be evaluated as

T1 := B + C

A := T1 + D

E := T1 + F

Take this advantage of the common sub-expressions B + C.

b. Loop Optimization:-

Another important source of optimization concerns about increasing the speed of loops.

A typical loop improvement is to move a computation that produces the same result each time

around the loop to a point, in the program just before the loop is entered.

Code Generation:-

Code Generator produces the object code by deciding on the memory locations for data,

selecting code to access each datum and selecting the registers in which each computation is to

be done. Many computers have only a few high speed registers in which computations can be

Page 40: Topic : Introduction about Compiler

performed quickly. A good code generator would attempt to utilize registers as efficiently as

possible.

The last phase of translation is code generation. A number of optimizations to reduce the

length of machine language program are carried out during this phase. The output of the code

generator is the machine language program of the specified computer.

In this phase, the code generator takes the optimized representation of the intermediate

code and maps it to the target machine language. The code generator translates the intermediate

code into a sequence of (generally) re-locatable machine code. Sequence of instructions of

machine code performs the task as the intermediate code would do.

Table Management OR Book-keeping OR Symbol Table :- A compiler needs to collect information about all the data objects that appear in the

source program.

The information about data objects is collected by the early phases of the compiler-

lexical and syntactic analyzers. The data structure used to record this information is called as

Symbol Table.

It is a data-structure maintained throughout all the phases of a compiler. All the

identifiers’ names along with their types are stored here. The symbol table makes it easier for the

compiler to quickly search the identifier record and retrieve it. The symbol table is also used for

scope management.

This is the portion to keep the names used by the program and records essential

information about each. The data structure used to record this information called a ‘Symbol

Table’.

Error Handling:-

One of the most important functions of a compiler is the detection and reporting of errors

in the source program. The error message should allow the programmer to determine exactly

where the errors have occurred. Errors may occur in all or the phases of a compiler. Whenever a

phase of the compiler discovers an error, it must report the error to the error handler, which

issues an appropriate diagnostic msg. Both of the table-management and error-Handling routines

interact with all phases of the compiler

It is invoked when a flaw error in the source program is detected. The output of LA is a

stream of tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups

Page 41: Topic : Introduction about Compiler

the tokens together into syntactic structure called as expression. Expression may further be

combined to form statements. The syntactic structure can be regarded as a tree whose leaves are

the token called as parse trees.

The parser has two functions. It checks if the tokens from lexical analyzer, occur in

pattern that are permitted by the specification for the source language. It also imposes on tokens

a tree-like structure that is used by the sub-sequent phases of the compiler.

Example, if a program contains the expression A+/B after lexical analysis this expression

might appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax

analyzer should detect an error situation, because the presence of these two adjacent binary

operators violates the formulations rule of an expression. Syntax analysis is to make explicit the

hierarchical structure of the incoming token stream by identifying which parts of the token

stream should be grouped.

Example,

(A/B*C has two possible interpretations.)

1, divide A by B and then multiply by C or

2, multiply B by C and then use the result to divide A.

Each of these two interpretations can be represented in terms of a parse tree.

Page 42: Topic : Introduction about Compiler

Compilation Process of a source code through phases

Page 43: Topic : Introduction about Compiler

Simply says:

• Lexical analysis (Scanning)

• Syntax Analysis (Parsing)

• Syntax Directed Translation

• Intermediate Code Generation

• Run-time environments

• Code Generation

• Machine Independent Optimization

The Phases of a Compiler

Page 44: Topic : Introduction about Compiler

Compiler-Construction Tools

• Software development tools are available to implement one or more compiler phases

o Scanner generators

o Parser generators

o Syntax-directed translation engines

o Automatic code generators

o Data-flow engines

Compiler-Construction Tools

The compiler writer, like any software developer, can portably use modern software

development environments containing tools such as language editors, debuggers, version

managers, porkers, test harnesses, and so on. In addition to these general software-development

tools, other more specialized tools have been created to help implement various phases of a

compiler.

These tools use specialized languages for specifying and implementing spefic

components, and many use quite sophisticated algorithms. The most successful tools are those

that hide the details of the generation algorithm and produce components that can be easily

integrated into the remainder of the compiler. Some commonly used compiler-construction tools

include

1. Parser generators that automatically produce syntax analyzers from a grammatical

description of a programming language.

2. Scanner generators that produce lexical analyzers from a regular-expression description of

the tokens of a language.

3. Syntax-directed translation engines that produce collections of routines for walking a parse

tree and generating intermediate code.

Page 45: Topic : Introduction about Compiler

4. Code-generator generators that produce a code generator from a collection of rules for

translating each operation of the intermediate language into the machine language for a target

machine.

5. Data-flow analysis engines that facilitate the gathering of information about how values are

transmitted from one part of a program to each other part. Data-flow analysis is a key part of

code optimization.

6. Compiler-construction toolkits that provide an integrated set of routines for constructing

various phases of a compiler.