Top Banner
LEXICAL ANALYSIS AND TOKENIZATION OF SOURCE CODE WRITTEN IN ‘C’ LANGUAGE A SYNOPSIS Minor Project Submitted in partial fulfillment of the requirement for Degree of Bachelor of Engineering in Information Technology Discipline Submitted To [RAJIV GANDHI PRODYOGIKI VISHWAVIDYALAYA, BHOPAL (M.P.)] PROJECT ID – IT-013 Submitted By: Jayati Naik (0111IT071045) Kuldeep Kumar Mishra (0111IT071048) Meghendra Singh (0111IT071053) Under The Guidance Of: Prof. Ilyas Khan (Professor, Department Of Information Technology)
28

Lexical Analyzer Synopsis Final

Mar 06, 2015

Download

Documents

Sourabh Nigam
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lexical Analyzer Synopsis Final

LEXICAL ANALYSIS AND TOKENIZATION OF SOURCE CODE

WRITTEN IN ‘C’ LANGUAGEA SYNOPSIS

Minor Project Submitted in partial fulfillment of the requirement for Degree of Bachelor of Engineering in Information Technology Discipline

Submitted To

[RAJIV GANDHI PRODYOGIKI VISHWAVIDYALAYA, BHOPAL (M.P.)]

PROJECT ID – IT-013

Submitted By:

Jayati Naik (0111IT071045) Kuldeep Kumar Mishra (0111IT071048)

Meghendra Singh (0111IT071053)

Under The Guidance Of:

Prof. Ilyas Khan(Professor, Department Of Information Technology)

DEPARTMENT OF INFORMATION TECHNOLOGY TECHNOCRATS INSTITUTE OF TECHNOLOGY, BHOPAL

SESSION: 2009-10

Page 2: Lexical Analyzer Synopsis Final

Technocrats Institute of Technology, Bhopal (M.P.)

Department of Information Technology

CERTIFICATE

This is to certify that the work embodied in this synopsis entitled “Lexical

Analysis and Tokenization of Source Code Written in ‘C’ Language”

being submitted by “Jayati Naik” (Roll No.: 0111IT071045) , “Kuldeep

Kumar Mishra” (Roll No.: 0111IT071048) & “Meghendra Singh” (Roll

No.: 0111IT071053) for partial fulfillment of the requirement for the degree

of “Bachelor of Engineering in Information Technology” discipline to

“Rajiv Gandhi Praudyogiki Vishwavidyalaya, Bhopal(M.P.)” during the

academic year 2009-10 is a record of bona fide piece of work, carried out by

him under my supervision and guidance in the “Department of

Information Technology”, Technocrats Institute of Technology, Bhopal

(M.P.).

APPROVED & GUIDED BY:

Prof. Ilyas Khan(Professor, Department Of Information Technology)

FORWARDED BY:

(Prof. Shiv K Sahu) (Prof. Vijay K Chaudhari) Project Coordinator, Dept. of IT Head of Information Technology TIT, Bhopal TIT, Bhopal

i

Page 3: Lexical Analyzer Synopsis Final

Technocrats Institute of Technology, Bhopal (M.P.)

Department Of Information Technology

DECLARATION

We Jayati Naik, Kuldeep Kumar Mishra and Meghendra Singh,

students of Bachelor of Engineering in Information Technology

discipline, session: 2009-2010, Technocrats Institute of Technology-

Bhopal (M.P.), here by declare that the work presented in this synopsis

entitled “Lexical Analysis and Tokenization of Source Code Written

in ‘C’ Language” is the outcome of our own work, is bona fide and

correct to the best of our knowledge and this work has been carried out

taking care of Engineering Ethics. The work presented does not infringe

any patented work and has not been submitted to any other university or

anywhere else for the award of any degree or any professional diploma.

Jayati Naik Date: Enrollment No.: 0111IT071045

Kuldeep Kumar Mishra Enrollment No.: 0111IT071048

Meghendra Singh Enrollment No.: 0111IT071053

ii

Page 4: Lexical Analyzer Synopsis Final

TOPIC NAME PAGE NO.

Certificate ………………… i

Declaration ………………… ii

Abstract ………………… 1

1.0 INTRODUCTION ………………… 2

1.1.0 Introduction to Lexical Grammar ………………… 3

1.2.0 Introduction to Token ………………… 4

1.3.0 How scanner and tokenizer works ? ………………… 4

1.4.0 Platform used ………………… 7

2.0 PROPOSED METHODOLOGY ………………… 9

2.1.0 Block Diagram ………………… 9

2.2.0 Data Flow Diagram ………………… 9

2.3.0 Flow Chart ………………… 11

3.0 APPROACHED RESULT AND CONCLUSION ………………… 13

4.0 APPLICATIONS AND FUTURE WORK ………………… 14

REFERENCES ………………… 15

CONTENTS

iii

Page 5: Lexical Analyzer Synopsis Final

ABSTRACT

The lexical analyzer is responsible for scanning the source input file and translating

lexemes (strings) into small objects that the compiler for a high level language can

easily process. These small values are often called “tokens”. The lexical analyzer is

also responsible for converting sequences of digits in to their numeric form as well as

processing other literal constants, for removing comments and white spaces from the

source file, and for taking care of many other mechanical details. Lexical analyzer

converts stream of input characters into a stream of tokens. For tokenizing into

identifiers and keywords we incorporate a symbol table which initially consists of

predefined keywords. The tokens are read from an input file. The output file will

consist of all the tokens present in our input file along with their respective token

values.

KEYWORDS: Lexeme, Lexical Analysis, Compiler, Parser, Token

1

Page 6: Lexical Analyzer Synopsis Final

1.0 INTRODUCTION

In computer science, lexical analysis is the process of converting a sequence of

characters into a sequence of tokens. A program or function which performs lexical

analysis is called a lexical analyzer, lexer or scanner. A lexer often exists as a single

function which is called by a parser or another function and works alongside other

components for making compilation of a high level language possible. This complete

setup is what we call a compiler.

To define what a compiler is one must first define what a translator is. A translator

is a program that takes another program written in one language, also known as the

source language, and outputs a program written in another language, known as the

target language.

Now that the translator is defined, a compiler can be defined as a translator. The

source language is a high-level language such as Java or Pascal and the target

language is a low-level language such as machine or assembly.

There are five parts of compilation (or phases of the compiler)

1.) Lexical Analysis

2.) Syntax Analysis

3.) Semantic Analysis

4.) Code Optimization

5.) Code Generation

Lexical Analysis is the act of taking an input source program and outputting a

stream of tokens. This is done with the Scanner. The Scanner can also place

identifiers into something called the symbol table or place strings into the string table.

The Scanner can report trivial errors such as invalid characters in the input file.

Syntax Analysis is the act of taking the token stream from the scanner and

comparing them against the rules and patterns of the specified language. Syntax

Analysis is done with the Parser. The Parser produces a tree, which can come in many

formats, but is referred to as the parse tree. It reports errors when the tokens do not

follow the syntax of the specified language. Errors that the Parser can report are

syntactical errors such as missing parenthesis, semicolons, and keywords.

Semantic Analysis is the act of determining whether or not the parse tree is

relevant and meaningful. The output is intermediate code, also known as an

intermediate representation (or IR). Most of the time, this IR is closely related to

assembly language but it is machine independent. Intermediate code allows different

2

Page 7: Lexical Analyzer Synopsis Final

code generators for different machines and promotes abstraction and portability from

specific machine times and languages. (I dare to say that the most famous example is

java’s byte-code and JVM). Semantic Analysis finds more meaningful errors such as

undeclared variables, type compatibility, and scope resolution.

Code Optimization makes the IR more efficient. Code optimization is usually

done in a sequence of steps. Some optimizations include code hosting, or moving

constant values to better places within the code, redundant code discovery, and

removal of useless code.

Code Generation is the final step in the compilation process. The input to the

Code Generator is the IR and the output is machine language code.

1.1.0 Introduction to Lexical Grammar

The specification of a programming language will often include a set of rules

which defines the lexer. These rules are usually called regular expressions and they

define the set of possible character sequences that are used to form tokens

or lexemes. whitespace, (i.e. characters that are ignored), are also defined in

the regular expressions.

1.2.0 Introduction to token

A token is a string of characters, categorized according to the rules as a symbol

(e.g. IDENTIFIER, NUMBER, COMMA, etc.). The process of forming tokens from

an input stream of characters is called (tokenization) and the lexer categorizes them

according to a symbol type. A token can look like anything that is useful for

processing an input text stream or text file.

A lexical analyzer generally does nothing with

combinations of tokens, a task left for a parser. For

example, a typical lexical analyzer recognizes

parenthesis as tokens, but does nothing to ensure that

each '(' is matched with a ')'.

Consider this expression in the C programming

language:

sum=3+2;

Tokenized in the following table:

lexeme token type

Sum Identifier

=Assignment

operator

3 Number

+Addition

operator

2 Number

;End of

statement 3

Page 8: Lexical Analyzer Synopsis Final

Tokens are frequently defined by regular expressions, which are understood by a

lexical analyzer generator such as lex. The lexical analyzer (either generated

automatically by a tool like lex, or hand-crafted) reads in a stream of characters,

identifies the lexemes in the stream, and categorizes them into tokens. This is called

"tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded

into data structures for general use, interpretation, or compiling.

1.3.0 How scanner and tokenizer work?

The first stage, the scanner, is usually based on a finite state machine. It has

encoded within it information on the possible sequences of characters that can be

contained within any of the tokens it handles (individual instances of these character

sequences are known as lexemes). For instance, an integer token may contain any

sequence of numerical digit characters. In many cases, the first non-white space

character can be used to deduce the kind of token that follows and subsequent input

characters are then processed one at a time until reaching a character that is not in the

set of characters acceptable for that token (this is known as the maximal munch rule,

or longest match rule). In some languages the lexeme creation rules are more

complicated and may involve backtracking over previously read characters.

Tokenization is the process of demarcating and possibly classifying sections of a

string of input characters. The resulting tokens are then passed on to some other form

of processing. The process can be considered a sub-task of parsing input.

Take, for example, the following string.

The quick brown fox jumps over the lazy dog

Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a

computer this is only a series of 43 characters.

A process of tokenization could be used to split the sentence into word tokens.

Although the following example is given as XML there are many ways to represent

tokenized input:

<sentence>

<word>The</word>

<word>quick</word>

<word>brown</word>

<word>fox</word>

4

Page 9: Lexical Analyzer Synopsis Final

<word>jumps</word>

<word>over</word>

<word>the</word>

<word>lazy</word>

<word>dog</word>

</sentence>

A lexeme, however, is only a string of characters known to be of a certain kind

(eg, a string literal, a sequence of letters). In order to construct a token, the lexical

analyzer needs a second stage, the evaluator, which goes over the characters of the

lexeme to produce a value. The lexeme's type combined with its value is what

properly constitutes a token, which can be given to a parser. (Some tokens such as

parentheses do not really have values, and so the evaluator function for these can

return nothing. The evaluators for integers, identifiers, and strings can be considerably

more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it

from the parser, which is useful for whitespace and comments.)

For example, in the source code of a computer program the string

net_worth_future = (assets - liabilities);

might be converted (with whitespace suppressed) into the lexical token stream:

Though it is possible and sometimes necessary to write a lexer by hand, lexers are

often generated by automated tools. These tools generally accept regular expressions

that describe the tokens allowed in the input stream. Each regular expression is

associated with a production in the lexical grammar of the programming language that

evaluates the lexemes matching the regular expression. These tools may generate

source code that can be compiled and executed or construct a state table for a finite

state machine (which is plugged into template code for compilation and execution).

NAME "net_worth_future" EQUALS OPEN_PARENTHESIS NAME "assets" MINUS NAME "liabilities" CLOSE_PARENTHESIS SEMICOLON

5

Page 10: Lexical Analyzer Synopsis Final

Regular expressions compactly represent patterns that the characters in lexemes

might follow. For example, for an English-based language, a NAME token might be

any English alphabetical character or an underscore, followed by any number of

instances of any ASCII alphanumeric character or an underscore. This could be

represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means "any

character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite state machines they generate are not powerful

enough to handle recursive patterns, such as "n opening parentheses, followed by a

statement, followed by n closing parentheses." They are not capable of keeping count,

and verifying that n is the same on both sides — unless you have a finite set of

permissible values for n. It takes a full-fledged parser to recognize such patterns in

their full generality. A parser can push parentheses on a stack and then try to pop

them off and see if the stack is empty at the end.

The Lex programming tool and its compiler is designed to generate code for fast

lexical analysers based on a formal description of the lexical syntax. It is not generally

considered sufficient for applications with a complicated set of lexical rules and

severe performance requirements; for instance, the GNU Compiler Collection uses

hand-written lexers.

1.4.0 Platform used

In computing, C is a general-purpose computer programming language originally

developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement

the Unix operating system.

Although C was designed for writing architecturally independent system software,

it is also widely used for developing application software.

Worldwide, C is the first or second most popular language in terms of number of

developer positions or publicly available code. It is widely used on many different

software platforms, and there are few computer architectures for which a C compiler

does not exist. C has greatly influenced many other popular programming languages,

most notably C++, which originally began as an extension to C, and Java and C#

which borrow C lexical conventions and operators.

i. Characteristics

6

Page 11: Lexical Analyzer Synopsis Final

Like most imperative languages in the ALGOL tradition, C has facilities for

structured programming and allows lexical variable scope and recursion, while a static

type system prevents many unintended operations. In C, all executable code is

contained within functions. Function parameters are always passed by value. Pass-by-

reference is achieved in C by explicitly passing pointer values. Heterogeneous

aggregate data types (struct) allow related data elements to be combined and

manipulated as a unit. C program source text is free-format, using the semicolon as a

statement terminator (not a delimiter).

C also exhibits the following more specific characteristics:

non-nest able function definitions

variables may be hidden in nested blocks

partially weak typing; for instance, characters can be used as integers

low-level access to computer memory by converting machine addresses to

typed pointers

function and data pointers supporting ad hoc run-time polymorphism

array indexing as a secondary notion, defined in terms of pointer arithmetic

a preprocessor for macro definition, source code file inclusion, and conditional

compilation

complex functionality such as I/O, string manipulation, and mathematical

functions consistently delegated to library routines

A relatively small set of reserved keywords (originally 32, now 37 in C99)

A large number of compound operators, such as +=, ++

ii. Features

The relatively low-level nature of the language affords the programmer close

control over what the computer does, while allowing special tailoring and aggressive

optimization for a particular platform. This allows the code to run efficiently on very

limited hardware, such as embedded systems.

iii. Turbo C++

Turbo C++ is a C++ compiler and integrated development environment (IDE)

from Borland. The original Turbo C++ product line was put on hold after 1994, and

was revived in 2006 as an introductory-level IDE, essentially a stripped-down version

of their flagship C++ Builder. Turbo C++ 2006 was released on September 5, 2006

and is available in 'Explorer' and 'Professional' editions. The Explorer edition is free to

7

Page 12: Lexical Analyzer Synopsis Final

download and distribute while the Professional edition is a commercial product. The

professional edition is no longer available for purchase from Borland.

Turbo C++ 3.0 was released in 1991 (shipping on November 20), and came in

amidst expectations of the coming release of Turbo C++ for Microsoft Windows.

Initially released as an MS-DOS compiler, 3.0 supported C++ templates, Borland's

inline assembler, and generation of MS-DOS mode executables for both 8086 real-

mode & 286-protected (as well as the Intel 80186.) 3.0's implemented AT&T C++

2.1, the most recent at the time. The separate Turbo Assembler product was no longer

included, but the inline-assembler could stand in as a reduced functionality version.

2.0 PROPOSED METHODOLOGY

Aim of the project is to develop a Lexical Analyzer that can generate tokens for the

further processing of compiler. The job of the lexical analyzer is to read the source

program one character at a time and produce as output a stream of tokens. The tokens

produced by the lexical analyzer serve as input to the next phase, the parser. Thus, the

lexical analyzer’s job is to translate the source program in to a form more conductive

to recognition by the parser.

The goal of this program is to create tokens from the given input stream.

2.1.0 Block diagram

INPUT SOURCE IN ‘C’

TOKENIZED OUTPUT FILE

LEXICAL ANALYSER CODE

8

Page 13: Lexical Analyzer Synopsis Final

[Fig- 1: Block Diagram for Lexical Analyzer]

2.2.0 Data Flow Diagram

A data flow diagram (DFD) is a graphical representation of the "flow" of data

through an information system. It differs from the flowchart as it shows the data flow

instead of the control flow of the program. A data flow diagram can also be used for

the visualization of data processing (structured design).

[Fig.-2: Zero Level Data Flow Diagram for Lexical Analyzer]

[Fig.-3: First Level Data Flow Diagram for Lexical Analyzer]

SYMBOL TABLE

9

Page 14: Lexical Analyzer Synopsis Final

[Fig. - 4: Second Level Data Flow Diagram for Lexical Analyzer]

2.3.0 Flow Chart

A flowchart is a common type of chart, that represents an algorithm or a process,

showing the steps as boxes of various kinds, and their order by connecting these with

arrows. Flowcharts are used in analyzing, designing, documenting or managing a

process or program in various fields.

Flowcharts are used in designing and documenting complex processes. Like other

types of diagram, they help visualize what is going on and thereby help the viewer to

understand a process, and perhaps also find flaws, bottlenecks, and other less-obvious

features within it. There are many different types of flowcharts, and each type has its

own repertoire of boxes and notational conventions. The two most common types of

boxes in a flowchart are:

A processing step, usually called activity, and denoted as a rectangular box

A decision usually denoted as a diamond.

10

Page 15: Lexical Analyzer Synopsis Final

Key: -

: START / END

: DECESSION

: PROCESS

: DISPLAY / OUTPUT

: MANUAL INPUT

11

Page 16: Lexical Analyzer Synopsis Final

[Fig.- 5: Flow Chart for Lexical Analyzer]

12

Page 17: Lexical Analyzer Synopsis Final

3.0 APPROACHED RESULT AND CONCLUSION

Lexical analysis is a stage in compilation of any program. In this phase we generate

tokens from the input stream of data. For performing this task we need a Lexical

Analyzer.

So we are designing a lexical analyzer that will generate tokens from the given

input in a high level language statement. We have not used any database for storing

the symbol table used in this project as such, as parsing the entire statement is beyond

the scope of this project. This makes our lexical analyzer portable and independent of

a DBMS. This although reduces the number of keywords and special characters

identifiable by the lexical analyzer, and increases the length of the code, but on the

other hand reduces the program complexity and increases the overall speed of the

system.

The main features of this lexical analyzer can be summarized as: -

Simple implementation.

Fast lexical analysis.

Efficient resource utilization.

Portable.

4.0 APPLICATIONS AND FUTURE WORK

13

Page 18: Lexical Analyzer Synopsis Final

This lexical analyzer can be used as a stand alone string analysis tool, which can

analyze a given set of strings and check there lexical correctness. This can also be

used to analyze the string sequences delimited by white spaces in a C / C++ source

code (*.c / *.cpp) file and output all the results in a text file, if proper functionality of

file handling will be used in the source code of the lexical analyzer, this functionality

will not be a part of the present project but will be available in an upgraded version, if

time permits the development of it. Further more the applications of a lexical analyzer

include: -

1. Text Editing

2. Text Processing

3. Pattern Matching

4. File Searching

An enhanced version of this lexical analyzer can be incorporated with a Parser

having the functionality of syntax directed translation, to make a complete Compiler

in the future. The lexical assembly of the keywords and special characters can be

appropriately modified in the source code to create a new high level language like C+

+.

REFERENCES:

14

Page 19: Lexical Analyzer Synopsis Final

[1]. Yashwant Kanetkar, “Let Us C”, ISBN: 10:81-8333-163-7, eighth edition, Pages: 424 – 437. [2]. www.wikipedia.org http://en.wikipedia.org/wiki/Lexical_analysis[3]. www.dragonbook.stanford.edu http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/03-Lexical-

Analysis.pdf[4]. www.docstore.mik.ua http://docstore.mik.ua/orelly/java/langref/ch02_01.htm[5]. www.isi.edu http://www.isi.edu/~pedro/Teaching/CSCI565-Spring10/Lectures/LexicalAnalysis

.part3.6p.pdf[6]. www.cs.berkeley.edu http://www.cs.berkeley.edu/~hilfingr/cs164/public_html/lectures/note2.pdf[7]. John E. Hopcroft, J.D. Ullman, “Introduction to Automata Theory, Languages,

and Computation”, ISBN: 81-85015-96-1, eighteenth edition, Page No.: 9. [8]. Pankaj Jalote, “An Integrated Approach To Software Engineering”, ISBN: 81-

7319-271-5, second edition, Pages: 13 – 17.

15