Compilation With an emphasis on getting the job done quickly Copyright © 2003-2015 – Curt Hill.

Compilation

With an emphasis on getting the job done quickly

Copyright © 2003-2015 – Curt Hill

Introduction• One of the goals of these

courses is to get the ability to implement simple control languages– This is what this is about

• There are whole courses devoted to compiler theory and construction of compilers and interpreters– Just not here


Stages• Compilers typically have

several stages• These may be sequentially run

or concurrently run• These include

– The lexical analyzer or scanner– The syntax analyzer or parser– The code generator– The optimization routines

• In this class little concern for last two


Lexical analyzer or scanner

• Front end for a parser• Take a string of characters and

produce a sequence of tagged tokens

• In the input there are things like comments, white space, and line breaks– None of those make any

difference to the parser– Neither does input buffering and

numerous other detailsCopyright © 2003-2015 – Curt Hill

Why separate scanner and parser?

• This simplifies the parser – Which is inherently more complicated

• We can optimize the scanner in different ways than the parser

• Separation makes both more modular

• The parser is mostly portable– Scanner may or may not be since it

deals more seriously with files


Scanner• Lexical errors - there are just a

few– Invalid format of a number or

identifier name– Unmatched two part comments or

quoted strings– Character not in alphabet

• The lexical analyzer might or might not actually do something with the symbol table– Depends on the format of the token

streamCopyright © 2003-2015 – Curt Hill

The token stream• A token is usually constructed

from a record or class• One item must contain the class

of token– This usually assigns a number or

enumeration for each reserved word, punctuation mark and for various things like identifiers, and constants

• These may contain supplemental information as needed


Supplemental information• A reserved word is defined

sufficiently by the assigned number of enumeration

• A numeric constant needs to carry the actual value and possibly type

• An identifier needs the canonical representation– In languages that are not case

sensitive this is usually every character converted to all upper or all lower case

– In case sensitive languages this is the exact representation of the name


Supplemental information

• Often the location of the token is also passed along so that an error message may be pinned to a usable source location– Line number and column position

• Not needed for parsing but helps the user determine how to fix the error

• The parser will merely ask for one token at a time, picking it off the token stream– Parser sees no lines


Creating a lexical analyzer

• A lexical analyzer usually recognizes a Type 3 language– A regular language

• Thus you can describe this token language rather simply– Such as using regular expressions

or a Finite State Automaton

• These are easy to code by hand– There are programs that do so as

well


Relationship

• It is possible to have the scanner operate like the preprocessor– Read in a source code file and write

out a file of tokens

• Usually it is just a function called by the parser to deliver the next token

• No reason why it could not be a co-routine as well


Generated Scanners• There are programs that

generate scanners based on some type of formal language that describes– The most famous of which is lex on

UNIX systems

• Many types of parser generators either come with one that is easy to modify to your particular grammar or generate one while processing the syntax


Simple Example


Program demo(input,output);{ a comment }var x:integer;begin x := 5; writeln(‘This is x’,x)End.

program

ident demo( ident input, ident output) ; var ident x: ident integer; begin ident x:= numeric constant

5

Source Code

Token Stream

The Parser• Determines if the source file

conforms to language syntax– Generate meaningful error messages

• Build a structure suitable for later stages to operate– Mostly the code generator– This structure is usually a parse tree


Parsers• There are several types of

parsers• Top down or bottom up• Recursive descent• LL• LR• Generated or table parsers


Top down• Uses leftmost derivation

•Build the parse tree in a top down and left to right fashion

• Languages that may be parsed in this way are termed LL(N) – The first L specifies a Left to right

scan of the source code– The second L specifies that the left

most derivation is the one generated– N is often one and it represents the

number of symbols that needs to be looked ahead at in order to decide which rule to use next


Example• Consider a handout for this

– After the ident is either a comma or right parenthesis

– The parser looks ahead to that item to determine what to do next

– In every fork in the syntax diagrams there is a look ahead set of tokens

– We determine which way to go by determining if the next symbol is in one set or another

– If we never need more than a single look ahead then 1 is the constant


Top Down Again• For most programming

languages an LL(1) grammar exists– This requires just a single look

ahead token

• Lets look through the handout and see how this works


Bottom up• Looks at the leaves and works

its way towards the root• Bottom up usually accepts

LR(n) languages• Since they start at the bottom

of the parse tree they use shift-reduce algorithms

• In this presentation we will skip how this actually works


LL and LR• In theory you could parse a

programming language in many ways other than LL or LR– However, doing so makes the

running time of O(N3) which is not good for something that is run as often as a compiler

– By forcing a Left to right scan of the source (the first L) we can get O(N) compiles which makes everyone much happier


Subsets• LL(1) languages are a subset

of LR(1)• Hence for any LL(1) language

there is an LR(1) grammar• There may be an LR(1)

language for which there is no LL(1) grammar

• There are two other classes SLR and LALR which are simplifications of LR


Commonly• The LL, SLR and LALR parsers

have been the dominant ones because the tables needed by an LR could grow exponentially in the worst case

• Hence most of the table driven parsers were LL or LALR

• However, since that time quite a bit of work has been done and there are now some decent LR table parsers


Recursive descent parsers

• Since the grammar is recursive we can make our program follow the grammar

• For each production/non-terminal generate a function that processes that non-terminal

• It simply calls another function for each non-terminal on the RHS of the production

• We are less interested in these Copyright © 2003-2015 – Curt Hill

Recursive Descent• We have seen

– LR(n)– LL(n)– LALR(n)

• The N determines how many tokens have to be looked at to make a decision which production is involved

• Most programming languages have an LL(1) grammar– A recursive descent parser can look

ahead just one token and then choose which is the right production


Generated parsers• These parsers are usually LALR(1)

or LL(1) but a few are LR(1)• There are number of these

available– YACC UNIX

• LALR(1)

• These read in some form of a grammar

• Generate a series of tables that is used by a parser

• The scanner is also generated and then plugged into the parser


How do these work?• The scanner reads in so many

tokens• The parser then looks for tokens

that fit into the pattern of a production

• That is it looks for what is on the RHS

• When it finds the pattern it does a reduction– A reduction is moving right to left

across a production– Replace RHS with the LHS


More• When a reduction is done a

semantic routine is usually called• This routine may do any of the

following:– Check semantic things

• Is a variable defined?

– Update the symbol table– Generate code– Do the things not possible with BNF

• Eventually we should be able to reduce to the distinguished symbol, then we are done


Finally• Lexical analysis implements a finite

state automaton• Parsing implements a push down

automaton– In a recursive descent the stack is the

run-time stack of function calls– In bottom up parsing the stack

contains tokens

• As Software Engineers we are usually interested in generated parsers for the ease of construction


Compilation With an emphasis on getting the job done quickly Copyright © 2003-2015 – Curt Hill.

Documents

curt hill slide

token stream copyright

token stream parser

lexical analyzer

needed copyright

details copyright

lines copyright

files copyright