Compilation With an emphasis on getting the job done quickly Copyright © 2003-2015 – Curt Hill
Dec 13, 2015
Compilation
With an emphasis on getting the job done quickly
Copyright © 2003-2015 – Curt Hill
Introduction• One of the goals of these
courses is to get the ability to implement simple control languages– This is what this is about
• There are whole courses devoted to compiler theory and construction of compilers and interpreters– Just not here
Copyright © 2003-2015 – Curt Hill
Stages• Compilers typically have
several stages• These may be sequentially run
or concurrently run• These include
– The lexical analyzer or scanner– The syntax analyzer or parser– The code generator– The optimization routines
• In this class little concern for last two
Copyright © 2003-2015 – Curt Hill
Lexical analyzer or scanner
• Front end for a parser• Take a string of characters and
produce a sequence of tagged tokens
• In the input there are things like comments, white space, and line breaks– None of those make any
difference to the parser– Neither does input buffering and
numerous other detailsCopyright © 2003-2015 – Curt Hill
Why separate scanner and parser?
• This simplifies the parser – Which is inherently more complicated
• We can optimize the scanner in different ways than the parser
• Separation makes both more modular
• The parser is mostly portable– Scanner may or may not be since it
deals more seriously with files
Copyright © 2003-2015 – Curt Hill
Scanner• Lexical errors - there are just a
few– Invalid format of a number or
identifier name– Unmatched two part comments or
quoted strings– Character not in alphabet
• The lexical analyzer might or might not actually do something with the symbol table– Depends on the format of the token
streamCopyright © 2003-2015 – Curt Hill
The token stream• A token is usually constructed
from a record or class• One item must contain the class
of token– This usually assigns a number or
enumeration for each reserved word, punctuation mark and for various things like identifiers, and constants
• These may contain supplemental information as needed
Copyright © 2003-2015 – Curt Hill
Supplemental information• A reserved word is defined
sufficiently by the assigned number of enumeration
• A numeric constant needs to carry the actual value and possibly type
• An identifier needs the canonical representation– In languages that are not case
sensitive this is usually every character converted to all upper or all lower case
– In case sensitive languages this is the exact representation of the name
Copyright © 2003-2015 – Curt Hill
Supplemental information
• Often the location of the token is also passed along so that an error message may be pinned to a usable source location– Line number and column position
• Not needed for parsing but helps the user determine how to fix the error
• The parser will merely ask for one token at a time, picking it off the token stream– Parser sees no lines
Copyright © 2003-2015 – Curt Hill
Creating a lexical analyzer
• A lexical analyzer usually recognizes a Type 3 language– A regular language
• Thus you can describe this token language rather simply– Such as using regular expressions
or a Finite State Automaton
• These are easy to code by hand– There are programs that do so as
well
Copyright © 2003-2015 – Curt Hill
Relationship
• It is possible to have the scanner operate like the preprocessor– Read in a source code file and write
out a file of tokens
• Usually it is just a function called by the parser to deliver the next token
• No reason why it could not be a co-routine as well
Copyright © 2003-2015 – Curt Hill
Generated Scanners• There are programs that
generate scanners based on some type of formal language that describes– The most famous of which is lex on
UNIX systems
• Many types of parser generators either come with one that is easy to modify to your particular grammar or generate one while processing the syntax
Copyright © 2003-2015 – Curt Hill
Simple Example
Copyright © 2003-2015 – Curt Hill
Program demo(input,output);{ a comment }var x:integer;begin x := 5; writeln(‘This is x’,x)End.
program
ident demo( ident input, ident output) ; var ident x: ident integer; begin ident x:= numeric constant
5
Source Code
Token Stream
The Parser• Determines if the source file
conforms to language syntax– Generate meaningful error messages
• Build a structure suitable for later stages to operate– Mostly the code generator– This structure is usually a parse tree
Copyright © 2003-2015 – Curt Hill
Parsers• There are several types of
parsers• Top down or bottom up• Recursive descent• LL• LR• Generated or table parsers
Copyright © 2003-2015 – Curt Hill
Top down• Uses leftmost derivation
•Build the parse tree in a top down and left to right fashion
• Languages that may be parsed in this way are termed LL(N) – The first L specifies a Left to right
scan of the source code– The second L specifies that the left
most derivation is the one generated– N is often one and it represents the
number of symbols that needs to be looked ahead at in order to decide which rule to use next
Copyright © 2003-2015 – Curt Hill
Example• Consider a handout for this
– After the ident is either a comma or right parenthesis
– The parser looks ahead to that item to determine what to do next
– In every fork in the syntax diagrams there is a look ahead set of tokens
– We determine which way to go by determining if the next symbol is in one set or another
– If we never need more than a single look ahead then 1 is the constant
Copyright © 2003-2015 – Curt Hill
Top Down Again• For most programming
languages an LL(1) grammar exists– This requires just a single look
ahead token
• Lets look through the handout and see how this works
Copyright © 2003-2015 – Curt Hill
Bottom up• Looks at the leaves and works
its way towards the root• Bottom up usually accepts
LR(n) languages• Since they start at the bottom
of the parse tree they use shift-reduce algorithms
• In this presentation we will skip how this actually works
Copyright © 2003-2015 – Curt Hill
LL and LR• In theory you could parse a
programming language in many ways other than LL or LR– However, doing so makes the
running time of O(N3) which is not good for something that is run as often as a compiler
– By forcing a Left to right scan of the source (the first L) we can get O(N) compiles which makes everyone much happier
Copyright © 2003-2015 – Curt Hill
Subsets• LL(1) languages are a subset
of LR(1)• Hence for any LL(1) language
there is an LR(1) grammar• There may be an LR(1)
language for which there is no LL(1) grammar
• There are two other classes SLR and LALR which are simplifications of LR
Copyright © 2003-2015 – Curt Hill
Commonly• The LL, SLR and LALR parsers
have been the dominant ones because the tables needed by an LR could grow exponentially in the worst case
• Hence most of the table driven parsers were LL or LALR
• However, since that time quite a bit of work has been done and there are now some decent LR table parsers
Copyright © 2003-2015 – Curt Hill
Recursive descent parsers
• Since the grammar is recursive we can make our program follow the grammar
• For each production/non-terminal generate a function that processes that non-terminal
• It simply calls another function for each non-terminal on the RHS of the production
• We are less interested in these Copyright © 2003-2015 – Curt Hill
Recursive Descent• We have seen
– LR(n)– LL(n)– LALR(n)
• The N determines how many tokens have to be looked at to make a decision which production is involved
• Most programming languages have an LL(1) grammar– A recursive descent parser can look
ahead just one token and then choose which is the right production
Copyright © 2003-2015 – Curt Hill
Generated parsers• These parsers are usually LALR(1)
or LL(1) but a few are LR(1)• There are number of these
available– YACC UNIX
• LALR(1)
• These read in some form of a grammar
• Generate a series of tables that is used by a parser
• The scanner is also generated and then plugged into the parser
Copyright © 2003-2015 – Curt Hill
How do these work?• The scanner reads in so many
tokens• The parser then looks for tokens
that fit into the pattern of a production
• That is it looks for what is on the RHS
• When it finds the pattern it does a reduction– A reduction is moving right to left
across a production– Replace RHS with the LHS
Copyright © 2003-2015 – Curt Hill
More• When a reduction is done a
semantic routine is usually called• This routine may do any of the
following:– Check semantic things
• Is a variable defined?
– Update the symbol table– Generate code– Do the things not possible with BNF
• Eventually we should be able to reduce to the distinguished symbol, then we are done
Copyright © 2003-2015 – Curt Hill
Finally• Lexical analysis implements a finite
state automaton• Parsing implements a push down
automaton– In a recursive descent the stack is the
run-time stack of function calls– In bottom up parsing the stack
contains tokens
• As Software Engineers we are usually interested in generated parsers for the ease of construction
Copyright © 2003-2015 – Curt Hill