410/510 1 of 31 Week 1 – Lecture 1 • Introduction • The Textbook • Assessment • Overview Compiler Construction
410/510 1 of 31
Week 1 – Lecture 1
• Introduction• The Textbook• Assessment• Overview
Compiler Construction
410/510 2 of 31
The Big Picture
• In this course we will be constructing a compiler!
• Moving from a High Level Language to a Low Level Language
• Compilers are complex programs– > 10,000 lines of code
• Integrate aspects from many different areas of CS– Formal language theory, algorithms, data structures, HLL &
LLL (obviously), user interaction (error reporting)
410/510 3 of 31
What is a compiler?
• A specialization of a language translator• Usually in CS:
– the Source is a high level programming language– the Target is a machine code for a micro-processor
L1 L2
Source Target
C x86 processor
410/510 4 of 31
Applications of Compiler Techniques
• Potential Source languages include:– Natural languages (English, French,….)
– Circuit layout languages
– Mark-up languages (HTML, XML, …)
– Command line languages (SQL interface)
• Potential Target languages include:– Natural languages
– Printer drivers
– Markup languages
• e.g. HTML to RTF converter– Could involve many of the aspects we will cover in compiler
construction
410/510 5 of 31
Compilers for Programming Languages
• If we had 1 compiler for each {Source,Target} pair then we would have a lot of compilers!
Source Languages Target Languages
CompilersC
Prolog
Java
Lisp Haskell
C++C#
Fortran
PascalSather
x86 (MMX)
JVM
PowerPC 750 (G3)
ARM
SPARCAMD K6
410/510 6 of 31
Modularity for Code Generation
Compilers
x86
ARM
G4
Source
Intermediate Representation
Compiler portability (man gcc – lists different target machines)
410/510 7 of 31
Modularity for Source Languages?
Compilers
Intermediate Representation
Sources Targets
C
Java
Prolog
Typically compilers only compile one source language– but the techniques used are very similar and are shared across different compilers
410/510 8 of 31
Typical Compiler
IntermediateRepresentationSource Target
Front-end Back-endIndependent of Sourceand Target languages
Analysis Synthesis
For a new Source language – we can add a new front-end to an existing back-endFor a new Target language – we can add a new back-end to an existing front-end
coursenow week 6
Ideally:
410/510 9 of 31
Front End
• Knowledge about the source language– Lexical structure (tokens)– Syntax
• Programming constructs– Conditionals, iteration etc
– Semantics• Type checking
• Error-reporting– UI component
• Often basic (and unhelpful!)• May vary if part of an IDE or standalone
Source program
Lexical analyser
Syntaxanalyser
Semanticanalyser
Symboltable
ErrorHandler
410/510 10 of 31
Lexical Analysis
Lexical Tasks the compiler has to perform:group together the 3 characters ‘max’ to form the single variable identifier maxgroup together the 2 characters ‘<=’ to form the single relational operator <= (less than or equal to)
int max = 20, x;read(x);if ( x <= max )
print(‘ok’);else
print(‘too big’);
410/510 11 of 31
Syntactic Analysis
• Recognise the if .. then … else structure• Group the x <= max into a single expression
with a relational operator• Recognise the format of the variable declaration
list– Such that x is correctly declared to be an int
• Loops, program blocks (begin…end)• Arithmetic expressions, etc
410/510 12 of 31
Semantic analysis
• Check that x <= max is a sensible thing to do– If x was a boolean and max a string then we would have a
type error
• Check that the ‘20’ is in fact an integer and so can be assigned to an int
• And also (can be split over several phases)– Keep a note of all the variables used so we make sure they all
refer to the same value (in memory)
410/510 13 of 31
Data Structures
• Stream of text as the source file• Group together text into larger units from a limited set• Nearly all programming constructs can be represented
as tree structures
If statement
if Boolean expression statement else statement
Relationaloperator
expression expression
410/510 14 of 31
Data Structures
• Lexical Analyzer Stream of tokens (enumerated type)– NUMBER OPERATOR NUMBER
• Syntax Analyzer / Parser Tree of program structure
program
if_statementassignment while_loop output_statement
410/510 15 of 31
Back-end
• Knowledge about target processor / virtual machine– Instruction set
• ‘costs’ of different:– op-codes– instructions
– Registers– Memory
Semantic analyser
Intermediate code generator
Code optimiser
Codegenerator
Symbol tablemanager
Error handler
410/510 16 of 31
Putting it together
Source program
Lexical analyser
Syntaxanalyser
SemanticanalyserSymbol
tableError
Handler
Intermediate code generator
Code optimiser
Codegenerator
Compiler
Skeletal source program
preprocessor
compiler
assembler
Loaderlink-editor
Target asse mbly program
Relocatable machine code
Absolute machine code
Source program
A language-processing system
410/510 17 of 31
Grammars
• We define/describe HL languages with grammars
• A Grammar consists of:– T, set of Terminals– N, set of Non-terminals
• N T = – P, set of Productions
• Where and are members of T N
– S, special member of N, the Start symbol
• G = {T, N, P, S}
410/510 18 of 31
Chomsky’s Grammar Hierarchy
Type 3 Regular Grammar
Type 2 Context Free Grammar
Type 1 Context-Sensitive Grammar
Type 0 Unrestricted Grammar
410/510 19 of 31
Grammars
• Type 0 (unrestricted) , and are unrestricted sequences, is not null– languages formed from Type 0 grammars can be recognised
by non-deterministic Turing machines
• Type 1 (context sensitive) A B – A becomes B in the context of … – Complex for computer analysis
410/510 20 of 31
Grammars
• Type 2 (context free)– A
• A is a Non-terminal is a member of T N (can be empty)
– Equivalent to a push-down automaton
• Type 3 (regular)– A wB, A w (right linear)
• w is a string of Terminals• A and B are Non-Terminals
– Finite state automata
410/510 21 of 31
In a compiler
• Use the minimum complexity grammars that let us successfully cope with HL programming languages (and process them efficiently)
• Regular grammars (=regular expressions) in the Lexical Analysis phase– ‘recognise the words’
• Context-free grammars in the Syntax Analysis phase– ’recognise the phrases’ define our HLL as a grammar based on the output of the Lexical
Analysis
• Deal with context sensitivity in the Semantic Analysis phase
410/510 22 of 31
Overall Front-End View
Source programText file
Lexical Analyser
Syntax Analyser
tokens
Semantic Analyser
Tree structure
Intermediate Representation
Type-safeTree structure
Back-end
Tree / Linearized tree
Context-free grammar
Regular grammar
Flex
Bison
410/510 23 of 31
The Textbook
Compilers: principles, techniques & tools
Aho, Sethi & UllmanAddison-Wesley{‘The Dragon Book’}
410/510 24 of 31
Assessment
• Building a compiler for a new language• Front-end
– Lexical analysis– Parsing
• Back end– Generating assembler code
• Some formal and some practical– Formal more at the front-end
410/510 25 of 31
Programming & Tools
• Lexical analysis generator – lex / flex• Parser generator – yacc / bison• C / C++
– To implement the remainder of the compiler
• Unix environment– make files will be useful for coordinating lex and yacc
410/510 26 of 31
Instant Compilation
• Consider the program:main()
{ int a = 3; a = a + 1; }
Given a reasonably sensible assembly language a hand-compilation might be:
LDA #3STA 1LDA 1ADD a, #1STA 1
410/510 27 of 31
& an Instant Compiler could look like …
Switch( source_code_construct ) {case INT_DEC: print( “LDA #”, INT.value)
print(“STA 1”)
break
case INT_ADD: print(“LDA 1”)
print(“ADD a,#”, ADD.value)
print(“STA 1”)
break
} /* end switch */
410/510 28 of 31
The Problems ….
• Not efficient, (LDA #4; STA 1)• Only works for 1 variable • Only works at one location in memory
– (usually let assembler deal with symbolic addresses)
• Only has 2 programming constructs!• Not even slightly portable:
– 1 instruction set & 1 source language
410/510 29 of 31
More problems…
• No error reporting– type checking?
• Assumes:– Program is correct– Recognition of programming language constructs
• int a = 3 INT_DEC
– Access to values • INT.value, ADD.value
– 1:1 relationship between integers and memory locations
410/510 30 of 31
Solutions
• We can view compilers as a solution to all of these problems
• E.g.– Only compile correct programs to object code– Recognise all constructs in the language– Improve the efficiency of code
• Execution speed• Memory usage
– Meaningful error messages to the user– Cope with different target architectures
410/510 31 of 31
Why are compilers called compilers?
• In early compilers one of the main tasks was connecting object program to – standard library functions, I/O devices
• collecting information from different sources(e.g. libraries)– OS and processor dependent
• This is now performed by ‘linkers’• Compile – ‘construct by collecting from different
sources’