Top Banner
Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1
148

Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

Jan 02, 2016

Download

Documents

Theodora Ryan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

1

Compilation 0368-3133

Lecture 1:Introduction

Lexical Analysis

Noam Rinetzky

Page 2: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

2

Page 3: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

3

Admin

• Lecturer: Noam Rinetzky– [email protected]– http://www.cs.tau.ac.il/~maon

• T.A.: Orr Tamir

• Textbooks: – Modern Compiler Design – Compilers: principles, techniques and tools

Page 4: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

4

Admin• Compiler Project 40%

– 4.5 practical exercises– Groups of 3

• 1 theoretical exercise 10%– Groups of 1

• Final exam 50% – must pass

Page 5: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

5

Course Goals

• What is a compiler• How does it work• (Reusable) techniques & tools

Page 6: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

6

Course Goals

• What is a compiler• How does it work• (Reusable) techniques & tools

• Programming language implementation– runtime systems

• Execution environments– Assembly, linkers, loaders, OS

Page 7: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

7

Introduction

Compilers: principles, techniques and tools

Page 8: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

8

What is a Compiler?

Page 9: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

9

What is a Compiler?

“A compiler is a computer program that transforms source code written in a programming language (source language) into another language (target language).

The most common reason for wanting to transform source code is to create an executable program.”

--Wikipedia

Page 10: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

10

What is a Compiler?source language target language

Compiler

Executable

code

exe

Source

text

txt

Page 11: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

11

What is a Compiler?

Executable

code

exe

Source

text

txt

Compiler

int a, b;a = 2;b = a*2 + 1;

MOV R1,2SAL R1INC R1MOV R2,R1

Page 12: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

12

What is a Compiler?source language target language

CC++

PascalJava

PostscriptTeX

PerlJavaScript

PythonRuby

Prolog

LispScheme

MLOCaml

IA32IA64

SPARC

CC++

PascalJava

Java Bytecode…

Compiler

Page 13: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

13

High Level Programming Languages• Imperative Algol, PL1, Fortran, Pascal, Ada, Modula, C

– Closely related to “von Neumann” Computers• Object-oriented Simula, Smalltalk, Modula3, C++, Java,

C#, Python– Data abstraction and ‘evolutionary’ form of program

development• Class an implementation of an abstract data type (data+code)• Objects Instances of a class• Inheritance + generics

• Functional Lisp, Scheme, ML, Miranda, Hope, Haskel, OCaml, F#

• Logic Programming Prolog

Page 14: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

14

More Languages• Hardware description languages VHDL

– The program describes Hardware components– The compiler generates hardware layouts

• Graphics and Text processing TeX, LaTeX, postscript– The compiler generates page layouts

• Scripting languages Shell, C-shell, Perl– Include primitives constructs from the current software

environment• Web/Internet HTML, Telescript, JAVA, Javascript • Intermediate-languages Java bytecode, IDL

Page 15: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

15

High Level Prog. Lang., Why?

Page 16: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

16

High Level Prog. Lang., Why?

Page 17: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

17

Compiler vs. Interpreter

Page 18: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

18

Compiler

• A program which transforms programs • Input a program (P)• Output an object program (O)

– For any x, “O(x)” “=“ “P(x)”

Compiler

Source

text

txt

Executable

code

exe

P O

Page 19: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

19

Compiling C to Assembly

Compiler

int x;scanf(“%d”, &x);x = x + 1 ;printf(“%d”, x);

add %fp,-8, %l1mov %l1, %o1call scanfld [%fp-8],%l0add %l0,1,%l0st %l0,[%fp-8]ld [%fp-8], %l1mov %l1, %o1call printf

5

6

Page 20: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

20

Interpreter

• A program which executes a program• Input a program (P) + its input (x)• Output the computed output (P(x))

Interpreter

Source

text

txt

Input

Output

Page 21: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

21

Interpreting (running) .py programs

• A program which executes a program• Input a program (P) + its input (x)• Output the computed output (“P(x)”)

Interpreter

5

int x;scanf(“%d”, &x);x = x + 1 ;printf(“%d”, x);

6

Page 22: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

22

Compiler vs. InterpreterSource

Code

Executable

Code Machine

Source

Code

Intermediate

Code Interpreter

preprocessing

processingpreprocessing

processing

Page 23: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

23

Compiled programs are usually more efficient than

scanf(“%d”,&x);y = 5 ;z = 7 ;x = x + y * z;printf(“%d”,x);

add %fp,-8, %l1mov %l1, %o1call scanfmov 5, %l0st %l0,[%fp-12]mov 7,%l0st %l0,[%fp-16]ld [%fp-8], %l0ld [%fp-8],%l0add %l0, 35 ,%l0st %l0,[%fp-8]ld [%fp-8], %l1mov %l1, %o1 call printf

Compiler

Page 24: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

24

Compilers report input-independent possible errors• Input-program

• Compiler-Output – “line 88: x may be used before set''

scanf(“%d”, &y);if (y < 0)

x = 5;...If (y <= 0)

z = x + 1;

Page 25: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

25

Interpreters report input-specific definite errors

• Input-program

• Input data – y = -1– y = 0

scanf(“%d”, &y);if (y < 0)

x = 5;...If (y <= 0)

z = x + 1;

Page 26: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

26

Interpreter vs. Compiler

• Conceptually simpler – “define” the prog. lang.

• Can provide more specific error report

• Easier to port

• Faster response time

• [More secure]

• How do we know the translation is correct?

• Can report errors before input is given

• More efficient code– Compilation can be expensive – move computations to

compile-time• compile-time + execution-time <

interpretation-time is possible

Page 27: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

27

Concluding Remarks

• Both compilers and interpreters are programs written in high level language

• Compilers and interpreters share functionality

• In this course we focus on compilers

Page 28: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

28

Ex 0: A Simple Interpreter

Page 29: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

29

Toy compiler/interpreter

• Trivial programming language• Stack machine• Compiler/interpreter written in C• Demonstrate the basic steps

• Textbook: Modern Compiler Design 1.2

Page 30: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

30

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

(synthesis)

Compiler

Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

Page 31: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

31

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

Page 32: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

32

Source Language

• Fully parameterized expressions• Arguments can be a single digit

(4 + (3 * 9))✗3 + 4 + 5✗(12 + 3)

expression digit | ‘(‘ expression operator expression ‘)’operator ‘+’ | ‘*’digit ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’

Page 33: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

33

The abstract syntax tree (AST)

• Intermediate program representation• Defines a tree

– Preserves program hierarchy• Generated by the parser• Keywords and punctuation symbols are not

stored – Not relevant once the tree exists

Page 34: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

34

Concrete syntax tree# for 5*(a+b)

expression

number expression‘*’

identifier

expression‘(’ ‘)’

‘+’ identifier

‘a’ ‘b’

‘5’

#Parse tree

Page 35: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

35

Abstract Syntax tree for 5*(a+b)

‘*’

‘+’

‘a’ ‘b’

‘5’

Page 36: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

36

Annotated Abstract Syntax tree

‘*’

‘+’

‘a’ ‘b’

‘5’

type:real

loc: reg1

type:real

loc: reg2

type:real

loc: sp+8

type:real

loc: sp+24

type:integer

Page 37: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

37

Driver for the toy compiler/interpreter

#include "parser.h" /* for type AST_node */#include "backend.h" /* for Process() */#include "error.h" /* for Error() */

int main(void) { AST_node *icode;

if (!Parse_program(&icode)) Error("No top-level expression"); Process(icode);

return 0;}

Page 38: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

38

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

Page 39: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

39

Lexical Analysis

• Partitions the inputs into tokens– DIGIT– EOF– ‘*’– ‘+’– ‘(‘– ‘)’

• Each token has its representation• Ignores whitespaces

Page 40: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

40

lex.h: Header File for Lexical Analysis

/* Define class constants */

/* Values 0-255 are reserved for ASCII characters */

#define EoF 256

#define DIGIT 257

typedef struct {

int class;

char repr;} Token_type;

extern Token_type Token;

extern void get_next_token(void);

Page 41: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

41

#include "lex.h" token_type Token; // Global variable

void get_next_token(void) { int ch; do { ch = getchar(); if (ch < 0) { Token.class = EoF; Token.repr = '#'; return; } } while (Layout_char(ch)); if ('0' <= ch && ch <= '9') {Token.class = DIGIT;} else {Token.class = ch;} Token.repr = ch;}

static int Layout_char(int ch) { switch (ch) { case ' ': case '\t': case '\n': return 1; default: return 0; }}

Lexical Analyzer

Page 42: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

42

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

Page 43: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

43

Parser

• Invokes lexical analyzer• Reports syntax errors• Constructs AST

Page 44: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

44

Parser Header File

typedef int Operator;

typedef struct _expression {

char type; /* 'D' or 'P' */

int value; /* for 'D' type expression */

struct _expression *left, *right; /* for 'P' type expression */

Operator oper; /* for 'P' type expression */

} Expression;

typedef Expression AST_node; /* the top node is an Expression */

extern int Parse_program(AST_node **);

Page 45: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

45

AST for (2 * ((3*4)+9))P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

Page 46: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

47

AST for (2 * ((3*4)+9))P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

Page 47: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

48

Driver for the Toy Compiler

#include "parser.h" /* for type AST_node */#include "backend.h" /* for Process() */#include "error.h" /* for Error() */

int main(void) { AST_node *icode;

if (!Parse_program(&icode)) Error("No top-level expression"); Process(icode);

return 0;}

Page 48: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

49

Source Language

• Fully parenthesized expressions• Arguments can be a single digit

(4 + (3 * 9))✗3 + 4 + 5✗(12 + 3)

expression digit | ‘(‘ expression operator expression ‘)’operator ‘+’ | ‘*’digit ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’

Page 49: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

50

lex.h: Header File for Lexical Analysis

/* Define class constants */

/* Integers are used to encode characters + special codes */

/* Values 0-255 are reserved for ASCII characters */

#define EoF 256

#define DIGIT 257

typedef struct {

int class;

char repr;} Token_type;

extern Token_type Token;

extern void get_next_token(void);

Page 50: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

51

#include "lex.h" token_type Token; // Global variable

void get_next_token(void) { int ch; do { ch = getchar(); if (ch < 0) { Token.class = EoF; Token.repr = '#’; return;} } while (Layout_char(ch));

if ('0' <= ch && ch <= '9') Token.class = DIGIT;

else Token.class = ch;

Token.repr = ch;}

static int Layout_char(int ch) { switch (ch) { case ' ': case '\t': case '\n': return 1; default: return 0; }}

Lexical Analyzer

Page 51: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

52

AST for (2 * ((3*4)+9))P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

Page 52: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

53

Driver for the Toy Compiler

#include "parser.h" /* for type AST_node */#include "backend.h" /* for Process() */#include "error.h" /* for Error() */

int main(void) { AST_node *icode;

if (!Parse_program(&icode)) Error("No top-level expression"); Process(icode);

return 0;}

Page 53: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

54

Parser Environment#include "lex.h”, "error.h”, "parser.h"

static Expression *new_expression(void) { return (Expression *)malloc(sizeof (Expression));}

static int Parse_operator(Operator *oper_p);static int Parse_expression(Expression **expr_p);int Parse_program(AST_node **icode_p) { Expression *expr; get_next_token(); /* start the lexical analyzer */ if (Parse_expression(&expr)) { if (Token.class != EoF) { Error("Garbage after end of program"); } *icode_p = expr; return 1; } return 0;}

Page 54: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

55

Top-Down Parsing• Optimistically build the tree from the root to leaves• For every P A1 A2 … An | B1 B2 … Bm

– If A1 succeeds• If A2 succeeds & A3 succeeds & …• Else fail

– Else if B1 succeeds• If B2 succeeds & B3 succeeds & ..• Else fail

– Else fail

• Recursive descent parsing– Simplified: no backtracking

• Can be applied for certain grammars

Page 55: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

56

static int Parse_expression(Expression **expr_p) { Expression *expr = *expr_p = new_expression(); if (Token.class == DIGIT) { expr->type = 'D'; expr->value = Token.repr - '0'; get_next_token(); return 1; } if (Token.class == '(') { expr->type = 'P'; get_next_token(); if (!Parse_expression(&expr->left)) { Error("Missing expression"); } if (!Parse_operator(&expr->oper)) { Error("Missing operator"); } if (!Parse_expression(&expr->right)) { Error("Missing expression"); } if (Token.class != ')') { Error("Missing )"); } get_next_token(); return 1; } /* failed on both attempts */ free_expression(expr); return 0;}

Parser

static int Parse_operator(Operator *oper) { if (Token.class == '+') { *oper = '+'; get_next_token(); return 1; } if (Token.class == '*') { *oper = '*'; get_next_token(); return 1; } return 0;}

Page 56: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

57

AST for (2 * ((3*4)+9))

P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

Page 57: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

58

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

Page 58: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

59

Semantic Analysis

• Trivial in our case• No identifiers• No procedure / functions• A single type for all expressions

Page 59: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

60

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

Page 60: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

61

Intermediate Representation

P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

Page 61: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

62

Alternative IR: 3-Address Code

L1:_t0=a_t1=b_t2=_t0*_t1_t3=d_t4=_t2-_t3GOTO L1

“Simple Basic-like programming language”

Page 62: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

63

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

Page 63: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

64

Code generation

• Stack based machine• Four instructions

– PUSH n– ADD– MULT– PRINT

Page 64: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

65

Code generation#include "parser.h" #include "backend.h" static void Code_gen_expression(Expression *expr) { switch (expr->type) { case 'D': printf("PUSH %d\n", expr->value); break; case 'P': Code_gen_expression(expr->left); Code_gen_expression(expr->right); switch (expr->oper) { case '+': printf("ADD\n"); break; case '*': printf("MULT\n"); break; } break; }}void Process(AST_node *icode) { Code_gen_expression(icode); printf("PRINT\n");}

Page 65: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

66

Compiling (2*((3*4)+9))

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

Page 66: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

67

Executing Compiled Program

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

Page 67: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

68

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack Stack’

2

Page 68: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

69

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

3

2

Stack

2

Page 69: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

70

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

4

3

2

Stack

3

2

Page 70: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

71

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

12

2

Stack

4

3

2

Page 71: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

72

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

9

12

2

Stack

12

2

Page 72: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

73

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

21

2

Stack

9

12

2

Page 73: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

74

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

42

Stack

21

2

Page 74: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

75

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’Stack

42

Page 75: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

76

Shortcuts

• Avoid generating machine code• Use local assembler• Generate C code

Page 76: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

77

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

Page 77: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

78

Interpretation

• Bottom-up evaluation of expressions• The same interface of the compiler

Page 78: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

79

#include "parser.h" #include "backend.h”

static int Interpret_expression(Expression *expr) { switch (expr->type) { case 'D': return expr->value; break; case 'P': int e_left = Interpret_expression(expr->left); int e_right = Interpret_expression(expr->right); switch (expr->oper) { case '+': return e_left + e_right; case '*': return e_left * e_right; break; }}

void Process(AST_node *icode) { printf("%d\n", Interpret_expression(icode));}

Page 79: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

80

Interpreting (2*((3*4)+9))

P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

Page 80: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

81

Summary: Journey inside a compiler

LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

x = b*b – 4*a*c

txt

<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

TokenStream

Page 81: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

82LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

‘b’ ‘4’

‘b’‘a’

‘c’

ID

ID

ID

ID

ID

factor

term factorMULT

term

expression

expression

factor

term factorMULT

term

expression

term

MULT factor

MINUS

SyntaxTree

Summary: Journey inside a compiler

Statement

‘x’

ID EQ

Page 82: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

83LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

‘b’ ‘4’

‘b’‘a’

‘c’

ID

ID

ID

ID

ID

factor

term factorMULT

term

expression

expression

factor

term factorMULT

term

expression

term

MULT factor

MINUS

SyntaxTree

Summary: Journey inside a compiler<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

Page 83: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

84Sem.Analysi

s

Inter.Rep.

Code Gen.

‘b’

‘4’

‘b’

‘a’

‘c’

MULT

MULT

MULT

MINUS

LexicalAnalysi

s

Syntax Analysi

s

AbstractSyntaxTree

Summary: Journey inside a compiler

Page 84: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

85LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

‘b’

‘4’

‘b’

‘a’

‘c’

MULT

MULT

MULT

MINUS

type: intloc: sp+8

type: intloc: const

type: intloc: sp+16

type: intloc: sp+16

type: intloc: sp+24

type: intloc: R2

type: intloc: R2

type: intloc: R1

type: intloc: R1

AnnotatedAbstractSyntaxTree

Summary: Journey inside a compiler

Page 85: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

86

Journey inside a compiler

LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

‘b’

‘4’

‘b’

‘a’

‘c’

MULT

MULT

MULT

MINUS

type: intloc: sp+8

type: intloc: const

type: intloc: sp+16

type: intloc: sp+16

type: intloc: sp+24

type: intloc: R2

type: intloc: R2

type: intloc: R1

type: intloc: R1

R2 = 4*aR1=b*bR2= R2*cR1=R1-R2

IntermediateRepresentation

Page 86: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

87

Journey inside a compiler

Inter.Rep.

Code Gen.

‘b’

‘4’

‘b’

‘a’

‘c’

MULT

MULT

MULT

MINUS

type: intloc: sp+8

type: intloc: const

type: intloc: sp+16

type: intloc: sp+16

type: intloc: sp+24

type: intloc: R2

type: intloc: R2

type: intloc: R1

type: intloc: R1

R2 = 4*aR1=b*bR2= R2*cR1=R1-R2

MOV R2,(sp+8)SAL R2,2MOV R1,(sp+16)MUL R1,(sp+16)MUL R2,(sp+24)SUB R1,R2

LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

IntermediateRepresentation

AssemblyCode

Page 87: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

88

Error Checking

• In every stage…

• Lexical analysis: illegal tokens• Syntax analysis: illegal syntax • Semantic analysis: incompatible types, undefined

variables, …

• Every phase tries to recover and proceed with compilation (why?)– Divergence is a challenge

Page 88: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

89

The Real Anatomy of a Compiler

Executable

code

exe

Source

text

txtLexicalAnalysi

s

Sem.Analysis

Process text input

characters SyntaxAnalysi

s

tokens AST

Intermediate code

generation

Annotated AST

Intermediate code

optimization

IR CodegenerationIR

Target code optimizatio

n

Symbolic Instructions

SI Machine code

generation

Write executable

output

MI

Page 89: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

90

Optimizations• “Optimal code” is out of reach

– many problems are undecidable or too expensive (NP-complete)– Use approximation and/or heuristics

• Loop optimizations: hoisting, unrolling, … • Peephole optimizations• Constant propagation

– Leverage compile-time information to save work at runtime (pre-computation)

• Dead code elimination– space

• …

Page 90: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

91

Machine code generation

• Register allocation– Optimal register assignment is NP-Complete– In practice, known heuristics perform well

• assign variables to memory locations• Instruction selection

– Convert IR to actual machine instructions

• Modern architectures– Multicores– Challenging memory hierarchies

Page 91: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

92

And on a More General Note

Page 92: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

93

Course Goals

• What is a compiler• How does it work• (Reusable) techniques & tools

• Programming language implementation– runtime systems

• Execution environments– Assembly, linkers, loaders, OS

Page 93: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

94

To Compilers, and Beyond …

• Compiler construction is successful– Clear problem – Proper structure of the solution– Judicious use of formalisms

• Wider application– Many conversions can be viewed as

compilation• Useful algorithms

Page 94: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

95

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

(synthesis)

Compiler

Frontend

(analysis)

Page 95: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

96

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

(synthesis)

Compiler

Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

Page 96: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

97

Judicious use of formalisms

• Regular expressions (lexical analysis)• Context-free grammars (syntactic analysis)• Attribute grammars (context analysis)• Code generator generators (dynamic programming)

• But also some nitty-gritty programming

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

Page 97: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

98

Use of program-generating tools

• Parts of the compiler are automatically generated from specification

Stream of tokens

Jlex

regular expressions

input program scanner

Page 98: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

99

Use of program-generating tools

• Parts of the compiler are automatically generated from specification

Jcup

Context free grammar

Stream of tokens parser Syntax tree

Page 99: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

100

Use of program-generating tools• Simpler compiler construction

– Less error prone– More flexible

• Use of pre-canned tailored code– Use of dirty program tricks

• Reuse of specification

toolspecification

input (generated) code

output

Page 100: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

101

Compiler Construction Toolset

• Lexical analysis generators– Lex, JLex

• Parser generators– Yacc, Jcup

• Syntax-directed translators• Dataflow analysis engines

Page 101: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

102

Wide applicability

• Structured data can be expressed using context free grammars– HTML files– Postscript– Tex/dvi files– …

Page 102: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

103

Generally useful algorithms

• Parser generators• Garbage collection• Dynamic programming• Graph coloring

Page 103: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

104

How to write a compiler?

Page 104: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

105

How to write a compiler?

L1 CompilerExecutable compiler

exe

L2 Compiler source

txtL1

Page 105: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

106

How to write a compiler?

L1 CompilerExecutable compiler

exe

L2 Compiler source

txtL1

L2 CompilerExecutable program

exe

Program source

txtL2

=

Page 106: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

107

How to write a compiler?

L1 CompilerExecutable compiler

exe

L2 Compiler source

txtL1

L2 CompilerExecutable program

exe

Program source

txtL2

=

107

ProgramOutput

Y

Input

X

=

Page 107: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

108

Bootstrapping a compiler

Page 108: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

109

Bootstrapping a compiler

L1 Compilersimple

L2 executable compiler

exeSimple

L2 compiler source

txtL1

L2s CompilerInefficient adv.

L2 executable compiler

exeadvanced

L2 compiler source

txtL2

L2 CompilerEfficient adv.

L2 executable compiler

Yadvanced

L2 compiler source

X

=

=

Page 109: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

110

Proper Design

• Simplify the compilation phase– Portability of the compiler frontend– Reusability of the compiler backend

• Professional compilers are integrated

Java

C

Pascal

C++

ML

Pentium

MIPS

Sparc

Java

C

Pascal

C++

ML

Pentium

MIPS

Sparc

IR

Page 110: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

111

Modularity

SourceLanguage 1

txt

SemanticRepresentation

Backend

TL2

Frontend

SL2

int a, b;a = 2;b = a*2 + 1;

MOV R1,2SAL R1INC R1MOV R2,R1

Frontend

SL3

Frontend

SL1

Backend

TL1

Backend

TL3

SourceLanguage 1

txt

SourceLanguage 1

txt

Executabletarget 1

exe

Executabletarget 1

exe

Executabletarget 1

exe

SET R1,2STORE #0,R1SHIFT R1,1STORE #1,R1ADD R1,1STORE #2,R1

Page 111: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

112

Page 112: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

113

Lexical Analysis

Modern Compiler Design: Chapter 2.1

Page 113: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

114

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

Compiler

Frontend

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

Page 114: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

115

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

Compiler

Frontend

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

words sentences

Page 115: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

116

What does Lexical Analysis do?

• Language: fully parenthesized expressionsExpr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

( ( 23 + 7 ) * 19 )

Page 116: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

117

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 117: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

118

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 118: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

119

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 119: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

120

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RP

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 120: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

121

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RPKind

Value

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Page 121: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

122

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RPKind

Value

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Token Token …

Page 122: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

123

• Partitions the input into stream of tokens– Numbers– Identifiers– Keywords– Punctuation

• Usually represented as (kind, value) pairs– (Num, 23)– (Op, ‘*’)

• “word” in the source language• “meaningful” to the syntactical analysis

What does Lexical Analysis do?

Page 123: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

124

From scanning to parsing((23 + 7) * x)

) ? * ) 7 + 23 ( (

RP Id OP RP Num OP Num LP LP

Lexical Analyzer

program text

token stream

ParserGrammar: Expr ... | Id Id ‘a’ | ... | ‘z’

Op(*)

Id(?)

Num(23) Num(7)

Op(+)

Abstract Syntax Tree

validsyntaxerror

Page 124: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

125

Why Lexical Analysis?

• Well, not strictly necessary, but …– Regular languages Context-Free languages

• Simplifies the syntax analysis (parsing) – And language definition

• Modularity• Reusability • Efficiency

Page 125: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

126

Lecture goals

• Understand role & place of lexical analysis

• Lexical analysis theory• Using program generating tools

Page 126: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

127

Lecture Outline

Role & place of lexical analysis• What is a token?• Regular languages• Lexical analysis• Error handling• Automatic creation of lexical analyzers

Page 127: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

128

What is a token? (Intuitively)

• A “word” in the source language– Anything that should appear in the input to

syntax analysis• Identifiers• Values• Language keywords

• Usually, represented as a pair of (kind, value)

Page 128: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

129

Example TokensType Examples

ID foo, n_14, lastNUM 73, 00, 517, 082 REAL 66.1, .5, 5.5e-10IF ifCOMMA ,NOTEQ !=LPAREN (RPAREN )

Page 129: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

130

Example Non TokensType Examples

comment /* ignored */preprocessor directive #include <foo.h>

#define NUMS 5.6macro NUMSwhitespace \t, \n, \b, ‘ ‘

Page 130: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

131

Some basic terminology

• Lexeme (aka symbol) - a series of letters separated from the rest of the program according to a convention (space, semi-column, comma, etc.)

• Pattern - a rule specifying a set of strings.Example: “an identifier is a string that starts with a letter and continues with letters and digits”– (Usually) a regular expression

• Token - a pair of (pattern, attributes)

Page 131: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

132

Examplevoid match0(char *s) /* find a zero */

{

if (!strncmp(s, “0.0”, 3))

return 0. ;

}

VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN

LBRACE

IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN

RETURN REAL(0.0) SEMI

RBRACE

EOF

Page 132: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

133

Example Non TokensType Examples

comment /* ignored */preprocessor directive #include <foo.h>

#define NUMS 5.6macro NUMSwhitespace \t, \n, \b, ‘ ‘

• Lexemes that are recognized but get consumed rather than transmitted to parser– If– i/*comment*/f

Page 133: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

134

How can we define tokens?

• Keywords – easy!– if, then, else, for, while, …

• Identifiers? • Numerical Values?• Strings?

• Characterize unbounded sets of values using a bounded description?

Page 134: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

135

Lecture Outline

Role & place of lexical analysisWhat is a token?• Regular languages• Lexical analysis• Error handling• Automatic creation of lexical analyzers

Page 135: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

136

Regular languages

• Formal languages– Σ = finite set of letters– Word = sequence of letter– Language = set of words

• Regular languages defined equivalently by– Regular expressions– Finite-state automata

Page 136: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

137

Common format for reg-expsBasic Patterns Matching

x The character x

. Any character, usually except a new line

[xyz] Any of the characters x,y,z

^x Any character except x

Repetition Operators

R? An R or nothing (=optionally an R)

R* Zero or more occurrences of R

R+ One or more occurrences of R

Composition Operators

R1R2 An R1 followed by R2

R1|R2 Either an R1 or R2

Grouping

(R) R itself

Page 137: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

138

Examples

• ab*|cd? = • (a|b)* =• (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)* =

Page 138: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

139

Escape characters

• What is the expression for one or more + symbols?– (+)+ won’t work– (\+)+ will

• backslash \ before an operator turns it to standard character– \*, \?, \+, a\(b\+\*, (a\(b\+\*)+, …

• backslash double quotes surrounds text– “a(b+*”, “a(b+*”+

Page 139: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

140

Shorthands

• Use names for expressions– letter = a | b | … | z | A | B | … | Z– letter_ = letter | _– digit = 0 | 1 | 2 | … | 9– id = letter_ (letter_ | digit)*

• Use hyphen to denote a range– letter = a-z | A-Z– digit = 0-9

Page 140: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

141

Examples

• if = if• then = then• relop = < | > | <= | >= | = | <>

• digit = 0-9• digits = digit+

Page 141: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

142

Example

• A number is

number = ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

( | . ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

( | E ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

) )

• Using shorthands it can be written as

number = digits ( | .digits ( | E (|+|-) digits ) )

Page 142: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

143

Exercise 1 - Question

• Language of rational numbers in decimal representation (no leading, ending zeros)– 0– 123.757– .933333– Not 007– Not 0.30

Page 143: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

144

Exercise 1 - Answer

• Language of rational numbers in decimal representation (no leading, ending zeros)

– Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg

Page 144: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

145

Exercise 2 - Question

• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

Page 145: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

146

Exercise 2 - Answer

• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

• Not regular• Context-free• Grammar: S ::= [] | [S]

Page 146: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

147

Challenge: Ambiguity

• if = if• id = letter_ (letter_ | digit)*

• “if” is a valid word in the language of identifiers… so what should it be?

• How about the identifier “iffy”?

• Solution– Always find longest matching token– Break ties using order of definitions… first definition

wins (=> list rules for keywords before identifiers)

Page 147: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

148

Creating a lexical analyzer

• Given a list of token definitions (pattern name, regex), write a program such that– Input: String to be analyzed– Output: List of tokens

• How do we build an analyzer?

Page 148: Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1.

149