CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

CSCE 330Programming Language

StructuresChapter 3: Lexical and

Syntactic AnalysisFall 2009

Marco [email protected]

Syntactic sugar causes cancer of the semicolon. A.Perlis


Engineering


Contents

• 3.1 Chomsky Hierarchy• 3.2 Lexical Analysis• 3.3 Syntactic Analysis


Engineering


3.1 Chomsky Hierarchy

• Regular grammar -- least powerful• Context-free grammar (BNF)• Context-sensitive grammar• Unrestricted grammar


Engineering


Regular Grammar

• Simplest; least powerful• Equivalent to:

– Regular expression– Finite-state automaton

• Right regular grammar: T*, B NA → BA →


Engineering


Example

• Integer → 0 Integer | 1 Integer | ... | 9 Integer | 0 | 1 | ... | 9


Engineering


Regular Grammars

• Left regular grammar: equivalent• Used in construction of tokenizers

(scanners, lexers)• Less powerful than context-free

grammars• Not a regular language

{ aⁿ bⁿ | n ≥ 1 }i.e., cannot balance: ( ), { }, begin end


Engineering


Context-free Grammars

• BNF a stylized form of CFG• Equivalent to a pushdown automaton• For a wide class of unambiguous CFGs,

there are table-driven, linear time parsers


Engineering


Context-Sensitive Grammars

• Production:• α → β |α| ≤ |β|• α, β (N T)*• i.e., left-hand side can be composed of

strings of terminals and nonterminals


Engineering


Undecidable Properties of CSGs

• Given a string and grammar G: L(G)• L(G) is non-empty• Defn: Undecidable means that you cannot

write a computer program that is guaranteed to halt to decide the question for all L(G).


Engineering


Unrestricted Grammar

• Equivalent to:– Turing machine– von Neumann machine– C++, Java

• That is, can compute any computable function.


Engineering


Contents

• 3.1 Chomsky Hierarchy• 3.2 Lexical Analysis• 3.3 Syntactic Analysis


Engineering


Lexical Analysis

• Purpose: transform program representation

• Input: printable Ascii characters• Output: tokens• Discard: whitespace, comments

• Defn: A token is a logically cohesive sequence of characters representing a single symbol.


Engineering


Example Tokens

• Identifiers• Literals: 123, 5.67, 'x', true• Keywords: bool char ...• Operators: + - * / ...• Punctuation: ; , ( ) { }


Engineering


Other Sequences

• Whitespace: space tab• Comments

// any-char* end-of-line• End-of-line• End-of-file


Engineering


Why a Separate Phase?

• Simpler, faster machine model than parser

• 75% of time spent in lexer for non-optimizing compiler

• Differences in character sets• End of line convention differs


Engineering


Regular Expressions

• RegExpr Meaning• x a character x • \x an escaped character,

e.g., \n• { name } a reference to a name• M | N M or N• M N M followed by N• M* zero or more occurrences

of M


Engineering


• RegExpr Meaning• M+ One or more

occurrences of M• M? Zero or one occurrence

of M• [aeiou] the set of vowels• [0-9] the set of digits• . Any single character


Engineering


Clite Lexical Syntax

• Category Definition• anyChar [ -~]• Letter [a-zA-Z]• Digit [0-9]• Whitespace [ \t]• Eol \n• Eof \004


Engineering


• Category Definition• Keyword bool | char | else | false |

float |if | int | main | true | while

• Identifier {Letter}({Letter} | {Digit})*

• integerLit {Digit}+• floatLit {Digit}+\.{Digit}+• charLit ‘{anyChar}’


Engineering


• Category Definition• Operator = | || | && | == | != | < | <=

| > | >= | + | - | * | / |! | [ | ]• Separator ; | . | { | } | ( | )• Comment // ({anyChar} |

{Whitespace})* {eol}


Engineering


Generators

• Input: usually regular expression• Output: table (slow), code• C/C++: Lex, Flex• Java: JLex


Engineering


Finite State Automata

• Set of states: representation – graph nodes

• Input alphabet + unique end symbol• State transition function

Labelled (using alphabet) arcs in graph• Unique start state• One or more final states


Engineering


Deterministic FSA

• Defn: A finite state automaton is deterministic if for each state and each input symbol, there is at most one outgoing arc from the state labeled with the input symbol.


Engineering


• A Finite State Automaton for Identifiers


Engineering


Definitions

• A configuration on an FSA consists of a state and the remaining input.

• A move consists of traversing the arc exiting the state that corresponds to the leftmost input symbol, thereby consuming it. If no such arc, then:– If no input and state is final, then

accept.– Otherwise, error.


Engineering


• An input is accepted if, starting with the start state, the automaton consumes all the input and halts in a final state.


Engineering


Example

• (S, a2i$) ├ (I, 2i$)• ├ (I, i$)• ├ (I, $)• ├ (F, )

• Thus: (S, a2i$) ├* (F, )


Engineering


Some Conventions

• Explicit terminator used only for program as a whole, not each token.

• An unlabeled arc represents any other valid input symbol.

• Recognition of a token ends in a final state.

• Recognition of a non-token transitions back to start state.


Engineering


• Recognition of end symbol (end of file)

ends in a final state.• Automaton must be deterministic.

– Drop keywords; handle separately.– Must consider all sequences with a

common prefix together.


Engineering



Engineering


•


Engineering


Lexer Code

• Parser calls lexer whenever it needs a new token.

• Lexer must remember where it left off.• Greedy consumption goes 1 character

too far– peek function– pushback function– no symbol consumed by start state


Engineering


From Design to Code

• private char ch = ‘ ‘;• public Token next ( ) {• do {• switch (ch) {• ...• }• } while (true);• }


Engineering


Remarks

• Loop only exited when a token is found

• Loop exited via a return statement.• Variable ch must be global. Initialized

to a space character.• Exact nature of a Token irrelevant to

design.


Engineering


Translation Rules

• Traversing an arc from A to B:– If labeled with x: test ch == x– If unlabeled: else/default part of

if/switch. If only arc, no test need be performed.

– Get next character if A is not start state


Engineering


• A node with an arc to itself is a do-while.– Condition corresponds to whichever

arc is labeled.


Engineering


• Otherwise the move is translated to a if/switch:– Each arc is a separate case.– Unlabeled arc is default case.

• A sequence of transitions becomes a sequence of translated statements.


Engineering


• A complex diagram is translated by boxing its components so that each box is one node.– Translate each box using an outside-

in strategy.


Engineering


• private boolean isLetter(char c) {• return ch >= ‘a’ && ch <= ‘z’ ||• ch >= ‘A’ && ch <= ‘Z’;• }


Engineering


• private String concat(String set) {• StringBuffer r = new

StringBuffer(“”);• do {• r.append(ch);• ch = nextChar( );• } while (set.indexOf(ch) >= 0);• return r.toString( );• }


Engineering


• public Token next( ) {• do { if (isLetter(ch) { // ident or keyword• String spelling = concat(letters+digits);• return Token.keyword(spelling);• } else if (isDigit(ch)) { // int or float literal• String number = concat(digits);• if (ch != ‘.’) • return Token.mkIntLiteral(number);• number += concat(digits);• return Token.mkFloatLiteral(number);


Engineering


• } else switch (ch) {• case ‘ ‘: case ‘\t’: case ‘\r’: case eolnCh:• ch = nextCh( ); break;• case eofCh: return Token.eofTok;• case ‘+’: ch = nextChar( );• return Token.plusTok;• …• case ‘&’: check(‘&’); return Token.andTok;• case ‘=‘: return chkOpt(‘=‘, Token.assignTok,• Token.eqeqTok);


Engineering


Source Tokens

• // a first program• // with 2 comments• int main ( ) {

char c;int i;c = 'h';i = c + 3;

• } // main

• int• main• (• )• {• char• Identifier c• ;


Engineering


JLex: A Lexical Analyzer Generator for Java

Definition of tokens

Regular Expressions

JLex

Java File: Scanner Class

Recognizes Tokens

We will look at an example JLex specification (adopted from the manual).

Consult the manual for details on how to write your own JLex specifications.


Engineering


The JLex tooluser code (added to start of generated file)

%% options

%{ user code (added inside the scanner class declaration)%} macro definitions

%%

lexical declaration

user code (added to start of generated file)

%% options

%{ user code (added inside the scanner class declaration)%} macro definitions

%%

lexical declaration

Layout of JLex file:

User code is copied directly into the output class

JLex directives allow you to include code in the lexical analysis class, change names of various components, switch on character counting, line counting, manage EOF, etc.

Macro definitions gives names for useful regexps

Regular expression rules define the tokens to be recognised and actions to be taken


Engineering


Java.io.StreamTokenizer• An alternative to JLex is to use the class

StreamTokenizer from java.io• The class recognizes 4 types of lexical

elements (tokens):• number (sequence of decimal numbers

eventually starting with the –(minus) sign and/or containing the decimal point)

• word (sequence of characters and digits starting with a character)

• line separator• end of file


Engineering


Parsing• Some terminology• Different types of parsing strategies

– bottom up– top down

• Recursive descent parsing– What is it– How to implement one given an EBNF

specification– (How to generate one using tools –

later)• (Bottom up parsing algorithms)


Engineering


Parsing: Some Terminology

• RecognitionTo answer the question “does the input conform

to the syntax of the language?”

• ParsingRecognition + determination of phrase structure

(for example by generating AST data structures)

• (Un)ambiguous grammar:A grammar is unambiguous if there is only at

most one way to parse any input (i.e. for syntactically correct program there is precisely one parse tree)


Engineering


Different kinds of Parsing Algorithms

• Two big groups of algorithms can be distinguished:– bottom up strategies– top down strategies

• Example parsing of “Micro-English”

Sentence ::= Subject Verb Object .Subject ::= I | a Noun | the Noun Object ::= me | a Noun | the NounNoun ::= cat | mat | ratVerb ::= like | is | see | sees


The cat sees the rat.The rat sees me.I like a cat

The rat like me.I see the rat.I sees a rat.


Engineering


Top-down parsing

The cat sees a rat .The cat sees rat .

The parse tree is constructed starting at the top (root).

Sentence

Subject Verb Object .

Sentence

Noun

Subject

The

Noun

cat

Verb

sees a

Noun

Object

Noun

rat .


Engineering


Bottom up parsing

The cat sees a rat .The cat

Noun

Subject

sees

Verb

a rat

Noun

Object

.

Sentence

The parse tree “grows” from the bottom (leaves) up to the top (root).


Engineering


Look-Ahead

Derivation

LL-Analyse (Top-Down)Left-to-Right Left Derivative

Scans string left to rightBuilds leftmost derivation

Look-Ahead

Reduction

LR-Analyse (Bottom-Up)Left-to-Right Right Derivative

Scans string left to rightBuilds rightmost derivation

Top-Down vs. Bottom-Up parsing


Engineering


Recursive Descent Parsing

• Recursive descent parsing is a straightforward top-down parsing algorithm.

• We will now look at how to develop a recursive descent parser from an EBNF specification.

• Idea: the parse tree structure corresponds to the “call graph” structure of parsing procedures that call each other recursively.


Engineering





Define a procedure parseN for each non-terminal N

private void parseSentence() ;private void parseSubject();private void parseObject(); private void parseNoun();private void parseVerb();

private void parseSentence() ;private void parseSubject();private void parseObject(); private void parseNoun();private void parseVerb();


Engineering



public class MicroEnglishParser {

private TerminalSymbol currentTerminal;

//Auxiliary methods will go here ...

//Parsing methods will go here ...}


private TerminalSymbol currentTerminal;

//Auxiliary methods will go here ...

//Parsing methods will go here ...}


Engineering


Recursive Descent Parsing: Auxiliary Methods


private TerminalSymbol currentTerminal

private void accept(TerminalSymbol expected) {if (currentTerminal matches expected) currentTerminal = next input terminal ;else report a syntax error

}

...}


private TerminalSymbol currentTerminal

private void accept(TerminalSymbol expected) {if (currentTerminal matches expected) currentTerminal = next input terminal ;else report a syntax error

}

...}


Engineering


Recursive Descent Parsing: Parsing Methods

private void parseSentence() { parseSubject(); parseVerb(); parseObject(); accept(‘.’);}

private void parseSentence() { parseSubject(); parseVerb(); parseObject(); accept(‘.’);}

Sentence ::= Subject Verb Object .Sentence ::= Subject Verb Object .


Engineering



private void parseSubject() { if (currentTerminal matches ‘I’) accept(‘I’); else if (currentTerminal matches ‘a’) { accept(‘a’); parseNoun(); } else if (currentTerminal matches ‘the’) { accept(‘the’); parseNoun(); } else report a syntax error}

private void parseSubject() { if (currentTerminal matches ‘I’) accept(‘I’); else if (currentTerminal matches ‘a’) { accept(‘a’); parseNoun(); } else if (currentTerminal matches ‘the’) { accept(‘the’); parseNoun(); } else report a syntax error}

Subject ::= I | a Noun | the Noun Subject ::= I | a Noun | the Noun


Engineering



private void parseNoun() { if (currentTerminal matches ‘cat’) accept(‘cat’); else if (currentTerminal matches ‘mat’) accept(‘mat’); else if (currentTerminal matches ‘rat’) accept(‘rat’); else report a syntax error}

private void parseNoun() { if (currentTerminal matches ‘cat’) accept(‘cat’); else if (currentTerminal matches ‘mat’) accept(‘mat’); else if (currentTerminal matches ‘rat’) accept(‘rat’); else report a syntax error}

Noun ::= cat | mat | ratNoun ::= cat | mat | rat


Engineering


Algorithm to convert EBNF into a RD parser

private void parseN() { parse X}

private void parseN() { parse X}

N ::= X N ::= X

• The conversion of an EBNF specification into a Java implementation for a recursive descent parser is so “mechanical” that it can easily be automated!

=> JavaCC “Java Compiler Compiler”• We can describe the algorithm by a set of mechanical rewrite

rules


Engineering



// a dummy statement// a dummy statement

parse parse

parse N where N is a non-terminalparse N where N is a non-terminal

parseN();parseN();

parse t where t is a terminalparse t where t is a terminal

accept(t);accept(t);

parse XYparse XY

parse Xparse Y

parse Xparse Y


Engineering



parse X* parse X*

while (currentToken.kind is in starters[X]) { parse X}

while (currentToken.kind is in starters[X]) { parse X}

parse X|Y parse X|Y

switch (currentToken.kind) { cases in starters[X]: parse X break; cases in starters[Y]: parse Y break; default: report syntax error }

switch (currentToken.kind) { cases in starters[X]: parse X break; cases in starters[Y]: parse Y break; default: report syntax error }

CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

Documents