Top Banner
Lexical and Syntax Analysis (of Programming Languages) Lexical Analysis
129

Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

May 11, 2018

Download

Documents

phamdien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Lexical and Syntax Analysis(of Programming Languages)

Lexical Analysis

Page 2: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Lexical and Syntax Analysis(of Programming Languages)

Lexical Analysis

Page 3: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

What is Parsing?

String ofcharacters

Easy for humansto write

Easy for programsto process

Parser

A parser also checks that the input stringis well-formed, and if not, rejects it.

Data structure

Page 4: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

What is Parsing?

String ofcharacters

Easy for humansto write

Easy for programsto process

Parser

A parser also checks that the input stringis well-formed, and if not, rejects it.

Data structure

Page 5: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

PARSING

=

LEXICAL ANALYSIS+

SYNTAX ANALYSIS

Page 6: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

PARSING

=

LEXICAL ANALYSIS+

SYNTAX ANALYSIS

Page 7: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Lexical Analysis

Identifies the lexemes in asentence.

Lexeme: a minimal meaningfulunit of a language.

Converts each lexeme to atoken.

Throws away ignorable textsuch as spaces, new-lines, andcomments.

(Also known as β€œscanning”)

Page 8: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Lexical Analysis

Identifies the lexemes in asentence.

Lexeme: a minimal meaningfulunit of a language.

Converts each lexeme to atoken.

Throws away ignorable textsuch as spaces, new-lines, andcomments.

(Also known as β€œscanning”)

Page 9: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

What is a token?

Every token has an identifier,used to denote the kind oflexeme that it represents, e.g.

Token identifier denotes

PLUS a + operator

ASSIGN a := operator

VAR a variable

NUM a number

Some tokens have a componentvalue, conventionally written inparenthesis after the identifier,e.g. VAR(foo), NUM(12).

Page 10: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

What is a token?

Every token has an identifier,used to denote the kind oflexeme that it represents, e.g.

Token identifier denotes

PLUS a + operator

ASSIGN a := operator

VAR a variable

NUM a number

Some tokens have a componentvalue, conventionally written inparenthesis after the identifier,e.g. VAR(foo), NUM(12).

Page 11: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Lexical Analysis

Stream of characters

Stream of tokens

Example input:

foo := 20 + bar

Example output:

VAR(foo), ASSIGN, NUM(20),PLUS, VAR(bar)

Page 12: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Lexical Analysis

Stream of characters

Stream of tokens

Example input:

foo := 20 + bar

Example output:

VAR(foo), ASSIGN, NUM(20),PLUS, VAR(bar)

Page 13: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Lexical Analysis

Lexemes are specified by regularexpressions. For example:

number = digitβ‹… digit*

variable = letterβ‹… (letter | digit)*

digit = 0 | ... | 9letter = a | ... | z

1443634

xfoofoo2x1y20

Example numbers: Example variables:

Page 14: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Lexical Analysis

Lexemes are specified by regularexpressions. For example:

number = digitβ‹… digit*

variable = letterβ‹… (letter | digit)*

digit = 0 | ... | 9letter = a | ... | z

1443634

xfoofoo2x1y20

Example numbers: Example variables:

Page 15: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

REGULAR EXPRESSIONS

What exactly is a regular expression?

Page 16: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

REGULAR EXPRESSIONS

What exactly is a regular expression?

Page 17: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Notation

Alphabet βˆ‘ is the set of allcharacters that can appear in aninput string.

If a string s matches a regularexpressions r, we write s ∿ r.

Language L(r) = { s ⦁ s ∿ r }, i.e.the set of all strings matchingregular expression r.

We write s1s2 to denote theconcatenation of strings s1 and s2.

Page 18: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Notation

Alphabet βˆ‘ is the set of allcharacters that can appear in aninput string.

If a string s matches a regularexpressions r, we write s ∿ r.

Language L(r) = { s ⦁ s ∿ r }, i.e.the set of all strings matchingregular expression r.

We write s1s2 to denote theconcatenation of strings s1 and s2.

Page 19: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Syntax

r β†’ πœ€

r β†’ x

r β†’ r β‹… r

r β†’ r | r

r β†’ r*

r β†’ ( r )

The syntax of regular expressions isdefined by the following grammar,where x ranges over symbols in βˆ‘.

Page 20: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Syntax

r β†’ πœ€

r β†’ x

r β†’ r β‹… r

r β†’ r | r

r β†’ r*

r β†’ ( r )

The syntax of regular expressions isdefined by the following grammar,where x ranges over symbols in βˆ‘.

Page 21: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Intuitive definition

Regularexpression

Matching strings are

πœ€ The empty string πœ€

x The singleton string x if x ∊ βˆ‘

r1 | r2 Any string matching r1 or r2.

r1 β‹… r2

Any string that can be split intosubstrings s1 and s2 such that s1

matches r1 and s2 matches r2

r*

The empty string or any stringthat can be split into substringss1...sn such that si matches r forall i in 1...n

Page 22: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Intuitive definition

Regularexpression

Matching strings are

πœ€ The empty string πœ€

x The singleton string x if x ∊ βˆ‘

r1 | r2 Any string matching r1 or r2.

r1 β‹… r2

Any string that can be split intosubstrings s1 and s2 such that s1

matches r1 and s2 matches r2

r*

The empty string or any stringthat can be split into substringss1...sn such that si matches r forall i in 1...n

Page 23: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Formal definition:Base cases

L(πœ€ ) = { πœ€ }

L(x) = { x }

where x ∊ βˆ‘

Page 24: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Formal definition:Base cases

L(πœ€ ) = { πœ€ }

L(x) = { x }

where x ∊ βˆ‘

Page 25: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Formal definition:Choice and Sequence

L(r1 | r2) = L(r1) βˆͺ L(r2)

L(r1 β‹… r2) ={ s1s2 ⦁ s1 ∊ L(r1),

s2 ∊ L(r2) }

Page 26: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Formal definition:Choice and Sequence

L(r1 | r2) = L(r1) βˆͺ L(r2)

L(r1 β‹… r2) ={ s1s2 ⦁ s1 ∊ L(r1),

s2 ∊ L(r2) }

Page 27: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Formal definition:Kleene closure

L(rn) =

L(r*) =

⋃

{ πœ€ }, if n = 0L(r β‹… rn-1), if n > 0

{ L(rn) ⦁ n ∊ { 0β‹― ∞ } }

Page 28: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Formal definition:Kleene closure

L(rn) =

L(r*) =

⋃

{ πœ€ }, if n = 0L(r β‹… rn-1), if n > 0

{ L(rn) ⦁ n ∊ { 0β‹― ∞ } }

Page 29: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example 1

r L(r)

a|b { a, b }

(a|b)β‹… (a|b) { aa, ab, ba, bb }

a* { πœ€, a, aa, aaa, ... }

(aβ‹… b)* { πœ€, ab, abab, ... }

(a|b)* { πœ€, a, b, ab, ba, ... }

Suppose βˆ‘ = { a, b }.

Page 30: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example 1

r L(r)

a|b { a, b }

(a|b)β‹… (a|b) { aa, ab, ba, bb }

a* { πœ€, a, aa, aaa, ... }

(aβ‹… b)* { πœ€, ab, abab, ... }

(a|b)* { πœ€, a, b, ab, ba, ... }

Suppose βˆ‘ = { a, b }.

Page 31: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example 2

Example of a language that cannotbe defined by a regular expression:

{ anbn ⦁ n ∊ β„• }

The set of strings containing nconsecutive a symbols followed byn consecutive b symbols, for all n.

Page 32: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example 2

Example of a language that cannotbe defined by a regular expression:

{ anbn ⦁ n ∊ β„• }

The set of strings containing nconsecutive a symbols followed byn consecutive b symbols, for all n.

Page 33: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 1

Characterise the languages definedby the following regular expressions

a β‹… (a|b)* β‹… a

a* β‹… b β‹… a* β‹… b β‹… a* β‹… b β‹… a*

((πœ€|a) β‹… b*)*

Page 34: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 1

Characterise the languages definedby the following regular expressions

a β‹… (a|b)* β‹… a

a* β‹… b β‹… a* β‹… b β‹… a* β‹… b β‹… a*

((πœ€|a) β‹… b*)*

Page 35: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Proof rules:Base cases

If x ∊ βˆ‘ then x matches x.

x ∿ x

x ∊ βˆ‘

The empty string πœ€ matches πœ€.

πœ€ ∿ πœ€ [Empty]

[Single]

Page 36: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Proof rules:Base cases

If x ∊ βˆ‘ then x matches x.

x ∿ x

x ∊ βˆ‘

The empty string πœ€ matches πœ€.

πœ€ ∿ πœ€ [Empty]

[Single]

Page 37: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Proof rules:Sequence

s1s2 ∿ r1 β‹… r2

s1 ∿ r1 s2 ∿ r2[Seq]

If s1 matches r1 and s2 matches r2

then s1s2 matches r1 β‹… r2.

Page 38: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Proof rules:Sequence

s1s2 ∿ r1 β‹… r2

s1 ∿ r1 s2 ∿ r2[Seq]

If s1 matches r1 and s2 matches r2

then s1s2 matches r1 β‹… r2.

Page 39: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Proof rules:Choice

If s matches r1

then s matches r1 | r2

s ∿ r1 | r2

s ∿ r1[Or1]

If s matches r2

then s matches r1 | r2

s ∿ r1 | r2

s ∿ r2[Or2]

Page 40: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Proof rules:Choice

If s matches r1

then s matches r1 | r2

s ∿ r1 | r2

s ∿ r1[Or1]

If s matches r2

then s matches r1 | r2

s ∿ r1 | r2

s ∿ r2[Or2]

Page 41: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Proof rules:Kleene closure

If s matches πœ€

then s matches r*.

If s matches r β‹… r*

then s matches r*.

s ∿ r*

s ∿ πœ€[Kleene1]

s ∿ r*

s ∿ r β‹… r*

[Kleene2]

Page 42: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Proof rules:Kleene closure

If s matches πœ€

then s matches r*.

If s matches r β‹… r*

then s matches r*.

s ∿ r*

s ∿ πœ€[Kleene1]

s ∿ r*

s ∿ r β‹… r*

[Kleene2]

Page 43: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 2

Proove that the string

cab

matches the regular expression

((aβ‹… b)|c)*

Page 44: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 2

Proove that the string

cab

matches the regular expression

((aβ‹… b)|c)*

Page 45: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 2

cab ∿ ((aβ‹… b)|c)*

⇐ { Kleene2 }

cab ∿ ((aβ‹… b)|c)β‹… ((aβ‹… b)|c)*

⇐ { Seq }

c ∿ (aβ‹… b)|c, ab ∿ ((aβ‹… b)|c)*

⇐ { Or2 }

c ∿ c, ab ∿ ((aβ‹… b)|c)*

⇐ { Single }

ab ∿ ((aβ‹… b)|c)*

Continued...

Page 46: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 2

cab ∿ ((aβ‹… b)|c)*

⇐ { Kleene2 }

cab ∿ ((aβ‹… b)|c)β‹… ((aβ‹… b)|c)*

⇐ { Seq }

c ∿ (aβ‹… b)|c, ab ∿ ((aβ‹… b)|c)*

⇐ { Or2 }

c ∿ c, ab ∿ ((aβ‹… b)|c)*

⇐ { Single }

ab ∿ ((aβ‹… b)|c)*

Continued...

Page 47: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 2

ab ∿ ((aβ‹… b)|c)*

⇐ { Kleene2 }

ab ∿ ((aβ‹… b)|c)β‹… ((aβ‹… b)|c)*

⇐ { Seq }

ab ∿ (aβ‹… b)|c, πœ€ ∿ ((aβ‹… b)|c)*

⇐ { Or1, Kleene1 }

ab ∿ (aβ‹… b)

⇐ { Seq }

a ∿ a, b ∿ b

⇐ { Single, Single }

true

Page 48: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 2

ab ∿ ((aβ‹… b)|c)*

⇐ { Kleene2 }

ab ∿ ((aβ‹… b)|c)β‹… ((aβ‹… b)|c)*

⇐ { Seq }

ab ∿ (aβ‹… b)|c, πœ€ ∿ ((aβ‹… b)|c)*

⇐ { Or1, Kleene1 }

ab ∿ (aβ‹… b)

⇐ { Seq }

a ∿ a, b ∿ b

⇐ { Single, Single }

true

Page 49: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Sound and Complete

Complete: if s ∊ L(r) then wecan prove s ∿ r using the rules.

The proof rules are:

Sound: if we can prove s ∿ rusing the rules then s ∊ L(r).

Page 50: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Sound and Complete

Complete: if s ∊ L(r) then wecan prove s ∿ r using the rules.

The proof rules are:

Sound: if we can prove s ∿ rusing the rules then s ∊ L(r).

Page 51: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Prolog

If we define our proof rules inProlog then we get a regularexpression implementation.

[] ∿ [].[X] ∿ X.S ∿ R1!R2 :- S ∿ R1.S ∿ R1!R2 :- S ∿ R2.S ∿ R1.R2 :- append(S1, S2, S),

S1 ∿ R1, S2 ∿ R2 .[] ∿ R*.S ∿ R* :- append(S1, S2, S), S1=[X|Xs],

S1 ∿ R, S2 ∿ R* .

Operator ! used to represent vertical bar. Read β€œ:-” as β€œif”. Strings represented by lists of symbols. Termination ensured by requiring S1 to be

non-empty in final clause.

NOTES:

Page 52: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Prolog

If we define our proof rules inProlog then we get a regularexpression implementation.

[] ∿ [].[X] ∿ X.S ∿ R1!R2 :- S ∿ R1.S ∿ R1!R2 :- S ∿ R2.S ∿ R1.R2 :- append(S1, S2, S),

S1 ∿ R1, S2 ∿ R2 .[] ∿ R*.S ∿ R* :- append(S1, S2, S), S1=[X|Xs],

S1 ∿ R, S2 ∿ R* .

Operator ! used to represent vertical bar. Read β€œ:-” as β€œif”. Strings represented by lists of symbols. Termination ensured by requiring S1 to be

non-empty in final clause.

NOTES:

Page 53: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Prolog

Sadly the Prolog implementation isnot very efficient:

when applying the proof rules byhand we used human intuition toknow where to split the string;

Prolog does not have this intuition;

instead, Prolog guesses, trying allpossible ways to split a string, andbacktracks on failure.

Page 54: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Prolog

Sadly the Prolog implementation isnot very efficient:

when applying the proof rules byhand we used human intuition toknow where to split the string;

Prolog does not have this intuition;

instead, Prolog guesses, trying allpossible ways to split a string, andbacktracks on failure.

Page 55: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Common extensions

Regularexpression

is the same as

r+ rβ‹… r*

r? πœ€|r

[c1c2c3-cn] c1|c2|c3|... |cn

[^c1c2] { x ⦁ x ∊ βˆ‘ , x ∊ {c1 , c2} }

Page 56: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Common extensions

Regularexpression

is the same as

r+ rβ‹… r*

r? πœ€|r

[c1c2c3-cn] c1|c2|c3|... |cn

[^c1c2] { x ⦁ x ∊ βˆ‘ , x ∊ {c1 , c2} }

Page 57: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Escaping

What if βˆ‘ contains regularexpression symbols such as

| * β‹… ( + [ ?

We can escape such symbols byprefixing with a backslash:

\| \* \β‹… \( \[ \?

And if we want \ then write \\.

Example: \[*β‹… \]* means zero orleft brackets followed by zero ormore right brackets.

Page 58: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Escaping

What if βˆ‘ contains regularexpression symbols such as

| * β‹… ( + [ ?

We can escape such symbols byprefixing with a backslash:

\| \* \β‹… \( \[ \?

And if we want \ then write \\.

Example: \[*β‹… \]* means zero orleft brackets followed by zero ormore right brackets.

Page 59: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Regular definitions

For convenience, we may wishto name a regular expressionso that we can refer to it manytimes:

name = r

number = digitβ‹… digit*

digit = 0 | ... | 9

We write name but sometimesthe notation {name} is used(e.g. in Flex). Example:

Page 60: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Regular definitions

For convenience, we may wishto name a regular expressionso that we can refer to it manytimes:

name = r

number = digitβ‹… digit*

digit = 0 | ... | 9

We write name but sometimesthe notation {name} is used(e.g. in Flex). Example:

Page 61: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Implementingregular expressions

How do we convert a regularexpression r into an efficientprogram that prints YES whenapplied to any string in L(r) andNO in all other cases?

Two options:

By hand (LSA Lab 1)

Automatically (Chapters 3 & 4of lecture notes, and LSA Lab 2)

Page 62: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Implementingregular expressions

How do we convert a regularexpression r into an efficientprogram that prints YES whenapplied to any string in L(r) andNO in all other cases?

Two options:

By hand (LSA Lab 1)

Automatically (Chapters 3 & 4of lecture notes, and LSA Lab 2)

Page 63: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Outline

Automatic conversion of regularexpressions to efficient string-matching functions:

Step 1: RE ⟢ NFA

Step 2: NFA ⟢ DFA

Step 3: DFA ⟢ C Function

Acronym Meaning

RE Regular Expression

NFA Non-deterministic Finite Automaton

DFA Deterministic Finite Automaton

Page 64: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Outline

Automatic conversion of regularexpressions to efficient string-matching functions:

Step 1: RE ⟢ NFA

Step 2: NFA ⟢ DFA

Step 3: DFA ⟢ C Function

Acronym Meaning

RE Regular Expression

NFA Non-deterministic Finite Automaton

DFA Deterministic Finite Automaton

Page 65: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

STEP 1: RE ⟢ NFA

Thompson’s construction

Page 66: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

STEP 1: RE ⟢ NFA

Thompson’s construction

Page 67: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

What is an NFA?

s s

s

s1 s2

x

A state s The startstate s

An acceptingstate s

s

The start state sthat is also an

accepting state

A directed graph with nodesdenoting states

and edges labelled with a symbol

x ∊ βˆ‘ βˆͺ {πœ€} denoting transitions

Page 68: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

What is an NFA?

s s

s

s1 s2

x

A state s The startstate s

An acceptingstate s

s

The start state sthat is also an

accepting state

A directed graph with nodesdenoting states

and edges labelled with a symbol

x ∊ βˆ‘ βˆͺ {πœ€} denoting transitions

Page 69: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Meaning of an NFA

A string x1x2...xn is accepted byan NFA if there is a pathlabelled x1,x2,...,xn (includingany number of πœ€ transitions)from the start state to anaccepting state.

Page 70: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Meaning of an NFA

A string x1x2...xn is accepted byan NFA if there is a pathlabelled x1,x2,...,xn (includingany number of πœ€ transitions)from the start state to anaccepting state.

Page 71: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example of an NFA

The following NFA acceptsexactly the strings that match the

regular expression aβ‹… a* | bβ‹… b*.

0

2

43

1

b

a

a

b

πœ€

πœ€

Page 72: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example of an NFA

The following NFA acceptsexactly the strings that match the

regular expression aβ‹… a* | bβ‹… b*.

0

2

43

1

b

a

a

b

πœ€

πœ€

Page 73: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Notation

Let N(r) be the NFA acceptingexactly the set of strings in L(r).

We abstractly represent an NFAN(r) with start state s0 and finalstate sa by the diagram:

N(r)s0 sa

Page 74: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Notation

Let N(r) be the NFA acceptingexactly the set of strings in L(r).

We abstractly represent an NFAN(r) with start state s0 and finalstate sa by the diagram:

N(r)s0 sa

Page 75: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Base cases

s0 saN(πœ€)s0 sa = πœ€

s0 saN(x)s0 sa = x

where x ∊ βˆ‘

Page 76: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Base cases

s0 saN(πœ€)s0 sa = πœ€

s0 saN(x)s0 sa = x

where x ∊ βˆ‘

Page 77: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Choice

N(r)

N(t)

s0 sa

N(r|t)s0 sa

=

πœ€

πœ€ πœ€

πœ€

Page 78: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Choice

N(r)

N(t)

s0 sa

N(r|t)s0 sa

=

πœ€

πœ€ πœ€

πœ€

Page 79: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Sequence

N(t) saN(r)s0

N(r β‹… t)s0 sa

=

Page 80: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Sequence

N(t) saN(r)s0

N(r β‹… t)s0 sa

=

Page 81: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Kleene closure

N(r)

N(r* )s0 sa

=

s0

πœ€

πœ€ πœ€

πœ€

sa

Page 82: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Thompson’s construction:Kleene closure

N(r)

N(r* )s0 sa

=

s0

πœ€

πœ€ πœ€

πœ€

sa

Page 83: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 3

Apply Thompson’s construction tothe following regular expression.

((aβ‹… b)|c)*

Page 84: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 3

Apply Thompson’s construction tothe following regular expression.

((aβ‹… b)|c)*

Page 85: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Problem with NFAs

It is not straightforward to turnan NFA into a deterministicprogram because:

There may be many possiblenext-states for a given input.

Which one do we choose?

Try them all?

Idea: convert an NFA into aDFA: a DFA can be easilyconverted into an efficientexecutable program.

Page 86: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Problem with NFAs

It is not straightforward to turnan NFA into a deterministicprogram because:

There may be many possiblenext-states for a given input.

Which one do we choose?

Try them all?

Idea: convert an NFA into aDFA: a DFA can be easilyconverted into an efficientexecutable program.

Page 87: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

STEP 2: NFA ⟢ DFA

The subset construction.

Page 88: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

STEP 2: NFA ⟢ DFA

The subset construction.

Page 89: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

What is a DFA?

A deterministic finite automaton(DFA) is an NFA in which

there are no πœ€ transitions, and

for each state s and inputsymbol a there is at most onetransition out of s labelled a.

Page 90: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

What is a DFA?

A deterministic finite automaton(DFA) is an NFA in which

there are no πœ€ transitions, and

for each state s and inputsymbol a there is at most onetransition out of s labelled a.

Page 91: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example of a DFA

The following DFA acceptsexactly the strings that match the

regular expression aβ‹… a* | bβ‹… b*.

0

1

2b

a

a

b

Page 92: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example of a DFA

The following DFA acceptsexactly the strings that match the

regular expression aβ‹… a* | bβ‹… b*.

0

1

2b

a

a

b

Page 93: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

NFA β†’ DFA:key observation

After consuming an input string,an NFA can be in be in one of aset of states. Example 3:

0

3

1

a

b

2

ab

a

Input States

aa 0, 1, 2

aba

aab

aaba

πœ€

Page 94: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

NFA β†’ DFA:key observation

After consuming an input string,an NFA can be in be in one of aset of states. Example 3:

0

3

1

a

b

2

ab

a

Input States

aa 0, 1, 2

aba

aab

aaba

πœ€

Page 95: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

NFA β†’ DFA: key idea

Idea: construct a DFA in whicheach state corresponds to a setof NFA states.

After consuming a1β‹― an theDFA is in a state whichcorresponds to the set of statesthat the NFA can reach on inputa1β‹― an.

Page 96: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

NFA β†’ DFA: key idea

Idea: construct a DFA in whicheach state corresponds to a setof NFA states.

After consuming a1β‹― an theDFA is in a state whichcorresponds to the set of statesthat the NFA can reach on inputa1β‹― an.

Page 97: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example 3, revisited

Input NFA States DFA State

aa 0, 1, 2 A

aba 0,1 B

aab 0,3 C

aaba 0,1 B

πœ€ 0 D

Create a DFA state correspondingto each set of NFA states.

Question: which states would beinitial and final DFA states?

Page 98: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Example 3, revisited

Input NFA States DFA State

aa 0, 1, 2 A

aba 0,1 B

aab 0,3 C

aaba 0,1 B

πœ€ 0 D

Create a DFA state correspondingto each set of NFA states.

Question: which states would beinitial and final DFA states?

Page 99: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Notation

Operation Description

πœ€-closure(s)Set of NFA states reachablefrom NFA state s on zero ormore πœ€-transitions.

πœ€-closure(T)

move(T, a)Set of NFA states to whichthere is a transition on symbola from some state s in T.

⋃ πœ€-closure(s)

s ∊ T

Page 100: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Notation

Operation Description

πœ€-closure(s)Set of NFA states reachablefrom NFA state s on zero ormore πœ€-transitions.

πœ€-closure(T)

move(T, a)Set of NFA states to whichthere is a transition on symbola from some state s in T.

⋃ πœ€-closure(s)

s ∊ T

Page 101: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 4

Consider the following NFA.

Compute:

0 4

1a

πœ€

2

a

a

3 πœ€

πœ€

c

πœ€-closure(0)

πœ€-closure({1, 2})

move({0,3}, a)

πœ€-closure(move({0,3}, a))

Page 102: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 4

Consider the following NFA.

Compute:

0 4

1a

πœ€

2

a

a

3 πœ€

πœ€

c

πœ€-closure(0)

πœ€-closure({1, 2})

move({0,3}, a)

πœ€-closure(move({0,3}, a))

Page 103: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Subset construction:input and output

Input: an NFA N.

Output: a DFA D accepting thesame language as N. Specifically,the set of states of D, termedDstates, and its transition functionDtran that maps any state-symbolpair to a next state.

Page 104: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Subset construction:input and output

Input: an NFA N.

Output: a DFA D accepting thesame language as N. Specifically,the set of states of D, termedDstates, and its transition functionDtran that maps any state-symbolpair to a next state.

Page 105: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Subset construction:input and output

Each state in D is denoted by asubset of N’s states.

To ensure termination, every stateis either marked or unmarked.

Initially, Dstates contains a singleunmarked start state πœ€-closure(s0)where s0 is the start state of N.

The accepting states of D are thestates that contain at least oneaccepting state of N.

Page 106: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Subset construction:input and output

Each state in D is denoted by asubset of N’s states.

To ensure termination, every stateis either marked or unmarked.

Initially, Dstates contains a singleunmarked start state πœ€-closure(s0)where s0 is the start state of N.

The accepting states of D are thestates that contain at least oneaccepting state of N.

Page 107: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Subset construction:algorithm

while (there is an unmarkedstate T in Dstates) {

mark T;for (each input symbol a) {

U = πœ€-closure(move(T, a));Dtran[T, a] = Uif (U is not in Dstates)

add U as unmarked state to Dstates;}

}

Page 108: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Subset construction:algorithm

while (there is an unmarkedstate T in Dstates) {

mark T;for (each input symbol a) {

U = πœ€-closure(move(T, a));Dtran[T, a] = Uif (U is not in Dstates)

add U as unmarked state to Dstates;}

}

Page 109: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 5

Convert the following NFA intoa DFA by applying the subsetconstruction algorithm.

2 4b

1a

5 6c

7πœ€

πœ€

8

πœ€

πœ€9

πœ€

πœ€ πœ€

πœ€

10

Page 110: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 5

Convert the following NFA intoa DFA by applying the subsetconstruction algorithm.

2 4b

1a

5 6c

7πœ€

πœ€

8

πœ€

πœ€9

πœ€

πœ€ πœ€

πœ€

10

Page 111: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 6

It is not obvious how to simulatean NFA in linear time with respectto the length of the input string.

But it may be converted to a DFAthat can be simulated easily inlinear time.

What’s the catch? Can you thinkof any problems with the DFAproduced by subset construction?

Page 112: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 6

It is not obvious how to simulatean NFA in linear time with respectto the length of the input string.

But it may be converted to a DFAthat can be simulated easily inlinear time.

What’s the catch? Can you thinkof any problems with the DFAproduced by subset construction?

Page 113: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Caveats

Number of DFA states could beexponential in number of NFAstates!

DFA produced is not minimalin number of states. (Can applya minimisation algorithm.)

Often no problem in practice.

Page 114: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Caveats

Number of DFA states could beexponential in number of NFAstates!

DFA produced is not minimalin number of states. (Can applya minimisation algorithm.)

Often no problem in practice.

Page 115: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Homework Exercise

Convert the following NFA intoa DFA by applying the subsetconstruction algorithm.

0

3

1

a

b

2

ab

a

Page 116: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Homework Exercise

Convert the following NFA intoa DFA by applying the subsetconstruction algorithm.

0

3

1

a

b

2

ab

a

Page 117: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

STEP 3: DFA ⟢ C CODE

Page 118: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

STEP 3: DFA ⟢ C CODE

Page 119: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 7

Implement the DFA

bB

a

c

A a

c

a

C

D

c

int match(char *next) {

β‹―

}

returning 1 if the string pointedto by next is accepted by the DFAand 0 otherwise.

as a C function

Page 120: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Exercise 7

Implement the DFA

bB

a

c

A a

c

a

C

D

c

int match(char *next) {

β‹―

}

returning 1 if the string pointedto by next is accepted by the DFAand 0 otherwise.

as a C function

Page 121: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

int match(char* next) {

goto A; /* start state */

A: if (*next == '\0') return 1;if (*next == 'a') { next++; goto B; }if (*next == 'c') { next++; goto C; }return 0;

B: if (*next == '\0') return 0;if (*next == 'b') { next++; goto D; }return 0;

C: if (*next == '\0') return 1;if (*next == 'a') { next++; goto B; }if (*next == 'c') { next++; goto D; }return 0;

D: if (*next == '\0') return 1;if (*next == 'a') { next++; goto B; }if (*next == 'c') { next++; goto D; }return 0;

}

Page 122: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

int match(char* next) {

goto A; /* start state */

A: if (*next == '\0') return 1;if (*next == 'a') { next++; goto B; }if (*next == 'c') { next++; goto C; }return 0;

B: if (*next == '\0') return 0;if (*next == 'b') { next++; goto D; }return 0;

C: if (*next == '\0') return 1;if (*next == 'a') { next++; goto B; }if (*next == 'c') { next++; goto D; }return 0;

D: if (*next == '\0') return 1;if (*next == 'a') { next++; goto B; }if (*next == 'c') { next++; goto D; }return 0;

}

Page 123: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

SUMMARY

Page 124: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

SUMMARY

Page 125: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Summary

In lexical analysis, the lexemesof the language are identifiedand converted into tokens.

Lexemes are typically specifiedby regular expressions.

Matching of regular expressionsformalised by proof rules.

Defining proof rules in Prologgives a simple but inefficientimplementation.

Page 126: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Summary

In lexical analysis, the lexemesof the language are identifiedand converted into tokens.

Lexemes are typically specifiedby regular expressions.

Matching of regular expressionsformalised by proof rules.

Defining proof rules in Prologgives a simple but inefficientimplementation.

Page 127: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Summary

Automatically converting regularexpressions into efficient C codeinvolves three main steps:

1. RE β†’ NFA

(Thompson’s Construction)

2. NFA β†’ DFA

(Subset Construction)

3. DFA β†’ C Function

Page 128: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Summary

Automatically converting regularexpressions into efficient C codeinvolves three main steps:

1. RE β†’ NFA

(Thompson’s Construction)

2. NFA β†’ DFA

(Subset Construction)

3. DFA β†’ C Function

Page 129: Lexical and Syntax Analysis - University of York Analysis Identifies the lexemes in a sentence. Lexeme: a minimal meaningful unit of a language. Converts each lexeme to a token. Throws

Theory & Practice

In the next lecture, we will learnhow to use a tool called Flexthat puts the regular expressiontheory into practice.