1 #1 Lexical Analysis Lexical Analysis Finite Automata Finite Automata (Part 1 of 2) (Part 1 of 2) #2 Cunning Plan • Informal Sketch of Lexical Analysis – Identifies tokens from input string – lexer : (char list) → (token list) • Issues in Lexical Analysis – Lookahead – Ambiguity • Specifying Lexers – Regular Expressions – Examples #3 One-Slide Summary • Lexical analysis turns a stream of characters into a stream of tokens. • Regular expressions are a way to specify sets of strings. We use them to describe tokens.
21
Embed
Lexical Analysis Finite Automata - University of Michiganweb.eecs.umich.edu/~weimerw/2007-415/lectures/weimer-415-04.pdf1 #1 Lexical Analysis Finite Automata (Part 1 of 2) #2 Cunning
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
#1
Lexical AnalysisLexical Analysis
Finite AutomataFinite Automata
(Part 1 of 2)(Part 1 of 2)
#2
Cunning Plan
• Informal Sketch of Lexical Analysis
– Identifies tokens from input string
– lexer : (char list) → (token list)
• Issues in Lexical Analysis
– Lookahead
– Ambiguity
• Specifying Lexers
– Regular Expressions
– Examples
#3
One-Slide Summary
• Lexical analysis turns a stream of characters
into a stream of tokens.
• Regular expressions are a way to specify sets
of strings. We use them to describe tokens.
2
#4
Recall: The Structure of a
Compiler or Interpreter
Source Tokens
Interm.Language
Lexicalanalysis
Parsing
CodeGen.
MachineCode
Today we start
Optimization
Run It! Interpreter
Only!
Compiler
Only!
#5
Lexical Analysis
• What do we want to do? Example:if (i == j)
z = 0;
elsez = 1;
• The input is just a sequence of characters:\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
• Goal: Partition input string into substrings– And classify them according to their role
#6
What’s a Token?
• Output of lexical analysis is a list of tokens
• A token is a syntactic category
– In English:
noun, verb, adjective, …
– In a programming language:
Identifier, Integer, Keyword, Whitespace, …
• Parser relies on the token distinctions:
– e.g., identifiers are treated differently than keywords
3
#7
Tokens
• Tokens correspond to sets of strings.
• Identifier: strings of letters or digits,
starting with a letter
• Integer: a non-empty string of digits
• Keyword: “else” or “if” or “begin” or …
• Whitespace: a non-empty sequence of
blanks, newlines, and tabs
• OpenPar: a left-parenthesis
#8
Lexical Analyzer: Implementation
• An implementation must do two things:
1. Recognize substrings corresponding to
tokens
2. Return the value or lexeme of the token
– The lexeme is the substring
#9
Example
• Recall:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
• Token-lexeme pairs returned by the lexer:
– (Whitespace, “\t”)
– (Keyword, “if”)
– (OpenPar, “(“)
– (Identifier, “i”)
– (Relation, “==”)
– (Identifier, “j”)
– …
4
#10
Lexical Analyzer: Implementation
• The lexer usually discards “uninteresting”
tokens that don’t contribute to parsing.
• Examples: Whitespace, Comments
• Question: What happens if we remove all
whitespace and all comments prior to lexing?
#11
Lookahead
• Two important points:
1. The goal is to partition the string. This is
implemented by reading left-to-right,
recognizing one token at a time
2. “Lookahead” may be required to decide where
one token ends and the next token begins
– Even our simple example has lookahead issues
i vs. if
= vs. ==
#12
Next We Need
• A way to describe the lexemes of each token
• A way to resolve ambiguities
– Is if two variables i and f?
– Is == two equal signs = =?
5
#13
Regular Languages
• There are several formalisms for specifying
tokens
• Regular languages are the most popular
– Simple and useful theory
– Easy to understand
– Efficient implementations
#14
Languages
Def. Let ΣΣΣΣ be a set of characters. A
language over ΣΣΣΣ is a set of strings of
characters drawn from ΣΣΣΣ
(Σ is called the alphabet)
#15
Examples of Languages
• Alphabet = English
characters
• Language = English
sentences
• Not every string on
English characters is an
English sentence
• Alphabet = ASCII
• Language = C programs
• Note: ASCII character
set is different from
English character set
6
#16
Notation
• Languages are sets of strings
• Need some notation for specifying which sets
we want
• For lexical analysis we care about regular
languages, which can be described using
regular expressions.
#17
Regular Expressions
and Regular Languages• Each regular expression is a notation for a