CS 275 Automata and Formal Language Theory Course Notes Part II: The Recognition Problem (II) Chapter II.2.: Basics of Regular Languages and Expressions Anton Setzer (Based on a book draft by J. V. Tucker and K. Stephenson) Dept. of Computer Science, Swansea University http://www.cs.swan.ac.uk/∼csetzer/lectures/ automataFormalLanguage/current/index.html April 29, 2016 CS 275 Chapter II.2. 1/ 44
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CS 275 Automata and Formal Language TheoryCourse Notes
Part II: The Recognition Problem (II)Chapter II.2.: Basics of Regular Languages and Expressions
Anton Setzer(Based on a book draft by J. V. Tucker and K. Stephenson)
The next grammar generates the postcodes of the form SA1 8PP or ingeneral LLd dLL for digits d and capital letters L without any leadingzeros. We use the notation | as in BNF. We write xy for blank
I Then replace in the right-linear case productions
A −→ a1a2 · · · anB
with n ≥ 2 by productions
A −→ a1A1,A1 −→ a2A2,· · ·An−1 −→ anB
for some new nonterminals Ai .
I Full details can be found in the additional material
CS 275 Sect. II.2.1. 19/ 44
II.2.1. Regular Languages (12.2)
Mixing of Left- and Right-Linear
Remark
In a regular grammar we are not allowed to mix left-linear and right-lineargrammars. Otherwise we would obtain truly context-free languages.
CS 275 Sect. II.2.1. 20/ 44
II.2.1. Regular Languages (12.2)
Example (Mixing Left/Right-Linear Rules)
The following grammar generates the languageL(G) = ?which (as we will see later) is context-free but not regular.
grammar G
terminals a, b
nonterminals S ,T
start symbol S
productions S −→ abS −→ aTT −→ Sb
CS 275 Sect. II.2.1. 21/ 44
II.2.2. Regular Expressions (13.8)
II.2.1. Regular Languages (12.2)
II.2.2. Regular Expressions (13.8)
CS 275 Sect. II.2.2. 22/ 44
II.2.2. Regular Expressions (13.8)
Operators for Forming Languages
Definition
Let L1, L2, L ⊆ T ∗ be languages over the alphabet T .
1. The::::::::::::::::concatenation L1.L2:::::
of L1 and L2 is defined as
L1.L2 := {w1w2 | w1 ∈ L1,w2 ∈ L2}
2. The:::::::union L1 | L2
::::::of L1 and L2 is defined as
L1 | L2 := L1 ∪ L2
The union is sometimes denoted by +:
.
3. The:::::::::iteration or
:::::::::::::Kleene-star L∗:: of L is defined as
L∗ := {w1w2 · · ·wn | n ≥ 0,w1, . . . ,wn ∈ L}
Note that ε ∈ L∗.CS 275 Sect. II.2.2. 23/ 44
II.2.2. Regular Expressions (13.8)
Regular Expressions
Regular expressions are denotations for languages formed from the ∅ andfrom the languages {a}, where a is an element of the alphabet, by usingthe above mentioned operations.
CS 275 Sect. II.2.2. 24/ 44
II.2.2. Regular Expressions (13.8)
Regular Expressions
Definition
Let T be an alphabet. We define the set of::::::::regular
:::::::::::::::expressions over
an alphabet T inductively together with the language L(E ) for eachregular expression E .
I ∅ is a regular expression, L(∅) := ∅.I ε is a regular expression, L(ε) := {ε}.I For a ∈ T we have a is a regular expression, L(a) := {a}. One usually
writes a for the regular expression, when the symbol is a.I If E ,F are regular expressions, then
I (E ) | (F ) is a regular expression, L((E ) | (F )) := L(E ) ∪ L(F ).I (E )(F ) is a regular expression, L((E )(F )) = L(E ).L(F ).I (E )∗ is a regular expression, L((E )∗) = L(E )∗.
We omit unnecessary brackets and usually write E | F instead of (E ) | (F ),EF instead of (E )(F ), E ∗ instead of (E )∗, if there is no confusion.
CS 275 Sect. II.2.2. 25/ 44
II.2.2. Regular Expressions (13.8)
Use of Regular Expressions
I We will usually omit writing L(E ), so write
(0 1) 0∗
instead ofL((0 1) 0∗)
which is({0}.{1}).({0})∗
which is{010n | n ∈ N} or {01 0 · · · 0︸ ︷︷ ︸
n times
| n ∈ N}
I We will as well identify regular expressions which denote the samelanguage. Therefore we can omit more brackets e.g. we can write
0 1 0
instead of(0 1) 0
CS 275 Sect. II.2.2. 26/ 44
II.2.2. Regular Expressions (13.8)
Use of Regular Expressions
I If the alphabet only contains single characters, we can omit the blankin concatenation, and write
010 instead of 0 1 0
I ∗ only refers to the last item, unless there are brackets:I 01∗ = {0(1n) | n ∈ N}I (01)∗ = {(01)n | n ∈ N}
I Note that this is different from grammars (including BNF) whererecursion is allowed.For instance we can have productions such as
S −→ aSa
or in BNF〈S〉 ::= a〈S〉a
CS 275 Sect. II.2.2. 31/ 44
II.2.2. Regular Expressions (13.8)
Regular Expressions in Programming
I Regular Expressions occur very often in programming.I They occur in
I Linux/Unix (command grep/egrep),I in scripting languages (Perl, Python, Ruby),I (one of the main innovations of Ruby over Python was an improved
notation ∼ for matching of regular expressions),
I in SQL,
I are supported in most programming languages by libraries.
CS 275 Sect. II.2.2. 32/ 44
II.2.2. Regular Expressions (13.8)
Notations for Regular Expressions
I One writes [a1 · · · an] for a1 | · · · | an.
I One writes [a− z ] for [a, b, c , . . . z ] similarly for [0− 9], [A− Z ].
I [a− zA− Z ] := [a− z ]|[A− Z ], [a− z?] := [a− z ]|? etc.I One writes L+ or L+ for L L∗ (so
L+ := {w1 · · ·wn | n ≥ 1,w1, . . . ,wn ∈ L}, the set of words formedfrom L by using at least one word in L.
I Question: Is L+ the set of non-empty words formed from elements ofL?Answer:
I Lots of other useful operators for constructing regular expressionshave been defined.
I Each language has its own set and of regular expressions (using oftendifferent notations), and its own syntax. Sometimes operators areintroduced which go beyond regular languages.
CS 275 Sect. II.2.2. 33/ 44
II.2.2. Regular Expressions (13.8)
Example Use of Regular Expressions
I Assume you have files called automatatheorych1.tex,automatatheorych2.tex, automatatheorych3.tex ,. . .Concatenation all of them into one file:
cssetzer@cs-svr1:> cat automatatheory[0-9].tex >
automatatheoryall.tex
I Process lines in a file containing entries separated by “,”, dosomething if the first field is a student number (a string consisting ofdigits only). Python code
file = open(filename)
regExpStud = re.compile(’^[0-9]*$’)
for line in file:
a = line.split(’,’)
if regExpStud.match(a[0]):
print a[1][:-1] #cut off trailing ’\n’
file.close()
CS 275 Sect. II.2.2. 34/ 44
II.2.2. Regular Expressions (13.8)
Example WebLinks
Consider links in http pages of the form:<a href=”http://www.swan.ac.uk/”>Swansea University</a>
displayed asSwansea University
The set of weblinks can be defined as(in most language xy would be written as a blank, ” would be preceded bya \, and the whole string would be put into quotation marks):
I In computer security one very often needs to check for occurrences ofcertain patterns.
I For instance in order to locate a certain virus, which might consist of3 pieces of code s1, s2, s3, separated by some normal code, one couldsearch for the regular expression
s1[a− z ]∗s2[a− z ]∗s3
(How do you obtain that s1, s2 and s3 might occur in different order?)
I Of course in general you need to check for much more sophisticatedpatterns.
CS 275 Sect. II.2.2. 36/ 44
II.2.2. Regular Expressions (13.8)
Usage of Regular Expressions in Computer Security
I In order to check that a password is safe enough, which might mean itconsists of digits and lower case characters, and at least one digit andone lower case character, would mean that you whether it matches
(([a− z ] | [0− 9])∗[a− z ]([a− z ] | [0− 9])∗[0− 9]([a− z ] | [0− 9])∗) |(([a− z ] | [0− 9])∗[0− 9]([a− z ] | [0− 9])∗[a− z ]([a− z ] | [0− 9])∗)
Of course you would usually use a much more sophisticated regularexpression.
CS 275 Sect. II.2.2. 37/ 44
II.2.2. Regular Expressions (13.8)
Usage of Regular Expressions in Computer Security
I Detecting in request certain malicious patterns in requests from theoutside can often be expressed as a regular expression and you searchfor matches in this income stream which match that expression.
CS 275 Sect. II.2.2. 38/ 44
II.2.2. Regular Expressions (13.8)
SQL Injection
I Regular expression can be used to detect attempts of SQL injection.I Example of SQL injection (from Wikipedia on SQL Injection):
Assume the following statement in a code
statement =“SELECT ∗ FROM users WHERE name =’ ” + userName + “ ’;”
I This statement is supposed to be sent to the SQL server.Then one checkes the resulting entries for whether the suppliedpassword matches modulo encryption one of the password entries forthat user name.
I Assume an attacker tries to login with user name
’ or ’1’=’1
Then the statement sent to the SQL server will be
SELECT ∗ FROM users WHERE name =’’ or ’1’=’1’;
which matches all users.CS 275 Sect. II.2.2. 39/ 44
II.2.2. Regular Expressions (13.8)
SQL Injection
I This might allow you to check whether your password matches anyuser, which makes it more likely to get a match and allow you to login.
I In order to avoid such kind of attack you can check whether theusername matches any malicious pattern.Such patterns can be expressed by regular expressions.
CS 275 Sect. II.2.2. 40/ 44
II.2.2. Regular Expressions (13.8)
Usage of Regular Expressions in Computer Security
I The above were just some (here very simple) examples how regularexpressions can be used to detect in computer security certainpatterns corresponding to attacks or weaknesses of a system.
CS 275 Sect. II.2.2. 41/ 44
II.2.2. Regular Expressions (13.8)
Closure of Regular Languages
The main lemma for showing that regular expressions define regularlanguages is as follows:
Lemma (II.2.2.1.)
Let G , G ′ be both left-linear grammars or both right-linear grammars.Then we can define a left-linear or right-linear grammars Gi s.t.
1. L(G1) = L(G ) | L(G ′),
2. L(G2) = L(G ).L(G ′),
3. L(G3) = L(G )∗.
These grammars can be computed from G and G ′.
CS 275 Sect. II.2.2. 42/ 44
II.2.2. Regular Expressions (13.8)
Proof
A proof can be found in the additional material for this subsection.
CS 275 Sect. II.2.2. 43/ 44
II.2.2. Regular Expressions (13.8)
Regular Expressions define Regular Languages
Lemma (II.2.2.2.)
Let E be a regular Expression. Then there exist both left-linear andright-linear grammars G , G ′ s.t.
L(E ) = L(G ) = L(G ′)
G and G ′ can be computed from L.
Proof: By Lemma II.2.2.1, and the fact that the finite languages ∅, {ε}and {a} are regular.Full details can be found in Additional Material.