CS 275 Automata and Formal Language Theory

CS 275 Automata and Formal Language TheoryCourse Notes

Part II: The Recognition Problem (II)Chapter II.2.: Basics of Regular Languages and Expressions

Anton Setzer(Based on a book draft by J. V. Tucker and K. Stephenson)

Dept. of Computer Science, Swansea University

http://www.cs.swan.ac.uk/∼csetzer/lectures/automataFormalLanguage/current/index.html

April 29, 2016

CS 275 Chapter II.2. 1/ 44

http://www.cs.swan.ac.uk/~csetzer/lectures/automataFormalLanguage/current/index.html

http://www.cs.swan.ac.uk/~csetzer/lectures/automataFormalLanguage/current/index.html

II.2.1. Regular Languages (12.2)

II.2.2. Regular Expressions (13.8)

CS 275 Chapter II.2. 2/ 44




CS 275 Sect. II.2.1. 3/ 44


Finite Languages are Regular

grammar G ab,aabb,aaabbb

terminals a, b

nonterminals S

start symbol S

productions S −→ abS −→ aabbS −→ aaabbb

This grammar is not regular, since there can only be one terminal in theright hand string. But we can amend this:

CS 275 Sect. II.2.1. 4/ 44




terminals a, b

nonterminals S , S1,S2, S3,S4,S5, S6,S7, S8, S9

start symbol S

productions S −→ aS1, S1 −→ bS −→ aS2, S2 −→ aS3, S3 −→ bS4, S4 −→ bS −→ aS5, S5 −→ aS6, S6 −→ aS7, S7 −→ bS8,S8 −→ bS9, S9 −→ b

CS 275 Sect. II.2.1. 5/ 44



If one wishes, the above grammar can of be optimised as follows:


terminals a, b

nonterminals S , S1,S3,S4, S7,S8, S9

start symbol S

productions S −→ aS1, S1 −→ bS1 −→ aS3, S3 −→ bS4, S4 −→ b

S3 −→ aS7, S7 −→ bS8,S8 −→ bS9,S9 −→ b

CS 275 Sect. II.2.1. 6/ 44


Lemma II.2.1.1.

Lemma (II.2.1.1.)

All finite languages are regular, and a regular grammar for them can becomputed.

Proof: Extend the example above.

CS 275 Sect. II.2.1. 7/ 44


A Left-Linear Grammar for ambn

The following left-linear grammar generates {ambn | m, n ≥ 1}.

grammar G left−linear ,ambn

terminals a, b

nonterminals S ,T

start symbol S

productions S −→ SbS −→ TbT −→ TaT −→ a

CS 275 Sect. II.2.1. 8/ 44


A Right-Linear Grammar for ambn

The following right-linear grammar generates {ambn | m, n ≥ 1}:

grammar G right−linear ,ambn

terminals a, b

nonterminals S ,T

start symbol S

productions S −→ aSS −→ aTT −→ bTT −→ b

CS 275 Sect. II.2.1. 9/ 44


Right-Linear Grammar for Numbers

Here is a right-linear grammars for numbers without leading zeros. We use“|” as for BNF.

grammar GNumber

terminals 0, 1, . . . , 9

nonterminals Number ,Digits

start symbol Number

productions Number −→ 0Number −→ 1 Digits | 2 Digits | · · · | 9 DigitsDigits −→ 0 Digits | 1 Digits | · · · | 9 DigitsDigits −→ ε

CS 275 Sect. II.2.1. 10/ 44


Right-Linear Grammar for Numbers

Why didn’t we use the following as in BNF?

grammar GNumber

terminals 0, 1, . . . , 9

nonterminals Number ,Digit,NonZeroDigit,Digits

start symbol Number

productions Number −→ Digit | NonZeroDigit DigitsDigits −→ Digit | Digit DigitsDigit −→ 0 | NonZeroDigitNonZeroDigit −→ 1 | 2 | · · · | 9

Answer:

CS 275 Sect. II.2.1. 11/ 44


Right-Linear Grammar for Post Codes

The next grammar generates the postcodes of the form SA1 8PP or ingeneral LLd dLL for digits d and capital letters L without any leadingzeros. We use the notation | as in BNF. We write xy for blank

CS 275 Sect. II.2.1. 12/ 44


Right-Linear Grammar for Post Codes

grammar GPostcode

terminals 0, 1, . . . , 9,A,B, . . . ,Z, xy

nonterminals postcode, letter2, digit1, blank1, digit2, letter3, letter4

start symbol postcode

productions postcode −→ A letter2 | B letter2 | · · · | Z letter2letter2 −→ A digit1 | B digit1 | · · · | Z digit1digit1 −→ 0 blank1 | 1 blank1 | · · · | 9 blank1blank1 −→ xy digit2digit2 −→ 0 letter3 | 1 letter3 | · · · | 9 letter3letter3 −→ A letter4 | B letter4 | · · · | Z letter4letter4 −→ A | B | · · · | Z

CS 275 Sect. II.2.1. 13/ 44


Example Derivation

Here is a derivation of SA2xy8PP ∈ L(GPostcode):

postcode ⇒ S letter2

⇒ SA digit1

⇒ SA1 blank1

⇒ SA1xy digit2

⇒ SA1xy8 letter3

⇒ SA1xy8P letter4

⇒ SA1xy8PP

CS 275 Sect. II.2.1. 14/ 44


Easier Proof that Postcodes are Regular

Can you give an easier proof that the language of postcodes is regular(both left-linear and right-linear)?

CS 275 Sect. II.2.1. 15/ 44


Multi-step Regular Grammars

I In general we can extend regular grammars by allowing productionssuch as

S −→ abBB −→ aSB −→ baS

So instead of having only one terminal symbol, we can have several.I As long as we remain left-linear or right-linear

I i.e. the terminal symbols are always to the right oralways to the left of the non-terminal on the right hand side of a rule

we obtain grammars which can be reduced to regular grammars.

CS 275 Sect. II.2.1. 16/ 44


Lemma II.2.1.2.

Lemma (II.2.1.2.)

1. Assume a grammar G which has only productions of the form

A −→ Bw or A −→ w

for some w ∈ T ∗, A,B ∈ N. Then L(G ) = L(G ′) for some left-lineargrammar G ′, which can be computed from G .

2. Assume a grammar G which has only productions of the form

A −→ wB or A −→ w

for some w ∈ T ∗, A,B ∈ N. Then L(G ) = L(G ′) for someright-linear grammar G ′, which can be computed from G .

CS 275 Sect. II.2.1. 17/ 44


Multi-step Right-Linear/Left-Linear/Regular Grammars

We call grammars as above:::::::::::multi-step

::::::::::::::::::::::::::::::::::right-linear/left-linear/regular

::::::::::::grammars.

CS 275 Sect. II.2.1. 18/ 44


Proof Idea for Lemma II.2.1.2.

I Then replace in the right-linear case productions

A −→ a1a2 · · · anB

with n ≥ 2 by productions

A −→ a1A1,A1 −→ a2A2,· · ·An−1 −→ anB

for some new nonterminals Ai .

I Full details can be found in the additional material

CS 275 Sect. II.2.1. 19/ 44


Mixing of Left- and Right-Linear

Remark

In a regular grammar we are not allowed to mix left-linear and right-lineargrammars. Otherwise we would obtain truly context-free languages.

CS 275 Sect. II.2.1. 20/ 44


Example (Mixing Left/Right-Linear Rules)

The following grammar generates the languageL(G) = ?which (as we will see later) is context-free but not regular.

grammar G

terminals a, b

nonterminals S ,T

start symbol S

productions S −→ abS −→ aTT −→ Sb

CS 275 Sect. II.2.1. 21/ 44




CS 275 Sect. II.2.2. 22/ 44


Operators for Forming Languages

Definition

Let L1, L2, L ⊆ T ∗ be languages over the alphabet T .

1. The::::::::::::::::concatenation L1.L2:::::

of L1 and L2 is defined as

L1.L2 := {w1w2 | w1 ∈ L1,w2 ∈ L2}

2. The:::::::union L1 | L2

::::::of L1 and L2 is defined as

L1 | L2 := L1 ∪ L2

The union is sometimes denoted by +:

.

3. The:::::::::iteration or

:::::::::::::Kleene-star L∗:: of L is defined as

L∗ := {w1w2 · · ·wn | n ≥ 0,w1, . . . ,wn ∈ L}

Note that ε ∈ L∗.CS 275 Sect. II.2.2. 23/ 44


Regular Expressions

Regular expressions are denotations for languages formed from the ∅ andfrom the languages {a}, where a is an element of the alphabet, by usingthe above mentioned operations.

CS 275 Sect. II.2.2. 24/ 44


Regular Expressions

Definition

Let T be an alphabet. We define the set of::::::::regular

:::::::::::::::expressions over

an alphabet T inductively together with the language L(E ) for eachregular expression E .

I ∅ is a regular expression, L(∅) := ∅.I ε is a regular expression, L(ε) := {ε}.I For a ∈ T we have a is a regular expression, L(a) := {a}. One usually

writes a for the regular expression, when the symbol is a.I If E ,F are regular expressions, then

I (E ) | (F ) is a regular expression, L((E ) | (F )) := L(E ) ∪ L(F ).I (E )(F ) is a regular expression, L((E )(F )) = L(E ).L(F ).I (E )∗ is a regular expression, L((E )∗) = L(E )∗.

We omit unnecessary brackets and usually write E | F instead of (E ) | (F ),EF instead of (E )(F ), E ∗ instead of (E )∗, if there is no confusion.

CS 275 Sect. II.2.2. 25/ 44


Use of Regular Expressions

I We will usually omit writing L(E ), so write

(0 1) 0∗

instead ofL((0 1) 0∗)

which is({0}.{1}).({0})∗

which is{010n | n ∈ N} or {01 0 · · · 0︸︷︷︸

n times

| n ∈ N}

I We will as well identify regular expressions which denote the samelanguage. Therefore we can omit more brackets e.g. we can write

0 1 0

instead of(0 1) 0

CS 275 Sect. II.2.2. 26/ 44


Use of Regular Expressions

I If the alphabet only contains single characters, we can omit the blankin concatenation, and write

010 instead of 0 1 0

I ∗ only refers to the last item, unless there are brackets:I 01∗ = {0(1n) | n ∈ N}I (01)∗ = {(01)n | n ∈ N}

CS 275 Sect. II.2.2. 27/ 44


Examples of Regular Expressions

I The set of non-zero digits is defined as

NonzeroDigit = 1 | 2 | · · · · · · | 9

I The set of digits is defined as

Digit = 0 | NonZeroDigit

I The set of numbers without leading zero is

Number = 0 | (NonZeroDigit Digit∗)

I The set of capital letters is defined by

CapitalLetter = A | B | · · · | Z

CS 275 Sect. II.2.2. 28/ 44



I The set of module codes in this department is

CSModuleCodes = CS− (0 | 1 | 2 | 3 |M) Digit Digit

CS 275 Sect. II.2.2. 29/ 44



I The set of postcodes can be defined as

postcode = CapitalLetter CapitalLetter Digit xyDigit CapitalLetter CapitalLetter

CS 275 Sect. II.2.2. 30/ 44


Regular Expressions are Non-recursive

I Please note that regular expressions are non-recursive.For instance in

postcode = CapitalLetter CapitalLetter Digit xyDigit CapitalLetter CapitalLetter

“postcode” doesn’t occur on the right-hand side.

I Note that this is different from grammars (including BNF) whererecursion is allowed.For instance we can have productions such as

S −→ aSa

or in BNF〈S〉 ::= a〈S〉a

CS 275 Sect. II.2.2. 31/ 44


Regular Expressions in Programming

I Regular Expressions occur very often in programming.I They occur in

I Linux/Unix (command grep/egrep),I in scripting languages (Perl, Python, Ruby),I (one of the main innovations of Ruby over Python was an improved

notation ∼ for matching of regular expressions),

I in SQL,

I are supported in most programming languages by libraries.

CS 275 Sect. II.2.2. 32/ 44


Notations for Regular Expressions

I One writes [a1 · · · an] for a1 | · · · | an.

I One writes [a− z ] for [a, b, c , . . . z ] similarly for [0− 9], [A− Z ].

I [a− zA− Z ] := [a− z ]|[A− Z ], [a− z?] := [a− z ]|? etc.I One writes L+ or L+ for L L∗ (so

L+ := {w1 · · ·wn | n ≥ 1,w1, . . . ,wn ∈ L}, the set of words formedfrom L by using at least one word in L.

I Question: Is L+ the set of non-empty words formed from elements ofL?Answer:

I Lots of other useful operators for constructing regular expressionshave been defined.

I Each language has its own set and of regular expressions (using oftendifferent notations), and its own syntax. Sometimes operators areintroduced which go beyond regular languages.

CS 275 Sect. II.2.2. 33/ 44


Example Use of Regular Expressions

I Assume you have files called automatatheorych1.tex,automatatheorych2.tex, automatatheorych3.tex ,. . .Concatenation all of them into one file:

cssetzer@cs-svr1:> cat automatatheory[0-9].tex >

automatatheoryall.tex

I Process lines in a file containing entries separated by “,”, dosomething if the first field is a student number (a string consisting ofdigits only). Python code

file = open(filename)

regExpStud = re.compile(’^[0-9]*$’)

for line in file:

a = line.split(’,’)

if regExpStud.match(a[0]):

print a[1][:-1] #cut off trailing ’\n’

file.close()

CS 275 Sect. II.2.2. 34/ 44


Example WebLinks

Consider links in http pages of the form:<a href=”http://www.swan.ac.uk/”>Swansea University</a>

displayed asSwansea University

The set of weblinks can be defined as(in most language xy would be written as a blank, ” would be preceded bya \, and the whole string would be put into quotation marks):

webLinks = <axyhref=”[a−zA−Z0−9/. :]∗”>[a−zA−Z0−9xy]∗</a>

E.g. in Python one would write

weblinks=”<a href=\”[a-zA-Z0-9/.:]*\”>[a-zA-Z0-9 ]*</a>”

CS 275 Sect. II.2.2. 35/ 44


Usage of Regular Expressions in Computer Security

I In computer security one very often needs to check for occurrences ofcertain patterns.

I For instance in order to locate a certain virus, which might consist of3 pieces of code s1, s2, s3, separated by some normal code, one couldsearch for the regular expression

s1[a− z ]∗s2[a− z ]∗s3

(How do you obtain that s1, s2 and s3 might occur in different order?)

I Of course in general you need to check for much more sophisticatedpatterns.

CS 275 Sect. II.2.2. 36/ 44



I In order to check that a password is safe enough, which might mean itconsists of digits and lower case characters, and at least one digit andone lower case character, would mean that you whether it matches

(([a− z ] | [0− 9])∗[a− z ]([a− z ] | [0− 9])∗[0− 9]([a− z ] | [0− 9])∗) |(([a− z ] | [0− 9])∗[0− 9]([a− z ] | [0− 9])∗[a− z ]([a− z ] | [0− 9])∗)

Of course you would usually use a much more sophisticated regularexpression.

CS 275 Sect. II.2.2. 37/ 44



I Detecting in request certain malicious patterns in requests from theoutside can often be expressed as a regular expression and you searchfor matches in this income stream which match that expression.

CS 275 Sect. II.2.2. 38/ 44


SQL Injection

I Regular expression can be used to detect attempts of SQL injection.I Example of SQL injection (from Wikipedia on SQL Injection):

Assume the following statement in a code

statement =“SELECT ∗ FROM users WHERE name =’ ” + userName + “ ’;”

I This statement is supposed to be sent to the SQL server.Then one checkes the resulting entries for whether the suppliedpassword matches modulo encryption one of the password entries forthat user name.

I Assume an attacker tries to login with user name

’ or ’1’=’1

Then the statement sent to the SQL server will be

SELECT ∗ FROM users WHERE name =’’ or ’1’=’1’;

which matches all users.CS 275 Sect. II.2.2. 39/ 44


SQL Injection

I This might allow you to check whether your password matches anyuser, which makes it more likely to get a match and allow you to login.

I In order to avoid such kind of attack you can check whether theusername matches any malicious pattern.Such patterns can be expressed by regular expressions.

CS 275 Sect. II.2.2. 40/ 44



I The above were just some (here very simple) examples how regularexpressions can be used to detect in computer security certainpatterns corresponding to attacks or weaknesses of a system.

CS 275 Sect. II.2.2. 41/ 44


Closure of Regular Languages

The main lemma for showing that regular expressions define regularlanguages is as follows:

Lemma (II.2.2.1.)

Let G , G ′ be both left-linear grammars or both right-linear grammars.Then we can define a left-linear or right-linear grammars Gi s.t.

1. L(G1) = L(G ) | L(G ′),

2. L(G2) = L(G ).L(G ′),

3. L(G3) = L(G )∗.

These grammars can be computed from G and G ′.

CS 275 Sect. II.2.2. 42/ 44


Proof

A proof can be found in the additional material for this subsection.

CS 275 Sect. II.2.2. 43/ 44


Regular Expressions define Regular Languages

Lemma (II.2.2.2.)

Let E be a regular Expression. Then there exist both left-linear andright-linear grammars G , G ′ s.t.

L(E ) = L(G ) = L(G ′)

G and G ′ can be computed from L.

Proof: By Lemma II.2.2.1, and the fact that the finite languages ∅, {ε}and {a} are regular.Full details can be found in Additional Material.

CS 275 Sect. II.2.2. 44/ 44

CS 275 Automata and Formal Language Theory

Documents