Top Banner
Gianluca Costa Introduction to regular expressions
67

Introduction to regular expressions

Jun 24, 2015

Download

Software

Gianluca Costa

The slides of my brown-bag session dedicated to introducing regular expressions.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to regular expressions

Gianluca Costa

Introduction to regular expressions

Page 2: Introduction to regular expressions

Before starting

Regular expressions are a tool: it's up to you to use them wisely.

Like every tool, they require:

PracticeTestsPatience

Page 3: Introduction to regular expressions

Why “regular expressions”?

● 1956: mathematical definition of regular sets by Stephen Cole Kleen

● 1968: “Regular Expression Search Algorithm” - by Ken Thompson. Description of a regular expression compiler.

● Regular expressions employed in text editors. Introduction of the grep command.

Page 4: Introduction to regular expressions

Examples of text matching

● Given an IIS log, keep just the requests to the web app “/PicnicAPI”

● Perform LIKE queries on MongoDB● Get the dir and basename of a file path● Get the src attribute of an <img> tag● Read a key-value file having “\” line

continuations

Page 5: Introduction to regular expressions

Generalized problems

● Determine if a pattern is contained (matches) a given string

● Extract substrings from a matching string

● Replace one or more substrings

● Generalizable to files and streams

Page 6: Introduction to regular expressions

Regular expressions

Regular expressions describe text patterns.

For example:

“At least 3 digits, but not more than 5”.

Page 7: Introduction to regular expressions

A simple example

/\d{3,5}/

Matches “3482”, but not “Hello”

Page 8: Introduction to regular expressions

How to apply regexes

● Functions/classes provided by programming languages/frameworks

● Command-line tools (sed, awk, egrep, …)

● Other interfaces (eg: MongoDB queries)

Page 9: Introduction to regular expressions

Interactive testing

● http://regex101.com/ - currently provides a free multi-engine test environment, explaining your regex and showing the matches on a text.

● http://rubular.com/ - another regex test environment, targeting Ruby's flavour.

Page 10: Introduction to regular expressions

The dualism regex-target

The regular expression is applied to a string, to check for a match.

Both the regex and the string have their own cursor.

Which cursor drives the matching process?

T h e q u i c k b r q u iText: Regex:

Page 11: Introduction to regular expressions

Engine types

● DFA● Traditional NFA● POSIX NFA● Hybrid solutions

Page 12: Introduction to regular expressions

DFA

● Matching is driven by the cursor on the text● Very fast matching● Takes longer to compile● Takes more memory● Declarative regex

● Always returns the longest possible match.

Page 13: Introduction to regular expressions

Traditional NFA

● Matching is driven by the cursor on the regex

● Creates a stack of states, and performs backtracking

● Supports more language constructs● Imperative regex● Usually returns the first match found● Employed by standard Java, .NET, Python,

PHP, Perl, Ruby, …

Page 14: Introduction to regular expressions

POSIX NFA

● Very, very similar to traditional NFA, but returns the longest possible match.

● Further performance issues!

Page 15: Introduction to regular expressions

Hybrid solutions

Double engine: first-scan with DFA, then scan with NFA if required by the pattern.

Further implementations are possible.

Page 16: Introduction to regular expressions

Our target: NFAs

● DFAs are less common than NFAs, their syntax is almost a subset and they are generally simpler.

● We will concentrate on NFA regexes

Page 17: Introduction to regular expressions

Know your engine

There are common rules, but several engines.

Every engine has its own implementation.

You must know your engine. And write tests.

Page 18: Introduction to regular expressions

Regex basics

Literal text, such as /rain/ matches if and only if the string contains, somewhere, that

sequence, matching character after character.

Page 19: Introduction to regular expressions

The first rule of matching

Matching starts from the leftmost character.

Therefore:

“The rainbow shines after the rain” /rain/

Page 20: Introduction to regular expressions

The second rule of matching

The engine returns a success if and only if the regex cursor reaches the end of the

regex.

Page 21: Introduction to regular expressions

Escaping characters

● Some characters (\, *, ?, +, ., (, ), [, ], {, }, |, ^, $, #) must be escaped when they are used literally

● Escape is performed by prepending “\”.For example: /\?/ to represent a literal “?”

● Where raw strings are not supported, a double escape might be required.In Java, the regex /\\\+/ becomes: “\\\\\\+”.

Page 22: Introduction to regular expressions

Escape sequences

● \r● \n● \v● \f● \t

● They work just like in C

Page 23: Introduction to regular expressions

Character classes

● [abc] = “a, b or c in this position”● [a-z] = “a, b, c, …, z here”● [A-Za-z] = “A, B, …, Z, a, b, …, z here”● [A-Za-z0] = “A, …, Z, a, …, z, 0 here”● [A-Z\-] = [-A-Z] = “A, …, Z or – here”

● What about accents? (é, è, …) And cedilla?● Know your engine.

Page 24: Introduction to regular expressions

Negated character classes

● [^ab] = “Something not a and not b here”● [^a-z] = “Something not a, b, c, …, z here”● [^A-Za-c] = “Something not “A, …, Z, a, …,

c here”

● Negating a character set requires the existence of a character in that position, not belonging to the specified class.

Page 25: Introduction to regular expressions

Common character classes

● \d = a digit● \D = [^\d]● \w = a letter, a digit or “_”● \W = [^\w]● \s = a space character● \S = [^\s]● . = any character except newline

Page 26: Introduction to regular expressions

What are letters and spaces?

● The answer depends on the encoding and on your engine.

● In ASCII, usually:– \w = [A-Za-z0-9_]

– \s = [\r \n\t\f\v] (includes ASCII-32 common space)

● But what about Latin-1 or Unicode?● Know your engine

Page 27: Introduction to regular expressions

Unicode character classes

● \uXXXX: matches the Unicode code point whose hex value is XXXX

● There should also be support for Unicode's categories and scripts, especially via \p

● Much more Unicode-related, non-standard features

● Know your engine

Page 28: Introduction to regular expressions

Capturing groups

● ( and ) define a capturing group● Capturing groups are assigned a 1-based

index, according to the position of their (● /(\w+)bet/ tries to match a string and, if

successful, creates a capturing group for the text matching \w+, having index 1

● If the above regex is applied to “alphabet”, it matches and its group 1 is “alpha”

Page 29: Introduction to regular expressions

Non-capturing groups

● Groups can just be used to clarify precedence: capturing is not always needed

● Skipping capturing can save memory and speed up the matching process

● To define a non-capturing group, use (?: and ).

● Therefore, /(?:\w+)bet/ is just like /\w+bet/, as no capturing is performed and this grouping alters precedence without effects.

Page 30: Introduction to regular expressions

Backreferences

● Backreference = the content of a capturing group that becomes part of the regex

● Use \N in your regex, replacing N with the index of the captured group in question

● For example: /(['”])\w+\1/ to pair single and double quotes

● Some engines support named capturing and backreferences

Page 31: Introduction to regular expressions

Alternation

● Alternatives are separated by |● For example: /alpha|beta/ means “alpha” or

“beta”● Alternation has very low precedence; its

scope is the current group: use grouping to force precedence.

● For example: /A(?:pril|ugust)/ means “A, followed by “pril” or “ugust”.

Page 32: Introduction to regular expressions

Alternation VS char classes

● A character class (asserted or negated) always matches one and only one character

● The branches of an alternation can be strings of any length (at least one character, to be consistent)

Page 33: Introduction to regular expressions

Matching in a DFA

/nice|cute/ applied to:

“Pandas are cute animals”

It scans the string, starting from P, and, at every character, tries to apply the regex.

In a DFA regex, the engine only chooses which regex components remain valid at a

given position of the text cursor.

Page 34: Introduction to regular expressions

Matching in NFA

● NFA also keeps a stack of states!

● Each decision point saves a state in the stack

● State = position of the 2 cursors● If a choice in the regex leads to no match,

the engine backtracks (=pops a state from the stack and makes a different choice)

Page 35: Introduction to regular expressions

BacktrackingS1

S2 S5

S3 S4

S6 S7

S8

1

2 4

7

8 10

11

3 5

6

9

Page 36: Introduction to regular expressions

Performance implications

● In NFA, a failure is returned only when all the regex paths have been explored

● NFA regexes must be written with performances in mind.

Page 37: Introduction to regular expressions

Alternation in NFA

● Ordered in most implementations.● Affects what is matched and performances.● Know your engine

Page 38: Introduction to regular expressions

Greedy quantifiers● All quantifiers can be applied to single

characters, classes or even groups

● * = any number of occurrences (even 0)● ? = 0 or 1 occurrencies● + = 1 or infinite occurrencies● {n} = exactly n occurrencies● {m, n} = m to n occurrencies (included)● {m,} = at least m occurrencies

Page 39: Introduction to regular expressions

First example of greedy quantifiers

● Let's consider the regex /be?(er|ar)/

● How is it applied to“I'd like a chocolate bar” ?

● The regex cursor stays on “b” until the text cursor reaches its “b” too

● Then, the following regex paths are tried:– be => b(er) => b(ar)

Page 40: Introduction to regular expressions

Greedy quantifiers and backtracking

● Consider the regex /.* are/● Applied to: “Pandas are cute animals”

● .* will consume the whole text at first● However, when reaching the end of the

text, it stops matching and the regex cursor goes on.

Page 41: Introduction to regular expressions

Greedy quantifiers and backtracking (2)

● Now, “ “ can't match (no more text is available), so the engine backtracks!

● Some backtracking is performed, until the first available space is reached (between “cute” and “animals”)

● The regex cursor moves on to “a”, that matches the “a” in “animals”. But “r” doesn't match “n” => more backtracking!

Page 42: Introduction to regular expressions

Greedy quantifiers and backtracking (3)

● The failures and backtracking go on until the space between “are” and “cute”... “a” doesn't match the “c” in “cute” => backtracking, again!

● The next space is ok: it is followed by “are”, that matches the rest of the regex.

Page 43: Introduction to regular expressions

Pandas are cute animals! ^__^!

Page 44: Introduction to regular expressions

Lazy quantifiers

● Quantifiers become lazy if followed by a ?

● *?● ??● +?● {m, n}?● {m, }?● {n} cannot be lazy: it indicates a precise n

Page 45: Introduction to regular expressions

Lazy quantifiers and backtracking

● When applying /.*? are/ to “Pandas are cute animals”, what happens?

● The engine must choose whether to apply .*? to “P”. But it's lazy, so the engine chooses to move the regex cursor forward

● The regex cursor goes on to “ “, but it doesn't match “P” so the engine backtracks

● The engine must now take the remaining path – applying .*? to “P”, which is viable

Page 46: Introduction to regular expressions

Lazy quantifiers and backtracking (2)

● This goes on until the first space in the text is reached: it matches the space in the regex, so the regex cursor can go on

● The matching process continues until the regex ends

● In this case, the match of greedy and lazy evaluation was the same – but the lazy quantifiers required less backtracking

Page 47: Introduction to regular expressions

Apply or skip? Greedy VS Lazy

● When a quantifier is encountered, the regex engine must choose whether to apply its element to the text or not

● Greedy quantifiers prefer the “apply” path whenever possible

● Lazy quantifiers prefer the “skip” path whenever possible

● Choosing greedy VS lazy quantifiers can impact performances and what is matched, but not the presence/absence of a match.

Page 48: Introduction to regular expressions

Greedy VS Lazy: an example

● Given the text “987”:– /\d{1,3}/ matches the whole “987”: the

greedy quantifier tries to consume as much as possible

– /\d{1,3}?/ matches just “9”: the lazy quantifier must honour the constraints (at least 1 match), but chooses to skip application whenever possible

Page 49: Introduction to regular expressions

Atomic grouping

● (?> and ) define an atomic group

● All the states created within an atomic group are removed from the engine's stack as soon as the group closes

● Atomic groups are non-capturing, but can have capturing groups

● Atomic grouping can alter the match/failure result of a regex, as well as affecting performances

Page 50: Introduction to regular expressions

Possessive quantifiers

● Obtained by adding a “+” to greedy quantifiers

● Possessive quantifiers are equivalent to greedy quantifiers wrapped within an atomic group.

● For example:/\d++/ = /(?>\d+)/

Page 51: Introduction to regular expressions

Regex flags

● Regex engines can turn on/off features, for customized behaviour

● Enabling and disabling flags usually affects the whole regex, but some engines support flags on just regions.

● Flag manipulation is engine- and API-dependent

● Every engine has its own flags, but some are definitely common.

Page 52: Introduction to regular expressions

Most common regex flags

● Case insensitive● Dot-all: . matches any character,

including \n● Multiline anchors: ^ and $ (see later) work

on lines instead of the whole text● Extended: spaces – including newlines -

are ignored unless escaped or within a character class; lines starting with # are comments. More readable regexes.

Page 53: Introduction to regular expressions

Anchors

● Anchors do not consume text: they are basic conditions on the text cursor.

● They must be verified for the regex to match

Page 54: Introduction to regular expressions

Common anchors

● ^: the cursor is at the beginning of the text (of a line, in multiline mode)

● $: the cursor is at the end of the text (of a line, in multiline mode. And before or after \n? Know your engine).

● \A: the cursor is at the beginning of the text● \Z: the cursor is at the end of the text● \b: the cursor is at a word boundary (what's

a word boundary? Know your engine)

Page 55: Introduction to regular expressions

Lookaround

● Lookaround = a regex-based condition on the text cursor. Can be positive (the regex must match) or negative (the regex must fail).

● Lookahead = a lookaround on the text following the cursor

● Lookbehind = a lookaround on the text preceding the cursor.

Page 56: Introduction to regular expressions

Lookaround notation

Lookbehind Lookahead

Positive (?<= regex) (?= regex )

Negative (?<! regex) (?! regex )

Page 57: Introduction to regular expressions

Lookaround basics

● Their position in the regex matters, as the other characters in the regex consume the text and make the text cursor shift forward.

● On the other hand, lookarounds do not consume text

● Juxtaposed lookarounds all apply, bound by a logic and, to the position marked by the text cursor

Page 58: Introduction to regular expressions

Lookaround limitations

● Lookarounds behave like nested regexes having their own stack

● They are also called zero-length assertions● Lookahead can be full-fledged regexes● Lookbehinds are usually much more

restricted, depending on the engine

Page 59: Introduction to regular expressions

Lookarounds and the stack

● Each lookaround maintains its own stack, that gets deleted at the end of the lookaround.

● An important detail: capturing groups within lookarounds are considered capturing groups of the whole regex => their result is saved.

Page 60: Introduction to regular expressions

Lookahead + Backreference = Atomic group

● Lookaheads are full-fledged regexes with their own stack, which is thrown away.

● This is exactly like an atomic group, but the lookahead does not consume text

● However, capturing groups in a lookahead are stored by the regex => use a backreference to capture that text

● Therefore, for example:/(?=(\d+))\1/ = /(?>\d+)/

Page 61: Introduction to regular expressions

Regexes and C#

● .NET encapsulates regexes in a class, System.Text.RegularExpressions.Regex

● Its constructor accepts the regex and, optionally, global flags

● C# supports raw strings (preceded by @), to avoid over-escaping, that can be found in Java.

Page 62: Introduction to regular expressions

Regexes and Java

● Java's regex class is java.util.regex.Pattern● In lieu of a constructor, it's a static method,

Pattern.compile(), that creates a regex● It takes the regex and, optionally, the global

flags

● In Java, the regex /\\test/ becomes “\\\\test”, because each “\” in the regex must be escaped in Java, too, for a total of 4 “\”.

Page 63: Introduction to regular expressions

Regexes in MongoDB

● MongoDB supports regexes● Just use /regex/ (with slashes and without

double quotes) as the right side of an equality assertion in your query

● Important: a regex could hit indexes on a field, but the best results are achieved when the regex starts with ^

Page 64: Introduction to regular expressions

Regexes in Python

● Python provides the standard module re● To create a regex, just use re.compile(),

that takes, as usual, the regex string and the optional global flags

Page 65: Introduction to regular expressions

Regexes in JavaScript

● In JavaScript, it's quite common to use this notation to create a regex object:

var regex = /regexPattern/var regexWithFlags = /regexPattern/flags

● Alternatively, the RegExp class can be used

Page 66: Introduction to regular expressions

Final notes

● Don't forget that regexes must be kept simple, just like any other construct

● To achieve this result, a good knowledge of the text, as well as of the requirements, is needed.

● Write tests for your regexes

Page 67: Introduction to regular expressions

Further references

● “Mastering Regular Expressions” - by Jeffrey E. F. Friedl, published by O'Reilly Media

● http://regex101.com/● http://rubular.com/● http://www.regular-expressions.info/