Top Banner
50

Perl Regex

Oct 21, 2015

Download

Documents

chintu_89

presentation on perl regex
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Perl Regex
Page 2: Perl Regex

Overview

Page 3: Perl Regex

Introduction• Regular expressions are tiny programs in their own special

language, built inside Perl.• These allow fast, flexible, and reliable string handling.• A regular expression, often called a pattern in Perl, is a

template that either matches or doesn’t match a given string.

• That is, there are an infinite number of possible text strings; a given pattern divides that infinite set into two groups: the ones that match, and the ones that don’t.

• Don’t confuse regular expressions with shell filename-matching patterns, called globs, which is a different sort of pattern with its own rules.

Page 4: Perl Regex

Simple Pattern• To match a pattern (regular expression) against the

contents of $_, simply put the pattern between a pair of forward slashes (/).

$_ = "yabba dabba doo";

if (/abba/) {

print "It matched!\n";

}

• The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value.

Page 5: Perl Regex

Unicode Properties• Unicode characters know something about themselves;

they aren’t just sequences of bits.• Instead of matching on a particular character, you can

match a type of character.• To match a particular property, you put the name in \

p{PROPERTY}.if (/\p{Space}/) { # 26 different possible characters

print "The string has some whitespace.\n";

}

if (/\p{Digit}/) { # 411 different possible characters

print "The string has a digit.\n";

}

• More properties at perluniprops .

Page 6: Perl Regex

Meta-characters• The dot (.) is a wildcard character—it matches any single

character except a newline./bet.y/ - > matches betty, betsy, bet=y, bet.y,

doesn’t match bety or betsey.

• The dot always matches exactly one character.• If you wanted the dot to match just a period, you can

simply backslash it./3\.141/ -> matches 3.141596456

doesn’t match 3a141545

• If you mean a real backslash, use a pair of them.$_ = 'a real \\ backslash';

if (/\\/) {

print "It matched!\n";

}

Page 7: Perl Regex

Simple Quantifiers• * -- zero or more occurrences

/fred\t*barney/ matches fredbarney, fred\tbarney, fred\t\tbarney

/fred.*barney/ matches fredbarney, fredabcd…barney

• + -- one or more occurrences/fred\t+barney/ matches fred\tbarney, fred\t\tbarney

doesn’t match fredbarney

• ? -- zero or one occurrence/bam-?bam/ matches bambam, bam-bam

doesn’t match bam-----bam

Page 8: Perl Regex

Grouping in Patterns• Use parentheses (“( )”) to group parts of a pattern.• So, parentheses are also meta-characters.

/fred+/ matches fredddd, fredd

/(fred)+/ matches fred, fredfred, fredfredfred

/(fred)*/ matches hello, barney, fred, fredfred

• Using of parentheses makes perl to store matched text in the special variables $1, $2, and so on. The number denotes the capture group.

$_ = “perl version is 5.14”;

if(/perl version is (.*)/) {

print $1; #prints 5.14

}

Page 9: Perl Regex

• Use back references to refer to text that you matched in the parentheses, called a capture group.

• You denote a back reference as a backslash followed by a number, like \1, \2, and so on.

$_ = "abba";

if (/(.)\1/) { # matches 'bb'

print "It matched same character next to itself!\n";

}

$_ = "yabba dabba doo";

if (/y(....) d\1/) {

print "It matched the same after y and d!\n";

}

Page 10: Perl Regex

$_ = "yabba dabba doo";

if (/y(.)(.)\2\1/) { # matches 'abba'

print "It matched after the y!\n";

}

• “How do I know which group gets which number?”--just count the order of the opening parenthesis and ignore nesting.

$_ = "yabba dabba doo";

if (/y((.)(.)\3\2) d\1/) {

print "It matched!\n";

}

Page 11: Perl Regex

• Consider the problem where you want to use a back reference next to a part of the pattern that is a number.

• In this regular expression, you want to use \1 to repeat the character you matched in the parentheses and follow that with the literal string 11

$_ = "aa11bb";

if (/(.)\111/) {

print "It matched!\n";

}

Is that \1, \11, or \111?

Page 12: Perl Regex

• Starting from perl 5.10, by using \g{1}, you disambiguate the back reference and the literal parts of the pattern:‖

use 5.010;

$_ = "aa11bb";

if (/(.)\g{1}11/) {

print "It matched!\n";

}

• With the \g{N} notation, you can also use negative numbers.

use 5.010;

$_ = "xaa11bb";

if (/(.)(.)\g{–1}11/) {

print "It matched!\n"; }

Page 13: Perl Regex

Alternatives• The vertical bar (|), often called “or” in this usage, means, if

the part of the pattern on the left of the bar fails, the part on the right gets a chance to match.

/fred|barney|betty/ matches fred, barney, betty.

/fred( |\t)+barney/ matches if fred and barney are separated by spaces, tabs, or a mixture of the two.

/fred( +|\t+)barney/ matches if fred and barney are separated either only by space or only by tabs not mixture of space and tabs.

/fred (and|or) barney/ matches fred and barney, fred or barney. Same as pattern /fred and barney|fred or barney/.

Page 14: Perl Regex

Character Classes• A character class, a list of possible characters inside square

brackets.• It matches just one single character, but that one character

may be any of the ones you list in the brackets.[abcwxyz] matches a,b,c,w,x,y,z (any of those seven characters)

• You may specify a range of characters with a hyphen (-)[a-cw-z] implies all alphabets between a to c and w to z[a-zA-Z0-9] implies any alphanumeric character

$_ = "The HAL-9000 requires authorization to continue.";

if (/HAL-[0-9]+/) {

print "The string mentions some model of HAL computer.\n";

}

Page 15: Perl Regex

Character Class Shortcuts• Some character classes appear so frequently that they have

shortcuts.• The character class for any digit as \d.

$_ = 'The HAL-9000 requires authorization to continue.';

if (/HAL-[\d]+/) {

say 'The string mentions some model of HAL computer.';

}

• However, there are many more digits than the 0 to 9 that you may expect from ASCII, so that will also match HAL-٩٠٠٠

• Recognizing this problematic shift from ASCII to Unicode, Perl 5.14 adds /a modifier on the end of the match perator tells Perl to use the old ASCII interpretation.

Page 16: Perl Regex

• \s matches any whitespace, which is almost the same as the Unicode property \p{Space}

• \h only matches horizontal whitespace. • \v shortcut only matches vertical whitespace.• Taken together, the \h and \v are the same as \p{Space}• The \R shortcut, introduced in Perl 5.10, matches any sort

of line-break, independent of operating system.• \w matches the set of characters [a-zA-Z0-9_]

Page 17: Perl Regex

Negating the Shortcuts• To specify the characters you want to leave out, rather than

the ones within the character class use caret(^).• A caret (^) at start of character class(i.e., inside square

brackets) negates the class.[^def] match any single character except one of those three.

[^n\-z] matches any character except for n, hyphen, or z.

• To negate a shortcut use it upper case \S matches any non-space

\D matches any non-digit

[\d\D] matches any digit, or any non-digit. i.e., any character or anything

[^\d\D] matches anything that’s not either a digit or a non-digit. i.e., nothing!

Page 18: Perl Regex
Page 19: Perl Regex

Matches with m//• We put patterns in pairs of forward slashes, like /fred/. But

this is actually a shortcut for the m// (pattern match operator).

• We may choose any pair of delimiters to quote the contents.

m(fred), m<fred>, m{fred}, m[fred], m,fred,, m!fred!, m^fred^

• The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m.

• Wisely choose a delimiter that doesn’t appear in your pattern.

m%http://% instead of /http:\/\// to match the initial "http://".

Page 20: Perl Regex

Match Modifiers• Case-Insensitive Matching with /i

$_=“Is Freddy there?”;

if(/freddy/i) {

print “Yes Freddy is here”;

}

• Without the /s modifier, that match would fail, since the two names aren’t on the same line.

• If you wanted to still match any character except a newline? --You could use the character class [^\n], or from Perl

5.12 added the shortcut \N to mean the complement of \n.

Page 21: Perl Regex

• Matching Any Character with /s– Using /s modifier makes dot(.) to match any character including a

newline character. – It achieves this by replacing (.) with [dD] with matches anything.– The effect can only be felt when the string has newline characters.

$_ = "I saw Barney\ndown at the bowling alley\nwith Fred\nlast night.\n";

if (/Barney.*Fred/s) {

print "That string mentions Fred after Barney!\n";

}

• There are many other modifiers available at perlop documentation. A few are described below.

Page 22: Perl Regex

• Adding Whitespace with /x– allows you to add arbitrary whitespace to a pattern, in order to

make it easier to read./-?[0-9]+\.?[0-9]*/ # what is this doing?

/ -? [0-9]+ \.? [0-9]* /x # a little better– /x allows whitespace inside the pattern, Perl ignores literal space

or tab characters within the pattern.– You could use a backslashed space or \t or \s (more common)(or \

s* or \s+) when you want to match whitespace.

Page 23: Perl Regex

– Perl considers comments a type of whitespace, so you can put comments into that pattern to tell what you are trying to do:

/

-? # an optional minus sign

[0-9]+ # one or more digits before the decimal point

\.? # an optional decimal point

[0-9]* # some optional digits after the decimal point

/x # end of string– Use the escaped character, \#, or the character class, [#], if you

need to match a literal pound sign as it indicates start of comment/

[0-9]+ # one or more digits before the decimal point

[#] # literal pound sign

/x # end of string

Page 24: Perl Regex

– Be careful not to include the closing delimiter inside the comments, or it will prematurely terminate the pattern. This pattern ends before you think it does:

/

-? # with / without - <--- OOPS!

[0-9]+ # one or more digits before the decimal point

\.? # an optional decimal point

[0-9]* # some optional digits after the decimal point

/x # end of string

Page 25: Perl Regex

Combining Option Modifiers• If you want to use more than one modifier on the same

match, just put them both at the end (their order isn’t significant)

if (/barney.*fred/is) { # both /i and /s

print "That string mentions Fred after Barney!\n";

}

Or as a more expanded version with comments:

if (m{

barney # the little guy

.* # anything in between

fred # the loud guy

}isx) { # all three of /s and /i and /x

print "That string mentions Fred after Barney!\n"; }

Page 26: Perl Regex
Page 27: Perl Regex
Page 28: Perl Regex
Page 29: Perl Regex
Page 30: Perl Regex
Page 31: Perl Regex
Page 32: Perl Regex
Page 33: Perl Regex
Page 34: Perl Regex
Page 35: Perl Regex
Page 36: Perl Regex
Page 37: Perl Regex
Page 38: Perl Regex
Page 39: Perl Regex
Page 40: Perl Regex
Page 41: Perl Regex
Page 42: Perl Regex
Page 43: Perl Regex
Page 44: Perl Regex
Page 45: Perl Regex
Page 46: Perl Regex
Page 47: Perl Regex
Page 48: Perl Regex
Page 49: Perl Regex
Page 50: Perl Regex

Misc• The trick with a good pattern is to not match more than you

ever mean to match.