Top Banner
LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

LIS651 lecture 4regular expressions

Thomas Krichel

2006-12-03

Page 2: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

remember DOS?

• DOS had the * character as a wildcard. If you saidDIR *.EXE

• It would list all the files ending with .EXE• Thus the * wildcard would mean “all

characters except the dot”• Similarly, you could say

DEL *.*

• to delete all your files

Page 3: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

regular expression• Is nothing but a fancy wildcard. • There are various flavours of regular

expressions. – We will be using POSIX regular expressions

here. They themselves come in two flavors• old-style• extended

We study extended here aka POSIX 1003.2. – Perl regular expressions are more powerful and

more widely used.

• POSIX regular expressions are accepted by both PHP and mySQL. Details are to follow.

Page 4: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

pattern

• The regular expression describes a pattern of characters.

• Patters are common in other circumstances. – Query: ‘Krichel Thomas’ in Google– Query: ‘"Thomas Krichel"’ in Google– Dates are of the form yyyy-mm-dd.

Page 5: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

pattern matching

• We say that a regular expression matches the string if an instance of the pattern described by the regular expression can be found in the string.

• If we say “matches in the string” may make it a little more clearer.

• Sometimes people also say that the string matches the regular expression.

• I am confused.

Page 6: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

metacharacters• Instead of just giving the star * special

meaning, in a regular expression all the following have special meaning\ ^ $ . | ( ) * + { } ? [ ]

• Collectively, these characters are knows as metacharacters. They don't stand for themselves but they mean something else.

• For example DEL *.EXE does not mean: delete the file "*.EXE". It means delete anything ending with .EXE.

Page 7: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

metacharacters

• We are somehow already familiar with metacharacters. – In XML < means start of an element. To use <

literally, you have to use &lt;– In PHP the "\n" does not mean backslash and

then n. It means the newline character.

Page 8: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

simple regular expressions

• Characters that are not metacharacters just simply mean themselves‘good’ does not match in ‘Good Beer’

‘d B’ matches in ‘Good Beer’

‘dB’ does not match in ‘Good Beer’

‘Beer ’ does not match in ‘Good Beer’

• If there are several matches, the pattern will match at the first occurrence.‘o’ matches in ‘Good Beer’

Page 9: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

the backslash \ quote

• If you want to match a metacharacter in the string, you have to quote it with the backslash‘a 6+ pack’ does not match in ‘a 6+ pack’

‘a 6\+ pack’ does match in ‘a 6+ pack’

‘\’ does not match in ‘a \ against boozing’‘\\’ does match in ‘a \ against boozing’

Page 10: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

other characters to be quoted

• Certain non-metacharacters also need to be quoted. These include some of the usual suspects– \n the newline– \r the carriage return– \t the tabulation character

• But this quoting occurs by virtue of PHP, it is not part of the regular expression.

• Remember Sandford’s law.

Page 11: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

anchor metacharacters ^ and $

• ^ matches at the beginning of the string.• $ matches at the end of the string.

‘keeper’ matches in ‘beerkeeper’

‘keeper$’ matches in ‘beerkeeper’

‘^keeper’ does not match in ‘beerkeeper’

‘^$’ matches in ‘’

• Note that in a double quoted-string an expression starting with $ will be replaced by the variable's string value (or nothing if the variable has not been set).

Page 12: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

character classes• We can define a character class by

grouping a list of characters between [ and ] ‘b[ie]er’ matches in ‘beer’

‘b[ie]er’ matches in ‘bier’

‘[Bb][ie]er’ matches in ‘Bier’

• Within a class, metacharacters need not be escaped. In the class only -, ] and ^ are metacharacters.

Page 13: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

- in the character class• Within a character class, the dash - becomes

a metacharacter. • You can use to give a range, according to the

sequence of characters in the character set you are using. It’s usually alphabetic‘be[a-e]r’ matches in ‘beer’

‘be[a-e]r’ matches in ‘becr’

‘be[a-e]r’ does not match in ‘befr’

• If the dash - is the last character in the class, it is treated like an ordinary character.

Page 14: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

] in the character class

• ] gives you the end of the class. But if you put it first, it is treated like an ordinary character, because having it there otherwise would create an empty class, and that would make no sense. ‘be[],]r’ matches in ‘be]r’

Page 15: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

^ in the character class

• If the caret ^ appears as the first element in the class, it negates the characters mentioned.‘be[^i]r’ matches in ‘beer’

‘b[^ie]er’ does not match in ‘bier’

‘be[^a-e]r’ does match in ‘befr’

‘be[e^]r’ matches in ‘beer’

‘beer[^6-9] matches ‘beer0’ to ‘beer5’

• Otherwise, it is an ordinary character.

Page 16: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

standard character classes• The following predefined classes exist

[:alnum:] any alphanumeric characters

[:digit:] any digits

[:punct:] any punctuation characters

[:alpha:] any alphabetic characters (letters)

[:graph:] any graphic characters

[:space:] any space character (blank and \n, \r)

[:blank:] any blank character (space and tab)

[:lower:] any lowercase character

Page 17: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

standard character classes

[:upper:] any uppercase character

[:cntrl:] any control character

[:print:] any printable character

[:xdigit:] any character for a hex number

• They are locale and operating system dependent.

• With this discussion we leave character classes.

Page 18: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

The period . metacharacter

• The period matches any character except the newline \n.

• The reason why the \n is not counted is historic. In olden days matching was done line by line, because the computer could not hold as much memory.‘.’ does not match in ‘’;

‘^.$’ does not match in "\n"

‘^.$’ matches in ‘a’

Page 19: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

alternative operator |

• This acts like an or‘beer|wine’ matches in ‘beer’

‘beer|wine’ matches in ‘wine’

• Alternatives are performed last, i.e. they take the component alternative as large as they can.

Page 20: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

grouping with ( )

• You can use ( ) to group ‘(beer|wine) (glass|)’ matches in ‘beer glass’

‘(beer|wine) (glass|)’ matches in ‘wine glass’

‘(beer|wine) (glass|)’ matches in ‘beer ’

‘(beer|wine) (glass|)’ matches in ‘wine ’

‘(beer|wine) (glass(es|)|)’ matches in

‘beer glasses’

• Yes, groups can be nested.

Page 21: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

repetition operators• * means zero or more times what preceeds it.• + means one or more times what preceeds it.• ? means zero or one time what preceeds it.• The shortest preceding expression is used, i.e.

either a single character or a group.(beer )* matches in ‘’

(beer )? matches in ‘’

(beer )+ matches in ‘beer beer beer’

be+r matches in ‘beer’

be+r does not match in ‘bebe’

Page 22: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

enumeration• We can use {min,max} to give a minimum min

and a maximum max. min and max are positive integers.‘be{1,3}r’ matches in ‘ber’

‘be{1,3}r’ matches in ‘beer’

‘be{1,3}r’ matches in ‘beeer’

‘be{1,3}r’ does not matches in ‘beeeer’

• ? is just a shorthand for {0,1}• + is just a shorthand for {1,}• * is just a shorthand for {0,}

Page 23: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

examples

• US zip code ^[0-9]{5}(-[0-9]{4})?$• something like a current date in ISO form

^(20[0-9]{2})-(0[1-9]|1[0-2])-([1-2][0-9]|3[01])$• Something like a Palmer School course code

(DIS[89])|(LIS[5-9]))[0-9]{2}• Something like an XML tag </*[:alpha:]+ */*>

Page 24: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

not using posix regular expressions

• Do not use regular expressions when you want to accomplish a simple for which there is a special PHP function already available.

• A special PHP function will usually do the specialized task easier. Parsing and understanding the regular expression takes the machine time.

Page 25: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

ereg()

• ereg(regex, string) searches for the pattern described in regex within the string string.

• It returns false if no match was found.• If you call the function as ereg(regex, string,

matches) the matches will be stored in the array matches. Thus matches will be a numeric array of the grouped parts (something in ()) of the string in the string. The first group match will be $matches[1].

Page 26: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

ereg_replace

• ereg_replace ( regex, replacement, string ) searches for the pattern described in regex within the string string and replaces occurrences with replacement. It returns the replaced string.

• If replacement contains expressions of the form \\number, where number is an integer between 1 and 9, the number sub-expression is used. $better_order=ereg_replace('glass of (Karlsberg|

Bruch)', 'pitcher of \\1',$order);

Page 27: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

split()• split(regex, string, [max]) splits the string

string at the occurrences of the pattern described by the regular expression regex. It returns an array. The matched pattern is not included.

• If the optional argument max is given, it means the maximum number of elements in the returned array. The last element then contains the unsplit rest of the string string.

• Use explode() if you are not splitting at a regular expression pattern. It is faster.

Page 28: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

case-insensitive function

• eregi() does the same as ereg() but work case-insensitively.

• eregi_replace() does the same as ereg_replace() but work case-insensitively.

• spliti() does the same as split() but work case-insensitively.

Page 29: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

regular expressions in mySQL

• You can use POSIX regular expressions in mySQL in the SELECT commandSELECT … WHERE REGEXP ‘regex’

• where regex is a regular expression.

Page 30: LIS651 lecture 4 regular expressions Thomas Krichel 2006-12-03.

http://openlib.org/home/krichel

Thank you for your attention!

Please switch off machines b4 leaving!