Satisfy Your Technical Curiosity Regular Expressions Roy Osherove www.iserializable.com Methodology & Team System Expert Sela Group www.Sela.co.il The hidden power language
Satisfy Your Technical Curiosity
Regular Expressions
Roy Osherovewww.iserializable.com
Methodology &
Team System Expert
Sela Group
www.Sela.co.il
The hidden power language
Satisfy Your Technical Curiosity
Tools
http://tools.osherove.comwww.ISerializable.com
Satisfy Your Technical Curiosity
The Log File
Satisfy Your Technical Curiosity
Developer Problem– Make this log file useful
Old log file from a *nix system’s entriesConverted to and from various formatsSearched by usersFormat may change Search fields can be added, removed or renamed at runtime
Date CPUs|ram|cpu HH:mm:ss action user domain.machine25/05/1998 1|00512|x86 21:49:12 [Search] Anakin Antler.Anita125/05/98 1|00512|x86 21:51:15 [Update] Anakin Antler.Anita126/05/1998 1|00256|x86 11:02:45 [Search] Darth Cydot.Uk.Gerry2k26/05/98 1|00256|x86 11:12:49 [Update] Darth Cydot.Uk.Gerry2k27/05/98 1|00512|x86 15:34:30 [Search] Anakin Anterl.Anita112/08/1998 2|01024|x86 10:14:53 [Search] Obi Monaco.Huarez
Satisfy Your Technical Curiosity
About 15 minutes later…
Done.
About 45 minutes later…Home early.
Satisfy Your Technical Curiosity
You can be home early too!Regex is easier than you think
Satisfy Your Technical Curiosity
What are Regular Expressions?A language to describe a language using “patterns”Think SQL or XPath – for textOriginated with Perl and *nix shell scriptingMany variations and frameworks exist. Only one for .NET (for now)Used in most languages
Satisfy Your Technical Curiosity
Common Regex Uses
Text ValidationPhones, emails, address or any format requirement
Text ManipulationTransform text
Text ParsingFind in files, site Scraping, data collection
Satisfy Your Technical Curiosity
What .NET brings to the plate
Full object modelExtended syntaxOptimization techniques in the framework
Satisfy Your Technical Curiosity
.NET Regular ExpressionsShow up in several places:
In the classes of the System.Text.RegularExpressions namespace
Via the RegularExpressionValidator validator control (for ASP.NET)Sprinkled in dozens of other places
Browser capabilities filterIn the WSDL <match> tagAnd many more
Satisfy Your Technical Curiosity
Key Classes within System.Text.RegularExpressions
RegexContains the pattern and matching optionsImportant methods:
IsMatch() returns booleanReplace() returns a stringSplit() returns a string array…
Main Use: Validation, Splitting, Replacing text
Satisfy Your Technical Curiosity
The Process
Pattern
Input
Regex
Matches
Splits
TextReplace text
Options
Satisfy Your Technical Curiosity
Validation
Satisfy Your Technical Curiosity
Syntax
Match exact text as written in the pattern‘a’ will match all ‘a’ in the text.
Except for special symbols:
Satisfy Your Technical Curiosity
Enclosing Alternatives with []The square brackets allow you to specify a list of alternate values. Used in conjunction with the – operator, you can even specify character ranges.
[Cc] Capital or lowercase c[A-Z] Any capital letter A through Z[A-Za-z] Any capital or lowercase letter[0-9] Any digit 0 through 9[A-Za-z0-9] Any letter or digit[0-9.+-&=%] Any digit or special char listed
Notice: no escape needed
Satisfy Your Technical Curiosity
Controlling ExpressionFrequency with {}
The {} operators allow you to control the frequency of the preceding expression. The expression takes one of these two forms:
{occurrences} [A-Za-z]{3}
{MinOccurrences, MaxOccurences} [A-Za-z]{1,3}
Satisfy Your Technical Curiosity
Basic Frequency Operators? 0 or 1* 0 or more+ 1 or moreSo,
3+ Will match
3, 33, 3333but not
45, 678.
Satisfy Your Technical Curiosity
Wildcard Operator: .. matches any non-newline character
Unless multiline mode has been turned on for the patternExamples:
A.$ would match a capital A followed by one any character.
Will not match Abc
A.+ would match a capital A followed by one or more non-newline characters\.htm.? would match ".htm" followed by
an optional non-newline characterBackslash == escape characters that have reserved meanings in regular expressions
Satisfy Your Technical Curiosity
Convenience Expressions\d
Any digit
\DAny non-digitMust match something else one
\sAny whitespace character (such as a space or tab)
\SAny character other than a whitespace character
\wAny number or letter
\WAny character other than a number or letter
Many more: Unicode, Hex Values, negative lookups…
Satisfy Your Technical Curiosity
Quick Quiz!
[A-Za-z]{3}3 capital or lowercase lettersAbc, abc, aBC,1bc
[A-Z][a-z]{2,4}A capital letter followed by at least 2 but not more than 4 lowercase lettersAbc, Acbde, abcde, ABcde
\w{3,8}\.\w{3}3 to 8 AlphaNumeric characters, followed by a dot and 3 alpha numericsFilename.txt, d0main.com, 1234.567, 34.456
Satisfy Your Technical Curiosity
Splitting and Manipulating
Satisfy Your Technical Curiosity
The Spammer
Satisfy Your Technical Curiosity
(2) Key Classes within System.Text.RegularExpressions
MatchCollection - MatchMatchCollection stores all the matches found
GroupCollection - GroupCaptureCollection - Capture
Regex.Match() returns MatchRegex.Matches() returns MatchCollection…
Main Use: Parsing, searching, collecting data
Satisfy Your Technical Curiosity
Simple parsingParsing for emails
Satisfy Your Technical Curiosity
Grouping(the coolest part)
Satisfy Your Technical Curiosity
Grouping (pay attention!)Groups give us object models
HTML [email protected]
Create a capture hierarchy and use it in code
[\w\.\-]+@ [\w\.\-]+\.\w{2,5}
(?<userName>[\w\.\-]+)@(?<domain>[\w\.\-]+\.\w{2,5})
Satisfy Your Technical Curiosity
Grouping Emails& The Regulator
Satisfy Your Technical Curiosity
Regulazy
Build simple expressions by exampleNo syntax knowledge neededFreeTools.osherove.com
Satisfy Your Technical Curiosity
When not to use Regex
When its easier and more readable to do it otherwiseNot just because it’s “cool”Hard to readSteep learning curveHard to maintain
“Sometimes, when confronted with a problem, you might decide to solve it with Regular Expressions for the wrong reasons. Now you you’ve got two problems.”
Satisfy Your Technical Curiosity
Summary
Amazing parsing flexibilityGood skill to have anywhereCan save you time and nervesWith Power comes responsibilityWeigh the pros and cons before using
Satisfy Your Technical Curiosity
Resources
The Regulator tools.osherove.comRegulazy tools.osherove.comRegexlib.com – Regex archive (http://www.regexlib.com) + Cheat Sheethttp://www.regular-expressions.info
Roy Osherove: [email protected]: www.iserializable.com
Satisfy Your Technical Curiosity
Thank you!
Questions?
Roy Osherove: [email protected]: www.iserializable.com
Satisfy Your Technical Curiosity