Top Banner
Regular Expressions in R Houston R Users Group 10.05.2011 Ed Goodwin twitter: @egoodwintx
29

Eag 201110-hrugregexpresentation-111006104128-phpapp02

Jul 20, 2015

Download

Technology

egoodwintx
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Regular Expressions in R

Houston R Users Group10.05.2011

Ed Goodwintwitter: @egoodwintx

Page 2: Eag 201110-hrugregexpresentation-111006104128-phpapp02

What is a Regular Expression?

Regexes are an extremely flexible tool for finding and replacing text. They can easily

be applied globally across a document, dataset, or specifically to individual strings.

Page 3: Eag 201110-hrugregexpresentation-111006104128-phpapp02

ExampleLastName, FirstName, Address, Phone

Baker, Tom, 123 Unit St., 555-452-1324

Smith, Matt, 456 Tardis St., 555-326-4567

Tennant, David, 567 Torchwood Ave., 555-563-8974

Data

gsub(“St\\.”, “Street”, data[i])

*Note the double-slash “\\” to escape the ‘.’

Regular Expression to Convert “St.” to “Street”

Page 4: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Benefits of Regex

• Flexible (can be applied globally or specifically across data)

• Terse (very powerful scripting template)

• Portable (sort of) across languages

• Rich history

Page 5: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Disadvantages of regex

• Non-intuitive

• Easy to make errors (unintended consequences)

• Difficult to robustly debug

• Various flavors may cause portability issues.

Page 6: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Why do this in R?

• Easier to locate all code in one place

• (Relatively) Robust regex tools

• May be the only tool available

• Familiarity

Page 7: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Other alternatives?

• Perl

• Python

• Java

• Ruby

• Others (grep, sed, awk, bash, csh, ksh, etc.)

Page 8: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Components of a Regular Expression• Characters

• Metacharacters

• Character classes

Page 9: Eag 201110-hrugregexpresentation-111006104128-phpapp02

The R regex functions

Note: all functions are in the base package

Function Purpose

strsplit()breaks apart strings at predefined points

grep()returns a vector of indices where a pattern is matched

grepl()returns a logical vector (TRUE/FALSE) for each element of the data

sub()replaces one pattern with another at first matching location

gsub()replaces one pattern with another at every matching location

regexpr()returns an integer vector giving the starting position of the first match, along with a match.length attribute giving the length of the matched text.

gregexpr()returns an integer vector giving the starting position of the all matches, along with a match.length attribute giving the length of the matched text.

Page 10: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Metacharacter SymbolsModifier Meaning

^ anchors expression to beginning of target

$ anchors expression to end of target

. matches any single character except newline

| separates alternative patterns

[] accepts any of the enclosed characters

[^] accepts any characters but the ones enclosed in brackets

() groups patterns together for assignment or constraint

* matches zero or more occurrences of preceding entity

? matches zero or one occurrences of preceding entity

+ matches one or more occurrences of preceding entity

{n} matches exactly n occurrences of preceding entity

{n,} matches at least n occurrences of preceding entity

{n,m} matches n to m occurrences of preceding entity

\ interpret succeeding character as literal

Source: “Data Manipulation with R”. Spector, Phil. Springer, 2008. page 92.

Page 11: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Examples[A-Za-z]+ matches one or more alphabetic characters

.* matches zero or more of any character up to the newline

.*\\.\\* matches zero or more characters followed by a literal .*

(July? ) Accept ‘Jul’ or ‘July’ but not ‘Julyy’. Note the space.

(abc|123) Match “abc” or “123”

[abc|123] Match a, b, c, 1, 2 or 3. The ‘|’ is extraneous.

^(From|Subject|Date):Matches lines starting with “From:” or “Subject:” or

“Date:”

Page 12: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Let’s work through some examples...Data

LastName, FirstName, Address, Phone

Baker, Tom, 123 Unit St., 555-452-1324

Smith, Matt, 456 Tardis St., 555-326-4567

Tennant, David, 567 Torchwood Ave., 555-563-8974

1. Locate all phone numbers.2. Locate all addresses.3. Locate all addresses ending in ‘Street’ (including abbreviations).

4. Read in full names, reverse the order and remove the comma.

Page 13: Eag 201110-hrugregexpresentation-111006104128-phpapp02

So how would you write the regular expression to match a calendar date in format “mm/dd/yyyy” or “mm.dd.yyyy”?

Page 14: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Regex to identify date format?

What’s wrong with

“[0-9]{2}(.|/)[0-9]{2}(.|/)[0-9]{4}” ?

Or with

“[1-12](.|/)[1-31](.|/)[0001-9999]” ?

Page 15: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Dates are not an easy problem because they are not a simple text

pattern

Best bet is to validate the textual pattern (mm.dd.yyyy) and then pass to a separate function to validate the date (leap years, odd days in month, etc.)“^(1[0-2]|0[1-9])(\\.|/)(3[0-1]|[1-2][0-9]|0[1-9])(\\.|/)([0-9]{4})$”

Page 16: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Supported flavors of regex in R

• POSIX 1003.2

• Perl

Perl is the more robust of the two. POSIX has a few idiosyncracies handling ‘\’ that may trip you up.

Page 17: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Usage Patterns

• Data validation

• String replace on dataset

• String identify in dataset (subset of data)

• Pattern arithmetic (how prevalent is string in data?)

• Error prevention/detection

Page 18: Eag 201110-hrugregexpresentation-111006104128-phpapp02

The R regex functions

Note: all functions are in the base package

Function Purpose

strsplit()breaks apart strings at predefined points

grep()returns a vector of indices where a pattern is matched

grepl()returns a logical vector (TRUE/FALSE) for each element of the data

sub()replaces one pattern with another at first matching location

gsub()replaces one pattern with another at every matching location

regexpr()returns an integer vector giving the starting position of the first match, along with a match.length attribute giving the length of the matched text.

gregexpr()returns an integer vector giving the starting position of the all matches, along with a match.length attribute giving the length of the matched text.

Page 19: Eag 201110-hrugregexpresentation-111006104128-phpapp02

strsplit( )

Definition:strsplit(x, split, fixed=FALSE, perl=FALSE, useBytes=FALSE)

Example:

str <- “This is some dummy data to parse x785y8099”strsplit(str, “[ xy]”, perl=TRUE)

Result:[[1]] [1] "This" "is" "some" "dumm" "" "data" "to" "parse" "" [10] "785" "8099"

Page 20: Eag 201110-hrugregexpresentation-111006104128-phpapp02

grep( )

Definition:grep(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)

Example:

str <- “This is some dummy data to parse x785y8099”grep(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE, value=TRUE)

Result:[1] "This is some dummy data to parse x785y8099"

Page 21: Eag 201110-hrugregexpresentation-111006104128-phpapp02

grepl( )

Definition:grepl(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE,fixed = FALSE, useBytes = FALSE, invert = FALSE)

Example:

str <- “This is some dummy data to parse x785y8099”grepl(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE)

Result:

[1] TRUE

Page 22: Eag 201110-hrugregexpresentation-111006104128-phpapp02

sub( )

Definition:sub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE, fixed=FALSE, useBytes=FALSE)

Example:

str <- “This is some dummy data to parse x785y8099”sub("dummy(.* )([a-z][0-9]{3}).([0-9]{4})", "awesome\\1\\2H\\3", str, perl=TRUE)

Result:[1] "This is some awesome data to parse x785H8099"

Page 23: Eag 201110-hrugregexpresentation-111006104128-phpapp02

gsub( )

Definition:gsub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE,fixed=FALSE, useBytes=FALSE)

Example:

str <- “This is some dummy data to parse x785y8099 you dummy”gsub(“dummy”, “awesome”, perl=TRUE)

Result:[1] "This is some awesome data to parse x785y8099 you awesome"

Page 24: Eag 201110-hrugregexpresentation-111006104128-phpapp02

regexpr( )

Definition:regexpr(pattern, text, ignore.case=FALSE, perl=FALSE,fixed = FALSE, useBytes = FALSE)

Example:

duckgoose <- "Duck, duck, duck, goose, duck, duck, goose, duck, duck"

regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE)

Result:[1] 1attr(,"match.length")[1] 4

Page 25: Eag 201110-hrugregexpresentation-111006104128-phpapp02

gregexpr( )Definition:gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE,fixed=FALSE, useBytes=FALSE)

Example:

duckgoose <- "Duck, duck, duck, goose, duck, duck, goose, duck, duck"

regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE)

Result:[[1]][1] 1 7 13 26 32 45 51attr(,"match.length")[1] 4 4 4 4 4 4 4

Page 26: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Problem Solving & Debugging

• Remember that regexes are greedy by default. They will try to grab the largest matching string possible unless constrained.

• Dummy data - small datasets

• Unit testing - testthis, etc.

• Build up regex complexity incrementally

Page 27: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Best Practices for Regex in R

• Store regex string as variable to pass to function

• Try to make regex expression as exact as possible (avoid lazy matching)

• Pick one type of regex syntax and stick with it (POSIX or Perl)

• Document all regexes in code with liberal comments

• use cat() to verify regex string

• Test, test, and test some more

Page 28: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Regex Workflow

• Define initial data pattern

• Define desired data pattern

• Define transformation steps

• Incremental iteration to desired regex

• Testing & QA

Page 29: Eag 201110-hrugregexpresentation-111006104128-phpapp02

Regex Resources• http://regexpal.com/ - online regex tester

• Data Manipulation with R. Spector, Phil. Springer, 2008.

• Regular Expression Cheat Sheet. http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

• Regular Expressions Cookbook. Goyvaerts, Jan and Levithan, Steven. O’Reilly, 2009.

• Mastering Regular Expressions. Friedl, Jeffrey E.F. O’Reilly, 2006.

• Twitter: @RegexTip - regex tips and tricks