Top Banner
Regular Expressions
23

Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Jan 03, 2016

Download

Documents

Jemimah Bryant
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Regular Expressions

Page 2: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Tokenizing strings When you read a sentence, your mind breaks it into tokens

individual words and punctuation marks that convey meaning.

String method split breaks a String into component tokens and returns an array of Strings.

Tokens are separated by delimiters Typically white-space characters

such as space, tab, newline and carriage return.

Other characters can also be used as delimiters to separate tokens.

Page 3: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Regular expressions

A regular expression a specially formatted String describing a search pattern

useful for validating input

One application is to construct a compiler Large and complex regular expression are used to this end

If the program code does not match the regular expression

=> compiler knows that there is a syntax error

Page 4: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Regular Expressions (cont’d)

String method matches receives a String specifying the regular expression

matches the contents of the String object parameter with the regular expression.

and returns a boolean indicating whether the match succeeded.

A regular expression consists of literal characters and special symbols.

Page 5: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Character classes

A character class Is an escape sequence representing a group of chars

Matches a single character in the search object

Construct Description [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z, or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction)

Page 6: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Common Matching Symbols

Regular Expression Description

. Matches any character

^regex regex must match at the beginning of the line

regex$ Finds regex must match at the end of the line

[abc] Set definition, can match the letter a or b or c

[abc][vz] Set definition, can match a or b or c followed by either v or z

[^abc] When a "^" appears as the first character inside [] when it negates the pattern. This can match any character except a or b or c

[a-d1-7] Ranges, letter between a and d and figures from 1 to 7, will not match d1

X|Z Finds X or Z

XZ Finds X directly followed by Z

$ Checks if a line end follows

Page 7: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Ranges

Ranges in characters are determined By the letters’ integer values

Ex: "[A-Za-z]" matches all uppercase and lowercase letters.

The range "[A-z]" matches all letters and also matches those characters (such as [ and \)

with an integer value between uppercase A and lowercase z.

Page 8: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Grouping

Parts of regex can be grouped using “()” Via the “$”, one can refer to a group

Example: Removing whitespace between a char and “.” or “,”

String pattern = "(\\w)(\\s+)([\\.,])"; System.out.println(

str.replaceAll(pattern, "$1$3"));

Page 9: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Negative look-ahead

It is used to exclude a pattern

defined via (?!pattern) Example: a(?!b)

Matches a if a is not followed by b

Page 10: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Quantifiers

Construct Description . Any character (may or may not match line terminators) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w]

Page 11: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Matches Method: Examples

Validating a first name firstName.matches(“[A-Z][a-zA-Z]*”);

Validating a first name “([a-zA-Z]+|[a-zA-Z]+\\s[a-zA-Z]+)” The character "|" matches the expression

to its left or to its right. "Hi (John|Jane)" matches both "Hi John" and "Hi Jane".

Validating a Zip code “\\d{5}”;

Page 12: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Split Method: examplespublic class RegexTestStrings {

public static final String EXAMPLE_TEST =

"This is my small example " + "string which I'm going to " + "use for pattern matching.";

public static void main(String[] args) {

System.out.println(EXAMPLE_TEST.matches("\\w.*"));

String[] splitString = (EXAMPLE_TEST.split("\\s+")); System.out.println(splitString.length);// Should be 14

for (String string : splitString) { System.out.println(string);

} // Replace all whitespace with tabs

System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t"));

}

}

Page 13: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

RegEx examples // Returns true if the string matches exactly "true"

public boolean isTrue(String s){

return s.matches("true"); } // Returns true if the string matches exactly "true" or "True“

public boolean isTrueVersion2(String s){

return s.matches("[tT]rue"); } // Returns true if the string matches exactly "true" or "True"

// or "yes" or "Yes"

public boolean isTrueOrYes(String s){

return s.matches("[tT]rue|[yY]es"); } // Returns true if the string contains exactly "true"

public boolean containsTrue(String s){

return s.matches(".*true.*"); }

Page 14: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

RegEx examples (cont’d) // Returns true if the string consists of three letters

public boolean isThreeLetters(String s){

return s.matches("[a-zA-Z]{3}");}

// Returns true if the string does not have a number at the beginning

public boolean isNoNumberAtBeginning(String s){

return s.matches("^[^\\d].*"); }

// Returns true if the string contains arbitrary number of characters //except b

public boolean isIntersection(String s){

return s.matches("([\\w&&[^b]])*"); }

Page 15: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Pattern and Matcher classes

Java provides java.util.regex That helps developers manipulate regular expressions

Class Pattern represents a regular expression

Class Matcher Contains a search pattern and a CharSequence object

If regular expression to be used once Use static method matches of Pattern class, which

Accepts a regular expression and a search object And returns a boolean value

Page 16: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Pattern and Matcher classes (cont’d)

If a regular expression is used more than once Use static method compile of Pattern to

Create a specific Pattern object based on a regular expression

Use the resulting Pattern object to Call the method matcher, which

Receives a CharSequence to search and returns a Matcher

Finally, use the following methods of the obtained Matcher find, group, lookingAt, replaceFirst, and replaceAll

Page 17: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Methods of Matcher The dot character "." in a regular expression

matches any single character except a newline character.

Matcher method find attempts to match a piece of the search object to the search pattern. each call to this method starts at the point where the last call ended,

so multiple matches can be found.

Matcher method lookingAt performs the same way except that it starts from the beginning of the search object and will always find the first match if there is one.

Page 18: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Pattern and Matcher exampleimport java.util.regex.Matcher; import java.util.regex.Pattern;

public class RegexTestPatternMatcher { public static final String EXAMPLE_TEST = "This is my small example string which I'm going to

use for pattern matching.";

public static void main(String[] args) { Pattern pattern = Pattern.compile("\\w+");

Matcher matcher = pattern.matcher(EXAMPLE_TEST);

while (matcher.find()) { System.out.print("Start index: " + matcher.start()); System.out.print(" End index: " + matcher.end()); System.out.println(matcher.group()); }

Pattern replace = Pattern.compile("\\s+"); Matcher matcher2 = replace.matcher(EXAMPLE_TEST);

System.out.println(matcher2.replaceAll("\t"));

} }

Page 19: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Appendix

More examples of Regular Expressions in Java

Page 20: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Validating a usernameimport java.util.regex.Matcher;

import java.util.regex.Pattern;  

public class UsernameValidator{  

private Pattern pattern; private Matcher matcher;  

private static final String USERNAME_PATTERN = "^[a-z0-9_-]{3,15}$";  

public UsernameValidator(){

pattern = Pattern.compile(USERNAME_PATTERN); }  

/** * Validate username with regular expression *

@param username username for validation *

@return true valid username, false invalid username */

public boolean validate(final String username){  

matcher = pattern.matcher(username);

return matcher.matches();  

}

}

Examples of usernames that don’t matchmk (too short, min 3 chars); w@lau (“@” not allowed)

Page 21: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Validating image file extensionimport java.util.regex.Matcher;

import java.util.regex.Pattern;  

public class ImageValidator{  

private Pattern pattern;

private Matcher matcher;  

private static final String IMAGE_PATTERN ="([^\\s]+(\\.(?i)(jpg|png|gif|bmp))$)";  

public ImageValidator(){

pattern = Pattern.compile(IMAGE_PATTERN);

}  

/** * Validate image with regular expression *

@param image image for validation *

@return true valid image, false invalid image */

public boolean validate(final String image){  

matcher = pattern.matcher(image); return matcher.matches();  

}

}

Page 22: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Time in 12 Hours Format validator

import java.util.regex.Matcher;

import java.util.regex.Pattern;  

public class Time12HoursValidator{  

private Pattern pattern; private Matcher matcher;  

private static final String TIME12HOURS_PATTERN =

"(1[012]|[1-9]):[0-5][0-9](\\s)?(?i)(am|pm)";  

public Time12HoursValidator(){

pattern = Pattern.compile(TIME12HOURS_PATTERN); }  

/** * Validate time in 12 hours format with regular expression *

@param time time address for validation *

@return true valid time fromat, false invalid time format */

public boolean validate(final String time){

matcher = pattern.matcher(time); return matcher.matches();

}

}

Page 23: Regular Expressions. Tokenizing strings When you read a sentence, your mind breaks it into tokens individual words and punctuation marks that convey meaning.

Validating date

Date format validation (0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)

( start of group #1

0?[1-9] => 01-09 or 1-9

| ..or

[12][0-9] # 10-19 or 20-29

| ..or

3[01] => 30, 31

) end of group #1

/ # followed by a "/"

( # start of group #2

0?[1-9] # 01-09 or 1-9

| # ..or

1[012] # 10,11,12

) # end of group #2

/ # followed by a "/"

( # start of group #3

(19|20)\\d\\d # 19[0-9][0-9] or 20[0-9][0-9]

) # end of group #3