Top Banner
Regular expressions 4 Day 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
20

REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Jan 14, 2016

Download

Documents

Rolf Jenkins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Regular expressions 4Day 9 - 9/15/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Course organization

15-Sept-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction.

http://www.tulane.edu/~howard/CompCultEN/

Page 3: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

The quiz was the review.

Review

15-Sept-2014

3

NLP, Prof. Howard, Tulane University

Page 4: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

4.3.4. Summary table

meta-character

matches name notes

a|b a or bdisjunction

 

(ab) a and b groupingonly outputs what is in (); (?:ab) for rest of pattern

[ab] a or b range[a-z] lowercase, [A-Z] uppercase, [0-9] digits

[^a] all but a negation  

a{m, n}from m to n of a

repetition

a{n} a number n of a

^aa at start of S

   

a$a at end of S

   

a+one or more of a

  a+? lazy +

a*zero or more of a

Kleene star

a*? lazy *

a?with or without a

optionality

a?? lazy ?

15-Sept-2014NLP, Prof. Howard, Tulane University

4

Page 5: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

There is a bit more to say.

§4. Regular expressions 4

15-Sept-2014

5

NLP, Prof. Howard, Tulane University

Page 6: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Open Spyder

15-Sept-2014

6

NLP, Prof. Howard, Tulane University

Page 7: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Sample string

import re

>>> S = '''This above all: to thine own self be true, And it must follow, as the night the day, Thou canst not then be false to any man.'''

15-Sept-2014NLP, Prof. Howard, Tulane University

7

Page 8: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

4.4. Character classes

class abbreviates name notes

\w[a-zA-Z0-9_]

alphanumeric

it’s really alphanumeric and underscore, but we are lazy

\W[^a-zA-Z0-9_]

  not alphanumeric

\d [0-9] digit  

\D [^0-9]   not a digit

\s [ tvnrf] whitespace  

\S [^ tvnrf]   not whitespace

\t  horizontal tab

 

\v  vertical tab

 

\n   newline  

\r  carriage return

 

\f   form-feed  

\b  word boundary

 

\B     not a word boundary

\A ^    

\Z $    

15-Sept-2014NLP, Prof. Howard, Tulane University

8

Page 9: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

4.4.2. Raw string notation with r’‘ Python interprets regular expressions just like any other expression. This can lead to unexpected results with class meta-characters, because the backslash that they incorporate is sometimes also used by Python for its own constructs.

For instance, we just met a class meta-character \b, which marks the edge of a word. It will be extremely useful for us, but it happens to overlap with Python’s own backspace operator, \b.

15-Sept-2014NLP, Prof. Howard, Tulane University

9

Page 10: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Raw text

The way to resolve this ambiguity is to prefix an r to a regular expression. The r marks the regular expression as raw text, so Python does not process it for special characters. The previous example is augmented with the raw text notation below:

1. >>> re.findall(r'\b\w\w\b', S)

2. ['to', 'be', 'it', 'as', 'be', 'to']

3. >>> re.findall(r'\b\w{2}\b', S)

4. ['to', 'be', 'it', 'as', 'be', 'to']

15-Sept-2014NLP, Prof. Howard, Tulane University

10

Page 11: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

More raw text

As a further illustration, what do you think are the non-alphanumeric characters in the Shakespeare text?:

>>> re.findall(r'\W', S) [' ', ' ', ':', ' ', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.']

15-Sept-2014NLP, Prof. Howard, Tulane University

11

Page 12: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Practice

4.3.5. Further practice of variable-length matching

4.6. Further practice Practice with answers on a different page

15-Sept-2014NLP, Prof. Howard, Tulane University

12

Page 13: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

There is a bit more to say.

§5. Lists1

15-Sept-2014

13

NLP, Prof. Howard, Tulane University

Page 14: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Introduction

In working with re.findall(), you have seen many instances of a collection of strings held within square brackets, such as the one below:

>>> S = '''This above all: to thine own self be true,

... And it must follow, as the night the day,

... Thou canst not then be false to any man.'''

>>> re.findall(r'\b[a-zA-Z]{4}\b', S)

['This', 'self', 'true', 'must', 'Thou', 'then']

15-Sept-2014NLP, Prof. Howard, Tulane University

14

Page 15: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Definition of list

A list in Python is a sequence of objects delimited by square brackets, []. The objects are separated by commas. Consider this sentence from Shakespeare’s A Midsummer Night’s Dream represented as a list:

>>> L = ['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.'] >>> type(L) >>> type(L[0])

L is a list of strings. You may think that a string is also a list of characters, and you would be correct for ordinary English, but in pythonic English, the word ‘list’ refers exclusively to a sequence of objects delimited by square brackets.

15-Sept-2014NLP, Prof. Howard, Tulane University

15

Page 16: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

An example with numerical objects1. >>> i = 2 2. >>> type(i) 3. >>> I = [0,1,i,3] 4. >>> type(I) 5. >>> type(I[0]) 6. >>> n = 2.3 7. >>> type(n) 8. >>> N = [2.0,2.1,2.2,n] 9. >>> type(N) 10. >>> type(N[0])

15-Sept-2014NLP, Prof. Howard, Tulane University

16

Page 17: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Most of the string methods work just as well on lists1. >>> len(L) 2. >>> sorted(L) 3. >>> set(L) 4. >>> sorted(set(L)) 5. >>> len(sorted(set(L))) 6. >>> L+'!' 7. >>> len(L+'!') 8. >>> L*2 9. >>> len(L*2) 10. >>> L.count('the')

15-Sept-2014NLP, Prof. Howard, Tulane University

17

Page 18: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

String methods work on lists, cont.1. >>> L.count('Love') 2. >>> L.count('love') 3. >>> L.index('with') 4. >>> L.rindex('with') 5. >>> L[2:] 6. >>> L[:2] 7. >>> L[-2:] 8. >>> L[:-2] 9. >>> L[2:-2] 10. >>> L[-2:2] 11. >>> L[:] 12. >>> L[:-1]+['!']

15-Sept-2014NLP, Prof. Howard, Tulane University

18

Page 19: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Q1

MIN 5.0 AVG 9.5 MAX 10.0

15-Sept-2014NLP, Prof. Howard, Tulane University

19

Page 20: REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

More on lists

Next time

15-Sept-2014NLP, Prof. Howard, Tulane University

20