Top Banner
Text statistics 7 Day 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
21

TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Dec 18, 2015

Download

Documents

Kelley Cain
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Text statistics 7Day 30 - 11/05/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Course organization

03-Nov-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

Page 3: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Final project

03-Nov-2014NLP, Prof. Howard, Tulane University

3

Page 4: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Open Spyder

03-Nov-2014

4

NLP, Prof. Howard, Tulane University

Page 5: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Review

03-Nov-2014

5

NLP, Prof. Howard, Tulane University

Page 6: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

ConditionalFreqDist

1. >>> from nltk.corpus import brown

2. >>> from nltk.probability import ConditionalFreqDist

3. >>> cat = ['news', 'romance']

4. >>> catWord = [(c,w)

5. for c in cat

6. for w in brown.words(categories=c)]

7. >>> cfd=ConditionalFreqDist(catWord)

03-Nov-2014NLP, Prof. Howard, Tulane University

6

Page 7: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Conditional frequency distribution

03-Nov-2014

7

NLP, Prof. Howard, Tulane University

Page 8: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

03-Nov-2014NLP, Prof. Howard, Tulane University

8

A more interesting example

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

sci fi 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

Page 9: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Conditions = categories, sample = modal verbs

1. # from nltk.corpus import brown2. # from nltk.probability import

ConditionalFreqDist3. >>> cat = ['news', 'religion', 'hobbies',

'science_fiction', 'romance', 'humor']4. >>> mod = ['can', 'could', 'may', 'might',

'must', 'will']5. >>> catWord = [(c,w)6. for c in cat7. for w in brown.words(categories=c)8. if w in mod]9. >>> cfd = ConditionalFreqDist(catWord)10. >>> cfd.tabulate()11. >>> cfd.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

9

Page 10: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

cfd.tabulate()

can could may might must will

news 93 86 66 38 50 389

religion 82 59 78 12 54 71

hobbies 268 58 131 22 83 264

science_fiction 16 49 4 12 8 16

romance 74 193 11 51 45 43

humor 16 30 8 8 9 13

03-Nov-2014NLP, Prof. Howard, Tulane University

10

Page 11: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

cfd.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

11

Page 12: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

03-Nov-2014NLP, Prof. Howard, Tulane University

12

Another example

The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses:1. >>> from nltk.corpus import inaugural2. >>> inaugural.fileids()

3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt']

Page 13: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

03-Nov-2014NLP, Prof. Howard, Tulane University

13

cfd2.plot()

Page 14: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

First try

1. from nltk.corpus import inaugural

2. from nltk.probability import ConditionalFreqDist

3. keys = ['america', 'citizen']

4. keyYear = [(w, title[:4])

5. for title in inaugural.fileids()

6. for w in inaugural.words(title)

7. if w.lower() in keys]

8. cfd2 = ConditionalFreqDist(keyYear)

9. cfd2.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

14

Page 15: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

03-Nov-2014NLP, Prof. Howard, Tulane University

15

cfd2.plot()

Page 16: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Second try

1. from nltk.corpus import inaugural2. from nltk.probability import

ConditionalFreqDist3. keys = ['america', 'citizen']4. keyYear = [(key, title[:4])5. for title in inaugural.fileids()6. for w in inaugural.words(title)7. for k in keys8. if w.lower().startswith(k)]9. cfd3 = ConditionalFreqDist(keyYear)10. cfd3.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

16

Page 17: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

dfc3.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

17

Page 18: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Stemming

03-Nov-2014NLP, Prof. Howard, Tulane University

18

Page 19: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Third try

1. from nltk.stem.snowball import EnglishStemmer

2. stemmer = EnglishStemmer()

3. from nltk.corpus import inaugural

4. from nltk.probability import ConditionalFreqDist

5. keys = ['america', 'citizen']

6. keyYear = [(w, title[:4])

7. for title in inaugural.fileids()

8. for w in inaugural.words(title)

9. if stemmer.stem(w) in keys]

10. cfd4 = ConditionalFreqDist(keyYear)

11. cfd4.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

19

Page 20: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

cfd4.plot()

03-Nov-2014NLP, Prof. Howard, Tulane University

20

Page 21: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Twitter

Next time

03-Nov-2014NLP, Prof. Howard, Tulane University

21