Programming for Linguists An Introduction to Python 22/12/2011.

Post on 02-Apr-2015

229 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Programming for Linguists

An Introduction to Python22/12/2011

Ex. 1)Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of “men”, “women”, and “people” in each document. What has happened to the usage of these words over time?

Feedback

import nltk

from nltk.corpus import state_union

cfd = nltk.ConditionalFreqDist((fileid, word) for fileid in state_union.fileids( ) for word in state_union.words(fileids = fileid))

fileids = state_union.fileids( )

search_words = ["men", "women", "people"]

cfd.tabulate(conditions = fileids, samples = search_words)

Ex 2)According to Strunk and White's Elements of Style, the word “however”, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. Use the concordance tool to study actual usage of this word in 5 NLTK texts.

import nltk

from nltk.book import *

texts = [text1, text2, text3, text4, text5]

for text in texts:

print text.concordance("however")

Ex 3)Create a corpus of your own of minimum 10 files containing text fragments. You can take texts of your own, the internet,…Write a program that investigates the usage of modal verbs in this corpus using the frequency distribution tool and plot the 10 most frequent words.

import nltk

from nltk.corpus import PlaintextCorpusReader

corpus_root = “/Users/claudia/my_corpus”

#corpus_root = “C:\Users\...”

my_corpus = PlaintextCorpusReader(corpus_root, '.*’)

words = my_corpus.words( )

cfd = nltk.ConditionalFreqDist((fileid, word) for fileid in my_corpus.fileids( ) for word in my_corpus.words(fileid))

fileids = my_corpus.fileids( )modals = ['can', 'could', 'may',

'might', 'must', 'will’cfd.tabulate(conditions = fileids,

samples = modals)fd = nltk.FreqDist(words)all_tokens = fd.keys( )for t in all_tokens:

if re.match(r'[^a-zA-Z0-9]+', t):all_tokens.remove(t)

most_frequent=all_tokens[:10]most_frequent.plot( )

Ex 1)Choose a website. Read it in in Python using the urlopen function, remove all HTML mark-up and tokenize it. Make a frequency dictionary of all words ending with ‘ing’ and sort it on its values (decreasingly).

Ex 2) Write the raw text of the text in the previous exercise to an output file.

import nltk

import re

url= “website”

from urllib import urlopen

htmltext= urlopen(url).read( )

rawtext= nltk.clean_html(htmltext)

rawtext2= rawtext.lower( )

tokens= nltk.wordpunct_tokenize(rawtext2)

my_text= nltk.Text(tokens)

wordlist_ing= [w for w in tokens if re.search(r'^.*ing$',w)]

freq_dict= { }

for word in wordlist_ing:

if word not in freq_dict:

freq_dict[word] = 1

else:

freq_dict[word] = freq_dict[word]+1

from operator import itemgetter

sorted_wordlist_ing = sorted(freq_dict.iteritems(), key= itemgetter(1), reverse=True)

Ex 2)

output_file = open(“dir/output.txt","w")

output_file.write(str(rawtext2)+"\n")

output_file.close( )

Ex 3)Write a script that performs the same classification task as we saw today using word bigrams as features instead of single words.

Loading your own corpus in NLTK with no subcategories:

import nltk

from nltk.corpus import PlaintextCorpusReader

loc = “/Users/claudia/my_corpus” #Mac

loc = “C:\Users\claudia\my_corpus” #Windows 7

my_corpus = PlaintextCorpusReader(loc, “.*”)

Some Mentioned Issues

Loading your own corpus in NLTK with subcategories:

import nltk

from nltk.corpus import CategorizedPlaintextCorpusReader

loc=“/Users/claudia/my_corpus” #Mac

loc=“C:\Users\claudia\my_corpus” #Windows 7

my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt', cat_pattern=

r'(cat1|cat2)/.*')

Dispersion Plotdetermine the location of a word in the text: how many

words from the beginning it appears

ExercisesWrite a program that reads a file,

breaks each line into words, strips whitespace and punctuation from the text, and converts the words to lowercase.You can get a list of all punctuation marks by:import stringprint string.punctuation

import nltk, string

def strip(filepath):f = open(filepath, ‘r’)text = f.read( )

tokens = nltk.wordpunct_tokenize(text)for token in tokens:

token = token.lower( )if token in string.punctuation:

tokens.remove(token)

return tokens

If you want to analyse a text, but filter out a stop list first (e.g. containing “the”, “and”,…), you need to make 2 dictionaries: 1 with all words from your text and 1 with all words from the stop list. Then you need to subtract the 2nd from the 1st. Write a function subtract(d1, d2) which takes dictionaries d1 and d2 and returns a new dictionary that contains all the keys from d1 that are not in d2. You can set the values to None.

def subtract(d1, d2): d3 = { }for key in d1.keys():

if key not in d2:d3[key] = None

return d3

Let’s try it out:

import nltk

from nltk.book import *

from nltk.corpus import stopwords

d1 = { }

for word in text7:

d1[word] = None

wordlist = stopwords.words(“english”)

d2 = { }

for word in wordlist:d2[word] = None

rest_dict = subtract(d1, d2)

wordlist_min_stopwords=rest_dict.keys( )

Questions?

Evaluation AssignmentDeadline = 23/01/2012

Conversation in the week of 23/01/12

If you need any explanation about the content of the assignment, feel free to e-mail me

Further ReadingSince this was only a short

introduction to programming in Python, if you want to expand your programming skills further: see chapters 15 – 18 about object-oriented programming

Think Python. How to Think Like a Computer Scientist?

NLTK book

Official Python documentation:http://www.python.org/doc/

There is a newer version of Python available, but it is not (yet) compatible with NLTK

Our research group:CLiPS: Computational Linguistics and Psycholinguistics Research Center

http://www.clips.ua.ac.be/

Our projects:http://www.clips.ua.ac.be/projects

Happy holidays and success with your exams

top related