Top Banner
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/02/17
19

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

May 06, 2015

Download

Technology

Jimmy Lai

Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Classification in Python – using Pandas, scikit-learn, IPython

Notebook and matplotlib Jimmy Lai

r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536

2013/02/17

Page 3: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Fast prototyping - IPython Notebook

• Write python code in browser:

– Exploit the remote server resources

– View the graphical results in web page

– Sketch code pieces as blocks

– Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-

prototyping-using-ipython-notebook for more introduction.

Text Classification in Python 3

Page 4: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Demo Code

• Demo Code: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare

• Ipython Notebook: – Install

$ pip install ipython

– Execution (under ipython_demo dir)

$ ipython notebook --pylab=inline

– Open notebook with browser, e.g. http://127.0.0.1:8888

Text Classification in Python 4

Page 5: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Machine learning classification

• 𝑋𝑖 = [𝑥1, 𝑥2, … , 𝑥𝑛], 𝑥𝑛 ∈ 𝑅

• 𝑦𝑖 ∈ 𝑁

• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌

• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦𝑖 = 𝑓(𝑋𝑖)

Text Classification in Python 5

Page 6: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text classification

Text Classification in Python 6

Feature Generation

Feature Selection

Classification Model Training

Model Parameter

Tuning

Page 7: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

From: [email protected] (zhenghao yeh) Subject: Re: Newsgroup Split Organization: University of Southern California, Los Angeles, CA Lines: 18 Distribution: world NNTP-Posting-Host: caspian.usc.edu In article <[email protected]>, [email protected] (Chris Herringshaw) writes: |> Concerning the proposed newsgroup split, I personally am not in favor of |> doing this. I learn an awful lot about all aspects of graphics by reading |> this group, from code to hardware to algorithms. I just think making 5 |> different groups out of this is a wate, and will only result in a few posts |> a week per group. I kind of like the convenience of having one big forum |> for discussing all aspects of graphics. Anyone else feel this way? |> Just curious. |> |> |> Daemon |> I agree with you. Of cause I'll try to be a daemon :-) Yeh USC

Dataset: 20 newsgroups

dataset

Text Classification in Python 7

Text

Structured Data

Page 8: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Dataset in sklearn

• sklearn.datasets

– Toy datasets

– Download data from http://mldata.org repository

• Data format of classification problem

– Dataset

• data: [raw_data or numerical]

• target: [int]

• target_names: [str]

Text Classification in Python 8

Page 9: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Feature extraction from structured data (1/2)

• Count the frequency of keyword and select the keywords as features: ['From', 'Subject', 'Organization', 'Distribution', 'Lines']

• E.g. From: [email protected] (where's my thing)

Subject: WHAT car is this!?

Organization: University of Maryland, College Park

Distribution: None

Lines: 15

Text Classification in Python 9

Keyword Count Distribution 2549 Summary 397 Disclaimer 125 File 257 Expires 116 Subject 11612 From 11398 Keywords 943 Originator 291 Organization 10872 Lines 11317 Internet 140 To 106

Page 10: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Feature extraction from structured data (2/2)

• Separate structured data and text data

– Text data start from “Line:”

• Transform token matrix as numerical matrix by sklearn.feature_extractionDictVectorizer

• E.g.

[{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]]

Text Classification in Python 10

Page 11: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Feature extraction in sklearn

• sklearn.feature_extraction.text

• CountVectorizer

– Transform articles into token-count matrix

• TfidfVectorizer

– Transform articles into token-TFIDF matrix

• Usage:

– fit(): construct token dictionary given dataset

– transform(): generate numerical matrix

Text Classification in Python 11

Page 12: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Feature extraction

• Analyzer – Preprocessor: str -> str

• Default: lowercase

• Extra: strip_accents – handle unicode chars

– Tokenizer: str -> [str] • Default: re.findall(ur"(?u)\b\w\w+\b“, string)

– Analyzer: str -> [str] 1. Call preprocessor and tokenizer

2. Filter stopwords

3. Generate n-gram tokens

Text Classification in Python 12

Page 13: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Classification in Python 13

Page 14: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Feature Selection

• Decrease the number of features:

– Reduce the resource usage for faster learning

– Remove the most common tokens and the most rare tokens (words with less information):

• Parameter for Vectorizer: – max_df

– min_df

– max_features

Text Classification in Python 14

Page 15: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Classification Model Training

• Common classifiers in sklearn:

– sklearn.linear_model

– sklearn.svm

• Usage:

– fit(X, Y): train the model

– predict(X): get predicted Y

Text Classification in Python 15

Page 16: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Cross Validation

• When tuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles.

– from sklearn.cross_validation import KFold

– for train_index, test_index in KFold(10, 2):

• train_index = [5 6 7 8 9]

• test_index = [0 1 2 3 4]

Text Classification in Python 16

Page 17: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Performance Evaluation

• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝

𝑡𝑝+𝑓𝑝

• 𝑟𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝

𝑡𝑝+𝑓𝑛

• 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

• sklearn.metrics

– precision_score

– recall_score

– f1_score

Text Classification in Python 17

Source: http://en.wikipedia.org/wiki/Precision_and_recall

Page 18: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Visualization

1. Matplotlib

2. plot() function of Series, DataFrame

Text Classification in Python 18

Page 19: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Experiment Result

• Future works:

– Feature selection by statistics or dimension reduction

– Parameter tuning

– Ensemble models

Text Classification in Python 19