Top Banner
SEMI-SUPERVISED CLASSIFICATION FOR NATURAL LANGUAGE PROCESSING Rushdi Shams Department of Computer Science University of Western Ontario, London, Canada. [email protected]
63

Semi-supervised classification for natural language processing

Jan 19, 2015

Download

Education

Rushdi Shams

This presentation describes semi-supervised learning and its application on natural language processing tasks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Semi-supervised classification for natural language processing

SEMI-SUPERVISED CLASSIFICATION FOR

NATURAL LANGUAGE PROCESSING

Rushdi ShamsDepartment of Computer Science

University of Western Ontario,London, Canada.

[email protected]

Page 2: Semi-supervised classification for natural language processing

2

PRESENTATION AT A GLANCE

• Semi-supervised learning– Problems solved by semi-supervised learning– Types– How it works– When it works

• Semi-supervised learning for NLP– Parsing– Text classification– Summarization– Biomedical tasks

• Conclusions

Page 3: Semi-supervised classification for natural language processing

3

SEMI-SUPERVISED LEARNING

• Traditional supervised classifiers use only labeled data for their training– Expensive, difficult to obtain, time

consuming

• Real-life problems have large amount of unlabeled data

• Semi-supervised learning encompasses unlabeled data with labeled data

Page 4: Semi-supervised classification for natural language processing

4

SEMI-SUPERVISED LEARNING PROBLEMS

(1)Learn from labeled data

(2)Apply learning on

unlabeled data to label them

(3)If confident in labeling,

then learn from (1) and (2)

(4)Apply learning on

unseen unlabeled data

Transductive Learning

Inductive Learning

Page 5: Semi-supervised classification for natural language processing

5

SEMI-SUPERVISED LEARNING PROBLEMS

• Transductive learning is like take-home exam– How good a model assumption is when

applied on unlabeled data after trained with labeled data?

• Inductive learning is like in-class exam– How good a model assumption is when

applied on unseen data after trained with labeled + unlabeled data?

Page 6: Semi-supervised classification for natural language processing

6

SCOPES OF SEMI-SUPERVISED LEARNING

• Like traditional learning methods semi-supervised learning can be used for– Classification– Regression and– Clustering

Page 7: Semi-supervised classification for natural language processing

7

HOW DOES SEMI-SUPERVISED CLASSIFICATION WORK?

Page 8: Semi-supervised classification for natural language processing

8

TYPES OF SEMI-SUPERVISED LEARNING

• Generative Learning• Discriminative Learning

• Self-Training• Co-Training• Active Learning

How to use generative

and discriminative learning

Page 9: Semi-supervised classification for natural language processing

9

GENERATIVE VS DISCRIMINATIVE MODELS

(x,y)

Discriminative Models Generative Models

Page 10: Semi-supervised classification for natural language processing

10

GENERATIVE VS DISCRIMINATIVE MODELS

• Imagine your task is to classify a speech to a language

• You can do it by1. Learning each language and then classifying

it using the knowledge you just gained2. Determining the difference in the linguistic

models without learning the languages and then classifying the speech.

• (1) is how Generative models work and (2) is how Discriminative models work

Page 11: Semi-supervised classification for natural language processing

11

GENERATIVE VS DISCRIMINATIVE MODELS

• Discriminative models predict the label y from the training example x

• Using Bayes’ Theorem, we get*

• This is the equation we use in generative models

* P(x) can be ignored since we are interested in finding the argmax(y)

Page 12: Semi-supervised classification for natural language processing

12

GENERATIVE VS DISCRIMINATIVE MODELS

Conditional Probability, to determine class

boundaries

Joint Probability P(x,y), for any given y, we can

generate its x

Transductive SVM, Graph-based

methods

EM Algorithm, Self-learning

Cannot be used without considering P(x)Difficult because P(x|y) are inadequate

Page 13: Semi-supervised classification for natural language processing

13

GENERATIVE VS DISCRIMINATIVE MODELS

•Probability Density Function•Function of mean vector and covariance matrix for a Gaussian distribution

•Mean vector and covariance matrix can be tuned to maximize the term•Use Maximum Likelihood Estimate (MLE) to find that

•Finally the tuning can be optimized using EM algorithm

•Different algorithms use different techniques according to the distribution of data

Page 14: Semi-supervised classification for natural language processing

14

IS THERE A FREE LUNCH?

• “Unlabeled data are abundant, therefore semi-supervised learning is a good idea”– Not always!

• To succeed, one needs to spend reasonable amount of effort to design good models / features / kernels / similarity functions

Page 15: Semi-supervised classification for natural language processing

15

IS THERE A FREE LUNCH?

• It requires matching of problem structure with model assumption

• P(x) is associated with the label prediction of an unlabeled data point x

• Algorithms like Transductive SVM (TSVM) assume that the decision boundary should avoid regions with high p(x)

Page 16: Semi-supervised classification for natural language processing

16

IS THERE A FREE LUNCH?

• “If the data are coming from highly overlapped Gaussians, then decision boundary would go right through the densest region” – TSVM– Expectation-Maximization (EM) performs

better in this case

• Other example: Hidden Markov Model with unlabeled data also does not work!

Page 17: Semi-supervised classification for natural language processing

17

SELF-TRAINING

Page 18: Semi-supervised classification for natural language processing

18

CO-TRAINING

• Given labeled data L and unlabeled data U

• Create two labeled datasets L1 and L2 from L using views 1 and 2

Page 19: Semi-supervised classification for natural language processing

19

CO-TRAINING

Page 20: Semi-supervised classification for natural language processing

20

CO-TRAINING

• Learn classifier f (1) using L1 and classifier f (2) using L2

• Apply f (1) and f (2) on unlabeled data pool U to predict labels

• Predictions are made only using their own set (view) of features

• Add K most confident predictions of f1 to L2

• Add K most confident predictions of f2 to L1

Page 21: Semi-supervised classification for natural language processing

21

CO-TRAINING

• Remove these examples from the unlabeled pool

• Re-train f (1) using L1, f (2) using L2• Like self-training but two classifiers

teaching each other• Finally, use a voting or averaging to

make predictions on the test data

Page 22: Semi-supervised classification for natural language processing

22

CO-TRAINING: COVEATS

1. Each view alone is sufficient to make good classifications, given enough labeled data.

2. The two algorithms perform good, given enough labeled data.

Page 23: Semi-supervised classification for natural language processing

23

ACTIVE LEARNING

Page 24: Semi-supervised classification for natural language processing

24

WHICH METHOD SHOULD I USE?

• There is no direct answer!– Ideally one should use a method whose

assumptions fit the problem structure

• Do the classes produce well clustered data?– If yes, then use EM

• Do the features naturally split into two sets?– If yes, then use co-training

• Is it true that two points with similar features tend to be in the same class?– If yes, then use graph-based methods

Page 25: Semi-supervised classification for natural language processing

25

WHICH METHOD SHOULD I USE?

• Already using SVM?– If yes, then TSVM is a natural extension!

• Is the existing supervised classifier complicated and hard to modify?– If yes, then use self-training

Page 26: Semi-supervised classification for natural language processing

26

SEMI-SUPERVISED CLASSIFICATION FOR NLP

• Parsing• Text classification • Summarization• Biomedical tasks

Page 27: Semi-supervised classification for natural language processing

27

EFFECTIVE SELF-TRAINING FOR PARSING

David McClosky, Eugene Charniak, and Mark JohnsonBrown University

Proceedings of the HLT: NACL, 2006

Page 28: Semi-supervised classification for natural language processing

28

INTRODUCTION

• Self-trained a two phase parser-reranker system with readily available data

• The self-trained model gained a 1.1% improvement in F-score over the previous best result– F-score reported is 92.1%

Page 29: Semi-supervised classification for natural language processing

29

METHODS

• Charniak parser is used for initial parsing

• It produces 50-best parses• A MaxEnt re-ranker is used to re-rank

the parses– Exploits over a million features

Page 30: Semi-supervised classification for natural language processing

30

DATASETS

• Penn treebank section 2-21 for training– 40k WSJ articles

• Penn treebank section 23 for testing• Penn treebank section 24 for held-out

validation• Unlabelled data were collected from

North American News Text Corpus (NANC)– 24 million LA times articles

Page 31: Semi-supervised classification for natural language processing

31

RESULTS

• The authors experimented with and without using the re-ranker as they added unlabelled sentences–With the re-ranker the parser performs well

• The improvement is about 1.1% F-score– The self-trained parser contributes 0.8%

and – The re-ranker contributes 0.3%

Page 32: Semi-supervised classification for natural language processing

32

LIMITATIONS

• The work did not restrict more accurately parsed sentences to be included in the training data.

• Speed is similar to Charniak parser but requires a little bit more memory.

• Unlabeled data from one domain (LA times) and labeled data from a different domain (WSJ) affects self-training– The question is remained unanswered

Page 33: Semi-supervised classification for natural language processing

33

SEMI-SUPERVISED SPAM FILTERING: DOES IT WORK?

Mona Mojdeh and Gordon V. CormackUniversity of Waterloo

Proceedings of the SIGIR 2008

Page 34: Semi-supervised classification for natural language processing

34

INTRODUCTION

• “Semi-supervised learning methods work well for spam filtering when source of available labeled examples differs from those to be classified” [2006 ECML/PKDD challenge]

• The authors reproduced the work and found opposite results

Page 35: Semi-supervised classification for natural language processing

35

BACKGROUND

• ECML/PKDD Challenge– Delayed Feedback:

• The filters will be trained on emails T1

• Then they will classify some test emails t1

• Then train again on the emails T1 + t1

• It continues for the entire dataset• Best (1-AUC) is 0.01%

– Cross-user Train:• Train on a set of emails and test on a different set of

emails• The emails are extracted from the same dataset• Best (1-AUC) is 0.1%

Page 36: Semi-supervised classification for natural language processing

36

BACKGROUND

• Best performing filters:• SVM and Transductive SVM (TSVM)• Dynamic Markov Compression (DMC)• Logistic regression with self-training

Page 37: Semi-supervised classification for natural language processing

37

BACKGROUND

• TREC Spam Track Challenge– Filters will be trained with publicly

available emails– Filters then will be tested on emails

collected from user inboxes

Page 38: Semi-supervised classification for natural language processing

38

METHODS AND MATERIALS

• TREC 2007 dataset– Delayed Feedback:• First 10,000 messages for training• Next 60,000 messages divided into six

batches (each containing 10,000 messages)• The last 5,000 messages for test

– Cross-user Train:• 30,338 messages from particular user

inboxes for training• 45,081 messages from other users for

evaluation

Page 39: Semi-supervised classification for natural language processing

39

RESULTS: DELAYED FEEDBACK VS CROSS-USER

Delayed Feedback Cross-User

Page 40: Semi-supervised classification for natural language processing

40

RESULTS: CROSS-CORPUS

• First 10,000 messages from TREC 2005 corpus

• TREC 2007 corpus split into 10,000 message segments

Page 41: Semi-supervised classification for natural language processing

41

EXTRACTIVE SUMMARIZATION USING SUPERVISED AND SEMI-SUPERVISED

LEARNING

Kam-Fai Wong, Mingli Wu, and Wenjie Li*The Chinese University of Hong Kong

The Hong Kong Polytechnic University*

Proceedings of the Coling. 2008

Page 42: Semi-supervised classification for natural language processing

42

INTRODUCTION

• Used co-training by combining labeled and unlabeled data

• Demonstrated that extractive summaries found from co-training are comparable to summaries produced by supervised methods and humans

Page 43: Semi-supervised classification for natural language processing

43

METHOD

• The authors used four kinds of features: 1. Surface2. Relevance3. Event and4. Content

• Supervised setup– Support Vector Machine

• Co-training setup– Probabilistic SVM (PSVM)– Naive Bayes

Page 44: Semi-supervised classification for natural language processing

44

DATASETS

• DUC-2001 dataset was used• It contains 30 clusters of documents– Each cluster contains documents for a particular

topic

• Total 308 documents• For each cluster, human summaries are

provided – 50, 100, 200 and 400-word summaries

• For each document human summaries are provided, too– 100-word summaries

Page 45: Semi-supervised classification for natural language processing

45

RESULTS: FEATURE SELECTION

• ROUGE I, ROUGE II and ROUGE L scores were used as evaluation measures

Human Summary ROUGE I Score was 0.422

Page 46: Semi-supervised classification for natural language processing

46

RESULTS: EFFECT OF UNLABELED DATA

More labeled data produced better F-score

Page 47: Semi-supervised classification for natural language processing

47

RESULTS: SUPERVISED VS SEMI-SUPERVISED

Page 48: Semi-supervised classification for natural language processing

48

RESULTS: EFFECT OF SUMMARY LENGTH

Page 49: Semi-supervised classification for natural language processing

49

LIMITATIONS

• Co-training is done on the same feature space– Violates the primary hypothesis of co-

training

• The strength of features was determined only using PSVM–We have no knowledge on the

performance of Supervised Naive Bayes on the features

Page 50: Semi-supervised classification for natural language processing

50

SEMI-SUPERVISED CLASSIFICATION FOR EXTRACTING PROTEIN INTERACTION

SENTENCESUSING DEPENDENCY PARSING

Gunes Erkan, Arzucan Ozgir, and Dragomir RadevUniversity of Michigan

Proceedings of the Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007

Page 51: Semi-supervised classification for natural language processing

51

INTRODUCTION

• Produces dependency trees for each sentence• Analyzes the paths between two protein names

in the parse trees• Using machine learning techniques, according

to the paths, the sentences are labeled (gold standard)

• Given the paths, cosine similarity and edit distance are used to find out interactions between the proteins

• Semi-supervised algorithms perform better than their supervised versions by a wide margin

Page 52: Semi-supervised classification for natural language processing

52

INTRODUCTION

• The first semi-supervised approach in the problem domain

• The first approach that utilizes information beyond syntactic parses

Page 53: Semi-supervised classification for natural language processing

53

METHOD

• Four algorithms were used1. Support Vector Machine (SVM)2. K-nearest Neighbor (KNN)3. Transductive SVM (TSVM)4. Harmonic Functions

• Stanford dependency parser is used to generate the parse trees

Page 54: Semi-supervised classification for natural language processing

54

DATASETS

• Sentences of two datasets are annotated based on their dependency trees (using supervised techniques)– AIMED– Christine Brun (CB)

Page 55: Semi-supervised classification for natural language processing

55

RESULTS: AIMED DATASET

Page 56: Semi-supervised classification for natural language processing

56

RESULTS: CB DATASET

Page 57: Semi-supervised classification for natural language processing

57

RESULTS: EFFECT OF TRAINING DATA SIZE (AIMED)

• With small training data, semi-supervised algorithms are better

• SVM performs poorly with less training data

Page 58: Semi-supervised classification for natural language processing

58

RESULTS: EFFECT OF TRAINING DATA SIZE (CB)

• KNN performs the worst with much labeled data

• With larger training data, SVM performs comparably with semi-supervised algorithms

Page 59: Semi-supervised classification for natural language processing

59

LIMITATIONS

• Transductive SVM is susceptible to the distribution of the labelled data– The distribution was not tested

• AIMED has class imbalance problem– TSVM is affected by this problem

Page 60: Semi-supervised classification for natural language processing

60

HOW MUCH UNLABELED DATA IS USED?

Page 61: Semi-supervised classification for natural language processing

61

CONCLUSIONS

• Semi-supervised learning is an obvious success in domains like Natural Language Text Processing

• The success depends on –Matching of problem in hand and model

assumption– Careful observations of the distribution

of data– Careful selection of algorithms

Page 62: Semi-supervised classification for natural language processing

62

CONCLUSIONS

• Apart from these fundamental conditions, to get success with semi-supervised learning we need to examine the followings—– Proportion of labeled and unlabeled data (No

definite answer)– Effect of dependency of features (with fewer

labeled examples, use fewer dependent features)– Noise in the labeled data (easier) and unlabeled

data (difficult) (Overall, noise has less effect on semi-supervised learning)

– Difference in domains of labeled and unlabeled data (Transfer learning or self-taught learning)

Page 63: Semi-supervised classification for natural language processing

63

CONCLUSIONS