Semi-supervised classification for natural language processing

SEMI-SUPERVISED CLASSIFICATION FOR

NATURAL LANGUAGE PROCESSING

Rushdi ShamsDepartment of Computer Science

University of Western Ontario,London, Canada.

[email protected]

2

PRESENTATION AT A GLANCE

• Semi-supervised learning– Problems solved by semi-supervised learning– Types– How it works– When it works

• Semi-supervised learning for NLP– Parsing– Text classification– Summarization– Biomedical tasks

• Conclusions

3

SEMI-SUPERVISED LEARNING

• Traditional supervised classifiers use only labeled data for their training– Expensive, difficult to obtain, time

consuming

• Real-life problems have large amount of unlabeled data

• Semi-supervised learning encompasses unlabeled data with labeled data

4

SEMI-SUPERVISED LEARNING PROBLEMS

(1)Learn from labeled data

(2)Apply learning on

unlabeled data to label them

(3)If confident in labeling,

then learn from (1) and (2)

(4)Apply learning on

unseen unlabeled data

Transductive Learning

Inductive Learning

5

SEMI-SUPERVISED LEARNING PROBLEMS

• Transductive learning is like take-home exam– How good a model assumption is when

applied on unlabeled data after trained with labeled data?

• Inductive learning is like in-class exam– How good a model assumption is when

applied on unseen data after trained with labeled + unlabeled data?

6

SCOPES OF SEMI-SUPERVISED LEARNING

• Like traditional learning methods semi-supervised learning can be used for– Classification– Regression and– Clustering

7

HOW DOES SEMI-SUPERVISED CLASSIFICATION WORK?

8

TYPES OF SEMI-SUPERVISED LEARNING

• Generative Learning• Discriminative Learning

• Self-Training• Co-Training• Active Learning

How to use generative

and discriminative learning

9

GENERATIVE VS DISCRIMINATIVE MODELS

(x,y)

Discriminative Models Generative Models

10


• Imagine your task is to classify a speech to a language

• You can do it by1. Learning each language and then classifying

it using the knowledge you just gained2. Determining the difference in the linguistic

models without learning the languages and then classifying the speech.

• (1) is how Generative models work and (2) is how Discriminative models work

11


• Discriminative models predict the label y from the training example x

• Using Bayes’ Theorem, we get*

• This is the equation we use in generative models

* P(x) can be ignored since we are interested in finding the argmax(y)

12


Conditional Probability, to determine class

boundaries

Joint Probability P(x,y), for any given y, we can

generate its x

Transductive SVM, Graph-based

methods

EM Algorithm, Self-learning

Cannot be used without considering P(x)Difficult because P(x|y) are inadequate

13


•Probability Density Function•Function of mean vector and covariance matrix for a Gaussian distribution

•Mean vector and covariance matrix can be tuned to maximize the term•Use Maximum Likelihood Estimate (MLE) to find that

•Finally the tuning can be optimized using EM algorithm

•Different algorithms use different techniques according to the distribution of data

14

IS THERE A FREE LUNCH?

• “Unlabeled data are abundant, therefore semi-supervised learning is a good idea”– Not always!

• To succeed, one needs to spend reasonable amount of effort to design good models / features / kernels / similarity functions

15


• It requires matching of problem structure with model assumption

• P(x) is associated with the label prediction of an unlabeled data point x

• Algorithms like Transductive SVM (TSVM) assume that the decision boundary should avoid regions with high p(x)

16


• “If the data are coming from highly overlapped Gaussians, then decision boundary would go right through the densest region” – TSVM– Expectation-Maximization (EM) performs

better in this case

• Other example: Hidden Markov Model with unlabeled data also does not work!

17

SELF-TRAINING

18

CO-TRAINING

• Given labeled data L and unlabeled data U

• Create two labeled datasets L1 and L2 from L using views 1 and 2

19

CO-TRAINING

20

CO-TRAINING

• Learn classifier f (1) using L1 and classifier f (2) using L2

• Apply f (1) and f (2) on unlabeled data pool U to predict labels

• Predictions are made only using their own set (view) of features

• Add K most confident predictions of f1 to L2

• Add K most confident predictions of f2 to L1

21

CO-TRAINING

• Remove these examples from the unlabeled pool

• Re-train f (1) using L1, f (2) using L2• Like self-training but two classifiers

teaching each other• Finally, use a voting or averaging to

make predictions on the test data

22

CO-TRAINING: COVEATS

1. Each view alone is sufficient to make good classifications, given enough labeled data.

2. The two algorithms perform good, given enough labeled data.

23

ACTIVE LEARNING

24

WHICH METHOD SHOULD I USE?

• There is no direct answer!– Ideally one should use a method whose

assumptions fit the problem structure

• Do the classes produce well clustered data?– If yes, then use EM

• Do the features naturally split into two sets?– If yes, then use co-training

• Is it true that two points with similar features tend to be in the same class?– If yes, then use graph-based methods

25

WHICH METHOD SHOULD I USE?

• Already using SVM?– If yes, then TSVM is a natural extension!

• Is the existing supervised classifier complicated and hard to modify?– If yes, then use self-training

26

SEMI-SUPERVISED CLASSIFICATION FOR NLP

• Parsing• Text classification • Summarization• Biomedical tasks

27

EFFECTIVE SELF-TRAINING FOR PARSING

David McClosky, Eugene Charniak, and Mark JohnsonBrown University

Proceedings of the HLT: NACL, 2006

28

INTRODUCTION

• Self-trained a two phase parser-reranker system with readily available data

• The self-trained model gained a 1.1% improvement in F-score over the previous best result– F-score reported is 92.1%

29

METHODS

• Charniak parser is used for initial parsing

• It produces 50-best parses• A MaxEnt re-ranker is used to re-rank

the parses– Exploits over a million features

30

DATASETS

• Penn treebank section 2-21 for training– 40k WSJ articles

• Penn treebank section 23 for testing• Penn treebank section 24 for held-out

validation• Unlabelled data were collected from

North American News Text Corpus (NANC)– 24 million LA times articles

31

RESULTS

• The authors experimented with and without using the re-ranker as they added unlabelled sentences–With the re-ranker the parser performs well

• The improvement is about 1.1% F-score– The self-trained parser contributes 0.8%

and – The re-ranker contributes 0.3%

32

LIMITATIONS

• The work did not restrict more accurately parsed sentences to be included in the training data.

• Speed is similar to Charniak parser but requires a little bit more memory.

• Unlabeled data from one domain (LA times) and labeled data from a different domain (WSJ) affects self-training– The question is remained unanswered

33

SEMI-SUPERVISED SPAM FILTERING: DOES IT WORK?

Mona Mojdeh and Gordon V. CormackUniversity of Waterloo

Proceedings of the SIGIR 2008

34

INTRODUCTION

• “Semi-supervised learning methods work well for spam filtering when source of available labeled examples differs from those to be classified” [2006 ECML/PKDD challenge]

• The authors reproduced the work and found opposite results

35

BACKGROUND

• ECML/PKDD Challenge– Delayed Feedback:

• The filters will be trained on emails T1

• Then they will classify some test emails t1

• Then train again on the emails T1 + t1

• It continues for the entire dataset• Best (1-AUC) is 0.01%

– Cross-user Train:• Train on a set of emails and test on a different set of

emails• The emails are extracted from the same dataset• Best (1-AUC) is 0.1%

36

BACKGROUND

• Best performing filters:• SVM and Transductive SVM (TSVM)• Dynamic Markov Compression (DMC)• Logistic regression with self-training

37

BACKGROUND

• TREC Spam Track Challenge– Filters will be trained with publicly

available emails– Filters then will be tested on emails

collected from user inboxes

38

METHODS AND MATERIALS

• TREC 2007 dataset– Delayed Feedback:• First 10,000 messages for training• Next 60,000 messages divided into six

batches (each containing 10,000 messages)• The last 5,000 messages for test

– Cross-user Train:• 30,338 messages from particular user

inboxes for training• 45,081 messages from other users for

evaluation

39

RESULTS: DELAYED FEEDBACK VS CROSS-USER

Delayed Feedback Cross-User

40

RESULTS: CROSS-CORPUS

• First 10,000 messages from TREC 2005 corpus

• TREC 2007 corpus split into 10,000 message segments

41

EXTRACTIVE SUMMARIZATION USING SUPERVISED AND SEMI-SUPERVISED

LEARNING

Kam-Fai Wong, Mingli Wu, and Wenjie Li*The Chinese University of Hong Kong

The Hong Kong Polytechnic University*

Proceedings of the Coling. 2008

42

INTRODUCTION

• Used co-training by combining labeled and unlabeled data

• Demonstrated that extractive summaries found from co-training are comparable to summaries produced by supervised methods and humans

43

METHOD

• The authors used four kinds of features: 1. Surface2. Relevance3. Event and4. Content

• Supervised setup– Support Vector Machine

• Co-training setup– Probabilistic SVM (PSVM)– Naive Bayes

44

DATASETS

• DUC-2001 dataset was used• It contains 30 clusters of documents– Each cluster contains documents for a particular

topic

• Total 308 documents• For each cluster, human summaries are

provided – 50, 100, 200 and 400-word summaries

• For each document human summaries are provided, too– 100-word summaries

45

RESULTS: FEATURE SELECTION

• ROUGE I, ROUGE II and ROUGE L scores were used as evaluation measures

Human Summary ROUGE I Score was 0.422

46

RESULTS: EFFECT OF UNLABELED DATA

More labeled data produced better F-score

47

RESULTS: SUPERVISED VS SEMI-SUPERVISED

48

RESULTS: EFFECT OF SUMMARY LENGTH

49

LIMITATIONS

• Co-training is done on the same feature space– Violates the primary hypothesis of co-

training

• The strength of features was determined only using PSVM–We have no knowledge on the

performance of Supervised Naive Bayes on the features

50

SEMI-SUPERVISED CLASSIFICATION FOR EXTRACTING PROTEIN INTERACTION

SENTENCESUSING DEPENDENCY PARSING

Gunes Erkan, Arzucan Ozgir, and Dragomir RadevUniversity of Michigan

Proceedings of the Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007

51

INTRODUCTION

• Produces dependency trees for each sentence• Analyzes the paths between two protein names

in the parse trees• Using machine learning techniques, according

to the paths, the sentences are labeled (gold standard)

• Given the paths, cosine similarity and edit distance are used to find out interactions between the proteins

• Semi-supervised algorithms perform better than their supervised versions by a wide margin

52

INTRODUCTION

• The first semi-supervised approach in the problem domain

• The first approach that utilizes information beyond syntactic parses

53

METHOD

• Four algorithms were used1. Support Vector Machine (SVM)2. K-nearest Neighbor (KNN)3. Transductive SVM (TSVM)4. Harmonic Functions

• Stanford dependency parser is used to generate the parse trees

54

DATASETS

• Sentences of two datasets are annotated based on their dependency trees (using supervised techniques)– AIMED– Christine Brun (CB)

55

RESULTS: AIMED DATASET

56

RESULTS: CB DATASET

57

RESULTS: EFFECT OF TRAINING DATA SIZE (AIMED)

• With small training data, semi-supervised algorithms are better

• SVM performs poorly with less training data

58

RESULTS: EFFECT OF TRAINING DATA SIZE (CB)

• KNN performs the worst with much labeled data

• With larger training data, SVM performs comparably with semi-supervised algorithms

59

LIMITATIONS

• Transductive SVM is susceptible to the distribution of the labelled data– The distribution was not tested

• AIMED has class imbalance problem– TSVM is affected by this problem

60

HOW MUCH UNLABELED DATA IS USED?

61

CONCLUSIONS

• Semi-supervised learning is an obvious success in domains like Natural Language Text Processing

• The success depends on –Matching of problem in hand and model

assumption– Careful observations of the distribution

of data– Careful selection of algorithms

62

CONCLUSIONS

• Apart from these fundamental conditions, to get success with semi-supervised learning we need to examine the followings—– Proportion of labeled and unlabeled data (No

definite answer)– Effect of dependency of features (with fewer

labeled examples, use fewer dependent features)– Noise in the labeled data (easier) and unlabeled

data (difficult) (Overall, noise has less effect on semi-supervised learning)

– Difference in domains of labeled and unlabeled data (Transfer learning or self-taught learning)

63

CONCLUSIONS

Semi-supervised classification for natural language processing

Education

semisupervised learning41

semisupervised spam

effective selftraining

effect of unlabeled

free lunch

effect of summary length48

unseen unlabeled data3

px difficult