Top Banner
Bootstrapping Goals: Utilize a minimal amount of (initial) supervision Obtain learning from many unlabeled examples (vs. selective sampling) General scheme: 1. Initial supervision – seed examples for training an initial model or seed model itself (indicative features) 2. Classify corpus with seed model 3. Add most confident classifications to training and iterate. Exploits feature redundancy in examples
37

Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Bootstrapping• Goals:

– Utilize a minimal amount of (initial) supervision

– Obtain learning from many unlabeled examples (vs. selective sampling)

• General scheme:1. Initial supervision – seed examples for training an

initial model or seed model itself (indicative features)

2. Classify corpus with seed model

3. Add most confident classifications to training and iterate.

• Exploits feature redundancy in examples

Page 2: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Bootstrapping Decision List for Word Sense Disambiguation

• Yarowsky (1995) scheme• Relies on “one sense per collocation”1. Initialize seed model with all collocations in the sense

dictionary definition – an initial decision list for each sense.– Alternatively – train from seed examples

2. Classify all examples with current model – Any example containing a feature for some sense

3. Compute odds ratio for each feature-sense combination; if above threshold then add the feature to the decision list

4. Optional: add the “one sense per discourse constraint” to add/filter examples

5. Iterate (2)

Page 3: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

“One Sense per Collocation”

Page 4: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Results• Applies the “one sense per discourse” constraint after final

classification – classify all word occurrences in a document by the majority sense

• Evaluation on ~37,000 examples:

Page 5: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Co-training for Name Classification

• Collins and Singer, EMNLP 1999• A bootstrapping approach that relies on a

(natural) partition of the feature space to two different sub-spaces

• The bootstrapping process iterates by switching sub-space in each iteration

• Name classification task: person, organization, location

• Features for decision list:1. Spelling features (full, contains, cap, non-alpha)2. Context features (words, syntactic type)

Page 6: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Seed Features

• Initialized with 0.9999 score for the decision list

Page 7: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

DL-CoTrain Algorithm

1. Initialize n=5 (max. #rules learned in an iteration)2. Initialize spelling decision list with seed3. Classify corpus by spelling rules (where applicable)4. Induce contextual rule list from classified

examples; keep at most the n most frequent rules whose score>θ for each class

5. Classify corpus by contextual rules6. Induce spelling rules, as in step 4, and add to list7. If n<2500 then n n+5, and return to step 3.

Otherwise, classify corpus by combined list and induce a final classifier from these labeled examples

Page 8: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Co-Train vs. Yarowsky’s Algorithm

• Collins & Singer implemented a cautious version of Yarowsky’s algorithm, where n rules are added at each iteration

• Same accuracy (~91%) (as well as a boosting-based bootstrapping algorithm)

• Abney (ACL 2002) provides some analysis for both algorithm, showing that they are based on different independence assmptions

Page 9: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Clustering and Unsupervised Disambiguation

• Word sense disambiguation without labeled training or other information sources

• Cannot label to predefined senses (there are none), so try to find “natural” senses

• Use clustering methods to divide different contexts of a word into “sensible” classes

• Other applications of clustering:– Thesaurus construction, document clustering– Forming word classes for generalization in

language modeling and disambiguation

Page 10: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Clustering

Page 11: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Clustering Techniques

• Hierarchical:– Bottom-up (agglomerative)– Top-down (divisive)

• Flat (non-hierarchical)– Usually iterative– Soft (vs. hard) clustering

• “Degree” of membership

Page 12: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Comparison

Hierarchical:• Preferable for detailed

data analysis• More information

provided• No clearly preferred

algorithm• Less efficient (at least

O(n2))

Non-hierarchical:• Preferable if efficiency

is important or lots of data

• K-means is the simplest method and often good enough

• If no Euclidean space, can use EM instead

Page 13: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Hierarchical Clustering

and but in on with for at from of to as

Page 14: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Similarity Functions

• sim(c1,c2) = similarity between clusters – Defined over pairs of individual elements, and

(possibly inductively) over cluster pairs

• Values typically between 0 and 1

For hierarchical clustering, require:• Monotonicity:

For all possible clusters,

min[sim(c1,c2),sim(c1,c3)] sim(c1,c2 c3)

Merging two clusters cannot increase similarity!

Page 15: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Input: Set X = {x1,…,xn} of objectsSimilarity function sim: 2X 2X

for i 1 to n do:ci {xi}

C {ci, ..., cn}j n + 1while |C| > 1 do:

(ca, cb) arg max(cu,cv) sim(cu,cv)cj ca cb; Trace merge to construct cluster hierarchy

C (C \ {ca,cb}) {cj}j++

Bottom-Up Clustering

Page 16: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Types of Cluster Pair Similarity

single link similarity of most similar (nearest) members

complete link similarity of most dissimilar (remote) members

- property of unified cluster

group average average pairwise similarity of members in unified cluster

- property of unified cluster

Page 17: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Single Link Clustering (“chaining” risk)

Page 18: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Complete Link Clustering

• “Global view” yields tighter clusters – usually more suitable in NLP

Page 19: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Efficiency of Merge Step

• Maintain pairwise similarities between clusters, and maximal similarity per cluster:

• Single link update - O(n):– sim(c1 c2, c3) = max(sim(c1,c3),sim(c2,c3))

• Complete link update - O(n log(n)):– Have to compare all pairwise similarities with

all points in merged cluster to find minimum

Page 20: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Group Average Clustering

• A “mid-point” between single and complete link – produces relatively tight clusters

• Efficiency depends on choice of similarity metric• Computing average similarity for a newly formed

cluster from scratch is O(n2) – Goal – incremental computation

• Represent objects as m-dimensional unit vectors (length-normalized), and use cosine for similarity:

yxyx

yx

yx

yxyxyx

i ii i

i ii

22

),cos(),(sim

Page 21: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Cosine Similarity

3 2 1

cos 1 > cos 2 > cos 3

Page 22: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Efficient Cosine Similarity• Average cosine similarity within a cluster c (to be

maximized after merge):

• Define sum vector for cluster:

cx cyx

yxcc

cA

),(sim

)1(

1)(

cx

xcs

)(

)1(

)()()(

)()1(

)()1(

)()(

cc

ccscscA

ccAcc

xxcAcc

yxcscs

cx

cx cy

Page 23: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

• On merge, update sum:

• Then update A for all new pairs of clusters• Complexity: O(n) per iteration (assuming constant

dimensionality); Overall for algorithm: O(n2)

)1)((

)())()(())()(()(

)1(

))(())(()(

2

baba

abababa cccc

cccscscscsccA

cc

ccscscA

)()()()( babaj cscsccscs

Page 24: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Input: Set X = {x1,…,xn} of objects

Coherence measure coh: 2XSplitting function split: 2X 2X 2X

C { X }

j 1

while C contains a cluster larger than 1 element do:

ca arg mincu coh(cu)

(cj+1,cj+2) split(ca)

C (C \ ca) {cj+1,cj+2}

j += 2

Top-Down Clustering

Page 25: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Top-Down Clustering (cont.)• Coherence measure – can be any type of cluster

quality function, including those used for agglomerative clustering– Single link – maximal similarity in cluster– Complete link – minimal similarity in cluster– Group average – average pairwise similarity

• Split – can be handled as a clustering problem on its own, aiming to form two clusters– Can use hierarchical clustering, but often more natural

to use non-hierarchical clustering (see next)

• May provide a more “global” view of the data (vs. the locally greedy bottom-up procedure)

Page 26: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Non-Hierarchical Clustering

• Iterative clustering:– Start with initial (random) set of clusters

– Assign each object to a cluster (or clusters)

– Re-compute cluster parameters• E.g. centroids:

– Stop when clustering is “good”

• Q: How many clusters?

cx

xcc

csc

||

1

||

)()(

Page 27: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

K-means AlgorithmInput: Set X = {x1,…,xn} of objects

Distance measure d: X X Mean function : 2XX

Select k initial cluster centers f1, ..., fk

while not finished do:

for all clusters cj do:

cj { xi | fj = arg minf d(xi, f) }

for all means fj do:

fj (cj)

Complexity: O(n), assuming constant number of iterations

Page 28: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

K-means Clustering

Page 29: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Example

Clustering of words from NY Times using cooccurring words

1. ballot, polls, Gov, seats

2. profit, finance, payments

3. NFL, Reds, Sox, inning, quarterback, score

4. researchers, science

5. Scott, Mary, Barbara, Edward

Page 30: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Buckshot Algorithm

• Often, random initialization for K-means works well

• If not:– Randomly choose points

– Run hierarchical group average clustering:O(( )2)=O(n)

– Use those cluster means as starting points for K-means

– O(n) total complexity

• Scatter/Gather (Cutting, Karger, Pedersen, 1993)– An interactive clustering-based browsing scheme for

text collections and search results (constant time improvements)

n

n

Page 31: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

The EM Algorithm• Soft clustering method to solve

* = arg max P(X | m())– Soft clustering – probabilistic association between

clusters and objects ( - the model parameters)

Note: Any occurrence of the data consists of:– Observable variables: The objects we see

• Words in each context• Word sequence in POS tagging

– Hidden variables: Which cluster generated which object

• Which sense generates each context• Underlying POS sequence

• Soft clustering can capture ambiguity (words) and multiple topics (documents)

Page 32: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Two Principles

Expectation: If we knew we could compute the expectation of the hidden variables (e.g, probability of x belonging to some cluster)

Maximization: If we knew the hidden structure, we could compute the maximum likelihood value of

Page 33: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

EM for Word-sense “Discrimination”Cluster the contexts (occurrences) of an ambiguous word,

where each cluster constitutes a “sense”:

• Initialize randomly model parameters P(vj | sk) and P(sk)

• E-step: Compute P(sk | ci) for each context ci and sense sk

• M-step: Re-estimate the model parameters P(vj | sk) and P(sk), for context words and senses

• Continue as long as log-likelihood of all corpus contexts 1 ≤ i ≤ I increases (EM – guaranteed to increase in each step till convergence to a maximum, possibly local):

i kkki

i kkki sPscPsPscP )()|(log)()|(log

Page 34: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

E-Step• To compute P(ci | sk), use naive Bayes assumption:

• For each context i & each sense sk, estimate the posterior probability hik = P(sk | ci) (an expected “count” of the sense for ci), using Bayes rule:

' '' )()|(

)()|(

k kki

kkiik sPscP

sPscPh

kj sv

kjki svPscP )|()|(

Page 35: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

M-StepRe-estimate parameters using maximum-

likelihood estimation:

I

h

h

hsP

h

hsvP

i ik

k iik

i ikk

j cvcik

cvc ik

kj

iji

iji

)(

)|(

:

:

Page 36: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Decision Procedure

Assign senses by (same as Naïve Bayes):

• Can adjust the pre-determined number of senses k to get finer or coarser distinctions– Bank as physical location vs. abstract corporation

• If adding more senses doesn’t increase log-likelihood much, then stop

ij

kcv

kjksi svPsPws )|(log)(logmaxarg)(

Page 37: Bootstrapping Goals: –Utilize a minimal amount of (initial) supervision –Obtain learning from many unlabeled examples (vs. selective sampling) General.

Results (Schütze 1998)Word Sense Accuracysuit lawsuit 95 0

suit you wear 96 0motion physical movement 85 1

proposal for action 88 13train line of railroad cars 79 19

to teach 55 33

• Works better for topic-related senses (given the broad-context features used)

• Improved IR performance by 14% - representing both query and document as senses, and combining results with word-based retrieval