Top Banner
1 Becoming a World Power The Panama Canal President Theodore Roosevelt
12

1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

Dec 28, 2015

Download

Documents

Eugene Reeves
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. Waston 1CLSP, The Johns Hopkins University

Using Random Forests Language Models in IBM

RT-04 CTS

Peng Xu1 and Lidia Mangu2

1. CLSP, the Johns Hopkins University2. IBM T.J. Waston Research Center

March 24, 2005

Page 2: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

n-gram SmoothingSmoothing: take out some probability mass from seen n-grams and distribute among unseen n-grams Over 10 different smoothing techniques were

proposed in the literature. Interpolated Kneser-Ney: consistently the best

performance [Chen & Goodman, 1998]

Page 3: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

More Data…There’s no data like more data

[Berger & Miller, 1998] Just-in-time language model. [Zhu & Rosenfeld, 2001] Estimate n-gram counts

from web. [Banko & Brill, 2001] Efforts should be directed

toward data collection, instead of learning algorithms. [Keller et. al., 2002] n-gram counts from the web

correlates reasonably well with BNC data. [Bulyko et. al., 2003] Web text sources are used for

language modeling. [RT-04] U. of Washington web data for language

modeling.

Page 4: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

More Datamore data solution to data sparseness The web has “everything”: web data is noisy. The web does NOT have everything: language

models using web data still have data sparseness problem.

[Zhu & Rosenfeld, 2001] In 24 random web news sentences, 46 out of 453 trigrams were not covered by Altavista.

In domain training data is not always easy to get.

Do better smoothing techniques matter when training data is of millions of words?

Page 5: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

OutlineMotivationRandom Forests for Language Modeling Decision Tree Language Models Random Forests Language Models

Experiments Perplexity Speech Recognition: IBM RT-04 CTS

LimitationsConclusions

Page 6: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Dealing With Sparseness inn-gram

Clustering: combine words into groups of words All components need to use smoothing. [Goodman, 2001]

Decision trees: cluster histories into equivalence classes

Appealing idea, but negative results were reported. [Potamianos & Jelinek, 1997]

Maximum entropy: use n-grams as features in an exponential model

There is almost no difference in performance from interpolated Kneser-Ney models. [Chen & Rosenfeld, 1999]

Neural networks: represent words with real vectors The models rely on interpolation with Kneser-Ney models in

order to get superior performance. [Bengio, 1999]

Page 7: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Our Motivation

Better smoothing technique is desirable. Better use of available data is often

important! Improvements in smoothing should

help other means of dealing with data sparseness problem.

Page 8: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Our Approach

Extend the appealing idea of history clustering from decision trees. Overcome problems in decision tree

construction

…by using Random Forests!

Page 9: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Decision Trees Language Models

Decision trees: equivalence classification of histories Each leaf is specified by the answers to

a series of questions which lead to the leaf from the root.

Each leaf corresponds to a subset of the histories. Thus histories are partitioned (i.e.,classified).

Page 10: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Construction of Decision Trees

Data Driven: decision trees are constructed on the basis of training dataThe construction requires:

1. The set of possible questions2. A criterion evaluating the desirability of

questions3. A construction stopping rule or post-

pruning rule

Page 11: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Decision Tree Language Models: An Example

Example: trigrams (w-2,w-1,w0)

Questions about positions: “Is w-i2S?” and “Is w-i2Sc?” There are two positions for trigram. Each pair, S and Sc, defines a possible split of a node, and therefore, training data. S and Sc are complements with respect to training data

A node gets less data than its ancestors.(S, Sc) are obtained by an exchange algorithm.

Page 12: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Decision Tree Language Models: An Example

Training data: aba, aca, bcb, bbb, ada

{ab,ac,bc,bb,ad}a:3 b:2

{ab,ac,ad}a:3 b:0

{bc,bb}a:0 b:2

Is the first word in {a}? Is the first word in {b}?

New event ‘bdb’ in testNew event ‘adb’ in test

New event ‘cba’ in test: Stuck!

Page 13: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Construction of Decision Trees: Our Approach

Grow a decision tree until maximum depth using training data

Questions are automatically obtained as a tree is constructed

Use training data likelihood to evaluate questions

Perform no smoothing during growing

Prune fully grown decision tree to maximize heldout data likelihood

Incorporate KN smoothing during pruning

Page 14: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Smoothing Decision TreesUsing similar ideas as interpolated Kneser-Ney smoothing:

Note: All histories in one node are not smoothed

in the same way. Only leaves are used as equivalence

classes.

Page 15: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Problems with Decision Trees

Training data fragmentation: As tree is developed, the questions are

selected on the basis of less and less data.

Optimality: The exchange algorithm is a greedy

algorithm. So is the tree growing algorithm.

Overtraining and undertraining: Deep trees: fit the training data well, will

not generalize well to new test data. Shallow trees: not sufficiently refined.

Page 16: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Amelioration: Random Forests

Breiman applied the idea of random forests to relatively small problems. [Breiman 2001] Using different random samples of data and

randomly chosen subsets of questions, construct K decision trees.

Apply test datum x to all the different decision trees. • Produce classes y1,y2,…,yK.

Accept plurality decision:

Page 17: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Example of a Random Forest

T1 T2 T3

An example x will be classified as according to this random forest.

Page 18: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Random Forests for Language Modeling

Two kinds of randomness: Selection of positions to ask about

Alternatives: position 1 or 2 or the better of the two. Random initialization of the exchange

algorithm

100 decision trees: ith tree estimatesPDT(i)(w0|w-2,w-1)

The final estimate is the average of all trees

Page 19: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Experiments

Perplexity (PPL):

UPenn Treebank part of WSJ: about 1 million words for training and heldout (90%/10%), 82 thousand words for test

Normalized text

Page 20: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Experiments: Aggregating

Considerable improvement already with 10 trees!

Page 21: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Embedded Random Forests

Smoothing a decision tree:

Better smoothing: embedding!

Page 22: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Speech Recognition Experiments

Word Error Rate by Lattice Rescoring IBM 2004 Conversational Telephony System

for Rich Transcription Fisher data: 22 million words WEB data: 525 million words, using frequent

Fisher n-grams as queries Other data: Switchboard, Broadcast News, etc.

Lattice language model: 4-gram with interpolated Kneser-Ney smoothing, pruned to have 3.2 million unique n-grams

Test set: DEV04

Page 23: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Speech Recognition Experiments

Baseline: KN 4-gram110 random DTsSampling data without replacementFisher+WEB: linear interpolationEmbedding in Fisher RF, no embedding in WEB RF

Fisher 4-gram

Fisher+WEB 4-gram

KN 14.1% 13.7%

RF 13.5% 13.1%

p-value <0.001 <0.001

Page 24: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Practical Limitations of the RF Approach

Memory: Decision tree construction uses much more

memory. It is not easy to realize performance gain

when training data is really large. Because we have over 100 trees, the final

model becomes too large to fit into memory. Computing probabilities in parallel incurs

extra cost in online computation.Effective language model compression or pruning remains an open question.

Page 25: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Conclusions: Random Forests

New RF language modeling approachMore general LM: RF DT n-gramRandomized history clustering

Good generalization: better n-gram coverage, less biased to training dataSignificant improvements in IBM RT-04 CTS on DEV04

Page 26: 1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

IBM T.J. WastonCLSP, The Johns Hopkins University

Thank you!