Top Banner
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000
29

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Jan 12, 2016

Download

Documents

Shanon Martin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Translingual Topic Tracking with PRISE

Gina-Anne Levow and Douglas W. Oard

University of Maryland

February 28, 2000

Page 2: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Roadmap

• The signal to noise perspective

• Our topic tracking system

• Boosting signal

• Reducing noise

• Future directions

Page 3: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Translingual Tracking Challenges

• Segmentation of text adds noise– Unknown words

• Transcription of speech adds noise– Unknown words– Easily confused words (e.g., homophones)

• Translation adds noise– Vocabulary mismatch with ASR / segmentation– Incorrect translation selection

Page 4: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Improving the Signal to Noise Ratio

• Translation coverage– Enrich the term list using large dictionaries

• Translation selection– Statistical evidence from comparable corpora

• Enriching indexing vocabulary– Add related terms from comparable corpora

• Score normalization– Learn source dependence from dry-run collection

Page 5: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Preview• Focusing on noise alone is not enough

– Signal boosting is a big win

• Baseline: Systran– Goal: choose the best single translation

• Two signal-boosting strategies beat Systran– Choose the best two translations– Add related terms for indexing

• (found in related documents)

Page 6: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Improvements Since TDT-2

• Weight selection– PRISE “bm25idf”

• Query representation:– Vector of 180 most selective terms by χ² test

• Two-pass normalization– Source-specific, 5 source classes

• NYT, APW, Eng. Speech, Man. Text, Man. Speech

– Topic-specific• Average of example story scores

Page 7: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Mandarin (All Sources)

English (All Sources)

Source-independent

Source-dependent

Source-independent

Source-dependent

Page 8: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Translingual Approaches

• Indexing strategies (boosting signal)– Post-translation document expansion– n-best translation

• Translation tweaks (reducing noise)– Enriched bilingual term list– Corpus-based translation selection– Pre-translation Mandarin stopword removal

Page 9: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Translingual Runs

Run Term ListSide

CorpusMandarinStopwords

DocumentExpansion

nBest

1 LDC Brown 12* Combined Brown 1

3 Combined TDT 1

4* Combined TDT Removed 1

5 Combined TDT Removed 2

6* Combined TDT Removed Yes 1

7 Systran 1

(* = official run scored by NIST)

Page 10: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Document Expansion

BN NWT

Mandarin

Word-to-WordTranslation

Comp.EnglishCorpus

PRISE

Top 5

ASRTranscript

NMSUSegmenter

TermSelectionPRISE

BN NWT

English

Results

QueryVector

Documents to Index

Single Document

Page 11: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

4 Combined TDT Removed 16 Combined TDT Removed Applied 1

Mandarin Newswire Text

Page 12: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

4 Combined TDT Removed 16 Combined TDT Removed Applied 1

Mandarin Broadcast News

Page 13: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Why Document Expansion Works

• Story-length objects provide useful context

• Ranked retrieval finds signal amid the noise

• Selective terms discriminate among documents– Enrich index with high IDF terms from top documents

• Similar strategies work well in other applications– TREC-7 SDR [Singhal et al., 1998]– CLIR query translation [Ballesteros & Croft, 1997]

Page 14: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

n-best Translation

• We generally used 1-best translation– Highest unigram frequency in comparable corpus

• Tried 2-best: two highest-ranked translations– Duplicating unique translations where necessary

• Should reduce miss rate– But at what cost in false alarms?

Page 15: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

4 Combined TDT Removed 15 Combined TDT Removed 2

Mandarin Newswire Text

Page 16: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

4 Combined TDT Removed 15 Combined TDT Removed 2

Mandarin Broadcast News

Page 17: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Comparison With Systran

• Used baseline translations provided by LDC– Untranslated words not used– No document expansion

• Systran produces 1-best translations– Natural comparison is with our 2-best run

Page 18: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

7,7 Systran 15,5 Combined TDT Removed 2

Mandarin Newswire Text

Page 19: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

7,7 Systran 15,5 Combined TDT Removed 2

Mandarin Broadcast News

Page 20: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Bilingual Term List Enrichment

• Two sources of candidate translations– LDC Chinese-English term list (version 2)– CETA (Optilex) dictionary

• >250K entries, hand-built from >250 sources

• Merging strategy– Used only general-purpose sources in CETA– Filtered out definitions– Removed parenthetical clauses

Page 21: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Term List Statistics

Term List

Mandarin Headwords

Mandarin Entries

Combined 195,078 341,187 CETA 91,602 169,067 LDC 127,924 187,130

Page 22: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

1 LDC Brown 12 Combined Brown 1

Broadcast News

Newswire Text

Page 23: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Translation Preference

• Unigram statistics guided translation selection– Minimize effect of rare translations, misspellings, …

• Based on dry run stories and rolling update– Backoff to balanced corpus for unknown words

• Brown corpus: variety of genres

• Compared with use of balanced corpus alone

Page 24: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

2 Combined Brown 13 Combined TDT 1

Mandarin Newswire Text

Page 25: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Pre-Translation Stopword Removal

• Common words don’t help retrieval much– But mistranslations might hurt

• We built a Mandarin stopword list– Processed dictionary to identify function words– Added the top 300 words in LDC frequency list– Filtered by two speakers of Mandarin

• Suppressed translation of stopwords

Page 26: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Run Term listSideCorpus

MandarinStopwords

DocumentExpansion

nBest

3 Combined TDT 14 Combined TDT Removed 1

Mandarin Newswire Text

Page 27: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Summary

• 3 techniques produced improvements:– Source-dependent normalization – Post-translation document expansion– n-best translation

• 3 techniques had little effect:– Bilingual term list enrichment– Comparable-corpus-based translation preference– Pre-translation stopword removal

Page 28: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Future Directions

• Statistical significance– Can this be added to the scoring software?

• Pre-translation document expansion– An effective approach in CLIR query translation

• Further experiments with n-best translation– Probably using a weighted strategy

• Structured translation [Pirkola, 1998]– Some concern about efficiency, though

Page 29: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Where is the Perfect TDT System?

Run TDT-4In Nova Scotia!

Maryland

Penn BBN