UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst
UMass Amherst at TDT 2003. James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
UMass Amherst at TDT 2003
James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema RaghavanCenter for Intelligent Information RetrievalDepartment of Computer ScienceUniversity of Massachusetts Amherst
What we did Tasks
Story Link Detection Topic Tracking New Event Detection Cluster Detection
Dictionary translation of Arabic stories Native language comparisons Adaptive tracking
Relevance models
ROI motivation
Analyzed vector space similarity measures Failed to distinguish between similar topics e.g. two “health care” stories from different topics
different locations and individuals similarity dominated by “health care” terms
drugs, cost, coverage, plan, prescription Possible solution: first categorize stories
different category different topics (mostly true) use within-category statistics
“health care” may be less confusing Rules of Interpretation provide natural categories
ROI intuition
•Each document in the corpus is classified into one of the ROI categories•Stories in different ROIs are less likely to be in same topic.•If two stories belong to different ROIs, we should trust their similarities less
ROI tagged corpus
simnew(s1,s2)=simold(s1,s2)
simnew(s1,s2)<simold(s1,s2)Sn
Sn
ROI classifiers Naïve Bayes BoosTexter [Schapire and Singer, 2000 ]
Decision tree classifier Generates and combines simple rules Features are terms with tf as weights
Used most likely single class Explored distribution of all classes Unable to do so successfully
Training Data for Classification Experiments: train on TDT-2,test on TDT-3
Submissions: train on TDT-2 plus TDT-3 Training data prepared the same way
Stories in each topic tagged with topic’s ROI Remove duplicate stories (in topics with the same ROI) Remove all stories with more than one ROI
Worst case: a single story relevant to…Chinese Labor Activists with ROI Legal/Criminal CasesBlair Visits China in October with ROI Political/Diplomatic Mtgs.China will not allow Opposition Parties with ROI Miscellaneous
Experiments with removing named entities for training
Naïve Bayes vs. BoosTexter Similar classification accuracy
Overall accuracy is the same Errors are substantially different
Our training results (TDT-3) BoosTexter beat Naïve Bayes for SLD and NED
BoosTexter used in most tasks for submission Evaluation results:
In Link Detection, using Naïve Bayes more useful
ROI classes in link detection Given story pair and their estimated ROIs If estimated ROIs are same, leave score alone If they are different, reduce score
Reduced to 1/3 of original value based on training runs Used four different ROI classifiers
ROI-BT,ne: BoosTexter with named entities ROI-BT, no-ne: BoosTexter without named entities ROI-NB, ne: Naïve Bayes with name entities ROI-NB, no-ne: Naïve Bayes without name entities
Training effectiveness (TDT-3)
Story Link Detection Minimum normalized cost
Various types of databases
1Dcos 4Dcos UDcos
original 0.3536 0.2556 0.3254
ROI-BT,ne 0.2959 0.2360 0.2748
ROI-BT,no ne 0.4600 0.3670 0.4246
ROI-NB,ne 0.3724 0.3047 0.3380
ROI-NB,no ne 0.4072 0.3269 0.3718
Evaluation results Story link detection
Various types of databases
1Dcos 4Dcos UDcos
original 0.2472 0.1983 0.2439
ROI-BT,ne 0.3090 0.2587 0.2938
ROI-BT,no ne 0.3220 0.2649 0.3020
ROI-NB,ne 0.2867 0.2407 0.2697
ROI-NB,no ne 0.2937 0.2463 0.2738
ROI for tracking Compare story to centroid of topic
Built from training stories If ROI does not match, drop score based on how
Dictionary translation of Arabic stories Native language comparisons Adaptive tracking
Relevance models
Comparing multilingual stories
Baseline All stories converted to English Using provided machine translations
New approaches Dictionary translation of Arabic stories Native language comparisons Adaptation in tracking
Dictionary Translation of Arabic
Probabilistic translation model Each Arabic word has multiple English
translations Obtain P(e|a) from UN Arabic-English parallel
corpus Forms a pseudo-story in English representing
Arabic Story Can get large due to multiple translations per
word Keep English words whose summed
probabilities are the greatest
Language specific comparisons
Language representations: Arabic CP1256 encoding and light stemming English stopped and stemmed with kstem Chinese segmented if necessary and overlapping
bigrams Linking Task:
If stories in same language, use that language All other comparisons done using all stories
translated into English
Adaptation in tracking Adaptation
Stories added to topic when high similarity score
Establish topic representation in each language as soon as added story in that language appears
Similarity of Arabic story compared to Arabic topic representation, etc.
Cross-Lingual Link Detection Results
Translation Condition
Minimum Cost Cost
TDT-3 TDT-4 TDT-4
1DcostIDF 0.3536 0.2472 0.2523
UDcosIDF 0.3254 (-8 %) 0.2439 (-1%) 0.2597
4DcosIDF 0.2556 (-28%) 0.1983 (-20%) 0.2000
Translation Conditions: 1DcosIDF: baseline, all
stories in English using provided translations.
UDcosIDF: all stories in English but using dictionary translation of Arabic.
4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic
of stories in native language. ADcosIDF: baseline plus
adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.
Translation Conditions:1DcosIDF: baseline.UDcosIDF: dictionary translation of Arabic.4DcosIDF: comparing a pair of stories in native language.ADcosIDF: baseline plus adaptation.