11
Natural Language Natural Language ProcessingProcessing@ Emory@ Emory
Eugene AgichteinEugene AgichteinMath & Computer Science and CCIMath & Computer Science and CCI
Andrew PostAndrew PostCCI and Biomedical Engineering (?)CCI and Biomedical Engineering (?)
22
Projects in the IR Lab (Agichtein Projects in the IR Lab (Agichtein Lab)Lab)
and Question Answering
Patterns in Text (Author
Behavior)
Patterns in Search
(Searcher Behavior)
Structuring Information in
Bio- and Medical text
Discovering Implicit Networks: Entity, Relation,
and Event Extraction
Content Creation and Discovery in
Social Media
Understanding Searcher
Inference and Decision Process
Question Answering
33
NLP & Text Mining Projects in NLP & Text Mining Projects in IRLabIRLab
EMTextEMText: Information Extraction from : Information Extraction from Text in Electronic Medical RecordsText in Electronic Medical Records
Other projects:Other projects:Collaborative filtering for Med. LiteratureCollaborative filtering for Med. LiteratureRecognizing textual entailment (TAC 2008 Recognizing textual entailment (TAC 2008 RTE track)RTE track)Web-scale semantic network extractionWeb-scale semantic network extraction
44
Information Extraction From EMR Information Extraction From EMR TextText Electronic Medical Records (EMRs) contain Electronic Medical Records (EMRs) contain
important metadata for analysis, data important metadata for analysis, data mining, and decision supportmining, and decision support– Example: patient who has had diabetes should Example: patient who has had diabetes should
have different interpretation of MPI results; have different interpretation of MPI results; depends on how long, how severe, and how long depends on how long, how severe, and how long since has been controlledsince has been controlled
– This information often resides in the text of the This information often resides in the text of the EMR (physican/nurse reports, notes, discharge EMR (physican/nurse reports, notes, discharge summaries)summaries)
Challenges:Challenges:– Access to dataAccess to data– Inconsistent informationInconsistent information– Little or no manually labeled data Little or no manually labeled data
55
I2B2 NLP 2008 Obesity ChallengeI2B2 NLP 2008 Obesity Challenge(SUNY/MIT/Partners Healthcare)(SUNY/MIT/Partners Healthcare)
Participated in the I2B2 2008 NLP Obesity Participated in the I2B2 2008 NLP Obesity ChallengeChallenge– The Challenge: to build systems that will The Challenge: to build systems that will
correctly replicate the textual and intuitive correctly replicate the textual and intuitive judgments of the obesity experts on obesity judgments of the obesity experts on obesity and [15] co-morbidities based on the narrative and [15] co-morbidities based on the narrative patient records.patient records.
Our approach: machine learning over Our approach: machine learning over lexical, semantic, and statistical featureslexical, semantic, and statistical features– Words, phrases, UMLS terms in textWords, phrases, UMLS terms in text– NegationNegation– Corpus co-occurrence statisticsCorpus co-occurrence statistics– SVM, boosting, TBL to combine predictionsSVM, boosting, TBL to combine predictions
Outcome:Outcome:– Much room for improvement exists both for Much room for improvement exists both for
accuracy and efficiency, great learning accuracy and efficiency, great learning experienceexperience
77
User Behavior:User Behavior:The 3The 3rdrd Dimension of the Dimension of the WebWeb
Amount exceeds web Amount exceeds web content and content and structurestructure– Published: 4Gb/day; Published: 4Gb/day; Social Media: Social Media:
10gb/Day 10gb/Day – Page views: 100Gb/day Page views: 100Gb/day
[Andrew Tomkins, Yahoo! Search, [Andrew Tomkins, Yahoo! Search, 2007]2007]
88
Web search user behavior: Web search user behavior: goldmine of noisy data goldmine of noisy data
Relative clickthrough for queries with known relevant results in position 1 and 3
respectively
1 2 3 5 10
Result Position
Re
lati
ve
Cli
ck
Fre
qu
en
cy
All queries
PTR=1
PTR=3
Higher clickthrough at top non-relevant than at top relevant
document
99
Approach: go beyond Approach: go beyond clickthrough/download countsclickthrough/download counts
PresentationPresentation
ResultPositionResultPosition Position of the URL in Current Position of the URL in Current rankingranking
QueryTitleOverQueryTitleOverlaplap
Fraction of query terms in result Fraction of query terms in result TitleTitle
Clickthrough Clickthrough
DeliberationTiDeliberationTimeme
Seconds between query and first Seconds between query and first clickclick
ClickFrequencyClickFrequency Fraction of all clicks landing on Fraction of all clicks landing on pagepage
ClickDeviationClickDeviation Deviation from expected click Deviation from expected click frequencyfrequency
Browsing Browsing
DwellTimeDwellTime Result page dwell timeResult page dwell time
DwellTimeDeviDwellTimeDeviationation
Deviation from expected dwell time Deviation from expected dwell time for queryfor query
1010
Example results: Predicting Example results: Predicting User PreferencesUser Preferences
SA+N
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
0 0.1 0.2 0.3 0.4
Recall
Pre
cis
ion
SA+N
CD
UserBehavior
Baseline
• Baseline < SA+N < CD << UserBehavior• Rich user behavior features result in dramatic improvement
1111
User Behavior User Behavior Complements Content Complements Content and Web Topology and Web Topology
0.45
0.5
0.55
0.6
0.65
0.7
1 3 5 10K
Pre
cis
ion
RNRN+AllBM25BM25+All
MethodMethod P@1P@1 GainGain
RN (Content + Links)RN (Content + Links) 0.6320.632
RN + All (User Behavior)RN + All (User Behavior) 0.6930.693 0.061(10%)0.061(10%)
BM25BM25 0.5250.525
BM25+AllBM25+All 0.6870.687 0.162 (31%)0.162 (31%)
1212
Instrumenting the Emory Instrumenting the Emory Library and BeyondLibrary and Beyond
Evaluate effectiveness of Evaluate effectiveness of search/discovery with behavioral search/discovery with behavioral metrics (task-specific)metrics (task-specific)– Perform aggregate, longitudinal studiesPerform aggregate, longitudinal studies
Develop tools for usability studies Develop tools for usability studies ““in in the wildthe wild””– Scale (hundreds/thousands of Scale (hundreds/thousands of
““participantsparticipants””))– Realistic behavior and tasksRealistic behavior and tasks– On-demand playback of On-demand playback of ““interestinginteresting””
sessionssessions
Unified analysis/query framework for Unified analysis/query framework for internal and external resource access internal and external resource access and usage statisticsand usage statistics– Web-based query and statistics interfaceWeb-based query and statistics interface– Access auditing, privacy, anonymity Access auditing, privacy, anonymity
enforcedenforced
1313
Emory User Behavior Analysis Emory User Behavior Analysis System (EUBA)System (EUBA)
EUBA:EUBA:– Client-side instrumentation Client-side instrumentation
(Firefox toolbar)(Firefox toolbar)– Data mining/machine learning Data mining/machine learning
componentscomponents– Log DB management system, web-Log DB management system, web-
based interface for querying, based interface for querying, playback, annotation playback, annotation
Plan: to release the system to Plan: to release the system to research/library community (Q2 research/library community (Q2 2009)?2009)?
141414
Simple featuresSimple features Basic FeaturesBasic Features
– Trajectory Trajectory lengthlength
– Horizontal Horizontal rangerange
– Vertical rangeVertical range
Horizontal range
Vertical range
Trajectory length
151515Intelligent Information Access Lab
http://ir.mathcs.emory.edu/
Mouse Movement Mouse Movement Representation Representation FeaturesFeatures
Second Second representation: representation: – 5 segments: 5 segments: initial, early, middle, initial, early, middle, late, and endlate, and end– Each segment: Each segment: speed, acceleration, speed, acceleration, rotation, slope, etc.rotation, slope, etc.
1
2
3
4
5
1616
Summary of Summary of Experimental ResultsExperimental Results Client-side behavior mining Client-side behavior mining
significantly outperforms aggregate, significantly outperforms aggregate, server-side measures for user intent server-side measures for user intent detection and satisfaction tasks detection and satisfaction tasks
Can be used even if user does not Can be used even if user does not generate server-trackable action (e.g., generate server-trackable action (e.g., click or download)click or download)
Feasible to perform inference on Feasible to perform inference on search instance vs. aggregating across search instance vs. aggregating across different users/searchersdifferent users/searchers 16
1717
OutlineOutline
Overview of Intelligent Overview of Intelligent Information Access Lab ResearchInformation Access Lab Research– Information retrieval & extraction, Information retrieval & extraction,
text mining, and data integrationtext mining, and data integration– User behavior modeling, User behavior modeling,
interactions, and collaborative interactions, and collaborative filteringfiltering
Mining User-generated contentMining User-generated content
Current and Future CollaborationsCurrent and Future Collaborations
2020
Some goals of mining social Some goals of mining social mediamedia
Find high-quality contentFind high-quality content Find Find relevantrelevant and high quality and high quality
contentcontent Use millions of interactions toUse millions of interactions to
– Understand complex information Understand complex information needsneeds
– Model subjective information Model subjective information seekingseeking
– Understand cultural dynamicsUnderstand cultural dynamics
3636
Lifecycle of a QuestionLifecycle of a Question
User
Choose a category
Choose a category
Compose the question
Compose the question
Openquestion
Openquestion Examine
Find the answer?Find the answer?
Close questionChoose best answers
Give ratings
Close questionChoose best answers
Give ratings
Question is closed by system.Best answer is chosen by voters
Question is closed by system.Best answer is chosen by voters
Yes
No
AnswerAnswer AnswerAnswer AnswerAnswer
User User UserUser User User User
+-
--+ ++
3737
Yahoo! Answers: The Yahoo! Answers: The Good NewsGood News
Active community of millions Active community of millions of users in many countries of users in many countries and languagesand languages
Accumulated a great number Accumulated a great number of questions and answersof questions and answers
Effective for Effective for subjectivesubjective information needsinformation needs– Great forum for Great forum for
socialization/chatsocialization/chat (Can be) invaluable for hard-(Can be) invaluable for hard-
to-find information not to-find information not available on webavailable on web
3939
Yahoo! Answers: The Yahoo! Answers: The Bad NewsBad News
May have to wait a May have to wait a longlong time to get a time to get a satisfactory answersatisfactory answer
May May nevernever obtain a satisfying answer obtain a satisfying answer
0
5
10
15
20
25
30
35
40
1 2 3 4 5 6 7 8 9 10
1. 2006 FIFA World Cup2. Optical3. Poetry4. Football (American)5. Scottish Football (Soccer)6. Medicine7. Winter Sports8. Special Education9. General Health Care10. Outdoor Recreation
Time to close a question (hours) for sample question categories
Tim
e t
o
clo
se
4040
The Problem of Asker The Problem of Asker SatisfactionSatisfaction Given a question submitted Given a question submitted
by an asker in CQA, predict by an asker in CQA, predict whether the user will be whether the user will be satisfiedsatisfied with the answers with the answers contributed by the contributed by the community.community.
– Where Where ““SatisfiedSatisfied”” is defined as:is defined as: The asker personally has closed the The asker personally has closed the
question ANDquestion AND Selected the best answer ANDSelected the best answer AND Provided a rating of at least 3 Provided a rating of at least 3 ““starsstars””
for the best answerfor the best answer
– Otherwise, the asker is Otherwise, the asker is ““UnsatisfiedUnsatisfied””
4141
ClassifierSupport Vector MachinesDecision TreeBoostingNaïve Bayes
asker is satisfied
asker is not satisfied
Satisfaction Prediction Satisfaction Prediction FrameworkFramework
Approach: Classification algorithms from Approach: Classification algorithms from machine learningmachine learning
Textual Features
Category Features
Answerer HistoryFeaturesAsker History
Features
Answer FeaturesQuestion Features
4242
Question-Answer Question-Answer FeaturesFeatures
Q: length, posting time…
QA: length, KL divergence
Q:Votes
Q:Terms
4444
Category FeaturesCategory Features CA: Average time CA: Average time
to close a to close a questionquestion
CA: Average # CA: Average # answers per answers per questionquestion
CA: Average CA: Average asker ratingasker rating
CA: Average CA: Average voter ratingvoter rating
CA: Average # CA: Average # questions per questions per hourhour
CA: Average # CA: Average # answers per houranswers per hour
CategoryCategory #Q#Q #A#A #A per #A per QQ
SatisfieSatisfiedd
Avg asker Avg asker ratingrating
Time to close by Time to close by askerasker
General General HealthHealth
134134 737377
5.465.46 70.4%70.4% 4.494.49 1 day and 13 1 day and 13 hourshours
4545
Classification Classification AlgorithmsAlgorithms Weka implementationWeka implementation
– http://www.cs.waikato.ac.nz/ml/http://www.cs.waikato.ac.nz/ml/wekaweka
Decision TreeDecision Tree– C4.5: confidence factor 0.05. Ross C4.5: confidence factor 0.05. Ross
Quinlan (1993) Quinlan (1993) – RandomForest: Leo Breiman RandomForest: Leo Breiman
(2001) (2001) Support Vector MachineSupport Vector Machine: : J. Platt J. Platt
(1999)(1999).. Boosting(AdaBoost): Boosting(AdaBoost): Yoav Yoav
Freund, Robert E. Schapire Freund, Robert E. Schapire (1996)(1996)
NaNaïïve Bayes: George H. John, ve Bayes: George H. John, Pat Langley (1995)Pat Langley (1995)
4646
MethodsMethods Heuristic: Heuristic: # answers # answers Baseline: Baseline: Simply predicts the majority Simply predicts the majority
class (satisfied).class (satisfied). ASP_SVM: ASP_SVM: Our system with the SVM Our system with the SVM
classifierclassifier ASP_C4.5:ASP_C4.5: with the C4.5 classifier with the C4.5 classifier ASP_RandomForest: ASP_RandomForest: with the with the
RandomForest classifierRandomForest classifier ASP_Boosting: ASP_Boosting: with the AdaBoost with the AdaBoost
algorithm combining weak learnersalgorithm combining weak learners ASP_NaiveBayes: ASP_NaiveBayes: with the Naive Bayes with the Naive Bayes
classifierclassifier
4747
Evaluation metricsEvaluation metrics
PrecisionPrecision– The fraction of the predicted satisfied The fraction of the predicted satisfied
asker information needs that were asker information needs that were indeed rated satisfactory by the asker.indeed rated satisfactory by the asker.
RecallRecall– The fraction of all rated satisfied The fraction of all rated satisfied
questions that were correctly identified questions that were correctly identified by the system.by the system.
F-scoreF-score– The geometric mean of Precision and The geometric mean of Precision and
Recall measures,Recall measures,– Computed as Computed as
2*(precision*recall)/(precision+recall)2*(precision*recall)/(precision+recall) AccuracyAccuracy
– The overall fraction of instances The overall fraction of instances classified correctly into the proper class. classified correctly into the proper class.
4848
DatasetDataset
Crawled from Yahoo! Answers in early 2008
Data is available at http://ir.mathcs.emory.edu/
QuestiQuestionon
AnsweAnswerr
AskeAskerr
CategoCategoriesries
% % SatisfieSatisfie
dd216,17
01,963,615
158,515
100 50.7%
4949
Dataset (cont.)Dataset (cont.) Realistic prediction task: given askers’
previous history, we try to predict satisfaction with her current (most recent) question
216,170 questions1,963,615 answers
158,515 askers100 categories
most recent 10,000 questions
random 5000 questions
training test
randomize
5050
Dataset StatisticsDataset StatisticsCategoryCategory #Q#Q #A#A #A per Q#A per Q SatisfiedSatisfied Avg asker Avg asker
ratingratingTime to Time to close by close by askerasker
2006 FIFA 2006 FIFA World World Cup(TM)Cup(TM)
11119494
3563565959
329.86329.86 55.4%55.4% 2.632.63 47 47 minutesminutes
Mental Mental HealthHealth
151511
11511599
7.687.68 70.9%70.9% 4.304.30 1 day and 1 day and 13 hours13 hours
MathematicMathematicss
656511
23223299
3.583.58 44.5%44.5% 4.484.48 33 33 minutesminutes
Diet & Diet & FitnessFitness
454500
24324366
5.415.41 68.4%68.4% 4.304.30 1.5 days1.5 days
Asker satisfaction varies significantly across different categories.
#Q, #A, Time to close… -> Asker Satisfaction
5151
Human Satisfaction Human Satisfaction PredictionPrediction
Truth: askerTruth: asker’’s ratings rating A random sample of 130 A random sample of 130
questionsquestions Annotated by researchers to Annotated by researchers to
calibrate the asker calibrate the asker satisfactionsatisfaction– Agreement: 0.82Agreement: 0.82– F1: 0.45F1: 0.45
5252
Human Satisfaction Human Satisfaction Prediction (ContPrediction (Cont’’d):d): Amazon Mechanical TurkAmazon Mechanical Turk
A service provided by Amazon. A service provided by Amazon. Workers submit responses to a Workers submit responses to a Human Intelligence Task (HIT)Human Intelligence Task (HIT) for a for a small feesmall fee
HIT:HIT:– Used the same 130 questionsUsed the same 130 questions– For each question, list the best answer, For each question, list the best answer,
as well as other four answers ordered by as well as other four answers ordered by votesvotes
– Five independent raters for each Five independent raters for each question. question.
– Agreement: 0.9 F1: 0.61. Agreement: 0.9 F1: 0.61. – Best accuracy achieved when at least 4 Best accuracy achieved when at least 4
out of 5 raters predicted asker to be out of 5 raters predicted asker to be ‘‘satisfiedsatisfied’’ (otherwise, labeled as (otherwise, labeled as ““unsatisfiedunsatisfied””).).
5454
Comparison of Classifiers Comparison of Classifiers (F-score)(F-score)
ClassifierClassifier With TextWith Text Without TextWithout Text Selected Selected FeaturesFeatures
ASP_SVMASP_SVM 0.690.69 0.720.72 0.620.62
ASP_C4.5ASP_C4.5 0.750.75 0.760.76 0.770.77
ASP_RandomFASP_RandomForestorest
0.700.70 0.740.74 0.680.68
ASP_BoostingASP_Boosting 0.670.67 0.670.67 0.670.67
ASP_NBASP_NB 0.610.61 0.650.65 0.580.58
HumanHuman 0.610.61
BaselineBaseline 0.660.66
C4.5 is the most effective classifier in this task
Human F1 performance is lower than the naïve baseline!
5555
F1 (Satisfied) with varying F1 (Satisfied) with varying training sizestraining sizes
ASP_C4.5 substantially outperforms others
2000 questions is sufficient to achieve 0.75 F1
5656
Features by Information Gain Features by Information Gain (Satisfied)(Satisfied)
0.14219 Q: Askers’ previous rating 0.13965 Q: Average past rating by asker 0.10237 UH: Member since (interval) 0.04878 UH: Average # answers for by past
Q 0.04878 UH: Previous Q resolved for the
asker 0.04381 CA: Average asker rating for the
category 0.04306 UH: Total number of answers
received 0.03274 CA: Average voter rating 0.03159 Q: Question posting time 0.02840 CA: Average # answers per Q
5757
““OfflineOffline”” vs. vs. ““OnlineOnline”” PredictionPrediction
Offline prediction:Offline prediction:– All features( question, answer, asker All features( question, answer, asker
& category)& category)– F1: 0.77F1: 0.77
Online prediction:Online prediction:– all answer featuresall answer features– question features (stars, question features (stars,
#comments, sum of votes#comments, sum of votes……))– F1: 0.74F1: 0.74
5858
Feature AblationFeature AblationPrecision Recall F1
Selected features 0.80 0.73 0.77
No question-answer features
0.76 0.74 0.75
No answerer features 0.76 0.75 0.75
No category features 0.75 0.76 0.75
No asker features 0.72 0.69 0.71
No question features 0.68 0.72 0.70
Asker & Question features are most important.
Answer quality/Answerer expertise/Category characteristics:
may not be important
caring or supportive answers might be preferred sometimes
5959
Satisfaction with varying Satisfaction with varying experienceexperience
Group together questions from askers with the same number of previous questionsAccuracy of prediction increase dramaticallyReaching F1 of 0.9 for askers with >= 5 questions
6060
SummarySummary Asker satisfaction is predictableAsker satisfaction is predictable
– Can achieve higher than human accuracy Can achieve higher than human accuracy by exploiting historyby exploiting history
UserUser’’s experience is importants experience is important General model: one-size-fits-allGeneral model: one-size-fits-all
– 2000 questions for training model are 2000 questions for training model are enoughenough
Current workCurrent work– Personalized satisfaction predictionPersonalized satisfaction prediction– Y.Liu, E. Agichtein.Y.Liu, E. Agichtein. You've Got Answers: Towards You've Got Answers: Towards
Personalized Models for Predicting Success in Personalized Models for Predicting Success in Community Question Answering (ACL 2008)Community Question Answering (ACL 2008)
6161
ACL08ACL08
Textual features only become helpful Textual features only become helpful for users with more than 20 questionsfor users with more than 20 questions
Personalized classifier achieves Personalized classifier achieves surprisingly good accuracysurprisingly good accuracy
For users with only 1 previous question, For users with only 1 previous question, personalized classifiers works very wellpersonalized classifiers works very well
Simple strategy of grouping users by Simple strategy of grouping users by number of previous questions is even number of previous questions is even more effective than other methods for more effective than other methods for users with moderate amount of historyusers with moderate amount of history
For users with few questions, non-For users with few questions, non-textual features are dominanttextual features are dominant
For users with lots of questions, textual For users with lots of questions, textual features are more significantfeatures are more significant
Other tasksOther tasks
Subjectivity, sentiment Subjectivity, sentiment analysisanalysis– B. Li, Y. Liu, and E. Agichtein, B. Li, Y. Liu, and E. Agichtein, CoCQA: CoCQA:
Co-Training Over Questions and Co-Training Over Questions and Answers with an Application to Answers with an Application to Predicting Question Subjectivity Predicting Question Subjectivity OrientationOrientation, in EMNLP 2008, in EMNLP 2008
Discourse analysisDiscourse analysis Cross-cultural comparisonsCross-cultural comparisons CQA vs. web search CQA vs. web search
comparisoncomparison
6464
6565
OutlineOutline
Overview of Intelligent Overview of Intelligent Information Access Lab ResearchInformation Access Lab Research– Information retrieval & extraction, Information retrieval & extraction,
text mining, and data integrationtext mining, and data integration– User behavior modeling, User behavior modeling,
interactions, and collaborative interactions, and collaborative filteringfiltering
Mining User-generated contentMining User-generated content
Current and Future ResearchCurrent and Future Research