Introduction to Information Retrieval & Web Search Kevin Duh Johns Hopkins University June 2019
AcknowledgmentsTheseslidesdrawheavilyfromtheseexcellentsources:• PaulMcNamee’sJSALT2018tutorial:
– https://www.clsp.jhu.edu/wp-content/uploads/sites/75/2018/06/2018-06-19-McNamee-JSALT-IR-Soup-to-Nuts.pdf
• DougOard’sInformationRetrievalSystemscourseatUMD– http://users.umiacs.umd.edu/~oard/teaching/734/spring18/
• ChristopherD.Manning,PrabhakarRaghavan,HinrichSchütze,IntroductiontoInformationRetrieval,CambridgeU.Press.2008.– https://nlp.stanford.edu/IR-book/information-retrieval-book.html
• W.BruceCroft,DonaldMetzler,TrevorStrohman,SearchEngines:InformationRetrievalinPractice,Pearson,2009– http://ciir.cs.umass.edu/irbook/
Ineverwastememoryonthingsthatcaneasilybestoredandretrievedfromelsewhere.--AlbertEinstein
Imagesource:Einstein1921byFSchmutzerhttps://en.wikipedia.org/wiki/Albert_Einstein#/media/File:Einstein_1921_by_F_Schmutzer_-_restoration.jpg
WhatisInformationRetrieval(IR)?
1. Informationretrievalisafieldconcernedwiththestructure,analysis,organization,storage,searching,&retrievalofinformation. (GerardSalton,IRpioneer,1968)
2. Informationretrievalfocusesontheefficientrecallofinformationthatsatisfiesauser’sinformationneed.
QUERY:NullPointerExceptionrandomize()FastMath
INFONEED:IneedtounderstandwhyI’mgettingaNullPointerExceptionwhen
callingrandomize()intheFastMathlibrary
Webdocumentsthatmayberelevant
InformationHierarchy
Data: raw material of information
Information: data organized & presented in context
Knowledge: info that can be acted upon
Wisdom
More refined and abstract
FromDougOard’sslides:http://users.umiacs.umd.edu/~oard/teaching/734/spring18/
Databasesvs.IRDatabase IR
Whatwe’reretrieving
Structureddata.Clearsemanticsbasedonformalmodel.
Unstructureddata.Freetextwithmetadata.Videos,images,music.
Querieswe’reposing
Unambiguousformallydefinedqueries.
Vague,imprecisequeries
Resultsweget
Exact.Alwayscorrectinaformalsense.
Sometimesrelevantsometimesnot.
FromDougOard’sslides:http://users.umiacs.umd.edu/~oard/teaching/734/spring18/
Note:Fromauserperspective,thedistinctionmaybeseamless,e.g.askingSiriaquestionaboutnearbyrestaurantsw/goodreviews
Query
RepresentationFunction
RepresentationFunction
Documents
INDEX
UserwithInformationNeed
QueryRepresentation DocumentRepresentation
ScoringFunction
ReturnedHits
IRSystem
Query
RepresentationFunction
RepresentationFunction
Documents
INDEX
UserwithInformationNeed
QueryRepresentation DocumentRepresentation
ScoringFunction
ReturnedHits
IRSystem
(1)Indexing
(2)QueryProcessing
(3)Scoring
(5)WebSearch:additionalchallenges(4)Evaluation
IndexvsGrep
• SaywehavecollectionofShakespeareplays• Wewanttofindallplaysthatcontain:
• Grep:Startat1stplay,readeverythingandfilterifcriteriadoesn’tmatch(linearscan,1Mwords)
• Index(a.k.a.InvertedIndex):buildindexdatastructureoff-line.Quicklookupatquery-time.
Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008
QUERY:BrutusANDCaesarANDNOTCalpurnia
TheShakespearecollectionasTerm-DocumentIncidenceMatrix
Matrixelement(t,d)is:1iftermtoccursindocumentd,0otherwise
Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008
TheShakespearecollectionasTerm-DocumentIncidenceMatrix
Answer:“AntonyandCleopatra”(d=1),“Hamlet”(d=4)
Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008
QUERY:BrutusANDCaesarANDNOTCalpurnia
InvertedIndexDataStructure
Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008
documentid(d),e.g.“Brutus”occursind=1,2,4...term(t)Importantly,it’ssortedlist
EfficientalgorithmforListIntersection(forBooleanconjunctive“AND”operators)
QUERY:BrutusANDCalpurnia
Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008
Pointerp1 Pointerp2
TimeandSpaceTradeoffs
• Timecomplexityatquery-time:– Linearscanoverpostings– O(L1+L2)whereLtislengthofpostingfortermt– vs.grepthroughalldocumentsO(N),L<<N
• Timecomplexityatindex-time:– O(N)foronepassthroughcollection– Additionalissue:efficientadding/deletingdocuments
• Spacecomplexity(examplesetup):– Dictionary:Hash/TrieinRAM– Postings:Arrayondisk
Quiz:Howwouldyouprocessthesequeries?
Whichtermsdoyouintersectfirst?
Think:Whattermstoprocessfirst?HowtohandleOR,NOT?
QUERY:BrutusANDCaesarANDCalpurnia
QUERY:BrutusAND(CaesarORCalpurnia)
QUERY:BrutusANDCaesarANDNOTCalpurnia
Optionalmeta-dataininvertedindex
• Skippointers:Forfasterintersection,butextraspace
Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008
Pointerp1
Pointerp2
Optionalmeta-dataininvertedindex
• Positionoftermindocument:Enablesphrasalqueries
QUERY:“tobeornottobe”
term(t)documentfrequency
termoccursindocumentd=4withtermfrequencyof5,atpositions17,191,291,430,434
Indexconstructionandmanagement
• Dynamicindex– SearchingTwittervs.staticdocumentcollection
• Distributedsolutions– MapReduce,Hadoop,etc.– Faulttolerance
• Pre-computingcomponentsforscorefunction
àManyinterestingtechnicalchallenges!
Query
RepresentationFunction
RepresentationFunction
Documents
INDEX
UserwithInformationNeed
QueryRepresentation DocumentRepresentation
ScoringFunction
ReturnedHits
IRSystem
(1)Indexing
Wecoveredthis
Nextup
RepresentingaDocumentasaBag-of-words(butwhatwords?)
TheQUICK,brownfoxesjumpedoverthelazydog!
Tokenization
The/QUICK/,/brown/foxes/jumped/over/the/lazy/dog/!
Stopwordremoval,Stemming,Normalization
quick/brown/fox/jump/over/lazi/dog
Index
IssuesinDocumentRepresentation
• Language-specificchallenges• Polysemy&Synonyms:
– “bank”inmultiplesenses,representedthesame?– “jet”and“airplane”shouldbesame?
• Acronyms,Numbers,Documentstructure• Morphology
CentralSiberianYupikmorphologyexamplefromE.Chen&L.Schartz,LREC2018:http://dowobeha.github.io/papers/lrec18.pdf
Query
RepresentationFunction
RepresentationFunction
Documents
INDEX
UserwithInformationNeed
QueryRepresentation DocumentRepresentation
ScoringFunction
ReturnedHits
IRSystem
(2)QueryProcessing
QueryRepresentation
• Ofcourse,thequerystringmustgothroughthesametokenization,stopwordremovalandnormalizationprocesslikethedocuments
• Butwecandomore,esp.forfree-textqueries– toguessuser’sintent&informationneed
Keywordsearchvs.Conceptualsearch
• Keywordsearch/Booleanretrieval:
– Answerisexact,mustsatisfytheseterms
• Conceptualsearch(orjust“search”likeGoogle)
– Answermaynotneedtoexactlymatchtheseterms– Notethisnamingmaynotbestandard
FREE-TEXTQUERY:BrutusassassinateCaesarreasons
BOOLEANQUERY:BrutusANDCaesarANDNOTCalpurnia
QueryExpansionfor“conceptual”search
• Addtermstothequeryrepresentation– Exploitknowledgebase,WordNet,userquerylogs
ORIGINALFREE-TEXTQUERY:BrutusassassinateCaesarreasons
EXPANDEDQUERY:BrutusassassinatekillCaesarreasonswhy
Pseudo-RelevanceFeedback
• Queryexpansionbyiterativesearch
ReturnedHitsv1
IRSystem
ReturnedHitsv2
IRSystem
ORIGINALQUERY:BrutusassassinateCaesarreasons
EXPANDEDQUERY:BrutusassassinateCaesarreasons+IdesofMarch
Addwordsextractedfrom
thesehits
Query
RepresentationFunction
RepresentationFunction
Documents
INDEX
UserwithInformationNeed
QueryRepresentation DocumentRepresentation
ScoringFunction
ReturnedHits
IRSystem
(3)Scoring
Motivationforscoringdocuments
• Forkeywordsearch,alldocumentsreturnedshouldsatisfyquery,andareequallyrelevant
• Forconceptualsearch:– Mayhavetoomanyreturneddocuments– RelevanceisagradationàScoredocumentsandreturnarankedlist
TF-IDFScoringFunction
• Givenqueryqanddocumentd
termstinq Termfrequency(rawcount)oftindInversedocumentfrequency
Numberofdocumentswith>=1occurrenceoft
Totalnumberofdocuments
TF-IDF
Vector-SpaceModelView• Viewdocuments(d)&queries(q)eachasvectors,
– Eachvectorelementrepresentsaterm– whosevalueistheTF-IDFofthattermindorq
• Scorefunctioncanbeviewedase.g.CosineSimilaritybetweenvectors
Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008
AlternativeScoringFunctions:BM25
Query Document
InverseDocumentFrequencyofqueryterm
Frequencyofquerytermindocument
Documentlengthratio
TunableHyperparameters
score(q, d) =X
t2q
idft ⇥tft,d · (k1 + 1)
tft,d + k1 · (1� b+ b · |D|avgdl )
k1:Saturationfortf b:Documentlengthbias
Query
RepresentationFunction
RepresentationFunction
Documents
INDEX
UserwithInformationNeed
QueryRepresentation DocumentRepresentation
ScoringFunction
ReturnedHits
IRSystem
(4)Evaluation
Evaluation:Howgood/badismyIR?
• Evaluationisimportant:– ComparetwoIRsystems– DecidewhetherourIRisreadyfordeployment– Identifyresearchchallenges
• TwoIngredientsforatrustworthyevaluation:– AnswerKey– AMeaningfulMetric:givenqueryq,returnedrankedlist,andanswerkey,computesanumber
PrecisionandRecall
precision = A
A + B
A BC D
relevantnot
relevant
retrieved
notretrieved
recall = A
A + Caverage precision = area under curve
0% 100%
100 %
0%
precision
recall
“Typetwoerrors”“Errorsofomission”“Falsenegatives”
“Typeoneerrors”“Errorsofcommission”“Falsepositives”
FromPaulMcNamee’sJSALT2018tutorialslides
IssueswithPrecisionandRecall
• Weoftendon’tknowtruerecallvalue– Forlargecollection,impossibletohaveannotatorreadalldocumentstoassessrelevanceofaquery
• Focusedonevaluatingsets,ratherthanrankedlists
We’llintroduceMeanAveragePrecision(MAP)here.NotethatIRevaluationisadeepfield,worthanotherlecturebyitself!
100 90 80 70 60 50 40 30 20 10 0
0 10 20 30 40 50 60 70 80 90 100
P R E C I S I O N
RECALL
10 relevant: Rq={d3,d5,d9,d25,d39,d44,d56,d71,d89,d123} Ranked List: d123, d84 , d56, d6, d8, d9, d511, d129, d187, d25, d38m, d48, d250, d113, d3 1/1
2/3
3/6
4/10
5/15
FromPaulMcNamee’sJSALT2018tutorialslides
Examplefor1query:precision&recallatdifferentpositionsinrankedlist
AveragePrecision(AP):(1/1+2/3+3/6+4/10+5/15)/5=0.58
MeanAveragePrecision(MAP):MeanofAPovermultiplequeries
• Firstrankeddocd123isrelevant,whichis10%ofthetotalrelevant.ThereforePrecisionatthe1/10=10%Recalllevelis1/1=100%
• NextRelevantd56givesus2/3=66%Precisionat2/10=20%recalllevel
Query
RepresentationFunction
RepresentationFunction
Documents
INDEX
UserwithInformationNeed
QueryRepresentation DocumentRepresentation
ScoringFunction
ReturnedHits
IRSystem
(5)WebSearch:additionalchallenges
Amemexisadeviceinwhichanindividualstoresallhisbooks,records,andcommunications,andwhichismechanizedsothatitmaybeconsultedwithexceedingspeedandflexibility.Itisanenlargedintimatesupplementtohismemory.--VannevarBush(1945)
ImageSource:OriginalillustrationoftheMemexfromtheLifereprintof"AsWeMayThink”https://history-computer.com/Internet/Dreamers/Bush.html
Somehistory• 1945:VannevarBushwritesaboutMEMEX• 1975:Microsoftfounded• 1981:IBMPC• 1989:TimBerners-LeeinventsWWW• 1992:1Minternethosts,butonly50websites• 1994:Yahoofounded,buildsonlinedirectory• 1995:AltaVistaindexes15Mwebpages• 1998:Googlefounded• 2004:GoogleIPO
FromPaulMcNamee’sJSALT2018tutorialslides
WebSearch:asampleofchallenges&opportunities• Crawling
– Infrastructuretohandlescale– Wheretocrawl,howoften:Freshness,DeepWeb
• Webdocumentcharacteristics:– Hypertextstructure,HTMLtags– Diversetypesofinformation– DealingwithSearchEngineOptimization(SEO)
• LargeUserbase– Long-tailofqueries– Exploitingquerylogsandclicklogs– Userinterfaceresearch(includingvoicesearch)
• Advertisingecosystem,etc.
Crawling:Basicalgorithm• Startwithasetofknownpagesinthequeue• Repeat:(1)popqueue,(2)download&parsepage,(3)pushdiscoveredURLonqueue
FromDougOard’sslides:http://users.umiacs.umd.edu/~oard/teaching/734/spring18/
Exploitinglinkstructure:PageRank
Imagesource:IllustrationofPageRankbyFelipeMicaroniLallihttps://en.wikipedia.org/wiki/File:PageRank-hi-res.png
-Pageswithmorein-linkshavemoreauthority-“Prior”documentscore-Canbeviewedasprobabilityofarandomsurferlandingonapage
Diversityofuserqueries
• “20-25%ofthequerieswewillseetoday,wehaveneverseenbefore” –UdiManber(GoogleVP,May2007)
• A.BroderinAtaxonomyofWebsearch(2002)classifiesuserqueriesas:– Informational– Navigational– Transactional