Top Banner
Introduction to Information Retrieval & Web Search Kevin Duh Johns Hopkins University June 2019
50

Introduction to Information Retrieval & Web Search

Apr 06, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Information Retrieval & Web Search

IntroductiontoInformationRetrieval

&WebSearch

KevinDuhJohnsHopkinsUniversity

June2019

Page 2: Introduction to Information Retrieval & Web Search

AcknowledgmentsTheseslidesdrawheavilyfromtheseexcellentsources:•  PaulMcNamee’sJSALT2018tutorial:

–  https://www.clsp.jhu.edu/wp-content/uploads/sites/75/2018/06/2018-06-19-McNamee-JSALT-IR-Soup-to-Nuts.pdf

•  DougOard’sInformationRetrievalSystemscourseatUMD–  http://users.umiacs.umd.edu/~oard/teaching/734/spring18/

•  ChristopherD.Manning,PrabhakarRaghavan,HinrichSchütze,IntroductiontoInformationRetrieval,CambridgeU.Press.2008.–  https://nlp.stanford.edu/IR-book/information-retrieval-book.html

•  W.BruceCroft,DonaldMetzler,TrevorStrohman,SearchEngines:InformationRetrievalinPractice,Pearson,2009–  http://ciir.cs.umass.edu/irbook/

Page 3: Introduction to Information Retrieval & Web Search

Ineverwastememoryonthingsthatcaneasilybestoredandretrievedfromelsewhere.--AlbertEinstein

Imagesource:Einstein1921byFSchmutzerhttps://en.wikipedia.org/wiki/Albert_Einstein#/media/File:Einstein_1921_by_F_Schmutzer_-_restoration.jpg

Page 4: Introduction to Information Retrieval & Web Search

WhatisInformationRetrieval(IR)?

1.  Informationretrievalisafieldconcernedwiththestructure,analysis,organization,storage,searching,&retrievalofinformation. (GerardSalton,IRpioneer,1968)

2.  Informationretrievalfocusesontheefficientrecallofinformationthatsatisfiesauser’sinformationneed.

Page 5: Introduction to Information Retrieval & Web Search

QUERY:NullPointerExceptionrandomize()FastMath

INFONEED:IneedtounderstandwhyI’mgettingaNullPointerExceptionwhen

callingrandomize()intheFastMathlibrary

Webdocumentsthatmayberelevant

Page 6: Introduction to Information Retrieval & Web Search

InformationHierarchy

Data: raw material of information

Information: data organized & presented in context

Knowledge: info that can be acted upon

Wisdom

More refined and abstract

FromDougOard’sslides:http://users.umiacs.umd.edu/~oard/teaching/734/spring18/

Page 7: Introduction to Information Retrieval & Web Search

Databasesvs.IRDatabase IR

Whatwe’reretrieving

Structureddata.Clearsemanticsbasedonformalmodel.

Unstructureddata.Freetextwithmetadata.Videos,images,music.

Querieswe’reposing

Unambiguousformallydefinedqueries.

Vague,imprecisequeries

Resultsweget

Exact.Alwayscorrectinaformalsense.

Sometimesrelevantsometimesnot.

FromDougOard’sslides:http://users.umiacs.umd.edu/~oard/teaching/734/spring18/

Note:Fromauserperspective,thedistinctionmaybeseamless,e.g.askingSiriaquestionaboutnearbyrestaurantsw/goodreviews

Page 8: Introduction to Information Retrieval & Web Search

StructureofIRSystem&TutorialOverview

Page 9: Introduction to Information Retrieval & Web Search

Query

RepresentationFunction

RepresentationFunction

Documents

INDEX

UserwithInformationNeed

QueryRepresentation DocumentRepresentation

ScoringFunction

ReturnedHits

IRSystem

Page 10: Introduction to Information Retrieval & Web Search

Query

RepresentationFunction

RepresentationFunction

Documents

INDEX

UserwithInformationNeed

QueryRepresentation DocumentRepresentation

ScoringFunction

ReturnedHits

IRSystem

(1)Indexing

(2)QueryProcessing

(3)Scoring

(5)WebSearch:additionalchallenges(4)Evaluation

Page 11: Introduction to Information Retrieval & Web Search

IndexvsGrep

•  SaywehavecollectionofShakespeareplays•  Wewanttofindallplaysthatcontain:

•  Grep:Startat1stplay,readeverythingandfilterifcriteriadoesn’tmatch(linearscan,1Mwords)

•  Index(a.k.a.InvertedIndex):buildindexdatastructureoff-line.Quicklookupatquery-time.

Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008

QUERY:BrutusANDCaesarANDNOTCalpurnia

Page 12: Introduction to Information Retrieval & Web Search

TheShakespearecollectionasTerm-DocumentIncidenceMatrix

Matrixelement(t,d)is:1iftermtoccursindocumentd,0otherwise

Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008

Page 13: Introduction to Information Retrieval & Web Search

TheShakespearecollectionasTerm-DocumentIncidenceMatrix

Answer:“AntonyandCleopatra”(d=1),“Hamlet”(d=4)

Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008

QUERY:BrutusANDCaesarANDNOTCalpurnia

Page 14: Introduction to Information Retrieval & Web Search

InvertedIndexDataStructure

Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008

documentid(d),e.g.“Brutus”occursind=1,2,4...term(t)Importantly,it’ssortedlist

Page 15: Introduction to Information Retrieval & Web Search

EfficientalgorithmforListIntersection(forBooleanconjunctive“AND”operators)

QUERY:BrutusANDCalpurnia

Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008

Pointerp1 Pointerp2

Page 16: Introduction to Information Retrieval & Web Search

TimeandSpaceTradeoffs

•  Timecomplexityatquery-time:–  Linearscanoverpostings– O(L1+L2)whereLtislengthofpostingfortermt–  vs.grepthroughalldocumentsO(N),L<<N

•  Timecomplexityatindex-time:– O(N)foronepassthroughcollection– Additionalissue:efficientadding/deletingdocuments

•  Spacecomplexity(examplesetup):– Dictionary:Hash/TrieinRAM–  Postings:Arrayondisk

Page 17: Introduction to Information Retrieval & Web Search

Quiz:Howwouldyouprocessthesequeries?

Whichtermsdoyouintersectfirst?

Think:Whattermstoprocessfirst?HowtohandleOR,NOT?

QUERY:BrutusANDCaesarANDCalpurnia

QUERY:BrutusAND(CaesarORCalpurnia)

QUERY:BrutusANDCaesarANDNOTCalpurnia

Page 18: Introduction to Information Retrieval & Web Search

Optionalmeta-dataininvertedindex

•  Skippointers:Forfasterintersection,butextraspace

Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008

Pointerp1

Pointerp2

Page 19: Introduction to Information Retrieval & Web Search

Optionalmeta-dataininvertedindex

•  Positionoftermindocument:Enablesphrasalqueries

QUERY:“tobeornottobe”

term(t)documentfrequency

termoccursindocumentd=4withtermfrequencyof5,atpositions17,191,291,430,434

Page 20: Introduction to Information Retrieval & Web Search

Indexconstructionandmanagement

•  Dynamicindex– SearchingTwittervs.staticdocumentcollection

•  Distributedsolutions– MapReduce,Hadoop,etc.– Faulttolerance

•  Pre-computingcomponentsforscorefunction

àManyinterestingtechnicalchallenges!

Page 21: Introduction to Information Retrieval & Web Search

Query

RepresentationFunction

RepresentationFunction

Documents

INDEX

UserwithInformationNeed

QueryRepresentation DocumentRepresentation

ScoringFunction

ReturnedHits

IRSystem

(1)Indexing

Wecoveredthis

Nextup

Page 22: Introduction to Information Retrieval & Web Search

RepresentingaDocumentasaBag-of-words(butwhatwords?)

TheQUICK,brownfoxesjumpedoverthelazydog!

Tokenization

The/QUICK/,/brown/foxes/jumped/over/the/lazy/dog/!

Stopwordremoval,Stemming,Normalization

quick/brown/fox/jump/over/lazi/dog

Index

Page 23: Introduction to Information Retrieval & Web Search

IssuesinDocumentRepresentation

•  Language-specificchallenges•  Polysemy&Synonyms:

– “bank”inmultiplesenses,representedthesame?– “jet”and“airplane”shouldbesame?

•  Acronyms,Numbers,Documentstructure•  Morphology

CentralSiberianYupikmorphologyexamplefromE.Chen&L.Schartz,LREC2018:http://dowobeha.github.io/papers/lrec18.pdf

Page 24: Introduction to Information Retrieval & Web Search

Query

RepresentationFunction

RepresentationFunction

Documents

INDEX

UserwithInformationNeed

QueryRepresentation DocumentRepresentation

ScoringFunction

ReturnedHits

IRSystem

(2)QueryProcessing

Page 25: Introduction to Information Retrieval & Web Search

QueryRepresentation

•  Ofcourse,thequerystringmustgothroughthesametokenization,stopwordremovalandnormalizationprocesslikethedocuments

•  Butwecandomore,esp.forfree-textqueries–  toguessuser’sintent&informationneed

Page 26: Introduction to Information Retrieval & Web Search

Keywordsearchvs.Conceptualsearch

•  Keywordsearch/Booleanretrieval:

– Answerisexact,mustsatisfytheseterms

•  Conceptualsearch(orjust“search”likeGoogle)

– Answermaynotneedtoexactlymatchtheseterms– Notethisnamingmaynotbestandard

FREE-TEXTQUERY:BrutusassassinateCaesarreasons

BOOLEANQUERY:BrutusANDCaesarANDNOTCalpurnia

Page 27: Introduction to Information Retrieval & Web Search

QueryExpansionfor“conceptual”search

•  Addtermstothequeryrepresentation– Exploitknowledgebase,WordNet,userquerylogs

ORIGINALFREE-TEXTQUERY:BrutusassassinateCaesarreasons

EXPANDEDQUERY:BrutusassassinatekillCaesarreasonswhy

Page 28: Introduction to Information Retrieval & Web Search

Pseudo-RelevanceFeedback

•  Queryexpansionbyiterativesearch

ReturnedHitsv1

IRSystem

ReturnedHitsv2

IRSystem

ORIGINALQUERY:BrutusassassinateCaesarreasons

EXPANDEDQUERY:BrutusassassinateCaesarreasons+IdesofMarch

Addwordsextractedfrom

thesehits

Page 29: Introduction to Information Retrieval & Web Search

Query

RepresentationFunction

RepresentationFunction

Documents

INDEX

UserwithInformationNeed

QueryRepresentation DocumentRepresentation

ScoringFunction

ReturnedHits

IRSystem

(3)Scoring

Page 30: Introduction to Information Retrieval & Web Search

Motivationforscoringdocuments

•  Forkeywordsearch,alldocumentsreturnedshouldsatisfyquery,andareequallyrelevant

•  Forconceptualsearch:– Mayhavetoomanyreturneddocuments– RelevanceisagradationàScoredocumentsandreturnarankedlist

Page 31: Introduction to Information Retrieval & Web Search

TF-IDFScoringFunction

•  Givenqueryqanddocumentd

termstinq Termfrequency(rawcount)oftindInversedocumentfrequency

Numberofdocumentswith>=1occurrenceoft

Totalnumberofdocuments

TF-IDF

Page 32: Introduction to Information Retrieval & Web Search

Vector-SpaceModelView•  Viewdocuments(d)&queries(q)eachasvectors,

–  Eachvectorelementrepresentsaterm– whosevalueistheTF-IDFofthattermindorq

•  Scorefunctioncanbeviewedase.g.CosineSimilaritybetweenvectors

Theseexamples/figuresarefrom:Manning,Raghavan,Schütze,IntrotoInformationRetrieval,CUP,2008

Page 33: Introduction to Information Retrieval & Web Search

AlternativeScoringFunctions:BM25

Query Document

InverseDocumentFrequencyofqueryterm

Frequencyofquerytermindocument

Documentlengthratio

TunableHyperparameters

score(q, d) =X

t2q

idft ⇥tft,d · (k1 + 1)

tft,d + k1 · (1� b+ b · |D|avgdl )

k1:Saturationfortf b:Documentlengthbias

Page 34: Introduction to Information Retrieval & Web Search

Query

RepresentationFunction

RepresentationFunction

Documents

INDEX

UserwithInformationNeed

QueryRepresentation DocumentRepresentation

ScoringFunction

ReturnedHits

IRSystem

(4)Evaluation

Page 35: Introduction to Information Retrieval & Web Search

Evaluation:Howgood/badismyIR?

•  Evaluationisimportant:– ComparetwoIRsystems– DecidewhetherourIRisreadyfordeployment–  Identifyresearchchallenges

•  TwoIngredientsforatrustworthyevaluation:– AnswerKey– AMeaningfulMetric:givenqueryq,returnedrankedlist,andanswerkey,computesanumber

Page 36: Introduction to Information Retrieval & Web Search

PrecisionandRecall

precision = A

A + B

A BC D

relevantnot

relevant

retrieved

notretrieved

recall = A

A + Caverage precision = area under curve

0% 100%

100 %

0%

precision

recall

“Typetwoerrors”“Errorsofomission”“Falsenegatives”

“Typeoneerrors”“Errorsofcommission”“Falsepositives”

FromPaulMcNamee’sJSALT2018tutorialslides

Page 37: Introduction to Information Retrieval & Web Search

IssueswithPrecisionandRecall

•  Weoftendon’tknowtruerecallvalue– Forlargecollection,impossibletohaveannotatorreadalldocumentstoassessrelevanceofaquery

•  Focusedonevaluatingsets,ratherthanrankedlists

We’llintroduceMeanAveragePrecision(MAP)here.NotethatIRevaluationisadeepfield,worthanotherlecturebyitself!

Page 38: Introduction to Information Retrieval & Web Search

100 90 80 70 60 50 40 30 20 10 0

0 10 20 30 40 50 60 70 80 90 100

P R E C I S I O N

RECALL

10 relevant: Rq={d3,d5,d9,d25,d39,d44,d56,d71,d89,d123} Ranked List: d123, d84 , d56, d6, d8, d9, d511, d129, d187, d25, d38m, d48, d250, d113, d3 1/1

2/3

3/6

4/10

5/15

FromPaulMcNamee’sJSALT2018tutorialslides

Examplefor1query:precision&recallatdifferentpositionsinrankedlist

AveragePrecision(AP):(1/1+2/3+3/6+4/10+5/15)/5=0.58

MeanAveragePrecision(MAP):MeanofAPovermultiplequeries

•  Firstrankeddocd123isrelevant,whichis10%ofthetotalrelevant.ThereforePrecisionatthe1/10=10%Recalllevelis1/1=100%

•  NextRelevantd56givesus2/3=66%Precisionat2/10=20%recalllevel

Page 39: Introduction to Information Retrieval & Web Search

Query

RepresentationFunction

RepresentationFunction

Documents

INDEX

UserwithInformationNeed

QueryRepresentation DocumentRepresentation

ScoringFunction

ReturnedHits

IRSystem

(5)WebSearch:additionalchallenges

Page 40: Introduction to Information Retrieval & Web Search

Amemexisadeviceinwhichanindividualstoresallhisbooks,records,andcommunications,andwhichismechanizedsothatitmaybeconsultedwithexceedingspeedandflexibility.Itisanenlargedintimatesupplementtohismemory.--VannevarBush(1945)

ImageSource:OriginalillustrationoftheMemexfromtheLifereprintof"AsWeMayThink”https://history-computer.com/Internet/Dreamers/Bush.html

Page 41: Introduction to Information Retrieval & Web Search
Page 42: Introduction to Information Retrieval & Web Search

Somehistory•  1945:VannevarBushwritesaboutMEMEX•  1975:Microsoftfounded•  1981:IBMPC•  1989:TimBerners-LeeinventsWWW•  1992:1Minternethosts,butonly50websites•  1994:Yahoofounded,buildsonlinedirectory•  1995:AltaVistaindexes15Mwebpages•  1998:Googlefounded•  2004:GoogleIPO

FromPaulMcNamee’sJSALT2018tutorialslides

Page 43: Introduction to Information Retrieval & Web Search

WebSearch:asampleofchallenges&opportunities•  Crawling

–  Infrastructuretohandlescale– Wheretocrawl,howoften:Freshness,DeepWeb

•  Webdocumentcharacteristics:– Hypertextstructure,HTMLtags– Diversetypesofinformation– DealingwithSearchEngineOptimization(SEO)

•  LargeUserbase–  Long-tailofqueries–  Exploitingquerylogsandclicklogs– Userinterfaceresearch(includingvoicesearch)

•  Advertisingecosystem,etc.

Page 44: Introduction to Information Retrieval & Web Search

Crawling:Basicalgorithm•  Startwithasetofknownpagesinthequeue•  Repeat:(1)popqueue,(2)download&parsepage,(3)pushdiscoveredURLonqueue

FromDougOard’sslides:http://users.umiacs.umd.edu/~oard/teaching/734/spring18/

Page 45: Introduction to Information Retrieval & Web Search

Crawling:Basicalgorithm

Page 46: Introduction to Information Retrieval & Web Search

BowtielinkstructureoftheWeb,circa2000

Page 47: Introduction to Information Retrieval & Web Search

Exploitinglinkstructure:PageRank

Imagesource:IllustrationofPageRankbyFelipeMicaroniLallihttps://en.wikipedia.org/wiki/File:PageRank-hi-res.png

-Pageswithmorein-linkshavemoreauthority-“Prior”documentscore-Canbeviewedasprobabilityofarandomsurferlandingonapage

Page 48: Introduction to Information Retrieval & Web Search

Diversityofuserqueries

•  “20-25%ofthequerieswewillseetoday,wehaveneverseenbefore” –UdiManber(GoogleVP,May2007)

•  A.BroderinAtaxonomyofWebsearch(2002)classifiesuserqueriesas:–  Informational– Navigational– Transactional

Page 49: Introduction to Information Retrieval & Web Search

ToSumUp

Page 50: Introduction to Information Retrieval & Web Search

Query

RepresentationFunction

RepresentationFunction

Documents

INDEX

UserwithInformationNeed

QueryRepresentation DocumentRepresentation

ScoringFunction

ReturnedHits

IRSystem

(1)Indexing

(2)QueryProcessing

(3)Scoring

(5)WebSearch:additionalchallenges(4)Evaluation