Top Banner
34

Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

Jan 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information
Page 2: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

Information Retrieval(WS 2018/2019)

Klaus Berberich([email protected])

Page 3: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

0. Organization

Page 4: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

4

Lectures / Exercises§ Lectures and exercises will take place on

§ Monday 08:15– 09:45 in room 7110

§ Thursday 08:15 – 09:45 in room 7110

§ The detailed schedule on when exercises will be discussed will be available on the course website

Information Retrieval / Chapter 0: Organization

Page 5: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

5

Website§ There is a website with all details about this course

§ https://swl.htwsaar.de/lehre/ws18/ir/

§ On the website you will find slides, exercise sheets,

and the datasets for the programming assignments

§ Some areas of the website will be password protected

§ Username: ir

§ Password: 7110

Information Retrieval / Chapter 0: Organization

Page 6: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

6

Exercises / Programming Assignments§ There will be four exercises, with problems that you can

solve on paper, and four programming assignments,for which you need to write code

§ In the programming assignments, we will develop our own little search engine and evaluate how well it works

§ It is up to you whether you hand in a solution to these

Information Retrieval / Chapter 0: Organization

Page 7: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

7

Bonus Points§ By submitting solutions to the exercises and

programming assignments, you can obtain

up to 30 bonus points (percent)

§ These are the rules for obtaining bonus points

§ you can submit by e-mail in teams of up to three people§ you have to submit by the deadline on the exercise sheet

§ you have to pass the exam at the end of the lecture period

§ 50% in exam and 30 bonus points = 80% in exam (i.e., 2.0)

§ 20% in exam and 30 bonus points = 20% in exam (i.e., you fail)

§ bonus points are only valid this semester

Information Retrieval / Chapter 0: Organization

Page 8: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

8

Exam§ There will be a written exam in the last session of this

course, i.e., February 7th from 08:15 until 09:45

§ The exam will take 90 minutes and you are allowed to

bring three handwritten sheets of DIN-A4 paper with

your own notes as well as a non-programmablepocket calculator

Information Retrieval / Chapter 0: Organization

Page 9: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

9

Literature§ C. D. Manning, P. Raghavan, and H. Schütze:

Introduction to Information Retrieval,Cambridge University Press, 2008[Online]

§ W. B. Croft, D. Metzler, and T. Strohman:Search Engines – Information Retrievalin Practice, Pearson Education, 2009[Online]

Information Retrieval / Chapter 0: Organization

Page 10: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

10

Agenda§ 1. Introduction

§ 2. Natural Language Preprocessing

§ 3. Retrieval Models

§ 4. IR-System Implementation

§ 5. Evaluation

§ 6. Web Search

§ 7. Semantic Search

Information Retrieval / Chapter 1: Introduction

Page 11: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

1. Introduction

Page 12: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

12

Information Retrieval is Everywhere

Information Retrieval / Chapter 1: Introduction

Page 13: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

13

What is Information Retrieval?

Information Retrieval / Chapter 1: Introduction

Information Retrieval (IR) isfinding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from withinlarge collections (usually stored on computers)

Manning et al. [1]

Page 14: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

14

What is Information Retrieval?

§ Information Retrieval (IR) is about finding content, e.g.:

§ articles (e.g., scientific reports, newspaper articles)

§ office documents (e.g., letters or spreadsheets)

§ multimedia content (e.g., images or videos)

§ web pages, e-mails, social media profiles, etc.

Information Retrieval / Chapter 1: Introduction

Information Retrieval (IR) isfinding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from withinlarge collections (usually stored on computers)

Manning et al. [1]

Page 15: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

15

What is Information Retrieval?

§ Contents have no or little predefined structure(in contrast to tuples in relational databases)

§ simple text documents in natural language

§ HTML documents with some markup (e.g., for headers)

§ semi-structured documents (e.g., XML or JSON)

Information Retrieval / Chapter 1: Introduction

Information Retrieval (IR) isfinding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from withinlarge collections (usually stored on computers)

Manning et al. [1]

Page 16: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

16

What is Information Retrieval?

§ IR seeks to satisfy an information need of a human user

§ information need is often vague (e.g., learn about robotics)and expressed as one or multiple queries(e.g., introduction robotics)

§ only the human user can say whether a document is relevant

Information Retrieval / Chapter 1: Introduction

Information Retrieval (IR) isfinding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from withinlarge collections (usually stored on computers)

Manning et al. [1]

Page 17: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

17

What is Information Retrieval?

§ Documents collections can be very large and dynamic

§ 100,000 documents on a desktop computer§ 10,000,000 articles in a newspaper archive§ >> 1,000,000,000,000 web pages on the World Wide Web§ 500,000,000 tweets per minute on Twitter

Information Retrieval / Chapter 1: Introduction

Information Retrieval (IR) isfinding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from withinlarge collections (usually stored on computers)

Manning et al. [1]

Page 18: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

18

Information Retrieval is Interdisciplinary

Information Retrieval / Chapter 1: Introduction

IR

ComputerScience

Natural Language Processing

Library and informationscience

Page 19: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

19

Historical Background§ Libraries (dating back to 3000 B.C.)

§ organize contents in catalogues according toauthor, publication year or keywords

§ categorize content using a classification scheme(e.g., Dewey Decimal Classification)

§ Vannevar Bush’s idea of a MemEx (1945)

§ serves as a memory extender§ foresees storage, cross-linking, and

retrieval of contents

Information Retrieval / Chapter 1: Introduction

Page 20: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

20

Historical Background§ SMART system developed by Salton et al. (1960s)

§ full-text indexing and result ranking§ brought user ”in the loop” by asking for relevance feedback

§ TREC and other benchmark initiatives (since 1990s)

§ reusable testbeds with documents, information needs,

and relevance judgments

Information Retrieval / Chapter 1: Introduction

Page 21: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

21

Historical Background§ Google (1998)

§ improved web search by making use of the link structureof the World Wide Web with their PageRank algorithm

§ Learning to Rank (since 2000s)

§ observe user behavior (who clicks on what for which query)and use Machine Learning to rank documentsin response to a query

§ progress in recent years through the use of Deep Learning,which avoids extensive feature engineering

Information Retrieval / Chapter 1: Introduction

Page 22: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

22

Information Retrieval vs. Relational Databases§ Relational databases

§ data has a predetermined schema with attributesthat have precise semantics

§ SQL provides a query language which allows expressinginformation needs with precise semantics

Information Retrieval / Chapter 1: Introduction

CustomersCustomerId FirstName LastName

13 Paul McCartney14 John Lennon

<latexit sha1_base64="1XsN6/1cAD3nlWIMeVjvkXjkEFc=">AAADanicdVLbbtNAEF03XEq49AIvqC8WBcRT4rRItG+RIiFALWol0laqo2i9niSrrHet3THFrCzBZ/IN8BGMHVpIgZWtPTpzZnZuSa6kwyj6Fqy0bty8dXv1TvvuvfsP1tY3Nk+cKayAoTDK2LOEO1BSwxAlKjjLLfAsUXCazAe1/fQjWCeN/oBlDqOMT7WcSMGRqPH69ziBqdQeeVIobisvhKjacVYolBS9yLTfJbLyMcIndMIPCocmo4hVFcZxO0aT20JBO04mDqwE5+NCp2DrhK7Eb1NSPw9/a15L6/A9z2CZPuCXbB06k2kTurdLoiNeKLoOxYBb1FDWit5LYt6ZmabrALQ2uvFLDNKri6xAp1e1jde3o87O/l60ux/+DXqdqDnb/UesOUfjjWAap0YUGWgUijt33otyHHnKQAoF1KfCQc7FnE/hnEc5z8GO/BSoZLTlspmgpsrcyDdDq8JnxKThxFj6NYYN+6eH55lzZZaQMuM4c9dtNflPW2LMnEp2S+97lPPPC6ZGSiaW25IqsebCXcsUJ3sjL3VeIGixSHRCzUcT1hsUptKCQFUS4MJKakUoZtxygbQV7bhx9N0hjdR15wlY+mT3sOyOcyKona6rOC1TNyNv4zoEqzaN5rL/4f/ByU6nR/h4Z7u/t5gRW2Vb7Al7wXrsFeuzN+yIDZkIjoOL4EvwdeVHa7P1uLW1kK4Ev3wesqXTevoT+CId9g==</latexit><latexit sha1_base64="1XsN6/1cAD3nlWIMeVjvkXjkEFc=">AAADanicdVLbbtNAEF03XEq49AIvqC8WBcRT4rRItG+RIiFALWol0laqo2i9niSrrHet3THFrCzBZ/IN8BGMHVpIgZWtPTpzZnZuSa6kwyj6Fqy0bty8dXv1TvvuvfsP1tY3Nk+cKayAoTDK2LOEO1BSwxAlKjjLLfAsUXCazAe1/fQjWCeN/oBlDqOMT7WcSMGRqPH69ziBqdQeeVIobisvhKjacVYolBS9yLTfJbLyMcIndMIPCocmo4hVFcZxO0aT20JBO04mDqwE5+NCp2DrhK7Eb1NSPw9/a15L6/A9z2CZPuCXbB06k2kTurdLoiNeKLoOxYBb1FDWit5LYt6ZmabrALQ2uvFLDNKri6xAp1e1jde3o87O/l60ux/+DXqdqDnb/UesOUfjjWAap0YUGWgUijt33otyHHnKQAoF1KfCQc7FnE/hnEc5z8GO/BSoZLTlspmgpsrcyDdDq8JnxKThxFj6NYYN+6eH55lzZZaQMuM4c9dtNflPW2LMnEp2S+97lPPPC6ZGSiaW25IqsebCXcsUJ3sjL3VeIGixSHRCzUcT1hsUptKCQFUS4MJKakUoZtxygbQV7bhx9N0hjdR15wlY+mT3sOyOcyKona6rOC1TNyNv4zoEqzaN5rL/4f/ByU6nR/h4Z7u/t5gRW2Vb7Al7wXrsFeuzN+yIDZkIjoOL4EvwdeVHa7P1uLW1kK4Ev3wesqXTevoT+CId9g==</latexit><latexit sha1_base64="1XsN6/1cAD3nlWIMeVjvkXjkEFc=">AAADanicdVLbbtNAEF03XEq49AIvqC8WBcRT4rRItG+RIiFALWol0laqo2i9niSrrHet3THFrCzBZ/IN8BGMHVpIgZWtPTpzZnZuSa6kwyj6Fqy0bty8dXv1TvvuvfsP1tY3Nk+cKayAoTDK2LOEO1BSwxAlKjjLLfAsUXCazAe1/fQjWCeN/oBlDqOMT7WcSMGRqPH69ziBqdQeeVIobisvhKjacVYolBS9yLTfJbLyMcIndMIPCocmo4hVFcZxO0aT20JBO04mDqwE5+NCp2DrhK7Eb1NSPw9/a15L6/A9z2CZPuCXbB06k2kTurdLoiNeKLoOxYBb1FDWit5LYt6ZmabrALQ2uvFLDNKri6xAp1e1jde3o87O/l60ux/+DXqdqDnb/UesOUfjjWAap0YUGWgUijt33otyHHnKQAoF1KfCQc7FnE/hnEc5z8GO/BSoZLTlspmgpsrcyDdDq8JnxKThxFj6NYYN+6eH55lzZZaQMuM4c9dtNflPW2LMnEp2S+97lPPPC6ZGSiaW25IqsebCXcsUJ3sjL3VeIGixSHRCzUcT1hsUptKCQFUS4MJKakUoZtxygbQV7bhx9N0hjdR15wlY+mT3sOyOcyKona6rOC1TNyNv4zoEqzaN5rL/4f/ByU6nR/h4Z7u/t5gRW2Vb7Al7wXrsFeuzN+yIDZkIjoOL4EvwdeVHa7P1uLW1kK4Ev3wesqXTevoT+CId9g==</latexit><latexit sha1_base64="0hN/xqvdiSXXvRRh5nS1hdjKEl8=">AAADanicdVLbbtNAELUbLiVcmpYn1BeLAOIpcRIk0rdKkRCgFqUSaSvVUbReT5JV1rvW7phiVpbgM/kG+AjGSVpIgZWtPTpzZnZucSaFxTD87m/Vbt2+c3f7Xv3+g4ePdhq7e6dW54bDiGupzXnMLEihYIQCJZxnBlgaSziLF4PKfvYJjBVafcQig3HKZkpMBWdI1KTxI4phJpRDFueSmdJxzst6lOYSBUXPU+V6RJYuQviMlrtBblGnFLEsgyiqR6gzk0uoR/HUghFgXZSrBEyV0LX4XULqF8FvzRthLH5gKWzSR+yKrUKnIlmG7vRINGS5pOuYD5hBBUWl6Lwi5r2eK7qOQCmtln6xRnp1lRWo5Lq2SaMZtroH/bB3EPwNOq1weZre+gwnu/4sSjTPU1DIJbP2ohNmOHaUgeASqE+5hYzxBZvBBQszloEZuxlQyWiKTTNBRZXZsVsOrQyeE5MEU23oVxgs2T89HEutLdKYlCnDub1pq8h/2mKtF1Sy3XjfoVh8WTEVkiI2zBRUidGX9kamOO2PnVBZjqD4KtEpNR91UG1QkAgDHGVBgHEjqBUBnzPDONJW1KOlo2uPaKS2vYjB0Cfax0V7khFB7bRtyWiZ2il5a9siWNZpNFf9D/4PTrutDuGTbvOwvx7StrfvPfVeeh3vtXfovfWG3sjj/ol/6X/1v239rO3VntT2V9Itf+3z2Ns4tWe/AFTOHaI=</latexit>

1 SELECT *2 FROM Customers3 WHERE Name LIKE ’Mc%’

<latexit sha1_base64="41n3Ga3w2g/tPk1plMYpWuUwmOc=">AAACxXicdVFdb9MwFHXD1whf3eCNF4tqGuKhScsD3VulqmOIFcZY10lrVdnubWbViTP7ZhCiij/Er+ENwY/BSUGiA65s6fgcH/n6Hp4qaTEMv9W8a9dv3Ly1cdu/c/fe/Qf1za0TqzMjYCi00uaUMwtKJjBEiQpOUwMs5gpGfNEr9dElGCt1cox5CpOYRYmcS8HQUdP63phDJJPCXqil/75/0O8d02c+pXtHbwe0l1nUsXP7dLTfP+rTNywGevDqdZ/uDMT2jj+GZFZZp/VG2GzvdsLnu/Rv0GqGVTW6j0hVh9PNWjSeaZHFkKBQzNqzVpjipGAGpVCw9MeZhZSJBYvgjIUpS8FMighcO2jyddnBxDVmJ0U1jyXddsyMzrVxO0FasX86ChZbm8fc3YwZnturWkn+U+NaL5Bxu/Z+gXLxacWUSElumMndT4z+YK90ivPOpJBJmiEkYtXoPFMUNS3DoTNpQKDKHWDCSDcKKs6ZYQLLEMaVsQiG1p2CBQfjlgwGeTBNHeHGaQPFED4GsXNr23Rw6btofs+f/h+ctJsth9+1G93OKiOyQR6TJ+QpaZEXpEv2ySEZEkG+kK/kO/nhvfRiD73L1VWv9svzkKyV9/knQWHeJg==</latexit><latexit sha1_base64="41n3Ga3w2g/tPk1plMYpWuUwmOc=">AAACxXicdVFdb9MwFHXD1whf3eCNF4tqGuKhScsD3VulqmOIFcZY10lrVdnubWbViTP7ZhCiij/Er+ENwY/BSUGiA65s6fgcH/n6Hp4qaTEMv9W8a9dv3Ly1cdu/c/fe/Qf1za0TqzMjYCi00uaUMwtKJjBEiQpOUwMs5gpGfNEr9dElGCt1cox5CpOYRYmcS8HQUdP63phDJJPCXqil/75/0O8d02c+pXtHbwe0l1nUsXP7dLTfP+rTNywGevDqdZ/uDMT2jj+GZFZZp/VG2GzvdsLnu/Rv0GqGVTW6j0hVh9PNWjSeaZHFkKBQzNqzVpjipGAGpVCw9MeZhZSJBYvgjIUpS8FMighcO2jyddnBxDVmJ0U1jyXddsyMzrVxO0FasX86ChZbm8fc3YwZnturWkn+U+NaL5Bxu/Z+gXLxacWUSElumMndT4z+YK90ivPOpJBJmiEkYtXoPFMUNS3DoTNpQKDKHWDCSDcKKs6ZYQLLEMaVsQiG1p2CBQfjlgwGeTBNHeHGaQPFED4GsXNr23Rw6btofs+f/h+ctJsth9+1G93OKiOyQR6TJ+QpaZEXpEv2ySEZEkG+kK/kO/nhvfRiD73L1VWv9svzkKyV9/knQWHeJg==</latexit><latexit sha1_base64="41n3Ga3w2g/tPk1plMYpWuUwmOc=">AAACxXicdVFdb9MwFHXD1whf3eCNF4tqGuKhScsD3VulqmOIFcZY10lrVdnubWbViTP7ZhCiij/Er+ENwY/BSUGiA65s6fgcH/n6Hp4qaTEMv9W8a9dv3Ly1cdu/c/fe/Qf1za0TqzMjYCi00uaUMwtKJjBEiQpOUwMs5gpGfNEr9dElGCt1cox5CpOYRYmcS8HQUdP63phDJJPCXqil/75/0O8d02c+pXtHbwe0l1nUsXP7dLTfP+rTNywGevDqdZ/uDMT2jj+GZFZZp/VG2GzvdsLnu/Rv0GqGVTW6j0hVh9PNWjSeaZHFkKBQzNqzVpjipGAGpVCw9MeZhZSJBYvgjIUpS8FMighcO2jyddnBxDVmJ0U1jyXddsyMzrVxO0FasX86ChZbm8fc3YwZnturWkn+U+NaL5Bxu/Z+gXLxacWUSElumMndT4z+YK90ivPOpJBJmiEkYtXoPFMUNS3DoTNpQKDKHWDCSDcKKs6ZYQLLEMaVsQiG1p2CBQfjlgwGeTBNHeHGaQPFED4GsXNr23Rw6btofs+f/h+ctJsth9+1G93OKiOyQR6TJ+QpaZEXpEv2ySEZEkG+kK/kO/nhvfRiD73L1VWv9svzkKyV9/knQWHeJg==</latexit><latexit sha1_base64="D0EbWAOnHcP4NmukaT1DwpGDOYA=">AAACxXicdVFdb9MwFHXD1whf3XjkxaKahnho0vJA9zap6hhihTHWddJaVbZ7m1lx4sy+GYSo4g/xa3hD8GNw2iLRAVe2dHyOj3x9D8+UtBiG32vejZu3bt/ZuOvfu//g4aP65tap1bkRMBBaaXPGmQUlUxigRAVnmQGWcAVDHncrfXgFxkqdnmCRwThhUSpnUjB01KS+P+IQybS0l2ruf+gd9ron9LlP6f7xuz7t5hZ14tw+HR70jnv0LUuAHr5+06M7fbG9448gnS6sk3ojbLZ3O+GLXfo3aDXDRTXIqo4mm7VoNNUiTyBFoZi1560ww3HJDEqhYO6PcgsZEzGL4JyFGcvAjMsIXDtoinXZwdQ1ZsflYh5zuu2YKZ1p43aKdMH+6ShZYm2RcHczYXhhr2sV+U+Nax0j43bt/RJl/HnJVEhJbpgp3E+M/mivdYqzzriUaZYjpGLZ6CxXFDWtwqFTaUCgKhxgwkg3CioumGECqxBGC2MZDKw7BTEH45YM+kUwyRzhxmkDxRA+BYlza9t0cO67aH7Pn/4fnLabLYfftxt7nVVIG+QJeUqekRZ5SfbIATkiAyLIV/KN/CA/vVde4qF3tbzq1Vaex2StvC+/AJ3+3dI=</latexit>

Page 23: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

23

Information Retrieval vs. Relational Databases§ Information Retrieval (in contrast)

§ data is mostly unstructured with no precise semantics

§ information need (e.g., learn about robotics) is oftenvague and expressed as a query (e.g., introduction robotics)

Information Retrieval / Chapter 1: Introduction

These technologies are used to develop machines that cansubstitute for humans and replicate human actions. Robots canbe used in many situations and for lots of purposes, but todaymany are used in dangerous environments (including bombdetection and deactivation), manufacturing processes, or wherehumans cannot survive (e.g. in space). Robots can take on anyform but some are made to resemble humans in appearance.

Source: https://en.wikipedia.org/wiki/Robotics

Page 24: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

24

Important Questions§ How can we preprocess natural language texts, e.g., to

merge different forms of the same word (e.g., houseand houses) and detect sentence boundaries?(Chapter 2: Natural Language Preprocessing)

§ How can we formally model documents and queries anddecide which documents are most likely to satifsfythe user’s information need?(Chapter 3: Retrieval Models)

Information Retrieval / Chapter 1: Introduction

Page 25: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

25

Important Questions§ How can we quickly return results for a specific query?

(Chapter 3: IR-System Implementation)

§ How can we determine whether our IR system returnsgood results or whether it is better than another system?(Chapter 4: Evaluation)

Information Retrieval / Chapter 1: Introduction

Page 26: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

26

Important Questions§ How can we leverage specifics of the World Wide Web

such as markup and hyperlinks?(Chapter 5: Web Search)

§ How can we make use of natural language processing techniques that better understand documentsto improve the search experience?(Chapter 6: Semantic Search)

Information Retrieval / Chapter 1: Introduction

Page 27: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

27

Preliminary Answer: Boolean Retrieval§ Convert documents into a bag of words by converting it

to lower case and splitting at white spaces

§ Queries are Boolean expressions over known words

Information Retrieval / Chapter 1: Introduction

These technologies are used to develop machines that cansubstitute for humans and replicate human actions. Robots canbe used in many situations and for lots of purposes, but todaymany are used in dangerous environments (including bombdetection and deactivation), manufacturing processes, or wherehumans cannot survive (e.g. in space). Robots can take on anyform but some are made to resemble humans in appearance.

Source: https://en.wikipedia.org/wiki/Robotics

technologies, develop,human, humans, robots, robots, …{ }

robots AND humans AND NOT (science AND fiction)

Page 28: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

28

Preliminary Answer: Boolean Retrieval§ Documents seen as assignments to Boolean variables

§ A document matches a query if the corresponding Booleanexpression evaluates to True on its value assignment

§ An obvious shortcoming of this simple retrieval model isthat there is no ranking of result documents

Information Retrieval / Chapter 1: Introduction

technologies, develop,human, humans, robots, robots, …{ }

technologies = Truedevelop = Truehuman = True. . .science = Falsefiction = False

Page 29: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

29

Preliminary Answer: Inverted Index§ We can build an index structure to speed up retrieval of

documents that contain a specific word

§ Inverted index (also: inverted file) consists of

§ dictionary with all known words

§ posting lists with details about word occurrences

Information Retrieval / Chapter 1: Introduction

humanrobot

d1 d4 d6 d9 d11

d4 d5 d7

Page 30: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

30

Preliminary Answer: Inverted Index§ While conceptually simple, there are a lot of details to

consider when implementing an inverted index

§ which information should be stored in the postings

§ how can we compress the inverted index to safe space,but also to speed up reading it from disk

§ how can we efficiently process queries using an invertedindex, maybe without reading the entire posting lists

Information Retrieval / Chapter 1: Introduction

Page 31: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

31

Preliminary Answer: Precision and Recall§ Let us assume that we know for all documents in our

collection whether they are relevant or not to a query

§ We can distinguish between documents that are returnedas results for the query by our system and documentsthat are not returned

Information Retrieval / Chapter 1: Introduction

Page 32: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

32

Preliminary Answer: Precision and Recall§ This gives us four different categories of documents

§ Relevant Results(true positives)

§ Irrelevant Results(false positives)

§ Relevant Non-Results(false negatives)

§ Irrelevant Non-Results(true negatives)

Information Retrieval / Chapter 1: Introduction

tn tn tn tn tn tntn tntn tntntntn tn tn tn tn tn

tn tntntntn

fpfp fp fp

tptpfn fn fn

fn

fnfnfn

Relevant DocumentsResult

Page 33: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

33

Preliminary Answer: Precision and Recall§ Precision measures the system’s ability to return

only relevant results

§ Recall measures the system’s ability to returnall relevant results

Information Retrieval / Chapter 1: Introduction

#tp#tp + #fp = # Relevant Results

# Results<latexit sha1_base64="uToFuxf4wlQbhOBPnSj3TjpEcRI=">AAAC9XicdVHfi9NAEN7EX2fv1J765stiEQShSU/B3oNQ8MUX4RR7d3ApZbOd9JZssmF30jOG/Cm+ia/+PYJ/jJO0oq06ZOGb75vZzM4XF1o5DMPvnn/t+o2bt/Zu9/YP7ty91z+8f+pMaSVMpdHGnsfCgVY5TFGhhvPCgshiDWdx+rrVz1ZgnTL5B6wKmGVimatESYFEzfurKLFC1lEm8FJhHQ2waJrtlD/jv/OEZP6Kb7oQPmJubCY0Sfw9aFiJHAm4UqNrL9qt2Ajz/iAcHh2Pw+fH/G8wGoZdDCYPWRcn80NvGS2MLDPIUWrh3MUoLHBWC4tKamh6UemgEDIVS7gQYSEKsLN6CSYDtNW2TDAXGbhZ3e2v4U+IWfDEWDo0fsf+2VGLzLkqi6myXYTb1Vryn1psTIoidlv/r1Gln9ZMi7SKrbAVvcSaK7czKSbjWa3yokTI5XrQpNQcDW/N5AtlQaKuCAhpFa2Cy0tB1iBZ3ou6xjqYOsqCNAZLnwreVsG8IILW6QItyKEgo27jhgSbHlnza//8/+D0aDgi/O7FYDJee8T22CP2mD1lI/aSTdgbdsKmTLIfnu/tewf+lf/Z/+J/XZf63qbnAdsK/9tPo3z1Yw==</latexit><latexit sha1_base64="uToFuxf4wlQbhOBPnSj3TjpEcRI=">AAAC9XicdVHfi9NAEN7EX2fv1J765stiEQShSU/B3oNQ8MUX4RR7d3ApZbOd9JZssmF30jOG/Cm+ia/+PYJ/jJO0oq06ZOGb75vZzM4XF1o5DMPvnn/t+o2bt/Zu9/YP7ty91z+8f+pMaSVMpdHGnsfCgVY5TFGhhvPCgshiDWdx+rrVz1ZgnTL5B6wKmGVimatESYFEzfurKLFC1lEm8FJhHQ2waJrtlD/jv/OEZP6Kb7oQPmJubCY0Sfw9aFiJHAm4UqNrL9qt2Ajz/iAcHh2Pw+fH/G8wGoZdDCYPWRcn80NvGS2MLDPIUWrh3MUoLHBWC4tKamh6UemgEDIVS7gQYSEKsLN6CSYDtNW2TDAXGbhZ3e2v4U+IWfDEWDo0fsf+2VGLzLkqi6myXYTb1Vryn1psTIoidlv/r1Gln9ZMi7SKrbAVvcSaK7czKSbjWa3yokTI5XrQpNQcDW/N5AtlQaKuCAhpFa2Cy0tB1iBZ3ou6xjqYOsqCNAZLnwreVsG8IILW6QItyKEgo27jhgSbHlnza//8/+D0aDgi/O7FYDJee8T22CP2mD1lI/aSTdgbdsKmTLIfnu/tewf+lf/Z/+J/XZf63qbnAdsK/9tPo3z1Yw==</latexit><latexit sha1_base64="uToFuxf4wlQbhOBPnSj3TjpEcRI=">AAAC9XicdVHfi9NAEN7EX2fv1J765stiEQShSU/B3oNQ8MUX4RR7d3ApZbOd9JZssmF30jOG/Cm+ia/+PYJ/jJO0oq06ZOGb75vZzM4XF1o5DMPvnn/t+o2bt/Zu9/YP7ty91z+8f+pMaSVMpdHGnsfCgVY5TFGhhvPCgshiDWdx+rrVz1ZgnTL5B6wKmGVimatESYFEzfurKLFC1lEm8FJhHQ2waJrtlD/jv/OEZP6Kb7oQPmJubCY0Sfw9aFiJHAm4UqNrL9qt2Ajz/iAcHh2Pw+fH/G8wGoZdDCYPWRcn80NvGS2MLDPIUWrh3MUoLHBWC4tKamh6UemgEDIVS7gQYSEKsLN6CSYDtNW2TDAXGbhZ3e2v4U+IWfDEWDo0fsf+2VGLzLkqi6myXYTb1Vryn1psTIoidlv/r1Gln9ZMi7SKrbAVvcSaK7czKSbjWa3yokTI5XrQpNQcDW/N5AtlQaKuCAhpFa2Cy0tB1iBZ3ou6xjqYOsqCNAZLnwreVsG8IILW6QItyKEgo27jhgSbHlnza//8/+D0aDgi/O7FYDJee8T22CP2mD1lI/aSTdgbdsKmTLIfnu/tewf+lf/Z/+J/XZf63qbnAdsK/9tPo3z1Yw==</latexit><latexit sha1_base64="duIPZClHwbCoXw81Fih5Lqnrc/A=">AAAC9XicdVFNb9NAEF2br5IWSOHIZUWEhIQUOy0S6QGpEhcuSAWRtlIdRePtOF1l7bV2xynGyk/hhrjye5D4MYyTIEiAkVd6897MenZeWhrtKY6/B+GNm7du39m529ndu3f/QXf/4am3lVM4UtZYd56CR6MLHJEmg+elQ8hTg2fp7HWrn83ReW2LD1SXOM5hWuhMKyCmJt15kjlQTZIDXWlqkh6Vi8VmKp/L33nGsnwl112EH6mwLgfDknyPBudQEANfGfLtRdsVa2HS7cX9g6NhfHgk/waDfryMnljHyWQ/mCaXVlU5FqQMeH8xiEsaN+BIK4OLTlJ5LEHNYIoXEJdQohs3U7Q5kqs3ZYYF5OjHzXJ/C/mUmUuZWceHx1+yf3Y0kHtf5ylXtovw21pL/lNLrZ0RpH7j/w3p2acV0yKjUweu5pc4e+23JqVsOG50UVaEhVoNmlVGkpWtmfJSO1RkagagnOZVSHUFbA2x5Z1k2dhEI89ZNEvR8aejt3U0KZngdfrIADsU5dxtfZ/hosPW/Nq//D84PegPGL970Tserk3aEY/FE/FMDMRLcSzeiBMxEkr8CMJgN9gLr8PP4Zfw66o0DNY9j8RGhN9+AgAo9Q8=</latexit>

#tp#tp + #fn = # Relevant Results

# Relevant Documents<latexit sha1_base64="39ovr9yRaF6/bAtu68tPCJlZmGk=">AAADAHicdVFNb9NAEF2brxIopIDgwGVFhISEFDshFckBKRIcuCAVRNpKdRStN+N0lbXX2h0XjOUD/BpuiCv/hF/A32CcBEEqOvJKb97MW8/Oi3OtHIbhT8+/dPnK1Ws711s3bu7eut3eu3PoTGElTKTRxh7HwoFWGUxQoYbj3IJIYw1H8fJlUz86A+uUyd5jmcM0FYtMJUoKJGrW/hwlVsgqSgWeKqyiDuZ1vZ3yp/xvnmR1zV/wjQrhI2bGpkJTib8DDWciQwKu0Oiaiy7oeGVkkULW9MzanbA7GoWDwT4Pu/thv98fEgif9YejHu91w1V0xvfYKg5me94imm/0UgvnTnphjtNKWFRSQ92KCge5kEuxgBMR5iIHO60WYFJAW26XCWYiBTetVqus+WNi5jwxlg7NuWL/VVQida5MY+psduLO1xryv7XYmCWK2G39v0K1/LRmGqRVbIUt6SXWfHDnJsVkOK1UlhcImVwPmhSao+GNr3yuLEjUJQEhraJVcHkqyCUk91vRSlgFE0dZsIzB0qeCN2Uwy4mgdbpACzIrSEltXJdg3SJr/uyfXwwO+90e4beDzni49ojtsIfsEXvCeuw5G7PX7IBNmGS/vF3vvvfA/+J/9b/539etvrfR3GVb4f/4DdeB+dw=</latexit><latexit sha1_base64="39ovr9yRaF6/bAtu68tPCJlZmGk=">AAADAHicdVFNb9NAEF2brxIopIDgwGVFhISEFDshFckBKRIcuCAVRNpKdRStN+N0lbXX2h0XjOUD/BpuiCv/hF/A32CcBEEqOvJKb97MW8/Oi3OtHIbhT8+/dPnK1Ws711s3bu7eut3eu3PoTGElTKTRxh7HwoFWGUxQoYbj3IJIYw1H8fJlUz86A+uUyd5jmcM0FYtMJUoKJGrW/hwlVsgqSgWeKqyiDuZ1vZ3yp/xvnmR1zV/wjQrhI2bGpkJTib8DDWciQwKu0Oiaiy7oeGVkkULW9MzanbA7GoWDwT4Pu/thv98fEgif9YejHu91w1V0xvfYKg5me94imm/0UgvnTnphjtNKWFRSQ92KCge5kEuxgBMR5iIHO60WYFJAW26XCWYiBTetVqus+WNi5jwxlg7NuWL/VVQida5MY+psduLO1xryv7XYmCWK2G39v0K1/LRmGqRVbIUt6SXWfHDnJsVkOK1UlhcImVwPmhSao+GNr3yuLEjUJQEhraJVcHkqyCUk91vRSlgFE0dZsIzB0qeCN2Uwy4mgdbpACzIrSEltXJdg3SJr/uyfXwwO+90e4beDzni49ojtsIfsEXvCeuw5G7PX7IBNmGS/vF3vvvfA/+J/9b/539etvrfR3GVb4f/4DdeB+dw=</latexit><latexit sha1_base64="39ovr9yRaF6/bAtu68tPCJlZmGk=">AAADAHicdVFNb9NAEF2brxIopIDgwGVFhISEFDshFckBKRIcuCAVRNpKdRStN+N0lbXX2h0XjOUD/BpuiCv/hF/A32CcBEEqOvJKb97MW8/Oi3OtHIbhT8+/dPnK1Ws711s3bu7eut3eu3PoTGElTKTRxh7HwoFWGUxQoYbj3IJIYw1H8fJlUz86A+uUyd5jmcM0FYtMJUoKJGrW/hwlVsgqSgWeKqyiDuZ1vZ3yp/xvnmR1zV/wjQrhI2bGpkJTib8DDWciQwKu0Oiaiy7oeGVkkULW9MzanbA7GoWDwT4Pu/thv98fEgif9YejHu91w1V0xvfYKg5me94imm/0UgvnTnphjtNKWFRSQ92KCge5kEuxgBMR5iIHO60WYFJAW26XCWYiBTetVqus+WNi5jwxlg7NuWL/VVQida5MY+psduLO1xryv7XYmCWK2G39v0K1/LRmGqRVbIUt6SXWfHDnJsVkOK1UlhcImVwPmhSao+GNr3yuLEjUJQEhraJVcHkqyCUk91vRSlgFE0dZsIzB0qeCN2Uwy4mgdbpACzIrSEltXJdg3SJr/uyfXwwO+90e4beDzni49ojtsIfsEXvCeuw5G7PX7IBNmGS/vF3vvvfA/+J/9b/539etvrfR3GVb4f/4DdeB+dw=</latexit><latexit sha1_base64="1C610cwsPYshDSmfre6zs5I1du0=">AAADAHicdVHBbtNAEF2bAiVQmoIEh15WREhISLETUpEckCrBgQtSQaStVEfRejNOV1l7rd1xwVg+wNdwQ1z5E76A32CcBEEqOvJKb97MW8/Oi3OtHIbhT8+/tnX9xs3tW63bd3bu7rb37h07U1gJY2m0saexcKBVBmNUqOE0tyDSWMNJvHjZ1E8uwDplsvdY5jBJxTxTiZICiZq2P0eJFbKKUoHnCquog3ldb6b8Kf+bJ1ld8xd8rUL4iJmxqdBU4u9Aw4XIkIArNLrmois6XhlZpJA1PdN2J+yORuFgcMDD7kHY7/eHBMJn/eGox3vdcBkdto6j6Z43j2ZrvdTCubNemOOkEhaV1FC3osJBLuRCzOFMhLnIwU6qOZgU0JabZYKZSMFNquUqa/6YmBlPjKVDcy7ZfxWVSJ0r05g6m524y7WG/G8tNmaBInYb/69QLT6tmAZpFVthS3qJNR/cpUkxGU4qleUFQiZXgyaF5mh44yufKQsSdUlASKtoFVyeC3IJyf1WtBRWwdhRFixisPSp4E0ZTHMiaJ0u0ILMClJSG9clWLfImj/751eD4363R/jtoHM4XJu0zfbZI/aE9dhzdshesyM2ZpL98na8B95D/4v/1f/mf1+1+t5ac59thP/jNzQt+Yg=</latexit>

Page 34: Information Retrieval · Information Retrieval vs. Relational Databases § Information Retrieval(in contrast) § datais mostly unstructuredwith no precise semantics § information

35

Title§ Text

Information Retrieval / Chapter 1: Introduction