Top Banner
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University
30

1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Dec 30, 2015

Download

Documents

Dina Short
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

1

Searching through the Internet

Dr. Eslam Al MaghayrehComputer Science Department

Yarmouk University

Page 2: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

2

Outline

Introduction Information Retrieval Indexing Smarter Internet Searching Examples

Page 3: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Introduction Internet has enormous quantity of information:

billions of web pages thousands of newsgroups

Two questions face any information seeker: (1) How can I find what I want? (2) How can I know that what I find is any

good?

3

Page 4: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

4

Information Retrieval Goal = find documents relevant to an

information need from a large document set

Document collection

Info. need

Query

Answer list

IR system

Retrieval

Page 5: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

5

Example

Google

Web

Page 6: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Search Engine Consists of:

the interface you use to type in a query an index of Web sites that the query is

matched with and a software program (called a spider or

bot) that goes out on the Web and gets new sites for the index

6

Page 7: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

7

IR problem First applications: in libraries (1950s)

ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,

analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>

External attributes and internal attribute (content)

Search by external attributes = Search in DB IR: search by content

Page 8: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

8

Possible approaches

1. String matching (linear search in documents)- Slow

2. Indexing- Fast- Flexible to further improvement

Page 9: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

9

DocumentsQuery

Results

Indexing Indexing

Query Representation Document Representation

ComparisonFunction Index

Page 10: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

10

Main problems in IR Query evaluation (or retrieval process)

To what extent does a document correspond to a query?

System evaluation How good is a system? Are the retrieved documents

relevant? (precision) Are all the relevant documents

retrieved? (recall)

Page 11: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

11

Document indexing Goal = Find the important meanings and create

an internal representation Factors to consider:

Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate

What is the best representation of contents? Word: good coverage, not precise Phrase: poor coverage, more precise Concept: poor coverage, precise

Coverage(Recall)

Accuracy(Precision)Word Phrase Concept

Page 12: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

12

Keyword selection and weighting

How to select important keywords? Simple method: using middle-frequency words Search engines usually disregard minor words

such as "the, and, to, etc."

 

Frequency/Informativity frequency informativity Max. Min.

1 2 3 … Rank

Page 13: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

13

Result of indexing Each document is represented by a set of

weighted keywords (terms):D1 {(t1, w1), (t2,w2), …}

e.g. D1 {(comput, 0.2), (architect, 0.3), …}

D2 {(comput, 0.1), (network, 0.5), …}

Page 14: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

14

Retrieval The problems underlying retrieval

Retrieval model How is a document represented with the

selected keywords? How are document and query

representations compared to calculate a score?

Page 15: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

15

Vector space model Vector space = all the keywords

encountered<t1, t2, t3, …, tn>

DocumentD = < a1, a2, a3, …, an>

ai = weight of ti in D Query

Q = < b1, b2, b3, …, bn>

bi = weight of ti in Q R(D,Q) = Sim(D,Q)

Page 16: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

16

Matrix representation t1 t2 t3 … tn

D1 a11 a12 a13 … a1n

D2 a21 a22 a23 … a2n

D3 a31 a32 a33 … a3n

…Dm am1 am2 am3 … amn

Q b1 b2 b3 … bn

Term vector space

Document space

Page 17: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

17

Some formulas for Sim

Dot product

Cosine

Dice

Jaccard

i i iiiii

iii

i iii

iii

i iii

iii

ii

baba

baQDSim

ba

baQDSim

ba

baQDSim

baQDSim

) * (

) * (),(

) * (2),(

*

) * (),(

) * (),(

22

22

22

t1

t2

D

Q

Page 18: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

18

(Classic) Presentation of results

Query evaluation result is a list of documents, sorted by their similarity to the query.

E.g.doc10.67doc20.65doc30.54…

Page 19: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

19

IR on the Web No stable document collection

(spider, crawler) Duplication Huge number of documents Multimedia documents Multilingual problem …

Page 20: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Tips for smarter Internet searching Use unique, specific terms Use the minus operator (-) to narrow the search

yarmouk -university Utilize quotation marks, to view "consecutive

words of a phrase," such as "flower arrangement."

Enter a short question, such as " what time is it in amman?“, “3.55*4.5-11 =“, “who is the king of england?”, “what is the distance between the sun and earth”

20

Page 21: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Smarter Internet Searching inurl:test results

only test must be found in the web address (URL)

allinurl:test results Both test AND results must be found in the

web address. define:

will provide definitions of the words, gathered from various online sources.

define: search engine

21

Page 22: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Smarter Internet Searching Allintext

Sometimes you get pages that do not have your search term/phrase in them.

Why? Because Google also searches for pages that just link to the target page.

Use allintext to get only those pages that have your search terms in them.

22

Page 23: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Smarter Internet Searching Allinanchor:

Returns only pages that link to pages with your search terms, but not in the actual pages.

This is the opposite of allintext. Site:

Limit your search to a specific web site. Example:

students site:yu.edu.jo students site:yu.edu.jo filetype:pdf

23

Page 24: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Smarter Internet Searching Don't use common words and punctuation

Common words and punctuation marks should be used when searching for a specific phrase inside quotes

Most search engines do not distinguish between uppercase and lowercase

Maximize AutoComplete

24

Page 25: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Smarter Internet Searching The wildcard operator (*): Google calls it the

fill in the blank operator. For example, amusement * will return pages with amusement and any other term(s) the Google search engine deems relevant.

Using a wildcard (*) for a character does not work in Google. cat* returns the same results as cat.

25

Page 26: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Smarter Internet Searching Related sites:

For example, related:www.yu.edu.jo can be used to find sites similar to Yarmouk University site.

Specific file type: For example Information retrieval filetype:ppt

26

Page 27: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Examples

Searching for papers YU library Google scholar

Searching for instructor resources Morgan Kaufmann Pearson

27

Page 28: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Examples Searching for books to buy

Amazon.com Ebay.com

Searching for items to buy Electronics: bustbuy.com

Searching for hotels Expedia.com Priceline.com Booking.com

28

Page 29: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

Examples

Regional search Google jo

Searching for images Google images

Searching for a job Jobsinacademia.net Academickeys.com

29

Page 30: 1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.

The End.

30