Top Banner
ITEC547 Text Mining Web Technologies Search Engines
43

ITEC547 Text Mining Web Technologies Search Engines.

Dec 26, 2015

Download

Documents

Tyler Simpson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ITEC547 Text Mining Web Technologies Search Engines.

ITEC547 Text Mining

Web Technologies

Search Engines

Page 2: ITEC547 Text Mining Web Technologies Search Engines.

Outline of Presentation

1Early Search Engines

2Indexing Text

for Search

3Indexing

Multimedia

4Queries

5Searching an Index

Page 3: ITEC547 Text Mining Web Technologies Search Engines.

Early Search Engines

History, Problems, Solutions …

1

Page 4: ITEC547 Text Mining Web Technologies Search Engines.

Search Engines

• Open Text (1995-1997)• Magellan (1995-2001)• Infoseek (Go) (1995-2001)• Snap (NBCi)(1997-2001)• Direct Hit (1998-2002)• Lycos (1994; reborn 1999)• WebCrawler (1994; reborn 2001)• Yahoo (1994; reborn 2002)• Excite (1995; reborn 2001)• HotBot (1996; reborn 2002)• Ask Jeeves (1998; reborn 2002)• Teoma (2000- 2001)• AltaVista (1995- )• LookSmart (1996- )• Overture (1998- )

4

Page 5: ITEC547 Text Mining Web Technologies Search Engines.

Information Retrieval

• The indexing and retrieval of textual documents.

• Searching for pages on the World Wide Web is the most recent and perhaps most widely used IR application

• Concerned firstly with retrieving relevant documents to a query.

• Concerned secondly with retrieving from large sets of documents efficiently.

Page 6: ITEC547 Text Mining Web Technologies Search Engines.

Typical IR Task

• Given:– A corpus of textual natural-language documents.– A user query in the form of a textual string.

• Find:– A ranked set of documents that are relevant to

the query.

Page 7: ITEC547 Text Mining Web Technologies Search Engines.

Typical IR System Architecture

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

Page 8: ITEC547 Text Mining Web Technologies Search Engines.

EARLY SEARCH ENGINES

• Initially used in academic or specialized domains.– Legal and specialized domains consume a large

amount of textual info• Use of expensive proprietary hardware and

software– High computational and storage requirements

• Boolean query model• Iterative search model

– Fetch documents in many steps

8

Page 9: ITEC547 Text Mining Web Technologies Search Engines.

Medline of National Library of Medicine

• Developed in late 1960 and made available in 1971• Based on inverted file organization• Boolean query language

– Queries broken down and numbered into segments– Results of a queries fed into the next query segment

• Each user assigned a time slot– If cycle not completed in time slot, most recent results are

returned• Query and browse operations performed as separate steps

– Following a query, results are viewed– Modifications start a new query-browse cycle

Page 10: ITEC547 Text Mining Web Technologies Search Engines.

Dialog

• Broader subject content• Specialized collections of data on payment• Boolean query

– Each term numbered and executed separately then combined

– Word patterns– For multiword queries proximity operator W

Page 11: ITEC547 Text Mining Web Technologies Search Engines.

2 Indexing Text for Search

Reduce retrieval time improve hit accuracy

Page 12: ITEC547 Text Mining Web Technologies Search Engines.

Why Index

• Simplest approach search text sequentially– Size must be small

• Static, semi-static index• Inverted Index

– mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.

• Documents/Positions in Documents/Weight• Fuzzy/Stemming/Stopwords

Page 13: ITEC547 Text Mining Web Technologies Search Engines.

Example

• T1 : "it is what it is“• T2 : "what is it“• T3 : "it is a banana"

• "a": {2} • "banana": {2} • "is": {0, 1, 2} • "it": {0, 1, 2} • "what": {0, 1}

Inverted Index

Page 14: ITEC547 Text Mining Web Technologies Search Engines.

Example

• "a": {(2, 2)} • "banana": {(2, 3)} • "is": {(0, 1), (0, 4), (1, 1), (2, 1)} • "it": {(0, 0), (0, 3), (1, 2), (2, 0)} • "what": {(0, 2), (1, 0)}

• T0 : "it is what it is“• T1 : "what is it“• T2 : "it is a banana"

Full Inverted Index

Page 15: ITEC547 Text Mining Web Technologies Search Engines.

Inverted Index

Page 16: ITEC547 Text Mining Web Technologies Search Engines.

Inverted Index

Page 17: ITEC547 Text Mining Web Technologies Search Engines.

Google Index

• A unique DocId associated with each URL• Hit: word occurences

– wordID: 24 bit number– Word position– Font size relative to the rest of the document– Plain hit : in the document– Fancy hit : in the URL, title, anchor text, meta tags

• Word occurrences of a web page are distributed across a set of barrels

Page 18: ITEC547 Text Mining Web Technologies Search Engines.

Architecture of the 1st Google Engine

Page 19: ITEC547 Text Mining Web Technologies Search Engines.

Architecture of the 1st Google Engine

Page 20: ITEC547 Text Mining Web Technologies Search Engines.

Architecture of the 1st Google Engine

Page 21: ITEC547 Text Mining Web Technologies Search Engines.

3 Indexing Multimedia

Broadcast and compress for seamless delivery

Page 22: ITEC547 Text Mining Web Technologies Search Engines.

Indexing Multimedia

• Forming an index for multimedia– Use context : surrounding

text– Add manual description– Analyze automatically and

attach a description

Page 23: ITEC547 Text Mining Web Technologies Search Engines.

4 Queries

Page 24: ITEC547 Text Mining Web Technologies Search Engines.

Queries

• Keywords• Proximity• Patterns• Phrases• Ranges• Weights of keywords• Spelling mistakes

Page 25: ITEC547 Text Mining Web Technologies Search Engines.

Queries

• Boolean query– No relevance measure– May be hard to understand

• Multimedia query– Find images of Everest– Find x-rays showing the human rib cage– Find companies whose stock prices have similar

patterns

Page 26: ITEC547 Text Mining Web Technologies Search Engines.

Relevance

• Relevance is a subjective judgment and may include:– Being on the proper subject.– Being timely (recent information).– Being authoritative (from a trusted source).– Satisfying the goals of the user and his/her

intended use of the information (information need).

26

Page 27: ITEC547 Text Mining Web Technologies Search Engines.

Keyword Search

• Simplest notion of relevance is that the query string appears verbatim in the document.

• Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

27

Page 28: ITEC547 Text Mining Web Technologies Search Engines.

Problems with Keywords

• May not retrieve relevant documents that include synonymous terms.– “restaurant” vs. “café”– “PRC” vs. “China”

• May retrieve irrelevant documents that include ambiguous terms.– “bat” (baseball vs. mammal)– “Apple” (company vs. fruit)– “bit” (unit of data vs. act of eating)

28

Page 29: ITEC547 Text Mining Web Technologies Search Engines.

Relevance Feedback

• User enters query terms– Keywords maybe weighted or not

• Links returned– Choose the relevant and irrelevant ones

• If there is no negative feedback second term is 0• T’s are terms from relevant and irrelevant sets

marked by the user

Page 30: ITEC547 Text Mining Web Technologies Search Engines.

SEARCHING AN INDEX 5Searching an Index

Page 31: ITEC547 Text Mining Web Technologies Search Engines.

Searching an Inverted Index

• Tokenize the query, search index vocabulary for each query token

• Get a list of documents associated with each token

• Combine the list of documents using constraints specified in the query

Page 32: ITEC547 Text Mining Web Technologies Search Engines.

Google Search

1. Tokenize query and remove stopwords2. Translate the query words into wordIDs using the lexicon3. For every wordID get the list of documents from the short

inverted barrel and build a composite set of documents4. Scan the composite list of documents

i. Skip to next document if the current document does not matchii. Compute a rank using query and featuresiii. If no more documents go to step 3 and use full inverted barrels

to find more docsiv. If there are sufficient # of docs go to step 5

5. Sort the final Document List by rank

Page 33: ITEC547 Text Mining Web Technologies Search Engines.

How are results ranked?

• Weight type• Location: title,URL, anchor,body• Size: relative font size• Capitalization• Count occurences • Closeness (proximity)

Page 34: ITEC547 Text Mining Web Technologies Search Engines.

Evaluation

• Response time • quality• Recall : % of correct items that are selected

• Precision : % of selected items that are correct

Page 35: ITEC547 Text Mining Web Technologies Search Engines.

Ranking Algorithms : Hyperlink

• Popularity Ranking• Rank “popular” documents higher among set of

documents with specific keywords.• Determining “Popularity”

– Access rate ?• How to get accurate data?

– Bookmarks?• Might be private?

– Links to related pages?• Using web crawler to analyze external links.

Page 36: ITEC547 Text Mining Web Technologies Search Engines.

Popularity/Prestige

• transfer of prestige– a link from a popular page x to a page y is treated

as conferring more prestige to page y than a link from a not-so-popular page z.

• Count of In-links/Out-links

Page 37: ITEC547 Text Mining Web Technologies Search Engines.

Hypertext Induced Topic Search (HITS)

• The HITS algorithm:– compute popularity using set of related pages

only.• Important web pages : cited by other

important web pages or a large number of less-important pages

• Initially all pages have same importance

Page 38: ITEC547 Text Mining Web Technologies Search Engines.

Hubs and Authorities

• Hub - A page that stores links to many related pages– may not in itself contain actual information on a topic

• Authority - A page that contains actual information on a topic – may not store links to many related pages

• Each page gets a prestige value as a hub (hub-prestige), and another prestige value as an authority (authority-prestige).

Page 39: ITEC547 Text Mining Web Technologies Search Engines.

Hubs and Authorities in twitter

Page 40: ITEC547 Text Mining Web Technologies Search Engines.

Hubs and Authorities algorithm

1. Locate and build the subgraph2. Assign initial values to hub and authority scores of each

node3. Run a loop till convergence

i. Assign the sum of the hub scores of all nodes y that link to node x to the authority score of x

ii. Assign the sum of the authority scores of all nodes y that are linked from node x to node y to hub score of node x

iii. Normalize the hub and authority scores of all nodesiv. Check for convergence. Is the difference< threshold?

4. Return the list of nodes sorted in descending order of hub and authority scores

Page 41: ITEC547 Text Mining Web Technologies Search Engines.

Page Rank Algorithm

• Ranks based on citation statistics– In/out links

• Rank of a page depends on the ranks of the pages that link to it.

Page 42: ITEC547 Text Mining Web Technologies Search Engines.

Page rank Algorithm

1. Locate and build subgraph2. Save the number of out-links from every node in an

array3. Assign a default PageRank to all nodes4. Run a loop till convergence

i. Compute a new PageRank score for every node. Assign the sum of PageRank scores divided by the number of out-links of every node that links to a node and add the default rank source

ii. Check convergence. Is the difference between new and old PageRank< threshold?

Page 43: ITEC547 Text Mining Web Technologies Search Engines.

?But wait… There’s Homework!1-Explain web crawling and the general architecture of a web crawler.2- What is the use of robots.txt?3- Find a web crawler code and explain how it can be used to collect information on ?4-Crawl the social media to collect emu related info. (if you want bonus)