logolund Lecture 5: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT – Electrical and Information Technology, Lund University February 14, 2013 A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 1 / 65 logolund Outline 1 Reiteration 2 Web search 3 Web search engines 4 Web robots, crawler 5 Focused Web crawling 6 Web search vs Browsing 7 Privacy, Filter bubble A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 2 / 65 logolund Previous lecture LSI (Latent Semantic Indexing)- concepts The term-document matrix is decomposed into three other matrices of a special form by use of Singular Value Decomposition (SVD) The matrices show a breakdown of the original relationships into linearly independent components Many of these components are very small and can be ignored - leading to an approximate model that contains fewer dimensions. SVM (Support Vector Machines) - classification A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 3 / 65 logolund LSI - reduced SVD Reduce dimensionality => retain only k largest singular values Saved space M x N matrix A (term/document) reduced SVD: A ≈ A k = U k Σ k V T k A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 4 / 65
17
Embed
Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
logolund
Lecture 5: EITN01 Web Intelligence and InformationRetrieval
Anders Ardö
EIT – Electrical and Information Technology, Lund University
February 14, 2013
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 1 / 65
logolund
Outline
1 Reiteration
2 Web search
3 Web search engines
4 Web robots, crawler
5 Focused Web crawling
6 Web search vs Browsing
7 Privacy, Filter bubble
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 2 / 65
logolund
Previous lecture
LSI (Latent Semantic Indexing)- conceptsThe term-document matrix is decomposed into three other matricesof a special form by use of Singular Value Decomposition (SVD)The matrices show a breakdown of the original relationships intolinearly independent componentsMany of these components are very small and can be ignored -leading to an approximate model that contains fewer dimensions.
SVM (Support Vector Machines) - classification
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 3 / 65
logolund
LSI - reduced SVD
Reduce dimensionality => retain only k largest singular valuesSaved space
M x N matrix A (term/document) reduced SVD:
A ≈ Ak = Uk ΣkV Tk
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 4 / 65
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 5 / 65
logolund
Text classification
Goal: classify documents into predefined categoriesExamples
Subject classification: ’business’, ’sports’, ’engineering’, ...Review classification: ’positive’ or ’negative’Web page classification: ’Personal homepage’ or others
Approach: supervised machine learning (⇒ SVM)Each predefined category needs a set of training documentsFrom training sets train a classifierUse classifier to classify new documents
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 6 / 65
logolund
Automated Classification technologies
Machine learning methods
Statistical models (Bayes, SVM, ...)ANN
Information Retrieval methodsClustering (no predefined categories)
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 7 / 65
logolund
SVM
Support vectors
SVM maximize the margin around the separating hyper-planeDecision function specified by support vectors (from trainingexamples)Quadratic programming problem
Hot text classification method
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 8 / 65
logolund
Lecture 5 agenda
Chapters 2, 11, 12 in “Modern Information Retrieval”
1 Reiteration
2 Web search
3 Web search engines
4 Web robots, crawler
5 Focused Web crawling
6 Web search vs Browsing
7 Privacy, Filter bubble
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 9 / 65
logolund
Outline
1 Reiteration
2 Web search
3 Web search engines
4 Web robots, crawler
5 Focused Web crawling
6 Web search vs Browsing
7 Privacy, Filter bubble
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 10 / 65
logolund
Why Web search ...
Explosion of (digital) informationwithin all types of information collections
Harder and harder to follow information flowFaster way to find relevant information when its neededChallenges
Distributed, dynamic dataLarge volumeUnstructured, heterogeneous data
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 11 / 65
logolund
Size of the Web
no one knowsestimates (text pages)
2005 ’more than 11.5 billion’2007 ’more than 20 billion’2010 ’20 - 55 billion’
Google claims to know of 1012 unique URLs (text, images, ...)
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 12 / 65
logolund
Important questions
How do I find relevant information?How do I navigate the digital information landscape?How structure and organize information to ease knowledgeextraction?How to create collections, properly organized, with relevantmaterial?How to keep collections updated?
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 13 / 65
logolund
Outline
1 Reiteration
2 Web search
3 Web search engines
4 Web robots, crawler
5 Focused Web crawling
6 Web search vs Browsing
7 Privacy, Filter bubble
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 14 / 65
logolund
Search Engine - Basic structure
���������������������������
���������������������������
Database
Interface
Database
Web pagesHTTP Web browserQuery
Answer
CGI−script
Web robot The WebHTTP
Size efficiency response time
software crawling the web (much like a human clicking on links)collect all found web-pages into a database (IR system)offer a web-interface to that database
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 15 / 65
logolund
Size of search engines
not publishedguesses 1 - 20 - 50 billion pagesoverlap between search engines is small ≈ 5 - 10 %
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 16 / 65
logolund
Google
started late 1990:sEstimated 450,000 low-cost commodity servers (2006)1 trillion links to web pages (July 2008)“over 8 billion web pages”estimate 40 billion pages?goal is to index all the world’s dataGoogle Flu Trends
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 17 / 65
logolund
Google Servers
From Jeff Dean http://www.odbms.org/download/dean-keynote-ladis2009.pdf
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 18 / 65
logolund
Google Servers
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 19 / 65
logolund
Sideline - Large server-clusters
The Joys of Real HardwareTypical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)~1 network rewiring (rolling ~5% of machines down over 2-day span)~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)~5 racks go wonky (40-80 machines see 50% packetloss)~8 network maintenances (4 might cause ~30-minute random connectivity losses)~12 router reloads (takes out DNS and external vips for a couple minutes)~3 router failures (have to immediately pull traffic for an hour)~dozens of minor 30-second blips for dns~1000 individual machine failures~thousands of hard drive failuresslow disks, bad memory, misconfigured machines, flaky machines, etc.
Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.
From Jeff Dean http://www.odbms.org/download/dean-keynote-ladis2009.pdf
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 20 / 65
logolund
Twitter
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 21 / 65
logolund
Twitter
broadcast what’s on your mindmax 140 chars27.3 M tweets per day (November, 2009)250 M tweets per day (October, 2011)Twitter moods(J. Bollen, H. Mao, X. Zeng: “Twitter mood predicts the stock market” http://arxiv.org/abs/1010.3003)
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 22 / 65
logolund
Search engine examples
Google, Bing, Yahoo
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 23 / 65
logolund
Search Engine - Application
���������������������������
���������������������������
Web browser
Database
Web pages
CGI−script
HTTP
Web server
CGI/HTML
SRU/XML
HTTP
(Z39.50 ...)
(ASN, ...)
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 24 / 65
logolund
Overlap between search engines
Compare Google, Yahoo, and Ask Jeeves.Using 10316 queries and hits from first result page.
Search resultsOnly in 1 Shared by 2 In all 3
85 % 12 % 3 %
MetaSearch engine Dogpile found 68 % of all results.
Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",Internet Research, Vol. 16 Iss: 4, pp.419 - 426, ISSN: 1066-2243
DOI: 10.1108/10662240610690034
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 25 / 65
logolund
Meta Search Engine - Application
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 26 / 65
logolund
MetaSearch Engine
it’s software that simultaneously search several individual searchenginescollecting, reviewing and ranking their answersand give them back in a merged/condensed form to the userthey are not better than the quality of the search enginedatabases they obtain results from
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 27 / 65
logolund
MetaSearch engines
Simultaneously search several individual search enginesQuery translationResult merging
Simple mergeDuplicate detectionCheck availability of pagetf-idf/similarity rankingPosition based
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 28 / 65
logolund
MetaSearch Engine examples
Yippy, Dogpile, DuckDuckGo
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 29 / 65
logolund
Special (Vertical) search engines
pricesex: prisjakt, PriceRunner, ...http://www.pricerunner.co.uk/http://www.prisjakt.nu/jobsex: freejobsearch, jobspider, ...http://freejobsearch.org/http://www.jobspider.com/Housingex: rightmove, hemnet, bovision, ...http://www.rightmove.co.uk/http://www.hemnet.se/http://bovision.se/... and so on ...
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 30 / 65
logolund
Other Search Engines
Wolfram Alpha
Wolfram|Alpha introduces a fundamentally new way to get knowledgeand answers — not by searching the web, but by doing dynamiccomputations based on a vast collection of built-in data, algorithms,and methods.From http://www.wolframalpha.com/about.html
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 31 / 65
logolund
Wolfram Alpha example
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 32 / 65
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 36 / 65
logolund
Web Robot - Problems
Network failuresErroneous URLsUnreachable serversPassword protectionSpider trapsRecursive URLsCharacter set encodingsSame page - different URLs - deduplication
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 37 / 65
logolund
Web Robot - More Problems
Hidden Web
DatabasesDynamic scripts... ?
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 38 / 65
logolund
Web Robot - Traversal algorithms
Depth first (Stack, LIFO queue)Breadth first (FIFO queue)Relevance order (How?)
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 39 / 65
logolund
Outline
1 Reiteration
2 Web search
3 Web search engines
4 Web robots, crawler
5 Focused Web crawling
6 Web search vs Browsing
7 Privacy, Filter bubble
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 40 / 65
logolund
Focused Crawling
Frontier
List of
unvisited
pages
Seed
URLs
Database
pagesWeb
Repository
of visited
pages
URLsGet URL
Fetch
Web page
URL
focus
filter
Analyze
Linksfocus
inNot
Within the
focusSave
filterFocus
Focus:
DomainProjectCountryRegionTopicSubject
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 41 / 65
logolund
Topic-specific Web-crawling
ProblemConstruct a topic specific search-engine(ex. Carnivorous plants)SolutionMake a Web-crawler walk through Internet and collect all pageswith topic ’Carnivorous plants’
easier said than done!
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 42 / 65
logolund
Conditions
Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 43 / 65
logolund
Automated Classification technologies
Machine learning methods
Statistical models (Bayes, SVM, ...)ANN
Information Retrieval methodsClustering (no predefined categories)
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 44 / 65
logolund
Topic Filter
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 45 / 65
logolund
Conditions
Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 46 / 65
logolund
Internet is Big
First pageOK, saveLinksChoosePage OK?New pagePage OK?SaveNew page
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 47 / 65
logolund
Basic Algorithm
Add good start pages (seeds) to frontierLOOP:
Choose a page among linksPage OK?
Save pageAdd all links to frontier
Go to LOOP
Save (database(s)):All relevant pages (search engine database)All analyzed pages (seen pages)All new links (frontier)
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 48 / 65
logolund
Focused Crawling
Frontier
List of
unvisited
pages
Seed
URLs
Database
pagesWeb
Repository
of visited
pages
URLsGet URL
Fetch
Web page
URL
focus
filter
Analyze
Linksfocus
inNot
Within the
focusSave
filterFocus
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 49 / 65
logolund
Problems I
Which newpage?
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 50 / 65
logolund
Problems II
Isolatedpages
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 51 / 65
logolund
Problems III
Non relevantpages“blocking”
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 52 / 65
logolund
Conditions
Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 53 / 65
logolund
Compromises
Precision/recallcompleteness/speed
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 54 / 65
logolund
Outline
1 Reiteration
2 Web search
3 Web search engines
4 Web robots, crawler
5 Focused Web crawling
6 Web search vs Browsing
7 Privacy, Filter bubble
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 55 / 65
logolund
Browsing
No idea how formulate a queryWilling to invest some timeStructure: flat vs hierarchy
Manual vs automatic classificationLack of standard classification/terminology
Precision - NOT recall
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 56 / 65
logolund
Browsing vs search
SearchLOTS of dataUnstructuredUnrelated items clutter results
BrowsingSmall amounts of dataHierarchically structuredQuality assessed
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 57 / 65
logolund
Browsing examples
Dmoz (ODP), Yahoo! Directory
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 58 / 65
logolund
Outline
1 Reiteration
2 Web search
3 Web search engines
4 Web robots, crawler
5 Focused Web crawling
6 Web search vs Browsing
7 Privacy, Filter bubble
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 59 / 65
logolund
Filter bubble
What do search engines or social sites know about me?At least location, search history, click history, likes, and more . . .Personalize whats shown (search results, . . . ) using this infoShow us what we want/like to see - algorithmically. . . and not whats relevant (who decides that?)
Problem?
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 60 / 65
logolund
Filter bubble example I
From http://www.thefilterbubble.com/what-is-the-internet-hiding-lets-find-out
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 61 / 65
logolund
Filter bubble example II
From http://www.thefilterbubble.com/what-is-the-internet-hiding-lets-find-out
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 62 / 65
you give Google (and those we work with) a worldwide license touse, host, store, reproduce, modify, create derivative works (suchas those resulting from translations, adaptations or other changeswe make so that your content works better with our Services),communicate, publish, publicly perform, publicly display anddistribute such content.Facebook: you grant us a non-exclusive, transferable,sub-licensable, royalty-free, worldwide license to use any IPcontent that you post on or in connection with Facebook (IPLicense).
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 63 / 65
logolund
Privacy
Search history, clicks, photos, documents, comments, . . .leads to a profilethat can be used by ads or sold, or even stolenwhich might lead to it ending up in unwanted placesand used against you
Beware!
Be aware!
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 64 / 65
Read:T. Berners-Lee, “Long Live the Web: A Call for Continued OpenStandards and Neutrality”, Scientific American, November 22, 2010.http://www.scientificamerican.com/article.cfm?id=long-live-the-web
A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 65 / 65