Top Banner
logolund Lecture 5: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT – Electrical and Information Technology, Lund University February 14, 2013 A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 1 / 65 logolund Outline 1 Reiteration 2 Web search 3 Web search engines 4 Web robots, crawler 5 Focused Web crawling 6 Web search vs Browsing 7 Privacy, Filter bubble A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 2 / 65 logolund Previous lecture LSI (Latent Semantic Indexing)- concepts The term-document matrix is decomposed into three other matrices of a special form by use of Singular Value Decomposition (SVD) The matrices show a breakdown of the original relationships into linearly independent components Many of these components are very small and can be ignored - leading to an approximate model that contains fewer dimensions. SVM (Support Vector Machines) - classification A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 3 / 65 logolund LSI - reduced SVD Reduce dimensionality => retain only k largest singular values Saved space M x N matrix A (term/document) reduced SVD: A A k = U k Σ k V T k A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 4 / 65
17

Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

Oct 01, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Lecture 5: EITN01 Web Intelligence and InformationRetrieval

Anders Ardö

EIT – Electrical and Information Technology, Lund University

February 14, 2013

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 1 / 65

logolund

Outline

1 Reiteration

2 Web search

3 Web search engines

4 Web robots, crawler

5 Focused Web crawling

6 Web search vs Browsing

7 Privacy, Filter bubble

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 2 / 65

logolund

Previous lecture

LSI (Latent Semantic Indexing)- conceptsThe term-document matrix is decomposed into three other matricesof a special form by use of Singular Value Decomposition (SVD)The matrices show a breakdown of the original relationships intolinearly independent componentsMany of these components are very small and can be ignored -leading to an approximate model that contains fewer dimensions.

SVM (Support Vector Machines) - classification

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 3 / 65

logolund

LSI - reduced SVD

Reduce dimensionality => retain only k largest singular valuesSaved space

M x N matrix A (term/document) reduced SVD:

A ≈ Ak = Uk ΣkV Tk

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 4 / 65

Page 2: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

LSI - Concept extraction

use rows of Σ−1k UT

k as concepts

Concept 1 Concept 2carlstrom 0.0354 regia 0.0523rick 0.0354 oct 0.0521amelnx 0.0354 chrisp 0.0314advmar 0.0354 problems 0.0273cuttings 0.0329 pm 0.0265september 0.0322 ip-forum 0.0264miller 0.0265 stratification 0.0261re -0.0287 uk -0.0267wants -0.0303 bladderwort -0.0361aquatic -0.0363 cuttings -0.0317rotundifolia -0.0371bladderwort -0.0605

HARD to interpret

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 5 / 65

logolund

Text classification

Goal: classify documents into predefined categoriesExamples

Subject classification: ’business’, ’sports’, ’engineering’, ...Review classification: ’positive’ or ’negative’Web page classification: ’Personal homepage’ or others

Approach: supervised machine learning (⇒ SVM)Each predefined category needs a set of training documentsFrom training sets train a classifierUse classifier to classify new documents

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 6 / 65

logolund

Automated Classification technologies

Machine learning methods

Statistical models (Bayes, SVM, ...)ANN

Information Retrieval methodsClustering (no predefined categories)

Library Science methodsString matching + Thesaurus

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 7 / 65

logolund

SVM

Support vectors

SVM maximize the margin around the separating hyper-planeDecision function specified by support vectors (from trainingexamples)Quadratic programming problem

Hot text classification method

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 8 / 65

Page 3: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Lecture 5 agenda

Chapters 2, 11, 12 in “Modern Information Retrieval”

1 Reiteration

2 Web search

3 Web search engines

4 Web robots, crawler

5 Focused Web crawling

6 Web search vs Browsing

7 Privacy, Filter bubble

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 9 / 65

logolund

Outline

1 Reiteration

2 Web search

3 Web search engines

4 Web robots, crawler

5 Focused Web crawling

6 Web search vs Browsing

7 Privacy, Filter bubble

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 10 / 65

logolund

Why Web search ...

Explosion of (digital) informationwithin all types of information collections

Harder and harder to follow information flowFaster way to find relevant information when its neededChallenges

Distributed, dynamic dataLarge volumeUnstructured, heterogeneous data

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 11 / 65

logolund

Size of the Web

no one knowsestimates (text pages)

2005 ’more than 11.5 billion’2007 ’more than 20 billion’2010 ’20 - 55 billion’

Google claims to know of 1012 unique URLs (text, images, ...)

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 12 / 65

Page 4: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Important questions

How do I find relevant information?How do I navigate the digital information landscape?How structure and organize information to ease knowledgeextraction?How to create collections, properly organized, with relevantmaterial?How to keep collections updated?

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 13 / 65

logolund

Outline

1 Reiteration

2 Web search

3 Web search engines

4 Web robots, crawler

5 Focused Web crawling

6 Web search vs Browsing

7 Privacy, Filter bubble

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 14 / 65

logolund

Search Engine - Basic structure

���������������������������

���������������������������

Database

Interface

Database

Web pagesHTTP Web browserQuery

Answer

CGI−script

Web robot The WebHTTP

Size efficiency response time

software crawling the web (much like a human clicking on links)collect all found web-pages into a database (IR system)offer a web-interface to that database

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 15 / 65

logolund

Size of search engines

not publishedguesses 1 - 20 - 50 billion pagesoverlap between search engines is small ≈ 5 - 10 %

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 16 / 65

Page 5: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Google

started late 1990:sEstimated 450,000 low-cost commodity servers (2006)1 trillion links to web pages (July 2008)“over 8 billion web pages”estimate 40 billion pages?goal is to index all the world’s dataGoogle Flu Trends

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 17 / 65

logolund

Google Servers

From Jeff Dean http://www.odbms.org/download/dean-keynote-ladis2009.pdf

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 18 / 65

logolund

Google Servers

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 19 / 65

logolund

Sideline - Large server-clusters

The Joys of Real HardwareTypical first year for a new cluster:

~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)~1 network rewiring (rolling ~5% of machines down over 2-day span)~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)~5 racks go wonky (40-80 machines see 50% packetloss)~8 network maintenances (4 might cause ~30-minute random connectivity losses)~12 router reloads (takes out DNS and external vips for a couple minutes)~3 router failures (have to immediately pull traffic for an hour)~dozens of minor 30-second blips for dns~1000 individual machine failures~thousands of hard drive failuresslow disks, bad memory, misconfigured machines, flaky machines, etc.

Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.

From Jeff Dean http://www.odbms.org/download/dean-keynote-ladis2009.pdf

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 20 / 65

Page 6: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Twitter

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 21 / 65

logolund

Twitter

broadcast what’s on your mindmax 140 chars27.3 M tweets per day (November, 2009)250 M tweets per day (October, 2011)Twitter moods(J. Bollen, H. Mao, X. Zeng: “Twitter mood predicts the stock market” http://arxiv.org/abs/1010.3003)

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 22 / 65

logolund

Search engine examples

Google, Bing, Yahoo

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 23 / 65

logolund

Search Engine - Application

���������������������������

���������������������������

Web browser

Database

Web pages

CGI−script

HTTP

Web server

CGI/HTML

SRU/XML

HTTP

(Z39.50 ...)

(ASN, ...)

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 24 / 65

Page 7: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Overlap between search engines

Compare Google, Yahoo, and Ask Jeeves.Using 10316 queries and hits from first result page.

Search resultsOnly in 1 Shared by 2 In all 3

85 % 12 % 3 %

MetaSearch engine Dogpile found 68 % of all results.

Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",Internet Research, Vol. 16 Iss: 4, pp.419 - 426, ISSN: 1066-2243

DOI: 10.1108/10662240610690034

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 25 / 65

logolund

Meta Search Engine - Application

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 26 / 65

logolund

MetaSearch Engine

it’s software that simultaneously search several individual searchenginescollecting, reviewing and ranking their answersand give them back in a merged/condensed form to the userthey are not better than the quality of the search enginedatabases they obtain results from

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 27 / 65

logolund

MetaSearch engines

Simultaneously search several individual search enginesQuery translationResult merging

Simple mergeDuplicate detectionCheck availability of pagetf-idf/similarity rankingPosition based

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 28 / 65

Page 8: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

MetaSearch Engine examples

Yippy, Dogpile, DuckDuckGo

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 29 / 65

logolund

Special (Vertical) search engines

pricesex: prisjakt, PriceRunner, ...http://www.pricerunner.co.uk/http://www.prisjakt.nu/jobsex: freejobsearch, jobspider, ...http://freejobsearch.org/http://www.jobspider.com/Housingex: rightmove, hemnet, bovision, ...http://www.rightmove.co.uk/http://www.hemnet.se/http://bovision.se/... and so on ...

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 30 / 65

logolund

Other Search Engines

Wolfram Alpha

Wolfram|Alpha introduces a fundamentally new way to get knowledgeand answers — not by searching the web, but by doing dynamiccomputations based on a vast collection of built-in data, algorithms,and methods.From http://www.wolframalpha.com/about.html

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 31 / 65

logolund

Wolfram Alpha example

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 32 / 65

Page 9: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Outline

1 Reiteration

2 Web search

3 Web search engines

4 Web robots, crawler

5 Focused Web crawling

6 Web search vs Browsing

7 Privacy, Filter bubble

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 33 / 65

logolund

Web Robot - Basic architecture

Spider, Crawler, Robot, agent, ...

Frontier

List of

unvisited

pages

Database

Get URL

Fetch

Web page

Analyze

Save

pagesWeb

Repository

of visited

pages

URLs

Links

Seed

URLs

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 34 / 65

logolund

Web Robot - Types

(From R. Baeza-Yates, B. Ribeiro-Neto: "Modern Information Retrieval", 2nd Ed, 2010)

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 35 / 65

logolund

Web Robot - Ethics

Important - BE NICEDo not overloadnetwork or serverRobot exclusion protocolcheck forhttp://www.foobar.com/robots.txt

HTML meta-tag ROBOTS

robots.txt:User-agent: *Disallow: /cgi-bin/Disallow: /DATA/Disallow: /Images/

<META NAME="ROBOTS"CONTENT="NOINDEX,NOFOLLOW">

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 36 / 65

Page 10: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Web Robot - Problems

Network failuresErroneous URLsUnreachable serversPassword protectionSpider trapsRecursive URLsCharacter set encodingsSame page - different URLs - deduplication

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 37 / 65

logolund

Web Robot - More Problems

Hidden Web

DatabasesDynamic scripts... ?

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 38 / 65

logolund

Web Robot - Traversal algorithms

Depth first (Stack, LIFO queue)Breadth first (FIFO queue)Relevance order (How?)

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 39 / 65

logolund

Outline

1 Reiteration

2 Web search

3 Web search engines

4 Web robots, crawler

5 Focused Web crawling

6 Web search vs Browsing

7 Privacy, Filter bubble

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 40 / 65

Page 11: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Focused Crawling

Frontier

List of

unvisited

pages

Seed

URLs

Database

pagesWeb

Repository

of visited

pages

URLsGet URL

Fetch

Web page

URL

focus

filter

Analyze

Linksfocus

inNot

Within the

focusSave

filterFocus

Focus:

DomainProjectCountryRegionTopicSubject

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 41 / 65

logolund

Topic-specific Web-crawling

ProblemConstruct a topic specific search-engine(ex. Carnivorous plants)SolutionMake a Web-crawler walk through Internet and collect all pageswith topic ’Carnivorous plants’

easier said than done!

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 42 / 65

logolund

Conditions

Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 43 / 65

logolund

Automated Classification technologies

Machine learning methods

Statistical models (Bayes, SVM, ...)ANN

Information Retrieval methodsClustering (no predefined categories)

Library Science methodsString matching + Thesaurus

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 44 / 65

Page 12: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Topic Filter

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 45 / 65

logolund

Conditions

Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 46 / 65

logolund

Internet is Big

First pageOK, saveLinksChoosePage OK?New pagePage OK?SaveNew page

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 47 / 65

logolund

Basic Algorithm

Add good start pages (seeds) to frontierLOOP:

Choose a page among linksPage OK?

Save pageAdd all links to frontier

Go to LOOP

Save (database(s)):All relevant pages (search engine database)All analyzed pages (seen pages)All new links (frontier)

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 48 / 65

Page 13: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Focused Crawling

Frontier

List of

unvisited

pages

Seed

URLs

Database

pagesWeb

Repository

of visited

pages

URLsGet URL

Fetch

Web page

URL

focus

filter

Analyze

Linksfocus

inNot

Within the

focusSave

filterFocus

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 49 / 65

logolund

Problems I

Which newpage?

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 50 / 65

logolund

Problems II

Isolatedpages

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 51 / 65

logolund

Problems III

Non relevantpages“blocking”

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 52 / 65

Page 14: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Conditions

Page is about Carnivorous plants=⇒ automated subject classificationThere are many pages on the Internet=⇒ where to start?=⇒ look only at interesting links=⇒ take the most important pages first

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 53 / 65

logolund

Compromises

Precision/recallcompleteness/speed

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 54 / 65

logolund

Outline

1 Reiteration

2 Web search

3 Web search engines

4 Web robots, crawler

5 Focused Web crawling

6 Web search vs Browsing

7 Privacy, Filter bubble

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 55 / 65

logolund

Browsing

No idea how formulate a queryWilling to invest some timeStructure: flat vs hierarchy

Manual vs automatic classificationLack of standard classification/terminology

Precision - NOT recall

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 56 / 65

Page 15: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Browsing vs search

SearchLOTS of dataUnstructuredUnrelated items clutter results

BrowsingSmall amounts of dataHierarchically structuredQuality assessed

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 57 / 65

logolund

Browsing examples

Dmoz (ODP), Yahoo! Directory

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 58 / 65

logolund

Outline

1 Reiteration

2 Web search

3 Web search engines

4 Web robots, crawler

5 Focused Web crawling

6 Web search vs Browsing

7 Privacy, Filter bubble

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 59 / 65

logolund

Filter bubble

What do search engines or social sites know about me?At least location, search history, click history, likes, and more . . .Personalize whats shown (search results, . . . ) using this infoShow us what we want/like to see - algorithmically. . . and not whats relevant (who decides that?)

Problem?

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 60 / 65

Page 16: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

Filter bubble example I

From http://www.thefilterbubble.com/what-is-the-internet-hiding-lets-find-out

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 61 / 65

logolund

Filter bubble example II

From http://www.thefilterbubble.com/what-is-the-internet-hiding-lets-find-out

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 62 / 65

logolund

ToS-DR

Terms-of-Service – Didn’t Read; http://tos-dr.info/

you give Google (and those we work with) a worldwide license touse, host, store, reproduce, modify, create derivative works (suchas those resulting from translations, adaptations or other changeswe make so that your content works better with our Services),communicate, publish, publicly perform, publicly display anddistribute such content.Facebook: you grant us a non-exclusive, transferable,sub-licensable, royalty-free, worldwide license to use any IPcontent that you post on or in connection with Facebook (IPLicense).

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 63 / 65

logolund

Privacy

Search history, clicks, photos, documents, comments, . . .leads to a profilethat can be used by ads or sold, or even stolenwhich might lead to it ending up in unwanted placesand used against you

Beware!

Be aware!

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 64 / 65

Page 17: Lecture 5: EITN01 Web Intelligence and Information Retrieval · Amanda Spink, Bernard J. Jansen, Vinish Kathuria, Sherry Koshman, (2006) "Overlap among major web search engines",

logolund

The Web - future

????Infinity i-Kitchen – intelligent fridge runs Linuxhttp://www.geek.com/articles/chips/this-intelligent-fridge-runs-linux-on-an-arm-chip-20101126/

Read:T. Berners-Lee, “Long Live the Web: A Call for Continued OpenStandards and Neutrality”, Scientific American, November 22, 2010.http://www.scientificamerican.com/article.cfm?id=long-live-the-web

A. Ardö, EIT Lecture 5: EITN01 Web Intelligence and Information Retrieval February 14, 2013 65 / 65