Top Banner
Chapter 5 Locating Information on the WWW Wednesday, October 16, 13
23

FIT5 Ch. 5, CIS 110 13F

Dec 14, 2014

Download

Education

mh-108

Ch.5 presentation from Fluency w/Information Technology, 5ed (Pearson)
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FIT5 Ch. 5, CIS 110 13F

Chapter 5Locating Information on the WWW

Wednesday, October 16, 13

Page 2: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

How a Search Engine WorksA. The Web Crawler

• software robots (called spiders or bots)

=> spiders crawl the web to build an index (keywords & web pages)

TOKEN URL

cat www.cat.com

icanhascheezburger.com

Wednesday, October 16, 13

Page 3: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

How a Search Engine Works:the Web Crawler

• Web crawler: a program that indexes content on the web

• Algorithm:– Start from one "seed" page– Extract all links on that page– Follow each link to find new pages– Extract all links from new pages– keep going ...

Wednesday, October 16, 13

Page 4: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-WesleyWednesday, October 16, 13

Page 5: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

How a Search Engine Works:B. The Query Processor

• user enters search terms (keywords) • query processor looks up word in index• returns hit list

• create index in advance• store in RAM,

=> fast query response

Wednesday, October 16, 13

Page 6: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Multiword Searches:set intersection

Wednesday, October 16, 13

Page 7: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Multiword Searches:set intersection

Wednesday, October 16, 13

Page 8: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Power of Indexed Search

• Search engines can look at billions of Web pages and return an answer in less than a fifth of a second

Wednesday, October 16, 13

Page 9: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Data Centers

• Search Index is RAM-resident– RAM 100,000x faster than disk

– Hennessy/Patterson (4ed) memory access times:» Register: 250ps» L1 Cache: 1ns» RAM: 100ns» Hard Disk 10ms (SSD Flash 100 msec.)

=> Data Centers: a growth industry in Oregon

• Why?Data Centers as Information Substations

Wednesday, October 16, 13

Page 10: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Google’s Data Centers– Google’s facility in The Dalles is only one two

dozen, which stretch from Silicon Valley to Dublin.

– #servers: 1,000,000 - 2,000,000• 2 exabytes of hard disk storage – enough to copy

the web • “The Indexed Web contains at least 3.59 billion

pages (Tuesday, 15 October, 2013).”• 8 petabytes of RAM

– Field Trip: Google’s Data Centers

Wednesday, October 16, 13

Page 11: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

datacenterknowledge.com

• rapid growth in data center electricity use from 2000 to 2005

• slowed significantly from 2005 to 2010, • 2010: total electricity use by all data centers

about 1.3% of all electricity use for the world (2% for the US)

=> Google’s entire global data center network: 220 megawatts

Wednesday, October 16, 13

Page 12: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Data Center Energy Efficiency

• PUE (power usage effectiveness)

• standard from Green Grid consortium

• measures how much power goes directly to computing vs. cooling, lighting, etc.

• Score of 1: no power goes to the extra costs

• 1.5 means that ancillary services consume half of power used

Wednesday, October 16, 13

Page 14: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

What Search Engines Look At

– Title— <title> element contains key words – Anchor text— <a> element, describes the

page it links to– Landing page— <a> element, the page it

connects to– Meta—A <meta> tag in the head section often

used for key words– Alt attributes— <img> element attribute gives

a textual description– Content— text on the page

Wednesday, October 16, 13

Page 15: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Page Rank Algorithm:Pioneered by Google

• PageRank works like a voting system

– If page A links to page B, A’s link adds to B’s importance

– Pages linked-to by many pages have a high page rank

– Links from pages with a high page ranking are ranked as more important

Wednesday, October 16, 13

Page 16: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Field Trip: Basic Search

• Google Search Education

http://bit.ly/16ZW6Ow

Wednesday, October 16, 13

Page 17: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Advanced Search: Logic Ops

• logic operator: AND– human AND powered AND flight

hits have at all words

• logic operator: OR– marshmallow OR strawberry OR chocolate– OR-queries hits have at least one word

• logic opeator: NOT– tigers AND NOT baseball

Wednesday, October 16, 13

Page 18: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Combining Logical Operators(marshmallow OR strawberry) AND sundae

• logic operators work like arithmetic

• Google also uses a minus (–) as an abbreviation for NOT

– http://www.powersearchingwithgoogle.com/course/ps/assets/PowerSearchingQuickReference.pdf

Wednesday, October 16, 13

Page 19: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Site Search

• Many sites offer the opportunity to perform a site search

• (eg) Try this Google search:

Google chief economist Hal Varian, site:uoregon.edu

Wednesday, October 16, 13

Page 20: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Field Trip: Power Search

• Google Search Education

http://www.powersearchingwithgoogle.com/

Wednesday, October 16, 13

Page 22: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Cloud Storage• Facebook: 300 petabytes (PB) • Microsoft Hotmail: 100 petabytes,• Microsoft SkyDrive: 10PB • Amazon S3: 900 PB• Dropbox: 40PB

Wednesday, October 16, 13

Page 23: FIT5 Ch. 5, CIS 110 13F

Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Ch. 5: AssessmentLearning Outcomes - Know the following

Wednesday, October 16, 13