ITIS 1210 ITIS 1210 Introduction to Web- Introduction to Web- Based Information Based Information Systems Systems Internet Research Two Internet Research Two How Search Engines Rank Pages & How Search Engines Rank Pages & Constructing Complex Searches Constructing Complex Searches
45
Embed
ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ITIS 1210ITIS 1210Introduction to Web-Based Introduction to Web-Based
Information SystemsInformation Systems
Internet Research TwoInternet Research TwoHow Search Engines Rank Pages &How Search Engines Rank Pages &
How do Search Engines Crawl?How do Search Engines Crawl?
Gathering data from the Web is like Gathering data from the Web is like browsing:browsing:1.1. Visit a page.Visit a page.
2.2. Record all the words on the pageRecord all the words on the page
3.3. Choose a link you haven’t seen/recordedChoose a link you haven’t seen/recorded
4.4. Click on the link.Click on the link.
Repeat 8 billion times.
Crawling the WebCrawling the Web
One person with a Web browser, following One person with a Web browser, following one link per second.one link per second.
How long does it take to browse the How long does it take to browse the surface Web (8 billion pages)?surface Web (8 billion pages)?
Crawling the WebCrawling the Web
How many people would it take to crawl How many people would it take to crawl the surface Web in a week? If each person the surface Web in a week? If each person follows one link per second (with no follows one link per second (with no sleep):sleep):
One week = six hundred thousand secondsSix hundred thousand / eight billion = thirteen thousand
Challenges:Challenges:
Remembering where you’ve beenRemembering where you’ve been Remembering where you haven’t beenRemembering where you haven’t been Storing all the dataStoring all the data
A (small) Server FarmA (small) Server Farm
The Deep WebThe Deep Web
Not all pages get crawled:Not all pages get crawled: Private pages on Intranets (company Private pages on Intranets (company
networks)networks) Pages that people don’t want crawledPages that people don’t want crawled Dynamic content pages (from databases)Dynamic content pages (from databases)
Dynamic content pages make the size of the Dynamic content pages make the size of the Internet infinite!Internet infinite!
Dynamic Content ExampleDynamic Content Example
zillow.comzillow.com Won’t be Won’t be
indexedindexed
Identifying High Quality Web Identifying High Quality Web PagesPages
Google has ranked billions of Web pages Google has ranked billions of Web pages by "quality".by "quality".
You enter your search terms:You enter your search terms:
UNC Charlotte HCIUNC Charlotte HCI
Google finds the Google finds the highest quality pagehighest quality page associated with these search terms.associated with these search terms.
Google PagerankGoogle Pagerank
Pretend you're surfing the Web randomly.Pretend you're surfing the Web randomly.
To move from page to page you could:To move from page to page you could:
1) type in an address (1) type in an address (www.sis.uncc.edusis.uncc.edu) )
includes using a bookmarkincludes using a bookmark
OROR
2) follow a link.2) follow a link.
Pagerank measures how likely you are to reach a particular page through random surfing (either 1 or 2).The main idea is that links to your page from important web pages indicate that your page is important.
Computing PagerankComputing Pagerank(what’s the probability of getting to this page?)(what’s the probability of getting to this page?)
Q A, B, C, ...
L(A), L(B), L(C),...
= Web page= Pages pointing to Q= number of links on each page
d represents the relative chance of following a link to page Q and 1-d represents the relative chance of going directly to page Q (via typing in the address or using a bookmark):
Usually these are: d = 0.9 (1-d) = 0.1
Computing PagerankComputing Pagerank
Pretend the Web has only four pages:Pretend the Web has only four pages:
Compute new estimates from the old until the estimatesstop changing. Note that this is the same answer as the traditional algebraic approach, but this way scales better.
How does Google Use How does Google Use Pagerank?Pagerank?
You enter search terms, such as “UNC You enter search terms, such as “UNC Charlotte HCI”Charlotte HCI”
Google finds all the pages that have Google finds all the pages that have allall those words on themthose words on them
Of all those pages, Google will list the Of all those pages, Google will list the ones with the highest page rank first, but…ones with the highest page rank first, but…
……other ‘magic ingredients’ are used by other ‘magic ingredients’ are used by Google: trade secrets of their algorithms.Google: trade secrets of their algorithms.
IntroductionIntroduction
Basic queries are somewhat limitedBasic queries are somewhat limited One or two keywordsOne or two keywords Simple relationshipsSimple relationships Limited syntaxLimited syntax
Complex queries provide more powerComplex queries provide more power Keywords & phrase can be connected to form Keywords & phrase can be connected to form
more complex relationshipsmore complex relationships Search filters can be employed to limit resultsSearch filters can be employed to limit results
Boolean operators may beBoolean operators may be Allowed on main page Allowed on main page Confined to Advanced search pagesConfined to Advanced search pages
Some engines use symbols insteadSome engines use symbols instead + for AND+ for AND - for NOT- for NOT No space between sign and word: No space between sign and word:
+solar +energy -windmill+solar +energy -windmill
Narrowing Searches with ANDNarrowing Searches with AND
ANDAND Limits resultsLimits results Forces inclusion of a stop wordForces inclusion of a stop word
Indicates that Indicates that allall keywords must be found keywords must be found on Web pageon Web page
Adding more ANDed keywords limits Adding more ANDed keywords limits search moresearch more
Results should be more relevant because Results should be more relevant because the keyword list has expandedthe keyword list has expanded
Narrowing Searches with ANDNarrowing Searches with AND
Example:Example: ““solar energy association” AND Portlandsolar energy association” AND Portland
W W W
Solar energy association
Portland
Narrowing Searches with ANDNarrowing Searches with AND
Example:Example: Henry +I same as “Henry I”Henry +I same as “Henry I”
W W W
Henry I
Expanding Searches with ORExpanding Searches with OR
OR expands resultsOR expands results Useful if you didn’t get enough returns from Useful if you didn’t get enough returns from
your first searchyour first search The more keywords you add, the more results The more keywords you add, the more results
you should getyou should get
Every page returned must have at least Every page returned must have at least one of the keywords on itone of the keywords on it Good to use when you have synonymsGood to use when you have synonyms
Expanding Searches with ORExpanding Searches with OR
Example:Example: oregon OR northwestoregon OR northwest
W W W
oregon northwest
Restricting Queries with AND NOTRestricting Queries with AND NOT
AND NOT excludes the keyword that AND NOT excludes the keyword that follows NOTfollows NOT
Limits your searchLimits your search Produces fewer resultsProduces fewer results
Useful if first search returns irrelevant Useful if first search returns irrelevant resultsresults Use AND NOT to get rid of those resultsUse AND NOT to get rid of those results
Restricting Queries with AND NOTRestricting Queries with AND NOT
Equivalent forms:Equivalent forms: cats AND NOT dogscats AND NOT dogs cats AND-NOT dogscats AND-NOT dogs cats NOT dogscats NOT dogs cats –dogscats –dogs
Restricting Queries with AND NOTRestricting Queries with AND NOT
Example:Example: ““solar energy association” AND portland solar energy association” AND portland
Boolean operators allow you to focus a Boolean operators allow you to focus a searchsearch
Any logical combination of operators is Any logical combination of operators is allowedallowed
If it makes sense when spoken like a If it makes sense when spoken like a sentence it’s probably OK to usesentence it’s probably OK to use
Order of operations is usually left to rightOrder of operations is usually left to right Use parentheses to organize termsUse parentheses to organize terms