Top Banner
Local Search (Including Importance Metrics and Link Merging) Everything you wanted to know about Crawling* *But Didn't Know Where to Ask Agile SEO Meetup – South Jersey Monday, September 10, 2012 7:00 PM to 9:00 PM Bill Slawski Webimax @bill_slawski
36

Everything you wanted to know about crawling, but didn't know where to ask

Aug 31, 2014

Download

Technology

SEO by the Sea

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Everything you wanted to know about crawling, but didn't know where to ask

Local Search

(Including Importance Metrics and Link Merging)

Everything you wanted to knowabout Crawling*

*But Didn't Know Where to Ask

Agile SEO Meetup – South JerseyMonday, September 10, 20127:00 PM to 9:00 PM

Bill SlawskiWebimax@bill_slawski

Page 2: Everything you wanted to know about crawling, but didn't know where to ask

In the Early Days of the Web, there was an invasion

Page 3: Everything you wanted to know about crawling, but didn't know where to ask

Robots

Page 4: Everything you wanted to know about crawling, but didn't know where to ask

Spiders

Via Thomas Shahan - http://www.flickr.com/photos/opoterser/

Page 5: Everything you wanted to know about crawling, but didn't know where to ask

Crawlers

Page 6: Everything you wanted to know about crawling, but didn't know where to ask

Invaded pages across the World Wide Web

Page 7: Everything you wanted to know about crawling, but didn't know where to ask

The Robots Mailing List was formed to solve the problem!

Page 8: Everything you wanted to know about crawling, but didn't know where to ask

Led by a young Martijn Koster, they developed the Robots.txt protocol

Page 9: Everything you wanted to know about crawling, but didn't know where to ask

Which Asked Robots to be Polite

Page 10: Everything you wanted to know about crawling, but didn't know where to ask

And Not Melt Down Internet Servers

Page 11: Everything you wanted to know about crawling, but didn't know where to ask

A student at Stanford named Lawrence Page went on to co-author a paper on how robots might Crawl web pages to index important pages first.

http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf

Page 12: Everything you wanted to know about crawling, but didn't know where to ask

<<Insert Subliminal Advertisement Here>>

Page 13: Everything you wanted to know about crawling, but didn't know where to ask

Important Web Pages1. Contain words similar to a query that starts the crawl2. Have a high backlink count3. Have a high PageRank4. Have a high forward link count5. Are in or are close to the root directory for sites

Image via Fir0002/Flagstaffotos under http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License_1.2

Page 14: Everything you wanted to know about crawling, but didn't know where to ask

So most crawlers will not only be Polite, but they will also hunt down

important pages first

Page 15: Everything you wanted to know about crawling, but didn't know where to ask

Search Engines filed patents on how they might crawl and collect content found on Web pages, including collectingURLs and Anchor Text associated with them.

<a href=“http://www.hungryrobots.com”>Feed Me</a>

http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7308643

Page 16: Everything you wanted to know about crawling, but didn't know where to ask

Also, in one embodiment, the robots are configured to not follow "permanent redirects". Thus, when a robot encounters a URL that is permanently redirected to another URL, the robot does not automatically retrieve the document at the target address of the permanent redirect.

Page 17: Everything you wanted to know about crawling, but didn't know where to ask

“Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.”*

*Google Webmaster Guidelines - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769

Page 18: Everything you wanted to know about crawling, but didn't know where to ask

Google’s Webmaster Guidelines make crawlers look prettyunsophisticated, and incapable of much more than the simpleLynx browser…

…But we have signs that crawlers can be smarter than that,and Microsoft introduced a Vision-based Page Segmentation Algorithm in 2003. Both Google and Yahoo have also publishedpatents and papers that describe smarter crawlers. IBM filed a patentfor a crawler in 2000 that is smarter than most browsers today.

Page 19: Everything you wanted to know about crawling, but didn't know where to ask

VIPS: a Vision-based Page Segmentation Algorithm - http://research.microsoft.com/apps/pubs/default.aspx?id=70027

Page 20: Everything you wanted to know about crawling, but didn't know where to ask

http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7519902

Page 21: Everything you wanted to know about crawling, but didn't know where to ask

Link Merging

Web Site Structure Analysis - http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7861151

• S-nodes – Structural Link Blocks - organizational and navigational link blocks; Repeated across pages with the same layout and showing the organization of the site. They are often lists of links that don’t usually contain other content elements such as text.

• C-nodes – Content link blocks, grouped together by some kind of content association, such as relating to the same topic or sub-topic. These blocks usually point to information resources and aren’t likely to be repeated across more than one page.

• I-nodes – Isolated links, which are links on a page that aren’t part of a link group, may be only loosely related to each other, by virtue of something like their appearing together within the same paragraph of text. Each link on a page might be considered an individual i-node, or they might be grouped together by page as an i-node.

Page 22: Everything you wanted to know about crawling, but didn't know where to ask

Crawling and Self Help

Page 23: Everything you wanted to know about crawling, but didn't know where to ask

Canonical = Best!

There can be only one:http://example.comhttp://www.example.comhttp://example.com/http://www.example.com/https://example.comhttps://www.example.comhttps://example.com/https://www.example.com/http://example.com/index.htmhttp://www.example.com/index.htmhttps://example.com/index.htmhttps://www.example.com/index.htmhttp://example.com/INDEX.htmhttp://www.example.com/INDEX.htmhttps://example.com/INDEX.htmhttps://www.example.com/INDEX.htmhttp://example.com/Index.htmhttp://www.example.com/Index.htmhttps://example.com/Index.htmhttps://www.example.com/Index.htm

Page 24: Everything you wanted to know about crawling, but didn't know where to ask

Canonical Link Element

<link rel="canonical" href="http://example.com/page.html"/>

Page 25: Everything you wanted to know about crawling, but didn't know where to ask

Rel=“prev” & rel=“next”On the first page, http://www.example.com/article?story=abc&page=1,

<link rel="next" href="http://www.example.com/article?story=abc&page=2" />

On the second page, http://www.example.com/article?story=abc&page=2:

<link rel="prev" href="http://www.example.com/article?story=abc&page=1" /> <link rel="next" href="http://www.example.com/article?story=abc&page=3" />

On the third page, http://www.example.com/article?story=abc&page=3

<link rel="prev" href="http://www.example.com/article?story=abc&page=2" /> <link rel="next" href="http://www.example.com/article?story=abc&page=4" />

And on the last page, http://www.example.com/article?story=abc&page=4:

<link rel="prev" href="http://www.example.com/article?story=abc&page=3" />

Page 26: Everything you wanted to know about crawling, but didn't know where to ask

Paginated Product Pages

Page 27: Everything you wanted to know about crawling, but didn't know where to ask

Paginated Article Pages

Page 28: Everything you wanted to know about crawling, but didn't know where to ask

View All Pages

Option 1

• Normal Prev/Next sequence • Self Referential Canonicals (point to their Own URL• Noindex meta element on View All page

Option 2

• Normal Prev/Next Sequence• Canonicals (all pages use the view-all page URL)

http://googlewebmastercentral.blogspot.com/2011/09/view-all-in-search-results.html

Page 29: Everything you wanted to know about crawling, but didn't know where to ask

Rel=“hreflang”

Page 30: Everything you wanted to know about crawling, but didn't know where to ask

Rel=“hreflang”HTML link element.

In the HTML <head> section of http://www.example.com/, add a link element pointing to the Spanish version of that webpage at http://es.example.com/, like this:

<link rel="alternate" hreflang="es" href="http://es.example.com/" />

HTTP header.

If you publish non-HTML files (like PDFs), you can use an HTTP header to indicate a different language version of a URL:

Link: <http://es.example.com/>; rel="alternate"; hreflang="es"

Sitemap.

Instead of using markup, you can submit language version information in a Sitemap.

Page 31: Everything you wanted to know about crawling, but didn't know where to ask

Rel=“hreflang” XML Sitemap<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <url> <loc>http://www.example.com/english/</loc> <xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/deutsch/" /> <xhtml:link rel="alternate" hreflang="de-ch" href="http://www.example.com/schweiz-deutsch/" /> <xhtml:link rel="alternate" hreflang="en" href="http://www.example.com/english/" /> </url>

Page 32: Everything you wanted to know about crawling, but didn't know where to ask

XML Sitemap

Page 33: Everything you wanted to know about crawling, but didn't know where to ask

XML Sitemap

• Use Canonical links• Remove 404s• Don’t set priority past 1 week• If more than 50,000 URLs, use multiple Sitemaps and a site index• Validate with an XML Sitemap Validator• Include a Sitemap statement in robots.txt

http://www.sitemaps.org/

Page 34: Everything you wanted to know about crawling, but didn't know where to ask

Next, we study which of the two crawl systems, Sitemaps and Discovery, sees URLs first. We conduct this test over a dataset consisting of over five billion URLs that were seen by both systems.

According to the most recent statistics at the time of the writing, 78% of these URLs were seen by Sitemaps first, compared to 22% that were seen through Discovery first.

Crawling vs. XML

Sitemaps: Above and Beyond the Crawl of Duty – http://www.shuri.org/publications/www2009_sitemaps.pdf

Page 36: Everything you wanted to know about crawling, but didn't know where to ask

Questions?

Bill SlawskiWebimax

@bill_slawski