2009
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
IP Address
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Timestamp
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Request type
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepageHTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Homepage
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Protocol
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Status Code
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Size of the page (in bytes)
What does a log look like?
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "GET /my_homepage HTTP/1.1" 200 2262 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html))"
User Agent
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
1. teflsearch.com
2. teflsearch.com/job-results
3. teflsearch.com/job-results/country/china
4. teflsearch.com/job-advert3455
1. teflsearch.com
2. teflsearch.com/job-results
3. teflsearch.com/job-results/country/china
4. teflsearch.com/job-advert3455
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
Hi x
I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!).
What we’d ideally like is 3-6 months of historical logs for the website. Our goal is look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re finding etc.
There are also some things that are really helpful for us to know when getting logs.
Do the logs have any personal information in?
We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed.
Do you have any sort of caching which would create separate sets of logs?
If there is anything like Varnish running on the server, or a CDN which might create logs in different location to the rest of your server? If so then we will need those logs as well as just those from the server. (Although we’re only concerned about a CDN if it’s caching pages, or serving from the same hostname; if you’re just using Cloudflare for example to cache external images then we don’t need it).
Are there any sub parts of your site which log to a different place?
Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well.
Do you log hostname?
It’s really useful for us to be able to see hostname in the logs. By default a lot of common server logging set-ups don’t log hostname, so if it’s not turned on, then it would be very useful to have that turned on now for any future analysis.
Is there anything else we should know?
Best,
{x}
Email for a developer
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
1. Ask powerful questions
2. Repeatable
3. Scaleable
4. Combine with crawl data
5. Easy to set-up
6. Easy to learn
What do we want from analysing our logs?
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
Format the logs so we can import them into BigQuery
Separate the Googlebot logs from all the other logs
What can you do with logs?
PART 1: THE WHY
Getting logs
Analysing Logs
Processing Logs
PART 2: THE HOW
Our first SQL query
SELECTDATE(timestamp) as date,count(*) as number_of_requests
FROM[mydata.log_analysis]
GROUP BYdate
Our first SQL query
SELECTDATE(timestamp) as date,count(*) as number_of_requests
FROM[mydata.log_analysis]
GROUP BYdate
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
Which pages are crawled with parameters and without parameters?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
Which pages are crawled with parameters and without parameters?
Which pages are only partly downloaded?
How many hits does each section get, when the sections are classified in an
external dataset?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
Which pages are crawled with parameters and without parameters?
Which pages are only partly downloaded?
How many hits does each section get, when the sections are classified in an
external dataset?
What percentage of a directory was crawled over the past 30 days?
How long does it take for a page to be discovered after being published?
What are the top 20 combinations of page_path_1 & path_path_2 folders
crawled by Google over the time period of our logs?
Which pages have requests from Googlebot, which don’t appear in our crawl?
What are the top non-canonical pages being crawled?
Which are most crawled parameters on the website?
How often are the most visited parameters crawled each day?
Which directories have the most 301 & 404 error codes?
Which pages are crawled with parameters and without parameters?
Which pages are only partly downloaded?
How many hits does each section get, when the sections are classified in an
external dataset?
What percentage of a directory was crawled over the past 30 days?
What are the total number of requests across two different time periods?