Top Banner
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Logging at Scale Alex Smith - @alexjs Solutions Architect April 2016
67

Log Analysis At Scale

Apr 14, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Log Analysis At Scale

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Logging at Scale

Alex Smith - @alexjs

Solutions Architect

April 2016

Page 2: Log Analysis At Scale

Logging is difficult

Page 3: Log Analysis At Scale

I thought I knew this

Page 4: Log Analysis At Scale

No Users

5.2m users

(~80k rps)

Page 5: Log Analysis At Scale

But.

It is really difficult

Page 6: Log Analysis At Scale

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

Page 7: Log Analysis At Scale

Stealing Content…

‘Your First 10m Users’

ARC301 – re:Invent 2015

http://bitly.com/2015arc301

- Joel Williams

AWS Solutions Architect

Page 8: Log Analysis At Scale

>1 User

• Amazon Route 53 for DNS

• A single Elastic IP

• A single Amazon EC2 instance

• With full stack on this host

• Web app

• Database

• Management

• And so on…

Amazon

EC2

instance

Elastic IP

UserAmazon

Route 53

ARC301

Page 9: Log Analysis At Scale

>1 User

• A single place to read logs

Amazon

EC2

instance

Elastic IP

UserAmazon

Route 53

ARC301

Page 10: Log Analysis At Scale

>1 User

• A single place to read logs from

Amazon

EC2

instance

Elastic IP

UserAmazon

Route 53

ARC301

Page 11: Log Analysis At Scale

@alexjs hacks – top URLs

# awk -F\" '{print $2'} access_log \

| awk '{print $2}' \

| sort | uniq -c | sort –rn

11208 /

3287 /2016/04/23/welcome

Page 12: Log Analysis At Scale

@alexjs hacks – HTTP response codes

# awk '{print $9}' access_log \

| sort | uniq -c | sort –rn

19307 200

1239 404

120 503

1 416

Page 13: Log Analysis At Scale

@alexjs hacks - top User-Agents

# awk -F\" '{print $6'} access_log | sort | uniq -c | sort -rn

3774 Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; Microsoft; Lumia 640 XL)

2949 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36

2928 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36

2900 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.5.2171.95 Safari/537.36

Page 14: Log Analysis At Scale

@alexjs hacks – requests per second (realtime)

# tail -F access_log \

perl -e 'while (<>) {$l++;if (time > $e) {$e=time;print"$l\n";$l=0}}’

1

1

68

99

912

424

http://bitly.com/bashlps

Page 15: Log Analysis At Scale

Users >1000

Web

Instance

RDS DB Instance

Active (Multi-AZ)

Availability Zone Availability Zone

Web

Instance

RDS DB Instance

Standby (Multi-AZ)

ELB

Balancer

UserAmazon

Route 53

Page 16: Log Analysis At Scale

Real Life

Page 17: Log Analysis At Scale

Users >1 million+

RDS DB Instance

Active (Multi-AZ)

Availability Zone

ELB

Balancer

RDS DB Instance

Read Replica

RDS DB Instance

Read Replica

Amazon

Route 53User

Amazon S3

Amazon

CloudFront

DynamoDB

Amazon SQS

ElastiCache

Worker

Instance

Worker

Instance

Amazon

CloudWatch

Internal App

Instance

Internal App

Instance Amazon SES

Lambda

ARC301

Web

Instance

Web

Instance

Web

Instance

Web

Instance

Page 18: Log Analysis At Scale

Amazon

EC2

instance

Elastic IP

UserAmazon

Route 53

ARC301

Page 19: Log Analysis At Scale

Users >1 million+

RDS DB Instance

Active (Multi-AZ)

Availability Zone

ELB

Balancer

RDS DB Instance

Read Replica

RDS DB Instance

Read Replica

Web

Instance

Amazon

Route 53User

Amazon S3

Amazon

CloudFront

DynamoDB

Amazon SQS

ElastiCache

Worker

Instance

Worker

Instance

Amazon

CloudWatch

Internal App

Instance

Internal App

Instance Amazon SES

Lambda

ARC301

Web

Instance

Web

Instance

Web

Instance

Page 20: Log Analysis At Scale

Users >1 million+

RDS DB Instance

Active (Multi-AZ)

Availability Zone

ELB

Balancer

RDS DB Instance

Read Replica

RDS DB Instance

Read Replica

Web

Instance

Amazon

Route 53

Amazon S3

Amazon

CloudFront

DynamoDB

Amazon SQS

ElastiCache

Worker

Instance

Worker

Instance

Amazon

CloudWatch

Internal App

Instance

Internal App

Instance Amazon SES

Lambda

ARC301

Web

Instance

Web

Instance

Web

Instance

Page 21: Log Analysis At Scale

Users >1 million+

RDS DB Instance

Active (Multi-AZ)

Availability Zone

ELB

Balancer

RDS DB Instance

Read Replica

RDS DB Instance

Read Replica

Web

Instance

Amazon

Route 53User

Amazon S3

Amazon

CloudFront

DynamoDB

Amazon SQS

ElastiCache

Worker

Instance

Worker

Instance

Amazon

CloudWatch

Internal App

Instance

Internal App

Instance Amazon SES

Lambda

ARC301

Web

Instance

Web

Instance

Web

Instance

Page 22: Log Analysis At Scale

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

Page 23: Log Analysis At Scale

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

Page 24: Log Analysis At Scale

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

Page 25: Log Analysis At Scale

When the logs are written (AWS)

• Local memory

• Ephemeral Volumes

• EBS Volumes

• gp2

• st1/sc1

Page 26: Log Analysis At Scale

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

• Insight

Page 27: Log Analysis At Scale

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

• Insight

Page 28: Log Analysis At Scale

Three Problems of Persistence

• Somewhere to stage

• Somewhere to live

• Somewhere to search

Page 29: Log Analysis At Scale

To NoSQL, or not to NoSQL?

- Joel

Page 30: Log Analysis At Scale

Some folks won’t like this,

but…

Page 31: Log Analysis At Scale

Start with SQL databases(even MPP SQL)

Page 32: Log Analysis At Scale

Why start with SQL?

• Established and well-worn technology.

• Lots of existing code, communities, books, and tools.

• You aren’t going to break SQL DBs in your first 10 million

users. No, really, you won’t.*

• Clear patterns to scalability (especially in analytics)

*Unless you are doing something SUPER peculiar with the data or you have MASSIVE amounts of it.

…but even then SQL will have a place in your stack.

Page 33: Log Analysis At Scale

Ah ha! You said

massive!

- Joel (again)

Page 34: Log Analysis At Scale

Why might you need NoSQL?

• Super low-latency applications

• Metadata-driven datasets

• Highly nonrelational data

• Need schema-less data constructs*

• Massive amounts of data (again, in the TB range)

• Rapid ingest of data (thousands of records/sec)

*Need!= “It’s easier to do dev without schemas”

Page 35: Log Analysis At Scale

Why might you need NoSQL?

• Super low-latency applications

• Metadata-driven datasets

• Highly nonrelational data

• Need schema-less data constructs*

• Massive amounts of data (again, in the TB range)

• Rapid ingest of data (thousands of records/sec)

*Need!= “It’s easier to do dev without schemas”

Page 36: Log Analysis At Scale

Why might you need NoSQL?

• Super low-latency applications

• Metadata-driven datasets

• Highly nonrelational data

• Need schema-less data constructs*

• Massive amounts of data (again, in the TB range)

• Rapid ingest of data (thousands of records/sec)

*Need!= “It’s easier to do dev without schemas”

Page 37: Log Analysis At Scale

Three Problems of Persistence

• Somewhere to stage

• Somewhere to live

• Somewhere to search

Page 38: Log Analysis At Scale

Log Dispatcher Architecture Revisited

App Server App Server App Server App Server

Kinesis

Firehose

Log IndexElasticSearch

Log IndexElasticSearch

Visualisation

Amazon

S3

JSON

Page 39: Log Analysis At Scale

Amazon S3

• Simple Storage Service

• Canonical logging target for ELB, CloudFront, etc.

• Virtually unlimited amounts of storage

• Support for Lambda operations

• Very fast – ideal for feeding other services (Redshift,

EMR/Hadoop)

• Data can be automatically pushed here from Amazon

Firehose

Amazon

S3

Page 40: Log Analysis At Scale

Three Problems of Persistence

• Somewhere to stage

• Somewhere to live

• Long tail

• Somewhere to search

Page 41: Log Analysis At Scale

Redshift

• PostgreSQL based MPP

database

• Petabyte scale data

warehousing

• Choice of nodes

• Dense compute

• Dense storage

• Already compatible with

your existing BI tools

dense

compute node

dense

storage node

Amazon

Redshift

Up to 128 nodes at 2PB

~256PB/cluster

Page 42: Log Analysis At Scale

Three Problems of Persistence

• Somewhere to stage

• Somewhere to live

• Somewhere to search

(streaming data)

Page 43: Log Analysis At Scale

Amazon ElasticSearch Service

• ElasticSearch

• Popular/Open Source

• Commonly used for log

and clickstream

• Managed Solution

• We prepackage Kibana

• Integrated with IAM,

Firehose, etc

Amazon

Elasticsearch Service

Amazon

Kinesis

Firehose

Page 44: Log Analysis At Scale

Three Problems of Persistence

• Somewhere to stage

• Somewhere to live

• Somewhere to search

(streaming data)

Page 45: Log Analysis At Scale

Demo: Storage!

Page 46: Log Analysis At Scale

ElasticSearch Index Mappingcurl -XPUT 'https://search-loggingatscale-demo-[...].us-east-1.es.amazonaws.com/blog-apache-combined' -d '

{

"mappings": {

"blog-apache-combined": {

"properties": {

"datetime": {

"type": "date",

"format": "dd/MMM/yyyy:HH:mm:ss Z”

},

"agent": {

"type": "string",

"index": "not_analyzed”

}, [...]

Page 47: Log Analysis At Scale

ElasticSearch Index Mappingcurl -XPUT 'https://search-loggingatscale-demo-[...].us-east-1.es.amazonaws.com/blog-apache-combined' -d '

{

"mappings": {

"blog-apache-combined": {

"properties": {

"datetime": {

"type": "date",

"format": "dd/MMM/yyyy:HH:mm:ss Z”

},

"agent": {

"type": "string",

"index": "not_analyzed”

}, [...]

Page 48: Log Analysis At Scale

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

Page 49: Log Analysis At Scale

How do I get my data in anyway?

Page 50: Log Analysis At Scale

Logging Architecture

App Server App Server App Server App Server

Log

Aggregator(Kafka/Kinesis/MQ)

Log

Aggregator(Kafka/Kinesis/MQ)

Log

Index/Persist(ElasticSearch, etc)

Log

Index/Persist(ElasticSearch, etc)

Visualisation

Page 51: Log Analysis At Scale

Logging Architecture

App Server App Server App Server App Server

Log

Aggregator(Kafka/Kinesis/MQ)

Log

Aggregator(Kafka/Kinesis/MQ)

ElasticSearch ElasticSearch

Visualisation

Page 52: Log Analysis At Scale

Amazon Kinesis

• Firstly, a massively

scalable, low cost way to

send JSON objects to a

’stream’ hosted by AWS

• Users can write applications

(using KCL) to take data

from it and parse/evaluate

• Apps can be written in Java,

Lambda (Node, Python, Java),

etc

Page 53: Log Analysis At Scale

Kinesis Streams

• What was previously Kinesis

• Still very customisable, for

innovative stream workloads

• Users still write app to parse

data from the stream

Amazon Kinesis: New Features (re:Invent 2015)

Kinesis Firehose

• Fully managed data ingest

service

• Provision end point

• Send data to end point

• ???

• Data!

• Outputs to S3, Redshift,

ElasticSearch Service

• (And can do two at once)

Page 54: Log Analysis At Scale

Amazon Kinesis: New Features (Apr 2016)

Amazon Kinesis Agent

• Standalone Java application from AWS

• Collect and send logs to Kinesis Firehose

• Built-in:

• File rotation

• Failure retries

• Checkpoints

• Integrated with CloudWatch for alerting

Page 55: Log Analysis At Scale

Amazon Kinesis Agent

• Multiple input options

• SINGLELINE

• CSVTOJSON

• LOGTOJSON

• LOGTOJSON

• Hoorah!

Page 56: Log Analysis At Scale

Demo: Local Capture + Dispatch

Page 57: Log Analysis At Scale

S3

Page 58: Log Analysis At Scale

ElasticSearch

Page 59: Log Analysis At Scale

Problems

• Storage (Temp)

• Capture

• Storage (Perm)

• Visualisation

Page 60: Log Analysis At Scale

Kibana

• Pre-packaged with Amazon ElasticSearch Service

• Easy to manage with freeform data

• Dashboards!

Page 61: Log Analysis At Scale

Your existing BI tools

• As before – your data exists on S3 (JSON)

• S3 -> Redshift

• Commission a Redshift cluster with IAM roles

• Write a manifest of the files to load (JSON)

• Issue a load

• Redshift is PgSQL compatible

• Drivers exist for many tools

Page 62: Log Analysis At Scale

Demo: visualisation!(Kibana)

Page 63: Log Analysis At Scale

Problems

• Storage (Temporary)

• Capture

• Storage (Permanent)

• Visualisation

• Insight

Page 64: Log Analysis At Scale

Recap / Lessons / Next

• Logging is really hard.

• Use tools like AWS Firehose, Kinesis Agent and

ElasticSearch Service to make it easier

• Reuse data, tools and people where possible

Page 65: Log Analysis At Scale

Lessons

Don’t be big data dog

Use the right tools at the right

time

Page 66: Log Analysis At Scale

Q&A

Twitter

@alexjs

LinkedIn

https://sg.linkedin.com/in/alexjs

Email

[email protected]

Page 67: Log Analysis At Scale

Thank You!