Top Banner
LOG DATA ANALYSIS PLATFORM May, 2015
31
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Log Data Analysis Platform by Valentin Kropov

LOG DATA ANALYSIS PLATFORM

May, 2015

Page 2: Log Data Analysis Platform by Valentin Kropov

Agenda

1) User-Group Introduction

2) Problematic

3) Log Data Analysis System Overview

4) Task Analysis

5) Solution Architecture

6) Trade-off Analysis

7) Automation

8) Performance Testing

9) Outcome & Plans

Page 3: Log Data Analysis Platform by Valentin Kropov

PROBLEMATIC

Page 4: Log Data Analysis Platform by Valentin Kropov

Demo Lab: Why we’ve started this project?

1) Increase Internal Experience

2) Create Reference Solution w/o NDA Limitations

3) Get Playground for Tests

4) Provide Demo Environment for Customers (using their data)

5) Decrease time to Market (by introducing automation)

Page 5: Log Data Analysis Platform by Valentin Kropov

LOG DATA ANALYSIS PLATFORM : OVERVIEW

Page 6: Log Data Analysis Platform by Valentin Kropov

Log Data Analysis Platform Details

Key Facts: • ~270-300 Web Servers • Log Types: HTTPD Access

logs, Error logs, Application Server Servlet, OS Service Logs

• ~500K events per minute

• 150GB of data per day

Technologies:• Flume• Hadoop/HDFS,

MapReduce• Hive, Impala• Oozie• Elasticsearch, Kibana 3• Tableau Analytics

platform• Puppet + Vagrant

Page 7: Log Data Analysis Platform by Valentin Kropov

Log Data ExamplesAccess log:127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log:[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstatprocs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostatLinux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76

Page 8: Log Data Analysis Platform by Valentin Kropov

TASK ANALYSIS

Page 9: Log Data Analysis Platform by Valentin Kropov

Architecture Drivers: Use Cases

Page 10: Log Data Analysis Platform by Valentin Kropov

Architecture Drivers: Quality Attributes (1/3)

Page 11: Log Data Analysis Platform by Valentin Kropov

Architecture Drivers: Quality Attributes (2/3)

Page 12: Log Data Analysis Platform by Valentin Kropov

Architecture Drivers: Quality Attributes (3/3)

Page 13: Log Data Analysis Platform by Valentin Kropov

Architecture Drivers: Limitations

Page 14: Log Data Analysis Platform by Valentin Kropov

Demo Lab: Marketecture

Page 15: Log Data Analysis Platform by Valentin Kropov

SOLUTION ARCHITECTURE

Page 16: Log Data Analysis Platform by Valentin Kropov

Solution Architecture

Batch Layer Serving Layer

Speed Layer

Raw Data Storage

Data Strea

m

Real-time Views

Static Views Precomputing

PrecomputingAd-hoc Batch

Views

Static Batch Views

Corporate BI Tool

Legend:Layer boundary

Data flow (with direction indicated)

Query flow

Apache HTTP Servers

Raw Data Storage Pre-computing Batch Views

Real-Time ViewsDashboard/

Search

Data Stream

Real-Time Processing and Aggregations

BI Tool

Avro as a Raw Data Storage file format

Parquet as a Batch Views file format

Star schema as a Batch Views data model

Page 17: Log Data Analysis Platform by Valentin Kropov

Architecture: Flume Topology

Page 18: Log Data Analysis Platform by Valentin Kropov

Batch ETL

Page 19: Log Data Analysis Platform by Valentin Kropov

TRADE-OFF ANALYSIS

Page 20: Log Data Analysis Platform by Valentin Kropov

Distribution Selection

Page 21: Log Data Analysis Platform by Valentin Kropov

Hive Stinger vs Impala

Compression Ratio

Access Speed

Page 22: Log Data Analysis Platform by Valentin Kropov

AUTOMATION

Page 23: Log Data Analysis Platform by Valentin Kropov

Automation (saves time and money)

80% 20%

Development and Debugging F&P Testing, Demo

Local Development Cloud Development

Page 24: Log Data Analysis Platform by Valentin Kropov

vagrant up

Page 25: Log Data Analysis Platform by Valentin Kropov

Automation Process

Phase Tool NotesVM Provisioning Vagrant — Supports:

VirtualBox, VMWare ESX, Amazon AWS

VM Bootstraping Puppet — Installs Cloudera Manager, Cloudera Distribution Hadoop, ElasticSearch+Kibana, Flume, Microstrategy, Log Generator.

— Creates Cluster using Cloudera Manager API.Configure ETL and BI

Puppet — Configures Flume, Oozie, ElasticSearch, Impala, Hive, Microstrategy Dashboards

Integration Tests Puppet — Generates Workload and ensures data go through.— Checks Logs for errors.— Calculates timing/throughput.

Page 26: Log Data Analysis Platform by Valentin Kropov

PERFORMANCE TESTING

Page 27: Log Data Analysis Platform by Valentin Kropov

Log Generator

1 Thread can generate:4200 events / second (File source)5500 events / second (TCP source)

Page 28: Log Data Analysis Platform by Valentin Kropov

Accurate Sizing

100k/min

50k/min

20k/min

200k/min

Calculator!

Page 29: Log Data Analysis Platform by Valentin Kropov

OUTCOME & PLANS

Page 30: Log Data Analysis Platform by Valentin Kropov

Outcome

1) Demo lab, playground, testing platform (in 1 hour)

2) Sizing Calculator3) Help to get 3 new customers (one is really,

really huge)4) Strategic Partnership with Cloudera5) Tons of experience and fun

Plans

1) Add support for other Hadoop Distributions (Hortonworks, MapR)

2) Make Project Open-Source

Page 31: Log Data Analysis Platform by Valentin Kropov

31

Thank You!

SoftServe US OfficeOne Congress Plaza, 111 Congress Avenue, Suite 2700 Austin, TX 78701 Tel: 512.516.8880

Contacts Valentyn [email protected]: 866.687.3588 x4341