Top Banner
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems John Joo, Program Director David Drummond, Program Director Insight Data Engineering
34

Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Aug 08, 2015

Download

Engineering

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Where Is Your Data?: An Introduction to Problems and

Bottlenecks in Data Systems!

John Joo, Program Director David Drummond, Program Director

!Insight Data Engineering

Page 2: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Page 3: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Program mentors are data engineers from top technology companies including:

Page 4: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Goals• Understand the different components of the

tech stack at a high level.

• Understand the hardware bottlenecks that dictate the tech stack.

• Understand the tech stacks that are generally used for different types of companies, and why.

Page 5: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Computing basics

Page 6: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Various ports (I/O)

up to ~ 10GB/s

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

RAM (memory)

~ 8GB

Page 7: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Various ports (I/O)

up to ~ 10GB/s

RAM (memory)

~ 8GB

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

Page 8: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Various ports (I/O)

up to ~ 10GB/s

RAM (memory)

~ 8GB

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

Network Processing Storage

Page 9: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

What does this look like for a business?

Page 10: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Page 11: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Data @ Point of Sale• 1 Transaction → 2 kb

• What did Customer buy?

• How much did Customer spend?

• When did Customer make this transaction?

Page 12: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Daily Data @ Individual Store• ~50,000 transactions / store /

day → 100 MB

• Servers at back of store

• What items were sold today?

• What was our revenue for today?

• How much was refunded today?

• What do we need to do to restock for tomorrow?

Page 13: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Yearly Data @ Individual Store• 20 million transactions → 40 GB /

year

• What are some seasonal trends in purchased items?

• How should we target our coupons or advertisements to local customers?

• Who were the most efficient employees?

• Should the store’s hours change depending on the time of year?

Page 14: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Various ports (I/O)

up to ~ 10GB/s

RAM (memory) ~8GB

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

Page 15: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Yearly Data @ All Stores• 7 billion transactions → 10 TB / year

• Requires in data centers

• What national sales campaigns should we run? Ads, coupons, commercials, web.

• What should the CEO's compensation be?

• Where should we open Supercenters, Discount Stores, Neighborhood Stores, Walmart Expresses?

• What music should we play in the stores?

Page 16: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Complete Historic Data @ All Stores

• 16 years (1992 - 2008)

• 1 trillion transactions → 2.5 PB

• Data centers

• “Area 71” in Caverna, Missouri.

• 125,000-square-foot

• 460 TB

• Colorado Springs

• 210,000-square-foot

• $100 million

Area 71

Page 17: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Various ports (I/O)

RAM (memory)

CPU (processor)

Hard Drive (storage)

Network Processing Storage

Page 18: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Bottlenecks in Data SystemsProper data system design should consider these limiting bottlenecks:

• Loading data into the CPU and memory

• Finding data on the disk

• Moving data across the network

Page 19: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Bottlenecks: Loading Data• All data that is processed must be loaded into the CPU

Disk Storage

Memory

CPU

Price

Speed

Page 20: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Bottlenecks: Loading Data• All data that is processed must be loaded into the CPU

Disk Storage

Memory

CPU

Price

Speed

• Solution: Distributed computing with ample memory

Page 21: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)

Actuator arm with head that reads from disk

End of Desired File

Beginning of Desired File

Page 22: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)

• Solution: SSD and structuring data in the order it is accessed

Actuator arm with head that reads from disk

End of Desired File

Beginning of Desired File

Page 23: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Bottlenecks: Moving Data• Moving data from machine to machine over a network

Page 24: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Bottlenecks: Moving Data

• Solution: Keeping data close to the processors

• Moving data from machine to machine over a network

Page 25: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Bottlenecks: Example• Processing a 2 kB transaction in memory, sequentially and

randomly on disk, or across the network 100 :1 200 :1 50 :1

Page 26: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Tech Stacks for CompaniesDepending on your growth plans:

• Single system with small data

• Distributed data center with large data

• Renting computers for flexibility

Page 27: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Small Firms with Small Data• Example: Small medical firm with slow growth

• Pros: Easy to maintain, data locality, inexpensive

• Cons: Difficult to grow quickly, risky, not ideal for analysis

Page 28: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Small Firms with Small Data• Example: Small medical firm with slow growth

• Pros: Easy to maintain, data locality, inexpensive

• Cons: Difficult to grow quickly, risky, not ideal for analysis

Page 29: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Small Firms with Small Data

Page 30: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Large Firms with Stable Growth• Example: Facebook with steadily growing data centers

• Pros: Economies of scale, redundancy, innovative design

• Cons: Upfront capital, dedicated maintenance

• >100 PB of Data • 7 PB / Day • 1 kW / TB • ~$20 / TB / Month

Page 31: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Start-Ups with Exponential Growth• Example: AirBnB - rent processing and storage from AWS

• Pros: Scales easily, no maintenance, no upfront capital

• Cons: Expensive in the long run, depend on data provider

• 50 GB / Day • $20-50 / TB / Mo

Page 32: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Start-Ups with Exponential Growth• Example: Netflix - AWS fails on Christmas Eve • Con: You can rent the computers, but you own the failure

Page 33: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Data Pipeline

Ingestion

Realtime Processing

File System Batch Processing

Database

Gathering data in a

reliable wayStoring the

unstructured data redundantly

Processing the data in large

batches at the data center

Processing live streaming data reliably

Organizing data for quick

access

Page 34: Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems

Conclusion• Understand the different components of the

tech stack at a high level

• Understand the hardware bottlenecks that dictate the tech stack

• Understand the tech stacks that are generally used for different types of companies, and why