Top Banner
One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group
24

One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Dec 31, 2015

Download

Documents

Kory Crawford
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Randy GuckPrincipal EngineerDell Software Group

Page 2: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

What is Doradus? Storage and query service

Leverages Cassandra NoSQL DB

Pure Java

- Stateless

- Embeddable or standalone

Open source: Apache 2.0 License

30 Doradus: The Tarantula Nebula Source: Hubble Space Telescope

Page 3: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Why Use Doradus? Easy to use: no client driver

Spider storage manager

- Good for unstructured data

OLAP storage manager

- Near real time data warehousing

Compared to Cassandra alone:

- Data model, searching, analytics

Compared to Hadoop:

- Fast data loads and queries

- Dense storage: less hardware

Cassandra

Data

Applications

REST API

OLAPSpide

r

Doradus

Page 4: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

A Multi-Node Cluster

Doradus

Cassandra

Data

Node 2

Cassandra

Data

Doradus

Cassandra

Data

Node 1 Node 3

Applications

REST API

Secondary Doradusinstances are optional

Page 5: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Why Did We Build Doradus OLAP? Some tough customer requirements:

- Statistical queries most important

- Need to scan millions of objects/second

- User-customizable “insights” = millions of possible queries

Couldn’t use indexes, pre-computed queries, etc.

Disk physics

- ~100's of random reads/second

- ~1000's of serial reads/second

Needed a radically new approach!

Page 6: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Doradus OLAP Combines ideas from:

- Online Analytical Processing: data arranged in static cubes

- Columnar databases: Column-oriented storage and compression

- NoSQL databases: Sharding

Features:

- Fast loading: up to 500K objects/second/node

- Dense storage: 1 billion objects in 2 GB!

- Fast cube merging: typically seconds

- No indexes!

Page 7: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Example: Message Tracking Schema

Message Participant Address

PersonManager

Employees

PersonAddress

AttachmentsMessage

ParticipantsMessage

AddressParticipants

Attachment

Page 8: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

DQL Object Queries Builds on Lucene syntax

- Full text queries

Adds link paths

- Directed graph searches

- Quantifiers and filters

- Transitive searches

Other features

- Stateless paging

- Sorting

Examples:

- LastName = Smith AND NOT (FirstName : Jo*) AND BirthDate = [1986 TO 1992]

- ALL(Participants).ANY(Address.WHERE(Email='*.gmail.com')).Person.Department : support

- Employees^(4).Office='San Jose’

Page 9: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

DQL Aggregate Queries Metric functions

- COUNT, AVERAGE, MIN, MAX, DISTINCT, ...

Multi-level grouping

Grouping functions

- BATCH, BOTTOM, FIRST, LAST, LOWER, SETS, TERMS, TOP, TRUNCATE, UPPER, WHERE, ...

Examples:

- metric=COUNT(*), AVERAGE(Size), MIN(Participants.Address.Person.Birthdate)

- metric=DISTINCT(Attachments.Extension);groups=Tags,

Participants.Address.Person.Department;query=Attachments.Size > 100000

- metric=AVERAGE(Size);groups=TOP(10,Participants.Address.Email)

Page 10: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

OLAP Data Loading

EventsEventsEvents

EventsEventsPeople

EventsEventsComputers

EventsEventsDomains

Sources

Page 11: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEventsComputers

EventsEventsDomains

Batch 2

Batch 3

...

Sources Batches

Batch 4

Page 12: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEventsComputers

EventsEventsDomains

Batch 2

Batch 3

...

2014-03-01

2014-02-28

2014-02-27

Sources Batches Shards

Batch 4

Merge

Page 13: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEventsComputers

EventsEventsDomains

Batch 2

Batch 3

...

2014-03-01

2014-02-28

2014-02-27

Sources Batches Shards OLAP Store

Batch 4

Merge

Page 14: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Storing Batches

Field ValuesID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ...

Size 48120 5435 24220 ...

SendDate 1280246462000 1279354872112 1279357261413 ...

Priority 0 0 1 ...

Subject.txt ballades encash nautch colloquy geared

nettlier outdoors culvert hypothec winder

stolons ungot guiding rupiahs outgone

...

Subject 1 2 0 ...

...

Data is sorted by object ID and stored as columnar, compressed blobs

Key ColumnsEmail/Message/2014-03-01/{Batch GUID}/ID [compressed data]

Email/Message/2014-03-01/{Batch GUID}/Size [compressed data]

Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data]

... ......

OLAP Table

Field Value Arrays

Compressed rows

Page 15: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Merging Batches

Key Columns

Email/Message/2014-03-01/ID [compressed data]

Email/Message/2014/03-01/Size [compressed data]

Email/Message/2014-03-01/SendDate [compressed data]

... ...

Email/Person/2014-03-01/ID [compressed data]

Email/Person/2014-03-01/FirstName [compressed data]

Email/Person/2014-03-01/LastName [compressed data]

... ...

Email/Address/2014-03-01/ID [compressed data]

Email/Address/2014-03-01/Person [compressed data]

Email/Address/2014/-03-01/Message [compressed data]

... ...

Email/Message/2014-02-28/ID [compressed data]

Email/Message/2014-02-28/Size [compressed data]

...

Batch #1: Shard 2014-03-01Message Table

ID ...

Size ...

SendDate ...

...

Batch #2: Shard 2014-03-01Message Table

ID ...

Size ...

SendDate ...

...

...

OLAP Store

Person Table

ID ...

FirstName ...

Lastname ...

...

Address Table

ID ...

Person ...

Messages ...

...

Person Table

ID ...

FirstName ...

Lastname ...

...

Address Table

ID ...

Person ...

Messages ...

...

Message table dataShard 2014-03-01

Person table dataShard 2014-03-01

Address table dataShard 2014-03-01

Data for other shards

Page 16: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Does Merging Take Long?

Page 17: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

OLAP Query Execution Example query:

- Count messages with Size between 1000-10000 andHasBeenSent=false in shards 2014-03-01 to 2014-03-31

How many rows are read?

- 2 fields x 31 shards = 62 rows

- Typically represents millions of values

Value arrays are scanned in memory

Physical rows are read on “cold” start only

- Multiple caching levels for “warm” and “hot” data

Page 18: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

1 Billion Objects in 2GB?Example Security Event (CSV format):

Fixed Fields Variable FieldsComputer Name MAILSERVER18 1 MAILSERVER18$Log Name Security 2Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 WorkstationType Success Audit 4 (0x0,0x142999A)Source Security 5 3Category Logon/Logoff 6 KerberosEvent ID 540 7 KerberosUser Domain NT AUTHORITYUser Name SYSTEM  User SID S-1-5-18

MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,"Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,"(0x0,0x142999A)",3,Kerberos,Kerberos

Page 19: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Events Schema

EventsInsertio

nStrings

Fields:• ComputerName (text)• LogName (text)• Timestamp (timestamp)• Type (text)• Source (text)

Fields:• Index (integer)• Value (text)• Event (link)

Count: 115 Million Count: 880 Million

Params

Event (inverse)

• Category (text)• EventID (integer)• UserDomain (text)• UserSID (text)• Params (link)

Page 20: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Event Schema Load Load stats:

Total shards: 860

Total events: 114,572,247

Total ins strings: 879,529,753

Total objects: 994,102,000

Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)

Space usage::nodetool -h localhost status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address Load Owns Host ID Token Rack

UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1

Page 21: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Demo 1) Count all Events in all shards

- 860 shards => 115M events

2) Find the top 5 hours-of-the-day when certain privileged events fail:

- Event IDs are any of 577, 681, 529

- Event type is ‘Failure Audit’

- Insertion string 8 is (0x0,0x3E7)

- Event occurred in first half of 2005 (181 shards)

Page 22: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Doradus OLAP Summary Advantages:

Simple REST API

All fields are searchable without indexes

Ad-hoc statistical searches

Support for graph-based queries

Near real time data warehousing

Dense storage = less hardware

Horizontally scalable when needed

Page 23: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Doradus OLAP Summary Good for applications where data:

Is continuous/streaming

Is structured to semi-structured

Can be loaded in batches

Is partitionable, especially by time

Is typically queried in a subset of shards

Emphasizes statistical queries

Page 24: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group.

Thank You! Where to find Doradus

- Source: github.com/dell-oss/Doradus

- Downloads: search.maven.org

Contact me

- [email protected]

- @randyguck