Top Banner
Heterogeneous Persistence A guide for the modern DBA Marcos Albe Jervin Real Ryan Lowe Liz Van Dijk
221

Heterogenous Persistence

Apr 16, 2017

Download

Data & Analytics

Jervin Real
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Heterogenous Persistence

Heterogeneous PersistenceA guide for the modern DBA

Marcos AlbeJervin RealRyan LoweLiz Van Dijk

Page 2: Heterogenous Persistence

Introduction

Hello everyone

Page 3: Heterogenous Persistence

Introduction

MySQL everyone?

Page 4: Heterogenous Persistence

Introduction

Memcached?

Page 5: Heterogenous Persistence

Agenda● Introduction● Why a single DBMS is not enough● What makes a DBMS● Different flavors of DMBS● Top picks

Page 6: Heterogenous Persistence

Why one DBMS is not enough

"If you feel things are not efficient in your code, is likely that you are suffering of poor data structures choice/design" ~ Anonymous

Page 7: Heterogenous Persistence

Why one DBMS is not enough● Different data structures● Different access patterns● Different consistency and durability requirements.● Different scaling needs● Different budgets● Theoretical fundamentalism

Page 8: Heterogenous Persistence

Why one DBMS is not enough

A more concrete exampleOLAP -vs- OLTP

Page 9: Heterogenous Persistence

OLAP -vs- OLTP

Page 10: Heterogenous Persistence

PROs CONs

● No SPOF● Workload optimized services● Easier to scale*

● Additional complexity● Operational needs (additional

staffing)● Cost ($$$)*

Page 11: Heterogenous Persistence

La Carte● Key Value Stores

○ Memcached○ MemcacheDB○ Redis○ Riak KV○ Cassandra○ Amazon's DynamoDB

● Graph○ Neo4J○ OrientDB○ Titan○ Virtuoso○ ArangoDB

● Relational○ MySQL○ PostgreSQL

● Time Series○ InfluxDB○ Graphite○ OpenTSDB○ Blueflood○ Prometheus

● Columnar○ Vertica○ Infobright○ Amazon RedShift○ Apache HBase

● Document○ MongoDB○ Couchbase

● Fulltext○ Sphinx○ Lucene/Solr

Page 12: Heterogenous Persistence

What makes a DB?

Page 13: Heterogenous Persistence

General Criteria● Specialty● Cost● API/Interfaces● Scalability● CAP● ACID● Secondary Features

Page 14: Heterogenous Persistence

What makes a DBMS: General● Licensing● Language support● OS support● Community & workforce● Tools ecosystem

Page 15: Heterogenous Persistence

● Data Architecture○ Logical data model ○ Physical data model

● Standards adherence (where defined)● Atomicity● Consistency● Isolation● Durability● Referential integrity● Transactions● Locking ● Crash recovery● Unicode support

What makes a DBMS: Fundamental Features

Page 16: Heterogenous Persistence

● Interface / connectors / protocols● Sequences / auto-incrementals / atomic counters● Conditional entry updates● MapReduce● Compression● In-memory● Availability● Concurrency handling● Scalability● Embeddable● Backups

What makes a DBMS: Fundamental Features cont.

Page 17: Heterogenous Persistence

● CRUD● Union ● Intersect ● JOIN (inner, outer)● Inner selects ● Merge joins ● Common Table Expressions ● Windowing Functions ● Parallel Query● Subqueries● Aggregation● Derived tables

What makes a DBMS: querying capabilities

Page 18: Heterogenous Persistence

● Cursors● Triggers● Stored procedures● Functions● Views● Materialized views● Virtual columns● UDF● XML/JSON/YAML support

What makes a DBMS: programmatic capabilities

Page 19: Heterogenous Persistence

● Database (tables size sum)● Number of Tables● Tables individual size ● Variable length column size● Row width ● Row columns count● Row count● Column name● Blob size● Char● Numeric● Date (min / max)

What makes a DBMS: sizing limits

Page 20: Heterogenous Persistence

● B-Tree● Full text indexing● Hash● Bitmap● Expression● Partials● Reverse● GiST● GIS indexing● Composite keys● Graph support

What makes a DBMS: indexing

Page 21: Heterogenous Persistence

● Replication● Failover● Clustering● CAP choice

What makes a DBMS: high availability

Page 22: Heterogenous Persistence

Partitioning

● Range● Hash● Range+hash● List● Expression● Sub-partitioning

Sharding

● By key● By table

What makes a DBMS: scalability

Page 23: Heterogenous Persistence

● Integer● Floating point● Decimal● String● Binary● Date/time● Boolean● Binary● Set● Enumeration● Blob● Clob● JSON/XML/YAML (as native types)

What makes a DBMS: supported data types

Page 24: Heterogenous Persistence

● Authentication methods● Access Control Lists● Pluggable Authentication Modules support● Encryption at-rest● Encryption over the wire● User proxy

What makes a DBMS: security features

Page 25: Heterogenous Persistence

● Data organization model: unstructured, semi-structured, structured● Data model (schema) stability: Static? Stable? Dynamic? Highly dynamic? ● Writes: append-only; append mostly; updates only; updates mostly● Reads: full scans; range scans; multi-range scans; point reads;● Reads by age: new only; new mostly; old only; old mostly; whole range● Reads by complexity: simple, related, deeply-nested relations, ....?

What makes a DBMS: workload

Page 26: Heterogenous Persistence

ACID vs BASE

● Atomic● Consistent● Isolated● Durable

● Basic Availability● Soft-state● Eventual Consistency

Page 27: Heterogenous Persistence

CAP Theorem

● Consistency● Availability● Partitioning

Page 28: Heterogenous Persistence

Relational Databases

Page 29: Heterogenous Persistence

Relational Databases

Page 30: Heterogenous Persistence

Relational Databases: write anomalies

Page 31: Heterogenous Persistence

Relational Databases: write anomalies

Page 32: Heterogenous Persistence

Relational Databases: normalization

Page 33: Heterogenous Persistence

Relational Databases: normalization

Page 34: Heterogenous Persistence

Relational Databases: query language

results = new Array();table = open(‘mydata’);while (row = table.fetch()) { if (row.x > 100) { results.push(row); } }

Page 35: Heterogenous Persistence

Relational Databases: query language

SELECT * FROM mydata WHERE x > 100;

Page 36: Heterogenous Persistence

Relational Databases: JOINsSELECT o.order_id AS Order, CONCAT(c.customer_name, “ (“, c.customer_email, “)”) as Customer, GROUP_CONCAT(i.item_name), SUM(item_price)FROM orders AS oJOIN order_items AS oi ON oi.order_id = o.order_idJOIN items AS i ON i.item_id = oi.item_idJOIN customers AS c ON c.customer_id = o.customer_id

Page 37: Heterogenous Persistence

Relational Databases: good use cases● Highly-structured data with complex querying needs

● Projects that need very high data durability and guarantees of database-level consistency and integrity

● Simple projects with limited data growth and limited amount of entities

● Projects that require PCI/DSS, HIPPA or similar security requirements

● Analysis of portions of larger BigData stores

● Projects where duplicated data volumes would be a problem

Page 38: Heterogenous Persistence

Relational Databases: bad use cases● Unstructured data

● Deep Hierarchies / Nested -> XML

● Deep recursion:

● Ever-growing datasets; Projects that are basically logging data

● Projects recording time-series

● Reporting on massive datasets

Page 39: Heterogenous Persistence

Relational Databases: bad use cases● Projects supporting extreme concurrency

● Projects supporting massive data intake

● Queues

● Cache storage

Page 40: Heterogenous Persistence

PROs CONs

● Very mature● Abundant workforce● ACID guarantees● Referential integrity● Highly expressive query language● Ubiquitous

● Rigid schema● Difficult to scale horizontally● Expensive writes● JOIN bombs

Page 41: Heterogenous Persistence

Relational Databases: MySQL

Page 42: Heterogenous Persistence

● Well known / mature / extensive documentation

● GPLv2 + commercial license for OEMs, ISVs and VARs

● Client libraries for about every programming language

● Many different engines

● SQL/ACID impose scalability limits

● Asynchronous / Semi-synchronous / Virtually synchronous replication

● Can be AP or CP depending on replication model

Relational Databases: MySQL

Page 43: Heterogenous Persistence

PROs CONs

● Open source● Mature and ubiquitous● ACID ● Choice of AP or CP● Highly available● Abundant tooling and expertise● General purpouse; Likely good to

start anything you want.

● Difficult to shard● Replication issues● Not 100% standard compliant● Storage engines imposed

limiations● General purpouse; No single

bullet solutions for scaling!

Page 44: Heterogenous Persistence

Relational Databases: PostgreSQL

Page 45: Heterogenous Persistence

● Mature / adequate documentation

● PostgreSQL License (similar to BSD/MIT)

● Client libraries for about every programming language

● Highly Standards Compliant

● SQL/ACID impose scalability limits

● Asynchronous / Semi-synchronous

● Virtually synchronous replication via 3rd party

● Can be AP or CP depending on replication model`

Relational Databases: PostgreSQL

Page 46: Heterogenous Persistence

PROs CONs

● Open source● Mature and stable● ACID● Lots of advanced features● Vacuum

● Difficult to shard● Operations feel like an

afterthought● Less forgiving● Vacuum

Page 47: Heterogenous Persistence

K/V Stores

Page 48: Heterogenous Persistence
Page 49: Heterogenous Persistence

CRUD

● CREATE● READ● UPDATE● DELETE

Page 50: Heterogenous Persistence

HASHING● Computers: 0, 1, 2, …, n - 1, n● Key Value Pair: (k, v)

(k, v) => hash(k) mod n

Page 51: Heterogenous Persistence
Page 52: Heterogenous Persistence
Page 53: Heterogenous Persistence
Page 54: Heterogenous Persistence

THUNDERING HERD

Page 55: Heterogenous Persistence

CONSISTENT HASHING

Page 56: Heterogenous Persistence

CONSISTENT HASHING

Page 57: Heterogenous Persistence

K/V Stores - Good Use Cases

● Lots of data● Object cache in front of RDBMS● High concurrency● Massive small-data intake● Simple data access patterns

Page 58: Heterogenous Persistence

K/V Stores - Good Use Cases

● Lots of data○ Usually easily horizontally scalable

● Object cache in front of RDBMS● High concurrency● Massive small-data intake● Simple data access patterns

Page 59: Heterogenous Persistence

K/V Stores - Good Use Cases

● Lots of data● Object cache in front of RDBMS

○ Memcached, anyone?

● High concurrency● Massive small-data intake● Simple data access patterns

Page 60: Heterogenous Persistence

K/V Stores - Good Use Cases

● Lots of data● Object cache in front of RDBMS● High concurrency

○ Very simple locking model

● Massive small-data intake● Simple data access patterns

Page 61: Heterogenous Persistence

K/V Stores - Good Use Cases

● Lots of data● Object cache in front of RDBMS● High concurrency● Massive small-data intake● Simple data access patterns

Page 62: Heterogenous Persistence

K/V Stores - Good Use Cases

● Lots of data● Object cache in front of RDBMS● High concurrency● Massive small-data intake● Simple data access patterns

○ CRUD on PK access

Page 63: Heterogenous Persistence

K/V Stores - Bad Use Cases

● Durability and consistency*● Complex data access patterns● Non-PK access*● Operations*

Page 64: Heterogenous Persistence

K/V Stores - Bad Use Cases

● Durability and consistency*● Complex data access patterns● Non-PK access*● Operations*

Page 65: Heterogenous Persistence

K/V Stores - Bad Use Cases

● Durability and consistency*● Complex data access patterns*● Non-PK access*● Operations*

Page 66: Heterogenous Persistence

K/V Stores - Bad Use Cases

● Durability and consistency*● Complex data access patterns● Non-PK access*● Operations*

Page 67: Heterogenous Persistence

K/V Stores - Bad Use Cases

● Durability and consistency*● Complex data access patterns● Non-PK access*● Operations*

○ Complex systems fail in complex ways

Page 68: Heterogenous Persistence

SIMPLE FAILURE

Page 69: Heterogenous Persistence

COMPLICATED FAILURE

Page 70: Heterogenous Persistence

EXAMPLE K/V STORES

● Memcached● MemcacheDB● Redis*● Riak KV● Cassandra*● Amazon DynamoDB*

Page 71: Heterogenous Persistence

EXAMPLE K/V STORES

● Memcached● MemcacheDB● Redis*● Riak KV● Cassandra*● Amazon DynamoDB*

Page 72: Heterogenous Persistence

EXAMPLE K/V STORES

● Memcached● MemcacheDB● Redis*● Riak KV● Cassandra*● Amazon DynamoDB*

Page 73: Heterogenous Persistence

EXAMPLE K/V STORES

● Memcached● MemcacheDB● Redis*● Riak KV● Cassandra*● Amazon DynamoDB*

Page 74: Heterogenous Persistence

EXAMPLE K/V STORES

● Memcached● MemcacheDB● Redis*● Riak KV● Cassandra*● Amazon DynamoDB*

Page 75: Heterogenous Persistence

EXAMPLE K/V STORES

● Memcached● MemcacheDB● Redis*● Riak KV● Cassandra*● Amazon DynamoDB*

Page 76: Heterogenous Persistence

EXAMPLE K/V STORES

● Memcached● MemcacheDB● Redis*● Riak KV● Cassandra*● Amazon DynamoDB*

Page 77: Heterogenous Persistence

PROs CONs

● Highly scalable● Simple access patterns

● Operational complexities● Limited access patterns

Page 78: Heterogenous Persistence

Key Value Stores - Questions?

Page 79: Heterogenous Persistence

Columnar Databases

Page 80: Heterogenous Persistence

Columnar Data Layout● Row-oriented ● Column-oriented

001:10,Smith,Joe,40000;

002:12,Jones,Mary,50000;

003:11,Johnson,Cathy,44000;

004:22,Jones,Bob,55000;

...

10:001,12:002,11:003,22:004;

Smith:001,Jones:002,Johnson:003,Jones:004;

Joe:001,Mary:002,Cathy:003,Bob:004;

40000:001,50000:002,44000:003,55000:004;

...

Page 81: Heterogenous Persistence

Columnar Data Layout● Row-oriented Read Approach

What we want to read

Read Operation

Memory Page

1 2

3

4

10 Smith Bob 40000

12 Jones Mary 50000

11 Johnson Cathy 44000

Page 82: Heterogenous Persistence

Columnar Data Layout● Column-oriented Read Approach

What we want to read

Read Operation

Memory Page

1 2

3

4

10 12 11 22

Smith Jones Johnson

Joe Mary Cathy Bob

Page 83: Heterogenous Persistence

Columnar Databases - Considerations● Buffering and compression can help to reduce the impact of writes, but

they should still be avoided when possible○ Usually, an ETL process should be put in place to prepare data for analysis in a column-based

format

● Covering Indexes in row-based stores could provide similar benefits, but only up to a point → index maintenance work can become too expensive

● Column-based stores are self-indexing and more disk-space efficient● SQL can be used for most column-based stores

Page 84: Heterogenous Persistence

Columnar Databases - Considerations● Buffering and compression can help to reduce the impact of writes, but they

should still be avoided when possible○ Usually, an ETL process should be put in place to prepare data for analysis in a column-based

format

● Covering Indexes in row-based stores could provide similar benefits, but only up to a point → index maintenance work can become too expensive

● Column-based stores are self-indexing and more disk-space efficient● SQL can be used for most column-based stores

Page 85: Heterogenous Persistence

Columnar Databases - Considerations● Buffering and compression can help to reduce the impact of writes, but they

should still be avoided when possible○ Usually, an ETL process should be put in place to prepare data for analysis in a column-based

format

● Covering Indexes in row-based stores could provide similar benefits, but only up to a point → index maintenance work can become too expensive

● Column-based stores are self-indexing and more disk-space efficient● SQL can be used for most column-based stores

Page 86: Heterogenous Persistence

Columnar Databases - Considerations● Buffering and compression can help to reduce the impact of writes, but they

should still be avoided when possible○ Usually, an ETL process should be put in place to prepare data for analysis in a column-based

format

● Covering Indexes in row-based stores could provide similar benefits, but only up to a point → index maintenance work can become too expensive

● Column-based stores are self-indexing and more disk-space efficient● SQL can be used for most column-based stores

Page 87: Heterogenous Persistence

● Suitable for read-mostly or read-intensive, large data repositories

● Good for full table / large range reads.

● Good for unstructured problems where “good” indexes are hard to forecast

● Good for re-creatable datasets

● Good for structured data

Columnar Database - Good use cases

Page 88: Heterogenous Persistence

● Suitable for read-mostly or read-intensive, large data repositories

● Good for full table / large range reads.

● Good for unstructured problems where “good” indexes are hard to forecast

● Good for re-creatable datasets

● Good for structured data

Columnar Database - Good use cases

Page 89: Heterogenous Persistence

● Suitable for read-mostly or read-intensive, large data repositories

● Good for full table / large range reads.

● Good for unstructured problems where “good” indexes are hard to forecast

● Good for re-creatable datasets

● Good for structured data

Columnar Database - Good use cases

Page 90: Heterogenous Persistence

● Suitable for read-mostly or read-intensive, large data repositories

● Good for full table / large range reads.

● Good for unstructured problems where “good” indexes are hard to forecast

● Good for re-creatable datasets

● Good for structured data

Columnar Database - Good use cases

Page 91: Heterogenous Persistence

● Suitable for read-mostly or read-intensive, large data repositories

● Good for full table / large range reads.

● Good for unstructured problems where “good” indexes are hard to forecast

● Good for re-creatable datasets

● Good for structured data

Columnar Database - Good use cases

Page 92: Heterogenous Persistence

● Not good for “SELECT *” queries or queries fetching most of the columns

● Not good for writes

● Not good for mixed read/write

● Bad for unstructured data

Columnar Database - Bad use cases

Page 93: Heterogenous Persistence

● Not good for “SELECT *” queries or queries fetching most of the columns

● Not good for writes

● Not good for mixed read/write

● Bad for unstructured data

Columnar Database - Bad use cases

Page 94: Heterogenous Persistence

● Not good for “SELECT *” queries or queries fetching most of the columns

● Not good for writes

● Not good for mixed read/write

● Bad for unstructured data

Columnar Database - Bad use cases

Page 95: Heterogenous Persistence

● Not good for “SELECT *” queries or queries fetching most of the columns

● Not good for writes

● Not good for mixed read/write

● Bad for unstructured data

Columnar Database - Bad use cases

Page 96: Heterogenous Persistence

Columnar Database - Examples● InfoBright (ICE)● Vertica● Amazon Redshift● Apache HBase

Page 97: Heterogenous Persistence

Columnar Database - Examples● InfoBright (ICE)● Vertica● Amazon Redshift● Apache HBase

Page 98: Heterogenous Persistence

Columnar Database - Examples● InfoBright (ICE)● Vertica● Amazon Redshift● Apache HBase

Page 100: Heterogenous Persistence

Columnar - Questions?

Page 101: Heterogenous Persistence

Graph Databases

Page 102: Heterogenous Persistence
Page 103: Heterogenous Persistence

Graph Databases - Good Use Cases

● Highly Connected Data● Millions or Billions of Records● Re-Creatable Data Set● Structured Data

Page 104: Heterogenous Persistence

Graph Databases - Good Use Cases

● Highly Connected Data○ Network & IT Operations, Recommendations, Fraud Detection, Social Networking, Identity &

Access Management, Geo Routing, Insurance Risk Analysis, Counter Terrorism

● Millions or Billions of Records● Re-Creatable Data Set● Structured Data

Page 105: Heterogenous Persistence
Page 106: Heterogenous Persistence

Graph Databases - Good Use Cases

● Highly Connected Data● Millions or Billions of Records

○ Relational databases can also solve this problem at a smaller scale

● Re-Creatable Data Set● Structured Data

Page 107: Heterogenous Persistence

Graph Databases - Good Use Cases

● Highly Connected Data● Millions or Billions of Records● Re-Creatable Data Set

○ Keep as much as possible outside of the critical path

● Structured Data

Page 108: Heterogenous Persistence

Graph Databases - Good Use Cases

● Highly Connected Data● Millions or Billions of Records● Re-Creatable Data Set● Structured Data

○ You cannot graph a relationship unless you can define it

Page 109: Heterogenous Persistence

Graph Databases - Bad Use Cases

● Unstructured Data● Non-Connected Data● Highly Concurrent RW Workloads● Anything in the Critical OLTP Path*● Ever-Growing Data Set

Page 110: Heterogenous Persistence

Graph Databases - Bad Use Cases

● Unstructured Data○ You cannot graph a relationship if you cannot define it

● Non-Connected Data● Highly Concurrent Workloads● Anything in the Critical OLTP Path*● Ever-Growing Data Set

Page 111: Heterogenous Persistence

Graph Databases - Bad Use Cases

● Unstructured Data● Non-Connected Data

○ Graphiness is important here

● Highly Concurrent Workloads● Anything in the Critical OLTP Path*● Ever-Growing Data Set

Page 112: Heterogenous Persistence

Graph Databases - Bad Use Cases

● Unstructured Data● Non-Connected Data● Highly Concurrent RW Workloads

○ Performance breaks down

● Anything in the Critical OLTP Path*● Ever-Growing Data Set

Page 113: Heterogenous Persistence

Graph Databases - Bad Use Cases

● Unstructured Data● Non-Connected Data● Highly Concurrent Workloads● Anything in the Critical OLTP Path*

○ I'm not only talking about writes here

● Ever-Growing Data Set

Page 114: Heterogenous Persistence

Graph Databases - Bad Use Cases

● Unstructured Data● Non-Connected Data● Highly Concurrent RW Workloads● Anything in the Critical OLTP Path*● Ever-Growing Data Set

Page 115: Heterogenous Persistence

Example Graph Databases

● Neo4j● OrientDB● Titan● Virtuoso● ArangoDB

Page 116: Heterogenous Persistence

Example Graph Databases

● Neo4j● OrientDB● Titan● Virtuoso● ArangoDB

Page 117: Heterogenous Persistence

Example Graph Databases

● Neo4j● OrientDB● Titan● Virtuoso● ArangoDB

Page 118: Heterogenous Persistence

Example Graph Databases

● Neo4j● OrientDB● Titan● Virtuoso● ArangoDB

Page 119: Heterogenous Persistence

Example Graph Databases

● Neo4j● OrientDB● Titan● Virtuoso● ArangoDB

Page 120: Heterogenous Persistence

Example Graph Databases

● Neo4j● OrientDB● Titan● Virtuoso● ArangoDB

Page 121: Heterogenous Persistence
Page 122: Heterogenous Persistence

THE CODE

Page 123: Heterogenous Persistence

PROs CONs

● Solves a very specific (and hard) data problem

● Learning curve not bad for developer usage

● Data analysts’ dream

● Very little operational expertise for hire● Little community and virtually no tooling for

administration and operations.● Big mismatch in paradigm vs RDBMS;

Hard to switch for DBAs.● Hard/Expensive to scale horizontally● Writes are computationally expensive

Page 124: Heterogenous Persistence

Graph Databases - Questions?

Page 125: Heterogenous Persistence

Time Series

Page 126: Heterogenous Persistence

ID: {timestamp, value}

db1-threads: {1460928171, 6}

Page 127: Heterogenous Persistence

Time Series - Good Use Cases

● Uh … Time Series Data● Write-mostly (95%+) - Sequential Appends● Rare updates, rarer still to the distant past● Deletes occur at the opposite end (the beginning)● Data does not fit in memory

Page 128: Heterogenous Persistence

Time Series - Good Use Cases

● Uh … Time Series Data● Write-mostly (95%+) - Sequential Appends● Rare updates, rarer still to the distant past● Deletes occur at the opposite end (the beginning)● Data does not fit in memory

Page 129: Heterogenous Persistence

Time Series - Good Use Cases

● Uh … Time Series Data● Write-mostly (95%+) - Sequential Appends● Rare updates, rarer still to the distant past● Deletes occur at the opposite end (the beginning)● Data does not fit in memory

Page 130: Heterogenous Persistence

Time Series - Good Use Cases

● Uh … Time Series Data● Write-mostly (95%+) - Sequential Appends● Rare updates, rarer still to the distant past● Deletes occur at the opposite end (the beginning)● Data does not fit in memory

Page 131: Heterogenous Persistence

Time Series - Good Use Cases

● Uh … Time Series Data● Write-mostly (95%+) - Sequential Appends● Rare updates, rarer still to the distant past● Deletes occur at the opposite end (the beginning)● Data does not fit in memory

Page 132: Heterogenous Persistence

Time Series - Good Use Cases

● Uh … Time Series Data● Write-mostly (95%+) - Sequential Appends● Rare updates, rarer still to the distant past● Deletes occur at the opposite end (the beginning)● Data does not fit in memory

Page 133: Heterogenous Persistence

Time Series - Bad Use Cases

● Uh … Not Time Series Data● Small data

Page 134: Heterogenous Persistence

Example Time Series Databases

● InfluxDB● Graphite● OpenTSDB● Blueflood● Prometheus

Page 135: Heterogenous Persistence

Example Time Series Databases

● InfluxDB● Graphite● OpenTSDB● Blueflood● Prometheus

Page 136: Heterogenous Persistence
Page 137: Heterogenous Persistence

Example Time Series Databases

● InfluxDB● Graphite● OpenTSDB● Blueflood● Prometheus

Page 138: Heterogenous Persistence

Example Time Series Databases

● InfluxDB● Graphite● OpenTSDB● Blueflood● Prometheus

Page 139: Heterogenous Persistence

Example Time Series Databases

● InfluxDB● Graphite● OpenTSDB● Blueflood● Prometheus

Page 140: Heterogenous Persistence

Example Time Series Databases

● InfluxDB● Graphite● OpenTSDB● Blueflood● Prometheus

Page 141: Heterogenous Persistence

PROs CONs

● Solves a very specific (big) data problem● Well-defined and finite data access

patterns

● Terrible query semantics

Page 142: Heterogenous Persistence

Time Series - Questions?

Page 143: Heterogenous Persistence

Document Stores

Page 144: Heterogenous Persistence

Document Stores: Document Oriented

Page 145: Heterogenous Persistence

Document Stores: Document Oriented

Page 146: Heterogenous Persistence

Document Stores: Flexible Schema

Page 147: Heterogenous Persistence

Document Stores: Flexible Schema

Page 148: Heterogenous Persistence

Document Stores: Flexible Schema

Page 149: Heterogenous Persistence

Document Stores: Flexible Schema

Page 150: Heterogenous Persistence

Document Stores: Flexible Schema

Page 151: Heterogenous Persistence

ShardShardShard

Document Stores: Scalable by Design

Primary Primary Primary

Replica Replica Replica

Replica Replica Replica

Page 152: Heterogenous Persistence

InstanceInstanceInstance

Document Stores: Scalable By Design

Shard Shard Shard

Replica Replica Replica

Replica Replica Replica

Page 153: Heterogenous Persistence

Document Stores

Page 154: Heterogenous Persistence

Document Stores: MongoDB

Page 155: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 156: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 157: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.

○ Different locking behaviors● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 158: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 159: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 160: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 161: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 162: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 163: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 164: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 165: Heterogenous Persistence

Document Stores: MongoDB● Sharding and replication for dummies!● Pluggable storage engines for distinct workloads.● Excellent compression options with PerconaFT, RocksDB, WiredTiger● On disk encryption (Enterprise Advanced)● In-memory storage engine (Beta)● Connectors for all major programming languages● Sharding and replica aware connectors● Geospatial functions● Aggregation framework● .. a lot more except being transactional

Page 166: Heterogenous Persistence

● Catalogs● Analytics/BI (BI Connector on 3.2)● Time series

Document Stores: MongoDB > Use Cases

Page 167: Heterogenous Persistence

● Catalogs● Analytics/BI (BI Connector on 3.2)● Time series

Document Stores: MongoDB > Use Cases

Page 168: Heterogenous Persistence

● Catalogs● Analytics/BI (BI Connector on 3.2)● Time series

Document Stores: MongoDB > Use Cases

Page 169: Heterogenous Persistence

Document Stores: Couchbase

Page 170: Heterogenous Persistence

Document Stores: Couchbase● MongoDB - more or less● Global Secondary Indexes is exciting which produces localized secondary

indexes for low latency queries (Multi Dimensional Scaling)● Drop in replacement for Memcache

Page 171: Heterogenous Persistence

Document Stores: Couchbase● MongoDB - more or less● Global Secondary Indexes is exciting which produces localized

secondary indexes for low latency queries (Multi Dimensional Scaling)● Drop in replacement for Memcache

Page 172: Heterogenous Persistence

Document Stores: Couchbase● MongoDB - more or less● Global Secondary Indexes is exciting which produces localized secondary

indexes for low latency queries (Multi Dimensional Scaling)● Drop in replacement for Memcache

Page 173: Heterogenous Persistence

Document Stores: Couchbase > Use Cases● Internet of Things (direct or indirect receiver/pipeline)● Mobile data persistence via Couchbase Mobile i.e. field devices with unstable

connections and local/close priximity ingestion points● Distributed K/V store

Page 174: Heterogenous Persistence

Document Stores: Couchbase > Use Cases● Internet of Things (direct or indirect receiver/pipeline)● Mobile data persistence via Couchbase Mobile i.e. field devices with

unstable connections and local/close priximity ingestion points● Distributed K/V store

Page 175: Heterogenous Persistence

Document Stores: Couchbase > Use Cases● Internet of Things (direct or indirect receiver/pipeline)● Mobile data persistence via Couchbase Mobile i.e. field devices with unstable

connections and local/close priximity ingestion points● Distributed K/V store

Page 176: Heterogenous Persistence

Document Store: Questions?

Page 177: Heterogenous Persistence

Fulltext Search

Page 178: Heterogenous Persistence

Fulltext Search: Inverted Index

Page 179: Heterogenous Persistence

Fulltext Search: Search in a Box

Page 180: Heterogenous Persistence

Fulltext Search: Optimized Out● Optimized to take data out - little optimizations for getting data in

https://flic.kr/p/abeTEw

Page 181: Heterogenous Persistence

Fulltext Search: Structured/Non-Structured Data

Page 182: Heterogenous Persistence

Fulltext Search

Page 183: Heterogenous Persistence

Fulltext Search: Elasticsearch● Lucene based● RESTful interface - JSON in, JSON out● Flexible schema● Automatic sharding and replication (NDB like)● Reasonable defaults● Extension model● Written in Java, JVM limitation applies i.e. GC● ELK - Elasticsearch+Logstash+Kibana

Page 184: Heterogenous Persistence

Fulltext Search: Elasticsearch● Lucene based● RESTful interface - JSON in, JSON out● Flexible schema● Automatic sharding and replication (NDB like)● Reasonable defaults● Extension model● Written in Java, JVM limitation applies i.e. GC● ELK - Elasticsearch+Logstash+Kibana

Page 185: Heterogenous Persistence

Fulltext Search: Elasticsearch● Lucene based● RESTful interface - JSON in, JSON out● Flexible schema● Automatic sharding and replication (NDB like)● Reasonable defaults● Extension model● Written in Java, JVM limitation applies i.e. GC● ELK - Elasticsearch+Logstash+Kibana

Page 186: Heterogenous Persistence

Fulltext Search: Elasticsearch● Lucene based● RESTful interface - JSON in, JSON out● Flexible schema● Automatic sharding and replication (NDB like)● Reasonable defaults● Extension model● Written in Java, JVM limitation applies i.e. GC● ELK - Elasticsearch+Logstash+Kibana

Page 187: Heterogenous Persistence

Fulltext Search: Elasticsearch● Lucene based● RESTful interface - JSON in, JSON out● Flexible schema● Automatic sharding and replication (NDB like)● Reasonable defaults● Extension model● Written in Java, JVM limitation applies i.e. GC● ELK - Elasticsearch+Logstash+Kibana

Page 188: Heterogenous Persistence

Fulltext Search: Elasticsearch● Lucene based● RESTful interface - JSON in, JSON out● Flexible schema● Automatic sharding and replication (NDB like)● Reasonable defaults● Extension model● Written in Java, JVM limitation applies i.e. GC● ELK - Elasticsearch+Logstash+Kibana

Page 189: Heterogenous Persistence

Fulltext Search: Elasticsearch● Lucene based● RESTful interface - JSON in, JSON out● Flexible schema● Automatic sharding and replication (NDB like)● Reasonable defaults● Extension model● Written in Java, JVM limitation applies i.e. GC● ELK - Elasticsearch+Logstash+Kibana

Page 190: Heterogenous Persistence

Fulltext Search: Elasticsearch● Lucene based● RESTful interface - JSON in, JSON out● Flexible schema● Automatic sharding and replication (NDB like)● Reasonable defaults● Extension model● Written in Java, JVM limitation applies i.e. GC● ELK - Elasticsearch+Logstash+Kibana

Page 191: Heterogenous Persistence

Fulltext Search: Elasticsearch > Use Cases● Logs Analysis - ELK Stack i.e. Netflix● Full Text search i.e. Github, Wikipedia, StackExchange, etc● https://www.elastic.co/use-cases

Page 192: Heterogenous Persistence

Fulltext Search: Elasticsearch > Use Cases● Logs Analysis - ELK Stack i.e. Netflix● Full Text search i.e. Github, Wikipedia, StackExchange, etc● https://www.elastic.co/use-cases

Page 193: Heterogenous Persistence

Fulltext Search: Elasticsearch > Use Cases● Logs Analysis - ELK Stack i.e. Netflix● Full Text search i.e. Github, Wikipedia, StackExchange, etc● https://www.elastic.co/use-cases

○ Sentiment analysis○ Personalized experience○ etc

Page 194: Heterogenous Persistence

● Lucene based● Quite cryptic query interface - Innovator’s Dilemma● Support for SQL based query on 6.1● Structured schema, data types needs to be predefined● Written in Java, JVM limitation applies i.e. GC● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x]● SolrCloud support for sharding and replication

Fulltext Search: Solr

Page 195: Heterogenous Persistence

● Lucene based● Quite cryptic query interface - Innovator’s Dilemma● Support for SQL based query on 6.1● Structured schema, data types needs to be predefined● Written in Java, JVM limitation applies i.e. GC● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x]● SolrCloud support for sharding and replication

Fulltext Search: Solr

Page 196: Heterogenous Persistence

● Lucene based● Quite cryptic query interface - Innovator’s Dilemma● Support for SQL based query on 6.1● Structured schema, data types needs to be predefined● Written in Java, JVM limitation applies i.e. GC● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x]● SolrCloud support for sharding and replication

Fulltext Search: Solr

Page 197: Heterogenous Persistence

● Lucene based● Quite cryptic query interface - Innovator’s Dilemma● Support for SQL based query on 6.1● Structured schema, data types needs to be predefined● Written in Java, JVM limitation applies i.e. GC● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x]● SolrCloud support for sharding and replication

Fulltext Search: Solr

Page 198: Heterogenous Persistence

● Lucene based● Quite cryptic query interface - Innovator’s Dilemma● Support for SQL based query on 6.1● Structured schema, data types needs to be predefined● Written in Java, JVM limitation applies i.e. GC● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x]● SolrCloud support for sharding and replication

Fulltext Search: Solr

Page 199: Heterogenous Persistence

● Lucene based● Quite cryptic query interface - Innovator’s Dilemma● Support for SQL based query on 6.1● Structured schema, data types needs to be predefined● Written in Java, JVM limitation applies i.e. GC● Near real-time indexing - DIH, ● Rich document handling - PDF, doc[x]● SolrCloud support for sharding and replication

Fulltext Search: Solr

Page 200: Heterogenous Persistence

● Lucene based● Quite cryptic query interface - Innovator’s Dilemma● Support for SQL based query on 6.1● Structured schema, data types needs to be predefined● Written in Java, JVM limitation applies i.e. GC● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x]● SolrCloud support for sharding and replication

Fulltext Search: Solr

Page 201: Heterogenous Persistence

● Lucene based● Quite cryptic query interface - Innovator’s Dilemma● Support for SQL based query on 6.1● Structured schema, data types needs to be predefined● Written in Java, JVM limitation applies i.e. GC● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x]● SolrCloud support for sharding and replication

Fulltext Search: Solr

Page 203: Heterogenous Persistence

● Search and Relevancy● Recommendation Engine● Spatial Search

Fulltext Search: Solr > Use Cases

Page 204: Heterogenous Persistence

● Search and Relevancy● Recommendation Engine● Spatial Search

Fulltext Search: Solr > Use Cases

Page 205: Heterogenous Persistence

● Structured data ● MySQL protocol - SphinxQL● Durable indexes via binary logs● Realtime indexes via MySQL queries● Distributed index for scaling● No native support for replication i.e. via rsync ● Very good documentation● Fastest full indexing/reindexing [?]

Fulltext Search: Sphinx Search

Page 206: Heterogenous Persistence

● Structured data ● MySQL protocol - SphinxQL● Durable indexes via binary logs● Realtime indexes via MySQL queries● Distributed index for scaling● No native support for replication i.e. via rsync ● Very good documentation● Fastest full indexing/reindexing [?]

Fulltext Search: Sphinx Search

Page 207: Heterogenous Persistence

● Structured data ● MySQL protocol - SphinxQL● Durable indexes via binary logs● Realtime indexes via MySQL queries● Distributed index for scaling● No native support for replication i.e. via rsync ● Very good documentation● Fastest full indexing/reindexing [?]

Fulltext Search: Sphinx Search

Page 208: Heterogenous Persistence

● Structured data ● MySQL protocol - SphinxQL● Durable indexes via binary logs● Realtime indexes via MySQL queries● Distributed index for scaling● No native support for replication i.e. via rsync ● Very good documentation● Fastest full indexing/reindexing [?]

Fulltext Search: Sphinx Search

Page 209: Heterogenous Persistence

● Structured data ● MySQL protocol - SphinxQL● Durable indexes via binary logs● Realtime indexes via MySQL queries● Distributed index for scaling● No native support for replication i.e. via rsync ● Very good documentation● Fastest full indexing/reindexing [?]

Fulltext Search: Sphinx Search

Page 210: Heterogenous Persistence

● Structured data ● MySQL protocol - SphinxQL● Durable indexes via binary logs● Realtime indexes via MySQL queries● Distributed index for scaling● No native support for replication i.e. via rsync ● Very good documentation● Fastest full indexing/reindexing [?]

Fulltext Search: Sphinx Search

Page 211: Heterogenous Persistence

● Structured data ● MySQL protocol - SphinxQL● Durable indexes via binary logs● Realtime indexes via MySQL queries● Distributed index for scaling● No native support for replication i.e. via rsync ● Very good documentation● Fastest full indexing/reindexing [?]

Fulltext Search: Sphinx Search

Page 212: Heterogenous Persistence

● Structured data ● MySQL protocol - SphinxQL● Durable indexes via binary logs● Realtime indexes via MySQL queries● Distributed index for scaling● No native support for replication i.e. via rsync ● Very good documentation● Fastest full indexing/reindexing

Fulltext Search: Sphinx Search

Page 213: Heterogenous Persistence

● Real time full text + basic geo functions● Above with with dependency or to simplify access with SphinxQL or even

Sphinx storage engine for MySQL

Fulltext Search: Sphinx Search > Use Cases

Page 214: Heterogenous Persistence

● Real time full text + basic geo functions● Above with with dependency or to simplify access with SphinxQL or

even Sphinx storage engine for MySQL

Fulltext Search: Sphinx Search > Use Cases

Page 215: Heterogenous Persistence

Search - Questions?

Page 216: Heterogenous Persistence

Docker Is Your Friend

Page 217: Heterogenous Persistence

Relational

● https://github.com/docker-library/mysql ● https://github.com/docker-library/postgres

Key Value

● https://github.com/docker-library/memcached ● https://github.com/docker-library/redis ● https://github.com/docker-library/cassandra● https://github.com/hectcastro/docker-riak (https://docs.docker.

com/engine/examples/running_riak_service/)

Docker Is Your Friend

Page 218: Heterogenous Persistence

Graph

● https://github.com/neo4j/docker-neo4j ● https://github.com/orientechnologies/orientdb-docker ● https://github.com/arangodb/arangodb-docker ● https://github.com/tenforce/docker-virtuoso (non official)● https://hub.docker.com/r/itzg/titandb/~/dockerfile/ (non official)● https://github.com/phani1kumar/docker-titan (non official)

Full Text

● https://github.com/docker-solr/docker-solr/ ● https://github.com/stefobark/sphinxdocker

Docker Is Your Friend

Page 219: Heterogenous Persistence

Docker Is Your FriendTime series

● https://github.com/tutumcloud/influxdb (non official)● https://hub.docker.com/r/sitespeedio/graphite/ (non official)● https://github.com/rackerlabs/blueflood/tree/master/demo/docker ● https://hub.docker.com/r/petergrace/opentsdb-docker/ (non-official)● https://hub.docker.com/r/opower/opentsdb/ (non-official)● https://prometheus.io/docs/introduction/install/#using-docker● https://github.com/prometheus/prometheus/blob/master/Dockerfile● Both via http://opentsdb.net/docs/build/html/resources.html

Page 221: Heterogenous Persistence