Top Banner
Bulk Loading into Cassandra
29

Bulk Loading into Cassandra

Feb 12, 2017

Download

Technology

Brian Hess
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bulk Loading into Cassandra

Bulk Loading into Cassandra

Page 2: Bulk Loading into Cassandra

What are we talking about today?

• Problem statement

• Possible Solutions

– cqlsh COPY FROM

– Custom code using SSTable formatted files

– Java CQL INSERTs

• Test Results

• Unloading considerations

2 © 2015. All Rights Reserved.

Page 3: Bulk Loading into Cassandra

The problem is simple…

© 2015. All Rights Reserved. 3

Page 4: Bulk Loading into Cassandra

Load a pile of files into Cassandra

• Where do the files start out?

– “On my Laptop/Server’s local file system”

• The focus today!

– “In HDFS (or another DFS)

• Consider using Spark – not the topic today

– “In an NFS mount”

• Consider using Spark – not the topic today

© 2015. All Rights Reserved. 4

Page 5: Bulk Loading into Cassandra

The Options

• The “Front Door” – Cqlsh COPY FROM

– Java program loading via INSERT statements and executeAsync()

• Or the language of your choice: C/C++, Python, C#, etc.

• The “Side Door” – Leverage “streaming” via sstableloader

– Need to create SSTables via Java and CQLSStableLoader

• No other language choice

© 2015. All Rights Reserved. 5

Page 6: Bulk Loading into Cassandra

The Front Door: CQL INSERT

© 2015. All Rights Reserved. 6

Page 7: Bulk Loading into Cassandra

SS

Table

SS

Table

Cassandra Write Path

© 2015. All Rights Reserved. 7

Co

ord

ina

tor

Com

mit

Lo

g

Memtable

SS

Table

SS

Table

SS

Table

Periodically Synchronously

Cassandra

Page 8: Bulk Loading into Cassandra

Cassandra Clients

• Load Balancing – Prepared Statements

– Token-aware routing

– Round-robin

• Connections per Cassandra host – The driver connects to every Cassandra host in the “local” data center

• Synchronous / Asyncrhonous Execution – How many “in-flight queries”?

• Consistency Level

• Does not require all nodes to be online – Standard Cassandra rules apply – hinted handoff, etc

© 2015. All Rights Reserved. 8

Page 9: Bulk Loading into Cassandra

Cqlsh COPY FROM

• Command-line CQL tool

• Ships with Cassandra

– Can be run from a client machine (versions must match)

• Built in Python

– In 2.1 leverages the Python driver

– No “token aware routing” yet

• Only makes connection to one coordinator

– Does not round-robin

• Executes CQL INSERTs asynchronously

© 2015. All Rights Reserved. 9

Page 10: Bulk Loading into Cassandra

Java client – e.g., cassandra-loader (https://github.com/brianmhess/cassandra-loader)

• Java program leveraging the Java CQL driver – Java driver (and others) provided by DataStax

• Connects to every node in the cluster – Potentially multiple times per node

• Variety of driver options – Load balancing – TokenAwarePolicy, DCAwareRoundRobinPolicy, etc

– Connections per host

– Consistency Level, etc

• Asynchronous execution – Or Synchronous – e.g., for “DDL” operations

– Aside: cassandra-loader uses asynchronous execution (no DDL)

© 2015. All Rights Reserved. 10

Page 11: Bulk Loading into Cassandra

The Side Door: “Streaming”

© 2015. All Rights Reserved. 11

Page 12: Bulk Loading into Cassandra

Streaming – the Client

• A connection to each Cassandra node

– Along with token range information

• For each file

– Read records

– Determine which nodes own the token range for this record

– Send the record to those nodes

© 2015. All Rights Reserved. 12

sstableloader

Page 13: Bulk Loading into Cassandra

SS

Ta

ble

File

Streaming – the Cluster

• Receive records from the client

– First write out to SSTable file

– Read file back in to create various Cassandra objects

• The “Primary Index” – in memory index for “shortcuts”

• Any secondary indices defined on this table

• Any materialized views defined on this table (in 3.0)

– Move on to next SSTable file and repeat

© 2015. All Rights Reserved. 13

2i MV Primary

Index

Page 14: Bulk Loading into Cassandra

Streaming

• Streaming requires all nodes to be online – Because sstableloader will connect to each node

• Can “blacklist” nodes to skip – sstableloader will not stream to those nodes

– Must know which nodes up front – via nodetool status, say

– SSTables won’t be streamed later to offline nodes when they come online

• No “streaming hints”

• To get data to offline nodes, you must repair

• Streaming also requires SSTables to start with – Use CQLSSTableWriter Java class to create SSTables

© 2015. All Rights Reserved. 14

Page 15: Bulk Loading into Cassandra

The test

• Delimited files

– Different size rows: 100 bytes, 1KB, 10KB, 1MB

– Same “schema”: 12-byte TEXT, 8-byte BIGINT, rest in a TEXT

• 12-byte TEXT is the partition key, BIGINT is unique and the clustering column

– Each file is 1GB – 20 files to load

• Larger rows means fewer rows-per-file

– Parallel execution of commands is allowed – use all the cores

• Hardware/Software

– 8x i2.2xlarge nodes for Cassandra running DSE 4.7.3

– 1x r3.xlarge node as the client

© 2015. All Rights Reserved. 15

Page 16: Bulk Loading into Cassandra

The Contenders

1. CQLSSTableWriter + sstableloader

– Wrote a Java program to take delimited files to SSTables

– Use command-line sstableloader to load

– Need to combine the times of both

2. Cqlsh COPY FROM

– Use DSE 4.7.3 (not started) on the client

3. cassandra-loader

– https://github.com/brianmhess/cassandra-loader

– Java CQL driver client

© 2015. All Rights Reserved. 16

Page 17: Bulk Loading into Cassandra

Experiment Execution Details

• CQLSSTableWriter +

sstableloader

– Leverage “make -j 8” to run 8 at

a time

– Time CQLSSTableWriter and

then time sstableloader

• Cqlsh COPY FROM

– Leverage “make -j 2” to run 2 at

a time

– Running more than 2 caused

timeouts and errors

• cassandra-loader

– 8 threads (“-numThreads 8”)

– unlogged batches of size 4

(“-batchSize 4”)

– 10000 queries in flight

(“-numFutures 10000”)

– Exception for 100-byte test

• 10 threads

• unlogged batches of size 20

• 50000 queries in flight

© 2015. All Rights Reserved. 17

Page 18: Bulk Loading into Cassandra

Results

© 2015. All Rights Reserved. 18

0

1000

2000

3000

4000

5000

6000

7000

8000

100B 1KB 10KB 1MB

Duration (s) cassandra-loadersstablewriter+sstableloadercopy

0

50000

100000

150000

200000

100B 1KB 10KB 1MB

Rows/s cassandra-loader

sstablewriter+sstableloader

copy

0

20

40

60

80

100

120

100B 1KB 10KB 1MB

Data Rate (MB/s) cassandra-loader

sstablewriter+sstableloader

copy

Page 19: Bulk Loading into Cassandra

Observations

• Java executeAsync() was faster in all tests – Except the 100-byte test, where it was a close second

– cassandra-loader means no custom code

• CQLSSTableWriter+sstableloader works better for smaller records

– Performance eroded as record size increased

– Custom Java program for each format

• Cqlsh COPY FROM was never the winner – Second place in the 10KB/row test

– Could not handle the 1MB/row test – ERROR

© 2015. All Rights Reserved. 19

Page 20: Bulk Loading into Cassandra

To Batch or Not To Batch

• Varying opinions on unlogged batches

• Batching puts more load on the coordinator

– The coordinator gets the list of INSERTs and executes each one

– Not all INSERTs will be “owned” by the coordinator

– Essentially, the client offloads work to the coordinator

• Batching means fewer queries to the cluster

– Since the INSERTs are bundled into one query

© 2015. All Rights Reserved. 20

Page 21: Bulk Loading into Cassandra

Batch test

• 10 delimited files

– 10 BIGINT columns – one is partition key, one is clustering column

• Use cassandra-loader – Vary the -batchSize argument

• Measure

– Throughput – Rows/sec

– Latency – 95th Percentile

© 2015. All Rights Reserved. 21

Page 22: Bulk Loading into Cassandra

Results

• Observation: Increasing batch size – Increases throughput (to a point)

– Increases latency

• Neither is surprising…

© 2015. All Rights Reserved. 22

0

20000

40000

60000

80000

100000

120000

1 2 4 6 8 12 16 24 32 64 128

Rows/sec

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 6 8 12 16 24 32 64 128

95th Percentile Latency (ms)

Page 23: Bulk Loading into Cassandra

Bulk Unloading

• The Problem

– Get all that data from Table X out to file(s)

• Refinement

– Where’s the data going?

• Local FS

• Distributed FS (e.g., HDFS) – Use Spark (or Hadoop, if you have to)?

© 2015. All Rights Reserved. 23

Page 24: Bulk Loading into Cassandra

• The “Front Door”

– CQL SELECT

• There is no “Side Door”

© 2015. All Rights Reserved. 24

Page 25: Bulk Loading into Cassandra

Parallel unload

• Split token range into pieces – Need the set of splits to cover and not overlap

• Cassandra drivers provide that

– Need each split to be completely within one node • So each extract is able to talk only to one Cassandra node

• Optimization step – not necessary

– Same approach as Spark and Hadoop (and others)

• Connection / Query per “split” – Export to a different file

• Optimize paging size – Reduce overhead for decompression

• Consistency Level

© 2015. All Rights Reserved. 25

Page 26: Bulk Loading into Cassandra

Available Tools

• Cqlsh COPY TO – Built into Cassandra command-line tool cqlsh

– Leverages the Python driver

– Recent improvements: CASSANDRA-9304

• Parallel export, etc

• cassandra-unloader

– Part of the https://github.com/brianmhess/cassandra-loader project

– Delimited file options, just like cassandra-loader

– Parallel export

© 2015. All Rights Reserved. 26

Page 27: Bulk Loading into Cassandra

Performance

• From the CASSANDRA-9304 ticket: “A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`.

The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. Brian Hess's cassandra-unloader takes a little over 2 minutes.”

• Summary: – Pre-9304 COPY TO: 30 minutes

– Post-9304 COPY TO: 7 minutes

– cassandra-unloader: 2 minutes

© 2015. All Rights Reserved. 27

Page 28: Bulk Loading into Cassandra

Summary

• Bulk Loading

– CQL asynchronous INSERTs are your best bet

• Simplicity, performance (almost always), configurability, low/no coding

– CQLSSTableWriter requires a custom Java application

– sstableloader requires all nodes to be online

• Operational consideration

• Batching

– Can improve throughput at the cost of latency

• Bulk Unloading

– Parallel export via splitting token range

– Use CQL, there is no “side door”

© 2015. All Rights Reserved. 28

Page 29: Bulk Loading into Cassandra

Thank you