Top Banner
Cassandra Data modeling Practical considerations Nitish Korla
27

Cassandra Data Modeling - Practical Considerations @ Netflix

Jan 27, 2015

Download

Technology

nkorla1share

Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:

- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cassandra Data Modeling - Practical Considerations @ Netflix

Cassandra Data modeling

Practical considerations

Nitish Korla

Page 2: Cassandra Data Modeling - Practical Considerations @ Netflix

Why Cassandra? High Availability / Fully distributed Scalability (Linear) Write performance Simple to install and operate Multi-region replication support (bi-directional)

Page 3: Cassandra Data Modeling - Practical Considerations @ Netflix

Cassandra footprint @ Netflix

• 60+ Cassandra clusters• 1600+ nodes holding 100+ TB data• AWS 500 IOPS -> 100, 000 IOPS• Streaming data completely persisted in Cassandra

• Related Open Source Projects– Cassandra/Astyanax : in-house committer– Priam : Cassandra Automation– Test Tools : jmeter– http://github.com/netflix

Page 4: Cassandra Data Modeling - Practical Considerations @ Netflix

Data Modelkeyspace

column family

Rowcolumn• name• value• timestamp

Cassandra RDBMS Equivalent

KEYSPACE DATABASE/SCHEMA

COLUMN FAMILY TABLE

ROW ROW

FLEXIBLE COLUMNS DEFINED COLUMNS

Page 5: Cassandra Data Modeling - Practical Considerations @ Netflix

Data ModelColumns sorted by comparator

name

356Paul

group

34567

sex

male

name

54kim

group

34566

sex

female

US:CA:Fremont

54353US:CA:Hayward

34343

status

single

zip

94538

rows

Composite columns US:CA:San Jose

987556population

Columns sorted by composite comparators

Page 6: Cassandra Data Modeling - Practical Considerations @ Netflix

Do your Homework

①Understand your application requirements

② Identify your access patterns

③ Model around these access patterns

④ Denormalization is your new friend but…

⑤ Benchmark – Avoid Surprises

Cost of getting it wrong is high

Page 7: Cassandra Data Modeling - Practical Considerations @ Netflix

Example 1 : Edge Service

Page 8: Cassandra Data Modeling - Practical Considerations @ Netflix

Edge Services Data Model

alloc/xyz/jkl_1

000

active

yes

script

text

alloc/xyl/jkl_2

111

active

yes

script

text

alloc/xyl/jkl_3

222

active

yes

script

text

ROWID ALLOCATION ACTIVE SCRIPT

Script_location_version 000 YES OR NO

EDGE SERVICECLUSTER

Page 9: Cassandra Data Modeling - Practical Considerations @ Netflix

Edge Service Anti patterns

• High concurrency: Edge servers auto scale• Range scans: Read all data• Large payload: ~1MB of data

Very high read latency / unstable cassandra

Page 10: Cassandra Data Modeling - Practical Considerations @ Netflix

Solution: inverted index

scripts

client

1

2

alloc/xyz/jkl_1

000

active

yes

script

text

alloc/xyl/jkl_2

111

active

yes

script

text

alloc/xyl/tml_3

222

active

yes

script

text

/xyz/jklIndex_1

1

/xyz/jzp

2

/xyz/plm

1

/xyz/tml

3

/xyz/urs

1

/xyz/zjkl

2

Script_index

Page 11: Cassandra Data Modeling - Practical Considerations @ Netflix

Inverted Index considerations

• Column name can be used a row key placeholder

• Hotspots!!• Sharding

Page 12: Cassandra Data Modeling - Practical Considerations @ Netflix

Other possible improvement

• Textual Data• Think compression

Upcoming features- Hadoop integration

- Solr

Page 13: Cassandra Data Modeling - Practical Considerations @ Netflix

Example 2: Ratings

Page 14: Cassandra Data Modeling - Practical Considerations @ Netflix

RDBMS -> CASSANDRAuser

id (primary key)

name

alias

email

movie

id (primary key)

title

description

user_movie_rating

id (primary key)

userId (foreign key)

movieId (foreign key)

rating

1 ∞ 1∞

QueriesGet email of userid 123Get title and description of movieId 222 List all movie names and corresponding ratings for userId 123 List all users and corresponding rating for movieId 222

Page 15: Cassandra Data Modeling - Practical Considerations @ Netflix

CASSANDRA MODEL

123222:rating 222:title 534:rating 534:title 888:rating 888:title

4 rockstar 2 Finding Nemo

1 Top Guns

movieId

userId

rating222

334 455 544 633 789 999

2 5 1 2 2 3

123name alias email

Nitish Korla buckwild [email protected]

223title description

Find Nemo Good luck with that

movie

ratingsByMovie

ratingsByUser

userId

Sequence?

Page 16: Cassandra Data Modeling - Practical Considerations @ Netflix

Example 3 : Viewing History

Page 17: Cassandra Data Modeling - Practical Considerations @ Netflix

Viewing History

ROWID 1234454545 : 5466Format<Timeuuid> : <movieid>

1234454545 : 5466 1234454545 : 5466

1234454545 : 5466

Subscriber_id Playback/Bookmark related SERRIALED DATA

Playback/Bookmark related SERRIALED DATA

Playback/Bookmark related SERRIALED DATA

Playback/Bookmark related SERRIALED DATA

3454545_5634534

JSON

3454546_5

JSON

3454547_5

JSON

3454555_9

JSON

3454560_9

JSON

3454580_9

JSON

454545_5654534

JSON

4454546_5

JSON

4454547_5

JSON

4454555_9

JSON

5554560_9

JSON

5554580_9

JSON

3454545_56

9545 JSON

3454546_5

JSON

3454547_5

JSON

3454555_9

JSON

3454560_9

JSON

3454580_9

JSON

3454545_564354

JSON

3454546_5

JSON

3454547_5

JSON

3454555_9

JSON

3454560_9

JSON

3454580_9

JSON

Page 18: Cassandra Data Modeling - Practical Considerations @ Netflix

Viewing History compressionROWID 1234454545_5466

Format<Timeuuid>_<movieid>

1234454546_5466 1234454547_5466 1234454548_5466

Subscriber_id Playback/Bookmark related SERRIALED DATA

Playback/Bookmark related SERRIALED DATA

Playback/Bookmark related SERRIALED DATA

Playback/Bookmark related SERRIALED DATA

Re-sort by movie idMovie_id:[{playbackevent1,playbackevent2 ...... } ],Movie_id:[{playbackevent1,playbackevent2 ...... } ],Movie_id:[{playbackevent1,playbackevent2 ...... } ],Movie_id:[{playbackevent1,playbackevent2 ...... } ],

Compress data

1

3

2

4 Store in separate column family

Reduced data size by 7 times

Operational processes improved by 10 timesMoney saved: $,$$$,

$$$

Improvement in app read latency

Page 19: Cassandra Data Modeling - Practical Considerations @ Netflix

Think Data Archival

• Data stores in Netflix grow exponentially• Have a process in place to archive data– DSE– Moving to a separate column family– Moving to a separate cluster (non SSD)– Setting right expectations w.r.t latencies with historical

data

• Cassandra TTL’s

Page 20: Cassandra Data Modeling - Practical Considerations @ Netflix

Example 4 : Personalized recommendations

Page 21: Cassandra Data Modeling - Practical Considerations @ Netflix

read-modify-write pattern

• Data read and written back (even if data was not modified)

• Large BLOB’s

Cassandra under IO pressurePeak traffic – compaction yet to

run – high read latency

Page 22: Cassandra Data Modeling - Practical Considerations @ Netflix

read-modify-write pattern

• Do you really need to read data ?• Avoid write if data has not changed – SSTable

creation – immutable SSTables created at backend• Write with a new row key (Limit sstable scans). TTL

data• If a batch process, throttle the write rate to let

compactions catch up

Page 23: Cassandra Data Modeling - Practical Considerations @ Netflix

Useful Tools• Cassandra real-time metrics

• Capture schema changes –(automatically)

Page 24: Cassandra Data Modeling - Practical Considerations @ Netflix

Observations

• Cassandra scales linearly without any noticeable degradation to running cluster

• Self-healing : minimal operational noise• Developers– mindset need to shift from normalization to

denormalization– Need to have reasonable understanding of Cassandra

architecture– Enjoy the schema change flexibility. No more DDL locks/

DBA dependency

Page 25: Cassandra Data Modeling - Practical Considerations @ Netflix

Questions

Page 26: Cassandra Data Modeling - Practical Considerations @ Netflix

Reading from Cassandra

client

memtable

sstable

sstable

sstable

Row cachekey cache

Page 27: Cassandra Data Modeling - Practical Considerations @ Netflix

Writing to Cassandra

client Commit log (Disk)

Memtable (memory)

sstable

Flush

Replication factor: 3

sstable sstablesstable