Top Banner
Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California | April 23th 25th, 2018 Time-Series Data in MongoDB on a Budget
43

Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

May 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Peter Schwaller – Senior Director Server Engineering, PerconaSanta Clara, California | April 23th – 25th, 2018

Time-Series Data in MongoDB on a Budget

Page 2: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

TIME SERIES DATA in MongoDBon a Budget

Click to add text

Page 3: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

3

What is Time-Series Data?

Characteristics:

• Arriving data is stored as a new value as opposed to overwriting existing values

• Usually arrives in time order

• Accumulated data size grows over time

• Time is the primary means of organizing/accessing the data

Page 4: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Time Series Data in MONGODB ona Budget

Click to add text

Page 5: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

5

Why MongoDB?

• General purpose database

• Specialized Time-Series DBs do exist

• Do not use mmap storage engine

Page 6: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

6

Data Retention Options

• Purge old entries• Set up MongoDB index with TTL option (be careful if this index is your shard key)

• Aggregate data and store summaries• Create summary document, delete original raw data

• Huge compression possible (seconds->minutes->hours->days->months->years)

• Measurement buckets• Store all entries for a time window in a single document

• Avoids storing duplicate metadata

• Individual Documents for Each Measurement• Useful when data is sparse or intermittent (e.g., events rather than sensors)

Page 7: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

7

Potential Problems with Data Collection

• Duplicate entries• Utilize unique index in MongoDB to reject duplicate entries

• Delayed

• Out of order

Page 8: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

8

Problems with Delayed and Out-of-Order Entries

• Alert/Event generation

• Incremental Backup

Page 9: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

9

Enable Streaming of Data

• Add recordedTime field (in addition to existing field with timestamp)

• Utilize $currentDate feature of db.collection.update()$currentDate: { recordedTime: true }

• You cannot use this field as a shard key!

• Requires use of update instead of insert• Which in turn requires specification of _id field

• Consider constructing your _id to solve the duplicate entries issue at the same time

Allows applications to reliably process each document once and only once.

Page 10: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Accessing Your DataIt’s only *mostly* write-only.

Page 11: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

11

Create Appropriate Indexes

• Avoid collection scans!• Consider using: db.adminCommand( { setParameter: 1, notablescan: 1 } )

• Avoid queries that might as well be collection scans

• Create the indexes you need (but no more)

• Don’t depend on index intersection

• Don’t over index• Each index can take up a lot of disk/memory

• Consider using partial indexes{ partialFilterExpression: { speed: { $gt: 75.0 } } }

Page 12: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

12

Check Your Indexes

• Use .explain() liberally

• Check which indexes are actually used:db.collection.aggregate( [ { $indexStats: {}}])

Page 13: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Adding DataGetting the Speed You Need

Page 14: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

14

API Methods

• Insert arraydatabase[collection].insert(doc_array)

• Insert unordered bulkbulk = database[collection].initialize_unordered_bulk_op()bulk.insert(doc) # loop herebulk.execute()

• Upsert unordered bulkbulk = database[collection].initialize_unordered_bulk_op()bulk.find({"_id": doc["_id"]}).upsert().update_one({"$set": doc}) # loop herebulk.execute()

• Insert singledatabase[collection].insert(doc)

• Upsert singledatabase[collection].update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True)

Page 15: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

15

Relative Performance

0

5000

10000

15000

20000

25000

30000

35000

40000

Insert Array Insert Unordered Bulk Update Unordered Bulk Insert Single Update Single

Comparison of API Methods

Docs/Sec

Page 16: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Benchmarks… and other lies.Answering, “Why can’t I just use a gigantic HDD RAID array?”

Page 17: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

17

Benchmark Environment

• VMs• 4 core Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz• 8 GB RAM• Sandisk Ultra II 960GB SSD• WD 5TB 7200rpm HDD

• MongoDB• 3.4.13• WiredTiger• 4GB Cache• Snappy collection compression• Standalone server (no replica set, no mongos)

• Data• 178 bytes per document in 6 fields• 3 indexes (2 compound)• Disk usage: 40% storage, 60% indexes• Using update unordered bulk method, 1000 docs per bulk.execute()

Page 18: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

18

Benchmark SSD vs. HDD

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Inserts/Sec

SSD HDD

Page 19: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

19

SSD Benchmark60 Minutes

Page 20: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

20

SSD Benchmark0:30-1:00

Page 21: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

21

HDD Benchmark0:30-1:30

Page 22: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

22

HDD Benchmark0:30-8:45 (42M documents)

Page 23: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

23

HDD BenchmarkLast Hour

Page 24: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

24

SSD Benchmark0:30-2:10 (42M documents)

Page 25: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

25

Benchmark SSD vs. HDDLast Hour

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Inserts/Sec

SSD HDD

Page 26: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

26

96 Hour Test

Page 27: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

27

TL;DR

• Don’t trust someone else’s benchmarks (especially mine!)

• Benchmark using your own “schema” and indexes

• Artificially accelerate index size exceeding available memory

Page 28: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Time Series Data in MongoDB on a BUDGET

Page 29: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

29

Replica Set Rollout Options

• Follow standard advice• 3 server replica sets (Primary, Secondary, Secondary)

• Every replica set server on its own hardware

• Disk mirroring

• Cost cutting options• Primary, Secondary, Arbiter

• Locate multiple replica set servers on the same hardware (but NOT from the SAME replica set)

• No disk mirroring (how many copies do you really need?)

• “I love downtime and don’t care about my data”• Single instance servers instead of replica sets

• RAID0 (“no wasted disk space!”)

• No backups

Page 30: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Storing Lots of Data“Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to

support deployments with very large data sets and high throughput operations.”

Page 31: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

31

Conventional Sharding

• Non-sharded data kept in default replica set

• Shard key hashed on timestamp to evenly distribute data

• Pros:• Increases insert rate

• Arbitrarily large data storage

• Cons:• All shard replica sets should have comparable hardware

• All shards start thrashing at the same time

• Expanding means a LOT of rebalancing

Page 32: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

32

Data Access Patterns

• New writes are always very recent

• Reads are almost always of recent data

• Reads of old data are “intuitively” slower

… let’s take advantage of that.

Page 33: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

33

Sharding by Zone

• Non-sharded data kept in default replica set

• Most recent time-series data stored in “fast” replica set

• Older time-series data stored in “slow” replica sets

• Pros:• Pay for speed where we need it

• Swap “fast” to “slow” before thrashing kills performance

• “Infinite” data size

• Cons:• Ceiling on insert speed

Page 34: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

34

Prerequisites for Zone Sharding

• Sharded cluster configured (config replica set, mongos, etc)

• Existing replica set rsmain (primary shard) contains your normal (not time-series) data

• TimeSeries collection with an index on “time”

• New replica set for time-series data (e.g., rs001) added as a shard

Page 35: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

35

Initial Zone Ranges

• Run on mongos:use admin

sh.enableSharding(‘DBName’)

sh.shardCollection(‘DBName.TimeSeries’, { time : 1 } )

sh.addShardTag('rsmain', ‘future')

sh.addShardTag(‘rs001', ‘ts001')

sh.addTagRange('DBName.TimeSeries',{time: new Date("2099-01-01")}, {time:MaxKey},'future')

sh.addTagRange(‘DBName.TimeSeries',{time:MinKey},{time:new Date("2099-01-01")},‘ts001')

# sh.splitAt('DBName.TimeSeries', {"time" : new Date("2099-01-01")})

Page 36: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

36

Adding a New Time-Series Replica SetStep 1 – Create new Replica Set

• When?• Well before you run out of available fast storage

• Before your input capacity is lowered too close to your needs

• Where?• On the same server with fast storage as the current time-series replica set

• Run on mongos:use admin

db.runCommand({addShard: “rs002/hostname:port", name: "rs002"})

sh.addShardTag(‘rs002’, ‘ts002')

var configdb=db.getSiblingDB("config");

configdb.tags.update({tag:“ts001"},{$set:{'max.time': new ISODate(“2018-04-26”) }})

sh.addTagRange(‘DBName.TimeSeries',{time:new Date("2018-04-26")},{time:new Date("2099-01-01")},‘ts002')

# sh.splitAt('DBName.TimeSeries', {"time" : new ISODate("2018-04-26")})

Page 37: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

37

Adding a New Time-Series Replica SetStep 2 – Wait before Relocation

• Initially nothing changes – all data is added into previous replica set

• Eventually, new entries match the min.time of the new replica set and will be stored there

• How long to wait before relocation?• Make sure you don’t fill up your fast storage

• How far back in time do “normal” queries go?

- Queries to previous replica set will get slower after relocation

Page 38: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

38

Adding a New Time-Series Replica SetStep 3 – Relocate to Slow Storage

• Follow standard procedure for moving replica set

• Multiple server instances can share same server/storage• Use unique ports

• Set –wiredTigerCacheSizeGB appropriately

Page 39: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Pause for Questions

Page 40: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

40

Wrap Up

1. Determine your anticipated time-series data rate

2. Mock up a benchmark app matching your use-case• Focus on indexed fields and their cardinality

3. Benchmark on a single server • Fast storage• Limited memory to accelerate index thrashing• Ensure benchmarks run long enough

4. Iterate adjusting the following tradeoffs:• single vs bulk/array• upsert vs insert• size of bulk/array insert/upsert• if using measurement buckets, adjust size of bucket

5. If you achieve your needed data rate, use shard tags to push old data to slower (cheaper) servers

Page 41: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

41

Rate My Session

Page 42: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

42

Thank You Sponsors!!

Page 43: Time-Series Data in MongoDB on a Budget - Percona · Time Series Data in MongoDB on a BUDGET. 29 Replica Set Rollout Options • Follow standard advice • 3 server replica sets (Primary,

Thank You!