Top Banner
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License MONITORING CASSANDRA: DON'T MISS A THING! Alain Rodriguez
68

Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Jan 06, 2017

Download

Software

DataStax
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

MONITORING CASSANDRA: DON'T MISS A THING!

Alain Rodriguez

Page 2: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Introduction

Page 3: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

About The Last Pickle

Page 4: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

About The Last Pickle and Alain Rodriguez

Page 5: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

About The Last Pickle and Alain Rodriguez

Page 6: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016
Page 7: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016
Page 8: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Poll time !

• How many of you are Cassandra users?

7%8%10%11%

29%

35%

Page 9: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Poll time !

• How many of you are Cassandra users?• … are not using any monitoring tool?

7%8%10%11%

29%

35%

Page 10: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Poll time !

• How many of you are Cassandra users?• … are not using any monitoring tool?• … are using nodetool and logs at least?

7%8%10%11%

29%

35%

Page 11: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Poll time !

• How many of you are Cassandra users?• … are not using any monitoring tool?• … are using nodetool and logs at least?• … plugged a tool to build dashboards from Cassandra metrics?

7%8%10%11%

29%

35%

Page 12: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Poll time !

• How many of you are Cassandra users?• … are not using any monitoring tool?• … are using nodetool and logs at least?• … plugged a tool to build dashboards from Cassandra metrics?• … are really happy with the set of dashboard you built?

7%8%10%11%

29%

35%

Page 13: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

CTO’s call…

CTO:

“Hi Alain, WTF is happening with Cassandra?”

Many possible outcomes to this discussion…

Page 14: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

CTO’s call…

Without monitoring

Me: “What are you talking about? Anyway, ask the Devs about it!”

Page 15: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

CTO’s call…

CTO:

“Hi Alain, WTF is happening with Cassandra?”

Many possible outcomes to this discussion…

Page 16: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

CTO’s call…

Without monitoring - But I know context and am a bash expert !

Me: “It is probably due to incremental repairs switch, let me check”

Check = Look at the 42 nodes (or more…?)

Me: “I call you back, let me dig this”

Page 17: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016
Page 18: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

CTO’s call…

CTO:

“Hi Alain, WTF is happening with Cassandra?”

Many possible outcomes to this discussion…

Page 19: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

CTO’s call…

With monitoring

Me: “I had a look. Latency went up to the roof, letʼs investigate this”

Minutes later…

“I investigated this and a client query should be rewritten”

Page 20: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

CTO’s call…

CTO:

“Hi Alain, WTF is happening with Cassandra?”

Many possible outcomes to this discussion…

Page 21: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

CTO’s call…

With good alerting - no call from CTO !

Me: “CTO FYI, it looks like latency went up to the roof. We can fix it by doing X”.

CTO: “Hi Alain, we didnʼt even noticed latencies, good catch, letʼs fix it!”

Page 22: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Why and what to monitor?

Page 23: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Why to monitor

1 - Understand patterns & Detect Anomalies

Page 24: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Why to monitor

2 - Troubleshoot & Fix issues

Page 25: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Why to monitor

2 - Troubleshoot & Fix issues

3 - Understand internals & Optimize

Page 26: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Detect Anomalies!

Page 27: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Detect Anomalies!

• Be minimalist: small set of charts, in 1 dashboard !

Page 28: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Detect Anomalies!

• Be minimalist: small set of charts, in 1 dashboard !

• What is necessary and sufficient to detect any issue

Page 29: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Detect Anomalies!

• Be minimalist: small set of charts, in 1 dashboard !

• What is necessary and sufficient to detect any issue

• Think about your own KPI

Page 30: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Detect Anomalies!

• Be minimalist: small set of charts, in 1 dashboard !

• What is necessary and sufficient to detect any issue

• Think about your own KPI

• Set alerts on this dashboard and / or keep an eye on it.

Page 31: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Detect Anomalies!

• Be minimalist: small set of charts, in 1 dashboard !

• What is necessary and sufficient to detect any issue

• Think about your own KPI

• Set alerts on this dashboard and / or keep an eye on it.

Overview Dashboard

Page 32: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016
Page 33: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Troubleshoot!

Page 34: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Troubleshoot!

• Maximise useful information

Page 35: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Troubleshoot!

• Maximise useful information

• Themed dashboards (Read path, Client connections, …)

Page 36: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Troubleshoot!

• Maximise useful information

• Themed dashboards (Read path, Client connections, …)

• Reuse charts and metrics, when relevant

Page 37: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Troubleshoot!

• Maximise useful information

• Themed dashboards (Read path, Client connections, …)

• Reuse charts and metrics, when relevant

• Missing something? Add it!

Page 38: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Optimize!

Page 39: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Optimize!

• Mostly using same dashboards than for troubleshooting

Page 40: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Optimize!

• Mostly using same dashboards than for troubleshooting

• Having a good set of dashboards highlights bottlenecks

Page 41: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

What to monitor : Optimize!

• Mostly using same dashboards than Troubleshooting

• Having a good set of dashboards highlights bottlenecks

• Check for impacts of recent tuning or operations

Page 42: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Monitoring limits

Page 43: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Monitoring limits: No relevant dashboard

• Tons of metrics available from the agent / reporter

Documentation: http://cassandra.apache.org/doc/latest/operating/metrics.html(Thanks to Jake Luciani - @tjake)

Page 44: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Monitoring limits: No relevant dashboard

• Tons of metrics available from the agent / reporter

• Building dashboard is not straightforward, many options available

Page 45: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Monitoring limits: No relevant dashboard

• Tons of metrics available from the agent / reporter

• Building dashboard is not straightforward, many options available

• Double expertise needed

Page 46: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Monitoring limits: No relevant dashboard

• Tons of metrics available from the agent / reporter

• Building dashboard is not straightforward, many options available

• Double expertise needed

• No end user pressure like for a new feature

Page 47: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Monitoring limits: No relevant dashboard

• Tons of metrics available from the agent / reporter

• Building dashboard is not straightforward, many options available

• Double expertise needed

• No end user pressure like for a new feature

• Often no oneʼs responsibility to build charts

Page 48: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Monitoring limits: Monitoring can fail

• Do not trust monitoring 100%

• In case of doubt, double check !

Page 49: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Monitoring limits: Tip for Datastax Opscenter

• Troubles with Datastax Opscenter in the past

• Opscenter data should be stored outside from the production C* cluster

Page 50: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards

Page 51: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Some existing pluggable solutions

Commercial

Free

Page 52: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Existing expert set of dashboard

• No “out of the box” and “complete” dashboards available

Page 53: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Existing expert set of dashboard

• No “out of the box” and “complete” dashboards available

• Sematext-SPM is the winner from this perspective

Page 54: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Existing expert set of dashboard

• No “out of the box” and “complete” dashboards available

• Sematext-SPM is the winner from this perspective

Page 55: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Templates idea origin

• TLP = Consulting company = many Cassandra clusters

Page 56: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Templates idea origin

• TLP = Consulting company = many Cassandra clusters

• Need to build dashboards for many clients

Page 57: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Templates idea origin

• TLP = Consulting company = many Cassandra clusters

• Need to build dashboards for many clients

• I am lazy, I do not like repeating tasks

Page 58: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Templates idea origin

• TLP = Consulting company = many Cassandra clusters

• Need to build dashboards for many clients

• I am lazy, I do not like repeating tasks

=> Decided to build templates and share to the community

Page 59: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Overview Dashboard!

• ADD screenshot when done with this dashboard.

Page 60: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Read path dashboard!

• ADD screenshot when done with this dashboard.

Page 61: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Ongoing work

• This is an ongoing work

Page 62: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Ongoing work

• This is an ongoing work

• Template will be shared and maintained by TLP (+ Monitoring provider)

Page 63: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Ongoing work

• This is an ongoing work

• Template will be shared and maintained by TLP (+ Monitoring provider)

• Corresponding reporter / agent configuration will be shared

Page 64: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Building dashboards: Ongoing work

• This is an ongoing work

• Template will be shared and maintained by TLP (+ Monitoring provider)

• Corresponding reporter / agent configuration will be shared

• We will explain our choices and provide metrics in use

Page 65: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Conclusion

Page 66: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Make sure you are actually monitoring something!

• Monitoring is important for QoS and saving money.

• Having a monitoring tool doesnʼtmean things are monitored

Page 67: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Make sure you are actually monitoring something!

Always wonder:

• Was I able to detect latest outages?

• Would I now be able to detect outages I faced in the past?

• Do I find enough info to easily troubleshoot issues I face?

• Am I able to see impacts when I tune Cassandra?

Page 68: Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) | C* Summit 2016

Thank youQuestions ?