Top Banner
Simple Practices in Performance Monitoring and Evaluation Schubert Zhang 2016.3.24
34

Simple practices in performance monitoring and evaluation

Jan 22, 2018

Download

Technology

Schubert Zhang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Simple practices in performance monitoring and evaluation

Simple Practices in Performance Monitoring and Evaluation

Schubert Zhang 2016.3.24

Page 2: Simple practices in performance monitoring and evaluation

SLA

Service Level Agreements

https://en.wikipedia.org/wiki/Service-level_agreement

SLAs commonly include segments to address: a definition of services, performance measurement, problem management, customer duties,

warranties, disaster recovery, termination of agreement.

Page 3: Simple practices in performance monitoring and evaluation

• APIIM SLA

• Performance

• Performanceperformance oriented SLA

Page 4: Simple practices in performance monitoring and evaluation

MetricsSLA Performance SLA

Performance Metrics

e.g.1: API

• (99%)

e.g.2: Call Center

• Abandonment Rate: Percentage of calls abandoned while waiting to be answered.

• ASA (Average Speed to Answer): Average time it takes for a call to be answered by the service desk.

• TSF (Time Service Factor): Percentage of calls answered within a definite timeframe, e.g., 80% in 20 seconds.

• FCR (First-Call Resolution): Percentage of incoming calls that can be resolved without the use of a callback or without having the caller call back the helpdesk to finish resolving the case.

• TAT (Turn-Around Time): Time taken to complete a certain task.

Metrics

Performance Metrics

Page 5: Simple practices in performance monitoring and evaluation

Benchmarking

the quality of a service must be measured, evaluated, … benchmarked.

and we must have a set of approaches for benchmarking.

Page 6: Simple practices in performance monitoring and evaluation

Metrics to be monitored

Page 7: Simple practices in performance monitoring and evaluation

Throughput

QPS TPS CPS

in seconds, in minutes, in hours …

Page 8: Simple practices in performance monitoring and evaluation

Concurrency

Page 9: Simple practices in performance monitoring and evaluation

Latency

Response Time Round-Trip Time(RTT) …

Average Median Min. Max. Percentile …

Page 10: Simple practices in performance monitoring and evaluation

Quantile / Percentile

refers to Google Sawzall Paper

Page 11: Simple practices in performance monitoring and evaluation

A Summary of these Concepts

Client-1

Client-2

Client-3

Client-N

Work Thread

Work Thread

Work Thread

Work Thread

Work Thread

ThroughputLatency Concurrency

Clients Server

Page 12: Simple practices in performance monitoring and evaluation

A Life-World Example

Page 13: Simple practices in performance monitoring and evaluation

Example-1 Paper Amazon Dynamo

Page 14: Simple practices in performance monitoring and evaluation
Page 15: Simple practices in performance monitoring and evaluation
Page 16: Simple practices in performance monitoring and evaluation

Average

99.9%, quantile

Page 17: Simple practices in performance monitoring and evaluation

Example-2 Evaluation Report to a NoSQL DB

Cassandra

Page 18: Simple practices in performance monitoring and evaluation

Benchmark for Write APIBenchmark for Writes Cluster overview

Throughput Latency

• Eachnoderuns6clients(threads),totally54clients.• EachclientgeneratesrandomCDRsfor50millionusers/phone-numbers,

andputsthemintoDaStoronebyone.– KeySpace:50million– SizeofaCDR: Thrift-compactedencoding,~200bytes

ü Throughput: average~80Kops/s;per-node:average~9Kops/sü Latency:average~0.5msp Bottleneck:network (andmemory)

Page 19: Simple practices in performance monitoring and evaluation

Benchmark for Read API• Eachnoderuns8clients(threads),totally72clients.• Eachclientrandomlyusesauser-id/phone-numberoutofthe50-million

space,togetit’srecent20CDRs(onepage)fromDaStor.• AllclientsreadCDRsofasameday/bucket.

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61100ms

percentageofreadops

ü Throughput: average~140ops/s;per-node:average~16ops/sü Latency:average~500ms,97%<2s(SLA)p Bottleneck:diskIO(randomseek)(CPUloadisverylow)

average97%

quantile

Page 20: Simple practices in performance monitoring and evaluation

Total & Delta

Total: Delta:

Page 21: Simple practices in performance monitoring and evaluation

Generate the metrics and monitor them

Page 22: Simple practices in performance monitoring and evaluation

• In server side

• Add a operation-count and the time-cost for every client call

• For every monitor interval, pull and push the current Throughput and Latency the monitor-tool(ganglia/zabbix) or console.

• Throughput = sum of count / time interval

• Latency = average(sum of latency / sum of count), max, min, quantile …

Code in Gitlab and Gerrit

Page 23: Simple practices in performance monitoring and evaluation

Code for Spring Project

Page 24: Simple practices in performance monitoring and evaluation

• Java

• JMX (Java Management Extensions, a simple example at https://github.com/schubertzhang/jsketch)

• javaagent (java -javaagent:jar path [= premain ] )

• jmxetric (use JMX and javaagent to display metrics to Ganglia, https://github.com/schubertzhang/jmxetric)

• Ganglia

• Zabbix

• …

Page 25: Simple practices in performance monitoring and evaluation

Ganglia Zabbix etc.

Page 26: Simple practices in performance monitoring and evaluation

Performance Benchmark Programing

Demo Test and Evaluation the Throughput and Latency of http://www.fangdd.com

Page 27: Simple practices in performance monitoring and evaluation

Demo Time …

Page 28: Simple practices in performance monitoring and evaluation

demo screenshots

Page 29: Simple practices in performance monitoring and evaluation

demo screenshots

���

���

���

��

����

����

����

� � � � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � �� �� �� �� � � � � � � � � � ���

���

���

���

��

���

���

���

���

��

���

���

���

���

��

���

���

����

����

�� ������� ���� �

Average 95%

The long tail …

Page 30: Simple practices in performance monitoring and evaluation

Statistical Monitoring for Outlier

usually for trouble-shooting

Page 31: Simple practices in performance monitoring and evaluation

Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.

The magic matrix:

Page 32: Simple practices in performance monitoring and evaluation

• Redis Memcache

• Just add at a point, very low-cost

• Very

• Logs ELK

Page 33: Simple practices in performance monitoring and evaluation

Heavy Logs & ELK

It’s another topic!

Page 34: Simple practices in performance monitoring and evaluation

Thank You!