Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28,2016
MonitoringSwiftOpenStackSummit, Austin2016
AdamTakvam,Sr.SystemsEngineerMartinLanner,EngagementManager
April28,2016
2 |SwiftStack Confidential
3
Overview
• Problems- Usage intelligence- Capacityplanning- Operational health- Audittrails
• Background- Methods: logs+systemmetrics- Interpretation ofmetrics- Actions:thresholds +alerting
• Swiftkeymonitoring concepts- Whattomonitor?- Howtomonitor
• Monitoring methods - demos- Logging:ELK- Trending/Forecasting:
Prometheus +Grafana- Systemmonitoring:Zabbix
|SwiftStack Confidential
4
It’sLinux!
|SwiftStack Confidential
5
PropertiesofSwift
• Distributed system
• Extremelydurable through replicationorErasure Coding
• Nosinglepointoffailure
• Evendistributionofdata
• Resilient
• Self-healing capabilities
• Cantakealotofabuseandnegligence
6
AnatomyofaMonitoringSolution
• Agent: Gathersmetricsonahostandeitherpushedoradvertisesthem- Logstash- PrometheusNodeExporter- ZabbixAgent- NagiosNRPE
• Aggregation Engines: Collects metrics fromagents andprovides an APIwith access toaggregated metric values- Nagios- Zabbix- Elasticsearch- Prometheus
• Visualizer: Renders graphs inahuman-friendlyformat for easy comprehension ofsystemstate- Kibana- Grafana
• Alerting: Uses metric thresholds totriggeralerts when metrics fall out ofan acceptablerange- AlertManager- PagerDuty
|SwiftStack Confidential
7
FormsofMonitoring
• Systemutilization: CPU,memory,diskI/O,network,auditingcycles,replicatortiming
• Performance:Transaction latency
• Errors:Invalidrequests orstates
• Outages:Servicefailures
• Featureusage:Understand CRUDoperations andtrafficpatterns
• Audittrail:Whodidwhatwhen?
MonitoringLifecycle
• Measurement
• Reporting
• Characterization
• Thresholds
• Alerting
• Rootcauseanalysis
• Remediation- Manual- Automated
|SwiftStack Confidential
Developing aMonitoring Strategy
8
Examplesofmonitoringmethods
• ELK: Usage intelligence- Who?- Agents- HTTPresponse codes- Errors- Audittrails
• Prometheus: Capacityplanning- Datagrowth- Trendinganalytics
• Zabbix: Operationalhealth- Network- CPU- RAM
9
KeyconceptsformonitoringSwift
• Cluster full- df- Datagrowth- Capacityplanning
• Networking- Availability- Saturation
• Proxystate- CPU- /healthcheck
• Auditingcycles
• Replicationcycletiming
10
LoadbalancerhealthchecksagainstSwiftproxyservers
demo@demo:~$ curl http://swift.swiftstack.oss/healthcheckOK
|SwiftStack Confidential
• Mostloadbalancers runICMPchecksagainstallIPsinitspoolbydefault
• Also,considerconfiguring theloadbalancer torunTCPchecksagainstSwift’s/healthcheck endpoint
Example:
11
AudittrailswithELK
|SwiftStack Confidential
12
Objectsizedistribution
|SwiftStack Confidential
13
DistributionofCRUDoperationsovertime
|SwiftStack Confidential
14
ZabbixtriggersforSwift
|SwiftStack Confidential
15
Zabbixnodememoryusage
|SwiftStack Confidential
16
Zabbixdriveutilizationevents
|SwiftStack Confidential
17
DiskI/O
|SwiftStack Confidential
18
ObjectReplicatorOperations
|SwiftStack Confidential
19
Prometheus+Grafanatrendingandforecasting
|SwiftStack Confidential
20
Alerting
ALERT StorageCritical24HoursIF sum(predict_linear(node_filesystem_free{
job='swiftstack',mountpoint=~"/srv/node/.*”}[1d]), 24*3600) < sum(node_filesystem_size{job="swiftstack",mountpoint=~"/srv/node/.*”}) * 0.2
FOR 1hLABELS {group="storage_admin“severity="critical“
}
|SwiftStack Confidential
Translation:Sendacriticalalerttoallmembersofthestorage_admin groupifthetotalavailablestoragecapacityisprojectedtobelessthan20%ofthetotalstoragecapacitywithinthenext24hoursandthatforecasthasheldtrueforatleast1hour,recalculatingevery5minutes(perserverconfig /notshown).
Example:
21
Q&A/Demo
|SwiftStack Confidential
22
Thankyou!
|SwiftStack Confidential