Top Banner
OpenBMC - Platform Telemetry Neeraj Ladkani – [email protected] Open Source Firmware Conference 2019
23

Open Source Firmware Conference 2019 OpenBMC - Platform ...

Oct 16, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Open Source Firmware Conference 2019 OpenBMC - Platform ...

OpenBMC - Platform Telemetry

Neeraj Ladkani – [email protected]

Open Source Firmware Conference 2019

Page 2: Open Source Firmware Conference 2019 OpenBMC - Platform ...

•The rise and rapid evolution of data analytics, AI and machine learning workloads have significant impact on cloud hardware design.

•Commercial Cloud Infrastructure requires high availability and need state of art telemetry to build and predict failsafe models.

•BMC role has evolved from legacy hardware management service to central intelligent controller serving cloud control plane operations.

Cloud Telemetry Conundrums

Page 3: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Specialization with Standardization• Processors

• Processors errors and CPU Crash dump

• Memory • Memory Correctable and uncorrectable errors

• IO• PCIe Correctable and uncorrectable errors• SMART data for disks

• Add on cards and custom silicon • Thermal data • Vendor specific telemetry

• Host Subsystem• OS heartbeat• Network link status

• Power Supply • Fault history

• Energy storage attributes

• Consumption history

• BMC • Firmware Stats

• Request and Response history

• BMC CPU/Memory/Flash stability

•Mainboard HW• Hot Swap Controller Faults

• Voltage Regulator Faults

Page 4: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Objective

• Standardize telemetry model

• Design a configurable BMC telemetry and health monitoring framework for OpenBMC platforms ( hardware, thermal, power, BMC and custom )

• Provide a generic interface to remotely access the metric data using both a push and pull model.

Page 5: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Possible Solutions

•Custom Daemons for every subsystem and custom IPMI/Redfish to push telemetry information• Use native binary blobs and OEM URIs

•Custom methods to specify telemetry parameters like metric definition, sensing interval, specifying triggers

Page 6: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Telemetry Collection Subsystem

• Use “collectd” for collecting metrics.

• “collectd” plugins can be written or provided by subsystem owners to collect metrics ( Hardware as Service).

• Integrating IPMI and Redfish subsystems with collectd using intermediate translation services.

• Supports aggregation of metrics data, which enables space-efficient storage of data.

Page 7: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Redfish Telemetry Model

• Use Standard Redfish telemetry model (Credit : Paul Vancil )

• Flexible, extendible and complete for OpenBMC client interfaces

• Supports push (Redfish event model) and pull model ( Event logs)

• Supports Triggers for specific scenarios

Page 8: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Redfish Telemetry – Sample Metric Report

Source: https://www.dmtf.org/documents/redfish-spmf/redfish-telemetry-white-paper-010a

Page 9: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Get Involved • Workgroup call ( Bi-weekly)

https://github.com/openbmc/openbmc/wiki/Platform-telemetry-and-health-monitoring-Work-Group

• Community requirements https://docs.google.com/spreadsheets/d/12gMMXB9r_WfWDf5wz-Z_zXsz6RNheC6p2LKp7HePAEE/edit?usp=sharing

• Design proposalshttps://gerrit.openbmc-project.xyz/c/openbmc/docs/+/22257

https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/23758

https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/24357

Page 10: Open Source Firmware Conference 2019 OpenBMC - Platform ...

OpenBMC Metrics Collection

Proposal + Progress on Collectd Integration

Kun Yi ([email protected])

Open Source Firmware Conference 2019

Page 11: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Context: What are "metrics"?

● "a degree to which a software system or process possesses some property" -- Wikipedia

● Timeseries data● Sensor value is a good example for BMC systems● Metrics enable monitoring system data at scale such as:

○ How much performance improves across the fleet when the BMCs are updated?○ How many times do BMCs report thermal throttling on a group of machines running

a heavy load?

Page 12: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Context: Characteristics of MetricsCharacteristics Metric Log Event

Generally numeric Yes No Maybe

Time Interval Regular Irregular Irregular

Urgent No No Maybe

Target Automation Human Human/Automation

Impact of losing a data point Low Medium High

Examples CPU loadmemory usagedisk usagedaemon restart countsystem uptime...

dmesg/kmesgsystemd journalrsyslog...

catastrophic GPIOsensor value over thresholdsystem disk full...

Page 13: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Context: Characteristics of MetricsCharacteristics Metric Log Event

Generally numeric Yes No Maybe

Time Interval Regular Irregular Irregular

Urgent No No Maybe

Target Automation Human Human/Automation

Impact of losing a data point Low Medium High

Examples CPU loadmemory usagedisk usagedaemon restart countsystem uptime...

dmesgsystemd journalrsyslog...

catastrophic GPIOsensor value over thresholdsystem disk full...

Existing OpenBMC collection Adhoc solutions systemd journalrsyslogRedfish logging

IPMI SELRedfish events

Page 14: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Context: Collectd and RRDTool

● Collectd [1]○ Metrics collection daemon○ Written in C○ Highly configurable○ Over 100 plugins available○ Supports various data formats including CSV and RRD ○ Used in OpenWRT

● RRDTool [2]○ Based on RRD format○ Includes shared library "libRRD"

Page 15: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Context: Round Robin Database (RRD)

● Stores data in a circular buffer● Automatically aggregates data according to configuration● Constant size

Round RobinArchive (RRA)

PD

P

PD

P

PD

P

PD

P

CF

ConsolidatedData Point

Consolidation Function

Primary Data Point

Updates

CDP CDP CDP . . . RRD Format Illustration

Credit: Gabriel Matute

Page 16: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Design: Requirements

● Must be able to persist certain critical metrics● Resource-friendly

○ Trade-off between storage and amount of data to persist○ Persist only the important data○ External program can scrape from BMC frequently

● Common, simple interface for instrumenting

Page 17: Open Source Firmware Conference 2019 OpenBMC - Platform ...

System Diagram(Illustrative)

Page 18: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Progress

● Created Proof-of-concept○ Collect BMC load and memory usage using Collectd plugins○ Use OEM IPMI command to transfer data to the host○ Host translates data to feed into other collection frameworks

● Preliminary study on resource consumption○ Default bitbake recipe for rrdtool includes too many dependencies

Page 19: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Progress: Resource Consumption

● Tested based on OpenBMC 2.7, ARMv7a● Image size

○ By default rrdtool recipes includes perl, python, graphic libs..○ Building default rrdtool+collectd takes >7MB of flash space after xz compression○ Building the minimally required recipe trims it down to 2.6MB

● CPU/Memory○ With a few metrics being collected, memory consumption is ~4.8MB○ CPU usage is ~1%○ Will increase with the number of metrics being collected

● RRD file size○ 23KB for 1 metric updated every 30s and kept for a day

Page 20: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Future

● Configurable RRDtool recipe to drop unnecessary dependencies● More code into librrd+ (librrd C++ wrapper)● Look into generating events● Look into tagging metrics

○ RRD file has no intrinsic string meta fields○ "Collectd is moving (slowly but calmly) towards implementing arbitrary key/value

attributes attached to each value. "● Redfish Telemetry Metric Report

○ Current proposal of JSON definition [3]

Page 21: Open Source Firmware Conference 2019 OpenBMC - Platform ...

References

[1] Collectd: https://github.com/collectd/collectd[2] RRDtool: https://github.com/oetiker/rrdtool-1.x[3] DMTF Redfish API JSON definition: https://redfish.dmtf.org/schemas/v1/MetricReportDefinition.v1_2_0.json

Page 22: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Credits

Gabriel Matute for his awesome work as an intern!

Page 23: Open Source Firmware Conference 2019 OpenBMC - Platform ...

Questions?