Open Source Firmware Conference 2019 OpenBMC - Platform ...

Post on 16-Oct-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

OpenBMC - Platform Telemetry

Neeraj Ladkani – neladk@microsoft.com

Open Source Firmware Conference 2019

•The rise and rapid evolution of data analytics, AI and machine learning workloads have significant impact on cloud hardware design.

•Commercial Cloud Infrastructure requires high availability and need state of art telemetry to build and predict failsafe models.

•BMC role has evolved from legacy hardware management service to central intelligent controller serving cloud control plane operations.

Cloud Telemetry Conundrums

Specialization with Standardization• Processors

• Processors errors and CPU Crash dump

• Memory • Memory Correctable and uncorrectable errors

• IO• PCIe Correctable and uncorrectable errors• SMART data for disks

• Add on cards and custom silicon • Thermal data • Vendor specific telemetry

• Host Subsystem• OS heartbeat• Network link status

• Power Supply • Fault history

• Energy storage attributes

• Consumption history

• BMC • Firmware Stats

• Request and Response history

• BMC CPU/Memory/Flash stability

•Mainboard HW• Hot Swap Controller Faults

• Voltage Regulator Faults

Objective

• Standardize telemetry model

• Design a configurable BMC telemetry and health monitoring framework for OpenBMC platforms ( hardware, thermal, power, BMC and custom )

• Provide a generic interface to remotely access the metric data using both a push and pull model.

Possible Solutions

•Custom Daemons for every subsystem and custom IPMI/Redfish to push telemetry information• Use native binary blobs and OEM URIs

•Custom methods to specify telemetry parameters like metric definition, sensing interval, specifying triggers

Telemetry Collection Subsystem

• Use “collectd” for collecting metrics.

• “collectd” plugins can be written or provided by subsystem owners to collect metrics ( Hardware as Service).

• Integrating IPMI and Redfish subsystems with collectd using intermediate translation services.

• Supports aggregation of metrics data, which enables space-efficient storage of data.

Redfish Telemetry Model

• Use Standard Redfish telemetry model (Credit : Paul Vancil )

• Flexible, extendible and complete for OpenBMC client interfaces

• Supports push (Redfish event model) and pull model ( Event logs)

• Supports Triggers for specific scenarios

Redfish Telemetry – Sample Metric Report

Source: https://www.dmtf.org/documents/redfish-spmf/redfish-telemetry-white-paper-010a

Get Involved • Workgroup call ( Bi-weekly)

https://github.com/openbmc/openbmc/wiki/Platform-telemetry-and-health-monitoring-Work-Group

• Community requirements https://docs.google.com/spreadsheets/d/12gMMXB9r_WfWDf5wz-Z_zXsz6RNheC6p2LKp7HePAEE/edit?usp=sharing

• Design proposalshttps://gerrit.openbmc-project.xyz/c/openbmc/docs/+/22257

https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/23758

https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/24357

OpenBMC Metrics Collection

Proposal + Progress on Collectd Integration

Kun Yi (kunyi@google.com)

Open Source Firmware Conference 2019

Context: What are "metrics"?

● "a degree to which a software system or process possesses some property" -- Wikipedia

● Timeseries data● Sensor value is a good example for BMC systems● Metrics enable monitoring system data at scale such as:

○ How much performance improves across the fleet when the BMCs are updated?○ How many times do BMCs report thermal throttling on a group of machines running

a heavy load?

Context: Characteristics of MetricsCharacteristics Metric Log Event

Generally numeric Yes No Maybe

Time Interval Regular Irregular Irregular

Urgent No No Maybe

Target Automation Human Human/Automation

Impact of losing a data point Low Medium High

Examples CPU loadmemory usagedisk usagedaemon restart countsystem uptime...

dmesg/kmesgsystemd journalrsyslog...

catastrophic GPIOsensor value over thresholdsystem disk full...

Context: Characteristics of MetricsCharacteristics Metric Log Event

Generally numeric Yes No Maybe

Time Interval Regular Irregular Irregular

Urgent No No Maybe

Target Automation Human Human/Automation

Impact of losing a data point Low Medium High

Examples CPU loadmemory usagedisk usagedaemon restart countsystem uptime...

dmesgsystemd journalrsyslog...

catastrophic GPIOsensor value over thresholdsystem disk full...

Existing OpenBMC collection Adhoc solutions systemd journalrsyslogRedfish logging

IPMI SELRedfish events

Context: Collectd and RRDTool

● Collectd [1]○ Metrics collection daemon○ Written in C○ Highly configurable○ Over 100 plugins available○ Supports various data formats including CSV and RRD ○ Used in OpenWRT

● RRDTool [2]○ Based on RRD format○ Includes shared library "libRRD"

Context: Round Robin Database (RRD)

● Stores data in a circular buffer● Automatically aggregates data according to configuration● Constant size

Round RobinArchive (RRA)

PD

P

PD

P

PD

P

PD

P

CF

ConsolidatedData Point

Consolidation Function

Primary Data Point

Updates

CDP CDP CDP . . . RRD Format Illustration

Credit: Gabriel Matute

Design: Requirements

● Must be able to persist certain critical metrics● Resource-friendly

○ Trade-off between storage and amount of data to persist○ Persist only the important data○ External program can scrape from BMC frequently

● Common, simple interface for instrumenting

System Diagram(Illustrative)

Progress

● Created Proof-of-concept○ Collect BMC load and memory usage using Collectd plugins○ Use OEM IPMI command to transfer data to the host○ Host translates data to feed into other collection frameworks

● Preliminary study on resource consumption○ Default bitbake recipe for rrdtool includes too many dependencies

Progress: Resource Consumption

● Tested based on OpenBMC 2.7, ARMv7a● Image size

○ By default rrdtool recipes includes perl, python, graphic libs..○ Building default rrdtool+collectd takes >7MB of flash space after xz compression○ Building the minimally required recipe trims it down to 2.6MB

● CPU/Memory○ With a few metrics being collected, memory consumption is ~4.8MB○ CPU usage is ~1%○ Will increase with the number of metrics being collected

● RRD file size○ 23KB for 1 metric updated every 30s and kept for a day

Future

● Configurable RRDtool recipe to drop unnecessary dependencies● More code into librrd+ (librrd C++ wrapper)● Look into generating events● Look into tagging metrics

○ RRD file has no intrinsic string meta fields○ "Collectd is moving (slowly but calmly) towards implementing arbitrary key/value

attributes attached to each value. "● Redfish Telemetry Metric Report

○ Current proposal of JSON definition [3]

References

[1] Collectd: https://github.com/collectd/collectd[2] RRDtool: https://github.com/oetiker/rrdtool-1.x[3] DMTF Redfish API JSON definition: https://redfish.dmtf.org/schemas/v1/MetricReportDefinition.v1_2_0.json

Credits

Gabriel Matute for his awesome work as an intern!

Questions?

top related