RAS: What is it? Why do we need it? Harb Abdulhamid (Qualcomm) Fu Wei (Red Hat) Yazen Ghannam (AMD)
RAS: What is it? Why do we need it?Harb Abdulhamid (Qualcomm)
Fu Wei (Red Hat)Yazen Ghannam (AMD)
ENGINEERS AND DEVICESWORKING TOGETHER
What is it?● Reliability
○ Computation needs be correct and reliable.
○ Failures and errors need be detected and reported.
○ Computation needs to fail when an error is not handled.
● Availability○ System needs to remain available as long as possible.
○ Errors should be corrected and failures handled so that operation can continue.
● Serviceability○ System should provide information to administrator to aid in system servicing.
○ Service time needs to be minimized to maximize uptime.
ENGINEERS AND DEVICESWORKING TOGETHER
Why do we need it?● Increase in system uptime (productivity)
● Less time spent debugging bad or failing hardware (productivity/cost)
● Fewer hardware replacement calls (cost/mindshare)
ENGINEERS AND DEVICESWORKING TOGETHER
Hardware Architecture (How do we do it?)● x86: Machine Check Exceptions (MCE) & Machine Check Architecture (MCA)
○ Architectural features/extensions.
○ Defines a register set that can be used for multiple devices (IMPORTANT!).
○ Poll for correctable errors.
○ APIC LVT or SMI interrupts for correctable thresholding and deferred errors.
○ MCE for uncorrectable errors.
● PCI-E: Advanced Error Reporting (AER)○ Similar concepts to MCE/MCA.
● Implementation-specific features○ ECC in memory controllers
○ ECC in I/O RAMs
○ Poison/bad data markers
○ Flooding I/O links (e.g. Sync Flood)
ENGINEERS AND DEVICESWORKING TOGETHER
Platform Firmware (How do we do it?)● Platform Firmware has intimate knowledge of the system and can handle RAS
features not available through standardized mechanisms.
● Privileged code runs on the main cores or a separate microcontroller.
● Can mask registers from OS view and handle interrupts.
● Handling can be done without OS’s knowledge and information can be
exposed to OS if desired.
● Preferably, will use a standard mechanism, like ACPI, to inform the OS of errors.
● Can directly inform sysadmin of errors using sideband communications like a
baseboard management controller (BMC).
● Can pinpoint bad hardware for easy replacement.
ENGINEERS AND DEVICESWORKING TOGETHER
Kernel (How do we do it?)● Error Detect and Correct (EDAC) for system-specific handling and decoding.
● ISA-specific handling in /arch.
● Drivers for PCI-E AER and ACPI.
● Ideally, most RAS code in the Kernel would be obsoleted by Platform Firmware
handling of errors.
● Kernel could then be only responsible for reporting errors received through
standard mechanisms (e.g. ACPI).
● Kernel could also perform error handling relevant at the kernel-level (e.g. killing
processes or retiring bad/poisoned pages).
ENGINEERS AND DEVICESWORKING TOGETHER
User-space (How do we do it?)● Mcelog
○ Generally considered obsolete.
○ X86 only.
○ Reads data from /dev/mcelog.
● Rasdaemon○ More active.
○ Can be updated to handle various platforms.
○ Reads data from Kernel tracepoints.
○ Can effectively obsolete EDAC modules for error decoding.
ENGINEERS AND DEVICESWORKING TOGETHER
ACPI (How do we do it?)● We’ll get into this next...
ENGINEERS AND DEVICESWORKING TOGETHER
ACPI APEI BERT ● Scenarios : Record errors in
emergency (OS crash/reset)
● BERT:Boot Error Record Table
● Mechanism : report unhandled
errors that occurred in a previous
boot.○ WHERE are the error records
ENGINEERS AND DEVICESWORKING TOGETHER
UEFI spec CPER
ENGINEERS AND DEVICESWORKING TOGETHER
ACPI APEI BERT
ENGINEERS AND DEVICESWORKING TOGETHER
ACPI APEI HEST ● Scenarios : Record errors in runtime
(OS still can work)
● HEST:Hardware Error Source Table
● Mechanism : describes a
standardized mechanism platforms
may use to describe their error
sources by Error Source Structure: ○ HOW to inform
○ WHERE are the error records
○ WHEN records can be free
ENGINEERS AND DEVICESWORKING TOGETHER
ACPI APEI HEST ● Error Source Structure :
○ For IA-32 : MCE/CMC/NMI
○ For PCI: AER Root Port/Endpoint/Bridge
○ Generic Hardware : GHES V1/V2
● For ARM64 : GHES v2○ HOW to inform : Notification Structure
○ WHERE are the error records: Error
Status Address (GAS : Generic Address Structure)
○ WHEN records can be free:Read Ack Register
ENGINEERS AND DEVICESWORKING TOGETHER
ACPI APEI HEST
ENGINEERS AND DEVICESWORKING TOGETHER
ACPI APEI ERST ● Scenarios : Record and Retrieve errors in
persistent storage
● ERST:Error Record Serialization Table
● Mechanism : Operation abstract, provides
details necessary to communicate with
on-board persistent storage
● Plan B: use the UEFI runtime variable services
to carry out error record persistence
operations
ENGINEERS AND DEVICESWORKING TOGETHER
ACPI APEI EINJ ● Scenarios : Test OSPM error handling stack
● EINJ:Error Injection Table
● Mechanism : Operation abstract, provides a
generic interface which OSPM can inject
hardware errors to the platform without
requiring platform specific software.
ENGINEERS AND DEVICESWORKING TOGETHER
RAS on ARM64● Architectural support for RAS is not available but not needed.
● In other words, no need to follow the same historical path as other
architectures.
● Focus should be on Platform Firmware handling of errors.
● Reporting should be through standard methods like ACPI.
● Will possibly need to implement kernel-relevant error handling based on
information received from Platform Firmware.
ENGINEERS AND DEVICESWORKING TOGETHER
Current Work● Add support for ACPI RAS features.
● Testing Platform Firmware to OS interface.
● No platform-specific RAS feature testing.
● Using modified QEMU for testing.
ENGINEERS AND DEVICESWORKING TOGETHER
Future Work● Finish ACPI implementation.
● Investigate kernel handling of poisoned pages and processes.
● Investigate I/O-related error handling in the Kernel.
ENGINEERS AND DEVICESWORKING TOGETHER
Demo
Thank You
#LAS16For further information: www.linaro.org
LAS16 keynotes and videos on: connect.linaro.org