Las16 200 - firmware summit - ras what is it- why do we need it

RAS: What is it? Why do we need it?Harb Abdulhamid (Qualcomm)

Fu Wei (Red Hat)Yazen Ghannam (AMD)

ENGINEERS AND DEVICESWORKING TOGETHER

What is it?● Reliability

○ Computation needs be correct and reliable.

○ Failures and errors need be detected and reported.

○ Computation needs to fail when an error is not handled.

● Availability○ System needs to remain available as long as possible.

○ Errors should be corrected and failures handled so that operation can continue.

● Serviceability○ System should provide information to administrator to aid in system servicing.

○ Service time needs to be minimized to maximize uptime.


Why do we need it?● Increase in system uptime (productivity)

● Less time spent debugging bad or failing hardware (productivity/cost)

● Fewer hardware replacement calls (cost/mindshare)


Hardware Architecture (How do we do it?)● x86: Machine Check Exceptions (MCE) & Machine Check Architecture (MCA)

○ Architectural features/extensions.

○ Defines a register set that can be used for multiple devices (IMPORTANT!).

○ Poll for correctable errors.

○ APIC LVT or SMI interrupts for correctable thresholding and deferred errors.

○ MCE for uncorrectable errors.

● PCI-E: Advanced Error Reporting (AER)○ Similar concepts to MCE/MCA.

● Implementation-specific features○ ECC in memory controllers

○ ECC in I/O RAMs

○ Poison/bad data markers

○ Flooding I/O links (e.g. Sync Flood)


Platform Firmware (How do we do it?)● Platform Firmware has intimate knowledge of the system and can handle RAS

features not available through standardized mechanisms.

● Privileged code runs on the main cores or a separate microcontroller.

● Can mask registers from OS view and handle interrupts.

● Handling can be done without OS’s knowledge and information can be

exposed to OS if desired.

● Preferably, will use a standard mechanism, like ACPI, to inform the OS of errors.

● Can directly inform sysadmin of errors using sideband communications like a

baseboard management controller (BMC).

● Can pinpoint bad hardware for easy replacement.


Kernel (How do we do it?)● Error Detect and Correct (EDAC) for system-specific handling and decoding.

● ISA-specific handling in /arch.

● Drivers for PCI-E AER and ACPI.

● Ideally, most RAS code in the Kernel would be obsoleted by Platform Firmware

handling of errors.

● Kernel could then be only responsible for reporting errors received through

standard mechanisms (e.g. ACPI).

● Kernel could also perform error handling relevant at the kernel-level (e.g. killing

processes or retiring bad/poisoned pages).


User-space (How do we do it?)● Mcelog

○ Generally considered obsolete.

○ X86 only.

○ Reads data from /dev/mcelog.

● Rasdaemon○ More active.

○ Can be updated to handle various platforms.

○ Reads data from Kernel tracepoints.

○ Can effectively obsolete EDAC modules for error decoding.


ACPI (How do we do it?)● We’ll get into this next...


ACPI APEI BERT ● Scenarios ： Record errors in

emergency (OS crash/reset)

● BERT：Boot Error Record Table

● Mechanism : report unhandled

errors that occurred in a previous

boot.○ WHERE are the error records


UEFI spec CPER


ACPI APEI BERT


ACPI APEI HEST ● Scenarios ： Record errors in runtime

(OS still can work)

● HEST：Hardware Error Source Table

● Mechanism : describes a

standardized mechanism platforms

may use to describe their error

sources by Error Source Structure: ○ HOW to inform

○ WHERE are the error records

○ WHEN records can be free


ACPI APEI HEST ● Error Source Structure ：

○ For IA-32 : MCE/CMC/NMI

○ For PCI: AER Root Port/Endpoint/Bridge

○ Generic Hardware : GHES V1/V2

● For ARM64 : GHES v2○ HOW to inform : Notification Structure

○ WHERE are the error records: Error

Status Address (GAS : Generic Address Structure)

○ WHEN records can be free：Read Ack Register


ACPI APEI HEST


ACPI APEI ERST ● Scenarios ： Record and Retrieve errors in

persistent storage

● ERST：Error Record Serialization Table

● Mechanism : Operation abstract, provides

details necessary to communicate with

on-board persistent storage

● Plan B: use the UEFI runtime variable services

to carry out error record persistence

operations


ACPI APEI EINJ ● Scenarios ： Test OSPM error handling stack

● EINJ：Error Injection Table

● Mechanism : Operation abstract, provides a

generic interface which OSPM can inject

hardware errors to the platform without

requiring platform specific software.


RAS on ARM64● Architectural support for RAS is not available but not needed.

● In other words, no need to follow the same historical path as other

architectures.

● Focus should be on Platform Firmware handling of errors.

● Reporting should be through standard methods like ACPI.

● Will possibly need to implement kernel-relevant error handling based on

information received from Platform Firmware.


Current Work● Add support for ACPI RAS features.

● Testing Platform Firmware to OS interface.

● No platform-specific RAS feature testing.

● Using modified QEMU for testing.


Future Work● Finish ACPI implementation.

● Investigate kernel handling of poisoned pages and processes.

● Investigate I/O-related error handling in the Kernel.


Demo

Thank You

#LAS16For further information: www.linaro.org

LAS16 keynotes and videos on: connect.linaro.org

http://www.linaro.org

Las16 200 - firmware summit - ras what is it- why do we need it

Technology