Task 2410.001 CLASH - Cross-Layer Accelerated Self-Healing: Circadian Rhythms for Resilient Electronic Systems Xinfei Guo, Alec Roelke, Mircea R. Stan ECE Department, University of Virginia, Charlottesville, VA Previous Solutions Tolerate - Design for the worst case Compensate - Dynamically adapt to wearout Slow - Reduce the stress during operation Passive Recovery Wearout Issues BTI, HCI, TDDB, EM, etc. Increase design margin and worsen metrics Cross-layer Issues Both Reversible and Irreversible Part Overview This Project Repair wearout completely Accelerated & Active Recovery Circadian Rhythms for FULL recovery Cross-Layer Implementations Wearout Accelerated Self-Healing [DAC ’14] Interface Board Chip Data Sampling 16-b fref clk in Cout 16 En En 75 LUTs Circuit Under Test (CUT) rst Thermal Chamber Counter Main Idea Sleep → Proactive Recovery Some of the effects of wearout (e.g. BTI and EM) can be reversed by applying several techniques (high temperature, negative voltage, UV light, reverse current, etc.), thus leading to effective accelerated self-healing fundamentally. Test Setup Commercial 40nm FPGA chips Accelerated Testing Methodology Knobs: V, T, AC/DC, Sleep/Active Results “Sleep When Getting Tired” [ASPDAC ’16] Main Idea The boundary between reversible & irreversible is “soft” Irreversible wearout can be recovered through acceleration Frequency dependency of accelerated wearout & recovery “Sleep when getting tired” to FULLY avoid the irreversible wearout Negative “turbo” boost at the system level Results Recovery under Different “circadian rhythms” Reduction of Design Margin >60X Average Performance Improvement – Close to the fresh state through the whole lifetime Circuit-level implementations Negative Voltages A Charge-pump Neg. voltage generator Embeddable Wearout Sensors [SELSE ’15] Vdd clk1 clk2 clk1 C1 C2 charging charge redistribution Vout + + High Temperatures Wearout-aware Power Gating Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 7 Shared L3 Cache Core 8 Zzzzzz... Zzzzzz... Heat Heat Heat Heat Heat Heat Metastable element based Track both wearout and (accelerated) recovery Track Multiple Paths simultaneously Be aware of wearout induced path reranking Used together with multiple dynamic management track poll1 poll2 poll3 track discharg discharg ref 10% 5% 2% path0 track poll1 poll2 poll3 discharg discharg ref 5% path1 path2 path3 5% 5% track track poll1 poll2 poll3 outa outb A CLASH System [JVLSI] Fresh “Turbo-boost” Frequency Time Design Margin Average Frequency Negative Days Hours End of life Years Wearout No recovery Active Sleep Utilize Core Redundancy in dark silicon era Optimal Scheduling/Load balancing Introduce Accelerated & Active Recovery as a new design knob for cross-layer resilience Tradeoff between lifetime, power, performance FF D Scan_in SE clk Q Q rst Old Scan Cell M U X M U X FF D Scan_in SE clk Q Q rst SENSOR sel Core New Scan Cell Close-loop Open-loop To user Accelerated Active Adaptive solutions or Recovery M U X Embedded in a ASIC design flow Sleep Voltage Negative Logic Blocks vdd vdd_high Sleep Boosted Vdd Generator Negative Voltage vdd Sleep En Power Gating Block 2016 SLD Review 2X 1X Accelerated Recovery Heating Element Core Logic output Logic Core Critical path -300.6mV 638ns -300.6mV 4.36mV 4.33mV Ripple: 1.45% Clock frequency = 66.7MHz Accelerated & Active Recovery for 12 hours 24.5 25 25.5 26 Frequency(MHz) Accelerated Stress for 48 hours 72.4% An example where about 72.4% of wearout is recovered by accelerated self-healing techniques in only ¼ of stress time (measured). Accelerated self-healing space exploration (model) Core-level implementations Active blocks healing inactive elements (e.g. Dark Silicon & Redundant Resources) Core sensors Sensor Sensor Proactive Accelerated From To cores Apply to Sleep Cores Active Cores Accelerated router Scheduler & Active Recovery outputs sleep cores outputs Applications Scheduler & Active Recovery Blocks Core Allocation Load Balancer Heat for accelerated recovery -V Active Recovery EN Circuit Wearout Sensors Architecture Dark Silicon Program Counters +1 System Virtual Sensors Load Balancer Heating Elements Accelerated Recovery EN Negative Voltage Generator Proactive Scheduler Redundant Resources 23.2min 57.8min 344. 3min 104. 5min Cross-layer Accelerated Self-Healing Illustration of “Negative Turbo Boost” A potential implementation of Cross-layer Accelerated Self-Healing in a NoC system Recovery time after 12-hour constant stress under regular operation condition (no accelerated stress)