Use QED check Ra – original register Ra’ – corresponding duplicated register Ra ≠ Ra’ – ERROR DETECTED L2 Bank 1 QED Effective Post-Silicon Validation and Debug Eshan Singh, David Lin, PI: Subhasish Mitra, Robust Systems Group, Stanford University Post-Silicon Validation Critical Q uick E rror D etection Q uick E rror D etection Highlights Symbolic QED Electrical Bugs Structured and Effective 10 9 X quicker detection, 4X coverage Automatically localize logic bugs No failure reproduction, no simulation Broadly applicable Cores, uncore, power management, logic & electrical, accelerators Source: Intel Post- silicon bug count Year Pre-silicon verification inadequate “Post-silicon cost & complexity rising faster than design cost” – S. Yerramilli, V.P., Intel Design Pre-silicon Verification Post - silicon Validation High Volume Fab Localization Dominates Cost Detect bugs Root-cause & fix Run tests (OS, games) Debug time: 1-4 weeks per bug Localize bugs Long Error Detection Latency Challenge Localization Timeline Error occurred Error detection latency Ideal ~ 1,000 cycles Reality ~ Billions cycles Error detected Test execution Intel® 48-Core SCC Symbolic QED Results Fast QED using Hardware Support QED Wide variety Diversity Systematic Automated QED family Tests QED Test 1 QED Test 2 … … QED Test N Original Tests Test 1 Test 2 … … Test N Error detection latency: guaranteed short Coverage: improved Software & hardware approaches Detected error count (normalized to QED) QED 0 0.5 1 1-10 Billion No-QED Error detection latency (clock cycles) 0-10K Detected error count (normalized to QED) QED 0 0.5 1 1-10 Billion No-QED Error detection latency (clock cycles) 0-10K 10 6 X 4X Software-only QED no hardware modifications, bugs inside processor cores, bugs inside uncore components, bugs from power- management features Hybrid QED Non-programmable accelerators, logic bugs and electrical bugs Symbolic QED Automatically localize logic bugs, no additional hardware Fast QED 0.4% area overhead, very low runtimes QED Transformation Examples Fully automated logic bug localization using Bounded Model Checking (BMC) No trace buffers → No area overhead Effective for large SoCs No failure reproduction, no simulation Collaborator: Prof. Clark Barrett (NYU) Traditional debug Automatic S-QED Weeks to months 20 mins. to 7 hours Long bug traces 3- to 22-cycle bug traces ... Core 1 Core 2 <PLC mem [1..N]> <PLC mem [1..N]> <PLC mem [1..N] > <PLC mem [1..N]> <PLC mem [1..N] > Core N <PLC mem [1..N]> <PLC mem [1..N]> <PLC mem [1..N]> A’=A B’=B C’=C A = B * 2 A’= B’* 2 Check(A==A’) D’=D E’=E F’=F G’=G H’=H E = F * G E’= F’* G’ Check(E==E’) H = D + E H’= D’+ E’ Check(H==H’) E’=E I’=E J’=J K’=K I = E / 2 I’= E’/ 2 Check(I==I’) Load J ← mem[7 ] Load J’← mem[7’] Check(J==J’) K = J + 1 K’= J’+ 1 Check(K==K’) Lock(1,’1) Store mem[1 ] ← C Store mem[1’] ← C’ Unlock(1,1’) Lock(5,5’) Store mem[5 ] ← H Store mem[5’] ← H’ Unlock(5,5’) ALL Cores ALL Threads <PLC mem[1..N]> for ALL i,i’ Lock(i) Lock(i’) Load X ← mem[i] Load X’← mem[i’] Check (X == X’) Unlock(i’) Unlock(i) IEEE TCAD comments (QED paper) “All reviewers agree this will be a classic paper for years to come.” “I will personally pay for page charges if you promise to thank me (anonymously) when you win a major award for this paper!” Intel (Nagib Hakim, PE) “QED is revolutionary... Intel is in the process of implementing a prototype of QED. This would enable a whole slew of applications.” AMD (Jeff Rearick, Senior Fellow) QED: “magical thinking needed” in ETS keynote. Freescale (Sharad Kumar, Manager) “We evaluated QED & are adopting in our tools flow for multi-core debug.” QED is one such promising technique that we have evaluated and are adopting in our tools flow for multi-core debug. Proactive Load and Check Control Flow Tracking Using Software Signatures if ((last_signature == #3) or (last_signature == #4)): last_signature = #5 else: ERROR_DETECTED! <Block 5> CFCSS-V Block 2 CFCSS-V CFCSS-V CFCSS-V CFCSS-V Block 3 Block 4 Block 1 Block 5 CFCSS-V Block 5: ERROR! Freescale SoC Logic Bug Error detection latency (cycles) Original QED 15 Billion 9 Interconnection network Core 1 Core 0 Core N Core 2 Core 3 Random Instruction Test Generator Shared Caches Memory Controllers Accelerators Other uncore components Error detection latency (cycles) Cumulative memory bugs detected 100 1K 10K 10 Billion 0% 20% 40% 60% 80% 100% 10 6 X improved QED Original test 8-Core Industrial Test QED Med., Max. EDL: 392, 3k Original test Med., Max. EDL: 10M, 100M 0% 20% 40% 60% 80% 100% 100 1k 10k 100k 1M 10M >100M 10 4 X 2X Cumulative Bugs Detected Error detection latency (clock cycles) Power Management Bugs 0 10k 20k 0 20 100 60 140 PLC-H checkers count Area cost 0.05% 0.4% 0.05% - 0.4% area impact Error detection latency (cycles) Fast QED 10 5 X quicker detection 2X coverage No intrusiveness Runtime: 1.04X – 6X MBIST reuse Core, uncore, power management bugs Uncore Bugs No boot Pass 48 processor cores 0.9V, 800 MHz QED unique detect QED enhanced detect QED quick detect Error detection latency (cycles) Cumulative bugs detected 100 1k 10k 100k 1M 10M 0% 20% 40% 60% 80% 100% 10 4 X 2X Original Med., Max. EDL: 241k, 10M QED Med., Max. EDL: 675, 8k Difficult Logic Bugs QED Techniques Hybrid QED Error Detection Latency (cycles) Coverage (percentage) 1 10 100 1k 10k 100k 1M 10M 0% 20% 40% 60% 80% 100% Hybrid QED: Mean EDL= 705 cycles Original Mean EDL = 124k cycles 10 2 X Improved Accelerator validation and debug Using high-level synthesis Collaborator: Prof. Deming Cheng (UIUC) 0% 20% 40% 60% 80% 100% 0 100 1K 10K 100k 1M Cumulative bugs detected Bug Trace Length (cycles) >10M Original Min., Mean, Max.: 722, 1.9M, 11M Symbolic QED Min., Mean, Max.: 13, 20, 29 10 6 X 2X BMC Tool Automatically Overnight 1. “Universal” Property QED Check + Initial State Logic Bugs Localized 2. Partial Instances + QED Modules 1. “Universal” Property: QED Check What property should the BMC tool check? 2. Partial Instantiation How to ensure the design fits in the BMC tool? CMP Ra == Ra’ QED checks are Compositional Not design/implementation specific Preserved across partial instances Unlike tradition properties Systematically instantiate only the modules needed to activate the bug BMC tool finds a bug trace Core 1 Core 0 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 L2 Bank 0 L2 Bank 1 L2 Bank 2 L2 Bank 3 L2 Bank 4 L2 Bank 5 L2 Bank 6 L2 Bank 7 Memory controller 0 Memory controller 1 Memory controller 2 Memory controller 3 I/O controllers Crossbar interconnect Core 0 L2 Bank 0 Crossbar interconnect Core 0 L2 Bank 0 Memory controller 0 Crossbar interconnect Core 1 Core 0 L2 Bank 0 Crossbar interconnect Memory controller 0 Reduce Instances Keep at least 1 core Run Each No Trace Found Trace Found Trace Found Best Localization