Programming Tools for Embedded Multicore Jakob Engblom Technical Marketing Manager – Simics Wind River [email protected] | http://blogs.windriver.com/engblom/
Programming Tools for Embedded MulticoreJakob EngblomTechnical Marketing Manager – SimicsWind River
[email protected] | http://blogs.windriver.com/engblom/
Disclaimer
These are my personal views on multicore and embedded
Nothing in this presentation should be interpreted as indicating the plan (or lack of plan) for products and product features in Wind River products
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-242
Embedded Multicore
Some Advantages
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-243
Software Dominates Development
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-244
1012
1010
108
106
104
102
1
Software-dominated systems industry
1960 1970 1980 1990 2000 2010 2020
Gates/chip 2x / 18monthsSW/chip: 2 x / 10 monthsSW Productivity: 2x HW/ 5 years
No. GatesLines of CodeNo. GatesLines of Code
Embedded Multicore Advantage
When it comes to multicore, there are certain advantages to the embedded tools field
– Embedded debug tools tend to be better at dealing with timing errors and doing debug of low-level code and
– Operating-system – application interfaces have better debug support
– Hardware-supported debug far beyond what desktops and servers can do
– OS awareness in external debug tools
Debuggers and tools are starting to catch up, including awareness of cores, systems, threads, domains, …
– But it gets pretty complex pretty quickly…
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-245
Multiple Context Debugging
Multiple Targets• One Wind River Workbench
instance• Target manager• Multiple simultaneous
connections includingshared connections
• Multiple OS types supported simultaneously
• Multiple target processors supported simultaneously
Bay Networks
Bay Networks
Bay Networks
FunctionProcessors
ControlProcessors
Multiple Contexts• Core, process, or thread• Each context has a set of views:
• Source• Stack• Registers
Processes/Threads• Qualify breakpoints on a process or
specific thread• Stop the entire process or an
individual thread
Target boards may be any mix of physical, logical, or virtual boards and any mix of uniprocessors and multicore running SMP or UP with Hypervisor, VxWorks, Wind River Linux and bare metal software.
Host System Target System
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-246
Hardware Trace On-Chip Trace
– Added feature of hardware– Costs some chip area,
some designers – and some customers – do not consider it worth the cost
– Mostly for processors and their buses
– Being added for other parts of the system, as they become more important Performance counters
common in complex devices today
– Interface bandwidth limitations can put a limit on effectiveness
Board 1
Flash
SoC
DDR RAM
Eth
Eth
PIC
Timer
Serial
PC
Ie
MemIntf
L1$
L1$
CPU
CPU
L2$
Peripherals
T
TT T
T
T
P P P P
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-247
Hardware Triggering Cross-triggering
– Coordination across the chip
– Cause action in one place based on events occurring elsewhere in the system Stop execution, start
tracing, stop tracing, interrupt, ...
Requires logic on the chip Basically, it is an on-chip
programmable little supervisor processor
Conclusion: wise users buy hardware with good debug support
Board 1
Flash
SoC
DDR RAM
Eth
Eth
PIC
Timer
Serial
PC
Ie
MemIntf
L1$
L1$
CPU
CPU
L2$
Peripherals
B
BB B
B
B
B
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-248
Trace, Trace, Trace
There seems to be a growing consensus that trace is a key tool for debugging multicore large-scale software
– Software stacks adding tracing as feature– Hardware support for extracting traces from software– Hardware actually tracing its own operation– Simulators hooks for getting data and key points out
Only way to get an overview of the system Trace long runs…
– Trace processing and analysis of data stream a key technology for the future, manual inspection does not suffice
And drop back to a debugger around a problemhttp://jakob.engbloms.se/archives/1251
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-249
Overhead vs Efficiency
Common complaint about debug hooks in hardware and software: it costs too much power / performance / throughput / chip area / money / …
Cary Millsap, Thinking Clearly about Performance 2, CACM Oct 2010http://mags.acm.org/communications/201010#pg40
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2410
Embedded Multicore
Software Architecture and Hypervisors
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2411
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24
OS
Core 2Core 1
OS
AMP
More Than Just SMP
Single Core
Multi-core
OS: Could be VxWorks, Wind River Linux, or other executive or OS
Combinations of these primary configurations can be used to create more advanced configurations.
Core
OS
“Traditional”
Hypervisor
Core Virtualization
Core
OS OS
Hypervisor
SMP
OS
Core 1 Core 2
12
Unit 1 Unit 2 Unit 3
Example: Consolidation with Hypervisor
Consolidated unit
Wind River Hypervisor
Multicore Hardware
Single-core
OS 1
App 1
Single-core
Bare-metal application
Multicore
OS 3
App 3
OS 1
App 1Bare-metal application
OS 3
App 3Single-core apps keep running as single-core, avoiding the risk of breakage due to true concurrency
Single hardware = easier to manage, reduced manufacturing cost, more units fits in the same space. Most of the multicore gain with very limited pain!Hypervisor provides
isolation between guests, virtual boards keep running as-is
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2413
Example: Back to Basics
Wind River Hypervisor
Multicore Hardware
Control-Plane OS
Management, control
Core Core Core Core
Network stack
Core
Network stack
Network stack
WRE WRE WRE
WRE – Wind River Executive. Clear trend to provide sub-RTOS “executives” to provide very high performance for applications with no need for a full OS. Typically per-core.
Hypervisor can simplify the coordination between OS instances and provide a simpler programming interface for a WRE:
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2414
Simics and Multicore
Debug
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2415
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24
Wind River Simics
Wind River Simics: Full Simulation of Any Electronic System
Virtual Platform
An adaptive virtual platform that enables customers to define, develop, and deploy electronics systems more efficiently
Aerospace and Defense Industrial and Medical Mobile and Consumer Network EquipmentAutomotive
16
System-Level FeaturesCheckpoint and restore Multicore, processor, board Real-world connections
Repeatable fault injection on any system component
Scripting Mixed endianness, word sizes, heterogeneity
con0.wait-for-string "$“
con0.record-start
con0.input "./ptest.elf 5\n"
con0.wait-for-string "."
$r = con0.record-stop
if ($r == "fail.”) {
echo ”test failed”
}
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2417
Full-System Insight
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2418
Simics
Hypervisor is just any Software Stack
Wind River Hypervisor
Multicore Hardware
OS 1
App 1Bare-metal application
OS 3
App 3
32/64-bit PC
Linux, Windows
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2419
Simics Debugging FeaturesSynchronous stop for entire system
Determinism and repeatability
Reverse execution
Unlimited and powerful breakpoints
Trace anything Insight into all devices
break –x 0x0000->0x1F00
break-io uart0
break-exception int13
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2420
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-24
Repeatability and Reverse Debugging Repeat any run trivially
– No need to rerun and hope for bug to reoccur
Stop and go back in time– No rerunning program from start– Breakpoints and watchpoints backward in time– Investigate exactly what happened this time
This control and reliable repeatability is very powerful for parallel code.
Discover Bug
Rerun, bug doesn’t show up
Rerun, bug doesn’t show up
Rerun, different bug
Rerun, initial bug occurs
Discover Bug
Reverse execute and find source of bug
On virtual hardware, debugging is much easier.On hardware, only some runs reproduce an error.
http://blogs.windriver.com/engblom/2010/09/deterministic-but-unpredictable.html
21
Transporting BugsVirtual platform
checkpoint
software package, load, or configuration
hardware configuration or reconfiguration
PP
R
D
The software user finds a bug and needs to report it to the developer. This makes him or her the reporter R
A developer D creates a piece of software and passes it on for testing and use
The developer and reporter are both using a virtual platform to run software
The reporter uses virtual platform checkpointing to pass the bug to the developer. This ensures perfect replication and that the complete target state is communicated.
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2422
Replaying Target Stimuli
R1
R RPP
Boot...
P
DR stimuli
RC
R
Configure...
R0
Run tests...
Note that many different tests can be started from this checkpoint
RnR2
Inputs occurring after the last checkpoint was taken, but before the bug hits
Checkpoint merge
Bug!
Recording of last few inputs
Merged checkpoint and the recording is the bug report contents
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2423
Debug Multicore Hang: Problem
Multithreaded program, stable on existing system OS changed, hardware and software stack not changed Started to freeze occasionally (1 run in 20)
– Change of OS exposed a latent bug in the code Reporter captured bug as a checkpoint + script Passed checkpoint and script to developer for analysis
MPC8641 8 core
Glibc 2.5.1
Linux 2.6.23
Rule30_threaded.elf
MPC8641 8 core
Glibc 2.5.1
Linux 2.6.27 (WR Linux 3.0)
Rule30_threaded.elf
R
R
R
D
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2424
Simics
Debug Multicore Hang: Debug
Reproduction of bug trivial with checkpoint and script Developer used OS awareness and source code debug to set
breakpoints inside target program– On data accesses to shared work queue used by all threads– Unintrusive – does not change the behavior of the target system in any way
Custom script catches breakpoints– Diagnostics: state of queue (read target memory, perform calculations), queue
control variable being accessed, source line, thread ID– For both successful and failing runs -> spotted the difference
R
R
MPC8641 8 core
Glibc 2.5.1
Linux 2.6.27 (WR Linux 3.0)
Rule30_threaded.elf D
OS awareness
Source code debug
Custom script
Debug information for binary program, outside the target
DD
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2425
Debug Multicore HangExample Diagnostic Output[bp] Thread 918, writing variable empty with value 1.
At rule30_packet_queue_get, line 157 Prev. state: Done: 1 Empty: 0 Full: 0 Tail: 0 Head: 0 Elems: 0
[bp] Thread 918, writing variable full with value 0. At rule30_packet_queue_get, line 158 Prev. state: Done: 1 Empty: 1 Full: 0 Tail: 0 Head: 0 Elems: 0
...
[bp] Thread 921, writing variable done with value 1. At rule30_packet_queue_signal_done, line 62 Prev. state: Done: 0 Empty: 0 Full: 0 Tail: 0 Head: 98 Elems: 2
The Bug
68 // - It only wakes up one thread...69 pthread_cond_signal (&(q->notEmpty));70 // To be correct:71 //pthread_cond_broadcast (&(q->notEmpty));
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2426
Analyzer Looking at the ProgramNice speedup with 1 to 3 worker threads
With four worker threads, the program uses only two cores
With five worker threads, the efficiency is horrible and two of the worker threads are left hanging!
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2427
Simics and Multicore
Evaluating Software Scalability on Flexible Virtual Hardware
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2428
Scalable AMP Hardware Scales to any number of cores Configurable in several dimensions
Global shared memory
PPC440 coreLocal memory
InterruptcontrollerSerial port
Interrupt network
PPC440 coreLocal memory
InterruptcontrollerSerial port
PPC440 coreLocal memory
InterruptcontrollerSerial port
Scalable virtual Power Architecture multicore machine
Clock frequency
Size
Size Number of cores
Access delay
Contention to global memory
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2429
Varying memory latency of shared memory Parallel processing benchmark
– Shared memory restricted single access and high latencies– Testing two different transfer modes, 1 packet and 4 packets per transmission– Scalability quite different
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
1 2 3 4 5 6 7 8 9
Pero
frm
ance
rela
tive
to o
ne w
orke
r nd
oe
Number of worker nodes
Scaling as Worker Nodes are AddedPerfect memory100 cycles, single port200 cycles, single port500 cycles, single portPerfect memory, 4 packets/trans100 cycles, single port, 4 packets/trans200 cycles, single port, 4 packets/trans500 cycles, single port, 4 packets/trans
Memory Speed Impact
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2430
Simics and Multicore
Speeding up development by smart tricks
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2431
OS Prototyping: Xtratum Timebase A multicore OS needs
consistent time across all cores– First task in development is to
establish such a timebase– On hardware, tricky timing loops
are needed With Simics, we can prototype
using scripting– Mark time sync point with a
magic instruction– Script triggers on the magic
instructions, and resets all local times to the same time
A complex but non-value-added task becomes trivial
– Shorten time to interesting experiments using Simics
http://www.tentech.ca/index.php/2010/09/easy-multi-core-powerpc-timebase-synchronization-with-simics/
The Code: OS and Scriptstatic void __VBOOT synchronize_clocks(void){
if (0 == GET_CPU_ID()) {MAGIC(4);
}BarrierWait(&g_smpPartitionInitBarrier);}
def synchronize_ppc_timebase():# Get number of CPUs from system 0. # Using some assumptionsnum_cpus = conf.sim.cpu_info[0][1]
# Iterate through all the coresfor cpu_id in range(num_cpus):
cpu = getattr(conf, "cpu%d" % cpu_id)
# Simply reset the timebasecpu.tbu = 0cpu.tbl = 0
print "Synchronized the CPU timebases at cpu0 cycle count %ld" % SIM_cycle_count(conf.cpu0)
Wind River - Programming Embedded Multicore - ICES Seminar 2010-11-2432