Retargeting Embedded Software Stack for Many-Core Systems

ESC 2012

Retargeting Embedded Software Stacks for Many-Core Systems

Sumant Tambe, Ph.D. Software Research Engineer, Real-Time Innovations

Agenda What’s happening in many-core

world? New Challenges

Collaborative Research Real-Time Innovations (RTI) University of North Carolina What we (RTI) do, briefly!

Research: Retargeting embedded software stack for many-core systems Components Scalable multi-core Scheduling Scalable communication Middleware Modernization

Single-core Multi-core Many-Core

InterconnectInterconnect

InterconnectSolution

Tra

nsis

tor

coun

t

CPU clock speed and power consumption hit a wall circa 2004

100’s of cores available today

Applications Domains using Multi-core

5

Defense Transportation Financial trading Telecommunications Factory automation Traffic control Medical imaging Simulation

Grand Challenge and Prize Scalable ApplicationsRunning faster with more cores

InhibitorsEmbedded software stack (OS, m/w, and

apps) not designed for more than a handful of cores○ One core maxed-out others idling!○ Overuse of communication via shared-memory○ Severe cache coherence overhead

Advanced techniques known only to experts○ Programming languages and paradigms

Lack of design and debugging tools

Trends in concurrent programming (1/7)

Source: Herb Sutter, Keynote @ AMD Fusion Developer Summit, 2011

Heterogeneous Computing Instruction Sets

Single○ In-order, out-of-order

Multiple (heterogeneous)○ Embedded system-on-chip○ Combine DSPs, microcontrollers,

and general-purpose microprocessors

Memory Uniform cache access Uniform RAM access Non-uniform cache access Non-uniform RAM access Disjoint RAM


Message-passing instead of shared-memory

“Do not communicate by sharing memory.

Instead, share memory by communicating.” – Google Go Documentation

Costs less than shared-memoryScales better on many-core

○ Shown up to 80 cores

Easier to verify and debugBypass cache coherenceData locality is very important

Source: Andrew Baumann, et. al, Multi-kernel: A new OS architecture for scalable multicore systems, SOSP’09 (small data, messages sent to a single server)

Source: Silas Boyd-Wickizer, Corey: An Operating System for Many Cores, USENIX 2008

http://golang.org/doc/GoCourseDay3.pdf

Shared-Nothing PartitioningData partitioning

○ Single Instruction Multiple Data (SIMD)

○ a.k.a “sharding” in DB circles ○ Matrix multiplication on GPGPU○ Content-based filters on stock

symbols (“IBM”, “MSFT”, “GOOG”)



Shared-Nothing PartitioningFunctional partitioning

○ E.g., Staged Event Driven Architecture (SEDA)○ Split an application into an n-stage pipeline○ Each stage executes concurrently○ Explicit communication channels between stage

Channels can be monitored for bottlenecks

○ Used in Cassandra, Apache Service Mix, etc.

Trends in concurrent programming (5/7) Erlang-Style Concurrency (Actor Model) Concurrency-Oriented Programming (COP)

Fast asynchronous messaging Selective message reception Copying message-passing semantics (share-nothing

concurrency) Process monitoring Fast process creation/destruction Ability to support >> 10 000 concurrent processes with

largely unchanged characteristics

Source: http://ulf.wiger.net

Trends in concurrent programming (6/7) Consistency via Safely Shared

ResourcesReplacing coarse-grained locking with fine-

grained lockingUsing wait-free primitivesUsing cache-conscious algorithmsExploit application-specific data locality New programming APIs

○ OpenCL, PPL, AMP, etc.

Trends in concurrent programming (7/7) Effective concurrency patterns

Wizardry Instruction Manuals!

Explicit Multi-threading:Too much to worry about!

Source: POSA2: Patterns for Concurrent, Parallel, and Distributed Systems, Dr. Doug Schmidt

1. The Pillars of Concurrency (Aug 2007)

2. How Much Scalability Do You Have or Need? (Sep 2007)

3. Use Critical Sections (Preferably Locks) to Eliminate Races (Oct 2007)

4. Apply Critical Sections Consistently (Nov 2007)

5. Avoid Calling Unknown Code While Inside a Critical Section (Dec 2007)

6. Use Lock Hierarchies to Avoid Deadlock (Jan 2008)

7. Break Amdahl’s Law! (Feb 2008)

8. Going Super-linear (Mar 2008)

9. Super Linearity and the Bigger Machine (Apr 2008)

10. Interrupt Politely (May 2008)

11. Maximize Locality, Minimize Contention (Jun 2008)

12. Choose Concurrency-Friendly Data Structures (Jul 2008)

13. The Many Faces of Deadlock (Aug 2008)

14. Lock-Free Code: A False Sense of Security (Sep 2008)

15. Writing Lock-Free Code: A Corrected Queue (Oct 2008)

16. Writing a Generalized Concurrent Queue (Nov 2008)

17. Understanding Parallel Performance (Dec 2008)

18. Measuring Parallel Performance: Optimizing a Concurrent Queue(Jan 2009)

19. volatile vs. volatile (Feb 2009)

20. Sharing Is the Root of All Contention (Mar 2009)

21. Use Threads Correctly = Isolation + Asynchronous Messages (Apr 2009)

22. Use Thread Pools Correctly: Keep Tasks Short and Non-blocking(Apr 2009)

23. Eliminate False Sharing (May 2009)

24. Break Up and Interleave Work to Keep Threads Responsive (Jun 2009)

25. The Power of “In Progress” (Jul 2009)

26. Design for Many-core Systems (Aug 2009)

27. Avoid Exposing Concurrency – Hide It Inside Synchronous Methods (Oct 2009)

28. Prefer structured lifetimes – local, nested, bounded, deterministic(Nov 2009)

29. Prefer Futures to Baked-In “Async APIs” (Jan 2010)

30. Associate Mutexes with Data to Prevent Races (May 2010)

31. Prefer Using Active Objects Instead of Naked Threads (June 2010)

32. Prefer Using Futures or Callbacks to Communicate Asynchronous Results (August 2010)

33. Know When to Use an Active Object Instead of a Mutex (September 2010)

Source: Effective Concurrency, Herb Sutter

Threads are hard!

Source: MSDN Magazine, Joe Duffy

Forgotten Synchronization

Incorrect Granularity

Read and Write Tearing

Lock-Free Reordering

Lock Convoys

Two-Step Dance

Priority Inversion

Patterns for Achieving Safety

Immutability

Purity

Isolation

Data race

Deadlock

Atomicity Violation

Order Violation

Collaborative Research!

Prof. James AndersonUniversity of North CarolinaIEEE FellowReal-Time Innovations

Sunnyvale, CA

Research funded by OSDScalable Communication and Scheduling

for Many-Core Systems

http://www.rti.com/

http://www.cs.unc.edu/~anderson/

http://www.rti.com/

Integrating Enterprise Systems with Edge Systems

RTPS

Web-Service

GetTempRequest

Co

nn

ecto

r

SOAP Adapter

GetTempResponse

TemperatureSensor

Temperature

Co

nn

ecto

r Socket Adapter

Data-Centric Messaging Bus

JMS App

Co

nn

ecto

r JMS Adapter

Temp

SQL App

Co

nn

ecto

r DB Adapter

Temp

Enterprise System Edge System

Data-Centric Messaging

Based on DDS Standard (OMG) DDS = Data Distribution Service DDS

is an API specificationfor Real-Time Systems provides publish-subscribe paradigmprovides quality-of-service tuninguses interoperable wire protocol (RTPS)

Real-time publish-subscribe

wire protocol

RTI DataDistribution Service

Data DistributionServices

Standards-based API for application developers

Open protocol for interoperability

DDS Communication Model Provides a “Global Data Space” that is accessible

to all interested applications. Data objects addressed by Domain, Topic and Key Subscriptions are decoupled from Publications Contracts established by means of QoS Automatic discovery and configuration

Global Data Space

Participant Pub ParticipantPub

SubParticipant

Sub

Participant Pub Alarm

Track,2

Track,1 Track,3

ParticipantSub

Data-Centric vs. Message-Centric DesignData-Centric Infrastructure does

understand your data What data schema(s) will be

used Which objects are distinct from

which other objects What their lifecycles are How to attach behavior (e.g.

filters, QoS) to individual objects

Example technologies DDS API RTPS (DDSI) protocol

Message-Centric Infrastructure does not

understand your data Opaque contents vary from

message to message No object identity; messages

indistinguishable Ad-hoc lifecycle management Behaviors can only apply to

whole data stream

Example technologies JMS API AMQP protocol

Re-enabling the Free Lunch, Easily!

Positioning applications to run faster on machines with more cores—enabling the free lunch!

Three Pillars of ConcurrencyCoarse-grained parallelism (functional partitioning)Fine-grained parallelism (running a ‘for’ loop in parallel)Reducing the cost of resource sharing (improved locking)

Scalable Communication and Scheduling for Many-Core Systems Objectives

Create a Component Framework for Developing Scalable Many-core Applications

Develop Many-Core Resource Allocation and Scheduling Algorithms

Investigate Efficient Message-Passing Mechanisms for Component Dataflow

Architect DDS Middleware to Improve Internal Concurrency

Demonstrate ideas using a prototype

Component-based Software Engineering

Facilitate Separation of Concerns Functional partitioning to enable MIMD-style parallelism Manage resource allocation and scheduling algorithms Ease of application lifecycle management

Component-based Design Naturally aligned with functional partitioning (pipeline) Components are modular, cohesive, loosely coupled, and

independently deployable

C C

Message passing communication Isolation of state Shared-nothing concurrency Ease of validation

Lifecycle management Application design Deployment Resource allocation Scheduling

Deployment and Configuration Placement based on data-flow dependencies Cache-conscious placement on cores

C

Component-based Software Engineering

CC

C

Transformation

Formal Models

Scheduling Algorithms for Many-core

Academic Research Partner Real-Time Systems Group, Prof. James Anderson University of North Carolina, Chapel Hill

Processing Graph Method (PGM) Clustered scheduling on many-core

G1

G2 G3

G4 G5

G6

G7Tilera TILEPro64 Multi-core Processor. Source: Tilera.com

N nodes to

M cores

N nodes to

M cores


Scheduling Algorithms for Many-cores

Key requirements Efficiently utilizing the processing capacity

within each cluster Minimizing data movement across clusters Exploit data locality

A many-core Processor An on-chip distributed system! Cores are addressable Send messages to other cores directly On-chip networks (interconnect)

○ MIT RAW = 4 networks○ Tilera iMesh = 6 networks○ On chip switches, routing algorithms, packet

switching, multicast!, deadlock prevention Sending messages to distant core takes

longer

E.g., Tilera iMesh Architecture. Source: Tilera.com

Message-passing over shared-memory

Two key issuesPerformance Correctness

PerformanceShared-memory does not scale

on many-coreFull chip cache coherence is

expensive Too much power Too much bandwidth Not all cores need to see the update

○ Data stalls reduce performance

Source: Ph.D. defense: Natalie Enright Jerger

Message-passing over shared-memory Correctness

Hard to achieve in explicit threading (even in task-based libraries) Lock-based programs are not composable

“Perhaps the most fundamental objection [...] is that lock-based programs do not compose: correct fragments may fail when combined. For example, consider a hash table with thread-safe insert and delete operations. Now suppose that we want to delete one item A from table t1, and insert it into table t2; but the intermediate state (in which neither table contains the item) must not be visible to other threads. Unless the implementer of the hash table anticipates this need, there is simply no way to satisfy this requirement. [...] In short, operations that are individually correct (insert, delete) cannot be composed into larger correct operations.”—Tim Harris et al., "Composable Memory Transactions", Section 2: Background, pg.2

Message-passing Composable Easy to verify and debug Observe in/out messages only

Component Dataflow using DDS Entities

Core-Interconnect Transport for DDS RTI DDS Supports many transports for messaging

UDP, TCP, Shared-memory, Zero-copy, etc In future: a “core-interconnect transport”!!

Tilera provides Tilera Multicore Components (TMC) library Higher-level library for MIT RAW in progress

Erlang-Style Concurrency: A Panacea? Actor Model

OO programming of the concurrency world Concurrency-Oriented Programming (COP)

Fast asynchronous messaging Selective message reception Copying message-passing semantics (share-nothing

concurrency) Process monitoring Fast process creation/destruction Ability to support >> 10 000 concurrent processes with

largely unchanged characteristics

Source: http://ulf.wiger.net

Actors using Data-Centric Messaging?

Fast asynchronous messaging○ < 100 micro-sec latency○ Vendor neutral but old (2006) results○ Source: Ming Xiong, et al., Vanderbilt University

Selective Message Reception○ Standard DDS data partitioning: Domains, Partitions, Topics○ Content-based Filter Topic (e.g., “key == 0xabcd”)○ Time-based Filter, Query conditions, Sample States etc.

Copying message-passing semantics

“Process” monitoring

Fast “process” creation/destruction

>> 10,000 concurrent “processes”

RTI RESEARCH

Middleware Modernization

Event-handling patterns Reactor

○ Offers coarse-grained concurrency control Proactor (asynchronous IO)

○ Decouples of threading from concurrency

Concurrency Patterns Leader/follower

○ Enhances CPU cache affinity, minimizes locking overhead reduces latency

Half-sync half-async○ Faster low-level system services

Middleware Modernization Effective Concurrency (Sutter) Concurrency-friendly data structures

○ Fine-grained locking in linked-lists○ Skip-list for fast parallel search

○ But compactness is important too! See Going Native 2012 Keynote by Dr. Stroustrup: Slide #45 (Vector vs. List) std::vector beats std::list in insertion and deletion! Reason: Linear search dominates. Compact = cache-friendly

Data locality aspect○ A first-class design concern○ Avoid false sharing

Lock-free data structures (Java ConcurrentHashMap)

○ New one will earn you a Ph.D. Processor Affinity and load-balancing

○ E.g., pthread_setaffinity_np

i i i i

Concluding Remarks

Scalable Communication and Scheduling for Many-Core Systems Research

Create a Component Framework for Developing Scalable Many-core Applications

Develop Many-Core Resource Allocation and Scheduling Algorithms Investigate Efficient Message-Passing Mechanisms for Component

Dataflow Architect DDS Middleware to Improve Internal Concurrency

http://www.rti.com/


Thank you!

Questions?

Retargeting Embedded Software Stack for Many-Core Systems

Technology

singlecore multicore

concurrent programming

pillars of concurrency

concurrent processes

memory scales

lockfree code

concurrent queuejan

cores easier