Top Banner
Developing Software in a Multicore & Multiprocessor World Tool-based approach for finding complex concurrency issues and endian incompatibilities To keep pace with customer demands for more functionality and speed, software teams are moving away from single processor architectures at a rapid rate. In particular, embedded devices that used to have one chip to perform a constrained set of tasks are now working in heterogeneous processor environments where processors are used for network connectivity, multi-media, and a whole variety of requirements. According to new data from VDC Research, this trend is only expected to accelerate: engineers expect that in two years time, the number of single processor projects will drop by half. The business impact of this growing complexity is stark: multicore and multiprocessor software projects are 4.5X more expensive, have 25% longer schedules, and require almost 3X as many software engineers. 1 One area in particular where this growing complexity can have a dramatic impact on cost and schedule overruns is in the area of software testing and code inspection. A multicore/processor environment can add exponential complexity to effectively identifying errors in software. There are two areas in particular that have the ability to drag the productivity of a software team through the floor: concurrency errors and endian incompatibilities. This whitepaper will discuss these types of issues in detail, explain how Klocwork ® ’s source code analysis engine, Klocwork Truepath ® can be used to address them, and walkthrough two examples of these problems in prominent open source projects. GWYN FISHER, CTO | WHITE PAPER | SEPTEMBER 2010 Current Project Don’t know 2.9% Don’t know 8.5% Multicore and multiprocessor 5.2% Multicore and multiprocessor 19.4% Multicore 9.3% Multicore 21.4% Multi- processor 20.8% Multi- processor 20.6% Single processor 61.8% Single processor 30.1% Expected in 2 Years Figure 1 | Processing Architecture Used in the Current Project and Expected in Next Two Years (Percent of Respondents) 1 VDC Research, “Next Generation Embedded Hardware Architectures: Driving Onset of Project Delays, Costs Overruns, and Software Development Challenges”, September 2010.
9

Multicore processors

Feb 07, 2016

Download

Documents

Dave

Multicore processors
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multicore processors

Developing Software in aMulticore & Multiprocessor WorldTool-based approach for fi nding complex concurrency issuesand endian incompatibilities

To keep pace with customer demands for more functionality and speed,

software teams are moving away from single processor architectures at a rapid

rate. In particular, embedded devices that used to have one chip to perform

a constrained set of tasks are now working in heterogeneous processor

environments where processors are used for network connectivity, multi-media,

and a whole variety of requirements. According to new data from VDC Research,

this trend is only expected to accelerate: engineers expect that in two years time,

the number of single processor projects will drop by half.

The business impact of this growing complexity is stark: multicore and multiprocessor software projects are 4.5X more expensive, have 25% longer schedules, and require almost 3X as many software engineers.1

One area in particular where this growing complexity can have a dramatic impact on cost and schedule overruns is in the area of software testing and code inspection. A multicore/processor environment can add exponential complexity to effectively identifying errors in software. There are two areas in particular that have the ability to drag the productivity of a software team through the fl oor: concurrency errors and endian incompatibilities.

This whitepaper will discuss these types of issues in detail, explain how Klocwork®’s source code analysis engine, Klocwork Truepath® can be used to address them, and walkthrough two examples of these problems in prominent open source projects.

GWYN FISHER, CTO | WHITE PAPER | SEPTEMBER 2010

Current Project

Don’t know2.9%

Don’t know8.5%

Multicore andmultiprocessor

5.2%

Multicore andmultiprocessor

19.4%

Multicore9.3%

Multicore21.4%

Multi-processor

20.8% Multi-processor

20.6%

Singleprocessor

61.8%

Singleprocessor

30.1%

Expected in 2 Years

Figure 1 | Processing Architecture Used in the Current Project and Expected in Next Two Years (Percent of Respondents)

1 VDC Research, “Next Generation Embedded Hardware Architectures: Driving Onset of Project Delays, Costs Overruns, and

Software Development Challenges”, September 2010.

Page 2: Multicore processors

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 2

Tackling Concurrency Issues and Endian Incompatibilities with Klocwork Truepath

Concurrency Issues

Source code analysis is a process by which the expected, or predicted behavior of a program at runtime is exercised, along every conceivable control flow path, in order that aberrant situations be found, diagnosed, and described to the author in such a way as to make them simple to fix. In the typical course of events, no timing information or order, other than that inherent in the control flow graph, is interpreted or required for this analysis to take place.

Concurrency issues pose a complex set of challenges for analysis, as they do require timing or ordering information to be promoted into the control flow graph. Some are obviously less difficult to find than others, such as threads that reserve locks and perform time-consuming activities before releasing. This type of behavior, whilst not leading to a critical failure such as a deadlock, can lead to frustration on the part of the end user of the software, for example in the face of an unresponsive device.

The more complex type of concurrency issues, such as deadlocks, require an additional type of analysis over-and-above that performed when finding non-order-related bugs such as memory leaks or buffer overruns. In this case, we must perform two different types of analysis: one that gathers and propagates lock lifecycle behavior, and another that can analyze the whole program space and find conflicts in this behavior.

Klocwork Truepath makes this possible via the addition of a new concurrency analysis engine to its existing tool chain:

In this figure you can see that data relating to lock lifecycles is gathered by the normal analysis engine, and once this has been produced for all modules in the system, the whole program space is then analyzed by the new concurrency analysis engine so that loops in the lifecycle graph can be found, which equate to deadlocks.

Consider a function that operates as follows:

You can easily see by inspection that when passed an odd number as its parameter, this function defines a dependency of Lock2 upon Lock1. Failing an odd parameter, Lock1 is still reserved, but this time there is no dependency of Lock2 upon Lock1 at the local scope, although there may still remain that dependency (or another) at an inter-procedural scope.

Therefore, we have two discrete types of questions to ask when performing the analysis:

1. Symbolic logic questions:a. Is there a valid control flow that gets us to call function foo() with an odd parameter?b. Is there a valid control flow that results in foo() being called with an even parameter followed by a call to

another function that results in another lock (e.g. Lock2) being reserved before Lock1 is released?

2. Lock dependency questions:a. If either of these are so, is there any other situation in the program’s natural control flow whereby a counter-

dependency of Lock1 upon Lock2 can be reached, potentially resulting in a deadlock?

Figure 2 | Klocwork Truepath tool chain provides concurrency analysis engine after control flow graph analysis and build emulation.

Compile

• Emulate native build

• Build control flow graph

Symbolic logic

• Analyze control flow graph

• Perform dataflow analysis

Concurrency

• Analyze lock dependencies

lock_t Lock1, Lock2;

void foo(int x) { if( x & 1 ) { lock(Lock1); lock(Lock2); } else lock(Lock1);}

Page 3: Multicore processors

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 3

The first type of question is answered by Klocwork Truepath’s symbolic logic engine during the normal course of program analysis, just as any other type of defect is analyzed for inter-procedural data flows that can or cannot occur.

The second type of question is then answered by the concurrency analysis engine, fed by the collection of all possible dependencies within the program space. The result is what tends to be a small set of incredibly difficult to find (manually), and insanely difficult to understand (without a tool) deadlock scenarios that developers can triage and fix very quickly within the natural course of their implementation tasks.

Endian Incompatibilities

Whilst it may be true that there are 10 kinds of people in the world, a switch from a little endian platform to a big endian platform will muddy that impression considerably. An advisor of ours recently informed me with glee that he’d finally “set his MSB” (having passed his 64th birthday), but store that in nibble representation on an unexpected endian architecture and he’d be regressing to the nursery once more.

In short, endian representations affect how the host processor stores integral types in memory. Considering 32-bit integers, each of which consists of four bytes of memory, the processor can chose to read and write those four bytes in a variety of orders, although traditionally only two are used:

» Little endian, in which the bytes are written in the order 0, 1, 2, 3 » Big endian, in which the bytes are written in the order 3, 2, 1, 0

This picture becomes slightly muddied if the processor actually writes words at a time (this is mostly a fairly historical representation now, but we mention it for completeness), and applies its endian assumptions to each word:

» Little endian still writes bytes in the order 0, 1, 2, 3 » Big endian, however, may now write bytes in the order 1, 0, 3, 2

However the processor stores and reads such types is entirely at its own discretion and the business of nobody else. Until, that is, the developer directs the processor to write such data into a medium for transmission, as opposed to storage in memory.

Transmission media, which could be sockets, files, pipes, or any other inter-processor vector (e.g. interrupts that cause data to be written to the PCI-Express interface, or to the serial bus, or…), are addressed by the processor in exactly the same way as memory unless specifically told to do otherwise.

Thus, a big endian processor will write a 32-bit integer onto a socket in byte order 3, 2, 1, 0. If the CPU on the other end of the socket uses a little endian architecture, then obviously a value written onto the socket will be interpreted completely differently when read. For example, a value of 29, written by a big endian processor and read by a little endian processor will be interpreted as 53,504 – not a small correction by any means.

Preparing a program for use with heterogeneous processor architectures therefore involves finding every integral type that ever hits a transmission vector that could legitimately target another processor and ensuring that the read/write operation involved transforms the data into / from a neutral representation that both sides agree on. In a program of any size at all, obviously this is a non-trivial task.

Klocwork Truepath can help developers in this task as it now includes the ability to validate type representation usage symmetrically as those types cross transmission vector boundaries. That is, the data flow engine within Klocwork Truepath automatically validates that types that are written directly to a transmission vector are subject to host-to-neutral format transformation before the write operation takes place. Likewise, integral types read from a transmission vector are tracked to ensure that they are appropriately transformed prior to the first attempted usage on the host.

For example, consider the following function:

void foo(int sock){ int x;

for( x = 0; x < 256; x++ ) if( send(sock, &x, sizeof int) < sizeof int) return;}

Page 4: Multicore processors

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 4

This simple function makes the basic assumption that the reader on the other end of its socket has the same processor architecture as the sender. This might be true, or more accurately it might be true today, but what designer can ever look far enough into the future to know that it will always be true, regardless of market shifts, great ideas that marketing interns have, etc.

Klocwork Truepath, upon analysis of this function, will point out:

Value ‘x’ is used in host byte order, but should be used in environment/network byte order.

A developer versed in inter-architectural development will naturally modify this function to transform the value of the variable ‘x’ prior to transmission:

Likewise when it comes to reading information across a transmission vector, Klocwork Truepath traces the data flow of any received integral types to ensure, in exactly the opposite way to sending, that any such values are transformed to host format prior to their first usage.

Open Source Case Studies

Lock Contention: SQLite ca. 2006

Long addressed by the developers of this great open source project, a deadlock was reported in the execution of the database engine and was traced to code that was specifically intended to guard against such an occurrence (as is usually the case). Although complicated to understand, and certainly the eventual fix resulted in an almost total rewrite of the offending module, requiring days or perhaps weeks of intense manual debugging and thought-modeling without a tool such as Klocwork Insight™, this very nasty bug was found and correctly described by Klocwork Truepath during an analysis that took mere minutes.

Consider the requirement to implement a simplistic singleton recursive lock capability within an environment that doesn’t support such constructs. Using reference counting, we can quite simply guard the underlying non-recursive lock and manage its lifecycle appropriately. Of course, this being a parallel world, we need to use another lock to guard the reference count that we’re using to guard the real lock, making the implementation just a bit more complicated.

The design of this might look something like the following example:

void foo(int sock){ int x, xt;

for( x = 0; x < 256; x++ ) { xt = htonl(x); // … or some other suitable form if( send(sock, &xt, sizeof int) < sizeof int) return; }}

lock_t lock1, lock2;int refCount = 0;

void enter() { reserve_lock(lock1); if( refCount == 0 ) reserve_lock(lock2); release_lock(lock1); refCount++;}

void leave(){ reserve_lock(lock1); refCount--; if( refCount == 0 ) release_lock(lock2); release_lock(lock1);}

Page 5: Multicore processors

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 5

Now I can call enter() multiple times, simulating some of the capabilities of a true recursive lock, and as long as I remember to call leave() an equal number of times the lifecycle of the underlying non-recursive lock is managed correctly:

Now consider the requirement to implement an abstraction over thread-specific data storage. To ensure safety when allocating such a structure, the database engine uses the singleton recursive lock described above to protect its activities with an implementation that simplifies as follows:

To simple inspection, this appears quite correct as it calls leave() the same number of times as enter() and thus should be considered well behaved. Unfortunately life in the parallel world is rarely simple to analyze, and this case is certainly more complicated than it first appears.

Consider a two core CPU executing two threads, both calling create_data at very slight offsets in time.

The first thread — let’s call our threads Thread 1 and Thread 2 — begins executing create_data() and successfully calls the enter() function. This results in the underlying lock, lock 2, being reserved to Thread 1:

void foo(){ // real lock is reserved enter(); if( i-really-want-to ) { // only the reference count is affected enter(); leave(); } // now the real lock is released leave();}

int tlsCreated = 0;

data_t* create_data(){ static data_t* tls;

enter(); if( tlsCreated == 0 ) tls = create_thread_data(); tlsCreated = 1; leave(); init_data(tls); return tls;}

Thread 1create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) release(lock1) refCount = 1

Page 6: Multicore processors

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 6

Now let’s assume that Thread 2 begins its execution of create_data() during the time that Thread 1 is active, and before it releases lock 1:

One further assumption makes the scenario whole: Thread 1 at this moment is interrupted by the operating system, losing its time on chip. Crucially, this happens before the reference count is updated. (Check the implementation of enter() and you’ll see that the author unfortunately left the reference count update outside of what is supposed to guard access to it.) As the reference count will therefore still read zero for Thread 2, it will attempt to reserve lock 2, resulting in Thread 2 blocking (as lock 2 is already owned by Thread 1):

Upon return from interrupt, Thread 1 is released and resumes execution where it left off, incrementing the reference count and returning from the enter() function. Its execution of create_data() continues, leading to a call to the leave() function, which unfortunately attempts to reserve lock 1 before doing anything else:

Due to the fact that Thread 2 is currently blocked, waiting on lock 2, and currently owns lock 1, Thread 1 will now block on its own attempt to reserve lock 1.

In short, this is a classic lock-order inversion contention caused by a poorly guarded data item, which when subject to race condition (being read by one thread whilst in the process of being updated by another) causes one thread to reserve locks in order while the other thread attempts to reserve them out of order, resulting in a deadlock.

With the race condition fixed, this singleton will operate correctly, although as previously described the author actually chose to completely rewrite this module, providing a more useful re-entrant mutual exclusion capability for multiple threads, i.e. removing the singleton semantic.

Thread 1 Thread 2create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) create_data() enter() release(lock1) reserve(lock1)

Thread 1 Thread 2create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) create_data() enter() release(lock1) reserve(lock1) interrupted refCount = 0 reserve(lock2) blocked

Thread 1 Thread 2create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) create_data() enter() release(lock1) reserve(lock1) interrupted refCount = 0 reserve(lock2) blocked refCount = 1 return leave() reserve(lock1); blocked

Page 7: Multicore processors

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 7

Figure 4 | Control flow description from Klocwork Truepath

Figure 3 | Source listing from SQLite

Page 8: Multicore processors

Developing Software in a Multicore & Multiprocessor World | Klocwork White Paper | 8

Endian Design Assumptions: PostgreSQL

In contrast to the situation described in relation to SQLite, the fi ndings in this case study don’t point to bugs in software as much as they do to limited design decisions and the impact they have on how software is then constructed.

Specifi cally, when designing a multi-process application, the architect is faced with the fundamental decision of whether all of those processes are going to be supported on one chip, or whether for the sake of scale or pure fl exibility, the software will support being deployed and executed on multiple chips / hosts / devices at once.

In the case of PostgreSQL, one of the processes detached from the main kernel is the statistics collector, something that acts more or less as a performance monitor, allowing the DBA to understand what’s going on within the kernel, without necessarily impacting the performance of the kernel whilst running reports or monitors against those statistics. This provides a nice analog for a typical application-layer process set that need to interact with each other, but which due to design could be implemented to operate on either the same CPU / host or a completely different one.

To implement this “low touch” collection and reporting mechanism, the PostgreSQL designers chose to fork() a process, presumably on the same CPU or multi-CPU package, and then use an asynchronous socket to transmit data from the kernel process to the collector. Using the pgstat application, the DBA can then interact with whatever the child process has collected at any point in time.

All of this is encoded within the module src/backend/postmaster/pgstat.c.

Because of the way that this fundamental decision was taken in this particular case, the designer chose to encode data transmission between the kernel and the collector using host-native representation. For example:

In this example, it’s simple to see the assumption in all its glory, as that data member msg.msg_hdr.m_size is read and used directly off the wire, in what could be, but isn’t in this case, network order.

Now let’s assume that a new generation of designers revisit this decision and instead place emphasis on scale and fl exibility over ease of implementation. Now they decide to place the statistics collector process on an arbitrary node in the hardware design, rather than on the same node as the kernel process.

With this decision in place, the assumption that network byte order and host byte order are the same can no longer be made in general. Porting to this new assumption set could take signifi cant time, both for developers and for the test crew, faced with putting together a matrix of CPUs / hosts that embody the plethora of representations we can expect to support in the fi eld.

Using a tool-driven approach, however, this entire effort can be collapsed to a single analysis pass, taking minutes in total, to see a report of what’s involved. In this case, the designers would be faced with the following endian vulnerabilities that would need to be addressed (along with the obvious logistical issues around how to place the process on the right host/CPU, of course):

pgstats.c: line 1988: function pgstat_recvbuffer() Value ‘msg.msg_hdr.m_size’ is used in network order.

pgstats.c: line 1443: function pgstat_send() Value ‘*msg’ is used in host byte order.

Figure 5 | Data representation analysis in action

Page 9: Multicore processors

IN THE UNITED STATES:15 New England Executive ParkBurlington, MA 01803

IN CANADA:30 Edgewater Street, Suite 114Ottawa, ON K2L 1V8

t: 1.866.556.2967f: 613.836.9088www.klocwork.com

© Klocwork Inc. All rights reserved. Klocwork and Klocwork Truepath are registered trademarks of Klocwork Inc.

These two simple issues might be thought of as the whole problem domain. However, looking further into what this module is capable of, certain information can be persisted across sessions using a statistics file. If we further our decision to allow the process to be spawned on heterogeneous hardware, we might well continue that spread by allowing different instantiations of said process to occur on heterogeneous hardware, thus requiring persistent data to be endian safe:

pgstats.c: line 2556: function pgstat_read_statsfile() Value ‘format_id’ is used in environment byte order. Similar errors can be found on line(s): 2610, 2684, 2717, 2740.

pgstats.c: line 2312: function pgstat_write_statsfile() Value ‘format_id’ is used in host byte order. Similar errors can be found on line(s): 2351, 2384, 2411, 2412.

Armed with this information, the designer can make all required updates to remove endian vulnerability from their code in one pass.

Conclusion

The complexity of this problem domain is vast, so there’s no one solution, tool, or approach that will address all your problems. Development teams need to equip themselves with good tools, smart design assumptions, and even smarter developers to reconcile the feature race being demanded by the market and the underlying platform complexity that implies. When it comes to selecting a tool, source code analysis should be on your shortlist as it offers a compelling mix of scalability, flexibility and the abiltiy to address a broad set of issues that will help you to ensure the overall security and reliability of your code.

About the Author

Gwyn Fisher is the CTO of Klocwork and is responsible for guiding the company’s technical direction and strategy. With nearly 20 years of global technology experience, Gwyn brings a valuable combination of vision, experience, and direct insight into the developer perspective. With a background in formal grammars and computational linguistics, Gwyn has spent much of his career working in the search and natural language domains, holding senior executive positions with companies like Hummingbird, Fulcrum Technologies, PC DOCS and LumaPath. At Klocwork, Gwyn has returned to his original passion, compiler theory, and is leveraging his experience and knowledge of the developer mindset to move the practical domain of static analysis to the next level.

About Klocwork

Klocwork® helps developers create more secure and reliable software. Our tools analyze source code on-the-fly, simplify peer code reviews, and extend the life of complex software. Over 900 customers, including the biggest brands in the mobile device, consumer electronics, medical technologies, telecom, military and aerospace sectors, have made Klocwork part of their software development process. Thousands of software developers, architects, and development managers rely on our tools everyday to improve their productivity while creating better software.