A SIMD APPROACH TO LARGE-SCALE REAL-TIME …parallel/papers/MikeYuanPhD... · A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROL USING ASSOCIATIVE PROCESSOR ... Chair,

A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROLUSING ASSOCIATIVE PROCESSOR AND CONSEQUENCES FOR PARALLEL COMPUTING

A dissertation submitted toKent State University in partial

fulfillment of the requirements for thedegree of Doctor of Philosophy

by

Man Yuan

August 2012

Dissertation written by

Man Yuan

B.S., Hefei University of Technology, China 2001

M.S., University of Western Ontario, Canada 2003

Ph.D., Kent State University, 2012

Approved by

Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee

Dr. Lothar Reichel , Members, Doctoral Dissertation Committee

Dr. Mikhail Nesterenko

Dr. Ye Zhao

Dr. Richmond Nettey

Accepted by

Dr. Javed I. Khan , Chair, Department of Computer Science

Dr. Raymond Craig , Dean, College of Arts and Sciences

ii

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Flynn’s Taxonomy and Classification . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD) . . . 9

2.1.2 Single Instruction Stream Multiple Data Stream (SIMD) . . . . . 10

2.2 Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Air Traffic Control (ATC) . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Task Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 The worst-case environment of ATC . . . . . . . . . . . . . . . . 16

2.3.3 ATC Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 An Associative Processor for ATC . . . . . . . . . . . . . . . . . . . . . . 18

2.5 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.1 Avoid False Sharing . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.2 Optimize Barrier Use . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.3 Avoid the Ordered Construct . . . . . . . . . . . . . . . . . . . . 24

iii

2.5.4 Avoid Large Critical Regions . . . . . . . . . . . . . . . . . . . . 24

2.5.5 Maximie Parallel Regions . . . . . . . . . . . . . . . . . . . . . . 24

2.5.6 Avoid Parallel Regions in Inner Loops . . . . . . . . . . . . . . . 25

2.5.7 Improve Load Balance . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.8 Using Compiler Features to Improve Performance . . . . . . . . . 26

3 Survey of Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Previous Work on ATC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Classical MIMD Real-Time Scheculing Theory . . . . . . . . . . . . . . . 31

3.3 Other Similar Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 AP Solution to ATC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Overview of AP Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Advantages of AP over MIMD for ATC . . . . . . . . . . . . . . . . . . . 33

4.3 AP properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Emulating the AP on the ClearSpeed CSX600 . . . . . . . . . . . . . . . . . . 39

5.1 Overview of ClearSpeed CSX600 . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Emulating the AP on the ClearSpeed CSX600 . . . . . . . . . . . . . . . 41

6 ATC System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1 ATC Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Static Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.3 Scaling up from CSX600 to a Realistic Size AP . . . . . . . . . . . . . . 45

iv

7 AP Solution for ATC Tasks Implemented on CSX600 . . . . . . . . . . . . . . 48

7.1 Report Correlation and Tracking . . . . . . . . . . . . . . . . . . . . . . 48

7.2 Cockpit Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3 Controller Display Update . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.4 Automatic Voice Advisory . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.5 Sporadic Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.6 Conflict Detection and Resolution (CD&R) . . . . . . . . . . . . . . . . . 55

7.6.1 Conflict Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.6.2 Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7.7 Terrain Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.8 Final Approach (Runways) . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.9 Comparison of AP and MIMD solutions for ATC Tasks . . . . . . . . . . 63

7.10 Implementation of Algorithm Issues . . . . . . . . . . . . . . . . . . . . . 65

8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.2.1 Comparison of Performance on STI, MIMD and CSX600 . . . . . 69

8.2.2 Comparison of Performance on MIMD, CSX600 and STARAN . . 70

8.2.3 How close to the AP is our CSX600 emulation of the AP . . . . . 72

8.2.4 Comparison of Predictability on CSX600 and MIMD . . . . . . . 74

8.2.5 Comparison of Missing Deadlines on CSX600 and MIMD . . . . . 75

8.2.6 Timings for 8 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 75

v

8.2.7 Video and Actual STARAN Demos . . . . . . . . . . . . . . . . . 78

9 Observations and Consequences of the Preceding Results . . . . . . . . . . . . 80

10 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

vi

LIST OF FIGURES

1 SIMD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 AP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Fork-Join Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 A Possible AP Architecture for ATC . . . . . . . . . . . . . . . . . . . . 36

5 CSX600 accelerator board . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 MTAP architecture of CSX600 . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Static schedule of ATC tasks implemented in AP . . . . . . . . . . . . . 45

8 Track/Report correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

9 Search box size for flight maneuver . . . . . . . . . . . . . . . . . . . . . 52

10 Conflict detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

11 Timing of tracking task. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

12 Timing of terrain avoidance task. . . . . . . . . . . . . . . . . . . . . . . 70

13 Timing of CD&R task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

14 Comparison of tracking task of MTI, CSX600 and STARAN. . . . . . . . 71

15 Comparison of terrain avoidance task of MTI, CSX600 and STARAN. . . 72

16 Comparison of CD&R task of MTI, CSX600 and STARAN. . . . . . . . 72

17 Comparison of display processing task of MTI, CSX600 and STARAN. . 73

18 Time of tracking for number of aircraft from 10 to 384. . . . . . . . . . . 73

19 Time of CD&R for number of aircraft from 10 to 384. . . . . . . . . . . . 74

vii

20 Comparison of MTI, STARAN and Super-CSX600 with at most one air-

craft per PE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

21 Predictability of execution times for report correlation and tracking task. 75

22 Predictability of execution times for terrain avoidance task. . . . . . . . . 75

23 Predictability of execution times for CD&R task. . . . . . . . . . . . . . 76

24 Number of iterations missing deadlines when scheduling tasks. . . . . . . 76

25 Overall ATC System Design. . . . . . . . . . . . . . . . . . . . . . . . . . 100

viii

LIST OF TABLES

1 Timings of Associative Functions . . . . . . . . . . . . . . . . . . . . . . 42

2 Static Schedule for ATC Tasks . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Statically Scheduled Solution Time for Worst Case Environment of ATC 47

4 Data Structures for Radar Reports . . . . . . . . . . . . . . . . . . . . . 50

5 Data Structures for Tracks . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Comparison of Required ATC Operations . . . . . . . . . . . . . . . . . . 64

7 Performance of One Flight/PE . . . . . . . . . . . . . . . . . . . . . . . . 76

8 Performance of Eight Tasks for Ten Flights/PE . . . . . . . . . . . . . . 77

ix

Acknowledgements

I would like to sincerely thank my advisor, Dr. Johnnie W. Baker, for his constant

support and encouragement throughout my Ph.D. study. He has been the advisor for

my research work throughout these years and also the mentor for my professional ca-

reer development. Without his encouragement and guidance, this dissertation would be

impossible.

My special thanks will go to many people who have helped me at various stages of

this work: Will Meilander, Frank Drews, Sue Peti, and Marcy Curtiss etc. In particular,

Professor Will Meilander has contributed valuable information for my dissertation. He

has read through all my publications and provided many useful comments.

I would also like to thank my dissertation committee members for their willingness

to serve on the committee and their valuable time.

On a personal note, I would like to express my gratitude to my parents Zhen Yuan

and Xiaohang Xu for their love and emotional support. Finally, I would like to express

my thanks to my girl friend Sijin Zhang for her moral support and for the incredible

amount of patience she had with me so that I can realize my dream eventually.

x

Abstract

This dissertation has two complementary focuses. First, it provides a solution to large

scale real-time system air traffic control (ATC) using an enhanced SIMD machine model

called an associative processor (AP). The second is the comparison of this implementa-

tion with a multiprocessor implementation and the implications of these comparisons.

This paper demonstrates how one application, ATC, can more easily, more simply, and

more efficiently be implemented on an AP than is generally possible on other types of

traditional hardware. The AP implementation of ATC will take advantage of its de-

terministic hardware to use static scheduling. Our solution differs from previous ATC

systems that are designed for MIMD computers and have a great deal of difficulty meet-

ing the predictability requirements for ATC, which are critical for meeting the strict

certification standards required for safety critical software components. The proposed

AP solution supports accurate predictions of worst case execution times and guarantees

all deadlines are met. Furthermore, the software developed based on the AP model is

much simpler and smaller in size than the current corresponding ATC software. As the

associative processor is built from SIMD hardware, it is considerably cheaper and sim-

pler than the MIMD hardware currently used to support ATC. While APs were used

for ATC-type applications earlier, these are no longer available. We use a ClearSpeed

CSX600 accelerator to emulate the AP solutions of ATC on an ATC prototype consist-

ing of eight data-intensive ATC real-time tasks. Its performance is evaluated in terms of

execution time and predictability and is compared with an 8-core multiprocessor (MP)

2

using OpenMP. Our extensive experiments show that the AP implementation meets all

deadlines while the MP will regularly miss a large number of deadlines. It is shown that

the proposed AP solution will support accurate predictions of worst case execution times

and will guarantee that all deadlines are met. In addition, the AP code will be similar

in size to sequential code for the same tasks and will avoid all of the additional support

software needed with an MP to handle dynamic scheduling, load balancing, shared re-

source management, race conditions, and false sharing, etc. At this point, essentially

only MIMD systems are built. Many of the advantages of using an AP to solve an ATC

problem would carry over to other applications. AP solutions for a wide variety of ap-

plications will be cited in this paper. Applications that involve a high degree of data

parallelism such as database management, text processing, image processing, graph pro-

cessing, bioinformatics, weather modeling, managing UAS (Unmanned Aircraft Systems

or drones) etc, are good candidates for AP solutions. This raises the issue of whether

we should routinely consider using non-multiprocessor hardware like the AP for appli-

cations where substantially simpler software solutions will normally exist. It also raises

the question of whether the use of both AP and MIMD hardware in the same system

could provide more versatility and efficiency. Either the AP or MIMD could serve as

the primary system but could hand off jobs it could not handle efficiently to the other

system.

CHAPTER 1

Introduction

The Air Traffic Control (ATC) system is a typical real-time system that continuously

monitors, examines, and manages space conditions for thousands of flights by processing

large volumes of data that are dynamically changing due to reports by sensors, pilots, and

controllers, and gives the best estimate of position, speed and heading of every aircraft in

the environment at all times. The ATC software consists of multiple real-time tasks that

must be completed in time to meet their individual deadlines. By performing various

tasks, the ATC system keeps timely correct information concerning positions, velocities,

contiguous space conditions, etc., for all flights under control. Since any lateness or

failure of task completion could cause disastrous consequences, time constraints for tasks

are critical. [1–3]

In the past, solutions to this problem have been implemented on multicomputer sys-

tems with records for various aircraft stored in the memory accessible to the processors

in this system. The dispersed nature of this dynamic ATC system and the necessity of

maintaining data integrity while providing rapid access to this data by multiple instruc-

tion streams (MIS ) increases the difficulty and complexity of handling air traffic control.

It is difficult for these MIMD implementations to satisfy reasonable predictability stan-

dards that are critical for meeting the strict certification standards needed to ensure

safety for critical software components. Massive efforts have been devoted to finding an

efficient MIMD solution to the ATC problems since 1963 but none of them has been

3

4

satisfactory yet [4–7]. To a large degree, this is due to the fact that the software used by

MIMDs in solving real-time problems involves solutions to many difficult problems such

as dynamic scheduling, load balancing, etc [6]. It is currently widely accepted that large

real-time problems such as ATC do not have a polynomial time MIMD solution [8, 9].

Even though many researchers have put extensive efforts into producing a reasonably

good (e.g., using heuristic algorithms) solution to the ATC problem, the current ATC

system has repeatedly failed to meet the requirements of the ATC automation system.

It is frequently been reported that the current ATC system periodically loses radar data

without any warnings and misses tracking of aircraft in the sky. This problem has even

occurred with Air Force One. [3–5, 7]

Instead of using the traditional MIMD approach, we implement a prototype of the

ATC system on an enhanced SIMD hardware system called an associative processor(AP)

or associative SIMD, where interactions are much simpler and more efficiently controlled,

due in large part to avoiding the need to coordinate the interactions of multiple instruc-

tion streams. AP supports some additional constant time operations, e.g., broadcast-

ing, associative searches, AND/OR and maximum/minimum reductions (assuming word

length is a constant) [4,10], details of which will be illustrated in section 2.4. It is shown

that an efficient polynomial time solution to a real-time problem for Air Traffic Control

can be obtained [3–7]. In addition, associative SIMD computers have been built which

support the AP model and can support this polynomial time solution to the ATC and

meet the required deadlines. The first AP system was designed by Kenneth Batcher and

implemented in the Goodyear Aerospace STARAN computer, which was specifically de-

signed to support ATC [11–13]. A second generation STARAN, called the ASPRO [14],

5

also an AP designed at Goodyear Aerospace, was used extensively by the Navy for Air-

borne Air Defense Systems applications for many years [15]. In addition, normal SIMDs

cannot handle ATC because they cannot input and output radar sensor data efficiently.

STARAN and ASPRO had overcome this I/O limitation by the flip network of Multidi-

mensional Accesss(MDA) memory, details are in Section 2.4.

We use ClearSpeed CSX600 to emulate an AP and implement eight key ATC tasks for

our ATC prototype, namely report correlation and tracking, cockpit display, controller

display update, sporadic requests, automatic voice advisory, terrain avoidance, conflict

detection and resolution (CD&R) and final approach (runway optimization). The se-

lection of the specific tasks was based on information in the following references: FAA

Grants for Aviation Research Program Solicitation [16], FAA’s NextGen Implementation

Plan [17], and the FAA 1963 ATC Specifications [18]. Reference [18] indicates that the

tasks we selected were similar to the ones selected in 1963 for implementation by indus-

tries interested in competing for the job of implementing ATC for FAA. Reference [16]

describes the FAA’s efforts to improve the aircraft capacity of the airspace while main-

taining high safety standards and aircraft safety technology for conflict detection and

resolution. These indicate that the tasks of the type that we have selected are key

to ATC implementation. Further, the NextGen plan [17] discusses the new standards

for ATC including developing capabilities in traffic flow management, dynamic airspace

configuration, separation assurance, super density operations, and airport surface opera-

tions. The purpose is to achieve a safe, efficient and high-capacity airspace system. The

following subtopics in [17] address current research that needs improvement: comprehen-

sive analysis of uncertainties in the National Airspace System; air traffic management

6

functional allocation using advanced computing and networking; guarantee safety of air-

to-air and air-to-ground. The type of activities discussed there will require the use of

tasks similar to the ones that we have selected for implementation, which show that our

eight ATC tasks have already captured most of the workload of ATC, and nothing major

is missing that may impact the ATC performance. In our careful search of literature for

publications on ATC, the ones that kept re-occurring in the ATC literature we found

were all included in the tasks we choose for our prototype to use for benchmarking.

Several of our previous papers [4–6,19] have used the AP to manage ATC computation

for an ATC sector. Similar SIMD parallel approaches have been used for collision avoid-

ance algorithms between multiple agents for real-time simulations in S.Guy et. al. [20].

We used ClearSpeed CSX600 to emulate an AP in our previous work [21]. However, the

assumed maximum number of aircraft being tracked is 4000 IFR (instrument flight rules)

aircraft and 10000 VFR (visual flight rules) aircraft, for a total of 14000 aircraft [21].

Because the emulation tool CSX600 can only process a small number of aircraft, an

ideal AP system would have to be a lot larger than the CSX600. Our paper [21] has

implemented two ATC tasks, i.e., report correlation tracking and conflict detection and

resolution (CD&R) on CSX600, and compared the performance of only one task, report

correlation and tracking, on both CSX600 and MIMD. However, in the comparison tests,

only one task was executed repeatedly, without any competition from other tasks. Also,

no adjustment was included for the greater combined computational speed of MIMD. In

our most recent papers [22, 23], we not only implement the 8 tasks in CSX600, but also

schedule them on both CSX600 and an 8-core MIMD with the fastest host-only version

implemented using OpenMP [24].

7

Our results show that some deadlines are regularly missed by these tasks when imple-

mented on a MIMD, while the AP implementation meets all deadlines for the tasks that

can be statically scheduled. The results indicate that the AP emulation has much better

scalability, efficiency and predictability than the MIMD implementation. Moreover, the

proposed AP solution will support accurate predictions of worst case execution times and

will guarantee that all deadlines are met. In contrast, MIMD can only support average

case timings, not worst case timings, and cannot guarantee that all deadlines are met all

the time. Also, the software used by the AP is much simpler and smaller in size than the

current corresponding ATC software, as the MIMD software must also support additional

activities such as dynamic scheduling and load balancing etc, which are not needed by

the AP. The solution used in AP is very similar to a sequential solution for this problem,

both in style and in code size. As a result, the size of this software solution is only a very

small fraction of the size of the MIMD solutions to the ATC problem. This results in a

dramatic drop in the cost in both the cost of creating and in maintaining this software

when compared to the MIMD solutions that have been given to the ATC problem. An

important consequence of these features is that the V&V (Validation and Verification)

process will be considerably simpler than for current ATC software. Moreover, the AP

hardware is considerably cheaper, simpler, and easier to build than the MIMD hardware

currently used to support ATC. Furthermore, the AP solution to ATC can be applied

in many other real-time problems, dynamic database problems, Unmanned Aircraft Sys-

tems (UASs), and many other applications. Thus we should consider SIMD solutions to

solve many parallel computing problems for which SIMD or AP solutions can be sim-

pler and more efficient than MIMD solutions. A large portion of our research has been

8

presented in our most recent journal [23].

This dissertation is organized into ten chapters. First, Chapter 2 introduces back-

ground information and terminology used throughout this dissertation. Section 2.1 intro-

duces the two most important parallel architectures: MIMD in Section 2.1.1 and SIMD

in Section 2.1.2. Section 2.2 introduces terminology of real-time processing. Section 2.3

describes the air traffic control (ATC) problem including worst case assumptions, con-

straints of the system and examples of ATC tasks. Section 2.4 gives an overview of asso-

ciative processor(AP), its properties such as constant time associative operations etc. The

basics of OpenMP and the techniques of how to improve performance of OpenMP pro-

grams are introduced in Section 2.5. Second, Chapter 3 gives a survey of previous related

research on ATC, and comparison of their pros and cons. Chapter 4 gives an overview

of AP solution to ATC, discusses the advantages of AP over MIMD, and discusses some

issues concerning the property of AP. Chaper 5 gives an overview of the CSX600 ar-

chitecture and programming concepts, and discuss emulating the AP on the CSX600.

In Chapter 6, the overall system design and static scheduling is illustrated. Chapter 7

describes the algorithms for 8 key ATC real-time tasks implemented on CSX600, and

how they can be scaled up to realistic size ATC. Experimental results are presented in

Chapter 8: comparison of worst case times of some most important ATC tasks for 8-core

MIMD, CSX600 and AP, comparison of predictability and missing deadlines, and timings

for 8 tasks. The observations and consequences of the preceding results are in Chapter 9.

Chapter 10 gives the final conclusions and discusses some work planned for the future.

CHAPTER 2

Background Information

This chapter introduces and discusses some terminologies and background informa-

tion used for this dissertation. Some basic knowledge of parallel algorithms have been

introduced in [25–27].

2.1 Flynn’s Taxonomy and Classification

Flynn’s taxonomy is the best-known classification scheme for parallel computers, in

which a computer’s category depends on parallelism it exhibits with its instruction stream

and data stream. A process is: a sequence of instructions (the instruction stream denoted

as I ) manipulates a sequence of operands (the data stream denoted as D), and the

instruction stream (I ) and the data stream (D) can be either single (S ) or multiple

(M ). Hence Flynn’s classification results in four categories: SISD, MISD, MIMD and

SIMD [26,28]. Because SISD and MISD have nothing to do with this dissertation, they

are not introduced.

2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD)

MIMD is generally considered to be the most important class and includes most com-

puters currently being built. Processors are asynchronous, since they can independently

execute different instructions on different data sets simultaneously. Communications

are handled either through shared memory called multiprocessors or by use of message

9

10

passing called multicomputers. [25, 26, 29, 30] MIMDs have been considered by most re-

searchers to include the most important and least restricted computers compared with

SIMD computers. However, there are numerous NP-hard problems for MIMDs in real-

time applications [6,8,9], including dynamic scheduling, load balancing, race conditions,

data dependency, non-determinism, deadlocks, etc. See Garey and Johnson, ”Computers

and Intractability: a Guide to the Theory of NP-completeness” [8], and related ref-

erences [31–34]. MIMD parallel programming is normally optimized for average case

performance and typically worst case is not considered [8, 9].

2.1.2 Single Instruction Stream Multiple Data Stream (SIMD)

Most early parallel computers had a SIMD-style design. Figure 1 shows the abstract

model of SIMD. SIMD has the instruction stream (IS) that is also called the control unit

and stores the program to be executed. Each PE contains arithmetic logical unit (ALU)

to perform arithmetic and logical operations, and its local memory to keep only data,

not program. The IS compiles the program and broadcasts the executable commands to

the PEs (or processing elements). The PEs are very simplistic and are essentially ALUs.

The active PEs execute the IS steps synchronously. All active processors execute the

same instructions synchronously, but on different data. [25, 26, 29, 30]

There are three basic types of parallel systems that have been called SIMD [26, 29].

The first is the traditional SIMD and includes the Goodyear Aerospace MPP, Thinking

Machines CM 2, and the MasPar. These parallel computers perform the same operation

on multiple data values simultaneously. The second type is called vector machines or

vector processing machines. They involve the use of pipelined processors and include

11

Figure 1: SIMD Model

the large CRAY vector machines. The third and most recent SIMD type are systems

that are sometime collectively called short vector machines. They evolved from desktop

computers as they became more powerful and capable of supporting real-time gaming

and video programming. Our focus here is strictly on the traditional SIMD type. Of

these three types, only the associative SIMDs can process both sequences of data parallel

and scalar operations efficiently and switch between them with no conversion or low level

synchronization overhead.

MIMD computers are generally believed to be more powerful and more performance/cost

effective compared with SIMD computers. Therefore, current research in various appli-

cation areas including real-time systems has put a great deal of emphasis on use of

MIMDs [3,9,35–37]. We know that MIMDs have a lot of difficulties in application based

on Section 2.1.1 above. However, use of SIMD computation can avoid essentially all of

these problems [3, 6, 7, 12, 30]. Since only one control unit exists in a SIMD, there is no

need for synchronization. The major advantage of SIMDs is their simplicity. Their pro-

grams are easy to program, debug, and optimize. One of the complaints by programmers

about parallel computers is that they are difficult to program. This is not true in SIMDs.

12

Because of the synchronous nature of SIMDs, the flow control of SIMDs is sequential and

at exactly one place in the program text. There is no need to consider numerous process

interactions. Additionally, there is only one program to write and debug and data are

usually distributed among PEs naturally. This dissertation will use an enhanced SIMD

to solve large scale real-time system and show their advantages over MIMD.

2.2 Real-Time Systems

A real-time system is distinguished from a traditional computing system in that its

computations must meet specified timing requirements. A real-time system executes

tasks to ensure not only their logical correctness but also their temporal correctness.

Otherwise, the system fails no matter how accurate a computation result is. [3,9,34–37]

A real-time task is ”an executable entity of work that, at minimum, is characterized by

a worst-case computation time and a time constraint or deadline” [35]. A job is ”an

instance of a task or an invocation of a task” [35]. When a task is invoked, a job is

generated to execute this task with the given conditions. The release time of a task

is the time that the task (or a job of the task) is ready to be executed. A realtime

task can be either periodic, which is activated (released) in a regular interval (period),

or aperiodic, which is activated irregularly in some unknown and possibly unbounded

interval. Aperiodic tasks have either soft or no deadlines, and sporadic tasks have hard

deadlines. Typically, real-time scheduling can be static or dynamic. Static scheduling

refers to the case that the scheduling algorithm has complete knowledge a priori about

all incoming tasks and their constraints such as deadlines, computation times, shared

resource access, and future release times. In contrast, in dynamic scheduling, the system

13

has knowledge about the currently active tasks but it does not have knowledge about

new task arrivals. Thus the scheduling algorithm has to be designed to change its task

schedule over time.

A system period P is the time during which the set of all tasks must be completed

[9, 35–37]. The system deadline D is the constraint time for a system period. A task

deadline d is the time constraint for an individual task. Lateness of a task is the time

elapse between the task release and its execution. Deadlines can be hard, firm, or soft. A

hard deadline is a time constraint for a task that failure to meet it will result in disastrous

consequences. A firm deadline is a time constraint for a task that failure to meet it will

produce useless results but cause no severe consequences. A soft deadline is the time

constraint for a task that the result produced after this deadline will be degraded but

may be still useful in a limited time period. Because it is not tolerable to miss deadlines

in ATC systems, we consider only hard deadlines in our work. According to [9], a static

schedule can be used if the summation of task times for the system period is less than

the system deadline D, otherwise a static schedule is not feasible.

A primary concern for a real-time system is how to schedule these tasks to ensure they

meet various constraints including deadlines. Depending on different criterion such as

minimizing the sum of all task computation times and minimizing the maximum lateness,

different scheduling algorithms can be designed [3, 9, 34–37]. An optimal scheduling

algorithm is one that may fail to meet a deadline only if no other scheduling algorithm can

meet the deadline [34,38,39]. There are three scheduling approaches. The first scheduling

algorithm is Cyclic Executives which is static analysis and static scheduling [9, 35]. All

tasks, times and priorities are given before system startup, and it is time-driven, i.e.,

14

schedules are computed and hardcoded before system startup. We use this approach for

our AP solution to ATC. The second approach is fixed priority scheduling which is static

analysis and dynamic scheduling. All tasks, times and priorities are given before system

startup, and it is priority-driven, dynamic scheduling, i.e., the schedule is constructed

by the OS scheduler at run time, e.g., Rate Monotonic (RM) Scheduling [38, 39]. The

third one is dynamic priority scheduling, which assigns priorities based on the current

state of the system. The examples are: Least Completion Time (LCT), Earliest Deadline

First (EDF), and Least Laxity (Laxity is deadline minus computation time minus current

time) First (LLF) algorithims [9, 35–37].

Although LCT, EDF, LLF and RM scheduling algorithms can be optimal [9,35,38,39],

however, as real-time systems become larger and tasks become more sophisticated, it is

impossible for only one processor to execute them to guarantee real-time properties. The

current trend is to use MIMD systems to schedule real-time tasks [40]. Unfortunately,

it has been proved for the long established theory that almost all real-time scheduling

problems using MIMDs are NP-complete [8,9,32,39]. No optimal scheduling algorithms

has ever been discovered for almost all cases with the MIMD systems. Researchers

have been driven to using heuristics to design scheduling algorithms on MIMD systems.

However, these heuristics are usually expensive in that they consume a lot of computation

time, and sometimes additional hardware support such as a scheduling chip is needed.

Moreover, use of heuristics will cause systems to be inherently unpredictable [3, 8, 39].

15

2.3 Air Traffic Control (ATC)

The Air Traffic Control (ATC) system is a real-time system that continuously moni-

tors, examines, and manages space conditions for thousands of flights by processing large

volumes of data that are dynamically changing due to reports by sensors, pilots, and

controllers, and gives the best estimate of position, speed and heading of every aircraft

in the environment at all times [1–3]. The ATC software consists of multiple real-time

tasks that must be completed in time to meet their individual deadlines. By performing

various tasks, the ATC system keeps timely correct information concerning positions, ve-

locities, contiguous space conditions, etc, for all flights under control. Since any lateness

or failure of task completion could cause disastrous consequences, time constraints for

tasks are critical.

Data inputs come mainly from radar reports and are stored in a real-time database.

In our working prototype of the ATC system, a set of radar reports arrive every 0.5

second. Multiple reports may return redundant data on some aircraft. All tasks must be

completed before new report data arrive for the next cycle.

In order to present our work more clearly, we briefly describe our analytical prototype

of the ATC system and ATC tasks in this section. ATC task characteristics and the worst

case ATC environment that is based on real world application are listed first, then ATC

tasks and task examples are explained. Some details have been discussed in [4–6].

16

2.3.1 Task Characteristics

Besides time constraints, there are other assumptions for ATC tasks in our ATC

system. Before specific ATC tasks are described in Section 2.3.3, we give general char-

acteristics of an ATC task [3, 4].

• All tasks are periodic. Although each task has its own deadline, all of them must

be completed by the system deadline D, or the system period P.

• All deadlines are known at the task release time.

• Aperiodic jobs are handled by a special task in a particular time slot in every cycle.

• Tasks are independent in that there is no synchronization between them nor shared

resources. The static schedule fixes release times and deadlines for each task.

• Overhead costs for interrupt handling are included in each task cost.

• Task execution is non-preemptive.

• Task deadlines are all hard and critical. None can be preempted or deleted without

possible adverse effects.

2.3.2 The worst-case environment of ATC

The US nation air space (NAS) is divided into 20 airspace regions called ATC enroute

centers in the lower (or contiguous) U.S. states plus one each for Hawaii and Alaska.

Each center is divided into sectors. A controller may control one or more sectors in a

center. In the ATC center we consider, there are 600 controllers. The assumed maximum

number of aircraft being tracked by one air traffic control center is 4, 000 controlled IFR

17

(Instrument Flight Rules) aircraft in the current controller center and 10, 000 uncontrolled

VFR (Visual Flight Rules) aircraft and adjacent center controlled IFR, for a total of

14, 000 aircraft [3–5]. Because of the multiplicity of sensors, we assume there are 12, 000

radar reports coming into the system per second.

2.3.3 ATC Tasks

An ATC system has to process a large amount of data coming from radar sensors

and provides accurate flight information in highly constrained time. Various processing

procedures can be grouped to following eight major tasks [4, 5]: report correlation and

tracking, cockpit display, controller display update, sporadic requests, automatic voice

advisory, terrain avoidance, conflict detection and resolution (CD&R) and final approach

(runway optimization). The 8 tasks are used to represent the benchmark of ATC, as can

be seen in references: FAA Grants for Aviation Research Program Solicitation [16], FAA’s

NextGen Implementation Plan [17], and the FAA 1963 ATC Specifications [18]. Refer-

ence [18] indicates that the tasks we selected were similar to the ones selected in 1963 for

implementation by industries interested in competing for the job of implementing ATC

for FAA. Reference [16] states the FAA’s effort to improve the capacity of the airspace

while maintaining high safety standards, aircraft safety technology for conflict detection

and resolution. These indicate that the tasks of the type that we have selected are key

to ATC implementation. Further, the NextGen plan [17] discusses the new standards

for ATC including developing capabilities in traffic flow management, dynamic airspace

configuration, separation assurance, super density operations, and airport surface opera-

tions. The purpose is to achieve a safe, efficient and high-capacity airspace system. The

18

following subtopics in [17] address current research that need innovation: comprehen-

sive analysis of uncertainties in the National Airspace System; air traffic management

functional allocation using advanced computing and networking; guarantee safety of air-

to-air and air-to-ground. The type of activities discussed there will require use of tasks

similar to the ones that we have selected for implementation, which show that our eight

ATC tasks have already captured most of the workload of ATC, and nothing major is

missing that may impact the ATC performance. In our careful search of literature for

publications on ATC, the ones that kept re-occurring in the ATC literature we found

were all included in the tasks we choose for our benchmark.

2.4 An Associative Processor for ATC

An associative processor (AP) [4,13] is a SIMD system with several additional useful

hardware enhancements that simplify supporting the real-time ATC system. The name

”associative” is due to computer’s ability to locate items in the memory of PEs by content

rather than location. Figure 2 shows the architecture of AP. It has one control unit, or

called instruction stream, and thousands of PEs(Cells). Each PE has its own arithmetic

logical unit (ALU) and memory. The first AP system was designed by Kenneth Batcher

and implemented in the Goodyear Aerospace STARAN computer, which was specifically

designed to support ATC [11–13]. A second generation STARAN called the ASPRO [14],

also an AP designed at Goodyear Aerospace, was used extensively by the Navy for

Airborne Air Defense Systems applications for over ten years [15].

19

Figure 2: AP Architecture

The hardware enhancement that is required for an AP is a broadcast/reduction net-

work such as described in [10]. This network supports the execution of the below opera-

tions, often called the associative operations, in constant time. A list of the associative

operationss follow [10, 15, 41, 42]:

• Global MAX and MIN: finds all instances where a maximal (or minimal) value

is stored by a processor at a fixed memory location. The processors whose value

is maximal (respectively minimal) are active at the end of this operation and are

called responders. The remaining processors participating in this activity are called

non-responders.

• AND and OR: finds the global reduction of Boolean values stored in the same

memory location of each active processor.

• Associative search: finds all instances where a search pattern is matched by the

content stored by a processor at a fixed memory location. The active processors

whose content matches the search pattern are the responders and the processors

which did not match the search pattern are the non-responders.

20

• Any-Responder: the ANY operation determines if there is at least one responder

after an associative search.

• Pick-One: selects one responder from the set of the responding processors. It is

implemented on ClearSpeed using GET and NEXT operations.

• Broadcast data or instructions from the control unit to PEs.

We introduce a new term that is called jobset for the AP [3,4]. Observe that an AP is a

set processor, as each of its instructions operates on a set of data. As a result, the AP can

execute a set of jobs involving the same basic operations on different data simultaneously.

This set of jobs is called a jobset. Compared with the common understanding of a job

as an instance of a task in the MP, in the AP, a task is considered to be a sequence of

jobsets.

2.5 OpenMP

OpenMP is a parallel programming model for shared memory and distributed memory

multiprocessors [43–46]. It consists of a set of compiler directives and library routines,

and it is relatively easy to create multi-threaded applications in Fortran, C and C++. In

this dissertation, we focus on the fork-join programming model employed by the OpenMP

system. To the best of our knowledge, prior work has not seriously studied the fork-join

task model in the context of real-time systems [47]. Fork-Join is a popular parallel

programming paradigm employed in systems such as Java and OpenMP. The basic fork-

join tasks are shown in Figure 3.

Each basic fork-join task begins as a single master thread that executes sequentially

21

Figure 3: Fork-Join Task Model

until it encounters the first fork construct, where it splits into multiple parallel threads

which execute the parallelizable part of the computation. After the parallel execution

region, a join construct is used to synchronize and terminate the parallel threads, and

resume the master execution thread. This structure of fork and join can be repeated mul-

tiple times within a job execution. In this dissertation, we use the basic fork-join task

model as described above. Many real-time systems, such as radar tracking, autonomous

driving, and video surveillance, exhibit a data parallel nature that lends itself easily to the

fork-join model. As the problem sizes scale and processor speeds saturate, the only way

to meet task deadlines in such systems would be to parallelize the computation. We focus

on preemptive fixed-priority scheduling algorithms, and dynamic priority scheduling for

fork-join tasks is beyond the current scope of this work. We also restrict our attention to

tasks with the number of cores of our MIMD system. Although the number of threads

in a parallel region can be dynamically adjusted by the OMP NUM THREADS envi-

ronment variable in specific implementations of OpenMP, we do not consider such a task

model in this work.

While it is relatively easy to quickly write a correctly functioning program in OpenMP,

22

it is more difficult to write a program with a good performance. There are some tech-

niques to improve the performance [43, 44].

2.5.1 Avoid False Sharing

An important efficiency concern is false sharing, which can also severely restrict scal-

ability. It is a side effect of the cache-line granularity of cache coherence implemented

in shared memory systems. The cache coherency mechanism keeps track of the status

of cache lines by appending ”state bits” to indicate whether the data on the cache line

is still valid or is outdated. Any time a cache line is modified, cache coherence software

notifies other caches holding a copy of the same line that its line is invalid. If data from

that line is needed, a new updated copy must be fetched. A problem is that state bits

do not keep track of which part of the line is outdated, but indicates the whole line is

outdated. As a result, a processor can not detect which part of the line is still valid and

instead requests a new copy of the entire line. As a result, when two threads update

different data elements in the same cache line, they interfere with each other. This effect

is known as false sharing.

False sharing is likely to significantly impact performance under the following con-

ditions: shared data is modified by multiple threads; the access pattern is such that

multiple threads modify the same cache lines; these modifications occur in rapid succes-

sion. An extreme case of false sharing is illustrated by the example shown in Algorithm 1.

Suppose all threads contain a copy of the vector a with 8 elements in it and thread 0

modifies a[0]. Everything in the cache line with a[0] is invalidated, so a[1], . . . , a[7] are

not accessible until the cache is updated in all threads, even though their data has not

23

changed and is valid. So threads can not access a[i] for i > 0 until each of their cache

lines containing a[0], · · · , a[7] have been updated.

Algorithm 1 Example of false sharing

1: pragma omp parallel for shared(Nthreads, a) schedule(static, 1)2: for int i = 0; i < Nthreads; i + + do

3: a[i]+ = i;4: end for

In this case, we can solve the problem by padding the array by dimensioning it

as a[n][8] and changing the indexing from a[i] to a[i][0]. Access to different elements

a[i][0] are now separated by a cache line and updating an array element a[i][0] does not

invalidate a[j][0] for i 6= j. Although this array padding works well, it is a low-level

solution that depends on the size of the cache line and may not be portable. In a more

efficient implementation, the variable a is declared and used as a private variable for

each thread instead of having thread i to access the global array position a[i]. In general,

using private data instead of shared data significantly reduces the risk of false sharing.

Unlike padding, this is a portable solution.

2.5.2 Optimize Barrier Use

Even efficiently implemented barriers are expensive. The nowait clause can be used

to eliminate the barrier that results from several constructs, including the loop construct.

We must do this safely, especially consider the order of different reads and writes from

same portion of memory. A recommended strategy is to first ensure that the OpenMP

program works correctly and then avoid the nowait clause where possible by carefully

inserting explicit barriers at points in the program where needed.

24

2.5.3 Avoid the Ordered Construct

The ordered construct ensures that the corresponding blocks of code within a parallel

loop are executed in the order of the loop iterations. This construct is expensive to

implement. The runtime system has to keep track of which iterations have finished and

possibly keep threads in wait state until their result is needed, which slows program

execution. The ordered construct can often be avoided, e.g., it may be better to wait

and perform I/O outside of the parallelized loop.

2.5.4 Avoid Large Critical Regions

A critical region is used to ensure that no two threads execute a piece of code simul-

taneously. It can be used when the actual order of which threads perform computation is

not important. However, the more code in the critical region, the greater the time that

threads needing this critical region will have to wait. Therefore the programmer should

minimize the amount of code enclosed withing a critical region. If a critical region is very

large, program performance will become poor. An alternative is to rewrite the piece of

code and separate, as far as possible, those computations that cannot lead to data races

from those operations that do need to be protected.

2.5.5 Maximie Parallel Regions

Indiscriminate use of parallel regions may lead to suboptimal performance. There are

significant overheads associated with starting and terminating parallel regions. Large

parallel regions offer more opportunities for using data in cache and provide a bigger

context for other compiler optimizations. For example, if multiple parallel loops exist,

we must choose whether to enclose each loop in an individual parallel region or create

25

one parallel region encompassing all of them.

2.5.6 Avoid Parallel Regions in Inner Loops

Another common technique to improve the performance is to move parallel regions

out of the innermost loops. Otherwise, we repeatedly incur the overheads of the parallel

construct. For example, in the loop nest shown in Algorithm 2, the overheads of the

parallel region are incurred n2 times. A more efficient solution is indicated in Algorithm 3.

The pragma omp parallel for construct is split into its constituent directives and the

pragma omp parallel construct has been moved to enclose the entire loop nest. The

pragma omp for construct remains at the inner loop level. Depending on the amount of

work performed in the innermost loop, the parallel construct overheads are minimized.

Algorithm 2 Parallel region embedded in a loop nest

1: for int i = 0; i < n; i + + do

2: for int j = 0; j < n; j + + do

3: pragma omp parallel for4: for int k = 0; k < n; k + + do

5: {. . .}6: end for

7: end for

8: end for

Algorithm 3 Parallel region moved outside of the loop nest

1: pragma omp parallel2: for int i = 0; i < n; i + + do

3: for int j = 0; j < n; j + + do

4: pragma omp for5: for int k = 0; k < n; k + + do

6: {. . .}7: end for

8: end for

9: end for

26

2.5.7 Improve Load Balance

In some parallel algorithms, there is a wide variation in the amount of work threads

have to do. When this happens, threads wait at the next synchronization point until the

slowest thread arrives. One way to avoid this problem is to use the schedule clause with a

non-static schedule. The problem is that the dynamic and guided workload distribution

schedules have higher overheads than static scheme. However, if the load balance is

severe enough, this cost is offset by the more flexible allocation of work to the threads.

It is a good idea to experiment with these schemes, as well as with various values for the

chunk size. In general, when possible, a general rule for MIMD parallelism is to overlap

computation and communication so that the total time taken is less than the sum of the

times to do each of these. OpenMP uses dynamic schedule to overcome this problem

and lead to a significant speedup: the thread performing I/O joins the others once it has

finished reading data and shares in any computations that remain at that time. But it

will not cause them to wait untill they have performed all of the work by the time the

I/O is done. After this, one thread writes out the results, another thread starts reading

while the others can immediately move on to the computation.

2.5.8 Using Compiler Features to Improve Performance

Compilers use a variety of analysis to determine which of techniques can be used.

They also apply a variety of techniques to reduce the number of operations performed and

reorder code to better exploit the hardware. Once the correctness of a numerical result is

assured, experiment with compiler options to squeeze out an improved performance. We

used the compiler optimization flags ’-mfpmath=387,sse -msse3 -ffast-math -O3’ to enable

27

Intel’s Streaming SIMD Extensions (SSE), which is a SIMD instruction set extension to

the Intel’s x86 architecture.

CHAPTER 3

Survey of Literature

3.1 Previous Work on ATC

The current airspace has a rigid, centralized structure, and aircraft are required to

follow predefined airways or instructions given by controllers [1, 2, 48]. The Federal Avi-

ation Administration (FAA) has put tremendous efforts on finding a predictable and

reliable system to achieve free flight which would allow pilots to choose the best path to

minimize fuel consumption and time delay rather than following pre-selected flight cor-

ridors [1,2,49]. This requires clear and unambiguous methods for maintaining safe sepa-

ration between aircraft. Therefore, conflict detection and resolution (CD&R) emerges as

a critical issue for the implementation of free flight. In the past, solutions to this prob-

lem have been implemented on multicomputer systems with records for various aircraft

distributed over the memory of the processors in this system. The distributed nature of

this dynamic ATC system and the necessity of maintaining data integrity while providing

rapid access to this data by multiple instruction streams (IS) increases the difficulty and

complexity of handling air traffic control. It is difficult for these MIMD implementations

to satisfy reasonable predictability standards that are critical for meeting the strict certi-

fication standards needed to ensure safety critical software components. Massive efforts

have been devoted to finding an efficient MIMD solution to the ATC problems for more

than forty years [4–6].

28

29

The performance of all CD&R algorithms available depends on aircraft state esti-

mation according to the comprehensive survey of Kuchar and Yang [50]. The current

ATC system is based on a surveillance system that uses data from radar measurements

to track aircraft. In the early versions of this system, human operators manually fol-

lowed the blips on the video displays, but the increasing number of aircraft necessitated

the development of automated tracking algorithms. The current algorithms in use for

ATC tracking are based on constant gain Kalman filters, known as α − β, α − β − γ

filters [51, 52]. The major problem with the single Kalman filter is that it does not pre-

dict well when the aircraft makes an unanticipated change of flight mode such as making

a maneuver, accelerating, etc [53–56]. Many adaptive state estimation algorithms have

been proposed [57–60]. The Interacting Multiple Model (IMM) algorithm [61, 62] runs

two or more Kalman filters that are matched to different modes of the system in parallel.

It uses a weighted sum of the estimates from the bank of Kalman filters to compute the

state estimate. IMM and its variants have been applied to single and multiple aircraft

tracking problems in [57]. However, it becomes inaccurate for tracking multiple aircraft

as the number of aircraft increases. Current MIMD implementations of this algorithm

are computationally very intensive. Hwang et. al. [54] proposed that the flight mode

likelihood function can be used to improve the estimation results of the IMM algorithm.

The likelihood function uses the mean of the residual produced by each Kalman filter.

A heuristic algorithm in [56] that evaluates correlation error values has been shown to

provide better results than the Kalman filter [56].

To the best of our knowledge, all existing conflict detection algorithms are based on

30

the continuous state information of the aircraft (see Kuchar and Yang [50] for a compre-

hensive survey of the CD&R algorithms). Menon et al. [63] formulate conflict resolution

as a multi-participant optimal control problem. Using parameter optimization and state

constrained trajectory optimization, they compute a conflict resolution trajectory for

two different cost functions: deviation from the original trajectory and a linear combi-

nation of total flight time and fuel consumption. Their method results in 3D optimal

multiple-aircraft conflict resolution. In general, the optimization process is computation-

ally intensive and is difficult to implement in real-time. In [64], Krozel et. al. propose

one centralized strategy that is controller-oriented and two decentralized strategies that

are user-oriented. In the centralized approach, a central agent analyzes the trajectories of

the aircraft and determines resolutions. In the two decentralized strategies, each aircraft

resolves its own conflicts as they are detected. In [49], Yang and Kuchar propose a con-

flict alerting logic based on sensor and trajectory uncertainties, with conflict probability

based on Monte Carlo simulation. Chiang et. al. [65] propose CD&R algorithms from

the perspective of computational geometry. Paielli and Erzberger [66] and Prandini et.

al. [67] propose analytic algorithms for computing probability of conflict. Many of the

algorithms consider only two aircraft. For example, Krozel et. al. [64] show that neither

their centralized nor decentralized CD&R algorithms can guarantee safety for multiple

aircraft when the number of aircraft is growing. Furthermore, many algorithms propose

optimization schemes that are not guaranteed to be completed within real-time deadlines.

Due to the increasing number of FAA problems, FAA is inviting proposals for new and

efficient conflict detection and resolution technologies [68].

31

3.2 Classical MIMD Real-Time Scheculing Theory

There are some classical multiprocessor real-time scheduling theory [9, 37, 40]. John

Stankovic et.al [9]: ”...complexity results show that most real-time multiprocessing schedul-

ing is NP-hard.” Mark Klein et. al.: [38]”One guiding principle in real-time system

resource management is predictability. The ability to determine for a given set of tasks

whether the system will be able to meet all the timing requirements of those tasks.”;

”...most realistic problems incorporating practical issues... are NP-hard.” Garey, Gra-

ham and Johnson: [31]”...all but a few schedule optimization problems are considered

insoluble... . For these scheduling problems, no efficient optimization algorithm has been

found, and indeed, none is expected.”

John Stankovic et.al [9, 32, 34] and Garey et.al [8, 31] have identified a number of

difficulties in scheduling periodic real-time tasks on multiprocessor, some good instances

include dynamic scheduling, load balancing, race conditions, data dependencies, shared

resource management, sorting, indexing, and cache and memory coherency problems, etc.

Garey et.al [8, 31] have shown them to be multiprocessor NP-hard problems. In [8], the

definition of multiprocessor is a parallel computer that uses message passing or shared

memory, so basically it is a MIMD. To avoid confusion, in this paper, we will treat

multiprocessor and MIMD as being identical terms.

3.3 Other Similar Work

There are also some related works excluding ATC. Most of the results of real-time

scheduling on MIMD are focused on the sequential programming model, where the prob-

lem is to schedule many sequential real-time tasks on multiple processor cores. Parallel

32

programming models introduce a new dimension to this problem, where jobs may be

split into parallel execution segments at specific points [47]. S.Guy et. al. [20] had used

SIMD parallel approaches for collision avoidance algorithms between multiple agents for

real-time simulations. K. Park et. al. [69] have used CUDA to evaluate performance of

image processing algorithms. The NVIDIA GPU has many SIMD PE groups on its chips

and their approach has many similar ideas as ours.

CHAPTER 4

AP Solution to ATC

4.1 Overview of AP Solution

Instead, we implement the ATC system on associative processor(AP), where interac-

tions are much simpler and more efficiently controlled, due in large part to avoiding the

need to coordinate the interactions of multiple instruction streams. Our previous papers

have detailed the solution [3–6, 19, 21, 70]. Similar SIMD parallel approaches have been

used for collision avoidance algorithms between multiple agents for real-time simulations

in S.Guy et. al. [20].

All records for each aircraft will be stored in a single processor. Initially, assume each

processor will store the records for at most one aircraft since the memory size and speed

of processors in a large SIMD is typically small, due to cost restrictions. For ATC tasks,

an AP with n processors can execute n instances of the same task in essentially the same

time as it takes to execute one instance of this task. As long as there are no more than

one aircraft per processor, the running time for the AP remains essential the same as the

number of aircraft increase.

4.2 Advantages of AP over MIMD for ATC

Since SIMDs have only one instruction stream (IS) or control unit, control-type com-

munication between processors is completely eliminated. Communication between the

IS and processors occurs in constant time using its broadcast/reduction network. Data

33

34

communication between processors is completely deterministic and a tight upper bound

can be calculated for the worst case. As a result, a numerical worst case time required for

communications can be accurately calculated. The AP has some important advantages

over MIMD including low overhead synchronization, deterministic hardware, much faster

communication, predictable worst-case running time, much wider ”memory to processor”

bandwidth, and elimination of the following: race conditions, data dependencies, shared

resource management, sorting, indexing, and cache and memory coherency problems, etc.

When solving a problem, MIMD systems often use software to solve additional prob-

lems repeatedly which is not needed in a sequential solution of the original problem.

Examples of these types of this ”additional software” are dynamic scheduling, load bal-

ancing, shared resource management, memory and cache coherency management, pre-

emption, synchronization, priority inversion handling, sorting, indexing, multi-tasking

and multi-threading management software, etc [6]. Several of these types of difficulties

were identified in scheduling periodic real-time tasks on MIMD in [47]. A number of these

are problems that have been shown to be multiprocessor NP-hard problems(e.g., [8]).

In [8], the definition of multiprocessor is a parallel computer that uses message passing

or shared memory, so basically it is a MIMD. To avoid confusion, in this paper, we will

treat multiprocessor and MIMD as interchangeable terminologies. SIMDs do not ordi-

narily need to use this additional software. Most sequential solutions to a problem can

usually be used to create a similar AP solution with roughly the same number of lines

of code. AP solutions are often shorter and simpler than the sequential solutions, as the

constant-time associative search AP property can be used to eliminate the use of sorting

and linked lists to organize and locate items. Also, the additional AP properties often

35

allow simpler and more efficient solutions for problems than with a SIMD that is not an

AP.

4.3 AP properties

We discuss some issues regarding AP properties in this section. One is whether the

methods used in data transfer of the ClearSpeed emulator is similar to data transfer

between PEs and control unit memory in AP. The issue really comes down to whether

the AP we plan to use can make the necessary data transfers in sufficient time, as there

are no transfer properties assumed for AP. According to [11, 71–78], the AP model is a

general purpose computational model and does not make any assumptions regarding how

the interconnection network is used. Often, the use of the interconnection network in an

algorithm can be avoided by using the above associative operations, leaving the running

time of this algorithm independent of the network used. Also, the AP does not make

any assumptions regarding how data is transferred from the AP to outside buffers or

between the control unit memory and the parallel memory. Clearly, these transfer times

depend on the data size, and this increases as the application size (e.g., the number of

aircraft) increases. While it is very efficient to transfer data between the mono and PE

memory in ClearSpeed, we do not know if this rate of transfer would scale if a much

larger ClearSpeed accelerator is built. We must point out the important fact that most

and probably all of the non-AP SIMDs that have been proposed and built could not

handle the ATC requirement because of I/O limitations. However, the previously built

AP machines STARAN [11–13] and ASPRO [14] overcame the I/O limitation by the

MDA (multidimensional access) memory [4,11–13,41] and the flip network(see Figure 4).

36

Figure 4: A Possible AP Architecture for ATC

The flip network is a corner turning network that provides access to a slice of memory

in a SIMD PE for I/O purposes [11, 41, 73]. The flip network in STARAN and ASPRO

was the key to providing high speed I/O movement. The placement of the flip network

between processors and their memory allowed very fast movement of data between an

outside buffer and the ASPRO/STARAN PE parallel memory. Corner-turned data and

the assignment of one record per processor allows multiple PEs to work together to

transfer large amounts of data rapidly between PEs and memories. The flip network and

MDA allow efficient movement of a record in one PE to an outside buffer. More about

this can be found in [41]. Moreover, Jerry Potter describes an alternate design in [41]

that allows multiple records to be transferred to an outside buffer. So the transfer of data

speed is not a bottleneck for building an associative processor to accommodate realistic

size ATC problems. See references [4, 11, 71–75, 78, 79] to find more information about

how to implement AP hardware.

Another issue involving AP properties is how the constant time operations can be

supported in AP and how can we be sure they actually run in constant time. The ATC

problem is essentially a dynamic database problem. Much of the data in this database

changes rapidly, with many parts being updated every one-half second. In order to be able

37

to update and process this data, we need to have some functions that allow us to access,

update, change, and add or delete records in this dynamic database very rapidly. In order

to support these rapid database operations, we require that these operations execute in

constant time. These are the additional functions we require a SIMD to possess in order

for it to be called an associative processor. Here ”constant time” is defined to be the

time that the sequential RAM model requires for a PE to access its memory, the time for

AND or OR operations, the time for addition or multiplication of word length numbers,

etc. However, there is an additional issue here that does not arise for sequential constant

time operations, namely that these functions may involve reductions of vectors with a

component value from each processor. Clearly, if the number of processors is allowed to

increase without bound, such constant time reductions are impossible. However, if we

limit the number of processors to a practical size (e.g., bound by the number of atoms

in the observable universe), it is shown in [10] that all of the constant time functions

required for an associative processor are possible. However, these functions cannot use a

typical interconnection network, as use of these will require non-constant time. Instead

these constant time functions can be supported by use of a binary (or n-ary for n > 2)

broadcast-reduction tree with nodes that have very limited computational capabilities. In

fact, this is the way these constant time operations are supported in both STARAN and

ASPRO, which use a 4-nary tree for broadcasts and another 4-nary tree for reductions.

The broadcasting/reduction network on the STARAN is constructed using a group of

resolver circuits [11,71,72,75]. It is important to keep in mind that all of the above basic

operations are implemented in hardware. These features make the AP model a unique

parallel computation model that is powerful and feasible. More information about this

38

can be found in [4, 5, 10, 11].

The key to AP hardware design is [76–81]: 1) high memory-to-PE bandwidth, and

2) low level synchronous operation supported by the i) elimination of branches in low

level loops, and ii) the elimination of low level barrier synchronism. In the STARAN and

the ASPRO, these abilities were supported by corner turning the data and assigning one

record per processor, the multi-dimensional array memory (and flip network) and mask

register hardware. Specifically, the mask register allows additional parallel searching and

processing on subsets of the original search with no branching for special cases. After

the search phase of a search, process, retrieve (or SPR) cycle is complete, associative

SIMDs can either continue data parallel processing or select a single record to process

sequentially with essentially no overhead. These attributes allow associative SIMDs to

process all the records in a file using data parallelism. More information about this

and other issues addressed in this section can be found in the book titled Associative

Computing by Jerry Potter [41].

CHAPTER 5

Emulating the AP on the ClearSpeed CSX600

5.1 Overview of ClearSpeed CSX600

The ClearSpeed accelerator board shown in Figure 5 is a PCI-X card equipped with

two CSX600 coprocessors. The CSX600 board is a multi-core processor with two CSX600

coprocessors, each with 96 processing elements (PEs) connected in the form of a one-

dimensional array. At present, we are only using one of the two coprocessors in order

to obtain a more SIMD-like environment. This multi-core section is called a multi-

threaded array processor (MTAP) core, and the architecture is shown in Figure 6. The

programmer only has to provide a single instruction stream, and the instructions and

data are dispatched to the execution units that have two parts: one is mono unit that

functions as a control unit and processes sequential instructions, and the other is poly unit

that has 96 PEs. At each step, all active PEs execute the same command synchronously

on their individual data. Each PE has its own local memory of 6 Kbytes, a dual 64-bit

FPU, its own ALU, integer MAC, registers and I/O. The PEs operate at a clock speed

of 210 MHz. The aggregate bandwidth of all PEs is specified to be 96 Gbytes/s, which is

for on-chip memory. Since the parallel bandwidth between the PEs and their memory is

(number of PEs) × (memory-PE bandwidth of each PE), this bandwidth increases as the

number of PEs increase. This provides an extremely wide bandwidth for SIMDs with a

large number of PEs. This allows SIMDs to avoid the von Neumann bottleneck, since a

large number of PEs can access their memory in the same step without any slowdown due

39

40

Figure 5: CSX600 accelerator board

Figure 6: MTAP architecture of CSX600

to message passing or shared memory access time. Further information on the hardware

architecture can be found in the documentation [82].

The ClearSpeed accelerator provides the Cn language as the programming interface

for the CSX600 processors. It is very similar to the standard C programming language.

The main difference is that it introduces two types of variables, namely mono and poly

variables. The mono variables are equivalent to common C variables and are used by

the control unit. A poly value is a parallel variable, with the ith value stored in the ith

processor; moreover, all these values are stored in the same memory location in their

respective processor. Further information on the associated software can be found in the

documentation [83, 84]. Here we will use three of the library functions for data transfer

on the ClearSpeed. The command memcpym2p copies from mono to poly memory and

the command memcpyp2m copies from poly to mono memory. The third one, swazzle,

41

exchanges data between adjacent PEs using the swazzle network, which is a ring network

connecting all the processors together. More details of usages of library functions can be

found in the documentation [84, 85].

5.2 Emulating the AP on the ClearSpeed CSX600

Because the CSX600 coprocessor is SIMD, so only the associative functions introduced

in Section 2.4 need to be emulated. These associative functions have been implemented

mostly by Dr. Kevin Schaffer on the CSX600 efficiently, in both the Cn language in-

troduced in Section 5.1 and assembly language supported by CSX600 [19, 21–23]: AND

and OR reductions across a Boolean poly variable; Associative search across a poly vari-

able; AnyResponder (following an associative search); MAX and MIN reductions across

a integer poly variable; PickOne (following use of AnyResponder). The source codes for

the ASC library for the ClearSpeed Cn language are posted on [86], and [79–81] have

explained the library more in detail. To evaluate their running time, we store 30 records

in each PE and perform each associative operation once for each of these 30 records. The

timing results in microseconds (µsec or 10−6 second) are shown in Table 1. The NA in the

table means that either they cannot be implemented in assembly or they are assembler

commands. Although they are not constant time, they are very efficient, and establish

that we can efficiently emulate an AP using CSX600. The reason that these functions are

not constant time on ClearSpeed is that they are not supported by a broadcast-reduction

network, but instead with the swazzle (or ring) network. Clearly, the time to use this

network increases linearly as the number of processors increases.

42

Table 1: Timings of Associative FunctionsAssociative functions Timings in Cn(microsecond) Timings in assembly(microsecond)

max 5.257 3.654min 5.257 3.654AND 7.024 NAOR 7.364 NA

associative search 29.2 NAany 0.281 NAnany 15.147 8.245get 13.116 7.876next 13.032 7.816

broadcast 100 NA

CHAPTER 6

ATC System Design

6.1 ATC Data Flow

The overall system design for an ATC system is shown in Figure 25. The executive

box controls the single instruction stream of AP using static scheduling. All control

paths are from ATC in the executive box to all of the tasks, e.g., report correlation

and tracking, etc. Controller input simulates sporadic requests, e.g., weather change,

controller input, etc. Radar reports data are simulated by data from data lines to two

modems and transferred from the host to the CSX600 PEs. Tracks are simulated from

flight plans in the PEs. The radar reports and tracks are used for report correlation

and tracking task. The outputs of tracking task are used for cockpit display, controller

display update, terrain avoidance, conflict detection and resolution (CD&R) and final

optimization. The results of terrain avoidance, CD&R and final approach are used for

cockpit display and controller display update. The results of terrain avoidance and CD&R

are used for automatic voice advisory that transfers results to automatic voice advisory

driver in the host and produces voice output. The resolution advisories of CD&R task

are sent to controllers.

6.2 Static Scheduling

The eight ATC real-time tasks can be statically scheduled on the CSX600 using

the static schedule for the AP discussed in [4, 21–23]. The eight tasks and the periods

43

44

Period 0.5 sec 1 sec 4 sec 8 sec

1 T1 T2,T3,T4

2 T1 T5

3 T1 T2,T3,T4

4 T1

5 T1 T2,T3,T4

6 T1 T6

7 T1 T2,T3,T4

8 T1 T7

9 T1 T2,T3,T4

10 T1 T5

11 T1 T2,T3,T4

12 T1

13 T1 T2,T3,T4

14 T1 T8

15 T1 T2,T3,T4

16 T1

Table 2: Static Schedule for ATC Tasks

used for each are (1) report correlation and tracking is executed every 0.5 second; (2)

cockpit display, (3) controller display update and (4) sporadic requests are executed every

second; (5) automatic voice advisory is executed every 4 seconds; (6) terrain avoidance,

(7) conflict detection and resolution, and (8) final approach are executed every 8 seconds.

An 8 second period is split into 16 one-half second periods. Task 1 is executed during

each half-second period. Tasks 2, 3 and 4 are executed during the 1st, 3rd, 5th, 7th,

9th, 11th, 13th, and 15th half-second periods. Task 5 is executed in the 2nd and 10th

half-second periods. Task 6 is executed in the 6th, task 7 is executed in the 8th and

task 8 is executed in the 14th half-second period. Table 2 shows the static schedule more

intuitively.

Figure 7 illustrates the static scheduling of ATC tasks. Time slots for individual tasks

within a system major period (8 seconds) are carefully tailored. Each of them provides

45

Figure 7: Static schedule of ATC tasks implemented in AP

sufficient time for the worst-case execution of the complete set of system tasks in the

ATC system. The periodic tasks are run at their release times and each is completed by

its deadline.

6.3 Scaling up from CSX600 to a Realistic Size AP

We use the present ClearSpeed System to show that our ATC solution is feasible.

However, our purpose is to propose building an AP whose size is appropriate for ATC,

not a larger CSX600 board or multiple CSX600 boards because some of the disadvantages

in MIMD systems mentioned in subsection 4.2 may show up in a system with multiple

CSX600 boards. As described in Chapter 1, our ideal AP has at least 14, 000 PEs with

only one aircraft per PE. Moreover, the memory size and speed of the PEs and of the

control unit can be chosen to optimize the ability of this large AP to handle the realistic

size ATC system. Although we are unable to prove it is possible to build this ideal

AP using our CSX600 simulation, STARAN and ASPRO have already met the goal

of supporting 14, 000 aircraft simultaneously [4, 5, 11]. A similar processor, the MPP

[11,71–75], with 16, 384 PEs, was delivered in 1982. While the MPP was not an AP as it

46

did not have the hardware reduction network, this feature could easily have been included.

Even larger SIMD machines have been built. In particular, Thinking Machines CM-2

was delivered in 1987 and had 64K processors. Paracel developed a parallel processor

which was generally believed to be a SIMD that had one million processors.

Figure 25 shows the overall system design and data flow based on AP architecture.

Our solutions to ATC tasks for the CSX600 in Chapter 7 are easy to implement on

the AP with 14, 000 PEs. References [4, 5, 11] have timing results of previous AP with

16, 000 PEs. For example, Table 3 in [4] shows that the STARAN AP executed a similar

set of ATC procedures in 4.52 seconds, leaving 3.48 unused seconds remaining. That

table is simplified to be Table 3 in this paper, which provides worst case execution times

for statically scheduled ATC tasks with 14, 000 total aircraft, 12, 000 sensor reports/sec,

6, 000 controllers and 8 second major period. The data transfer of large records in real-

time is not their bottleneck because an AP can bring data in and out efficiently (please

see details in Section 2.4). Additional reasons that the AP model can solve the ATC

problem efficiently are its constant time associative operations, SIMD execution style,

and extremely wide memory bandwidth. These have eliminated a number of difficulties

that have to be handled in MIMD architectures. So an AP with 14, 000 PEs using modern

technology can be built and can easily meet all of the deadlines, as this has already been

done in the past with older technology.

47

Table 3: Statically Scheduled Solution Time for Worst Case Environment of ATC

Tasks Period(sec) Proc Time(sec)Report Correlation & Tracking 0.5 1.44Cockpit Display 1.0 0.72Controller Display Update 1.0 0.72Aperiodic Requests(200/sec) 1.0 0.4Automatic Voice Advisory 4.0 0.36Terrain Avoidance 8.0 0.32Conflict Detection & Resolution 8.0 0.36Final Approach(100 runways) 8.0 0.2Summation of tasks in a period 4.52

CHAPTER 7

AP Solution for ATC Tasks Implemented on CSX600

This chapter describes the algorithms for each of the eight real-time tasks, report

correlation and tracking, cockpit display, controller display update, sporadic requests,

automatic voice advisory, terrain avoidance, conflict detection and resolution (CD&R)

and final approach (runway optimization). The solutions are implemented on the CSX600

architecture, but it is easy to scale up from 96 PEs to an AP with 14, 000 PEs using similar

algorithms. The best algorithm will always depend on the exact architecture of the AP.

For instance, the use of the swazzle network in CSX600 is more efficient here but in

a true AP, ordinarily each radar report would be broadcast from the host to all PEs

simultaneously, as this can be done in constant time on a true AP.

7.1 Report Correlation and Tracking

Report correlation and tracking is a task that correlates aircraft data about position

as reported by radar and the predicted data of established tracks for the aircraft in the

system [3–5]. A track is the best estimate of position and velocity of each aircraft under

observation. This task is present in many command and control real-time systems and is

executed every 0.5 second, so it is a major limitation in ATC performance. If total time

consumed is considered, this is easily the ATC task that consumes the most time, as it

is performed much more frequently than the other tasks. It is challenging because each

report has to be compared with each track, and some aircraft are changing flight mode.

48

49

The main idea of this task is from [3–5, 87]. Since all these data are stored in a

shared relational database, the input data are two database relations: aircraft data from

radar reports(Relation R) and the predicted positional data of established tracks for the

aircraft (Relation T ). The correlated radar reports are used to smooth the position and

velocity of the tracks to obtain the next estimate of position and velocity. Given an

unordered set of tracks, each report record must be evaluated with every track record in

the system to assure a match (correlation) is not missed. Multiple matches are treated

different from unique matches, and any report records that do not match a track record

are entered as new tentative tracks.

It is challenging because each report has to be compared with each track, and some

aircraft are changing flight mode. We use the SIMD architecture to do the task. First,

we create a box for each report and each track to accommodate uncertainties of report

and track. The two database relations R and T contain information about the x and y

coordinates of points in 3-D space. The objective of the task is to determine the join of

the two relations as the intersection of two boxes, which is a many-to-many join. Each

box from R is evaluated for intersection with every box in T until all boxes from R have

been compared with all boxes in T.

As presented in [19, 21], the data structures of R and T are shown in Table 4 and

Table 5, in which positions of next period are predicted by smoothed current positions

and velocities. Let T have columns X, Y , j, X1, Y1, and q, and R have columns X, Y , r

and k. Here (X, Y ) is the position of aircraft on the records of T or R; j and r are used

to give the sizes of boxes developed for aircrafts in T and R, respectively. (X1, Y1) in T

is the reported position for the track record that is based on the current correlated radar

50

Table 4: Data Structures for Radar Reports

Attribute Type Comments

report id int Report identity

r float Report box size

X float X position

Y float Y position

Hr(tk) int Altitude

Match count int Number of correlated tracks

Match id int ID of the correlated track

k int Correlation flag

Figure 8: Track/Report correlation

report. Both q and k are flags set during the correlation procedure. As shown in Figure 8,

a box is developed around each point (x, y) with the four corner points (X ± r, Y ± r) in

T. A similar box is developed around each point (x, y) in R with the four corner points

(X ± j, Y ± j). r > 0 is based on uncertainties in the radar report, and j > 0 is based

on uncertainties of each track to compensate turning or larger errors in the radar report.

Initially j = 0.5 nautical mile for all tracks, and r = 0.5 nautical mile for all radar reports.

Each box of the radar report in R is compared with each box of the track record in T. If

an intersect is found between one report box and one track box, for example, r2 and t2

in Figure 8, then the report data is entered into X1 and Y1 of the correlated record in T

and a correlation flag is set in column k, the match count of this report is incremented,

51

Table 5: Data Structures for TracksAttribute Type Comments

ID int Flight identity

q int Track state(seven values)

C int Error measures(3 values)

j float Track box size

report id int ID of correlated report

X float Current X position

Y float Current Y position

Ht int Current altitude

Vx(tk) float Current X velocity

Vy(tk) float Current Y velocity

X1 float X position of correlated report

Y1 float Y position of correlated report

its ID is entered into the correlated track’s report id, and the ID of the track is entered

into the radar report’s match id. We calculate the distance between them, which is the

track’s shortest distance. If two tracks correlate to the same radar report, i.e., the radar

report’s match count > 1, then this radar report is discarded. If a second radar report

correlates with the same track, calculate the distance between the track and radar report

and call it the track’s current distance; if it is less than shortest distance, update the

shortest distance to be this distance, and record this report’s position as a candidate

report position for this track. Next, another radar report is broadcast to all PEs to

compare to all of the track records using the same procedure above. After all report

boxes have been compared with all track boxes, set error measure C = 1 for the tracks

that have correlated reports. The details of the smoothing process have been explained

in [4, 5, 19, 21, 87].

When all report boxes have been compared with all track boxes, a flight that is not

52

Figure 9: Search box size for flight maneuver

produced by noise might not correlate to any reports because the flight that it corresponds

to is accelerating, turning, maneuvering or by greater noise in the report. We increase

track box size for any track that has not correlated with a radar report to increase its

probability to intersect a radar report box as shown in Figure 9. First we double the box

sizes of tracks that do not correlate any reports, i.e, j = j × 2. Next, we apply the same

algorithm to compare them with uncorrelated reports whose match count are 0. The

error measures C are set to 2 for tracks that correlate reports in this round. The process

is repeated for all unmatched radar reports, which have the ”not match flag” of column

q set in R. When no intersections are found, the process is repeated with j = j × 3.

If there still remain unmatched radar reports, new tentative tracks are started for the

reports in order to detect arrival of any new flights. If reports are due to noise, they

usually will be dropped in a few periods (several seconds) based on further evaluation of

other radar reports. The details of this task are shown in Algorithm 4. This task is done

both accurately and within deadline because of the SIMD features of AP.

If a track created for a potentially new aircraft does not correlate with any aircraft

after several passes through Algorithm 4, it is viewed as being due to noise and dropped.

Note that if this algorithm were to be executed on an AP, it would need to be modified so

53

Algorithm 4 Algorithm for Aircraft Tracking

1: Radar reports are transferred from host to mono memory, then distributed frommono to PEs.

2: for i = 1 → 96 in parallel do

3: Boxes are created around each radar report and each track in each PE to accom-modate report and track uncertainties.

4: Check intersection of each report box with every track box in each PE.5: If there is an intersection, the radar report and the track are correlated. The

match count of this report is incremented, which indicates that it correlates withone track, and its ID and positions are entered into the correlated track’s record.

6: All radar reports in each PE are transferred to next PE using the ring/swazzlenetwork, and steps 3 to 6 are repeated.

7: end for

8: After the 96 iterations, all reports have been compared with all tracks. A track thatis not produced by noise might not correlate to any reports because the aircraft thatit corresponds to is maneuvering.

9: Double the box sizes of tracks that have not correlated with any reports to increasetheir probability of intersecting a report box and repeat steps of for loop above tocompare them with uncorrelated reports.

10: Triple the original box sizes of tracks that have not correlated yet, and run thealgorithm again.

11: After 3 rounds, if there are still any uncorrelated reports, they are used to start newtracks.

that each radar report would be broadcast in constant time to all PEs and processed prior

to broadcasting the next radar report to the PEs. The modification of the algorithms

in this section to run on an AP will be easy but will depend on the exact architecture

of the AP. It is noted that, in the AP solution, each report record is tested with every

track record in one set operation that requires constant time. That is, the overall AP

time is O(n) where n is the number of reports in this period. On the other hand, the

MIMD process is O(n2) because it has |R|× |T | processes, assuming O(1) processors can

access and update needed data from the dynamic data base and work on this problem

simultaneously. In the worst case situation we anticipate 12, 000 reports against 14, 000

tracks. For the AP this is 12, 000 operations which is O(n), while in the MIMD it is

R(T − 1)/2 or 8.4 × 107 which is O(n2). It is noted that the number does not include

54

many operations that are needed by MIMD but not by AP: dynamic scheduling, load

balancing, data distribution, processor assignment, mutual exclusion of data access, etc.

7.2 Cockpit Display

First the associative operation PickOne is used to select one aircraft. Next, the

broadcast operation is used to broadcast the x, y and altitude coordinates of the plane

picked in the previous step. For each of its aircraft records, each processor computes

the x-distance, y-distance and altitude distance between the location of its aircraft and

the location of the aircraft broadcast. Then the processor identifies the aircraft that are

approaching this aircraft. This is done by using the conflict detection algorithm covered

in Section 7.6.1, find all aircraft that will be within 2 × 2 nm in x and y and within

1000 feet in altitude in 2 minutes. Next, transfer these selected aircraft’s identity, x,

y positions, altitude, velocity, heading, conflict information, etc to the display server.

Finally, use the conflict resolution algorithm in section 7.6 to obtain a conflict advisory

and transfer it to the server. This algorithm uses the SIMD architecture to parallelize

the computation and to improve its efficiency.

7.3 Controller Display Update

This task is similar to cockpit display. We transfer the updated flight identity, posi-

tions, altitude, speed, heading, etc from PEs to the ClearSpeed server, which plays the

role of the controller display in this simulation. It uses the SIMD architecture to speed

up the display process.

55

7.4 Automatic Voice Advisory

Automatic Voice Advisory (AVA) automatically advises an uncontrolled flight (VFR)

of near term conditions of other aircraft and terrain by voice. This task is simulated by

having the ClearSpeed server print advisories of conflict detection and resolution, terrain

avoidance tasks, etc. For example, if there is an aircraft that is approaching the aircraft

called, the message might be ”aircraft at 4 miles ahead, 4, 500 feet above, in 1 minute”; if

the aircraft called is heading for a terrain, the message might be ”terrain, 4 miles, 3, 100

feet ahead”. We use an AP style of computation to do this efficiency.

7.5 Sporadic Requests

Sporadic requests include information requests or changes in data. For example,

aircraft have to avoid an area that has bad weather, aircraft make maneuvers to avoid bad

weather, or controllers make a request for runway usage, etc. This task is executed once

every second. Although the requests are not processed immediately, they are processed

very quickly. We simulate this task as follows. We first process the next unprocessed

sporadic task (assuming there are more than one). If it is to divert aircraft so they will

miss a storm area, all affected aircraft will be processed using the associative operation

PickOne to select one aircraft to redirect at a time.

7.6 Conflict Detection and Resolution (CD&R)

7.6.1 Conflict Detection

This paper considers a conflict to occur when two aircraft are predicted to be within

a distance of three nautical miles in x and y and within 1, 000 feet in altitude. To

assure timely evaluation we let the detection cycle be eight seconds, and we determine

56

the possibility of a future conflict between any pairs of aircraft within a 5 minute ”look

ahead” period (i.e., 300 seconds). [3–5]

All IFR flights(defined in section 2.3.2) are evaluated for their future relative space

positions between each other; and each of these IFR flights is also evaluated against

all VFR flights(defined in section 2.3.2) for their future space positions. In the worst

case situation we anticipate 4, 000 IFR records and 10, 000 VFR records according to

section 2.3.2. This means that every 8 seconds we must evaluate each of the 4, 000 IFR

flights with the other 13, 999 IFR and VFR flights in the worst environment. The process

is essentially a recursive join on the track relation where the best estimate of each flights

future position is projected as an envelope into future time. The envelope has some

uncertainty in its future position that is a function of the track state. It is modified by

adding 1.5 miles to each x, y edge of the future position (to provide a 3 mile minimum

miss distance) and 1, 000 feet in the altitude (to provide a 2, 000 feet minimum miss

height). For each future space envelope, its data is broadcast to all remaining PEs first.

Then an intersection of this envelope with every other space envelope in the environment

can be checked simultaneously in constant time. First, all future flight envelopes are

generated simultaneously, which takes O(1) steps. Then, the first ”trial envelope” of

the 4, 000 IFR future flight envelopes is compared for possible conflict with all the other

13, 999 flight envelopes for a look-ahead period of 5 minutes. Thus the equivalence of

maximum 13, 999 jobs is completed simultaneously in this jobset. This occurs because

each of the 14, 000 records in the track table is simultaneously available to each of the

14, 000 PEs that are active in the AP. The next operation selects the second trial envelope

and repeats the conflict tests against the remaining 13, 998 tracks. When a trial envelope

57

has been tested, it is marked ”done”, and future trial envelopes will exclude all prior trial

tracks. When the last of the 4, 000 trial envelopes is tested, the conflict detection will

have finished.

We implement the algorithm on CSX600 to emulate AP solution. The input data

is from the track records in the PEs. We copy each track ’s ID, 3D position Xt, Yt and

Ht, and velocity Vxt and Vyt at time t to the following variables, respectively, ID, Xc,

Yc, Hc, Vxc and Vyc in the poly structure trial in PEs. Initialize their time till which is

the aircraft’s earliest collision time with other aircraft to 300.00. The intuitive idea is to

compare each trial aircraft with all the other ones, but we use the CSX600 architecture

to parallelize the computation. The details of the conflict detection algorithm are shown

in Algorithm 5 (see Figure 10), known as Batcher’s algorithm.

Algorithm 5 Algorithm for Conflict Detection

1: for i = 1 → 96 in parallel do

2: if for each trial and track record in each PE, their flight IDs are different andaltitudes are within 1000 feet then

3: Project their positions into 5 minutes ahead, add 1.5 to each x and y coordinateto provide a 3.0 minimum miss distance in each dimension. The x dimensioncase is shown in Figure 10.

4: Calculate the min x, max x, min y and max y for minimum and maximumintersection times in x and y dimensions, as shown in equations 1, 2, 3 and 4.

5: Find the largest minimum time time min and smallest maximum time time maxacross the two dimensions using equations 5 and 6.

6: If time min is less than time max, there is a potential conflict between theaircraft whose ID is trial.ID and another aircraft whose ID is track.ID.

7: If time min is less than trial.time till, then trial.time till is updated totime min.

8: end if

9: All trial records in each PE are passed along the swazzle/ring network to the nextPE and the above steps are repeated to calculate trial.time till(trial, track).

10: end for

11: After 96 iterations, all trial records have been compared with all track records. Thetime till of each trial is its soonest collision time with another track.

58

Figure 10: Conflict detection

The following formulas are used in Algorithm 5:

min x =|trial.Xc − track.Xt| − 3

|trial.Vxc − track.Vxt|(1)

max x =|trial.Xc − track.Xt| + 3

|trial.Vxc − track.Vxt|(2)

min y =|trial.Yc − track.Yt| − 3

|trial.Vyc − track.Vyt|(3)

max y =|trial.Yc − track.Yt| + 3

|trial.Vyc − track.Vyt|(4)

time min = max{min x, min y} (5)

time max = min{max x, max y} (6)

In the AP solution, each IFR record is tested with every IFR except itself and every

VFR record in one set operation that requires constant time, according to the descriptions

above. So the overall AP process requires only 4, 000 operations at most, which is O(n)

steps where n is the number of IFR flights in this period. On the other hand, the MIMD

59

process is O(n2) because it has IFR × (IFR − 1)/2 + IFR × V FR processes, assuming

O(1) processors can access and update needed data from the dynamic data base and work

on this problem simultaneously. In the MIMD it is 4, 000× 3, 999/2 + 4, 000× 10, 000 =

40, 799, 800 which is O(n2). The AP will complete the entire set of jobs in at most

4, 000 steps due to the simultaneous execution of all jobs in each jobset. It is noted

that the number does not include many operations that are needed by MIMD but not by

AP: dynamic scheduling, load balancing, data distribution, processor assignment, mutual

exclusion of data access, etc.

7.6.2 Conflict Resolution

We use the SIMD architecture of CSX600 to do conflict resolution accurately and

efficiently. Conflict resolution will occur as quickly as possible after conflict detection has

been executed, so that each aircraft will have an updated time till value. In practice,

conflicts should be reasonably rare. First, we find which aircraft have the shortest conflict

time. This is the trial aircraft that will make the initial heading change. Next, the PEs

will evaluate in parallel different trial trajectories for the trial aircraft. These trajectories

will be created by changing the direction of the current trajectory of the trial aircraft both

clockwise and counterclockwise in increments of 5 degrees, up to a maximum increment

change of 30 degrees. The various trajectories are rotated through all processors using the

ring (or swazzle) network, and the conflict detection algorithm is used to determine the

longest time before each trajectory collides with another aircraft. Finally, the maximum

of the collision times of all trajectories is determined and used as the best heading change

that the trial aircraft can make. If more than one of the best heading change are found,

60

the one with the minimal degree is used. If two have the same degree changes (e.g., +10

and −10 degree change), one of them is arbitrarily chosen. Although conflicts cannot be

fully resolved theoretically, it runs very well in our simulation and during our tests, all

conflicts are resolved after several rounds. In an actual implementation, any unresolved

conflicts could be resolved by changing the altitude of a plane that still has a conflict

after Algorithm 6 is executed.

The algorithm for conflict resolution is described in Algorithm 6. The input data are

the tracks and trial records in PEs. Each PE can have 1 to 17 tracks and trial records.

Algorithm 6 is based on the CSX600, not an AP. The AP could handle collision avoidance

in a much simpler manner, due to its ability to do a constant-time broadcast [3–6]. For an

AP, the conflict resolution can be included as part of conflict detection. Proceeding one

aircraft at a time, each selected aircraft’s trajectory information is broadcast (in constant

time) to all other aircraft and any potential conflict is immediately corrected as follows.

The heading of the initial aircraft is altered by perhaps an increment of 10 degrees and

immediately rechecked for a conflict. This procedure continues until a conflict free path

with all other aircraft is found with the heading being altered a maximum of 30 degrees.

This method is more efficient on an AP since the broadcast of trajectory information of

one aircraft’s trajectory information can be done in constant time.

7.7 Terrain Avoidance

Terrains are lines that make a box shape that encloses a terrain height, e.g., a TV

tower is a 1.0 by 1.0 nm box with a height equal to 3, 100 feet. All terrains and tracks

are entered in each PE. Terrain avoidance is as challenging as the report correlation and

61

Algorithm 6 Algorithm for Conflict Resolution

1: In parallel, find the minimum time till of all trial records sequentially in each PE.2: Compute the global minimum time till by taking the minimum of the local time till

value in each PE.3: Use the pick-one command to select a processor whose local time till is equal to the

global time till, and transfer the trajectory record of an aircraft in this PE whosetime till is minimal to the mono memory.

4: Create array projectedpath[11] in mono memory, copy the best trial aircraft’s IDand positions to each of the records of projectedpath[11], initialize collision time, thesoonest collision time with other aircraft to ∞.

5: Each of the projectedpath[i] represents a path where the best aircraft makes a dif-ferent heading change from left to right 5 to 30 degrees in increments of 5 degrees,alternating between left and right (e.g., 5, −5, 10, −10 degrees, etc). This will allownumerous different possible paths for the trial aircraft to be evaluated in parallel.

6: Transfer the projectedpath[0], · · · , projectedpath[11] from mono memory to the PEsin parallel with each PE receiving one projectedpath[i].

7: for i = 1 → 96 do

8: Each projectedpath is compared to all aircraft records in each PE with a differentID (so that the trial aircraft will not compare to itself) unless their altitudes differby more than 1000 feet.

9: Calculate minimum and maximum intersection times in x and y dimension usingthe earlier conflict detection equations 1, 2, 3 and 4.

10: Use equations 5 and 6 to get time min and time max. If time min is less thantime max, there is a potential conflict between the aircraft whose ID is the trialID and this path.

11: Check whether time min is less than the collision time of this projectedpath, if so,this projectedpath.collision time is updated to time min. If no conflict is found,the time is not updated and nothing needs to be done.

12: The projectedpath records in each PE are then passed to the next PE along thering network to compare with the records in the neighbor PE.

13: end for

14: After 96 iterations, all projectedpath records have been compared to all the otheraircraft.

15: Find the maximum projectedpath.collision time across all PEs. This path is thebest scenario.

16: Change the best trial aircraft’s x and y velocity according to the best scenario path,display the resolution advisory and change the flight plan in the host.

62

tracking task. The terrain avoidance algorithm in Algorithm 7 is similar to the conflict

detection algorithm. The challenge is the computational intensiveness and we use the

CSX600 architecture again to speed up this computation.

Algorithm 7 Algorithm for Terrain Avoidance

1: for i = 1 → 96 do

2: if for each terrain and track record in each PE, the track record’s height is lowerthan the terrain record’s then

3: Project the track record’s position to 2 minutes in the future, add 1.5 to each xand y edge of the future position to provide a 3.0 minimum miss distance. Theterrain records are 1.0 by 1.0 nm boxes.

4: Calculate the minimum and maximum intersection times in both x and y di-mensions.

5: Set time min to be the larger of the two minimum intersection times in both xand y dimensions in step 4. Similarly, set the record time max to be the largerof the two maximum intersection times in both x and y dimensions in step 4.

6: if time min < time max then

7: There is a potential conflict between the track and the terrain.8: end if

9: end if

10: All track records in each PE are passed to next PE and steps 2 to 8 are repeated.11: end for

12: After 96 iterations, all track records have been compared with all terrain records forterrain avoidance.

7.8 Final Approach (Runways)

The final approach task is to optimize runway usage. Each flight has a flight plan

that specifies its departure terminal, planned departure time, its destination terminal,

and planned arrival time. The runways that occur in the region being managed by an

ATC system could be distributed among the processors. Each processor manages the

information for the runways assigned to it. Here, we assume that there are 96 runways in

the sector being managed by this ATC system and assign one runway to each processor.

The PE assigned to a runway collects aircraft departure times from this runway and

arrival times to this runway and sorts the times. The PEs will instruct the aircraft to

63

increase or decrease their speed to optimize runway usage and also optimize fuel cost.

The last step is currently done by controllers manually.

7.9 Comparison of AP and MIMD solutions for ATC Tasks

We assume that there is a different PE for each flight and that the data for each of

the n flights is stored in its PE. We apply static scheduling in our AP solution to the

ATC problem, which is fundamentally different from the heuristic scheduling normally

used in MIMD solutions [3–5]. The reasons that static scheduling is possible are: the

simultaneous execution of thousands of jobs in each jobset, the wider data access band-

width (200 to 500 times increase), and the ability to predict the worst case execution

time for the AP, which have been explained in section 2.4. However, like most other

real-time systems, the ATC problem cannot be efficiently scheduled using MIMD. One of

the main reasons is that an MIMD approach uses dynamic scheduling to schedule all the

tasks. The data keeps changing, so the MIMD system cannot determine, a priori, how

many records will actually participate in the computation of a task or how much actual

computation time a task in a cycle will require. In the AP, since tasks are executed as

jobsets, there is no need to differentiate between a jobset with one record and a jobset

with thousands of records, since both will require the same amount of time. Thus, the

number of records in a jobset is simply a ”don’t care” parameter. All set operation times

are based on the worst case assumption. This makes it possible to schedule ATC tasks

statically in the AP solution.

Table 6 shows the comparison of the complexities of AP and MIMD solutions for

ATC tasks [4, 5]. It is noted that we do not include the MIMD software for dynamic

64

Table 6: Comparison of Required ATC Operations

Tasks MIMD APReport Correlation & Tracking O(n2) O(n)Track Smooth and Predict O(n) O(1)Flight Plan Update and Conformance O(n) O(1)Cockpit Display O(n2) O(n)Controller Display Update O(n2) O(n)Aperiodic Requests(200/sec) O(n) O(1)Automatic Voice Advisory O(n) O(1)Terrain Avoidance O(n2) O(n)Conflict Detection O(n2) O(n)Conflict Resolution O(n2) O(n)Final Approach O(n2) O(n)Coordinate Transform O(n) O(1)

scheduling, load balancing, data management overhead, etc, and we assume that the

number of processors in MIMD is much smaller than the number of PEs in AP.

Furthermore, the MIMD solutions include costs of running a dynamic scheduling

algorithm, resource sharing due to concurrency, processor assignment, data distribution,

concurrency control, mutual exclusion of data access, indexing and sorting, etc [6]. These

costs not only complicate ATC software and algorithm design, but also dramatically

increase the size, complexity, and processing time of the system. On the other hand, our

AP approach provides a much simpler solution to the ATC problem: the static scheduling

assures the predictability of system performance, and the design procedure has been

greatly simplified by the table-driven scheduler shown in Table 2 and Figure 7. The

static schedule is both reliable and adaptable and can easily be modified to incorporate

system changes as needed.

65

7.10 Implementation of Algorithm Issues

One issue that needs attention is the implementation of these algorithms on real

machines. Why does conflict detection have 96 iterations? Why are there 96 runways in

final approach? If we replace 96 runways with 100, will our approach still outperform?

Since our ClearSpeed accelerator has 96 processors on a SIMD chip, the execution of

ATC tasks is faster if at most one record of each type is stored on each processor. For

example, if there are at most 96 aircraft being tracked at any time, then these can be

processed simultaneously in data parallel fashion. However, if there are 100 planes being

tracked, then at least 4 processors will have to manage the records for two planes each.

For example, when executing the tracking algorithm, all planes with two records will first

check their first plane’s location against a radar location and then the four PEs managing

two aircraft records will check their second plane’s location against the radar location.

During the time the 4 processors check their second plane’s location against the radar

location, the other 92 processors without a second aircraft record are inactive. The result

is that having even one processor contain two aircraft records will about double the time

required to process the aircraft records. Likewise, having two runway records stored on

one processor will typically double the amount of time to process the runway records. It

is more efficient to have at most one record of each type on each processor so that all

records of the same type can be simultaneously processed at the same time. Of course,

if the processor speeds are fast enough and they have sufficient memory, then it may be

reasonable to assign more than one record of each type to processors. More generally, if

the maximum number of aircraft records in the processors is k then the time to execute

66

the tracking algorithm will be about k times larger than if there were at most 1 record

per processor. As a result, when a new aircraft record is created, this new record should

be assigned to a processor with a minimum number of current records of that same type.

This can be easily and efficiently accomplished using the AP associative function MIN.

As each processor can keep the maximum number of each type of records it stores and

the constant time functions can be used to quickly locate a processor with a minimum

number of records of that type where the new record can be stored.

CHAPTER 8

Experimental Results

This chapter describes the results of a set of preliminary experimental results that

were conducted to achieve four different goals. First, the experiments provide a proof-of-

concept for the proposed ATC system implementation based on the CSX600. Second, the

performance and scalability of the proposed approach will be evaluated by performing a

comparison between the CSX600-based implementation and the fastest host only version

using OpenMP [24] on a state-of-the-art multiprocessor server system featuring a total of

8 system cores. Similar experiments using OpenMP have been used in image processing

algorithms on GPUs [69]. Third, these experiments will provide some initial evidence

for the claim that the proposed AP-based ATC system implementation exhibits greater

efficiency and a huge increase in the degree of predictability when compared to the

MIMD-based solution. Fourth, we show that our model of ATC that consists of eight

selected tasks can meet the deadlines for the hard real-time ATC tasks.

8.1 Experimental Setup

We are creating a solution for a prototype of ATC since our implementation cannot

manage the number of aircraft in a real-world situation. In order to have information

about flights that we can control, we simulate the real-world situation by generating

aircraft flights in a three dimensional airspace of 1, 024 by 1, 024 nautical miles(nm)

and 1, 000 to 10, 000 feet in altitude. The initial positions and realistic velocities of the

67

68

aircraft are generated randomly and trajectories of aircraft consist of a constant velocity

mode and a coordinated turn mode. A random amount of error is added to the location

of each plane to create the radar reports. Since we control this entire process, we can

generate different numbers of aircraft to test algorithms and test the limits on the number

of aircraft that can be processed fast enough to meet deadlines. Unlike live FAA flight

data, when two aircraft are on a collision course, we can alter the flight path of one

aircraft to eliminate this problem.

The emulation of the AP solution of ATC uses the ClearSpeed CSX600 accelerator

board and is implemented using the Cn language as described in Section 5.1. The al-

ternative implementation was on an Intel Dual Processor Xeon E5410 Quad Core 2.33

GHz system with 32 GB of main memory and 2×6Mb of L2 cache (for each CPU). This

system has a total of 8 cores. The implementation was done in C using the gcc compiler

version 4.1.3. Both machines’ power are similar because their peak performance in doing

some embarrassingly parallel computations [46] are almost the same, e.g., Mandelbrot

set [43, 83]. So our comparison is fair. In this section, MIMD will usually be used to

refer to this specific machine. Occasionally, it will refer to a generic MIMD, but these

instances should be clear from the context. A single threaded implementation on a single

core of the multicore is called STI. We denote the multi-threaded implementation on the

multicore by MTI. We used the compiler optimization flags ’-mfpmath=387,sse -msse3

-ffast-math -O3’ to enable Intel’s Streaming SIMD Extensions (SSE) for both the single-

threaded as well as the multi-threaded version of the code. We have carefully tuned the

OpenMP codes to achieve a good performance, e.g., avoid false sharing, cache ping pong,

maximize parallel regions, ensure the efficient use of cache, avoid poor load balance and

69

high lock contention problems, etc. The multi-threads were specifically designed to avoid

locking whenever possible. Additional information about the techniques used to obtain

highly efficient OpenMP programs are given in Section 2.5.

8.2 Experiment Results

8.2.1 Comparison of Performance on STI, MIMD and CSX600

We scheduled the tasks of report correlation and tracking, terrain avoidance and

CD&R on both machines. The report correlation and tracking is executed every 0.5

second, the terrain avoidance every 1 second, and the CD&R task every 2 seconds. The

deadline for each task occurs at the end of each of its periods. None of the tasks can

start their executions before their release times and all of them must complete their

executions before their deadlines. Each approach was executed for a varying number

of aircraft, ranging from 96 to 1824 in increments of 96. For each number of aircraft

in the given range, each approach was executed for 100 times. The major period is 2

seconds and contains one CD&R period, two terrain avoidance periods, and four report

correlation and tracking periods. If any of the three tasks misses its deadlines during a

major period, we say that the execution has missed its deadline during this period.

We execute the three tasks under all three approaches, CSX600, MTI and STI.

The comparisons of maximum execution times are shown in Figure 11, Figure 12 and

Figure 13. From these results we can see that MTI takes more time than the CSX600 and

the additional time required for the MTI implementation increases more rapidly than

the CSX600 as the number of aircraft increases. This is because MIMD solution has

more overhead, more costs in inconsistency, dynamic scheduling and load balancing etc.

70

�

�

�

��

��

��

��

��

��

��

��

��

��

��

��

�

�

�

�

�

��

��

��

� ��

��

��

��

��

�� !�"� � !�

��

��

��

��

��

Figure 11: Timing of tracking task.

�

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

�

�

�

� �� !�"� � !�

Figure 12: Timing of terrain avoidance task.

Because there is too much difference between the data of STI and the other two, it is very

hard to distinguish the bottom two curves in these three figures. All of these graphs show

a significant speedup of the performance of the MTI implementation over the sequential

STI implementation so the parallel code for MTI is well-written. Section 8.2.2 shows the

difference between MTI and SIMD more clearly in its three figures.

8.2.2 Comparison of Performance on MIMD, CSX600 and STARAN

Timing results for tracking and correlation, CD&R, terrain avoidance and display

processing tasks for the STARAN are given in [4, 5, 11]. It only takes 0.11 second to

execute the tracking task for 2, 000 aircraft for the STARAN. The AP in the following

four figures stands for STARAN specifically. Figure 14 compares timings of the tracking

task for the MIMD, CSX600 and STARAN. It only takes 0.1 second to execute terrain

avoidance task for 2, 000 aircraft in the STARAN. Figure 15 compares timings of terrain

71

�

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�

�

�

� �� !�"� � !�

Figure 13: Timing of CD&R task.

�

��

�

��

��

��

�

��

��

��

��

��

��

� � ��

Figure 14: Comparison of tracking task of MTI, CSX600 and STARAN.

avoidance task for the MIMD, CSX600 and STARAN. It only takes 0.28 second to execute

CD&R task for 2, 000 aircraft in the STARAN. Figure 16 compares timings of CD&R

task for the MIMD, CSX600 and STARAN. It only takes 0.16 second to execute display

processing (controller display update) task for 2, 000 aircraft for the STARAN. Figure 17

compares timings of this task for the MIMD, CSX600 and STARAN. Although we do not

have a modern AP system to test, with today’s advanced architecture, we believe that

it will execute the ATC tasks more quickly and accurately and at least as predictably,

compared to the STARAN.

The comparisons between MTI and CSX600 in Figures 14, 15, 16, and 17 are fair, as

both are small systems and are doing the entire job. However, the comparisons between

the CSX600 and the STARAN are not fair, as the STARAN has only one aircraft per PE,

while the CSX600 has an increasing number of aircraft. The CSX600 curves in the three

graphs above make it appear that this system is more like a MIMD than STARAN, due

to the fact that the number of records per processor is increasing. The next section 8.2.3

72

�

��

�

��

��

��

�

��

��

��

��

��

��

� � ��

��

Figure 15: Comparison of terrain avoidance task of MTI, CSX600 and STARAN.

�

��

�

��

�

��

��

��

��

��

��

��

��

��

� � �� !�� !�

Figure 16: Comparison of CD&R task of MTI, CSX600 and STARAN.

will explain how to compare STARAN with CSX600 fairly and demonstrate how closely

our CSX600 emulates the STARAN.

8.2.3 How close to the AP is our CSX600 emulation of the AP

This section illustrates how closely our CSX600 emulation of the AP performance is

to AP performance. Figure 18 and Figure 19 shows the timings of tracking and CD&R

tasks for the total number of aircraft increasing from 10 to 384, i.e., number of aircraft

per PE is 1, 2, 3 and 4 in CSX600. We can see that the execution time is basically linear

when the maximum number of aircraft per PE does not change, and there is a gap when

the maximum number of aircraft per PE increases. Moreover, the slope of each of the

line segments is reasonably close to the slope of the STARAN over that interval. These

results show that our CSX600 emulation is closely tied to STARAN, and it emulates the

solution of STARAN the best when the maximum number of aircraft in each PE is 1.

As explained in previous section 8.2.2, the comparison between the STARAN and the

CSX600 is unfair, so we provide a fairer way to compare them. We will use the CSX600

73

�

��

�

��

�

��

��

��

��

��

��

��

��

��

� � �� !�� !�

Figure 17: Comparison of display processing task of MTI, CSX600 and STARAN.

Figure 18: Time of tracking for number of aircraft from 10 to 384.

to emulate a Super-CSX600 with enough processors to match the upper bound of the

number of planes so that there is at most one aircraft per PE on it. The CSX600 is

still paying a price for having to do extra moves between mono memory and parallel

memory in emulating the Super-CSX600. The CD&R execution time for the Super-

CSX600 is obtained by adding the non-parallel computation time to the quotient of the

parallel execution time divided by the maximum number of aircraft per PE. The results

in Figure 20 show that the Super-CSX600 is basically linear and reasonably similar

to the STARAN, but the cost increases faster for the former due to greater overhead,

such as data transfer between mono memory and processors, the non-constant cost of

the associative functions, and disadvantages of hardware compared to STARAN, etc.

There is a significant difference between AP and Super-CSX600, but it is dwarfed by the

difference between their curves and the MTI curves.

74

Figure 19: Time of CD&R for number of aircraft from 10 to 384.

�

��

�

��

�

��

��

��

��

�

��

� �� !� ��"�� #"�

Figure 20: Comparison of MTI, STARAN and Super-CSX600 with at most one aircraftper PE.

8.2.4 Comparison of Predictability on CSX600 and MIMD

The preceding results demonstrate that the performance of CSX600 is better than

that of STI and MTI. We next focus on the predictability of the execution times, which

is an important factor in guaranteeing that all the ATC processing can be performed

within the specified time bounds. We measure the Coefficient of Variation (COV),

which is a common normalized measure of dispersion, and is defined as the ratio of

the standard deviation to the mean. The smaller the value is, the lower the variance

is and so the higher the predictability is. Unlike the standard deviation, the COV is

dimensionless, so we think that it is a better tool to use to compare predictability than

standard deviation. The COV of the three tasks are shown in Figures 21, 22 and 23. Note

that the y-axis in these figures uses a logarithmic scale. The results clearly show that

the COV values for CSX600 are several orders of magnitude below the ones for MTI. In

addition, the increased variability of MIMD will not be compensated for by the running

75

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� !��"��#��

�

��

��

��

��

��

��

��

��

� $�� $��

��

��

%&�'��(��

��

��

��

��

��

��

�� !��"��#��

Figure 21: Predictability of execution times for report correlation and tracking task.

��

��

��

��

��

��

��

��

��

� ��

��

�� !�� "�# ��

��

��

��$�%

%�&'!

��#

�

��

��

��

��

��

��

��

��

��

��

� ��

� ��

��

()!*��+��

�� !�� "�# ��

��

��

��$�%

%�&'!

��#

Figure 22: Predictability of execution times for terrain avoidance task.

time. As shown in Figures 14, 15 and 16, the MIMD running time increases much more

rapidly than CSX600 and AP. Again this is because of the advantages of AP over MIMD

mentioned in Section 4.2 above. These scenarios are true for not only less than 8, 000

aircraft, but also for large scales such as 14, 000 aircraft.

8.2.5 Comparison of Missing Deadlines on CSX600 and MIMD

Last, we compare the number of major periods that have missed their deadlines for

both CSX600 and MTI. The scheduling was described at the beginning of this section.

The results are shown in Figure 24. When the number of aircraft increases, the number

of deadlines missed by the MTI execution increases dramatically. However, the CSX600

does not miss any deadline during each major period.

8.2.6 Timings for 8 Tasks

In this section we will show that ClearSpeed can perform all the required tasks in

our ATC prototype and meet all deadlines for each major period. Table 7 shows the

76

��

��

��

��

��

��

��

� ��

� ��

��

��

��

�!"�

��

�

��

��

��

��

��

� ��

��

��

#$�%�� &��

� ��

� ��

��

��

��

�!"�

��

Figure 23: Predictability of execution times for CD&R task.

��

��

��

��

��

��

��

�

��

��

��

��

��

��

� ��!�"

��#$��

� �

%�

��

��

��

��

��

��

��

�

� �� %�� %��

��

��

��&��'��

��

��

��

��

� ��!�"

��#$��

�

Figure 24: Number of iterations missing deadlines when scheduling tasks.

performance of one flight per PE, which is the closest scenario to the AP. The execution

time (secs) is the time that is spent to execute this task once. The processing time (secs)

is the total time that is spent for this task in an 8 second period. We can see that all

tasks can be done within their deadlines. The total used time is 0.25508 seconds, which

is only 3.19% of the available time of 8 second.

Table 8 lists the worst case timings of eight tasks for 10 aircraft per PE running on

the CSX600, which is the maximum number that can be scheduled because each PE only

Table 7: Performance of One Flight/PETasks Exec Time(Seconds) Proc Time (Seconds)

Report Correlation & Tracking 0.00552 0.08832Cockpit Display 0.00272 0.02177

Controller Display Update 0.00276 0.02209Sporadic Requests 0.00155 0.01244

Automatic Voice Advisory 0.00544 0.01088Terrain Avoidance 0.00782 0.0782

Conflict Detection & Resolution 0.01301 0.01301Final Approach(96 runways) 0.00837 0.00837

Total 0.25508

77

Table 8: Performance of Eight Tasks for Ten Flights/PETasks Exec Time(Seconds) Proc Time(Seconds)

Report Correlation & Tracking 0.3409 5.4544Cockpit Display 0.1705 1.364

Controller Display Update 0.1825 1.46Sporadic Requests 0.0987 0.7896

Automatic Voice Advisory 0.1586 0.3172Terrain Avoidance 0.2386 0.2386

Conflict Detection & Resolution 0.5182 0.5182Final Approach(96 runways) 0.2598 0.2598

Total 10.4018

has 6K of memory. The total processing time is 10.4018 seconds while the maximum

time required by ATC is typically 10 seconds [2, 4, 5, 17, 68]. Since a 10.4 second cycle is

only slightly larger than a 10 second cycle, the performance results are relatively good

and demonstrate the scalability of this implementation on the CSX600. We can conclude

that the CSX600 implementation can meet all deadlines that can be statically scheduled.

The ClearSpeed CSX700 provides a larger SIMD system with 192 processors and should

be able to double the number of aircraft supported by the CSX600. With a faster 250

MHz clock speed, the CSX700 should either meet or come close to meeting a 10 second

deadline for execution of the major period.

The reason that we use 96 in Table 3 is that 96 is the maximum we can use without

increasing the CSX600 running time of this task by a factor of about 2. Also, this

larger number demonstrates the advantage of using a SIMD for a more computationally

intensive problem and should be enough to include all of the runways in airfields under

the control of an ATC center. The bound of 100 runways is used in both [4] and [5].

We have also tested executions for 10 runways, which is more practical for an airport

size application (as opposed to an entire ATC sector). The result is almost the same.

78

However, CSX600 is low power consumption, small size and weight, so it is not a big

waste of resources.

8.2.7 Video and Actual STARAN Demos

In a contract with FAA, Goodyear Aerospace gave a demonstration in 1971 in Knoxville,

TN. This unit used a content-addressable memory built with magnetic wire to demon-

strate automatic tracking, conflict detection and resolution, terrain avoidance, and auto-

matic voice advisory using only about 4500 instructions. This unit was a predecessor to

the STARAN.

As a result of the successful performance of the experimental magnetic SIMD at

Knoxville, Goodyear developed a production AP named STARAN. The STARAN sys-

tem had array modules containing 256 PEs and 1028 bits of storage per PE. The system

was programmed by five people in five months and consisted of 4, 017 AP assembly in-

structions, and about 1, 600 instructions for the control processor [13]. It was delivered

for demonstration in 1972 at the International Air Exposition at Dulles Field, and per-

formed as a control center processor. The Edwards AFB radar was used as radar input.

All radar data was tracked automatically.

A video made from the 16 mm film taken in 1972 at the International Air Exposition

at Dulles Field is available for viewing at [88]. The video shows automatic, primary and

beacon, radar tracking, conflict detection and resolution, and display processing. The

demonstration uses 256 flight plans to develop the primary and beacon radar reports.

Initially, a ten second major period was used, but later the system clock was sped up to

process major periods in one tenth of a second. The speedup of the system clock by 100

79

times real time is the equivalent of 25, 600 tracks in real time. Both the real-time and

the 100 times speedup are captured on the video. What was demonstrated by STARAN

in 1972 cannot be done in any ATC system in the world today.

CHAPTER 9

Observations and Consequences of the Preceding Results

This section focuses on the theoretical aspects of the paper, i.e., the contributions

of our AP approach to the bigger picture of high performance computing, parallel and

distributed computing etc, which are very popular now. We will show that the idea of

using AP for solving ATC problems is useful for many other data intensive problems,

and other computation methods in many areas can benefit from it.

A recent paper in IEEE Computer [89] observes that the widely accepted RAM model

for sequential computation is the reason for the huge progress that has occurred in se-

quential computation over the past 60 years and observes that a single widely acceptable

model for parallel computation could result in similar successes in parallel computation.

It describes several features that this parallel model and its target architectures should

satisfy, including encouraging parallel solutions to applications that are easy to use, en-

ergy efficient, scalable, and highly portable. Moreover, the pitfalls and problems that are

identified as desirable to avoid in software production are data dependences, race condi-

tions, load balancing, nondeterminism, false sharing, deadlocks, and non-predictability

while paradigms that support construction of easy-to-use, productive, energy-efficient,

scalable, and portable parallel applications are desirable. Unfortunately, most program-

mers do not have a good understanding of the latest developments in hardware design,

which is an essential skill needed for multicore and particularly many-core programming.

80

81

Sequential programming is a deterministic and predictable process that arises intu-

itively from the way programmers solve problems using algorithms. Parallel programming

should be as simple and intuitive as sequential programming. The general programmers

should not be required to have an in-depth understanding of the latest developments in

hardware design in order to write efficient programs. The current complexity of designing

high quality parallel programs makes the process of redesigning and rewriting applica-

tions both time-consuming and expensive. The result is that most legacy applications

are still waiting to be parallelized. These and related issues are discussed further in [89].

The AP programming has all of the advantages mentioned above, moreover, it is

very easy for programmers because there is only one instruction stream every time. It is

unnecessary to worry about the extra work of MIMD, so AP programming is not as error

prone as multicore and particularly many-core programming. Since not many SIMDs

have been built since the 1980’s, most of today’s computer professionals have very limited

knowledge about how to design software, algorithms, or programs for SIMDs. Essentially

none of today’s computer professionals even know what an AP is. They do not think in

terms of SIMD or AP solutions to problems, and are unaware of how to solve problems

using these machines.

The associative computational model ASC introduced in [15] was designed to support

algorithm development for the AP and satisfies all of the above desired characteristics.

While the name ASC in this paper originally applied to both APs and multi-APs (which

allowed multiple instruction streams), the name ASC was later restricted to the single

instruction stream AP parallel computer and the model for multi-APs was called MASC.

Comparisons between the power of these models and other well-known computational

82

models are given in [42,90,91]. A possible implementation of a multi-threaded associative

processor is also investigated in [79].

Next, we consider whether the AP is primarily useful in solving only ATC-type prob-

lems, or whether it can also solve a wide range of other problems. A language (also called

ASC) with a primer, compiler, and Windows emulator has been designed for ASC [41,92].

Many algorithms have been designed for the ASC model and software projects for AP

platform (often using the ASC language) in order to demonstrate the versatility of the

ASC model and the AP system. These include graph algorithms [15,41,93], convex hull

[94–98], string matching [99,100], sequence alignment [101,102], NP-Complete [103,104],

databases [105, 106], parallel compilers [107–109], supporting functional and logic-based

languages [110–112], and supporting expert system languages [113–115] and computer ar-

chitecture [76–81]. In [11], Batcher lists fast Fourier transforming, sonar post-processing,

string search, file processing, air traffic control, image processing, data management, posi-

tion locating and reporting, bulk filtering, and radar processing as STARAN applications,

and for the first five of these, provides a comparison of the STARAN’s performance with

several other systems. Additional applications are discussed in [41], the MPP was specif-

ically designed and used for image processing. This wide range of AP algorithms and

software demonstrate the versatility of the AP to solve a wide range of applications.

The extra problems that MIMDs have to solve as part of their solution to the original

problem require a lot of extra work. In addition to the extra problems discussed earlier,

the extra work that MIMDs have to do to support the integrity and the ACID properties

of a database requires substantially additional work. This requires additional software

support that typically involves many of the same types of support software discussed

83

earlier such as synchronization, dynamic scheduling, critical sections of code, sorting,

indexing, etc. Also this extra work may involve a substantial increase in communications.

All of these problems become much more difficult to handle when the database is dynamic,

as records typically have to be added or transferred frequently, and massive updates occur

frequently, e.g., every 0.5 second. In contrast, the AP (and SIMDs in general) stores all

information about an object such as an aircraft in the local memory of the processor this

object is assigned to. This processor executes essentially all of the computations regarding

that object, so updates to the database involving this object only require changes to the

local memory of that processor.

The graphs in Chapter 8 of this paper show that the magnitude of this extra work

dramatically slows down the time required for MIMDs on ATC type problems. However,

there is nothing highly unusual about the solution to ATC problems, except the hard

deadlines. While the heavy real-time nature of the ATC problem and the extremely

dynamic database that must be maintained exacerbates the extra work that MIMDs are

required to do to solve this problem, this just primarily increases the magnitude of this

extra work. It is still the case that for most non-trivial applications, MIMDs have to use

a huge amount of extra software and do a huge amount of extra work when compared to

SIMDs.

The hardware cost of building a large AP should be much cheaper than building a

MIMD that has the capability of solving the same problems due to the simplicity of

the AP hardware. While many applications may be reasonable to solve on a MIMD,

on important large scale problems that will be expensive and challenging to develop a

MIMD solution, it may be considerably cheaper and much more satisfactory to have

84

an appropriate size AP built to handle the problem. The history of the ATC problem

provides an excellent justification for considering this alternate approach. The ongoing

MIMD ATC solution is widely known for its repeated shortcomings, in spite of the fact

that a very large scale project to redevelop this system has been launched about once

every ten years for the past 50 years.

CHAPTER 10

Conclusions and Future Work

In this dissertation, we provided an efficient and scalable solution for Air Traffic

Control (ATC) problems on the associative processor (AP) system that is emulated on

the ClearSpeed CSX600 system. We established the feasibility of this solution by showing

that if this solution is mapped onto the Clearspeed CSX600 emulator of the AP, then

this solution works on the AP system. Without many of the details of the AP solution,

we were able to recreate a large portion of this earlier implementation.

The contributions of this paper are summarized as follows. First, the AP imple-

mentation has much better scalability and efficiency than the MIMD implementation.

This implementation used static scheduling, which is possible due to the deterministic

SIMD architecture. This AP ATC software also avoids the inclusion of solutions to the

following types of problems (or calls to library functions which solve these problems):

load balancing, shared resource management, synchronization, preemption management,

sorting, indexing, cache and memory coherency management, false sharing, priority in-

version handling, race conditions, data dependencies, and deadlocks, etc. In particular,

this solution avoids use of solutions or approximate solutions to any of the numerous

multiprocessor NP-hard problems of the type given in [8]. The solution used is very sim-

ilar to a sequential solution for this problem, both in style and in code size. As a result,

the size of this software solution is only a very small fraction of the size of the MIMD

solutions to the ATC problem. This results in a dramatic drop in the cost in both the

85

86

cost of creating and in maintaining this software when compared to the MIMD solutions

that have been given to the ATC problem.

Second, the proposed AP solution will support accurate and meaningful predictions of

worst case execution times and will guarantee all deadlines are met. In contrast, MIMD

systems optimize average case running time and have highly unpredictable worst case

running time. These contributions can provide major help in meeting the goals of FAA’s

NextGen Plan: fly more aircraft, more safely, more precisely, more efficiently and use

less fuel [17].

Third, the AP hardware easily scales to handle larger problem sizes by either increas-

ing the maximum number of records stored in each processor (which will slow down the

run-time) or by building a larger AP computer with more processors, which is easy to do.

Based on the simplicity of the architecture of the STARAN and ASPRO, which are the

only APs that have been built, building a larger AP system should be both easy to do and

much cheaper to build than a MIMD system of comparable size. The CM-2 Connection

Machine was a SIMD computer built by Thinking Machines in the late 1980’s that had

over 64K processors. Paracel developed a parallel processor which was generally believed

to be a SIMD that had one million processors. It would seem reasonable to expect that

an AP with a million processors could easily be built currently. Fourth, the result is that

the Validation and Verification (V&V) is simpler than for current MIMD software.

The ATC problem has similar requirements to most embedded real-time problems

with periodic tasks and hard deadlines, such as command and control problems in mil-

itary, autonomous driving, video surveillance, image processing, simulations of complex

87

physical systems (e.g., weather forecasting, molecular modeling), and massive data pro-

cessing, etc. The AP’s ability to easily handle the ATC problem would also enable it

to easily handle many other real-time problems, both large and small. As a result, this

research is relevant not only to the ATC problem, but also to numerous other important

applications that involve real-time problems with hard deadlines. For example, command

and control systems such as an air defense system would be natural candidates. Other

examples may include embedded real-time systems as well.

The ATC problem is a large dynamic database problem. The AP excels in handling

the ATC since it was designed to handle real-time dynamic database activities rapidly.

For example, locating records with a particular property, reading a value from this record,

changing a value in this record, and determining whether a record with a certain property

exists are actions that can be done in constant time. By use of the flip network, large

pieces of records can be moved into parallel memory or copied from parallel memory in

constant time, making it possible to enter and to ship these records elsewhere rapidly.

The AP’s capability of handling dynamic databases makes it very useful for numerous

other applications.

A natural application area is controlling fleets of Unmanned Aircraft Systems (UASs).

In fact, the ATC system developed here with the ClearSpeed CSX600 could be expanded

to manage flight control for a small fleet of 480 to 960 UASs (depending on the number

of tasks performed and the size of the records stored, etc) in applications such as: (1)

patrolling areas such as international borders and reporting unusual activities; (2) early

identification of forest fires and during actual fires, maintaining an updated information

about regions that are burnt, threatened, etc; and (3) surveillance of agriculture crops and

88

performing functions like spraying when needed, turning water on at various locations

when plants need water, etc. The ClearSpeed CSX700 provides a larger SIMD system

with 192 processors and roughly a 20% faster clock time for processors should more than

double the capabilities of the CSX600. As ClearSpeed SIMD accelerators are well-known

for their very small power consumption, size, and weight, it should be possible to build

an AP that also has similar characteristics. For example, it should be possible to build

an AP that requires less power than a light bulb and have a sufficiently low weight that

would enable it to be easily deployed in the field and still be able to control 1000 UASs in

the immediate vicinity. Since UASs will soon be permitted to enter airspace controlled

by FAA, many avionics experts expect the total aircraft in the skies to rapidly increase

in numbers as the anticipated civilian use of UASs explodes. Since the demands on our

ATC system are likely to increase at a much more rapid pace than in the past, the FAA

should quickly initiate an investigation into the use of APs for ATC and how to best

integrate the APs into the FAA system.

From Sections 9 and 10 above, we know that AP approach has novelty and contribu-

tions in many other applications. However, we discuss its limitation now. It is obvious

that AP is ideal for solutions that have a lot of data parallelism in them, especially large

scale, while MIMD fits distributed applications. For example, at the end of Section 8.2.6,

it takes the same time to process 10 runways and 96 ones. A natural way to overcome

this limitation is to combine AP and MIMD in the same system. The ClearSpeed has

been used extensively as an accelerator to MIMD systems. In several cases, supercom-

puters increased their ranking in the top 500 by adding multiple ClearSpeed accelerators

to their system. MIMD processors could hand off problems that the SIMD accelerator

89

could compute more efficiently and perform other work while waiting for ClearSpeed to

return the solution. It may be useful to investigate whether the use of both AP and

MIMD hardware in the same system as co-equal partners could also prove to be benefi-

cial. Either system could serve as the main system, but would have the option of handing

off problems to the other system that it could not handle efficiently. Such a system would

need to be able to convert from one mode to the other efficiently or else transfer data

from one system to the other efficiently. Perhaps such combination might be useful in

building a system that could handle a wider range of problems efficiently. It also might

be useful in large systems (e.g., exascale) which would be able to handle a variety of very

large applications.

A possibly important extension to our current research would be to consider an im-

plementation of ATC on NVIDIA GPU hardware using CUDA, which has many SIMD

PE groups on its chips. The NVIDIA technology including the latest FERMI chip has

a lot in common with the MTAP approach of ClearSpeed. Implementing the CSX600

ATC algorithms on this architecture may provide another useful platform to use in this

project and would provide useful information about its ability to provide another useful

platform to use for the types of real-time applications mentioned earlier. Furthermore,

using CUDA GPU, we can investigate whether the shortcomings or bottlenecks of AP

can be improved. For example, GPUs have evolved into highly parallel multicore sys-

tems allowing very efficient manipulation of large blocks of data. This design is more

effective than general-purpose CPUs for algorithms where processing of large blocks of

data is done in parallel. We will investigate whether ATC can run well using CUDA

90

and whether the bottlenecks can be improved. Another potential project is to investi-

gate implementing our prototype on other parallel systems, e.g., a Cray Systems, such

as their vector processor, IBM’s Cell processor, and Convex with FPGA reconfigurable

hardware, etc.

BIBLIOGRAPHY

[1] S. Kahne and I. Frolow, “Air traffic management: Evolution with technology,”IEEE Control Systems Magazine, vol. 16, no. 4, pp. 12–21, November 1996.

[2] M. Nolan, Fundamentals of Air Traffic Control, 3rd ed. Wadsworth: Brooks/Cole,1998.

[3] M. Jin, “Evaluating the power of the parallel masc model using simulations andreal-time applications,” Ph.D. dissertation, Department of Computer Science, KentState University, August 2004.

[4] W. Meilander, M. Jin, and J. Baker, “Tractable real-time air traffic control au-tomation,” in Proc. of the 14th IASTED International Conference on Parallel andDistributed Computing and Systems (PDCS), Cambridge, MA, November 2002, pp.483–488.

[5] W. Meilander, J. Baker, and M. Jin, “Predictable real-time scheduling for air trafficcontrol,” in Fifteenth International Conference on Systems Engineering, August2002, pp. 533–539.

[6] ——, “Importance of simd computation reconsidered,” in Proc. of the 17th In-ternational Parallel and Distributed Processing Symposium (IEEE Workshop onMassively Parallel Processing), Nice, France, April 2003.

[7] W. Meilander, J. Potter, K. Liszka, and J. Baker, “Real-time scheduling in com-mand and control,” in Proc. of the 1999 Midwest Workshop on Parallel Processing,August 1999.

[8] M. Garey and D. Johnson, Computers and Intractability: a Guide to the Theory ofNP-completeness. New York: W.H. Freeman, 1979.

[9] M. N. J.A. Stankovic, M. Spuri and G. Buttazzo, “Implications of classical schedul-ing results for real-time systems,” IEEE Computer, pp. 16–25, June 1995.

[10] M. Jin, J. Baker, and K. Batcher, “Timings for associative operations on the mascmodel,” in Proc. of the 15th International Parallel and Distributed Processing Sym-posium (IEEE Workshop on Massively Parallel Processing), San Francisco, CA,April 2001, pp. 193–200.

[11] K. Batcher, “Staran parallel processor system hardware,” in National ComputerConference and Exposition (AFIPS74), New York, NY, May 1974, pp. 405–410.

[12] W. Meilander, “Staran an associative approach to multiprocessing,” in Multipro-cessor Systems, Infotech State of the Art Reports, Infotech International, 1976, pp.347–372.

91

92

[13] J. A. Rudolph, “A production implementation of an associative array processor -staran,” in The Fall Joint Computer Conference (FJCC), Los Angeles, CA, De-cember 1972.

[14] W. Meilander, “Aspro-vme hardware/architecture,” June 1992, eR3418-5 LORALDefense Systems.

[15] J. Potter, J. Baker, S. Scott, A. Bansal, C. Leangsuksun, and C. Asthagiri, “Asc:An associative-computing paradigm,” Computer, vol. 27, no. 11, pp. 19–25, 1994.

[16] “Faa grants for aviation research program solicitation no. faa-06-01,” 2011. [Online].Available: http://www.tc.faa.gov/logistics/grants/solicitation/97solict.doc

[17] “Faa’s nextgen implementation plan,” 2011. [Online]. Available: http://www.faa.gov/nextgen/media/ng2011\ implementation\ plan.pdf

[18] “Federal aviation agency 1963 atc specifications,” 1963. [Online]. Available:http://www.cs.kent.edu/∼parallel/papers/FAA1963ATCSpecifications.pdf

[19] M. Yuan, J. Baker, F. Drews, and W. Meilander, “Efficient implementation ofair traffic control (atc) using the clearspeed csx620 system,” in Proc. of the 21stIASTED International Conference on Parallel and Distributed Computing and Sys-tems (PDCS), Cambridge, MA, November 2009, pp. 353–360.

[20] S. Guy, J. Chhugani, C. Kim, N. Satish, M. Lin, D. Manocha, and P. Dubey,“Clearpath: highly parallel collision avoidance for multi-agent simulation,” in ACMSIGGRAPH/Eurographics Symposium on Computer Animation(SCA). ACM, Au-gust 2009, pp. 177–187.

[21] M. Yuan, J. Baker, F. Drews, L. Neiman, and W. Meilander, “An efficient asso-ciative processor solution to an air traffic control problem,” in Large Scale ParallelProcessing (LSPP) IEEE Workshop at the International Parallel and DistributedProcessing Symposium (IPDPS), Atlanta, GA, April 2010.

[22] M. Yuan, J. Baker, W. Meilander, and K. Schaffer, “Scalable and efficient asso-ciative processor solution to guarantee real-time requirements for air traffic con-trol systems,” in Large Scale Parallel Processing (LSPP) IEEE Workshop at theInternational Parallel and Distributed Processing Symposium (IPDPS), Shanghai,China, May 2012, pp. 1682–1689.

[23] M. Yuan, J. Baker, and W. Meilander, “Comparisons of air traffic control implemen-tations on an associative processor with a mimd and consequences for parallel com-puting,” Journal of Parallel and Distributed Computing(JPDC), 2012, accepted, toappear.

[24] “Openmp website,” 2010. [Online]. Available: http://openmp.org/wp/

[25] S. Akl, Parallel Computing: Models and Methods. New York: Prentice Hall, 1997.

[26] M. Quinn, Parallel Programming in C with MPI and OpenMP. McGraw-Hill,2004.

93

[27] T. Cormen, C. Leisterson, and R. Rivest, Introduction to Algorithms, 1st ed. Mc-Graw Hill and MIT Press, 1990, chapter 30 on Parallel Algorithms.

[28] M. Flynn, “Some computer organizations and their effectiveness,” IEEE Transac-tions on Computers, vol. C, no. 21, pp. 948–960, 1972.

[29] M. Quinn, Parallel Computing: Theory and Practice. McGraw-Hill, 1994.

[30] W. Chantamas, “A multiple associative computing model to support the executionof data parallel branches using the manager-worker paradigm,” Ph.D. dissertation,Department of Computer Science, Kent State University, December 2009.

[31] M. Garey, R. Graham, and D. Johnson, “Performance guarantees for schedulingalgorithms,” Operations Research, vol. 26, no. 1, pp. 3–21, Jan-Feb 1978.

[32] J. Stankovic, “Misconceptions about real-time computing,” IEEE Computer,vol. 21, no. 10, pp. 17–25, Oct. 1988.

[33] J. Stankovic, S. Son, and J. Hansson, “Misconceptions about real-time databases,”Computer, June 1999.

[34] J. Stankovic, M. Spuri, K. Ramamritham, and G. Buttazzo, Deadline Schedulingfor Real-time Systems. Kluwer Academic Publishers, 1998.

[35] J. W. Liu, Real-Time Systems. Prentice Hall, 2000.

[36] G. Buttazzo, Hard Real-Time Computing Systems: Predictable Scheduling Algo-rithms and Applications, 2nd ed. New York: Springer Science, 2005.

[37] C. Murthy and G. Manimaran, Resource Management in Real-time Systems andNetworks. MIT Press, 2001.

[38] M. Klein, J. Lehoczky, and R. Rakumar, “Rate-monotonic analysis for real-timeindustrial computing,” IEEE Computer, pp. 24–33, 1994.

[39] C. L. Liu and J. Layland, “Scheduling algorithms for multiprogramming in a hard-real-time environment,” Journal of the ACM, vol. 20, no. 10, 1973.

[40] J. Anderson, J. Calandrino, and U. Devi, “Real-time scheduling on multicore plat-forms,” in Proc. of the 12th IEEE Real-Time and Embedded Technology and Appli-cations Symposium, San Jose, CA, April 2006, pp. 179–190.

[41] J. Potter, Associative Computing: A Programming Paradigm for Massively ParallelComputers. New York: Plenum Press, 1992.

[42] J. Trahan, M. Jin, W. Chantamas, and J. Baker, “Relating the power of the multipleassociative computing model (masc) to that of reconfigurable bus-based models,”Journal of Parallel and Distributed Computing, Elsevier Publishers, vol. 70, pp.458–466, May 2010.

[43] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald, ParallelProgramming in OpenMP, 1st ed. Morgan Kaufmann, 2000.

94

[44] B. Chapman, G. Jost, and R. Pas, Using OpenMP Portable Shared Memory ParallelProgramming. MIT Press, 2007.

[45] H. Casanova, A. Legrand, and Y. Roberts, Parallel Algorithms. CRC Press, 2009.

[46] B. Wilkinson and M. Allen, Parallel Programming: Techniques and ApplicationsUsing Networked Workstations and Parallel Computers. Prentice Hall, 1999.

[47] K. Lakshmanan, S. Kato, and R. Rajkumer, “Scheduling parallel real-time tasks onmulti-core processors,” in Proc. of the 31st IEEE Real-Time Systems Symposium(RTSS’10), San Diego, CA, December 2010, pp. 259–268.

[48] I. Hwang, “Air traffic surveillance and control using hybrid estimation and protocol-based conflict resolution,” Ph.D. dissertation, Stanford University, 2003.

[49] L. Yang and J. Kuchar, “Prototype conflict alerting logic for free flight,” AIAAJournal of Guidance, Control, and Dynamics, vol. 20, no. 4, pp. 768–773, July-August 1997.

[50] J. Kuchar and L. Yang, “A review of conflict detection and resolution modelingmethods,” IEEE Transactions on Intelligent Transportation Systems, vol. 1, no. 4,pp. 179–189, 2000.

[51] Y. Bar-Shalom and T. Fortmann, Tracking and Data Association. Academic Press,1988.

[52] H. Blom, R. Hogendoorn, and B. vanDoorn, “Design of a multisensor trackingsystem for advanced air traffic control,” in Multitarget-Multisensor Tracking: Ap-plication and Advances, Y.Bar-Shalom, Ed., vol. 2. Artech House, 1990, pp. 31–63.

[53] I. Hwang, H. Balakrishnan, K. Roy, and C. Tomlin, “Multiple-target tracking andidentity management in clutter for air traffic control,” in Proceedings of the AACCAmerican Control Conference, Boston, MA, June 2004.

[54] I. Hwang, J. Hwang, and C. Tomlin, “Flight-mode-based aircraft conflict detectionusing a residual-mean interacting multiple model algorithm,” in Proceedings of theAIAA Guidance, Navigation, and Control Conference, Austin Texas, August 2003.

[55] I. Hwang and C. Tomlin, “Protocol-based conflict resolution for finite informationhorizon,” in Proceedings of the AACC American Control Conference, Anchorage,Alaska, May 2002.

[56] K. Liu, “Composition of kalman and heuristic tracking algorithms for air trafficcontrol,” Master’s thesis, Department of Computer Science, Kent State University,August 1999.

[57] E. Mazor, A. Averbuch, Y. Bar-Shalom, and J. Dayan, “Interacting multiple modelmethods in tracking: A survey,” in IEEE Transactions on Aerospace and ElectronicSystems, vol. 34(1), 1998, pp. 103–123.

95

[58] D. Lainiotis, “Partitioning: A unifying framework for adaptive systems i: Estima-tion,” in Proceedings of the IEEE, vol. 64, August 1976, pp. 1126–1142.

[59] Y. Bar-Shalom and X. Li, Estimation and Tracking: Principles, Techniques andSoftware. Boston, Massachusetts: Artech House, 1993.

[60] D. Sworder and J. Boyd, “Estimation problems in hybrid systems,” in CambridgeUniversity Press, 1999.

[61] H. Blom and Y. Bar-Sharlom, “The interacting multiple model algorithm for sys-tems with markovian switching coefficients,” IEEE Transactions on AutomaticControl, vol. 33, no. 8, pp. 780–783, August 1988.

[62] X. Li and Y. Bar-Shalom, “Design of an interacting multiple model algorithm forair traffic control tracking,” IEEE Transactions on Control Systems Technology,vol. 1, no. 3, pp. 186–194, September 1993.

[63] P. Menon, G. Sweriduk, and B. Sridhar, “Optimal strategies for free flight air trafficconflict resolution,” Journal of Guidance, Control, and Dynamics, vol. 22, no. 2,pp. 202–211, 1999.

[64] J. Krozel, M. Peters, K. Bilimoria, C. Lee, and J. Mitchell, “System performancecharacteristics of centralized and decentralized air traffic separation strategies,” inFourth USA/Europe Air Traffic Management Research and Development Seminar,2001.

[65] Y.-J. Chiang, J. Klosowski, C. Lee, and J. Mitchell, “Geometric algorithms forconflict detection/resolution in air traffic management,” in 36th IEEE Conferenceon Decision and Control, San Diego, CA, December 1997, pp. 1835–1840.

[66] R. Paielli and H. Erzberger, “Conflict probability estimation for free flight,” Journalof Guidance, Control, and Dynamics, vol. 20, no. 3, pp. 588–596, 1997.

[67] M. Prandini, J. Hu, J. Lygeros, and S. Sastry, “A probabilistic approach to air-craft conflict detection,” IEEE Transactions on Intelligent Transportation Systems,vol. 1, no. 4, pp. 199–219, 2000.

[68] “Faa grants for aviation research program solicitation,” 2011. [Online]. Available:http://www.tc.faa.gov/logistics/grants/

[69] K. Park, N. Singhal, M. Lee, S. Cho, and C. Kim, “Design and performance evalu-ation of image processing algorithms on gpus,” IEEE Transactions on Parallel andDistributed Systems, vol. 22, no. 1, pp. 91–104, January 2011.

[70] S. Reddaway, W. Meilander, J. Baker, and J. Kidman, “Overview of air trafficcontrol using an simd cots system,” in Proc. of the International Parallel and Dis-tributed Processing Symposium (IPDPS’05), Denver, CO, April 2005.

[71] K. Batcher, “Staran/radcap hardware architecture,” in Sagamore Computer Conf.on Parallel Processing, 1973, pp. 147–152.

96

[72] ——, “The multi-dimensional access memory in staran,” in Sagamore ComputerConf. on Parallel Processing, 1975, pp. 167–168.

[73] ——, “The flip network in staran,” in International Conf. on Parallel Processing,1976, pp. 65–71.

[74] ——, “Staran series e,” in International Conf. on Parallel Processing, 1977, pp.140–143.

[75] ——, “The multi-dimensional access memory in staran,” IEEE Transactions onComputers, vol. C-26, no. 2, pp. 174–177, Feb 1977.

[76] H. Wang and R. Walker, “Implementing a scalable asc processor,” in Proc. of the17th International Parallel and Distributed Processing Symposium (Workshop inMassively Parallel Processing), Nice, France, April 2003, pp. abstract on page 267,full text on CDROM.

[77] ——, “Implementing a multiple-instruction stream associative masc processor,” inProc. of the 18th International Conference on Parallel and Distributed Computingand Systems(PDCS), Dallas, Texas, November 2006, pp. 460–465.

[78] R. Walker, J. Potter, Y. Wang, and M. Wu, “Implementing associative processing:Rethinking earlier architectural decisions,” in Proceedings of the 15th InternationalParallel and Distributed Processing Symposium (Workshop on Massively ParallelProcessing), San Francisco, CA, April 2001, pp. abstract on p. 195, full text onaccompanying CDROM.

[79] K. Schaffer and R. Walker, “A prototype multithreaded associative simd proces-sor,” in Proceedings of the 21st International Parallel and Distributed ProcessingSymposium (IPDPS)-Workshop on Advances in Parallel and Distributed Comput-ing Models (APDCM), Long Beach, CA, March 2007, p. 228.

[80] ——, “Using hardware multithreading to overcome broadcast/reduction latencyin an associative simd processor (extended version),” Parallel Processing Letters,vol. 18, no. 4, pp. 491–509, Dec 2008.

[81] ——, “Using hardware multithreading to overcome broadcast/reduction latency inan associative simd processor,” in Proc. 22nd Int’l Parallel and Distributed Process-ing Symp.(IPDPS)-Workshop on Large-Scale Parallel Processing (LSPP), Miami,FL, April 2008, p. 289.

[82] “Clearspeed technology plc. clearspeed whitepaper: Csx processor architecture,”2007. [Online]. Available: http://www.clearspeed.com/docs/resources/

[83] “Clearspeed technology plc. clearspeed whitepaper: Clearspeed software descrip-tion,” 2007. [Online]. Available: https://support.clearspeed.com/documents/

[84] “Clearspeed technology plc. csx600 runtime software user guide,” 2007. [Online].Available: https://support.clearspeed.com/documents/

97

[85] J. Gustafson and B. Greer, “Clearspeed whitepaper: Accelerating theintel math kernel library,” Tech. Rep., 2007. [Online]. Available: http://www.clearspeed.com/docs/resources/ClearSpeedIntelWhitepaperFeb07.pdf

[86] K. Schaffer, “Asc library for clearspeed,” 2012. [Online]. Available: http://www.cs.kent.edu/∼kschaffe/asc/

[87] E. Eddey and W. Meilander, “Application of an associative processor to aircrafttracking,” in Proceedings of the Sagamore Computer Conference on Parallel Pro-cessing. Springer-Verlag, Aug 1974, pp. 417–428.

[88] “The video of the performance of the staran at dulles is posted at,” 2012, this is areproduction of part of the original 16mm film made in the 1980’s. If a completeprofessional restoration is feasible, it will also be posted here. [Online]. Available:http://www.cs.kent.edu/∼jbaker/ATC/andXXXXatElseviersite.

[89] A. Marowka, “Back to thin-core massively parallel processors,” IEEE ComputerJournal, vol. 44, no. 12, pp. 49–54, December 2011.

[90] W. Chantamas, J. Baker, and M. Scherger, “An extension of the asc language com-piler to support multiple instruction streams in the masc model using the manager-worker paradigm,” in Proc. of the 2006 International Conference on Parallel andDistributed Processing Techniques and Applications (PDPTA 2006), June 2006, pp.521–527.

[91] W. Chantamas and J. Baker, “A multiple associative model to support branches indata parallel applications using the manager-worker paradigm,” in Proc. of the 19thInternational Parallel and Distributed Processing Symposium (WMPP Workshop),April 2005, pp. 266–273.

[92] J. Potter, ASC Software, 1992, includes a Primer, Windows Compiler,and Windows Emulator, can be downloaded at:. [Online]. Available: http://www.cs.kent.edu/∼parallel/

[93] M. Jin and J. Baker, “Two graph algorithms on an associative computing model,”in International Conference on Parallel and Distributed Processing Techniques andApplications (PDPTA), Las Vegas, June 2007, p. 7 pages.

[94] M. Atwah and J. Baker, “An associative dynamic convex hull algorithm,” in Proc.of the Tenth IASTED International Conference on Parallel and Distributed Com-puting and Systems, Las Vegas, NV, October 1998, pp. 250–254.

[95] ——, “An associative implementation of a parallel convex hull algorithm,” in Proc.of the 15th International Parallel and Distributed Processing Symposium (IEEEWorkshop on Massively Parallel Processing), San Francisco, CA, April 2001, pp.abstract on page 64, full text on CDROM.

[96] ——, “An associative static and dynamic convex hull algorithm,” in Proc. of the16th International Parallel and Distributed Processing Symposium (IEEE Work-shop on Massively Parallel Processing), Ft. Lauderdale, FL, April 2002, pp. ab-stract on page 249, full text on CDROM.

98

[97] M. Atwah, J. Baker, and S. Akl, “An associative implementation of graham’s con-vex hull algorithm,” in Proc. of the Seventh IASTED International Conference onParallel and Distributed Computing and Systems, Washington D.C., October 1995,pp. 273–276.

[98] ——, “An associative implementation of classical convex hull algorithm,” in Proc.of the Eighth IASTED International Conference on Parallel and Distributed Com-puting and Systems, Chicago, IL, October 1996, pp. 435–438.

[99] M. Esenwein, “String matching algorithms for an associative computer,” Master’sthesis, Department of Computer Science, Kent State University, 1995.

[100] M. Esenwein and J. Baker, “Vlcd string matching for associative computing andmultiple broadcast mesh,” in Proc. of the IASTED International Conference onParallel and Distributed Computing and Systems, October 1997, pp. 69–74.

[101] S. Steinfadt and J. Baker, “Swamp: Smith-waterman using associative massive par-allelism,” in IEEE Workshop on Parallel and Distributed Scientific and Engineer-ing Computing, 2008 International Parallel and Distributed Processing Symposium(IPDPS), Miami, FL, April 2008.

[102] S. Steinfadt, M. Scherger, and J. Baker, “A local sequence alignment algorithmusing an associative model of parallel computation,” in Proc. of IASTED Compu-tational and Systems Biology (CASB 2006), Dallas, TX, Nov 2006, pp. 38–43.

[103] D. Ulm and J. Baker, “Solving a 2d knapsack problem on an associative computeraugmented with a linear network,” in Proc. of the International Conference onParallel and Distributed Processing Techniques and Applications, Sunnyvale, CA,Aug 1996, pp. 29–32.

[104] J. Lee, “Developing parallel simd algorithms for the traveling salesman problem,”Master’s thesis, Department of Computer Science, Kent State University, November1989.

[105] P. Berra, “Some problems in associative processor applications to database man-agement,” in Proceedings of the National Computer Conference and Exposition,May 1974, pp. 1–5.

[106] P. Berra and E. Oliver, “The role of associative array processors in database ma-chine architecture,” IEEE Transactions on Computers, no. 4, pp. 53–61, 1979.

[107] C. Asthagiri and J. Potter, “Associative parallel lexing,” in Proceedings of the 6thInternational Parallel Processing Symposium(IPPS), March 1992, pp. 466–469.

[108] ——, “Parallel compilation on associative processors,” in Proceedings of the IFIPWG10.3 Working Conference on Parallel Architectures and Compilation Tech-niques(PACT), North-Holland Publishing Co, The Netherlands, 1994, pp. 315–318.

[109] ——, “Parallel context-sensitive compilation,” Software-Practice and Experience,vol. 24, no. 9, pp. 801–822, Sept 1994.

99

[110] A. Bansal and J. Potter, “Exploiting data parallelism for efficient execution oflogic programs with large knowledge bases,” in Proceedings of the 2nd InternationalIEEE Conference on Tools for Artificial Intelligence, 1990, pp. 674–681.

[111] B. Reed, “An implementation of lisp on a simd parallel processor,” in First annualaerospace applications of AI, Dayton, OH, 1985, pp. 81–90.

[112] G. Steele and W. Hillis, “Connection machine lisp: Fine-grained parallel symbolicprocessing,” in Proceedings of the 1986 ACM conference on LISP and FunctionalProgramming(LFP). New York, NY: ACM, 1986, pp. 279 – 297.

[113] T. Hasten, “An ops5 implementation on a simd computer,” Master’s thesis, De-partment of Computer Science, Kent State University, 1987.

[114] J. Potter, M. Rivett, and T. Hasten, “Rule-based systems on simd computers,” inProceedings of ROBEXS, 1987, pp. 198–204.

[115] B. Reed, “The aspro parallel inference engine (p.i.e.): A real time production rulesystem, Tech. Rep. 85-6048, 1985.

100

Figure 25: Overall ATC System Design.

A SIMD APPROACH TO LARGE-SCALE REAL-TIME …parallel/papers/MikeYuanPhD... · A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROL USING ASSOCIATIVE PROCESSOR ... Chair,

Documents