The Open Unified Technical Framework High Performance ... · The idea for developing a High-Level Architecture (HLA) High Performance Computing Run Time Infrastructure (HPC-RTI) layered

WarpIV Technologies, Inc.

Page 1 of 13

The Open Unified Technical Framework High Performance Computing Run Time Infrastructure for the High-Level Architecture

Jeffrey S. Steinman, Ph.D., Craig N. Lammers, Maria E. Valinski,

Wendy L. Steinman, Mitchel W. Steinman, Gary E. Blank

WarpIV Technologies, Inc. San Diego, CA

[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

ABSTRACT The Open Unified Technical Framework (OpenUTF) is a layered architecture that facilitates next-generation, scalable, composable, and cognitive M&S on a wide variety of computational platforms, network configurations, and operating systems. The WarpIV Kernel is the open-source reference implementation of the OpenUTF. Its high-speed communications and optimistic discrete-event engine provide scalable execution. The OpenUTF architecture supports legacy simulations running in both real time and logical time using the High-Level Architecture (HLA) interface standard. This paper primarily focuses on the logical-time HLA algorithms implemented in the WarpIV Kernel. All HLA services leverage the (a) WarpIV high-speed communications and (b) optimistic event management infrastructure when running in logical time. The HPC-RTI guarantees repeatable execution for federations operating in logical time using tie-breaking mechanisms and robust optimistic event-management algorithms. True zero-lookahead execution is also supported without creating deadlocks, introducing time-creeping, or requiring epsilon fudge factors to nudge time forward. In addition, a novel auxiliary time abstraction is featured by the HPC-RTI that allows multi-event zero-lookahead transactions to start and complete as a single unit before any other event across the federation is logically processed. This paper first provides a high-level introduction to the HPC-RTI, including its historical background. Then, this paper outlines the overall design of the HPC-RTI, including its abstract time representation, destination-based reliable multicast services, various performance optimizations, and important flow-control techniques that are unique to the WarpIV Kernel implementation. Some preliminary benchmark performance results are briefly provided under a variety of hardware, network, and software configurations for specific test cases. The primary benchmark configuration consists of a federation containing 252-federates executing on a 21-machine cloud, with 12 cores on each machine.

ABOUT THE AUTHORS JEFFREY S. STEINMAN, President and CEO of WarpIV Technologies, Inc. received his Ph.D. in High Energy Elementary Particle Physics from UCLA in 1988. Upon graduation, Dr. Steinman began work at the Jet Propulsion Laboratory/CalTech in the area of parallel and distributed supercomputing where he created the Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) framework as the next-generation replacement for the Time Warp Operating System (TWOS). This work transitioned to industry in 1996 when Dr. Steinman formed the High-Performance Computing Division at Metron and then later in 2000 when he joined RAM Laboratories as Vice President, Chief Scientist, and Director of High Performance Computing Programs. Dr. Steinman launched his own company, WarpIV Technologies, Inc., in 2005 where his focus has been the development and standardization of new high performance parallel and distributed computing technologies based on the open source WarpIV Kernel and the Open Unified Technical Framework (OpenUTF). Dr. Steinman was the chair of the Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG), has five patents in the area of high-performance computing, and has published more than 80 papers in the field of M&S including “The Roadmap” for M&S.


Page 2 of 13

The Open Unified Technical Framework High Performance Computing Run Time Infrastructure for the High-Level Architecture

Jeffrey S. Steinman, Ph.D., Craig N. Lammers, Maria E. Valinski,

Wendy L. Steinman, Mitchel W. Steinman, Gary E. Blank

WarpIV Technologies, Inc. San Diego, CA

[email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

INTRODUCTION

The idea for developing a High-Level Architecture (HLA) High Performance Computing Run Time Infrastructure (HPC-RTI) layered on top of an optimistic Parallel Discrete Event Simulation (PDES) framework began in the late 1990’s and early 2000’s when HLA was emerging as a standard (IEEE 1516-2000, IEEE 1516.1-2000, IEEE1516.2-2000, IEEE 1516.3). The first implementation of the HPC-RTI (1998-2002) leveraged the Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) framework, which at the time represented the state-of-the-art in optimistic parallel and distributed discrete-event simulation engine technology (Steinman, 1993, Steinman et al., 1999, Blank et al., 2000, Clark et al., 2002). SPEEDES had been used on several large Department of Defense (DoD) simulation programs involving Missile Defense and Joint service simulations (JSIMS, 2003).

The SPEEDES-based HPC-RTI demonstrated the feasibility of synchronizing logical-time federations using an existing optimistic rollback-based event-scheduling infrastructure (Fujimoto, 2000, Jefferson, 1985) to (a) send and receive interactions, and (b) coordinate object creation, object discovery, attribute updates, attribute reflections, object deletion, and object removal. Rollbacks transparently occurred inside the SPEEDES-based optimistic HPC-RTI as it processed its internal events while coordinating the operation of conservative (non-rollbackable) time-managed federates using software interfaces specified by the HLA standard (Fujimoto, 1997, Kuhl et al., 2000).

Despite its overall success as a prototype RTI, the SPEEDES-based approach suffered from several noteworthy limitations. First, SPEEDES itself was becoming obsolete and would later be replaced by the much-improved WarpIV Kernel, which has since become the open source reference implementation of the Open Unified Technical Framework (OpenUTF) (Steinman et al., 2010, Steinman, 2013). Second, the SPEEDES-based approach did not provide a lightweight optimized real-time execution mode. Its approach towards real-time operation was to run in logical time, hopefully faster than real time, and then synchronize logical time to the wall clock at Global Virtual Time (GVT) updates (Steinman et al., 1995). Third, no special rigor was provided to ensure repeatable execution of simultaneous federation events. Most RTIs in existence today, including the original HPC-RTI based on SPEEDES, simply return simultaneous events to federates in the order that their messages are received. While SPEEDES did provide rigor in processing its internal events in a robust and repeatable manner, it did not coordinate the messages flowing between each federate and SPEEDES through the RTI interface with the same rigor.

To address the important limitations of the SPEEDES-based HPC-RTI, a new implementation was developed in 2002 based on the WarpIV Kernel. The initial focus was to support real-time operation as a thin interface layer to the native OpenUTF High Speed Communications (OpenUTF-HSC) library that combines (a) platform-independent scalable shared memory read and write techniques for supporting communications between federates operating on the same machine with (b) reliable network communications (TCP/IP using standard socket calls) between federates that reside on different machines (Steinman, 2005, Steinman, 2007, Steinman et al., 2012). One of the key features of the OpenUTF-HSC is its ability to support reliable destination-based multicast services, which perfectly map to publish and subscribe data distribution services for objects and interactions specified in the HLA standard. Because messages are delivered reliably, there is no reason to employ heartbeat entity state-update techniques even when running in real-time because messages are always delivered reliably.


Page 3 of 13

While real-time performance is extremely important for many HLA federation use cases, the real strength of the HPC-RTI is its unique ability to support federations operating in logical time where guaranteed repeatability and scalable run-time performance can be critical for validating analysis studies. The logical-time algorithms developed for the WarpIV-based HPC-RTI applied the many lessons learned from the previous SPEEDES-based effort. But, much more attention would be paid this time concerning the abstract representation of logical time and synchronization mechanisms residing between the HPC-RTI and each time-managed federate to absolutely ensure repeatable operation for all HLA services. This includes robust time tags of messages sent, messages received, subscriptions, and logical time-throttling barriers used for synchronization to ensure that all tie-breaking fields in the abstract time representation are (1) unique, (2) never violate causality by going backwards, and (3) do not necessarily serialize performance (e.g., cause time creep or create deadlocks) even when zero lookahead settings are used. Every operation in the HPC-RTI occurs at a unique and repeatable time, independent of the run-time performance characteristics of each machine and/or the network used to support the federation.

The HPC-RTI heavily leverages technology previously published on the OpenUTF, as implemented in the WarpIV Kernel. Numerous references to this previous work are provided throughout this paper, which include (a) synchronizing parallel and distributed simulations using optimistic time management with adaptable flow control, (b) scalable publish and subscribe data distribution services, (c) high-speed communications algorithms, and (d) the HLA standard itself. In addition, the OpenUTF specifies many important M&S capabilities that are not critical to the focus of this paper (e.g., model composability, artificial intelligence constructs, five-dimensional execution, process-model language extensions, rollback framework and its collection of utilities, motion algorithms, coordinate system transformations, math utilities, visualization and analysis tools, etc.) so they will not be discussed here.

TECHNICAL FEATURES

This section describes the important technical solutions that were developed and/or utilized by the HPC-RTI. For those not familiar with the parallel and distributed simulation technology of the OpenUTF and its open source WarpIV Kernel implementation, the following references provide a good introduction to the complex issues and solutions concerning time synchronization of parallel and distributed simulations (PDMS-SSG, 2008, Steinman et al., 2010, Steinman, 2015).

Unique Properties of the HPC-RTI

The HPC-RTI is different from other RTIs in that once the federation starts up, it does not allow new federates to join. Federates that resign early actually continue running in the background until all federates resign, which then cleanly brings down the federation execution. In addition, a federate crash occurring during execution intentionally brings down the entire federation automatically. The HPC-RTI considers the federation to be a single parallel execution rather than a disjoint set of loosely coupled applications. It would be far too tedious in a large federation involving hundreds of federates (or more) to manually bring each federate down when one crashes. Global shut down of each federate is automatic. This shift in focus is key towards providing scalability concerning CPU usage, messages, and data storage. In this regard, no single node (or federate) in the federation manages the federation. All nodes are on an equal footing.

Representation of Abstract Time

Time-managed parallel and distributed simulations that provide repeatable operation require an abstract time representation with automated tie-breaking fields to ensure that each event occurs uniquely in time. This is accomplished in the OpenUTF by the abstract time class that contains the following data members listed below in the order of their tie-breaking properties.

1. NET_DOUBLE Time 2. NET_INT Priority1 3. NET_INT Priority2 4. NET_INT Counter 5. NET_INT UniqueId 6. NET_INT AuxCounter 7. NET_INT AuxUniqueId 8. NET_DOUBLE SuperCounter


Page 4 of 13

NET_DOUBLE and NET_INT data types are special C++ classes that abstract machine storage representations when operating in distributed heterogeneous (e.g., mixed big endian and little endian) environments. When operating in homogeneous environments (e.g., all machines based on Intel or AMD processors are little endian) these data types are simply defined to be their native double precision and integer representations to maximize efficiency.

The first field in the abstract time is the actual physical time itself. Normally, Time is represented in seconds past midnight at the zero-longitude dateline (i.e., Greenwich time) of a specified day in a specified year. As a double, the Time field provides a very high-resolution 15-digit value for the actual physical time of events that are scheduled and processed. To put this in context, a year consists of 31,536,000 seconds, which has 8 digits. So, a simulation modeling activity over a year time period still provides seven more digits of resolution (i.e., resolution better than a microsecond). Representing physical time with a double is generally accurate enough for supporting all mainstream simulation use cases for DoD purposes.

Priority1 and Priority 2 are provided to break ties when the physical time fields of two events are the same. Note that in special circumstances, Priority1 could be used as a 32-bit extension to the Time field to provide additional time resolution if needed. Normally, models operating in the OpenUTF set (or more likely ignore) these two fields to control the process ordering of simultaneous events. The HPC-RTI sets the Priority2 field in scheduled events and time barrier mechanisms as part of its federation-wide synchronization to ensure repeatability and causality without creating deadlocks within the federation. One of the reasons why two integer priority fields are specified is to maintain 8-byte alignment throughout the abstract time representation without compilers creating arbitrary four-byte padding within the time class structure.

The Counter and UniqueId fields follow the priority fields in tie-breaking order and are automatically set after events are scheduled. Each simulation object (the HPC-RTI locally creates four of these per federate) has a unique identifier that is established during startup. Every event scheduled by a simulation object automatically assigns the UniqueId in its time tag to its unique value. This means that every simulation object in WarpIV generates events with a different abstract time from every other simulation object. Repeatability then boils down to ensuring that each simulation object itself generates unique time values. This uniqueness is accomplished with the Counter tie-breaking field. At startup, each simulation object’s Counter is initialized to zero. The Counter is automatically incremented by the simulation object as it schedules events. Thus, because all simulation objects produce unique time tags from each other (due to the UniqueId tie-breaking field) and because each simulation object itself generates unique times (due to its incrementing Counter), collectively this means that all events scheduled by simulation objects have unique times.

One subtle situation involving Counters must be addressed. This situation is easily understood using an example. Suppose a simulation object’s Counter is 50 and it receives an event from another simulation objet with Time = 100.0 and its Counter = 75. When the event is processed, it would be possible for the event to schedule a new event with Time also set to 100.0 (i.e., a zero-lookahead event). If the Counter of the simulation object were simply used, it would schedule the new event with a Counter of 50, which would be less than 75. In other words, the new event would go backwards (in terms of tie-breaking sequencing) in time, which would violate causality. The fix to this problem is simple. Whenever the Counter of an event being processed is greater than the Counter stored in the simulation object, the simulation object bumps its Counter to be the same as the event plus one. In this manner, causality is never violated while still guaranteeing unique time tags for all events scheduled within the simulation. Because events are processed optimistically, which means that they can roll back, the Counter managed by each simulation object must be represented as a rollbackable data type.

The next two fields, AuxCounter and AuxUniqueId behave exactly like the Counter and UniqueId fields except that they are used in zero-lookahead transactions involving multiple events to complete. The goal is to ensure that all time-tagged events in a multi-event transaction are assigned time tag values that are less than all other events previously scheduled in the simulation. In other words, the transaction logically completes before the time tag of all other events in the simulation. This is effectively accomplished by locking down the first five tie-breaking fields within the transaction, which was established by the originator of the transaction. The two auxiliary-tie-breaking fields then ensure uniqueness for the series of events operating within the transaction.

Interactions and object attribute updates are good examples of transactions. The sending federate schedules multiple events that are delivered to each subscribing federate. If the lookahead is zero, then each subscribing federate receives


Page 5 of 13

their event before any other event is logically processed in the federation. Hence, auxiliary time with zero lookahead solves the ambiguity problem of federates receiving simultaneous events in mixed, but repeatable, time order.

Furthermore, when running in logical time with zero lookahead, the auxiliary time capability solves the ownership management problem where two federates could potentially update the attributes of an object at the same time, thereby creating an ambiguity concerning the actual attribute values. With auxiliary time, this situation can never occur because whichever federate modifies the attribute first in its transaction pushes the new value to every other subscribing federate before any other federate attempts to modify the same attribute. Auxiliary tie-breaking fields could provide a significant step forward in simplifying how objects share data that are dynamically modified over the course of a simulation run. Rather than federates fighting over who gets to own the attribute, the problem shifts to simply who has write privileges (similar to private, protected, and public privileges in C++ to access data members and methods within classes) to modify the value. An alternative strategy might be to provide access keys to certain subscribers so that they can unlock write privileges that enable them to modify certain attributes of discovered objects.

Finally, the SuperCounter is essentially an integer counter (represented as a double to extend its dynamic range to avoid overflow for simulations scheduling more than 2 billion events per node) that uniquely identifies the instance of the event itself. Each node (i.e., federate) manages its own SuperCounter, which is strictly a bookkeeping mechanism that is used to track event instances within the optimistic event-processing infrastructure. The SuperCounter does not need to be rollbackable.

Operator overloading in C++ allows the abstract time class to behave in numerical expressions as its double-precision Time field. Operator overloading in this manner is extremely useful when (a) computing mathematical expressions that involve the current event time, and/or (b) when computing the time for new events being scheduled.

Time Barriers

Time barrier mechanisms prevent Global Virtual Time (GVT) from going beyond a specified time (Steinman, 1993, Steinman, 2005). Each node can create as many time barriers as necessary to throttle the advancement of GVT, which is updated periodically in a globally synchronized manner involving all nodes. Each node performs the following steps when updating GVT:

1. Receive all messages and antimessages that are still in transit 2. Determine TB as the minimum time of all existing barriers on the node 3. Determine TL as the time of the next unprocessed local event on the node 4. Determine LVT (Local Virtual Time) as the minimum of TB and TL 5. Determine GVT as the global minimum of LVT values across all nodes Time barriers are critical in coordinating time requests with possible events scheduled for the HPC-RTI from federates. When a federate requests advancement to a particular time, T, an agreement is essentially made between the HPC-RTI and the federate that it will not schedule any new federation events with time tags less than the requested time plus the federate’s lookahead, L, which can be zero. Each federate may use a different lookahead value. When using the time advance request service, each federate can also update time with its own time step without requiring global lock-step operation across the federation.

When each federate makes a request to time T, a time barrier is created inside the federate’s HPC-RTI node with time T + L, meaning that GVT is not permitted to go beyond that time because the federate could potentially schedule a federation event with time T + L. Of course, if the HPC-RTI has an event to deliver to the federate with an earlier time value than what was requested, the federate will be granted to that time rather than the requested time if the next event request HLA service is used.

Each node (representing a federate) within the federation uses its time barriers to coordinate the advancement of time in a decentralized manner that eliminates time creeping (i.e., federation-wide time stepping by the smallest lookahead valued used in the federation, rather than jumping to the next event; a critical optimization when lookahead is small) while promoting maximal performance and ensuring complete repeatability.

Barrier times are established from the requested time plus lookahead with one additional step. Priority2 is also incremented to ensure that events coming from the federate always have increasing time tags. The time tags of all


Page 6 of 13

events scheduled by the federate back for the HPC-RTI are further checked and adjusted if necessary so that their time values are always greater than or equal to the time barrier time and that they are unique. Keep in mind that multiple events could be scheduled by the federate after it has received a grant. Additional care is taken to (a) increment the Counter, and (b) set a UniqueId based on the federate for all of its scheduled events.

High-level Software Design

A high-level conceptual design of the HPC-RTI executing with four HLA federates (i.e., nodes) is shown in Figure 1. Each HLA federate operates independently with its own unique simulation engine and specific processing strategy (e.g., real-time, time-stepped, or discrete-event). The HPC-RTI follows the original standard HLA 1.3 interface specification, which permits each federate to coordinate its activities with the other federates participating in the federation.

Figure 1: High-level design of the HPC-RTI.

Within the WarpIV Kernel, each federate is paired to four internal simulation objects that represent the federate as its proxy. Two simulation objects coordinate sending and receiving object attributes, while the other two coordinate sending and receiving interactions. Separating these four operations enhances scalable performance by significantly reducing rollbacks from event collisions. When a federate schedules an event for the RTI, the event is actually scheduled inside the WarpIV Kernel for its proxy object or interaction sender representation, which then initiates the internal data distribution transaction. As previously mentioned, the time barriers enforce causality and repeatability rules to ensure that all time tags are unique, causal, and will not cause deadlocks or time creeping artifacts to occur even when lookahead is zero.

Figure 2 illustrates the basic flow of messages occurring between an optimistic simulation engine and a conservative simulation engine (e.g., the federate) using the HLA standard. This example assumes usage of the next event request service (i.e., discrete-event). The top event timeline represents the federate’s event queue, while the bottom timeline represents optimistically processed events within the simulation engine that could potentially be rolled back.

Figure 2: Rules for coordinating messages in logical time that are going into and coming out of the HPC-RTI.

Outgoing messages within the HPC-RTI are queued for delivery to the federate when GVT is greater than or equal to their corresponding event time tags. In other words, the messages are queued for delivery once the receive event is committed and therefore cannot roll back. Each outgoing message creates a local time barrier using its time tag + Lookahead to prevent GVT from advancing too far into the future. The barrier is automatically removed once the event is committed and queued for delivery. Of course, it is possible that the federate’s actual next event ought to occur at TminQ, which would happen if its time value were less than Tnext. Also, notice that as previously described, the barrier time, Tbarrier, is automatically set to Tnext + L, which also throttles GVT from advancing too far while eliminating time creeping.

Time advanced (but possibly not yet granted), Tadv, can be determined as the minimum of TminQ, Tnext, and GVT. It is possible for no outgoing messages to exist and that GVT is less than Tnext. In this case, a new time grant cannot be issued to the federate because it is unknown whether its next event is its true next event or if another event within the HPC-RTI having an earlier time tag will subsequently arrive. Time, according to the next event request service in the HLA standard, is only advanced when Tadv is equal to TminQ or Tnext. Because all events have unique time tags, only a single message at a time between grants is returned to the federate when the next event request service is used.

Conservative

Optimistic!

HPC!RTI!

HLA!Federate!

HPC!RTI!

WarpIV!Kernel!

WarpIV!Kernel!

HPC!RTI!

HLA!Federate!

WarpIV!Kernel!

HPC!RTI!

HLA!Federate!

High-Speed!Communication!

Events are processed optimistically within the WarpIV Kernel simulation engine, but conservatively within each HLA Federate simulation engine.!

Legacy!Models!

WarpIV!Models!

Bridged!Interoperability!

HLA!Federate!

WarpIV!Kernel!

Each WarpIV Kernel, HPC-RTI, and HLA Federate operate as a single process. Multiple instances execute on networks of multicore machines.!

Tadv = Min(TminQ, Tnext, GVT)

Local Time = Min(Tbarrier , TminQ + L, LVT)!GVT = GlobalMin(LVT)

Outgoing!Messages

Committed Messages

Message Passed!to Federate

Message Passed to!Optimistic Simulation

Federate Event Queue

Optimistic Event Queue

Time

Time

TminQ

LVT

Tgrant! Tnext!

Tbarrier = Tgrant + L!


Page 7 of 13

The tick operation does one of three things. First, if there is an outgoing message in the queue with time less than Tnext, the tick operation delivers it to the federate and then grants time if next event request is used. Second, if GVT is greater than Tnext, it grants time to Tnext, which allows the federate to process its next internal event. Finally, if neither of the first two conditions are true, it performs a GVT cycle, which processes events and receives messages in collaboration with all of the other nodes. After each cycle, GVT will potentially advance, possibly resulting in new committed messages in the outgoing message queue. It is important to note that GVT cycles are only performed when time cannot be granted across the entire federation. Minimizing the number of GVT cycles and the overheads of these synchronization messages is absolutely crucial for maintaining high performance. In principle, GVT should advance by at least the minimum lookahead value provided by the federates when flow control restrictions are removed. In any case, time never creeps, but rather jumps to at least the earliest requested time.

Federation Launching and Startup

It is imperative for each federate to be mapped to the same NodeId to guarantee repeatability across multiple runs because the NodeId affects subtle tie breaking fields in the simulation time. This assumes that (a) the scenario is unchanged, and (b) pseudo random number generation seeds are also unchanged. The NodeId is utilized to determine the UniqueId of the federate’s proxy simulation object, which thereby affects the tie-breaking fields in all internally scheduled event times. Such a change in federate-to-node mapping potentially affects event ordering.

A Parallel Application Launcher (PAL) provided with the HPC-RTI can launch federation applications as (1) a standalone program that reads a Run Time Class (RTC) description file, which specifies how and where to run each application, or (2) an API that user programs can manually call themselves to specify everything necessary to launch each application in the parallel execution. The PAL determines which applications run on which machines, assigns their NodeIds in a repeatable manner, and then rapidly launches them in parallel across machines in an asynchronous manner. Each federate is launched concurrently without introducing serialization bottlenecks. Federation startup times are therefore quick.

For example, a large federation consisting of 252 federates executing on 21 machines (each with 12 cores) can be fully launched and complete all of its initialization steps (i.e., create federation execution, join federation, enable time management, complete all of its publish and subscribe requirements) in just a few seconds.

There is no need to use sync points during initialization because startups are automatically synchronized when federates join the federation. Also, publications and subscriptions are time managed and repeatable when running in logical time, which eliminates the need to wait until subscriptions have completed before registering objects or sending interactions. Objects discovered by federates are always provided with the latest attribute values. This further eliminates the need to request attribute updates, which in other RTIs produces an undesirable artifact across the federation as new subscriptions by late joiners are dynamically made.

PERFORMANCE OPTIMIZATIONS

A number of unique and noteworthy performance optimizations have been developed for the HPC-RTI. Three of these optimizations are discussed below.

Multicast Event Scheduling

Optimistic discrete-event simulation engines must provide intricate bookkeeping mechanisms to schedule events between simulation objects knowing that if the scheduling event is rolled back due to the reception of a straggler message, it must subsequently retract all of the events it scheduled. Antimessages chase down and cancel all messages that were generated from events during rollbacks. Of course, late arriving antimessages trying to cancel previously sent messages that have already been processed could generate further rollbacks and antimessages, which can quickly cascade to create instabilities. Flow control to address both rollback and antimessage potential instabilities is discussed in the next optimization technique.

Event scheduling in traditional optimistic simulation engines utilize unicast messaging to coordinate message and antimessage delivery between simulation objects. However, the OpenUTF high-speed communications framework supports destination-based multicast message delivery mechanisms. The optimistic event-scheduling mechanism within


Page 8 of 13

the WarpIV Kernel was upgraded to utilize multicasting when scheduling events that distribute data to multiple nodes, such as (1) object creation and discovery, (2) object deletion and removal, (3) reflection of updated attributes, and (4) sending and receiving interactions.

Consider for example, a federation consisting of ten federates, where each federate subscribes to a particular interaction. With unicast, nine interaction messages would be individually sent from a federate to each other federate. Rollbacks would result in nine antimessages having to be sent. With multicast, a single interaction message is distributed to each federate. Similarly, only a single multicasted antimessage would be sent during a rollback.

The OpenUTF-HSC defines very specific communication services for hosting scalable M&S. As an open standard, optimized implementations could be developed for specific platform and/or network configurations. As shown in Figure 3, the native OpenUTF-HSC implemented in the WarpIV Kernel efficiently integrates message passing and parallel programming constructs through shared memory and network services (Steinman, 2007, Steinman et al., 2012). It was designed to support scalable M&S operating on clouds and HPC platforms.

All messages going between machines pass through distributed servers that deliver messages via standard network protocols. Executing multiple distributed servers reduces internal processing bottlenecks that potentially arise when just a single server is used. The configuration used in reporting performance in this paper placed 12 distributed servers on a separate machine with four 1-GB Ethernet connections to a high-speed Cisco switch. Of course, other configurations are possible, such as having a single distributed server on each machine coordinating the sending of messages to other machines.

Figure 3: High-speed communications framework coordinates message passing and parallel programming constructs between machines through a distributed server and between nodes residing on the same machine using shared memory.

There are several benefits of having all network messages flow through servers rather than have every federate directly connect with every other federate. First, this approach provides a huge reduction in the number of socket interconnections. Large federations using the HPC-RTI launch with almost no delay because connections are fast and scalable. Second, each server is dedicated to handle its share of the messages, which can reduce processing time from senders and receivers when their local network buffers become full and blocking occurs. Third, it is actually surprising how fast each server can process multicast or broadcast messages, especially when high-speed switches are used.

Figure 4: Unicast and multicast message-passing example showing eight federates operating on two multicore computers.

Scaling this analysis up to larger hardware configurations results in dramatically improved performance. WarpIV Technologies, Inc. has recently purchased a 264-core rack-mounted cloud computer for benchmarking and testing the HPC-RTI. Each rack-mounted machine has 2 Xeon CPUs, each configured with 6 cores. A total of 21 rack-mounted machines providing 252 cores for use in a federation execution are connected through a high-speed gigabit Ethernet Cisco router with an additional rack-mounted machine also connected to the Cisco router that runs distributed servers. Each rack-mounted machine has four network ports that can be combined into a single logical port. Because the Cisco router only has 48 connections, each of the 21 machines combines two of its ports into a logical port, while the machine running 12 distributed servers utilizes all four of its ports.

SharedMemoryServer

SharedMemoryServer

SharedMemoryServer

DistributedServer

Parallel Application



Server

0 1

2 3

4 5

6 7

Unicast: 3 Shared memory Messages + 8 Network Messages

Server

0 1

2 3

4 5

6 7

Multicast: 2 Shared memory Messages + 2 Network Messages


Page 9 of 13

Figure 4 depicts the message flow for unicast vs. multicast message-passing approaches using servers. In this example, the federate executing on Node 1 broadcasts to all other federates executing on the other nodes. Notice (top) the unicast approach requires eight network messages plus three shared memory messages. The multicast approach (bottom) requires only two network messages and two multicasted shared-memory messages.

Expanding this to a larger configuration involving 252 federates, a broadcast message is reliably multicasted to 251 subscribing federates with only 21 network messages and 21 multicasted shared-memory messages. In contrast, the unicast approach going through a distributed server would require 480 network messages and 11 shared memory messages. Because communications through shared memory is significantly faster than when going through the network, the important performance distinction is the reduction of network messages. In this configuration, unicasting sends 22.86 times more network messages than multicasting. Recent benchmarks show that 6,600,000 short messages per second can be delivered using multicasting across 252 federates.

Consider yet a larger and more modern configuration. AMD plans to release its 32-core (with full Hyperthreading) Zen Naples chip in the second quarter of 2017. Imagine a cloud configuration with each rack-mounted machine having two AMD CPUs plugged into its motherboard. A total of 16 machines would provide 1024 cores, with each core capable of hosting an HPC-RTI federate as part of a large HLA federation. A 17th machine could host 64 distributed servers.

Using the same analysis as before, a single federate distributing an interaction to the other federates would result in 16 network messages (i.e., 1 to the server and 15 for relaying the message to the other machines) and 16 shared-memory multicast messages. In comparison, the unicast approach would result in 1920 network messages and 63 shared memory messages. Now, unicasting sends 120 times more network messages than multicasting. Assuming that the same Cisco switch is used (with the same throughput) and that the network is the bottleneck, this new configuration is expected to reliably deliver more than 35,000,000 messages per second. These numbers go far beyond anything imaginable with pure network-based reliable unicast communications, even if every federate had a direct network connection to every other federate, which would cut down the message traffic by a factor of 2. Also, keep in mind that the network messages in the OpenUTF-HSC are delivered reliably using TCP/IP, thereby eliminating the need for heartbeat message sending strategies, which for real-time federations further reduces network traffic.

The distributed server communications infrastructure used in the WarpIV Kernel implementation of the OpenUTF-HSC may actually be the ideal cloud-based or HPC message-passing strategy for emerging large-scale multicore processing systems that will be available in the not-so-distant future (Manycore Computing Workshop, 2007).

Flow Control for Cascading Rollbacks

Optimistic simulations, without appropriate flow control to mitigate the undesirable effects of rollbacks and antimessages, can become unstable due to cascading rollbacks spanning multiple levels of recursion hops in the retraction process. Several flow control techniques are uniquely provided to determine (a) when to release event messages that are generated as events are processed, and (b) when to throttle event processing itself to mitigate the effects of runaway nodes that get too far ahead of the other nodes in their event processing. These techniques are well documented in previously published papers and will not be described here (Steinman, 1993, Steinman et al., 1995, Steinman, 2005, Steinman et al., 2010, Steinman, 2013).

However, one very important technique has proven to be quite useful in providing critical flow control for very large federations. This technique (shown in Figure 5) completely eliminates cascading rollbacks and antimessages, resulting in a fully stable execution of large time-managed federations operating with the HPC-RTI.

Figure 5: The top figure (Case 1) shows a first event scheduling a second event, where GVT is less than the time of the first event. Because the first event could be rolled back, the second event holds back its messages. Antimessage sequences become limited to a single recursion hop. The bottom figure (Case 2) shows that it is safe to release those messages generated by the second event because the first event is committed due to GVT advancing beyond its event time. In other words, the second event cannot be retracted from an antimessage. This throttles antimessages to one hop.

GVT Time

GVT Time

Hold Back Messages

Ok to Send Messages

Cas

e 1

Cas

e 2


Page 10 of 13

The technique is straightforward and easy to understand. The header of each event message not only stores the time tag of the scheduled event (Te), but also the time of the event that scheduled the event (T0). After each event is processed, a determination is made concerning whether to (a) immediately release any messages that might have been generated by the event, or (b) to hold those generated messages back for release later in time when the risk of rollback is reduced. Unsent messages for events that are rolled back are simply discarded, which has a minimal impact on performance in comparison to messages that were sent (that shouldn’t have been sent) and then need to be retracted by antimessages. Cascading rollbacks are completely eliminated by only sending messages when GVT is greater than T0. Messages are otherwise held back if GVT is less than T0. Note that the flow control for when messages are released has no bearing on repeatability. Regardless on when messages are released, events are still processed in a repeatable time-ordered manner.

Split Lookahead

Lookahead has a significant effect on the performance of both (a) time-managed federations and (b) optimistic and conservative parallel discrete-event simulations (Steinman, 1994). The use of lookahead to help synchronize federates across a federation is handled differently from lookahead used within the internal optimistic event processing of the HPC-RTI itself. In the first case, lookahead is applied to time barriers that throttle GVT updates. In the second case, lookahead adds time to internally scheduled events that go across nodes to distribute data (i.e., object attributes and interactions), which in turn reduces rollbacks. To the federate, lookahead is a single value that is specified when enabling time management. Yet to the HPC-RTI, the single lookahead value could be split into two separate lookahead values as shown in Figure 6.

Figure 6: Split lookahead divides the federate’s lookahead into two parts that are then used to (1) establish time barriers that operate between the federate and the HPC-RTI, and (2) internal event-scheduling lookahead that is used within WarpIV to reduce rollbacks and antimessages.

It is important to note that the two separate lookahead values used by the HPC-RTI must still add up to the single value provided by the federate to produce the same overall lookahead effect.

Define LF as the lookahead specified by the federate, LB as the lookahead used to establish time barriers, and LR as the lookahead used by the HPC-RTI to process events internally. Then LF = LB + LR.

Another way to specify how lookahead divides into two individual lookaheads is to use a Split Lookahead Ratio (SLR) whose value lies between 0.0 and 1.0. In this way, LB = SLR x LF, and LR = (1.0 - SLR) x LF. For true consistency, round-off errors due to machine precision must be carefully accounted for so that LF is exactly equal to LB + LR when used in expressions.

Consider two extremes. In the first case, if SLR were 0.0, then this would effectively provide all of the lookahead inside the optimistic event processing of the HPC-RTI. While this would reduce the number of rollbacks, it would also overly throttle the advancement of GVT due to the zero lookahead used in constructing time barriers, resulting in an excessive number of GVT cycles. In the second case, if SLR were 1.0, then this would effectively provide all of the lookahead in the time barriers, resulting in fewer GVT cycles. But, there would be no lookahead within the HPC-RTI to reduce rollbacks. Both of these extremes produce suboptimal performance. The optimal solution is somewhere in between these two extremes. Observations from a variety of benchmarks show that a reasonable value for SLR is 0.5. While the optimal value for SLR likely varies across federation sizes and their message sending/receiving characteristics, using 0.5 for SLR appears to always produce respectable performance across a variety of machine configurations.

Preliminary performance results from the same 252-federate configuration previously used for real-time benchmarking, now executing in logical time, show that 2,200,000 messages can be delivered across the federation per second when lookahead is comparable to update rates. Results are perfectly repeatable even when federates receive multiple events with the same logical time tag. More detailed performance results showing performance sensitivity to lookahead, SLR, cascading rollbacks, and message-passing risk flow control settings will be provided in a future publication. A number of optimizations (based on space-time data structures and touch-depend events) have been made within the WarpIV data distribution mechanisms to virtually eliminate all rollbacks. These critical logical-time optimizations will also be reported in a future publication.

FEDERATE A

HPC-RTI (A)

FEDERATE B

HPC-RTI (B)

Green: Split Lookahead Ratio = 0.0 (Serializes Federates) Red: Split Lookahead Ratio = 0.5 (Equal Concurrency) Blue: Split Lookahead Ratio = 1.0 (Serializes HPC-RTI)

Logical Time

Federate A Schedules Event

Federate B Receives Event

Excessive Rollbacks

Excessive GVT Cycles

Just Right


Page 11 of 13

MISCELLANEOUS FEATURES AND LIMITATIONS

While the HPC-RTI supports the most widely used bread-and-butter services of the HLA 1.3 interface specification, it is not yet a fully compliant RTI. For example, exception handling and reporting of every problematic condition is not yet implemented. The HPC-RTI also does not yet support Data Distribution Management (DDM) and Ownership Management (OWM) service. These two service categories are straightforward to implement and unlike other RTIs, they would be fully time managed and repeatable when they are eventually provided by the HPC-RTI. These two services could be developed in the near future if there is demand for them.

In addition, the Management Object Model (MOM) services along with its assumed run-time data capturing mechanisms are not yet implemented. The HPC-RTI has its own unique run-time data capturing mechanisms, which are considerably different and more detailed from the MOM specification.

The HPC-RTI does not currently support the save and restore Federation Management services that are used for implementing periodic checkpoints and performing federation-wide restarts. Again, this would be straightforward using the automatic persistence data storage capabilities that are built into the WarpIV Kernel.

Finally, the HPC-RTI does not currently support dynamic joins and resigns. An HPC-RTI federation runs as a parallel application, which is ideal for constructive operation in clouds or HPC platforms. An alternative mechanism based on the External Modeling Framework (EMF) could support dynamic joins and resigns with those remote federates that need to connect externally to the central parallel federation (Steinman et al., 2012). Of course, the EMF wrapper interface would still hold to the HLA standard.

Despite the limitations and “features” of the current HPC-RTI, it offers some additional important capabilities that are not provided by other RTIs.

Machine clocks are automatically synchronized during startup. Federations operating in real time requiring high-precision timing services could potentially utilize these capabilities. For benchmarking purposes, all messages are internally tagged with their send times, which is used by the HPC-RTI to measure latencies for received messages. As a run-time option, users can automatically produce performance plots at the end of a federation execution to indicate message latency over time, bytes, and as statistical histogram distributions.

The HPC-RTI provides a rigorous data logging service that has proven to be very helpful in hunting down bugs or repeatability issues within individual federates and/or across the federation. When enabled, the HPC-RTI meticulously logs everything it does within each federate across the federation. A single logical-time-sorted and federation-wide merged log file is created showing everything going into the RTI and everything coming out of the RTI. Several command-line tools are provided to compare two log files and show their differences. This valuable capability has been utilized several times to hunt down repeatability problems created by individual federates. Typical mistakes made by federates include sending messages while waiting for time grants, using the wall clock to influence message sending rates or random number generation, using process Ids inappropriately for federate names or as data, etc.

PERFORMANCE BENCHMARKING

A rigorous HPC-RTI benchmarking effort is currently in progress to study the behavior of the HPC-RTI under a variety of test configurations operating in both real-time and logical time. Testing will measure the effects of the performance optimization algorithms discussed in this paper (e.g., destination-based multicasting vs. unicasting for event scheduling, cascading rollbacks coupled with other flow control mechanisms, and the effects of split lookahead). Results will be published in a subsequent technical paper and detailed report.

REAL TIME PERFOMRANCE PREDEICTION

A parallel simulation of the HPC-RTI operating in real time has been constructed within the WarpIV Kernel simulation engine. The goal of this simulation is to predict performance for very large federations operating on a wide variety of platforms and network configurations. Models in the simulation represent (1) each federate sending and receiving


Page 12 of 13

interactions and object attribute updates, (2) distributed servers receiving and forwarding messages, (3) communications through shared memory, and (4) a high-speed switch with a variable number of ports connected to each machine via Ethernet. Each simulated federate and server operates on a simulated dedicated CPU that behaves as a resource in a queuing model. Similarly, the high-speed switch is also modeled as a complex queuing system. Performance measurements from actual HPC-RTI runs with up to 252 federates are used to determine all of the queuing parameters for sending messages through the network, going through the switch, and propagated through shared memory. Results of this effort will be published in a subsequent technical paper and detailed report.

SUMMARY AND CONCLUSIONS

The HPC-RTI provides a giant step forward in supporting scalable high-speed HLA federations operating in distributed networks, clouds, and/or HPC environments. Its scalable real-time and logical-time performance takes full advantage of the OpenUTF-HSC capabilities and optimistic event processing with adaptable flow control mechanisms. While the technical solutions described in this paper are general and could be applied within other RTI implementations, the guaranteed repeatability and scalable optimistic processing mechanisms described in this paper are extremely complex and therefore somewhat unique to the HPC-RTI. Supporting repeatable execution is important for federations performing analysis. Without repeatability, verifying and validating constructive federations that produce uncontrollable results is next to impossible.

Another very important capability that is provided by the HPC-RTI is that it can act as a bridge allowing legacy federates that are capable of working with HLA integrate with the OpenUTF as a first step towards modernization. Models from such federates could then be incrementally extracted and re-implemented to operate natively within the OpenUTF where performance would be substantially improved. Interoperability between models running natively within the OpenUTF and federates has already been successfully demonstrated. The HPC-RTI therefore provides an incremental modernization strategy that allows existing simulations capable of supporting the HLA standard to continue providing utility to their user communities while establishing a path forward for hosting next-generation capabilities within the OpenUTF. REFERENCES

Blank Gary, Steinman Jeffrey, Shupe Scott, Wallace Jeff, 2000. “Design and Implementation of the HPC-RTI for the High Level Architecture in SPEEDES 0.81.” In the proceedings of the 2000 Spring Simulation Interoperability Workshop. Paper 00S-SIW-153.

Clark Joe, Capella Sebastian, Bailey Chris, Steinman Jeff and Peterson Larry, 2002. "The Development of an HLA Compliant High Performance Computing Run-time Infrastructure" In proceedings of the 2002 Spring Simulation Interoperability Workshop, Paper 02S-SIW-016.

Fujimoto, R. 1997. “Zero Lookahead and Repeatability in the High Level Architecture.” In Proceedings of the 1997 Spring Simulation Interoperability Workshop (SIW).

Fujimoto R., 2000. “Parallel and Distributed Simulation Systems.” John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012.

HLA Framework and Rules, Version IEEE 1516-2000.

HLA Interface Specification, Version IEEE 1516.1-2000.

HLA Object Model Template, Version IEEE 1516.2-2000.

HLA Federation Development and Execution Process, Version IEEE 1516.3.

Jefferson D., 1985. “Virtual Time.” ACM Transactions on Programming Languages and Systems, Vol. 7, No. 3, pages 404-425.


Page 13 of 13

Kuhl Frederick, Weatherly Richard, and Dahmann Judith, 2000. “Creating Computer Simulation Systems, An Introduction to the High Level Architecture.” Prentice Hall PTR, Upper Saddle River, NJ 07458.

Joint Simulation System (JSIMS) Common Component Simulation Engine (CCSE) Software Design Document (SDD), 2003.

Manycore Computing Workshop, 2007. Sponsored by Microsoft. Participation was by invitation only. http://science.officeisp.net/ManycoreComputingWorkshop07/Program.aspx.

Parallel and Distributed Modeling and Simulation Standing Study Group (PDMS-SSG) Terms of Reference (TOR), 2008

Steinman Jeff, 1993. "Breathing Time Warp." In proceedings of the 7th Workshop on Parallel and Distributed Simulation (PADS93). Vol. 23, No. 1, Pages 109-118.

Steinman Jeff, 1994. "Discrete-Event Simulation and the Event Horizon." In proceedings of the 1994 Parallel And Distributed Simulation Conference. Pages 39-49.

Steinman Jeff, Nicol David, Wilson Linda, and Lee Craig, 1995. "Global Virtual Time and Distributed Synchronization." In proceedings of the 1995 Parallel And Distributed Simulation Conference. Pages 139-148.

Steinman Jeff, et. al., 1999. “Design of the HPC RTI for the High Level Architecture.” In proceedings of the 1999 Fall Simulation Interoperability Workshop. Paper 99F-SIW-071.

Steinman Jeff, et. al., 1999. “The SPEEDES-Based Run-Time Infrastructure for the High-level Architecture on High-Performance Computers.” In proceedings of the High Performance Computing 1999 Conference, Grand Challenges in Computer Simulation. Pages 255-266.

Steinman Jeff, 2005. “WarpIV Kernel: Real Time HPC-RTI Prototype.” In proceedings of the 2005 Spring Simulation Interoperability Workshop. Paper 05S-SIW-071.

Steinman Jeff, 2005, “The WarpIV Simulation Kernel.” In proceedings of the 2005 Principles of Advanced and Distributed Simulation (PADS) conference.

Steinman Jeff, 2007, “WarpIV Kernel: High Speed Communications.” In proceedings of the Fall 2007 Simulation Interoperability Workshop, 07F-SIW-045.

Steinman Jeff, Lammers Craig, Valinski Maria, and Steinman Wendy, 2010. “Scalable Publish and Subscribe Data Distribution.” In proceedings of the Fall 2010 Simulation Interoperability Workshop, 10F-SIW-027.

Steinman Jeff, Lammers Craig, Valinski Maria, and Steinman Wendy, 2010. “Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG) DRAFT REPORT Volume 1: PDMS Technology.” Submitted to the SISO Standards Activities Committee (SAC), 14 April 2010.

Steinman Jeff, Lammers Craig, Valinski Maria, and Steinman Wendy, 2012. “The OpenUTF External Modeling Framework (EMF).” In proceedings of the Spring 2012 Simulation Interoperability Workshop, 12S-SIW-034.

Steinman Jeff, Valinski Maria, and Koop Matthew, 2012. “Native and MPI Implementations of the Open Unified Technical Framework High Speed Communications Layer (OpenUTF-HSC).” In proceedings of the Fall 2012 Simulation Interoperability Workshop, 12F-SIW-026.

Steinman Jeff, 2013. “The Roadmap.” In proceedings of the Spring 2013 Simulation Interoperability Workshop, 13S-SIW-047.

The Open Unified Technical Framework High Performance ... · The idea for developing a High-Level Architecture (HLA) High Performance Computing Run Time Infrastructure (HPC-RTI) layered

Documents