Emulation

Hi, John,

Here are the 14 major metrics that I feel a design team must consider whendeciding on what specific emulator to use on their project (if any):

1. Price/Gate 2. Initialization and Dedicated Support 3. Capacity 4. Primary Target Designs 5. Speed Range 6. Partitioning 7. Compile Time 8. Visibility 9. Debug 10. Virtual Platform API 11. Transactor Availability 12. Verification Language and Native Support 13. Number of Users 14. Memory Capacity

Below I explain in detail the impact and pitfalls each metric will have on your team's emulation decision.

- Jim Hogan Vista Ventures, LLC Los Gatos, CA

---- ---- ---- ---- ---- ---- ----

1. Price/Gate

The actual cost of an emulator typically runs from 1-5 cents-per-gate for higher capacity emulators -- both processor-based and FPGA-based. There is usually some recurring cost for software and maintenance.

The lower capacity FPGA-based emulators are typically priced as separate HW and SW components. The HW consists of off-the-shelf FPGA prototyping boards that typically cost 0.25 to 1 cent-per-gate. The SW that provides emulation features on top of the prototyping boards are typically priced like software simulators.

2. Initialization and Dedicated Support

When assessing an emulator's total cost of ownership, the cost of dedicated human support is a key factor, especially as it is an ongoing expense. The large processor-based emulators virtually always require at least one support person, if not a team, dedicated to emulator support. It's not due to any inherent issues with emulators; it's just the sheer scope and magnitude of the verificaion projects being tackled.

The large FPGA-based emulators' need for dedicated support is similar, also depending on the size and complexity of the emulated designs and the number of end users involved.

Further, the time to initialize an emulator and set up models can take 6 months or more. Transactors may need to be developed to connect test benches to the design-under-test (DUT). They are also needed to connect host-based virtual platforms to emulators and FPGA prototyping systems. This is part of a growing trend to improve the accuracy and performance of virtual platforms for performance modeling and pre- silicon software development.

The most complex transactors - such as PCIe transactors - can take more than a year to develop. They can be the source of functionality and performance bugs that can delay the emulation project even longer. And test bench re-partitioning between the host and emulator may also be required to achieve acceptable performance.

3. Capacity

There are subtle issues around an emulator's capacity; a given capacity can be delivered in multiple ways. First, there is the straightforward capacity measurement in terms of total number of gates: that range for emulators currently ranges from 2 million to 2 billion gates.

Second, there is the granularity in terms of the number of devices (ASICs or FPGAs), boards, or boxes that are used to reach a given capacity.

Processor-based emulators are architected to provide capability in a seamless way, where it looks monolithic to users. With FPGA-based emulators, if it is a vertically integrated box, it should also appear monolithic to a user. Synopsys-EVE reaches its 1 B gate capacity in a monolithic way by connecting multiple emulator boxes.

A vendor may hide the emulator granularity from the customer, but higher granularity generates more communication overhead, which generally degrades performance. This is true whether the emulator is processor- based or FPGA-based. For instance, if you were emulating 10 M gates and it fit on a single FPGA, it could run approximately 5X faster than if the same 10 M gate capacity were to be divided over multiple FPGAs.

Low-to-mid FPGA-based emulators typically expose more granularity to the end user. However, an offset to this is that these boxes tend to take advantage of the increased capacity for newer FPGA devices more quickly than the mid-to-high-capacity FPGA-based emulators.

FPGAs on the leading edge of Moore's Law are one of the first things manufactured. New FPGA product cycles run 12 to 18 months. In contrast custom chip cycles are at least 4 years, and the cost of development, which has generally been increasing for custom ASICs, is borne by the custom-processor-based emulation vendor.

4. Primary Target Designs

SoCs are the dominant workhorse for systems companies today. The target design sweet spot for each emulator type is typically defined by the capacity of the emulator as it relates to your design size.

As mentioned earlier, complexity is one of the factors driving more mainstream designs into requiring emulation. Control complexity can require many verification cycles. Emulation applications range from CPUs, GPUs, application processors such as video, audio, security and datapath, to IP blocks and subsystems.

5. Speed Range

Emulator speed is measured in cycles-per-second. Processor-based speeds range from 100 K to 4 M cycles/sec, while FPGA-based range from 500 K to 50 M cycles/sec, depending on the number of devices.

6. Partitioning

Partitioning is harder than sounds. Partitioning would be easy if your design could be broken up into latency insensitive blocks that would eliminate any strict latency requirements between partitions. But that's hardly ever the case. - Processor-based emulators have many processing units, so large designs must be partitioned across these units. Processor-based systems basically run software using many cores, and the partitioning is more or less transparent to the user.

- FPGA-based systems are hardware-centric, based on multiple FPGAs. Due to the FPGA boundaries you must first split the design into reasonably sized pieces, so that each sub-system will fit on one FPGA device. Unfortunately, you then usually end up needing more logical connections between partitioned sections of the design than there are physical wires between FPGAs to make those connections! These logic connections can exceed the physical connections by 2X to 100X.

So when you stitch your partitioned design elements together to connect them, rather than just doing a simple logical connection, you must multiplex signal pins over the FPGA connections. This adds substantially to complexity, particularly if this process is not completely automated. The entire process takes additional time and effort that opens the possibility for errors to be injected in partitioning or reconnecting.

A deeply partitioned design can be very difficult to debug manually or semi-automatically. Design groups sometimes give up on the task because they can't get it right. Certain debug features don't work as well in partitioned systems; debugging a single FPGA is easier.

Debug concern only applies to FPGA-based; the processor-based emulators have complete visibility across all of their processors.

Most processor-based and larger FPGA-based emulation vendors try to automatically take care of partitioning such that the end user doesn't have to deal with it.

- Automatic partitioning is rule-based and is assumed to be correct-by-construction. If you have a problem with your partitioning, it's a support call to the emulation vendor.

- The greatest partitioning risk occurs in the smaller FPGA-based emulator group. Their (in)ability to partition should be carefully assessed -- particularly when partitioning across more than two FPGAs. Some of these vendors use an independent partitioning tool from Auspy; however, partitioning is inherently never completely push button.

Thankfully, Moore's Law has now made it possible to have 30 M gates in just two FPGAs. This neutralizes the partitioning issue for a substantial fraction of emulation projects. If no messy multi-part partitioning is needed, there's no problem. A single partition between only two devices is straightforward.

7. Compile Time

Emulator compile time is the total time to prep a job for execution on your emulation system, including synthesis and routing. For FPGA-based emulators, compilation time is primarily determined by the FPGA routing tools. Further, some FPGA-based emulators are starting to provide hierarchical routing -- which enables incremental compiles.

Another variable for compile time is whether your design partitioning between the devices is automated or not.

- For processor-based emulators this design partitioning is automated and quite complete, and that time is included in the compile time. The compile runs on a single workstation.

- For FPGA-based emulators partitioning can be quite variable; your total compile time can more than double if the partitioning is semi-automated and requires user interaction.

Another factor for compile time is whether it's parallelizable - i.e. whether it can be broken into independent jobs to be run on separate workstations. For FPGA-based emulators, once your partitioning is complete, your FPGA compilation can be sped up significantly by running each FPGA compile on a separate workstation. That's not the case for processor-based emulators.

8. Visibility

Visibility is the ability to see signals inside a design. You have full visibility with SW simulation, because simulation is software and the program state is inside the simulator; it is very flexible and you can log any state.

For FPGA-based emulators, there's static and dynamic visibility. Static visibility refers to signal probes that are defined at compile time. They require FPGA resources and run fast but usually cover only a small subset of a design's signals. If you need to change that subset of signals, you need to recompile. Dynamic visibility refers to signal probes that do not need to be defined at compile time. These do not require additional FPGA resources and run much slower than static probes, but cover a much larger subset of a design's signals.

- The signal visibility with processor-based emulators is basically the same as simulation. That's because all signal states reside as a software addressable register somewhere in the emulation processor array.

- With FPGA-based emulators, to see the internal signals, you must route them out through some type of multiplexing network to the pins. This adds physical gate overhead whenever you take a signal and connect to it through an I/O multiplexer to the pins of the FPGA. The multiplexer and wires create trees of logic that add area overhead.

Additionally, those gates and wires add a performance overhead; every cycle requires capturing the state and eventually sending it to a host for storage. If you try to probe every signal, your effective design capacity can go down by a factor of 2-5. You must navigate this carefully, or your design won't fit.

However, the signal visibility gap is narrowing between processor-based and FPGA-based emulators based on two key factors:

- FPGA capacity improvements provide increased capacity for static probing inside of FPGAs.

- Xilinx has a built a feature in their FPGAs which offers dynamic probes for register states. This Xilinx feature enables better multiplexing in the chip to access the signals without overhead.

With effort, the individual emulator vendors can all but eliminate the area overhead, which then means less time recompiling to be able to see a different set of signals. Some emulator vendors are taking advantage of this.

The bottom line is that emulation vendors can be assessed with an area and performance hit for any given number of probes, as well as whether they offer dynamic probing.

9. Debug

With the exponential rise in complexity, debug is essential - as you can see in the chart below, teams spend 1.6x total time debugging (42%) than they do developing testbenches (26%) and writing/running tests (26%).

A robust debug capability is critical, and processor-based emulators today have debug capabilities that approach that of SW simulator debug.

Fundamental emulator debug capabilities can include: - Breakpoints. The ability to pause an emulation run based on event triggers. - Assertions support. It flags when assertions, or logic statements that define the intended behavior of a design, are violated.

- Simulation hot-swap. The ability to automatically transfer execution to a connected simulator for more in-depth debug, in the form of greater visibility and control.

- Software debug. It can run software debuggers on the embedded code being executed by the processor(s).

Debug is an area undergoing significant innovation and should be assessed beyond the simple list I show in the comparison chart.

10. Virtual Platform API

Hybrid platforms consists of emulators co-simulating with virtual platforms, where a virtual platform simulates whole chips on a workstation by eliminating detail from the hardware, and plugging C-models together.

Given the prevalence of virtual platforms, it is important for the emulator to have a standard virtual platform API, so that if an engineer doesn't have a virtual platform C-model for a component, they can plug the component into the emulator instead.

For example, you may have an RTL USB 3.0, but no C model. An emulator with a virtual platform API could allow you to co-simulate with your emulator running the USB RTL at up to 4 M cycles/sec. In contrast, the RTL might run at only 5 K cycles/sec in a SW simulator.

11. Transactor Availability

Transactors facilitate the communication between the emulator and other platforms with different levels of abstraction. Transactors are critical to making emulators work in that they remove the speed bottlenecks associated with co-emulation.

- Transactors convert transactions coming from outside the emulator to bit-level interfaces inside the emulator and vice versa; as emulators move to the mainstream, they need to communicate with simulators and virtual platforms more frequently.

- Transactors also serve as interfaces when you connect inputs and outputs to the I/O of a design (SoC) under test (DUT) in an emulator. The transactors let you go between I/Os on the DUT to anything communicating with it on the host simulator, C testbench or possibly another I/O board bringing in a live connection, such as Ethernet. Below shows an example.

Source: Mentor Graphics

- One definition of a transactor is anything that moves something from one abstraction level to another. For example, in the transaction level modeling (TLM) is standard, whereas in the RTL space, bit-level interfaces are the norm.

- Hardware transactors eliminate traffic over co-emulation links to optimize performance. They do this by translating a compound transaction into a bit level handshake exploded in space and time on the emulator where it's more efficiently handled. For example, a single transaction over the co-emulation link to "move a 128-word block of memory from location A to location B" is exploded into a sequence of 128 bit-level interface operations in the emulator.

The overall emulator performance is very sensitive to transactor quality. For example, an optimized transactor may have a speed differential of 1 M cycles/sec compared with only 100 K cycles/sec for an unoptimized transactor -- a 10X performance delta.

The individual emulator vendors provide off-the-shelf transactor libs available for standard interfaces such as memory and high speed I/O; these standard transactors can be used for different designs. Also, custom transactors must often be developed for each design.

Some projects may require 10-20 transactors, of which 5-10 may be custom, ranging from UARTs to proprietary system bus interconnects. End-to-end performance optimization of these custom transactors is hard -- the time can range from 1 to 12 months. Furthermore, different transactor versions can be required to support multiple emulation platforms.

12. Verification Language and Native Support

Regardless of transactor quality, sometimes, you must move selected components, models or testbenches, over to a SW simulator. To do this efficiently, ideally your emulator supports your existing languages.

- Processor-based emulators typically have broad native support for verification languages. This is because the emulation vendors often also provide simulators -- an inherent advantage.

- FPGA-based emulators don't support native languages such as C++ or SystemC; they are typically limited to synthesizable Verilog and VHDL. The drawback is that developing synthesizable Verilog/VHDL takes more time than writing a higher level language. Another item that can severely limit emulator performance: when test bench components run too slow on the host or generate too much traffic on the link between the host and the emulator. Synthesizable test benches running in the emulator can reduce a host emulator bottleneck by 5X to 10X.

13. Number of Users

The bigger an SoC, generally the larger the engineering team and the more geographically dispersed it is. Therefore, the maximum number of users that can run an emulation at the same time on partitioned elements of the SoC should be taken into consideration.

14. Memory Capacity

Processor-based emulators have a high memory capacity, up to 1 TB. This memory is used more like the memory in a simulator; mapping DUT memories is transparent and usually not an issue.

In contrast, DUT memory on FPGA-based emulators is more explicitly mapped to hard macro memory blocks in the FPGA devices. The largest FPGA devices have about 50 M bits per device; this capacity is usually enough for the memory required in DUT partitions. Sometimes a DUT memory does not map well into the FPGA device memory, in which case the FPGA-based emulator can use special constructs to map the DUT memory into on-board DRAM. It is important to know ahead of time whether an FPGA-based emulator has memory configuration limitations.

Below I have mapped top-level information for each vendor, according to theemulation metrics I mentioned earlier. I derived the information for thesnapshot below from each vendor's website plus my general accumulatedknowledge to date.

- Category 1. Emulators are based on application-specific processors. Cadence Palladium's processor is implemented in an ASIC-structured custom fabric. Mentor Veloce's processor is implemented in a custom FPGA fabric.

- Category 2. Emulation with a standard FPGA product at its core. Synopsys-EVE is currently the player in this sector.

- Category 3. Other emulators with HW based on a standard FPGA product. The primary differentiator between category 2 and 3 is capacity; however, there are other differences as shown below. Aldec, Bluespec, Cadence RPP, Dini Group, S2C, and HyperSilicon are primary vendors in this segment.

The emulator best suited to the designer problem is defined by what problemthey are trying to solve. Source: EVE, Embedded Computing, 2010

The optimal choice lies in the intersection of a number of factors, with oneexample outlined above.Emulation Vendors Cadence Palladium, Mentor Veloce Synopsys EVE Zebu Other: Aldec, Bluespec, Cadence RPP, HyperSilicon

Emulator Architecture custom silicon, custom board, custom box (32 M to 2 B) off-the-shelf FPGA, custom board, custom box (25 M - 200 M) off-the-shelf FPGA, off-the-shelf board, off the shelf box (2 M - 50 M)

Price/gate 2-5 cents 0.5 - 2 cents 0.25 -1 cent

Dedicated Support yes mixed no

Design Capacity Claims up to 2 billion. Typical usage 100 M to 1 B gates. Claims up to 1 billion. Typical usage 25 M to 200 M gates. Claims up to 50+ million. Typical usage 2 M to 25 M gates.

Primary Target Designs SoCs 100 M to 1 B gates. Large CPUs, GPUs, multi-chip systems, application processors. SoCs from 25 M to 200 M gates IP blocks, sub-system, and SoCs from 2 M to 25 M

Speed range (cycles/sec) 100 K to 2 M 500 K to 5 M 500 K to 20 M

Compile time 10-30 M gates/hour. Single workstation (Palladium). PC farm (Veloce). Includes automated partitioning time. Parallelizable: Yes 25 M - 100 M gates/hr for PC farm. Proprietary software for fast FPGA partitioning, synthesis and P&R. Parallelizable: Yes 1 M - 15 M gates/hr for PC farm. Constrained by FPGA vendor synthesis and P&R times. Doesn't include partitioning time. Parallelizable: Yes

Partitioning automated automated semi-automated. Partitioning depends on # of FPGAs. Time range 30 min to 4 hours.

Visibility full visibility. at-speed probe capture. static, dynamic probes. at-speed probe capture. static, dynamic probes (vendor dependent). at-speed probe capture (vendor dependent).

Debug Breakpoints, assertions, simulation hot-swap, SW debug. Breakpoints, assertions, simulation hot-swap, SW debug. Breakpoints, assertions, simulation hot-swap, SW debug.

Virtual platform API Yes Yes varies by vendor

Transactor Availability Standard/off-the-shelf: Good. Custom: developed ad hoc Standard/off-the-shelf: Good. Custom: developed ad hoc Standard/off-the-shelf: Mixed. Custom: developed ad hoc

Verification Language - Native support C++, SystemC, Specman e, SystemVerilog, OVM, SVA, PSL, OVL Synthesizable Verilog, VHDL, System Verilog Synthesizable Verilog, VHDL, System Verilog

Memory up to 1 TB up to 200 GB up to 32 GB

Users 1 to 512 users 1 to 49 1 user

Here is my quick summary of the different emulation vendors for 2013.

Category 1:

- Cadence Palladium. Hats off to Cadence for being pioneers in emulation and sustaining innovation to maintain a very competitive product year-over-year.

- Mentor Veloce. Their revenue numbers show emulation is a growing segment for them. (See ESNUG 510 #7.) Clearly Wally and Greg have been investing heavily in emulation.

Category 2:

- Synopsys EVE Zebu. This has been the choice for companies and design groups doing mid-size SoCs or blocks for emulation. It is no secret that Intel was an EVE customer. (See ESNUG 508 #6.) My expectation is that with the Synopsys acquisition, EVE will now move upstream to challenge Cadence and Mentor at the high end.

Category 3:

- Aldec HES-DVM. The company initially grew out of providing system emulation/simulation using FPGAs for eventual implementation in FPGAs. FPGAs will continue to be a choice for system designers with low volumes, including the mil-aero world. Will they try to move into the SoC market?

- Bluespec Semu. Bluespec expanded their emulation footprint in March with a new FPGA-based desktop form factor verification and hybrid emulator. They emphasize low cost, ease of use, fast deployment using third-party FPGA boards, dynamic hardware debug (no re-instrument and re-synthesis) and a C API to integrate SystemC/C/C++ models and test benches. Bluespec claims to need only 1 day set up.

- Cadence RPP. The Cadence FPGA-based Rapid Prototyping Platform is an FPGA-based prototyper for early software development and high-performance system validation. While not positioned as an emulator (See ESNUG 517 #6) it uses the core technology of FPGA-based emulators and confirms the need for boxes with higher performance and lower cost than processor-based emulators for pre-silicon software development.

- The Dini Group. An established leader in FPGA boards for prototyping and emulation. The Dini Group consistently delivers high quality, high capacity boards with the shortest time-to-market for leading edge FPGAs.

- S2C. Offers FPGA boards and software and IP for system-level design verification and acceleration. Their new boards based on 14 M gate Xilinx FPGAs.

- HyperSilicon. Company to watch from mainland China, focusing on FPGA prototyping boards. Offers boards similar to S2C.

My conclusion is that emulation has indeed gone mainstream. Its growthextends from the rise of the SoC as the cornerstone of system hardware,with its associated multiple SW functions. What's also helped emulationgrow is its better debug, increased FPGA sizes, and its newer ability tohandle complex designs.

---- ---- ---- ---- ---- ---- ----

The reason for my report was to analyze the segment and try to put someorder to the market place. I'm not the only source of info on this. I'dlike to invite the DeepChip readers to feel free to add their perspectiveand to update the charts and data I have gathered.

Functional verification is primarily comprised of:

- software simulation, - simulation acceleration, - FPGA prototyping, and - emulation.

Let's look at each of them briefly, and then compare some of the elements that relate to the speed ranges of each approach.

---- ---- ---- ---- ---- ---- ----

SOFTWARE SIMULATORS

A simulator is a software program that simulates an abstract model of aparticular system by taking an input representation of the product orcircuit, and processing the hardware description language and compiling it.

A system model typically includes processor cores, peripheral devices,memories, interconnection buses, hardware accelerators and I/O interfaces.

Simulation is the basis of all functional verification. It spans the full range of detail from transistor-level simulation like SPICE to Transaction Level Modeling (TLM) using C/SystemC. Simulation should be used whereverit's up to the task -- it's easiest to use and the most general purpose.

But SW simulation hits a speed wall as the size and detail of the circuit description increases. Moores Law continues to give us more transistors per chip, but transistor speed is flattening out. So while computers are shipping with increasing numbers of microprocessor cores, the operating frequencies are stuck in the 2 - 3 GHz range. Since SW simulators (whichrun on these computers) don't effectively utilize more than a handful ofPC cores, performance degrades significantly for large circuits. It wouldtake decades just to boot an operating system running on an SoC beingsimulated in a logic SW simulator.

Simulation acceleration, emulation, and FPGA prototyping are all solutions to get around show-stopping slow PC simulation speeds for large designs.They all attempt to parallelize simulation onto larger numbers of processingunits. This ranges from two orders of magnitude (e.g. hundreds of GPUprocessing elements) to nine orders of magnitude (billions of FPGA gates).

---- ---- ---- ---- ---- ---- ----

SIMULATION ACCERLERATION

Simulation acceleration implements a hardware description language, suchas Verilog or VHDL, according to a verification specification. The resultsare the same as the simulation, but faster.

- Often simulation accelerators will use hardware such as GPUs (i.e. NVidia Kepler) or FPGAs with embedded processors.

- Simulation acceleration involves mapping the synthesizable portion of the design into a hardware platform specifically designed to increase performance by evaluating the HDL constructs in parallel. The remaining portions of the simulation are not mapped into hardware, but run in a software simulator on a PC/workstation.

- The software simulator works in conjunction with the hardware platform to exchange simulation data. Acceleration removes most of the simulation events from the slow PC software simulator and runs them in parallel on other HW to increase performance.

The final acceleration performance is determined by:

1) Percentage of the simulation that is left running in software; 2) Number of I/O signals communicating between the PC/workstation and the hardware engine; 3) Communication channel latency and bandwidth; and 4) The amount of visibility enabled for the hardware being accelerated.

---- ---- ---- ---- ---- ---- ----

EMULATORS

An emulator maps an entire design into gates or Boolean macros that are then executed on the emulator's implementation fabric (parallel Booleanprocessors or FPGA gates) such that the emulated behavior exactly matches the cycle-by-cycle behavior of the actual system.

- Processor-based emulator. The design under test is mapped to special purpose Boolean processors.

- FPGA-based emulator. The design under test is mapped to FPGA gates as processing elements.

Elsewhere in this report, I go into more detail on emulation including:Emulation drivers; Metrics to evaluate emulation; and a top-level comparisonchart of commercial emulation systems against those metrics.

---- ---- ---- ---- ---- ---- ----

FPGA PROTOTYPING

An FPGA prototype is the implementation of the SoC or IC design on a FGPA. The protype environment is real, with real input and output streams. Thusthe FPGA prototyping platform can provide full verification for hardware, firmware, and application software design functionality.

Some problems associated with FPGA prototyping are:

- Debug Confusion: Because you mapped your design into an FPGA, you can expect to spend some extra time debugging it, to identify problems that are relevant ONLY to your prototype, but that are not necessarily bugs inside your actual design.

- Partitioning: Your design must be partitioned across multiple FPGAs. Further, sometimes repartitioning may be necessary when design changes are made. Partitioning challenges can also apply to emulation, so I discuss them in the emulation metrics section.

- Timing (Impedance) Mismatches: If your FPGA prototype connects to real world interfaces, such as Ethernet or PCIe, then you have to ensure that it is capable of supporting the interface. That is, mismatched timing can sometimes be a problem. This can involve "speed bridging" to an FPGA.

If your design can fit into a few FPGAs, and you have adequate support, then FPGA prototyping can be very effective -- especially when real-time performance is vital.

---- ---- ---- ---- ---- ---- ----

A BASIC COMPARISION

Below I characterize the various types of hardware-assisted verification approaches:approach Computational Element Granularity(# of comp elements) Speed per comp element Cycles per Sec (100 M gates) Vendors

SW Simulation X86 cores under 16 3 GHz under 1 Cadence Incisive/NC-Sim, Synopsys VCS, Mentor Questa

Simulation acceleration GPU processing elements 100's 1 GHz 10 to 1,000 Rocketick

Processor-based emulation custom processors 1000's under 1 GHz 100 K to 2 M Cadence Palladium

FPGA-based emulation FPGA gates millions 1 MHz - 100 MHz 500 K - 2 M Mentor Veloce, Synopsys EVE-Zebu, Bluespec, Aldec, Cadence RPP, HyperSilicon

FPGA prototyping FPGA gates millions 1 MHz - 100 MHz 500 K - 20 M Synopsys HAPS, internal

Notice that any given approach's processing elements speed is inverselyrelated to the approach's ultimate performance in cycles/sec. This is basically a reflection that for functional verification, concurrency (more computing elements) trumps clock speed (speed-per-comp element).

I put in a list of the commercially announced vendors that I am aware of.

The graphic below roughly shows how each of these basic approaches compare relative to simulation design frequency and design size. Notice emulation's 1,000X to 1,000,000X faster run time over SW simulationas your design size goes from 10 K gates to 20 M gates. Also notice that emulation's capacity of 2 B gates while SW simulation and acceleration both top out at around 20 M gates -- a 100X difference in capacity.

The rest of my analysis will expand on the emulation segment.

EMULATION IS BECOMING UBIQUITIOUS

Emulation emerged in the early 1990s. In its early days, emulation was primarily deployed by a few large corporations to verify large system-leveldesigns -- specifically for microprocessors and graphics processors. The turning point was when audio and video recordings were digitized. Where possible, designs needed to be verified for both functionality and softwarebehavior against voluminous real data before committing to silicon.

By 2013, we are now moving toward emulation becoming ubiquitous:

- Current mobile devices have hundreds of applications. The vast majority have audio and video use as one of their system appliance "must haves".

- Subsystems now are the same size as what entire designs and processors were in the 90s. Teams are now using emulators to verify many of these larger, more complex blocks.

- Because of compute-speed and capacity gains, emulation is becoming an attractive option for more mainstream verification tasks; such as verifying individual IP blocks in the low millions of gates.

- Emulation is even now needed for smaller, under 1 million gate designs -- if there is a lot of control complexity with a large number of cycles -- such as with an H.265 encoder/decoder. In fact, simulating high density videos on an H.265 decoder would be impractical (because it would take weeks to do) if it couldn't be simulated in seconds in an emulator.

It's safe to say that simulation is vital for functional verification. However software simulation (by itself) is way too slow to capture events that occur when an application runs for even a short time on an actual instance of hardware. Emulation is the only practical way to get usefully long SW application runtimes on a hardware instance.

---- ---- ---- ---- ---- ---- ----

IT'S NOW A SW DOMINATED WORLD:

I see some of the same current top level drivers for emulation as I did for On-Chip Communications Networks: SoC growth is being driven by mobile consumer devices, with the corresponding pressures for:

- Smaller die size and reduced cost - More functionality - Increased battery life - True gigahertz performance

For years, hardware verification was primarily about meeting the hardware design specification. Intel built the processor, Microsoft built the operating system, and Phoenix gave you BIOS. The software conformed to the hardware; meaning that the software developer lived within the constraints of the hardware design.

Fast forward to a 2013 system design world. You'll see it's now software dominated. Intel, Microsoft and Phoenix are still around but struggling; the system design community is now serving consumers by way of Google Android, ARM, and their SW ecosystems.

Typical SoC - dozens of HW blocks but millions of lines of code Notice that there's millions of lines of code for roughly a dozen hardware subsystems in a typical SoC. (Source: Texas Instruments) ---- ---- ---- ---- ---- ---- ----

'GOODNESS' IS ABOUT A DESIRED BEHAVIOR

While not explicitly aware of it, consumers' expectations are high with regard to how well the hardware and software work together.

Software is about the experience, not functionality. The software dictates the behavior of an SoC -- in other words, software now forces constraints on the hardware architecture specification.

Verification encompasses more than "does it do what the specification says?". It now asks "does the specification deliver the end user experience that I'm looking for?". The goal is not just to build the design right, but to build the system that behaves correctly.

How do you ensure your design is up to its intended use? The answer is by using increasingly capable emulation systems. Source: Bluespec, Inc.

You load the operating system and observe the behavior on the hardware; software developers run their application on the virtual machine.

---- ---- ---- ---- ---- ---- ----

IP USE AND REUSE IS EXPLODING

Because today's designs are getting so big and so complex, they're next to impossible to design without using 3rd party IP, IP from fabs, and internalIP reuse. Semico predicts an average 2013 design will have close to 90 IP cores. This will only grow -- and the more IP that's (re)used, the more each block will demand simulation compute resources -- increasing the need for emulation speed and capacities.

Source: International Business Strategies, 2012

And with each node shrink, 90nm, 65nm, 45nm, 28nm, 20nm, the average number of IP cores used grows -- plus the cost-to-design grows.

---- ---- ---- ---- ---- ---- ----

SW SIMULATION HAS HIT A WALL

SW simulators (like VCS or NC-Sim or Questa) all parallelize across only a small number of x86 processors, so their performance has been scaling only as fast as x86 frequency, which has flattened out. SW simulators have become inadequate for verification within cores and core-to-core.

Emulation is becoming ubiquitous where SW simulation hits a wall. Emulation runs orders of magnitude faster than SW simulation, and its ability to runsoftware in megahertz rather than kilohertz (or even hertz) allows teams to do verification that is otherwise time-prohibitive.

---- ---- ---- ---- ---- ---- ----

EMULATION AND SW SIMULATION COMPLEMENT EACH OTHER

SW simulation is great in the early phases of verification:

- Designers and verification engineers are constantly finding bugs, changing the design, and re-verifying. In this phase, the fast compilation time that SW simulation offers for each new run is critical. And, they're not at the point yet where you need to load the operating system.

- If you have a fixed function or hard-wired integrated circuit, you can just run your Verilog/VHDL to verify that your design does what the specification describes.

What drives emulation use:

- The one downside to emulation is it takes a measurable compile time. But emulation's 1,000X to 1,000,000X faster run time wins out over SW simulation when the number of cycles that must be run offsets the longer compilation time for emulation. As engineers progress through the design cycle, they must run more and more cycles to identify new bugs.

- The reduced total cost of emulation ownership is another factor helping it grow. Current emulation system cost can go as low as 0.25 to 0.50 cents per gate.

- Even smaller companies can now use it. Emulation used to just be the domain of big companies with huge support teams -- they might spend $2 M on an emulator, and then find they need a service contract for $1 M to make it work.

Many of today's emulation systems are fairly plug-and-play, with far reduced support requirements. I go into more detail regarding some of these costs of ownership factors in the metrics section of this report.

- Increased static and dynamic probing inside today's much bigger FPGAs has helped emulation grow, too.

---- ---- ---- ---- ---- ---- ----

THE PARTITIONING PROBLEM HAS RECEDED

One significant obstacle to adoption, particularly for FPGA-based emulators,has been the partitioning requirement (also called mapping). A designerwould have to partition his register transfer level design into sectionswhich fit inside each FPGA and then connect all the sections via the FPGAs I/Os. It was a tremendous challenge to balance the optimal sizing of the partitioned elements against managing the number of FPGA connections withinthe limits of the available I/Os.

Partitioning software such as what Auspy Development offers has helped.

However, it does not entirely eliminate the manual effort. The partitioning process can result in unnatural design hierarchies, and critical timingpaths that cross FPGA boundaries can mandate re-partitioning.

One major technology enabler behind the proliferation of emulators relates to Moore's Law: It's now possible to have 30 million gates in just 2 FPGAs.

This higher gate-count-per-FPGA means less design partitioning is necessary.For example, below we see two off-the-shelf FPGA boards.

Source: The Dini Group Source: Aldec

- The first board has 6 FPGAs; each FPGA is a Virtex-6 LX550T capable of supporting 4 M gates for a total board capacity of 24 M gates.

- The second board has only 2 FPGAs; each FPGA is a Virtex-77V2000T capable of supporting 14 M gates for a total board capacity of 28 M gates.

Cutting a design in half is much easier and safer than cutting a design into6 parts. This lessens a major adoption obstacle for FPGA-based emulators,helping them to close the gap with custom processor based emulators; as wellas making emulation more attractive for smaller designs.

Higher capacity FPGAs also push down the cost of the emulation systems; forexample, some systems now use off-the-shelf FPGAs at $4 K per FPGA, reducingtheir core hardware cost. (Further, lower capacity emulators can often addnext generation FPGAs shortly after they are released by Xilinx and Altera.)

---- ---- ---- ---- ---- ---- ----

Computational Element Granularity (# of comp elements) Speed per comp element Cycles per Sec (100 M gates) Vendors

SW Simulation X86 cores under 16 3 GHz under 1 Cadence Incisive/NC-Sim, Synopsys VCS, Mentor Questa

Simulation accelerationGPU processing elements 100's 1 GHz 10 to 1,000 NVDIA: 17x Rocketick

Verification AccelerationMix of Host and Emulator millions under 1 GHz 40 to 10,000 Cadence Palladium XP + RTL Sim

Mentor Veloce + RTL Sim

Synopsys Zebu Server + RTL Sim

Processor-based emulationcustom processors It is actually 100s of 1000s to millions under 1 GHz 100 K to 2M, processor based scales better with design size than FPGA Cadence Palladium

FPGA-based emulationFPGA gates millions 1 MHz - 100 MHz 500 K - 2 M, does not scale well with design size, so at 100MG to reach the M range is very unlikely. Debug causes further slow down. Mentor Veloce, Synopsys EVE-Zebu

FPGA prototypingFPGA gates millions 1 MHz - 100 MHz 2M - 50 M, sometimes up to 100M Synopsys HAPS, Cadence RPP, DINI, Aldec, S2C, HOENS, Hitech Global, ProDesign

Palladium P64 Veloce 2 Quattro

Granularity 4 MG to 256 MGresulting in much better utilization16 MG to 256MG

Number of users 64 users 16 users

Speed 2 MHz(as per datasheet)scaling with design size 1-1.5 MHz(as per datasheet), degrading with design size due to architecture

Capacity 256 MG nominal90% to 100% utilization=>256 MG actual capacity 256 MG nominal60% to 75% utilization=> 200 MG actual capacity

Emulation

Documents

emulators capacity

higher capacity emulators

large fpgabased emulators

dedicated support

emulator support

given capacity

memory capacity

native support