This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
All rights reserved. This dissertation may not be reproduced in whole or in part,by photocopying or other means, without the permission of the author.
University of Victoria
We accept this dissertation as conformingto the required standard
A Dissertation Submitted in Partial Fulfillment of theRequirements for the Degree of
by
The Co-Design of Virtual Machines Using Reconfigurable Hardware
DOCTORATE OF PHILOSOPHY
in the Department of Computer Science
Kenneth Blair KentB.Sc. (hons), Memorial University of Newfoundland, 1996
M.Sc., University of Victoria, 1999
Dr. M. Serra, Supervisor (Department of Computer Science)
Dr. M. Cheng, Member (Department of Computer Science)
Dr. N. Horspool, Member (Department of Computer Science)
Dr. K. Li, Outside Member (Department of Electrical and Computer Engineering)
Dr. R. McLeod, External Examiner (University of Manitoba, Department of Electrical and Computer Engineering)
4.2 The Process of Partitioning...................................................................................394.2.1 Partitioning Approaches ......................................................................................... 404.2.2 Exploitations of Virtual Machine Partitioning........................................................ 414.2.3 Partitioning Heuristics ............................................................................................ 43
4.3 Software Partition .................................................................................................464.3.1 Loading Data from the Constant Pool .................................................................... 464.3.2 Field Accesses of Classes and Objects ................................................................... 474.3.3 Method Invocation.................................................................................................. 474.3.4 Quick Method Invocation ....................................................................................... 474.3.5 Exceptions .............................................................................................................. 484.3.6 Object Creation....................................................................................................... 484.3.7 Array Creation ........................................................................................................ 484.3.8 Storing to a Reference Array .................................................................................. 494.3.9 Type Checking ........................................................................................................ 494.3.10 Monitors ................................................................................................................. 494.3.11 Accessing the Jump Table ...................................................................................... 504.3.12 Wide Indexing ........................................................................................................ 504.3.13 Long Mathematical Operations .............................................................................. 504.3.14 Returning from a Method ....................................................................................... 514.3.15 Operating System Support...................................................................................... 514.3.16 Software and Hardware Coordination .................................................................... 51
4.4.2.1 Array Accessing ........................................................................................................564.4.2.2 Length of Arrays........................................................................................................57
4.4.3 Full Partition ........................................................................................................... 574.4.3.1 Quick Loading Data from the Constant Pool ............................................................574.4.3.2 Quick Field Accesses in Classes and Objects ...........................................................58
5.2 Development Environment ...................................................................................615.2.1 Hot-II Development Environment.......................................................................... 63
Appendix D Co-Design Benchmark Results ..............................................157
D.1 Compress Benchmark .......................................................................................157D.1.1 Benchmark with Communication Included .......................................................... 157D.1.2 Benchmark with Communication Excluded......................................................... 159
D.2 Db Benchmark ..................................................................................................161D.2.1 Benchmark with Communication Included .......................................................... 161D.2.2 Benchmark with Communication Excluded......................................................... 163
D.3 Mandelbrot Benchmark ....................................................................................165D.3.1 Benchmark with Communication Included .......................................................... 165D.3.2 Benchmark with Communication Excluded......................................................... 167
D.4 Queen Benchmark.............................................................................................169D.4.1 Benchmark with Communication Included .......................................................... 169D.4.2 Benchmark with Communication Excluded......................................................... 171
D.5 Raytrace Benchmark.........................................................................................173D.5.1 Benchmark with Communication Included .......................................................... 173D.5.2 Benchmark with Communication Excluded......................................................... 175
Table 5.1. Ackerman function timings in clock cycles. ...............................................................85
Table 5.2. Minimal performance increase factors for each of the benchmarks based on cycle counts without consideration for clock rates. ..................................................89
Table 7.2. Constant pool caching efficiency measurements.....................................................118
Table 7.3. Percentage of original execution times with full partitioning scheme and 1:5 FPGA:Host ratio, including communication delays...............................................119
Table 7.4. Average number of hardware cycles/context switch for each benchmark.......121
Table 7.5. Optimal performance increases under ideal conditions.........................................122
Table 7.6. Instruction support and density for various benchmarks. .....................................123
Table A.1. Java bytecode data collection for five benchmark applications..........................139
Table B.1. Specification of Java virtual machine instruction set between partitioning schemes. ...............................................................................................................................150
xivxivxiv
Acknowledgments
Many people contributed to the completion of this work. Special thanks to my
supervisor, Dr. Serra. I am sure I withered away a few years of her lifespan in trying to
complete this degree. To Jon, he enjoyed me as a masters student so much he recom-
mended me to Micaela for the Ph.D. That must say something! I also want to thank Dr.
Li for his help over the last few months to finish the loose ends.
To the VLSI group which suffered through many of my presentations while I
gave various dry-runs for conferences and invited talks. Especially Duncan for the
motivation in who will finish first. To the Graduate Students Society for having the
lounge open every friday, there was no better place for escaping from the research at
the end of the week. To Sean for giving me a personal demonstration of when you
should stop drinking and Barry for showing me when NOT to ride a bike!
Thanks to my good buddy Gord who from rough calculations I have shared 24
kegs of beer and a few bottles of scotch with over 6 years. What else can I say but ...
wow!!! No wonder people go to the bathroom so often when drinking.
Last but not least, to my family. Without their constant mocking about being
under worked and a student for life, I never would have aspired to make the jump to
becoming a glorified permanent student while getting paid ... a university professor :)
xvxvxv
for my family
CHAPTER 1
Introduction Chapter 1
This dissertation examines the merging of three problems that exist in computing
today. The first problem is the slow performance of virtual machines that, with the
increasing importance of the internet, have become popular for providing a homoge-
neous platform. The second problem is moving reconfigurable computing from the appli-
cation specific domain into a new general purpose computing platform. The third
problem is that of instance specific techniques used to develop hardware/software co-
designed solutions to systems, in this case specifically to virtual machines. This is attrib-
uted to the complexity and variety in types of co-designed systems being developed. This
dissertation investigates using reconfigurable computing in a co-designed system to alle-
viate some performance issues of virtual machines.
Homogeneous computing techniques have become increasingly important with the
increase in internet usage and types of services. This usage continues to increase at an
exponential rate [57]. A popular means by which to provide a homogeneous platform is
through the use of a virtual machine. This solution is desirable since it guarantees a com-
mon platform and also allows users to maintain preferential heterogeneous hardware
underneath. The drawback however is the inherent slow performance of adding another
layer of abstraction between the end application and the underlying computing devices.
A tremendous amount of research has been performed into virtual machines and
how to improve their performance [2,3,8,18,19,30,41,75,79,81,94,101,116]. Techniques
have spanned all aspects of the execution paradigm including better source code and
compilation techniques, just-in-time compilation and replacing software with hardware.
Some of these techniques have provided respectable performance increases and are com-
monly used in virtual machine implementations, while others have not reached the main-
stream. While the gap in performance has decreased, there is still a performance loss from
execution on a virtual computing platform.
2
Despite this, virtual machines are used in many contexts and applications ranging
from large scale complete general purpose computing platforms to low-level specific
embedded systems. Within these, a virtual machine’s features and capabilities must be
adjusted to reflect the support provided by and required of the environment. This work
strives towards providing a full implementation of a general purpose abstract virtual
machine within the context of the desktop workstation.
The implementation of the full virtual machine, as opposed to a subset of the vir-
tual machine, is desirable since it allows a demonstration of the effectiveness of using a
reconfigurable computing device in a general purpose computing platform. This raises
issues such as the partitioning of the virtual machine between hardware and software, the
dynamic run-time decisions for where to execute a given code segment, as well as neces-
sary communication requirements. To reduce the problem into examining a subset of a
virtual machine that exists only in hardware would remove this investigation.
There currently exist a variety of approaches to providing a computing platform
such as a virtual machine. Some of these include: a dedicated hardware processor; a co-
processor specific for the platform; and a full software implementation. While each of
these have their merits, they also have disadvantages. The dedicated processor and co-
processor solutions are costly if the fabricated hardware requires replacement to adapt if
the virtual machine specification were to change. This is in addition to the complexities
encountered in either incorporating the virtual machine support in an existing platform, or
adding support for other platforms within the virtual machine itself. The software-only
solution provides desirable flexibility and maintainability, but suffers in performance.
With the development of systems that incorporate both hardware and software com-
ponents, there is a need for methodologies to assist the process. The tradition for hard-
ware components has been that they are expensive and time consuming to develop. As
such, traditional viewpoints have grown to the expectation that software, with its inher-
ent flexibility, will adapt and suit the needs of the hardware resources. With the emer-
gence of flexible reconfigurable hardware, the scope of possibilities is widened
considerably.
Hardware/software co-design is the cooperative design of both hardware and soft-
3
ware for a specific system. Encompassing the full design process, it is concerned with
many aspects such as the partitioning of the system between hardware and software
through to the system integration and testing. To aid in the process, many tools, tech-
niques, and methodologies have been proposed and examined. However due to the wide
range of co-designed systems no single detailed approach or tool solution exists. There is
a general process that co-designed systems follow, but it usually requires a lot of custom-
izing to be applicable in practice to a diversity of systems. This dissertation focuses on
hardware/software co-design for virtual machines, not for all systems.
The co-designed solution here differs in that it provides an implementation that
attempts to incorporate the advantages of the previous methodologies. This is accom-
plished by dividing cleverly the virtual machine specification between a hardware and
software partition. Both of these partitions are then realized in their respective environ-
ments through the utilization of the system processor and a reconfigurable logic device.
This results in a new virtual machine architecture as depicted in Figure 1.1, where each
partition is supported by a different resource. The software and memory are provided
through the general purpose CPU and RAM available on the local host. The hardware,
however, is provided through a reconfigurable computing device.
Reconfigurable computing is an emerging research area which utilizes programma-
Software(host processor)
Hardware(reconfigurable)
Memory
Figure 1.1 New co-designed virtual machine architecture overview.
4
ble hardware devices to provide an inexpensive custom hardware solution to a problem.
Devices exist such that a user can develop a hardware design using software tools and
then program the device to provide the implementation, which becomes the custom hard-
ware. Once the hardware design is completed, the programming of the device requires
only microseconds. Typically the problems addressed to date have been instance specific
and narrowly focused due to the limited capabilities of the programmable devices them-
selves and the environments within which they exist. While the approach presented here
is focused only on virtual machines, it is supportive of multiple applications executing
within the platform. The previous more narrowly focused use has led to the predominant
use of instance specific techniques for design and implementation of the solutions. The
techniques in this dissertation attempt to be more general and can be applied to the co-
design of most virtual machines.
The potential advantages of reconfigurable computing have been great enough to
solicit a high level of interest [12,91]. Reconfigurable devices are being seen as a cheap
alternative for custom hardware. This coupled with reprogrammability allows for quicker
time to market, iterative development, and backwards compatibility. These features sug-
gest that reconfigurable computing will only become even more pervasive in the future.
Reconfigurable computing has been used in many small application specific
instances to increase performance [15,82,84]. The idea of using reconfigurable computing
as an approach to solve the slow performance of virtual machines is new. Virtual
machines are used to satisfy primarily the requirement of having a common platform
across architectures. An immediate solution guaranteeing that a common platform exists
is to simply have everyone use the same underlying hardware architecture. While this
may be an ideal scenario, it is not a cost effective or feasible solution. Using reconfig-
urable technologies to provide a virtual machine is potentially more cost effective than
the traditional Application Specific Integrated Circuit (ASIC) approach for providing a
common underlying hardware architecture. Instead of replacing the underlying hardware
with a new platform, the user simply reconfigures to the desired new platform [45]. While
the success of such an approach to provide virtual machines is unknown, there are obvi-
ous conjectures that are interesting to explore.
5
This dissertation describes a different approach of computing for virtual machines
through hardware/software co-design and the utilization of reconfigurable hardware, by
providing guidelines and several algorithms that focus on important co-design phases of
the process such as partitioning, design of the components with flexibility, and of the
interface linking them together. From this research results are gathered concerning the
required support for success. Included as well are performance measurements that can be
attained through this solution.
1.1 Research Contributions
There are three major research contributions of this dissertation and they include: an
advancement towards a new general computing paradigm and architecture; a set of guide-
lines and algorithms for applying the general hardware/software co-design process to the
specific virtual machine class of problems; and an assessment of the potential advantages
of using co-design as an implementation approach for virtual machines. The remainder of
this section will focus on each of these contributions and discuss them in more detail.
The first contribution is to make advances towards a new view of a general comput-
ing platform and architecture. This approach provides a computing platform which is sup-
ported by both hardware and software components through a static partitioning of instruc-
tions. By overlapping the partitions as well, a decision can be made at run-time as to the
location of execution for a user application. Reconfigurable technologies to date have
been focusing at the application level. This dissertation examines reconfigurable comput-
ing at the operating system and computer architecture level. This allows applications to
be written without knowledge of the specialized hardware, yet receiving the benefits.
The second contribution is to outline a set of guidelines to assist in the transition of
a virtual machine into this new computing paradigm, which must efficiently utilize the
existing general purpose processor and the new reconfigurable resources. A significant
component of this utilization is the dynamic selection of application regions to execute in
the hardware partition. The partitioning scheme used to determine the opcodes that form
the hardware component is critical to the outcome. Any partitioning strategy used must
6
deal with the challenges of resource constraints, such as design space and memory, as
well as implementation costs.
Co-design is new and interesting, but has been used mainly for embedded systems,
where the main implementation implies having closely connected software and hard-
ware portions and a well-defined interface. Here, a general process for co-design has been
established, but the process is generic to suit all systems. This leaves the co-designer with
little direction to address each of the steps within the co-design process. Steps such as
partitioning become more focused only when restricted to a particular and narrow domain
of application. In this research specific techniques are applied within each of the process
steps for virtual machines to obtain better performance and to attempt to provide a more
systematic approach to co-design, when applied to the context of virtual machines.
There are different ways of tackling this idea, for example using a co-processor,
which is very successful in graphics and video streaming. In this case one utilizes a static
partitioning strategy, where the hardware is used to implement specialized instructions or
functionalities. Such solutions are inflexible due to the static partitioning. Likewise, the
implementation using a custom ASIC co-processor also lacks flexibility, and is poten-
tially costly. Instead the use of reconfigurable hardware can provide greater flexibility
and is potentially less costly. This is reflected by the division of the virtual machines
functionalities between hardware and software, the interface between the divisions, and
the dynamic decision process for when to move execution between hardware and soft-
ware during run-time, since the software partition maintains full functionality. Each of
these concerns are addressed and the solutions can be transferred to other virtual comput-
ing paradigms. The general co-design process is described in section 3.2.
Within this approach designed for the class of virtual machines, there are several
issues and ideas that are addressed and they include:
• A partitioning strategy for dividing the virtual machine between hardware
and software.
• The idea of overlapping hardware and software partitions to allow for
selective dynamic context switching. Three algorithms are presented and a
7
demonstration of the importance of context switching execution between
them.
• A generic hardware design that can be adapted and manipulated for other
virtual computing platforms.
• An analysis of the performance of the co-design solution as applied to the
Java virtual machine.
• Lastly, a set of simulated benchmarks that quantifies the performance pre-
diction.
The third contribution is to assess the potential performance increase of virtual
machines that are implemented using hardware/software co-design dependent on the
underlying hardware resources. Specifically, the Java virtual machine is used as an exam-
ple. This includes an examination of the effects the physical resources of the system and
characteristics of the virtual machine’s applications have on the overall performance. A
requirements analysis is also performed on the hardware support needed to provide a suit-
able environment for a co-designed virtual machine to exist. This analysis will include
such factors as memory, communication, and FPGA requirements suitable for this
approach to succeed.
1.2 Dissertation Overview
This dissertation follows through the use of hardware/software co-design for virtual
machines. A detailed discussion of the motivation for co-design and the advantages and
disadvantages of this approach in comparison to other popular methods of implementa-
tion for virtual machines is in chapter two. Chapter three is a background of hardware/
software co-design related information as well as reconfigurable computing and program-
mable hardware devices.
With the foundation set, the proposed application of hardware/software co-design to
virtual machines is described in chapter four, covering the partitioning of the virtual
machine between the hardware and software components. The next two chapters, five and
six, discuss the hardware and software designs of the virtual machine respectively. These
8
designs encapsulate the interface between the partitions. Each of these chapters discusses
co-design as it applies to virtual machines in general, and to the example case study of
Java in particular.
Finally, chapter seven of the dissertation discusses some of the results realized
through the co-design solution. This includes an analysis of some of the results obtainable
through co-design as well as the requirements of the development environment. Chapter
eight concludes the dissertation with a summary and a brief description of some future
work that can evolve.
9
CHAPTER 2
Virtual Machines Chapter 2
2.1 Introduction
This chapter discusses the motivation and new concept for co-designing virtual
machines clarifying the idea and context. The concept of a virtual machine, along with
the advantages and disadvantages of this computing platform approach, is presented. Sev-
eral common techniques for implementing virtual machines within a general purpose
workstation are presented along with their advantages and disadvantages. The co-design
solution proposed in this dissertation is compared and finally the chapter concludes with a
discussion of the Java virtual machine (the example virtual machine that is used through-
out the dissertation), and its suitability in portraying the approach.
2.2 Virtual Machines
There have been many virtual machines used to support and promote different plat-
forms of execution. The term was first introduced in 1959 to describe IBM’s new VM
operating system [76]. In the 1970s, a virtual machine was implemented for SmallTalk
which supported a very high level object-oriented abstraction of the underlying computer
[76]. A virtual machine is defined to be a self-contained operating environment that
behaves as if it is a separate computer [52]. In more concrete terms, the virtual machine is
a software implementation that lies between the application and the operating system. As
such, it is an application that executes other applications. Figure 2.1 shows both an appli-
cation running directly on top of the operating system (on the left), and an application
running on top of a virtual machine.
An advantage of virtual machines over a traditional hardware architecture with an
operating system is system independence. The virtual machine provides a consistent
interface for application programs despite the potentially wide range of underlying hard-
10
ware architectures and operating systems. This allows the application developers to pro-
vide only one software binary implementation. The key benefits include:
1. Drastically reduces the costs of providing multiple versions of software
across varying platforms.
2. Supports better application development through application portability, a
uniform computing model, and a higher level of programming abstraction.
3. Provides a homogeneous execution platform for distributed computing on
a heterogeneous network.
4. Resolves issues of differing libraries and interfaces between target environ-
ments.
5. Provides the ability for a common security model.
There are other minor advantages such as the low cost of not having specialized
hardware. For these reasons, virtual machines are a good choice to provide a homoge-
neous computing platform.
However, there is a downside to providing an execution environment as a virtual
machine. Because programs running in a virtual machine are abstracted from the specific
system, they often cannot take advantage of any special system features. A key example
of this is the graphics capabilities where specialized acceleration for graphics at the hard-
ware level is common due to the high demands placed on performance by games and
Operating System
Virtual Machine
Application
Figure 2.1 Software virtual machine execution layers of abstraction.
Native Hardware
Operating System
Application
Native Hardware
11
other applications. It is common today for hardware architectures to provide custom
graphics support, for example the Intel processor offers MMX technology and AMD pro-
vides a 3DNow instruction extension [55,1]. While both of these strive to meet the same
goal, their approaches are somewhat different, and so are their interfaces to this special-
ized support. With applications executing within a contained virtual machine that is plat-
form independent, the applications are prevented from accessing this support directly.
This separation of the application from the underlying system is responsible for the
critical drawback of a virtual machine: its performance. Applications that execute on a
virtual machine are not as fast as fully compiled applications that execute directly. The
reason for this is the extra layer of abstraction between the application and the underly-
ing hardware. Any action that is requested by an application before being executed is
interpreted by the virtual machine. In addition, the virtual machine itself requires execu-
tion time to perform maintenance duties such as memory management and security
checking. All of these factors contribute to the overall slow performance of applications
within virtual environments.
With the increasing demand for a homogeneous computing environment, generated
by the internet, and the increasing performance of computers, the use of virtual machines
for computing platforms is more prominent despite some poor performance. New virtual
computing platforms such as the Java virtual machine and the .NET common language
runtime promote this network computing model [17].
2.3 Virtual Machine Implementation Techniques
There are many different approaches to implement a virtual machine. Some of the
more traditional approaches are through either a software interpreter, just-in-time compi-
lation, a dedicated native processor, or using a custom hybrid processor that was opti-
mized to support the virtual platform [43,117]. There are also other less conventional
techniques, mostly targeted for a specific application within the virtual machine and not
the virtual machine itself [18]. Each of these methodologies for implementation has
advantages and disadvantages. The following sub-sections outline the benefits and pitfalls
12
of each of these different approaches. This is followed by a description of the benefits of
co-design, which presents the co-design solution to be an alternative for the desktop
workstation environment.
2.3.1 Software Interpreter
A software interpreter is the most common form of implementation for a virtual
machine. A driving force behind this is that software meets the common demands and
features desired of a virtual machine. Typically virtual machines are “virtual” because
users desire to have portability across different hardware platforms, want a cheap plat-
form, and require backward compatibility as the platform grows into a more stable envi-
ronment. A software computing platform has traditionally been the most appropriate
means by which the implementation can be realized to satisfy these requirements.
The software implementation is the cheapest and quickest means by which the vir-
tual machine can evolve from concept, through prototyping and research, into an end
product. The currently popular Java virtual machine is an example of this evolution. It
originally began as a platform for cable TV switchboxes and continually developed and
grew into the general purpose computing platform that it is today [24]. Currently the Java
platform, since first released as a general purpose computing platform in 1995 has under-
gone four major revisions and numerous other minor editions [103]. Software provides
suitable features for this evolution mainly through its vast set of cheap development tools
and flexibility with underlying hardware architectural platforms. The flexibility that soft-
ware provides for analyzing the virtual machine in terms of configurability provides
insights to help develop efficient and suitable implementation ideas. This flexibility is
also invaluable when the virtual machine has not matured and is changing through contin-
uous revisions. Having the ability to easily update and release a new version is important
during this stage of the virtual machine’s life cycle.
Unfortunately, this is the point where software-based implementation becomes a
burden on the end virtual machine. A software interpreter is a great mechanism for devel-
oping and analyzing the virtual machine, however, its lack of performance hinders the
virtual machine from being used for computing intensive applications. The extra layer of
interpretation in execution is too costly in performance. As can be seen from Figure 2.1,
13
with a software implementation of the virtual machine, there is the extra layer of abstrac-
tion above the host operating system. This extra layer, while providing a standard inter-
face to the underlying hardware, also forbids access to any special capabilities of the
operating system or hardware architecture. In a typical application developed for the
hardware platform, the virtual machine layer does not exist. Instead, the application has
more direct access to the hardware and its special capabilities. There are also advantages
of this abstraction level, as it also acts as a “sandbox”, protecting from illegal access to
other applications and preventing the host operating system from crashing as a result of
the virtual machine application [72].
For performance, this raises even greater concerns when the operating system is
capable of multi-tasking, as it can also result in worse performance as the operating sys-
tem is sharing the hardware resources with other applications, possibly equal in priority to
the virtual machine itself.
2.3.2 Just-In-Time Technology
A common technique that has been used to increase the performance of software
implementations for virtual machines is that of just-in-time (JIT), or hot-spot, compilers.
This technique utilizes the fact that a significant amount of the time during execution is
spent executing a small fragment of the overall application. This technology attempts to
identify these fragments of the application during runtime and compile them into native
code, thus allowing the application to perform faster since it can avoid software interpret-
ing and execute natively [94,103]. Given the correct code fragments of the application to
JIT, the application can almost become a native application. This technique has shown
high levels of performance increase for many virtual computing platforms [103,94].
There are several challenges that just-in-time technologies face. Two factors are
identification of the time critical regions of the application and compilation of the virtual
platform code to execute in the native architecture. Identifying the time critical sections
of an application is difficult since it is dependent on the specific application and requires
monitoring the application during execution. Some of the original Just-in-Time compil-
ers used for Java attempted to compile all of an application methods during loading, but
this resulted in large memory requirements and in compilation of code that is sometimes
14
only used once [119]. Moreover, depending on the input to a given application, the time
critical sections can change. Finally, once identified, compiling the time critical sections
of the application into native code is often a challenging task. This is especially true when
the virtual and native machines differ significantly in architectures. Manipulating the
application to represent it in the supported native instruction set can present a problem
[94]. All of this effort must be performed quickly, as time spent performing the just-in-
time compilation weighs against the performance gains obtained.
2.3.3 Native Processor
When a virtual machine is in high use and performance is of primary importance, it
is common for the platform to become native. For this, a custom processor is developed
based on the instruction set of the virtual platform. This contributes towards providing
higher performance capabilities for the platform’s applications. A key trade-off for this
performance is the loss of flexibility as well as performance for other computing lan-
guages and paradigms [20]. With a native processor, there is less flexibility in evolving
and revising the platform while keeping the proper backwards compatibility. Customiz-
ing the architecture for a specific computing platform or language also causes problems
for executing other platforms and languages. An example of this is the recent picoJava
processor [19,27]. While the specific processor does provide performance gains over soft-
ware emulation, the performance of other computing platforms, such as the execution of
C programs, suffers because the Java specific platform does not offer suitable features as
would another general purpose processor [20].
Another concern that arises from having a native processor for the virtual machine
platform is the support of other platforms. One reason for having various platforms is
because each platform offers different features and capabilities. Using a native processor
may include the features that are desirable for one platform while losing the necessary
characteristics for another. Changing the native processor may be suitable for a dedicated
environment, but not for a general purpose environment where the native processor must
meet a common ground between all supported platforms. In the context of this research,
namely a desktop workstation, the use of a native processor for the targeted virtual
machine is not considered desirable.
15
There are many examples of virtual platforms becoming actual hardware platforms,
such as the Lisp machine, the Pascal processor, and other computer architectures for such
languages as Algol and Smalltalk [39,105,92,22,90,51,77]. Each of these language spe-
cific platforms is capable of providing performance increases simply because the archi-
tecture is targeted to the language and its computing paradigm. For example, the Lisp
machine utilizes the fact that the language is stack based, and hence so too is the architec-
ture. This is also true for more current and emerging computing platforms such as Java
[2,95,99,58,65,117]. These specific examples, despite their demonstration of a perfor-
mance increase over software implementations, have not been adopted as common place
solutions. One contribution to this outcome is the high costs associated with specialized
hardware. In most cases, there is not a sufficient demand for performance on these plat-
forms to warrant the costs.
2.3.4 Hybrid Processor
A hybrid processor attempts to provide greater performance for multiple platforms
by providing a native processor that is based on the combination of the platforms merged
together. This approach in theory provides the best of all the incorporated platforms to
accelerate execution for each virtual machine [29,33]. There has been considerable
research into hybrid processors to specifically enhance the support of Java execution
[3,4,8-10,30-32,79,80]. There are, however, some drawbacks with this approach. Incorpo-
rating multiple virtual machines can result in a very complex design that may be very
challenging to implement. Such factors as design space and cost also arise, sometimes
making this approach impractical.
Having each platform directly supported in the underlying native processor may
lead to increased performance. Again, several drawbacks may mitigate against perfor-
mance gains. There exist many different platforms with many different philosophies that
are not always compatible. Trying to incorporate platforms with a mix of philosophies
can result in a system where each platform is hindered by the other(s). With the vast num-
ber of platform architectures, it is probable that the platforms will have conflicting fea-
tures. Having the scenario of compromising the performance of one platform to improve
another is never desirable and often intolerable.
16
2.4 Co-Designing Virtual Machines
The previous section described several methodologies commonly used to imple-
ment a virtual machine: pure software based, and pure hardware based, with both native
and hybrid instruction sets. Each of these methodologies has its benefits and its costs.
This section instead discusses the idea of co-designed virtual machines using a reconfig-
urable device.
Virtual machines are typically software implementations of a hardware architec-
ture plus supporting software management or operating system. Backward compatibility,
cost, and portability issues are common reasons for providing a platform as a virtual
machine. By having the specified machine in software it can be cheaply implemented and
run on top of, without affecting, many existing host platforms. The motivation behind co-
designing a virtual machine is to increase the performance of the virtual machine’s execu-
tion through hardware support. In this dissertation, the hardware support is provided
through the use of a reconfigurable hardware device, namely a Field Programmable Gate
Array (FPGA).
There are two parts that make up a virtual machine: a low-level instruction set, and
a high level operating system. The idea of co-designing virtual machines is based on sup-
porting each part of the virtual machine by the most desirable approach. Thus, providing
the low-level instruction set of the virtual machine in hardware, i.e. the FPGA, and the
high level operating system in software, i.e. the host processor, is desirable. For the co-
designed solution, an abstract depiction of the conceptual architecture for implementa-
tion is depicted in Figure 2.2.
This architecture is seen as desirable as each part is delivered through technologies
that provide a high level of performance while still maintaining flexibility. The co-design
approach, though simple in concept, faces the new challenge of integrating the hardware
and software components. This requires the careful design of the interface between them.
Architecturally, both of these computing elements are connected via buses to the memory
unit, and to each other. Ideally, there are three separate buses, but sharing a common bus
is possible. This allows for close shared execution between the two devices on one execu-
17
tion task.
There is the issue of a bottleneck caused by the accessing of the memory region by
both the FPGA and the host processor. This can result in a significant issue which is not
addressed here in detail. Chapter 7 does however consider the effects of memory access-
ing bandwidth, as well as other hardware architectural features.
This approach was used in the past, but mainly for specific processing purposes and
not for a general computing virtual machine [64]. Configurable computing has been
broadly used in embedded computing and telecommunications to address such problems
as high-speed adaptive routing, encryption and decryption, and cellular base station man-
agement [68]. The co-design idea here is to implement a portion of the virtual machine in
hardware using reconfigurable hardware technology [62], i.e. a more general problem.
While the idea of using reconfigurable hardware for application acceleration or for
providing an embedded system platform is not unique, using reconfigurable hardware
within the desktop workstation to support virtual computing platforms is rather novel.
This concept is intriguing since the same hardware resources can be used for not just one
virtual platform, but for several virtual machines, or for any other process. The ability to
reconfigure the underlying hardware to specifically support the computing platform offers
many advantages. Most importantly, this paradigm for computing may provide a solution
FPGAHost
Processor
Memory
Figure 2.2 Abstract architecture for co-designed virtual machine.
18
to the performance problem of software based implementation virtual machines.
For this to be viable, a co-design flow needs to be developed to assist the implemen-
tation. There exists a general co-design process, but it is too general for virtual machines.
There is little direction provided to assist in how to partition the virtual machine, how to
design the hardware and software components, or what comprises the interface between
them. While assistance for these stages may not be possible for all co-designed systems in
general, it may be possible for virtual machines as a class of problems. Currently, there
exists no assistance for this class of problems beyond the support available for co-
designed systems in general, or for embedded systems more narrowly focused in a
domain. This dissertation will address this problem, by presenting techniques and guide-
lines that can be used specifically for directing the co-design of virtual machines. The
next section discusses in depth the foreseen benefits of a co-designed virtual machine.
2.5 Benefits of a Co-Designed Virtual Machine
A major benefit of any implementation approach for virtual machines is the ability
to change and extend the implementation for revisions to the virtual machine’s specifica-
tion. The use of a co-designed virtual machine promotes this flexibility through the recon-
figurability of the hardware architecture. Revising the implementation is arguably no
more difficult than that of changing a full software implementation. This is not the case
however, when a dedicated ASIC co-processor or hybrid processor is used. In these
instances, changing the hardware can be a high cost venture. The recent Java virtual
machine is an example of this. From a software implementation of the virtual machine,
the Java platform has undergone four major and several minor implementation revisions,
the specification of the virtual machine itself has been revised once, and the Java proces-
sor, picoJava, has undergone a major revision as well [103,72,109,99]. This demonstrates
the importance of having a flexible implementation that can be easily changed to accom-
modate revisions in the virtual machine.
When using a hardware device to provide a service there is always a concern
regarding availability. Even if a hardware device exists to provide the service desired, is
19
the device suitable for the user? Assuming a custom ASIC co-processor were available,
one needs a different type for each different type of virtual machine. It could be envi-
sioned that the host system would contain a general purpose processor along with several
dedicated co-processors on the system mainboard. Is the computing platform for each of
these dedicated co-processors used often enough to justify having dedicated hardware
resources? This is especially true if the performance demand for a particular virtual
machine is low, thus causing a high cost for hardware support. For a dedicated ASIC co-
processor or hybrid processor solution this can be an issue. Using reconfigurable hard-
ware, the same hardware can be used to support multiple computing platforms, thus
amortizing the cost of having this hardware. The cost associated with having a reconfig-
urable device is much less dramatic when several computing platforms can be supported.
It can be envisioned that as each virtual machine is requested by an application, the sys-
tem will reconfigure the hardware to the appropriate virtual machine and then execute.
Thus, only one general processor and one reconfigurable device can theoretically support
an unlimited number of virtual machine types. Moreover such reconfigurable coproces-
sor can support any number of other configurations for any other application.
Cost is always an issue raised when discussing the value of various means of imple-
mentation. This is a rather subjective area to argue when discussing the effort involved to
fulfill the implementation. Past research experience shows that a software implementation
is easier than a hardware implementation because of its flexibility, so a software and JIT
solution would potentially be easier to complete than a co-designed solution. The co-
designed solution, however, is arguably easier to implement than the hybrid solution
which requires integration with a secondary computing platform, and the dedicated co-
processor which involves fabricating the solution.
Often, a computing platform is supported through a virtual machine because it has
an embedded architecture that differs from the native architecture. To attempt to merge
the two computing platforms together to support both paradigms is a very challenging
and often counterproductive process. Some platforms simply cannot be easily merged
based on their underlying fundamental architectures. The co-designed solution avoids this
by having the embedded architecture of the virtual machine supported within its own
20
computing element. This allows the hardware support for the virtual machine to be opti-
mized for its platform, without compromising support for another. This is an advantage
that the co-designed solution provides over the hybrid processor.
The just-in-time compiler solution in some sense performs the complement of the
co-designed approach. The JIT technology transforms the application from the virtual
machine instruction set to the native instruction set of the host processor. Conversely, the
co-designed approach changes the native instruction set to make the application native. In
this sense the co-designed virtual machine has the advantage that the transformation takes
place at compile-time when the reconfigurable device is programmed, while the JIT
transformation takes place at run-time after the time critical section is identified.
When providing a virtual machine through a software emulation environment, the
time critical section of the virtual machine is optimized to take advantage of the underly-
ing hardware architecture to improve performance. When examining the software imple-
mentation of the Java virtual machine, it can be seen that the time critical loop of fetching
and executing instructions is optimized specifically for each hardware platform [103]. It
has both Sparc and Intel architecture modules for that specific component of the virtual
machine. This adds complexity when providing the virtual machine through a new plat-
form as this module is customized for the new underlying hardware architecture. Often
the hardware component of the co-designed virtual machine, which is provided through
reconfigurable logic, overlaps the platform specific components of the software only
solution. In this case, a significant portion of the platform dependencies are removed.
With less need to port platform dependent implementation components of the virtual
machine between platforms, the porting process becomes much simpler. Thus, when a
virtual machine is co-designed for one general desktop platform it can more easily be
manipulated for all desktop platforms.
In some aspects, the co-processor solution and the co-designed approach are very
similar. Both provide additional hardware resources that target specific needs of a com-
puting platform to improve the performance. There are however, three main differences
that separate these approaches. First, the co-designed solution discussed here utilizes
reconfigurable technology. This reduces the cost of hardware resources and allows sup-
21
port for multiple virtual machines as discussed previously. Secondly, the co-processor is
designed to work as an add-on to the general purpose processor. Control flow is dictated
by the CPU and the co-processor just performs fine grained tasks that are requested of it.
The co-designed approach views the added hardware support as an equal processing unit
and as such it contributes to the control flow of an applications execution. This does how-
ever add complexity to the design that may be unnecessary. Thirdly, the co-designed solu-
tion goes beyond simply providing additional hardware support, but addresses the
synergy between the added hardware support and the whole virtual machine. This is seen
later in the dissertation in the discussion of what support to provide in hardware/soft-
ware, where to execute a block of instructions, and how to design the software to work
seamlessly with the hardware support. In a typical co-processor, these issues are not
addressed and instead the design focuses on providing just a standard interface to the co-
processor.
Finally, a major benefit of the co-designed solution is the use of two computing
devices. With the addition of a hardware device, it is now possible to execute two flows
of execution simultaneously. In the simplest of circumstances, this can execute a virtual
machine application and arbitrarily any other application in parallel. If however the vir-
tual machine being used supports multi-threading, this can result in two threads within
the virtual machine executing in parallel. This can result in a further performance gain,
but is not addressed in this dissertation.
2.6 Java Virtual Machine
For this research on co-designing virtual machines, it is as important to show the
application of the approach to a case study virtual machine as it is to describe the
approach itself. While all of the ideas are applicable to virtual machines in general, the
use of a concrete virtual machine allows for some insight into the potential performance
that can be gained through co-design. The use of a case study is also beneficial in exam-
ining some of the more detailed aspects of the co-design approach and from that abstract-
ing the results to form some additional general guidelines to follow in the process. For
this, a case study virtual machine must be chosen and it was decided to use the Java vir-
22
tual machine. That is, the Java virtual machine as within the desktop environment and not
that of a Java platform for use in embedded computing [104,60,69]. While both targets
share some similar problems, they both contain issues that are unique to their usage
[74,78,83].
The Java programming language is a general-purpose object-oriented language
[5,40]. The Java platform was initially developed to address the problems of building
software for networked devices. To provide this support for different types of devices, it
was decided to provide the Java language on top of a virtual machine [71]. The Java vir-
tual machine is the cornerstone of the Java platform. It is the virtual machine that allows
the Java platform to be both hardware architecture and operating system independent.
A key reason for the choice of the Java virtual machine as a case study is its popu-
larity [6]. While it is not crucial that the example virtual machine be popular, it does pro-
vide certain characteristics that are desirable. The amount of Java code that exists from
application developers makes finding and selecting test benchmarks easier. The popular-
ity has also forced Java to mature and become a stable computing environment. The Java
virtual machine is relatively stable and the source code for the software implementation is
freely available for research use. This can be used to avoid implementing supporting
characteristics of the virtual machine that are not important to the research and the co-
design solution, but still allow for a complete virtual machine to be explored. The Java
virtual machine also has a well-defined and freely available specification document of the
virtual machine, as well as substantial reference and specification documents of the Java
based picoJava processor [72,101]. All of this documentation can contribute to making
the co-design process more complete and the implementation more straightforward.
The Java virtual machine is also a good example because of its history. The Java
virtual machine was originally designed very cleanly and precisely. It was internally
developed at Sun Microsystems and encountered many revisions internally before its ini-
tial release as a general purpose computing platform [24]. Supporting the argument is the
fact that original Java programs written when the platform was initially released will still
execute on the latest virtual machine. Since the inception of Java, the platform has been
provided through the virtual machine and there have been several books written concern-
23
ing its specifications. While the Java platform has encountered revisions, the actual Java
virtual machine engine has not evolved very much since its original design and specifica-
tion. Only one specification revision has been made, which avoids the issue of virtual
machine versioning and legacy issues that become part of the virtual machine specifica-
tion.
Finally, the extent to which the Java virtual machine has been researched makes it
an interesting example. Many different methodologies have been used on the Java virtual
machine to increase its performance [21]. Since its introduction, the Java virtual machine
has been the target of research because of its slow performance and hindrance to the Java
platform [41].
2.6.1 Benchmark Tests
To validate this research, it is important to analyze the performance of the virtual
machine for several applications that represent a general range of domains. For the Java
virtual machine, there exists a standard set of benchmarks for verification of an imple-
mentations performance. These are the SpecJVM benchmarks [49]. This suite consists of
the following eight tests:
• check - A simple program to test various features of the JVM to ensure that
it provides a suitable environment for Java programs. Such features include
array indexing, class creation and invocation, and basic operations and
control flow.
• compress - Modified Lempel-Ziv method (LZW) finds common substrings
and replaces them with a variable size code. This is deterministic, and can
be done on the fly.
• jess - the Java Expert Shell System is based on NASA's CLIPS expert shell
system. The benchmark workload solves a set of puzzles commonly used
with CLIPS.
• db - Performs multiple database functions on a memory resident database.
It reads in a 1 MB file which contains records with names, addresses and
24
phone numbers of entities and a batch file which contains a stream of oper-
ations to perform on the records in the database.
• javac - This is the Java compiler from the JDK 1.0.2.
• mpegaudio - This is an application that decompresses audio files that con-
form to the ISO MPEG Layer-3 audio specification. The workload consists
of about 4MB of audio data.
• mtrt - This is a variant of raytrace, a raytracer that works on a scene depict-
ing a dinosaur, where two threads each render the scene in the input file
which is 340KB in size.
• jack - A Java parser generator that is based on the Purdue Compiler Con-
struction Tool Set (PCCTS). This is an early version of what is now called
JavaCC.
In addition to these benchmarks, it was decided to provide two more applications
that are known to be compute intensive. These tests are:
• queens - A programming solution for the n-queens problem. It uses a tree-
parsing approach of recursively placing pieces, but trimming away incor-
rect solutions at the first sign of failure.
• mandelbrot - Generates a 320x240 picture of the mandelbrot set with a
maximum iteration of 2000 for each pixel in the graph.
Each benchmark has various properties and is designed to test various features. Not
all of these benchmarks are used through the examination of the case study co-designed
Java virtual machine. This is due to some of the characteristics that the applications pos-
sess. One such feature is multithreading, which raises difficulties in a simulation environ-
ment. There are also licensing issues for some applications that prevent the manipulation
of the Java bytecode. This prevents the co-designed Java virtual machine from perform-
ing its dynamic run-time analysis of when to switch execution between partitions.
Additional tests were developed through the course of the research work to investi-
gate local effects and characteristics of the hardware design. These tests were comprised
of the subset of bytecodes that were supported by the hardware design under test. As
25
such, the tests themselves are relatively small in size, but are focused on the features they
exercise. These tests are:
• Loop counter - A simple for loop used to gauge maximal performance
increase.
• Fibonacci - An iterative program that will compute the nth Fibonacci num-
ber.
• Ackerman - A recursive program that will compute the Ackerman function
for a given input combination.
• Bubble sort - Uses the bubble sort algorithm to sort an array of numbers.
This test examines the effects of increased bytecode size.
• Insertion sort - Implementation of insertion sort algorithm used to analyze
the effects of high levels of memory access.
Each of these tests is chosen to examine specific design issues that will be presented later
in the dissertation.
2.7 Summary
This chapter presents the idea of co-designed virtual machines and its motivation.
The chapter began with a brief description of virtual machines and some of the motiva-
tion for virtual machines usage. Several current implementation methods for virtual
machines were presented along with their advantages and disadvantages. The discussion
continued with a description of some of the benefits offered by co-designed virtual
machines over these current methodologies. The chapter concludes discussing the Java
platform and the decision behind choosing the Java virtual machine for the working
example. The next chapter addresses some of the underlying information and research
that has been done in the areas of hardware/software co-design, reconfigurable comput-
ing, and virtual machines in general.
26
CHAPTER 3
Hardware/Software Co-Design Chapter 3
3.1 Introduction
To provide a context for the research work to be presented later, this chapter pre-
sents a short background survey of hardware/software co-design. The concept of hard-
ware/software co-design is defined and the issues that make co-design important are
discussed. This leads into an overview of reconfigurable computing including the various
types of reconfigurable computing and a common device used to implement reconfig-
urable computing, Field Programmable Gate Arrays (FPGAs).
3.2 Hardware/Software Co-Design
Hardware/Software co-design is the integrated design of systems implemented
using both hardware and software components [25]. Systems that consist of both compo-
nents are not new, but methodologies for designing these systems are new. Software pro-
vided for such systems is often written using instance specific techniques and is now
being seen as a specific topic for software engineering [70]. These methodologies are
used to concurrently apply and trade off design techniques from both computer and soft-
ware engineering disciplines [86,89]. The approaches are intended to give relief to
designers struggling with instance specific divisions of hardware and software compo-
nents, and the resulting integration problems. Their purpose is to streamline the design
process, thus reducing design costs and shortening time-to-market; to optimize the hard-
ware/software partitioning, thus reducing direct product costs; and to ease integration, by
invokevirtualobject_quick, and invokevirtual_quick_w (for wide index). Unlike other
quick instructions which can be implemented in hardware after the class is known to be
loaded, quick method invocations cannot be implemented in hardware. This is due to the
48
complexities of storing and changing the machine state between methods. Despite the
method being resolved, this problem still exists in the quick version.
4.3.5 Exceptions
Exceptions are a very complex mechanism to implement in any platform. The rea-
son for this is the effect that an exception can have on the calling stack and the flow of
execution. When an exception is thrown it signals that something out of the ordinary has
happened. The actions taken by an exception vary between instances. They could range
from a simple print message, to as intense an error as the entire application exiting.
Within the virtual machine it could involve unwinding several calling stacks to find a
location where the exception is handled. An exception in Java also involves the creation
of an Exception object that is passed back to the location where the exception is caught.
This can result in class loading and verifying as part of the exception throwing process.
As a result of this complexity, the instruction athrow (opcode 191) is implemented in
software where manipulating the execution stack is more easily performed.
4.3.6 Object Creation
Creating objects in the virtual machine can be a very complex process. During the
creation of the object it may be necessary to load and verify the class of which the object
is an instance. If the object is a thread object, then the object has to be added into the
thread scheduling of the virtual machine so that the thread will have the opportunity to
obtain time slices and to have an execution environment created. To make the process
still more complex, it is possible that during the creation of the object an exception may
have to be thrown to signal a shortage of memory or that the class description cannot be
found. This all adds an enormous amount of complex processing to create an object.
Since the creation involves direct interaction with the software support in terms of excep-
tions and thread handling, the creation of objects should be done in software. As such the
instructions new and new_quick (opcodes 187 and 221) are implemented in the software
partition.
4.3.7 Array Creation
Creating an array is as complex a process as creating a single object. During the cre-
49
ation it is possible that an exception can be thrown due to a lack of memory or other
resources. As such it was decided to implement the instructions newarray, anewarray,
multianewarray, anewarray_quick, and multianewarray_quick (opcodes 188, 189, 197,
222, and 223 respectively) in software. The differences between the quick and non-quick
versions of the instructions is only that for the quick version, the class has already been
loaded and verified in the virtual machine. This solves some complexity, but still allows
for a lack of resources which can cause an exception.
4.3.8 Storing to a Reference Array
Java is a strongly typed language, and this is exemplified when storing a reference
into a reference array. Opcode 083, aastore, checks the reference before storing it into the
array to verify that the object reference is a correct type, or subtype, of the array. This is
an intense process that involves tracing through the subtype hierarchy to determine that
the class is a descendant of the array type. This is better left in software since it requires
intense usage of the whole object store for the virtual machine.
4.3.9 Type Checking
The Java virtual machine as part of its security features performs type checking on
objects to verify that any actions that cast the type of the object are legal. When checking
a type, the virtual machine must trace back through the inheritance tree of the object to
verify that the object at some point inherits that type. This tracing can involve a great deal
of work and can result in an exception being thrown. Because of this it has been decided
to implement the instructions checkcast, instanceof, checkcast_quick, and
instanceof_quick in software. This is not a major drawback as the frequency of these
instructions in relation to most applications is rather low.
4.3.10 Monitors
Monitors are the synchronization tools used within Java to ensure proper concur-
rent execution. Dealing with threads on a low level involves the ability to manipulate the
states of the threads, and to adjust their priorities. The thread manipulation support is
implemented in software due to its higher level in the overall scheme of things. Since the
instructions monitorenter and monitorexit (opcodes 194 and 195) have to deal with
50
threads in a low level fashion by accessing thread scheduling info and methods in the
higher level support, these instructions are desirable to be implemented in software.
4.3.11 Accessing the Jump Table
Two bytecode instructions implement case statement operations, where the various
cases of the instruction are arguments to the opcode. These instructions are tableswitch
and lookupswitch, opcodes 170 and 171 respectively. With the variable length of the
instructions up to a potential maximum of 234 bytes, these instructions would cause havoc
in implementing an efficient pipeline. Additionally, these instructions have a low fre-
quency of 0.032% in comparison to the other opcodes [35]. For these reasons, these
instructions are better implemented in software due to the low frequency and negative
effects on the pipelining architecture.
4.3.12 Wide Indexing
The opcode 196, wide, is used for extending a local variable index to a wide for-
mat. This instruction is better suited to the software partition for various reasons. The
instruction has multiple formats and would complicate the pipelining of instructions,
especially since one form of the instruction would result in having to lengthen the data
path from 32 bits to 48. Implementing this instruction in hardware would only show an
increase in performance in the event that the hardware partition can access the extended
area of the local variables. As well, the instruction is not frequently used. This instruction
is included in the software partition. However, with the right environment conditions, it
may be moved into the hardware partition.
4.3.13 Long Mathematical Operations
Some mathematical operations on long data types are costly in design space and are
infrequently used. This is the case for multiplication, division, and remainder (opcodes
105, 109, and 113 respectively). For this reason, they are commonly left to be imple-
mented in software as in the picoJava core [101]. This partitioning scheme allows these
opcodes to remain in software. In the event that additional design space is available, these
opcodes could potentially be moved to hardware.
51
4.3.14 Returning from a Method
Six instructions exist for returning from a method with various return types and
void, opcodes 172 - 177. These instructions affect the virtual machine by changing its
entire execution frame, including bytecode method, execution stack, local variables, and
constant pool. In the event that the hardware and software partitions do not share a com-
mon memory space, these instructions cannot be efficiently implemented in hardware. To
maintain a high level of flexibility for environments that do not share a common memory
space, these opcodes must be implemented in software. If it were known that a common
memory were always present, then these instructions could be moved to the hardware
partition.
4.3.15 Operating System Support
As for general cases, the processor itself must have some software support for man-
aging the processes executing and the resources provided. These are what make up the
operating system for the processor. In the case of a virtual machine, it too must have soft-
ware support similar to an operating system. This software support must include the man-
agement of threads, garbage collection, class verification, network and I/O support, the
Java abstract programming interface (API), and native method support. These are just
some of the necessary features which must be implemented in software to support the
Java virtual machine.
4.3.16 Software and Hardware Coordination
Just as the hardware partition needs added functionality for communicating with the
software partition, so too must the software partition have extra functionality for commu-
nicating with the hardware. This support must provide the software partition with the
ability to transfer and receive data from the hardware, signal the hardware to stop, abort,
start, and continue processes. Functionality for enhanced control needed for debugging
and testing is required.
These are some of the more specific instructions which must be supported in soft-
ware. As discussed in the previous section 4.2, the software partition also overlaps with
the hardware partition thus it includes all of the instructions in the hardware component
52
as well. This will allow for greater flexibility in migrating execution between the hard-
ware and software partitions. This is cost effective since the only penalty for having extra
support in software is the software development time.
4.4 Hardware Partition
The more instructions that can be implemented in hardware the better, since the
overall purpose of this co-design is to obtain faster execution through pipelining the
fetch-decode-execute loop. Additionally, the desire is to target the instructions that have
traditionally been supported in a typical processor to be moved into the hardware parti-
tion. It is also important to consider the usage frequency of the instructions when decid-
ing whether to implement them in hardware.
As discussed earlier, the partitioning provides several configurations with varying
levels of hardware support and resource requirements. There are three primary partitions
that have been identified for hardware partitions, with each partition being an extension
of the former as shown in Figure 4.2. These partitionings are:
• Compact - Small partition for restrictive hardware devices. Intended for
small FPGA devices, environments with a slow communication bus, or
both.
Compact
Software
Figure 4.2 Abstract view of overlapping partitioning extensions.
Host
Full
53
• Host - Extends Compact with support for accessing host memory system.
Intended for medium sized FPGA devices that have available support for
accessing the global data space of the application.
• Full - Extends Host with support for quick instructions. Specifically for the
same environment as Host partitioning scheme, however provides support
for dynamic instructions that initially require software support.
These partitioning schemes may not be suitable for all architectural environments,
however they are intended to be used as starting points for a solution. Incremental
changes to fine tune the partitioning can be easily made to add or delete instructions to
better utilize the resources available.
The next three subsections discuss the specific instructions, or grouping of instruc-
tions, that have been implemented in each of the hardware partitions, with a brief expla-
nation as to why the decision. Appendix B is a list of all the Java virtual machine
instructions. With each instruction is the opcode, mnemonic, a description of the instruc-
tion, and whether the instruction is targeted for hardware implementation or not.
4.4.1 Compact Partition
The compact partitioning, or minimal partition, encompasses the instructions that
have been targeted as fundamental instructions for execution and require minimal system
knowledge for execution. This scheme minimizes the necessary data that must be
exchanged between the hardware and software partitions for execution. Likewise, it pro-
vides a configuration with minimal hardware support. The minimal data exchange is
viewed as a potential benefit in the event that the communication medium between the
FPGA and the host system is slow. Thus, this partition is intended for environments with
a small FPGA, a slow communication bus, or both. The list of typical instruction groups
that comprise this partition are:
• Constant instructions that perform a fixed operation on no data and do not
change the state of execution other than a simple data register assignment.
• Data assignments or retrievals from the temporary register stores.
• Basic arithmetic and logic operations on local data values.
54
• Branching instructions to manipulate control flow.
• High frequency instructions.
The following sub-subsections list the specific groups of instructions that are sup-
ported by the hardware partition for the test case Java virtual machine.
4.4.1.1 Constant Instructions
Opcodes 001 through 015 represent the instructions within the Java virtual machine
that are for pushing a constant numerical on the data stack. It is very clear that an instruc-
tion that requires no computation and simply an assignment of a numerical value to a
location in memory can be implemented in hardware.
4.4.1.2 Stack Manipulation
The Java virtual machine is a stack-based computing model. This means that as
Java executes, it uses a stack to retain all data being manipulated. To allow for this model,
Java has many instructions for the purpose of manipulating the stack. These consist of
opcodes 016, 017, and 087 - 095. All of these stack manipulations can be implemented in
hardware as they translate into basic memory assignments, which are the equivalent of
instructions in traditional microprocessors for reading and writing data to registers.
4.4.1.3 Mathematical Opcodes
The opcodes 096 - 119 are for mathematical operations such as addition, subtrac-
tion, multiplication, division, remainder, and negation. Each of these operations can be
performed on the basic data types float, double, integer, and long. These instructions are
traditionally implemented in microprocessors and are easily implemented in hardware,
with the exception of the instructions discussed in section 4.3, namely lmul, ldiv, and
lrem. These instructions are implemented in the software partition since the potential
usage does not outweigh the required design space due to their low occurrence.
4.4.1.4 Shift and Logical Opcodes
The instructions for shifting numerals are opcodes 120 - 125, and logical operators
and, or, and xor as performed on integers and longs are opcodes 126 - 131. These are
implemented in hardware to increase performance.
55
4.4.1.5 Loading and Storing
The opcodes 021 - 045 and 054 - 078 are combinational opcodes for storing and
loading different types and offsets of variables from the local variables to the operand
stack. Each of these instructions are simply data copying instructions, equivalent to regis-
ter reading and writing instructions in traditional microprocessors. These instructions can
be easily implemented in hardware.
4.4.1.6 Casting Operators
The opcodes 133 - 147 allow for casting values from one primitive type to another.
For example, integer 1 to double 1.0. These operators work on the basis of mathematics
like the arithmetic operators. A clear algorithm exists for each casting operator that uses
the same basic principles similar to other mathematical operations.
4.4.1.7 Comparison and Branching Operators
The opcodes 148 - 152 are for comparison of values for greater or less than of dif-
ferent basic types, 153 - 167 are for branching from one location to another, 198 and 199
are for branching based on references being either null or non-null values, and 200 is for
an unconditional branch to a wide index. These opcodes simply change the program
counter and are implemented in hardware just as they are in other microprocessors. These
instructions are special in comparison to the other instructions as they will provide a chal-
lenge in the hardware design for pipelining of instructions.
4.4.1.8 Jump and Return
Opcode 168 is for jumping within a subroutine and opcode 201 is for jumping to a
wide index location. The opcode 169 is used for returning to another location within a
subroutine. These opcodes are specialized versions of branch instructions and can easily
be included in the hardware partition.
4.4.1.9 Miscellaneous Instructions
There are some instructions within the virtual machine that do not belong to a spe-
cific set of instructions but are useful and should be implemented in hardware. The nop
instruction, opcode 000, is a common instruction in hardware architectures to perform a
56
“do nothing” operation. The opcode 132, iinc, increments a local variable by a constant.
This is no different than a combination of load, add and store instructions.
4.4.1.10 Communication Support
Since the hardware partition has to work in unison with software components, it is
necessary that the hardware partition contains functionality such that it can communicate
with software through the system bus. For the Compact partition, the communication sup-
port is limited as it is capable of only interating and retrieving data from within its own
local memory system. For the Host and Full partition it is more complicated as support is
required for retrieving data from the host’s memory system.
The next sub-section discusses an extension of this partition called the common
memory partition. This extension provides extra instructions that require communication
support for accessing the memory of the host system.
4.4.2 Host (Common Memory) Partition
The second partition is an extension of the former, but with added support for
accessing the host system’s memory. A significant group of instructions in the virtual
machine requires accessing the common memory store of the virtual machine. This data
is too large to hold in the limited external cache memory available to the FPGA. Even if
space were not an issue, the penalty in communication is too great to transfer all poten-
tial data to the local memory cache. Thus, the solution would be to have the FPGA
directly access the data from the host system. This partition carries the penalty of having
the extra logic in hardware to communicate with the host’s memory system, thus the sep-
arate partition configuration.
4.4.2.1 Array Accessing
Array access is supported by opcodes 046 - 053, for loading different data types,
and opcodes 079 - 086 excluding 083, for storing different data types. These instructions
simplify into memory accesses to the virtual machine’s object store. With the support for
accessing the object store, these instructions can easily be implemented in hardware.
57
4.4.2.2 Length of Arrays
The opcode 190 for arraylength is used to retrieve the size of an array. This instruc-
tion requires accessing the header information of the array that is being evaluated. This
check is trivial; however, the data in question lies in the general heap of the virtual
machine. Thus, it will be implemented in this partitioning.
Other instructions exist that can be implemented in the hardware, given the ability
to access the host systems memory space. These instructions however, are classified dif-
ferently since the instructions themselves transform during runtime. The next section dis-
cusses an extension of this partition, the Full partition, which includes these instructions.
4.4.3 Full Partition
The third partition scheme provides added support for quick or augmenting instruc-
tions. These are instructions which replace their normal equivalents after the initial exe-
cution of the instruction. These provide a unique challenge in that the initial instance is
not capable of being supported in hardware. However the replacement quick version, that
is substituted after initial execution, is simplified and removes any prior constraints that
prevent the original instruction from being contained in the hardware partition. An exam-
ple of this instruction in the case study Java virtual machine is getfield. On the first exe-
cution, the instruction may involve loading the class; however after the initial instance,
the class is loaded and execution is reduced to simple memory accessing. These instruc-
tions do not differ significantly from instructions in the previous two partitioning
schemes, but they require special consideration when determining when to transfer execu-
tion between the hardware and software partitions. This will be discussed later in chapter
6 of the dissertation. The following sub-subsections describe the instructions, or groups
of instructions, that are included in this partitioning scheme.
4.4.3.1 Quick Loading Data from the Constant Pool
Within the instruction set there exist opcodes for quick loading data from the con-
stant pools of objects into the virtual machine’s execution stack. The opcodes 203 - 205
are used for this purpose and cover standard sized data as well as wide and double wide
data elements. These instructions, once the classes and constant pools are resolved, can
58
easily be implemented in hardware.
4.4.3.2 Quick Field Accesses in Classes and Objects
The opcodes 206 - 213, 227, and 228 are for the quick versions of the instructions
for getting and setting fields in objects and classes. Since the class or object in question is
already resolved and loaded into the virtual machines memory, the instructions are sim-
ple data accesses to transfer data between the virtual machine’s execution stack and the
object store.
4.5 Partition Coverage
For the partitioning schemes described, the success in achieving a performance
increase is dependent on the application. Having an application that contains a significant
majority of its instructions in the hardware partition will certainly be beneficial in achiev-
ing a performance increase. For the Java virtual machine case study, it can be seen in Fig-
ure 4.3 that a high percentage of the instructions for each of the benchmarks, based on
instruction frequency, are supported in the hardware partition. For the minimal compact
partitioning scheme, the coverage ranges between 51.5% and 94.6%, with an average of
68.2%. As expected, the full partitioning scheme provides even higher instruction cover-
age ranging from 69% to 99.9% with an average of 87.2%.
A second metric that can be used to judge the coverage of the partitioning schemes
is that of execution time. With the availability of a full software implementation of the
Java virtual machine, the time spent executing each of the instructions in the different
partitionings can be measured. Figure 4.4 shows the coverage of the different partition-
ing schemes for each of the benchmarks based on execution time.1 The percentage rates
reflect the amount of overall time spent executing instructions that belong to the hardware
partition. For the minimal compact partitioning scheme, the coverage ranges between
46.5% and 95.7%, with an average of 67.7%. The full partitioning scheme provides even
1. The benchmarks mtrt and jess were omitted for this analysis due to the complexities of obtaining accurate timings for multithreaded applications as described in section 2.5.1.
59
higher instruction coverage ranging from 59.9% to 99.6% with an average of 84.9%.
These percentages are in general lower than the percentages obtained through mea-
suring instruction frequencies. This is due to the fact that instructions that remain in the
software partition typically involve complex high level tasks, such as class loading and
verification, that require comparatively large amounts of execution time or latencies in I/
O functions. Despite the lower percentages, a significant portion of each of the applica-
tions’ execution is supported by the hardware partition.
An important characteristic of the applications that is not captured by this analysis
is that of instruction density. While it can be seen how much of the execution for each
benchmark is performed in each partition, the number of times execution is transferred
between the partitions is not shown. It is perceived that the optimal scenario is having
minimal execution transfer between the hardware and software partitions. Thus, having a
high hardware instruction density would be favorable. This aspect of the execution is ana-
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
quee
n
man
del db
com
pres
sjes
sm
trt
raytr
ace
Compact Host Full
Figure 4.3 Instruction coverage for various partitioning schemes (based on instruction execution frequency).
60
lyzed later in the dissertation in chapter 6.
4.6 Summary
This chapter outlines the approach used in determining the partitioning between
hardware and software. This approach has been applied to the instruction set of the Java
virtual machine and the resulting software partition along with 3 hardware partitions were
presented. Each partition is accompanied by an explanation and justification concerning
the decisions made. The partitioning schemes are examined in the example Java virtual
machine for its coverage, both for instruction frequency and execution time. This demon-
strates that the partitioning schemes do provide a high level of hardware support. The fol-
lowing chapter describes the hardware design that was implemented from the various
partitionings.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
quee
n
man
del db
com
pres
s
raytr
ace
Compact Host Full
Figure 4.4 Instruction coverage for various partitioning schemes (based on percentage of overall execution time).
61
CHAPTER 5
Hardware Design Chapter 5
5.1 Introduction
With the decisions made as to what components of the virtual machine to imple-
ment in hardware and software, each of the partitions must be examined in depth for how
they are to be implemented. This chapter first examines the various aspects of a develop-
ment environment, the effects the environment’s characteristics have on the hardware
design and the factors that must be considered when making any design decisions. This is
followed by a description of the development environment available for use in this
research.
With a development environment established, the context for implementation of the
partition can be discussed. The general approaches used in an implementation will be
described and how each of these can contribute to an increase in performance. This is
supported by the example Java virtual machine, including the detailed description of the
hardware design and its sub-components [61]. The design is then discussed for its inter-
esting characteristics and properties which make it a co-designed solution. Finally, the
chapter concludes with some benchmark results of the performance for the subset Java
virtual machine.
5.2 Development Environment
As with all systems, the design of the hardware portion of the co-designed virtual
machine is constrained by the target environment. In this case, the focus is to target
desktop workstation that has an FPGA available through a local bus. Due to the target
environment, there are several implications to the requirements of the hardware design,
the primary one being the availability of resources.
62
The architecture must be flexible to promote easy implementation on different hard-
ware environments. Having a design that can be easily modified to fit on a smaller FPGA
is an important requirement since the size of the FPGA affects how much of the hard-
ware partition can be implemented in hardware. With restricted design space, there is a
need to have a design that has minimal space and maximal support of bytecode instruc-
tions. Some of the decisions made during the partitioning process may have to be revis-
ited in the event that the FPGA is too small or, even, too large. The required and desirable
size of the FPGA is dependent on the specific virtual machine.
The communication rate between the FPGA and the host processor is also of impor-
tance. With the tight coupling of both execution elements at a low instruction level, there
is frequent and fine grain communication between them. Each execution migration
between processing devices requires an exchange of the current state of the virtual
machine. How high of a communication rate is needed in order to provide a performance
increase is dependent upon how often the execution migrates between processing ele-
ments, how much data must be transferred during the migration process, and the differ-
ence in computation performance between the two processing elements.
The memory that is available for use by the FPGA is also a large concern when
designing a solution. A key architectural influence is whether the memory used by the
FPGA and host processor is shared or disjoint. If each partition has its own memory sys-
tem, then additional requirements must be met to ensure that the necessary data can be
exchanged between the two memory systems when needed. The hardware co-design must
be active to compensate for the slow communication speed between the software and
hardware partitions. This active design will assist in the communication between the
hardware and software, instead of waiting for the software to push data onto the hard-
ware partition.
The size and speed of the memory that is accessible also influences the design. It is
possible that the hardware partition may require more memory than the resources pro-
vide, or require too frequent memory accesses for the memory subsystem to support.
These are factors whose requirements vary between virtual machines. Thus, depending on
the characteristics of the virtual machine to be supported, the development environment
63
may have to be modified.
All of these factors contribute to the development environment and its suitability for
a virtual machine. It is clear that a development environment suitable for one virtual
machine may not be suitable for another. The next section describes the development
environment architecture that is used in this research.
5.2.1 Hot-II Development Environment
The Hot-II card is a commercial environment available from the Virtual Computer
Corporation (VCC). The board is based on a Peripheral Component Interconnect (PCI)
card that houses a Xilinx XC4062XLT FPGA, 4 Mb of user SRAM, and a 2 Mb configu-
ration flash.[110-114] The board also has a programmable clock that can be configured
from 360 KHz to 100 MHz. The purpose of the configuration flash is to allow the board
to hold multiple designs that can be loaded dynamically onto the FPGA faster than if the
configuration was dynamically loaded from the host processor. Using this flash, it is pos-
sible to implement a design that reconfigures the FPGA during runtime. The size of the
cache was designed to hold roughly 3 designs for the FPGA. The development environ-
ment also provides the Xilinx PCI LogiCORE Interface macro and a VCC custom back-
end that lets users communicate with two fully independent 32-bit banks of RAM and the
Configuration Cache Manager (CCM) that controls the run-time configuration/reload
behavior of the system. The PCI board has two independent buses, each with 32-bit data
and 24-bit addresses. There is an I/O connector for each of these two buses. A diagram of
the board layout can be seen in Figure 5.1.
As with all development boards, the amount of software support that accompanies
the board is just as important as the actual features of the board itself. The HOT II devel-
opment board comes with support for the HOT II PCI interface, both target and initiator,
drivers for Windows 95, 98 & NT, VCC's HOT Run-Time-Reconfiguration program-
ming tools, C++ libraries and API files [110,111]. The package does not, however,
include design entry and implementation software for entering and mapping the FPGA,
nor does it include a C++ compiler. If there is a need to alter the HOT II PCI interface,
the source files and license for the LogiCORE PCI32 interface can be obtained from Xil-
inx Incorporated [53]. This level of support is adequate to allow developers to use the
64
tools of their choice, but it also leaves them without a complete solution for their develop-
ment needs. The software development platform used to conduct the research is the one
suggested by the board manufacturers. These tools include:
• Synopsys FPGA express v3.2 for HDL design entry,
• Xilinx Foundation Standard Express software for generating the FPGA
mapping from the design, and
• Microsoft Visual C++ for writing software to implement the software parti-
tion.
This architecture was chosen for this research because of its simple layout, the dis-
tinct memory system available for use by the FPGA, and the standard interface connec-
tor. There are no special hardware connectors or additional devices provided that could
cause interference with the results. This development card can be easily installed into
most current desktop workstations. The distinct memory system allows an investigation
into determining the communication requirements between the hardware and software
components. Since FPGAs are not typically found in desktop workstations, having a
shared memory region between the host processor and the FPGA is perceived as not
being typical either. With the growth of FPGAs this is anticipated to change, but cur-
Figure 5.1 Hot-II development board architecture.
Flash
SRAM
CONN A
Cache
SRAM
CONN BFPGA
Configuration
Manager
65
rently the distinct memory systems must be addressed.
5.3 Hardware Design
To simply implement a portion of the virtual machine in hardware does not guaran-
tee that the performance will increase. The implementation must also leverage the charac-
teristics of the hardware environment to further increase the performance. One such
characteristic is the parallel nature of hardware. With the appropriate division of the hard-
ware partition into smaller parallel tasks, the hardware can contribute to a significant
increase in performance. With the partitioning scheme that was discussed in the previous
chapter, the hardware partition contains the instruction level fetch, decode, and execute
stages. These stages are traditionally implemented in parallel in hardware architectures,
with each stage forwarding its result to the next. In software, these stages are executed
sequentially. By implementing these stages as a pipeline, there is a theoretical increase of
three-fold. In this manner, the power of the hardware is exploited.
In addition to the parallel execution of the “fetch-decode-execute” pipeline, there is
a further performance increase that is inherited due to the dedicated execution environ-
ment. When executing a virtual machine in software, it executes in the capacity of an
application on the host system. This application shares the processor with other applica-
tions in the system. In contrast, the hardware partition provides a dedicated single thread
operating environment. While a portion of the application is executing within the hard-
ware device, it is running in a dedicated environment with less contention than when
sharing the CPU, thus potentially increasing performance.
The remaining sections of this chapter continue with the ideas discussed and applies
them to the example Java virtual machine.
66
5.4 Java Hardware Design
The Java hardware design is comprised of 4 main units: the host interface, instruc-
tion buffer, execution engine, and data cache controller. The architecture is based upon a
3-stage pipeline that funnels instructions through the first three units respectively. Figure
5.2 shows the interconnections between the components and the pipeline between them.
The pipeline works in the traditional “fetch-decode-execute” method.
The dark connections show the direction in which the instructions travel through the
pipeline, and the dashed connections show control lines that are used to deliver informa-
tion back to its feeding unit. Pipelining is utilized such that preceding units in the design
flow “forward” data to the next component so as to maximize clock cycles. The feedback
information is used to indicate that the unit requires an adjustment in the next address to
forward. The following subsections discuss each of the units in detail and their purpose in
the architecture.
Figure 5.2 Java hardware architecture
Cache
Stack
Cache
HostController
ExecutionEngine
InstructionBuffer
Data CacheController
67
5.4.1 Host Controller
The Host Controller is the central point within the architecture for communication
with both the on-board memory and the host computer. It is involved in retrieving instruc-
tions and data as well as handshaking with the software partition in performing context
switches. Bytecodes are retrieved from memory and are pipelined to the Instruction
Buffer. Any request from the Instruction Buffer to change the address of fetching results
in a delay of execution. The Data Cache Controller only requests data when the current
instruction is waiting for the data, thus any requests from the Data Cache Controller for
data take precedence over the fetching of instructions. As such, a variable delay in pro-
cessing is possible due to the transaction requests of other components. This is necessary
since execution cannot continue until the higher precedence requests, which include data
accesses, and stack overflows/underflows are fulfilled.
5.4.2 Instruction Buffer
The Instruction Buffer acts as both an instruction cache and a decoder for instruc-
tions. The cache can be variable size, and is primarily determined by the amount of
design space that is available. A sufficiently large cache is preferable, as it provides a
higher probability that upon executing a branch instruction, the next instruction will
already be in the cache and not require a delay as the instruction is retrieved from on-
board memory. There is no real disadvantage to having a large cache, other than the area
required to support it. Since the target environment is an FPGA with a fixed area, using
the remaining area available is not costly. When the cache has room for more instructions,
or the instruction required by the Execution Engine is not located in the cache, then the
Instruction Buffer requests the next instruction from the Host Controller.
Instruction decoding is performed to align the instructions before passing to the
Execution Engine. The Java virtual machine has the property of having different sized
instructions, and the instructions come from software packed together to reduce memory
usage. In the current target environment the memory available on-board is rather low
(4Mb). Therefore, it is better for the co-design to perform the decoding and padding in
hardware rather than software in order to utilize the memory to the fullest and reduce
communication.
68
The Instruction Buffer decodes the instruction for the Execution Engine to execute
next and pipelines the instruction through. The Instruction Buffer does not have branch
prediction logic and as such simply feeds the next sequential instruction. In the event that
a branch occurs, the Execution Engine ignores the incorrect pipelined instruction and sig-
nals the Instruction Buffer for the correct execution location. When ready, the Instruction
Buffer pipelines the correct instruction. This may take a variable amount of time depend-
ing on cache hits and misses. In the event of a cache miss, the Instruction Cache is
cleared and it starts re-filling at the new branch address.
5.4.3 Execution Engine
The Execution Engine receives instructions from the Instruction Buffer and exe-
cutes the instruction. To assist in the execution, the Execution Engine has a hardware
stack cache of 64-entries that contains the top elements on the stack. As the stack under-
flows/overflows, stack elements are communicated with the Host Controller that manages
the complete stack in the RAM on the FPGA card. This on-demand loading and storing of
the stack prevents against unnecessary loading of the stack when context switching to
hardware, when execution may return back to software quickly. It is not seen as a perfor-
mance penalty when the stack elements are required and execution is stalled, since execu-
tion will have to be delayed in either situation. This approach protects against situations
where execution may return to software before the stack elements are required.
The Data Cache Controller is used to fetch/store data from/to the execution frames
local variables. Data that is required in the Execution Engine from the constant pool is
received directly through the Host Controller. This is done since the constant pool will
never be updated in the hardware design. Thus, instructions that require accessing the
constant pool or local variables take a performance hit. This is unavoidable due to the
potential size of both the constant pool and local variables. The Java virtual machine
avoids taking this hit by loading data onto the stack, performing its manipulations there,
and only pushing the data back out to memory when completed.
69
5.4.4 Data Cache Controller
The Data Cache Controller is responsible for interacting with the memory for both
loading and storing data from the local variables when required. Ideally it contains a
cache that buffers data to reduce the number of transactions with the on-board memory.
This cache is a “write through” architecture that writes cache data to the RAM immedi-
ately upon writing to the cache. This prevents the cache having to be flushed when execu-
tion returns to software. In the event that no design space is available to house the cache,
the size of the cache can be zero, which results in a memory transaction every time the
Execution Engine makes a request. The effects of this on performance are dependent
upon the difference in penalties for accessing the on-board RAM instead of the local
cache.
5.5 Design Characteristics
The architecture possesses various features that allow it to be a successful design in
the resource restricted target environment. The design is based on a native stack and a
pipelined architecture. It also attempts to be active and assists the software in the transfer
of data between software and hardware. This is achieved by the loading of the data stack
in hardware on demand. This avoids the situations where the data stack is loaded, only to
have execution switch back to software. This is an expensive penalty since the stack
needs to be stored back to memory shared with software. The Data Cache Controller is
designed to push any cached data element back to memory shared with software on writ-
ing the element to the data cache. This avoids dumping the entire cache to RAM when
context switching occurs.
The design is also compact and flexible as required. The Instruction Cache and
Data Cache are both configurable and can be resized to accommodate a larger or smaller
memory cache. As expected, the larger the cache the better up to an optimal size, but this
is a trade-off against area resources. The core pipeline can be altered to remove the data
cache from the design completely. This is possible; however removing the data cache
70
forces all local variable accesses to communicate with the external RAM memory. This
drastically affects performance as expected and should only be performed when the
FPGA resources are at a minimum.
The true power of this hardware design for the virtual machine is that it is generic.
It is capable of being used to represent a wide range of architecture paradigms provided it
is based on the fetch-decode-execute strategy. This encapsulates a wide range of architec-
tures, most notably both stack and register based. This will potentially allow the same
overall design to be used for many different virtual machine hardware designs. The inter-
nals of each component may be required to change to adapt, but the overall structure can
succeed.
5.5.1 Comparison to picoJava
The picoJava core is the original design of a Java processor based on the virtual
machine for the goal of creating a native Java computer [95-102,107,75,81]. Thus, a com-
parison of distinguishing features between the design of picoJava and that of the co-
design is appropriate. Through the comparison it is clear that both designs share various
basic characteristics. However, they also differ significantly.
A major difference between the picoJava core and this design is the environment for
which they are targeted. The picoJava core is designed for the purpose of being the sole
processing unit, while the approach described here is intended to complement an already
existing general microprocessor. Thus, the proposed architecture does not require operat-
ing system instructions for support in accessing hardware components such as RAM and
various busses, as well as additional instructions for supporting different languages and
paradigms. An example of this is the ncstore_word instruction, for performing a non-
cacheable store of an integer to memory [101].
With a greater restriction in the design space available, it was deemed suitable to
remove the process of folding instructions from hardware. This process can be relocated
to software. The restriction in design space results in less area and less space to imple-
ment special instructions that can perform multiple Java virtual machine instructions as a
single instruction. Any special techniques that can be utilized to increase performance
71
through folding, re-ordering, or re-structuring of instructions are left to be implemented
in software. This is beneficial not only for saving precious design space, but also for per-
forming the operation during class loading where it needs to take place once only. This is
as opposed to hardware where it is done at every encounter of the instructions.
The picoJava core provides flexibility in its design with the caches it uses and
allows for various cache sizes. This design also contains caches for the same purposes,
but it emphasizes flexibility with much smaller sizes. The picoJava specification outlines
the caches as ranging from 16Kb to 0Kb in size. This hardware design pushes the smaller
caches further, by using caches that are 1Kb to no cache. The smaller caches are attribut-
able to this hardware design only executing one given method at a time whereas the pico-
Java processor must support multitasking of multiple methods. Thus, a larger cache is
necessary.
Overall, the emphasis on the differences between the picoJava core and this design
is in the simplicity and reduction of the design space. This is a necessary step for the tar-
geted environment. This can be clearly seen in the simplified pipeline architecture in
comparison with the complex parallel architecture of picoJava with instruction folding.
The differences between the co-design approach and picoJava can also be seen in
the users and applications targeted by each. Both designs can be used for embedded sys-
tems, but the co-design could be better for this context due to its smaller size and tight
coupling with a microprocessor along with other factors. The picoJava design has extra
functionality that may be unnecessary in an embedded system.
5.6 Hardware Simulator Justification
When attempting to prototype a hardware design, the question arises as to whether
the design should be implemented in hardware, or simply simulated in software. There
are many advantages and disadvantages to each approach and each project needs to be
evaluated independently as to which is more appropriate. For this project it was decided
to simulate the hardware architecture. There are several reasons for this decision:
72
• One reason for simulating this design is the desire to have a flexible inter-
face between hardware and software. The development environment that
was available to this research did not provide the flexibility to have the
hardware architecture access the memory of the host system. To implement
this flexibility would consume efforts that are better used in analyzing the
real problem at hand.
• The same is true for analyzing the requirements of the communication rate
between the partitions. A simulated hardware design allows for greater
investigation into different communication rates between the hardware and
software partitions. This allows one to capture the performance of the
machine as a whole under varying rates.
• Several questions are raised concerning the programmable device and its
suitability for this purpose. Is the FPGA sufficiently large to accommodate
the hardware partition? Is the FPGA fast enough to provide a performance
increase? When targeting a physical environment, the capabilities of the
FPGA are fixed. This can result in the investigation being prevented from
analyzing the different partitions that are outlined in Chapter 4. However,
in a simulation environment, these constraints are lifted. In any case, state-
of-the-art FPGAs such as the Xilinx Virtex-II provides 8 million system
gates and features an internal 420 MHz clock, which should satisfy the per-
formance requirements for this purpose [53].
• With the hardware architecture designed to be tightly coupled with the soft-
ware partition, it is necessary to analyze the integration between hardware
and software. This requires the ability to easily integrate the two partitions.
This integration is more easily realized using a software simulation due to
its inherent flexibility. Attempting to integrate hardware and software
together often involves dealing with low-level implementation issues, such
as implementing correct binary floating-point support, and not architecture
design issues [54].
73
• Targeting a specific platform environment such as the one from Virtual
Computer Corporation introduces further complexities into the process.
Technical issues can emerge that affect the overall process and hinder anal-
ysis. For example, the above mentioned development environment con-
tains technical issues with the provided interface and prevents the hardware
partition from signalling the software partition. Technical issues such as
these are not the focus of the work and in a physical environment can result
in loss of time or project failure.
Additionally, there are the normal benefits of simulating over implementing that include
the following:
• Lower costs, as simulating requires no special hardware.
• Better software support, as support in software is more dynamic and exten-
sive than in hardware.
• Fewer environment quirks. Software allows a generic environment, where
a hardware implementation requires the design to involve its quirks.
• Faster development time, as typically software development is faster than
hardware implementation.
Overall, this flexibility allows for design space exploration which is crucial to this
research process.
Various simulation environments already exist for simulating hardware designs.
Unfortunately, there are two major factors that suggested using a custom simulator. The
first is the complexity of the software component. The software component in this co-
designed system is very intricate and relies upon certain functionalities available through
the host operating system, namely scheduling and memory management. Running the co-
designed software in an encapsulated simulator would not provide realistic results. Sec-
ondly, the tight integration between the software and hardware components requires the
intricate integration of the software components with the hardware simulator. With the
74
low level dependency between the hardware and software partitions, it is unclear if the
available simulators would support and allow investigation of this communication. For
these reasons, it was decided to build a custom simulator in software.
The next section will discuss some of the techniques that were used in simulating
the hardware design to ensure accuracy and hardware portability in simulation.
5.7 Software Simulator
This section discusses various techniques used to implement the software simulator
of this hardware design. Each of these techniques is a step towards not only achieving a
correct simulation timing at the clock level, but as well to help the later implementation
become an easier task. At the end of this exercise it is anticipated that specification of the
design for synthesis can progress from the specification used in the simulator.
5.7.1 Simulator Goals
Overall, the simulator’s purpose is to give an indication of the potential perfor-
mance of the hardware design described in section 5.4. There are many different reasons
for simulation, but this simulator is intended to give an indication of the possible perfor-
mance so an assessment can be made of the feasibility in using a co-designed virtual
machine. To accomplish this there are several smaller goals that the simulator must strive
to achieve to provide an accurate indication. For this simulation these goals are:
• To model the pipeline stages of fetching, decoding, and executing Java
bytecodes in parallel.
• To model the various data caches that exist in the design and provide flexi-
bility for investigation into the effects of varying sizes.
• To model the communication interface between the hardware design
(FPGA) and the software partition (host processor) through the PCI inter-
face.
• To model the memory available to the hardware design (FPGA) through
the VCC (Virtual Computer Corporation) custom interface [110].
75
• To model the interface between the hardware design and the memory sub-
system that is available on the host workstation.
• To model the different execution stages of each instruction that is sup-
ported by the hardware design.
• To provide a reasonably fast simulation of the hardware design.
• To provide an accurate simulation of the hardware design.
To best achieve these goals, it is suitable for the simulator to leverage known char-
acteristics of existing hardware components. Likewise, it is desirable for the simulator to
be based upon a specification language that is synthesizable into a hardware implementa-
tion. The next section describes different assumptions that the simulator used to provide
an accurate measurement of execution time.
5.7.2 Simulator Design Overview
The most critical decision to be made in the simulation is the specification lan-
guage to use in describing the hardware design. For this it was decided to use the C pro-
gramming language. Primarily this decision is based on the implementation language of
the software partition, and C also allows easy integration with software. Having the simu-
lator easily integrate with the software partition allows the research to examine full appli-
cations and the interactions between the partitions during execution. Additionally, the C
language provides similar programming constructs to VHDL, a common specification
language for hardware, and support for low-level bit operations. It is also beneficial that it
is a language that can generate fast binaries in comparison to other languages. This leads
to a fast simulation environment that can be used to examine full applications. C also pro-
vides support that can be used for exploration without requiring vast amounts of develop-
ment time. An example of this is support for floating-point operations.
It was decided to base the simulation on the VHDL behavioral model [7,37]. Using
the VHDL model is a justifiable decision since the support for the development environ-
ment described in subsection 5.2.1 is provided in VHDL. Limiting the usage of C in the
implementation to only the subset of constructs that are supported by VHDL can contrib-
ute towards a later effort of converting the specification to VHDL if deemed desirable.
76
Some additional effort is necessary to provide support for VHDL constructs that are not
directly available in C, a discussion of some of these issues are addressed later in the dis-
sertation.
The simulator performs a time-driven simulation of the hardware design for the
Java virtual machine. In this simulation, each of the different components in the design
executes for one clock cycle and then interchanges signals that relay information between
the components. Each of the different components in the hardware design is either imple-
mented as a custom defined component, or modelled using some other existing compo-
nents that are available within the development environment described in subsection
5.2.1. The following Figure 5.3 depicts the components that are modeled by the simula-
tor. Components that are shaded in the diagram are modeled after existing components
that exist within the development environment presented at the beginning of the chapter.
The remaining components are custom defined for the simulator’s purpose.
current = new Point(base);for (xpixel = 0; xpixel < canvas.GetWidth(); xpixel++){
if (xpixel >= xstart && xpixel < xend){Color color = new Color(0.0f, 0.0f, 0.0f);eyeRay.GetDirection().Sub(current, eyeRay.GetOrigin());eyeRay.GetDirection().Normalize();eyeRay.SetID(RayID++);Shade(octree, eyeRay, color, 1.0f, 0, 0);canvas.Write(Brightness, xpixel, ypixel, color);
}current.Add(horIncr);
}base.Add(vertIncr);
}}
Figure 7.9 Critical section of Raytrace application.
CHAPTER 8
Conclusions Chapter 8
8.1 Summary
The prominence of the internet and networked computing has driven research
efforts into providing support for homogeneous computing. This has been exemplified by
the current research into virtual machines, a case in point being the Java virtual machine.
Unfortunately, it has long been accepted that with virtual computing platforms and the
ability to “write once, run anywhere” comes the penalty of performance. This disserta-
tion presents a new hardware/software co-design approach for providing virtual comput-
ing platforms through the use of reconfigurable computing devices. This novel approach
promotes the philosophy that user applications remains portable, while achieving a per-
formance increase.
Chapters three through five discuss specifically how the co-design process can be
applied to the class of virtual machines in a structured approach. This replaces the
instance specific techniques that are often used within each of the co-design stages. The
dissertation demonstrates that a structured partitioning, hardware, software, and interface
design approach can result in a winning co-designed virtual machine. Novel ideas in these
approaches are presented including the overlapping of hardware and software partitions, a
generic hardware design, and algorithms for controlling execution location at run-time.
Simulation showed that under ideal conditions for certain benchmarks it attained as high
as a nine-fold performance increase. The effects of the physical environment on this per-
formance is also included. Specifics such as the required size and speed of the program-
mable device, the memory system, and the communication are addressed. This results in
the description of the requirements needed for this approach to be successful. This
includes:
• An FPGA large enough to support the Full partitioning scheme described.
128
• An FPGA that can perform at least within a 1:5 speed ratio of the general-
purpose processor can provide significant performance gains.
• A memory system that does not need to be extremely large, as studies
showed 1 Mb will suffice, but must be accessible by both processing
devices and capable of operating at a high rate, fewer than 50 clock cycles
per access.
• A fast communication bus between the FPGA and the general purpose pro-
cessor. It was shown that the communication penalty is too much unless the
hardware component executed approximately 8300 cycles.
The following sections outlines the major contributions of this research and dis-
cusses some of the future work that can branch from the research presented in this disser-
tation.
8.2 Contributions
This dissertation has introduced and addressed the original concept of using hard-
ware/software co-design as a means for providing virtual machine platforms. Specifi-
cally it described a new approach that extends the generally accepted co-design process
for all systems. Stages in the co-design process such as partitioning, design of both the
hardware and software components, and the inherent interface between them were
described.
The new contributions succeed in linking the general co-design process that is well
established, and described in chapter three, with the specific co-design of virtual
machines. It discusses specific techniques for each of the various stages of the general co-
design process, including:
• Partitioning. The dissertation presents the partitioning strategy of dividing
the functionalities between the hardware architecture and the operating
system. While this is only one of many possible partitioning strategies, it is
extendable to other virtual machines, and was demonstrated to contribute
to a successful co-design for the example Java virtual machine. Addition-
129
ally, the novel idea of having the hardware and software partitions overlap
is introduced. This is different from traditional co-design systems where
the partitions are disjoint. This approach is shown to be beneficial for
allowing the virtual machine to determine the partition at run-time where
execution will occur.
• Hardware design. A generic hardware design is presented that can increase
performance based upon parallelizing the fetch-decode-execute execution
cycle found in virtual machines. This design can be used as a starting point
for all virtual machines in designing the hardware component. The design
is flexible and allows for many different types of hardware architectures.
• Software design. The software design incorporates a handle into the hard-
ware component to allow for the off-loading of tasks into the faster hard-
ware. It also includes the functionality to determine the run-time
scheduling. Three algorithms were presented and both the benefits and
importance of dynamic run-time scheduling are discussed. This is new for
co-designed systems and presents a new perspective for co-designing vir-
tual machines.
These contributions were applied to the ubiquitous Java virtual machine and simu-
lated for insights into the potential benefits and drawbacks of co-design for this area. This
entailed partitioning of the Java virtual machine instruction set between hardware and
software following the previously proposed process characteristics. The two main chal-
lenges were designing the hardware component to provide the functionality of the parti-
tioning while being aware of possible design space shortages, and then designing the
software component to identify suitable conditions under which switching execution from
hardware to software is worthwhile dynamically at run-time.
Through simulation, many valuable characteristics of the co-designed virtual
machine were revealed. It was seen that overall performance in the co-designed system
may provide an increase over a software only implementation under well-defined con-
straints. It was shown, however, that the performance increase was only seen under cer-
tain underlying architectural conditions. Factors such as: the communication rate between
130
the hardware and software components; the size and speed of the physical hardware com-
puting device; and the design and size of the memory subsystem specifically affected the
performance. Each of these factors were investigated separately to identify the threshold
levels of each and the minimum support required to obtain a performance increase.
Finally, a proposal for an ideal, yet currently technically achievable architecture was pro-
posed and will be expanded in the future.
This work has demonstrated a capability of extended use for reconfigurable devices.
It has also derived some of the required performance capabilities and support needed for
reconfigurable computing to be capable of supporting co-designed virtual machines. As
such, reconfigurable computing can be used for more general computing, not for just very
specific problem instances.
8.3 Future Work
This dissertation has introduced the use of hardware/software co-design and recon-
figurable computing for use as a general computing platform. There is considerable work
that can be extended from the results presented, including obvious extensions that can be
followed such as applying the hardware/software co-design contributions to other virtual
machine platforms. There are however three other distinctive streams which further work
can follow and for which the current research provides a strong and valuable start.
First, further investigation can be carried out towards targeting applications for this
computing platform. It was demonstrated that the underlying characteristics of applica-
tions affect the performance. This is true regardless of the underlying virtual machine
implementation. Much research has been done to massage applications for improved per-
formance with just-in-time compilation technologies or for dedicated hardware execu-
tion, such as picoJava, or through advanced topics such as bytecode re-ordering and
branch instruction prediction. While some of this research can be carried over to improve
performance within a co-designed virtual machine, there are unique features of the co-
design platform that must also be investigated further for exploitation. One example is to
further investigate the context switching to find better techniques that can be used at com-
131
pile-time. Another is to examine selective bytecode usage, replacing bytecode instruc-
tions with other instructions, or combinations of, that are more desirable for execution in
hardware.
A second area for further research concerns the effects of parallelism. The research
to date has focused on the performance increases attained because of dedicated hardware
support over software interpretation. The work did not address the increases available
because of parallel execution. This is primarily due to the complexities involved with
simulating parallel execution. With the added hardware support, it is possible to have
both the dedicated hardware and host system processor working in parallel on executing
the application. This research is of considerable value in the event the specific virtual
machine being targeted is multi-threaded. Under this condition, additional performance
increases can be envisaged.
Finally, this work has identified characteristics of the underlying hardware architec-
ture that would promote hardware/software co-design for use in providing virtual
machines. Traditionally, reconfigurable computing has been used for the embedded sys-
tems market and/or very constrained instance specific problems. Thus the architectural
environments available today are not entirely suitable for co-designed virtual machines.
Knowing the desirable features, the development of a suitable architectural environment
would be invaluable. The availability of such a development environment would not only
allow a suitable platform for validation and verification of co-designed virtual machines,
but will also ignite further research in the area.
APPENDIX A
Java Virtual Machine Bytecode Statistics Appendix A
This appendix presents several quantitative execution measurements of the various
Java bytecodes within each of five different benchmarks. The bytecode size is the number
of bytes that comprise the opcode and operands. The execution time is the average num-
ber of clock cycles required to execute each of the instruction instances in the bench-
marks1. For instructions that have a quick version, the times given is the amount required
to perform the necessary class loading. It is at this time the quick version of the instruc-
tion is invoked.
These numbers are affected by external operating system events and other anoma-
lies during execution. Therefore, these execution times should only be considered as
approximations. The data traffic is the number of memory accesses necessary for execu-
tion. The results presented are broken into both local and remote memory accesses. For
this purpose, local is considered to be access to the execution stack and the local vari-
ables, all other accesses are considered remote. Finally, the frequency is the number of
times each instruction is encountered during execution for each benchmark.
1. Clock cycles are measured in relation to a Pentium III 750 Mhz processor running Windows 2000. These results are acquired by capturing the processors time stamp counter.
[2] Aoki, Takashi, and Eto, Takeshi. On the Software Virtual Machine forthe Real Hardware Stack Machine. USENIX Java Virtual MachineResearch and Technology Symposium, April, 2001.
[3] Arm Ltd. Jazelle - ARM Architecture Extensions for Java Applications.http://www.arm.com/armtech/jazelle, Arm Ltd., September 2001.
[4] Arm Ltd. ARM - Jazelle Technology, http://www.arm.com/armtech/jazelle. Arm Ltd., September 2001.
[5] Arnold, Ken. and Gosling, James. The Java Programming Language(2nd edition). Addison-Wesley, 1997.
[6] Atherton, Robert J. Moving Java to the Factory. IEEE Spectrum, pp. 18- 23, December 1998.
[7] Ashenden, Peter J. The Designer’s Guide to VHDL. Morgan KaufmannPublishers, 1996.
[8] Aurora VLSI Inc. DeCaf - Summary, http://www.auroravlsi.com/web-site/DeCaf_summary.html. Aurora VLSI Inc., September 2001.
[9] Aurora VLSI Inc. Espresso - Datasheet, http://www.auroravlsi.com/website/Espresso_datasheet.html, Aurora VLSI Inc., September 2001.
[10] Aurora VLSI Inc. Espresso - Summary, http://www.auroravlsi.com/website/Espresso_summary.html, Aurora VLSI Inc., September 2001.
[11] Awalt, R. K. Making the ASIC/FPGA Decision, Integrated SystemDesign Magazine, July 1999.
[12] Bass, M. J. and Christensen, C. M. The Future of the MicroprocessorBusiness, IEEE Spectrum, pp. 34-39, April 2002.
[13] Benveniste, A. and Bery, G. The Synchronous Approach to Reactive
178178178178178178
and Real-Time Systems, IEEE Proceedings, Vol. 79, No. 9., pp. 1270 -1282, September 1991.
[14] Berge, J. M., Levia, O. and Rouillard, J. (eds). Hardware/Software Co-Design and Co-Verification, Kluwer Academic Publishers, 1997.
[15] Bingham, J. and Serra, M. Solving Hamiltonian Cycle on FPGA Tech-nology via Instance to Circuit Mappings, Workshop on Engineering ofReconfigurable Hardware/Software Objects, PDPTA, June 2000.
[16] Brown, Stephen D., Francis, Robert J., Rose, Jonathan, and Vranesic,Zvonko G. Field-Programmable Gate Arrays, Kluwer Academic Pub-lishers, 1992.
[17] Burton, Kevin. .NET Common Language Runtime Unleashed, SamsPublishers, March 2002.
[18] Cardoso, J. M. P. and Neto, H. C. Macro-Based Hardware Compilationof Java Bytecodes into a Dynamic Reconfigurable Computing System.IEEE Symposium on Field-Programmable Custom ComputingMachines, April 1999.
[19] Case, Brian. Implementing the Java Virtual Machine, MicroprocessorReport, pp. 12 - 17, March 25, 1996.
[20] Case, Brian. Java Virtual Machine Should Stay Virtual, Microproces-sor Report, pp. 14 - 15, April 15, 1996.
[21] Case, Brian. Java Performance Advancing Rapidly, MicroprocessorReport, pp. 17 - 19, May 27, 1996.
[23] Compton, K., and Hauck, S. Reconfigurable Computing: A Survey ofSystems and Software, ACM Computing Surveys, Vol. 34, No. 2, pp.171-210, June 2002.
[24] Cornell, G. and Horstmann, C. S. Core Java. SunSoft Press, 1996.
[25] De Micheli, G. and Sami, M. (eds.) Hardware/Software Co-Design.Kluwer Academic Publishers, pp. 1-28, 1996.
[26] De Micheli, G., Ernst, R. and Wolf, W. Readings in Hardware/SoftwareCo-Design. Morgan Kaufmann Publishers, 2002.
179179179179179179
[27] Dey, S. et al. Using a Soft Core in a SoC Design: Experiences withpicoJava. IEEE Design & Test, pp. 60-71, July-Sept 2000.
[28] Dorf, R. C. Field Programmable Gate Arrays: Reconfigurable Logicfor Rapid Prototyping and Implementation of Digital Systems. Wiley &Sons Inc., 1995.
[29] El-Kharashi, M. W. and ElGuibaly, F. Java Microprocessors: ComputerArchitecture Implications. IEEE Pacific Rim Conference on Communi-cations, Computers and Signal Processing (PACRIM 1997), pp. 277-280, August 1997.
[30] El-Kharashi, M. W., ElGuibaly, F., and Li K.F. A New Methodology forStack Operations Folding for Java Microprocessors. High PerformanceComputing Systems and Applications, chapter 11, pp. 149 - 160, Klu-wer Academic Publishers, 2000.
[31] El-Kharashi, M. W., ElGuibaly, F., and Li K.F. A Novel Approach forStack Operations Folding for Java Processors. IEEE Computer SocietyTechnical Committee on Computer Architecture Newsletter, pp. 104 -107, September 2000.
[32] El-Kharashi, M. W., ElGuibaly, F., and Li K.F. An Operand Extraction-Based Stack Folding Algorithm for Java Processors. International Con-ference on Computer Design, pp. 22 - 26, September 2000.
[33] El-Kharashi, M. W., ElGuibaly, F., and Li K.F. Quantitative Analysisfor Java Microprocessor Architectural Requirements: Instruction SetDesign. International Conference on Computer Design, pp. 50 - 54,October 1999.
[34] El-Kharashi, M. W., ElGuibaly, F., and Li K.F. A Quantitative Study forJava Microprocessor Architectural Requirements. Part I: InstructionSet Design. Microprocessors and Microsystems, pp. 225 - 236, Septem-ber 2000.
[35] El-Kharashi, M. W., ElGuibaly, F., and Li K.F. A Quantitative Study forJava Microprocessor Architectural Requirements. Part II: High-LevelLanguage Support. Microprocessors and Microsystems, pp. 237 - 250,September 2000.
[36] Engel, Joshua. Programming for the Java Virtual Machine, Addison-Wesley, 1999.
[37] Gajski, D., Vahid, F., Narayan, S., and Gong, J. Specification and
180180180180180180
Design of Embedded Systems. Prentice-Hall Inc., 1994.
[38] Gajski, D., Zhu, J., Domer, R., Gerstlauer, A., and Zhao, S. SpecC:Specification Language and Methodology. Kluwer Academic Publish-ers, 2000.
[39] Goldberg, A. and Robson, D. Smalltalk-80: The Language and itsImplementation. Addison Wesley, 1983.
[40] Gosling, J. The Feel of Java. IEEE Computer, pp. 53-57, June 1997.
[41] Gu, W., Burns, N. A., Collins, M. T., and Wong, W. Y. P. The Evolutionof a High-Performing Java Virtual Machine. IBM systems Journal, vol39, no 1, pp. 135 - 150, 2000.
[42] Gupta, Rajesh Kumar. Co-Synthesis of Hardware and Software for Dig-ital Embedded Systems. Kluwer Academic Publishers, 1995.
[43] Halfhill, T. R. How to Soup up Java (Part I), BYTE, pp. 60 - 74, May1998.
[44] Harel, D., Pneuli, A., Schmidt, J., and Sherman, R. Statecharts: AVisual Formalism for Complex Systems, Science of Computer Pro-gramming, No. 8, pp. 231 - 274, 1987.
[45] Hauck, S. The Future of Reconfigurable Systems. Keynote Address, 5thCanadian Conference on Field Programmable Devices, Montreal, June1998.
[46] Henkel, Joerg., and Hu, Xiaobo (general co-chairs). Tenth Interna-tional Symposium on Hardware/Software Codesign. ACM Press, 2002.
[47] http://ptolemy.eecs.berkeley.edu/. August 2002.
[48] http://www.altera.com. August 2002.
[49] http://www.spec.org/osg/jvm98. November 1997.
[50] http://www.systemc.org. August 2002.
[51] http://www.threedee.com/jcm/psystem/index.html. July 2002.
[52] http://www.webopedia.com/TERM/v/virtual_machine.html. July 2002.
[53] http://www.xilinx.com. March 2003.
181181181181181181
[54] IEEE, IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-1985, IEEE, 1985.
[56] Intel Corp, Intel Architecture Software Developer’s Manual, Volume II:Instruct Set Reference Manual, http://www.intel.com/design/pentiumii/manuals/243191.htm, February 2003.
[57] Internet Software Consortium, Internet Domain Survey, January 2002.http://www.isc.org/ds/hosts.html, July 2002.
[58] inSilicon Inc. http://www.insilicon.com/products/images/jvxtreme.pdf,inSilicon Inc., September 2001.
[59] Ito, S. A., Carro, L., and Jacobi, R. P. Making Java Work for Microcon-troller Applications. IEEE Design and Test of Computers, pp. 100-110,Sept-Oct 2001.
[60] Kent, Kenneth B. and Serra, Micaela. Context Switching in a Hard-ware/Software Co-Design of the Java Virtual Machine. Designer’sForum of Design Automation & Test in Europe (DATE) 2002, pp. 81 -86, March 2002.
[61] Kent, Kenneth B., and Serra, Micaela, Hardware Architecture for Javain a Hardware/Software Co-Design of the Virtual Machine, EuromicroSymposium on Digital System Design (DSD) 2002, pp. 20 - 27, Sep-tember 4-6, 2002.
[62] Kent, Kenneth B. and Serra, Micaela. Hardware/Software Co-Designof a Java Virtual Machine. International Workshop on Rapid SystemPrototyping, pp. 66 - 71, June 2000.
[63] Kent, Kenneth B., and Serra, Micaela, Reconfigurable ArchitectureRequirements for Co-Designing Virtual Machines, ReconfigurableArchitectures Workshop (RAW) part of International Parallel and Dis-tributed Processing Symposium (IPDPS) 2003, April 2003.
[64] Kimura, S., Yukishita, M., Itou, Y., Nagoya, A., Hirao, M., andWatanabe, K. A Hardware/Software Codesign Method for a GeneralPurpose Reconfigurable Co-Processor. IEEE 5th International Work-shop on Hardware/Software Co-Design, pp. 147 - 151, March 1997.
[65] Kreuzinger, J., Zulauf, A., Schulz, A., Ungerer, T., Pfeffer, M., Brinks-chulte, U. and Krakowski, C. Performance Evaluations and Chip-Space Requirements of a Multithreaded Java Microcontroller, Second
182182182182182182
Annual Workshop on Hardware Support for Objects and Microarchi-tectures for Java (ICCD ’00), September 2000.
[66] Ku, D. and De Micheli, G. HardwareC - A Language for HardwareDesign (version 2.0). CSL Technical Report CSL-TR-90-419, StanfordUniversity, April 1990.
[67] Kumar, S., Aylor, J., Johnson, B., and Wulf, W. The Codesign ofEmbedded Systems: A Unified Hardware/Software Representation.Kluwer Academic Publishers, 1996.
[69] Lee, Burton H. Embedded Internet Systems: Poised for Takeoff, IEEEInternet Computing, pp. 24 - 29, May-June 1998.
[70] Lee, Edward. What’s Ahead for Embedded Software, IEEE Computer,pp. 18-26, September 2000.
[71] Lentczner, Mark. Java’s Virtual World, Microprocessor Report, pp. 8 -17, March 25, 1996.
[72] Lindholm, Tim., and Yellin, Frank. The Java Virtual Machine Specifi-cation (2nd edition). Sun Microsystems Inc., 1997.
[73] Madsen, Jan (general chair). Ninth International Symposium on Hard-ware/Software Codesign. ACM Press, 2001.
[74] McDowell, Charlie. Challenges to Embedded Java, http://www.cse.ucsc.edu/research/embedded/pubs/challenges.ppt, Universityof California, Santa Cruz, December 1998.
[75] McGhan, H., and O’Connor, M. PicoJava: A Direct Execution EngineFor Java Bytecode. IEEE Computer, pp. 22-30, October 1998.
[78] Mulchandani, Deepak. Java for Embedded Systems, IEEE InternetComputing, pp. 30 - 39, May-June 1998.
183183183183183183
[79] Nazomi Inc. http://www.nazomi.com/pdf/jstar_arm.pdf, Nazomi Inc.,September 2001.
[80] Nazomi Inc. http://www.nazomi.com/pdf/jstar_productbrief.pdf,Nazomi Inc., September 2001.
[81] O’Connor, Michael J. and Tremblay, Marc. picoJava-I: The Java Vir-tual Machine in Hardware, IEEE Micro, pp. 45 - 53, March-April1997.
[82] Platzner, M. Reconfigurable Accelerators for Combinatorial Problems.IEEE Computer, pp. 62 - 69, April 2000.
[83] Ploog, H., Rachui, T. and Timmermann, D. Design Issues in the Devel-opment of a JAVA-Processor for Small Embedded Applications. FPGA99, pp. 246, Monterey, California, 1999.
[84] Rincon, F. and Teres, L. Reconfigurable Hardware Systems. Interna-tional Semiconductor Conference, Vol.1, pp.45-54, Oct. 1998.
[85] Roman, G., Stucki, M. J., Ball, W. E., and Gillett, W. D. A Total SystemDesign Framework. International Semiconductor Conference, Vol.1,pp.45-54, Oct. 1998.
[87] Rozenblit, J. and Buchemrieder, K. (eds.) Codesign: Computer-AidedSoftware/Hardware Engineering. IEEE Press, New York, 1995.
[88] Sánchez, L., Koch, G., Martínez, N., López-Vallejo, M. L. , Delgado-Kloos C., and Rosenstiel W., Hardware-Software Prototyping fromLotos, Journal of Design Automation for Embedded Systems, vol. 3,number (2/3), pp. 117-148, March 1998.
[89] Sangiovanni-Vincentelli, A. and Martin, G. Platform-Based Design andSoftware Design Methodology for Embedded Systems. IEEE Design &Test, pp. 23-33, Nov-Dec 2001.
[90] Sansonnet, J., Castan, M., Percebois, C., Botella, D. and Perez, J.Direct Execution of LISP on a List-Directed Architecture. Proceedingsof the Symposium on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS I), pp. 132-139, March, 1982.
[91] Schlett, Manfred. Trends in Embedded-Microprocessor Design. IEEEComputer, pp. 44-49, August 1998.
184184184184184184
[92] Schoellkopf, J.. PASC-HLL: A High-Level-Language Computer Archi-tecture for Pascal. Proceedings of the International Workshop on High-Level Language Computer Architecture, pp. 222-225, May, 1980.
[93] Sciuto, Donatella. Guest Editor’s Introduction: Design Tools forEmbedded Systems. IEEE Design and Test, pp. 11-13, April-June 2000.
[94] Suganuma, T., et als. Overview of the IBM Java Just-in-Time Compiler.IBM systems Journal, vol 39, no 1, pp. 175 - 193, 2000.
[95] Sun Microsystems. The Java Chip Processor: Redefining the ProcessorMarket. Sun Microsystems, November 1997.
[96] Sun Microsystems. picoJava-I: Java Processor Core. Sun Microsys-tems data sheet, December 1997.
[97] Sun Microsystems. picoJava-I: picoJava-I Core Microprocessor Archi-tecture. Sun Microsystems white paper, October 1996.
[98] Sun Microelectronics. picoJava-I: Sun Microelectronics’ picoJava-IPosts Outstanding Performance, Sun Microelectronics White PaperOctober, 1996.
[99] Sun Microsystems. picoJava-II: Java Processor Core. Sun Microsys-tems data sheet, April 1998.
[100] Sun Microsystems. picoJava-II: Microarchitecture Guide. Sun Micro-systems, March 1999.
[101] Sun Microsystems. picoJava-II: Programmer’s Reference Manual. SunMicrosystems, March 1999.
[102] Sun Microsystems. picoJava-II: Verification Guide. Sun Microsystems,March 1999.
[103] Sun Microsystems Inc. http://java.sun.com/, Sun Microsystems Inc.,August 2002.
[104] Sun Microsystems Inc. http://java.sun.com/embeddedjava, Sun Micro-systems Inc., December 1999.
[105] Tanabe, K. and Yamamoto, M. Single Chip Pascal processor: ITSArchitecture and Performance Evaluation. Proc. of the twenty-firstIEEE Computer Society International Conference (Fall COMPCON80), pp. 395-399, September, 1980.
185185185185185185
[106] Thomas, D. and Moorby, P. The Verilog Hardware Description Lan-guage, Kluwer Academic Publishers, 1991.
[107] Turley, Jim. Sun Reveals First Java Processor Core, MicroprocessorReport, pp. 28 - 31, October 28, 1996.
[108] Vemuri, R. R., and Harr, R. E. Configurable Computing: Technologyand Applications, IEEE Computer, pp. 39 - 40, April 2000.
[109] Venners, Bill. Inside the Java Virtual Machine. McGraw-Hill Inc.,1998.
[116] Wayner, P. How to Soup up Java (Part II): Nine Recipies for Fast EasyJava. BYTE, pp. 76 - 80, May 1998.
[117] Wayner, P. Sun Gambles on Java Chips. BYTE, pp. 79 - 88, November1996.
[118] Wilson, J. (ed). SOCs Drive New Product Development. IEEE Com-puter, pp. 61 - 66, June 1999.
[119] Wilson, S. and Kesselman, J. Java Platform Performance: Strategiesand Tactics. Addison-Wesley, 2000.
[120] Wolf, W. and Staunstrup, J. (eds.) Hardware/Software Co-Design:Principles and Practice. Kluwer Academic Publishers, 1997.
[121] Zivojnovic, Vojin and Meyr, Heinrich. Compiled HW/SW Co-Simula-tion. pp 584 - 589, Readings in Hardware/Software Co-Design, 2001.
VITA
Surname:
Place of Birth: Date of Birth:
Given Names:
Educational Institutions Attended:
Degrees Awarded:
Publications:
Honours and Awards:
Kent Kenneth Blair
June 21, 1973Bell Island, Newfoundland
University of Victoria, 1996 - 2003.
Memorial University of Newfoundland, 1991-1996.
M.Sc., University of Victoria, 1999.
B.Sc (hons), Memorial University of Newfoundland, 1996.
University of Victoria Fellowship, 1999-2002.
University of Victoria Graduate Research and Teaching Fellowship, 1996-2002.
Government of Newfoundland Scholarship, 1996.
Dean’s List, Memorial University of Newfoundland, 1996.
Monie Bown Scholarship, 1991.
Kent, Kenneth B., and Serra, Micaela, “A Co-Design Methodology for Virtual Ma-chines”, in progress, May 2003.
Kent, Kenneth B., "Branch Sensitive Context Switching between Partitions in a Hard-ware/Software Co-Design of the Java Virtual Machine", accepted for IEEE Pa-cific Rim Conference on Communications, Computers and Signal Processing (PACRIM) 2003, Victoria, Canada, August 2003.
Kent, Kenneth B., and Rice, Jacqueline E., "Using Instance-Specific Circuits to Com-pute Autocorrelation Coefficients", accepted for The First Northeast Workshop on Circuits and Systems (NEWCAS) 2003, Montreal, Canada, June 2003.
Kent, Kenneth B., and Serra, Micaela, “Using FPGAs to Solve the Hamiltonian Cycle Problem”, accepted for IEEE International Symposium on Circuits and Systems (ISCAS) 2003, Bangkok, Thailand, May 2003.
Kent, Kenneth B., and Serra, Micaela, “Reconfigurable Architecture Requirements for Co-Designing Virtual Machines”, Reconfigurable Architectures Workshop (RAW) part of International Parallel and Distributed Processing Symposium (IPDPS) 2003, Nice, France, April 2003.
Kent, Kenneth B., and Serra, Micaela, “Hardware Architecture for Java in a Hardware/Software Co-Design of the Virtual Machine”, Euromicro Symposium on Digi-tal System Design (DSD) 2002, Dortmund, Germany, pp. 20 - 27, September 2002.
Kent, Kenneth B., and Serra, Micaela, “Context Switching in a Hardware/Software Co-Design of the Java Virtual Machine”, Designer's Forum of Design Automation & Test in Europe (DATE) 2002, Paris, France, pp. 81-86, March 4-8, 2002.
Kent, Kenneth B., Muzio, Jon C., and Shoja, Gholamali C., “Remote Transparent Ex-ecution of Java Threads”, Proceedings of the High Performance Computing Symposium (HPC) 2001. Seattle, WA, pp. 184-191, April 2001.
Kent, Kenneth B., and Serra, Micaela, “Hardware/Software Co-Design of a Java Virtu-al Machine”, IEEE International Workshop on Rapid Systems Prototyping (RSP), Paris, France, pp. 66 - 71, June 2000.
Kent, Kenneth B. “Transparent Remote Execution of Java Threads”, M.Sc. thesis, Uni-versity of Victoria, 1998.
Kent, Kenneth B. “Kenet: A Software Library for Designing and Testing Multicast Pro-tocols”, B.Sc. (hons) dissertation, Memorial University of Newfoundland, 1996.
Partial Copyright License
I hereby grant the right to lend my thesis to users of the University of Victoria
Library, and to make single copies only for such users, or in response to a request from
the library of any other university or similar institution, on its behalf or for one of its
users. I further agree that permission for extensive copying of this thesis for scholarly
purposes may be granted by me or a member of the university designated by me. It is
understood that copying or publication of this thesis for financial gain shall not be
allowed without my written permission.
Title of Thesis:
The Co-Design of Virtual Machines Using Reconfigurable Hardware