-
UNIVERSITY OF CALIFORNIALos Angeles
Architectural and Algorithmic Acceleration ofReal-Time Physics
Simulation in
Interactive Entertainment
A dissertation submitted in partial satisfaction
of the requirements for the degree
Doctor of Philosophy in Computer Science
by
Thomas Yen-Hsi Yeh
2007
-
c© Copyright by
Thomas Yen-Hsi Yeh
2007
-
The dissertation of Thomas Yen-Hsi Yeh is approved.
William Kaiser
Sanjay Patel
Yuval Tamir
Demetri Terzopoulos
Petros Faloutsos, Committee Co-chair
Glenn Reinman, Committee Co-chair
University of California, Los Angeles
2007
ii
-
To Erica
iii
-
TABLE OF CONTENTS
1 Introduction and Motivation . . . . . . . . . . . . . . . . .
. . . . . . . 1
1.1 The Emerging Workload of Interactive Entertainment . . . . .
. . . . 1
1.2 Software Components of Interactive Entertainment
Applications . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 4
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 5
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 6
2.1 Kinematics vs Physics . . . . . . . . . . . . . . . . . . .
. . . . . . 6
2.2 Types of Physical Simulation . . . . . . . . . . . . . . . .
. . . . . . 8
2.3 High Level Characteristics of the Simulation Load . . . . .
. . . . . . 8
2.4 Open Dynamics Engine Algorithmic Load . . . . . . . . . . .
. . . . 10
3 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 13
3.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 13
3.2 Hardware Physics Accelerators . . . . . . . . . . . . . . .
. . . . . . 14
3.3 Perceptual Error Tolerance . . . . . . . . . . . . . . . . .
. . . . . . 16
3.3.1 Perceptual Believability . . . . . . . . . . . . . . . . .
. . . 17
3.3.2 Simulation believability . . . . . . . . . . . . . . . . .
. . . 18
4 Workload Characterization . . . . . . . . . . . . . . . . . .
. . . . . . 19
4.1 PhysicsBench 1.0 . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 19
4.2 Characterization . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 24
iv
-
4.3 Parallelization . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 33
4.3.1 Real x86 Processor Evaluation . . . . . . . . . . . . . .
. . . 38
4.3.2 Fine Grain Parallelism . . . . . . . . . . . . . . . . . .
. . . 40
5 Architectural Acceleration . . . . . . . . . . . . . . . . . .
. . . . . . . 43
5.1 PhysicsBench 2.0 . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 43
5.2 ParrallAX: An Architecture for Real-Time Physics . . . . . .
. . . . 49
5.2.1 Physics Simulation and Workload . . . . . . . . . . . . .
. . 52
5.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . .
. . . 57
5.2.3 Performance Demands of Real-Time Physics Workload . . . .
59
5.2.4 ParallAX Architecture . . . . . . . . . . . . . . . . . .
. . . 66
5.2.5 Architectural Design Exploration . . . . . . . . . . . . .
. . 72
5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 81
5.3 Performance-Driven Adaptive Sharing Cache . . . . . . . . .
. . . . 82
5.3.1 Introduction and Motivation . . . . . . . . . . . . . . .
. . . 82
5.3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . .
. . . 85
5.3.3 Distributed L2 Cache . . . . . . . . . . . . . . . . . . .
. . . 88
5.3.4 Limiting Data Migration Among Clusters . . . . . . . . . .
. 92
5.3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 102
5.3.6 Methodology . . . . . . . . . . . . . . . . . . . . . . .
. . . 106
5.3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 111
5.3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 118
6 Algorithmic Acceleration . . . . . . . . . . . . . . . . . . .
. . . . . . . 119
v
-
6.1 Perceptual Error Tolerance . . . . . . . . . . . . . . . . .
. . . . . . 119
6.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
. . . . 119
6.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . .
. . . 123
6.1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . .
. . . 128
6.1.4 Numerical Error Tolerance . . . . . . . . . . . . . . . .
. . . 133
6.1.5 Precision Reduction . . . . . . . . . . . . . . . . . . .
. . . 143
6.1.6 Simulation Time-Step . . . . . . . . . . . . . . . . . . .
. . 150
6.1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 152
6.2 Fast Estimation with Error Control . . . . . . . . . . . . .
. . . . . . 153
7 Architectural Exploitation of Algorithmic Properties . . . . .
. . . . . . 156
7.1 Leveraging Precision Reduction in FPU Design for CMPs . . .
. . . 156
7.1.1 Core Area Reduction . . . . . . . . . . . . . . . . . . .
. . . 156
7.1.2 Improve Floating-Point Trivialization and Memoization . .
. 158
7.2 Fuzzy Computation . . . . . . . . . . . . . . . . . . . . .
. . . . . . 167
7.2.1 Value Prediction . . . . . . . . . . . . . . . . . . . . .
. . . 167
7.3 Object-Pair Information . . . . . . . . . . . . . . . . . .
. . . . . . . 172
7.3.1 Application-Level Correlation and Locality . . . . . . . .
. . 172
7.3.2 Branch Prediction for CD . . . . . . . . . . . . . . . . .
. . 176
7.3.3 Increasing Parallelism for CD with the Object Table . . .
. . 178
8 Conclusion and Future Directions . . . . . . . . . . . . . . .
. . . . . . 187
References . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 191
vi
-
LIST OF FIGURES
1.1 Software Components of Interactive Entertainment
Applications. . . . 3
4.1 Parameters Affecting Computation Load. . . . . . . . . . . .
. . . . 20
4.2 Benchmarks 2-Cars, 100CrSk, Fight, and Battle2 from top to
bottom.
Images in raster order. . . . . . . . . . . . . . . . . . . . .
. . . . . . 22
4.3 Instruction Mix for PhysicsBench. . . . . . . . . . . . . .
. . . . . . 25
4.4 Performance of four modern architectures on PhysicsBench.
[Note:
Frame Rate uses log scale] . . . . . . . . . . . . . . . . . . .
. . . . 28
4.5 Performance of an ideal architecture on PhysicsBench. . . .
. . . . . 29
4.6 Performance of four modern architectures on SPEC FP. . . . .
. . . . 30
4.7 Performance of an ideal architecture on SPEC FP. . . . . . .
. . . . . 31
4.8 Normal, Parallel Simulation, and Parallel Collision
Detection Flows. . 33
4.9 Alpha ISA Performance of Parallel Physics Simulation. [Frame
Rate
uses log scale] . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 34
4.10 Alpha ISA Performance of Parallel Collision Detection +
Physics Sim-
ulation. [Frame Rate uses log scale] . . . . . . . . . . . . . .
. . . . 36
4.11 Instructions Per Frame for PhysicsBench. . . . . . . . . .
. . . . . . 39
4.12 x86 ISA PhysicsBench Performance. . . . . . . . . . . . . .
. . . . . 40
5.1 Physics Engine Flow. All phases are serialized with respect
to each
other, but unshaded stages can exploit parallelism within the
stage. . 54
5.2 (a) Execution Time Breakdown of 1 Core + 1MB L2 — (b)
Single
Core Execution of Serial Parts with Different L2 Sizes. . . . .
. . . . 64
vii
-
5.3 (a) Performance of Broadphase with dedicated L2 — (b)
Performance
of Narrowphase with dedicated L2. . . . . . . . . . . . . . . .
. . . . 64
5.4 (a) Performance of Island Creation with dedicated L2 — (b)
Perfor-
mance of Island Processing with dedicated L2. . . . . . . . . .
. . . . 64
5.5 (a) Performance of Narrowphase with dedicated L2 — (b)
Perfor-
mance with Processor Scaling. . . . . . . . . . . . . . . . . .
. . . . 65
5.6 (a) Execution Time Breakdown of 4 Core + 12MB L2 — (b) L2
Miss
Breakdown with Thread Scaling. . . . . . . . . . . . . . . . . .
. . . 65
5.7 (a) Limit of Coarse-grain Parallelism. — (b) Instruction Mix
for all 5
Phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 65
5.8 ParallAX - Parallel Physics Accelerator. . . . . . . . . . .
. . . . . . 67
5.9 (a) Coarse-grain vs Fine-grain Execution Time. — (b)
Instruction Mix
of Fine-grain Kernels. . . . . . . . . . . . . . . . . . . . . .
. . . . . 72
5.10 (a) IPC of Different Fine-grain Core Types. — (b) Number of
Fine-
grain Cores Required per Type to Achieve 30 FPS. . . . . . . . .
. . 74
5.11 Average Number of Available Fine-grain Parallel Tasks. . .
. . . . . . 77
5.12 The Proposed PDAS CMP Memory Hierarchy and High Level
Floorplan. 89
5.13 Cache content of Nurapid. . . . . . . . . . . . . . . . . .
. . . . . . 102
5.14 Cache content of PDAS. . . . . . . . . . . . . . . . . . .
. . . . . . 103
5.15 Per-core IPC weighted by ST IPC with 1MB cache. . . . . . .
. . . . 104
5.16 Processor and PDAS Parameters. . . . . . . . . . . . . . .
. . . . . . 106
5.17 Single Thread IPC and Migration per Access. We show the
harmonic
mean across all benchmarks. . . . . . . . . . . . . . . . . . .
. . . . 112
5.18 2-Thread Weighted Speedup and Migration per Access. . . . .
. . . . 114
viii
-
5.19 3-Thread Weighted Speedup. . . . . . . . . . . . . . . . .
. . . . . . 115
5.20 4-Thread Weighted Speedup. . . . . . . . . . . . . . . . .
. . . . . . 116
6.1 Snapshots of two simulation runs with the same initial
conditions. The
simulation results shown on top is the baseline, and the bottom
row is
simulation computed with 7-bit mantissa floating-point
computation in
Narrowphase and LCP. The results are different but both are
visually
correct. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 120
6.2 Physics Engine Flow. All phases are serialized with respect
to each
other, but unshaded stages can exploit parallelism within the
stage. . 124
6.3 Snapshots of two simulation runs with the same initial
conditions and
different constraint ordering. The results are different but
both are
visually correct. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 126
6.4 Simulation Worlds. CD = Collision Detection. IP = Island
Processing.
E = Error-injected. The Baseline world simulates without any
errors.
The Error-Injected world simulation has error injected. The
Synched
world copies the state of objects from Error-Injected after
collision
detection. Then continues the physics loop with no error
injection. . . 130
6.5 Perceptual Metrics Data for Error-Injection. X-axis shows
the maxi-
mum possible injected error. Note: Extremely large numbers and
in-
finity are converted to the max value of each Y-axis scale for
better
visualization. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 141
6.6 Average and Standard Deviation of Error-Injection. . . . . .
. . . . . 142
6.7 Floating-point Representation Formats (s = sign, e =
exponent, and m
= mantissa). . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 143
ix
-
6.8 Percentage Error Injected from Precision Reduction. X-axis
shows the
number of mantissa bits used. . . . . . . . . . . . . . . . . .
. . . . . 146
6.9 Perceptual Metrics Data for Precision Reduction. X-axis
shows the
number of mantissa bits used. Note: Extremely large numbers
and
infinity are converted to the max value of each Y-axis scale for
better
visualization. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 149
6.10 Effect on Energy with Time-Step Scaling. . . . . . . . . .
. . . . . . 150
6.11 Fast Estimation with Error Control (FEEC). . . . . . . . .
. . . . . . 154
7.1 FP Adder/Multiplier Area with Varying Mantissa Width. . . .
. . . . 157
7.2 Performance of FEEC and fuzzy value prediction. . . . . . .
. . . . . 168
7.3 Error for FEEC relative to a 20-iteration QuickStep. . . . .
. . . . . . 168
7.4 Narrowphase branch prediction rate correlation with the
program counter
(PC), branch history, and high-level objects. . . . . . . . . .
. . . . . 175
7.5 Battle Execution Time Breakdown for Collision Detection . .
. . . . 179
7.6 Battle2 Execution Time Breakdown for Collision Detection . .
. . . . 179
7.7 CrashWa Execution Time Breakdown for Collision Detection . .
. . . 179
7.8 Collision Detection’s Role in the Physics Simulation Flow .
. . . . . 180
7.9 Collision Detection Flow with Decoupled Broad-Phase and
Narrow-
Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 181
7.10 Unnecessary Narrow-Phase Comparisons and New Object-Pairs .
. . 183
x
-
LIST OF TABLES
4.1 Parameters Affecting Computation Load. . . . . . . . . . . .
. . . . 21
4.2 Parameters for our architectural configurations. All
architectures use a
common ISA. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 32
4.3 Resource Requirement for Full Parallelization using Server
Cores. . . 37
4.4 Resource Requirement for Full Parallelization for real x86
Processor. . 40
4.5 Distribution of factors affecting fine-grain parallelism. .
. . . . . . . 41
5.1 Our benchmarks cover a wide range of parameterized
situations within
different game genres. . . . . . . . . . . . . . . . . . . . . .
. . . . . 45
5.2 Features Found in Our Benchmarks. . . . . . . . . . . . . .
. . . . . 46
5.3 Our Physics Benchmarking Suite. . . . . . . . . . . . . . .
. . . . . 47
5.4 Benchmark Specs. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 48
5.5 Coarse-grain Core Design. . . . . . . . . . . . . . . . . .
. . . . . . 57
5.6 Our Fine-Grain Core Designs. . . . . . . . . . . . . . . . .
. . . . . 75
5.7 Number of Fine-Grain Tasks Required to Hide Communication. .
. . 77
5.8 Selected 2-Thread L2 Miss Rates. . . . . . . . . . . . . . .
. . . . . 115
5.9 Selected 3-Thread L2 Miss Rates. . . . . . . . . . . . . . .
. . . . . 116
6.1 Max Error Tolerated for Each Computation Phase. . . . . . .
. . . . . 136
6.2 Perceptual Metric Data for Random Reordering, Baseline, and
Simple
Simulations. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 139
6.3 Numerically-derived Min Mantissa Precision Tolerated for
Each Com-
putation Phase. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 146
xi
-
6.4 Simulation-based Min Mantissa Precision Tolerated for Each
Compu-
tation Phase. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 147
6.5 Min Mantissa Precision Tolerated for PhysicsBench 2.0. . . .
. . . . 147
7.1 Conventional Trivial Cases. . . . . . . . . . . . . . . . .
. . . . . . . 159
7.2 Reduced Precision Trivial Cases. . . . . . . . . . . . . . .
. . . . . . 159
7.3 Percent Trivialized FP Operations for Full and Reduced
Precision. . . 164
7.4 Percent Memoized FP Multiply for Full and Reduced Precision.
. . . 165
7.5 Percent Memoized FP Add/Sub for Full and Reduced Precision.
. . . 165
7.6 Percent Memoized FP Mixed for Full and Reduced Precision. .
. . . . 166
7.7 Parameters for our architectural configuration. . . . . . .
. . . . . . . 169
xii
-
ACKNOWLEDGMENTS
I would like to extend my gratitude and appreciation to the many
people in my aca-
demic and personal life who made this dissertation possible.
First and foremost, I would like to thank my research advisors,
Professor Glenn
Reinman and Professor Petros Faloutsos for their support,
encouragement, and guid-
ance. It has been a privilege to work with both. Their deep
knowledge in the dis-
joint fields of computer architecture and computer graphics were
essential for this
work. Most importantly, I appreciate Professor Reinman’s trust
and patience during
the search for my research topic.
I thank Professor David Patterson for introducing me to the
world of computer
architecture, Professor Bill Mangione-Smith for co-advising me
and suggesting the
exploration of real-time physics, Professor Sanjay Patel for his
guidance and the oppor-
tunity to collaborate with AGEIA Technologies, Professor Yuval
Tamir for providing
invaluable feedback, Professor Milos Ercegovac for collaboration
and guidance, and
Dr. Enric Musoll for support and advice. I would also like to
acknowledge professors
William Kaiser and Demetri Terzopolous who served on my
committee and provided
insightful comments.
It has been a pleasure working with the talented and dedicated
people at UCLA.
I would like to acknowledge the help and support from my fellow
graduate students
(Dr. Anahita Shayesteh, Adam Kaplan, Dr. Eren Kursun, Dr.
Yongxiang Liu, Kanit
Therdsteerasukdi, Gruia Pitigoi-Aron, Shawn Singh, and Brian
Allen), the assistance
I received from the undergraduates (Eric Wood, Nathan Beckmann,
Mishali Naik, and
Paul Salzman), and the consistent help and advice from our
department’s graduate
student advisor (Verra Morgan).
My deepest gratitude goes to my extended family (the Yehs, the
Lis, the Changs,
xiii
-
the Chuangs, and the Jengs) and close friends for their
unconditional support, un-
derstanding, and sacrifice. I am especially thankful to David,
Gene, Mike, Snoopy,
Lucia, Christine, Leo, Ramona, Shang, Lauren, Brandon,
Kai-Sheng, Hank, David
Sr., George, Amy, John, Albert, Pearl, David Jr., my
grandparents and my brother
David. I offer special thanks to my parents Shou-Shoung and
Fumei, especially for
their sacrifice in making my education in the United States
possible and for their en-
couragement to always strive for my dreams. Finally, and most
importantly, I want
to thank my wife, Erica May-Chien Chang, M.D., for her love,
support, and sacrifice
without which I could not have accomplished this work. I
dedicate this dissertation to
her.
xiv
-
VITA
1974 Born, Taipei, Taiwan.
1993–1997 B.S., Electrical Engineering and Computer Science,
University of
California, Berkeley
1996 Hardware Engineer, SBE
1998 Verification Engineer, Intel Corporation
1998–1999 M.S., Computer Science, University of California, Los
Angeles
1999 - 2001 Logic Design/ Architecture Research/ Marketing,
Intel Corporation
2001 - 2002 Design Engineer, Xstream Logic/ Clearwater
Networks
2002 Microarchitect, Sun Microsystems
2003 - 2007 Graduate Research Assistant, Teaching Assistant,
Computer Sci-
ence Department, University California, Los Angeles
xv
-
PUBLICATIONS
Fool Me Twice: Exploring and Exploiting Error Tolerance in
Physics-Based Anima-
tion T. Y. Yeh, G. Reinman, S. Patel, P. Faloutsos, ACM
Transactions on Graphics
(TOG) – accepted with major revisions, 2007
ParallAX: An Architecture for Real-Time Physics T. Y. Yeh, P.
Faloutsos, S. Patel, G.
Reinman, The 34th International Symposium on Computer
Architecture (ISCA-34),
June 2007
Enabling Real-Time Physics Simulation in Future Interactive
Entertainment T. Y. Yeh,
P. Faloutsos, G. Reinman, 2006 ACM SIGGRAPH Symposium on
Videogames (Sand-
box), August 2006
Fast and Fair: Data-stream Quality of Service T. Y. Yeh, G.
Reinman, 2005 Interna-
tional Conference on Compilers, Architecture, and Synthesis for
Embedded Systems
(CASES), September 2005
Redundant Arithmetic Optimizations T. Y. Yeh, H. Wang, The 6th
International Euro-
Par Conference on Parallel Computing (Euro-Par), August 2000
xvi
-
ABSTRACT OF THE DISSERTATION
Architectural and Algorithmic Acceleration ofReal-Time Physics
Simulation in
Interactive Entertainment
by
Thomas Yen-Hsi YehDoctor of Philosophy in Computer Science
University of California, Los Angeles, 2007
Professor Glenn Reinman, Co-chair
Professor Petros Faloutsos, Co-chair
Interactive entertainment (IE) applications are rapidly gaining
significance from both
technical and economical point of views. Future IE applications
will feature on-the-fly
content creation with large number of interacting objects,
intelligent agents, and high-
definition rendering. Application designers must provide at
least 30 graphical frames
per second to provide the illusion of visual continuity. While
IE’s real-time constraint
necessitates a tremendous amount of performance, almost no
academic attention in the
architecture community has been directed at quantifying the
needs of this emerging
workload.
In this dissertation, we focus on the acceleration of one core
component of this
emerging workload, namely real-time physics simulation or
physics based anima-
tion (PBA). Our holistic approach to acceleration spans
benchmark creation, workload
characterization, architectural acceleration, algorithmic
acceleration, and architectural
exploitation of algorithmic properties.
To represent this emerging workload, we developed PhysicsBench,
a set of bench-
xvii
-
marks to capture the complexity and scale of PBA in IE
applications. Using the
PhysicsBench suite, we characterized the workload to identify
its key differentiat-
ing factors. Based on the characterization, we propose ParallAX,
an architecture to
sustain interactive frame rates for real-time physics. The
ParallAX architecture is
a heterogeneous chip-multiprocessor that features aggressive
coarse-grain cores and
area-efficient fine-grain cores. Scaling the number of active
cores per chip increases
the load on the lowest-level cache. To alleviate cache
thrashing, we propose the Per-
formance Driven Adaptive Sharing (PDAS) cache design. PDAS is a
scalable, multi-
ported NUCA that dynamically allocates its distributed cache
resources through an
intelligent, realizable on-line partitioning strategy.
In addition to parallelism, the human perception error tolerance
can also be lever-
aged for performance in PBA. Using prior studies of simpler
scenes as a starting point,
we extrapolate a methodology for evaluating the tolerable error
of complex scenes.
Leveraging the findings from these studies, we propose
architectural techniques to ex-
ploit algorithmic properties of PBA, namely perceptual error
tolerance and the notion
of object-pairs.
To summarize, this dissertation is an in-depth study on the
acceleration of real-
time physics simulation. Given physics’ similarity to other
software components, our
proposed methodologies and techniques can be applied to many
areas within the IE
space.
xviii
-
CHAPTER 1
Introduction and Motivation
1.1 The Emerging Workload of Interactive Entertainment
Interactive entertainment (IE) has grown to a substantial
industry. According to the
Entertainment Software Association [CS], IE software generates
$10.3 billion in di-
rect sales per year and $7.8 billion in complementary products.
Video games are the
predominant form of interactive entertainment, and have driven
mass demand for high-
performance computing. Gaming sustains the economy of scale for
CPU and GPU de-
velopment which finances research and development, impacting the
entire computing
industry. Sixty-nine percent of the heads of American households
play games and the
current average game player age is 33 [ESA].
Beyond pure entertainment, interactive gaming is being leveraged
for use in educa-
tion, training, health, and public policy [Ini]. One interesting
example is the America’s
Army game, created by the United States Army for civilians to
experience life as a
soldier [Arm]. Other creative uses include medical screening,
fitness promotion, and
hazmat training. Recently, gaming software research has been
integrated into the cur-
riculum of various prestigious universities. Interactive
entertainment is evolving into a
powerful medium for future generations to experience [ESA].
Despite the social, economic, and technical importance of gaming
software, there
has been very little academic effort to quantify game’s behavior
and needs. The latest
generation of game-consoles (Sony PlayStation 3 [CNE], Microsoft
Xbox 360 [360],
1
-
and Nintendo Revolution [Rev]) shows a broad spectrum of designs
aimed at the
same workload. Differing design choices include the programming
model, number
of threads, type of chip-multiprocessor, order of execution, and
complexity of branch
prediction. Surprisingly, all three drastically different
processor designs were created
by the same company.
1.2 Software Components of Interactive Entertainment
Applica-
tions
From a technical perspective, future games will be
computationally intensive applica-
tions that involve various computation tasks [Kel]: artificial
intelligence, physics sim-
ulation, motion synthesis, scene database query, networking,
graphics, audio, video,
I/O, OS, tactile feedback, and general purpose game engine code.
The interdependen-
cies of these components are illustrated in Figure 1.1 based on
information from [Kel].
In order to provide smooth game-play, gaming hardware is
required to complete all
tasks under their respective real-time constraints. The diverse
computational tasks that
could compose future games will challenge future system
architects to achieve the per-
formance constraints of these games. As demands change, the
algorithms employed
in games are continuously being refined, further increasing
computation demands into
the foreseeable future.
Real-time physics simulation and artificial intelligence are
considered the top two
areas for dramatically enhancing user experience of future IE
applications. Both com-
ponents enable on-line content creation by dynamically
generating game-play content.
This characteristic offers both substantial advantages as well
as disadvantages. Dy-
namic content allows for open-ended unique user experiences
while reducing produc-
tion costs in the form of programmers statically coding possible
scenarios. However,
2
-
Figure 1.1: Software Components of Interactive Entertainment
Applications.
3
-
it results in significant increase of hardware performance
requirement as well as veri-
fication complexity.
While the laws which govern physical behavior are well
understood and modeled,
gaming artificial intelligence (AI) covers a wide range of tasks
and is currently an
active area of research for the AI community [LL00, YFR06]. The
collision detection
component of physics simulation is also a main component for AI
tasks such as local
steering [Ope]. Therefore, we focus on the software component of
real-time physics
simulation.
The research goal of this dissertation is to determine how to
best meet the compute
demands of real-time physics simulation for future gaming
workloads. With proper
benchmarking and characterization, we propose and evaluate novel
architectural and
algorithmic acceleration techniques.
1.3 Contributions
• Real-Time Physics Benchmarking: PhysicsBench 1.0 and 2.0•
Physics Workload Characterization• ParallAX: Architecture for
Physics Acceleration• Performance-Driven Adaptive Sharing Cache•
Evaluation of Perceptual Error Tolerance for Complex Scenarios•
Precision Reduction in FPU Design for CMPs• Fast Estimation with
Error Control• Fuzzy Value Prediction• Object-Pair Filter
4
-
1.4 Overview
This dissertation is organized as follows. Chapter 2 presents
background information
on real-time physics simulation. Prior work on both software
physics engines and
hardware accelerators are discussed in Chapter 3, and the
workload characterization is
in Chapter 4. Chapter 5 presents architectural contributions for
accelerating physics
simulation. These include both ParallAX, a heterogeneous CMP
architecture to ac-
celerate physics simulation, and PDAS, a performance-driven
adaptive sharing cache.
Chapter 6 details pure algorithmic contributions in the field of
real-time physics sim-
ulations. In Chapter 7, contributions in hybrid techniques which
leverage algorithmic
properties for architectural acceleration are presented.
Finally, the conclusion and fu-
ture directions are presented in Chapter 8.
5
-
CHAPTER 2
Background
Before discussing the details of our work, we describe the key
background information
on real-time physics simulation.
In the early days of the interactive entertainment industry,
virtual characters were
heavily simplified, crude polygonal models. The scenarios in
which they participated
were also simple, requiring them to perform small sets of simple
actions. The recent
advances in graphics hardware and software techniques have
resulted in near cine-
matic quality images for entertainment applications such as
Assassin’s Creed, Heav-
enly Sword, Motor Storm, Gears of War, World of Warcraft and
Crysis.
The unprecedented levels of visual quality and complexity in
turn require high fi-
delity animation. To achieve high fidelity animation, modern
interactive entertainment
applications have started to incorporate new techniques into
their motion synthesis
engines. Among them, physics-based simulation is one of the most
promising options.
2.1 Kinematics vs Physics
The current state-of-the-art in motion synthesis for interactive
entertainment applica-
tions is predominantly based on kinematic techniques. The motion
of all objects and
characters in a virtual world is derived procedurally or from a
convex set of parame-
terized recorded motions. Such techniques offer absolute control
over the motion of
the animated objects and are fairly efficient to compute.
However, the more complex
6
-
the virtual characters are the larger the sets of recorded
motions will be. For the most
complex virtual characters, it is impractical to record the
entire set of possible motions
that their real counterparts can do.
Physics-based simulation is an alternative approach to the
motion synthesis prob-
lem. It computes the motion of virtual objects by numerically
simulating the laws of
physics. Thus, it supports unpredictable, non-prescribed
interaction between objects
in the most general possible way. Physics-based simulation
provides physical real-
ism and automated motion calculation, but has greater
computational cost, difficulty
in object control, and potentially unstable results.
Realism: The laws of physics offer the most general constraint
over the motion.
Not only do they guarantee realistic motion, but they also avoid
repetition. Any varia-
tion in the initial conditions (i.e. contact points) will
produce a different motion. In a
sense, the set of possible actions is as large as the domain of
the initial conditions, and
not restricted to a small set of recorded motions.
Automation: Once the equations of motion are provided for each
object in a virtual
world, motion can be computed automatically based on the applied
forces and torques.
Control: The laws of physics specify how objects move under the
influence of
applied forces and torques. However, they do not specify what
the forces and torques
should be in order to achieve a desired action. That is a
separate problem which can
be very complex for dynamically balanced characters such as
virtual humans. This
problem is out of the scope of our work.
Stability: The numerical methods that simulation uses to solve
the equations of
motion can become unstable under certain circumstances. However,
there are a lot of
methods that have been developed to deal with this problem. Of
particular interest to
entertainment applications are methods that trade off accuracy
for stability. This is a
key issue that we exploit later on.
7
-
2.2 Types of Physical Simulation
Simulation in IE applications can be categorized based on the
objects and the type
of phenomena that we are most interested in simulating. They
typically fall in the
following 5 categories:
1. Rigid Body: idealization of a solid body of finite size in
which deformation is
neglected [Bar97]
2. Cloth: cloth mesh simulated as point masses connected via
distance
constraints [CK02, BW98]
3. Explosion: suspended particle explosions [FOA03]
4. Fluid: smoothed particle hydrodynamics, mass distributed
around a point [FL04,
MTP04]
5. Hair: geometric model of hairs using vector fields
[CJY02]
This dissertation focuses on Rigid Body, Cloth, and
Explosion.
2.3 High Level Characteristics of the Simulation Load
All types of simulation have certain characteristics that are
unique to the domain of IE
applications. Efficiency is crucial in interactive
entertainment: each frame of animation
must be computed at a minimum rate of approximately 30 frames
per second. For a
frame to be computed all the necessary components of the
application must complete
within a fraction of this frame rate.
Stability is also critical to creating a realistic environment.
The simulation should
not numerically explode under any circumstances. However, while
it is important that
8
-
actions have a visually believable outcome and do not violate
any constraints placed
on the objects of the simulation (i.e. bones bending, walking
through walls), interac-
tive entertainment applications generally have looser
requirements on accuracy than
most scientific applications. Recent research in animation
[HRP04b, RP03a] has actu-
ally studied and quantified errors that are visually
imperceptible. For instance, length
changes below 2.7% cannot be perceived by an average
observer[HRP04b] while
changes of over 20% are allways visible. The acceptable bounds
on errors increase
with scene clutter and high-speed motions[HRP04b].
The physics load of interactive entertainment applications has
certain unique fea-
tures. First, it seems to be distributed. For most scenes that
depict realistic events,
there are many things happening simultaneously but independently
of each other.
This distributed nature of the physics load can be exploited to
reduce the complex-
ity of the underlying solvers and allows for parallel execution.
Second, the physics load
seems to be sparse. Numerical solvers and dynamic formulations
can exploit sparsity
to improve computational efficiency. Third, since the
applications are interactive there
is usually a human viewer/user involved. Therefore, the
application inherently toler-
ates errors that the human user cannot perceive.
In summary, the physics load specifically as it applies to
interactive entertainment
applications seems to be distributed, sparse, and
error-tolerant. At the same time,
such applications require efficiency, and stability for which
they can trade off accu-
racy. Based on these considerations, we use the Open Dynamic
Engine [Eng] as a
representative physics-based simulator for interactive
entertainment applications.
9
-
2.4 Open Dynamics Engine Algorithmic Load
The Open Dynamics Engine follows a constraint-based approach for
modeling articu-
lated figures, similar to [Bar97]. ODE is designed with
efficiency rather than accuracy
in mind and it is particularly tuned to the characteristics of
constrained rigid body dy-
namics simulation. A typical application that uses ODE has the
following high level
algorithmic structure:
1. Create a dynamics world.
2. Create bodies in the dynamics world.
3. Set the state (position and velocities) of all bodies.
4. Create the joints (constraints) that connect bodies.
5. Create a collision world and collision geometry objects.
6. While (time < timemax)
(a) Apply forces to the bodies as necessary.
(b) Call collision detection.
(c) Create a contact joint for every collision point, and put it
in the contact joint group.
(d) Take a forward simulation step.
(e) Remove all joints in the contact joint group.
(f) Advance the time: time = time+∆t
7. End.
The computational load of a simulation is defined by two main
components: Col-
lision Detection, (b), and the forward dynamics step, (d). Finer
granularity distinction
between computation phases will be explored later.
10
-
Collision Detection Collision detection (CD) uses geometrical
approaches to iden-tify bodies that are in contact and generate
appropriate contact points. A space in CD
contains geometric objects that represent the outline of rigid
bodies [Eng]. Spaces are
used to accelerate collision detection by allowing the removal
of certain object pairs
that would result in useless tests.
Collision detection depends significantly on the geometric
properties of the objects
involved. ODE supports contact between standard shapes such as
boxes, spheres, and
cylinders, and also arbitrary triangle meshes. The contact
resolution module of ODE
supports both instantaneous collisions and resting contact with
friction. High speed
collisions can be resolved even at coarse time steps. In such
cases, the collision may
produce penetrating configurations. However, a nice feature of
ODE is that the pen-
etration will be eliminated after a short number of steps. Such
features make ODE
especially suitable for interactive applications.
Forward Dynamics Step The simulator takes a forward step in time
by computingthe constraint forces that maintain the structure of
the objects and that satisfy the colli-
sion constraints produced by collision detection. This is one of
the most expensive part
of the simulator and requires the solution of a Linear
Complementary Problem(LCP).
ODE offers two ways of solving the LCP system for the constraint
forces: an accurate
and expensive one based on a big-matrix approach (the so called
normal step), and
a less accurate approach called quick step that iteratively
solves a number of much
smaller LCP problems. Their respective complexities are O(m3)
and O(m× i), where
m is the total number of constraints and i is the number of
iterations, typically 20.
For any scene of average complexity the iterative (quick-step)
approach far outper-
forms the big-matrix approach. IE applications’ tolerance for
lower accuracy is one
main characteristic that we can leverage for performance. By
using the quick step, we
enable massive parallelization for the constraint solver as
described in the fine-grain
11
-
parallelism section.
ODE’s integrator trade-offs accuracy for efficiency and allows
relatively high time
steps, even in situations with multiple high speed collisions.
The key parameter here is
the integration time step which, for a fixed-step integrator,
relates directly to the time
step of the simulation ∆t. Typical values range from 0.01 to
0.03.
In the physics integration computation, the concept of an island
is analogous to the
space in the above discussion on CD. The island concept is
defined as a group of bodies
that can not be pulled apart [Eng], which means that there are
joints interconnecting
these bodies. Each island of bodies is computed independently
from other islands by
the physics engine.
The computation demand is affected significantly by the number
and the complex-
ity of islands during one simulation step. The complexity of an
island can be quantified
by the number of objects along with the number and complexity of
the interconnecting
joints. The complexity of a joint is characterized by the
degrees of freedom (DoF) it
removes as listed in the following table:
Joint Ball Hinge Slider Contact Universal Fixed
DoF Removed 3 5 (4) 5 1 4 6
The formation of an island has different temporal behaviors.
Some persist for a
long time while others constantly change between each
integration step. This behavior
contributes to the variance in computation demands by the
engine.
12
-
CHAPTER 3
Prior Work
There is little work directly related to interactive
entertainment (IE) in the architecture
community. In this chapter, prior work on IE benchmarking,
hardware acceleration,
and perceptual error tolerance is presented. Throughout the
dissertation, discussion of
contributions will include additional related work specific to
each contribution.
3.1 Benchmarking
At the start of the project detailed in this dissertation, no
benchmarks for real-time
physics simulation existed. Since then, AGEIA Technologies
joined Futuremark’s
3DMark Benchmark Development Program to include two complex
game-like scenar-
ios in 3DMark06 [3DM]. While these benchmarks support
multi-thread and multi-core
architectures, the lack of source code hampers in depth
architectural studies.
[MWG04] compared the performance counter statistics of a single
second execu-
tion between two first person shooter games to music and video
playback applications.
This work shows the difference between gaming and multimedia
applications due to
game’s content creation tasks, and points to chip
multiprocessors (CMP) [ONH96a] as
a promising approach to providing performance.
Physics Engines Physics engines are software libraries that
enable rigid body dy-namics simulation. [SR06] compares three
physics engines, namely ODE, Newton,
13
-
and Novodex. Different engines may support different types of
physics simulation and
solver algorithms. Below is the list of prominent engines
currently on the market:
1. Commercial: AGEIA [agea] and Havok [Hava]
2. Open-source: ODE [Eng]
3. Free, close-source: Newton [New] and Tokamak [Tok]
Both AGEIA [agea] and Havok [Havb] provide their own proprietary
SDKs. Open
Dynamics Engine (ODE) [Eng] is the most popular open-source
alternative. It has
been used in commercial settings, and provides APIs and
numerical techniques similar
in nature to proprietary engines. ODE is the basis of our
physics engine.
3.2 Hardware Physics Accelerators
The MDGRAPE-3 chip by RIKEN [Tai04] and the PhysX chip by AGEIA
[agea]
are currently the only dedicated physics simulation accelerator
designs. While MD-
GRAPE targets computational physics, PhysX targets real-time
physics for games.
Both designs are placed on accelerator boards which connect to
the host CPU through
a system bus. PhysX’s architectural design is not public, and
MD-GRAPE’s design is
specific to computing forces for molecular dynamics and
astrophysical N-body simu-
lations with limited programmability.
Two other closely related bodies of prior work are vector
processing [HP96] and
stream computation [LMT04, KRD03].
Vector Processing. The massive parallelism available in
real-time physics hints atthe use of vector processors like VIRAM
[KPP97], Tarantula [EAE02], and
CODE [KP03]. VIRAM[KP02] has achieved an order of magnitude
performance
14
-
improvement on certain multimedia benchmarks. However,
conventional vector ar-
chitectures are constrained [KP03] by limitations like the
complexity of a centralized
register file, the difficulty to implement precise exceptions,
and the requirement of an
expensive on-chip memory system. Most importantly, the physics
workload requires
tremendous speedup that necessitates massive parallel execution.
While CODE is scal-
able by increasing clusters and lanes, the data shows a plateau
at eight clusters with
eight lanes, and the cache-less CODE can not satisfy our
measured physics workload.
Stream Computation. Stream architectures (SAs) aim to enable
ASIC-like per-formance efficiency while being programmable with a
high-level language. Stream
programs express computation as a signal flow graph with streams
of records flowing
between computation kernels. While a broad range of designs
populate this space, the
high-level characteristics of SAs are described in [LMT04].
The Stream Virtual Machine (SVM) architecture model logically
consists of three
execution engines and three storage structures. The execution
engines include a con-
trol processor, kernel processor, and DMA. The storage
structures are local registers,
local memory, and global memory. The SVM mitigates the
engineering complexity of
developing new stream languages or architectures by enabling a
2-level compilation
approach. Related designs in this space include IBM’s Cell,
GPUs, and the Xbox360
system.
IBM’s Cell [Hof05] consists of one general purpose PowerPC core
(PPE) and eight
application specific streaming engines (SPE) – all connected by
a ring of on-chip inter-
connect (EIB). Although the Cell’s programming model is
described as cellular com-
puting, the design can be included in the broad space of
streaming computation. The
PPE is a 64-bit, 2-way SMT, in-order execution PowerPC design
with 32KB L1 caches
and a 512KB L2 cache. The SPEs are RISC cores each with 128
128-bit SIMD reg-
isters, customized SIMD instructions, and 256KB local private
memory. SPEs are not
15
-
ISA compatible with conventional PowerPC cores. Heterogeneous
CMP designs such
as the Cell are able to target the best task specific
performance using different cores.
The PPE targets control intensive tasks such as the OS and the
SPEs target compute
intensive tasks. However, according to our exploration, both the
PPE and SPE designs
are not optimal for physics computation. Serial components’
performance on the PPE
will take a significant amount of each frame’s time, and the
SPE’s complexity prevents
the placement of the required number of cores to achieve 30 FPS.
This may be a re-
sult of the fact that the Cell is designed to execute all
components of a game, not just
physics simulation.
The Graphics Processing Unit (GPU) [PF05] is another design
point within the
streaming architecture space. GPUs are specialized hardware
cores designed to accel-
erate rendering and display. While Havok’s FX allows effect
physics simulation on
GPUs, GPUs are designed to maximize throughput from the graphics
card to the dis-
play – data that enters the pipeline and the results of
intermediate computations cannot
be easily accessed by the CPU. Furthermore, the host CPU is
connected to the GPU
via a system bus. This communication latency is problematic for
physics simulation
working in a continuous feedback loop. This may be one reason
why Havok’s FX
only enables effect physics and not game-play physics. This
limitation is alleviated in
the Xbox360 system [AB06], which combines a 3-core CMP and GPU
shaders. This
system allows the GPU to read from the FSB rather than main
memory and L2 data
compression reduces the required bandwidth.
3.3 Perceptual Error Tolerance
This section reviews the relevant literature in the area of
perceptual error tolerance.
16
-
3.3.1 Perceptual Believability
[OHM04] is a 2004 state of the art survey report on the field of
perceptual adaptive
techniques proposed in the graphics community. There are six
main categories of such
techniques: interactive graphics, image fidelity, animation,
virtual environments, visu-
alization and non-photorealistic rendering. This dissertation
focuses on the animation
category. Given the comprehensive coverage of this prior survey
paper, we will only
present prior work most related to our work and point the reader
to [OHM04] for ad-
ditional information.
[BHW96] is credited with the introduction of the plausible
simulation concept,
and [CF00] built upon this idea to develop a scheme for sampling
plausible solutions.
[ODG03] is a recent paper upon which we base most of our
perceptual metrics. For the
metrics examined in this paper, the authors experimentally
arrive at thresholds for high
probability of user believability. Then, a probability function
is developed to capture
the effects of different metrics. This paper only uses simple
scenarios with 2 objects
colliding for clinical trials.
[HRP04a] is a study on the visual tolerance of lengthening or
shortening of human
limbs due to constraint errors produced by physics simulation.
We derive the threshold
for constraint error from this paper.
[RP03b] is a study on the visual tolerance of ballistic motion
for character ani-
mation. Errors in horizonal velocity were found to be more
detectable than vertical
velocity, and added accelerations were easier to detect than
added deceleration.
In general, prior work has focused on simple scenarios in
isolation (involving 2
colliding objects, a human jumping, human arm/foot movement,
etc). Isolated special
cases allow us to see the effect of instantaneous phenomena,
such as collisions, over
time. In addition, they allow apriori knowledge of the correct
motion which serves
17
-
as the baseline for exact error comparisons. Complex cases do
not offer that luxury.
On the other hand, in complex cases, such as multiple
simultaneous collisions, errors
become difficult to detect and, may in fact cancel out.
3.3.2 Simulation believability
Chapter 4 of [SR06] compares three physics engines, namely ODE,
Newton, and
Novodex, by conducting performance tests on friction, gyroscopic
forces, bounce,
constraints, accuracy, scalability, stability, and energy
conservation. All tests show
significant differences between the three engines, and the
engine choice will pro-
duce different simulation results with the same initial
conditions. Even without any
error-injection, there is no single correct simulation for
real-time physics simulation
in games as the algorithms are optimized for speed rather than
accuracy.
18
-
CHAPTER 4
Workload Characterization
This chapter covers our initial characterization of the
real-time physics simulation
workload.
4.1 PhysicsBench 1.0
In order to suggest architectural improvements to enable future
applications’ use of
real-time physics simulation, the workload characterization of
real-time physics en-
gine kernel is required. Due to the lack of prior work, this
involves the creation of a
representative suite of benchmarks that covers a wide range of
common situations in
interactive entertainment applications. We have created two
versions of PhysicsBench.
This section covers the details and reasoning in creating
PhysicsBench 1.0. Our first
pass on PhysicsBench takes a bottom-up approach and focuses on
parameters that af-
fect the computation load. The latest version, PhysicsBench 2.0,
is described in the
next chapter.
High Level Considerations PhysicsBench covers a wide range of
typical IE situa-tions that involve object interaction. Our
scenarios include fighting humans, object to
human collisions, object to object collisions, exploding
structures, and fairly complex
battle scenes. Our benchmarks are designed to test the
scalability of the simulation
both in terms of the objects that interact with each other
simultaneously, for example
19
-
stacking, and in independent groups, for example a large battle
scene.
The benchmarks represent scenes of realistic complexity
(interactions) but not nec-
essarily realistic motions. The visual representation of the
objects in a scene show the
geometries used for collisions not the ones used for visual
display. We are only inter-
ested in the simulation load, not the graphics load.
While the benchmarks below cover a wide range of representative
scenarios, more
complex situations can be constructed by mixing multiple
benchmarks as shown be-
low. Because of the distributed nature of the physics load as it
applies to interactive
entertainment applications, the combined computational load can
be roughly extrapo-
lated from the results of the individual scenarios.
World
Physics Collision Detection
# of Islands Island Complexity Temporal Behavior Accuracy # of
Spaces Space Complexity Inter-Space Comm
Objects
Joints
Joint Type
Space Type
GeomType
Figure 4.1: Parameters Affecting Computation Load.
Computation Load As described previously, the physics simulation
engine is com-posed of two major dependent components: collision
detection and forward dynamics
step. Collision detection determines all contact points and
creates joints to model the
impulse forces generated. Then, the bodies along with these
contact joints are com-
puted to determine the new positions. Within each component,
there are a number of
factors that affect the computation load. The high-level chart
4.1 captures the most
significant parameters.
Benchmarks The benchmarks involve virtual humans, cars, tanks,
walls and pro-jectiles. The virtual humans are of anthropomorphic
dimensions and mass properties.
20
-
Each character consists of 16 segments (bones) connected with
idealized joints that
allow movement similar to their real world counterpart. The car
consists of a single
rigid body and four wheels that can rotate around their main
axis. Four slider joints
model the suspension at the wheels. The walls are modeled with
blocks of light con-
crete. The projectiles are single bodies with spherical,
cylindrical or box geometry.
In all benchmarks, the simulator is configured to resolve
collisions and resting contact
with friction. Table 4.1 summarizes the quantitative differences
between benchmarks.
Benchmark Number of Islands Number of Spaces
(Max, Min, Avg, Dev)
2 Cars 2, 2, 2, 0 1
10 Cars 10, 10, 10, 0 1
Car Crash Sk 3, 1, 2, 0.65 1
Car Crash Wall 105, 99, 101, 1.4 1,3
Environment 337, 196, 245, 46 1,10
Car Crash Sk x100 300, 100, 220, 64 100
Battle I 120, 2, 93, 18 3
Fight 10, 7, 8, 1.1 1,10
Battle II 156, 113, 134, 18 1,15
Table 4.1: Parameters Affecting Computation Load.
The benchmarks are as follows:
• 2-Cars: Two cars driving - two cars, each with 3 wheels that
are steered to run
in parallel then collide. One of the cars goes over a wooden
ramp.
• 10-Cars: Ten cars driving - to get a sense of how the load
changes with scale,
we extend the two-car scenario to ten cars.
• CrashSk: Car crashing on two people - a car with four wheels
crashing into two
16-bone virtual humans.
• CrashWa: Extreme-speed Car crashing on wall, tank shooting
projectiles - a high
speed car (velocity 200Mph) crashing into a wall, while a tank
shoots varying
21
-
Figure 4.2: Benchmarks 2-Cars, 100CrSk, Fight, and Battle2 from
top to bottom. Images in raster order.
22
-
shape projectiles towards the wall. The wall consists of a large
number of blocks.
• Environ: Complex environment scene with wall, tank, car,
monster, and projec-
tiles Similar to previous benchmark. Addition of tank firing
projectiles and a
centipede monster.
• 100CrSk: Car crashing on two people replicated 100 times -
replicate the previ-
ous scenario 100 times.
• Battle: Battle scene I - One group of 10 humanoids attacked by
tank. 2 groups
of 4 and 6 humanoids crashing into each other.
• Fight: Fighting Scene, 2 groups of 5 humanoids - two groups of
five humanoids
that come in contact in pairs and eventually form a number of
piles.
• Battle2: Battle scene II - a relative complex battle scene. A
tank is behind the far
wall shooting projectiles in different directions. A car crashes
on the right wall
while two groups of five people are fighting inside the
compound. The walls
eventually get destroyed and fall on the people.
These scenarios can capture complex interactions. The
computational load of 2-
Cars, for example, relates to a wide range of two objects
interactions that arise in in-
teractive entertainment applications. These include racing
games, airplanes that crash
in midair, rocket and plane collision, tank-to-tank collision
and even simple ships col-
liding. Fight captures the computational complexity of a wide
range of human group
activities that involve progressive interaction such as action,
sports games and urban
simulation scenes.
23
-
4.2 Characterization
We have described the unique characteristics of real-time
physics simulation. To fur-
ther support our claims, we compare and contrast PhysicsBench to
graphics, embed-
ded, and scientific applications at the algorithm level. Then,
we present a detailed
characterization of PhysicsBench and consider possible
techniques to accelerate per-
formance. The Alpha ISA was used for the data presented in this
section.
Comparison Against Other Workloads A graphics workload includes
the compu-tations needed to draw a single frame after all motion
parameters have been computed
and applied to the associated graphics primitives (object
geometries). For interactive
entertainment applications all geometric primitives are
approximated with polygonal
meshes and most often meshes of quadrilaterals or triangles. To
produce the final
image, all polygons go through a set of well defined stages that
include: geometric
transformations, lighting calculations, clipping, projections,
and finally rasterization.
Most of these stages perform calculations based on a polygon’s
vertices. Each vertex
is defined by four floating point numbers. All of these stages
treat each polygon inde-
pendently of the others. For realistic scenes, there are
thousands of polygons involved.
Therefore the typical graphics load is highly parallel and
pipelined. Modern graphics
cards have multiple hardware pipelines capable of treating
massive numbers of poly-
gons. Certain research groups have managed to use graphics
hardware to accelerate
specific physics-based formulations such as computational fluid
dynamics. The grid-
based nature of such approaches can be supported, albeit in
awkward ways, by the
graphics hardware. However, this type of adaptation is not
appropriate for constrained
rigid body formulations.
The SPEC CPU 2000 FP suite seems similar in that it makes use of
similar numer-
ical methods, but the constraints imposed by interactive
entertainment applications
24
-
along with the specific load characteristics make it a different
problem. For exam-
ple, the relaxed accuracy requirement allows multiple levels of
approximations and
optimizations, such as higher error thresholds, restricted size
matrices, constraint vio-
lation, higher time-steps, inter-penetrations, approximate
iterative techniques etc. The
differences are reflected by our measurements.
Embedded application suites like MiBench [GRE01] are also quite
different from
PhysicsBench. One major difference is the relatively small
amount of floating point
instructions seen in typical embedded applications.
PhysicsBench Characterization In order to accurately
characterize PhysicsBench,we present some system independent data,
along with performance data from some
specific systems. In particular, we evaluate state-of-the-art
mobile console, desktop,
and server processors.
Table 4.2 presents the architectural parameters for these
classes.
0%
20%
40%
60%
80%
100%
2-Cars
2-Cars
Q
10-C
ars
10-C
ars Q
Cras
hSk
Cras
hSkQ
Cras
hWaQ
Envir
onQ
100C
rSk
100C
rSkQ
Battle
Battle
QFig
ht
FightQ
Battle
2Q
Harm
Mean
Loads Stores Branches Integer Calculation Floating Point
Calc
Figure 4.3: Instruction Mix for PhysicsBench.
Platform-Independent Characteristics Figure 4.3 provides the
instruction mix forPhysicsBench. Despite the diverse input sets
given to the physics engine, the instruc-
tion mix remains fairly uniform across benchmarks. This
uniformity strongly contrasts
with the diversity seen in typical general purpose, scientific,
and embedded benchmark
25
-
suites. On average, PhysicsBench is composed of 34% floating
point calculations,
25% integer calculations, 6% branches, 5% stores, and 30% loads.
The relatively large
amount of both integer and floating point calculations shows a
fundamental difference
between PhysicsBench and the integer heavy SPEC INT and MiBench,
as well as the
floating point heavy SPEC FP.
Instruction mix also depends on the algorithm used. For simple
tests, ODE’s quick
step algorithm [Eng] executes more floating calculations and
loads than the normal
step. However, the the normal step executes more floating
calculations in complex
runs. The two algorithms are more efficient at different levels
of complexity. The
more accurate normal step algorithm is faster for simple islands
while the quick step
algorithm is faster for complex islands.
The harmonic mean for instruction per branch (IPB) across all
PhysicsBench tests
is 16. Similar to the instruction mix, the average behavior
accurately represents all but
a few outliers. Fight Normal shows 27 IPB while 2 Cars Normal
and 10 Cars Normal
show 11 IPB. On the other hand, SPEC CPU2000 FP shows an average
IPB of 21 with
many outliers ranging from 7.7 to 341.
On average, PhysicsBench’s tests have 400KB of text and 370KB of
data. In con-
trast, SPEC FP has on average 740MB of data and 980KB of Inst.
The more than order
of magnitude difference between the data segment sizes suggests
a significant memory
behavior difference between these two workloads. This will be
corroborated by our
performance data comparison.
Furthermore, we observe a drastic difference between the maximum
stack sizes
of these two workloads during run-time. The average maximum
stack size for SPEC
FP tests is 23KB with a maximum of 75KB, but the average maximum
stack size for
PhysicsBench is 1.1MB with a maximum of 2.9MB. This large stack
is due to the
dynamically allocated temporary structures to hold large
matrices.
26
-
The combination of a small data size along with a large maximum
stack size sug-
gests that PhysicsBench’s performance will heavily depend on the
L1 cache’s perfor-
mance, while memory latency effects may be minimal. In contrast,
SPEC FP’s large
data sizes with a small maximum stack size suggests the exact
opposite behavior. This
observation will also be corroborated by our performance
data.
In order to focus our attention on the bottleneck, we capture
the percentage of
total instructions contributed by collision detection vs physics
simulation. On average,
collision makes up 7% of all executed instructions, ranging
between 2% and 20%.
Despite this, we will later demonstrate that accelerating
collision detection can also
yield substantial gains.
Platform Dependent Characterization Figure 4.4 presents results
for the four ar-chitectures in table 4.2 on Physics Bench. We
consider three metrics: IPC, Frame Rate,
and the % of frames that were computed within 10% of 30
frame/sec constraint. The
first two metrics are averages over all frames executed – they
give some indication of
how close we are getting on average to meeting the frame
constraint.
The latter metric gives an indication of whether or not all
frames were computed
in time – ideally, we would like this metric to be as close to
100% as possible to
provide stability and realism. For this initial study, we
allocate 10% of each 1/30th ofa
second to provide a frame to physics simulation. While this time
allocation may be
too conservative, the data presented can be extrapolated for
larger allocation of each
1/30th of a second.
From this first set of data, we clearly see why current
interactive entertainment
applications rarely use realistic physics simulation to
dynamically generate content. In
this suite of physics-only tests, even the powerful Desktop and
Server processors can
only satisfy the demand of the three simplest scenarios.
Surprisingly, the much lower
27
-
00.5
11.5
22.5
33.5
2-Cars
2-Cars
Q
10-C
ars
10-C
arsQ
Cras
hSk
Cras
hSkQ
Cras
hWaQ
Envir
onQ
100C
rSk
100C
rSkQ
Battle
Battle
QFig
ht
FightQ
Battle
2Q
Harm
Mean
Inst
ruct
ions
Per
Cyc
le
Mobile Console Desktop Server
1
10
100
1000
10000
100000
2-Cars
2-Cars
Q
10-C
ars
10-C
arsQ
Cras
hSk
Cras
hSkQ
Cras
hWaQ
Envir
onQ
100C
rSk
100C
rSkQ
Battle
Battle
QFig
ht
FightQ
Battle
2Q
Fram
es p
er S
econ
d
Mobile Console Desktop Server
300
020406080
100
2-Cars
2-Cars
Q
10-C
ars
10-C
arsQ
Cras
hSk
Cras
hSkQ
Cras
hWaQ
Envir
onQ
100C
rSk
100C
rSkQ
Battle
Battle
QFig
ht
FightQ
Battle
2Q% F
ram
es S
atisf
ied
Mobile Console Desktop Server
Figure 4.4: Performance of four modern architectures on
PhysicsBench. [Note: Frame
Rate uses log scale]
performance in-order Console processor is able to satisfy a
similar number of frames.
Due to its low clock frequency, Mobile is shown to be adequate
only for the basic test
of 2-car.
From these high-level observations, we can conclude the
following: real-time
physics simulation is extremely difficult to satisfy due to both
the amount of compu-
tation required and the real-time constraint. Furthermore, there
is a large performance
requirement gap between the simple scenarios and the typical
in-game scenarios.
28
-
0
2
4
6
8
10
12
14
2-Cars
2-Cars
Q
10-C
ars
10-C
arsQ
Cras
hSk
Cras
hSkQ
Cras
hWaQ
Envir
onQ
100C
rSk
100C
rSkQ
Battle
Battle
QFig
ht
FightQ
Battle
2Q
Harm
Mean
IPC
0102030405060708090100
% F
ram
es S
atisf
ied
Ultimate Ultimate - FU Ultimate - L1 Ultimate - BP
Figure 4.5: Performance of an ideal architecture on
PhysicsBench.
In order to find the most useful directions of attacking the
large performance gap
shown above, we scale processor parameters to find the most
critical bottlenecks. Due
to the order of magnitude performance improvement required, we
start this study with
an idealized processor, the Ultimate core from Table 4.2.
Figure 4.5 shows the performance for Ultimate and when
individual parameters are
scaled down to more realistic conditions. The primary y-axis
(bars) shows IPC and the
secondary y-axis (diamonds) shows the % of frames that were
computed within 10%
of our 30 frame/sec constraint.
We explored scaling down each parameter of Ultimate
independently. The suffix
after the “-” indicates the one parameter being scaled (FU =
functional latency, BP =
branch misprediction penalty, and L1 = data cache size). We have
scaled other param-
eters also, but only the parameters with the greatest interest
and impact are presented.
On average, the descending order of % performance degradation is
miss penalty (BP)
at 44% , L1 at 32%, and FU (functional unit latency) at 27%.
First, the large effect
of a more realistic branch misprediction penalty shows the large
amount of instruction
level parallelism being exploited by Ultimate’s large
instruction window and support-
ing structures. Second, the L1 is scaled to equal that of the
Desktop. The performance
29
-
degradation supports our prior observation that the physics
engine generates numer-
ous temporary values during computation which necessitates a
fast and large memory
hierarchy. Finally, the effects of FU indicates that we have
dependent chains of long
latency operations on the critical path.
Although this extremely ideal configuration is getting closer to
the goal, not all
applications are being satisfied. With the major bottlenecks
identified, we explore
techniques to move us closer to real-time physics behavior.
0
1
2
3
4
ammp
0ap
plu1
apsi2 art
3
equa
ke4
facere
c5
fma3
d6
galge
l7luc
as8
mesa
9
mgrid
10
sixtra
ck11
swim
12
wupw
ise13
Harm
Mea
n
IPC
Mobile Console Desktop Server
Figure 4.6: Performance of four modern architectures on SPEC
FP.
As a point of comparison, we also show performance data from the
SPEC FP suite.
Figure 4.6 presents IPC results for the four architectures in
Table 4.2 on SPEC FP. The
most apparent behavior in this graph is the drastic performance
difference between
Server and all other designs. This corresponds with the fact
that scientific workload is
one of the major target workload for server processor designs.
The much larger % of
FP operations for this workload contributes to this difference
among the four designs.
To find the most critical design parameters, we scale resources
as in the above
PhysicsBench study. Figure 4.7 presents IPC results for the
Ultimate running SPECFP
to compare against PhysicsBench. Again, each parameter of
Ultimate is scaled one at
a time. In addition to FP, BP, and L1, we also present the
result when scaling memory
latency to 267 cycles. On average, the descending order of %
performance degradation
30
-
0
4
8
12
16
20
24
ammp
0ap
plu1
apsi2 art
3
equa
ke4
facere
c5
fma3
d6
galge
l7luc
as8
mesa
9
mgrid
10
sixtra
ck11
swim
12
wupw
ise13
Harm
Mea
n
IPC
Ultimate Ultimate-FU Ultimate-BP Ultimate-L1 Ultimate-MEM
Figure 4.7: Performance of an ideal architecture on SPEC FP.
is memory latency (MEM) at 25% , branch mispredict penalty (BP)
at 22%, and FU
(functional unit latency) at 15%. Compared to PhysicsBench, the
memory behavior
is drastically different in that memory latency dominates any L1
cache behavior. This
confirms our earlier workload characterization. Memory latency
scaling for Physics-
Bench results in < 1% IPC degradation.
31
-
Mobile Game Console Desktop Server Ultimate
Frequency 222 MHz 3 GHz 3 GHz 2 GHz 5 GHz
Fetch, 1, 1, 1 2, 2, 2 4, 8, 4 8, 8, 8 32, 32, 32
Decode,
Issue
FetchQ Size 4, 1 16, 1 32, 2 32, 1 128, 2
FetchQ Speed
Issue In-order In-order Out-of-order Out-of-order
Out-of-order
Issue window 8 16 32 64 512
Branch Taken 4K Bimod, 8K Gshare , 8K Gshare, 16K Gshare,
Predictor 4K 4-way 4K 4-way 4K 4-way 32K 4-way
BTB BTB BTB BTB
Branch Miss 3 19 19 17 2
Penalty
Inst L1 Cache 8K 4 way 32K 4-way 8K 4-way 64K Direct Map 512K 4
way
Latency 2 cycle 4 cycle 2 cycle 2 cycle 1 cycle
Data L1 Cache 8K 4 way 32K 4-way 16K 4-way 32K 2-way 512K 4
way
Latency 2 cycle 4 cycle 4 cycle 4 cycle 1 cycle
Inst Window, 32,8 64,32 128,64 256,64 2048,1024
Load/Store
L2 Cache None 512K 8-way 1M 8-way 2M 8-way 64M 16-way
16 cycle 27 cycle 24 cycle 12 cycle
Functional Units
(int ALU, Mult) 1, 1 3, 1 6, 1 6, 1 32, 32
(FP, FP Mult) 1, 1 2, 1 2, 1 4, 4 32, 32
(Memport) 1 1 2 3 32
Mem Latency 20 258 269 184 50
Table 4.2: Parameters for our architectural configurations. All
architectures use a
common ISA.
32
-
4.3 Parallelization
In this section we explore architectural candidates to satisfy
the frame constraint of
PhysicsBench 1.0.
Space Collision Detection
Generate Islands
Per IslandPhysics Simulation
Loop on # Islands
[Normal Flow]
Space Collision Detection
Generate Islands
Per IslandPhysics Simulation
Per IslandPhysics Simulation
Per IslandPhysics Simulation
Per IslandPhysics Simulation
Per IslandPhysics Simulation
Per IslandPhysics Simulation
[Parallel Physics Simulation Flow]
Sub_Space_0
Collision Detection
Sub_Space_1
Collision Detection
Generate Islands
Per Island
Physics SimulationPer Island
Physics SimulationPer Island
Physics SimulationPer Island
Physics SimulationPer Island
Physics SimulationPer Island
Physics Simulation
Sub_Space_2
Collision Detection
Space
Collision Detection
[Parallel Collision Detection + Simulation]
Figure 4.8: Normal, Parallel Simulation, and Parallel Collision
Detection Flows.
Parallel Threads As demonstrated by Figure 4.8, the core physics
simulation loopcan be simplified into 3 dependent logical steps:
collision detection, island creation,
and per island physics simulation. Collision detection finds all
object pairs that interact
with one another. Then, islands are formed by interconnected
objects. Finally, the
engine computes the new positions for all objects at the island
granularity. The physics
simulation for all of the islands is done serially.
Parallel Physics Simulation From the earlier instruction mix
study, we see thatphysics simulation (PS) code contributes the bulk
of instructions executed. On av-
erage, PS contributes to 92% of the execution time across the
suite using the four
processors presented earlier. Therefore we first explore the
limits of parallel physics
33
-
simulation by creating one thread for every island. This
parallelization process in-
volves farming off the threads to either logical or physical
processors. The initial data
communication requirement is dictated by the number of bodies
and joints for each
island spawned away from the original thread. Because every
island is independent of
other islands, only the final position data of objects needs to
be communicated back to
a central thread at the end of each simulation step.
To evaluate the potential of this optimization, we first capture
the upper bound
performance by simulations that assume an unlimited supply of
homogeneous cores
and ignores overhead from sources such as thread creation,
thread migration, data
migration, and setup codes. The data presented in Figure 4.9
contains both frame
rate and % frames satisfied. Because tests 2-Cars and 10-Cars
are easily satisfied, we
remove these from future data graphs and discussions to conserve
space.
1
10
100
1000
10000
Cras
hSk
Cras
hSkQ
Cras
hWaQ
Envir
onQ
100C
rSk
100C
rSkQ
Battle
Battle
QFig
ht
FightQ
Battle
2Q
Fram
e Ra
te
0102030405060708090100
% F
ram
es S
atisf
ied300
Mobile Console Desktop Server
Figure 4.9: Alpha ISA Performance of Parallel Physics
Simulation. [Frame Rate uses
log scale]
For frames satisfied, we see that Mobile is now able to satisfy
100% of 10-Cars and
Server is now satisfying 100% of FightQ. Frame rate improvement
is consistent across
the suite with a max of 516% and an average of 118%. However,
the magnitude of
speedup from parallel physics simulation varies between
different tests. The tests that
showed the least amount of improvement are CrashWaQ and
EnvironQ. CrashWaQ and
34
-
EnvironQ contain a large complex island simulating a brick wall.
This wall’s bricks
apply contact forces on one another, and this cannot be
parallelized unless these bricks
are pushed away from one another. In scenarios containing one
extremely complex
island, the frame rate is dictated by the processing of this
island even with paralleliza-
tion.
At the algorithm level, the QuickStep function used for these
tests already esti-
mates the result by processing each object independently of
others. QuickStep uses an
iterative approach where during each iteration (a) each body in
an Island is essentially
considered a free body in space and solved independently of the
others (b) a constraint
relaxation step progressively enforces the constraints by some
small amount. The con-
straint satisfaction increases with the number of iterations.
Fine-grain parallelization
of the LCP solver will be discussed in the next chapter.
Even though our idealized coarse-grain parallel physics
simulation is very effec-
tive, we are still some distance away from satisfying the
demands of these benchmarks,
especially for the extreme cases described. As a result of
parallel physics simulation,
the % of cycles taken up by collision detection becomes much
more significant at 20%
on average, and a maximum of 55%. Therefore, we consider
performing collision
detection in parallel.
Parallel Collision Detection + Physics Simulation Figure 4.8
shows our implemen-tation of parallel collision detection through a
hierarchy of collision spaces. Groups
of frequently interacting objects are inserted into the same
subspace, and only the
subspaces are directly inserted into the root space. Note the
bidirectional arrows, rep-
resenting two-way communication between collision threads,
interconnecting the root
space with subspaces. During collision detection, the root
thread handles any colli-
sions across subspaces while each subspace handles collisions
within its domain.
35
-
1
10
100
1000
10000
Cras
hSk
Cras
hSkQ
Cras
hWaQ
Envir
onQ
100C
rSk
100C
rSkQ
Battle
Battle
QFig
ht
FightQ
Battle
2Q
Fram
e Ra
te
0102030405060708090100
% F
ram
es S
atisf
ied300
Mobile Console Desktop Server
Figure 4.10: Alpha ISA Performance of Parallel Collision
Detection + Physics Simu-
lation. [Frame Rate uses log scale]
Take the 100CrSk test for example – we can create 100 subspaces
to contain each
trio of a car and two humanoids. Because of each trio’s physical
location, no inter-
action between subspaces happens during its execution. This
allows us to completely
parallelize it into a fraction of the original task with minimal
overhead on the root
space thread.
Frame rate and % frames satisfied are presented in Figure 4.10.
In contrast to
parallel physics simulation, the results show both significant
improvements as well as
degradations. The most apparent change is that 100CrSk and
100CrSkQ both now have
more than 80% satisfaction for Console, Desktop, and Server
processors. In addition,
both FightQ and BattleQ can now be 100% satisfied by Desktop and
Server.
Even without taking certain overheads into account, parallel
collision detection
degrades the performance of EnvironQ, Fight, and Battle. All
three tests have fast
changing island makeup along with collisions across the logical
spaces created. This
indicates a need for utilizing high-level context information to
selectively enable par-
allel collision detection.
With both optimizations enabled, a comparison of Figure 4.10 and
Figure 4.5
36
-
shows that Desktop and Server can achieve levels of user
experience similar to that
of the unrealistic Ultimate design.
Resource Requirement for Parallelization To bound the resource
requirement nec-essary to achieve the improvements presented above,
we refer back to Table 4.1 which
shows the distribution of island count and space count across
the benchmarks.
For parallel physics simulation, the maximum island count for
each benchmark
indicates the number of cores we need to achieve the
improvements shown earlier. As
shown in Table 4.1, the maximum island count in the suite is
337. However, we may
need far fewer cores to achieve the performance shown due to
simple islands which
can be serially processed on one core. We present the resource
requirements using
optimal load balancing for the Server processor in Table 4.3.
The data shows that most
benchmarks with high island counts can actually be satisfied
with less than 5 Server
cores. The outlier, 100CarSk, can also be satisfied with a few
cores, but we parallelized
it into 100 worlds to show the opportunity for effective massive
parallelization given
non-interacting virtual spaces.
Name 2 10 CrashSk Environ CrashWa 100 Battle Fight Battle2
Cars Cars CarSk
N/Q N/Q N/Q N/Q N/Q N/Q
Number 3 , 2 9 , 9 3 , 3 4 3 100 , 100 14 , 19 3 , 4 3
of
Cores
Table 4.3: Resource Requirement for Full Parallelization using
Server Cores.
37
-
4.3.1 Real x86 Processor Evaluation
In this section, we explore the performance of PhysicsBench on a
real x86 processor.
For this real processor study, we used a 2.4GHz Intel P4 Xeon
CPU with 512KB L2
cache, with support for SSE/2 instructions. This study is
similar to section 4.3.
Real Processor Methodology PhysicsBench 1.0 includes tests using
both the big-matrix and the iterative solvers. In this section, we
focus on enabling real-time physics
simulation by accelerating the iterative solvers (QuickStep),
the faster of the two ap-
proaches.
The PhysicsBench suite consists of two sets of source code. The
first set contains
graphics code to allow for visual correctness inspection, and
the second set contains
only user input and physics simulation code for performance
evaluation. We compiled
binaries for the x86 ISA using gcc version 3.4.5 at optimization
level -O2 (recom-
mended by ODE), using single precision floating point and the
following flags: -ffast-
math, -mmmx, -msse2, -msse, -mfpmath=sse and -march=pentium4.
These options
enable full SSE support to exploit SIMD parallelism.
All benchmarks are warmed up for 3 frames to execute past setup
code as well as
warm up processor resources, and then we execute 5 frames. We
have designed the
benchmarks so that significant activity is captured within these
5 frames (i.e. the actual
collision of a car and skeleton, the crumbling of a wall,
etc).
We model a uniprocessor as the baseline for our study – the
predominant execution
hardware for current gaming platforms and the architectural
target of most current
physics engines, including ODE.
As described in the introduction and [Wu05], the physics engine
is interdependent
on other software components of the application. These include
AI, game-play logic,
audio, IO, and graphics rendering. We allocate 10% of each
1/30th of a second frame
38
-
for computing physics simulation. While this is a conservative
estimate, behaviors for
a larger time allocation can be extrapolated from the presented
results.
10000
100000
1000000
10000000
100000000
1000000000
2-Cars
2-Cars
Q
10-C
ars
10-C
ars Q
Cras
hSk
Cras
hSkQ
Cras
hWaQ
Envir
onQ
100C
rSk
100C
rSkQ
Battle
Battle
QFig
ht
FightQ
Battle
2Q
Harm
Mean
Log
Scal
e in
st/F
ram
e
Figure 4.11: Instructions Per Frame for PhysicsBench.
We consider two performance metrics: Frame Rate (frames per
second), and the
% of frames that were computed within 10% of our 30 frames/sec
constraint. The
first metric is the harmonic mean over all frames executed,
giving some indication
of how close we are getting on average to meeting the frame
constraint. The second
metric gives an indication of whether or not all frames were
computed in time. Ideally,
we would like this metric to be as close to 100% as possible to
provide stability and
realism.
We base our performance metrics on frames rather than
instructions, as frames are
a more natural fit for interactive entertainment – particularly
since the performance
goal is measured in frames per second. We show the number of
instructions per frame
for each individual benchmark in Figure 4.11.
PhysicsBench Results Figure 4.12 presents results for actual
runs of PhysicsBenchon the architecture in section 4.3.1. It is
clear why current interactive entertainment
applications rarely use realistic physics simulation to
dynamically generate content. In
this suite of physics-only tests, our test processor can only
satisfy the demands of the
simple scenarios described by 2-Cars, 10-Cars, and CrashSk.
We present the resource requirements using optimal load
balancing in Table 4.4.
The data shows that most benchmarks with high island counts can
actually be satisfied
39
-
0
50
100
150
200
250
300
2-Cars 10-Cars CrashSk CrashWa 100CrSk Battle Fight Battle2
Fram
e Rate
0%10%20%30%40%50%60%70%80%90%100%
Fram
e Sati
sfied
Pentium 4 Xeon Ideal Parallel P4 Xeon
Environ
Figure 4.12: x86 ISA PhysicsBench Performance.
with less than 5 cores. The outlier, 100CarSk, can also be
satisfied with a few