A Hardware and Software Architecture for Pervasive Parallelism

()for Pervasive Parallelism
Mark Christopher Jeffrey
Bachelor of Applied Science, University of Toronto (2009) Master of Applied Science, University of Toronto (2011)
Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
at the
c© Mark Christopher Jeffrey, MMXX. All rights reserved.
The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium
now known or hereafter created.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science
October 25, 2019
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students
iii
for Pervasive Parallelism
by Mark Christopher Jeffrey
Submitted to the Department of Electrical Engineering and Computer Science on October 25, 2019, in partial fulfillment of the
requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science
Abstract
Parallelism is critical to achieve high performance in modern computer systems. Un- fortunately, most programs scale poorly beyond a few cores, and those that scale well often require heroic implementation efforts. This is because current parallel architectures squander most of the parallelism available in applications and are too hard to program.
This thesis presents Swarm, a new execution model, architecture, and system software that exploits far more parallelism than conventional multicores, yet is almost as easy to program as a sequential machine. Programmer-ordered tasks sit at the software-hardware interface. Swarm programs consist of tiny tasks, as small as tens of instructions each. Parallelism is dynamic: tasks can create new tasks at run time. Synchronization is implicit: the programmer specifies a total or partial order on tasks. This eliminates the correctness pitfalls of explicit synchronization (e.g., deadlock and data races). Swarm hardware uncovers parallelism by speculatively running tasks out of order, even thousands of tasks ahead of the earliest active task. Its speculation mechanisms build on decades of prior work, but Swarm is the first parallel architecture to scale to hundreds of cores due to its new programming model, distributed structures, and distributed protocols. Leaning on its support for task order, Swarm incorporates new techniques to reduce data movement, to speculate selectively for improved effi- ciency, and to compose parallelism across abstraction layers.
Swarm achieves efficient near-linear scaling to hundreds of cores on otherwise hard-to-scale irregular applications. These span a broad set of domains, including graph analytics, discrete-event simulation, databases, machine learning, and genomics. Swarm even accelerates applications that are conventionally deemed sequential. It out- performs recent software-only parallel algorithms by one to two orders of magnitude, and sequential implementations by up to 600× at 256 cores.
Thesis Supervisor: Daniel Sanchez Title: Associate Professor of Electrical Engineering and Computer Science
iv
v
Acknowledgments
This PhD journey at MIT has been extremely rewarding, challenging, and stimulating. I sincerely thank the many individuals who supported me along the way.
First and foremost, I am grateful to my advisor, Professor Daniel Sanchez. His breadth of knowledge and wide range of skills astounded me when I arrived at MIT and continue to impress as I depart. My background was in parallel and distributed systems, and Daniel quickly brought me up to speed on computer architecture research. Early on, Daniel opined that a lot of fun happens at the interface, where you can change both hardware and software to create radical new designs. Now, I could not agree more. Daniel taught me how to think about the big picture challenges in computer science, then frame tractable research problems, and use limit study prototypes to quickly filter ideas as promising or not. He helped me improve my communication of insights and results to the broader community. He also taught me new tricks to debug low-level protocol deadlocks or system errors. Daniel exemplified strong leadership: he would recognize and nurture particular strengths in his students, and rally strong teams to tackle big important problems, while encouraging us to work on our weaknesses. Daniel both provided so much freedom throughout my time at MIT, yet was always available when I needed him.
I would like to thank my thesis committee members, Professor Joel Emer and Pro- fessor Arvind. Over the years, Joel taught me the importance of carefully distilling the contributions of a research project down to its core insights; implementation details are important, but secondary to understanding the insight. Throughout my career, I will strive to continue his principled approach to computer architecture research, and his welcoming, inclusive, and empathetic approach to mentorship. Arvind, who solved fundamental problems in parallel dataflow architectures and languages, provided a valuable perspective and important feedback on this work.
This thesis is the result of collaboration with an outstanding team of students: Suvinay Subramanian, Cong Yan, Maleen Abeydeera, Victor Ying, and Hyun Ryong (Ryan) Lee. Suvinay was a crucial partner in all of the projects of this thesis. I appreciate the hours we spent brainstorming, designing, and debugging. I learned a lot from his persistence, deep focus, and unbounded optimism. Cong invested a few months to wrangle a database application for our benchmark suite. We learned a lot from such a large workload. Maleen improved the modeling of our simulations, implemented several applications, and, bringing his expertise in hardware design, provided valuable feedback and fast prototypes to simplify our designs. Victor added important capabilities to our simulation in the Espresso and Capsules project, identified several opportunities for system optimization, and has been a crucial sounding board for the last three years. I am excited to see where his audacious work will lead next. Ryan implemented new applications that taught us important lessons on how to improve performance.
vi
I am thankful to the members of the Sanchez group: Maleen Abeydeera, Nathan Beckmann, Nosayba El-Sayed, Yee Ling Gan, Harshad Kasture, Ryan Lee, Anurag Mukkara, Quan Nguyen, Suvinay Subramanian, Po-An Tsai, Cong Yan, Victor Ying, and Guowei Zhang. In addition to providing excellent feedback on papers and presentations, they brought fun and insight to my time at MIT, through technical discussions, teaching me history, eating, traveling to conferences, and hiking the trails of New York and New Hampshire. Thanks to my officemate, Po-An, who shared much wisdom about mentorship, research trajectories, and career decisions. I thoroughly enjoyed our one official collaboration on the MICRO-50 submissions server, and our travels around the globe.
I appreciate my dear friends across Canada and the United States, who made every moment outside the lab extraordinary.
Last but not least, an enormous amount of thanks goes to my family. My siblings encouraged me to start this PhD journey and provided valuable advice along the way. My parents boosted morale with frequent calls full of love and encouragement and re- laxing visits back home. My nieces and nephews brought a world of fun and adventure every time we got together. I got to watch you grow up over the course of this degree. Finally, a very special thanks goes to my wife, Ellen Chan, who brings joy, fun, and order to my life. I could not have finished this thesis without her love and support.
I am grateful for financial support from C-FAR, one of six SRC STAR-net centers by MARCO and DARPA; NSF grants CAREER-1452994, CCF-1318384, SHF-1814969, and NSF/SRC grant E2CDA-1640012; a grant from Sony; an MIT EECS Jacobs Presidential Fellowship; an NSERC Postgraduate Scholarship; and a Facebook Fellowship.
Contents
1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 9
2.1 Important Properties of Task-Level Parallelism . . . . . . . . . . . . . . . . 10 2.1.1 Task Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Task Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 Task Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 Open Opportunity: Fine-Grain Ordered Irregular Parallelism . . . 12
2.2 Exploiting Regular Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Exploiting Non-Speculative Irregular Parallelism . . . . . . . . . . . . . . . 14 2.4 Exploiting Speculative Irregular Parallelism . . . . . . . . . . . . . . . . . . 18
2.4.1 Dynamic Instruction-Level Parallelism . . . . . . . . . . . . . . . . . 19 2.4.2 Thread-Level Speculation . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Swarm: A Scalable Architecture for Ordered Parallelism 27
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Understanding Ordered Irregular Parallelism . . . . . . . . . . . . 29 3.1.2 Analysis of Ordered Irregular Algorithms . . . . . . . . . . . . . . . 31 3.1.3 Limitations of Thread-Level Speculation . . . . . . . . . . . . . . . 33
vii
3.3 Swarm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 ISA Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.5 Selective Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.7 Handling Limited Queue Sizes . . . . . . . . . . . . . . . . . . . . . 45
3.3.8 Analysis of Hardware Costs . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.3 Swarm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.4 Sensitivity Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Additional Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 Spatial Hints: Data-Centric Execution of Speculative Parallel Programs 59
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 Hardware Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Improving Locality and Parallelism with Fine-Grain Tasks . . . . . . . . . 74
4.4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Additional Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.6.1 Scheduling in Speculative Systems . . . . . . . . . . . . . . . . . . . 81
4.6.2 Scheduling in Non-Speculative Systems . . . . . . . . . . . . . . . . 82
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
ecution in Architectures for Ordered Parallelism 85
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1.1 Speculation Benefits Are Input-Dependent . . . . . . . . . . . . . . 87 5.1.2 Combining Speculative and Non-Speculative Tasks . . . . . . . . . 88 5.1.3 Software-Managed Speculation Improves Parallelism . . . . . . . 89
5.2 Espresso Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2.1 Espresso Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.2 MAYSPEC: Tasks That May Speculate . . . . . . . . . . . . . . . . . . 93 5.2.3 Exception Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Capsules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.1 Untracked Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.2 Safely Entering a Capsule . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.3 Capsule Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.4 Capsule Programming Example . . . . . . . . . . . . . . . . . . . . . 96
5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.1 Espresso Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.2 Capsules Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5.2 Espresso Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5.3 Capsules Case Study: Dynamic Memory Allocation . . . . . . . . . 104 5.5.4 Capsules Case Study: Disk-Backed Key-Value Store . . . . . . . . . 106
5.6 Additional Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6.1 Task Scheduling and Synchronization . . . . . . . . . . . . . . . . . 107 5.6.2 Restricted vs. Unrestricted Speculative Tasks . . . . . . . . . . . . . 107 5.6.3 Open-Nested Transactions . . . . . . . . . . . . . . . . . . . . . . . . 108
5.7 Summary .…

A Hardware and Software Architecture for Pervasive Parallelism

Documents

idealization

architecture

design

building

parallelism