The University of Texas at El Paso PIs: Pat Teller and Michael McGarry, University of Texas-El Paso Research Staff: Sarala Arunagiri Graduate Students: Ricardo Portillo (PhD), Felipe Jovel (MS), Joshua McCartney (MS), Salvador Melendez (PhD), Joshua McKee (MS), Enrique Portillo (MS), Ben Post (PhD) Undergraduate Students: Arturo Argueta, Adriana Contreras, Jaime Jaloma, Garrett Shaw (2013-2014 COURI Research Award) Enabling Battlefield Decision-Making using Tactical Cloudlets
44
Embed
The University of Texas at El Paso Enabling Battlefield ...web.stanford.edu/group/ahpcrc/2013Review/AHPCRC... · Power- and energy-aware computing: Lots including ORNL, SDSC, and
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The University of Texas at El Paso
PIs: Pat Teller and Michael McGarry, University of Texas-El Paso Research Staff: Sarala Arunagiri Graduate Students: Ricardo Portillo (PhD), Felipe Jovel (MS), Joshua McCartney (MS), Salvador Melendez (PhD), Joshua McKee (MS), Enrique Portillo (MS), Ben Post (PhD) Undergraduate Students: Arturo Argueta, Adriana Contreras, Jaime Jaloma, Garrett Shaw (2013-2014 COURI Research Award)
Enabling Battlefield Decision-Making using Tactical Cloudlets
Optimal Scenario: embedded devices facilitate time-critical decision making in isolation
ADVANCED ANALYTICS Analysis of IED attack data, sensor data (e.g., from airborne sensors, tactical sensors, and soldiers on the battlefield), fused sensor data, …
wearable
Ground-vehicle
mounted
Low-altitude aerial-
vehicle mounted
High-altitude aerial-
vehicle mounted
Airborne mid-scale
compute cluster
Grounded mid-scale
compute cluster
Grounded large-scale
compute cluster
Compute Performance
Local Computing
(MFLOPS to TFLOPS)
Cloud
Computing
(GFLOPS to PFLOPS)
• MANET (e.g., WNaN network)
• WNaN BW:
• 90Kbps – 2Mbps
• Regional Network and
SATCOM-leveraged Nodes
• Connected to one or more
MANETs.
• SATCOM Upload BW:
• 3.2Mbps – 48Mbps
(present)
• 6.4Mbps – 274Mbps
(future)
Embedded to
Multi-core
Processors
(MFLOPS to GFLOPS)
Up to
Many-core
Processors
(GFLOPS to 1-2 TFLOPs)
Up to 10,000s of
clustered cores
(10s-100s of TFLOPS)
Up to 100,000s of cores
(100s of TFLOPS
to under 10 PFLOPS)
Tactical Processing Elements
Tactical Computing Paradigms
Wireless Distributed
Computing (WDC)
(GFLOPS to TFLOPS)
Network and Bandwidth
• Global Information Grid (GIG)
• SATCOM-accessible
Power for Computation
1 to 10s of Watts
10s to 1000s of
Watts
10s to 100s of
Kilowatts
1 to 10s of
Megawatts
Hurdles to Optimal Scenario
AHPCRC 2013 Annual Review: 3
Compute performance, memory capacity, disk storage capacity, power/energy consumption, remote data INCREASE RESOURCE EFFICIENCY, a focus of our research
Alternative Scenario: Tactical Clouds with embedded, mobile, and fixed compute platforms
connected by wireless MANETs
AHPCRC 2013 Annual Review: 4
INCREASE RESOURCE CAPACITY
Hurdles to Increased Resource Capacity – another research focus
AHPCRC 2013 Annual Review: 5
• Job deadlines
• Informed application-to-architecture mappings
• Energy-efficient wireless device MANET communication
• Robust data transmission
• Effective scheduling
PROJECT OVERVIEW – 1/2 • Scientific problem
Develop techniques for use in tactical cloudlets that provide Power/energy-aware computing Informed application-to-architecture mappings Efficient inter-device communication and discovery of information that
enables cloud scheduling
• Army relevance
• Technical challenges
• Objectives
The Army plans to use cloud computing at the tactical edge to provide commanders with better situational awareness and improve their ability to make informed decisions quickly [FedTechs eNewsletter, July 2013].
Diversity of battlefield applications w.r.t. computational complexity, data-
set size, required time-to-solution, and power/energy consumption Static and mobile devices with different processing, memory, and power
performance connected by wireless MANETs
Enable the use of tactical cloudlets in the battlefield through migration of applications to mobile computing devices and employment of efficient and cooperative resource provisioning and communication.
1. Assemble necessary performance-measurement infrastructure, including physical test beds, simulators, tools, and applications
2. Collect performance data across configurations under study 3. Analyze data 4. When appropriate, develop mathematical models to gain insights 5. Based on findings, develop new techniques to meet objectives 6. Evaluate the performance of these new techniques via steps 2-5
• Related work
• Collaborations with Army ARL-APG (Brian Henz, Song-Jun Park, and Dale Shires) CERDEC (Joseph Deroba and Russ Ruppe) ARL-Adelphi (Lam Nguyen)
Power- and energy-aware computing: Lots including ORNL, SDSC, and UT-Austin Application-to-architecture mappings: Lots including DoE Labs Efficient inter-device communication and discovery of information that
facilitates cloud scheduling: IBM T.J. Watson Research Center, UC-Berkeley, Purdue, U Melbourne; Neighbor discovery: U Mass, Northwestern, UC Riverside, LTS NCSC; OLSR specific: Carleton U, U Antwerp, U Malaga
• SAR Image Processing: Double- to single-precision SIRE/RSM Reduced time to solution Up to 50% decrease in energy
consumption Comparable output quality
• Computer Vision: Floating-point to integer stereo-matching cost function
70% savings in execution time & energy consumption for 42-megapixel images, while negligible power savings for 1-megapixel images
Data Footprint & Transfers
Compute Capability /
Work Sharing Power
Consumption
Decrease Arithmetic Precision
In collaboration with ARL
• Focus: Floating-point precision
• Supplemented prior published work on SIRE/RSM (Synchronous Impulse REconstruction/Recursive Sidelobe Minimization) radar image formation precision vs. quality/power with evaluation of SIRE image post-processing
Evaluated common Coherent- and Amplitude-Change-Detection methods
Showed that lower-power single-precision affects neither SIRE radar image quality nor change-detection outputs
Disparity map + the reference 2D image has the information needed to generate the 3D image
Images courtesy Michael Bleyers - Interactive Media Systems Group, Software Technology & Interactive Systems, Vienna University of Technology
• Motivation: Stereo Matching
Transformed into finding a minimum-energy solution in a 2D dense neighborhood, an NP-complete problem
Approximately solved or reduced to a solvable case Core process of many military applications Compute intensive & suitable for GPGPU acceleration
• Goal: Employ arithmetic precision findings to improve performance
• Foci: Develop new integer-based cost function and evaluate its
performance on CPU/GPGPU systems Investigate effect of TDP (Thermal Design Power)
Effect of Integer-based Cost Function on CPU/GPGPU Performance
Percentage Difference: Floating-point (Baseline) vs. Integer-based Cost Functions
smaller ~1-megapixel images → negligible power savings
larger 42-megapixel images → 70% savings in execution time & energy consumption
Another example of how reduced arithmetic precision can result in improved performance with either no or insignificant degradation of output quality
AHPCRC 2013 Annual Review: 17
Effect of CPU TDP on CPU/GPU Performance: Stereo Matching
Percent Performance Difference: Intel Xeon E3 1260L w/TDP 45 W (baseline) vs. AMD A8 3850 w/TDP 100W Host Processors
smaller images → ~50% decrease in execution time & 5% decrease in avg. power, resulting in >50% decrease in energy consumption
larger images → 70% savings in execution time & energy consumption
2 CPU/GPU systems: same GPGPU with Intel Xeon CPUs w/different TDPs (45W & 100W)
For CPU/GPGPU system w/smaller CPU TDP:
For some tactical HPC GPGPU-accelerated apps, a host w/a lower TDP may provide better power performance, while not increasing execution time, resulting in decreased energy consumption. AHPCRC 2013 Annual Review: 18
AHPCRC_Paged: 7-10% better performance than CUBLAS & CUSUMMA
AHPCRC_Pinned: Comparable performance with CUSUMMA when result matrix fits in <= 96% of GPGPU memory (only A & B tiled on host) & 15% better performance than CUSUMMA for larger matrices but with no energy savings
Component of many application codes Limitation of NVIDIA CUBLAS API, an implementation of the Basic
Linear Algebra Subprograms (BLAS): all 3 matrices must fit in GPGPU memory
CUSUMMA, an open-source scalable parallel CUDA implementation of the Scalable Universal Matrix Multiplication Algorithm (SUMMA) with self-tuning capabilities employs CUBLAS and tiling on the host – the 3 matrices need not fit in GPGPU memory
• Goal: Develop a scalable, efficient, tiled matrix-multiplication implementation in CUDA with better performance than CUBLAS and CUSUMMA; explore effect of pinned memory
• Foci:
Increase problem size that can be handled Improve power/energy and execution-time performance
dual AMD Opteron 6272 (16 cores and 64GB RAM) and NVIDIA C2075 GPGPU with 5.2GB global memory)
AHPCRC_Paged: 7-10% better performance than CUBLAS & CUSUMMA AHPCRC_Pinned: comparable performance when result matrix C fits in <= 96% of GPGPU memory (only A & B are tiled) – for 1764MB and 4900MB & appx. 15% better performance for larger problem sizes (at the cost of power)
Pe
rce
nta
ge D
iffe
ren
ce
Problem Size (MB) AHPCRC implementation employs CUBLAS and tiling on host, but is limited only by host memory
Per
cen
tag
e D
iffe
ren
ce
(21504 x 21504 Limit of CUBLAS – all 3 matrices in
GPU memory)
(59008 x 59008 Limit of CUSUMMA, which tiles at host)
Improve Power Consumption: GPU Single-precision Matrix Multiplication
AHPCRC 2013 Annual Review: 22
Pe
rce
nta
ge D
iffe
ren
ce
Problem Size (MB) AHPCRC_Paged has comparable average power consumption.
AHPCRC_Pinned consumes more power for large matrices.
Per
cen
tag
e D
iffe
ren
ce
AHPCRC_Paged vs. CUSUMMA
AHPCRC_Paged vs. CUBLAS
AHPCRC_Pinned vs. CUSUMMA
AHPCRC_Pinned vs. CUBLAS
Improve Energy Performance: Matrix Multiplication
Per
cen
tag
e D
iffe
ren
ce
Problem Size (MB) AHPCRC_Paged saves between ~4-8% average energy consumption.
AHPCRC_Pinned saves between ~5-6% for smaller matrices and is comparable to CUSUMMA for larger matrices but in these cases requires ~15% less execution time.
AHPCRC_Paged vs. CUSUMMA
AHPCRC_Paged vs. CUBLAS
AHPCRC_Pinned vs. CUSUMMA
AHPCRC_Pinned vs. CUBLAS
AHPCRC 2013 Annual Review: 23
Power/Energy Aware Computing Fine-Grain DVFS
• State-of-the-Art GPU DVFS
Throttles GPU speed to save power and energy at coarse-grain (millisecond-scale)
GPU application phases change at microsecond to millisecond scale
Leads to wasted power when application phase does not need to run at full throttle
• Our Approach Develop method that quantifies
fine-grain DVFS benefit with current hardware
Aim to motivate hardware vendors to supply fine-grain DVFS capability
Improve Algorithm Efficiency
Data Footprint & Transfers
Compute Capability /
Work Sharing Power
Consumption
Fine-Grain DVFS
AHPCRC 2013 Annual Review: 24
In collaboration with NVIDIA
Power/Energy Aware Computing Fine-Grain Power Measurement
Custom Power Measurement Board • At least as accurate as NVIDIA’s onboard GPU power sensors. • 50 times faster power sampling rate (microsecond-scale granularity).
AHPCRC 2013 Annual Review: 26
Power/Energy Aware Computing Fine-grain Power Measurement
Power/Energy-Aware Computing Sample Finding
Fastest static frequency Lowest-energy static frequency
Lowest-energy (static) 36.5% Energy Reduction at 28.8% Performance Cost Optimal (fine-grain) 36.5% Energy Reduction at 11.5% Performance Cost
Same Energy Benefit. Less of a Performance Hit. AHPCRC 2013 Annual Review: 27
Power/Energy-Aware Computing Fine-grain DVFS Study
AHPCRC 2013 Annual Review: 28
Fine-grain DVFS Study (in detail) • Will quantify energy benefits if DVFS implementation had full
knowledge of application nano/micro phases and, thus, the potential of fine-grain application-aware GPU DVFS
• Will motivate GPU hardware developers to include finer grain and faster DVFS technologies on their boards
• Workloads: Comprehensive set of general GPU workloads Synthetic stressmarks, GPU benchmarks (Rodinia, SHOC, Parboil, CUDA SDK), Custom Army SIRE/RSM versions with varying degrees of
compute and memory intensities
Application-Aware GPU Power Management: Roadmap
Evaluation and development of experimental infrastructure • Evaluated GPU DVFS enforcement capability (latency and granularity) • Enabled in-code frequency scaling (system calls to NVIDIA’s DVFS utility) • Enabled workload idleness characterization (CUPTI/Vampir profiling tools) • Improved accuracy and granularity of power monitoring (custom power monitoring boards). • Improved post-processing of large power measurement datasets (SPSS and custom analysis software)
Comprehensive static GPU frequency study. HPCA/SHAW-5 2014 (Dec 2013 deadline)
Proof of concept application-aware GPU DVFS on army workloads.
PhD Dissertation – “The unrealized potential of DVFS power-management for GPGPU applications”
DEC 2013
JAN 2013
FEB 2014
SPR 2014
Here now working on the below tasks
Meet Job Deadlines within Tactical Cloudlets
Cooperative Resource Provisioning & Communication
AHPCRC 2013 Annual Review: 30
Cooperative Resource Provisioning & Communication
Informed
Application-to-Architecture
Mappings
Effective Cloud Scheduling
Efficient MANET
Communication & Data
Transmission Energy- & Mobility-
aware Communication
Reserved Data Transmission
Paths
Job Migration
Meet Job Deadlines within Cloudlets Informed Application-to-Architecture Mappings
AHPCRC 2013 Annual Review: 31
Informed Application-to-
Architecture Mappings
Effective Cloud Scheduling
Efficient MANET
Communication & Data
Transmission
• Performance Study of LULESH 1.0
Proxy code that represents a typical hydrocode
– Ported to multiple programming languages
– Optimized to run on multiple architectures
Executed on Sandy Bridge, Xeon Phi, Kepler GPU, and Fermi GPU
– Measured and now analyzing execution time, speedup, parallel efficiency, overheads, effectiveness of vectorization, cost of serial sections, memory performance, power/energy consumption
Map code segment characteristics to differences in execution time and power/energy consumption
In collaboration with Texas Advanced Computing Center, LLNL, Intel, and NVIDIA
In collaboration with Texas Advanced Computing Center, LLNL, Intel, and NVIDIA
• Motivation: Emerging Technologies
Different programming models and different architectures Disparate costs associated with processing, memory access,
synchronization, and parallelism • Goal: Using LULESH 1.0, a proxy code of Lawrence Livermore
National Laboratory (LLNL) ported to several programming models and tuned by experts to perform well on multiple architectures, identify program characteristics that lead to performance variations on diverse computer architectures
• Foci:
Study execution behavior of code segments optimized for three state-of-the-art accelerators: Intel Xeon Phi co-processor and NVIDIA Fermi and Kepler GPGPUs, along with Intel Sandy Bridge multi-core processor (baseline)
Map code segment characteristics to differences in execution time and power/energy consumption
Execution Time Comparison Sample Finding
AHPCRC 2013 Annual Review: 33
13.0
83 53
.113
140.
844
2.42
0
7.30
2
17.0
22
17.1
06 52
.281
121.
428
26.8
75
82.7
54
300.
032
10.9
23 48
.322
126.
864
15.2
87 45.8
18
121.
402
0.000
50.000
100.000
150.000
200.000
250.000
300.000
350.000
50^3 70^3 90^3
Tim
e (s
)
Problem Size
LULESH 1.0 Execution Time Comparison (sec)
Sandy Bridge (Parallel)
Kepler
Fermi
MIC (Xeon Phi)
Beta (Sandy Bridge)
Beta (Xeon Phi)
Solve Time of LULESH 1.0 run on one node not including the initialization and termination portions. Except for Xeon Phi, codes optimized for architectures.
Data layout optimizations in the beta code result in a 1.75 to 2.47x improvement.
Meet Job Deadlines Efficient MANET Communication
AHPCRC 2013 Annual Review: 34
Informed Application-to-
Architecture Mappings
Effective Cloud Scheduling
Efficient MANET
Communication & Data
Transmission
Goal: Reduce energy consumed by wireless MANET routing functions
Current Focus: Discover ways to adapt OLSR operating parameters to changing mobility patterns
Experimental Platform: Developed & validated 6-node MANET experimental platform with OLSR routing and topology changes via iptable configuration
Performance Study: Using test bed to understand relationship of time intervals between topological changes with values of OLSR operating parameters
• Motivation: Energy-aware Tactical Cloud Communication Resource-constrained communication devices Lack of fixed role of each device (host or packet switch) Use of free space as transmission media that suffers from
transmission impairments • Goal: Reduce energy consumed by battlefield cloud
communication, in particular, reduce energy consumed by wireless MANET routing functions
• Foci: Develop physical and simulation test-beds
Understand relationship of time intervals between topological changes with values of OLSR operating parameters such as time intervals between transmission of Hello messages and Topology Control messages
Discover ways to dynamically adapt OLSR operating parameters to changing mobility patterns
Designed & commenced experimental plan to identify mechanisms to reduce energy consumed by OLSR neighbor discovery process.
o Understand relationship between topological change time intervals & values of OLSR operating parameters such transmission frequency of Hello messages and topology control messages
o Discover ways to dynamically adapt OLSR operating parameters to changing mobility patterns
Designed, developed, & validated 6-node MANET experimental platform that uses OLSR for routing and “simulates” mobility through topology changes via iptable configuration.
Started experiments to understand relationship between topological change time intervals & OLSR operating parameters.
Started design and development of simulator to conduct complementary experiments.
Meet Job Deadlines Effective Cloud Scheduling
AHPCRC 2013 Annual Review: 37
Informed Application-to-
Architecture Mappings
Effective Cloud Scheduling
Efficient MANET
Communication & Data
Transmission
Estimate times at which data transfers commence – reserve transmission paths to deliver data on time
Leverage knowledge of temporarily fixed and expected future locations of platforms to
– Decrease communication overhead
– Initiate necessary job migration/replication
By December: Easily scalable, fully functional Nimbus cloud test bed connected via a wireless ad hoc network that can collect power and memory metrics for each node for use by the scheduler
• Cooperative Cloud Scheduling and Communication
In collaboration with FutureGrid
Reserved Data Transmission
Paths
Job Migration
Cooperative Cloud Scheduling & Communication - 1/2
AHPCRC 2013 Annual Review: 38
• Motivation:
Program and computer platform characteristics can be used to estimate the times at which data transfers will commence. To meet job deadlines, transmission paths could be reserved to deliver data.
Knowledge of non-mobility and/or travel paths of devices could be used to decrease communication overhead associated with neighbor discovery.
Knowledge of devices moving out of range could initiate migration of jobs.
• Goal: Pursue the dual objectives of reducing energy consumption and increasing the number of jobs that meet their solution deadlines.
Cooperative Cloud Provisioning & Communication - 2/2
AHPCRC 2013 Annual Review: 39
• Progress:
Experimental Platform: Designed & developed a Nimbus cloud test bed comprised of 4 mobile and static computing devices connected via a wireless ad hoc network (a master/controller node, a head node, and two hypervisor nodes). Each hypervisor device hosts multiple virtual machines.
By December 2013:
o Easily scalable, fully functional cloud test bed connected via a wireless ad hoc network that can collect power and memory metrics for each node for use by the scheduler
oDocumentation with all relevant codes and information required to reproduce this work and create such a test bed.
This work represents collaboration with FutureGrid, who described the work as “cool, groundbreaking cloud research”. (It is primarily being done by a graduate student who is not supported by this grant.)
APP 2013-2014 – 1/2
• Deliverable #1 Identification of code attributes (e.g., memory-access behavior,
vectorization potential, instructions-per-cycle, and branch behavior) that highly influence an application’s execution-time and power performance on sets of light-weight cores (e.g., Xeon Phi), GPUs, and sets of heavy-weight cores (e.g., Sandy Bridge)
• Deliverable #2 Development of guidelines that inform the mapping of code segments
and/or applications to these computing architectures
• Deliverable #3 Dissemination of findings in technical reports, conference
proceedings, and journals
Research Objective 1: Analysis of applications and emerging technologies to inform application-to-architecture mappings and employment of optimizations that will result in “best” execution-time and power performance.
AHPCRC 2013 Annual Review: 40
• Deliverable #1
• Deliverable #2 Development of analytical frameworks, simulation models, and a
physical test bed for performance evaluation of the defined cooperative mechanisms
• Deliverable #3 Preliminary performance evaluation of the defined mechanisms using
analytical methods, simulation experiments, and physical experiments
Research Objective 2: Investigation of cooperative resource provisioning and communication mechanisms for tactical cloud management for reduced number of missed application execution deadlines and energy consumption.
APP 2013-2014 – 2/2
AHPCRC 2013 Annual Review: 41
Definition of mechanisms in which the resource provisioning system and communication protocols cooperate to reduce number of missed application execution deadlines and energy consumption
• Deliverable #4 Dissemination of findings in technical reports, conference
proceedings, and journals
PROJECT PUBLICATIONS FY 2013 – 1/2
Refereed publications and reports (* students) • Sarala Arunagiri, Jaime Jaloma*, Ricardo Portillo*, and Arturo Argueta*, “Power
and Execution Performance Tradeoffs of GPGPU Computing: a Case Study Employing Stereo Matching,” in Proceedings of Image Processing: Machine Vision Applications VI Conference (part of IS&T/SPIE Electronic Imaging), San Francisco, CA, February 3-7, 2013.
• Sarala Arunagiri and Jaime Jaloma*, “Parallel GPGPU Stereo Matching with an Energy-Efficient Cost Function based on Normalized Cross Correlation,” in Proceedings of Image Processing: Algorithms and Systems XI Conference (part of IS&T/SPIE Electronic Imaging), San Francisco, CA, February 3-7, 2013.
Papers in progress • Esthela Gallardo*, Patricia J. Teller, Jaime Jaloma, Ian Karlin, Arturo Argueta, Edgar
Leon, and James Browne, “Accelerators: the Good, the Bad, and the Ugly,” target: TBD.
• Ricardo Portillo* and Patricia J. Teller, “The Potential of Application-aware DVFS for GPGPU Scientific Workloads,” target: HPCA/SHAW-5 2014 (Dec 2013 deadline)
AHPCRC 2013 Annual Review: 42
PROJECT PUBLICATIONS FY 2013 – 2/2
Dissertations, Master’s Theses, and Technical Reports • Ricardo Portillo, “The Unrealized Potential of DVFS Power-Management for
GPGPU Applications,” PhD dissertation, UTEP, Department of Computer Science, expected Spring 2014.
• Joshua McCartney, “Towards Understanding the Impact of OLSR HELLO and TC Message Frequency on Network Performance,” M.S. thesis, UTEP, Department of Computer Science, expected Fall 2013.
• Joshua McKee, “Setting Up a Highly Configurable, Scalable Cloud using Nimbus and Phantom,” M.S. project, UTEP, Department of Computer Science, expected Fall 2013.
• Enrique Portillo, “Efficient, Scalable, Parallel Matrix-Matrix Multiply,” M.S. thesis, UTEP, Department of Computer Science, expected Fall 2013.
AHPCRC 2013 Annual Review: 43
SUMMARY
AHPCRC 2013 Annual Review: 44
• 2013 Research focused on enabling Tactical Cloud computing
Main Focus: Power/Energy-Aware Computing – Decrease arithmetic precision (SAR Image Processing and