The University of Texas at El Paso Enabling Battlefield ...web.stanford.edu/group/ahpcrc/2013Review/AHPCRC... · Power- and energy-aware computing: Lots including ORNL, SDSC, and

The University of Texas at El Paso

PIs: Pat Teller and Michael McGarry, University of Texas-El Paso Research Staff: Sarala Arunagiri Graduate Students: Ricardo Portillo (PhD), Felipe Jovel (MS), Joshua McCartney (MS), Salvador Melendez (PhD), Joshua McKee (MS), Enrique Portillo (MS), Ben Post (PhD) Undergraduate Students: Arturo Argueta, Adriana Contreras, Jaime Jaloma, Garrett Shaw (2013-2014 COURI Research Award)

Enabling Battlefield Decision-Making using Tactical Cloudlets

Optimal Scenario: embedded devices facilitate time-critical decision making in isolation

TACTICS situational awareness, route selection, weather prediction, …

MEDICINE Medical diagnostics and treatment, health monitoring, …

DEFENSE IED detection, image recognition, robotics, augmented reality, …

AHPCRC 2013 Annual Review: 2

ADVANCED ANALYTICS Analysis of IED attack data, sensor data (e.g., from airborne sensors, tactical sensors, and soldiers on the battlefield), fused sensor data, …

wearable

Ground-vehicle

mounted

Low-altitude aerial-

vehicle mounted

High-altitude aerial-

vehicle mounted

Airborne mid-scale

compute cluster

Grounded mid-scale

compute cluster

Grounded large-scale

compute cluster

Compute Performance

Local Computing

(MFLOPS to TFLOPS)

Cloud

Computing

(GFLOPS to PFLOPS)

• MANET (e.g., WNaN network)

• WNaN BW:

• 90Kbps – 2Mbps

• Regional Network and

SATCOM-leveraged Nodes

• Connected to one or more

MANETs.

• SATCOM Upload BW:

• 3.2Mbps – 48Mbps

(present)

• 6.4Mbps – 274Mbps

(future)

Embedded to

Multi-core

Processors

(MFLOPS to GFLOPS)

Up to

Many-core

Processors

(GFLOPS to 1-2 TFLOPs)

Up to 10,000s of

clustered cores

(10s-100s of TFLOPS)

Up to 100,000s of cores

(100s of TFLOPS

to under 10 PFLOPS)

Tactical Processing Elements

Tactical Computing Paradigms

Wireless Distributed

Computing (WDC)

(GFLOPS to TFLOPS)

Network and Bandwidth

• Global Information Grid (GIG)

• SATCOM-accessible

Power for Computation

1 to 10s of Watts

10s to 1000s of

Watts

10s to 100s of

Kilowatts

1 to 10s of

Megawatts

Hurdles to Optimal Scenario


Compute performance, memory capacity, disk storage capacity, power/energy consumption, remote data INCREASE RESOURCE EFFICIENCY, a focus of our research

Alternative Scenario: Tactical Clouds with embedded, mobile, and fixed compute platforms

connected by wireless MANETs


INCREASE RESOURCE CAPACITY

Hurdles to Increased Resource Capacity – another research focus


• Job deadlines

• Informed application-to-architecture mappings

• Energy-efficient wireless device MANET communication

• Robust data transmission

• Effective scheduling

PROJECT OVERVIEW – 1/2 • Scientific problem

Develop techniques for use in tactical cloudlets that provide Power/energy-aware computing Informed application-to-architecture mappings Efficient inter-device communication and discovery of information that

enables cloud scheduling

• Army relevance

• Technical challenges

• Objectives

The Army plans to use cloud computing at the tactical edge to provide commanders with better situational awareness and improve their ability to make informed decisions quickly [FedTechs eNewsletter, July 2013].

Diversity of battlefield applications w.r.t. computational complexity, data-

set size, required time-to-solution, and power/energy consumption Static and mobile devices with different processing, memory, and power

performance connected by wireless MANETs

Enable the use of tactical cloudlets in the battlefield through migration of applications to mobile computing devices and employment of efficient and cooperative resource provisioning and communication.


PROJECT OVERVIEW – 2/2 • Technical approach (across sub-projects)

1. Assemble necessary performance-measurement infrastructure, including physical test beds, simulators, tools, and applications

2. Collect performance data across configurations under study 3. Analyze data 4. When appropriate, develop mathematical models to gain insights 5. Based on findings, develop new techniques to meet objectives 6. Evaluate the performance of these new techniques via steps 2-5

• Related work

• Collaborations with Army ARL-APG (Brian Henz, Song-Jun Park, and Dale Shires) CERDEC (Joseph Deroba and Russ Ruppe) ARL-Adelphi (Lam Nguyen)

Power- and energy-aware computing: Lots including ORNL, SDSC, and UT-Austin Application-to-architecture mappings: Lots including DoE Labs Efficient inter-device communication and discovery of information that

facilitates cloud scheduling: IBM T.J. Watson Research Center, UC-Berkeley, Purdue, U Melbourne; Neighbor discovery: U Mass, Northwestern, UC Riverside, LTS NCSC; OLSR specific: Carleton U, U Antwerp, U Malaga


PROJECT COMPONENTS

• Core components

• Thematic components

Advanced HPC – Multicore – Emerging parallel technologies – Performance

Cross-cutting


PROJECT DELIVERABLES

Techniques for use in tactical cloudlets that provide:

• Power/energy-aware computing

• Informed application-to-architecture mappings

• Efficient inter-device communication and discovery of information that facilitates cloud scheduling


embedded

mobile

fixed

HPC center

embedded

mobile

fixed

HPC center

PROJECT ROADMAP

Power/Energy-Aware Computing

Improve Power

Performance Inform Application-to Architecture Mappings &

Optimize Performance

Improve Execution-time & Power/Energy Performance

Provide Cooperative Resource

Provisioning & Communication Meet Soft Real-time

Deadlines & Decrease Energy Consumption

… … …

Apr 2012

May 2014

Dec 2016

Dec 2013

Increase Resource Efficiency Increase Resource Capacity

Improve Power Performance: Application-aware GPU Power Management


Power/Energy-Aware Computing Focus Mostly on Graphics Processing Units (GPUs)

Motivation:

GPUs are ideal for fielded systems – they offer HPC capability in a small and portable form factors (up to ~4 TFLOPS).

GPUs have potentially high power requirements (up to 300 Watts per board).


Goal: Leverage knowledge of application characteristics to reduce power and energy further than classically black-box GPU approaches.


Increase Resource Efficiency


Data Footprint & Transfers

Compute Capability/Work

Sharing Power

Consumption



Increase Resource Efficiency


Compute Capability/Work

Sharing Power

Consumption

Decrease Arithmetic Precision Improve Algorithm Efficiency

Fine-Grain DVFS

Power/Energy-Aware Computing Decrease Arithmetic Precision


• SAR Image Processing: Double- to single-precision SIRE/RSM Reduced time to solution Up to 50% decrease in energy

consumption Comparable output quality

• Computer Vision: Floating-point to integer stereo-matching cost function

70% savings in execution time & energy consumption for 42-megapixel images, while negligible power savings for 1-megapixel images


Compute Capability /

Work Sharing Power

Consumption

Decrease Arithmetic Precision

In collaboration with ARL

• Focus: Floating-point precision

• Supplemented prior published work on SIRE/RSM (Synchronous Impulse REconstruction/Recursive Sidelobe Minimization) radar image formation precision vs. quality/power with evaluation of SIRE image post-processing

Evaluated common Coherent- and Amplitude-Change-Detection methods

Showed that lower-power single-precision affects neither SIRE radar image quality nor change-detection outputs

• Will be published as a technical report.


Power/Energy-Aware Computing Decrease Arithmetic Precision

In collaboration with ARL

Improve Power/Energy & Execution-time Performance: Stereo Matching


Disparity map + the reference 2D image has the information needed to generate the 3D image

Images courtesy Michael Bleyers - Interactive Media Systems Group, Software Technology & Interactive Systems, Vienna University of Technology

• Motivation: Stereo Matching

Transformed into finding a minimum-energy solution in a 2D dense neighborhood, an NP-complete problem

Approximately solved or reduced to a solvable case Core process of many military applications Compute intensive & suitable for GPGPU acceleration

• Goal: Employ arithmetic precision findings to improve performance

• Foci: Develop new integer-based cost function and evaluate its

performance on CPU/GPGPU systems Investigate effect of TDP (Thermal Design Power)

Effect of Integer-based Cost Function on CPU/GPGPU Performance

Percentage Difference: Floating-point (Baseline) vs. Integer-based Cost Functions

smaller ~1-megapixel images → negligible power savings

larger 42-megapixel images → 70% savings in execution time & energy consumption

Another example of how reduced arithmetic precision can result in improved performance with either no or insignificant degradation of output quality


Effect of CPU TDP on CPU/GPU Performance: Stereo Matching

Percent Performance Difference: Intel Xeon E3 1260L w/TDP 45 W (baseline) vs. AMD A8 3850 w/TDP 100W Host Processors

smaller images → ~50% decrease in execution time & 5% decrease in avg. power, resulting in >50% decrease in energy consumption

larger images → 70% savings in execution time & energy consumption

2 CPU/GPU systems: same GPGPU with Intel Xeon CPUs w/different TDPs (45W & 100W)

For CPU/GPGPU system w/smaller CPU TDP:

For some tactical HPC GPGPU-accelerated apps, a host w/a lower TDP may provide better power performance, while not increasing execution time, resulting in decreased energy consumption. AHPCRC 2013 Annual Review: 18

Power/Energy Aware Computing Increase Algorithm Efficiency

• Single-precision matrix-matrix

multiplication

AHPCRC_Paged: 7-10% better performance than CUBLAS & CUSUMMA

AHPCRC_Pinned: Comparable performance with CUSUMMA when result matrix fits in <= 96% of GPGPU memory (only A & B tiled on host) & 15% better performance than CUSUMMA for larger matrices but with no energy savings



Work Sharing Power

Consumption

Improve Algorithm Efficiency


Improve Power/Energy & Execution-time Performance: Matrix Multiplication


• Motivation: Matrix Multiplication

Component of many application codes Limitation of NVIDIA CUBLAS API, an implementation of the Basic

Linear Algebra Subprograms (BLAS): all 3 matrices must fit in GPGPU memory

CUSUMMA, an open-source scalable parallel CUDA implementation of the Scalable Universal Matrix Multiplication Algorithm (SUMMA) with self-tuning capabilities employs CUBLAS and tiling on the host – the 3 matrices need not fit in GPGPU memory

• Goal: Develop a scalable, efficient, tiled matrix-multiplication implementation in CUDA with better performance than CUBLAS and CUSUMMA; explore effect of pinned memory

• Foci:

Increase problem size that can be handled Improve power/energy and execution-time performance

Improve Execution Time: GPU Single-Precision Matrix Multiplication

AHPCRC_Paged vs. CUSUMMA

AHPCRC_Paged vs. CUBLAS

AHPCRC_Pinned vs. CUSUMMA

AHPCRC_Pinned vs. CUBLAS

dual AMD Opteron 6272 (16 cores and 64GB RAM) and NVIDIA C2075 GPGPU with 5.2GB global memory)

AHPCRC_Paged: 7-10% better performance than CUBLAS & CUSUMMA AHPCRC_Pinned: comparable performance when result matrix C fits in <= 96% of GPGPU memory (only A & B are tiled) – for 1764MB and 4900MB & appx. 15% better performance for larger problem sizes (at the cost of power)

Pe

rce

nta

ge D

iffe

ren

ce

Problem Size (MB) AHPCRC implementation employs CUBLAS and tiling on host, but is limited only by host memory

Per

cen

tag

e D

iffe

ren

ce

(21504 x 21504 Limit of CUBLAS – all 3 matrices in

GPU memory)

(59008 x 59008 Limit of CUSUMMA, which tiles at host)

Improve Power Consumption: GPU Single-precision Matrix Multiplication


Pe

rce

nta

ge D

iffe

ren

ce

Problem Size (MB) AHPCRC_Paged has comparable average power consumption.

AHPCRC_Pinned consumes more power for large matrices.

Per

cen

tag

e D

iffe

ren

ce





Improve Energy Performance: Matrix Multiplication

Per

cen

tag

e D

iffe

ren

ce

Problem Size (MB) AHPCRC_Paged saves between ~4-8% average energy consumption.

AHPCRC_Pinned saves between ~5-6% for smaller matrices and is comparable to CUSUMMA for larger matrices but in these cases requires ~15% less execution time.






Power/Energy Aware Computing Fine-Grain DVFS

• State-of-the-Art GPU DVFS

Throttles GPU speed to save power and energy at coarse-grain (millisecond-scale)

GPU application phases change at microsecond to millisecond scale

Leads to wasted power when application phase does not need to run at full throttle

• Our Approach Develop method that quantifies

fine-grain DVFS benefit with current hardware

Aim to motivate hardware vendors to supply fine-grain DVFS capability

Improve Algorithm Efficiency



Work Sharing Power

Consumption

Fine-Grain DVFS


In collaboration with NVIDIA

Power/Energy Aware Computing Fine-Grain Power Measurement

Custom Power Measurement Board • At least as accurate as NVIDIA’s onboard GPU power sensors. • 50 times faster power sampling rate (microsecond-scale granularity).


Power/Energy Aware Computing Fine-grain Power Measurement

Power/Energy-Aware Computing Sample Finding

Fastest static frequency Lowest-energy static frequency

Lowest-energy (static) 36.5% Energy Reduction at 28.8% Performance Cost Optimal (fine-grain) 36.5% Energy Reduction at 11.5% Performance Cost

Same Energy Benefit. Less of a Performance Hit. AHPCRC 2013 Annual Review: 27

Power/Energy-Aware Computing Fine-grain DVFS Study


Fine-grain DVFS Study (in detail) • Will quantify energy benefits if DVFS implementation had full

knowledge of application nano/micro phases and, thus, the potential of fine-grain application-aware GPU DVFS

• Will motivate GPU hardware developers to include finer grain and faster DVFS technologies on their boards

• Workloads: Comprehensive set of general GPU workloads Synthetic stressmarks, GPU benchmarks (Rodinia, SHOC, Parboil, CUDA SDK), Custom Army SIRE/RSM versions with varying degrees of

compute and memory intensities

Application-Aware GPU Power Management: Roadmap

Evaluation and development of experimental infrastructure • Evaluated GPU DVFS enforcement capability (latency and granularity) • Enabled in-code frequency scaling (system calls to NVIDIA’s DVFS utility) • Enabled workload idleness characterization (CUPTI/Vampir profiling tools) • Improved accuracy and granularity of power monitoring (custom power monitoring boards). • Improved post-processing of large power measurement datasets (SPSS and custom analysis software)

Comprehensive static GPU frequency study. HPCA/SHAW-5 2014 (Dec 2013 deadline)

Proof of concept application-aware GPU DVFS on army workloads.

Comprehensive fine-grain GPU DVFS benefit analysis (theoretical and empirical). IPDPS 2014 (Feb 2014 deadline)

PhD Dissertation – “The unrealized potential of DVFS power-management for GPGPU applications”

DEC 2013

JAN 2013

FEB 2014

SPR 2014

Here now working on the below tasks

Meet Job Deadlines within Tactical Cloudlets

Cooperative Resource Provisioning & Communication


Cooperative Resource Provisioning & Communication

Informed

Application-to-Architecture

Mappings

Effective Cloud Scheduling

Efficient MANET

Communication & Data

Transmission Energy- & Mobility-

aware Communication

Reserved Data Transmission

Paths

Job Migration

Meet Job Deadlines within Cloudlets Informed Application-to-Architecture Mappings


Informed Application-to-

Architecture Mappings


Efficient MANET


Transmission

• Performance Study of LULESH 1.0

Proxy code that represents a typical hydrocode

– Ported to multiple programming languages

– Optimized to run on multiple architectures

Executed on Sandy Bridge, Xeon Phi, Kepler GPU, and Fermi GPU

– Measured and now analyzing execution time, speedup, parallel efficiency, overheads, effectiveness of vectorization, cost of serial sections, memory performance, power/energy consumption

Map code segment characteristics to differences in execution time and power/energy consumption

In collaboration with Texas Advanced Computing Center, LLNL, Intel, and NVIDIA

Inform Application-to-Architecture Mappings: LULESH


In collaboration with Texas Advanced Computing Center, LLNL, Intel, and NVIDIA

• Motivation: Emerging Technologies

Different programming models and different architectures Disparate costs associated with processing, memory access,

synchronization, and parallelism • Goal: Using LULESH 1.0, a proxy code of Lawrence Livermore

National Laboratory (LLNL) ported to several programming models and tuned by experts to perform well on multiple architectures, identify program characteristics that lead to performance variations on diverse computer architectures

• Foci:

Study execution behavior of code segments optimized for three state-of-the-art accelerators: Intel Xeon Phi co-processor and NVIDIA Fermi and Kepler GPGPUs, along with Intel Sandy Bridge multi-core processor (baseline)

Map code segment characteristics to differences in execution time and power/energy consumption

Execution Time Comparison Sample Finding


13.0

83 53

.113

140.

844

2.42

0

7.30

2

17.0

22

17.1

06 52

.281

121.

428

26.8

75

82.7

54

300.

032

10.9

23 48

.322

126.

864

15.2

87 45.8

18

121.

402

0.000

50.000

100.000

150.000

200.000

250.000

300.000

350.000

50^3 70^3 90^3

Tim

e (s

)

Problem Size

LULESH 1.0 Execution Time Comparison (sec)

Sandy Bridge (Parallel)

Kepler

Fermi

MIC (Xeon Phi)

Beta (Sandy Bridge)

Beta (Xeon Phi)

Solve Time of LULESH 1.0 run on one node not including the initialization and termination portions. Except for Xeon Phi, codes optimized for architectures.

Data layout optimizations in the beta code result in a 1.75 to 2.47x improvement.

Meet Job Deadlines Efficient MANET Communication





Efficient MANET


Transmission

Goal: Reduce energy consumed by wireless MANET routing functions

Current Focus: Discover ways to adapt OLSR operating parameters to changing mobility patterns

Experimental Platform: Developed & validated 6-node MANET experimental platform with OLSR routing and topology changes via iptable configuration

Performance Study: Using test bed to understand relationship of time intervals between topological changes with values of OLSR operating parameters

• Energy-Aware Tactical Cloud Communication

Energy- & Mobility- aware Communication

Energy-Aware Tactical Cloudlet Communication: Neighbor Discovery - 1/2


• Motivation: Energy-aware Tactical Cloud Communication Resource-constrained communication devices Lack of fixed role of each device (host or packet switch) Use of free space as transmission media that suffers from

transmission impairments • Goal: Reduce energy consumed by battlefield cloud

communication, in particular, reduce energy consumed by wireless MANET routing functions

• Foci: Develop physical and simulation test-beds

Understand relationship of time intervals between topological changes with values of OLSR operating parameters such as time intervals between transmission of Hello messages and Topology Control messages

Discover ways to dynamically adapt OLSR operating parameters to changing mobility patterns

Energy-Aware Tactical Cloudlet Communication: Neighbor Discovery - 2/2


• Progress:

Designed & commenced experimental plan to identify mechanisms to reduce energy consumed by OLSR neighbor discovery process.

o Understand relationship between topological change time intervals & values of OLSR operating parameters such transmission frequency of Hello messages and topology control messages

o Discover ways to dynamically adapt OLSR operating parameters to changing mobility patterns

Designed, developed, & validated 6-node MANET experimental platform that uses OLSR for routing and “simulates” mobility through topology changes via iptable configuration.

Started experiments to understand relationship between topological change time intervals & OLSR operating parameters.

Started design and development of simulator to conduct complementary experiments.

Meet Job Deadlines Effective Cloud Scheduling





Efficient MANET


Transmission

Estimate times at which data transfers commence – reserve transmission paths to deliver data on time

Leverage knowledge of temporarily fixed and expected future locations of platforms to

– Decrease communication overhead

– Initiate necessary job migration/replication

By December: Easily scalable, fully functional Nimbus cloud test bed connected via a wireless ad hoc network that can collect power and memory metrics for each node for use by the scheduler

• Cooperative Cloud Scheduling and Communication

In collaboration with FutureGrid

Reserved Data Transmission

Paths

Job Migration

Cooperative Cloud Scheduling & Communication - 1/2


• Motivation:

Program and computer platform characteristics can be used to estimate the times at which data transfers will commence. To meet job deadlines, transmission paths could be reserved to deliver data.

Knowledge of non-mobility and/or travel paths of devices could be used to decrease communication overhead associated with neighbor discovery.

Knowledge of devices moving out of range could initiate migration of jobs.

• Goal: Pursue the dual objectives of reducing energy consumption and increasing the number of jobs that meet their solution deadlines.

Cooperative Cloud Provisioning & Communication - 2/2


• Progress:

Experimental Platform: Designed & developed a Nimbus cloud test bed comprised of 4 mobile and static computing devices connected via a wireless ad hoc network (a master/controller node, a head node, and two hypervisor nodes). Each hypervisor device hosts multiple virtual machines.

By December 2013:

o Easily scalable, fully functional cloud test bed connected via a wireless ad hoc network that can collect power and memory metrics for each node for use by the scheduler

oDocumentation with all relevant codes and information required to reproduce this work and create such a test bed.

This work represents collaboration with FutureGrid, who described the work as “cool, groundbreaking cloud research”. (It is primarily being done by a graduate student who is not supported by this grant.)

APP 2013-2014 – 1/2

• Deliverable #1 Identification of code attributes (e.g., memory-access behavior,

vectorization potential, instructions-per-cycle, and branch behavior) that highly influence an application’s execution-time and power performance on sets of light-weight cores (e.g., Xeon Phi), GPUs, and sets of heavy-weight cores (e.g., Sandy Bridge)

• Deliverable #2 Development of guidelines that inform the mapping of code segments

and/or applications to these computing architectures

• Deliverable #3 Dissemination of findings in technical reports, conference

proceedings, and journals

Research Objective 1: Analysis of applications and emerging technologies to inform application-to-architecture mappings and employment of optimizations that will result in “best” execution-time and power performance.


• Deliverable #1

• Deliverable #2 Development of analytical frameworks, simulation models, and a

physical test bed for performance evaluation of the defined cooperative mechanisms

• Deliverable #3 Preliminary performance evaluation of the defined mechanisms using

analytical methods, simulation experiments, and physical experiments

Research Objective 2: Investigation of cooperative resource provisioning and communication mechanisms for tactical cloud management for reduced number of missed application execution deadlines and energy consumption.

APP 2013-2014 – 2/2


Definition of mechanisms in which the resource provisioning system and communication protocols cooperate to reduce number of missed application execution deadlines and energy consumption

• Deliverable #4 Dissemination of findings in technical reports, conference

proceedings, and journals

PROJECT PUBLICATIONS FY 2013 – 1/2

Refereed publications and reports (* students) • Sarala Arunagiri, Jaime Jaloma*, Ricardo Portillo*, and Arturo Argueta*, “Power

and Execution Performance Tradeoffs of GPGPU Computing: a Case Study Employing Stereo Matching,” in Proceedings of Image Processing: Machine Vision Applications VI Conference (part of IS&T/SPIE Electronic Imaging), San Francisco, CA, February 3-7, 2013.

• Sarala Arunagiri and Jaime Jaloma*, “Parallel GPGPU Stereo Matching with an Energy-Efficient Cost Function based on Normalized Cross Correlation,” in Proceedings of Image Processing: Algorithms and Systems XI Conference (part of IS&T/SPIE Electronic Imaging), San Francisco, CA, February 3-7, 2013.

Papers in progress • Esthela Gallardo*, Patricia J. Teller, Jaime Jaloma, Ian Karlin, Arturo Argueta, Edgar

Leon, and James Browne, “Accelerators: the Good, the Bad, and the Ugly,” target: TBD.

• Ricardo Portillo* and Patricia J. Teller, “The Potential of Application-aware DVFS for GPGPU Scientific Workloads,” target: HPCA/SHAW-5 2014 (Dec 2013 deadline)


PROJECT PUBLICATIONS FY 2013 – 2/2

Dissertations, Master’s Theses, and Technical Reports • Ricardo Portillo, “The Unrealized Potential of DVFS Power-Management for

GPGPU Applications,” PhD dissertation, UTEP, Department of Computer Science, expected Spring 2014.

• Joshua McCartney, “Towards Understanding the Impact of OLSR HELLO and TC Message Frequency on Network Performance,” M.S. thesis, UTEP, Department of Computer Science, expected Fall 2013.

• Joshua McKee, “Setting Up a Highly Configurable, Scalable Cloud using Nimbus and Phantom,” M.S. project, UTEP, Department of Computer Science, expected Fall 2013.

• Enrique Portillo, “Efficient, Scalable, Parallel Matrix-Matrix Multiply,” M.S. thesis, UTEP, Department of Computer Science, expected Fall 2013.


SUMMARY


• 2013 Research focused on enabling Tactical Cloud computing

Main Focus: Power/Energy-Aware Computing – Decrease arithmetic precision (SAR Image Processing and

Change Detection & Stereo Matching for Computer Vision) – Improve algorithm efficiency (GPU Matrix Multiplication) – Fine-grain DVFS (Application-Aware GPU Power

Management) Research in Progress:

– Informed application-to-architecture mappings (Lulesh Performance Study)

– Efficient inter-device communication (OLSR Test Bed and Energy-Aware OLSR Performance Study)

– Discovery of information that enables cloud scheduling (Nimbus Cloud Testbed)

Collaborations: ARL-APG, NVIDIA, TACC, LLNL, Intel, FutureGrid

• 2014 Research focuses on Cooperative Resource Provisioning and Communication

The University of Texas at El Paso Enabling Battlefield ...web.stanford.edu/group/ahpcrc/2013Review/AHPCRC... · Power- and energy-aware computing: Lots including ORNL, SDSC, and

Documents