EEL 4930/5934, Fall 09 December 3-4, 2009 Novo-G : Adaptively Custom Novo-G : Adaptively Custom Reconfigurable Reconfigurable Supercomputer Supercomputer Dr. Alan D. George Professor of ECE University of Florida Dr. Herman Lam Assoc. Professor of ECE University of Florida Abhijeet Lawande Carlo Pascoe Research Assistants CHREC University of Florida
Novo-G : Adaptively Custom Reconfigurable Supercomputer. Dr. Alan D. George Professor of ECE University of Florida Dr. Herman Lam Assoc. Professor of ECE University of Florida. Abhijeet Lawande Carlo Pascoe Research Assistants CHREC University of Florida. December 3-4, 2009. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
HPC MarketplaceHPC Marketplace HPC practitioners often more reactive than proactive
Understandably conservative, risk-averse Looking for quick fixes (not always best approach for long-term)
Accelerators (e.g. GPU, Cell) popular @ SC09 But these consume more (energy) to get more (performance)
Performance promising for subset of apps (on fixed-logic spectrum) Productivity a significant challenge (common in Age of Parallelism) Sustainability a major concern (single devices approaching 300W!)
But better solutions borne from better methods Goal: high performance, productivity, & sustainability Change in paradigm, mindset, approach
“Every generation needs a new revolution” – Jefferson Smarter device and system architectures
Adaptive hardware parallelism, more (performance) with less (energy) Better models & solutions apply more broadly than only HPC
So what’s the problem? New computing model: revolutionary, potent, complex
Adaptive hardware offers many challenges & opportunities Still relatively new and immature field
Many open R&D issues, from prog. model to device arch.
5
Novo-G ConceptNovo-G Concept Goals
Investigate, develop, evaluate, & showcase: Most powerful RC machine ever fielded for research Innovative suite of productivity tools for app development Impactful set of scalable kernels/apps in key science areas
Project & machine name: Novo-G “Novo” is Latin: "to make anew, refresh, revive, change, alter," essence of RC “G” for Genesis (first of a series of Novo machines) or Green
Focus on experimental research challenges of RC spanning HPC to HPEC Motivations
Design productivity is foremost need/challenge for widespread use of RC Challenges accentuated as scale increases (devices, systems, apps) Powerful experimental testbed to support R&D addressing these challenges
After capacity doubled, total power of Novo-G @ max. load 8KW
11
Novo-G ToolsNovo-G Tools Commercial and open-source tools
Digital design tools: Altera, GiDEL, Aldec, Synopsys Cores and libraries: Altera, GiDEL, et al. High-level device design: Altera FP Compiler,
Impulse-C, Mitrion-C, LabVIEW (2010) High-level system design: MPI, UPC, SHMEM Additional options in review (ROCCC, et al.)
Variety of CHREC tools being ported & used for Novo-G Strategic design & prediction: RCML, RCSE, RAT, CMD High-level system design: SHMEM+, SCF Hardware virtualization for fast PAR: IFET App verification & performance analysis: ReCAP Proposed OpenCL over CHREC-IF Assorted kernel & app cores
Industry Partners
12
Impulse-C Platform Support Impulse-C Platform Support PackagePackage
12
Impulse-C Allows software written in Impulse-C programming
language to run in Novo-G FPGAs H/W – S/W partitioning approach
S/W processes compiled to executable using GCC H/W processes converted to synthesizable
VHDL/Verilog
Platform Support Package (PSP) Provides interface between Impulse-C generated
H/W and S/W customized for Impulse-C application Currently supports streams and registers Future Work:
Provide support for shared memory Extend PSP to support Multi-FPGA system
Impulse-C apps on Novo-G Smith-Waterman Back-projection European Option Pricing
Planned apps / app areas Bioinformatics Information retrieval and
search engines Database acceleration
Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics Smith-Waterman (S-W) is an algorithm used to compute the optimal local sequence alignment of two or more character strings.
Needleman-Wunsch (N-W) is for the computation of the optimal global sequence alignment.
In biology, alignments are performed in search of sequence similarities under the assumption that they imply functional, structural, or evolutionary relationships between sequences and their sources.
Contemporary implementations of optimal sequence alignment (whether global, local, or anything in-between) are based on a computation-intensive dynamic programming algorithm that breaks down the process of alignment into a set of recursive computations.
Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics Algorithms involve calculation of
optimal alignment for all possible subsequences, then choosing the final sequence alignment from set of sub-alignments.
Equivalent to populating a score matrix and selecting the appropriate cell based on the type of alignment desired
Example of local alignment (S-W)
Query Sequence = “ACGTATGC”
Database Sequence = “ACGAACCCTTGC”
Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics For two sequences of length A and B, optimum alignment requires the calculation of A∙B scores, with serial implementations operating in O(A∙B) time and O(min{A,B}) space complexity.
As amount of sequence data grows exponentially, the need for faster sequence alignment has fuelled the development of hardware accelerators.
Hardware Approach:
18
Novo-G Apps: Smith-Waterman Novo-G Apps: Smith-Waterman (S-W)(S-W) First completed app: S-W Kernel for use in Bio Apps Locally/Optimally align DNA, RNA, or protein sequences Identify regions of similarity; dominant & vital app. in comp. biology Optimal alignment Ideal but often replaced with much faster heuristics
Design: systolic array spanning 4 FPGAs per board 512 PE/FPGA, 2048 PE/board, 1 board/server, 125MHz, see app-note
Execution Time of Serial Baseline on Single 2.4 GHz Opteron Core
= 743,460 Seconds (≈8.6 Days)
Number of Novo-G Nodes in Execution
1 4 8 12 16 24(E)
Execution Time (Sec) of Novo-G 279 70.4 35.6 24.2 18.2 12.38886
Novo-G Speedup vs. Single Core 2665 10561 20884 30721 40849 60053
Execution Time of Serial Baseline on Single 2.4 GHz Opteron Core
= 743,460 Seconds (≈8.6 Days)
Number of Novo-G Nodes in Execution
1 4 8 12 16 24(E)
Execution Time (Sec) of Novo-G 279 70.4 35.6 24.2 18.2 12.38886
Novo-G Speedup vs. Single Core 2665 10561 20884 30721 40849 60053
Novo-G achieves in ~12 seconds what takes a fast CPU core nearly 9 days!
Speed of S-W on Novo-G comparable to two largest machines on NSF TeraGrid After our 2x upgrade, fast as both combined!
Yet, Novo-G is 100s of times lower in energy, cooling, cost, size, weight, etc. than TeraGrid
Future Plans:• Use S-W Kernel in SHRiMP application
as replacement for BLAST heuristic
Execution times for 34MB chromosome sequence aligned with 16K 128-character sequences
Execution Time of needledist Baseline on Single 2.26GHz Intel Nehalem Core
=55,200 Seconds (≈ 15.4 hours)
Number of Novo-G FPGAs in Execution
1 2 4 8 96 192
Execution Time (Sec) of Novo-G 50.6 25.9 14.7 8.39
Novo-G Speedup vs. Single Core 1091 2131 3755 6579 78951 157902Execution times for distance calculation of 16,777,216 pairs of length 250. Note: Red cells extrapolated values, obtainable with larger data sets.
N-W Kernel for use in ICBR’s ESPRIT application Globally/Optimally align DNA sequences, then computes edit distance Edit distances used to group sequences into operational taxonomic units (OTU) OTUs grouped into tree; tree represents species richness and taxonomy
Design: systolic array of PEs with I/O FIFOs for streaming Current Design: 250 PEs/FPGA, 1000 PEs/board, 2 boards/server, 125MHz Design only consumes 68% of chip; number of PEs will be increased Compared to S-W, overall design more simple in terms of control signals but
N-W PEs vastly more complex Uses special encoding scheme that allows N-W to approach S-W performance
Future Plans:• Fully Integrate N-W
Kernel into ESPRIT and Create Web App for use by Scientific Community
USED IN ACTUAL METOGENOMICS
RESEARCH!
Novo-G Apps: Real-Time Adaptive Novo-G Apps: Real-Time Adaptive FilteringFiltering Use of ITL optimization in real-time adaptive filtering
Filter weights change with every sample through feedback by optimizing value of cost function
Current filters minimize mean squared error (MSE) cost functions; ITL cost function minimize error entropy (EE) yielding better results.
Minimizing EE equivalent to calculating gradient of information potential (IP) :
i j
ee jiexp
Increasing IP window size (i, j) results in smoother and faster convergence but computation increases as O(n2)
Novo-G implementation advantages SW implementation (MATLAB) cant operate in RT for large window size All summed exponential terms independent; HW can compute in
parallel Fist design iteration can compute window size up to (50, 50) on 1 FPGA
in single clock cycle Clockrate/Speedup limited by sample frequency for RT filtering
Window size 5 40 50
Maximum Sample
Frequency (kHz)
Matlab 12.99 4.35 3.03
1 FPGA 11,764.7 10,204.1 9,803.9
Speedup 906 2346 3236
Sample frequencies based on simulated time to calculate a single weight (computation of IP). Does not include FPGA transfer time.
21
Novo-G Apps: Filtered Novo-G Apps: Filtered Back-Projection Back-Projection (FBP)(FBP) BP for use in CT image reconstruction
2D object is reconstructed from several 1-D projections Projections obtained by bombarding object with X-ray beam from multiple angles Each pixel on projected image represents total absorption of X-ray along path from source
to detector Mathematically, transformation from projection-space into Cartesian coordinates
Design: 512 pipelined processing engines per FPGA
DesignCPU
(4 cores)4 FPGA
(VHDL)
4 FPGA
(Impulse C)
Total Time 2300ms 6.81ms 8.8 ms
Speedup - 338X 261X
Embarrassingly parallel w.r.t. computation of each pixel as well as projections for each
Processing engines iterate over all pixels and compute partial sum; final image formed in software
Software time complexity is O(n3); Hardware design reduces complexity to O(n2)
H/W implementation uses 16-bit fixed point arithmetic; results are visually indistinguishable from DPFP
S/W baseline: C code executed with fixed point on Intel E5520 Nahalem Quad Core Xeon @ 2.26GHz
• Design implemented in both Impulse-C and VHDL to compare performance and productivity
• Performance loss of 1.29x but estimated productivity gain considerably greater