Slide-1 SC2002 Tutorial MIT Lincoln Laboratory DoD Sensor Processing: Applications and Supporting Software Technology Dr. Jeremy Kepner MIT Lincoln Laboratory This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Unites States Government.
75
Embed
Slide-1 SC2002 Tutorial MIT Lincoln Laboratory DoD Sensor Processing: Applications and Supporting Software Technology Dr. Jeremy Kepner MIT Lincoln Laboratory.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide-1SC2002 Tutorial
MIT Lincoln Laboratory
DoD Sensor Processing:Applications and Supporting
Software TechnologyDr. Jeremy Kepner
MIT Lincoln Laboratory
This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Unites States Government.
MIT Lincoln LaboratorySlide-2
SC2002 Tutorial
P0 P1 P2 P3
Node Controller
Parallel Embedded Processor
System Controller
ConsolesOther
Computers
ControlCommunication:
CORBA, HP-CORBASCA
Data Communication:MPI, MPI/RT, DRI
Computation:VSIPL
DefinitionsVSIPL = Vector, Signal, and Image
Processing LibraryMPI = Message-passing interfaceMPI/RT = MPI real-timeDRI = Data Re-org InterfaceCORBA = Common Object Request Broker
ArchitectureHP-CORBA = High Performance CORBA
Preamble: Existing Standards
• A variety of software standards support existing DoD signal processing systems
MIT Lincoln LaboratorySlide-3
SC2002 Tutorial
Preamble: Next Generation Standards
Performance (1.5x)
Por
tabi
lity
(3x)
Productivity (3x)
HPECSoftwareInitiative
Demonstrate
Develop A
pplie
dR
esea
rch
Object O
rientedO
pen
Sta
ndar
ds
Interoperable & Scalable
Portability lines-of-code changed to port/scale to new systemProductivity lines-of-code added to add new functionalityPerformance computation and communication benchmarks
• Software Initiative Goal: transition research into commercial standards• Software Initiative Goal: transition research into commercial standards
MIT Lincoln LaboratorySlide-4
SC2002 Tutorial
HPEC-SI: VSIPL++ and Parallel VSIPL
Demonstrate insertions into fielded systems (e.g., CIP)• Demonstrate 3x portability
High-level code abstraction• Reduce code size 3x
Unified embedded computation/ communication standard•Demonstrate scalability
Demonstration: Existing Standards
Phase 1
Phase 2
Phase 3
Time
Development: Object-Oriented Standards
Applied Research: Unified Comp/Comm Lib
Demonstration: Object-Oriented Standards
Applied Research: Fault tolerance
Demonstration: Unified Comp/Comm Lib
Development: Fault tolerance
Applied Research: Self-optimization
Development: Unified Comp/Comm Lib
Fu
nct
ion
alit
y
VSIPL++
prototypeParallelVSIPL++
prototype
VSIPLMPI
VSIPL++
ParallelVSIPL++
MIT Lincoln LaboratorySlide-5
SC2002 Tutorial
Preamble: The Links
High Performance Embedded Computing Workshophttp://www.ll.mit.edu/HPEC
High Performance Embedded Computing Software Initiativehttp://www.hpec-si.org/
Vector, Signal, and Image Processing Libraryhttp://www.vsipl.org/
• Software costs for embedded systems could be reduced by one-third
with improved programming models, methodologies, and standards
MIT Lincoln LaboratorySlide-8
SC2002 Tutorial
Embedded Stream Processing
Requires high performance computing and networkingRequires high performance computing and networking
Pea
k B
isec
tio
n B
and
wid
th (
GB
/s) 10000.0
1000.0
100.0
10.0
1.0
0.11 10 100 1000 10000 100000
Peak Processor Power (Gflop/s)
Moore’sLaw
FasterNetworks
Desired region of performance
Today
COTS
GoalRadar Sonar
WirelessVideo
Medical
Scientific Encoding
Slide-9SC2002 Tutorial
MIT Lincoln Laboratory
Military Embedded Processing
MIT Lincoln Laboratory• Signal processing drives computing requirements• Rapid technology insertion is critical for sensor dominance• Signal processing drives computing requirements• Rapid technology insertion is critical for sensor dominance
0.001
0.01
0.1
1
10
100
1990 1995 2000 2005 2010
YEAR
TFLOPS
Airborne Radar
ShipboardSurveillance
UAV
Missile Seeker
SBR
Small UnitOperations
SIGINT
REQUIREMENTS INCREASINGBY AN ORDER OF MAGNITUDEEVERY 5 YEARS
EMBEDDED PROCESSING REQUIREMENTS WILLEXCEED 10 TFLOPS IN THE 2005-2010 TIME FRAME
EMBEDDED PROCESSING REQUIREMENTS WILLEXCEED 10 TFLOPS IN THE 2005-2010 TIME FRAME
MIT Lincoln LaboratorySlide-10
SC2002 Tutorial
Military Query Processing
Sensors ParallelComputing
Wide AreaImaging
Hyper SpecImaging
SAR/GMTI
BoSSNET
Targeting
ForceLocation
InfrastructureAssessment
High SpeedNetworks
Missions
ParallelDistributedSoftware
MultiSensor
Algorithms
Software
• Highly distributed computing• Fewer very large data movements• Highly distributed computing• Fewer very large data movements
MIT Lincoln LaboratorySlide-11
SC2002 Tutorial
ParallelComputer
Parallel Pipeline
BeamformXOUT = w *XIN
DetectXOUT = |XIN|>c
FilterXOUT = FIR(XIN )
Signal Processing Algorithm
Mapping
• Data Parallel within stages• Task/Pipeline Parallel across stages • Data Parallel within stages• Task/Pipeline Parallel across stages
MIT Lincoln LaboratorySlide-12
SC2002 Tutorial
Filtering
XOUT = FIR(XIN,h)
• Fundamental signal processing operation• Converts data from wideband to narrowband via filter
O(Nsamples Nchannel Nh / Ndecimation)• Degrees of parallelism: Nchannel
• Fundamental signal processing operation• Converts data from wideband to narrowband via filter
O(Nsamples Nchannel Nh / Ndecimation)• Degrees of parallelism: Nchannel
Xin
Nchannel
Nsamples
Nsamples/Ndecimation
Nchannel
Xout
MIT Lincoln LaboratorySlide-13
SC2002 Tutorial
Beamforming
XOUT = w *XIN
• Fundamental operation for all multi-channel receiver systems• Converts data from channels to beams via matrix multiply
O(Nsamples Nchannel Nbeams)• Key: weight matrix can be computed in advance• Degrees of Parallelism: Nsamples
• Fundamental operation for all multi-channel receiver systems• Converts data from channels to beams via matrix multiply
O(Nsamples Nchannel Nbeams)• Key: weight matrix can be computed in advance• Degrees of Parallelism: Nsamples
Xin
Nchannel
Nsamples
Nsamples
Nbeams
Xout
MIT Lincoln LaboratorySlide-14
SC2002 Tutorial
Detection
XOUT = |XIN|>c
• Fundamental operation for all processing chains• Converts data from a stream to a list of detections via
thresholding O(Nsamples Nbeams)• Number detections is data dependent• Degrees of parallelism: Nbeams Nchannels or Ndetects
• Fundamental operation for all processing chains• Converts data from a stream to a list of detections via
thresholding O(Nsamples Nbeams)• Number detections is data dependent• Degrees of parallelism: Nbeams Nchannels or Ndetects
Xin
Nbeams
Nsamples
Ndetects
Xout
MIT Lincoln LaboratorySlide-15
SC2002 Tutorial
Types of Parallelism
InputInput
FIRFIlters
FIRFIlters
SchedulerScheduler
Detector 2
Detector 2
Detector 1
Detector 1
Beam-former 2Beam-
former 2 Beam-
former 1 Beam-
former 1
Task Parallel Task Parallel
Pipeline Pipeline
Round RobinRound Robin
Data ParallelData Parallel
MIT Lincoln LaboratorySlide-16
SC2002 Tutorial
• Filtering• Beamforming• Detection
Outline
• Introduction
• Processing Algorithms
• Parallel System Analysis
• Software Frameworks
• Summary
MIT Lincoln LaboratorySlide-17
SC2002 Tutorial
FIR Overview
MIT Lincoln Laboratory
FIR
• Uses: pulse compression, equalizaton, …
• Formulation: y = h o x– y = filtered data [#samples]– x = unfiltered data [#samples]– f = filter [#coefficients]– o = convolution operator
• Implementation Parameters: Direct Sum or FFT based
MIT Lincoln LaboratorySlide-18
SC2002 Tutorial
Basic Filtering via FFT
• Fourier Transform (FFT) allows specific frequencies to be selected O(N log N)
FFTFFT
time frequency
time frequency
DC
DC
MIT Lincoln LaboratorySlide-19
SC2002 Tutorial
Basic Filtering via FIR
freqf1 f2DC
(Example: Band-Pass Filter)
FIR(x,h)x y
Power in anyfrequency
Power onlybetweenf1 and f2
h1 h2 hLh3
Delay Delay Delay
y
• Finite Impulse Response (FIR) allows a range of frequencies to be selected O(N Nh)
MIT Lincoln LaboratorySlide-20
SC2002 Tutorial
Multi-Channel Parallel FIR filter
MIT Lincoln Laboratory
• Parallel Mapping Constraints:– #channels MOD #processors = 0– 1st parallelize across channels– 2nd parallelize within a channel based on #samples and
#coefficients
FIRFIRFIRFIR
Channel 1Channel 2Channel 3Channel 4
MIT Lincoln LaboratorySlide-21
SC2002 Tutorial
• Filtering• Beamforming• Detection
Outline
• Introduction
• Processing Algorithms
• Parallel System Analysis
• Software Frameworks
• Summary
MIT Lincoln LaboratorySlide-22
SC2002 Tutorial
Beamforming Overview
MIT Lincoln Laboratory
• Uses: angle estimation
• Formulation: y = wHx– y = beamformed data [#samples x #beams]– x = channel data [#samples x #channels]– w = (tapered) stearing vectors [#channels x #beams]
• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact
• Single processor and multi-processor code are the same• Maps can be changed without changing software• High level code is compact
MIT Lincoln LaboratorySlide-51
SC2002 Tutorial
PVL Evolution
Parallel Processing
Library
Parallel Communications
Single processor Library
1988 20009998979695949392919089
Applicability
= Scientific (non-real-time) computing
= Real-time signal processing
VSIPL
MPI/RT
LAPACK
MPI
ScaLAPACK
STAPL
PVL
• Fortran• Object-based
• C• Object-based
• Fortran
• C• Object-based
• C• Object-based
• C• Object-based
• C++• Object- oriented
PETE • C++• Object-oriented
• Transition technology from scientific computing to real-time• Moving from procedural (Fortran) to object oriented (C++)• Transition technology from scientific computing to real-time• Moving from procedural (Fortran) to object oriented (C++)
MIT Lincoln LaboratorySlide-52
SC2002 Tutorial
Anatomy of a PVL Map
Vector/Matrix Computation Conduit Task
Map
Grid
{0,2,4,6,8,10}Distribution
• All PVL objects contain maps
• PVL Maps contain
•Grid•List of nodes•Distribution•Overlap
• All PVL objects contain maps
• PVL Maps contain
•Grid•List of nodes•Distribution•Overlap
List of Nodes
Overlap
MIT Lincoln LaboratorySlide-53
SC2002 Tutorial
Sig
nal
Pro
cess
ing
& C
on
tro
lM
app
ing
Library Components
Data & TaskPerforms signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR)
Computation
DataUsed to perform matrix/vector algebra on data spanning multiple processors
Vector/Matrix
Task & Pipeline
Supports data movement between tasks (i.e. the arrows on a signal flow diagram)
Conduit
Task & Pipeline
Supports algorithm decomposition (i.e. the boxes in a signal flow diagram)
Task
Organizes processors into a 2D layoutGrid
Data, Task & Pipeline
Specifies how Tasks, Vectors/Matrices, and Computations are distributed on processor
Map
ParallelismDescriptionClass
• Simple mappable components support data, task and pipeline parallelism• Simple mappable components support data, task and pipeline parallelism
MIT Lincoln LaboratorySlide-54
SC2002 Tutorial
PVL Layered Architecture
Map
Vector/MatrixVector/Matrix CompCompTask
Conduit
Grid
Distribution
Math Kernel (VSIPL) Messaging Kernel (MPI)
Application
ParallelVectorLibrary
Hardware
Input Analysis Output
UserInterface
HardwareInterface
Workstation
EmbeddedMulti-computer
PowerPCCluster
EmbeddedBoard
IntelCluster
Productivity
Portability
Performance
• Layers enable simple interfaces between the application, the library, and the hardware
• Layers enable simple interfaces between the application, the library, and the hardware
• Communication dominates performance• Communication dominates performance
MIT Lincoln LaboratorySlide-61
SC2002 Tutorial
• PVL• PETE• S3P• MatlabMPI
Outline
• Introduction
• Processing Algorithms
• Parallel System Analysis
• Software Frameworks
• Summary
MIT Lincoln LaboratorySlide-62
SC2002 Tutorial
S3P Framework Requirements
• Each compute stage can be mapped to different sets of hardware and timed
• Each compute stage can be mapped to different sets of hardware and timed
FilterXOUT = FIR(XIN)
DetectXOUT = |XIN|>c
BeamformXOUT = w *XIN
Mappableto different sets of hardware
Measurableresource usage of each mapping
Decomposableinto Tasks (comp)and Conduits (comm)
Task
Task
Task
Conduit
Conduit
MIT Lincoln LaboratorySlide-63
SC2002 Tutorial
S3P Engine
Hardware InformationHardware
Information
Algorithm InformationAlgorithm
Information
SystemConstraints
SystemConstraints
ApplicationProgram
ApplicationProgram
S3P EngineS3P Engine“Best”SystemMapping
“Best”SystemMapping
• Map Generator constructs the system graph for all candidate mappings• Map Timer times each node and edge of the system graph• Map Selector searches the system graph for the optimal set of maps
• Map Generator constructs the system graph for all candidate mappings• Map Timer times each node and edge of the system graph• Map Selector searches the system graph for the optimal set of maps
MapGenerator
MapGenerator
MapTimerMap
TimerMap
SelectorMap
Selector
MIT Lincoln LaboratorySlide-64
SC2002 Tutorial
Test Case: Min(#CPU | Throughput)
Input Low Pass Filter Beamform Matched Filter
3.2 31.5
1.4 15.7
1.0 10.4
0.7 8.2
16.1 31.4
9.8 18.0
6.5 13.7
3.3 11.5
52494642472721244429202460332315
1231-
571617-
28149.1-
181815-
14
8.38.73.32.67.38.39.48.0----
17141413
33 frames/sec(1.6 MHz BW)
66 frames/sec (3.2 MHz BW)
• Vary number of processors used on each stage
• Time each computation stage and communication conduit
• Find path with minimum bottleneck
• Vary number of processors used on each stage
• Time each computation stage and communication conduit
• Find path with minimum bottleneck
1 CPU
2 CPU
3 CPU
4 CPU
MIT Lincoln LaboratorySlide-65
SC2002 Tutorial
Dynamic Programming
N = total hardware unitsM = number of tasksPi = number of mappings for task i
t = MpathTable[M][N] = all infinite weight pathsfor( j:1..M ){ for( k:1..Pj ){ for( i:j+1..N-t+1){ if( i-size[k] >= j ){ if( j > 1 ){ w = weight[pathTable[j-1][i-size[k]]] + weight[k] + weight[edge[last[pathTable[j-1][i-size[k]]],k] p = addVertex[pathTable[j-1][i-size[k]], k] }else{ w = weight[k] p = makePath[k] } if( weight[pathTable[j][i]] > w ){ pathTable[j][i] = p } } } } t = t - 1}
• Graph construct is very general• Widely used for optimization problems• Many efficient techniques for choosing “best” path (under constraints)
such as Dynamic Programming
• Graph construct is very general• Widely used for optimization problems• Many efficient techniques for choosing “best” path (under constraints)
• Excellent agreement between S3P predicted and achieved latencies and throughputs
#CPU
Lat
ency
(se
con
ds)
Large (48x128K)Small (48x4K)
Th
rou
gh
pu
t (
fram
es/s
ec)
4 5 6 7 8
25
20
15
100.25
0.20
0.15
0.10
1.5
1.0
0.5
predicted
achieved
predicted
achieved
predicted
achieved
predicted
achieved
5.0
4.0
3.0
2.04 5 6 7 8
Problem Size
#CPU
1-1-1-1
1-1-1-1
1-1-1-1
1-1-1-1
1-1-2-1
1-2-2-1
1-2-2-2 1-3-2-2
1-1-2-1
1-1-2-2
1-2-2-2
1-3-2-2
1-1-1-2
1-1-2-2
1-2-2-2
1-2-2-3
1-1-2-1
1-2-2-1
1-2-2-2
2-2-2-2
MIT Lincoln LaboratorySlide-67
SC2002 Tutorial
• PVL• PETE• S3P• MatlabMPI
Outline
• Introduction
• Processing Algorithms
• Parallel System Analysis
• Software Frameworks
• Summary
MIT Lincoln LaboratorySlide-68
SC2002 Tutorial
Modern Parallel Software Layers
Vector/MatrixVector/Matrix CompCompTask
Conduit
Math Kernel Messaging Kernel
Application
ParallelLibrary
Hardware
Input Analysis Output
UserInterface
HardwareInterface
Workstation
PowerPCCluster
IntelCluster
• Can build any parallel application/library on top of a few basic messaging capabilities
• MatlabMPI provides this Messaging Kernel
• Can build any parallel application/library on top of a few basic messaging capabilities
• MatlabMPI provides this Messaging Kernel
MIT Lincoln LaboratorySlide-69
SC2002 Tutorial
MatlabMPI “Core Lite”
• Parallel computing requires eight capabilities– MPI_Run launches a Matlab script on multiple processors– MPI_Comm_size returns the number of processors– MPI_Comm_rank returns the id of each processor– MPI_Send sends Matlab variable(s) to another processor– MPI_Recv receives Matlab variable(s) from another processor– MPI_Init called at beginning of program– MPI_Finalize called at end of program
MIT Lincoln LaboratorySlide-70
SC2002 Tutorial
MatlabMPI:Point-to-point Communication
load
detect
Sender
variable Data filesave
createLock file
variable
ReceiverShared File System
MPI_Send (dest, tag, comm, variable);
variable = MPI_Recv (source, tag, comm);
• Sender saves variable in Data file, then creates Lock file• Receiver detects Lock file, then loads Data file• Sender saves variable in Data file, then creates Lock file• Receiver detects Lock file, then loads Data file
MIT Lincoln LaboratorySlide-71
SC2002 Tutorial
Example: Basic Send and Receive
MPI_Init; % Initialize MPI.comm = MPI_COMM_WORLD; % Create communicator.comm_size = MPI_Comm_size(comm); % Get size.my_rank = MPI_Comm_rank(comm); % Get rank.source = 0; % Set source.dest = 1; % Set destination.tag = 1; % Set message tag.
if(comm_size == 2) % Check size. if (my_rank == source) % If source. data = 1:10; % Create data. MPI_Send(dest,tag,comm,data); % Send data. end if (my_rank == dest) % If destination. data=MPI_Recv(source,tag,comm); % Receive data. endend
• Bandwidth matches native C MPI at large message size• Primary difference is latency (35 milliseconds vs. 30 microseconds)• Bandwidth matches native C MPI at large message size• Primary difference is latency (35 milliseconds vs. 30 microseconds)
1.E+05
1.E+06
1.E+07
1.E+08
1K 4K 16K 64K 256K 1M 4M 32M
C MPIMatlabMPI
Message Size (Bytes)
Ban
dw
idth
(B
ytes
/sec
)Bandwidth (SGI Origin2000)
MIT Lincoln LaboratorySlide-73
SC2002 Tutorial
Image Filtering Parallel Performance
Parallel performance
1
10
100
1 2 4 8 16 32 64
LinearMatlabMPI
Fixed Problem Size (SGI O2000)
0
1
10
100
1 10 100 1000
MatlabMPILinear
Number of Processors
Gig
afl
op
s
Scaled Problem Size (IBM SP2)
Number of Processors
Sp
eed
up
• Achieved “classic” super-linear speedup on fixed problem• Achieved speedup of ~300 on 304 processors on scaled problem• Achieved “classic” super-linear speedup on fixed problem• Achieved speedup of ~300 on 304 processors on scaled problem
• The community is developing software librariea to address many of these challenges:
– Exploits C++ to easily express data/task parallelism– Seperates parallel hardware dependencies from software– Allows a variety of strategies for implementing dynamic
applications(e.g. for fault tolerance)– Delivers high performance execution comparable to or better than
standard approaches
• Our future efforts will focus on adding to and exploiting the features of this technology to: