www.kit.edu KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Performance-oriented instrumentation for high-speed synchrotron imaging Institute for Data Processing and Electronics at KIT Instrumentation for high-speed synchrotron imaging Optimizing tomographic reconstruction for parallel architectures S. Chilingaryan, M. Caselle, T. Dritschler, A. Kopmann, A. Shkarin, M. Vogelgesang acquisition flat field correction . . . . . . noise reduction sinogram generation FFT . . . . . . filter iFFT back projection Storage . . . . . . Segmentation / meshing storage of raw data
41
Embed
Performance-oriented instrumentation for high-speed ...€¦ · Performance-oriented instrumentation for high-speed synchrotron imaging Institute for Data Processing and Electronics
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
Performance-oriented instrumentation for high-speed synchrotron imaging
Institute for Data Processing and Electronics at KITInstrumentation for high-speed synchrotron imagingOptimizing tomographic reconstruction for parallel architectures
S. Chilingaryan, M. Caselle, T. Dritschler, A. Kopmann, A. Shkarin, M. Vogelgesang
acquisitionflat field
correction
. . . . . .
noise reduction
sinogramgeneration FFT
. . .. . .filter iFFT back
projection
Storage
. . .. . .
Segmentation /meshing
storage of raw data
2 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Karlsruhe Institute of Technology
KIT is the merger of Karlsruhe Research Center and Karlsruhe University
Research Center Karlsruhe University
ANKA Synchrotron
3 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
IPE Competences
March 2013 Prof. Dr. Marc Weber
Competences cover full data path from detector, analog electronics, through data management
Experiments• Astroparticle & High Energy Physics • Atmosphere and Climate • Nuclear Fusion• Electrical Storage Systems• Photon Science• Ultrasound Tomography• Nano and Microsystems • Supercomputing & Big Data
Tools and Technologies• Highspeed DAQ Electronics• Highperformance and GPU computing• Software optimization• Databases and data warehousing• Webbased data visualization
Detector Digital Electronics Data AnalysisData
Management
S. Chilingaryan et. all4 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Example 4D cine-tomography experiment
In vivo X-ray 4D cine-tomography experiment. (a) Photograph of Sitophilus granarius, dorsal view. (b) Experimental set-up for ultra-fast X-ray microtomography showing bending magnet (1), rotation stage (2), fixed specimen (3) and detector system (4). (c) Radiographic projection. (d) 3D rendering of the reconstructed volume with thorax cut open and revealing hip joints (arrows). (e) In vivo cine-tomographic sequence of moving weevil.
S. Chilingaryan et. all5 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Fully automated 4D imaging of living species with high spatial and temporal
resolution and image-based control
Scope of the problem & Projects
UFO STROBOS ASTOR
Online montiroing and image-based control
Real-time reconstruction and Visualization
Low Dose Laminography
High-quality reconstruction from under-sampled data for diffraction laminography
Post-processing tools for biologists
Work-flow for remote semi-automated segmentation
S. Chilingaryan et. all6 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Ultra Fast X-ray Imaging of Scientific Processes with On-Line Assessment and Data-Driven Process Control
Enable data driven controlAuto-tunning optical systemTracking dynamic processesFinding area of interest
S. Chilingaryan et. all7 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
4D Tomography of living organisms
Radiation Motion
Nyquist-Shanon criteria defines a minimum number of projections required for quality reconstruction
Radiation is destructive and limits duration of experiment
Motion during the acquisition of projections is blurring the reconstruction
We need to reduce number of used projections
A priori knowledge can be introduced to overcome the restriction
S. Chilingaryan et. all8 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Reconstruction Algorithms
FBPDFM ART Minimizationtechniques
Compoundmethods
SIRTSARTOS-SART
ShrinkageSDCGLS...
ASD-POCSSplit-Bregman...
DFIGriddingPasciak
+ Geometry Modeling+ Projection Modeling+ A priory Knowledge
Analytic Iterative
High performanceCustomizable Reconstruction
Faster MoreRobust
S. Chilingaryan et. all9 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Handling the computational problem• Distributed control system based on Infiniband interconnects• GPU-based computing• Multiple levels of scalability• Cheap off-the-shelf components• Modular reconstruction framework
Easily scalableUp to 4 GPUswith 35 Tflops for ~ 5000 EUR
S. Chilingaryan et. all11 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
UFO Control Network
CameraStation
Ar c hi ve
Lar ge
Sc a
l e D
at a
Fa
c ility
Beam-line
OpenCLNode
OpenCLNode
StorageNode
StorageNode
IB router
Control Room
ComputeCenter
Scalable Control Network
Infiniband(Optical)
Master Server(lots of memory)
CameraLink
Infiniband(Electrical)
Ethernet
10 Gig Ethernet
PCIex8 gen3
S. Chilingaryan et. all12 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Master Server
LSDFLarge Scale Data Facility
FDR Infiniband
56 Gbit/s
External PCIe x16 (16 GB/s)
Ethernet
10 Gb/sInternalPCO.edgePCO.dimax….
SuperMicro 7047GR-TRF (Intel C602 Chipset)CPU: 2 x Xeon E5-2680v2 ( total 20 cores at 2.8 Ghz)GPUs: 7 x NVIDIA GTX TitanMemory: 256 GB (512GB max)Network: Intel 82598EB (10 Gb/s)Infiniband: 2 x Mellanox ConnectX-3 VPIStorage: Areca ARC-1880-ix-12 SAS Raid 8 x Samsung 840 Pro 510 (Raid0) 16 x Hitachi A7K200 (Raid6)
SFF8088 (2.4 GB/s)
Cameras Storage
High amount of memoryFast SSD-based Raid for overflow dataEasy scalability with external PCI express and SAS
S. Chilingaryan et. all13 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
UFO Image Processing Framework
Fully pipelined architecture supporting diversity of the hardware platforms and based on open standards for easy algorithms exchange. Easy prototyping with Python and other scripting languages.
S. Chilingaryan et. all14 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
S. Chilingaryan et. all15 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
UFO Algorithms: Synchrotron data
FBP Split-Bregman withFramelets
Zoomed joint of Sitophilus granarius (grain weevil)
Segmented
Reconstructed from 50 projections (~ 1/40 of standard dataset)
S. Chilingaryan et. all16 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Reconstruction Performance
DFI
N2 * log N
FBP
M * N2
SART/SIRT
1600 MB/s (CUDA) 850 MB/s (OpenCL)
800 MB/s
N2 equations
500 KB/s
Scaling up to 6000 MB/s~ 2250 MB / s from 12 bit camera
1 2 3 4 5 6 70
1000200030004000500060007000
Scalability with CUDA
FBP DFI Number of GPUs
MB
/sScaling up to 2500 MB/s
~ 950 MB / s from 12 bit camera
CUDA
OpenCL
S. Chilingaryan et. all17 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Software Stack
ALPSAdvanced Linux
PCI ServicesLibUCAUnified Camera
Access
UFOGPU Image
Processing Framework
FastWriterStreaming Library
Tango Control
DeviceMotors
DeviceOther slow device
ConcertControl System
LibPCOPCO Drivers
Camera Station
OpenCL
KIRO
UFO
UFO Master Server
Computing Cluster
XFS
iSER
StorageCluster
SoftRaid
PythonGobject-Introspection
KIRO
Remote Users
WAVe
DM
S. Chilingaryan et. all18 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
ASTOR
Initiateanalysis
ProcessingCache
ProcessingCache
get data from catalog
archive +restore data
transmit data
request data +store results
Compute Servers
Compute Servers
ASTORWeb-
catalog
ASTOR Data Management
ASTOR Portal
Long-term ArchiveLSDF
VirtualizationServers
VirtualizationServers
VM-Storage Servers
VM-Storage Servers
ASTORAnalysisServices
manage recorded data
UFO Station
S. Chilingaryan et. all19 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
WAVe: Web-based volume visualization
Ray-casting approach
A preview slice-maps pre-generated using UFO framework
Optimized storage layout for fast zooming
Working on majority mobile platforms with descent GPUsMultiple zooming levels for inspecting fine details High-quality cuts Automatic thresholding-based segmentation Multi-modality rendering support
S. Chilingaryan et. all20 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Optimizing tomography for parallel architectures
Compute Unit on Fermi
Consists of SIMD-type Compute Units (CU)One instruction is executed on many data itemsEach CU able to execute several operation typesBut only FP additions/multiplications are fast
Posses complex memory hierarchyLow Bandwidth-per-flop ratio and small cachesUp to four different types of memoryOptimal access pattern have to be followed
Architectures vary drasticallySizes, speed, and structure of memories / cachesTypes and amount of provided processing unitsBalance of operation throughput
Codes and algorithms have to be carefully optimized for the specific parallel architecture
S. Chilingaryan et. all21 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Memory model• Host Memory– 6 GB/s (PCIe x16 gen2) to
12 GB/s (PCIe x16 gen3)• Global Memory– 100 – 300 GB/s with
latencies up to 1000 clocks• Local Memory– 1 – 2 TB/s (total) with
Complex memory hierarchy consisting of 4 levels and with each level one order of magnitude faster when previous!
Sy stem
Me
mory
S. Chilingaryan et. all22 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
grid
block
thread
warp
Programming Model
e.g. resulting image is mapped to a 1-, 2-, or 3D grid of GPU threads and each pixel is computed by a thread with the index equal to pixel coordinates
All threads execute the same code (kernel)Task is defined by the linear or volumetric index of the threadGPU schedules threads in groups of fixed size (warp)A user-defined block of threads is assigned to a specific CUThreads of the block may exchange data using CU shared
memory
Thread abstraction is used to split the problem space into the independent GPU tasks
S. Chilingaryan et. all23 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Scheduling
Warps from several blocks are executed by CU in parallelThe number of currently resident warps is called occupancyOccupancy is limited by available registers and shared memorySuboptimal occupancy limits the instruction bandwidth
Warp Scheduler Warp Scheduler
Warp 1 instr 1 Warp 2 instr 1
Warp 3 instr 1 Warp 3 instr 2
Warp 1 instr 2 Warp 4 instr 1
ti me
Core Core Core SFU SFU LD LD
Multiple warps on CUexecuted in parallel
Independent instructionsexecuted in parallel
Warp 4 will be blocked for a long time, but other warps on CU will execute and hide the latency
For optimal performance we have to increase occupancy and number of independent instructions
S. Chilingaryan et. all24 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
FBP Reconstruction
1 2 3 4 5proj. α
1. For each position we compute: x●cos(α) - y●sin(α)
2. Interpolate between neighboring bins3. Sum over all projection4. The sum is the value of (x,y)
For each texel of output volumeand for each projection we performa single linear interpolation
1. FilteringMultiplication with the configured filter in the Fourier space
2. BackProjection
S. Chilingaryan et. all25 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Texture Engine
Features:• Spatial-aware cache• Bi/tri-linear interpolation• Normalized coordinates• Different clamping modes
Applications:• Linear interpolation, i.e. image
scaling • Optimize random access to
multidimensional arrays
S. Chilingaryan et. all26 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Filtered Back Projection
Image Loader
Pool of Sinograms (host memory)
Pool of CPU and GPUprocessing threads
Pool of Vertical Slices(host memory)
Texture
Data Storage
W
H
GPU thread
1st Stage 2nd Stage
Double buffering
Double buffering
Filtering
PC
Ie Data
Trans fer
PC
Ie Data
Tra nsf er
Fetch slicesfor processing Store results
S. Chilingaryan et. all27 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Performance of Texture Engine
GT280 GTX580
Core Throughput
930 GF 1580 GF
Texture Fill Rate
48 GT/s 49 GT/s
Ratio 19.3 31.6
S. Chilingaryan, M. Vogelgesang, A. Mirone, A. Kopmann28 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Optimizing FBP for Fermi
Texture
Image
N2 texture fetches
Standard VersionTexture engine is heavily loaded
Texture
SharedMemory
(3/2)*N texture fetches
Image
N2 interpolations
Fermi-optimized VersionBoth texture & computations engines are used
Threadblock
16 px
16 px
Each block of threads accesses actually only 3 ● N / 2 bins per projection
v = x●cos(α) - y●sin(α)max
x,y<N(v) – min
x,y<N(v) < N√2
N√2 < 1.5 N
NN
S. Chilingaryan et. all29 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
Pixel to thread mappingbins
projections
24 bins
16 proj.
Processed by a single thread block (16x16)
48 bins of a projection required for current block
TextureMemory
48 bins
16 proj
..
.Step1: filling shared memoryOnly 48 texture fetches per projection
32 px
32 px
Step2: integrating the volume 322 interpolations per projection thr (1,1)
thr (1,2)
thr (1,3)
thr (2,1)
thr (2,2)
thr (2,3)
thr (3,1)
thr (3,2)
thr (3,3)
Shared Memory
Volume
16 of the projections processed in a single pass
Legend
Processing in multiple passes, 16 projections each
16
16
Processing 4 pixels per thread reducing amount of texture fetches and hides operation latencies with multiple independent operations (instruction reordering).
Px. Fetches/px. Regs ShMem Occup. ILP
1 0.09375 26 1536 66% 1
4 0.046875 32 3072 66% 4
S. Chilingaryan et. all30 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
S. Chilingaryan et. all44 Institute for Data Processing and ElectronicsKarlsruhe Institute of Technology
SummaryScalable hardware platform for image-based control
Only off-the-shelf components are usedEasily scalable from single PC to the GPU clusterReliable storage for data streaming at rates up to 4 GB/sDistributed over large area using Optical Infiniband Links
Fully-pipelined parallel image-processing frameworkTuning for various parallel architectures Real-time reconstruction (up to 2 GB/s from camera)Fast low-dose reconstruction (about 4 hours per dataset)
Remote data analysis infrastructureVirtualization environment for remote image segmentationHigh quality web-based visualization of large volumes