-
Parallel implementations of a 3-d image
reconstruction algorithm
L. Pastor", A. Sanchez^ F. Fernandez", A. Rodriguez"
" Dipartimento de Tecnologia Fotonica
^ Dipartimento de Lenguajes y sistemas
Informdticos
del Monte, Madrid, Spain
ABSTRACT
This paper compares two different parallel implementations of
Feldkamp'scone-beam reconstruction method for 3D tomography. The
first approach isbased on a vector-parallel shared-memory
architecture, and the second on atransputer-based
distributed-memory architecture. The experimental results haveshown
the effectiveness of both models for executing this kind of
computeintensive parallel algorithms.
1. INTRODUCTION
From a computational point of view, 3D image reconstruction is a
very
demanding task. For example, for reconstructing an object with
N* voxels
(volume elements), the cone-beam backprojection operation, the
most time-
consuming stage on filtered backprojection methods (Feldkamp et
al. [4]),
requires around 60N* FLOPS. Other 3D reconstruction techniques
need even
more operations (Smith [15], Grangeat[5]). In order to achieve
acceptable
reconstruction times for practical resolutions (in the range
between 128* to
512*), special purpose hardware can be developed.
An alternative approach can be based on exploiting the
parallelism
inherent to reconstruction algorithms: an in-depth analysis of
the problem and
its algorithmic solution can help with the definition of the
computational task,
allowing the achievement of a maximum degree of parallelism,
taking
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
184 Applications of Supercomputers in Engineering
advantage of existing parallel machines. In general this last
approach is less
performant, although it has a very good price/performance ratio.
Similar
solutions have been developed in other image processing domains
(Li et al.
[13], Weber [17].
This paper describes the implementation of Feldkamp's
cone-beam
reconstruction method on two different parallel architectures: a
shared-memory
vector-parallel multiprocessor (Alliant FX/40) and a
hierarchical distributed
memory message passing multiprocessor (T.Node). The first
machine was
selected for implementing reconstruction algorithms in the EC
BRTTE project
'EVA' (Morisseau et al. [14]), developed jointly by
INTERCONTROLE
(France), LETI-CEA (France), CUALICONTROIVDTF-UPM (Spain),
MILANO
RICERCHE (Italy), REGIENOV (France) and FAIREY TECRAMICS
(United
Kingdom).
This paper is organized as follows: section 2 presents the main
stages
of Feldkamp's reconstruction method. Section 3 offers a brief
description of
architectural features corresponding to considered
multiprocessors. Section 4
describes the implemented solutions. Section 5 presents and
compares the major
experimental results for both considered parallel architectures.
Finally, Section
6 offers some concluding remarks.
2. FELDKAMP'S CONE-BEAM RECONSTRUCTION METHOD
Feldkamp's method [4] is a cone-beam geometry extrapolation of
fan-beam
bidimensional reconstruction techniques. It is composed of three
main stages:
- Projection weighting: The projection data is multiplied by
coefficients
dependent only on their position within the projection (the
coefficients
remain constant for all of the acquisitions).
- Filtering: The weighted data is convolved during this stage
with a one
dimensional filter, such as Shepp-Logan's (Jain [9]).
- Backprojection: During the method's third stage, the weighted
and
filtered projection data is backprojected using a cone-beam
geometry
(Fig.l).
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
Applications of Supercomputers in Engineering 185
#SOURCE
OBJECT
DETECTOR
Figure 1 Cone-beam acquisition.
Feldkamp's method was the first cone-beam practical
reconstructionmethod available for 3D tomography. Being an
extrapolation of fan-beamtechniques, it is correct only for the
object middle plane (the plane containingthe source circular
trajectory). For small vertical cone aperture angles,
thereconstruction errors remain acceptable. The method's most
salient feature isits simplicity, being remarkably efficient from
the computational point of view.A sequential algorithm for
Feldkamp's method can be found in Jacquet [8].
In this paper only the backprojection stage has been considered
for
parallelization, for two reasons:
- The complexity of the cone-beam backprojection is
substantially higherthan the complexity of the other two stages
together (O(N*) versus
- The first two stages can be computed at data acquisition time,
togetherwith additional preprocessing operations.
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
186 Applications of Supercomputers in Engineering
3. CONSIDERED MULTIPROCESSORS
3.1. Shared-memory paradigm: ALLIANT FX/40
The available shared memory multiprocessor is an Alliant FX/40
with fourAdvanced Computational Elements (ACE's), one Interactive
Processor (IP), 256Kb of shared cache and 32 Mb of main memory
(Alliant [2]). The ACE's areprocessors with vector capabilities
that are crossbar-connected to the sharedcache which is in turn
connected via a high speed bus to the shared memory.
In the Alliant FX/40 system, concurrent programs can be produced
usingthe FX/Fortran compiler (Alliant [1]), which has been
especially designed tosupport concurrency, being one of the best
vectorizing compilers. Also availableare optimized library
functions for computing vector and matrix operations.
3.2. Distributed-memorv paradigm: TELMAT T.NODEThe T.Node system
is a commercial product of Telmat Informatique (Telmat[16]),
emerged from the development of the Supernode (ESPRIT
projectP1085). It is a loosely coupled MIMD multiprocessor machine
based on T800-25 MHz transputers (Inmos [7]), in which the
interconnections (called links)among transputers are made via
software-controlled switches. This modular andhierarchical
architecture is based on reconfigurable nodes (or modules)
oftransputers, allowing the interconnexion of up to 1024
processors.
Each basic node is a reconfigurable network with a maximum of
16worker transputers and an associated control transputer. The four
bidirectionallinks of each transputer are connected to a 72x72
crossbar switch, configuredby a program running on the control
transputer. An additional control bus, witha master-slave protocol,
enables any transputer to communicate with the controlone
independently of the links.
The available hardware configuration has 3 worker boards (or
clusters)with 8 transputer per cluster. Each processor has 2 Mb of
dynamic RAM. Thehost machine is a Sun 4 workstation.
The programming language is 3L's Fortran 77 with library
functions forspecifying interprocessor communication. The 3L
programming environmentconsists of a software toolset for
compiling, linking, debugging and runningparallel applications on
transputers (3L [18]). Since transputers arereconfigurable
processors, it is necessary to describe how they are to
beinterconnected for running a particular program. This is done by
means of aconfiguration file, which includes the number of needed
transputers, how their
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
Applications of Supercomputers in Engineering 187
links are connected and the way of mapping software processes
onto the
physical network.
4. PARALLELIZATION ASPECTS IN THE BACKPROJECTION STAGE
This section describes the main aspects considered in the
parallelization ofFeldkamp's algorithm. In particular, we have
focused on how thebackprojection step can be efficiently executed
in the two parallel architectures
considered.
4.1. Code optimizations (for shared-memory architectures)
The strategy followed in this case considers two main lines:
1.- Aspects such as code vectorization and parallelization -in
particular,nested loops execution in Concurrent Outer (loop),
Vector Inner (loop)or COVI mode-, unnecessary or redundant code
elimination, operationsfactorization, subroutines expansion, loop
collapsing, etc. have beenused when advisable to achieve good
results. The COVI execution modefor nested loops achieves the
computer's maximum parallel and vectorperformance.
2.- An important problem found for the optimization of large
resolutionreconstructions was the degradation of the system's
virtual memoryperformance when large data volumes were used. The
evolution of thereconstruction times when the problem size
increased showed clearly theunfeasibility of maintaining a
monolythic data structure. This problemwas avoided by limiting the
amount of memory used by the application,in order to match the
computer's main memory size.
Feldkamp's method uses two data volumes: the acquisition
orprojections volume, and the object volume. The first volume is
accessedprojection after projection, not being necessary to keep
the usedprojections in memory. The object volume, on the other
hand, istraversed for each projection. This last volume was
decomposed onhorizontal slices to limit the memory requirements of
large resolutionreconstructions.
4.2. Mapping strategies (for distributed-memory
architectures)The cone-beam backprojection can be computed by
accumulating thecontributions of the projections to each of the
points of the considered n slicesS=(Sj,S2,...,sJ of the object to
be reconstructed, taking into account the
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
188 Applications of Supercomputers in Engineering
superposition effect of the m different projections
P=(pt,p2f»>Pm)' The densityof the corresponding slices
D=(dj,d2,...,dJ can be computed independently onefrom each other.
This stage can be summarized by the following expression:
H
which can be transformed into the local recurrent form:
with initial conditions: d(0,i) = 0 and p(j,0) = p̂ and final
result:
2 "
(b)
Figure 2 (a) Backprojection dependence graph (m=6) (n=3) (b) A
SFG.
The set {(j,i) I l
-
Applications of Supercomputers in Engineering 189
transputer network can be systematically derived using different
time-allocation
functions (Kung [11]), which fulfil the causality constrains. In
our problem, we
have taken into consideration the communication and memory
requirements,and index space dimensions. The following allocation
function has been
selected:
allocation(j, i) = i
which is equivalent to a projection of the DG in the ./-axis
direction (figure 2.b).For the timing function, it is possible to
select two permissible options:
timingl(j,i) = j (with broadcasting)timing2(j,i) = y+i (without
broadcasting)
This systolic-like approach allows us to consider different
parallelizationstrategies for the backprojection stage in
Feldkamp's algorithm. The chosen
distributed solution is detailed in the following section.
4.3 Implementation on distributed memory architectures.
Obj.Voi.
WrN
Figure 3 Selected topology and software processes.
We have considered different strategies for exploiting
parallelism (pipeline,geometric, farm (de Carlini et al. [3],
Harp[6], Jane et al.[10]). Due to the
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
190 Applications of Supercomputers in Engineering
characteristics of the problem being solved and taking into
account the
limitations of our hardware configuration (especially the number
of processorsand the size of the available memory in each
transputer), we have adopted apipeline topology as shown in figure
3.
In this topology, three types of processes can be distinguished:
a processthat sequentially performs the weighting and filtering
stages if needed (main);many identical processes, one replicated in
each worker transputer, whichperform backprojection over
independent slices of the target volume(backprojectors); and,
finally, a process which sends projections from main tothe first
backprojector in the pipeline and collects slices of the
reconstructedvolume in the opposite direction (recollector). To
summarize, the operations
carried out by each process type are expressed with the
following pseudocode:
Process main(1) Initialization step
(2) FOR all backprojectors 6, DO(2.1) Send identifications and
slices' limits s, to
backprojector fc- through recollector process(3) FOR all
projections PJ of the target volume IX)
(3.1) Read projection PJ(3.2) Ponderate projection PJ {if
necessary}(3.3) Filter projection PJ {if necessary}(3.4) Send
filtered projection PJ to first backprojector of
the pipeline through recollector process(4) FOR all the slices
s, of the reconstructed target volume DO
(4.1) Receive the normalized slice j, corresponding
tobackprojector 6, through the recollector process
(4.2) Write the slice a, onto disk
Process backprojector (bj(1) Receive identification and limits
of its volume slice a, from
backprojector 6,.y (or directly from recollector for 6,)(2) Send
identification and limits of volume slices s/ (/>/) to
backprojector b̂(3) FOR all projections PJ of the target volume
DO
(3.1) Receive filtered projection p- from backprojector 6,.,(or
directly from recollector for bj)
(3.2) FOR all the lines /* of a volume slice DO(3.2.1) Perform
backprojection operations over /* using
projection PJ
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
Applications of Supercomputers in Engineering 191
(3.3) Send filtered projection PJ to backprojector 6,+,
(4) Normalize reconstructed volume slice a,
(5) Send reconstructed volume slice a, to backprojector b̂
(to
be finally received by process main)(6) FOR all reconstructed
volume slices % (/>/) DO
(6.1) Receive slice s, from backprojector b̂
(6.2) Send slice % to backprojector ft,.,
Process recollector(1) Receive identifications and limits of
volume slices from main(2) Send identifications and limits of
volume slices to
backprojector bj
(3) FOR all filtered projections PJ DO(3.1) Receive PJ from
main(3.2) Send PJ to backprojector bj
(4) FOR all volume slices j, of reconstructed target volume
DO(4.1) Receive j, from backprojector bj
(4.2) Send s, to main
5. EXPERIMENTAL RESULTS
This section presents the experimental results achieved. The
tests performedinclude volume reconstructions with resolutions of
32\ 64\ 128\ 160* and 180*voxels, using the two parallel
implementations of Feldkamp's reconstruction
method described in the previous section.
Figure 4 refers to the Alliant FX/40, and shows processing time
as afunction of the number of processors. Each curve represents a
different targetvolume resolution. It is interesting to point out
that doubling the number ofprocessors cuts the processing time by a
factor of almost two, with theexception of the 32* case.
The following three figures correspond to experiments performed
on theTelmat T.Node. Figure 5 represents the same tests using
different number ofworker processors (1, 2,4, 8, 16 and 23,
respectively). Each processor performsthe cone-beam backprojection
on its own assigned volume slice, in parallel withthe remaining
processors.
Figure 5, like figure 4, shows large processing time reductions
when thenumber of available processors is increased. In this case,
the reductions differsomewhat more from the maxima achievable due
to inter-transputer
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
192 Applications of Supercomputers in Engineering
Processing Time (seconds)o.uuu5f\f\r\,000
4f\f\f\,uuu
3nnn,uuu
1 000
n
*\•̂\
A^_^ *"\
^ *?*..̂.̂- »*̂ kf ».*«.-.— ;̂;;/̂;
•° ^ *
. "t 22 3Number of Processors
Size=64 160 Size=180
Figure 4 Processing time in Alliant FX/40 as a function of the
numberof processors, for various image resolutions.
communication times and to the presence of algorithm portions
inherentlysequential.
Speedup curves for the 64̂ case are displayed in figure 6: the
curvesrepresented are the ideal (linear) speedup, the experimental
speedup obtainedby practical time measures, and an adjusted speedup
line least-squares fit to theexperimental results. This last line
gives a 72% efficiency (measured as thequotient of speedup divided
by the number of processors).
Figure 7 compares communication and backprojection times
fordifferent resolution reconstructions using the maximum (best)
number ofworkertransputers (23 backprojectors). Communication time
does not dependsignificantly on the number of processors, but on
the reconstructed volume
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
Applications of Supercomputers in Engineering 193
Processing Time (seconds)3,000
2,500
2,000
1,500
1,000
500 -GK--::.
Size=32 Size=64 Size=
4 8 16Number of Processors
=160Size=180
23
Figure 5 Processing time in Telmat T.Node as a function of the
numberof processors, for various image resolutions.
Speedup
16
14
12
10
8
6
4
2
124 8 16 Number of processors
Linear speedup Experimental speedup Adjusted speedup• .-̂ -- Ar
•••
Figure 6 Speedup for an image with resolution of 64* voxels.
resolution (in general, the number of projections is much
greater than thenumber of volume slices). This observation,
together with the backprojection
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
194 Applications of Supercomputers in Engineering
Time (seconds)£,OUU
O AAAC,UUU
1K.r\f\,500
1f\f\f\,uuu
C.f\f\OUU
n
>,;>"
^. ,**
^**
,.
18032 64 128 160Problem size
Communication Time Backprojegtion Time
Figure 7 Communication vs. backprojection time in Telmat
T.Node.
Time (seconds)2,500
128Problem size
Alliant version TNode version
Figure 8 Comparative processing time of Alliant FX/40 and
TelmatT.Node versions, as a function of different problem
sizes.
time increments when the target volume's resolution grows,
explains why it isadvantageous to use a large number of
transputers.
Finally, figure 8 compares the experimental backprojection
timesmeasured with the Alliant FX/40 and Telmat T.Node. Each curve
has been
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
Applications of Supercomputers in Engineering 195
obtained using the maximum number of processors available for
performing the
backprojection stage (in our case, 4 and 23 for the
shared-memory and
distributed-memory multiprocessors, respectively).
6. CONCLUDING REMARKS
In this paper, two parallel implementations of Feldkamp's method
arepresented, having each of them been developed for a very
different
multiprocessor architecture. The experimental results have shown
that thebackprojection, the most computationally intensive stage,
can be effectively
parallelized: large speedups have been achieved both in shared
and distributedmemory multiprocessors when the number of processors
used is increased.
Comparing both implementations, it can be noted that for all of
the testsperformed, the response time of the shared-memory computer
is better. This canbe due to the additional speed improvements
achieved by using the Alliant'svector capabilities. However, the
differences between both machines show asmall percentual decrease
when the resolution increases. (It is interesting topoint out that
the overall memory available to the user was similar in
bothmachines). An aspect in favour of the transputer based solution
is that further
machine upgrades can be performed at a smaller cost.
Regarding the T.Node version, the use of a pipeline topology has
shownto yield a good performance, in particular for large
resolutions. Moreover, theprocessing performance can be increased
by adding processors to the networkwithout process modifications.
An important point in this sense, confirmed bythe experimental
results, is that total communication time does not
dependsignificantly of the number of transputers used. Similar
parallelization strategiesare being applied presently to other
reconstruction techniques, such as
Grangeat's [5].
ACKNOWLEDGEMENTS
This work has been partly supported by the Spanish Ministry of
Education andScience (CICYT) under grant ROB91-0489 and the
European CommunityBRITE Project EVA n* P-2051-4, contract n*
R/1B-0285-C.
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
196 Applications of Supercomputers in Engineering
REFERENCES
[1] Alliant Computer Systems Corporation, FXIFortran Language
ManualMassachusetts, USA, 1988.
[2] Alliant Computer Systems Corporation, FXISeries Architecture
ManualMassachusetts, USA, 1988.
[3] de Carlini, U. and Villano, U. Transputers and Parallel
Architectures:Message-passing Distributed Systems Ellis Horwood,
Chichester,England, 1991.
[4] Feldkamp, L.A., Davis, L.C. and Kress, J.W. 'Practical
cone-beam
algorithm', Journal Optical Soc. America, Vol.1, N.6, pp.
612-619,1984.
[5] Grangeat, P. 'Mathematical Framework of the Cone Beam
3DReconstruction via the First Derivative of the Radon
Transform',Mathematical Methods in Tomography ed. Herman, G.T.,
Louis, A.K.and Natterer, F. Springer-Verlag, Heidelberg, Germany,
1991.
[6] Harp, G. (Ed). Transputer Applications, Pitman, London,
England, 1989.
[7] INMOS Ltd., The Transputer Databook Redwood Brn Ltd.,
Trowbridge,England, 1989.
[8] Jacquet, I. Reconstruction d'images 3D par I'algorithme
tventailg6n6ralist Thfcse C.N.A.M., Grenoble, France, 1988.
[9] Jain, A.K. Fundamentals of Digital Image Processing
Prentice-Hall,Englewood Cliffs, USA, 1989.
[10] Jane, M.R., Fawcett, R.J. and Mawby T.P. (Eds).
TransputerApplications - Progress & Prospects, Proc. of the
Closing Symp. of theSERC/DTI Initiative in the Engineering
Applications of Transputers,Reading, IOS Press, Amsterdam, The
Netherlands, 1992.
[11] Kung, S.Y. VLSI Array Processors Prentice-Hall, Englewood
Cliffs,USA, 1988.
[12] Lewis, T.G. and El-Rewini, H. Introduction to Parallel
Computing
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517
-
Applications of Supercomputers in Engineering 197
Prentice-Hall, Englewood Cliffs, USA, 1992.
[13] Li, J. and Miguet, S. 'Parallel Volume Rendering of Medical
Images',
in Parallel Computing: From Theory to Practice (Ed. loosen, W.
and
Milgrom, E.), pp. 332-343, Proceedings of the European Workshop
onParallel Computing, Barcelona, Spain, IOS Press, Amsterdam,
The
Netherlands, 1992.
[14] Morisseau, P. et al. 'X-ray voludensitometry: Application
to the testing
of technical ceramics', Proceedings of the 13th World Conference
onNOT, Sao Paulo, Brazil, 1991.
[15] Smith, B.D. 'Cone-beam tomography: recent advances and a
tutorial',
Optical Engineering, Vol.29, N.5, pp. 524-534, 1990.
[16] Telmat Informatique, TNode User Manual 1990.
[17] Webber, H.C. (Ed). Image Processing and Transputers, IOS
Press,Amsterdam, The Netherlands, 1992.
[18] 3L Ltd., Parallel Fortran User Guide Livingston, Scotland,
1988.
Transactions on Information and Communications Technologies vol
3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517