Parallel implementations of a 3-d image · 2014. 5. 12. · Parallel implementations of a 3-d image reconstruction algorithm L. Pastor", A. Sanchez^ F. Fernandez", A. Rodriguez""

Parallel implementations of a 3-d image

reconstruction algorithm

L. Pastor", A. Sanchez^ F. Fernandez", A. Rodriguez"

" Dipartimento de Tecnologia Fotonica

^ Dipartimento de Lenguajes y sistemas

Informdticos

del Monte, Madrid, Spain

ABSTRACT

This paper compares two different parallel implementations of Feldkamp'scone-beam reconstruction method for 3D tomography. The first approach isbased on a vector-parallel shared-memory architecture, and the second on atransputer-based distributed-memory architecture. The experimental results haveshown the effectiveness of both models for executing this kind of computeintensive parallel algorithms.

1. INTRODUCTION

From a computational point of view, 3D image reconstruction is a very

demanding task. For example, for reconstructing an object with N* voxels

(volume elements), the cone-beam backprojection operation, the most time-

consuming stage on filtered backprojection methods (Feldkamp et al. [4]),

requires around 60N* FLOPS. Other 3D reconstruction techniques need even

more operations (Smith [15], Grangeat[5]). In order to achieve acceptable

reconstruction times for practical resolutions (in the range between 128* to

512*), special purpose hardware can be developed.

An alternative approach can be based on exploiting the parallelism

inherent to reconstruction algorithms: an in-depth analysis of the problem and

its algorithmic solution can help with the definition of the computational task,

allowing the achievement of a maximum degree of parallelism, taking

Transactions on Information and Communications Technologies vol 3, © 1993 WIT Press, www.witpress.com, ISSN 1743-3517

184 Applications of Supercomputers in Engineering

advantage of existing parallel machines. In general this last approach is less

performant, although it has a very good price/performance ratio. Similar

solutions have been developed in other image processing domains (Li et al.

[13], Weber [17].

This paper describes the implementation of Feldkamp's cone-beam

reconstruction method on two different parallel architectures: a shared-memory

vector-parallel multiprocessor (Alliant FX/40) and a hierarchical distributed

memory message passing multiprocessor (T.Node). The first machine was

selected for implementing reconstruction algorithms in the EC BRTTE project

'EVA' (Morisseau et al. [14]), developed jointly by INTERCONTROLE

(France), LETI-CEA (France), CUALICONTROIVDTF-UPM (Spain), MILANO

RICERCHE (Italy), REGIENOV (France) and FAIREY TECRAMICS (United

Kingdom).

This paper is organized as follows: section 2 presents the main stages

of Feldkamp's reconstruction method. Section 3 offers a brief description of

architectural features corresponding to considered multiprocessors. Section 4

describes the implemented solutions. Section 5 presents and compares the major

experimental results for both considered parallel architectures. Finally, Section

6 offers some concluding remarks.

2. FELDKAMP'S CONE-BEAM RECONSTRUCTION METHOD

Feldkamp's method [4] is a cone-beam geometry extrapolation of fan-beam

bidimensional reconstruction techniques. It is composed of three main stages:

- Projection weighting: The projection data is multiplied by coefficients

dependent only on their position within the projection (the coefficients

remain constant for all of the acquisitions).

- Filtering: The weighted data is convolved during this stage with a one

dimensional filter, such as Shepp-Logan's (Jain [9]).

- Backprojection: During the method's third stage, the weighted and

filtered projection data is backprojected using a cone-beam geometry

(Fig.l).


Applications of Supercomputers in Engineering 185

#SOURCE

OBJECT

DETECTOR

Figure 1 Cone-beam acquisition.

Feldkamp's method was the first cone-beam practical reconstructionmethod available for 3D tomography. Being an extrapolation of fan-beamtechniques, it is correct only for the object middle plane (the plane containingthe source circular trajectory). For small vertical cone aperture angles, thereconstruction errors remain acceptable. The method's most salient feature isits simplicity, being remarkably efficient from the computational point of view.A sequential algorithm for Feldkamp's method can be found in Jacquet [8].

In this paper only the backprojection stage has been considered for

parallelization, for two reasons:

- The complexity of the cone-beam backprojection is substantially higherthan the complexity of the other two stages together (O(N*) versus

- The first two stages can be computed at data acquisition time, togetherwith additional preprocessing operations.



3. CONSIDERED MULTIPROCESSORS

3.1. Shared-memory paradigm: ALLIANT FX/40

The available shared memory multiprocessor is an Alliant FX/40 with fourAdvanced Computational Elements (ACE's), one Interactive Processor (IP), 256Kb of shared cache and 32 Mb of main memory (Alliant [2]). The ACE's areprocessors with vector capabilities that are crossbar-connected to the sharedcache which is in turn connected via a high speed bus to the shared memory.

In the Alliant FX/40 system, concurrent programs can be produced usingthe FX/Fortran compiler (Alliant [1]), which has been especially designed tosupport concurrency, being one of the best vectorizing compilers. Also availableare optimized library functions for computing vector and matrix operations.

3.2. Distributed-memorv paradigm: TELMAT T.NODEThe T.Node system is a commercial product of Telmat Informatique (Telmat[16]), emerged from the development of the Supernode (ESPRIT projectP1085). It is a loosely coupled MIMD multiprocessor machine based on T800-25 MHz transputers (Inmos [7]), in which the interconnections (called links)among transputers are made via software-controlled switches. This modular andhierarchical architecture is based on reconfigurable nodes (or modules) oftransputers, allowing the interconnexion of up to 1024 processors.

Each basic node is a reconfigurable network with a maximum of 16worker transputers and an associated control transputer. The four bidirectionallinks of each transputer are connected to a 72x72 crossbar switch, configuredby a program running on the control transputer. An additional control bus, witha master-slave protocol, enables any transputer to communicate with the controlone independently of the links.

The available hardware configuration has 3 worker boards (or clusters)with 8 transputer per cluster. Each processor has 2 Mb of dynamic RAM. Thehost machine is a Sun 4 workstation.

The programming language is 3L's Fortran 77 with library functions forspecifying interprocessor communication. The 3L programming environmentconsists of a software toolset for compiling, linking, debugging and runningparallel applications on transputers (3L [18]). Since transputers arereconfigurable processors, it is necessary to describe how they are to beinterconnected for running a particular program. This is done by means of aconfiguration file, which includes the number of needed transputers, how their



links are connected and the way of mapping software processes onto the

physical network.

4. PARALLELIZATION ASPECTS IN THE BACKPROJECTION STAGE

This section describes the main aspects considered in the parallelization ofFeldkamp's algorithm. In particular, we have focused on how thebackprojection step can be efficiently executed in the two parallel architectures

considered.

4.1. Code optimizations (for shared-memory architectures)

The strategy followed in this case considers two main lines:

1.- Aspects such as code vectorization and parallelization -in particular,nested loops execution in Concurrent Outer (loop), Vector Inner (loop)or COVI mode-, unnecessary or redundant code elimination, operationsfactorization, subroutines expansion, loop collapsing, etc. have beenused when advisable to achieve good results. The COVI execution modefor nested loops achieves the computer's maximum parallel and vectorperformance.

2.- An important problem found for the optimization of large resolutionreconstructions was the degradation of the system's virtual memoryperformance when large data volumes were used. The evolution of thereconstruction times when the problem size increased showed clearly theunfeasibility of maintaining a monolythic data structure. This problemwas avoided by limiting the amount of memory used by the application,in order to match the computer's main memory size.

Feldkamp's method uses two data volumes: the acquisition orprojections volume, and the object volume. The first volume is accessedprojection after projection, not being necessary to keep the usedprojections in memory. The object volume, on the other hand, istraversed for each projection. This last volume was decomposed onhorizontal slices to limit the memory requirements of large resolutionreconstructions.

4.2. Mapping strategies (for distributed-memory architectures)The cone-beam backprojection can be computed by accumulating thecontributions of the projections to each of the points of the considered n slicesS=(Sj,S2,...,sJ of the object to be reconstructed, taking into account the



superposition effect of the m different projections P=(pt,p2f»>Pm)' The densityof the corresponding slices D=(dj,d2,...,dJ can be computed independently onefrom each other. This stage can be summarized by the following expression:

H

which can be transformed into the local recurrent form:

with initial conditions: d(0,i) = 0 and p(j,0) = p̂ and final result:

2 "

(b)

Figure 2 (a) Backprojection dependence graph (m=6) (n=3) (b) A SFG.

The set {(j,i) I l


transputer network can be systematically derived using different time-allocation

functions (Kung [11]), which fulfil the causality constrains. In our problem, we

have taken into consideration the communication and memory requirements,and index space dimensions. The following allocation function has been

selected:

allocation(j, i) = i

which is equivalent to a projection of the DG in the ./-axis direction (figure 2.b).For the timing function, it is possible to select two permissible options:

timingl(j,i) = j (with broadcasting)timing2(j,i) = y+i (without broadcasting)

This systolic-like approach allows us to consider different parallelizationstrategies for the backprojection stage in Feldkamp's algorithm. The chosen

distributed solution is detailed in the following section.

4.3 Implementation on distributed memory architectures.

Obj.Voi.

WrN

Figure 3 Selected topology and software processes.

We have considered different strategies for exploiting parallelism (pipeline,geometric, farm (de Carlini et al. [3], Harp[6], Jane et al.[10]). Due to the



characteristics of the problem being solved and taking into account the

limitations of our hardware configuration (especially the number of processorsand the size of the available memory in each transputer), we have adopted apipeline topology as shown in figure 3.

In this topology, three types of processes can be distinguished: a processthat sequentially performs the weighting and filtering stages if needed (main);many identical processes, one replicated in each worker transputer, whichperform backprojection over independent slices of the target volume(backprojectors); and, finally, a process which sends projections from main tothe first backprojector in the pipeline and collects slices of the reconstructedvolume in the opposite direction (recollector). To summarize, the operations

carried out by each process type are expressed with the following pseudocode:

Process main(1) Initialization step

(2) FOR all backprojectors 6, DO(2.1) Send identifications and slices' limits s, to

backprojector fc- through recollector process(3) FOR all projections PJ of the target volume IX)

(3.1) Read projection PJ(3.2) Ponderate projection PJ {if necessary}(3.3) Filter projection PJ {if necessary}(3.4) Send filtered projection PJ to first backprojector of

the pipeline through recollector process(4) FOR all the slices s, of the reconstructed target volume DO

(4.1) Receive the normalized slice j, corresponding tobackprojector 6, through the recollector process

(4.2) Write the slice a, onto disk

Process backprojector (bj(1) Receive identification and limits of its volume slice a, from

backprojector 6,.y (or directly from recollector for 6,)(2) Send identification and limits of volume slices s/ (/>/) to

backprojector b̂(3) FOR all projections PJ of the target volume DO

(3.1) Receive filtered projection p- from backprojector 6,.,(or directly from recollector for bj)

(3.2) FOR all the lines /* of a volume slice DO(3.2.1) Perform backprojection operations over /* using

projection PJ



(3.3) Send filtered projection PJ to backprojector 6,+,

(4) Normalize reconstructed volume slice a,

(5) Send reconstructed volume slice a, to backprojector b̂ (to

be finally received by process main)(6) FOR all reconstructed volume slices % (/>/) DO

(6.1) Receive slice s, from backprojector b̂

(6.2) Send slice % to backprojector ft,.,

Process recollector(1) Receive identifications and limits of volume slices from main(2) Send identifications and limits of volume slices to

backprojector bj

(3) FOR all filtered projections PJ DO(3.1) Receive PJ from main(3.2) Send PJ to backprojector bj

(4) FOR all volume slices j, of reconstructed target volume DO(4.1) Receive j, from backprojector bj

(4.2) Send s, to main

5. EXPERIMENTAL RESULTS

This section presents the experimental results achieved. The tests performedinclude volume reconstructions with resolutions of 32\ 64\ 128\ 160* and 180*voxels, using the two parallel implementations of Feldkamp's reconstruction

method described in the previous section.

Figure 4 refers to the Alliant FX/40, and shows processing time as afunction of the number of processors. Each curve represents a different targetvolume resolution. It is interesting to point out that doubling the number ofprocessors cuts the processing time by a factor of almost two, with theexception of the 32* case.

The following three figures correspond to experiments performed on theTelmat T.Node. Figure 5 represents the same tests using different number ofworker processors (1, 2,4, 8, 16 and 23, respectively). Each processor performsthe cone-beam backprojection on its own assigned volume slice, in parallel withthe remaining processors.

Figure 5, like figure 4, shows large processing time reductions when thenumber of available processors is increased. In this case, the reductions differsomewhat more from the maxima achievable due to inter-transputer



Processing Time (seconds)o.uuu5f\f\r\,000

4f\f\f\,uuu

3nnn,uuu

1 000

n

*\•̂\

A^_^ *"\

^ *?*..̂.̂- »*̂ kf ».*«.-.— ;̂;;/̂;

•° ^ *

. "t 22 3Number of Processors

Size=64 160 Size=180

Figure 4 Processing time in Alliant FX/40 as a function of the numberof processors, for various image resolutions.

communication times and to the presence of algorithm portions inherentlysequential.

Speedup curves for the 64̂ case are displayed in figure 6: the curvesrepresented are the ideal (linear) speedup, the experimental speedup obtainedby practical time measures, and an adjusted speedup line least-squares fit to theexperimental results. This last line gives a 72% efficiency (measured as thequotient of speedup divided by the number of processors).

Figure 7 compares communication and backprojection times fordifferent resolution reconstructions using the maximum (best) number ofworkertransputers (23 backprojectors). Communication time does not dependsignificantly on the number of processors, but on the reconstructed volume



Processing Time (seconds)3,000

2,500

2,000

1,500

1,000

500 -GK--::.

Size=32 Size=64 Size=

4 8 16Number of Processors

=160Size=180

23

Figure 5 Processing time in Telmat T.Node as a function of the numberof processors, for various image resolutions.

Speedup

16

14

12

10

8

6

4

2

124 8 16 Number of processors

Linear speedup Experimental speedup Adjusted speedup• .-̂ -- Ar •••

Figure 6 Speedup for an image with resolution of 64* voxels.

resolution (in general, the number of projections is much greater than thenumber of volume slices). This observation, together with the backprojection



Time (seconds)£,OUU

O AAAC,UUU

1K.r\f\,500

1f\f\f\,uuu

C.f\f\OUU

n

>,;>"

^. ,**

^**

,.

18032 64 128 160Problem size

Communication Time Backprojegtion Time

Figure 7 Communication vs. backprojection time in Telmat T.Node.

Time (seconds)2,500

128Problem size

Alliant version TNode version

Figure 8 Comparative processing time of Alliant FX/40 and TelmatT.Node versions, as a function of different problem sizes.

time increments when the target volume's resolution grows, explains why it isadvantageous to use a large number of transputers.

Finally, figure 8 compares the experimental backprojection timesmeasured with the Alliant FX/40 and Telmat T.Node. Each curve has been



obtained using the maximum number of processors available for performing the

backprojection stage (in our case, 4 and 23 for the shared-memory and

distributed-memory multiprocessors, respectively).

6. CONCLUDING REMARKS

In this paper, two parallel implementations of Feldkamp's method arepresented, having each of them been developed for a very different

multiprocessor architecture. The experimental results have shown that thebackprojection, the most computationally intensive stage, can be effectively

parallelized: large speedups have been achieved both in shared and distributedmemory multiprocessors when the number of processors used is increased.

Comparing both implementations, it can be noted that for all of the testsperformed, the response time of the shared-memory computer is better. This canbe due to the additional speed improvements achieved by using the Alliant'svector capabilities. However, the differences between both machines show asmall percentual decrease when the resolution increases. (It is interesting topoint out that the overall memory available to the user was similar in bothmachines). An aspect in favour of the transputer based solution is that further

machine upgrades can be performed at a smaller cost.

Regarding the T.Node version, the use of a pipeline topology has shownto yield a good performance, in particular for large resolutions. Moreover, theprocessing performance can be increased by adding processors to the networkwithout process modifications. An important point in this sense, confirmed bythe experimental results, is that total communication time does not dependsignificantly of the number of transputers used. Similar parallelization strategiesare being applied presently to other reconstruction techniques, such as

Grangeat's [5].

ACKNOWLEDGEMENTS

This work has been partly supported by the Spanish Ministry of Education andScience (CICYT) under grant ROB91-0489 and the European CommunityBRITE Project EVA n* P-2051-4, contract n* R/1B-0285-C.



REFERENCES

[1] Alliant Computer Systems Corporation, FXIFortran Language ManualMassachusetts, USA, 1988.

[2] Alliant Computer Systems Corporation, FXISeries Architecture ManualMassachusetts, USA, 1988.

[3] de Carlini, U. and Villano, U. Transputers and Parallel Architectures:Message-passing Distributed Systems Ellis Horwood, Chichester,England, 1991.

[4] Feldkamp, L.A., Davis, L.C. and Kress, J.W. 'Practical cone-beam

algorithm', Journal Optical Soc. America, Vol.1, N.6, pp. 612-619,1984.

[5] Grangeat, P. 'Mathematical Framework of the Cone Beam 3DReconstruction via the First Derivative of the Radon Transform',Mathematical Methods in Tomography ed. Herman, G.T., Louis, A.K.and Natterer, F. Springer-Verlag, Heidelberg, Germany, 1991.

[6] Harp, G. (Ed). Transputer Applications, Pitman, London, England, 1989.

[7] INMOS Ltd., The Transputer Databook Redwood Brn Ltd., Trowbridge,England, 1989.

[8] Jacquet, I. Reconstruction d'images 3D par I'algorithme tventailg6n6ralist Thfcse C.N.A.M., Grenoble, France, 1988.

[9] Jain, A.K. Fundamentals of Digital Image Processing Prentice-Hall,Englewood Cliffs, USA, 1989.

[10] Jane, M.R., Fawcett, R.J. and Mawby T.P. (Eds). TransputerApplications - Progress & Prospects, Proc. of the Closing Symp. of theSERC/DTI Initiative in the Engineering Applications of Transputers,Reading, IOS Press, Amsterdam, The Netherlands, 1992.

[11] Kung, S.Y. VLSI Array Processors Prentice-Hall, Englewood Cliffs,USA, 1988.

[12] Lewis, T.G. and El-Rewini, H. Introduction to Parallel Computing



Prentice-Hall, Englewood Cliffs, USA, 1992.

[13] Li, J. and Miguet, S. 'Parallel Volume Rendering of Medical Images',

in Parallel Computing: From Theory to Practice (Ed. loosen, W. and

Milgrom, E.), pp. 332-343, Proceedings of the European Workshop onParallel Computing, Barcelona, Spain, IOS Press, Amsterdam, The

Netherlands, 1992.

[14] Morisseau, P. et al. 'X-ray voludensitometry: Application to the testing

of technical ceramics', Proceedings of the 13th World Conference onNOT, Sao Paulo, Brazil, 1991.

[15] Smith, B.D. 'Cone-beam tomography: recent advances and a tutorial',

Optical Engineering, Vol.29, N.5, pp. 524-534, 1990.

[16] Telmat Informatique, TNode User Manual 1990.

[17] Webber, H.C. (Ed). Image Processing and Transputers, IOS Press,Amsterdam, The Netherlands, 1992.

[18] 3L Ltd., Parallel Fortran User Guide Livingston, Scotland, 1988.


Parallel implementations of a 3-d image · 2014. 5. 12. · Parallel implementations of a 3-d image reconstruction algorithm L. Pastor", A. Sanchez^ F. Fernandez", A. Rodriguez""

Documents