An eddy current integral formulation on parallel computer systems

INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERINGInt. J. Numer. Meth. Engng 2005; 62:1127–1147Published online 16 November 2004 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/nme.1203

An eddy current integral formulation on parallelcomputer systems

Raffaele Fresa1,‡, Guglielmo Rubinacci2,∗,† and Salvatore Ventre2,§

1Dipartimento di Ingegneria e Fisica dell’Ambiente, Università degli Studi della Basilicata,

Contrada Macchia Romana, Potenza, Italy2Associazione EURATOM/ENEA/CREATE–DAEIMI, Università degli Studi di Cassino,

Via Di Biasio 43, I-03043 Cassino, Italy.

SUMMARY

In this paper, we show how an eddy current volume integral formulation can be used to analysecomplex 3D conducting structures, achieving a substantial benefit from the use of a parallel computersystem. To this purpose, the different steps of the numerical algorithms in view of their parallelizationare outlined to enlighten the merits and the limitations of the proposed approach. Numerical examplesare developed in a parallel environment to show the effectiveness of the method. Copyright � 2004John Wiley & Sons, Ltd.

KEY WORDS: computational electromagnetism; parallel systems; parallel algorithms; eddy currents;integral formulation; edge elements

1. INTRODUCTION

The numerical solution of Maxwell equations in the magneto-quasi-stationary limit is at thebasis of many design and analysis engineering applications, including among the others electro-magnetic non-destructive evaluation, low-frequency electromagnetic compatibility and computer-aided design of electromagnetic devices.

In this context, the integral formulations present several advantages with respect to differen-tial approaches. The discretized domain is limited to the region where sources and materials arelocated; the regularity conditions at infinity for unbounded domains are implicitly included in

∗Correspondence to: G. Rubinacci, Associazione EURATOM/ENEA/CREATE–DAEIMI, Università degli Studidi Cassino, Via Di Biasio 43, I-03043 Cassino, Italy.

†E-mail: [email protected]‡E-mail: [email protected]§E-mail: [email protected]

Contract/grant sponsor: MIUR Italian Ministry of Education, University and ResearchContract/grant sponsor: EURATOM/ENEA/CREATE–DAEIMI Association

Received 15 January 2003Revised 5 April 2004

Copyright � 2004 John Wiley & Sons, Ltd. Accepted 12 July 2004

1128 R. FRESA, G. RUBINACCI AND S. VENTRE

the formulation, so that the overall discretization cost is very limited. Moving parts can be easilytaken into account, while non-linear magnetic materials require an additional moderate effort.

The linear set of algebraic equations arising after discretization is described by a densematrix with n2 elements, where n is the number of scalar unknowns related to a given spatialresolution. The need of forming, storing and inverting this matrix poses severe limitations onthe size of solvable problems. The use of parallel or massive parallel computer systems canrepresent a convenient way to extend this range.

A very important issue in the use of parallel computation is the portability of the relatedsource-codes. The availability of message-passing-systems and linear algebra libraries provideshigh source-code portability and allows efficient implementations across a wide range of archi-tectures. Moreover, parallelization is not only limited to specialized supercomputer environments,but it is now a more challenging approach to achieve the request efficiency using the distributedparallelism available from cluster of workstations connected by Ethernet or other fast link suchas ATM [1]. This approach is much more suitable in industry and research environment whereoften big computational resources are not easily available for numerical field computation.

Many papers have been already published on these aspects [2]. In Reference [3], the au-thors implemented a 2D boundary element method (BEM) magnetostatic code on a systemof transputers. Since then, several other experiences have been reported on the parallelizationof electromagnetic 3D BEM and volume integral formulation codes, both for electrostatic [4],magnetostatic [5], and scattering problems [6, 7].

The aim of this paper is to show how the volume integral formulation [8, 9] can be used toanalyse complex 3D structures, achieving a substantial benefit from its parallelization. To thispurpose, after summarizing the mathematical and numerical model, the different steps of thenumerical algorithm in view of its parallelization are outlined to enlighten the merits and thelimitations of the proposed approach. Finally, numerical examples are developed in a parallelenvironment to show the effectiveness of the method.

2. THE INTEGRAL FORMULATION

We solve Maxwell equations in the time domain, in the magneto-quasi-stationary limit. Theconducting domain Vc, whose position and shape is constant with time, is characterized by theconstitutive relation

E = �J (1)

where E is the electric field and �(r) the resistivity; the current density J is solenoidal, witha continuous normal component that in the following, for the sake of simplicity, is assumedto be zero on the boundary �Vc. An impressed current density Js is localized in the domainVs . We assume that no magnetic materials are present in the problem domain that is thereforecharacterized by the magnetic permeability �0 of the free space.

Posing:

B = ∇ × A (2)

A(r, t) = �0

4�

∫Vc

J(r′, t)|r − r′| dV ′ + �0

4�

∫Vs

Js(r′, t)|r − r′| dV ′ = �0

4�

∫Vc

J(r′, t)|r − r′| dV ′ + As(r, t) (3)

Copyright � 2004 John Wiley & Sons, Ltd. Int. J. Numer. Meth. Engng 2005; 62:1127–1147

https://www.researchgate.net/publication/223150221_Parallel_finite_element_and_boundary_element_analysis_theory_and_applications_-_A_bibliography_1997-1999?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/3096167_Implementing_a_boundary_element_method_on_a_transputer_system?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/257397427_Parallel_CARLOS-3D_-_An_electromagnetic_boundary_integral_method_for_parallel_platforms?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/4118152_A_Case_Study_in_Parallel_Scientific_Computing_The_Boundary_Element_Method_on_a_Distributed-Memory_Multicomputer?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/227648092_Volume_integral_equations_in_non-linear_3-D_magnetostatics?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/3869316_Comparative_study_of_parallel_algorithms_for_3-D_capacitance_extraction_on_distributed_memory_multiprocessors?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/224338270_Integral_formulation_for_3D_eddy_current_Computation_using_edge-elements?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/262210987_Electromagnetic_Scattering_with_the_Boundary_Integral_Method_on_MIMD_Systems?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

EDDY CURRENT INTEGRAL FORMULATION 1129

E = −�A�t

− ∇� (4)

Maxwell equations are satisfied. We only have to impose the solenoidality of J and the Ohmlaw (1), which we do in weak form:

Find J ∈ Q such that ∫Vc

(�J − E) · w dV = 0, w ∈ Q (5)

where Q = {q ∈ L2div(Vc)|∇ · q = 0, n̂ · q = 0 on �Vc} is the functional space of admissible

solutions q and L2div(Vc) = {a ∈ L2(Vc)|∇ · q ∈ L2(Vc)}. Using (3) we finally have

�0

4�

∫Vc

∫Vc

�J(r′, t)/�t · w(r)

|r − r′| dV ′ dV +∫

Vc

�J · w dV +∫

Vc

∇� · w dV = −∫

Vc

�As

�t· w dV

(6)

with J ∈ Q and w ∈ Q.Since in Vc J ∈ Q and w ∈ Q we have:∫

Vc

∇� · w dV = −∫

�Vc

�w · n̂ dV = 0 (7)

To guarantee the solenoidality of J and the continuity of its normal component, we introducethe electric vector potential T, such that ∇ × T = J. A mesh of the conductor Vc is given, andT is expanded in terms of edge element basis functions Nk:

J(r, t) =n∑

k=1Ik(t)∇ × Nk(r)

The boundary condition J · n̂ = 0 on �Vc can be directly imposed on the wks. The gaugeassuring the uniqueness of T is imposed directly in the discrete approximation giving a tree–cotree decomposition of the edges of the mesh and retaining only the n degrees of freedomassociated with the edges of the cotree [8, 9]. Posing wk = ∇ × Nk , (6) becomes:

LdI

dt+ RI = V (8)

where

Ri,j =∫

Vc

�wi · wj dV (9)

Li,j = �0

4�

∫Vc

∫Vc

wi (r′) · wj (r)|r − r′| dV ′ dV (10)

Vi = −∫

Vc

�As

�t· wi dV (11)




Notice that R is a sparse positive semi definite symmetric matrix, while L is a dense n × n

symmetric positive definite matrix. The integrals in (10) are numerically computed using themethod described in Reference [8].

Usually, the structures of interest could have an high level of symmetry, that can be efficientlytaken into account to reduce computational time and memory allocation. The basic idea to takeinto account the symmetry of the domain is to assume basis functions wk which automaticallyverify the symmetry conditions. For this purpose the integration domain Vc to be dicretizedis an elementary part of the whole structure. The shape functions in the rest of the structurecan be obtained by means of suitable operators. For instance, with reference to a system ofrectangular co-ordinates with unit vectors t̂x, t̂y and n̂z, with n̂z perpendicular to the plane ofsymmetry of reflection, we define the following reflection operator:

Sn

=

∣∣∣∣∣∣∣1 0 0

0 1 0

0 0 −1

∣∣∣∣∣∣∣(12)

such that the components of wk at the reflected point rr = Snr are given by:

wk(rr ) = �Snwk(r) (13)

where r = x t̂x + y t̂y + zn̂z is in the integration domain Vc, rr is in the reflected domain, the3D vectors are represented as column vectors made of their Cartesian components and � is +1or −1, depending on the type of symmetry.

Similarly, the rotation of an angle � around the z-axis can be represented by the usualoperator of rotation R

�:

R�

=

∣∣∣∣∣∣∣cos � − sin � 0

sin � cos � 0

0 0 1

∣∣∣∣∣∣∣(14)

such that the components of Jk at the rotated point rr = R�r are given by:

wk(rr ) = R�wk(r) (15)

It should be noticed that the continuity of J · n̂z has to be assured on the symmetry planes.Therefore, for instance, when applying a reflection with � = +1, the condition J · n̂z = 0 mustbe guaranteed on the symmetry plane. This can directly be imposed on the shape functions asdescribed in the previous section.

At this point, the calculation of the coefficients of L, R, and V is straightforward. They,with reference to a simple example with a symmetry of reflection and a periodicity of �, havethe following form:

Li,j =̂ 2N�

N�∑m=0

1∑j=0

�0

4�

∫Vc

∫Vc

wi (r) · Rm

��j Sj

nwj (r′)

|r − Rm

�Sj

nr′| dV ′ dV (16)

Ri,j = 2N�

∫Vc

�wi · wj dV (17)




Vi =̂ −2N�

∫Vc

wi (r) · ��t

As(r, t) dV (18)

where N� = 2�/�.In the time domain, using the theta method for the time integration of Equation (8), the

following system of n linear algebraic equations is obtained:

AI(tk+1) = BI(tk) + �V(tk+1) + (1 − �)V(tk) (19)

where tk = k�t ,

A = 1

�tL + �R (20)

and

B = 1

�tL − (1 − �)R (21)

In the frequency domain, Equation (8) is rewritten as

ZI = V (22)

where

Z = j�L + R (23)

Further details on this formulation can be found in Reference [10] where the problems relatedto the treatment of multiply connected regions and of the coupling with lumped circuit elementsare analysed. The presence of magnetic materials (linear or non-linear) can be taken into accountin the numerical model by means of an additional integral equation in terms of the equivalentmagnetic current densities, as described for example in References [9, 11].

3. PARALLEL IMPLEMENTATION

The numerical implementation requires the following steps:

(1) Initialization.(2) Computation of matrices L and R and vector V, by means of volume integration.(3) LU decomposition of the system matrix A for the solution of the linear system (19) in

the time domain or of the system matrix Z in the frequency domain.(4) Computation of the transient or the frequency domain solution.(5) Output.

Steps (1) and (5) have a complexity of O(n) (order n). The computation of matrix L in step

(2) has a complexity of O(n2), while the computation of R is of O(n). Step (3) is of O(n3).Most part of the computational time is required for steps (2) and (3) that are therefore paral-

lelized. Step (4) is of O(n2); a high number of time steps could suggest also its parallelization.


https://www.researchgate.net/publication/3099123_A_nonlinear_eddy_current_integral_formulation_in_terms_of_a_two-component_current_density_vector_potential?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/3102775_Circuitsfields_coupling_and_multiply_connected_domains_in_integral_formulations?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==


3.1. Solution of the linear system

The two major drawbacks of this integral approach, related to the storage limit and to thecomputational time can be overcome by partitioning the coefficient matrix among the clusterof processors and by using efficient algorithms for assembling and inverting the linear system.

The assembling of the matrix could be obtained using a straightforward parallel approach, inwhich a different number of rows is assigned to each processor. However, the best performancesof the parallel solver can be often obtained by partitioning the matrix among the processors ina block cyclic way, not fully compatible with the parallelization of the matrix assembly justmentioned. In this case, a possible strategy could be the separate optimization of the differentphases. But this approach requires an intensive use of the all-to-all communication, difficult toprogram and with lack of generality. In the following we will separately analyse these differentphases.

In our work, the solution of the linear system is obtained using the standard ScaLAPACKlinear algebra library [12]. We distinguish two different cases. In the frequency domain thecoefficients of the linear system are complex and symmetric (hence not hermitian). In this case,we choose the general subroutine PCGETRF [12], computing a general P L U decompositionof the original matrix (P is a permutation matrix, L is a lower triangular matrix with unitarydiagonal terms and U is the upper triangular matrix), so that the storage and computationaladvantage related to the symmetry cannot be taken into account using the ScalAPACK library.In the time domain, the solver performs the Cholesky decomposition. However, also in thiscase, the particular way in which the routine has been implemented requires the storage ofrectangular blocks, leading to the full storage of the coefficient matrix. In both cases, a criticalissue is related to the way in which the data are distributed among the processors, becausethis affects the load balance and the scalar software BLAS library [12] reuse in the serialcomputation of each processor. A two-dimensional block-cycling distribution assures a goodsplitting of the computational work among the processors throughout the algorithm and gives thepossibility of using the level 3 BLAS routines [12] during computations on a single processor.To understand the way in which this works, we recall that the LU decomposition algorithmproceeds by working on successively smaller square southeast corners of the matrix. In addition,during this process suitable sub-matrices are factorized by means of extra level 2 BLAS [12]work. To assign a block or contiguous columns to successive processes results in a highlyinefficient load balance, since as soon as the first set of columns are complete, the first processends to work for the rest of the computation. Same arguments apply for the transpose ofthis configuration. A one-dimensional cyclic distribution encompasses this limitation. Howeversingle columns are stored rather than blocks so that level 2 BLAS routines cannot be usedfor local computation. The two-dimensional block cyclic distribution gives the right solutionallowing a good load balance while assuring an efficient local computation by scalar libraryroutines. In this case, the processors are organized in a 2D grid of dimension prow ×pcol = p,where p is the total number of processors. Each processor is uniquely identified by the coupleof indices i_proc and j_proc with 0 � i_proc �prow − 1 and 0 � j_proc �pcol − 1, accordingto its relative position in the grid. The coefficients matrix is partitioned in np ×nq local squareblocks of size nb × nb, with nb>n. Row and column blocks are assigned to each processorin a cyclic way along the processor grid. This data partitioning is illustrated in Figure 1, withreference to a 9 × 9 matrix decomposed into 2 × 2 blocks mapped onto a 2 × 2 process grid.The performance of each processor improves by increasing the local block size nb × nb. Onthe other hand, the load imbalance also increases for increasing nb, leading to the definition



0 1 0 1 0

L11 L12 L13 L14 L15 L16 L17 L18 L190

L21 L22 L23 L24 L25 L26 L27 L28 L29

L31 L32 L33 L34 L35 L36 L37 L38 L391

L41 L42 L43 L44 L45 L46 L47 L48 L49

L51 L52 L53 L54 L55 L56 L57 L58 L590

L61 L62 L63 L64 L65 L66 L67 L68 L69

L71 L72 L73 L74 L75 L76 L77 L78 L791

L81 L82 L83 L84 L85 L86 L87 L88 L89

0 L91 L92 L93 L94 L95 L96 L97 L98 L99

0 1

L11 L12 L15 L16 L19 L13 L14 L17 L18

L21 L22 L25 L26 L29 L23 L24 L27 L28

L51 L52 L55 L56 L59 L53 L54 L57 L58

L61 L62 L65 L66 L69 L63 L64 L67 L68

0

L91 L92 L95 L96 L99 L93 L94 L97 L98

L31 L32 L35 L36 L39 L33 L34 L37 L38

L41 L42 L45 L46 L49 L43 L44 L47 L48

L71 L72 L75 L76 L79 L73 L74 L77 L78

1

L81 L82 L85 L86 L89 L83 L84 L87 L88

(a) (b)

Figure 1. A 9 × 9 matrix decomposed into 2 × 2 blocks mapped onto a 2 × 2 processors grid: (a) thematrix is decomposed into 2×2 blocks; and (b) the blocks are mapped onto the 2×2 processors grid.

The four processors are identified by the indices (0, 0), (0, 1), (1, 0), (1, 1), respectively.

of an optimum value. This optimum value for nb was found to be in the range of 40–50, inall the cases analysed in this work.

3.2. Matrix assembly

The matrix elements Li,j ’s, in principle, can be numerically computed as

Li,j = ∑ei∈Ei

∑ej ∈Ej

∑Gi=1,NG

∑Gj =1,NG

wi (rGi) · wj (rGj

)Jei(rGi

)Jej(rGj

)/|rGi− rGj

| (24)

where Ei is the set of elements sharing the edge i, Jeiis the Jacobian of the element ei

computed at the Gauss point rGimultiplied by the corresponding Gauss weight and NG is the

number of Gauss points. In Reference [8] we showed that (24) could be conveniently modifiedin order to reduce the computational effort and to preserve the properties of the inductancematrix. Here, we only briefly recall that expression (24) can be interpreted as the energy ofinteraction among the set of ‘vector’ point charges wi (rGi

)Jei(rGi

) located at the points rGi.

Since this discrete approximation fails when non-zero point charges are located at the samepoint, the formula is corrected by replacing each point charge with a uniformly charged sphereof an adequate radius:

Li,j = �0

4�

∑ei∈Ei

∑ej ∈Ej

∑Gi=1,NG

∑Gj =1,NG

wi (rGi)Jei

· wj (rGj)Jej

�(rGi, Gi

|rGj, Gj

) (25)

where �(rl , l |rm, m) is the energy of interaction between two unit charges uniformly dis-tributed within the spheres |r − rl | < l and |r − rm| < m. An analytic expression of � isavailable; the radius is chosen such as to give the same self-energy for each volume element




[8] as the one computed analytically in the inner integration and numerically in the outer. Weexplicitly notice that expression (25) guarantees the symmetry and the positive definiteness ofthe inductance matrix, by inheriting these properties from the analytic expression of the en-ergy of interaction �(rl , l |rm, m) between two unit charges uniformly distributed within thespheres. On the other hand, with any other conventional integration method, the need to per-form the outer integral in (10) by a numerical method, even if the inner integral is analyticallycomputed, unavoidably leads to unsymmetrical inductance matrices, without any guarantee ofpositive definiteness.

Although the matrix assembly scales as n2, the crossover point between the O(n3) factor-ization and the O(n2) fill-in can be very high, so that in many applications the filling timecan be a not negligible part of the CPU time, especially if a high precision is required inthe inductance matrix computation. Moreover, the limits due to the available storage stronglysuggest an efficient parallelization of the matrix fill-in.

The direct use of expression (25), based on two nested loops indexed over the unknowns,although ready to be parallelized, is not very efficient. In fact, the inner loops over the elementshave to be repeated more times because each element contributes up to 12 unknowns, leadingto redundant computations.

Since there is a very little difference in the amount of computational work to compute asingle entry or all the entries related to the same couple of volume elements, a subdivision ofthe computation on the different processors by elements can be much more efficient. In thiscase, however, the matrix assembly requires a communication overhead, because the contributionto a certain number of matrix elements is shared by volume elements allocated to differentprocessors.

These considerations lead to the following algorithm (local computation and communication,LCAC).

(1) InitializeThe volume elements of the mesh are distributed by rows among different processes inorder to have balanced loads and taking into account the symmetry of interaction amongthe elements. Pointers from rows of the local matrix to the same rows of the globalmatrix and vice versa are computed. Note that the same element Lij can be present indifferent processes, since the edges i and/or j can belong to elements stored in differentprocesses. In addition, for this reason, the memory required for storing the computeddata could be, for some processes, much larger than the memory necessary for storingthe local part of L.

(2) Compute local matrices

for each element of current_process Iel1 dofor each element Iel2 with Iel2 � Iel1 do

Compute the edge–edge interactions associated to the couple of elements(iel1-iel2) taking into account the symmetry of the matrix, using indirectaddressing.

end forend for

(3) Redistribute local matrices, summing the local contributions to the processes accord-ing to the allocation of the unknowns required by the solver. This phase requires an




all-to-all communication that can be a difficult step if the speed of the network is notvery high.

The overhead due to the communications may suggest a different approach although redundantin the computation. In this case, on every local process the blocks of the matrix with the sameordering required by the linear system solver are allocated. However, the symmetry of thematrix implies an additional redundancy in the computation for the processors at symmetricalpositions in the grid. To avoid this computational overhead, each processor of each couple ofsymmetric processors computes half of the interactions required in the block in an elementby element order and at the end each processor of the couple exchanges with the other thedata needed to assemble the required block. Differently from the previous case, the memoryrequired for each process is now exclusively related to the size of the local part of L.

This approach can be summarized in the following algorithm (local algorithm, LA)

(1) InitializationEach processor is identified in the grid by the indexes i_proc and j_proc. The unknownsare distributed among the processors in np × nq local blocks with the same orderingrequired by the linear system solver. The lists of volume elements containing at least oneedge of the row-unknowns (1 : Nr_el_row) and of the column-unknowns (1 : Nr_el_col)is computed and allocated to each (i_proc, j_proc) processor.

(2) Compute local matrices

if (i_proc〈〉j_proc) thenfor each element iel1 containing row-unknowns do

if (i_proc < j_proc)|(iel1 �Nr_el_row/2) thenfor each element iel2 containing column-unknowns do

if (i_proc > j_proc)|(iel2 > Nr_el_col/2) thenCompute the edge–edge interactions associated to the couple ofelements (iel1-iel2)

End ifend for

end ifend for

else if (i_proc = j_proc) then// For these processes the row-unknowns coincide with the column-unknowns

for each element iel1 containing row-unknowns dofor each element Iel2 containing column-unknowns with Iel2 � Iel1 do

Compute the edge–edge interactions associated to the couple of elements(iel1-iel2)

end forend for

end if

(3) Communications between (i_proc, j_proc) and (j_proc, i_proc)Send and receive the local matrices L(i_proc, j_proc) and L(j_proc, i_proc)Compute the local matrix asL(i_proc, j_proc) = L(i_proc, j_proc) + transpose(L(j_proc, i_proc))



3.3. Renumbering of the unknowns

A suitable renumbering of the unknowns can increase the computational efficiency. In fact, theinitial ordering follows by construction the ordering of the elements, usually made on the basisof a spatial contiguity. A proper renumbering assures that the cyclic distribution does not alterthis ordering. The spatial contiguity is, in fact advantageous for the LA algorithm, since it iseasy to verify that in this case the number of elements to integrate is relatively smaller. Thereare advantages also for the LCAC algorithm, since in this case the number of locally computedterms is enhanced, with a consequent reduced request of data communication and storage.

4. RESULTS

The algorithms have been implemented on a ‘Beowulf system’ (BWS) and a ‘shared memorysystem’.

The BWS consists of a cluster of 16 computers connected via a 16 port fast Ethernet switch.Each node is equipped with a Pentium III 450 MHz CPU, 512 Mb Ram, 8 GB EIDE hard disk,and 100 Mbit/s Ethernet adapter. The operative system installed in each node is the RedHatLinux version 7.0 and the development environment is PGI cluster tool by Portland Group.

The ‘shared memory system’ (SUN) is a Sun machine model Sun Fire 880. This machinehas six Ultra Sparc-3 750 MHz CPU, 12 GB Ram, 100 GB SCSI hard disk. The operativesystem is the Solaris version 5.8 and the development environment is the HP cluster tool bySun.

4.1. A model problem

As a first case, we considered a very simple geometry of interest for electromagnetic non-destructive evaluation and for electromagnetic compatibility. Eddy currents are induced by acircular loop of radius R = 0.02m feed by a sinusoidal current (i = I0 sin (2�f t) with I0 = 1Aand f = 3.3 kHz) placed 0.205 mm over a 0.10 m × 0.10 m × 0.003 m conducting plate. Theresistivity of the plate � = 5.3e − 8 �m gives rise to a skin-depth = √

�/��0f = 2 mm. InFigure 2, the eddy current density distribution along the thickness of the plate is shown forseveral different meshes and compared with the analytic solution [13]. Note that the results forthe fine mesh (60 × 60 × 6 elements, 38 881 unknowns) have been obtained with a fast solver[14] taking advantage of the translation invariance of the finite element mesh. In this case,the regularity of the mesh assures a satisfying precision in the computation of the inductancematrix even using one Gauss point per element in the volume integrals. However, we havechosen a fixed number of eight Gauss points per element instead of an adaptive integrationfor allowing a simple comparison of the assembly algorithms. Moreover, a fixed number ofGauss points guarantees the positive definitiveness of the inductance matrix, as pointed out inSection 3.2. Notice that the computational time scales quadratically with the number of Gausspoints. However, with few Gauss points, most part of the computational time is spent in thecomputation of the shape functions and Jacobians at the Gauss point locations; consequently,in these cases, the computational time approaches a scaling linear with the number of Gausspoints.

In Figure 3, we show the performances of the linear system solver for different meshes andnumber of processors on the two parallel systems. The CPU time required by the computer


https://www.researchgate.net/publication/224518771_Analytical_Solutions_to_Eddy-Current_Probe-Coil_Problems?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==

https://www.researchgate.net/publication/290185334_A_FFT_integral_formulation_using_edge-elements_for_Eddy_Current_Testing?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==


Figure 2. The amplitude of the eddy current density modulus as a function of the distance fromthe top of the plate, just under the inducing coil. The numerical results for various meshes (a)24 × 24 × 6 elements and 6049 scalar unknowns, (b) 30 × 30 × 6 elements and 9541 scalar unknowns,(c) 36 × 36 × 6 elements and 13 825 scalar unknowns, (d) 60 × 60 × 6 elements and 38 881 scalar

unknowns are compared with the analytic solution (e).

Figure 3. The execution time of the linear system solver for different meshes and number of processorson the two parallel systems: (a) BWS; and (b) SUN.



Figure 4. The execution time in the two assembly algorithms versus the number of processor andthe number of unknowns on the two parallel systems: (a) LA algorithm, BWS; (b) LCAC algorithm,

BWS; (c) LA algorithm, SUN; and (d) LCAC algorithm, SUN.

systems with one processor in case of finer meshes with n > 6049 has been obtained byextrapolating the values obtained with coarser meshes.

In Figure 4, we show the performances of the two assembly algorithms on the two parallelsystems for different meshes and number of processors.

For a better understanding of the differences between the two approaches, in Figure 5, weshow how the computer time is distributed among the different phases of the algorithm. Noticethat, in the LCAC algorithm on the BWS system, the local matrices are temporarily stored ondisk, to optimize the memory allocation. This implies the use of an additional computer time.

In the SUN multi-processors computer system, all processes share the hard disk communi-cation bus, so that, to avoid bottlenecks in the read–write operations, we slightly modified theLCAC algorithm, by storing the local matrices directly in the central memory.

We notice that because of the low communication speed of the BWS system used in thesetests, LA achieves better performances on BWS, even if, in this example, there is 6% ofredundancy in the computations.



Figure 5. The execution time for each processor in the two assembly algorithms on the two parallelsystems: (a) LA algorithm, BWS; (b) LCAC algorithm, BWS; (c) LA algorithm, SUN; and (d) LCAC

algorithm, SUN. The term ‘barrier’ refers to the idle (inactivity) time of the process.

Instead, on the shared memory system SUN, since the four processors of the Sun workstationshare the same central memory, the time devoted to communications is much less and the twoalgorithms show almost the same speed-up.

To further assess the performances of the proposed approach, two problems of interest indifferent fields of application are briefly illustrated.

4.2. A shielding problem

We analyse the shielding effectiveness of a test enclosure with one rectangular aperture inthe low-frequency limit. The test enclosure is made of a thin cubic copper cube, with aside dimension of 0.5 m. The front panel has a rectangular aperture, sized 0.2 × 0.05 m2.A uniform sinusoidal external magnetic field parallel to the longer side of the aperture isapplied. The frequency f = 10 kHz has been chosen such as to assure a uniform distributionof the field along the thickness of the walls of the cube. The shield effectiveness is defined asSE = −20 log |B|/|Bext|. A similar problem has been analysed in Reference [15]. The analysishas been performed using two meshes, a coarse mesh with 5097 scalar unknowns and a fine


https://www.researchgate.net/publication/3741626_Comparison_of_FDTD_and_MoM_for_shielding_effectiveness_modelling_of_test_enclosures?el=1_x_8&enrichId=rgreq-4c1435e6bdb8cc5025b017a61c2d8c64-XXX&enrichSource=Y292ZXJQYWdlOzIyOTczODIzOTtBUzoxMDQ4NzczNjc3NTg4NTVAMTQwMjAxNjEyMTIzNQ==


Figure 6. A test enclosure with one rectangular aperture: (a) coarse mesh with 5097 scalar unknowns;(b) normalized eddy current power density deposition; (c) fine mesh with 13 985 scalar unknowns; and

(d) normalized eddy current power density deposition.

mesh with 13 985 scalar unknowns, with eight Gauss points per element. The two meshes areshown in Figure 6 together with the normalized eddy current power density deposition. Thereal and imaginary part of the x component (parallel to the longer side of the aperture) of thenormalized magnetic induction due to the eddy currents as a function of the distance from theaperture is shown in Figure 7, for the two meshes. The shield effectiveness as a function ofthe distance from the aperture is shown in Figure 8, in good agreement with Reference [15].In Table I we show the execution time for one processor and the memory requirements for thestorage of the matrix Z, while, in Table II, we report the speed-up resulting from the parallelimplementation.

4.3. A problem in the analysis of a nuclear fusion device

The analysis deals with the toroidal conducting structure of the ITER Tokamak reactor [16]. Themodel includes the first wall, the vacuum vessel, the toroidal field coil cases with several radial




Figure 7. A test enclosure with one rectangular aperture. The real (a) and imaginary (b) part of the xcomponent of the magnetic induction due to the eddy currents normalized to amplitude of the appliedfield, as a function of the distance from the aperture along the line passing through the centre of the

aperture, for the coarse (dash–dot) and fine (solid) mesh.

Figure 8. A test enclosure with one rectangular aperture. The shield effectiveness as a function of thedistance from the aperture along the line passing through the centre of the aperture, for the coarse

(dash–dot) and fine (solid) mesh.

plates and the inter-coils structure, for a total of 34 700 nodes, 17 958 elements and 21 189real unknowns (see Figure 9). The field source is made of six side correction coils (SCC)distributed along the external periphery of the torus (see Figure 9). The currents circulating intwo neighbour coils have opposite direction, so that the geometrical period of the sources aroundthe axis of the torus is 120◦. This set of coils is designed to control the non-axisymmetric



Table I. A test enclosure with one rectangular aperture: one processor,execution time and memory requirements.

Assembly Inversion MemorySystem time (s) time (s) (MB)

5097 Unknowns SUN 529 785 198BWS 952 2428

13 985 Unknowns SUN (3982) (16 215) 1493BWS (7167) (50 512)

(·) Extrapolated.

Table II. A test enclosure with one rectangular aperture; speed-up for the two parallel systems.

Speed-up

No. of SUN BWS BWS BWSscalar 4 processors 4 processors 9 processors 16 processors

unknowns (s) (s) (s) (s)

LA 5097 3.67 3.52 7.37 12.7algorithm (s) 13 985 3.70 13.8

LCAC 5097 3.91 3.35 7.00 11.5algorithm (s) 13 985 3.85 11.6

Inversion (s) 5097 3.32 2.45 6.5 11.413 985 3.28 10.2

plasma deformation by means of a non-axisymmetric magnetic field that is an alternatingfunction of the toroidal co-ordinate. The symmetry of reflection of this conducting structurewith respect to the x–z plane reduces the discretized domain to 60◦. However, the assemblyof the inductance matrix requires the integration on the whole torus and not only on thediscretized domain, as explained in Section 2 (see Equation (16)). This circumstance representsan additional overloading with respect to the computation required in the previous examples, sothat also in this case the time required by the assembly is not at all negligible, even if the sizeof the problem is high. The matrix filling time is high also using one Gauss point per elementin the numerical integration. Moreover it should be explicitly noticed that the conducting regionin this example is multiply connected, so that several degrees of freedom are not locally based[10] and this has some influence on the distribution of the load among the processors.

In this case, we examine a transient related to a given control action with the current ineach SCC rising to 90 kA in 25 ms and then decaying to zero with the same time constant.An implicit time integration scheme has been used with a time step of 1 ms. The eddy currentdensity distribution and the power density distribution on the conducting structure at the endof rise time (t = 0.025 s) are shown in Figure 10.

In Table III, we report the execution time for one processor and the memory requirementsfor the storage of the matrix A, for the SUN parallel system. The assembly time is reportedfor one and eight Gauss points per element, respectively. In Tables IV and V, we show how




Figure 9. Finite element discretization of one sixth of the toroidal conducting structure of the ITERTokamak reactor, including the double shell vacuum vessel, the toroidal field coil cases and the SCCs

with the active current density distribution.

Table III. The toroidal conducting structure of ITER: one processor,execution time and memory requirements.

Assembly time (s)

System 1 Gauss 8 Gauss Cholesky Memorypoint points factorization (s) (MB)

SUN 7464 42 458 7628 1713

the computer time is distributed among the four SUN processors, for the assembly phase withone and eight Gauss points, respectively.

In this example, the memory limitation does not allow BWF to solve the problem evenwith four processors. In Tables VI and VII, we report the execution time of the parallelimplementation in the two systems. The speed-up is reported only for the SUN system, because



Figure 10. One sixth of the toroidal conducting structure of the ITER Tokamak reactor. Eddy currentdensity distribution: (a) in the whole structure; (b) in the vacuum vessel, and Ohmic power density

distribution; (c) in the whole structure; and (d) in the vacuum vessel, at t = 0.025 s.

no reference point is available for BWF. Moreover, we note from Table VI that the LA andLCAC algorithms show almost the same speed-up, when the CPU time increases, since theoverload due to distribution of the unknowns among the processors in LA is balance by theoverload due to communications in LCAC.

5. CONCLUSIONS

In this paper, we have discussed several aspects related to the parallelization of a volumeintegral formulation for solving Maxwell equations in the magneto-quasi-stationary limit. Wehave shown how the intrinsic limits related to storage and computational requirements for thegeneration and solution of the large matrix resulting from the discretization can be extended.



Table IV. The toroidal conducting structure of ITER; execution time and speed-up for the SUN4-processors parallel system.

Execution time (s) Speed-up

1 Gauss points 8 Gauss points1 Gauss 8 Gauss

Barrier Communications Assembly Barrier Communications Assembly point points

LA algorithm 347 — 2080 3464 — 12 750 3.59 3.33(s)LCAC algorithm 109 1067 2757 105 1087 11 691 2.71 3.63(s)

Cholesky 1986 3.84factorization (s)

Table V. The toroidal conducting structure of ITER; subdivision of the assembly time (1 Gauss point)among the processors of the SUN parallel system.

LA algorithm (s) LCAC algorithm (s)Processors Processors

1 2 3 4 1 2 3 4

Preprocessing 4 3 3 3 4 4 4 5Parallel 1729 1803 1949 2073 1762 1879 1711 1510computationsRead and write 0 0 0 0 75 64 31 66on diskBarrier 347 83 0 4 100 0 62 109Communications 0 191 128 0 816 808 949 1067Total time 2080 2080 2080 2080 2757 2757 2757 2757

The computation and storage of the system matrix are O(n2), while the linear system solutionis O(n3). However, especially in the presence of a symmetry of rotation or in some particularcases such as those arising in a design problem or an inverse problem (when many solutionshave to be produced by changing some parameters) the assembly phase can be at least asexpensive as the solver phase, because of the cost of the numerical evaluations involved. Inthese cases one has to assemble the matrix and recursively solve the problem many times,leading to a request of CPU time that easily is of the order of days, also for ‘small’ problems.The assembly phase is also certainly a significant task when a higher precision is requiredin the computation of the inductance matrix, so that a high number of Gauss points (or evenanalytical formulas) should be used. In addition, the storage limitations support in any case theparallelization of the assembly phase.

For this phase two different algorithms have been analysed. It has been shown that theirperformances are in general good and, in the better cases, the speed-up is very near to thetheoretical limit scaling almost linearly with the number of processors. The choice of the



Table VI. The toroidal conducting structure of ITER; subdivision of the assembly time (8 Gauss points)among the processors of the SUN parallel system.

LA algorithm (s) LCAC algorithm (s)Processors Processors

1 2 3 4 1 2 3 4

Preprocessing 10 10 10 10 11 11 11 11Parallel 9276 11 048 10 731 12 740 10 581 10 574 10 628 10 423computationsRead and write 0 0 0 0 71 66 32 66on diskBarrier 3464 1646 1542 0 100 0 58 105Communications 0 45 467 0 928 1040 963 1087Total time 12 750 12 750 12 750 12 750 11 691 11 691 11 691 11 691

Table VII. The toroidal conducting structure of ITER; execution time for the BWS parallel system.

Execution time (s)

BWS 9 processors BWS 16 processors

Barrier Communications Assembly Barrier Communications Assembly

LA algorithm (s) 734 — 2059 513 — 1313LCAC algorithm (s) 122 754 2123 75 465 1223

Cholesky factorization (s) 5393 3263

algorithm depends not only on the parallel system configuration but also on the specific geo-metrical characteristics of the problem. The performances of the LA algorithm are in generalslightly better when the computational overhead related to the not exactly equal distribution ofthe unknowns among the processors is less than the all to all communications needed by theLCAC algorithm.

The solver phase has been based on the use of the standard ScalAPACK linear algebralibrary. Speed-up of 10 or 8 (for complex or real system) on the 16 processors Beowulf systemand of 3.3 or 3.6 (for complex or real system) on the 4 processors SUN workstation relativeto the single processors performance have been obtained for this phase.

Several cases of practical interest have been briefly analysed showing the advantages of theproposed approach. The tests performed on two very different parallel systems finally confirmedthe high source-code portability verifying the efficiency of the implementation across a quitewide range of architectures and the attractiveness of low-cost parallel systems.

ACKNOWLEDGEMENTS

The authors wish to thank Dr C. Nardone, Sun Microsystems of Italy, for his suggestions about thetuning of the shared memory system. Work supported in part by the Italian Ministry of Education,University and Research, MIUR and by the EURATOM/ENEA/CREATE–DAEIMI Association.



REFERENCES

1. Yuan Y, Banerjee P. Comparative study of parallel algorithms for 3-D capacitance extraction on distributedmemory multiprocessors. 2000 International Conference on Computer Design, Proceedings, 2000; 133–138.

2. Mackerle J. Parallel finite element and boundary element analysis: theory and applications—a bibliography(1997–1999). Finite Elements in Analysis and Design 2000; 35:283–296.

3. Bryant CF, Roberts MH, Trowbridge CW. Implementing a boundary element method on a transputer system.IEEE Transactions on Magnetics 1990; 26:819–822.

4. Natarajan R, Krishnaswamy D. A case study in parallel scientific computing: the boundary element methodon a distributed-memory multicomputer. Engineering Analysis Boundary Elements 1996; 18:183–193.

5. Kettunen L, Forsman K, Levine D, Gropp W. Volume integral equations in non-linear 3-D magnetostatics.International Journal for Numerical Methods in Engineering 1995; 38:2655–2675.

6. Putnam JM, Car DD, Kotulski JD. Parallel CARLOS-3D—an electromagnetic boundary integral method forparallel platforms. Engineering Analysis Boundary Elements 1997; 19:49–55.

7. Jacques T, Nicolas L, Vollaire C. Electromagnetic scattering with the boundary integral method on MIMDsystems. IEEE Transactions on Magnetics 2000; 36:1479–1482.

8. Albanese R, Rubinacci G. Integral formulation for 3D eddy current Computation using edge-elements. IEEProceedings Part A 1988; 135:457–462.

9. Albanese R, Rubinacci G. Finite element methods for the solution of 3D eddy current problems. Advancesin Imaging and Electron Physics 1998; 102:1–86.

10. Rubinacci G, Tamburrino A, Villone F. Circuits/fields coupling and multiply connected domains in integralformulations. IEEE Transactions on Magnetics 2002; 38:581–584.

11. Albanese R, Hantila FI, Rubinacci G. A nonlinear eddy current integral formulation in terms of a two-component current density vector potential. IEEE Transactions on Magnetics 1996; 32:784–787.

12. Drakos N. ScaLAPACK Users’Guide. http://www.netlib.org/scalapack/slug/index.html13. Dodd CV, Deeds WE. Analytical solution to eddy-current probe-coil problems. Journal of Applied Physics

1968; 39:2829–2838.14. Tamburrino A, Ventre S, Rubinacci G. An FFT integral formulation using edge-elements for eddy current

testing. International Journal of Applied Electromagnetics and Mechanics 2000; 11:141–162.15. De Moerloose J, Criel S, De Smedt R, Laermans E, Olyslager F, De Zutter D. Comparison of FDTD

and MoM for shielding effectiveness modelling of test enclosures. IEEE 1997 International Symposium onElectromagnetic Compatibility, Proceedings, 1997; 596–601.

16. Aymar R. ITER R&D: Executive Summary: Design Overview. Fusion Engineering and Design 2001; 55(2–3):107–118.





























An eddy current integral formulation on parallel computer systems

Documents