PP08 Abstracts - SIAM · 36 PP08 Abstracts Numerical Analysis Group Delft University of Technology [email protected] Dan Erik Petersen, Stig Skelboe Department of Computer

34 PP08 Abstracts

CP1

Efficient Parallel Simulation for Stochastic Simula-tion of Biochemical Systems on the Graphics Pro-cessing Unit

The small populations of some reactant species in biologicalsystems formed by living cells can result in inherent ran-domness that cannot be captured by traditional determin-istic (ordinary differential equation) simulation. A moreaccurate simulation can be obtained by using the Stochas-tic Simulation Algorithm (SSA). Many stochastic realiza-tions are required to obtain accurate probability densityfunctions. This carries a very high computational cost.The current generation of general-purpose graphics pro-cessing units (GPU) is well-suited to this task. Computa-tional experiments illustrate the power of this technologyfor this important and challenging class of problems.

Hong LiDepartment of Computer ScienceUniversity of California, Santa [email protected]

Linda [email protected]

CP1

From HPC to Grid Computing: CSE Scenarioswith GridSFEA

With the simulation framework GridSFEA, we intend toprovide the HPC application developer and user with acomplementary instrument for working in grid. GridSFEAservices, application library, and wrappers simplify taskssuch as parameter studies and long running simulations.This is important, since the HPC community lacks well-suited and usable grid tools, and, thus, remains reluctantto exploit computing grids. We demonstrate several suc-cessful scenarios such as molecular dynamics and trafficsimulations in grid.

Ioan L. Muntean, Hans-Joachim Bungartz, MartinBuchholz, Michael Moltenbrey, Ekaterina Elts, DirkPflügerTechnische Universität München, Department ofInformaticsChair of Scientific Computing in Computer [email protected], [email protected],[email protected], [email protected],[email protected], [email protected]

Ralf-Peter MundaniTUM, Faculty of Civil Engineering and GeodesieChair for Computational Civil and [email protected]

CP1

Plasma Turbulence Computation and Visualizationon the GPU

We present techniques to accelerate plasma turbulence sim-ulations on modern Graphics Processing Units (GPUs).We contrast and compare the performance of two classes ofmethods: spectral methods used in MHD models and par-ticle methods used in gyrokinetic models. We demonstratea prototype of a scalable computational steering frameworkbased on a tight coupling of plasma turbulence simulation

and visualization on the GPU.

George StantchevUniversity of [email protected]

CP1

Parallel Numerical Methods for Solving Nonlin-ear Evolution Equations that Model Optical FiberCommunication Systems

Nonlinear evolution equations of the Nonlinear Schrdingertypes are of tremendous interest in both theory and ap-plications. Various regimes of pulse propagation in opticalfibers are modeled by some form of the NLS and CNLSequations. In this talk we introduce parallel algorithmsfor numerical simulations of these equations. The parallelmethods are implemented on the IBM p655 multiprocessorcomputer. Our numerical experiments have shown thatthe used methods give accurate results and considerablespeedup.

Thiab R. TahaProfessor at the University of [email protected]

CP1

Efficient Parallel Algorithm for 2D2V Vlasov Equa-tion with High Order Spectral Element Method

For decades kinetic space plasma simulation has been dom-inated by PIC (Particle-In-Cell) codes. Due to its inherentnoise, solving the Vlasov equation directly becomes morepromising. In this work, we are developing tera-scalablescheme of 2D2V Vlasov solver using high order spectralelement method. For this 4 diemnsions problem, efficientparallel algorithm is necessary due to memory and speed re-quirements. The results have been virified with PIC codes.

Jin XuPhysics DivisionArgonne National Lab.jin [email protected]

CP1

Using Multi-Core, Multi-CPU, PC Clusters forStatistical 3-D Virus Reconstructions from CryoElectron Microscopy Images

Cryo electron microsopy images of viruses provide roughlyprojection data at unknown projection angles and signalto noise ratios less than 1/2. Statistical approaches to3-D reconstruction (e.g., Doerschuk and Johnson, IEEETransactions Information Theory, 2000) require extensivecomputation. A system to perform such computations ona PC cluster with multi-cpu and multi-core nodes usingC, MPI, and OpenMP is described including experimentswith different algorithms and different mixtures of paral-lelism methodologies.

John JohnsonThe Scripps Research InstituteDepartment of Molecular [email protected]

Yili ZhengElectrical and Computer EngineeringPurdue [email protected]

PP08 Abstracts 35

Peter C. DoerschukCornell UniversityBiomedical Engineering and Electrical and [email protected]

CP2

Electronic Structure Calculations for LargeNanosystems on Parallel Computers

Electronic Structure calculations based on the density func-tional theory approach have become one of the biggest con-sumers of cycles on high performance computers aroundthe world. In this talk I will discuss this approach, as usedin nanoscience applications on high performance comput-ers as well as new methods that go beyond density func-tional theory and allow us to simulate much larger systemswith first principles accuracy. Performance of new paral-lel solvers for these methods on high performance parallelcomputers such as the IBM BGL, Cray XT4, NEC EarthSimulator and PC clusters will also be discussed. I willdiscuss some applications of these methods to nanosystemsand detector materials. Work done in collaboration withL-W Wang, O. Marques, S. Tomov and C. Voemel.

Andrew M. CanningLawrence Berkeley National [email protected]

CP2

Parallelized Topography Simulation for ElectronicDevice Manufacturing

Three-dimensional topography simulation for electronic de-vice manufacturing is a demanding task regarding compu-tational resources. We developed a parallelized methodusing a Monte Carlo algorithm for flux calculation andlevel-set algorithms for front tracking. The computationalcomplexity scales optimally with surface size. The flux cal-culation is parallelized using MPI and OpenMP. Commu-nication is minimized by storing the compressed structuregeometry on each node, which makes our method applica-ble for cheap infrastructure consisting of conventional PCswith LAN connection.

Otmar Ertl, Johann Cervenka, Siegfried SelberherrVienna University of TechnologyInstitute for [email protected], [email protected], [email protected]

CP2

Parallel Block-Oriented Preconditioners for FemModeling of Semiconductor Devices

Various approximate block factorization and physics-basedpreconditioners are applied to the drift-diffusion equationsfor modeling semiconductor devices. The resulting scalarsubsystems are solved by various iterative methods in-cluding AMG-type techniques. We employ a stabilized fi-nite element discretization of the drift-diffusion equationson unstructured meshes. The nonlinear coupled systemis solved with a parallel preconditioned Newton-Krylovmethod. Preliminary results will be presented demonstrat-ing the performance of the block-oriented preconditionerscompared with one-level and multilevel preconditioners.This work was partially funded by the DOE NNSA’s ASCProgram and the DOE Office of Science AMR Program,

and was carried out at Sandia National Laboratories op-erated for the U.S. Department of Energy under contractno.DE-ACO4-94AL85000

Paul LinSandia National [email protected]

Gary HenniganSandia National [email protected]

Robert J. HoekstraSandia National [email protected]

John ShadidSandia National LaboratoriesAlbuquerque, [email protected]

CP2

A Massively Parallel Schroedinger Solver for Nano-Electronics

NEMO3D is a scalable simulator for nano-electronic de-vices such as quantum dots that uses a quantum mechan-ical description of the device. A key computational chal-lenge is to compute degenerate eigenstates in the interiorof the spectrum for a very large Hamiltonian matrix, withup to 109 degrees of freedom. We compare several paralleleigensolvers and present detailed performance results thatdemonstrate the petascale potential of NEMO3D on stateof the art parallel platforms.

Maxim NaumovPurdue UniversityDepartment of Computer [email protected]

Faisal Saied, Hansang Bae, Steve Clark, Ben Haley,Gerhard Klimeck, Sunhee LeePurdue [email protected], [email protected], [email protected],[email protected], [email protected],[email protected]

CP2

Parallel Methods for Electronic Transport ThroughNanoscale Devices

Nanoelectronics is a fast developing field. Therefore un-derstanding of electronic transport at the nanoscale is cur-rently of great interest. In this talk, we present new paral-lel algorithms to calculate the electronic transport throughlarge nanoscale devices consisting of many thousand atoms.Applying the semi-empirical Extended Hückel Theory tomodel nanowire junctions, we compare the parallel perfor-mance of our recently developed direct and iterative ap-proaches, the latter of which uses preconditioned Krylovsubspace techniques.

Hans Henrik B SørensenInformatics and Mathematical ModellingTechnical University of [email protected]

Martin van Gijzen

36 PP08 Abstracts

Numerical Analysis GroupDelft University of [email protected]

Dan Erik Petersen, Stig SkelboeDepartment of Computer ScienceUniversity of [email protected], [email protected]

Kurt StokbroNBI and Department of Computer ScienceUniversity of [email protected]

Per Christian HansenTechnical University of DenmarkInformatics and Mathematical [email protected]

CP3

Cortically-Inspired Parallel Processing

Symbolic logic and serial computations have proven inade-quate for modeling human cognition and solving hard cog-nitive problems. On the other extreme, the recent use of aBlue Gene supercomputer for the molecular simulation of asingle cortical column, abeit a remarkable leap for molecu-lar neuroscience, is still very far from explaining cognition.I will present an implementation of a massively parallel,highly scalable, heteroassociative network of attractor net-works that abstracts the functionality of millions of corticalcolumns and explains phenomena of visual cognition. Thisis a paradigm of how brain-like computing might look inthe future.

Socrates DimitriadisBrown UniveristyDepartment of Cognitive and Linguistic [email protected]

CP3

Uc-geowave: A Stereo-distributed-parallel Appli-cation for Seismic Modeling in Oil Exploration

This paper describes a stereo–distributed–parallel applica-tion for the simulation of two–dimensional acoustic wavepropagation on heterogeneous media, called UC–geoWave.This application has three modules: pre–processing, pro-cessing and post-processing. Using the pre–processingmodule the user can create synthetic terrain in 3D andfinally view it in stereoscopic way. In the processing mod-ule, the application computes a parallel reverse time migra-tion (RTM) of a seismic data set to obtain a depth imag-ing. This migration technique is based on the solution ofthe acoustic wave equation in 2D using a finite differencescheme. This phase is executed on a cluster machine usingMPI. In the post-processing module, the user can load theresulting data and see the graphics. The application canbe accessed over the internet using any commercial browser(internet explorer, netscape, etc.) and run on different op-erating systems (Windows, MacOS X, Linux and Solaris).

Juan MedinaUniversidad de [email protected]

German A. Larrazabal

University of [email protected]

CP3

Ratio-Based Parallel Time Integration (RaPTI) forSatellite Trajectories

Chartier, Philippe (1993), Erhel, Rault (2000) have into-duced Parallel time integration for satellites trajectories.We apply a version of RaPTI algorithm to solve this prob-lem. RaPTI is a predictor-corrector scheme based on au-tomatic generation of time slices (Nassif et al (2005)) atend of which solution values exhibit a ?ratio phenomenon?.Such approach leads to parallel time integration schemesused in previous authors works (2006-2007). The presentpaper extends RaPI to a J2 perturbed satellite trajectory.

Nabil R. NassifMathematics DepartmentAmerican University of [email protected]

Jocelyne Erhel, Noha Makhoul-KaramIRISA,UNIVERSITE DE [email protected], [email protected]

Yeran SoukiassianMathematics DepartmentAmerican University of [email protected]

CP3

A Scalable Parallel Classification Algorithm for Re-mote Sensing

Previous work on the parallel IGSCR (iterative guidedspectral class rejection) classification algorithm for remotesensing resulted in good speedup through 64 processors,however, speedup began deteriorating beyond 16 proces-sors. This work will tackle scalability issues with parallelclustering that will be essential in a scalable distributedmemory version of IGSCR. These issues include clusteringschemes that are more amenable to a parallel environmentand employing load balancing methodologies that will in-crease overall parallel efficiency.

Layne T. WatsonVirginia Polytechnic Institute and State UniversityDepartments of Computer Science and [email protected]

Rhonda D. PhillipsVirginia Polytechnic Institute and State [email protected]

Randolph H. WynneVirginia Polytechnic Institute and State UniversityDepartment of [email protected]

CP3

Parallel Implementation of Data Mining Algo-rithms

This paper discusses the parallel or distributed implemen-tation of key data mining algorithms in the areas of col-

PP08 Abstracts 37

laborative filtering and latent semantic analysis. Both op-timization problems are elegantly captured by matrix rep-resentations which are usually sparse and with very largedimensions. We discuss the implementation of these algo-rithms in architectures of different levels of granularity suchas dedicated highly coupled parallel processors, loosely cou-pled parallel processors and a distributed platform.

Yosef G. Tirat-GefenCastel Research Inc. and George Mason [email protected]

CP4

Higher-Level Abstractions and Patterns for De-signing Data-Parallel Applications

The level of abstraction provided by the Message PassingInterface (MPI) are too low-level and enormous amountsof time and effort is spent in refactoring existing codeto include MPI primitives. This presentation describedhigher-level abstractions and design patterns to encapsu-late data distribution, communication, and load balanc-ing. Several applications developed using a single patternwill be presented and their performance comparisons withhand-written versions of the applications will also be pro-vided.

Purushotham BangaloreUniv. of Alabama at BirminghamDept. of Computer and Information [email protected]

CP4

Parallel Io and Data Management for Data Struc-tures in Applications

We have developed a middle layer between parallel filesystems/MPI-IO and applications, through which appli-cations are able to efficiently use the most efficient partof MPI-IO and parallel file systems for millions of dis-tributed un-aligned small datasets. For a variety of datastructures in applications, such as unstructured mesh andvariable, the middle layer provides sustainable, interopera-ble, efficient, scalable, and convenient tools for parallel IOand data management. The IO performance of the mid-dle layer for high-level data structures in applications isalmost the same as the performance of MPI-IO for largedatasets. The IO performance of either collective or non-collective calls for millions of distributed non-aligned smalldatasets is comparable to the performance of MPI-IO forlarge datasets.

William W. DaiLos Alamos National [email protected]

CP4

Using GPUs From High-level Programming Lan-guages

This talk describes a high-productivity development modelfor general purpose computing on Graphics ProcessingUnits (GPUs). This is accomplished by exposing the capa-bilities of NVIDIA’s CUDA architecture to high-level pro-gramming languages, such as Python, IDL, MATLAB andJava. In this talk we describe an array-based programminginterface that hides the details of CUDA and the GPU fromthe user, allowing them to perform GPU accelerated com-putations with little effort. The result is a development

model that makes the high performance of GPUs accessi-ble to end users and working scientists. This talk will focuson the Python implementation of the interface, but willalso briefly cover IDL, MATLAB and Java versions. Weacknowledge Nathaniel Sizemore and Dave Wade-Stein forhelp with the build system for this project.

Dan Karipides, Paul Mullowney, Michael Galloy, PeterMessmer, Brian E. GrangerTech-X [email protected], [email protected],[email protected], [email protected],[email protected]

CP4

Grids and Clusters with Multi-Core Nodes: A Ge-netics Application Perspective

The introduction of multicore processor implies that algo-rithms which are parallelized at an outer, coarse grain levelshould possibly be revisited to examine if multithreadingshould also be used at an inner, fine grain level. In thispaper we discuss parallel versions of the tightly coupledglobal optimization algorithm DIRECT. We examine howboth coarse grained and fine grained parallelism can be ex-ploited using a hybrid programming model. We show thatexcellent performance can be archived when using the hy-brid algorithm on loosely-coupled systems like clusters andgrids with multicore nodes.

Henrik LöfStanford UniversityDepartment of Energy Resources [email protected]

Mahen JayawardenaUppsala UniversityDepartment of Scientific [email protected]

Sverker HolmgrenUppsala UniverstiyDepartment of Scientific [email protected]

CP4

Group Locality Based Performance Analysis ofTriplet Architecture A Static Direct Interconnec-tion Network for Multi-Processor (mp-SoC)

We propose a new criterion in performance evaluationbased on the concept of group locality in interconnectionnetworks, the lower layer complete connect i.e., how com-pletely a node in a subset of processing nodes is connectedto its neighbors. Triplet Based Architecture, TriBA - anew idea in MP-SoC architectures is compared with threestatic interconnection networks from three orthogonal enti-ties physical (chip area, dissipation), computational speed(message delay) and cost (chip yield, layout cost.)

Haroon-Ur-Rashi Khan, Shi Feng, Ji Wei XingSchool of Computer Science and TechnologyBeijing Institute of [email protected], [email protected],[email protected]

Kamran KamranDepartemnt of Electrical EngineeringUniverity of Engg. & Tech., Lahore, Pakistan

38 PP08 Abstracts

[email protected]

CP4

A Data-Distributed Massively Parallel Design ofDIRECT

A data-distributed massively parallel implementation is de-veloped for the optimization algorithm DIRECT, favoredfor its deterministic nature and global convergence prop-erty. Sharing data across multiple machines reduces thelocal memory burden. Multilevel parallelism boosts theconcurrency and mitigates the data dependency, thus im-proving the load balancing and scalability. Also, user-levelcheckpointing is integrated as a fault-tolerance feature. Onlarge-scale systems, the design was evaluated using bench-mark functions and real-world applications.

Rhonda D. PhillipsVirginia Polytechnic Institute and State [email protected]

Layne T. WatsonVirginia Polytechnic Institute and State UniversityDepartments of Computer Science and [email protected]

Jian HeDepartment of Computer ScienceVirginia [email protected]

Masha SosonkinaAmes Laboratory/DOEIowa State [email protected]

CP5

Load Distribution in Madness

Load balancing is vital to the efficiency of MADNESS(Multiresolution Adaptive Numerical Environment for Sci-entific Simulation), an environment for prototyping anddeveloping scientific applications being developed to runon leadership computing resources. We propose the meld-ing algorithm to load balance the computational work inMADNESS. In this presentation, we describe the method,discuss its theoretical advantages over alternative load bal-ancing techniques for this problem, and present prelimi-nary results from runs on leadership computing resources.

Rebecca J. Hartman-Baker, George FannOak Ridge National [email protected], [email protected]

Robert HarrisonUniversity of TennesseeOak Ridge National [email protected]

CP5

A Benchmark Study of Compiler Performance forSparse Kernels on Multicore Processors

Obtaining optimal performance for scientific applicationson modern computer architectures continues to be a chal-lenge. This study presents an empirical comparison of theimpact of hardware architecture, compilation options, datastructure and coding technique on algorithm performance

for a small set of representative mathematical kernels in-cluding sparse matrix-vector products on a set of multicoreprocessor-based HPC platforms. Numerical results are pre-sented, and implications for the optimization of numericalsoftware codes are considered.

Wayne JoubertU.S. Army Engineer Research and Development Center(ERDC)Major Shared Resource Center (MSRC)[email protected]

CP5

Database Components for Support of Computa-tional Quality of Service for Scientific CCA Ap-plications

While component-based design has proven helpful in man-aging the complexity of parallel scientific simulations,many challenges remain in selecting and configuring com-ponents during runtime to improve performance. This pre-sentation introduces a new aspect of our infrastructure incomputational quality of service (CQoS), namely databasecomponents that manage historical performance data andmetadata. We illustrate their use in selecting appropriateparallel solver components.

Li LiArgonne National [email protected]

Boyana NorrisArgonne National LaboratoryMathematics and Computer Science [email protected]

Lois McInnesArgonne National [email protected]

CP5

Computational Forces in the Linpack Benchmark

The efficiency of parallel algorithms can be explained as abalancing act between computational forces. These forces,also called computational intensities, are determined bythe particular algorithm and the particular machine run-ning the algorithm. For a timing formula describing theLinpack benchmark from Greer and Henry, we show thatdifferent machines follow different paths along a single ef-ficiency surface.

Robert NumrichUniversity of [email protected]

CP5

Performance Comparison Between Square-to-Hemisphere and Cubed Sphere Projections of aGlobal Shallow-Water Model on a Toroidal Inter-connect Architecture

Motivated by limited scalability issues encountered withthe cubed sphere projection implemented in global shallow-water models over a toroidal interconnect, we proposea square-to-hemisphere projection. We argue that thesquare-to-hemisphere projection is superior in optimiz-ing processor communication and decreasing complexity

PP08 Abstracts 39

of computational load balancing. We present a perfor-mance comparison for a numerical shallow-water modelunder both projections using a discrete Galerkin Runge-Kutta (DGRK) method on the IBM BlueGene/L systemover 1024 nodes.

Marcus WaldmanUndergraduate, University of Colorado at BoulderStudent Research Assistant, [email protected]

Siddhartha GhoshCISL/[email protected]

CP6

Why Column Pivoting Should Be Used for Perfor-mance

This talk shows new research for doing parallel dense lin-ear algebra with implicit column pivoting to improve loadbalancing. After showing the performance on a cluster ofworkstations, we discuss heterogeneous clusters where thedynamic load balancing helps the most. We show how thesame new idea can be applied to hybrid OpenMP/MPIproblems and sparse problems. We also show how this re-search is being integrated into the latest Intel Math KernelLibrarys cluster products.

Greg HenryIntel [email protected]

CP6

Block Householder Reduction of Sparse Matricesto Small Band Upper Triangular Form

Bidiagonalization can be accomplishing by accessing asparse matrix A only to perform sparse matrix dense vec-tor multiplications Ax and ytA. Only a moderate numberof leading rows and columns are eliminated. The computa-tions Ax and ytA are predominant, especially when x andy are too large to fit in cache memory. If the reductionis to bandwidth k, the multiplications can instead be AXand Y TA, A sparse, X, Y dense with k columns. BlockingA gives further speedup. On a cache based architecture,the resulting algorithm is fast and stable. It adapts easilyto multi-core architectures.

Gary HowellNorth Carolina State [email protected]

CP6

Divide and Conquer Eigenvalue Solver Paralleliza-tion

The Divide and Conquer algorithm is very great to be par-allel by idea: division of a big task to smaller ones thatcan be solved in parallel. But in fact it is not so easy be-cause small solutions should be merged in a big one andin addition they impact each other on solving stage. Thiswork describes problems and their solutions that appearedin eigenvalue solver parallelization.

Alexander V. KobotovIntel Corp.; Institute of Computational Mathematics andMathematical Geophysics SB [email protected]

CP6

Weighted Matrix Reordering and Parallel BandedPreconditioners for Non-Symmetric Linear Sys-tems

With the emergence of petascale architectures, the roleof preconditioning techniques that can scale well on largenumber of processors have become crucial. We present areordering scheme that allows the extraction of a centraldominant band that can be used as a preconditioner. Ourresults demonstrate excellent scalability and robustness fora large class of problems for which other black-box precon-ditioners, such as ILU and varieties, are poorly scalable.

Murat ManguogluPurdue University Department of Computer [email protected]

Ahmed SamehDepartment of Computer SciencePurdue [email protected]

Mehmet KoyuturkCase Western Reserve UniversityDepertment of Electrical Engineering and [email protected]

Ananth GramaPurdue UniversityDepartment of Computer [email protected]

CP6

One World, One Matrix

We propose a new parallel algorithm, called DirectedTransmission Method (DTM), to solve the sparse lin-ear system whose coefficient matrix is symmetric-positive-definite (SPD). DTM is a fully scalable, asynchronous, dis-tributed and continuous-time iterative algorithm, whichis quite different from the traditional discrete-time itera-tive algorithms. It is proved to be convergent. DTM isable to be efficiently running on any kind of homogeneousor heterogeneous parallel computers, e.g. multicore andmanycore microprocessors, SMP, clusters, supercomputers,grids, clouds and WWW. By means of DTM, we are capa-ble of solving arbitrarily-large sparse SPD linear systems,as long as we have enough processors and memories. Fur-thermore, we may unite the supercomputers all over theworld to solve an unprecedented, extremely large sparselinear system, and the dream of ”One World, One Ma-trix” would come true at that time. Besides, DTM wouldbe a persuasive benchmark to test the performance of theparallel computers, especially the supercomputers and themanycore microprocessors.

Huazhong Yang, Fei WeiDepartment of Electronic EngineeringTsinghua University, Beijing, [email protected], [email protected]

CP6

New Algorithms for Sparse Matrix Partitioning

We discuss how to partition a sparse matrix to reduce com-munication in parallel sparse matrix computations. Wefocus on sparse matrix-vector multiplication, which is an

40 PP08 Abstracts

important kernel in scientific computing. We consider two-dimensional distributions, and present a new algorithmbased on vertex separators and nested dissection. Empir-ical results on real application matrices show our methodis better than the traditional 1-d (row) distribution, andcompetitive with other 2-d distributions.

Erik G. BomanSandia National Labs, NMScalable Algorithms [email protected]

Michael WolfUniv. of Illinois, [email protected]

CP7

High Performance Solution of Sparse Linear Sys-tems Using Direct Methods with Application toElectromagnetic Problems

The numerical treatment of high frequency electromagneticscattering in inhomogeneous media is very computationallyintensive. For scattering, the electromagnetic field mustbe computed around and inside 3D complex bodies. Be-cause of this, accurate numerical methods must be usedto solve Maxwell’s equations in the frequency domain, andit leads to solve very large linear systems. In order tosolve these systems, we have combined on our TERAscalecomputer modern numerical methods with efficient parallelalgorithms.

Katherine Mer-Nkonga, Michel Mandallena, Jean-JacquesPesque, David GoudinCEA/[email protected], [email protected],[email protected], [email protected]

CP7

Parallel Subspace Newton Methods for AlgebraicSystems with Local High Nonlinearities

We present locally refined Newton type methods for largenonlinear systems of algebraic equations, arising from thediscretization of nonlinear partial differential equations.We focus on the type of systems that have local high non-linearities. In other words, the nonlinear system may havemany equations, but only a small percentage of them arehighly nonlinear compared to the rest of the equations.Global Newton methods may be used to solve the system,but often the computing time is wasted since all equationsare treated equally as if they were all highly nonlinear. Weintroduce subspace Newton methods to remove the localhigh nonlinearities and therefore improve the efficiency andthe effectiveness of the outer global Newton method, whichperforms well on equations with roughly the same levelof nonlinearities. We prove the convergence of this newmethod under certain assumptions. We also discuss theparallel implementation of the new method using PETScand provide some numerical results from solving severaldifferent nonlinear differential equations.

Xiao-Chuan CaiUniversity of Colorado, BoulderDept. of Computer [email protected]

Xuefeng LiLoyola University New Orleans

[email protected]

CP7

Fully Coupled Two-Level Domain DecompositionAlgorithms for Inverse Problems

In this talk, we discuss multilevel domain decompositionmethods for solving some coupled nonlinear systems ofequations obtained from the discretization of inverse prob-lems. We focus on a fully coupled Newton-Krylov algo-rithm with two-level Schwarz type domain decompositionmethods as the preconditioner. We study the parallel per-formance of the algorithms on supercomputers with hun-dreds of processors for solving some difficult inverse prob-lems arising from the modeling of ground water flows.

Xiao-Chuan CaiUniversity of Colorado, BoulderDept. of Computer [email protected]

Si LiuDepartment of Applied MathematicsUniversity of Colorado, Boulder [email protected]

CP7

A Parallel Multigrid Preconditioner for High-Order and hp-Adaptive Finite Elements

The hp version of the finite element method is an adap-tive finite element approach in which adaptivity occurs inboth the size, h, of the elements and in the order, p, of theapproximating piecewise polynomials. An optimal orderparallel linear system solver is needed to get the best effi-ciency of these methods. We present a parallel multigridpreconditioner whose rate of convergence is independent ofboth h and p.

William F. MitchellNational Institute of Standards and TechnologyMathematical and Computational Sciences [email protected]

CP7

Impact of Dual-Core Processors on the Perfor-mance of Parallel Krylov Subspace Linear Solversand Preconditioners for Porous Media Flow Appli-cations

Data from finite element modeling of porous media flowwere used to solve linear systems of equations using 12Krylov subspace parallel linear solvers with five precondi-tioners (60 scenarios) using PETSc to test for efficiencyand accuracy of the different options. The Cray XT3 usedin this study has been recently upgraded to 4160 dual corenodes. This presentation will highlight the performance ofthe linear solvers before and after the dual-core processorswere installed.

Thomas OppeEngineer Research and Development CenterWaterways Experiment [email protected]

Sharad GavaliNASA Ames Research [email protected]

PP08 Abstracts 41

Fred T. TracyEngineer Research and Development CenterWaterways Experiment [email protected]

CP7

Multi-Length Scale Preconditioned Iterative Solverfor Parallel Hybrid Quantum Monte Carlo Simula-tion

The hybrid quantum Monte Carlo (HQMC) method of theHubbard model is a powerful method used to study theelectron interactions that characterize the properties of ma-terials, such as magnetism and superconductivity. Thebottleneck of the method is on the repeated solutions ofthe underlying multi-length-scale linear systems of equa-tions. In this talk, we present a preconditioning tech-nique and its parallelization for solving the linear systems.The preconditioned solver demonstrates the optimal linearscaling complexity of the HQMC method for moderately-correlated materials.

Zaojun BaiDepartment of Computer ScienceUniversity of California, Davis, [email protected]

Richard ScalettarDepartment of Physics,University of California, Davis, [email protected]

Wenbin ChenSchool of Mathematical Science,Fudan University, [email protected]

Ichitaro YamazakiDepartment of Computer ScienceUniversity of California, [email protected]

CP8

A Parallel Algorithm for Optimization-BasedSmoothing of Unstructured 3-D Meshes

Serial optimization-based smoothing algorithms are com-putationally expensive. Using Metis (or ParMetis) to par-tition the mesh, the parallel algorithm moves (or does notmove) a processor’s internal nodes based on a cost functionderived from the Jacobians and condition numbers of sur-rounding elements. Ghost cells are used to communicatenew positions, the lower processor on a boundary uses thenew information to move boundary nodes, and the processrepeats. The result is a ready-to-use decomposed mesh.

Vincent C. BetroUniversity of Tennessee at [email protected]

CP8

Distributed Transpose for 3D Fft: The Effects ofMachine Geometry and Process Mapping on BlueGene/L

We describe how to extend the scalability 3D-FFT using2D-decomposition on thousands of BlueGene/L processors.The communication cost of carrying out the data trans-

poses required by the 3D-FFT is very high and dominatesthe computation cost at the limits of scalability. This moti-vated us to focus on performance measurements of the dis-tributed transpose alone. We report performance data ontwo communication protocols, MPI and BG/L-ADE. Theproposed approach is effective in improving performancefor Particle-Mesh-based N-body simulations.

T.J.C. WardIBM Software Group,Hursley Park, Hursley, [email protected]

Philiph HeidelbergerIBM Thomas J. Watson Research CenterYorktown Heights, NY 10598-0218, USAphiliph@@us.ibm.com

Robert S. Germain, Blake Fitch, Aleksandr Rayshubskiy,Maria EleftheriouIBM Thomas J. Watson Research [email protected], [email protected],[email protected], [email protected]

CP8

New Parallel Techniques for Bvps in Ords

The main objective of this paper is the devlopment of anew parallel integration algorithms for solving boundaryvalue problems ( BVPs ) in ordenary deffirential equations (ODEs ). the idea of new techniques is combinning the par-allel integration processes with parallel interpolation pro-cesses suitable for running on MIMD ( Multiple instructionstreams with multiple data streams ) computing systems.The stablity of the developed algorithms are anylsed. Wealso studied the treatment of stiff BVPs by the devlopedtechniques.

Bashir M. KhalafProfessor of Scientific [email protected]

CP8

Programming with Large Scale Edge-Node Simu-lator on BlueGene/L: A Case Study of 3D Fft

We designed a network simulator for rapid specification ofcomplex networks, such as those required to model neu-ral tissue. Here we demonstrate a more general use of thenetwork simulator for implementing generic parallel algo-rithms, with a case study of the 3D-FFT. We demonstratescaling of the 128x128x128 FFT network to 4,096 BG/Lprocessors, and compare performance against the originalalgorithm (Eleftheriou et al, 2006). Strategies for automat-ically mapping network calculations to BG/L are discussed.

Robert S. Germain, Blake Fitch, Maria EleftheriouIBM Thomas J. Watson Research [email protected], [email protected],[email protected]

James KozloskiIBM TJ Watson Research [email protected]

Charles PeckBiometaphorical Computing ResearchIBM T.J. Watson Research [email protected]

42 PP08 Abstracts

CP8

Improving the Scalability of Adaptive Mesh Refine-ment

In many large scale adaptive simulations scalability is hin-dered due to costs associated with the changing mesh. Al-gorithmic improvements to the mesh changing processeshave led to a significant reduction in these costs. In addi-tion, the frequency of remeshing can be reduced throughthe use of dilation. These changes have led to large im-provements in overall scalability of the Uintah simulationframework. Results up to 4096 processors will be shown.

Justin P. Luitjens, Tom HendersonUniversity of [email protected], [email protected]

Martin BerzinsSCI InstituteUniversity of [email protected]

CP8

Finite Element Assembly on Arbitrary Meshes

One goal of automating Finite Element Methods (FEM) isto allow arbitrary element types and orders on arbitrarymeshes. A challenge to this goal is separating local ele-ment definitions from the mesh definition. We show ourconceptual paradigm for this separation using the PETScSieve library, a library based on representing meshes asGrothendieck topologies, and demonstrate results with agrade-2 fluid application.

Andy R. TerrelUniversity of ChicagoDepartment of Computer [email protected]

Matthew G. KnepleyArgonne National [email protected]

MS1

Towards General Auto-tuning Description Lan-guage on Advanced Computing Systems

The description of auto-tuning is crucial, but time-consuming work for developing numerical libraries withauto-tuning facility. In this presentation, a description lan-guage for auto-tuning, named ABCLibScript, is explainedwith several examples of numerical computation. Althoughthe target of ABCLibScript was vector supercomputers,but we show the effectiveness on it to software develop-ment process on embedded systems. The effect on theadvanced computer environment, which is supercomputerwith multi-core processor, will be also shown.

Takahiro KatagiriInformation Technology CenterThe University of [email protected]

MS1

Proposal of Run-time Parameter Auto-Tuning Ap-proach for Restarted Lanczos Method

Many input parameters in matrix solvers are difficult to

predict the best values before runtime. This paper pro-poses an automatic tuning approach for the restarted Lanc-zos method, which explores the best projection matrix sizefrom the history of residual value at runtime. The numer-ical experiments show the proposed approach is 100 timesfaster than the original method in the best case. The re-sult implies the runtime automatic tuning is effective foriterative matrix solvers.

Takao Sakurai, Ken Naono, Masashi EgiCentral Research LaboratoryHitachi [email protected], [email protected],[email protected]

Mitsuyoshi Igai, Hiroyuki KidachiHitachi ULSI Systems [email protected],[email protected]

MS1

A Bayesian Approach to Automatic PerformanceTuning

Code tuning has been done based on models, experimentsor their combinations, but the combinations are mostly ofheuristics. In this talk it is shown that Bayesian statis-tics can provide a convenient mathematical framework tocombining model and experiments for code tuning. The ex-ample problem here is online selection of several unrolledcodes for matrix-matrix multiply, and some sequential ex-perimental designs based on a simple performance modelare proposed and evaluated.

Reiji SudaDepartment of Computer Science, The University [email protected]

MS1

Automatic Tuning for Parallel FFTs

In this talk, an automatic performance tuning method forparallel fast Fourier transforms (FFTs) is presented. Ablocking algorithm for parallel FFTs utilizes cache mem-ory effectively. Since the optimal block size may dependon the problem size, we propose a method to determinethe optimal block size that minimizes the number of cachemisses. Performance results of parallel FFTs on a PC clus-ter are reported.

Daisuke TakahashiGraduate School of Systems and Information EngineeringUniversity of [email protected]

MS2

Neutral Territory Methods for Efficient Paralleliza-tion of Molecular Dynamics Simulations

The majority of the computational workload in moleculardynamics simulations involves interactions between nearbyparticles. We will describe a class of algorithms for paral-lelization of range-limited particle interactions, the neutralterritory methods, some of which confer significant practi-cal advantages over traditional parallelization algorithms.We will illustrate specific neutral territory methods intro-duced by other researchers and by ourselves, and we will

PP08 Abstracts 43

discuss the tradeoffs that led us to select different neutralterritory methods for different molecular dynamics imple-mentations.

Ron O. DrorD. E. Shaw [email protected]

David E. ShawD. E. Shaw ResearchColumbia [email protected]

Kevin J. BowersD. E. Shaw [email protected]

MS2

Scaling NAMD to Large Parallel Machines

NAMD’s parallel design, circa 1996, has stood the test oftime. The basic parallel structure includes (a) decompo-sition into cells, and force-computation objects for eachpair of interacting cells, (b) implementation using message-driven objects in Charm++, and (c) assignment of objectsto processors using measurement-based load balancers thatalso reduce communication. This talk will review recentoptimizations to scale NAMD to over 32,000 processors forsmall and large biomolecular systems.

Laxmikant V. KaleUniversity of Illinois at [email protected]

James C. PhillipsBeckman Institute, U. Illinois at [email protected]

Chao MeiUniversity of Illinois at [email protected]

Abhinav Bhatele, Gengbin Zheng, Sameer KumarBeckman Institute, U. Illinois at [email protected], [email protected],[email protected]

Klaus SchultenUniversity of Illinios at Urbana [email protected]

MS2

Nanoparticle and Colloidal Simulations withMolecular Dynamics

Modeling nanoparticle or colloidal systems in a molecu-lar dynamics (MD) code requires coarse-graining on sev-eral levels to achieve meaningful simulation times for studyof rheological and other manufacturing properties. Theseinclude treating colloids as single particles, moving fromexplicit to implicit solvent, and capturing hydrodynamiceffects. These changes also impact parallel algorithms fortasks such as finding neighbor particles and interprocessorcommunication. I’ll describe enhancements we’ve made toour MD code LAMMPS to make nanoparticle simulationsmore efficient, highlighting its flexible design that has en-

abled the new capabilities.

Steve PlimptonSandia National [email protected]

MS2

A Summary of the Performance and Scaling of AM-BER 10 and the Challenges Ahead

This talk will present a summary of the current perfor-mance and scaling of the soon to be released version 10of the AMBER software on a range of NSF and DOE highperformance computing systems. In addition it will includean overview of the supported methods and the approachesused to obtain the level of performance seen. Finally someof the challenges that may face the molecular dynamicscommunity in the near future will be discussed.

Ross C. WalkerSan Diego Supercomputer [email protected]

Robert E. DukeNIEHS and UNC-CHapel [email protected]

David A. CaseThe Scripps Research [email protected]

MS3

Creating Interoperability for Parallel MeshingTools

Mesh technology, such as mesh generation, databasequeries, and adaptivity, plays a critical role in scientificsimulations. While many frameworks providing mesh tech-nology exist, their incorporation into applications requiressignificant effort and learning by application developers.Interfaces allowing interoperable use of mesh tools greatlysimplify this process while providing a wider range of tech-nology than a single framework. In this talk, we discussinteroperable mesh interfaces and, in particular, their ex-tension to parallel mesh services.

Karen D. DevineSandia National [email protected]

Xiaojuan Luo, Mark S. ShephardRensselaer Polytechnic InstituteScientific Computation Research [email protected], [email protected]

Lori A. DiachinLawrence Livermore National [email protected]

Tim TautgesArgonne National [email protected]

Carl Ollivier-GoochUniversity of British [email protected]

Vitus Leung

44 PP08 Abstracts

Sandia National [email protected]

MS3

Algorithms for Parallel Mesh Smoothing UsingMesquite

We discuss the development of an infrastructure that sup-ports the use of Mesquite mesh quality improvement algo-rithms in distributed memory applications. We start withthe application’s decomposition of the mesh data and usean iterative process to select independent sets of verticesto resposition in each pass. We experiment with a mix oflocal and global techniques from Mesquite and report onthe scalability and performance or our methods.

Lori A. DiachinLawrence Livermore National [email protected]

Martin IsenburgLawrence Livermore National [email protected]

MS3

Zoltan Load Balancing Approaches

Dynamic load-balancing is a data-management service thatis critical to a wide range of unstructured and/or adaptiveparallel applications. The Zoltan Library provides a suiteof dynamic load-balancing tools. Access to Zoltan is nowavailable through a common interface that supports inter-operability within the ITAPS data model. In this presenta-tion, we give a brief overview of the dynamic load-balancingapproaches available through Zoltan’s ITAPS interface.

Karen D. Devine, Vitus LeungSandia National [email protected], [email protected]

MS3

A Partition Model for Massively Parallel Mesh-Based Computations

The Interoperable Technologies for Advanced PetascaleSimulations DOE SciDAC center is designing and imple-menting an interoperable partition model to support paral-lel mesh-based operations including adaptive computationsaccounting for the complexities that arise due to the chang-ing computational load and communications of adaptedmeshes. The presentation will first discuss the overall par-tition model design. Consideration will then be given toits implemented and relation to adaptive mesh control andZoltan load balancing procedures.

Onkar SahniRensselaer PolytechnicScientific Computation Research [email protected]

Xiaojuan Luo, Mark S. ShephardRensselaer Polytechnic InstituteScientific Computation Research [email protected], [email protected]

Kenneth Jansen, TIng XieRensselaer Polytechnic [email protected], [email protected]

MS4

Issues in Exploiting the Power of Multiple Methods

In this presentation, we will discuss some of the issues inmultimethod implementation. While the focus of usingmultimethods is mapping a ”single” method to a simula-tion stage, for certain problems,several suitable methodsmight be combined to produce more effective results. Thetrade-off related to the frequency of changing methods isanother issue, as it might not be practical to switch meth-ods at every opportunity for adaptivity. Yet another chal-lenge is the efficient identification of adaptivity in the sim-ulation.

Sanjukta BhowmickDepartment of Computer Science and EngineeringPennsylvania State [email protected]

MS4

Machine Learning Support for Numerical DecisionMaking

We present the SALSA (Self-Adapting Large-scale SolverArchitecture) software system for intelligent multi-methods. The system is based on a modular architec-ture for composite algorithms (for instance, choice ofscaling/preconditioner/iterator in iterative linear systemsolvers) and uses machine learning techniques for adap-tively choosing the component algorithms. We will discussvarious learning techniques we have explored, and the highlevel of accuracy obtained.

Victor EijkhoutThe University of Texas at AustinTexas Advanced Computing [email protected]

MS4

Evaluation of a Meta-partitioner for Simula-tions Using Block-structured Adaptively RefinedMeshes

High parallel efficiency for structured adaptive mesh refine-ment (SAMR) applications requires repeated data parti-tioning and distribution. We present a performance evalu-ation of a framework for adaptive partitioning. Consideringcomputational load, communication volume, synchroniza-tion delays, and data movement, the framework selects,configures and invokes the most efficient partitioning al-gorithm. We show that adaptive partitioning can signifi-cantly improve parallel efficiency for SAMR applications.

Henrik JohanssonDepartment of Information TechnologyUppsala [email protected]

MS4

Adaptive Partitioning for Unstructured AMR Ap-plications

Improving performance of large scientific adaptive appli-cations is non-trivial due to their inherent dynamics andwide spectrum of properties. Performance is limited by thepartitioner’s ability exploit computer resources given theapplication state. No single partitioning configuration cangenerally achieve high performance; partitioning must bedynamically adaptive. In this talk, we describe the meta-

PP08 Abstracts 45

partitioner: a framework for selecting and configuring themost suitable partitioner based on run-time state.

Johan SteenslandSandia National [email protected]

MS5

Parallel Programming in MATLAB: Best Practices

Matlab is one of the most commonly used languagesfor scientific computing with approximately one mil-lion users worldwide. The Lincoln pMatlab library(http://www.ll.mit.edu/pMatlab), The Mathworks DCT,and StarP from ISC have brought parallel computing tothe this community using the distributed array program-ming paradigm. This talk provides an introduction to dis-tributed array programming and will describe the best pro-gramming practices for using distributed arrays to producewell performing parallel Matlab programs.

Jeremy KepnerMIT Lincoln [email protected]

MS5

Parallel MATLAB in Production Supercomputingwith Applications in Signal and Image Processing

Parallel MATLAB enables the large community of MAT-LAB users to harness the increased computing capacityand memory of distributed memory clusters. At the OhioSupercomputer Center we provide our users with three va-rieties of Parallel MATLAB. In this talk, we will describehow we run these Parallel MATLAB environments withina traditional batch oriented queuing system. We will alsodescribe our experiences in developing three signal and im-age processing applications within this environment.

Ashok KrishnamurthyOhio Supercomputing [email protected]

David Hudak, John Nehrbass, Siddharth Samsi, VijayGadepallyOhio Supercomputer [email protected], [email protected], [email protected], [email protected]

MS5

Parallel Computing Toolbox (PCT) and ParallelProgramming in MATLAB

Parallel Computing Toolbox addresses computationallyand data-intensive problems using MATLAB and Simulinkin a multiprocessor computing environment. The toolboxallows both several independent tasks or a single parallelcomputation by harnessing computing clusters and a vari-ety of batch queuing software implementation. The tool-box provides high-level constructs, such as parallel loopsand algorithms, and MPI-based functions. Also, low-levelconstructs for resource management are included. The Par-allel Command Window provides interactive environmentfor developing parallel applications.

Piotr LuszczekThe MathWorks, [email protected]

MS5

Interactive Data Exploration with Star-P

High performance applications increasingly combine nu-merical and combinatorial algorithms. Past research onhigh performance computation has focused mainly on nu-merical algorithms, and there is a rich variety of toolsfor high performance numerical computing. On the otherhand, few tools exist for large scale combinatorial comput-ing. We describe our efforts to build a common infrastuc-ture for numerical and combinatorial computing by usingparallel sparse matrices to implement parallel graph algo-rithms.

Viral B. ShahInteractive [email protected]

MS6

Integrated Air/Ocean/Wave Modeling UsingESMF

Development of an integrated air/ocean/wave modelingsystem is described. The single executable system is builtfrom mature stand-alone models using the Earth SystemModeling Framework (ESMF). The framework providesthe required functionality for treating each model as a sep-arate component and for the redistribution and remappingof data between them. An exchange grid approach is im-plemented to simplify the interface between models thatuse telescoping nests. In addition to describing the imple-mentation details, preliminary results will be presented fortwo regional test cases.

Sue ChenNaval Research LaboratoryMonterey, [email protected]

Hao JinSAIC, Naval Research LaboratoryMonterey, [email protected]

Rich HodurNaval Research LaboratoryMonterey, [email protected]

Sasa GabersekUCAR, Naval Research LaboratoryMonterey, [email protected]

Tim CampbellNaval Research LaboratoryStennis Space [email protected]

MS6

Algorithms for a Scalable Earth System Model

Abstract not available at time of publication.

John DrakeOak Ridge National [email protected]

46 PP08 Abstracts

MS6

A Coupled Watershed-Nearshore Model UsingDBuilder

Coupling of independent models involves implementationof synchronization and data-exchanging algorithms. Also,coupling may be along a shared edge of two meshes or anoverlapped region between two meshes. The latter can bedifficult in terms of spatially mapping nodes/elements be-tween two meshes. DBuilder, a parallel data managementtoolkit, provides users with APIs such as element search-ing and data synchronization routines to accomplish thesetasks.

Robert M. HunterU.S. Army Engineer Research & Development [email protected]

MS6

Parallel Rendezvous Regridding in ESMF

In coupled multiphysics simulations often each physics ismodelled by a distinct, specialized code; to combine thesecodes into a coupled solver, it is necessary to transfer fieldsfrom one code to another (often called regridding). Inthe Earth Sciences (and other disciplines) each individ-ual physics code will likely be a massively parallel code,with a unique parallel decomposition of the physical do-main. We discuss the ESMF implementation of the Par-allel Rendezvous algorithm of Stewart et al, which createsa geometric rendezvous mesh to perform the search andinterpolation. We discuss the application and extensionof this algorithm to interpolation of high order finite ele-ments with non-nodal interpolation rules (e.g. Hierarchicalelements). We also demonstrate a smoothing interpolationmethod that is based on finite element patch recovery tech-niques.

David NeckelsNational Center for Atmospheric [email protected]

MS7

Solving Rank Deficient Linear-Least Squares Prob-lems Using Sparse QR Factorizations

We address the problem of solving linear least-squaresproblems min||Ax − b|| when A is a sparse m-by-n rankdeficient or highly ill-conditioned matrix. Since A is rank-deficient or highly ill-conditioned the factorization A = QRis not useful because the computed R is ill-conditioned.We have developed a new method that uses a regular QRfactorization instead of a rank-revealing QR factorization.The goal of this work is to implement and test the algo-rithm in an high performance QR factorization.

Esmond G. NgComputational Research DivisionLawrence Berkeley National [email protected]

Haim AvronSchool of Computer ScienceTel-Aviv [email protected]

Sivan ToledoTel Aviv [email protected]

MS7

Computing the Conditioning of Dense Linear LeastSquares with (Sca)LAPACK

We define condition numbers that can assess the accuracyof the components of the least squares solution. We in-terpret them in terms of statistical quantities. We showthat the ratio of the variance of one component of thesolution by the variance of the right-hand side is exactlythe condition number. We also propose codes based on(Sca)LAPACK for computing the variance-covariance ma-trix. Finally we present experiments from the space indus-try with real physical data.

Julien LangouUniversity of Colorado at Denver and Health [email protected]

Jack DongarraUniversity of [email protected]

Marc [email protected]

Serge [email protected]

MS7

Sparse QR Rank-revealing Factorization

We discuss an algorithm for computing a rank revealingsparse QR factorization. First, a QR factorization withno pivoting is performed, that allows to obtain efficientlya sparse triangular factor R. Second, an incremental con-dition number estimator (ICE) is used iteratively on R toidentify redundant columns. We also introduce a blockformulation of ICE algorithm. Numerical tests show thatblock ICE leads to approximations close to those obtainedby successive runs of ICE.

Bernard PhilippeIRISA-INRIARennes [email protected]

Laura [email protected]

Frederic [email protected]

MS7

Blocked Bidiagonal Reduction of Sparse MatricesUsing Givens Rotations

Computing the singular value decomposition involves bidi-agonalization of a sparse upper triangular matrix R. Con-ventional methods do not exploit the sparsity of R. Weintroduce a method to bidiagonalize R using a sequence ofGivens rotations while preserving the ”mountainview” pro-

PP08 Abstracts 47

file of R. A dynamic blocking scheme extends the method totwo blocked variations which generate no more fill than theunblocked version. We present performance results com-paring all the different methods.

Gene H. GolubStanford UniversityDepartment of Computer [email protected]

Timothy A. DavisUniversity of FloridaComputer and Information Science and [email protected]

Sivasankaran RajamanickamDept. of Computer and Information Science andEngineeringUniv. of Florida, [email protected]

MS8

NVIDIA CUDA Software and GPU Parallel Com-puting Architecture

In the past, graphics processors were special purposehardwired application accelerators, suitable only for con-ventional rasterization-style graphics applications. Mod-ern GPUs are now fully programmable, massively par-allel floating point processors. This talk will describeNVIDIAs massively multithreaded computing architectureand CUDA software for GPU computing. The architec-ture is a scalable, highly parallel architecture that delivershigh throughput for data-intensive processing. Althoughnot truly general-purpose processors, GPUs can now beused for a wide variety of compute-intensive applicationsbeyond graphics.

Michael [email protected]

MS8

Implementation of the Navier-Stokes Stanford Uni-versity Solver (NSSUS) on a GPU

Current graphics processing units are capable of over 300Gflops peak performance and this typically doubles ev-ery year. We have ported some of the capabilities ofthe Navier-Stokes Stanford University Solver (NSSUS), amulti-block structured code with a provably stable and ac-curate numerical discretization which uses a vertex-basedfinite-difference method and multigrid for convergence ac-celeration. Speed-ups of over 40x were demonstrated forsimple test geometries and up to 20x for realistic geome-tries of engineering interest.

Patrick LeGresleyStanford University, now with [email protected]

Erich ElsenStanford [email protected]

Eric F. DarveStanford UniversityCenter for Turbulence [email protected]

MS8

Performance and Productivity of Graphics Pro-cessing Units for a Quantum Monte Carlo Appli-cation

The increased programmability and performance of Graph-ics Processing Units (GPUs) can have profound positiveimpact on developer productivity. In this talk, we discussthe acceleration of a Quantum Monte Carlo application us-ing GPUs. Topics include the impact of GPU features onperformance, tradeoffs in using a library approach versus amore informed hand optimized acceleration path, and thefeasibility of combining these approaches.

Jeremy MeredithOak Ridge National [email protected]

MS8

Accelerating Molecular Modeling Applicationswith Graphics Processors

State-of-the-art graphics processing units (GPUs) can per-form over 500 billion arithmetic operations per second,a powerful computational resource that can now be har-nessed for use by scientific applications. We present anoverview of recent advances in programmable GPUs, withan emphasis on their application to biomolecular modelingapplications and the programming techniques required toobtain optimal performance in these cases. Performanceand implementation details are presented for several appli-cations. The calculations include runs on multiple GPUs.

John Stone, James C. Phillips, Peter Freddolino,Leonardo TrabucoBeckman InstituteUniv of Illinois at [email protected], [email protected],[email protected], [email protected]


David J. HardyTheoretical Biophysics Group, Beckman InstituteUniversity of Illinois at [email protected]

MS9

Communication Avoiding Algorithms for LinearAlgebra: Motivation, Approach

We survey results to be presented in this minisymposiumon designing numerical algorithms to minimize the largestcost component: communication. This could be bandwidthand latency costs between processors over a network, orbetween levels of a memory hierarchy; both costs are in-creasing exponentially compared to floating point. We de-scribe novel algorithms in sparse and dense linear algebra,for both direct methods (like QR and LU) and iterativemethods that can minimize communication.

James W. DemmelUniversity of CaliforniaDivision of Computer [email protected]

48 PP08 Abstracts

MS9

Communication Avoiding Gaussian Elimination

We present CALU, a Communication Avoiding LU factor-ization algorithm for dense matrices distributed in a two-dimensional (2D) cyclic layout. The new algorithm leadsto an important decrease in the number of messages ex-changed during the factorization, and thus it overcomesthe latency bottleneck of the LU factorization as imple-mented in ScaLAPACK. We also discuss the stability ofthe pivoting strategy used in CALU and present perfor-mance results on several computational platforms.


James DemmelUC Berkeley, [email protected]


Hua XiangINRIA, [email protected]

MS9

AllReduce Algorithms: Application to House-holder QR Factorization

QR factorizations of tall and skinny matrices with theirdata partitioned vertically across several processors arisein a wide range of applications. Various methods existto perform the QR factorization of such matrices: Gram-Schmidt, Householder, or CholeskyQR. In this talk, Wepresent the Allreduce Householder QR factorization. Thismethod is stable and performs, in our experiments, fromfour to eight times faster than ScaLAPACK routines on talland skinny matrices. The idea of Allreduce algorithms canbe extended to 2D block-cyclic LU or QR factorization.


Jim DemmelDivision of Computer ScienceUniversity of California, [email protected]


Mark HoemmenUC Berkeley, [email protected]

MS9

A Low Latency Approach for Parallel Sparse LU

Factorization

We present a new scheme for computing the sparse LUfactorization. Our goal is to decrease the number of mes-sages, hence decreasing the time spent in communication.The reduction in the number of messages is obtained byusing a heuristic pivoting strategy, which is shown by nu-merical experiments to be stable in practice. The parallelalgorithm is based on a hypergraph reordering strategy,and an associated separator tree is used for distributingthe data.


Hua XiangINRIA, [email protected]

MS10

Anton: A Special-purpose Machine for MolecularDynamics Simulation

Anton is a massively parallel machine which should makepractical millisecond-scale classical molecular dynamics(MD) simulations of proteins in explicit solvent. The ma-chine, which is scheduled for completion by the end of 2008,is based on 512 identical MD-specific ASICs that interactin a tightly coupled manner using a specialized high-speedcommunication network. Anton has been designed to useboth novel parallel algorithms and special-purpose logic todramatically accelerate those calculations that dominatethe time required for a typical MD simulation. The re-mainder of the simulation algorithm is executed by a pro-grammable portion of each chip that achieves a substantialdegree of parallelism while preserving the flexibility nec-essary to accommodate anticipated advances in physicalmodels and simulation methods.

Ron O. DrorD. E. Shaw [email protected]

David E. ShawD. E. Shaw ResearchColumbia [email protected]

Martin M. Deneroff, Jeffrey S. Kuskin, Richard H.Larson, John K. Salmon, Cliff YoungD. E. Shaw [email protected], [email protected],[email protected], [email protected],[email protected]

MS10

Scaling Classical Molecular Dynamics to O(1)Atom per Node

We will describe some of the issues involved in scalingbiomolecular simulations onto massively parallel machinesas well as some of the science that we have been able toachieve using the Blue Matter molecular simulation appli-cation on Blue Gene/L. Our experiences in scaling to orderone atom/node on BG/L should provide some insights intothe challenges involved in scaling biomolecular simulations

PP08 Abstracts 49

onto larger peta-scale platforms.

Blake G. Fitch, Christopher Ward, Michael C. Pitman,Robert S. GermainIBM. T.J. Watson Research [email protected], [email protected], [email protected],[email protected]

Aleksandr Rayshubskiy, Maria EleftheriouIBM Thomas J. Watson Research [email protected], [email protected]

MS10

Accelerating NAMD with Graphics Processors

Commodity graphics processors allow a single worksta-tion to achieve teraflop performance on certain workloads.Working with the NVIDIA CUDA programming system,we have adapted our molecular dynamics code NAMD(www.ks.uiuc.edu/Research/namd/) to offload the mostexpensive calculations to graphics processors while main-taining its parallel capability (J. Comp. Chem., 28:2618-2640, 2007). This talk will present recent work and parallelperformance results for CUDA-accelerated NAMD.

John E. Stone, James C. PhillipsBeckman Institute, U. Illinois at [email protected], [email protected]


MS10

Petascale Special-Purpose Computer for MolecularDynamics Simulations: MDGRAPE-3 and Beyond

We have developed the MDGRAPE-3 system, a petaflopsspecial-purpose computer for molecular dynamics simula-tions. The MDGRAPE-3 is a PC cluster equipped withaccelerators of 4,778 ASICs that calculate nonbonded in-teractions between atoms. Currently serial Amber-8 andin-house parallel MD software has been ported for the sys-tem. We will present the architecture and performance ofthe system as well as the next-generation project to developa tile processor with special-purpose engines over TFLOPSperformance.

Tetsu NarumiGenomic Sciences Center, RIKENand Keio [email protected]

Duraid MadeinaUniversity of [email protected]

Makoto Taiji, Yosuke OhnoGenomic Sciences Center, [email protected], [email protected]

Takashi IkegamiThe Graduate School of Arts and SciencesUniversity of [email protected]

MS11

Irregular Algorithms on the Cell Broadband En-gine

The Sony-Toshiba-IBM Cell Broadband Engine is a hetero-geneous multicore architecture that consists of a traditionalmicroprocessor (PPE) with eight SIMD co-processing units(SPEs) integrated on-chip. Noting that while the Cell pro-cessor is architected for multimedia applications with regu-lar processing requirements, we are interested in its perfor-mance on problems with non-uniform memory access pat-terns. In this talk, we present a case study of list ranking,a fundamental kernel for graph problems, that illustratesthe design and implementation of parallel combinatorialalgorithms on Cell. List ranking is a particularly chal-lenging problem to parallelize on current cache-based anddistributed memory architectures due to its low computa-tional intensity and irregular memory access patterns. Totolerate memory latency on the Cell processor, we decom-pose work into several independent tasks and coordinatecomputation using the novel idea of Software-Managedthreads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking,and demonstrate substantial speedup in comparison to tra-ditional cache-based microprocessors. For instance, on a3.2 GHz IBM QS20 Cell blade, for a random linked list of1 million nodes, we achieve an overall speedup of 8.34 overa PPE-only implementation.

David A. BaderGeorgia Institute of [email protected]

MS11

Linear Algebra Algorithms on the IBM Cell

This talk describes the design concepts behind implemen-tations of some linear algebra routines targeted for the Cellprocessor and multicore in general. It describes in detailthe implementation of code to solve linear system of equa-tions using Gaussian elimination in single precision withiterative refinement of the solution to the full double pre-cision accuracy. We will also look at the PlayStation 3 foruse in scientific computations.

Jack J. DongarraDepartment of Computer ScienceThe University of [email protected]

MS11

The Implementation of FFTW on Cell

FFTW is a library for computing Fourier transforms ofcomplex, real, and real-symmetric multi-dimensional se-quences. In this talk, I describe the port of FFTW to theCell Broadband Engine, which was completed at the begin-ning of 2007 by the IBM Austin Research Lab. The bulkof FFTW runs on the Cell PPE, treating the SPE’s as ac-celerators. The SPE’s execute a specialized program capa-ble of executing one-dimensional DFT’s of two-dimensionalvectors. While the capabilities of the SPE program are re-stricted, FFTW can reduce an arbitrary multi-dimensionalDFT to this restricted form, thus taking advantage of theSPE’s in most cases.

Matteo FrigoCilk [email protected]

50 PP08 Abstracts

MS11

Dealing with the Memory Bandwidth Bottleneckon the Cell Processor

The computational workload on the Cell processor is han-dled by co-processors called SPEs. They have small localstores, which makes it necessary to store the applicationdata in main memory. In many applications, especiallythose requiring O(n) computational effort, the bandwidthto main memory limits application performance. We willdiscuss the effectiveness of data compression in dealingwith this limitation in a few important applications, suchas matrix-vector multiplication.

Ashok SrinivasanDepartment of Computer Science,Florida State [email protected]

Gunaranjan Gunaranjan, T Nagaraju, RamprasadRamprasad, T.V. SivakumarSri Sathya Sai [email protected], [email protected], [email protected], [email protected]

MS12

Fluid Dynamics Simulations on Massively ParallelComputers

To achieve the goal of reliable flow simulations for realisticproblems requires methods that are extensible to levels ofparallelism that scale on 100,000s of processors and thatcan attain petaflop performance. We present a frameworkto perform massively parallel simulations where work ispartitioned into balanced parts with well-controlled com-munications. We demonstrate scalability on 30,000 proces-sors on an IBM BlueGene/L for the case of blood flow ina patient-specific arterial system.

Min ZhouRensselaer Polytechnic [email protected]

Ken JansenRensellaer Polytechnic [email protected]

Onkar SahniRensselaer PolytechnicScientific Computation Research [email protected]

Mark S. ShephardRensselaer Polytechnic InstituteScientific Computation Research [email protected]

MS12

Load Balancing in FronTier/AMR Computation

Computation of fluid physics with dynamically evolvingfronts embedded in the PDE solution domain poses a greatchallenge for load balancing in parallel computing, partic-ularly when adaptive mesh refinement is used. In this pre-sentation, we present our adjusted rectangular domain de-composition algorithm and AMR patch redistribution forcomputation on a parallel platform with large number of

processors. These algorithms enhances the parallel effi-ciency and scaling substantially. Our implementation con-forms with the ITAPS interoperability requirements.

Ryan Kaufman, Brian FixSUNY at Stony [email protected], [email protected]

Xiaolin LiDepartment of Applied Math and StatSUNY at Stony [email protected]

MS12

Sets and Tags in the ITAPS Data Model

The data model used in the ITAPS iMesh and iGeom in-terfaces includes sets (an arbirary collection of entities andother sets) and tags (application-defined data assigned toentities, sets, and the interface itself). The combinationof sets and tags is a powerful mechanism for embeddingdata with a variety of sources and semantics in the ITAPSinterfaces. However, in practice this same variety makesit difficult to find data through such an abstract interface.This issue is discussed in the context of parallel scientificcomputing for both geometry and mesh data.

Tim TautgesArgonne National [email protected]

Mark MillerLawrence Livermore National [email protected]

MS12

Parallel, Scalable Unstructured Mesh Generationand Computation Physics Tools on Petascale Com-puting Architectures

As a scientific mesh based modeling community we aremaking steady progress toward petascale computing hard-ware. The computing hardware that we will be dealingwith over the next several years is going to get more com-plex for mesh based algorithms in terms of multi/many-core processors, hierarchal memories and distributed I/Obecause the relationships between CPU speed, memorybandwidth, memory latency and communication latencyare going to change dramatically. We need to make surethat our modeling and simulation tools keep pace withthese hardware developments, such that the algorithms arescalable to 100s of thousands of processors and data ispartitioned properly with respect to memory hierarchies.This presentation will describe our efforts to maintain par-allel, scalable, efficient software tools and technology formesh generation and continuum/discrete simulation on ad-vanced computing architectures by making use interoper-able tools such as the ITAPS mesh/field interfaces; Zoltandata/graph partitioning tools; and complex mesh gener-ation and solvers, like NWGrid/NWPhys and FronTieron applications such as multiscale subsurface transport incomplex geometries.

Yilin Fang, Harold E. TreasePacific Northwest National [email protected], [email protected]

Bruce PalmerPacific Northwest National Lab

PP08 Abstracts 51

[email protected]

MS13

Challenges and Achievements in ComputationalElectromagnetics

Under consideration are problems in the vicinity of existingand future high current and high brightness particle accel-erators such as high power proton drivers and 4th gen-eration light sources i.e. x-ray free electron lasers. Onecan distinguish between two broad classes of problems inthis field: real or complex eigenvalue problems in the con-text of cavity designs and relativistic particle tracking in3D time dependent electromagnetic fields from Maxwellsequation. Here the source terms, i.e. the time dependentcharge distribution must be explicit modeled with high ac-curacy. Also of great importance is the efficient handling ofdatasets in the terra byte region, our HDF5 based Ansatzwill be drafted. Our workhorses are a massive parallel par-ticle in cell code and a finite element based eigenmodesolver. I will talk about our implementations and showsome results. Ongoing projects are time dependent hp-finite element based particle codes; here I will sketch ourideas.

Andreas AdelmannPaul Scherrer [email protected]

MS13

Parallel Particle-In-Cell (PIC) Simulation on Hy-brid Meshes

Particle-In-Cell (PIC) codes have become an essential toolfor the numerical simulation of many physical phenomenainvolving charged particles, in particular beam physics,space and laboratory plasmas including fusion plasmas.Genuinely kinetic phenomena can be modeled by theVlasov-Maxwell equations which are discretized by a PICmethod coupled to a Maxwell field solver. Todays and fu-ture massively parallel supercomputers allow to envisionthe simulation of realistic problems involving complex ge-ometries and multiple scales. To achieve this efficiently wepropose to couple a Finite Element Maxwell solver withparticles on hybrid grids with several homogeneous zoneshaving their own structured or unstructured mesh typeand size. This allows in particular fast particle trackingin zones having a structured mesh, but needs a fine analy-sis of load balancing issues for efficient parallelization. Ourlatest progress towards this goal will be presented.

Latu GuillaumeStrasbourg [email protected]

MS13

Parallel Smoothed Aggregation Multigrid for LargeScale Electromagnetic Simulations

We present a new AMG preconditioner for linear systemsarising from edge element discretization of the eddy cur-rent equations. The linear system is implicitly transformedinto a 2 × 2 block system whose diagonal blocks are anedge Hodge Laplacian and a nodal scalar Laplacian, respec-tively. Solving the edge Hodge Laplacian involves matrix-free smoothing and a specialized restriction to a coarsenodal problem. We present three-dimensional computa-tional results on twenty thousand Cray XT3 processors.

Pavel BochevSandia National LaboratoriesComputational Math and [email protected]

Ray S. TuminaroSandia National LaboratoriesComputational Mathematics and [email protected]

Jonathan J. HuSandia National LaboratoriesLivermore, CA [email protected]

Chris SiefertSandia National [email protected]

MS13

Parallel Auxiliary Space AMG for Maxwell Prob-lems

In this talk we will discuss the implementation and per-formance of an auxiliary space based algebraic solver fordefinite Maxwell problems, discretized with edge elements.The algorithm is based on a recent theoretical result byHiptmair and Xu, and utilizes two internal AlgebraicMultigrid (AMG) V-cycles: one for a scalar and one fora vector Poisson-like matrix. The parallel scalability ofthis approach is directly tied to the AMG performance onPoisson problems.

Tzanio V. Kolev, Panayot S. VassilevskiCenter for Applied Scientific ComputingLawrence Livermore National [email protected], [email protected]

MS14

Scalability Infrastructure for the Lustre File Sys-tem

This paper describes low-level infrastructure in the Lustrefile system that addresses scalability in very large clusters.The features deal with I/O and networking, lock manage-ment, recovery and failure, and other scalability-relatedissues.

Peter BraamCluster File [email protected]

MS14

High End Computing File Systems and I/O (HECFSIO): Coordinating the US Government ResearchInvestments

The High End Computing Interagency Working Group(HEC IWG) is chartered with coordinating US Govern-ment investments in Research and Development (R&D) forHEC. The HEC FSIO Technical Advisory Group (TAG) ischartered with providing guidance to the HEC IWG in thearea of File Systems and I/O (FSIO). The HEC FSIO re-search needs and priorities will be discussed. Also, thecurrently portfolio of 28 research projects will be reviewed.Additionally, the future direction for the HEC FSIO area

52 PP08 Abstracts

will be outlined including programs for taking research out-come into products and discussion of a new round of gov-ernment sponsored research to continue to feed the R&Dpipeline in this area will be outlined.

Gary GriderLos Alamos National [email protected]

MS14

I/O Architectures for Petascale Computing

Production high-performance storage systems today aretypically constructed from enterprise storage hardware,with a parallel file system such as GPFS, Lustre, or PVFStying this hardware into a coherent whole. As we move intothe petascale regime, constructing storage systems in thisway is becoming problematic. In this talk we will discusssome of the challenges in storage at petascale, particularlyin reliability and performance, and examine hardware andsoftware options that can help us construct effective stor-age systems at this extreme scale.

Rob RossArgonne National [email protected]

MS14

Structured Streams: Data Services for PetascaleScience Environments

The challenge of meeting the I/O needs of petascale ap-plications is exacerbated by an emerging class of data-intensive HPC applications that requires annotation, re-organization, or even conversion of their data. We in-troduce an end-to-end approach to meeting these require-ments. The Structured Streaming Data System (SSDS)enables high-performance data movement or manipulationbetween the compute and service nodes of the petascalemachine and between/on service nodes and ancillary ma-chines. This talk describes the SSDS architecture, moti-vating its design decisions and intended application uses.Performance claims are supported with experiments bench-marking the underlying software layers of SSDS, as well asapplication-specific usage scenarios.

Karsten SchwanCollege of ComputingGeorgia Institute of [email protected]

MS15

An Empirical Investigation of Generating ParallelQuasirandom Sequences by Using Different Scram-bling Methods

Quasi-Monte Carlo (QMC) methods are now widely usedin scientific computation. The use of randomized QMCmethods, where randomness can be brought to bear onquasirandom sequences through scrambling and other re-lated randomization techniques, brings more wide appli-cations for QMC. Scrambling QMC offers a natural wayto generate parallel sequences. QMC applications havehigh degrees of parallelism, can tolerate large latencies, andusually require considerable computational effort, makingthem extremely well suited for grid computating. Parallelcomputations using QMC require a source of quasirandomsequences, which are distributed among the individual par-allel processes. However, the integration variance can de-

pend strongly on the scrambling methods. Much of thework dealing with scrambling methods has been aimed atways of linear scrambling methods. In this paper, we takea close look at the quadratic scrambling method for Haltonsequences in generating parallel sequences.

Hongmei ChiComputer ScienceFlorida A&M [email protected]

MS15

Estimation of Migration Rates and Effective Pop-ulation Numbers by Using Importance Sampling

After coalescence theory is widely used to explore diver-sity among populations and species in population genetics(phylogenetics), the computation of likelihood or posteriordistribution of the population genetics parameters are com-mon tasks in computational biology. The numerical resultsof these approaches can be achieved by Monte Carlo simu-lations. This paper focuses on exploring the use of uniformrandom sequences, more specifically, completely uniformlydistributed sequences to calculate the likelihood with thehelp of importance sampling. We demonstrate by exam-ples that quasi-Monte Carlo can be a viable alternative tothe Monte Carlo methods in population genetics. Analysisof a simple one-population problem in this paper showedthat quasi-Monte Carlo methods achieve the same or bet-ter parameter estimates as standard Monte Carlo, but havethe potential to converge faster and so reduce the compu-tational burden.

Peter BeerliFlorida State UniversitySchool of Computational [email protected]

Hongmei ChiComputer ScienceFlorida A&M [email protected]

MS15

Hybrid Parallel Tempering and Simulated Anneal-ing Method in Rosetta Practice

We applied our recently developed hybrid Parallel Tem-pering (PT)/ Simulated Annealing (SA) method to theRosetta program. The hybrid PT/SA method is an ef-fective global optimization algorithm to overcome the slowconvergence in low-temperature protein simulation by initi-ating multiple systems to run at multiple slowly decreasingtemperature levels (SA scheme) and randomly switch withneighbor temperature levels (PT scheme). The PT schemecan significantly enhance the relaxation rate in the SAsearch. With hybrid PT/SAs fast barrier-crossing capabil-ity, we expect to achieve resolution improvement comparedto the original Rosetta program. Our preliminary resultsshow that the Rosetta fragment assembly implementationusing hybrid PT/SA method has a broader exploration ofthe protein folding scoring function landscape and exhibitsa 0.2 4.0A shift toward the native structure in most ofthe Rosetta benchmark proteins. Our analysis and com-putational re

PP08 Abstracts - SIAM · 36 PP08 Abstracts Numerical Analysis Group Delft University of Technology [email protected] Dan Erik Petersen, Stig Skelboe Department of Computer

Documents