-
34 PP08 Abstracts
CP1
Efficient Parallel Simulation for Stochastic Simula-tion of
Biochemical Systems on the Graphics Pro-cessing Unit
The small populations of some reactant species in
biologicalsystems formed by living cells can result in inherent
ran-domness that cannot be captured by traditional determin-istic
(ordinary differential equation) simulation. A moreaccurate
simulation can be obtained by using the Stochas-tic Simulation
Algorithm (SSA). Many stochastic realiza-tions are required to
obtain accurate probability densityfunctions. This carries a very
high computational cost.The current generation of general-purpose
graphics pro-cessing units (GPU) is well-suited to this task.
Computa-tional experiments illustrate the power of this
technologyfor this important and challenging class of problems.
Hong LiDepartment of Computer ScienceUniversity of California,
Santa [email protected]
Linda [email protected]
CP1
From HPC to Grid Computing: CSE Scenarioswith GridSFEA
With the simulation framework GridSFEA, we intend toprovide the
HPC application developer and user with acomplementary instrument
for working in grid. GridSFEAservices, application library, and
wrappers simplify taskssuch as parameter studies and long running
simulations.This is important, since the HPC community lacks
well-suited and usable grid tools, and, thus, remains reluctantto
exploit computing grids. We demonstrate several suc-cessful
scenarios such as molecular dynamics and trafficsimulations in
grid.
Ioan L. Muntean, Hans-Joachim Bungartz, MartinBuchholz, Michael
Moltenbrey, Ekaterina Elts, DirkPflügerTechnische Universität
München, Department ofInformaticsChair of Scientific Computing in
Computer [email protected],
[email protected],[email protected],
[email protected],[email protected], [email protected]
Ralf-Peter MundaniTUM, Faculty of Civil Engineering and
GeodesieChair for Computational Civil and
[email protected]
CP1
Plasma Turbulence Computation and Visualizationon the GPU
We present techniques to accelerate plasma turbulence
sim-ulations on modern Graphics Processing Units (GPUs).We contrast
and compare the performance of two classes ofmethods: spectral
methods used in MHD models and par-ticle methods used in
gyrokinetic models. We demonstratea prototype of a scalable
computational steering frameworkbased on a tight coupling of plasma
turbulence simulation
and visualization on the GPU.
George StantchevUniversity of [email protected]
CP1
Parallel Numerical Methods for Solving Nonlin-ear Evolution
Equations that Model Optical FiberCommunication Systems
Nonlinear evolution equations of the Nonlinear Schrdingertypes
are of tremendous interest in both theory and ap-plications.
Various regimes of pulse propagation in opticalfibers are modeled
by some form of the NLS and CNLSequations. In this talk we
introduce parallel algorithmsfor numerical simulations of these
equations. The parallelmethods are implemented on the IBM p655
multiprocessorcomputer. Our numerical experiments have shown
thatthe used methods give accurate results and
considerablespeedup.
Thiab R. TahaProfessor at the University of
[email protected]
CP1
Efficient Parallel Algorithm for 2D2V Vlasov Equa-tion with High
Order Spectral Element Method
For decades kinetic space plasma simulation has been dom-inated
by PIC (Particle-In-Cell) codes. Due to its inherentnoise, solving
the Vlasov equation directly becomes morepromising. In this work,
we are developing tera-scalablescheme of 2D2V Vlasov solver using
high order spectralelement method. For this 4 diemnsions problem,
efficientparallel algorithm is necessary due to memory and speed
re-quirements. The results have been virified with PIC codes.
Jin XuPhysics DivisionArgonne National Lab.jin [email protected]
CP1
Using Multi-Core, Multi-CPU, PC Clusters forStatistical 3-D
Virus Reconstructions from CryoElectron Microscopy Images
Cryo electron microsopy images of viruses provide
roughlyprojection data at unknown projection angles and signalto
noise ratios less than 1/2. Statistical approaches to3-D
reconstruction (e.g., Doerschuk and Johnson, IEEETransactions
Information Theory, 2000) require extensivecomputation. A system to
perform such computations ona PC cluster with multi-cpu and
multi-core nodes usingC, MPI, and OpenMP is described including
experimentswith different algorithms and different mixtures of
paral-lelism methodologies.
John JohnsonThe Scripps Research InstituteDepartment of
Molecular [email protected]
Yili ZhengElectrical and Computer EngineeringPurdue
[email protected]
-
PP08 Abstracts 35
Peter C. DoerschukCornell UniversityBiomedical Engineering and
Electrical and [email protected]
CP2
Electronic Structure Calculations for LargeNanosystems on
Parallel Computers
Electronic Structure calculations based on the density
func-tional theory approach have become one of the biggest
con-sumers of cycles on high performance computers aroundthe world.
In this talk I will discuss this approach, as usedin nanoscience
applications on high performance comput-ers as well as new methods
that go beyond density func-tional theory and allow us to simulate
much larger systemswith first principles accuracy. Performance of
new paral-lel solvers for these methods on high performance
parallelcomputers such as the IBM BGL, Cray XT4, NEC EarthSimulator
and PC clusters will also be discussed. I willdiscuss some
applications of these methods to nanosystemsand detector materials.
Work done in collaboration withL-W Wang, O. Marques, S. Tomov and
C. Voemel.
Andrew M. CanningLawrence Berkeley National
[email protected]
CP2
Parallelized Topography Simulation for ElectronicDevice
Manufacturing
Three-dimensional topography simulation for electronic de-vice
manufacturing is a demanding task regarding compu-tational
resources. We developed a parallelized methodusing a Monte Carlo
algorithm for flux calculation andlevel-set algorithms for front
tracking. The computationalcomplexity scales optimally with surface
size. The flux cal-culation is parallelized using MPI and OpenMP.
Commu-nication is minimized by storing the compressed
structuregeometry on each node, which makes our method applica-ble
for cheap infrastructure consisting of conventional PCswith LAN
connection.
Otmar Ertl, Johann Cervenka, Siegfried SelberherrVienna
University of TechnologyInstitute for
[email protected], [email protected],
[email protected]
CP2
Parallel Block-Oriented Preconditioners for FemModeling of
Semiconductor Devices
Various approximate block factorization and
physics-basedpreconditioners are applied to the drift-diffusion
equationsfor modeling semiconductor devices. The resulting
scalarsubsystems are solved by various iterative methods in-cluding
AMG-type techniques. We employ a stabilized fi-nite element
discretization of the drift-diffusion equationson unstructured
meshes. The nonlinear coupled systemis solved with a parallel
preconditioned Newton-Krylovmethod. Preliminary results will be
presented demonstrat-ing the performance of the block-oriented
preconditionerscompared with one-level and multilevel
preconditioners.This work was partially funded by the DOE NNSA’s
ASCProgram and the DOE Office of Science AMR Program,
and was carried out at Sandia National Laboratories op-erated
for the U.S. Department of Energy under
contractno.DE-ACO4-94AL85000
Paul LinSandia National [email protected]
Gary HenniganSandia National [email protected]
Robert J. HoekstraSandia National
[email protected]
John ShadidSandia National LaboratoriesAlbuquerque,
[email protected]
CP2
A Massively Parallel Schroedinger Solver for
Nano-Electronics
NEMO3D is a scalable simulator for nano-electronic de-vices such
as quantum dots that uses a quantum mechan-ical description of the
device. A key computational chal-lenge is to compute degenerate
eigenstates in the interiorof the spectrum for a very large
Hamiltonian matrix, withup to 109 degrees of freedom. We compare
several paralleleigensolvers and present detailed performance
results thatdemonstrate the petascale potential of NEMO3D on
stateof the art parallel platforms.
Maxim NaumovPurdue UniversityDepartment of Computer
[email protected]
Faisal Saied, Hansang Bae, Steve Clark, Ben Haley,Gerhard
Klimeck, Sunhee LeePurdue [email protected],
[email protected], [email protected],[email protected],
[email protected],[email protected]
CP2
Parallel Methods for Electronic Transport ThroughNanoscale
Devices
Nanoelectronics is a fast developing field. Therefore
un-derstanding of electronic transport at the nanoscale is
cur-rently of great interest. In this talk, we present new
paral-lel algorithms to calculate the electronic transport
throughlarge nanoscale devices consisting of many thousand
atoms.Applying the semi-empirical Extended Hückel Theory tomodel
nanowire junctions, we compare the parallel perfor-mance of our
recently developed direct and iterative ap-proaches, the latter of
which uses preconditioned Krylovsubspace techniques.
Hans Henrik B SørensenInformatics and Mathematical
ModellingTechnical University of [email protected]
Martin van Gijzen
-
36 PP08 Abstracts
Numerical Analysis GroupDelft University of
[email protected]
Dan Erik Petersen, Stig SkelboeDepartment of Computer
ScienceUniversity of [email protected], [email protected]
Kurt StokbroNBI and Department of Computer ScienceUniversity of
[email protected]
Per Christian HansenTechnical University of DenmarkInformatics
and Mathematical [email protected]
CP3
Cortically-Inspired Parallel Processing
Symbolic logic and serial computations have proven inade-quate
for modeling human cognition and solving hard cog-nitive problems.
On the other extreme, the recent use of aBlue Gene supercomputer
for the molecular simulation of asingle cortical column, abeit a
remarkable leap for molecu-lar neuroscience, is still very far from
explaining cognition.I will present an implementation of a
massively parallel,highly scalable, heteroassociative network of
attractor net-works that abstracts the functionality of millions of
corticalcolumns and explains phenomena of visual cognition. Thisis
a paradigm of how brain-like computing might look inthe future.
Socrates DimitriadisBrown UniveristyDepartment of Cognitive and
Linguistic [email protected]
CP3
Uc-geowave: A Stereo-distributed-parallel Appli-cation for
Seismic Modeling in Oil Exploration
This paper describes a stereo–distributed–parallel applica-tion
for the simulation of two–dimensional acoustic wavepropagation on
heterogeneous media, called UC–geoWave.This application has three
modules: pre–processing, pro-cessing and post-processing. Using the
pre–processingmodule the user can create synthetic terrain in 3D
andfinally view it in stereoscopic way. In the processing mod-ule,
the application computes a parallel reverse time migra-tion (RTM)
of a seismic data set to obtain a depth imag-ing. This migration
technique is based on the solution ofthe acoustic wave equation in
2D using a finite differencescheme. This phase is executed on a
cluster machine usingMPI. In the post-processing module, the user
can load theresulting data and see the graphics. The application
canbe accessed over the internet using any commercial
browser(internet explorer, netscape, etc.) and run on different
op-erating systems (Windows, MacOS X, Linux and Solaris).
Juan MedinaUniversidad de [email protected]
German A. Larrazabal
University of [email protected]
CP3
Ratio-Based Parallel Time Integration (RaPTI) forSatellite
Trajectories
Chartier, Philippe (1993), Erhel, Rault (2000) have into-duced
Parallel time integration for satellites trajectories.We apply a
version of RaPTI algorithm to solve this prob-lem. RaPTI is a
predictor-corrector scheme based on au-tomatic generation of time
slices (Nassif et al (2005)) atend of which solution values exhibit
a ?ratio phenomenon?.Such approach leads to parallel time
integration schemesused in previous authors works (2006-2007). The
presentpaper extends RaPI to a J2 perturbed satellite
trajectory.
Nabil R. NassifMathematics DepartmentAmerican University of
[email protected]
Jocelyne Erhel, Noha Makhoul-KaramIRISA,UNIVERSITE DE
[email protected], [email protected]
Yeran SoukiassianMathematics DepartmentAmerican University of
[email protected]
CP3
A Scalable Parallel Classification Algorithm for Re-mote
Sensing
Previous work on the parallel IGSCR (iterative guidedspectral
class rejection) classification algorithm for remotesensing
resulted in good speedup through 64 processors,however, speedup
began deteriorating beyond 16 proces-sors. This work will tackle
scalability issues with parallelclustering that will be essential
in a scalable distributedmemory version of IGSCR. These issues
include clusteringschemes that are more amenable to a parallel
environmentand employing load balancing methodologies that will
in-crease overall parallel efficiency.
Layne T. WatsonVirginia Polytechnic Institute and State
UniversityDepartments of Computer Science and
[email protected]
Rhonda D. PhillipsVirginia Polytechnic Institute and State
[email protected]
Randolph H. WynneVirginia Polytechnic Institute and State
UniversityDepartment of [email protected]
CP3
Parallel Implementation of Data Mining Algo-rithms
This paper discusses the parallel or distributed implemen-tation
of key data mining algorithms in the areas of col-
-
PP08 Abstracts 37
laborative filtering and latent semantic analysis. Both
op-timization problems are elegantly captured by matrix
rep-resentations which are usually sparse and with very
largedimensions. We discuss the implementation of these algo-rithms
in architectures of different levels of granularity suchas
dedicated highly coupled parallel processors, loosely cou-pled
parallel processors and a distributed platform.
Yosef G. Tirat-GefenCastel Research Inc. and George Mason
[email protected]
CP4
Higher-Level Abstractions and Patterns for De-signing
Data-Parallel Applications
The level of abstraction provided by the Message
PassingInterface (MPI) are too low-level and enormous amountsof
time and effort is spent in refactoring existing codeto include MPI
primitives. This presentation describedhigher-level abstractions
and design patterns to encapsu-late data distribution,
communication, and load balanc-ing. Several applications developed
using a single patternwill be presented and their performance
comparisons withhand-written versions of the applications will also
be pro-vided.
Purushotham BangaloreUniv. of Alabama at BirminghamDept. of
Computer and Information [email protected]
CP4
Parallel Io and Data Management for Data Struc-tures in
Applications
We have developed a middle layer between parallel
filesystems/MPI-IO and applications, through which appli-cations
are able to efficiently use the most efficient partof MPI-IO and
parallel file systems for millions of dis-tributed un-aligned small
datasets. For a variety of datastructures in applications, such as
unstructured mesh andvariable, the middle layer provides
sustainable, interopera-ble, efficient, scalable, and convenient
tools for parallel IOand data management. The IO performance of the
mid-dle layer for high-level data structures in applications
isalmost the same as the performance of MPI-IO for largedatasets.
The IO performance of either collective or non-collective calls for
millions of distributed non-aligned smalldatasets is comparable to
the performance of MPI-IO forlarge datasets.
William W. DaiLos Alamos National [email protected]
CP4
Using GPUs From High-level Programming Lan-guages
This talk describes a high-productivity development modelfor
general purpose computing on Graphics ProcessingUnits (GPUs). This
is accomplished by exposing the capa-bilities of NVIDIA’s CUDA
architecture to high-level pro-gramming languages, such as Python,
IDL, MATLAB andJava. In this talk we describe an array-based
programminginterface that hides the details of CUDA and the GPU
fromthe user, allowing them to perform GPU accelerated
com-putations with little effort. The result is a development
model that makes the high performance of GPUs accessi-ble to end
users and working scientists. This talk will focuson the Python
implementation of the interface, but willalso briefly cover IDL,
MATLAB and Java versions. Weacknowledge Nathaniel Sizemore and Dave
Wade-Stein forhelp with the build system for this project.
Dan Karipides, Paul Mullowney, Michael Galloy, PeterMessmer,
Brian E. GrangerTech-X [email protected],
[email protected],[email protected],
[email protected],[email protected]
CP4
Grids and Clusters with Multi-Core Nodes: A Ge-netics
Application Perspective
The introduction of multicore processor implies that algo-rithms
which are parallelized at an outer, coarse grain levelshould
possibly be revisited to examine if multithreadingshould also be
used at an inner, fine grain level. In thispaper we discuss
parallel versions of the tightly coupledglobal optimization
algorithm DIRECT. We examine howboth coarse grained and fine
grained parallelism can be ex-ploited using a hybrid programming
model. We show thatexcellent performance can be archived when using
the hy-brid algorithm on loosely-coupled systems like clusters
andgrids with multicore nodes.
Henrik LöfStanford UniversityDepartment of Energy Resources
[email protected]
Mahen JayawardenaUppsala UniversityDepartment of Scientific
[email protected]
Sverker HolmgrenUppsala UniverstiyDepartment of Scientific
[email protected]
CP4
Group Locality Based Performance Analysis ofTriplet Architecture
A Static Direct Interconnec-tion Network for Multi-Processor
(mp-SoC)
We propose a new criterion in performance evaluationbased on the
concept of group locality in interconnectionnetworks, the lower
layer complete connect i.e., how com-pletely a node in a subset of
processing nodes is connectedto its neighbors. Triplet Based
Architecture, TriBA - anew idea in MP-SoC architectures is compared
with threestatic interconnection networks from three orthogonal
enti-ties physical (chip area, dissipation), computational
speed(message delay) and cost (chip yield, layout cost.)
Haroon-Ur-Rashi Khan, Shi Feng, Ji Wei XingSchool of Computer
Science and TechnologyBeijing Institute of
[email protected],
[email protected],[email protected]
Kamran KamranDepartemnt of Electrical EngineeringUniverity of
Engg. & Tech., Lahore, Pakistan
-
38 PP08 Abstracts
[email protected]
CP4
A Data-Distributed Massively Parallel Design ofDIRECT
A data-distributed massively parallel implementation is
de-veloped for the optimization algorithm DIRECT, favoredfor its
deterministic nature and global convergence prop-erty. Sharing data
across multiple machines reduces thelocal memory burden. Multilevel
parallelism boosts theconcurrency and mitigates the data
dependency, thus im-proving the load balancing and scalability.
Also, user-levelcheckpointing is integrated as a fault-tolerance
feature. Onlarge-scale systems, the design was evaluated using
bench-mark functions and real-world applications.
Rhonda D. PhillipsVirginia Polytechnic Institute and State
[email protected]
Layne T. WatsonVirginia Polytechnic Institute and State
UniversityDepartments of Computer Science and
[email protected]
Jian HeDepartment of Computer ScienceVirginia
[email protected]
Masha SosonkinaAmes Laboratory/DOEIowa State
[email protected]
CP5
Load Distribution in Madness
Load balancing is vital to the efficiency of
MADNESS(Multiresolution Adaptive Numerical Environment for
Sci-entific Simulation), an environment for prototyping
anddeveloping scientific applications being developed to runon
leadership computing resources. We propose the meld-ing algorithm
to load balance the computational work inMADNESS. In this
presentation, we describe the method,discuss its theoretical
advantages over alternative load bal-ancing techniques for this
problem, and present prelimi-nary results from runs on leadership
computing resources.
Rebecca J. Hartman-Baker, George FannOak Ridge National
[email protected], [email protected]
Robert HarrisonUniversity of TennesseeOak Ridge National
[email protected]
CP5
A Benchmark Study of Compiler Performance forSparse Kernels on
Multicore Processors
Obtaining optimal performance for scientific applicationson
modern computer architectures continues to be a chal-lenge. This
study presents an empirical comparison of theimpact of hardware
architecture, compilation options, datastructure and coding
technique on algorithm performance
for a small set of representative mathematical kernels
in-cluding sparse matrix-vector products on a set of
multicoreprocessor-based HPC platforms. Numerical results are
pre-sented, and implications for the optimization of
numericalsoftware codes are considered.
Wayne JoubertU.S. Army Engineer Research and Development
Center(ERDC)Major Shared Resource Center
(MSRC)[email protected]
CP5
Database Components for Support of Computa-tional Quality of
Service for Scientific CCA Ap-plications
While component-based design has proven helpful in man-aging the
complexity of parallel scientific simulations,many challenges
remain in selecting and configuring com-ponents during runtime to
improve performance. This pre-sentation introduces a new aspect of
our infrastructure incomputational quality of service (CQoS),
namely databasecomponents that manage historical performance data
andmetadata. We illustrate their use in selecting
appropriateparallel solver components.
Li LiArgonne National [email protected]
Boyana NorrisArgonne National LaboratoryMathematics and Computer
Science [email protected]
Lois McInnesArgonne National [email protected]
CP5
Computational Forces in the Linpack Benchmark
The efficiency of parallel algorithms can be explained as
abalancing act between computational forces. These forces,also
called computational intensities, are determined bythe particular
algorithm and the particular machine run-ning the algorithm. For a
timing formula describing theLinpack benchmark from Greer and
Henry, we show thatdifferent machines follow different paths along
a single ef-ficiency surface.
Robert NumrichUniversity of [email protected]
CP5
Performance Comparison Between Square-to-Hemisphere and Cubed
Sphere Projections of aGlobal Shallow-Water Model on a Toroidal
Inter-connect Architecture
Motivated by limited scalability issues encountered withthe
cubed sphere projection implemented in global shallow-water models
over a toroidal interconnect, we proposea square-to-hemisphere
projection. We argue that thesquare-to-hemisphere projection is
superior in optimiz-ing processor communication and decreasing
complexity
-
PP08 Abstracts 39
of computational load balancing. We present a perfor-mance
comparison for a numerical shallow-water modelunder both
projections using a discrete Galerkin Runge-Kutta (DGRK) method on
the IBM BlueGene/L systemover 1024 nodes.
Marcus WaldmanUndergraduate, University of Colorado at
BoulderStudent Research Assistant,
[email protected]
Siddhartha GhoshCISL/[email protected]
CP6
Why Column Pivoting Should Be Used for Perfor-mance
This talk shows new research for doing parallel dense lin-ear
algebra with implicit column pivoting to improve loadbalancing.
After showing the performance on a cluster ofworkstations, we
discuss heterogeneous clusters where thedynamic load balancing
helps the most. We show how thesame new idea can be applied to
hybrid OpenMP/MPIproblems and sparse problems. We also show how
this re-search is being integrated into the latest Intel Math
KernelLibrarys cluster products.
Greg HenryIntel [email protected]
CP6
Block Householder Reduction of Sparse Matricesto Small Band
Upper Triangular Form
Bidiagonalization can be accomplishing by accessing asparse
matrix A only to perform sparse matrix dense vec-tor
multiplications Ax and ytA. Only a moderate numberof leading rows
and columns are eliminated. The computa-tions Ax and ytA are
predominant, especially when x andy are too large to fit in cache
memory. If the reductionis to bandwidth k, the multiplications can
instead be AXand Y TA, A sparse, X, Y dense with k columns.
BlockingA gives further speedup. On a cache based architecture,the
resulting algorithm is fast and stable. It adapts easilyto
multi-core architectures.
Gary HowellNorth Carolina State
[email protected]
CP6
Divide and Conquer Eigenvalue Solver Paralleliza-tion
The Divide and Conquer algorithm is very great to be par-allel
by idea: division of a big task to smaller ones thatcan be solved
in parallel. But in fact it is not so easy be-cause small solutions
should be merged in a big one andin addition they impact each other
on solving stage. Thiswork describes problems and their solutions
that appearedin eigenvalue solver parallelization.
Alexander V. KobotovIntel Corp.; Institute of Computational
Mathematics andMathematical Geophysics SB
[email protected]
CP6
Weighted Matrix Reordering and Parallel BandedPreconditioners
for Non-Symmetric Linear Sys-tems
With the emergence of petascale architectures, the roleof
preconditioning techniques that can scale well on largenumber of
processors have become crucial. We present areordering scheme that
allows the extraction of a centraldominant band that can be used as
a preconditioner. Ourresults demonstrate excellent scalability and
robustness fora large class of problems for which other black-box
precon-ditioners, such as ILU and varieties, are poorly
scalable.
Murat ManguogluPurdue University Department of Computer
[email protected]
Ahmed SamehDepartment of Computer SciencePurdue
[email protected]
Mehmet KoyuturkCase Western Reserve UniversityDepertment of
Electrical Engineering and
[email protected]
Ananth GramaPurdue UniversityDepartment of Computer
[email protected]
CP6
One World, One Matrix
We propose a new parallel algorithm, called DirectedTransmission
Method (DTM), to solve the sparse lin-ear system whose coefficient
matrix is symmetric-positive-definite (SPD). DTM is a fully
scalable, asynchronous, dis-tributed and continuous-time iterative
algorithm, whichis quite different from the traditional
discrete-time itera-tive algorithms. It is proved to be convergent.
DTM isable to be efficiently running on any kind of homogeneousor
heterogeneous parallel computers, e.g. multicore andmanycore
microprocessors, SMP, clusters, supercomputers,grids, clouds and
WWW. By means of DTM, we are capa-ble of solving arbitrarily-large
sparse SPD linear systems,as long as we have enough processors and
memories. Fur-thermore, we may unite the supercomputers all over
theworld to solve an unprecedented, extremely large sparselinear
system, and the dream of ”One World, One Ma-trix” would come true
at that time. Besides, DTM wouldbe a persuasive benchmark to test
the performance of theparallel computers, especially the
supercomputers and themanycore microprocessors.
Huazhong Yang, Fei WeiDepartment of Electronic
EngineeringTsinghua University, Beijing,
[email protected], [email protected]
CP6
New Algorithms for Sparse Matrix Partitioning
We discuss how to partition a sparse matrix to reduce
com-munication in parallel sparse matrix computations. Wefocus on
sparse matrix-vector multiplication, which is an
-
40 PP08 Abstracts
important kernel in scientific computing. We consider
two-dimensional distributions, and present a new algorithmbased on
vertex separators and nested dissection. Empir-ical results on real
application matrices show our methodis better than the traditional
1-d (row) distribution, andcompetitive with other 2-d
distributions.
Erik G. BomanSandia National Labs, NMScalable Algorithms
[email protected]
Michael WolfUniv. of Illinois,
[email protected]
CP7
High Performance Solution of Sparse Linear Sys-tems Using Direct
Methods with Application toElectromagnetic Problems
The numerical treatment of high frequency
electromagneticscattering in inhomogeneous media is very
computationallyintensive. For scattering, the electromagnetic field
mustbe computed around and inside 3D complex bodies. Be-cause of
this, accurate numerical methods must be usedto solve Maxwell’s
equations in the frequency domain, andit leads to solve very large
linear systems. In order tosolve these systems, we have combined on
our TERAscalecomputer modern numerical methods with efficient
parallelalgorithms.
Katherine Mer-Nkonga, Michel Mandallena, Jean-JacquesPesque,
David GoudinCEA/[email protected],
[email protected],[email protected],
[email protected]
CP7
Parallel Subspace Newton Methods for AlgebraicSystems with Local
High Nonlinearities
We present locally refined Newton type methods for
largenonlinear systems of algebraic equations, arising from
thediscretization of nonlinear partial differential equations.We
focus on the type of systems that have local high non-linearities.
In other words, the nonlinear system may havemany equations, but
only a small percentage of them arehighly nonlinear compared to the
rest of the equations.Global Newton methods may be used to solve
the system,but often the computing time is wasted since all
equationsare treated equally as if they were all highly nonlinear.
Weintroduce subspace Newton methods to remove the localhigh
nonlinearities and therefore improve the efficiency andthe
effectiveness of the outer global Newton method, whichperforms well
on equations with roughly the same levelof nonlinearities. We prove
the convergence of this newmethod under certain assumptions. We
also discuss theparallel implementation of the new method using
PETScand provide some numerical results from solving
severaldifferent nonlinear differential equations.
Xiao-Chuan CaiUniversity of Colorado, BoulderDept. of Computer
[email protected]
Xuefeng LiLoyola University New Orleans
[email protected]
CP7
Fully Coupled Two-Level Domain DecompositionAlgorithms for
Inverse Problems
In this talk, we discuss multilevel domain decompositionmethods
for solving some coupled nonlinear systems ofequations obtained
from the discretization of inverse prob-lems. We focus on a fully
coupled Newton-Krylov algo-rithm with two-level Schwarz type domain
decompositionmethods as the preconditioner. We study the parallel
per-formance of the algorithms on supercomputers with hun-dreds of
processors for solving some difficult inverse prob-lems arising
from the modeling of ground water flows.
Xiao-Chuan CaiUniversity of Colorado, BoulderDept. of Computer
[email protected]
Si LiuDepartment of Applied MathematicsUniversity of Colorado,
Boulder [email protected]
CP7
A Parallel Multigrid Preconditioner for High-Order and
hp-Adaptive Finite Elements
The hp version of the finite element method is an adap-tive
finite element approach in which adaptivity occurs inboth the size,
h, of the elements and in the order, p, of theapproximating
piecewise polynomials. An optimal orderparallel linear system
solver is needed to get the best effi-ciency of these methods. We
present a parallel multigridpreconditioner whose rate of
convergence is independent ofboth h and p.
William F. MitchellNational Institute of Standards and
TechnologyMathematical and Computational Sciences
[email protected]
CP7
Impact of Dual-Core Processors on the Perfor-mance of Parallel
Krylov Subspace Linear Solversand Preconditioners for Porous Media
Flow Appli-cations
Data from finite element modeling of porous media flowwere used
to solve linear systems of equations using 12Krylov subspace
parallel linear solvers with five precondi-tioners (60 scenarios)
using PETSc to test for efficiencyand accuracy of the different
options. The Cray XT3 usedin this study has been recently upgraded
to 4160 dual corenodes. This presentation will highlight the
performance ofthe linear solvers before and after the dual-core
processorswere installed.
Thomas OppeEngineer Research and Development CenterWaterways
Experiment [email protected]
Sharad GavaliNASA Ames Research [email protected]
-
PP08 Abstracts 41
Fred T. TracyEngineer Research and Development CenterWaterways
Experiment [email protected]
CP7
Multi-Length Scale Preconditioned Iterative Solverfor Parallel
Hybrid Quantum Monte Carlo Simula-tion
The hybrid quantum Monte Carlo (HQMC) method of theHubbard model
is a powerful method used to study theelectron interactions that
characterize the properties of ma-terials, such as magnetism and
superconductivity. Thebottleneck of the method is on the repeated
solutions ofthe underlying multi-length-scale linear systems of
equa-tions. In this talk, we present a preconditioning tech-nique
and its parallelization for solving the linear systems.The
preconditioned solver demonstrates the optimal linearscaling
complexity of the HQMC method for moderately-correlated
materials.
Zaojun BaiDepartment of Computer ScienceUniversity of
California, Davis, [email protected]
Richard ScalettarDepartment of Physics,University of California,
Davis, [email protected]
Wenbin ChenSchool of Mathematical Science,Fudan University,
[email protected]
Ichitaro YamazakiDepartment of Computer ScienceUniversity of
California, [email protected]
CP8
A Parallel Algorithm for Optimization-BasedSmoothing of
Unstructured 3-D Meshes
Serial optimization-based smoothing algorithms are
com-putationally expensive. Using Metis (or ParMetis) to par-tition
the mesh, the parallel algorithm moves (or does notmove) a
processor’s internal nodes based on a cost functionderived from the
Jacobians and condition numbers of sur-rounding elements. Ghost
cells are used to communicatenew positions, the lower processor on
a boundary uses thenew information to move boundary nodes, and the
processrepeats. The result is a ready-to-use decomposed mesh.
Vincent C. BetroUniversity of Tennessee at
[email protected]
CP8
Distributed Transpose for 3D Fft: The Effects ofMachine Geometry
and Process Mapping on BlueGene/L
We describe how to extend the scalability 3D-FFT
using2D-decomposition on thousands of BlueGene/L processors.The
communication cost of carrying out the data trans-
poses required by the 3D-FFT is very high and dominatesthe
computation cost at the limits of scalability. This moti-vated us
to focus on performance measurements of the dis-tributed transpose
alone. We report performance data ontwo communication protocols,
MPI and BG/L-ADE. Theproposed approach is effective in improving
performancefor Particle-Mesh-based N-body simulations.
T.J.C. WardIBM Software Group,Hursley Park, Hursley,
[email protected]
Philiph HeidelbergerIBM Thomas J. Watson Research CenterYorktown
Heights, NY 10598-0218, USAphiliph@@us.ibm.com
Robert S. Germain, Blake Fitch, Aleksandr Rayshubskiy,Maria
EleftheriouIBM Thomas J. Watson Research [email protected],
[email protected],[email protected], [email protected]
CP8
New Parallel Techniques for Bvps in Ords
The main objective of this paper is the devlopment of anew
parallel integration algorithms for solving boundaryvalue problems
( BVPs ) in ordenary deffirential equations (ODEs ). the idea of
new techniques is combinning the par-allel integration processes
with parallel interpolation pro-cesses suitable for running on MIMD
( Multiple instructionstreams with multiple data streams )
computing systems.The stablity of the developed algorithms are
anylsed. Wealso studied the treatment of stiff BVPs by the
devlopedtechniques.
Bashir M. KhalafProfessor of Scientific
[email protected]
CP8
Programming with Large Scale Edge-Node Simu-lator on BlueGene/L:
A Case Study of 3D Fft
We designed a network simulator for rapid specification
ofcomplex networks, such as those required to model neu-ral tissue.
Here we demonstrate a more general use of thenetwork simulator for
implementing generic parallel algo-rithms, with a case study of the
3D-FFT. We demonstratescaling of the 128x128x128 FFT network to
4,096 BG/Lprocessors, and compare performance against the
originalalgorithm (Eleftheriou et al, 2006). Strategies for
automat-ically mapping network calculations to BG/L are
discussed.
Robert S. Germain, Blake Fitch, Maria EleftheriouIBM Thomas J.
Watson Research [email protected],
[email protected],[email protected]
James KozloskiIBM TJ Watson Research
[email protected]
Charles PeckBiometaphorical Computing ResearchIBM T.J. Watson
Research [email protected]
-
42 PP08 Abstracts
CP8
Improving the Scalability of Adaptive Mesh Refine-ment
In many large scale adaptive simulations scalability is
hin-dered due to costs associated with the changing mesh.
Al-gorithmic improvements to the mesh changing processeshave led to
a significant reduction in these costs. In addi-tion, the frequency
of remeshing can be reduced throughthe use of dilation. These
changes have led to large im-provements in overall scalability of
the Uintah simulationframework. Results up to 4096 processors will
be shown.
Justin P. Luitjens, Tom HendersonUniversity of
[email protected], [email protected]
Martin BerzinsSCI InstituteUniversity of [email protected]
CP8
Finite Element Assembly on Arbitrary Meshes
One goal of automating Finite Element Methods (FEM) isto allow
arbitrary element types and orders on arbitrarymeshes. A challenge
to this goal is separating local ele-ment definitions from the mesh
definition. We show ourconceptual paradigm for this separation
using the PETScSieve library, a library based on representing
meshes asGrothendieck topologies, and demonstrate results with
agrade-2 fluid application.
Andy R. TerrelUniversity of ChicagoDepartment of Computer
[email protected]
Matthew G. KnepleyArgonne National
[email protected]
MS1
Towards General Auto-tuning Description Lan-guage on Advanced
Computing Systems
The description of auto-tuning is crucial, but time-consuming
work for developing numerical libraries withauto-tuning facility.
In this presentation, a description lan-guage for auto-tuning,
named ABCLibScript, is explainedwith several examples of numerical
computation. Althoughthe target of ABCLibScript was vector
supercomputers,but we show the effectiveness on it to software
develop-ment process on embedded systems. The effect on theadvanced
computer environment, which is supercomputerwith multi-core
processor, will be also shown.
Takahiro KatagiriInformation Technology CenterThe University of
[email protected]
MS1
Proposal of Run-time Parameter Auto-Tuning Ap-proach for
Restarted Lanczos Method
Many input parameters in matrix solvers are difficult to
predict the best values before runtime. This paper pro-poses an
automatic tuning approach for the restarted Lanc-zos method, which
explores the best projection matrix sizefrom the history of
residual value at runtime. The numer-ical experiments show the
proposed approach is 100 timesfaster than the original method in
the best case. The re-sult implies the runtime automatic tuning is
effective foriterative matrix solvers.
Takao Sakurai, Ken Naono, Masashi EgiCentral Research
LaboratoryHitachi [email protected],
[email protected],[email protected]
Mitsuyoshi Igai, Hiroyuki KidachiHitachi ULSI Systems
[email protected],[email protected]
MS1
A Bayesian Approach to Automatic PerformanceTuning
Code tuning has been done based on models, experimentsor their
combinations, but the combinations are mostly ofheuristics. In this
talk it is shown that Bayesian statis-tics can provide a convenient
mathematical framework tocombining model and experiments for code
tuning. The ex-ample problem here is online selection of several
unrolledcodes for matrix-matrix multiply, and some sequential
ex-perimental designs based on a simple performance modelare
proposed and evaluated.
Reiji SudaDepartment of Computer Science, The University
[email protected]
MS1
Automatic Tuning for Parallel FFTs
In this talk, an automatic performance tuning method forparallel
fast Fourier transforms (FFTs) is presented. Ablocking algorithm
for parallel FFTs utilizes cache mem-ory effectively. Since the
optimal block size may dependon the problem size, we propose a
method to determinethe optimal block size that minimizes the number
of cachemisses. Performance results of parallel FFTs on a PC
clus-ter are reported.
Daisuke TakahashiGraduate School of Systems and Information
EngineeringUniversity of [email protected]
MS2
Neutral Territory Methods for Efficient Paralleliza-tion of
Molecular Dynamics Simulations
The majority of the computational workload in moleculardynamics
simulations involves interactions between nearbyparticles. We will
describe a class of algorithms for paral-lelization of
range-limited particle interactions, the neutralterritory methods,
some of which confer significant practi-cal advantages over
traditional parallelization algorithms.We will illustrate specific
neutral territory methods intro-duced by other researchers and by
ourselves, and we will
-
PP08 Abstracts 43
discuss the tradeoffs that led us to select different
neutralterritory methods for different molecular dynamics
imple-mentations.
Ron O. DrorD. E. Shaw [email protected]
David E. ShawD. E. Shaw ResearchColumbia
[email protected]
Kevin J. BowersD. E. Shaw [email protected]
MS2
Scaling NAMD to Large Parallel Machines
NAMD’s parallel design, circa 1996, has stood the test oftime.
The basic parallel structure includes (a) decompo-sition into
cells, and force-computation objects for eachpair of interacting
cells, (b) implementation using message-driven objects in Charm++,
and (c) assignment of objectsto processors using measurement-based
load balancers thatalso reduce communication. This talk will review
recentoptimizations to scale NAMD to over 32,000 processors
forsmall and large biomolecular systems.
Laxmikant V. KaleUniversity of Illinois at
[email protected]
James C. PhillipsBeckman Institute, U. Illinois at
[email protected]
Chao MeiUniversity of Illinois at
[email protected]
Abhinav Bhatele, Gengbin Zheng, Sameer KumarBeckman Institute,
U. Illinois at [email protected],
[email protected],[email protected]
Klaus SchultenUniversity of Illinios at Urbana
[email protected]
MS2
Nanoparticle and Colloidal Simulations withMolecular
Dynamics
Modeling nanoparticle or colloidal systems in a molecu-lar
dynamics (MD) code requires coarse-graining on sev-eral levels to
achieve meaningful simulation times for studyof rheological and
other manufacturing properties. Theseinclude treating colloids as
single particles, moving fromexplicit to implicit solvent, and
capturing hydrodynamiceffects. These changes also impact parallel
algorithms fortasks such as finding neighbor particles and
interprocessorcommunication. I’ll describe enhancements we’ve made
toour MD code LAMMPS to make nanoparticle simulationsmore
efficient, highlighting its flexible design that has en-
abled the new capabilities.
Steve PlimptonSandia National [email protected]
MS2
A Summary of the Performance and Scaling of AM-BER 10 and the
Challenges Ahead
This talk will present a summary of the current perfor-mance and
scaling of the soon to be released version 10of the AMBER software
on a range of NSF and DOE highperformance computing systems. In
addition it will includean overview of the supported methods and
the approachesused to obtain the level of performance seen. Finally
someof the challenges that may face the molecular dynamicscommunity
in the near future will be discussed.
Ross C. WalkerSan Diego Supercomputer [email protected]
Robert E. DukeNIEHS and UNC-CHapel [email protected]
David A. CaseThe Scripps Research [email protected]
MS3
Creating Interoperability for Parallel MeshingTools
Mesh technology, such as mesh generation, databasequeries, and
adaptivity, plays a critical role in scientificsimulations. While
many frameworks providing mesh tech-nology exist, their
incorporation into applications requiressignificant effort and
learning by application developers.Interfaces allowing
interoperable use of mesh tools greatlysimplify this process while
providing a wider range of tech-nology than a single framework. In
this talk, we discussinteroperable mesh interfaces and, in
particular, their ex-tension to parallel mesh services.
Karen D. DevineSandia National
[email protected]
Xiaojuan Luo, Mark S. ShephardRensselaer Polytechnic
InstituteScientific Computation Research [email protected],
[email protected]
Lori A. DiachinLawrence Livermore National
[email protected]
Tim TautgesArgonne National [email protected]
Carl Ollivier-GoochUniversity of British
[email protected]
Vitus Leung
-
44 PP08 Abstracts
Sandia National [email protected]
MS3
Algorithms for Parallel Mesh Smoothing UsingMesquite
We discuss the development of an infrastructure that sup-ports
the use of Mesquite mesh quality improvement algo-rithms in
distributed memory applications. We start withthe application’s
decomposition of the mesh data and usean iterative process to
select independent sets of verticesto resposition in each pass. We
experiment with a mix oflocal and global techniques from Mesquite
and report onthe scalability and performance or our methods.
Lori A. DiachinLawrence Livermore National
[email protected]
Martin IsenburgLawrence Livermore National
[email protected]
MS3
Zoltan Load Balancing Approaches
Dynamic load-balancing is a data-management service thatis
critical to a wide range of unstructured and/or adaptiveparallel
applications. The Zoltan Library provides a suiteof dynamic
load-balancing tools. Access to Zoltan is nowavailable through a
common interface that supports inter-operability within the ITAPS
data model. In this presenta-tion, we give a brief overview of the
dynamic load-balancingapproaches available through Zoltan’s ITAPS
interface.
Karen D. Devine, Vitus LeungSandia National
[email protected], [email protected]
MS3
A Partition Model for Massively Parallel Mesh-Based
Computations
The Interoperable Technologies for Advanced PetascaleSimulations
DOE SciDAC center is designing and imple-menting an interoperable
partition model to support paral-lel mesh-based operations
including adaptive computationsaccounting for the complexities that
arise due to the chang-ing computational load and communications of
adaptedmeshes. The presentation will first discuss the overall
par-tition model design. Consideration will then be given toits
implemented and relation to adaptive mesh control andZoltan load
balancing procedures.
Onkar SahniRensselaer PolytechnicScientific Computation Research
[email protected]
Xiaojuan Luo, Mark S. ShephardRensselaer Polytechnic
InstituteScientific Computation Research [email protected],
[email protected]
Kenneth Jansen, TIng XieRensselaer Polytechnic
[email protected], [email protected]
MS4
Issues in Exploiting the Power of Multiple Methods
In this presentation, we will discuss some of the issues
inmultimethod implementation. While the focus of usingmultimethods
is mapping a ”single” method to a simula-tion stage, for certain
problems,several suitable methodsmight be combined to produce more
effective results. Thetrade-off related to the frequency of
changing methods isanother issue, as it might not be practical to
switch meth-ods at every opportunity for adaptivity. Yet another
chal-lenge is the efficient identification of adaptivity in the
sim-ulation.
Sanjukta BhowmickDepartment of Computer Science and
EngineeringPennsylvania State [email protected]
MS4
Machine Learning Support for Numerical DecisionMaking
We present the SALSA (Self-Adapting Large-scale
SolverArchitecture) software system for intelligent multi-methods.
The system is based on a modular architec-ture for composite
algorithms (for instance, choice ofscaling/preconditioner/iterator
in iterative linear systemsolvers) and uses machine learning
techniques for adap-tively choosing the component algorithms. We
will discussvarious learning techniques we have explored, and the
highlevel of accuracy obtained.
Victor EijkhoutThe University of Texas at AustinTexas Advanced
Computing [email protected]
MS4
Evaluation of a Meta-partitioner for Simula-tions Using
Block-structured Adaptively RefinedMeshes
High parallel efficiency for structured adaptive mesh
refine-ment (SAMR) applications requires repeated data
parti-tioning and distribution. We present a performance
evalu-ation of a framework for adaptive partitioning.
Consideringcomputational load, communication volume,
synchroniza-tion delays, and data movement, the framework
selects,configures and invokes the most efficient partitioning
al-gorithm. We show that adaptive partitioning can signifi-cantly
improve parallel efficiency for SAMR applications.
Henrik JohanssonDepartment of Information TechnologyUppsala
[email protected]
MS4
Adaptive Partitioning for Unstructured AMR Ap-plications
Improving performance of large scientific adaptive appli-cations
is non-trivial due to their inherent dynamics andwide spectrum of
properties. Performance is limited by thepartitioner’s ability
exploit computer resources given theapplication state. No single
partitioning configuration cangenerally achieve high performance;
partitioning must bedynamically adaptive. In this talk, we describe
the meta-
-
PP08 Abstracts 45
partitioner: a framework for selecting and configuring themost
suitable partitioner based on run-time state.
Johan SteenslandSandia National
[email protected]
MS5
Parallel Programming in MATLAB: Best Practices
Matlab is one of the most commonly used languagesfor scientific
computing with approximately one mil-lion users worldwide. The
Lincoln pMatlab library(http://www.ll.mit.edu/pMatlab), The
Mathworks DCT,and StarP from ISC have brought parallel computing
tothe this community using the distributed array program-ming
paradigm. This talk provides an introduction to dis-tributed array
programming and will describe the best pro-gramming practices for
using distributed arrays to producewell performing parallel Matlab
programs.
Jeremy KepnerMIT Lincoln [email protected]
MS5
Parallel MATLAB in Production Supercomputingwith Applications in
Signal and Image Processing
Parallel MATLAB enables the large community of MAT-LAB users to
harness the increased computing capacityand memory of distributed
memory clusters. At the OhioSupercomputer Center we provide our
users with three va-rieties of Parallel MATLAB. In this talk, we
will describehow we run these Parallel MATLAB environments withina
traditional batch oriented queuing system. We will alsodescribe our
experiences in developing three signal and im-age processing
applications within this environment.
Ashok KrishnamurthyOhio Supercomputing [email protected]
David Hudak, John Nehrbass, Siddharth Samsi, VijayGadepallyOhio
Supercomputer [email protected], [email protected],
[email protected], [email protected]
MS5
Parallel Computing Toolbox (PCT) and ParallelProgramming in
MATLAB
Parallel Computing Toolbox addresses computationallyand
data-intensive problems using MATLAB and Simulinkin a
multiprocessor computing environment. The toolboxallows both
several independent tasks or a single parallelcomputation by
harnessing computing clusters and a vari-ety of batch queuing
software implementation. The tool-box provides high-level
constructs, such as parallel loopsand algorithms, and MPI-based
functions. Also, low-levelconstructs for resource management are
included. The Par-allel Command Window provides interactive
environmentfor developing parallel applications.
Piotr LuszczekThe MathWorks, [email protected]
MS5
Interactive Data Exploration with Star-P
High performance applications increasingly combine nu-merical
and combinatorial algorithms. Past research onhigh performance
computation has focused mainly on nu-merical algorithms, and there
is a rich variety of toolsfor high performance numerical computing.
On the otherhand, few tools exist for large scale combinatorial
comput-ing. We describe our efforts to build a common
infrastuc-ture for numerical and combinatorial computing by
usingparallel sparse matrices to implement parallel graph
algo-rithms.
Viral B. ShahInteractive
[email protected]
MS6
Integrated Air/Ocean/Wave Modeling UsingESMF
Development of an integrated air/ocean/wave modelingsystem is
described. The single executable system is builtfrom mature
stand-alone models using the Earth SystemModeling Framework (ESMF).
The framework providesthe required functionality for treating each
model as a sep-arate component and for the redistribution and
remappingof data between them. An exchange grid approach is
im-plemented to simplify the interface between models thatuse
telescoping nests. In addition to describing the imple-mentation
details, preliminary results will be presented fortwo regional test
cases.
Sue ChenNaval Research LaboratoryMonterey,
[email protected]
Hao JinSAIC, Naval Research LaboratoryMonterey,
[email protected]
Rich HodurNaval Research LaboratoryMonterey,
[email protected]
Sasa GabersekUCAR, Naval Research LaboratoryMonterey,
[email protected]
Tim CampbellNaval Research LaboratoryStennis Space
[email protected]
MS6
Algorithms for a Scalable Earth System Model
Abstract not available at time of publication.
John DrakeOak Ridge National [email protected]
-
46 PP08 Abstracts
MS6
A Coupled Watershed-Nearshore Model UsingDBuilder
Coupling of independent models involves implementationof
synchronization and data-exchanging algorithms. Also,coupling may
be along a shared edge of two meshes or anoverlapped region between
two meshes. The latter can bedifficult in terms of spatially
mapping nodes/elements be-tween two meshes. DBuilder, a parallel
data managementtoolkit, provides users with APIs such as element
search-ing and data synchronization routines to accomplish
thesetasks.
Robert M. HunterU.S. Army Engineer Research & Development
[email protected]
MS6
Parallel Rendezvous Regridding in ESMF
In coupled multiphysics simulations often each physics
ismodelled by a distinct, specialized code; to combine thesecodes
into a coupled solver, it is necessary to transfer fieldsfrom one
code to another (often called regridding). Inthe Earth Sciences
(and other disciplines) each individ-ual physics code will likely
be a massively parallel code,with a unique parallel decomposition
of the physical do-main. We discuss the ESMF implementation of the
Par-allel Rendezvous algorithm of Stewart et al, which createsa
geometric rendezvous mesh to perform the search andinterpolation.
We discuss the application and extensionof this algorithm to
interpolation of high order finite ele-ments with non-nodal
interpolation rules (e.g. Hierarchicalelements). We also
demonstrate a smoothing interpolationmethod that is based on finite
element patch recovery tech-niques.
David NeckelsNational Center for Atmospheric
[email protected]
MS7
Solving Rank Deficient Linear-Least Squares Prob-lems Using
Sparse QR Factorizations
We address the problem of solving linear least-squaresproblems
min||Ax − b|| when A is a sparse m-by-n rankdeficient or highly
ill-conditioned matrix. Since A is rank-deficient or highly
ill-conditioned the factorization A = QRis not useful because the
computed R is ill-conditioned.We have developed a new method that
uses a regular QRfactorization instead of a rank-revealing QR
factorization.The goal of this work is to implement and test the
algo-rithm in an high performance QR factorization.
Esmond G. NgComputational Research DivisionLawrence Berkeley
National [email protected]
Haim AvronSchool of Computer ScienceTel-Aviv
[email protected]
Sivan ToledoTel Aviv [email protected]
MS7
Computing the Conditioning of Dense Linear LeastSquares with
(Sca)LAPACK
We define condition numbers that can assess the accuracyof the
components of the least squares solution. We in-terpret them in
terms of statistical quantities. We showthat the ratio of the
variance of one component of thesolution by the variance of the
right-hand side is exactlythe condition number. We also propose
codes based on(Sca)LAPACK for computing the variance-covariance
ma-trix. Finally we present experiments from the space indus-try
with real physical data.
Julien LangouUniversity of Colorado at Denver and Health
[email protected]
Jack DongarraUniversity of [email protected]
Marc [email protected]
Serge [email protected]
MS7
Sparse QR Rank-revealing Factorization
We discuss an algorithm for computing a rank revealingsparse QR
factorization. First, a QR factorization withno pivoting is
performed, that allows to obtain efficientlya sparse triangular
factor R. Second, an incremental con-dition number estimator (ICE)
is used iteratively on R toidentify redundant columns. We also
introduce a blockformulation of ICE algorithm. Numerical tests show
thatblock ICE leads to approximations close to those obtainedby
successive runs of ICE.
Bernard PhilippeIRISA-INRIARennes [email protected]
Laura [email protected]
Frederic [email protected]
MS7
Blocked Bidiagonal Reduction of Sparse MatricesUsing Givens
Rotations
Computing the singular value decomposition involves
bidi-agonalization of a sparse upper triangular matrix R.
Con-ventional methods do not exploit the sparsity of R. Weintroduce
a method to bidiagonalize R using a sequence ofGivens rotations
while preserving the ”mountainview” pro-
-
PP08 Abstracts 47
file of R. A dynamic blocking scheme extends the method totwo
blocked variations which generate no more fill than theunblocked
version. We present performance results com-paring all the
different methods.
Gene H. GolubStanford UniversityDepartment of Computer
[email protected]
Timothy A. DavisUniversity of FloridaComputer and Information
Science and [email protected]
Sivasankaran RajamanickamDept. of Computer and Information
Science andEngineeringUniv. of Florida,
[email protected]
MS8
NVIDIA CUDA Software and GPU Parallel Com-puting
Architecture
In the past, graphics processors were special purposehardwired
application accelerators, suitable only for con-ventional
rasterization-style graphics applications. Mod-ern GPUs are now
fully programmable, massively par-allel floating point processors.
This talk will describeNVIDIAs massively multithreaded computing
architectureand CUDA software for GPU computing. The architec-ture
is a scalable, highly parallel architecture that delivershigh
throughput for data-intensive processing. Althoughnot truly
general-purpose processors, GPUs can now beused for a wide variety
of compute-intensive applicationsbeyond graphics.
Michael [email protected]
MS8
Implementation of the Navier-Stokes Stanford Uni-versity Solver
(NSSUS) on a GPU
Current graphics processing units are capable of over 300Gflops
peak performance and this typically doubles ev-ery year. We have
ported some of the capabilities ofthe Navier-Stokes Stanford
University Solver (NSSUS), amulti-block structured code with a
provably stable and ac-curate numerical discretization which uses a
vertex-basedfinite-difference method and multigrid for convergence
ac-celeration. Speed-ups of over 40x were demonstrated forsimple
test geometries and up to 20x for realistic geome-tries of
engineering interest.
Patrick LeGresleyStanford University, now with
[email protected]
Erich ElsenStanford [email protected]
Eric F. DarveStanford UniversityCenter for Turbulence
[email protected]
MS8
Performance and Productivity of Graphics Pro-cessing Units for a
Quantum Monte Carlo Appli-cation
The increased programmability and performance of Graph-ics
Processing Units (GPUs) can have profound positiveimpact on
developer productivity. In this talk, we discussthe acceleration of
a Quantum Monte Carlo application us-ing GPUs. Topics include the
impact of GPU features onperformance, tradeoffs in using a library
approach versus amore informed hand optimized acceleration path,
and thefeasibility of combining these approaches.
Jeremy MeredithOak Ridge National
[email protected]
MS8
Accelerating Molecular Modeling Applicationswith Graphics
Processors
State-of-the-art graphics processing units (GPUs) can per-form
over 500 billion arithmetic operations per second,a powerful
computational resource that can now be har-nessed for use by
scientific applications. We present anoverview of recent advances
in programmable GPUs, withan emphasis on their application to
biomolecular modelingapplications and the programming techniques
required toobtain optimal performance in these cases.
Performanceand implementation details are presented for several
appli-cations. The calculations include runs on multiple GPUs.
John Stone, James C. Phillips, Peter Freddolino,Leonardo
TrabucoBeckman InstituteUniv of Illinois at
[email protected],
[email protected],[email protected], [email protected]
Klaus SchultenUniversity of Illinios at Urbana
[email protected]
David J. HardyTheoretical Biophysics Group, Beckman
InstituteUniversity of Illinois at
[email protected]
MS9
Communication Avoiding Algorithms for LinearAlgebra: Motivation,
Approach
We survey results to be presented in this minisymposiumon
designing numerical algorithms to minimize the largestcost
component: communication. This could be bandwidthand latency costs
between processors over a network, orbetween levels of a memory
hierarchy; both costs are in-creasing exponentially compared to
floating point. We de-scribe novel algorithms in sparse and dense
linear algebra,for both direct methods (like QR and LU) and
iterativemethods that can minimize communication.
James W. DemmelUniversity of CaliforniaDivision of Computer
[email protected]
-
48 PP08 Abstracts
MS9
Communication Avoiding Gaussian Elimination
We present CALU, a Communication Avoiding LU factor-ization
algorithm for dense matrices distributed in a two-dimensional (2D)
cyclic layout. The new algorithm leadsto an important decrease in
the number of messages ex-changed during the factorization, and
thus it overcomesthe latency bottleneck of the LU factorization as
imple-mented in ScaLAPACK. We also discuss the stability ofthe
pivoting strategy used in CALU and present perfor-mance results on
several computational platforms.
Julien LangouUniversity of Colorado at Denver and Health
[email protected]
James DemmelUC Berkeley, [email protected]
Laura [email protected]
Hua XiangINRIA, [email protected]
MS9
AllReduce Algorithms: Application to House-holder QR
Factorization
QR factorizations of tall and skinny matrices with theirdata
partitioned vertically across several processors arisein a wide
range of applications. Various methods existto perform the QR
factorization of such matrices: Gram-Schmidt, Householder, or
CholeskyQR. In this talk, Wepresent the Allreduce Householder QR
factorization. Thismethod is stable and performs, in our
experiments, fromfour to eight times faster than ScaLAPACK routines
on talland skinny matrices. The idea of Allreduce algorithms canbe
extended to 2D block-cyclic LU or QR factorization.
Julien LangouUniversity of Colorado at Denver and Health
[email protected]
Jim DemmelDivision of Computer ScienceUniversity of California,
[email protected]
Laura [email protected]
Mark HoemmenUC Berkeley, [email protected]
MS9
A Low Latency Approach for Parallel Sparse LU
Factorization
We present a new scheme for computing the sparse
LUfactorization. Our goal is to decrease the number of mes-sages,
hence decreasing the time spent in communication.The reduction in
the number of messages is obtained byusing a heuristic pivoting
strategy, which is shown by nu-merical experiments to be stable in
practice. The parallelalgorithm is based on a hypergraph reordering
strategy,and an associated separator tree is used for
distributingthe data.
Laura [email protected]
Hua XiangINRIA, [email protected]
MS10
Anton: A Special-purpose Machine for MolecularDynamics
Simulation
Anton is a massively parallel machine which should makepractical
millisecond-scale classical molecular dynamics(MD) simulations of
proteins in explicit solvent. The ma-chine, which is scheduled for
completion by the end of 2008,is based on 512 identical MD-specific
ASICs that interactin a tightly coupled manner using a specialized
high-speedcommunication network. Anton has been designed to useboth
novel parallel algorithms and special-purpose logic todramatically
accelerate those calculations that dominatethe time required for a
typical MD simulation. The re-mainder of the simulation algorithm
is executed by a pro-grammable portion of each chip that achieves a
substantialdegree of parallelism while preserving the flexibility
nec-essary to accommodate anticipated advances in physicalmodels
and simulation methods.
Ron O. DrorD. E. Shaw [email protected]
David E. ShawD. E. Shaw ResearchColumbia
[email protected]
Martin M. Deneroff, Jeffrey S. Kuskin, Richard H.Larson, John K.
Salmon, Cliff YoungD. E. Shaw [email protected],
[email protected],[email protected],
[email protected],[email protected]
MS10
Scaling Classical Molecular Dynamics to O(1)Atom per Node
We will describe some of the issues involved in
scalingbiomolecular simulations onto massively parallel machinesas
well as some of the science that we have been able toachieve using
the Blue Matter molecular simulation appli-cation on Blue Gene/L.
Our experiences in scaling to orderone atom/node on BG/L should
provide some insights intothe challenges involved in scaling
biomolecular simulations
-
PP08 Abstracts 49
onto larger peta-scale platforms.
Blake G. Fitch, Christopher Ward, Michael C. Pitman,Robert S.
GermainIBM. T.J. Watson Research [email protected],
[email protected], [email protected],[email protected]
Aleksandr Rayshubskiy, Maria EleftheriouIBM Thomas J. Watson
Research [email protected], [email protected]
MS10
Accelerating NAMD with Graphics Processors
Commodity graphics processors allow a single worksta-tion to
achieve teraflop performance on certain workloads.Working with the
NVIDIA CUDA programming system,we have adapted our molecular
dynamics code NAMD(www.ks.uiuc.edu/Research/namd/) to offload the
mostexpensive calculations to graphics processors while
main-taining its parallel capability (J. Comp. Chem., 28:2618-2640,
2007). This talk will present recent work and parallelperformance
results for CUDA-accelerated NAMD.
John E. Stone, James C. PhillipsBeckman Institute, U. Illinois
at [email protected], [email protected]
Klaus SchultenUniversity of Illinios at Urbana
[email protected]
MS10
Petascale Special-Purpose Computer for MolecularDynamics
Simulations: MDGRAPE-3 and Beyond
We have developed the MDGRAPE-3 system, a
petaflopsspecial-purpose computer for molecular dynamics
simula-tions. The MDGRAPE-3 is a PC cluster equipped
withaccelerators of 4,778 ASICs that calculate nonbonded
in-teractions between atoms. Currently serial Amber-8 andin-house
parallel MD software has been ported for the sys-tem. We will
present the architecture and performance ofthe system as well as
the next-generation project to developa tile processor with
special-purpose engines over TFLOPSperformance.
Tetsu NarumiGenomic Sciences Center, RIKENand Keio
[email protected]
Duraid MadeinaUniversity of [email protected]
Makoto Taiji, Yosuke OhnoGenomic Sciences Center,
[email protected], [email protected]
Takashi IkegamiThe Graduate School of Arts and
SciencesUniversity of [email protected]
MS11
Irregular Algorithms on the Cell Broadband En-gine
The Sony-Toshiba-IBM Cell Broadband Engine is a hetero-geneous
multicore architecture that consists of a traditionalmicroprocessor
(PPE) with eight SIMD co-processing units(SPEs) integrated on-chip.
Noting that while the Cell pro-cessor is architected for multimedia
applications with regu-lar processing requirements, we are
interested in its perfor-mance on problems with non-uniform memory
access pat-terns. In this talk, we present a case study of list
ranking,a fundamental kernel for graph problems, that
illustratesthe design and implementation of parallel
combinatorialalgorithms on Cell. List ranking is a particularly
chal-lenging problem to parallelize on current cache-based
anddistributed memory architectures due to its low computa-tional
intensity and irregular memory access patterns. Totolerate memory
latency on the Cell processor, we decom-pose work into several
independent tasks and coordinatecomputation using the novel idea of
Software-Managedthreads (SM-Threads). We apply this generic SPE
work-partitioning technique to efficiently implement list
ranking,and demonstrate substantial speedup in comparison to
tra-ditional cache-based microprocessors. For instance, on a3.2 GHz
IBM QS20 Cell blade, for a random linked list of1 million nodes, we
achieve an overall speedup of 8.34 overa PPE-only
implementation.
David A. BaderGeorgia Institute of
[email protected]
MS11
Linear Algebra Algorithms on the IBM Cell
This talk describes the design concepts behind implemen-tations
of some linear algebra routines targeted for the Cellprocessor and
multicore in general. It describes in detailthe implementation of
code to solve linear system of equa-tions using Gaussian
elimination in single precision withiterative refinement of the
solution to the full double pre-cision accuracy. We will also look
at the PlayStation 3 foruse in scientific computations.
Jack J. DongarraDepartment of Computer ScienceThe University of
[email protected]
MS11
The Implementation of FFTW on Cell
FFTW is a library for computing Fourier transforms ofcomplex,
real, and real-symmetric multi-dimensional se-quences. In this
talk, I describe the port of FFTW to theCell Broadband Engine,
which was completed at the begin-ning of 2007 by the IBM Austin
Research Lab. The bulkof FFTW runs on the Cell PPE, treating the
SPE’s as ac-celerators. The SPE’s execute a specialized program
capa-ble of executing one-dimensional DFT’s of
two-dimensionalvectors. While the capabilities of the SPE program
are re-stricted, FFTW can reduce an arbitrary multi-dimensionalDFT
to this restricted form, thus taking advantage of theSPE’s in most
cases.
Matteo FrigoCilk [email protected]
-
50 PP08 Abstracts
MS11
Dealing with the Memory Bandwidth Bottleneckon the Cell
Processor
The computational workload on the Cell processor is han-dled by
co-processors called SPEs. They have small localstores, which makes
it necessary to store the applicationdata in main memory. In many
applications, especiallythose requiring O(n) computational effort,
the bandwidthto main memory limits application performance. We
willdiscuss the effectiveness of data compression in dealingwith
this limitation in a few important applications, suchas
matrix-vector multiplication.
Ashok SrinivasanDepartment of Computer Science,Florida State
[email protected]
Gunaranjan Gunaranjan, T Nagaraju, RamprasadRamprasad, T.V.
SivakumarSri Sathya Sai [email protected],
[email protected], [email protected],
[email protected]
MS12
Fluid Dynamics Simulations on Massively ParallelComputers
To achieve the goal of reliable flow simulations for
realisticproblems requires methods that are extensible to levels
ofparallelism that scale on 100,000s of processors and thatcan
attain petaflop performance. We present a frameworkto perform
massively parallel simulations where work ispartitioned into
balanced parts with well-controlled com-munications. We demonstrate
scalability on 30,000 proces-sors on an IBM BlueGene/L for the case
of blood flow ina patient-specific arterial system.
Min ZhouRensselaer Polytechnic [email protected]
Ken JansenRensellaer Polytechnic
[email protected]
Onkar SahniRensselaer PolytechnicScientific Computation Research
[email protected]
Mark S. ShephardRensselaer Polytechnic InstituteScientific
Computation Research [email protected]
MS12
Load Balancing in FronTier/AMR Computation
Computation of fluid physics with dynamically evolvingfronts
embedded in the PDE solution domain poses a greatchallenge for load
balancing in parallel computing, partic-ularly when adaptive mesh
refinement is used. In this pre-sentation, we present our adjusted
rectangular domain de-composition algorithm and AMR patch
redistribution forcomputation on a parallel platform with large
number of
processors. These algorithms enhances the parallel effi-ciency
and scaling substantially. Our implementation con-forms with the
ITAPS interoperability requirements.
Ryan Kaufman, Brian FixSUNY at Stony
[email protected], [email protected]
Xiaolin LiDepartment of Applied Math and StatSUNY at Stony
[email protected]
MS12
Sets and Tags in the ITAPS Data Model
The data model used in the ITAPS iMesh and iGeom in-terfaces
includes sets (an arbirary collection of entities andother sets)
and tags (application-defined data assigned toentities, sets, and
the interface itself). The combinationof sets and tags is a
powerful mechanism for embeddingdata with a variety of sources and
semantics in the ITAPSinterfaces. However, in practice this same
variety makesit difficult to find data through such an abstract
interface.This issue is discussed in the context of parallel
scientificcomputing for both geometry and mesh data.
Tim TautgesArgonne National [email protected]
Mark MillerLawrence Livermore National
[email protected]
MS12
Parallel, Scalable Unstructured Mesh Generationand Computation
Physics Tools on Petascale Com-puting Architectures
As a scientific mesh based modeling community we aremaking
steady progress toward petascale computing hard-ware. The computing
hardware that we will be dealingwith over the next several years is
going to get more com-plex for mesh based algorithms in terms of
multi/many-core processors, hierarchal memories and distributed
I/Obecause the relationships between CPU speed, memorybandwidth,
memory latency and communication latencyare going to change
dramatically. We need to make surethat our modeling and simulation
tools keep pace withthese hardware developments, such that the
algorithms arescalable to 100s of thousands of processors and data
ispartitioned properly with respect to memory hierarchies.This
presentation will describe our efforts to maintain par-allel,
scalable, efficient software tools and technology formesh
generation and continuum/discrete simulation on ad-vanced computing
architectures by making use interoper-able tools such as the ITAPS
mesh/field interfaces; Zoltandata/graph partitioning tools; and
complex mesh gener-ation and solvers, like NWGrid/NWPhys and
FronTieron applications such as multiscale subsurface transport
incomplex geometries.
Yilin Fang, Harold E. TreasePacific Northwest National
[email protected], [email protected]
Bruce PalmerPacific Northwest National Lab
-
PP08 Abstracts 51
[email protected]
MS13
Challenges and Achievements in ComputationalElectromagnetics
Under consideration are problems in the vicinity of existingand
future high current and high brightness particle accel-erators such
as high power proton drivers and 4th gen-eration light sources i.e.
x-ray free electron lasers. Onecan distinguish between two broad
classes of problems inthis field: real or complex eigenvalue
problems in the con-text of cavity designs and relativistic
particle tracking in3D time dependent electromagnetic fields from
Maxwellsequation. Here the source terms, i.e. the time
dependentcharge distribution must be explicit modeled with high
ac-curacy. Also of great importance is the efficient handling
ofdatasets in the terra byte region, our HDF5 based Ansatzwill be
drafted. Our workhorses are a massive parallel par-ticle in cell
code and a finite element based eigenmodesolver. I will talk about
our implementations and showsome results. Ongoing projects are time
dependent hp-finite element based particle codes; here I will
sketch ourideas.
Andreas AdelmannPaul Scherrer
[email protected]
MS13
Parallel Particle-In-Cell (PIC) Simulation on Hy-brid Meshes
Particle-In-Cell (PIC) codes have become an essential toolfor
the numerical simulation of many physical phenomenainvolving
charged particles, in particular beam physics,space and laboratory
plasmas including fusion plasmas.Genuinely kinetic phenomena can be
modeled by theVlasov-Maxwell equations which are discretized by a
PICmethod coupled to a Maxwell field solver. Todays and fu-ture
massively parallel supercomputers allow to envisionthe simulation
of realistic problems involving complex ge-ometries and multiple
scales. To achieve this efficiently wepropose to couple a Finite
Element Maxwell solver withparticles on hybrid grids with several
homogeneous zoneshaving their own structured or unstructured mesh
typeand size. This allows in particular fast particle trackingin
zones having a structured mesh, but needs a fine analy-sis of load
balancing issues for efficient parallelization. Ourlatest progress
towards this goal will be presented.
Latu GuillaumeStrasbourg [email protected]
MS13
Parallel Smoothed Aggregation Multigrid for LargeScale
Electromagnetic Simulations
We present a new AMG preconditioner for linear systemsarising
from edge element discretization of the eddy cur-rent equations.
The linear system is implicitly transformedinto a 2 × 2 block
system whose diagonal blocks are anedge Hodge Laplacian and a nodal
scalar Laplacian, respec-tively. Solving the edge Hodge Laplacian
involves matrix-free smoothing and a specialized restriction to a
coarsenodal problem. We present three-dimensional computa-tional
results on twenty thousand Cray XT3 processors.
Pavel BochevSandia National LaboratoriesComputational Math and
[email protected]
Ray S. TuminaroSandia National LaboratoriesComputational
Mathematics and [email protected]
Jonathan J. HuSandia National LaboratoriesLivermore, CA
[email protected]
Chris SiefertSandia National [email protected]
MS13
Parallel Auxiliary Space AMG for Maxwell Prob-lems
In this talk we will discuss the implementation and per-formance
of an auxiliary space based algebraic solver fordefinite Maxwell
problems, discretized with edge elements.The algorithm is based on
a recent theoretical result byHiptmair and Xu, and utilizes two
internal AlgebraicMultigrid (AMG) V-cycles: one for a scalar and
one fora vector Poisson-like matrix. The parallel scalability
ofthis approach is directly tied to the AMG performance onPoisson
problems.
Tzanio V. Kolev, Panayot S. VassilevskiCenter for Applied
Scientific ComputingLawrence Livermore National
[email protected], [email protected]
MS14
Scalability Infrastructure for the Lustre File Sys-tem
This paper describes low-level infrastructure in the Lustrefile
system that addresses scalability in very large clusters.The
features deal with I/O and networking, lock manage-ment, recovery
and failure, and other scalability-relatedissues.
Peter BraamCluster File [email protected]
MS14
High End Computing File Systems and I/O (HECFSIO): Coordinating
the US Government ResearchInvestments
The High End Computing Interagency Working Group(HEC IWG) is
chartered with coordinating US Govern-ment investments in Research
and Development (R&D) forHEC. The HEC FSIO Technical Advisory
Group (TAG) ischartered with providing guidance to the HEC IWG in
thearea of File Systems and I/O (FSIO). The HEC FSIO re-search
needs and priorities will be discussed. Also, thecurrently
portfolio of 28 research projects will be reviewed.Additionally,
the future direction for the HEC FSIO area
-
52 PP08 Abstracts
will be outlined including programs for taking research out-come
into products and discussion of a new round of gov-ernment
sponsored research to continue to feed the R&Dpipeline in this
area will be outlined.
Gary GriderLos Alamos National [email protected]
MS14
I/O Architectures for Petascale Computing
Production high-performance storage systems today aretypically
constructed from enterprise storage hardware,with a parallel file
system such as GPFS, Lustre, or PVFStying this hardware into a
coherent whole. As we move intothe petascale regime, constructing
storage systems in thisway is becoming problematic. In this talk we
will discusssome of the challenges in storage at petascale,
particularlyin reliability and performance, and examine hardware
andsoftware options that can help us construct effective stor-age
systems at this extreme scale.
Rob RossArgonne National [email protected]
MS14
Structured Streams: Data Services for PetascaleScience
Environments
The challenge of meeting the I/O needs of petascale
ap-plications is exacerbated by an emerging class of data-intensive
HPC applications that requires annotation, re-organization, or even
conversion of their data. We in-troduce an end-to-end approach to
meeting these require-ments. The Structured Streaming Data System
(SSDS)enables high-performance data movement or manipulationbetween
the compute and service nodes of the petascalemachine and
between/on service nodes and ancillary ma-chines. This talk
describes the SSDS architecture, moti-vating its design decisions
and intended application uses.Performance claims are supported with
experiments bench-marking the underlying software layers of SSDS,
as well asapplication-specific usage scenarios.
Karsten SchwanCollege of ComputingGeorgia Institute of
[email protected]
MS15
An Empirical Investigation of Generating ParallelQuasirandom
Sequences by Using Different Scram-bling Methods
Quasi-Monte Carlo (QMC) methods are now widely usedin scientific
computation. The use of randomized QMCmethods, where randomness can
be brought to bear onquasirandom sequences through scrambling and
other re-lated randomization techniques, brings more wide
appli-cations for QMC. Scrambling QMC offers a natural wayto
generate parallel sequences. QMC applications havehigh degrees of
parallelism, can tolerate large latencies, andusually require
considerable computational effort, makingthem extremely well suited
for grid computating. Parallelcomputations using QMC require a
source of quasirandomsequences, which are distributed among the
individual par-allel processes. However, the integration variance
can de-
pend strongly on the scrambling methods. Much of thework dealing
with scrambling methods has been aimed atways of linear scrambling
methods. In this paper, we takea close look at the quadratic
scrambling method for Haltonsequences in generating parallel
sequences.
Hongmei ChiComputer ScienceFlorida A&M
[email protected]
MS15
Estimation of Migration Rates and Effective Pop-ulation Numbers
by Using Importance Sampling
After coalescence theory is widely used to explore diver-sity
among populations and species in population
genetics(phylogenetics), the computation of likelihood or
posteriordistribution of the population genetics parameters are
com-mon tasks in computational biology. The numerical resultsof
these approaches can be achieved by Monte Carlo simu-lations. This
paper focuses on exploring the use of uniformrandom sequences, more
specifically, completely uniformlydistributed sequences to
calculate the likelihood with thehelp of importance sampling. We
demonstrate by exam-ples that quasi-Monte Carlo can be a viable
alternative tothe Monte Carlo methods in population genetics.
Analysisof a simple one-population problem in this paper showedthat
quasi-Monte Carlo methods achieve the same or bet-ter parameter
estimates as standard Monte Carlo, but havethe potential to
converge faster and so reduce the compu-tational burden.
Peter BeerliFlorida State UniversitySchool of Computational
[email protected]
Hongmei ChiComputer ScienceFlorida A&M
[email protected]
MS15
Hybrid Parallel Tempering and Simulated Anneal-ing Method in
Rosetta Practice
We applied our recently developed hybrid Parallel Tem-pering
(PT)/ Simulated Annealing (SA) method to theRosetta program. The
hybrid PT/SA method is an ef-fective global optimization algorithm
to overcome the slowconvergence in low-temperature protein
simulation by initi-ating multiple systems to run at multiple
slowly decreasingtemperature levels (SA scheme) and randomly switch
withneighbor temperature levels (PT scheme). The PT schemecan
significantly enhance the relaxation rate in the SAsearch. With
hybrid PT/SAs fast barrier-crossing capabil-ity, we expect to
achieve resolution improvement comparedto the original Rosetta
program. Our preliminary resultsshow that the Rosetta fragment
assembly implementationusing hybrid PT/SA method has a broader
exploration ofthe protein folding scoring function landscape and
exhibitsa 0.2 4.0A shift toward the native structure in most ofthe
Rosetta benchmark proteins. Our analysis and com-putational re