Experiences with a large-memory HP cluster – performance on benchmarks and genome codes Craig A. Stewart ([email protected]) Executive Director, Pervasive.

Experiences with a large-memory HP cluster – performance on benchmarks and genome codes

Craig A. Stewart ([email protected])Executive Director, Pervasive Technology InstituteAssociate Dean, Research TechnologiesAssociate Director, CREST

Robert HenschelManager, High Performance Applications, Research Technologies/PTI

William K. BarnettDirector, National Center for Genome Analysis SupportAssociate Director, Center for Applied Cybersecurity Research, PTI

Thomas G. DoakDepartment of Biology

Indiana University

License terms

• Please cite this presentation as: Stewart, C.A., R. Henschel, W. K. Barnett, T.G. Doak. 2011. Experiences with a large-memory HP cluster – performance on benchmarks and genome codes. Presented at: HP-CAST 17 - HP Consortium for Advanced Scientific and Technical Computing World-Wide User Group Meeting. Renaissance Hotel, 515 Madison Street, Seattle WA, USA, November 12th 2011. http://hdl.handle.net/2022/13879

• Portions of this document that originated from sources outside IU are shown here and used by permission or under licenses indicated within this document. Items indicated with a © are under copyright and may not be reused without permission from the holder of copyright, except where license terms noted on a slide permit reuse.

• Except where otherwise noted, the contents of this presentation are copyright 2011 by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

2

3

The human genome project was just the start

• More sequencing will help us:– Understand the basic building blocks and mechanisms of life– Improve crop yields or disease resistance by genetic modification, or

add new nutrients to crops, – Understand disease variability by mapping the genetic variation of

diseases such as cancer or by studying the human microbiome and how it interacts with us and our conditions,

– Create personalized treatments for various illnesses by understanding human genetic variability, or create gene therapies as new treatments

– Really begin to understand genome variability as a population issue as well as having at hand the genome of one particular individual

4

Genome sequencer model

Year introduced

Raw image data per run

Data products Sequence per run

Read length

Doctoral Student (hard working) as sequencer

Circa 1980s

n.a. Several exposed films /day on a good day

2 Kbp 100-200 nt

ABI 3730 2002 0.03 GB 2 GB/day 60 Kbp 800 nt

454 Titanium 2005 39 GB 9 GB/day 500 Mbp 400 nt

Illumina-Solexa G1 2006 600 GB 100 GB/day 50 Gbp 300 nt

ABI SOLiD 4 2007 680 GB 25 GB/day 70 Gbp 90 nt

Illumina HiSeq 2000

2010 600 GB 150 GB/day 200 Gbp 200 nt

Evolution of sequencers over time

5

Date Cost per Mb of DNA sequence

Cost per human genome

March 2002 $3,898.64 $70,175,437

April 2004 $1,135.70 $20,442,576

April 2006 $651.81 $11,732,535

April 2008 $15.03 $1,352,982

April 2010 $0.35 $31,512

Cost of sequencing over time

6

Mason – a HP ProLiant DL580 G7

• 16 node cluster• 10GE interconnect

– Cisco Nexus 7018– Compute nodes are oversubscribed 4 : 1– This is the same switch that we use for DC and other 10G connected

equipment.• Quad socket nodes

– 8 core Xeon L7555, 1.87 GHz base frequency– 32 cores per node– 512 GByte of memory per node

7

Why 512 GB – sweet spot in RAM requirements

Largest genome that can be assembled on a computer with 512 GB of memory, assuming maximum memory usage and k-mers of 20 bp

“Memory is a problem that admits only money as a solution” – David Moffett

Application Genome / Genome size RAM required (per node)

ABySS Various plant genomes 7.8 GB RAM per node (distributed memory parallel code) [based on McCombie lab at CSHL]

SOAPdenovo

Panda / 2.3 Gbp 512 GBHuman gut metagenome / est. 4Gbp

512 GB

Human Genome 150 GBVelvet Honeybee / ~300 Mbp 128 GB

Daphnia / 200 Mbp > 512 GB [based on runs by Lynch lab at IUDuckweed / 150 Mbp > 512 GB [based on McCombie lab at CSHL]

Coverage Maximum assemblable genome (Gbp) Percentile of distribution of genome sizes

Plant Animal20x 1.3 32 4440x 0.6 16 1560x 0.4 7 9

8

Community trust matters

Right now the codes most trusted by the community require large amounts of memory in a single name space. More side by side testing may lead to more trust in the distributed memory codes but for now….

Application Year initially published Number of citations as of August 2010

de Bruijn graph methods

ABySS 2009 783

EULER 2001 1870

SOAPdenovo 2010 254

Velvet 2008 1420

Overlap/layout/consensus

Arachne 2 2003 578

Celera Assembler 2000 3579

Newbler 2005 999

9

The National Center for Genome Analysis Support

• Dedicated to supporting life science researchers who need computational support for genomics analysis

• Initially funded by the National Science Foundation Advances in Biological Informatics (ABI) program, grant no. 1062432

• A Cyberinfrastructure Service Center affiliated with the Pervasive Technology Institute at Indiana University (http://pti.iu.edu)

• Provides support for genomics analysis software on supercomputers customized for genomics studies, including Mason and systems which are part of XSEDE

• Provides distributions of hardened versions of popular codes• Particularly dedicated for genome assembly codes such as:

– de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS– consensus methods: Celera, Newbler, Arachne 2

• For more information, see http://ncgas.org

http://pti.iu.edu/

http://ncgas.org/

10

Benchmark overview

• High Performance Computing Challenge Benchmark (HPCC)– http://icl.cs.utk.edu/hpcc/

• SPEC OpenMP– http://www.spec.org/omp2001/

• SPEC MPI– http://www.spec.org/mpi2007/

http://icl.cs.utk.edu/hpcc/

http://www.spec.org/omp2001/

http://www.spec.org/mpi2007/

11

High Performance Computing Challenge benchmark

Innovative Computing Laboratory at the University of Tennessee (Jack Dongarra and Piotr Luszczek)

• Announced at Supercomputing 2004• Version 1.0 available June 2005• Current: Version 1.4.1

• Our results are not yet published, because we are unsatisfied with 8 node HPCC runs. This is likely due to our oversubscription of the switch. (4 to 1)

12


• Raw results, 1 to 16 nodes

HPCC Benchmark Target

HPCC

G-HPL G-PTRANS G-FFTE G-

RandomG-

STREAMEP-

STREAMEP-

DGEMM

Random Ring

Bandwidth

Random Ring

Latency

% HPL Peak

# Nodes #CPUs #Cores TFLOP/s GB/s GLFOP/s Gup/s GB/s GB/s GFLOP/s GB/s usec Percent

16 64 512 3.383 5.112 17.962 0.245 1084.682 2.119 7.158 0.010 229.564 82.59

8 32 256 1.608 2.648 8.938 0.1534 549.184 2.145 7.136 0.011 169.079 78.51

4 16 128 0.847 1.575 5.356 0.123 267.297 2.088 7.141 0.014 119.736 82.66

2 8 64 0.424 3.545 10.095 0.152 137.790 2.153 7.128 0.083 71.988 82.85

1 4 32 0.222 6.463 11.542 0.225 66.936 2.092 7.157 0.324 3.483 86.78

13


• HPL Efficiency from 1 to 16 nodes– Highlighting our issue at 8 nodes– However, for a 10GE system, not so bad!

1 2 4 8 167476788082848688

HPL Efficiency

Nodes

HPL

Effi

cien

cy [%

]

14

SPEC benchmarks

• High Performance Group (HPG) of the Standard Performance Evaluation Corporation (SPEC)

• Robust framework for measurement• Industry / education mix • Result review before publication• Fair use policy and its enforcement• Concept of reference machine, base / peak runs, different datasets

15

SPEC OpenMP

• Evaluate performance of OpenMP applications (single node)• Benchmark consists of 11 applications, medium and large dataset available• Our results:

– Large and Medium: • http://www.spec.org/omp/results/res2011q3/

http://www.spec.org/omp/results/res2011q3/

16

The SPEC OpenMP application suite

310.wupwise_m and 311.wupwise_l quantum chromodynamics312.swim_m and 313.swim_l shallow water modeling314.mgrid_m and 315.mgrid_l multi-grid solver in 3D potential field316.applu_m and 317.applu_l parabolic/elliptic partial differential equations318.galgel_m fluid dynamics analysis of oscillatory instability330.art_m and 331.art_l neural network simulation of adaptive resonance theory320.equake_m and 321.equake_l finite element simulation of earthquake modeling332.ammp_m computational chemistry328.fma3d_m and 329.fma3d_l finite-element crash simulation324.apsi_m and 325.apsi_l temperature, wind, distribution of pollutants326.gafort_m and 327.gafort_l genetic algorithm code

17

SPEC OpenMP Medium

Hyperthreading OFF Hyperthreading ONBenchmarks Base Ref

TimeBase Run

TimeBase Ratio

Base Run Time

Base Ratio

310.wupwise_m 6000 46.7 128583 37.3 160774312.swim_m 6000 84.1 71337 74.2 80847314.mgrid_m 7300 96.9 75332 87.7 83222316.applu_m 4000 29.9 133967 26.1 153288318.galgel_m 5100 115.0 44387 114.0 44802320.equake_m 2600 52.3 49691 47.9 54295324.apsi_m 3400 44.7 76128 46.5 73134326.gafort_m 8700 98.2 88571 109.0 79651328.fma3d_m 4600 88.3 52108 92.8 49543330.art_m 6400 31.1 205935 32.9 194318332.ammp_m 7000 152.0 45953 161.0 43469SPECompMbase2001 78307 80989

Hyper-Threadingbeneficial

18

HP Integrity, 32 threads, Itanium 2, 1.5GHz, Sep.

03

HP Integrity, 64 threads, Itanium

2, 1.5 GHz, Sep. 03

Sun SPARC, 64 threads,

SPARC64, 2.5 GHz, Jul 08

HP DL 580, 32 threads, XEON, 1.8 GHz, Apr.

11

IBM p570, 32 threads,

Power6, 4.7 GHz, May 07

SGI UV, 64 threads, XEON, 2.2 GHz, Mar.

10

IBM p595, 128 threads,

Power6, 5 GHz, Jun. 08

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

System Comparison

SP

EC

Sc

ore

SPEC OpenMP Medium

19

SPEC OpenMP Large

Hyperthreading OFF Hyperthreading ONBenchmarks Base Ref

TimeBase Run

TimeBase Ratio

Base Run Time

Base Ratio

311.wupwise_l 9200 203 723729 211 697727313.swim_l 12500 602 331955 628 318338315.mgrid_l 13500 518 416695 528 409378317.applu_l 13500 562 384230 590 366221321.equake_l 13000 575 361456 542 383513325.apsi_l 10500 271 620871 286 587380327.gafort_l 11000 391 450003 359 490814329.fma3d_l 23500 1166 322462 941 399786331.art_l 25000 290 1377258 277 1445765SPECompLbase2001 493152 504788

Hyper-Threadingbeneficial

20

SPEC MPI

• Evaluate performance of MPI applications in the whole cluster• Benchmark consists of 12 applications, medium and large dataset available• Our results are not yet published, as we are still lacking a 16 node run

– We spent a lot of time on the 8 node run– We think we know what the source of the problem is, we just have not

yet been able to fix it– The problem is the result of rational configuration choices, and does not

impact our primary intended uses of the system

21

SPEC MPI

• Scalability study, 1 to 8 nodes – Preliminary results, not yet published by SPEC

1 2 4 805

101520253035

Scalability, 1 to 8 Nodes

IU HP DL580 Intel Endeavor Intel Atlantis

Nodes

SP

EC

Sc

ore

Endeavor: Intel Xeon X5560, 2.80 GHz, IB, Feb 2009Atlantis: Intel Xeon X5482, 3.20 GHz, IB, Mar 2009IU/HP: Intel Xeon L7555, 1.87 GHz, 10 GigE

22

Early users overview

• Metagenomics Sequences Analysis • Genome Assembly and Annotation • Genome Informatics for Animals and Plants• Imputation of Genotypes And Sequence Alignment • Daphnia Population Genomics

23

Metagenomics Sequences Analysis

Yuzhen Ye's Lab (IUB School of Informatics)

• Environmental sequencing– Sampling DNA sequences directly from the environment– Since the sequences consist of DNA fragments from hundreds or even

thousands of species, the analysis is far more difficult than traditional sequence analysis that involves only one species.

• Assembling metagenomic sequences and getting genes from the assembled dataset

• Dynamic programming is used to find the optimal mapping of consecutive contigs out of the assembly

• Since the number of contigs is enormous for most metagenomic datasets, a large memory computing system is required to perform the dynamic programming algorithm so that the task can be completed in polynomial time

24

Genome Assembly and Annotation

Michael Lynch's Lab (IUB Department of Biology)

• Assembles and annotates genomes in the Paramecium aurelia species complex in order to eventually study the evolutionary fates of duplicate genes after whole-genome duplication. This project also has been performing RNAseq on each genome, which is currently being used to aid in genome annotations and will later be used to detect expression differences between paralogs.

• The assembler used is based on an overlap-layout-consensus method instead of a de Bruijn graph method (like some of the newer assemblers). It is more memory intensive – requires performing pairwise alignments between all pairs of reads.

• The annotation of the genome assemblies involves programs such as GMAP, GSNAP, PASA, and Augustus. To use these programs, we need to load-in millions of RNAseq and EST reads and map them back to the genome.

25

Genome Informatics for Animals and Plants

Genome Informatics Lab (Don Gilbert) (IUB Department of Biology)

• This project is to find genes in animals and plants, using the vast amounts of new gene information coming from next generation sequencing technology. These improvements are applied to newly deciphered genomes for an environmental sentinel animal, the waterflea (Daphnia), the agricultural pest insect Pea aphid, the evolutionarily interesting jewel wasp (Nasonia), and the chocolate bean tree (Th. cacao) which will bring genomics insights to sustainable agriculture of cacao.

• Large memory compute systems are needed for biological genome and gene transcript assembly because assembly of genomic DNA or gene RNA sequence reads (in billions of fragments) into full genomic or gene sequences requires a minimum of 128 GB of shared memory, more depending on data set. These programs build graph matrices of sequence alignments in memory.

26

Imputation of Genotypes And Sequence Alignment

Tatiana Foroud's Lab (IUPUI Department of Medical and Molecular Genetics)

• Study the complex disorders by using imputation of genotypes typically for genome wide association studies as well as sequence alignment and post-processing of whole genome and whole exome sequencing.

• Requires analysis of markers in a genetic region (such as a chromosome) in several hundred representative individuals genotyped for the full reference panel of SNPs, with extrapolation of the inferred haplotype structures.

• More memory allows the imputation algorithms to evaluate haplotypes across much broader genomic regions, reducing or eliminating the need to partition the chromosomes into segments. This would result in imputed genotypes with both increased accuracy and speed, allowing for improved evaluation of detailed within-study results as well as communication and collaboration (including meta-analysis) using the disease study results with other researchers.

27

Daphnia Population Genomics

Michael Lynch's Lab (IUB Department of Biology)

• This project involves the whole genome shotgun sequences of over 20 more diploid genomes with genomes sizes >200 Megabases each. With each genome sequenced to over 30 x coverage, the full project involves both the mapping of reads to a reference genome and the de novo assembly of each individual genome.

• The genome assembly of millions of small reads often requires excessive memory use for which we once turned to Dash at SDSC. With Mason now online at IU, we have been able to run our assemblies and analysis programs here at IU.

28

Mason as an example of effective campus bridging

• The goal of campus bridging is to make local, regional, and national cyberinfrastructure facilities appear as if they were peripherals to your laptop

• Mason is designed for a specific set of tasks that drive a different configuration than XSEDE (the eXtreme Science and Engineering Discovery Environment – http://xsede.org/)

• For more information on campus bridging: http://pti.iu.edu/campusbridging/

29

Key points

• The increased amount of data and decreased kmer length that are driving growing demands for data analysis in genome assembly

• The codes the biological community trusts are the codes they trust. Over time, testing may enable more use of distributed memory codes. But for now if we want to serve the biological community most effectively we need to implement systems that match their research needs now.

• In the performance analysis of Mason we found two outcomes of note:– There is a problem in our switch configuration that we still have not sorted that is

causing odd HPL results, and we will continue to work on that.– The summary result on hyperthreading is “sometimes it helps, sometimes not”

• If we as a community are frustrated by the attention that senior administrators give placement on the Top500 list, and how that affects system configuration, we need to take more time to publish SPEC and / or HPCC benchmark results. – Much of the time this may mean “we got what we expected.” But more data will

make it easier to identify and understand results we don’t expect.• By implementing Mason – a lot of memory with some some processors attached to it –

we have enabled research that would otherwise not be possible.

30

Absolutely Shameless Plugs

• XSEDE12: Bridging from the eXtreme to the campus and beyondJuly 16-20, 2012 | Chicago

• The XSEDE12 Conference will be held at the beautiful Intercontinental Chicago (Magnificent Mile) at 505 N. Michigan Ave. The hotel is in the heart of Chicago's most interesting tourist destinations and best shopping.

• Watch for Calls for Participation – coming early January

• And please visit the XSEDE and IU displays in the SC11 Exhibition Hallway!

31

Thanks

• Danke fuer das Einladung: Herr Dr. Frank Baetke, Eva-Marie Markert, und HP • Thanks to HP, particularly James Kovach, for partnership efforts over many years, including the

implementation of Mason.• Staff of the Research Technologies Division of University Information Technology Services, affiliated

with the Pervasive Technology Institute, who led the implementation of Mason and benchmarking activities at IU: George Turner, Robert Henschel, David Y. Hancock, Matthew R. Link

• Our many collaborators in the Pervasive Technology Institute, particularly the co-PIs of NCGAS: Michael Lynch, Matthew Hahn, and Geoffrey C. Fox

• Those involved in campus bridging activities: Guy Almes, Von Welch, Patrick Dreher, Jim Pepin, Dave Jent, Stan Ahalt, Bill Barnett, Therese Miller, Malinda Lingwall, Maria Morris, Gabrielle Allen, Jennifer Schopf, Ed Seidel

• All of the IU Research Technologies and Pervasive Technology Institute staff who have contributed to the development of IU’s advanced cyberinfrastructure and its support

• NSF for funding support (Awards 040777, 1059812, 0948142, 1002526, 0829462, 1062432, OCI-1053575 – which supports the Extreme Science and Engineering Discovery Environment)

• Lilly Endowment, Inc. and the Indiana University Pervasive Technology Institute• Any opinions presented here are those of the presenter and do not necessarily represent the

opinions of the National Science Foundation or any other funding agencies

Thank you!

• Questions and discussion?

32

Experiences with a large-memory HP cluster – performance on benchmarks and genome codes Craig A. Stewart ([email protected]) Executive Director, Pervasive.

Documents

genome variability

license terms

human genome project

kbp800 nt

genome codes craig

human genetic variability

barnett director

titanium200539 gb