Top Banner
Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD http://lsmarr.calit2.net
23

Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Mar 27, 2015

Download

Documents

Faith Davies
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Sequencing Genomics:The New Big Data Driver

IntermezzoTalk

SURFnet7, Part of GigaPort3

Utrecht, Netherlands

December 7, 2011

Dr. Larry Smarr

Director, California Institute for Telecommunications and Information Technology

Harry E. Gruber Professor,

Dept. of Computer Science and Engineering

Jacobs School of Engineering, UCSD

http://lsmarr.calit2.net

Page 2: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Cost Per Megabase in Sequencing DNA is Falling Much Faster Than Moore’s Law

www.genome.gov/sequencingcosts/

Page 3: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Genomic Sequencing is Driving Big Data

November 30, 2011

Page 4: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

BGI—The Beijing Genome Institute is the World’s Largest Genomic Institute

• Main Facilities in Shenzhen and Hong Kong, China– Branch Facilities in Copenhagen, Boston, UC Davis

• 137 Illumina HiSeq 2000 Next Generation Sequencing Systems– Each Illumina Next Gen Sequencer Generates 25 Gigabases/Day

• Supported by Supercomputing ~160TF, 33TB Memory – Large-Scale (12PB) Storage

Page 5: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Next Generation Genome SequencersProduce Large Data Sets

Source: Chris Misleh, SOM/Calit2 UCSD

Page 6: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Needed: Interdisciplinary Teams Made From Computer Science, Data Analytics, and Genomics

We believe the field of bioinformatics

for genetic analysis will be one of the biggest areas

of disruptive innovation in life science tools

over the next few years,” --Isaac Ro, an analyst at

Goldman Sachs

Page 7: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Calit2 Brings Together Computer Science and Bioinformatics

National Biomedical Computation Resource an NIH supported resource center

Page 8: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Single Nucleotide Polymophisms (SNPs):Human DNA Base Pairs May Differ At Some Points

Person A

Person B

http://en.wikipedia.org/wiki/File:Dna-SNP.svg

Page 9: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Why We Study SNPs

99.9% of One’s Individual DNA Sequence will be Identical to that of Another Person.

Of the 0.1% Difference, Over 80% will be

Single Nucleotide Polymorphisms (SNPs).

http://shop.perkinelmer.com/content/snps/genotyping.asp

Page 10: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Consumer Companies Provide Your SNPs

www.23andme.com

Page 11: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Cost of Sequencing Human Genome is Rapidly Becoming Affordable

Page 12: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

The Rise of Individual and Societal Genomic Testing-Promise and Concerns

www.technologyreview.com/biomedicine/25218/

Page 13: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Publically Sharing Your Genome and Medical Records:Is it Crazy or the Future?

Page 14: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

From 10,000 Human Genomes Sequenced in 2011to 1 Million by 2015 Out of Less Than 5,000 sq. ft.!

4 Million Newborns / Year in U.S.

Page 15: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

But the Human Genome Contains Less Than 1% of the Bodies Genes

http://commonfund.nih.gov/hmp/

The Total Number of These Bacterial Cells is 10 Times the Number of Human Cells in Your Body

Page 16: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

The Human Microbiome is the Next Large NIH Drive to Understand Human Health and Disease

• “A majority of the bacterial sequences corresponded to uncultivated species and novel microorganisms.”

• “We discovered significant inter-subject variability.” • “Characterization of this immensely diverse ecosystem is the first step in

elucidating its role in health and disease.”

“Diversity of the Human Intestinal Microbial Flora” Paul B. Eckburg, et al Science (10 June 2005)

395 Phylotypes

Page 17: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

The New Science of Metagenomics

“The emerging field of metagenomics,

where the DNA of entire communities of microbes is studied simultaneously,

presents the greatest opportunity -- perhaps since the invention of

the microscope – to revolutionize understanding of

the microbial world.” –

National Research CouncilMarch 27, 2007

NRC Report:

Metagenomic data should

be made publicly

available in international archives as rapidly as possible.

Page 18: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis

http://camera.calit2.net/

Page 19: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Calit2 CAMERA: 0ver 4000 Registered Users From Over 80 Countries

Page 20: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Calit2 Microbial Metagenomics Cluster-Next Generation Optically Linked Science Data Server

512 Processors ~5 Teraflops

~ 200 Terabytes Storage 1GbE and

10GbESwitched/ Routed

Core

~200TB Sun

X4500 Storage

10GbE

Source: Phil Papadopoulos, SDSC, Calit2

4000 UsersFrom 90 Countries

Page 21: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

UCSD Planned Optical NetworkedBiomedical Researchers and Instruments

Cellular & Molecular Medicine West

National Center for

Microscopy & Imaging

Biomedical Research

Center for Molecular Genetics Pharmaceutical

Sciences Building

Cellular & Molecular Medicine East

CryoElectron Microscopy Facility

Radiology Imaging Lab

Bioengineering

Calit2@UCSD

San Diego Supercomputer

Center

• Connects at 10 Gbps :– Microarrays

– Genome Sequencers

– Mass Spectrometry

– Light and Electron Microscopes

– Whole Body Imagers

– Computing

– Storage

Page 22: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

UCSD Campus Investment in Fiber Enables Big Data Science

Source: Philip Papadopoulos, SDSC, UCSD

OptIPortalTiled Display Wall

Campus Lab Cluster

Digital Data Collections

N x 10Gb/sN x 10Gb/s

Triton – Petascale

Data Analysis

Gordon – HPD System

Cluster Condo

WAN 10Gb: WAN 10Gb: CENIC, NLR, I2CENIC, NLR, I2

GLIFGLIF

Scientific Instruments

DataOasis (Central) Storage

GreenLightData Center

Page 23: Sequencing Genomics: The New Big Data Driver IntermezzoTalk SURFnet7, Part of GigaPort3 Utrecht, Netherlands December 7, 2011 Dr. Larry Smarr Director,

Visualization courtesy of Donna Cox, Bob Patterson, NCSA.

www.glif.is

SURFnet – a Global SuperNetwork Connecting tothe Global Lambda Integrated Facility