Top Banner
HIG Project Overview August 31, 2012 Matthieu-P. Schapranow Hasso Plattner Institute Chair of Prof. Hasso Plattner
13
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: High-Performance In-Memory Genome (HIG) Project

HIG Project Overview

August 31, 2012

Matthieu-P. Schapranow Hasso Plattner Institute

Chair of Prof. Hasso Plattner

Page 2: High-Performance In-Memory Genome (HIG) Project

Vision: Real-time Analysis of Genomic Data to Improve Medical Treatment

HIG Project Overview, M. Schapranow, Aug 31, 2012

2

Page 3: High-Performance In-Memory Genome (HIG) Project

Build up the Whole Picture out of Layers

■  Data:

□  Combine research findings from int’l scientific databases in single system at HPI

■  Platform:

□  Expose information as a service to be consumed by special purpose applications

■  Applications:

□  Support genome alignment pipeline processing by

□  Massively parallel execute: □ Alignment algorithms, e.g. BWA, BT2, etc. □ Variant calling

□  Analyze individual patient results (real-time annotations with combined data)

□  Analyze patient cohorts using individual filters HIG Project Overview, M. Schapranow, Aug 31, 2012

3

Page 4: High-Performance In-Memory Genome (HIG) Project

How the Vision Becomes Real

■  Platform:

□  Worker Framework: Enables parallel execution of tasks (alignment, variant calling) across node limits

□  Updating Framework: Retrieves periodic database updated of international databases and automatically integrates them into local store

■  Applications:

□  Alignment Coordinator: Submit alignment tasks and retrieve mutation lists, e.g. CSV

□  Genome Browser: Interactive browsing in reference and specific patient genomes

HIG Project Overview, M. Schapranow, Aug 31, 2012

4

Page 5: High-Performance In-Memory Genome (HIG) Project

Alignment Coordinator

■  Available Alignment Algorithms (and growing)

□  Bowtie2

□  Bowtie

□  BWA

□  TMAP

□  SNAP

□  MAQ

□  SOAP

HIG Project Overview, M. Schapranow, Aug 31, 2012

5

Page 6: High-Performance In-Memory Genome (HIG) Project

Numbers you should know Alignment Execution Time

■  One cell line ~600k reads / 110MB

■  Pipeline: Alignment and variant calling

HIG Project Overview, M. Schapranow, Aug 31, 2012

6

Property Traditional HPI Full Genome No Yes

Cores 2 * 6 cores 25 * 40 cores Main Memory 48 GB 25 TB

Runtime ~720 ~40s

Page 7: High-Performance In-Memory Genome (HIG) Project

Numbers you should know History of the Human Genome Project

■  1984: Idea of a global Human Genome (HG) project discussed at Alta Summit: “DNA available on the Internet”

■  1990: HG project for 15 years started in the US (3 billion USD funding)

■  2000: Rough draft of the HG announced

■  2003: Complete genome sequenced

■  2006: Last and longest chr1 sequenced

■  … what’s next?

HIG Project Overview, M. Schapranow, Aug 31, 2012

7

Page 8: High-Performance In-Memory Genome (HIG) Project

Numbers you should know Human Genome

HIG Project Overview, M. Schapranow, Aug 31, 2012

Entity Cardinality Different Bases 4 (A,C,G,T) Base Pairs 3.137 Bbp Chromosomes 23 Distinct Genes 20k-25k Amino Acids (coded as triplets)

21

Proteins 50k-300k

8

Taken from http://de.wikipedia.org/wiki/Code-Sonne

Page 9: High-Performance In-Memory Genome (HIG) Project

Numbers you should know Comparison of Costs

HIG Project Overview, M. Schapranow, Aug 31, 2012

9

0,01

0,1

1

10

100

1000

10000

01.0

1.01

01.0

5.01

01.0

9.01

01.0

1.02

01.0

5.02

01.0

9.02

01.0

1.03

01.0

5.03

01.0

9.03

01.0

1.04

01.0

5.04

01.0

9.04

01.0

1.05

01.0

5.05

01.0

9.05

01.0

1.06

01.0

5.06

01.0

9.06

01.0

1.07

01.0

5.07

01.0

9.07

01.0

1.08

01.0

5.08

01.0

9.08

01.0

1.09

01.0

5.09

01.0

9.09

01.0

1.10

01.0

5.10

01.0

9.10

01.0

1.11

01.0

5.11

01.0

9.11

01.0

1.12

Cos

ts in

US

D

Comparison of Costs for Main Memory and Genome Analysis

Costs per Megabyte RAM Costs per Megabase Sequencing

Page 10: High-Performance In-Memory Genome (HIG) Project

Hardware Characteristics

■  1,000 core cluster, 25 TB main memory

■  Consists of 25 identical nodes:

□  80 cores

□  1 TB main memory

□  Intel® Xeon® E7- 4870

□  2.40GHz

□  30 MB Cache

HIG Project Overview, M. Schapranow, Aug 31, 2012

10

Page 11: High-Performance In-Memory Genome (HIG) Project

Customer Process as of Today

■  Tissue sequencing in context of cancer treatment

■  Complex, time-consuming, media breaks, manual steps

HIG Project Overview, M. Schapranow, Aug 31, 2012

11

Page 12: High-Performance In-Memory Genome (HIG) Project

Project Objectives

■  Alignment of DNA reads (FASTQ) against reference genome (FASTA) è mapped reads

■  Real-time analysis of mapped reads

□  Detection of mutations (SNP, INDELs)

□  Comparison of multiple tissues

□  Detection of similar clusters to identify co-relations

■  Analysis of mutations

□  Identify mutations with scientific references (existing knowledge)

□  Detection of similar clusters to identify co-relations

□  Identify genes and regulators for certain phenotypic characteristics, e.g. “fast running horses”

HIG Project Overview, M. Schapranow, Aug 31, 2012

12

Page 13: High-Performance In-Memory Genome (HIG) Project

Thank you for your interest! Keep in contact with us.

HIG Project Overview, M. Schapranow, Aug 31, 2012

13

Hasso Plattner Institute Enterprise Platform & Integration Concepts

Matthieu-P. Schapranow August-Bebel-Str. 88

14482 Potsdam, Germany

Matthieu-P. Schapranow, M.Sc. [email protected]

http://j.mp/schapranow