Top Banner
Frank Dehne www.dehne.net Parallel Computational Biochemistry
51

Frank Dehne Parallel Computational Biochemistry.

Dec 30, 2015

Download

Documents

Hannah Gibbs
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Parallel Computational Biochemistry

Page 2: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Proteins, DNA, etc.

DNA encodes the information necessary to produce proteins

Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)

Page 3: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

• Proteins are formed from a chain of molecules called amino acids

Proteins, DNA, etc.

Page 4: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

• The DNA sequence encodes the amino acid sequence that constitutes the protein

Proteins, DNA, etc.

Page 5: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

• There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I, ...

Proteins, DNA, etc.

Page 6: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Multiple Sequence Alignment

Page 7: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Databases of Biological Sequences

>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

NCBI: 14,976,310 sequences

15,849,921,438 nucleotides

Swiss-Prot: 104,559 sequences

38,460,707 residues

PDB: 17,175 structures

Page 8: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Sequence comparison

• Compare one sequence (target) to many sequences (database search)

• Compare more than two sequences simultaneously

Page 9: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Applications

• Phylogenetic analysis

• Identification of conserved motifs and domains

• Structure prediction

Page 10: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Page 11: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Phylogenetic Analysis

Page 12: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Structure Prediction

Genomic sequences

> RICIN GLYCOSIDASEMYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Protein sequences

Protein structures

Page 13: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Our Contributions

• Parallel min vertex cover for improved sequence alignments (to appear in Journal of Computer and System Sciences)

• Parallel Clustal W (ICCSA 2003)

• In progress: “Clustal XP” portal at http://cgm.dehne.net

Page 14: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Clustal W

Page 15: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Progressive Alignment

Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289

S.cerevisiaeC.elegans

DrosophilaMouse

Human

1. Do pairwise alignment of all sequences and calculate distance matrix

2. Create a guide tree based on this pairwise distance matrix

3. Align progressively following guide tree. • start by aligning most closely related pairs of sequences• at each step align two sequences or one to an existing subalignment

Page 16: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Parallel Clustal

• Parallel pairwise (PW) alignment matrix

• Parallel guide tree calculation

• Parallel progressive alignment

Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289

S.cerevisiaeC.elegans

DrosophilaMouse

Human

Page 17: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Relative Speedup

Page 18: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Clustal XP vs. SGI

SGI data taken from Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL

By: Dmitri Mikhailov, Haruna Cofer, and Roberto Gomperts

Page 19: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Parallel Clustal - Improvements

• Optimization of input parameters– scoring matrices, gap penalties - requires many

repetitive Clustal W calculations with various input parameters.

• Minimum Vertex Cover– use minimum vertex cover to remove erroneous

sequences, and identify clusters of highly similar sequences.

Page 20: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Minimum Vertex Cover

Conflict Graph– vertex: sequence

– edge: conflict (e.g. alignment with very poor score)

TASK: remove smallest number of gene sequences that eliminates all conflicts

NP-complete

Page 21: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

FPT Algorithms

• Phase 1: Kernelization

Reduce problem to size f(k)

• Phase 2: Bounded Tree Search

Exhausive tree search; exponential in f(k)

Page 22: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Kernelization

Buss's Algorithm for k-vertex cover

• Let G=(V,E) and let S be the subset of vertices with degree k or more.

• Remove S and all incident edges

G->G’ k -> k'=k-|S|.

• IF G' has more than k x k' edges THEN no k-vertex cover exists

ELSE start bounded tree search on G'

Page 23: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Bounded Tree Search

VC={}

VC+=... VC+=... VC+=...

VC+=... VC+=... VC+=...

VC+=... VC+=... VC+=...

Page 24: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Case 1: simple path of length 3

VC+={v,v2}

VC={...}

VC+={v1,v2} VC+={v1,v3}

search tree

v

v1

v2

v3

in graph G'

remove selected vertices from G'k' - = 2

Page 25: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Case 2: 3-cycle

v

v1

v2

in graph G'

VC+={v,v1}

VC={...}

VC+={v1,v2} VC+={v,v2}

search tree

remove selected vertices from G'k' - = 2

Page 26: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Case 3: simple path of length 2

v

v1

v2

in graph G'

VC={...}

VC+={v1}

search tree

remove v1, v2 from G'k' - = 1

Page 27: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Case 4: simple path of length 1

v

v1

in graph G'

VC={...}

VC+={v}

search tree

remove v, v1 from G'k' - = 1

Page 28: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Sequential Tree Search

Depth first search– backtrack when k'=0 and

G'<>0 ("dead end" ))

– stop when solution found (G'={}, k'>=0 )

Page 29: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Parallel Tree SearchBasic Idea:

– Build top log p levels of the search tree (T ')

– every proc. starts depth-first search at one leaf of T '

– randomize depth-first search by selecting random child

T 'log p

Page 30: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Analysis: Balls-in-bins

sequential depth-first search path total length:L, #solutions: m

expected sequential time (rand. distr.): L/(m+1)

parallel search path

expected parallel time (rand. distr.): p + L/(p(m+1))expected speedup: p / (1 + (m+1)/L)if m << L then expected speedup = p

Page 31: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Simulation Experiment

number of processors

0 50

50

pre

dict

ed s

pee

dup

L = 1,000,000

m = 10m = 100m = 1,000m = 10,000m = 100,000

100

150

200

100 150 200

L = 1,000,000

Page 32: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Implementation

• test platform:– 32 node HPCVL Beowulf cluster– each node: dual 1.4 GHz Intel Xeon, 512 MB

RAM, 60 GB disk– gcc and LAM/MPI on LINUX Redhat 7.2

• code-s: Sequential k-vertex cover

• code-p: Parallel k-vertex cover

Page 33: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

• Protein sequences

• Same protein from several hundred species

• Each protein sequence a few hundred amino acid residues in length

• Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)

Page 34: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

• Somatostatin

– neuropeptide involved in the regulation of many functions in different organ systems

– Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255

Page 35: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

• WW

– small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling

– Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318

Page 36: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

• Kinase

– large family of enzymes involved in cellular regulation

– Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397

Page 37: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

• SH2 (src-homology domain 2)

– involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine

– Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397

Page 38: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

• Thrombin

– protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin

– Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413

Page 39: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

• PHD (pleckstrin homology domain)

– involved in cellular signaling

– Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603

Page 40: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

• Random Graph

|V| = 220, |E| = 2155, k = 122, k' = 122

• Grid Graph

|V| = 289, |E| = 544, k = 145, k' = 145

Page 41: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Test Data

|VC| ~ |V| / 2 k' = k

Page 42: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Sequential Times

Kinase, SH2, Thombin: n/a

Page 43: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Code-p on Virtual Proc.

Page 44: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Parallel Times

Page 45: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Speedup: Somatostatin

Page 46: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Speedup: WW

Page 47: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Speedup: Rand. Graph

Page 48: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Speedup: Grid Graph

Page 49: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

Clustal W

+Parallel Clustal

Parallel FPT MVC

Clustal XP

Web Portal

Clustal XPin progress X : Extended

P : Parallel

Page 50: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net

http

://cg

m.d

ehne

.net

Clustal XP

Page 51: Frank Dehne Parallel Computational Biochemistry.

Frank Dehne www.dehne.net