Top Banner
Project Phase II Report Due on 10/20, send me through email Write on top of Phase I report. 5-20 Pages Free style in writing (use 11pt font or larger) Methods Overview (high-level description) Source of data Algorithm (pseudo code) Prove or argue why the algorithm will work Analyze the computational complexity of the algorithm Plan of implementation
50

Project Phase II Report

Feb 12, 2016

Download

Documents

Bryce

Project Phase II Report. Due on 10/20, send me through email Write on top of Phase I report. 5-20 Pages Free style in writing (use 11pt font or larger) Methods Overview (high-level description) Source of data Algorithm (pseudo code) Prove or argue why the algorithm will work - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Project Phase II Report

Project Phase II Report Due on 10/20, send me through email Write on top of Phase I report. 5-20 Pages Free style in writing (use 11pt font or larger) Methods

Overview (high-level description) Source of data Algorithm (pseudo code) Prove or argue why the algorithm will work Analyze the computational complexity of the algorithm Plan of implementation

Page 2: Project Phase II Report

Multiple Sequence Alignment

Dong Xu

Computer Science Department109 Engineering Building WestE-mail: [email protected]

573-882-7064http://digbio.missouri.edu

Page 3: Project Phase II Report

As we proceed …

Warning:Muddy Road

Ahead!!!

Page 4: Project Phase II Report

Outline

Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast

Page 5: Project Phase II Report

Introduction

The multiple sequence alignment of a set of sequences may be viewed as an evolutionary history of the sequences.

No sequence ordering is required.

Page 6: Project Phase II Report

An Example of Multiple Alignment

VTISCTGSESNIGAG-NHVKWYQQLPGVTISCTGTESNIGS--ITVNWYQQLPGLRLSCSSSDFIFSS--YAMYWVRQAPGLSLTCTVSETSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKEFYPSD--IAVEWWSNG--

Page 7: Project Phase II Report

Why Multiple Alignment (1)

Natural extension of Pairwise Sequence Alignment

“Pairwise alignment whispers… multiple alignment shouts out loud” Hubbard et al 1996

Much more sensitive in detecting sequence relationship and patterns

Page 8: Project Phase II Report

Why Multiple Alignment (2)

Give hints about the function and evolutionary history of a set of sequences

Foundation for phylogenic tree construction and protein family classification

Useful for protein structure prediction…

Page 9: Project Phase II Report

Outline

Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast

Page 10: Project Phase II Report

Idea of Scoring

Requirement of a good quality of alignment measureAdditive function

Function must be independent of order of arguments

Should reward presence of many equal or strongly related symbols (in the same column) and penalize unrelated symbols and spaces.

Page 11: Project Phase II Report

Distance from Consensus

Consensus sequnece: single sequence which represents the most common amino acid/base in that positionD D G A V - E A LD G G - - - E A LE G G I L V E A LD - G I L V Q A VE G G A V V Q A LD G G A/I V/L V E A L

distance from consensus: total number of characters in the alignment that differs from the consensus character of their columns

Page 12: Project Phase II Report

Sum-of-pairs’ (SP) measure

The Sum of Pairs (SP) method is as follows:Given (1) a set of N aligned sequences each of length L in the form of a L x N MSA alignment matrix M and (2) a substitution matrix (PAM or BLOSUM) that gives a cost c(x,y) for aligning two characters x, y .

The SP score SP(mi) for the I-th column of M denoted by mi is calculated:

SP(mi) = SUM(j, k) [ c( mij , mi

k) ] The SP score for M is SUM(i) [SP(mi) ]

Page 13: Project Phase II Report

Feature of SP Measure

Theorem: Let alpha be a multiple alignment of the set of sequences s1, …, sk; and alpha(I,j) denote the pairwise alignment of si and sj as induced by alpha. Then

SP-score(alpha) = Sum over i,j [score(alpha(i,j)]

The above is only true if we have s(-,-) = 0.

This is because in pairwise alignment the presence of two aligned spaces (–) in the two sequences are ignored.

Page 14: Project Phase II Report

In the minimum entropy approach the score of multiple alignment is defined as the sum of entropies of the columns (the entropy of a column is related with the frequency of letter x in the column)

Information content Profile score

Profile Distance

Page 15: Project Phase II Report

Outline

Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast

Page 16: Project Phase II Report

Dynamic Programming

Multiple sequence alignment can be done with dynamic programming, using (n+1)k memory cells, where n is the length of each of the k sequences to be aligned.

Sequence 1

Sequence 2

Sequence 3

Page 17: Project Phase II Report

Complexity Analysis of Dynamic Programming

Running time: (n+1)k number of entries in the table

For each entry we need to find the maximum of 2k -1 elements

Finding the SP-score corresponding to each element means adding O(k2) numbers

Total = O(k22knk) i.e., NP-hard

Page 18: Project Phase II Report

Outline

Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast

Page 19: Project Phase II Report

Center Star Method

The center star method is an approximation algorithm, if scores can be decoupled into pairwise distances.

minimizing sum of pairwise distances. Works well when a distance satisfies

the triangle inequality: dst <= dsc + dct for all sequences s, t, and c.

Page 20: Project Phase II Report

Star Alignments

Select a sequence sc as the center of the star

For each sequence s1, …, sk such that index i c, perform a Needleman-Wunsch global alignment

Aggregate alignments with the principle “once a gap, always a gap.”

Page 21: Project Phase II Report

Star Alignments Example

s2

s1 s3

s4

s1: MPEs2: MKEs3: MSKEs4: SKE

MPE

| |

MKE

MSKE

- ||

MKE

SKE

||

MKE

MPEMKE

-MPE-MKEMSKE

-MPE-MKEMSKE-SKE

Page 22: Project Phase II Report

Choosing a Center

Try them all and pick the one with the best score

Calculate all O(k2) alignments, and pick the sequence sc that maximizes

ci

ci ssscore ),(

Page 23: Project Phase II Report

Performance

Therefore, if C is the center star method SP score, then C/2 is a lower bound to the SP score of any multiple alignment.

C is at most as bad as twice of the optimum, and this bound on how bad the result can be makes the center star method an approximation algorithm to the NP-hard problem.

Page 24: Project Phase II Report

Complexity Analysis

Assuming k sequences have length n

O(n2) to calculate 1 global alignment

O(k2) global alignments to calculate

O(k2n2) overall cost

Page 25: Project Phase II Report

Outline

Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast

Page 26: Project Phase II Report

Progressive Alignment

Devised by Feng and Doolittle in 1987.

Essentially a heuristic method, not guaranteed to find the ‘optimal’ alignment.

Multiple alignment is achieved by successive application of pairwise methods.

Page 27: Project Phase II Report

Basic Algorithm

Compare all sequences pairwise. Perform cluster analysis on the pairwise

data to generate a hierarchy for alignment (guide tree).

Build alignment step by step according to the guide tree. Build the multiple alignment by first aligning the most similar pair of sequences, then add another sequence or another pairwise alignments.

Page 28: Project Phase II Report

Steps in Progressive Multiple Alignment

Compare pairwise sequences Perform cluster analysis on pairwise data

to generate hierarchy for alignment

Page 29: Project Phase II Report

Building Guided Tree

Page 30: Project Phase II Report

Using WeightHbb_Human -

Hbb_Horse .17 -Hba_Human .59 .60 -

Hba_Horse .59 .59 .13 -Myg_Whale .77 .77 .75 .75 -

Hbb_Human

Hbb_Horse

Hba_Horse

Hba_Human

Myg_Whale

2

1

3

4

Quick pairwise alignment: calculate distance matrix

Neighbor-joining tree (guide tree)

Page 31: Project Phase II Report

Alignment (1)

Build multiple alignments by first aligning most similar pair, then next similar pair etc.

Page 32: Project Phase II Report

Alignment (2)

Page 33: Project Phase II Report

Scoring

PQRRZWYQRKZX

YZTUOPTZZ_FO

Total Score = [w(R, U ) + w(R, ) + w(K, U) + w(K, ) ] / 4

Page 34: Project Phase II Report

Most successful implementation of progressive alignment (Des Higgins)

CLUSTAL - gives equal weight to all sequences

CLUSTALW - has the ability to give different weights to the sequences

CLUSTALX - provides a GUI to CLUSTAL

http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html

CLUSTAL

Page 35: Project Phase II Report

Problems with Progressive Alignments

Greedy Nature: No guarantee for global optimum

Depends on pairwise alignments If sequences are very distantly related,

much higher likelihood of errors Care must be made in choosing scoring

matrices and penalties Other approaches using Bayesian

methods such as hidden Markov models

Page 36: Project Phase II Report

Outline

Background (what, why) Scoring function Dynamic programming Star alignment Progressive alignment Profile and Psi-Blast

Page 37: Project Phase II Report

Concept of Profile

Seq1>-

Seq3>-Seq4>-

Seq2>-

Information about the degree of conservation of sequence positions is included

Page 38: Project Phase II Report

Position-specific Score Matrix (PSSM)

For protein of length L, scoring matrix is L x 20, PSSM(i,j) --“Profile”: specific scores for each of the 20 amino acids at each position in a sequence.

For highly conserved residues at a particular position, a high positive score is assigned, and others are assigned high negatives.

For a weakly conserved position, a value close to zero is assigned to all the amino acid types.

Page 39: Project Phase II Report

Building a Profile

First, get multiple sequence alignments using substitution matrix, Sjk.

Second, count the number of occurrences of amino acid k at position i, Cik.

(1) Average-score method: Wij = k CikSjk / N. (2) log-odds-ratio formula: Wij = log(qij/pj).

qij = Cij / N.

pj : background probability of residue j.

Page 40: Project Phase II Report

Calculating Profiles (1)

Gribskov et al, Proc. Natl. Acad. Sci. USA 84, 4355-4358, 1987

ACGCTAFKI

GCGCTAFKI

ACGCTAFKL

GCGCTGFKI

GCGCTLFKI

ASGCTAFKL

ACACTAFKL

C1A = 4, C1G = 3W1A = (4 SAA + 3 SAG) / 7 = (4 4 + 3 0) / 7 = 2.3

Wij = k CikSjk / N

Page 41: Project Phase II Report

Calculating profile (2)

Wij = log(qij/pj).qij = Cij / N.

pj : background probability of residue j.

For small N, formula qij = Cij / N is not good A large set of too closely related sequences carries little more information than a single member. Absence of Leu does not mean no Leu at this position when Ile is abundant! Pseudocount frequency, gij

Page 42: Project Phase II Report

Frequency Matrix

Effective frequency, fij

ij iij

q gf

Frequency matrix element, fij, is the

probability of amino acid j at position i.

j

gij = pj k qik exp ( Skj )pj: : background frequency

Page 43: Project Phase II Report

Frequency matrix, example

0.03 0.02 0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.22 0.27 0.02 0.12 0.03 0.01 0.02 0.03 0.00 0.01 0.080.04 0.02 0.17 0.03 0.01 0.02 0.03 0.03 0.01 0.03 0.08 0.03 0.01 0.01 0.02 0.07 0.34 0.00 0.01 0.040.07 0.03 0.02 0.03 0.01 0.12 0.13 0.03 0.01 0.01 0.02 0.04 0.01 0.01 0.38 0.07 0.07 0.00 0.01 0.020.07 0.03 0.09 0.06 0.01 0.03 0.14 0.03 0.06 0.01 0.02 0.09 0.01 0.01 0.15 0.16 0.05 0.00 0.01 0.020.03 0.11 0.08 0.03 0.01 0.12 0.13 0.02 0.01 0.04 0.04 0.13 0.09 0.01 0.02 0.04 0.03 0.00 0.01 0.040.03 0.12 0.05 0.03 0.01 0.10 0.05 0.02 0.01 0.04 0.14 0.10 0.16 0.01 0.01 0.04 0.03 0.00 0.01 0.040.04 0.04 0.12 0.06 0.00 0.06 0.12 0.03 0.03 0.01 0.02 0.18 0.02 0.02 0.02 0.15 0.04 0.03 0.01 0.020.06 0.02 0.08 0.03 0.00 0.08 0.19 0.02 0.04 0.05 0.07 0.09 0.04 0.01 0.02 0.12 0.04 0.00 0.01 0.030.03 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.15 0.31 0.01 0.02 0.12 0.01 0.02 0.02 0.01 0.02 0.130.06 0.03 0.05 0.06 0.03 0.07 0.19 0.02 0.02 0.03 0.06 0.10 0.01 0.05 0.02 0.07 0.07 0.00 0.05 0.030.22 0.10 0.03 0.08 0.01 0.03 0.12 0.02 0.01 0.01 0.06 0.11 0.01 0.01 0.03 0.05 0.07 0.00 0.02 0.030.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.08 0.34 0.01 0.02 0.26 0.01 0.02 0.02 0.01 0.02 0.100.03 0.05 0.02 0.03 0.00 0.10 0.29 0.12 0.01 0.01 0.08 0.13 0.02 0.03 0.02 0.03 0.03 0.00 0.01 0.020.06 0.10 0.10 0.07 0.00 0.06 0.13 0.03 0.06 0.01 0.02 0.18 0.01 0.01 0.02 0.08 0.04 0.00 0.03 0.010.09 0.22 0.07 0.03 0.00 0.06 0.04 0.02 0.02 0.04 0.07 0.18 0.01 0.01 0.01 0.07 0.05 0.00 0.01 0.030.03 0.01 0.02 0.01 0.01 0.01 0.02 0.01 0.01 0.17 0.41 0.01 0.03 0.05 0.01 0.03 0.04 0.00 0.04 0.050.04 0.02 0.10 0.05 0.00 0.06 0.02 0.48 0.02 0.01 0.02 0.05 0.01 0.01 0.01 0.05 0.03 0.00 0.01 0.010.05 0.01 0.01 0.01 0.01 0.02 0.08 0.01 0.06 0.13 0.09 0.02 0.04 0.05 0.01 0.02 0.02 0.01 0.25 0.080.06 0.06 0.07 0.05 0.01 0.06 0.09 0.02 0.01 0.02 0.02 0.07 0.01 0.02 0.09 0.17 0.13 0.00 0.02 0.020.02 0.01 0.01 0.04 0.00 0.02 0.05 0.01 0.01 0.06 0.11 0.01 0.01 0.43 0.03 0.03 0.03 0.01 0.04 0.03

i (p

ositi

on) 1

,…

,L

j (amino acid type) 1,…,20

Page 44: Project Phase II Report

Profile Alignment (1)

ACD……VWY

sequence

profile

Page 45: Project Phase II Report

alignment of alignments• Sequence – Profile Alignment.• Profile – Profile Alignment.

Penalize gaps in conserved regions more heavily than gaps in more variable regions

Dynamic Programming. (same idea as in Pairwise Sequence Alignment)Optimal alignment in time O(a2l2)a = alphabet size, l = sequence length

Profile Alignment (2)

Page 46: Project Phase II Report

PsiBlast Psi (Position Specific Iterated) is an

automatic profile-like search The program first performs a gapped blast

search of the database. The information of the significant alignments is then used to construct a “position specific” score matrix. This matrix replaces the query sequence in the next round of database searching

The program may be iterated until no new significant are found

Page 47: Project Phase II Report

Search (Building Profile) with PSI-BLAST

Query sequence is “master” for multiple alignment. Profile length = query sequence length. Find family members with low E-value (e.g. <0.01). Exclude sequences with >98% identity. Build frequency matrix. Build PSSM. Build alignment using PSSM. Add new sequences. Iterate…

Page 48: Project Phase II Report

PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.

Much more sensitive than BLAST E-values can be misleading!

Significance of Psi Blast

Page 49: Project Phase II Report

Reading Assignments

Suggested reading: Chapter 4 in “Current Topics in

Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press. 2002.”

Optional reading: Chapter 7 in “Pavel Pevzner: Computational

Molecular Biology - An Algorithmic Approach. MIT Press, 2000.”

Page 50: Project Phase II Report

Develop a program that implement the Center Star algorithm

1. Modify your code in the first assignment for global alignment.

2. Use edit distance (match 1; otherwise 0) with gap penalty –1 – k (k is gap size).

Project Assignment