Top Banner
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411. CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003
29

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Jan 01, 2016

Download

Documents

Andrea Craig
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

SATCHMO: sequence alignment and tree construction using hidden

Markov models

Edgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411.

CECS 694-04 Bioinformatics Journal Club

Eric Rouchka, D.Sc.

September 10, 2003

                              

Page 2: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

What is Multiple Sequence Alignment (MSA) ?

• Taking more than two sequences and aligning based on similarity

Page 3: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Globin Example>gamma_AMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKL

HVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH>alfaVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP

VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR>betaVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

HVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH>deltaVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDK

LHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH>epsilonVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLH

VDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH>gamma_GMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKL

HVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH>myoglobinMGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATK

HKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG>teta1ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQKVADALSLAVERLDDLPHALSALSHLHACQLRVDPAS

FQLLGHCLLVTLARHYPGDFSPALQASLDKFLSHVISALVSEYR>zetaSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKL

LSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR

Page 4: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Globin Multiple Alignment

Page 5: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Why do MSA?

• Homology Searching– Important regions conserved across (or

within) species• Genic Regions• Regulatory Elements

• Phylogenetic Classification• Subfamily classification• Identification of critical residues

Page 6: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

MSA Approaches

• All columns alignable across all sequences– MSA– ClustalW

• Columns alignable throughout all sequences singled out (Profile HMM)– HMMER– SAM

Page 7: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

MSA

• N-dimensional dynamic programming

• Time consuming

• High memory usage

• Guaranteed to yield maximum alignment

Page 8: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

ClustalW

• Progressive Alignment– Sequences aligned in pair-wise fashion– Alignment scores produce phylogenetic

tree

– Enhanced dynamic programming approach

Page 9: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Hidden Markov Models

• Match State, Insert State, Delete State

Page 10: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

HMMs

• Models conserved regions

• Successful at detecting and aligning critical motifs and conserved core structure

• Difficulty in aligning sequence outside of these regions

Page 11: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

SATCHMO

• Simultaneous Alignment and Tree Construction using Hidden Markov mOdels

www.lib.jmu.edu/music/composers/ armstrong.htm

Page 12: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

SATCHMO

• Progressive Alignment– Built iteratively in pairs– Profile HMMs used

• Alignments of same sequences not same at each node

• Number of columns predicted smaller as structures diverge

• Output not represented by single matrix

Page 13: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Why HMMs?

• Homologs ranked through scoring

• Accurate profiles from small numbers of sequences

• Accurately combines two alignments having low sequence similarity

Page 14: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Bits saved relative to background

• K = 1..M: HMM node number• a: amino acid type

• Pk(a): emission probability of a in kth match state

• P0(a): approximation of background probability of a

Page 15: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Sequence weights

• Sequences weighted such that b converges on a desired value

• Weights compensate for correlation in sequences

Page 16: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

HMM Construction

• Profile HMM constructed from multiple alignment

• Some columns alignable; others not

Page 17: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

HMM Construction

• Given an alignment a, a profile HMM is generated

• Each column in a is assigned to an emitter state – transition probabilities are calculated based on observed amino acids

Page 18: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Transition Probabilities

• If we have a total of five match states, the probabilities can be stored in the following table:

Page 19: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

HMM Terminology

: Path through an HMM to produce a sequence s

• P(A|) = P(s| s)

+: maximum probability path through the HMM

Page 20: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Aligning Two Alignments

• One alignment is converted to an HMM

• Second alignment is aligned to the HMM– Some columns remain alignable– Affinities (relative match scores) calculated

• New MSA results• HMM Constructed from new MSA

Page 21: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Aligning Two Alignments

Page 22: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

SATCHMO Algorithm

• Step 1: – Create a cluster for each input sequence and

construct an HMM from the sequence

• Step 2: – Calculate the similarity of all pairs of clusters and

identify a pair with highest similarity – align the target and template to produce a new

node

Page 23: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

SATCHMO Algorithm

• Repeat set 2 until:– All sequences assigned to a cluster– Highest similarity between clusters is below a

threshold– No alignable positions are predicted

• Output: A set of binary trees – Nodes are sequences– Each node contains an HMM aligning the

sequences in the subtree

Page 24: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Graphical Interface for SATCHMO

Page 25: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Demonstration of SATCHMO

Page 26: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Validation Set

• BAliBASE benchmark alignment set used– Ref1: equidistant sequences– Ref2: distantly related sequences– Ref3: subgroups of sequences; < 25%

similarity between groups– Ref4: alignments with long extensions on

the ends– Ref5: alignments with long insertions

Page 27: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Comparision of Results

• SATCHMO compared to:– ClustalW (Progressive Pairwise Alignment)– SAM (HMM)

Page 28: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Page 29: Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Eric C. Rouchka, University of Louisville

Discussion

• SATCHMO effective in identifying protein domains

• Comparison to T-Coffee and PRRP would be useful– Time and sensitivity

• Tree representation is unique, modeling structural similarity