Bioinformatics

Bioinformatics

Richard Tseng and Ishawar Hosamani

Outline• Homology modeling (Ishwar)• Structural analysis

– Structure prediction– Structure comparisons

• Cluster analysis– Partitioning method– Density-based method

• Phylogenetic analaysis

Structural Analysis

• Overview– Structure prediction– Structural alignment– Similarity

• Tools for protein structure prediction– Protein

• Secondary structure prediction: SSEAhttp://protein.cribi.unipd.it/ssea/

• Tertiary structure prediction: – Wurst: http://www.zbh.uni-hamburg.de/wurst/– LOOPP: http://cbsuapps.tc.cornell.edu/loopp.aspx

http://protein.cribi.unipd.it/ssea/

http://www.zbh.uni-hamburg.de/wurst/

http://cbsuapps.tc.cornell.edu/loopp.aspx

• WURST( Torda et al. (2004) Wurst: A protein threading server with a structural scoring function, sequence profiles and optimized substitution matrices

Nucleic Acids Res., 32, W532-W535)• Rationale

– Alignment: Sequence to structure alignments are done with a Smith-Waterman style alignment and the Gotoh algorithm

– Score function: fragment-based sequence to structure compatibility score and a pure sequence-sequence component substitution score

– Library: Dali PDB90 (24599 srtuctures)

• Tools for structure comparison– Pair structures comparison:

• TopMatch• Matras: (http://biunit.naist.jp/matras/)

– Multiple structures comparison: • 3D-surfer• Matras: (http://biunit.naist.jp/matras/)

http://biunit.naist.jp/matras/

http://biunit.naist.jp/matras/

• TopMatch (Sippl & Wiederstein (2008) A note on difficult structure alignment

problems. Bioinformatics 24, 426-427)– Rationale:

• Structure alignment: http://www.cgl.ucsf.edu/home/meng/grpmt/structalign.html

• Similarity measurement

– Input format• PDB, SCOP and CATH code• PDB structure directly

– Exercise: http://topmatch.services.came.sbg.ac.at/

2,, bababa DLLS

http://www.cgl.ucsf.edu/home/meng/grpmt/structalign.html

http://topmatch.services.came.sbg.ac.at/

• 3D-surfer (David La et al. 3D-SURFER: software for high throughput protein

surface comparison and analysis. Bioinformatics , in press. (2009))– Rationale

1. Define a surface function2. Transform the surface function into a 3D Zernike description

function

– Input format• PDB and CATH code• PDB structure directly

– Exercise: http://dragon.bio.purdue.edu/3d-surfer/

,,, mlnl

mnl YrRrZ

http://dragon.bio.purdue.edu/3d-surfer/

Cluster analysis

• Goal: – Grouping the data into classes or clusters, so that

objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters.

• Methods– Partitioning method: k-means– Density-based method: Ordering Points to

Identify the Clustering Structure (OPTICS)

• k-means– Rationale: Partition n observations into k clusters

in which each observation belongs to the cluster with the nearest mean

– Exercise

http://cgm.cs.ntust.edu.tw/etrex/kMeansClustering/kMeansClustering2.html

k

i Cpi

i

mpE1

2



• OPTICS– Rationle: Partition

observations based on the density of similar objects

– Exercise

http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo/



• Example: Folding of Trp-cage peptide

Phylogenetic analysis

• Overviews– Comparisons of more than two sequences– Analysis of gene families, including functional

predictions– Estimation of evolutionary relationships among

organisms

• Theoretical tree– Parsimony method– Distance matrix method– Maximum likelihood and Bayesian method– Invariants method

• Software– Collections of toolshttp://evolution.genetics.washington.edu/phylip/software.html – A web server version for tree construction and display

• PHYLIP, http://bioweb2.pasteur.fr/phylogeny/intro-en.html • Interactive tree of life, http://itol.embl.de/

– Mostly common used stand alone software• PHYLIP, tool for evaluating similarity of nucleotide and amino

acid sequences.http://evolution.gs.washington.edu/phylip.html • TreeView, tool for visualization and manipulation of family

tree.http://taxonomy.zoology.gla.ac.uk/rod/treeview.html • Matlab - bioinformatics tool box

http://evolution.genetics.washington.edu/phylip/software.html

http://bioweb2.pasteur.fr/phylogeny/intro-en.html

http://itol.embl.de/

http://evolution.gs.washington.edu/phylip.html

http://taxonomy.zoology.gla.ac.uk/rod/treeview.html

• Example: Alignment phylogenetic tree of Tubulin family– Searching homologous sequences of Tubulin (PDB

code: 1JFF) from RCSB protein databank• Blast for pair sequence alignment• Clustalw for comparative sequence alignment

– Evaluating protein distance matrix • using “Protdist” of PHYILIP (Particularly, Point Accepted

Mutation (PAM) matrix is used)

– Clustering proteins using “Neighbor” of PHYILIP (Neightboring-Joint method is considered)

• Example: n-distance phylogenetic tree– Evaluating n-distance matrix

• n-distance method

– Clustering proteins using “Neighbor” of PHYILIP (Neightboring-Joint method is considered)

• 16S and 18S Ribosomal RNA sequenecs of 35 organisms

Summary

• Homology modeling• Tools for structure prediction and

comparisons• Tools for phylogenetic tree construction

Thanks for your attention!!

1Z5V_A 3CB2_A 1JFF_B 1FFX_B 1TUB_B 1Z2B_B

1Z5V_A 0 0.000010 1.349411 1.349411 1.303115 1.345634

3CB2_A 0.000010 0 1.350506 1.350506 1.303115 1.346730

1JFF_B 1.349411 1.350506 0 0.000010 0.000010 0.010729

1FFX_B 1.349411 1.350506 0.000010 0 0.000010 0.010729

1TUB_B 1.303115 1.303115 0.000010 0.000010 0 0.006725

1Z2B_B 1.345634 1.346730 0.010729 0.010729 0.006725 0

•Protein distance matrix

•Tubulin family tree

• n-distance method– Frequency count of “n-letter words”

– n-dsiatnce matrix

– Advantage: 1. Identify fully conservative words located at nearly the

same sites2. Effecient

MREIVHIQAGQCGNQIGAKFWEVISDEHGIDPTGSYHGDSDLQLERINVYYNE

Nfp /

'', ppDn

Bioinformatics

Documents

structure alignments

cath codepdb structure

structure compatibility

surface function

fragmentbased sequence

sequence profiles

protein threading server

meansdensitybased method