Top Banner
David Hoksza, [email protected]ff.cuni.cz Supervisor: Tomáš Skopal, [email protected]ff.cuni.cz KSI MFF UK Similarity Search in Protein Databases
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

David Hoksza, [email protected]

Supervisor: Tomáš Skopal, [email protected]

KSI MFF UK

Similarity Search in Protein Databases

Page 2: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Thesis Contributions

Protein sequence similarity

metric indexing approach sequential approach

speedup within existing similarity model

Protein structure similarity

nonalignment-based similarity approach alignment-based similarity approach

similarity modelling itself

Page 3: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Protein structure & Motivation Transportation, building,

signalling, catabolism, ... Molecule consisting of 20 types

of amino acids (AA)

Central dogma of molecular biology DNA → RNA → protein

Proteins’ 3D interactions secure biological function

protein structure similarity → biological function similarity

protein sequence similarity → protein structure similarity

Source: http://en.wikipedia.org/wiki/File:Main_protein_structure_levels_en.svg

Page 4: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Protein Sequence Similarity Edit distance

minimal number of editing operations (insert/update/delete) needed for converting one sequence to the other

alignment

Weighted edit distance scoring matrix (PAM, BLOSSUM, …)

amino acids’ mutation probability affine gap penalty system global alignment

dynamic programming - Needleman-Wunsch (NW) algorithm local alignment

dynamic programming - Smith-Waterman (SW) algorithm the locality leads to nonmetric measures

E D ( , ) = 8S S1 2

A lignm ent

SS

1

2

= N P H G IIM G LA E = H G A LG LLE x x x x x x x x

Page 5: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Metric Indexing Approach Using metric access methods (MAMs)

organize objects into regions for effective filtering

require metric distance function

Distance function E-value with SW

standard in biological databases statistical relevance

dealing with reflexivity, symmetry, triangular inequality (Trigen)

D. Hoksza, T. Skopal. Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007

Page 6: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Metric Indexing Approach

D. Hoksza, T. Skopal. Index-Based Approach to Similarity Search in Protein and Nucleotide Databases. DATESO 2007

Swissprot database 3,000 DB sequences 100 query sequences

Page 7: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Sequence-Based Approach Speeding up distance computation itself

Skipping common parts in the DP matrix context dependency problem prefixes

database sort skipping common prefixes

prefix ratio (PR) = speedup

suffixes database division

D. Hoksza. Improved Alignment of Protein Sequences Based on Common Parts, ISBRA 2008, LNCS

Page 8: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Sequence-Based Approach

D. Hoksza. Improved Alignment of Protein Sequences Based on Common Parts, ISBRA 2008, LNCS

Uniprot database

Page 9: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Protein Structure Similarity

Nonalignment-based approach indexing feature extraction simple or ad-hoc similarity measure

Alignment-based approach sequential scan feature extraction alignment superposition (3D transformation) similarity measure

RMSD, TM-score

Quality criterion classification accuracy (SCOP hierarchical classification)

Page 10: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Density-Based Feature Extraction

D. Hoksza. DDPIn: Distance and Density-Based Protein Indexing, CIBCB 2009, IEEE

Features n-dimensional

vectors of real numbers

AA ≈ viewpoint → VPT (viewpoint tag)

sDens density of AAs in

rings of predefined width

sRad widths of rings

containing predefined percentage of AAs

Page 11: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Nonalignment-Based Approach One-step search

o database creation1. AAs → feature vectors

2. indexing using weighted L2 metric and MAM

o querying1. AAs → feature vectors

2. feature vector → query object

3. results’ merging

4. SCOP classification

Two-step search 2 one-step searches results’ comparison rescoring using Smith-Waterman SCOP classification

D. Hoksza. DDPIn: Distance and Density-Based Protein Indexing, CIBCB 2009, IEEE

Page 12: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Alignment-Based Approach

D. Hoksza, J. Galgonek. Density-Based Classification of Protein Structures Using IterativeTM-score, BIBMW 2009, IEEED. Hoksza, J. Galgonek. Alignment-Based Extension to DDPIn Feature Extraction, IJCB, ACTA Press, 2010

Page 13: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Alignment-Based Approach Feature extraction

AAs → feature vectors density-based feature extraction

Alignment (amino acid matching) Smith-Waterman alignment

distance between feature vectors → scoring matrix

modified variable gap penalty system

Superposition + scoring RMSD TM-score

reducing number of initial states iterative dynamic programming with

belt-based restriction

D. Hoksza, J. Galgonek. Density-Based Classification of Protein Structures Using IterativeTM-score, BIBMW 2009, IEEED. Hoksza, J. Galgonek. Alignment-Based Extension to DDPIn Feature Extraction, IJCB, ACTA Press, 2010

Page 14: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Alignment-Based Approach - Indexing Indexability measures

intrinsic dimensionality objects’ distribution

ball overlap factor (BOF) ball regions’ separation

T-error nontriangular triplets

Semimetrization

reranking

Metrization Trigen

J. Galgonek, D. Hoksza. On the Effectiveness of Distances Measuring Protein Structure Similarity, SISAP 2009, IEEE

Page 15: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

Conclusion Protein sequence search

indexing data domain exploration

speeding distance computation 20% speedup expected speedup growth

Protein structure search nonalignment-based

best accuracy on all SCOP levels

alignment-based best accuracy on superfamily and fold SCOP levels indexing possibilities comparison with nonalignment-based

Page 16: David Hoksza, hoksza@ksi.mff.cuni.cz Supervisor: Tomáš Skopal, skopal@ksi.mff.cuni.cz KSI MFF UK Similarity Search in Protein Databases.

PublicationsD. Hoksza, T. Skopal

Index-Based Approach to Similarity Search in Protein and Nucleotide Databases.

DATESO 2007

D. Hoksza

Improved Alignment of Protein Sequences Based on Common Parts

ISBRA 2008, LNCS

D. Hoksza

DDPIn: Distance and Density-Based Protein Indexing

CIBCB 2009, IEEE

D. Hoksza , J. Galgonek

Density-Based Classification of Protein Structures Using Iterative TM-score

BIBMW 2009, IEEE

J. Galgonek, D. Hoksza

On the Effectiveness of Distances Measuring Protein Structure Similarity

SISAP 2009, IEEE

D. Hoksza , J. Galgonek

Alignment-Based Extension to DDPIn Feature Extraction

IJCB, ACTA Press, 2010