Top Banner
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic
17

DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Jan 17, 2016

Download

Documents

Dayna Booker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

DDPIn Distance and Density Based Protein Indexing

David Hoksza

Charles University in PragueDepartment of Software Engineering

Czech Republic

Page 2: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 2

Presentation Outline

Biological background

Similarity search in protein structure databases

DDPIn feature vector extraction metrics querying

one-step approach multi-step approach

Experimental results

Conclusion

Page 3: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 3

Biological Background Proteins

molecules translated from mRNA in ribosomes

DNA → RNA → protein sequence of amino acids (20 AAs) coded by codon (triplet of nucleotides)

Function of a protein derived from its three dimensional structure → similar proteins have similar functions similar proteins have a common ancestor

Identifying protein structure → finding similar proteins → getting clue to the function

Page 4: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 4

Similarity Search in Protein Databases

Similarity between a pair of proteins alignment + similarity score

RMSD, TM-score, … visual inspection

DALI, CE, SAP, VAST…

Classification SCOP (Structural Classification of Proteins)

no need for an alignment indexing various features

PSI, PSIST, ProGreSS, CTSS, …DDPIn

Page 5: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 5

DDPIn - Overview

Distance and Density based Protein Indexing

Classification method Indexing of protein features

distances among Cα atoms used each AA represents a feature → protein p consists of |p|

features various semantics used

based on clustering Cα atoms into rings metric indexing employed (M-tree)

kNN querying outcomes of several searches are merged to obtain final

results

Page 6: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 6

DDPIn - Feature Extraction Features

n-dimensional vectors of real numbers

AA ≈ viewpoint → VPT (viewpoint tag)

sDens density of AAs in rings with

a predefined width sDensSSE

enhanced with SSE information

sRad widths of rings containing

predefined percentage of AAs

sRadSSE enhanced with SSE

information sDir

number of AAs in a ring pointing from the viepoint

sDens enhanced with direction information

Page 7: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 7

Metrics L2

weighted L2

close neighborhood of VPs is more important

DDPIn - Similarity of VPTs

n

iii yxyxd

1

2||),(

n

iiii yxwyxd

1

2||),(

Page 8: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 8

DDPIn – Indexing Structure

M-tree (Metric tree) Dynamic, hierarchical indexing

structure Data space divided into ball shaped

data regions (hyper-spheres) root node represent data region

covering all data children nodes represent regions

covering parts of the space, … data regions form balanced

hierarchical structure inner nodes → routing entries

leaf nodes → ground entries

))](()),(,(,,[)( iiiOiil OTptrOparOrOOrouti

))](,(,[)( iiii OparOOOgrnd

Page 9: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 9

Querying / Classification

One-step extracting VPTs

from query → n queries

ranking scheme

Two-step healing reclassification

with Smith-Waterman algorithm on sequences

Page 10: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 10

Experimental Results SCOP 1.65 dataset

class → fold → superfamily → family

1810 proteins 181 superfamilies

at least 10 proteins each all α, all β, α + β and α /β classes

query set reduced - 181 queries full

used also by PSI, ProGreSS, PSIST methods

Testing of superfamily classification accuracy fold classification accuracy

Page 11: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 11

Finding Optimal k for kNN Queries

Page 12: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 12

Accuracy of VPT Semantics

Page 13: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 13

Accuracy for Increasing Dimension

Page 14: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 14

Accuracy of Various Metrics

Page 15: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 15

Suitability of Pairs of VPT Semantics for Healing

identical correct classification

identical wrong classification

Page 16: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 16

Comparison of Classification Methods

Page 17: DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

CIBCB 2009 17

Conclusion

We have proposed new representation of protein structures

distance and density of Cα atoms ranking scheme two-step classification

We implemented M-tree indexing for proposed representation classification against SCOP

Experimental results best results among methods using identical classification

98.9% superfamily classification accuracy 100% fold classification accuracy

comparable run time