Top Banner
Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris Bystroff, Biology Dept.
34

Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Dec 29, 2015

Download

Documents

Magnus Poole
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Directions in Protein Contact Map Mining

Mohammed J. ZakiComputer Science Dept.

joint work withJingjing Hu & Xiaolan Shen, CS Dept.

Yu Shao & Prof. Chris Bystroff, Biology Dept.

Rensselaer Polytechnic Institute, Troy NY

Page 2: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Protein Structures Primary structure

Un-branched polymer 20 side chains (residues or amino acids) PDB file 2IGD: MTPAVTTYSLVINGLTLSGU…..

Higher order structures Secondary: local (consecutive) in sequence Tertiary: 3D fold of one polypeptide chain Quaternary: Chains packing together

Page 3: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

PDB protein 2IGD

Anti-parallel Beta Sheets

Parallel Beta Sheets

Alpha Helix

Page 4: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

The Protein Folding Problem

Page 5: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Contact Map

Amino acids Ai and Aj are in contact if their 3D distance is less than contact threshold (e.g., 7 Angstroms)

Sequence separation is given as |i-j| Contact map C is a symmetric N x N

matrix with C(i,j) = 1 if Ai and Aj are in contact C(i,j) = 0 otherwise

Consider all pairs with |i-j| >= 4

Page 6: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Contact Map (2IGD)

Anti-parallel Beta Sheets

Alpha Helix

Parallel Beta Sheets

Amino Acid Ai

Am

ino

Aci

d A

j

Page 7: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Characterizing Physical, Protein-like Contact Maps

A very small subset of all contact maps code for physically possible proteins (self-avoiding, globular chains)

A contact map must: Satisfy geometric constraints Represent low-energy structure

Page 8: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Characterizing Physical Contact Maps in Proteins

What are the typical non-local interactions? Frequent dense 0/1 sub-matrices in

contact maps 3-step approach

Dense pattern mining Pruning mined patterns Clustering dense patterns (non-local

pattern signatures)

Page 9: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Dense Pattern Mining

Frequent 2D Pattern Mining Use WxW sliding window; W window size Measure density under each window (N-W)2 / 2 possible windows for N length

protein Look for “minimum density” (number of

1’s) scale away from diagonal

Try different window sizes

Page 10: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Counting Dense Patterns Naïve Approach: for W=5, N=60 there are

1485 windows per protein. 28 million possible windows for 18,544 proteins (in PDB) Test if two sub-matrices are equal

Linear search: O(P x W2) with P current dense patterns

Hash based: O(W2)

Our Approach: 2-level Hashing O(W) time

Page 11: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Pattern (WxW Sub-matrix) Encoding

Encode sub-matrix as string (W ints)Sub-matrix Integer Value 00000 0 01100 12 01000 8 01000 8 00000 0Concatenated String: 0.12.8.8.0

Page 12: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

String-ID(M) =

Level1 (approximate):

Level2 (exact): h2(M) = String-ID(M)

Two-level Hashing

W

iivMh

1)(1

Wvvv ...... 21

Page 13: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Binding Patterns to Protein Sequence and Structure

StringID:0.12.8.8.0, Support = 170 (window size W=5)0000001100010000100000000

Occurrences:pdb-name (X,Y) X_sequence Y_sequenceInteraction1070.0 52,30 ILLKN TFVRI alpha::beta1145.0 51,13 VFALH GFHIA alpha::strand1251.2 42,6 EVCLR GSKFG alpha::strand1312.0 54,11 HGYDE ATFAK alpha::beta1732.0 49,6 HRFAK KELAG alpha::beta2895.0 49,7 SRCLD DTIYY alpha::beta...

Page 14: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Frequent Dense Local Patterns

Submatrix 0 0 0 0 0 0 0 1

0 0 0 0 0 0 1 0

0 0 0 0 0 1 0 0

0 0 0 0 1 0 0 0

0 0 0 1 0 0 0 0

0 0 1 0 0 0 0 0

0 1 0 0 0 0 0 0

1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 1 0 0 0 0 0

0 1 1 1 0 0 0 0

1 0 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 1 0 0 0 0 0

0 1 1 1 0 0 0 0

0 0 1 1 1 0 0 0

0 0 0 1 1 1 0 0

0 0 0 0 1 1 1 0

0 0 0 0 0 1 1 1

Frequency 2.0% 2.0% 2.2%

PhysicalPhenomenon

Parallel beta sheet

Anti-parallel beta sheet

Anti-parallel beta sheet

Page 15: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Pruning Patterns

0000001000010000100000000

0000000100001000010000000

0000000010000100001000000

Same pattern (shifted to right) but different String-IDs

Merge horizontally or vertically shifted patternsPrune away the local patterns (alpha/beta)

Page 16: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Dense Pattern Mining Results

2702 non-redundant proteins from PDB

Min-Support = 1 (exhaustive patterns)

Window size = 5, Min-Density = 5Contact Threshold Number of Patterns

5 Angstroms 2508

6 Angstroms 9929

7 Angstroms 21231

Page 17: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Frequent Dense Non-Local Patterns

Alpha – Alpha Alpha – Beta Sheet

Page 18: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Frequent Dense Non-Local Patterns

Alpha – Beta Turn Beta Sheet – Beta Turn

Page 19: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Clustering Dense Patterns Distance: Mi, Mj are dense sub-matrices

Use agglomerative hierarchical clustering Find each cluster’s (c) representative (n patterns)

Conceptually the super-imposition of n sub-matrices Compute contact probability at each position

Note a 1 whenever contact probability is more than a probability threshold

|][][|),(2

1

W

kjiji kMkMMMd

n

kMkp

n

ii

c

1

][][

Page 20: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Cluster RepresentativeContact Probabilities:0: 0.05 1: 0.05 2: 0.68 3: 0.85 4: 0.71 5: 0.03 6: 0.02 7: 0.14 8: 0.07 9: 0.09 10: 0.05 11: 0.05 12: 0.12 13: 0.09 14: 0.0315: 0.03 16: 0.05 17: 0.15 18: 0.27 19: 0.85 20: 0.25 21: 0.10 22: 0.59 23: 0.92 24: 0.83

Representative contact pattern: 00111 00000 00000 00001 00011

Page 21: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Clustering Quality

High and low value of pc[k] are good (most cluster members agree on k)

For a cluster c, define quality Qc:

Overall clustering quality (0.5 <= Q <= 1)

)5.0][(],[2

1

1

W

kccc kpkpS

NP

QcQ

NC

ici i

1

|| NC = Number of ClustersNP = Number of Patterns

)5.0][(],[12

1

0

W

kccc kpkpS

01ccc SSQ

Page 22: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Example 1: Mined Cluster

#1355

#3496

#6282

#7980

representative

0001100011011111100010000

0000100101111111100010000

0001000000110001000010000

0001100101111001000000000

0001100001111001000010000

Page 23: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Cluster patterns (beta-beta strand)

Page 24: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Example 2: Mined Cluster

#196 #503 #2834 #8697 representative

1101001111010000100011000

0100001110010000100011000

1100001100011100100001000

1101001110011000110001000

1100001110010000100001000

Page 25: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Cluster Patterns (beta-beta turn)

Page 26: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Clustering Results

Contact Threshold

Number of Patterns

Number of Clusters

Cluster Quality

5 A 2508 83 0.89

6 A 9929 99 0.86

7 A 21231 367 0.84

Page 27: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Future Work

Comprehensive list of non-local motifs I-sites library (by Prof. Bystroff)

catalogs local motifs Future Directions

Improving prediction of contact maps Mining heuristic rules for “physicality” Protein folding pathways

Page 28: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Improving Contact Map Prediction

Physically Impossible

Physically Impossible

Page 29: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Mining Physicality Rules Mining heuristic rules for “physicality”

Based on simple geometric constraints Rules governing contacts and non-contacts

Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1,

then C(i,j+2) = 0 and C(i+2,j) = 0 Anti-parallel Beta Sheets:

If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0

Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1,

then C(i+2,j) = 0

Page 30: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Heuristic Rules of Physicality

i

i+2 j

j+2

If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0

Anti-parallel Beta Sheets

Page 31: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0

Parallel Beta Sheets

i

i+2

j

j+2

Heuristic Rules of Physicality

Page 32: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Heuristic Rules of Physicality

j

i

i+4

i+2

Alpha Helix

If C(i,j) = 1 and C(i+4,j) = 1 and C(I,i+4) = 1, then C(i+2,j) = 0

Page 33: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Protein Folding Pathways Rules for Pathways in Contact Map Space

Pathway is time-ordered sequence of contacts

Consider only native contacts (those that are present in the true map)

Condensation rule: New contacts within Smax U(i,j) <= Smax; U(i,j) unfolded residues from i to j

Pathway prediction is complementary to structure prediction

Page 34: Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris.

Contact Map Folding Pathways