Top Banner
AN ENSEMBLE SVM MODEL FOR THE ACCURATE PREDICTION OF NON- CANONICAL MICRORNA TARGETS Asish Ghoshal 1 , Ananth Grama 1 , Saurabh Bagchi 2 , Somali Chaterji 1 1 : Computer Science 2 : Electrical and Computer Engineering Purdue University, West Lafayette, IN
28

A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

Jan 04, 2016

Download

Documents

Jesse Waters
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

AN ENSEMBLE SVM MODEL FOR THE ACCURATE PREDICTION OF NON-CANONICAL MICRORNA

TARGETS

Asish Ghoshal1, Ananth Grama1, Saurabh Bagchi2, Somali Chaterji1

1: Computer Science2: Electrical and Computer EngineeringPurdue University, West Lafayette, IN

Page 2: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

MicroRNA (miRNA): “The genome’s guiding hand?”

• miRNA are 22 nucleotide (nt) strings of RNA, base-pairing with messenger RNA (mRNA) to cause mRNA degradation or translational repression

• Can be thought of as biology’s dark matter: small regulatory RNA that are abundant and encoded in the genome

• Dysregulation of miRNA may contribute to diverse diseases• Canonical (i.e., exact) matches involve the miRNA’s seed

region (nt 2-7) and the 3’ untranslated region (UTR) of mRNA and were thought of as the only form of interaction

• Recent high-throughput experimental studies have indicated the high-preponderance of “non-canonical” miRNA targets

miRNA

Page 3: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

3

Background

• microRNA• mRNA• Argonaute• RISC (complex)

Competition between target sites of regulators shapes post-transcriptional gene regulation: Marvin Jens & Nikolaus Rajewsky, Nature Reviews Genetics, 2015

Page 4: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

• We are attempting computational prediction of miRNA-mRNA interactions

• Challenging because of:– Large number of features in miRNAs and

mRNAs– Noisy and incomplete data sets from

experimental approaches– Wide variety of positive miRNA-mRNA

interactions

• Prior computational approaches are deterministic and rely on perfect seed matches of the miRNA and mRNA nucleotides [Nature Methods 2013, NAR 2010]

Computational miRNA Target Prediction

Page 5: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

• Avishkar provides an ensemble model classifier, specialized to each miRNA family

• It achieves the highest precision and recall among all competitive computational approaches– Shows how to use an ensemble non-linear SVM classifier – Shows how to transform spatial features into smooth curves– Shows how to handle perfect matches as well as imperfect matches in a unified manner

• Since training non-linear SVMs is computationally expensive, we provide an open source, efficient implementation of Cascade SVM1, on top of Apache Spark

• Key Result: TPR of 76% and an FPR of 20%, with the AUC (ROC curve) for the ensemble non-linear model being 20% higher than for the simple linear model. This is an improvement of over 150% in the TPR over the best competitive protocol.

Our Contributions: Avishkar

Parallel Support Vector Machines: The Cascade SVM: Hans Peter Graf, Eric Cosatto, Leon Bottou, Igor Durdanovic, Vladimir Vapnik, NIPS 2005

Page 6: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

6

Problem Statement

• Predict if a miRNA targets an mRNA (segment)

miRNAs mRNA segments

1 or 0 ?

• Ideally we would want some experimentally verified edge labels to train on

1 or 0 ?

1

0

1

Edges with ground truth labelsEdges whose labels have to be predicted

miRNAs mRNA segments

Page 7: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

7

Problem Statement

• We have labels on vertices from CLIP-Seq experiments, i.e., which mRNA segments were targeted.

• We do not know which miRNA targeted a particular mRNA segment.

1 or 0 ? 0

miRNAs mRNA segments

1

0

0Not contained within an IP region

Contained within an Immuno-precipitated (IP) region

Page 8: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

8

Methods: Generating Ground Truth Data

• Consider only the most highly expressed miRNAs– 10 in humans, 20 in mouse

• Label an edge ‘1’ if – The mRNA segment is labeled ‘1’, AND– The binding between a miRNA and

mRNA segment is strong enough, i.e., ΔG is below a certain threshold, OR,

– There is at least a 6-mer seed match between the mRNA segment and miRNA.

• All other edges are labeled ‘0’

1 0

miRNAs mRNA segments

1

0

0

0

0 1

Page 9: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

9

Methods: Feature Construction

• The alternating blue and green regions denote the 13 consecutive windows around the miRNA target site (red). These are the windows where the average thermodynamic and sequence features are computed.

• Compute interaction profiles at two different resolutions– Window size of 46 and using the entire miRNA: “site” curves– Window size of 9 and only using the seed region of the miRNA: “seed” curves

• Use coefficients of B-spline basis functions as features for classifier• We hypothesize that the curves are different for the positive and negative samples.

Seed match siteRISC (complex)

Page 10: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

10

Methods: Metric for Non-Canonical Matches

• Encode the alignment of the miRNA seed region with mRNA nucleotides using a string of 1s (matches), 2s (mismatches), 3s (gaps), and 4s (GU wobbles)

• Compute occurrence frequencies of seed match patterns among positive miRNA-mRNA interactions

• If the pattern a occurs k times among n positive samples, we calculate α = Probability(pattern a occurs k of n by chance)

• Define seed enrichment score for the seed match pattern a as: SES = 1 – α

• The “seed enrichment score” captures, in a single unified numeric feature, the relative efficacy of various kinds of seed matches – ML techniques handle numeric features better

Page 11: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

11

Validation Protocol used to Evaluate Avishkar

Page 12: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

12

Improving Classification Performance with Kernel SVM

• Linear classifier suffers from high bias (large error even on training set)

• Solution: Use more complex learning model– Non-linear or Kernel SVM

• SVMs suffer from a widely recognized scalability problem in both memory use and compute time.

• Kernel SVM computational cost: O(n3)• Does not scale beyond a few thousand examples for

feature vector of dimension ~ 150

Page 13: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

13

Distributed Training for Kernel SVM

• Cascading SVM [Graf et al. NIPS 2005]• Key ideas:

– Train on partitions of the whole data set and do this in parallel– Merge SVs from each partition in a hierarchical manner– Final step is serial and it is hoped that the number of SVs is reasonably small at

that stage

• Implemented on top of Apache Spark: Open source release

1 2 3

Page 14: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

14

Making Kernel SVM Scale Up

• Biological insight: miRNAs within an miRNA family share structural similarities

• Therefore, we create a separate non-linear classifier for each miRNA family

• Within each family, we train in parallel using Cascade SVM approach

Page 15: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

15

CLIP Data SetSpecies # Positive

examples (Seed,

Seedless)

# Negative examples

# mRNA # miRNA Positive target sites

3’ UTR CDS 5’ UTR

HITS-CLIP (Mouse)

861,208 (6%, 94%)

35,608,333 4,059 119 56% 43% 1%

PAR-CLIP (Human)

141,109 (8%,92%)

2,659,748 1,211 35 57% 39% 4%

Page 16: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

16

Results: Misclassification Rate for Linear and Non-Linear Classifier

• Mean test error of the non-linear SVMs for each of the miRNA families is less than the corresponding linear models.

• Benefit of non-linear SVM is more pronounced for larger miRNA families: e.g., for let-7 and miR-320 families, benefits are 50% and 69.9% over linear model.

• Insight: Non-linear model can remove the prediction bias inherent in linear model.

Page 17: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

17

Results: ROC Curve for Linear and Non-linear SVM

• ROC curves for the ensemble linear model and ensemble non-linear model, obtained by varying the probability threshold for the output of the SVM.

• The misclassification error, true positive rate, and false positive rate were computed using 5-fold stratified cross-validation for seedless sites in the human data set.

• One possible operating region is with an FPR of 0.2, the TPR for the linear model is 0.469, while the TPR for the non-linear model is 0.756.

Page 18: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

18

Take-Away Points• We have developed “Avishkar”, a machine-learning, support

vector machine-based model, to predict both canonical and non-canonical miRNA-mRNA interactions

• Avishkar extracts thermodynamic and sequence features and does smoothening through curve fitting, in order to extract enriched features from CLIP data

• We use non-linear SVM to minimize bias and scale it up to the large biological data sizes through a biologically-driven parallelization strategy

• We achieve the best-in-class recall (true positive rate), with an improvement of over 150%, over the best competitive protocol

• Open source software releases: https://bitbucket.org/cellsandmachines/avishkar

https://bitbucket.org/cellsandmachines/kernelsvmspark

Page 19: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

19

Thanks!

Page 20: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

20

EXTRA Slides + Notes

Page 21: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

21

Background: CLIP Technology and Non-Canonical miRNA-mRNA Interactions

An experimental method to map the binding sites of RNA-binding proteins across the transcriptome. Proteins are crosslinked to RNA using ultraviolet light, and an antibody is used to specifically isolate the RNA-binding protein of interest together with its RNA interaction partners, which are subjected to sequencing.

Page 22: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

22

Page 23: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

23

Background: RNA interference

Page 24: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

24

Methods: Capturing Spatial Interaction using Smooth Curves

• Compute thermodynamic interaction profile upstream and downstream of the target region

– Data is collected for fixed-size windows on both sides of the target region– Averaging is done within each window

• Compute smooth curves to remove noise in the biological data set– Used B-spline interpolation for smoothing out the points

Page 25: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

25

Methods: Our Classifier

• Distributed linear SVM using gradient descent (Apache Spark)• 9 Intel X86 nodes, 8 GB memory, 4 cores.

Page 26: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

26

Results: ROC Curves

Page 27: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

27

Results: Feature Importance

Page 28: A N E NSEMBLE SVM M ODEL FOR THE A CCURATE P REDICTION OF N ON - C ANONICAL M ICRO RNA T ARGETS Asish Ghoshal 1, Ananth Grama 1, Saurabh Bagchi 2, Somali.

28

Results

• Does the classification performance improve due to clustering by miRNA family or due to the use of a more complex model?

• Performance of linear classifier does not improve by clustering data.