Cancer Outlier Profile Analysis using Apache Spark Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illumina Apache Big Data Vancouver, BC May 10, 2016 Mahmoud Parsian Ph.D in Computer Science Senior Architect @ llumina Cancer Outlier Profile Analysis using Apache Spark 1 / 74
79
Embed
Cancer Outlier Profile Analysis using Apache Sparkschd.ws/hosted_files/apachebigdata2016/27/Cancer Outlier Profile... · Apache Spark Mahmoud Parsian Ph.D in Computer Science Senior
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Cancer Outlier Profile Analysisusing
Apache Spark
Mahmoud ParsianPh.D in Computer ScienceSenior Architect @ illumina
Apache Big DataVancouver, BC
May 10, 2016
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 1 / 74
Table of Contents
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 2 / 74
Biography
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 3 / 74
Biography
Who am I?
Name: Mahmoud Parsian
Education: Ph.D in Computer Science
Work: Senior Architect @Illumina, Inc
Lead Big Data Team @IlluminaDevelop scalable regression algorithmsDevelop DNA-Seq and RNA-Seq workflowsUse Java/MapReduce/Hadoop/Spark/HBase
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 31 / 74
Input Data
What is Rank Product?
Let {A1, ...,Ak} be a set of (key-value) pairs where keys are uniqueper dataset.
Example of (key-value) pairs:
(K,V) = (item, number of items sold)(K,V) = (user, number of followers for the user)(K,V) = (gene, test expression)
Then the ranked product of {A1, ...,Ak} is computed based on theranks ri for key i across all k datasets. Typically ranks are assignedbased on the sorted values of datasets.
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 32 / 74
Input Data
What is Rank Product?
Let A1 = {(K1, 30), (K2, 60), (K3, 10), (K4, 80)},then Rank(A1) = {(K1, 3), (K2, 2), (K3, 4), (K4, 1)}since 80 > 60 > 30 > 10Note that 1 is the highest rank (assigned to the largest value).
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 33 / 74
Input Data
Calculation of the Rank Product
Given n genes and k replicates,
Let eg ,i be the fold change and rg ,i the rank of gene g in the i’threplicate.
Compute the rank product (RP) via the geometric mean:
RP(g) =
( k∏i=1
rg ,i
)1/k
RP(g) = k
√√√√( k∏i=1
rg ,i
)
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 34 / 74
Input Data
Formalizing Rank Product
Let S = {S1, S2, ...,Sk} be a set of k studies, where k > 0 and eachstudy represent a micro-array experiment
Let Si (i = 1, 2, ..., k) be a study, which has an arbitrary number ofassays identified by {Ai1,Ai2, ...}Let each assay (can be represented as a text file) be a set of arbitrarynumber of records in the following format:
<gene_id><,><gene_value_as_double_data-type>
Let gene_id be in {g1, g2, ..., gn} (we have n genes).
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 35 / 74
Input Data
Rank Product: in 2 Steps
Let S = {S1, S2, ...,Sk} be a set of k studies:
STEP-1: find the mean of values per study per gene
you may replace the ”mean” function by your desired functionfinding mean involves groupByKey() or combineByKey()
STEP-2: perform the ”Rank Product” per gene across all studies
finding ”Rank Product” involves groupByKey() or combineByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 36 / 74
Input Data
Formalizing Rank Product
The last step will be to find the rank product for each gene per study:
Sk = {(g1, rk1), (g2, rk2), ...}then Ranked Product of gj =
RP(gj) =
( k∏i=1
ri ,j
)1/k
or
RP(gj) = k
√√√√( k∏i=1
ri ,j
)Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 37 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sortthe dataset by value, we will swap the key with value and thenperform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with itselement indices (these indices will be the ranks).Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sortthe dataset by value, we will swap the key with value and thenperform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with itselement indices (these indices will be the ranks).Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sortthe dataset by value, we will swap the key with value and thenperform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with itselement indices (these indices will be the ranks).Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sortthe dataset by value, we will swap the key with value and thenperform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with itselement indices (these indices will be the ranks).Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Spark Solution for Rank Product
1 Read k input paths (path = study)
2 Find the mean per gene per study
3 Sort the genes by value per study and then assign rank values; To sortthe dataset by value, we will swap the key with value and thenperform the sort.
4 Assign ranks from 1, 2, ..., N (1 is assigned to the highest value)
use JavaPairRDD.zipWithIndex(), which zips the RDD with itselement indices (these indices will be the ranks).Spark indices will start from 0, we will add 1
5 Finally compute the Rank Product per gene for all studies:
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 38 / 74
Input Data
Two Spark Solutions: groupByKey() and combineByKey()
Two solutions are provided using Spark-1.6.0:
SparkRankProductUsingGroupByKey
uses groupByKey()
SparkRankProductUsingCombineByKey
uses combineByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 39 / 74
Input Data
How does groupByKey() work
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 40 / 74
Input Data
How does reduceByKey() work
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 41 / 74
Rank Product Algorithm
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 42 / 74
Rank Product Algorithm
Rank Product Algorithm in Spark
Algorithm: High-Level Steps
Step DescriptionSTEP-1 import required interfaces and classesSTEP-2 handle input parametersSTEP-3 create a Spark context objectSTEP-4 create list of studies (1, 2, ..., K)STEP-5 compute mean per gene per studySTEP-6 sort by valuesSTEP-7 assign rankSTEP-8 compute rank productSTEP-9 save the result in HDFS
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 43 / 74
Rank Product Algorithm
Main Driver
Listing 1: performRrankProduct()
1 public static void main(String[] args) throws Exception {
2 // args[0] = output path
3 // args[1] = number of studies (K)
4 // args[2] = input path for study 1
5 // args[3] = input path for study 2
6 // ...
7 // args[K+1] = input path for study K
8 final String outputPath = args[0];
9 final String numOfStudiesAsString = args[1];
10 final int K = Integer.parseInt(numOfStudiesAsString);
11 List<String> inputPathMultipleStudies = new ArrayList<String>();
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 44 / 74
Rank Product Algorithm
groupByKey() vs. combineByKey()
Which one should we use? combineByKey() or groupByKey()?According to their semantics, both will give you the same answer.But combineByKey()| is more efficient .
In some situations, groupByKey() can even cause ofout of disk problems . In general, reduceByKey(), andcombineByKey() are preferred over groupByKey().
Spark shuffling is more efficient for reduceByKey()| thangroupByKey() and the reason is this: in the shuffle step forreduceByKey(), data is combined so each partition outputs at mostone value for each key to send over the network, while in shuffle stepfor groupByKey(), all the data is wastefully sent over the networkand collected on the reduce workers.
To understand the difference , the following figures show how theshuffle is done for reduceByKey() and groupByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 45 / 74
Rank Product Algorithm
Understanding groupByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 46 / 74
Rank Product Algorithm
Understanding reduceByKey() or combineByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 47 / 74
24 return new Tuple2<Double, Integer>(rankedProduct, N);
25 }
26 });
27 return rankedProducts;
28 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 54 / 74
Rank Product Algorithm
Next FOCUS on combineByKey()
We do need to develop 2 functions:
computeMeanByCombineByKey()
computeRankedProductsUsingCombineByKey()
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 55 / 74
Rank Product Algorithm
combineByKey(): how does it work?
combineByKey() is the most general of the per-key aggregationfunctions. Most of the other per-key combiners are implementedusing it.
Like aggregate(), combineByKey() allows the user to return valuesthat are not the same type as our input data.
To understand combineByKey(), it is useful to think of how ithandles each element it processes.
As combineByKey() goes through the elements in a partition, eachelement either has a key it has not seen before or has the same key asa previous element.
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 56 / 74
Rank Product Algorithm
combineByKey() by Combiner (C)
1 public <C> JavaPairRDD<K,C> combineByKey(
2 Function<V,C> createCombiner, // V -> C
3 Function2<C,V,C> mergeValue, // C+V -> C
4 Function2<C,C,C> mergeCombiners) // C+C -> C
Provide 3 functions:
1. createCombiner, which turns a V into a C
// creates a one-element list
2. mergeValue, to merge a V into a C
// adds it to the end of a list
3. mergeCombiners, to combine two Cs into a single one.
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 57 / 74
Rank Product Algorithm
computeMeanByCombineByKey():Define C data structure
1 //
2 // AverageCount is used by combineByKey()
3 // to hold the total values and their count.
4 //
5 static class AverageCount implements Serializable {
6 double total;
7 int count;
8
9 public AverageCount(double total, int count) {
10 this.total = total;
11 this.count = count;
12 }
13
14 public double average() {
15 return total / (double) count;
16 }
17 }
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 58 / 74
16 new Function2<RankProduct, RankProduct, RankProduct>() {
17 public RankProduct call(RankProduct a, RankProduct b) {
18 a.product(b);
19 return a;
20 }
21 };
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 69 / 74
Moral of Story
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 70 / 74
Moral of Story
Moral of Story...
Understand data requirements
Select proper platform for implementation: Spark
Partition your RDDs properly
number of cluster nodesnumber of cores per nodeamount of RAM
Avoid unnecessary computations (g1, g2), (g2, g1)
Use filter() often to remove non-needed RDD elements
Avoid unnecessary RDD.saveAsTextFile(path)
Run both on YARN and Spark cluster and compare performance
use groupByKey() cautiously
use combineByKey() over groupByKey()
Verify your test results against R language
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 71 / 74
References
Outline
1 Biography
2 What is COPA using Spark?
3 COPA Algorithm
4 Input Data
5 Rank Product Algorithm
6 Moral of Story
7 References
Mahmoud Parsian Ph.D in Computer Science Senior Architect @ illuminaCancer Outlier Profile Analysis using Apache Spark 72 / 74
References
References
COPA – Cancer Outlier Profile Analysis: James W. MacDonald andDebashis Ghosh, http://bioinformatics.oxfordjournals.org/content/22/23/2950.full.pdf
mCOPA: analysis of heterogeneous features in cancer expression data,Chenwei Wang, Alperen Taciroglu, Stefan R Maetschke, Colleen CNelson, Mark A Ragan, and Melissa J Davis,http://jclinbioinformatics.biomedcentral.com/articles/