This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Finding Allelic FrequenciesUsing MapReduce/Hadoop
Mahmoud ParsianPh.D in Computer Science
Senior Architect @ illumina1
2014 Hadoop SummitAmsterdam, Netherlands
April 3, 2014
1www.illumina.comMahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 1 / 46
Table of Contents
1 Biography
2 Overview
3 Basic Definitions
4 Source of Data for Allelic Frequency
5 Allelic Frequency Analysis
6 Allelic Frequency Using MapReduce/Hadoop
7 Running Allelic Frequency Analysis
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 2 / 46
Biography
Outline
1 Biography
2 Overview
3 Basic Definitions
4 Source of Data for Allelic Frequency
5 Allelic Frequency Analysis
6 Allelic Frequency Using MapReduce/Hadoop
7 Running Allelic Frequency Analysis
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 3 / 46
Biography
Who am I?
Name: Mahmoud Parsian
Education: Ph.D in Computer Science
Works: as Senior Architect @Illumina, Inc
Lead Big Data Team @IlluminaDevelop scalable regression algorithmsDevelop DNA-Seq and RNA-Seq workflowsUse Java/MapReduce/Hadoop/HBase
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 4 / 46
Overview
Outline
1 Biography
2 Overview
3 Basic Definitions
4 Source of Data for Allelic Frequency
5 Allelic Frequency Analysis
6 Allelic Frequency Using MapReduce/Hadoop
7 Running Allelic Frequency Analysis
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 5 / 46
Overview
Overview
Genetic variants in patients germline DNA is identified throughnext-gen sequencing technology.
Patient Sample −→ ... −→ VCF
Magnitude of this data is challenging to store and analyze:
several million variants per patientseveral billions for groups of patients
The group comparison will estimate allelic or genotypic frequencydifferences between groups for all variants present in any individual inthe analysis cohort.
Use Fisher’s Exact test to determine whether the difference infrequency is statistically significant.
Find allelic frequencies (use MapReduce/Hadoop)
Find top-100 p-values for two groups of variants (useMapReduce/Hadoop)
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 6 / 46
Overview
Overview
Genetic variants in patients germline DNA is identified throughnext-gen sequencing technology.
Patient Sample −→ ... −→ VCF
Magnitude of this data is challenging to store and analyze:
several million variants per patientseveral billions for groups of patients
The group comparison will estimate allelic or genotypic frequencydifferences between groups for all variants present in any individual inthe analysis cohort.
Use Fisher’s Exact test to determine whether the difference infrequency is statistically significant.
Find allelic frequencies (use MapReduce/Hadoop)
Find top-100 p-values for two groups of variants (useMapReduce/Hadoop)
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 6 / 46
Overview
Overview
Genetic variants in patients germline DNA is identified throughnext-gen sequencing technology.
Patient Sample −→ ... −→ VCF
Magnitude of this data is challenging to store and analyze:
several million variants per patientseveral billions for groups of patients
The group comparison will estimate allelic or genotypic frequencydifferences between groups for all variants present in any individual inthe analysis cohort.
Use Fisher’s Exact test to determine whether the difference infrequency is statistically significant.
Find allelic frequencies (use MapReduce/Hadoop)
Find top-100 p-values for two groups of variants (useMapReduce/Hadoop)
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 6 / 46
Overview
Overview
Genetic variants in patients germline DNA is identified throughnext-gen sequencing technology.
Patient Sample −→ ... −→ VCF
Magnitude of this data is challenging to store and analyze:
several million variants per patientseveral billions for groups of patients
The group comparison will estimate allelic or genotypic frequencydifferences between groups for all variants present in any individual inthe analysis cohort.
Use Fisher’s Exact test to determine whether the difference infrequency is statistically significant.
Find allelic frequencies (use MapReduce/Hadoop)
Find top-100 p-values for two groups of variants (useMapReduce/Hadoop)
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 6 / 46
Overview
Overview
Genetic variants in patients germline DNA is identified throughnext-gen sequencing technology.
Patient Sample −→ ... −→ VCF
Magnitude of this data is challenging to store and analyze:
several million variants per patientseveral billions for groups of patients
The group comparison will estimate allelic or genotypic frequencydifferences between groups for all variants present in any individual inthe analysis cohort.
Use Fisher’s Exact test to determine whether the difference infrequency is statistically significant.
Find allelic frequencies (use MapReduce/Hadoop)
Find top-100 p-values for two groups of variants (useMapReduce/Hadoop)
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 6 / 46
Overview
Overview
Genetic variants in patients germline DNA is identified throughnext-gen sequencing technology.
Patient Sample −→ ... −→ VCF
Magnitude of this data is challenging to store and analyze:
several million variants per patientseveral billions for groups of patients
The group comparison will estimate allelic or genotypic frequencydifferences between groups for all variants present in any individual inthe analysis cohort.
Use Fisher’s Exact test to determine whether the difference infrequency is statistically significant.
Find allelic frequencies (use MapReduce/Hadoop)
Find top-100 p-values for two groups of variants (useMapReduce/Hadoop)
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 6 / 46
Overview
Overview
Genetic variants in patients germline DNA is identified throughnext-gen sequencing technology.
Patient Sample −→ ... −→ VCF
Magnitude of this data is challenging to store and analyze:
several million variants per patientseveral billions for groups of patients
The group comparison will estimate allelic or genotypic frequencydifferences between groups for all variants present in any individual inthe analysis cohort.
Use Fisher’s Exact test to determine whether the difference infrequency is statistically significant.
Find allelic frequencies (use MapReduce/Hadoop)
Find top-100 p-values for two groups of variants (useMapReduce/Hadoop)
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 6 / 46
Basic Definitions
Outline
1 Biography
2 Overview
3 Basic Definitions
4 Source of Data for Allelic Frequency
5 Allelic Frequency Analysis
6 Allelic Frequency Using MapReduce/Hadoop
7 Running Allelic Frequency Analysis
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 7 / 46
Basic Definitions
Some Basic Definitions
Chromosome
Bioset
Bioset Record
Allele
Allelic Frequency
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 8 / 46
Basic Definitions
Some Basic Definitions
Chromosome
Bioset
Bioset Record
Allele
Allelic Frequency
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 8 / 46
Basic Definitions
Some Basic Definitions
Chromosome
Bioset
Bioset Record
Allele
Allelic Frequency
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 8 / 46
Basic Definitions
Some Basic Definitions
Chromosome
Bioset
Bioset Record
Allele
Allelic Frequency
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 8 / 46
Basic Definitions
Some Basic Definitions
Chromosome
Bioset
Bioset Record
Allele
Allelic Frequency
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 8 / 46
Basic Definitions
Chromosome
The term chromosome comes from the Greek words for color(chroma) and body (soma)
A chromosome is an organized structure of DNA,protein, and RNA found in cells.
Human cells have 23 pairs of chromosomes labeledas {1, 2, ..., 22, X, Y}.Humans have a total of 46 chromosomes.
How are chromosomes inherited? In humans, onecopy of each chromosome is inherited from thefemale parent and the other from the male parent.
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 9 / 46
Basic Definitions
Chromosome in Picture
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 10 / 46
Basic Definitions
Cells to DNA
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 11 / 46
Basic Definitions
What is a Bioset?
Individually analyzed data signatures are referred to as ”biosets”.”Biosets” encompass data in the form of experimental samplecomparisons as well as genotype signatures
A bioset most commonly referred to as a ”gene signature”. A samplerecord of a bioset will contain a chromosome, its start and stoppositions, two alleles, and other related information.
The number of entries/records for a germline bioset can have 4.3million records
A patient may have any number of biosets
Each bioset has a set of genes
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 12 / 46
Basic Definitions
VCF to Bioset
Sample −→ FASTQ1 Data
FASTQ Data −→ DNA-Seq
DNA-Seq −→ VCF2
VCF −→ Bioset
Bioset −→ Ready for Analysis
1FASTQ format is a text-based format for storing both a biological sequence(usually nucleotide sequence) and its corresponding quality scores.
2VCF = Variant Call Format = the format of a text file used inbioinformatics for storing gene sequence variations.Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 13 / 46
Basic Definitions
VCF to Bioset
Sample −→ FASTQ1 Data
FASTQ Data −→ DNA-Seq
DNA-Seq −→ VCF2
VCF −→ Bioset
Bioset −→ Ready for Analysis
1FASTQ format is a text-based format for storing both a biological sequence(usually nucleotide sequence) and its corresponding quality scores.
2VCF = Variant Call Format = the format of a text file used inbioinformatics for storing gene sequence variations.Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 13 / 46
Basic Definitions
VCF to Bioset
Sample −→ FASTQ1 Data
FASTQ Data −→ DNA-Seq
DNA-Seq −→ VCF2
VCF −→ Bioset
Bioset −→ Ready for Analysis
1FASTQ format is a text-based format for storing both a biological sequence(usually nucleotide sequence) and its corresponding quality scores.
2VCF = Variant Call Format = the format of a text file used inbioinformatics for storing gene sequence variations.Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 13 / 46
Basic Definitions
VCF to Bioset
Sample −→ FASTQ1 Data
FASTQ Data −→ DNA-Seq
DNA-Seq −→ VCF2
VCF −→ Bioset
Bioset −→ Ready for Analysis
1FASTQ format is a text-based format for storing both a biological sequence(usually nucleotide sequence) and its corresponding quality scores.
2VCF = Variant Call Format = the format of a text file used inbioinformatics for storing gene sequence variations.Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 13 / 46
Basic Definitions
VCF to Bioset
Sample −→ FASTQ1 Data
FASTQ Data −→ DNA-Seq
DNA-Seq −→ VCF2
VCF −→ Bioset
Bioset −→ Ready for Analysis
1FASTQ format is a text-based format for storing both a biological sequence(usually nucleotide sequence) and its corresponding quality scores.
2VCF = Variant Call Format = the format of a text file used inbioinformatics for storing gene sequence variations.Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 13 / 46
Basic Definitions
VCF to Bioset
Sample −→ FASTQ1 Data
FASTQ Data −→ DNA-Seq
DNA-Seq −→ VCF2
VCF −→ Bioset
Bioset −→ Ready for Analysis
1FASTQ format is a text-based format for storing both a biological sequence(usually nucleotide sequence) and its corresponding quality scores.
2VCF = Variant Call Format = the format of a text file used inbioinformatics for storing gene sequence variations.Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 13 / 46
Basic Definitions
Sample Record of a Bioset?
A bioset can have 4.3 million records
A sample record of a bioset will contain
a chromosome (chromosomeID: 1, 2, 3, ...)Start positionStop positionTwo alleles: Allele1, Allele2Genome Referenceand other related information such as mutation class, ...
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 14 / 46
Basic Definitions
What is an Allele?
Allele is a viable DNA coding that occupies a given locus (position)on a chromosome. There are two alleles per chromosome position andthey are called allele1 and allele2.
Allelic frequency is defined as ”the percentage of a population of aspecies that carries a particular allele on a given chromosome locus.”
Alternatively, ”allele frequency” can be defined as the frequency of anallele relative to that of other alleles of the same gene in a population.
The Fisher’s Exact Test is used to calculate the ”p-value” for AllelicFrequency.
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 15 / 46
Basic Definitions
Two Alleles: allele1, allele2
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 16 / 46
Basic Definitions
Two Alleles: allele1, allele2
An allele is one of two or more versions of a gene. An individualinherits two alleles for each gene, one from each parent.
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 17 / 46
Source of Data for Allelic Frequency
Outline
1 Biography
2 Overview
3 Basic Definitions
4 Source of Data for Allelic Frequency
5 Allelic Frequency Analysis
6 Allelic Frequency Using MapReduce/Hadoop
7 Running Allelic Frequency Analysis
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 18 / 46
Source of Data for Allelic Frequency
VCF to Bioset
Sample → FASTQ Data → DNA-Seq → VCF → Bioset
Bioset Record Elements:
1. chromosomeID
2. startPosition
3. stopPosition
4. allele1
5. allele2
6. referenceGenome
7. mutationClass
...
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 19 / 46
Source of Data for Allelic Frequency
Size of Data for Analysis
One Bioset = 4.3 million records
For Allelic frequency: form two groups: Group-A, Group-B
Keep two sets of the same data:
one set for Group-Aone set for Group-B
Group-A = 6,000 Biosets
Group-B = 9,000 Biosets
6,000 + 9,000 = 15,000
15,000 Total Biosets to analyze
15,000 x 4.3M = 64.5 Billion records
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 20 / 46
Allelic Frequency Analysis
Outline
1 Biography
2 Overview
3 Basic Definitions
4 Source of Data for Allelic Frequency
5 Allelic Frequency Analysis
6 Allelic Frequency Using MapReduce/Hadoop
7 Running Allelic Frequency Analysis
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 21 / 46
Allelic Frequency Analysis
Allelic Frequency Analysis
Given
Group-A = set of biosets = {A1,A2, ...,An}Group-B = set of biosets = {B1,B2, ...,Bm}
Find
Allelic Frequecy for every chromosomeID, start, stop, allele
Find p-value for every chromosomeID, start, stop, allele
Find top-100 p-values
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 22 / 46
Allelic Frequency Analysis
Allelic Frequency by Example
Group-A: 6 biosets
Bioset-ID Allele-1 Allele-2
1 A C2 A A3 A C4 G G5 A A6 AC T
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 23 / 46
Allelic Frequency Analysis
Allelic Frequency by Example...
Group-B: 5 biosets
Bioset-ID Allele-1 Allele-2
7 A A8 C C9 A C10 A A11 A A
Mahmoud Parsian Ph.D in Computer Science Finding Allelic Frequencies Using MapReduce/Hadoop 24 / 46
Allelic Frequency Analysis
Allelic Frequency by Example...
Create Frequency Table for Group-A and Group-B:
Allele Group-A Group-A Group-B Group-BKnown Others Known Others