Top Banner
EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin
24

EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Dec 16, 2015

Download

Documents

Emily Pitts
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

EBI is an Outstation of the European Molecular Biology Laboratory.

CRAM: reference-based compression formatdeveloped by Vadim Zalunin

Page 2: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Data horror

EMBL-EBI10 petabytesSRA~1 petabytes

Over 2 million DVDs or 2.5km

Complete Genomics0.5 TB for a single file

Page 3: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

The need for compression

Red alert

Page 4: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Compression, what is it?

BMP, 190 kb PNG, 100 kb JPG, 21 kb JPG, 4 kb

LOSSLESS LOSSY

Page 5: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Compression, when we know what to expect.

BMP, 145 kb PNG, 2 kb JPG, 6 kb JPG, 3 kb

LOSSLESS LOSSY

But the actual message is only 40 characters (bytes) long!

Page 6: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Compression at it’s best

IMAGE, 145 kb

"Five little ducks went swimming one day"

TEXT, 40 b IMAGE, 145 kb

~3500 times more efficient

compress uncompress

Page 7: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

What are we talking about

sample

sequencing machines

bug

bunch of huge files

The bug’s DNA is hidden somewhere

Page 8: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Looking closer at the data

bunch of huge files

read 1read 2read 3…..read bizzilion

It boils down to a long list of reads:

Each read represents a short nucleotide sequence from the genome.

Additional information may be attached to it, for example error estimates.

Page 9: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

What is a Read?

@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…

An excerpt from of a FASTQ file.

Page 10: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

What is a Read?

@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…

read name

An excerpt from of a FASTQ file.

Page 11: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

What is a Read?

@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…

read name read bases

An excerpt from of a FASTQ file.

Bases: ACGTN

Page 12: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

What is a Read?

@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…

read name read bases

read quality scores

An excerpt from of a FASTQ file.

Bases: ACGTN

Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)

Page 13: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

What is quality score?

Then quality score is phred quality score encoded as ASCII symbols 33-126.

Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.

Page 14: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Reference based encoding

Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G

read 1 T G A G C T C T T A G T A G C      read 2       G C T C T A A G T A G C C G C  read 3   C T C T A A G T A G C C G C G            read 4             G T A G C C G C G G A C T G T      read 5               C G G T C T G T C C G

Read start position Read end position

Page 15: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Reference based encoding

Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G

read 1 . . . . . . . . T . . . . . .      read 2       . . . . . . . . . . . . . . .  read 3   . . . . . . . . . . . . . . .            read 4             . . . . . . . . . . A . . . .      read 5               . . . . . . . . . . .

Page 16: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Reference based encoding

Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G

read 1 . . . . . . . . T . . . . . .      read 2       . . . . . . . . . . . . . . .  read 3   . . . . . . . . . . . . . . .            read 4             . . . . . . . . . . A . . . .      read 5               . . . . . . . . . . .

Mismatching bases

Page 17: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Lossy quality scores

Approach 1Quality scores are usually values from 0 to 39.

Let’s shrink them, so that they are from 0 to 7 now.

Approach 2Let’s treat quality scores using alignment information.

For example: preserve only quality scores for mismatching bases.

horizontal

vert

ical

Page 18: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Comparison study:1K Genomes exomes

compress uncompressBAM BAMCRAM

Page 19: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

compress uncompress

Comparison study:1K Genomes exomes

BAM BAMCRAM

Some analysis pipeline

Some analysis pipeline

Page 20: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

compress uncompress

Comparison study:1K Genomes exomes

BAM BAMCRAM

Some analysis pipeline

Some analysis pipeline

Original SNPs Restored SNPs

Page 21: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Comparison study:1K Genomes exomes

Page 22: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

CRAM NGS data compression

Do nothingDo nothing

CRAM lossyUntreated

CRAM very lossy

LosslessLossless LossyLossy

Bits/base

CRAM lossless

(bad) (good)

Page 23: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Progressive application of compression

Sample value

Sam

ple accessibility

200-fold

Lossless

2-fold

20-fold

Hard

High

Easy

Low

Page 24: EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

References

More information:

http://www.ebi.ac.uk/ena/about/cram_toolkit

Mailing list:

http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev

Publications:

Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40

Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1