Top Banner
An Open-Source Format for Personal Genome Representation Enabling Fast Queries and Analyses of Human Genomes Compact Genome Format Sally Guthrie, Research Scientist, Curoverse ([email protected])
18
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Compact Genome Format

An Open-Source Format for Personal Genome Representation Enabling Fast Queries and Analyses of Human Genomes

Compact Genome Format

Sally Guthrie, Research Scientist, Curoverse ([email protected])

Page 2: Compact Genome Format

ACKNOWLEDGEMENTS

Alexander Wait Zaranek, Chief Scientist, Curoverse Abram Connelly, Research Scientist, Curoverse

Page 3: Compact Genome Format

CURRENT USES OF GENOMIC DATA

Patient Care • Analyze one genome for rare and pathogenic variants Population Analysis • Examine a population for rare variants • Separate a population into subgroups • Case/Control Studies and GWA Studies

• Can require merging multiple data sets • Can require using supervised and unsupervised

machine learning

Page 4: Compact Genome Format

VARIANT CALL FORMAT (VCF) SNAPSHOT

Advantages • Very flexible • Easily annotated with canonical or in-house annotation

pipelines • Can be small (with compression) Disadvantages • Difficult to merge VCFs between studies • Can be slow to query and run machine learning

algorithms on (requires pre-processing)

Page 5: Compact Genome Format

WHAT IS COMPACT GENOME FORMAT (CGF)?

Compact Genome Format is a compressed genomic sequence

Allows analysis to be run on the compressed data

Represents a sequence using a series of vectors • Each position in the vector is termed a “tile” • The value of the vector points to a sequence in a “Tile

Library,” a pan-genome

Page 6: Compact Genome Format

GENERATING THE REFERENCE TILE LIBRARY

Human Reference Genome(with tag sets highlighted)

Tag Set: …

1. Choose a tag set of unique 24-base long sequences 2. Map tag set to a reference genome

Page 7: Compact Genome Format

GENERATING THE REFERENCE TILE LIBRARY

1. Save sequence between each tag pair to the tile library 2. Give these sequences a value (0)

Tile Position Id

00.0000

00.0001

00.0002

… …

Page 8: Compact Genome Format

EXTENDING THE TILE LIBRARY

…010020……011031…

Tile LibraryTile Position Id

00.002b

00.002c

00.002d

00.002e

00.002f

00.0030

……

Page 9: Compact Genome Format

EXTENDING THE TILE LIBRARY

…00201*……1*11**…

Tile LibraryTile Position Id

00.002b

00.002c

00.002d

00.002e

00.002f

00.0030

……

Page 10: Compact Genome Format

RATE OF GROWTH OF THE TILE LIBRARY

Page 11: Compact Genome Format

CGF AND TILE LIBRARY FACILITATE

Requires: beginning locus and end locus Returns: the sequences between the two loci for all people in the population

Queries on Sequences

Page 12: Compact Genome Format

TIME USED FOR QUERIES ON SEQUENCES

Page 13: Compact Genome Format

TIME PER BASE FOR QUERIES ON SEQUENCES

Page 14: Compact Genome Format

CGF FACILITATES SEVERAL IMPORTANT ANALYSIS TYPES

Unsupervised Machine Learning

Supervised Machine Learning (Case/Control) GWAS

Encompass all variation, not just SNP variation

Page 15: Compact Genome Format

COMPACT GENOME FORMAT FINAL THOUGHTS

• Allows annotations Tile Library can be annotated by canonical and in-house annotation pipelines, thus automatically applying annotations to all CGF files

• Small • Standardized • Fast to query • Designed for machine learning

Page 16: Compact Genome Format

Thank you!

Any Questions?

Preliminary implementation: lightning-dev3.curoverse.com/brca Source code: https://github.com/curoverse/lightning Software license: GNU AGPLv3

Page 17: Compact Genome Format

GENERATING THE REFERENCE TILE LIBRARY WITH MULTIPLE TAG SETS

Tag Set … … Tag Set

Page 18: Compact Genome Format

RATE OF GROWTH OF THE TILE LIBRARY (NO CALLS CREATE VARIANTS)