Genetic Data Warehousing eBook

Copyright © 2016 Andreas Scherer

All rights reserved. No part of this book may be reproduced in

any form or by any electronic or mechanical means - except in

the case of brief quotations embodied in articles or reviews -

without written permission from its publisher.

Copyright © 2016 Andreas Scherer

All rights reserved.

ISBN: 978-0-9908886-7-3

Table of Contents Preface…………………………………………………………………………….…………5

Chapters

1. The Data Explosion in Genetics…………...……………………...7

2. Data Warehousing……………………………………………………….9

3. Use Cases for Data Warehousing in a Genetic Testing

Lab……………….………………..…...........................................13

4. Implementation Approach…………………………..….…………20 5. Summary…………………………………………………………………...23

Preface As Next-Generation Sequencing is taking off in the clinic, it creates a significant data management issue for clinicians, scientists and IT professionals alike. How can we retain the massive amounts of data coming out of clinical pipelines in a way that enables labs to systematically build a knowledge base, capturing the insights clinicians gain on a day to day basis while analyzing the genetic information of their patients? What infrastructure is required to alert medical personnel of new research that could potentially alter medical decisions? And how can we embed the work that is being done in the labs into the general hospital workflows? Data warehousing is a pivotal technology that can help in all of these areas. In this book, I will explain the concept of Data Warehousing and discuss the use cases for this technology vis-a-vis the upcoming requirements in clinical labs that are implementing Precision Medicine. A lot of people at Golden Helix have contributed to this eBook. It would not have been impossible to write this without the ingenious work of our product developers. Specifically, I’d like to thank Gabe Rudy, Greta Linse Peterson, Nathan Fortier, Hauwa Yusuf and Cheryl Rogers for their invaluable contributions.

Andreas Scherer January 2016

Bozeman, Montana

Chapter: 1 The Data Explosion in Genetics

According to Grand View Market Research, the next generation sequencing (NGS) market size was globally 2.0 billion USD in 2014. This number is expected to grow from 2015 to 2022 at an annual rate of about 40%. What drives this phenomenon is the increasing number of treatment options for Precision Medicine (see figure 1).

Figure 1: Utility of NGS based diagnostics per disease category

In addition to that, prices of genome sequencing are quickly reducing due to research and development of rapid, high capacity whole genome sequencers by leading vendors such as Illumina. Obviously, a growing population, with an increasing desire for prenatal testing and a growing number of cancer cases in an aging population are expected to further fuel the demand for next generation sequencing based diagnostics. Now, let’s look at this from a lab’s perspective. A human genome consists of about 3 billion base pairs. If we were to store all of it we would need about 700 megabytes, assuming that there are no technological flaws to worry about and therefore no need to include information on data quality along with the sequence. In that case, all we would need is the string of letters A, C, G and T that make up one strand of the human genome. However, that is not a valid assumption. This number is actually

much higher, about 200 gigabytes, assuming we store all the short reads produced by a sequencer at a 30 x coverage rate. A more compressed representation of this information is the variant file, or VCF-file, because only about 0.1% of the genome is different among individuals, which equates to about 3 million variants in the average human genome. This means we could derive a “diff file” of just the places where any given individual differs from the normal “reference” genome. So, per sample we have to store about 3 million variants and depending on how much data we keep to compute clinically relevant information such as coverage statistics as well as for visualization purposes, our storage requirements are between 200 megabytes and 200 gigabytes. For example, it would make sense to retain all BAM files associated with a particular sample. In addition to that we have to save clinical reports and meta-information about the filtering process. In 2016, labs easily conduct dozens of whole genome analyses per month. Larger labs process hundreds in the same time frame. According to the market study, each lab will see on average a 40% increase of its data volume year over year. This means that very quickly there are billions of variants and terabytes of data to manage per lab. And in a little over two years, the newly created data volume will double. In order to capture, organize and leverage this data, advanced warehousing capabilities are required. The numbers above obviously require a Big Data approach, but the necessity for centralized and genetically aware data management already arises at much smaller variant numbers that are generated by gene panels or more targeted whole exome tests. Essentially, labs require an integrated approach to store, manage and gain insights from samples produced in a clinical setting.

http://goldenhelix.com/resources/ebooks/Genetic-Data-Warehousing.html

Genetic Data Warehousing eBook

Science