Emily Williams, Yuan Tian, Yun Zhu, Carol Munroe, John Bucci, Yutao Fu, Fiona Hyland, and Corina Shtir, Clinical Next-Gen Sequencing Division, Thermo Fisher Scientific Inc., 5781 Van Allen Way, Carlsbad, CA, U.S.A, 92008. RESULTS • The Disease Association Database organizes diseases into an effective hierarchical structure for lookup, using disease parent-child relationships established by NIH’s Unified Medical Language System (UMLS) 2 . • For any disease in the hierarchical tree, the gene scoring algorithm computes the scores to summarize the strength of genes’ association with all of the disease’s child diseases. Table 1. Disease annotation for the 28 identified gene clusters. INTRODUCTION Selection of genes to include in genomic studies of disease remains a difficult task. Current methods rely on expert opinion or manual search engine use. With these methods, the process and result are neither repeatable nor scalable. To remedy this situation, we created the Informative Genetic Content (IGC) system, which enables the algorithmic selection of genes for inclusion in such studies, given one or more diseases to target. The IGC system stands on three components: a database associating diseases with genes and other diseases, an algorithm to rank the genes under consideration for inclusion in a panel, and a module that clusters genes by families of diseases. The first component, the database, maps diseases to associated genes and scores each of these mappings according to the strength of the relationship. The database also maps diseases to other diseases, such that groups of diseases or hierarchical relationships between diseases can be identified. The second component enables the ranking of candidate genes when multiple diseases are of interest. The algorithm accounts for the common situation where two or more diseases are associated with the same gene with varying strengths of association, weighting and combining the scores across the diseases associated with each gene. The final component, the gene clustering module, groups genes by pathogenic pathways, should the user want to consider targeting a broader family of diseases affected by a closely related set of genes. We validated the IGC system through comparisons of our automated gene selections with expertly curated gene panel designs. We found a high degree of overlap between the IGC’s gene selection and the gene lists chosen by experts, supporting the viability of our system. Together with the scalability and repeatability enabled by its automation, the IGC system greatly improves the gene panel selection process and therefore advances targeted genomic studies. CONCLUSIONS We created a comprehensive, efficient, and informative engine, the IGC, to optimize gene selection given diseases at any level of the disease ontology hierarchy: • The Disease Association Database organizes diseases into an effective hierarchical structure, and associates diseases to genes. • The gene scoring algorithm ranks genes by disease relevance, and summarizes the scores for diseases at any level of the hierarchy. • The Virtual Panel Library efficiently groups genes into clusters by major disease category, and further ranks the genes within clusters by their relative importance to each category’s diseases. * For Research Use only. Not for use in diagnostic procedures. REFERENCES 1.Pinero J, Queralt-Rosinach N, Bravo A et al (2015) DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015:bav028. 2.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. 3.Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008: 9:559 TRADEMARKS/LICENSING © 2016 Thermo Fisher Scientific Inc. All rights reserved. All trademarks are the property of Thermo Fisher Scientific or its subsidiaries unless otherwise specified. Algorithmically optimized gene selection for targeted clinical sequencing panels Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com Figure 1. Overview of IGC - Database and algorithms for identifying and ranking gene-disease associations Figure 2. Disease Association Database maps genes to diseases Figure 4. Gene prioritization in disease hierarchy The database establishes gene-disease relationships based on DisGeNET 1 , which scores gene-disease associations according to expert-curated sources (e.g. CTD, CLINVAR, and ORPHANET), predicted data using mouse models, and text-mining of publications. Blue circles: two neurological diseases – schizophrenia and bipolar disorder. Green circles: genes associated with these two diseases. Figure 3. Gene Scoring Algorithm Figure 5. Gene clustering identified 28 Virtual Panel Libraries associated with major disease categories. A B Disease Key MeSH Category Description C04 Neoplasms C05 Musculoskeletal Diseases C06 Digestive System Diseases C07 Stomatognathic Diseases C08 Respiratory Tract Diseases C09 Otorhinolaryngologic Diseases C10 Nervous System Diseases C11 Eye Diseases C12 Male Urogenital Diseases C13 Female Urogenital Diseases and Pregnancy Complications C14 Cardiovascular Diseases C15 Hemic and Lymphatic Diseases C16 Congenital, Hereditary, and Neonatal Diseases and Abnormalities C17 Skin and Connective Tissue Diseases C18 Nutritional and Metabolic Diseases C19 Endocrine System Diseases C20 Immune System Diseases Cluster Groups The ranking score uses an unbiased gene scoring method that accounts for both the strength and number of gene-disease pairs. From the top 5,000 genes that are disease relevant according to the gene scoring algorithm, 28 gene clusters were identified using WGCNA algorithm 3 . A) Hierarchical clustering of genes according to their association patterns with 16 high-level MeSH categories relevant to inherited diseases. B) Gene cluster association scores with the 16 MeSH disease categories are shown with p-values. Module # Module Color GeneCount Disease Annotation 1 turquoise 530 Nervous System Diseases 2 blue 321 Nutritional and Metabolic Diseases 3 brown 307 Cardiovascular Diseases 4 yellow 280 Digestive System Diseases 5 green 253 Eye Diseases 6 red 250 Skin and Tissue Connective Diseases 7 black 229 Male and Female Urogenital Diseases 8 pink 205 Musculoskeletal Diseases 9 magenta 164 Nervous System Diseases; Nutritional and Metabolic Diseases 10 purple 150 Hemic and Lymphatic Diseases 11 greenyellow 140 Musculoskeletal Diseases; Nervous System Diseases 12 tan 137 Neoplasms 13 salmon 129 Respiratory Tract Diseases 14 cyan 111 Otorhinolaryngologic Diseases; Nervous System Diseases 15 midnightblue 90 Male Urogenital Diseases; 16 lightcyan 87 Immune; Male Urogenital Diseases; Female Urogenital Diseases and Pregnancy Complications 17 grey60 76 Stomatognathic Diseases 18 lightgreen 69 Hemic and Lymphatic Diseases; Immune System Diseases 19 lightyellow 67 Female Urogenital Diseases and Pregnancy Complications; Endocrine System Diseases 20 royalblue 63 Female Urogenital Diseases and Pregnancy Complications 21 darkred 61 Musculoskeletal Diseases; Skin and Connective Tissue Diseases 22 darkgreen 60 Musculoskeletal Diseases; Stomatognathic Diseases 23 darkgrey 55 Female and Male Urogenital Diseases; Nutritional and Metabolic Diseases 24 darkturquoise 55 Nutritional and Metabolic Diseases; Endocrine System Diseases 25 darkorange 36 Musculoskeletal Diseases; Cardiovascular Diseases 26 orange 36 Immune System Diseases 27 white 35 Endocrine System Diseases 28 skyblue 34 Immune System Diseases; Skin and Connective Tissue Diseases Disease of interest DisGeNET Database 1