Joaquín Dopazo Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Functional Genomics Node, (INB-ELIXIR-es), Bioinformatics in Rare Diseases (BiER-CIBERER), Valencia, Spain. Platforms CIBERER and INB-ELIXIR-es http://bioinfo.cipf.es http://www.babelomics.org @xdopazo Symposium: International platforms for biomedical research: A focus on rare diseases, Fundacion Ramón Areces, Madrid 3-4 November, 2016
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Symposium: International platforms for biomedical research: A focus on rare diseases,
Fundacion Ramón Areces, Madrid 3-4 November, 2016
The CIBERER “1000 genomes” Initiative to sequence rare disease patients
Diseases with • Unknown genes • No mutations in known genes
Search for: • New genes • Known genes with unknown modifier genes • Susceptibility genes
http://www.gbpa.es/
Sample providers Sequencing platforms Data analysis
A total of 1044 patients
(including 300 controls) of
more than 30 diseases were
sequenced between 2012 and
2013.
The actors: MGP and CIBERER
MGP is a PPP between the Andalucia local government and Roche. MGP roadmap is based on the availability of: • More than14.000 clinically well characterized samples • An automatically updated PATIENT HEALTH RECORD (PHR) • SAMPLE INFORMATION (SI) That will be used as the first steps towards the implementation of genomic and personalized medicine in the Andalusian HEALTHCARE SYSTEM. A system covering a population of 8.5 million. MGP spans from 2012 to 2014
The Spanish Network for Research in Rare Diseases
(CIBERER) is an initiative of the Spanish Health Ministry.
The CIBERER is composed of 60 research and clinic
groups distributed across the country and has been
The CSVS is a crowdsourcing project Scenario: Sequencing projects of healthy
population are expensive and funding
bodies are reluctant to fund them
CSVS Aim: To offer increasingly accurate
information on variant frequencies
characteristic of Spanish population.
CSVS Main use: Frequency-based
filtering of candidate variants
Main data source: Sequencing projects
of individual researchers (CIBERER and
others)
Problem: Most of the contributions
correspond to patient exomes
Idea: Patients of disease A can be
considered healthy pseudo-controls for
disease B (providing no common genetic
background exist between A and B)
Beacon: CSVS will soon appear in the
Beacon server
http://ciberer.es/bier/exome-server/
The CSVS Interface
CSVS is organized in disease categories
CSVS can be queried about chromosomal
regions or genes
Why binning data into ICD-10 categories?
ICD-10 first level of diseases offer two
advantages:
• No (or very low) common genetic
background among ICD categories
• Classes big enough to preserve data
confidentiality. Attempts to identify
individuals within them will produce very
vague phenotype clues
Binning into ICD-10 high level categories
endorsed by CIBERER experts in bioethics.
D1 D2 D3 D4 D5 D6 D7 …… D22
(pseudo) control s for D3
Statistics As of 11/09/2016
CSVS contains 790
unrelated Spanish
individual exomes.
About 1000
expected by the
end of the year
Information provided
Genotype frequencies
in the different
reference populations
Genomic coordinates, variation, gene.
SNPid
if any
Information provided
Pathogenicity indexes
Phenotype,
if available
Variants can also be seen within their genomic context
GenomeMaps viewer (Medina et al., 2013, NAR) embedded in the application.
GenomeMaps is the official genome viewer of the ICGC (http://dcc.icgc.org/)
CSVS provides insights on the portion of the variability already contained in it
Table of Spanish Frequencies
(TSF)
DB of Spanish variants (DBSV)
Chr Position Ref Alt 0/0 0/1 1/1
1 1365313 A T 75 0 0
1 1484884 G A 70 4 1
2 326252 T C 25 35 15
CES use
Other countries
CSVS input
External
Unrelated? (DBSV)
VCFs Spanish? (TSF)
YES YES
NO NO
Counts
Internal
Regional
AIM (Ancestry-informative
markers) are used to
discard kinship and
different ethnicity
?
SIP
Diagnosis+ biomarker discovery: an ongoing
integrated CIBERER initiative Ongoing CIBERER pilot project with the collaboration of seven hospitals: La
Paz, FJD, Ramón y Cajal, CBM (Madrid), Virgen del Rocio (Sevilla), Hospital del
Mar (Barcelona), HU La Fe (Valencia)
http://team.babelomics.org
http://BiERapp.babelomics.org
Diagnostic using NGS and
virtual panels
Diagnostic SNV
Variants of unknown
significance (VUS) and
unexpected findings
management
Medical reports Generation and management
of virtual panels http://team.babelomics.org
100% traceability of
data management
and decisions
The CIBERER CNV server
Stores CNVs found in
patients of different
hospitals, along with
some interesting
information on
ethnicity, location,
phenotype (HPO), etc.,
that can be studied in
the genomic context
(using GenomeMaps)
If everything goes as
planned it will contain
data on more than
15.000 patients from 5
CIBERER hospitals by
the end of the year
What is inside? OpenCGA Overview and goals
Open-source Computational Genomics Analysis (OpenCGA) aims to provide a high performance and scalable solution for genomic big data processing and analysis
OpenCGA is built on OpenCB: CellBase, Genome Maps, Cell Maps, HPG Aligner, HPG BigData, Variant annotation. Project at GitHub: https://github.com/opencb/opencga
Extensive capabilities to query across genotype and phenotype relationships
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Extensive capabilities to query across genotype and phenotype relationships
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Extensive capabilities to query across genotype and
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional
queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
Transform 97 min
Load 80 sec
Merge 84 sec
Annotate 2000 v / sec
Times
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional
queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA - Storage
6 node Hadoop cluster:
• Transform: 97 min
• Load: 80 sec
• Merge: 84 sec
• Millisecond response
times for regional
queries
• Whole genome filtering
queries for all individuals
within seconds
OpenCGA: storage
Extensive capabilities to query across genotype and phenotype relationships
https://github.com/opencb/opencga
Tools developed to improve the pipeline: CellBase, the knowledge DB
Now at: https://github.com/opencb/cellbase
Project: http://bioinfo.cipf.es/compbio/cellbase CellBase (Bleda, 2012, NAR), a comprehensive integrative database and RESTful Web Services API, more than 250GB of data:
and evolve towards a systems biology based perspective
Genetic diseases have a modular nature
and, consequently, must be addressed
from a systems biology perspective • With the development of systems biology, studies have shown that phenotypically
similar diseases are often caused by functionally related genes, being referred
to as the modular nature of human genetic diseases (Oti and Brunner, 2007; Oti
et al, 2008).
• This modularity suggests that causative genes for the same or phenotypically
similar diseases may generally reside in the same functional module, either a
protein complex, a sub-network of protein interactions, or a pathway
• Perturbed modules account for disease better than individual perturbed genes
Disease genes are close in the interactome
Goh 2007 PNAS
Same disease
in different
populations is
caused by
different genes
affecting the
same functions
Fernandez, 2013, Orphanet J Rare Dis.
In fact, predictions made with proper models of
functional modules overtake the predictions of
their components
The activity of the pathway is
best correlated to survival
than individual gene activities Fey et al., Sci. Signal. (2015).
ODE used to solve the dynamics of a model
from the expression values of their
components
Problem:
ODE can
efficiently solve
only small systems
Two problems: defining
functional modules and
modeling their behavior Gene ontology:
descriptive; unstructured
functional labels
Networks of Interaction,
regulation, etc.:
relationships among
components but unknown
function
Pathways: relationships
among components and
their functional roles
Models
Enrichment methods. GO, etc. (simple statistical tests)
Connectivity models. Protein-protein, protein-DNA and protein-small molecule interactions (tests on network properties)
Low resolution models. Models of signalling pathways, metabolic pathways, regulatory pathways, etc. (executable models)
Detailed models. Kinetic models including stoichiometry, balancing reactions, etc. (mathematical models)
The behavior of a functional module can be
estimated from the behavior of their
components Transforming gene expression levels into a different metric that accounts for a function. Easiest example of modeling function: signaling pathways. Function: transmission of a signal from a receptor to an effector
Receptors Effectors
Important assumption:
collective changes in gene
expression within the
context of a signaling
circuit are proxies of
changes in protein
activation
Important fact: when the
signal reaches the end of a
circuit triggers a function
Signaling activity trigger cell functions
directly related to cancer progression
Estimations of signal intensity received by the effectors
that trigger a cancer-related function can be related to
clinical parameters, such as survival
Actually, signal activity triggers
all the cancer hallmarks
Hanahan, Weinberg, 2011
Hallmarks of cancer: the next
generation. Cell 144, 646
Negative regulation of release of cytochrome c
from mitochondria (inhibition of apoptosis)
Mechanistic biomarkers
show high specificity and
sensitivity
Models used for obtaining
mechanistic biomarkers
can integrate different
omics data (e.g. mutations)
Mechanistic biomarkers
can be used in the context
of prediction
Specificity Sensitivity
Some interesting features of mechanistic
biomarkers derived from models of pathway
activity
Future prospects: Actionable models
The real advantage of models is that, the same way they can be used
to convert omics data into measurements of cell functionality that
provide information on disease mechanisms and drug MoA, they can
be used to test hypothesis such as “what if I suppress (or over-
express) this gen?” This lead to the concept of actionable models.
By simulating changes of gene expression/activity it is easy to:
• Direct study of the consequences of induced gene over-expressions
or KOs
• Reverse study of genes that need to be perturbed to change cell
functionalities, such as:
• Reverting the “normal” functional status of a cell
• Selectively kill diseased cells without affecting normal cells
• Enhancing or reducing cell functionalities (e.g., apoptosis or
proliferation, respectively, to fight cancer)
• Etc.
Actionable pathway models
KO in RAF1 gene Drugs that
target RAF1
Selected
drugs
extra
targets
Other
pathways
affected
by the KO
Specific
circuits
affected
Action
button
http://pathact.babelomics.org/
The use of new algorithms that enable the transformation of genomic
measurements into cell functionality measurements that account for
disease mechanisms and for drug mechanisms of action will ultimately
allow the real transition from today’s empirical medicine to precision
medicine and provide an increasingly personalized medicine
Biomedical Platforms need to evolve to provide a real support to the transition to
precision medicine
Intuitive Based on trial
and error
Identification of probabilistic
patterns
Decisions and actions based on knowledge
Intuitive Medicine Empirical Medicine Precision Medicine
Today Tomorrow
Degree of personalization
The Computational Genomics Department at the Centro de
Investigación Príncipe Felipe (CIPF), Valencia, Spain, and…
...the INB-ELIXIR, National Institute of Bioinformatics and the BiER (CIBERER Network of Centers for Research in Rare Diseases)