Graduado en Ingeniería Informática Universidad Politécnica de Madrid Escuela Técnica Superior de Ingenieros Informáticos TRABAJO FIN DE GRADO A Graph Mining technique for identifying individuals at risk of genetic diseases in pedigrees Autor: Luciano García Giordano Director: Sergio Paraíso Medina MADRID, JUNIO DE 2019
94
Embed
TRABAJO FIN DE GRADO A Graph Mining technique for ...oa.upm.es/55699/1/TFG_LUCIANO_GARCIA_GIORDANO.pdfAbstract Sincethe1970s,manystatistics-basedmodelsforperforminggeneticpre-diction
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Graduado en Ingeniería Informática
Universidad Politécnica de Madrid
Escuela Técnica Superior deIngenieros Informáticos
TRABAJO FIN DE GRADO
A Graph Mining technique for identifying individuals atrisk of genetic diseases in pedigrees
Desde la década de 1970, se describieron, formalizaron e incluso imple-
mentaron muchos modelos basados en estadísticas para realizar predicciones
genéticas en individuos. Sin embargo, su adopción en la práctica clínica no
es significativa. Hoy en día, los genetistas continúan utilizando la técnica tra-
dicional basada en las tablas de Punnett, y los cálculos todavía se realizan
principalmente a mano. Con la integración actual de la información genéti-
ca en la práctica clínica, hay una necesidad de herramientas para ayudar a
la explotación de datos relacionados con la familia como parte de la evalua-
ción del riesgo genético. Una herramienta con este propósito disminuiría la
posibilidad de errores en las operaciones matemáticas de los genetistas, per-
mitiría predicciones y simulaciones rápidas, y podría facilitar la visualización
vi
del proceso, que son todas operaciones que tienden a ser engorrosas y tedio-
sas sin el soporte de un entorno informático. En este trabajo, propongo una
técnica que pretende causar una mejora en dicho contexto al proporcionar
automáticamente predicciones para los genotipos y fenotipos de los indivi-
duos en función de su herencia mediante el uso de técnicas de Graph Mining.
Para evaluar sus resultados, implemento el método como un módulo para ge-
noDraw, un Sistema de Dibujo de Pedigree actualmente en desarrollo en el
Grupo de Informática Biomédica de la Universidad Técnica de Madrid en co-
laboración con el Grupo de Investigación en Genética y Herencia del Hospital
12 de Octubre Madrid. Los resultados muestran que mi técnica es adecuada
en términos de predicciones y es capaz de concebir una visión de la dinámica
genética de las familias, por lo que es de una utilidad esperanzadora para la
práctica clínica futura.
Palabras clave: estadística genética, evaluación del riesgo genético, mi-
nería de grafos, genética humana
vii
1 INTRODUCTION
In this section, I first make a brief description of what genetic diseases are, and how they
are transmitted. Then, I comment on the use of pedigrees as a tool to enable Genetic Risk
Assessment and the computing tools that enable their eased representation. In this context,
the objectives of this work are then laid out: to devise a method capable of estimating the
risks of individuals of having genetic diseases given mostly latent genetic information of
their family and the specific mode of inheritance of the disease in question. Lastly, in this
section, the work plan for this work is summarized.
1
1.1 Genetic diseases and Genetic Risk Assessment
A genetic disease is a term used to describe a disease that is caused by abnormalities in
the genome of a person. Every individual typically inherits half of their genome from
their mother and the other half from their father in the form of chromosomes organized
in pairs. In normal humans, 22 pairs of chromosomes and two extra sex chromosomes,
that can be of type X or Y , exist. Gametes, the cells that are combined with cells from
the opposite sex to generate offspring, are ideally formed by one chromosome from each
pair and one of the two sex chromosomes. The children of this offspring have thus 23
chromosomes from the father and 23 chromosomes from the mother, adding to 22 pairs
and two sex chromosomes, as in each of their parents. New mutations, chromosomal
crossovers, parental imprinting, and uniparental disomy are examples of factors that might
cause this pattern to not work in such an exact manner. Although these factors are of
sufficient importance to not be dismissed, the information about the context of a family
does not cease to be of tremendous importance when analyzing the risk of a certain patient
to be affected by a certain disease.
A medicine field which takes special advantage of this relationship among individuals
is the field of Precision Medicine. It is a field in medicine that targets the identification
of the best approaches to treat or prevent a disease based on the patient’s characteristics.
Since the patient’s family is a heavy influence in their having a genetic disease or being
able to transmit it to their offspring, being able to aggregate genetic information is of
relevant potential towards precision medicine.
For this aggregation of information, pedigrees are a widely-adopted graphical lan-
guage used by medical specialists to collect information about the family of the patient.
In a pedigree, some types of information can be inserted, such as who the individuals of
the family are and what characteristics they present, how and to whom they are related,
and by which diseases each of them is affected. A Pedigree Drawing System (PDS) is
an informatics tool developed especially to help medical practitioners collect, in a com-
puterized environment, the necessary information for diagnosis or analysis to be made.
Ideally, a PDS is capable of facilitating the tasks required to collect, process and visualize
family-related information, thus enabling a broad and encompassing analysis of a family,
contributing to advancement in current precision medicine. Current PDSs can range from
not much more than a canvas in which symbols of a pedigree can be positioned manu-
ally to complex integrated environments that operate on structured family data, generate
pedigrees automatically and link the information included with external resources, such
as biomedical vocabularies for the annotation of diseases.
2
1.2 Objective
In the context of Pedigree Drawing Systems, genoDraw is an in-house development at the
Biomedical Informatics Group created in collaboration with the Genetics and Inheritance
Research Group of the 12 de Octubre Hospital, Madrid, and designed and implemented
by me. It is an integrated environment that helps automate the creation, management, and
visualization of pedigree diagrams. genoDraw presents some characteristics in this sense
that are promisingly useful for medical specialists in the area of genetics [1], which make
it a tool that can be expected to be adopted in the near future by medical practitioners.
With the major necessities of a complex PDS addressed (i.e. automation, generation
of pedigrees from structured data, etc.), the next step for augmenting its potential impact
in the area of precision medicine is to include tools to facilitate the risk assessment of
patients. In current clinical practice, calculations are performed by hand to reach conclu-
sions such as that, for example, a certain person is in risk of being affected by a certain
genetic disease [2]. These calculations are most of the times very complex (thus prone to
errors) and based on certainties. In many complex family scenarios, no exact calculations
can be made, and only approximated probabilities of one being affected or not can be ob-
tained. Using simple techniques, risk assessment in these situations is rendered tedious,
very difficult, prone to errors and even outright incorrect, especially since such techniques
are based on statistics intuitions and calculations, which are especially susceptible to hu-
man errors [3, 4].
In this work, I propose a new method for facilitating the prediction of risks related to
genetic diseases of individuals based on (a) the underlying genotypes associated with the
risks of having a genetic disease, (b) the propagation of information in the family graph,
and (c) the integration with biomedical vocabularies for obtaining information about spe-
cific genetic diseases automatically. The method is based on the prediction of genotype
distributions for individuals in two different ways. The combination of such two ways
not only shows a prediction for each individual, but also the uncertainty in the prediction.
This technique will help prioritize certain key individuals in the family. Therefore, not
only can genetics specialists dedicate fewer resources performing prone-to-error mathe-
matical operations, they can also better decide who the individuals on whom to perform
further genetic analysis are, potentially reducing diagnosis time and resources needed.
In this work, I also implement the proposed method as a module for genoDraw and
analyze its results.
1.3 Work Plan
List of Objectives
• Study the currently-existing data sources based on biomedical terminologies that
3
can provide information of the inheritance mode of diseases caused by genetic fac-
tors. Define, from these data sources, a method that can allow for obtaining a link
between a genetic trait and its mode(s) of inheritance.
• Establish adequate methods for modelling the propagation of information in the
family graphs, with the objective of estimating the risk that a certain person is
affected, carrier, or not affected by a certain genetic trait given the phenotypical
information of other individuals to which they are related, directly or indirectly, as
well as to point to the user the persons from whom obtaining information tends to
be most advantageous.
• Implement the methods as a module in genoDraw, thus enabling healthcare pro-
fessionals to observe such risks as an insight towards the current risk condition of
a family, observing important characteristics such as ease of use, adequacy to the
clinical practice and precision of the estimations given. Design, Implementation,
Testing and Deployment are included as parts of this objective.
List of Tasks
1. Search for biomedical terminologies-based solutions through which a link between
genetic traits and their modes of inheritance can be established.
2. Study the literature on algorithmic prediction of genetic traits, searching for existing
methods to predict the risks of an individual to have a certain genetic disease given
its mode of inheritance and given its family graph.
3. Model the propagation of information in the family graph to define a graph-based
algorithm for algorithmic prediction of genetic traits given phenotypical or geno-
typical information of related individuals.
4. Design an implementation of the method as a genoDraw module
5. Evaluate the resulting module as part of genoDraw, focusing on ease-of-use, ade-
quacy to the activity of healthcare professionals and accuracy of the predicted risks.
6. Deploy the module as part of genoDraw
7. Write the final memory
4
Gantt diagram The obtained Gantt diagram is as follows:
5
2 STATE OF THE ART
This work presents a method for genotype and phenotype prediction based on the genetic
information of related individuals. It is implemented as a module for genoDraw, a web-
based platform for drawing and maintaining pedigree diagrams in a standard-complying
and integrated way. In this section, I describe the foundations on which my work is based.
Initially, an introduction to the motivation behind this work is presented, in which
the concept and purpose of pedigree diagrams in clinical practice are presented. Then, I
include an overview of the task of assessing the risk of one of being affected by or carrier
of a genetic disease.
The next subsection discusses the existing computing approaches to the problem of
representing pedigree diagrams. Then, an overview of the concept and usefulness of
biomedical vocabularies in the context of this work is also included. Lastly, a discussion
of the current computing approaches to the problem of performing knowledge extraction
and risk assessment on individuals in a computerized environment is also presented.
6
2.1 Pedigree diagrams
Pedigree diagrams are a graphical language to represent families by representing each of
the individuals of interest and the relations existing among them. The essence of pedigrees
is being able to distinguish the key characteristics observable in a small population in
order to interpret a family. That is, pedigrees are a method of information visualization.
In a historical approach, according to [5], the drawing of pedigrees as an intent to
represent a family dates back at least to the 19th century. It has been in use as an in-
dispensable tool for the medical practice of genetics for more than a century now [5].
A depiction of an early pedigree is shown in Figure 1. This pedigree was drawn by the
infamous Swiss geneticist and eugenicist Ernst Rüdin. He was one of the theorists and
evangelists of social Darwinism. He was largely funded by Nazi Germany and advo-
cated for mass sterilization and outright killing of individuals as a mechanism to achieve
a “better population” [6].
Having been used as a means for (incorrectly) justifying the practice of racial hygiene,
pedigree diagrams have a much more noble utility at emphasizing important relationships
among individuals that suffer from certain conditions or are at risk of doing so. Since
the spread of the technique of pedigree drawing as a tool in clinical practice, a neces-
sity for uniform ways to represent pedigrees that encompassed the necessary situations
and that were able to represent precisely the complex relationships among individuals has
been made increasingly evident. Unsurprisingly, standardization efforts arose from this
necessity [7] and culminated in the Pedigree Standardization Task Force (PSTF) of the
National Society of Genetic Counselors at the end of the last century. The Recommenda-
tions for the Standardized Pedigree Nomenclature [8], developed by the PSTF and pub-
lished initially in 1995, establishes uniform and non-ambiguous directives for the drawing
of pedigrees.
More recently, the availability of alternative reproductive scenarios (i.e. ovum dona-
tions and surrogate gestations) indicated a need for more complex relationships among
parents and children to be able to be represented. In 2008, an update to the Standardized
Human Pedigree Nomenclature is published [10]. This update introduces some extensions
to the previous version [8], enabling most major reproductive scenarios to be clearly rep-
resented in pedigree diagrams. In Figure 2, a situation of impossible representation in the
previous version is drawn following the updated version.
Currently, the use of pedigree diagrams as a tool to collect and interpret genetic char-
acteristics in families is of widespread use. Medical practitioners, especially those within
the genetics area, make use of pedigree diagrams in their day-to-day clinical activities. A
rather comprehensive list of functions enabled or facilitated by the use of pedigrees during
medical encounters and other activities is found in [2], and includes functions that range
from “Making a medical diagnosis” and “Calculating disease risks”, to seemingly unim-
7
Figure 1: A Sippschaftstafel drawn by Ernst Rüdin in 1910. (from Mazumdar, 1992 [9]
and Resta, 1993 [5])
portant but actually essential activities, such as “Educating the patient” and “Exploring
the patient’s understanding”.
In conclusion, pedigree diagrams are a relevant tool for medical practitioners to per-
form part of their activities, ranging from collecting information from the patient and their
family to reaching a conclusion and helping form the patient’s understanding about the
risks and facts observed.
2.2 Risk assessment of genetic diseases
It is widely known that the phenotype of a person, that is, their presenting some charac-
teristics or not, is affected mainly by two factors: environment and genotype. The study
of the environment in which a person lives is sometimes enough to explain their condi-
tion. For example, a mine worker is much more likely to develop lung diseases such as
emphysema or pneumoconiosis than a control urban population [11]. On the contrary,
many other disorders are defined almost uniquely by the contents of the genome of the
individual. For example, having a certain mutation in the APC gene is almost certain to
cause the individual in question to develop familial adenomatous polyposis by the age of
40 [2].
8
Figure 2: This pedigree complies with the 2008 update of the Standardized Human Pedi-
gree Nomenclature, but not with the 1995 version.
The unfamiliarized reader might interpret that diseases with a clear genetic component
are very simple to understand. However, not only different diseases follow potentially
different inheritance patterns, some of them even have almost unknown or unpredictable
patterns. One example is Alzheimer’s disease. According to [12], not only are there
many genes with multiple possible mutations involved, there is more than one possible
inheritance pattern for the same disease.
In terms of inheritance patterns often observed in genetic diseases, the most common
ones are the following [2]:
• Autosomal Dominant (AD): if one gene is mutated, it is enough to cause the disor-
der
• Autosomal Recessive (AR): for the same locus in the two chromosomes of the same
pair, if both genes are equally mutated, the disorder is observed
• X-linked Recessive (XLR): women possess two X chromosomes, while men only
carry one. In X-linked recessive disorders, a man that has a mutation in a certain
gene of his X chromosome will be affected by the disorder, while the woman with
one X containing the mutation will most likely present no symptoms of the disease
or none at all, needing both of her X chromosomes to have such mutation to be
affected by the disorder.
• X-linked Dominant (XLD): contrary to X-linked recessive disorders, in XLD dis-
orders both women and men are affected if they carry a mutated gene in an X
chromosome (only one of the two for women). However, due to lyonization, also
9
called X-inactivation, women tend to present milder symptoms, while the effects in
men are usually lethal.
• Y-linked (YL): while men carry one Y chromosome, women carry none. For this
reason, only men are affected by or carrier of YL diseases. Additionally, having
the mutated gene in this chromosome is usually enough to show a phenotype corre-
sponding to the related disease
• Chromosomal inheritance (C): a chromosomal disease is the one caused by the ex-
istence of a disruptive added segment or by the inexistence of an important segment
of a chromosome, that can range from a few bases to a whole chromosome.
• Multifactorial and Polygenic (MP): there are some disorders which are caused by
alterations on many loci at once. Since they are very difficult to observe in families,
they are usually grouped in one category.
• Mitochondrial (M): the mitochondrion is an organelle present in the cells of most
eukaryotic organisms that processes some compounds to generate, among other
compounds, ATP. ATP is a molecule that stores energy and is used in many bio-
logical phenomena inside the cell as energy source. However, given the nature of
the fertilization in humans (and many other species), mitochondria are only inher-
ited from biological mother to child. Furthermore, mitochondria possess their own
DNA, the mitochondrial DNA. For that reason, mutations in this DNA that cause
defects in the mitochondria of a woman are passed on to her offspring. If this off-
spring is a male, the defect is not transmitted to his children, but if it is a female,
then the inheritance is assured.
Although the previous list is by no means complete (nor does it intend to be), we
can easily observe that the inheritance patterns of genetic diseases can range from simple
“inherit one mutated gene and be affected” to patterns in which the lack of knowledge
about its underlying mechanism is still present in the literature.
Additionally, sometimes an individual that has a genotype that corresponds to having
the disease does not show the corresponding phenotype. This is known as incompletepenetrance. The penetrance is a probability that, having the necessary genomic conditions
to have a disease, the individual actually presents the phenotype corresponding to it.
In clinical practice, genetics specialists must not only collect information obtained
from what is presented by the patient and their family, but also predict, given the specific
inheritance modes in which each of the diseases may be expressed, who are the carrier
individuals, who may be affected or come to be affected, and who are the individuals on
whom more specific genetic tests must be performed. According to [2], it is important not
10
only to register who are the affected individuals, but also to take note of the ones who are
unaffected.
Given the multiple scenarios and the complexity of the inheritance schemes for each
of them, a visualization method that helps the user to observe which of them is taking
place for a certain disease in a family is necessary. One of the methods widely used for
this task is pedigree diagrams [2].
2.3 Pedigree diagram drawing systems
A Pedigree Drawing System (PDS) is an informatics tool that enables a user to create,
manage and visualize pedigrees. Many PDSs exist currently. However, none of them
is widely adopted in clinical practice. Some characteristics identified in [1] can be key
at making a PDS useful for genetics specialists. The first of them is the ability to in-
teractively generate pedigree diagrams during medical encounters and to make dynamic
changes to the pedigree. For that, the medical practitioner must be deeply familiarized
with the tool. In this aspect, the easier the tool is to be mastered by an expert user, the
better. The second characteristic identified is the capability of the system of generating
pedigrees from structured data. This is important because a pedigree can be generated di-
rectly from the information of individuals and their relations, which can either be inserted
by the user or retrieved from a medical information system. The third characteristic is the
automation of the generation process. That is, the components that compose a pedigree
are automatically generated and placed on the screen, with minimal interaction from the
user. This helps the user focus on the interaction with the patient, rather than on the in-
teraction with the system. The fourth characteristic identified as very important for the
adoption of any PDS in daily clinical practice is the generation of standard-complying
pedigrees. In the case of this work, we consider only the Nomenclature presented by
the National Society of Genetic Counselors since it is the only reported Recommenda-
tion and it is of wide adoption. The fifth characteristic is that the information inserted
into the pedigree is meaningful as concepts of current standard biomedical vocabularies.
This could enable the integration of pedigrees with current medical information systems,
allowing the PDS to receive information following standard formats and using standard
concepts to refer to each entity and use it internally as a means of obtaining information or
making changes to a pedigree. Ideally, since a pedigree is a visualization technique, any
changes made to it (except those exclusively related to the placements of entities) should
be directly reflected on the medical history of the involved individuals.
Most of the PDSs currently-available have similar characteristics. The most relevant
are Madeline 2.0 [13], My Family Health Portrait (MFHP) [14], Progeny [15], GenoPro
[16] and CRA Health [17]. Although Madeline 2.0 is a self-hosted server and MFHP is
an online tool, both are only useful for exporting pedigrees as images from a description.
11
None of them enable the user to make changes to the exported pedigrees since they are
images. Also, both receive data to draw the pedigree as structured data and export a
pedigree without any kind of interaction with the user. The whole pedigree is made at
once without a prior possibility to visualize the insertion process or to make changes
and corrections to the possibly bad-positioned pedigree. In Madeline, the input is a text
file with descriptions of the individuals. In MFHP, the user must enter all the data by
hand using forms. The output of both is a picture of a generated pedigree generated with
relative positioning rules.
Progeny is a commercial tool that includes a PDS. As a platform, it aims at the in-
clusion of every possible aspect related to the diagnosis of genetic diseases in one tool.
However, as a PDS, it only offers so many features. GenoPro is very similar to Progeny
as a pedigree drawing system. First, they do not generate pedigree diagrams automati-
cally from structured data. The user must insert it by hand using menus and textual data.
Second, although the user can interact with the layout of the pedigree and make changes
and corrections, the representation is easily violated. That is, a modification of the lay-
out made by the user can render the representation wrong. As in the previous cases of
Madeline and MFHP, Progeny and GenoPro comply partly with the previous version of
the Pedigree Nomenclature.
From the presented selection of PDSs, CRA Health contains one of the most advanced
engines. Reported in [18], it is the first to allow automatic creation of diagrams directly
from structured data while considering usability and interactivity aspects. It makes use of
optimization techniques to position the nodes in a canvas, and the user can make changes
to the graph as desired. However, the data model is in itself limited. In Section 3.3.1, I
discuss these limitations to emphasize the necessity of a broader data model for the correct
representation of pedigrees.
2.4 Biomedical vocabularies
In the field of biomedical informatics, one important concept is that of a vocabulary. In
this context, as well as in other fields, a vocabulary is essentially a data structure contain-
ing concepts that can be referred to in a homogeneous manner. In computer vocabularies,
usual naming strategies are that each entry of a vocabulary is assigned a code, and a
textual explanation or usual names are to it assigned.
In terms of utility, vocabularies tend to be used as a way to establish a common ref-
erence to each of its entries. For example, multiple people might refer to the same ideausing different words or expressions. When a vocabulary is used, the different ways to
express the same idea can be unified in the same concept. That way, every reference made
to the same object is done through the same code.
Vocabularies can range from simple lists of concepts and their meanings, also called
12
glossaries, to complex structures usually called ontologies. The differentiation among the
different subtypes of vocabularies is in their expressive power about the field of knowl-
edge they are intended to represent. A simple list with names for reference and a sufficient
explanation for each concept is a glossary. Concepts can be subtypes of other concepts.
A vocabulary that contains such hierarchical relationships is usually called a taxonomy.
Additionally, non-hierarchical relations can be added. That might also include the ad-
dition of properties to concepts. When both are done, referencing to this vocabulary as
thesaurus is adequate [19].
On the far end of the aforementioned spectrum are the ontologies. As well as con-
cepts, their meanings, relationships and properties, ontologies also contain vast informa-
tion about the behavior of the concepts included. They are expressed through the use of
axioms, restrictions of many types and descriptions of the relations among concepts.
In biomedical informatics, a very important vocabulary is the Unified Medical Lan-
guage System (UMLS) [20]. UMLS is a metathesaurus, in the sense that it is a thesaurus
that contains many different vocabularies and unifies the common concepts present.
In UMLS, the information is contained in sections (easily translatable to database
tables). One of the most relevant of them for this work is the one that contains the concepts
themselves, MRCONSO. Since in the medical literature many expressions might refer to
the same concept, in MRCONSO, as well as in the vocabularies on which it is based, each
unit of information is a text entry of a concept from a vocabulary. One or more text entries
(strings) are grouped into a term. A term is a concept in the vocabulary of origin. One
or more terms are then grouped into concepts. That way, UMLS contains information
regarding which terms of the vocabularies are the same concept in a global sense. That
is, if a concept 1 from vocabulary A is attached to the same UMLS concept as another
concept 2 from vocabulary B, we can say that they are equivalent. That is, that they refer
to the same idea in medicine.
Another section of UMLS which is relevant for this work is MRREL. It contains the
relations among concepts. Each entry of this file is a link of a certain type between two
concepts. Although these relations are extracted from the vocabularies used, the relations
are always between two UMLS concepts. As depicted in Figure 3, Abetalipoproteine-
mia, an autosomal recessive disorder, can be referenced to as “ABETALIPOPROTEINE-
MIA”, “BASSEN-KORNZWEIG SYNDROME” or as “Abetalipoproteinaemia”. These
are some of the names used as STR (string) in OMIM for the term with code 200100 and
in SNOMED CT with code 8312300. The three terms are linked in the UMLS concept
code C0000744. This code has a relationship in terms of the type of inheritance provided
by OMIM with the UMLS code C0441748, which corresponds to being an autosomal re-
cessive disorder. In OMIM, the code MTHU000016 corresponds to this concept. In HPO
and in SNOMED CT, the codes HP:0000006 and 258211005 are used.
Many vocabularies are integrated into UMLS. Some of them are SNOMED CT [21],
13
Figure 3: In UMLS, many different terms from different vocabularies can correspond to
the same “idea”. If that is the case, one UMLS concept encompasses all of them. Mean-
while, UMLS concepts are interrelated. In the case of this figure, the concept C0441748
(autosomal recessive inheritance) is the type of inheritance of the concept C0000744 (a
disorder called “Abetalipoproteinaemia”)
the Online Mendelian Inheritance in Man (OMIM) [22], and the Human Phenotype On-
tology (HPO) [23].
Although currently-existing biomedical vocabularies present many characteristics be-
yond the principles here discussed, what is mentioned in this section is sufficient for the
scope of this work.
2.5 Knowledge extraction from family data
The extraction of knowledge from family data is a field that might come to be of great help
to medical professionals, in contexts that can range from individual healthcare to public
health. As in many other types of data, the nature of family-related genetic information is
largely characterized by its uncertainty and by the fact that most individuals do not have
any of their genetic characteristics registered anywhere, and thus have to be considered as
latent data.
Nonetheless, the analysis of such data is enabled because of the connectivity among
individuals. Since the vast majority of humans are “created” from one father and one
mother via fertilization, each individual’s genetic information is deeply connected with
two individuals. We might not know anything about the three of them, but they are certain
to share some genetic characteristics that can reveal important features about the rest of
the individuals in their family (since they are carriers of chromosomes, which contain
genetic information). In Figure 4, for example, although no information is known for the
individuals A,B, and C, they are an important part of the explanation of why individual
F is affected by the disease represented (affected individuals contain a colored symbol),
assuming an Autosomal Recessive disease. This is due to the individuals E and D being
also affected, and thus F most likely received the disease-causing allele from both parents.
That is, C is at least carrier of such allele, which was most likely inherited from A or B.
As we can see, the mere information that C has a brother who is affected to the disease is
14
Figure 4: The lack of information on A, B and C about being at least carriers of an auto-
somal recessive trait is important to understand why F is affected. However, the structure
of the family makes this information evident.
an explanation to why her daughter is affected, even though we do not have any additional
information about C’s parents.
According to [2], genetic data based on the information of the family is inevitably
uncertain. Many factors that come to play are sometimes not registered or not even dis-
covered by the genetic counselor. When no information about a certain disease is given
for an individual, for example, they can be not affected, not well observed while being af-
fected or even still not affected. They might also never come to be affected while carrying
at least one mutated copy of a disease-causing gene. Observations can thus be imprecise,
out-of-date and even outright false.
Additionally, factors such as that meiosis, the process that generates the genome of
every individual, is perfectly random in a simplistic approach, but reveals itself much
more complex when more factors that come to play are considered. Similarly, mutations
cause changes in the genome and are also seemingly-random processes. How a system
that extracts information from family data considers the randomness of the generation
of the genome is, therefore, a critical factor to its precision. However, there is a severe
limitation on this precision, since many biological processes known to influence the cre-
ation and expression of the genome are currently unknown. These are such as parental
imprinting and even information regarding some genetic diseases [2].
However, as discussed previously, the inheritance among individuals is not to be dis-
missed as an important source of information to help detect the root causes of most genetic
diseases and to help increase the precision of the diagnosis of patients.
15
2.6 Likelihood estimation in pedigree data
One objective of the extraction of information from family data is the prediction of the
affection statuses of individuals in a family. This is rendered useful in cases such as when
a rare genetic disease is observed in both families of a couple. If the objective is that
their children do not carry or are affected by such disease, an analysis may be useful to
determine if they are at risk of giving birth to an affected child.
Currently, genetics specialists more often than not do the corresponding risk calcu-
lations by hand. As mentioned in the Introduction of this work, this process is prone to
errors and usually outright incorrect. Their calculations correspond to simple methods
that resemble the calculation of the Maximum Likelihood Estimation (MLE) of the geno-
types of individuals. Therefore, an adequate calculation of the MLE for the individuals in
a family should be of utility to specialists.
In past works, approximately from 1970 to 1995, this approach is vastly explored,
reaching complex algorithms. In 1970, [24] proposes an algorithm to calculate likeli-
hoods in pedigrees. To calculate a likelihood means to reach the statistical functions that
describe the information in it contained. The likelihood formula proposed by the work
mentioned is exact and is based on existing data. Thus, no extrapolation is made. Eight
years later, [25] extends the calculation for complex pedigrees (those which do not as-
sume statistical independence among parents of children). Furthermore, the authors also
propose that individuals are represented as triples (σ(A), e(A), φ(A)). For an individual
A, σ corresponds to their unique ouriotype, which is a set of fixed characteristics that
includes their genotype. e is a set of personal characteristics, such as age, gender, and
whether A smokes. φ corresponds to A’s phenotype. The work also models the pene-
trance of a characteristic as the probability that an individual has a certain genotype given
their phenotype. One aspect to notice is that such penetrance is calculated among all the
possibilities of phenotypes and genotypes. The calculations presented in [25] are also
exact, and are based on recursive calculations of transmission and penetrance.
In 1979, the concept of Maximum Likelihood Estimation is brought to the analysis
of pedigree data [26]. This is done via mixed models and gene counting. In gene count-
ing, the probabilities that a child has a certain genotype is calculated directly from their
parents, in an exact manner. A posterior work [27] extends the approaches of the then
state of the art by proposing an algorithm for the exclusion of genotypes. Although the
method is notably overconfident by nature, it introduces the possibility of omitting the
whole analysis for some genotypes or phenotypes of people given their own status and
their surrounding family.
During the 1980s, few advancements are done. Nonetheless, some unorthodox ap-
proaches are developed during the decade. From one perspective, the analysis of identity
by descent is first approached [28]. Simply put, it is the overdetailed analysis of the seg-
16
regation probabilities that occur during the meiosis process (gamete generation), with or
without recombinations and uniparental disomies. These probabilities identify which in-
dividuals are likely to have inherited two copies of the exact same chromosome both from
father and mother (that is, a direct measure of consanguinity). Another unconventional
approach was the one presented by [29], in which image processing ideas are applied to
the calculation of likelihoods in pedigrees.
From the 1990s onwards, less innovative statistical approaches are proposed. A project
proposal from 1990 [30] establishes some basis for posterior works [31, 32, 33, 34,
35]. After a sequence of similar works, [33] formalizes a simple Monte Carlo estima-
tion method. It uses the Expectation Maximization algorithm, a Gibbs sampler in order
to reach a model that handles random and fixed components of heritable and nonheri-
table characteristics to calculate predictions for the Maximum Likelihood of genotypes.
Random and fixed components, in this sense, refer to elements that come to play when de-
ciding the composition of the genome of the individual in question. The fixed components
are, for example, the factors that underly the segregation of chromosomes during meiosis,
while random components are such as mutations and chromosomal recombinations, as
well as uniparental disomies.
More recently, despite parallel advancements in genome sequencing and other -omicsfields, works such as [36] and [37] do not present any actual novelty in the field. In the
era of Big Data and almost unlimited computing power, possibilities such as machine
learning approaches have not been explored yet for genotype distribution prediction in
pedigree data.
2.7 Numerical Computing and Machine Learning approaches for in-formation extraction from graph-based data
In this section, a brief collection of numerical computing and machine learning approaches
is presented. Such approaches are some of the most promising for the characteristics of
the data included in pedigrees, which is structured in a graph-like structure and contains
mostly categorical data, although continuous and discrete numerical data also exists.
Also regarding the structure of data, in the family graph, the relationship between a
parent and a child exists in only one direction. No child is a parent of their parent, and
this is guaranteed by temporal reasons. That means that a family can be understood as an
acyclical directed graph (more specifically as a semantic network, as will be discussed in
Section 3.3.1). Therefore, nodes of this graph can be considered as individuals that have
their own genotypes. Directed edges are parenthood relationships (also interpretable as ischild of ). From every node, a maximum of two parenthood edges can be originated.
Given this graph-like nature and the randomness that characterizes the meiosis and
fertilization from which each individual originates, some algorithms can be considered as
17
candidates for the extraction of information from a pedigree.
Expectation Maximization algorithm The first algorithm that can be considered is the
previously-mentioned and extensively explored EM algorithm. Since it is a statistical al-
gorithm, many formulations derived from its essence can be made. It is based on two
steps (hence its name): the first is the Expectation step, which is the calculation of ex-
pected values for latent data from the data we have, while the second is the Maximization
step, in which measurements are made to try to better estimate the Expectation in the
next iteration of the algorithm. Sometimes, however, the M step can be summarized to
the reestablishment of some parameters, in a sense that no actual optimization is made.
The convergence in those cases is a consequence of navigating the parameters without
any driving factors. A characteristic feature of the EM algorithm is that it has a greedy
behavior, in which no alternative options for maximization are considered, only one so-
lution is calculated from the input, and the maximization is, in the best-case scenario,
monotonically upwards towards a local maximum.
Hidden Markov Models The second formulation can be that the problem is a Hidden
Markov Model (HMM). It is a simplification of other Markov process-based models. In
an HMM, previous information is considered as causing factor to posterior information.
However, only one causing factor can be utilized for this calculation. Therefore, the
prediction is limited to a one-way propagation across many generations. Latent data can
have a tremendous impact on prediction and many lines of descent must be analyzed
before a prediction can be made.
Rule-based models A third formulation is to use rule-based models to make predictions
of the genotypes of individuals given their parents and their children. A rule-based model
is the one that either searches for an explanation for a phenomenon or that extrapolates
a given scenario in search of a prediction for an unknown scenario. For the searches,
rule-based models possess sets of rules that are activated according to some situations.
A rule-based model could, for example, activate a rule that concludes that the individual
A from Figure 4 does not have a genotype that only contains the dominant allele for the
disease in question because D, his son, is affected by it.
However, once again, the existence of mostly latent data in pedigrees renders the
vast use of this kind of model impractical. Pure rule-based approaches can only define
constraints to the data. However, only constraints are not enough to enable prediction on
pedigrees to work.
Graphical models Graphical models are models devised under the formalism of a struc-
tured probabilistic model, as HMMs and Markov chain-based methods. These models can
18
facilitate the prediction with latent data when the situation to be predicted can be mod-
eled as a graph. As listed in [38], two groups of models can be identified: the directed
models, also known as Bayesian networks or belief networks, and the undirected models,
also known as Markov random fields or Markov networks.
In the case of Bayesian networks, each node of the graph is a variable in the model,
while a directed edge means that the variable at the destination of the edge is conditioned
on the variable at its origin. Currently, BNs with binary variables are being used in fields
in which a graph-like relationships’ network among variables can be obtained.
From another perspective, undirected models do not consider the edges of the graph
as one-way conditionals. Instead, each edge has the meaning of interdependence between
two variables. One practical algorithm derivated from this formalization is the Boltzmann
Machine. A Boltzmann Machine is characterized by an undirected graph whose nodes are
binary variables. It is trained as an unsupervised algorithm and can be used for supervised,
unsupervised and generative tasks. Both groups of models can be used interchangeably
[38].
However, although graphical models have the potential to be extremely useful in the
analysis of pedigrees, current implementations lack the support for categorical data types,
which are essential for this work.
Graph embedding-based models One of the obstacles posed by information organized
in graph structures is the interpretation of the connections themselves. Traditional and
modern machine learning tools are capable of learning from information that can be or-
ganized in tensors. As it is of wide knowledge, graphs’ connectivity matrices can also
be expressed in tensors. However, graphs tend to be of an arbitrary number of nodes and
these connectivity matrices are usually not only of variable size, but they also tend to be
very sparse, since the nodes in these graphs are usually very loosely connected (a node
connects to very few other nodes). In the specific case of pedigrees, each person is only
connected to their parents and children.
Largely used in social sciences and bioinformatics research in part due to the char-
acteristics of the underlying graphs, graph embedding techniques are ways of not only
standardizing the size of the features vector of each node but also providing more infor-
mation at once about its vicinity. That way, traditional machine learning techniques can
be used to make predictions on nodes, subgraphs and even on the whole graph [39].
In the case of genetic prediction, the predictions can be about an individual given
their family’s pedigree. To my knowledge, the technique of graph embedding has never
been used for the prediction of individuals given their family. The reason for so might
be that, given the current state of the graph embedding techniques, gargantuan amounts
of currently badly-collected data would be needed to train an algorithm capable of “sum-
marizing” the context of a node from the genetic information of its vicinity (parents and
19
children).
In the graph embedding literature, there are various techniques that enable the em-
bedding of graphs. The two which are currently most relevant are graph2vec [40] and
node2vec [41].
20
3 STATE OF DEVELOPMENT
This work is related to other research and commercial efforts that culminated in tools that
are currently adopted by medical specialists.
One of such tools is Phenomizer [42]. Although the existence of a relationship be-
tween genotype and phenotype is as widely-known as it can possibly get in the field of
genetics, another relationship of interest in the area is the relationship among diseases and
their corresponding symptoms (phenotypical expression).
Another genre of tools that precede this work and are of relevance to what is here
exposed is the Pedigree Drawing Systems. As commented in the State of the Art of
this work (Section 2), there are simpler and more complex Pedigree Drawing Systems.
An example of a simple PDS is Madeline 2.0. A more complex example is the Risk
Assessment module of CRA Health. Furthermore, a specially-relevant tool for this work
is genoDraw, to which the module implemented in this work is integrated.
In the next subsections, the most relevant platforms that either influence or precede
this work are introduced and their main characteristics are outlined.
21
3.1 Phenomizer: Exploring the Symptoms-Disease relationship
As commented previously, some relationships are of special interest in Risk Assessment.
One example of which is the relationships among symptoms and diseases.
Diseases express themselves through symptoms, which are typically used in their di-
agnosis. However, many diseases share similar symptoms.
In this context, Phenomizer is a tool of relevance. It serves its purpose by estimating
the likelihood that a person that presents some specific symptoms (phenotype) has each
of the possible diseases, assigning scores for each of the associations. The most likely
are given to the user as highly-likely possibilities. The exploration is done following a
Semantic Similarity search, through which the semantic network that expresses the ontol-
ogy (biomedical vocabulary) is explored in a top-down fashion, searching for similarities
among the disease in question and the symptoms selected by the user.
Currently, Phenomizer is a tool of free access available on http://compbio.
charite.de/phenomizer/. In its implementation, the user is able to select symp-
toms which are terms from the HPO [23] vocabulary from a list and select the mode of
inheritance with which the disease manifests itself in the family. From both of these fea-
tures, a list of possible genetic diseases from OMIM [22] is obtained. A depiction of the
graphical interface of this software can be seen in Figure 5.
Figure 5: Graphical User Interface of Phenomizer. Source: http://compbio.
charite.de/phenomizer/
22
3.2 Previous Pedigree Drawing Systems
3.2.1 Madeline 2.0
Madeline 2.0 is an open-source Pedigree Drawing Engine first reported in 2007 [13]. It
is a web service that automatically and unassistedly generates pedigree diagrams from
family descriptions. In terms of characteristics, Madeline 2.0 offers a service that pro-
duces pedigrees in compliance with the Standardized Human Pedigree Nomenclature
from 1995.
From the perspective of the characteristics a PDS can offer, however, Madeline 2.0
fits best in the simple PDS category. Although the creation of pedigrees is automatic, the
user is not able to adapt it to their liking. For this reason, a pedigree, once generated is a
static image in which no personalization can be done. Furthermore, the user is not able
to select specific diseases for the affected individuals in the family, let alone register such
diseases as terms from biomedical vocabularies.
Madeline 2.0 also relies on a data model which is far from ideal. Such data model is
based on a list of individuals to whom characteristics are given. The code below corre-
sponds to a simple example that can be found on https://madeline.med.umich.
edu/madeline/testdata/input/si_001.data. As we can see, the first ele-
ment of the code is a sequence of variables to be used in the block below (the columns).
Then, each row corresponds to an individual.
FamilyId
IndividualId
Gender
Father
Mother
Deceased
Proband
DOB
MZTwin
DZTwin
Sampled
Affected
si_001 S00102 M S00100 S00101 . . 1937.07 . . Y U
si_001 S00103 F S00100 S00101 . Y 1939.02.25 . . Y A
si_001 S00104 F S00100 S00101 . . 1942.04.15 . . Y U
The previous example, when processed with Madeline 2.0, outputs the following im-
age, in SVG format:
Figure 6: Image output of Madeline 2.0. Source: https://madeline.med.
umich.edu/madeline/testdata/
In terms of limitations, the data model used in Madeline 2.0 is not capable of ex-
pressing any states of the relationships among individuals. Furthermore, the annotation
of diseases is limited to expressing whether an individual is affected, not affected, carrier
or unknown. The disease itself is not known. Furthermore, some important information
is contained in the gestations and is completely ignored in Madeline 2.0.
In conclusion, Madeline 2.0 is prohibitively complex to be used by genetics specialists
during medical encounters and does not allow users to insert and remove data dynami-
cally.
3.2.2 CRA Health - Risk Assessment Software
CRA Health (https://www.crahealth.com/) is a company that develops a risk
assessment software suite. Implemented in CRA Health is a plugin for pedigree draw-
ing activities first reported in 2011 [18]. Such plugin is more appropriate to be called a
complex PDS, due to the characteristics it presents. First of all, it is an interactive and au-
tomatic system that allows the user to insert individuals in the diagram by using high-level
directives and without having to manually place each entity. Secondly, the generation is
done from structured data.
The plugin also complies partly with the 1995 Standardized Pedigree Nomenclature,
due to its data structure (as will be discussed later). Diseases are also not registered as
associated with terms from medical vocabularies.
In terms of usability, however, the plugin presents interesting agility, being thus seem-
ingly adequate for clinical practice. A screenshot of its use extracted from the webpage
of the company is shown in Figure 7.
24
Figure 7: Risk Assessment Software of CRA Health. Source: https://www.
crahealth.com/screensamples
3.3 genoDraw
genoDraw is an in-house development of the Biomedical Informatics Group of the Tech-
nical University of Madrid. It is strongly based on the characteristics identified as promis-
ingly useful for medical specialists in the area of genetics [1]. Figures 8 and 9 are screen-
shots of genoDraw.
In terms of compliance with standard visual nomenclatures for the drawing of indi-
viduals and their relations, the 2008 Standardized Human Pedigree Nomenclature of the
National Society of Genetic Counselors is fully implemented.
Furthermore, genoDraw works with a data model that is comprehensive enough to
represent a wide spectrum of possibilities which can be observed in society. It also gen-
erates pedigree diagrams directly from such data model automatically and supports the
interactive manipulation of entities, either to reposition them or to make changes to the
underlying data.
Another important characteristic of genoDraw is the capability to handle diseases as
terms from biomedical vocabularies. This not only enables genoDraw to offer as wide
a selection as possible, it also enables a common reference to disorders. Additionally,
future exploitation of pedigree data is facilitated.
25
Figure 8: Interface of genoDraw. The user is editing a simple pedigree using the normal
mode, intended for visualizations and minor structural changes. The individual F is a
male adopted by A and B. In the sidebar, as we can see, changes to individual F can be
made.
In the next subsection, an aspect of utmost importance to the understanding of how
pedigrees and their data models differ is explained. This is an issue of special relevance
both for the drawing process, as described in [43], and for the exploitation of such data, be
it for the propagation of genetic information as in this work, or for gathering other types
of information.
3.3.1 Representation of family-related information
From a mathematical perspective, a (modern) pedigree is a visual representation of an
undirected hypergraph. A hypergraph is a generalization of a graph. A graph can be
defined as a tuple G = (V,E), in which V is a set of vertices v and E is a set of edges
defined between two vertices. When we refer to directed graphs, we can consider E as
E ⊂ V × V . Similarly, undirected graphs are graphs in which E is composed of the
unordered pairs of elements of V . Meanwhile, a hypergraph is a graph whose edges
are not defined between two nodes, but among an arbitrary number of vertices. The
unordered pairs of vertices that define edges in graphs are now sets of vertices (which can
26
Figure 9: Interface of genoDraw. The user is editing a simple pedigree using the editing
mode, intended for major structural changes and overall pedigree creation. Its interface is
optimized for fast interaction.
also be thought of as unordered lists of vertices). The definition of undirected hypergraph
considered in this work is as following [44]:
Definition 3.1. Undirected hypergraph An undirected hypergraph is a pair H = (V,E),
where V = {v1, v2, ..., vn} is the set of vertices and E = {E1, E2, ..., Em} is the set of
hyperedges. A hyperedge Ek is a set of vertices with unrestricted cardinality.
As an illustration of the previous definition, Figure 10a depicts an undirected hyper-
graph. As we can see, not constraining an edge to be between two nodes allow us to
connect multiple nodes at once. When compared to the pedigree diagram shown in Fig-
ure 10b, we can see that a pedigree is clearly an undirected hypergraph in which nodes
are drawn according to their characteristics and edges are only connections among many
nodes, and are usually drawn as compositions of straight lines. Additionally, positions are
defined for each node, so that, for example, parents are always drawn above their children.
However, the relations in families are best represented as a semantic network. A se-
mantic network is a directed graph whose vertices and edges are assigned a type. Usually,
they are used to describe ontologies, but as mathematical instruments, their usage must
27
(a) An undirected hypergraph (b) A simple pedigree diagram
Figure 10: Comparison between an undirected hypergraph and a pedigree diagram that
follows the Standardized Human Pedigree Nomenclature in its updated version [10].
not be unique. In this work, I define a family graph as a semantic network with types of
nodes Individual, Gestation and Relationship. The possible types of directed edges are
biological father, biological mother, birth mother, gestation, partner and non-biologicalparent. I extracted these types in a formalization performed directly from the Standard-
ized Human Pedigree Nomenclature [10].
Given that families are composed solely by persons, and the relations among indi-
viduals can be of multiple types, a first approach to the problem could indicate that a
directed graph with labeled edges could be enough to model the representation of a fam-
ily. However, the existence of characteristics such as dates of engagement and divorce in
relationships invalidates the consideration that a relationship is a usual relation and could
be modeled as two is partner of relations for each relationship. Instead, an entity that car-
ries information on its own is required. Furthermore, gestations also carry information on
their own. Initially, one may conclude that a gestation is part of the individual. However,
the existence of multiple gestations requires that a common entity exists between two or
more twins and parents. Therefore, given that both entities and edges must be of certain
types, a directed graph is rather restrictive for modeling the information required. For this
task, a semantic network is ideal.
Considering the formal aspects of pedigree diagrams is of utmost importance to un-
derstand the limitations of most of the past software dedicated to the creation and man-
agement of such graphical representations. (Kelleher, 2011) [18], for example, defines
multiple layers of directed graphs to represent the families. However, many limitations
arise from this decision. One of them is that the chosen set of layers and the content
of each of them, as a whole, limits the information that can be inserted. Although the
model proposed by [18] complies partially with the first version of the Standardized Hu-
man Pedigree Nomenclature, the restriction posed by the set of layers is that each child is
to have two parents and that each of the two parents needs to have a relationship. This is
due to one of the layers being the “couples’ graph”. Individuals have their parents defined
directly through the couples’ graph. That means that two parents must form a couple to
be able to have their children represented. This is not only a violation to the old Nomen-
clature from 1995, that allowed parents to not be married, but also a factor that blocks the
28
compliance of the proposed system with the Updated Nomenclature because modeling
that an individual is a child of more than two parents at once is absolutely impossible, and
this is exactly what happens with ovum donations, for example.
In an ovum donation, the father provides the sperm, one mother provides an ovum,
while the other mother gestates the child (in this work, I refer to them as biological fa-ther, biological mother and birth mother, respectively). That way, three people must be
somehow annotated as parents of the child. In this case, failing to indicate that the mother
that gestates the child is a parent is a loss of important information, since her gestation of
the child may have had its effect on the fetus, for example by transmitting teratogens.
For a more flexible formal representation of a family to be possible, we must be able
to: (a) represent vertices as entities of multiple types, given that individuals have different
connectivity patterns and a different semantic meaning than gestations or relationships,
and (b) represent edges as directed and labeled links between two entities. For this sce-
nario, one of the most adequate structures is a semantic network.
A semantic network is a graph-like structure based on the existence of entities (ver-
tices) and links among them (directed edges). Both entities and links are of a certain type.
This type carries information about which type of entity or connectivity we are dealing
with. Semantic networks are widely used in the field of semantic web, given both its
modeling power and its simplicity (i.e. any information to be captured in the network is
done so through a triplet, that carries a source entity, a relation, and a destination entity).
Thus, while the representation of a family as a pedigree is an undirected hypergraph,
the interpersonal data underlying it is best described as a semantic network. In the case of
genoDraw, the semantic network is composed by nodes of classes Individual, Gestationand Relationship, while the relations among such entities are of types biological father,
biological mother, birth mother, gestation, partner and non-biological parent.As an illustrative example, Figure 11 depicts, at the top, a correct pedigree with mul-
tiple less-common relations among the individuals, as well as common relations. At the
bottom, the underlying semantic network is shown.
29
(a) A correct pedigree diagram
(b) Its underlying semantic network
Figure 11: Comparison between a complex pedigree diagram that follows the Standard-
ized Human Pedigree Nomenclature in its updated version [10] and its underlying seman-
tic network.
30
4 METHODS
In this section, the elements that comprise the information propagation in the family
graph, as well as the means by which this information is treated, are laid out.
Initially, genotypes are explained to be considered as distributions over the possibil-
ities for each individual. Then, three techniques for genetic information propagation in
the family graph are presented. After that, a brief comment is made on the technique
used to hold known information in the genotype distribution of individuals, the masking
of genotype distributions. Next, the proposed structuring of the techniques in an Expecta-
tion Maximization algorithm is described, as well as how the approach presented can be
changed to adapt to various modes of inheritance. Lastly, the strategy used to estimate the
contribution that information of each individual can provide to the genetic information of
the whole family is proposed.
31
In this work, I propose a method that aims to identify which individuals are capa-
ble of providing most information to the genetic information of the whole family. For
this purpose, phenotype and genotype distribution estimation must be made from genetic
information available. In my formalization of the problem, I propose a model for ge-
netic information propagation in the family graph that enables an interpretation of the
Expectation-Maximization (EM) algorithm. Three information propagation schemes are
proposed. The final model is composed of two parallel EM algorithms, and the difference
between the results of both provides a predicted measure of how decisive the individual
in question is. These predictions are calculated across the whole family graph, and the
results are obtained on an individual’s level. Each person can have a phenotype, or a
genotype, associated to them for a certain genetic trait. As mentioned in Section 2.4, ge-
netic traits are associated with specific modes of inheritance, if known. Specific modes of
inheritance mean that a pattern can be observed in the way a genetic disease is inherited
from parents to child. For this method, the structure of the family in question is necessary.
From that, we know who is child of whom with high reliability. Having this information,
a model is devised to simulate the inheritance pattern so that a prediction of the risk of
having a certain disease can be made exclusively from data available for a family.
As in many other cases, genetic data in families are generally very sparse and mostly
latent. Additionally, the nature of genetics is such that many phenomena are mostly un-
predictable. That means that if a person is known to have or not a certain disease, not
much can be said about their surroundings in a decisive manner. For example, in the
case of a couple whose both partners suffer from an autosomal dominant genetic disease,
it is wrong to assume their children to also be or become affected. For example, these
parents might be both heterozygous, and, in this case, their children are nearly at a 25%
probability of not being affected nor being carriers of the mutation.
Since no assumptions should be made (i.e. assign automatically a person as affected),
an approach based on the genotype distributions for each individual is more appropriate
than a decisive model, such as rule-based systems or other naïve algorithms (those which
stick to the most likely prediction). This technique has been used in many studies (i.e. [26,
31, 32, 33, 34]). From the phenotypical information of a person, a genotypical distribu-
tion is derived, as depicted in Figure 12 for a generic case. When genotypical information
is given, this step is skipped. As we know, each person is to have two chromosomes of
each type, one from mother and one from father (sex chromosomes are not necessarily the
same from father and from mother, but autosomes are ideally very similar). Each gene is
a pair of alleles. Since no differentiation can be easily made from the copies from father
and mother [35]), the pair is considered in genetics as an unordered pair. The alleles that
correspond to a locus (gene) can be of various types, and each type can be the result of a
mutation or recombination from the other types. The most common alleles are generally
known and are typically assigned a letter. When the allele is dominant over the others,
32
Figure 12: Example of phenotype-genotype distribution conversion with penetrance of
60% for being affected.
it is generally written as an uppercase letter. If it is recessive, it is generally written as
a lowercase letter. In this work, ‘A’ and ‘a’ are used for each of the cases, respectively.
The different combinations of alleles can cause different phenotypes with different prob-
abilities (depending on many factors, such as the age of the patient or penetrance of the
disease). Therefore, each phenotype can indicate with certain confidence which is the
underlying genotype. Although no certain assertions can be reached (except in isolated
cases), information can be gathered as a variation in the original probabilities of each
genotype (pair of alleles) for a person.
From the previous strategy, a genetic counselor can assign if a person is affected or not
by a disease (phenotype) given their genotype, and vice versa. Knowing some parameters
of the disease in question enables estimating the probability of each combination of alleles
in the genome of each patient (genotype). For the people for whom the status for the
disease is unknown (latent), their probabilities can be estimated as an analysis of the
surrounding family. Probabilities for each combination of alleles can be calculated for
these people. When considered in conjunction, these probabilities form a distribution
for each individual. Then, after the transformation to genotype distribution, people are
assigned probabilities for each genotype. After the operations to be defined are performed
on the genotype distributions of the individuals, new phenotypes can be estimated from
the resulting genotype distribution.
Comparatively to what is currently performed by genetic counselors, these operations
are rather complex and extremely prone to errors, as many statistics-based calculations
and intuitions when performed by humans [3, 4]. In a computerized environment, on the
other hand, each of the steps can be automatized, enabling a decrease in the workload of
the genetic counselor and an increase in the correctness of the calculations. The genotype
33
can either be given by the user or predicted from a known phenotype. If none of the op-
tions are viable, the patient is considered latent. Then, a predictive algorithm can analyze
the family graph, propagating information that can help observe the likelihoods for each
genotype in each person. Then, from the likelihood estimations (genotype distributions)
for each individual, phenotype distributions can be calculated using parameters of the dis-
ease being analyzed, such as average age of onset, the gender of the patient, penetrance,
etc. In all, this process is decidedly faster and less prone to errors to what is currently
done.
In this work, I consider the genotype and the phenotype distributions of the individuals
as feature vectors in a similar fashion to one-hot encoding, where the probability of each
of the possibilities is calculated. The vectors always add up to 1, being thus normalized.
An example of such a vector is the genotype of a fictitious individual A for a monogenic
biallelic disease. The individual A has a probability of 0.6 of being homozygous on the
dominant alleles (AA), 0.28 of being heterozygous (Aa) and 0.12 of being homozygous
on the recessive alleles (aa). The vector is considered in this work as follows:
GA =
⎡⎢⎣0.6
0.28
0.12
⎤⎥⎦ (1)
In order to extract genetic information from the family graph, a propagation-based
predictive algorithm is proposed. It is based on three components for the propagation of
information that can be used and combined as desired. Namely, a) downwards propaga-
tion, b) upwards propagation and c) upwards constraint. In the next subsections, each of
these modules is introduced. These three components operate on the genotype distribu-
tions of the represented individuals, and can, in conjunction, be used as an EM algorithm,
which is explained in subsequent subsections.
4.1 Genotype probability distribution propagation
In current clinical practice, genetics specialists usually perform manual calculations dur-
ing their analysis of a family in order to calculate the risks associated with genetic dis-
eases. They make extensive use of a method that will be called biological model in this
work.
The biological model is based on a 50% chance of each parent contributing with each
of their alleles to a child. That means that an individual with Aa genotype will transmit
either A or a at a 50% chance each. This is due to the segregation that occurs during the
process of meiosis, which splits the pairs of chromosomes in two, so that each resulting
gamete has only half the genetic information of the original cell. Combining both parents,
we can estimate the joint probability of them both. A practical way to visualize this split-
34
A aA AA Aa
a aA aa
Table 1: Punnett square of two heterozygous parents. Alleles are A and a. The first row
and the first column are the possibilities of inheritance (alleles) from each parent. The
rest of the cells are the possible genotypes of the child, at a 25% probability each.
ting is the Punnett Square. The numbers written in a Punnett square are the probabilities
of a child having a certain combination of alleles. Imagining the pairs of alleles of the
parents as ordered lists, a child can inherit the first allele of the father and the second
of the mother, for example, which ideally corresponds to a 25% inheritance probability
in the Punnett square. As shown in Table 1, parents who are heterozygous for a certain
trait are at a 25% chance of having a child that inherits the conditions to be affected. At
the same time, the probability that the child is a carrier of the disease (heterozygous, in
this case), is at 50%, since, in biological terms, Aa or aA (first from mother, second from
father, for example) are equivalent and the probabilities are summed.
By chaining the biological model, some invaluable information can be obtained. For
example, relationships between cousins can be justified to be of high risk to the child,
since identical mutated alleles can be inherited at the same time from father and mother.
This is a model especially relevant for the risk assessment of autosomal recessive traits.
The biological model is widely used in part because of its simplicity. It can be of
help in the prediction from known parents to children. However, when non-latent family
members are more distant, the biological model is of virtually no utility.
4.1.1 Downwards propagation
The use of Punnett squares is very useful for the immediate visualization given the full
information of the parents. However, when dealing with pedigrees, since many people
are represented and the presence of latent data is highly likely, the probabilities must be
passed on without having to assume that someone has a certain genotype simply because
it is the most likely (this strategy is necessary when chaining the biological model). That
is, all scenarios must be considered concurrently.
For that reason, one proposition is to consider a matrix (Profile) whose cells are the
probabilities that a child has a certain genotype (as in the Punnett square). However, in-
stead of being bidimensional, it is a tensor of four dimensions. These dimensions are 1)
the probability of each combination of alleles (genotype) for the father, 2) the probability
of each genotype for the mother, 3) the alleles that can be inherited from father, 4) the al-
leles that can be inherited from mother. Therefore, each cell now carries more conditions.
35
AA Aa aa
A a A a A a
AAA 1 0 0.5 0.5 0 1
a 0 0 0 0 0 0
AaA 0.5 0 0.25 0.25 0 0.5
a 0.5 0 0.25 0.25 0 0.5
aaA 0 0 0 0 0 0
a 1 0 0.5 0.5 0 1
Table 2: Profile tensor for monogenic biallelic autosomal diseases.
Two examples are given:
(2)ProfileAA,AA,A,A = P (child inherits A from father|father is AA)
× P (child inherits A from mother|mother is AA)= 1
(3)ProfileAa,Aa,A,A = P (child inherits A from father|father is Aa)
× P (child inherits A from mother|mother is Aa)= 0.25
Table 2 contains the Profile for the case of monogenic biallelic autosomal traits in
which only one gene is involved, and two alleles are possible.
The Profile matrix holds the probabilities that can be observed in a Punnett square.
In order for the actual calculation to be performed (prediction of child given parents), the
prior information is needed. In this case, the genotype distributions of the parents are prior
information. If the genotype distribution of the father and the mother are vectors in which
each position is the probability of one genotype, one step for calculating the probabilities
in the downwards propagation is to generate a matrix from the product of both vectors,
which is then multiplied element by element in the first two axis of the Profile tensor.
The result is a map of the conditional probabilities for each possible genotype of the
parents (Inheritance By Case - IBC), considering the probabilities of each of the cases.
Mathematically, if a father has probability 0.25 of being heterozygous for a trait (Aa) and
a mother has probability 0.1 of having such genotype, the position that corresponds to
the child inheriting the genotype AA given that the mother is Aa and the father Aa is
0.25. This is a conditional probability. For us to predict the probability of the child being
AA, all the other cases must be considered, including the probabilities for every other
combination of genotypes for the parents. Table 3 contains an IBC tensor containing one
possible example of the aforementioned case.
36
AA (0.9) Aa (0.1) aa (0.0)
A a A a A a
AA (0.5)A 0.45 0 0.025 0.025 0 0
a 0 0 0 0 0 0
Aa (0.25)A 0.1125 0 0.00625 0.00625 0 0
a 0.1125 0 0.00625 0.00625 0 0
aa (0.25)A 0 0 0 0 0 0
a 0.225 0 0.0125 0.0125 0 0
Table 3: Inheritance By Case tensor for the case of 0.25 probability that the mother is
heterozygous and 0.1 probability that the father is heterozygous too. The rows axis corre-
sponds to the father. The columns axis corresponds to the mother.
This table is mathematically identical to the use of Punnett squares when the geno-
types of the parents are known in the position of their genotypes. However, when all we
can know are the probabilities of each genotype, instead of not being able to easily reach
a conclusion, the Profile table handles separate father and mother genotype probabilities.
Having obtained the IBC tensor for the child from the parents, two steps are required
to take it from a 4-dimensional tensor to a genotype vector (prediction of the genotype of
the child). First of all, a sum over the first two axes of the IBC tensor will result in a matrix
(C) whose units are the probabilities of each genotype for the child (for example, CA,a =
P (child is Aa)). Nonetheless, each axis of the matrix is composed of the possibilities for
each allele, and the matrix is not triangular. That is, for two alleles in one gene, CA,a
and Ca,A have the same meaning but are two independent numbers. The second step
for reaching the genotype vector is, therefore, to remap the matrix into a vector whose
dimensions are the same as the genotype vector (in the case of 2 alleles and one gene it
will be 3 (AA, Aa, aa)). This calculation for the case of monogenic biallelic diseases is
done as follows:
Gchild =
⎡⎢⎣1 0 0 0
0 1 1 0
0 0 0 1
⎤⎥⎦×
∑i
∑j
IBCi,j,k,l (4)
Since the IBC-based model does not cease to consider the probabilities of the parents
when calculating the probabilities of the child, it does not lose information by consider-
ing only the most likely option for each parent. Furthermore, it is statistically correct for
both known and unknown parents’ genotypes. Therefore, it will remain consistent across
multiple executions. Additionally, given the characteristics described, the model is also
capable of handling chaining across many generations of people. Since latent data is ex-
pected to be notoriously large, precision in results is not expected to be perfect. Instead, it
37
is expected to indicate some insights into who are most likely to be carriers, not affected
or affected patients. However, be it precise or not, chaining the downwards model is pos-
sible and does not lose information in each iteration, carrying all that can be inferred to the
next generation without unnecessary losses. These characteristics render the downwards
propagation model adequate for the EM algorithm.
4.1.2 Upwards propagation
Similarly to the downwards propagation model, it is possible to model the inverse effect.
In the upwards propagation, we can tune the probabilities of the parents of a child when
information about this child is given. For example, it is very likely that, if a child is
known to be affected by an autosomal recessive trait, their parents are either carriers of or
affected by the same trait. This is not mandatory since de novo mutations can have caused
the child to be affected. However, it is a reasonable approach to set a higher probabilitythat the parents are carriers or affected when this is the case.
Furthermore, combining the upwards propagation with the downwards propagation
can help us model the effect that a cousin of the patient for whom we know the phenotype
has on the patient itself. Since the grandparents of this cousin are shared with the ones of
the patient of interest, some of the genes of these two individuals might carry the same
mutations. Therefore, the genetic relatedness among individuals who have ancestry in
common can be expressed by combining the presented downwards propagation model
and the upwards model to be introduced in this section.
Since the downwards model is a numerical method on the probabilities of each possi-
ble genotype of the child conditioned to the probabilities of the possible genotypes of the
parents (P (Gc|Gf , Gm)), it is reasonable to model the upwards model as the probabilities
of the parents conditioned to the probability of the child (P (Gf , Gm|Gc)).
According to Bayes’ rule,
P (Gc|Gf , Gm) =P (Gf , Gm|Gc)× P (Gc)
P (Gf , Gm)(5)
Therefore,
P (Gf , Gm|Gc) =P (Gc|Gf , Gm)× P (Gf , Gm)
P (Gc)(6)
Thus, it is possible to devise a method for having an insight about the genotypes of
the parents given the genotype of the child.
In this work, I formalize the previous statement in an algorithm. First, the parents’
prior matrix is calculated (P (Gf , Gm)). Then, a na × na × na tensor (B) is created,
where na is the possible number of alleles for the disease in question. Each of the three
38
dimensions corresponds to one individual. In this work, I consider the first two as father
and mother, respectively. The third corresponds to the child.
For each of the positions of the B tensor, Bi,j,k, an operation is performed. If, for
the corresponding child’s genotype distribution, the probability at the position k is zero,
then Bi,j,k is zero. If this is not the case, the conditional probability from the Profile is
calculated for the i, j position. Then, a calculation following the Bayes’ rule above is
done. The result is stored in the B tensor.
At the end of this calculation, the normalized sum on the first axis gives the most
likely genotype probabilities of the father for that child and the sum on the second axis
has the same meaning for the mother.
One aspect to be noticed when performing the upwards propagation is that any char-
acteristics existing on the child will be propagated equally to both parents (in autosomal
diseases). Therefore, some mutation that comes from the father, for example, will be
assigned a high likelihood to also have come from the mother. This is, in a sense, incor-
rect. However, it is only incorrect in the case we already know that this should not have
happened. Any individual, as expert as they may be, will suspect that both parents are
equally likely to have transmitted a mutation to the child unless more information proves
this assumption wrong.
4.1.3 Upwards constraint propagation
In some cases, a direct constraint on the possible genotypes of the parents given the pos-
sible genotypes of the child can be made. Since this behavior is much more precise in
effect than the upwards propagation, being able to propagate a constraint instead of an ex-
pectation can be a very useful characteristic of a model for predicting genotypes of latent
or semi-latent individuals (the case of being semi-latent will be discussed in Section 4.2).
By propagating a constraint, what is intended is to propose that, if some genotype is
mandatorily existing or non-existing in the parents of the individual in question, then the
genotype distributions of these parents are going to be transformed so that this possible
genotype is the only that is likely to exist or that this genotype is impossible to exist. For
example, in a monogenic biallelic autosomal recessive disease, to affirm that a child is
affected by the disease means that the only possible genotype for this child is that they
are homozygous on the recessive allele (that is, aa in the naming scheme of this work).
Therefore, since each of these alleles comes from a parent, then the parents must possess
it in their genomes. Therefore, both the parents cannot be homozygous on the dominant
alleles. That is, they cannot be AA.
As an example, consider the individuals in Figure 13. The genotype distribution
shown in Equation 7 for the individual C can improve the estimation of the genotype
distributions of her parents. C is affected by a monogenic biallelic autosomal recessive
39
Figure 13: Example of small family for which the application of upwards constraint can
improve the prediction.
disease (genotype is certainly aa). Therefore, none of her parents should have a distri-
bution that allowed for the AA genotype to exist. Equation 9 is the new prediction of the
family by applying one step of upwards constraint from C to A and B. The operation is
performed following Equation 8 for individual A. The same calculation is performed for
individual B.
GA =
⎡⎢⎣0.96
0.03
0.01
⎤⎥⎦ , GB =
⎡⎢⎣0.96
0.03
0.01
⎤⎥⎦ , GC =
⎡⎢⎣0
0
1
⎤⎥⎦ (7)
GA =
⎡⎢⎣0.96
0.03
0.01
⎤⎥⎦ constraint−−−−−−→ GA =
⎡⎢⎣
0
0.03
0.01
⎤⎥⎦ normalization−−−−−−−−−→ GA =
⎡⎢⎣
0
0.75
0.25
⎤⎥⎦ (8)
GA =
⎡⎢⎣
0
0.75
0.25
⎤⎥⎦ , GB =
⎡⎢⎣
0
0.75
0.25
⎤⎥⎦ , GC =
⎡⎢⎣0
0
1
⎤⎥⎦ (9)
The attentive reader may notice that propagating a constraint such as in the example
given may result in a wrong parental expectation, since there is the possibility that, for
example, one parent is carrier of the mutated allele and the other allele is either the result
of a uniparental disomy (both alleles come from a single parent) or of a mutation. How-
ever, for some diseases, this is very unlikely. In such cases, the likelihood that the allele is
inherited from both parents is highly credible. Therefore, the propagation of constraints
can be useful insight into the risk of transmission from the parents. Additionally, in this
case, a false positive constraint is more useful than not considering that the parents are
40
much more likely to have at least one recessive allele than none. If having none is the case
and this can be proven in any other manner, it is registered by the user and the constraint
is ignored.
4.2 Masked genotypes
Given the nature of the algorithm used for genotype distribution prediction, there is a
necessity of holding the partial predictions without changes where information is enough
and changing when more expected probabilities can be found. From this necessity emerges
the concept of masked genotypes. A masked genotype is a genotype whose probability is
fixed and unchangeable. When the new genotype distribution is estimated from the stack
of possible genotype distributions for a specific patient, only those positions which are
free to be changed are actually changed (all except the masked genotypes). The rest is left
as before. Furthermore, the new free probabilities only add up to what is left by the fixed
ones, leaving thus the genotype probabilities array normalized.
4.3 Expectation Maximization for genotype expectation propagation
While families are networks of multiple individuals, the previous subsections of this sec-
tion (methods) only mentions families which are formed by one father, one mother and
one child. The reason for that is that the basis for the model here proposed is triplets
formed by exactly one father, one mother and one child. If one of the parents does not
exist, they are created and considered as a latent individual. That way, there is always the
possibility to propagate genetic information among individuals. The existence of siblings
for a given person, for example, provides information to the family (if there is some to
be provided), but not directly. Each gestation is a different gestation and each meiosis is
different from any other even from the same parents, even though the genetic material is
the same. Thus, combining siblings in the same unit of information does not seem to be
an adequate idea.
Therefore, in the model here proposed, triplets are the basic units. Each individual in
a triplet has a clear purpose and different triplets may share individuals (the alterations
in their genotype distributions are shared with other triplets). Families are decomposed
into triplets by biological roles: biological father (provider of sperm), biological mother
(provider of ovum) and child.
The Expectation Maximization process is composed by two steps, the first being the
calculation of the expectation (E step), and the second being the calculation of the new
parameters, searching for maximizing a function (M step). In this work, I do not consider
a function to be maximized. This can be seen as a simplification of the algorithm, but
since there is no need to formulate such a complex function, it will be left without further
41
formalization.
E step The E step is responsible for calculating an intermediate decision that will enable
the M step to modify the actual parameters, searching for maximization of expectation.
In this work, the E step finishes leaving the individuals with a stack of expected genotype
distributions. These individuals are the ones for whom a prediction can be made (to
be discussed later). The stack contains one genotype distribution for each operation in
the triplets that involves the individual in question. Therefore, different individuals may
have different numbers of genotype distributions, depending on their connectivity and the
activation criteria for each triplet of which they are members.
The criteria for triplet activation (calculation) depends on the status of the key indi-
vidual(s) present and their genotype masks. The list below contains the operations and
criteria that enable the activation of a triplet. Each triplet is activated if at least one of the
operations is possible. It also only performs those operations which are possible. The list
of criteria by the operation is as follows:
• Downwards propagation: If the mask of the child allows for adjustment (it is not
fully determined), the triplet can perform downwards propagation.
• Upwards propagation: If the mask of at least one parent allows for adjustment,
the triplet can perform upwards propagation.
• Upwards constraint propagation: If either the genotype of the child or their phe-
notype is known and the penetrance provides enough confidence in propagating the
constraint to the parents, the constraint can be performed (if it is the case that there
is a possible constraint to be propagated).
The E step starts by calculating the propagations of constraints. Then, the possible
upwards and downwards propagations are executed, each adding a new genotype distribu-
tion to the stack of some individuals (parents in upwards propagation, child in downwards
propagation). The E step finishes when all the possible operations are performed, and the
individuals’ stacks of genotype distributions contain all the necessary entries.
M step The M step is the one in charge of calculating the new parameters for the next it-
eration of the EM algorithm (or to finish the iterative process). In this interpretation of the
algorithm, the M step calculates the expected genotype distributions for each individual
given the previous distribution and the stack calculated by the E step.
For each individual, we calculate the average genotype distribution from the stack.
The process of calculating the average distribution can take weights into account, de-
pending on the confidence of each estimation. In this work, I do not consider weights. A
42
gradient is then calculated between the average distribution and the previous distribution.
Lastly, the previous genotype is updated by a percentage of the calculated gradient (the
percentage is a learning parameter, also commonly mentioned as α learning parameter in
the Machine Learning literature).
Iterative process Since the EM algorithm is an iterative process, it is constructed around
a loop. In the case of the EM algorithm specifically, each iteration consists of performing
one E step before one M step. Then, if convergence is found, the process is terminated.
In this work, since it is an objective that the iterative process is implemented in a
PDS, a mechanism for early termination of the algorithm must be devised. Thus, the
maximum number of iterations is performed. If convergence is found before, the process
is terminated without running out of iterations. This is done for usability reasons.
One aspect to be noticed is that, in each M step, the genotype distribution of an indi-
vidual is changed following only the gradient, without any stochastic term or any other
type of maximization mechanism. For that reason, the EM algorithm, in this work, has
a greedy maximization. Therefore, it ends up searching for local maxima of the search
function (which is not formalized in this work but exists implicitly nonetheless).
4.4 Mode of inheritance-specific factors
In this work, only monogenic biallelic autosomal genetic diseases are used in examples
and explanations. However, the model here proposed is also compatible with other sce-
narios.
For example, in the same context of autosomal diseases, both more possible alleles
and more genes are supported. In the first case, a Profile tensor, genotype probabilities
arrays, etc. adapted to the changes (more possibilities) is enough for being able to make
predictions in diseases in which more than two classes of alleles are significant for the
analysis of the risk of the disease. In the second case, if more than one gene is involved,
more predictions in parallel are enough for multiple genes to be considered. In this case,
the phenotype-genotype calculation must be adapted, so that more than one genotype is
analyzed to determine the phenotype probabilities.
Changing from autosomal diseases to sex-linked ones, those caused by mutations in
the sex chromosomes, more substantial changes have to be made. In the case of Y-linked
diseases, the method here explained is better than any rule-based system only for those
diseases with incomplete penetrance. The profile tensor can now have only two dimen-
sions (only the father’s dimensions). Some adaptations in the rest of the operations are
needed. In the case of X-linked diseases, more variables have to be analyzed. Since X-
linked diseases have different behaviors in men and women [2], the treatment is to be
different in each of those cases. In the case of men, the X chromosome is only inherited
43
from the mother. the Profile tensor is thus bidimensional, only containing the axis corre-
sponding to the mother. In the case of women, one of the chromosomes is inherited from
the father exactly as in men. However, the other is inherited like an autosome from the
mother. For the prediction, therefore, what is needed is only an adapted Profile tensor and
a modified phenotype-genotype transformation.
When it is the case that two or more genes are in the same chromosome, the use of
dynamic Profile tensors for each of the alleles can be key at finding the most likely geno-
types for each individual. The behavior for genetic linkage can also be treated the same
way. When genetic linkage occurs, meiosis randomness is somewhat voided because of
the tendency to inherit certain chromosomes always from the same grandparents.
In the case of mitochondrial diseases, although few are the diseases that result in
viable human beings when the percentage of mitochondria that carry the mutation [2],
mitochondria are always inherited only from mother. The model proposed in this work
should theoretically work well for mitochondrial diseases when the mother’s mitochon-
dria are very likely to be mutated, or when the percentage is at least known. In terms
of model, the changes in the Profile tensor are similar to those for Y-linked diseases, but
from the mother.
One example of genetic disease type which is not supported by the model here pro-
posed is the chromosomal diseases. This is due to them being based on the existence of
multiple copies of the same alleles or the inexistence of any of them.
4.5 Information contribution prediction
Some previous approaches to the prediction of genetic traits given family-related genetic
information apply models that take into consideration more parameters than the ones here
described. (Guo et al., 1994) [33], for example, formalizes a mixed models method based
on fixed and random parameters, as well as both genetic and nongenetic factors. In terms
of prediction, it is very useful. However, in a clinical scenario, fine-tuning such param-
eters is complex, and, most likely than not, uncertain. This renders such a method inap-
propriate.
In order to actually assist the medical professional, an algorithm capable of effortlessly
predicting who is most likely to provide information that helps better specify the likeli-
hoods for each other individual is more adequate than a pinpoint immoderately complex
Maximum Likelihood Estimation. Therefore, in this work, a difference-based perspective
is applied. Instead of making one precise but excessively complex operation, the algo-
rithm performs two different operations, one which is underconfident (method A), and
one which is overconfident (method B). Both represent local optima of the search space.
The difference between the predictions of both algorithms expresses how uncertain we
can be about a person. The ones which are most uncertain are assumed to be most likely
44
to provide useful information to the genetic screening process. The medical professional
can thus visualize this information and, based on the suggestions given and the prediction
intervals obtained, make a decision.
Method B is the EM algorithm described above with the three modes of information
propagation. Method A is very similar, but only with downwards propagation and up-
wards constraint propagation modes. The effects of such combinations are that method
A is composed of those modes which do not calculate expectations in an inverse manner.
Therefore, information only goes upwards in the family graph by constraint propagation,
which has a fainter but more decisive effect in comparison to the upwards propagation.
Both results are intended to be shown to the user for each individual so that they observe
the intervals between the predictions. Also, a distance measure can be performed so that
a list of individuals sorted by lack of information can be obtained.
45
5 IMPLEMENTATION AND EVALUATION PROCEDURE
In order to complete one of the objectives of this work, as well as to evaluate the useful-
ness of the method here introduced in actual clinical practice, I implemented the genetic
information propagation as a module of genoDraw.
In this implementation, so as to evaluate the method with utmost clarity, two modes
of inheritance are considered: monogenic biallelic autosomal recessive and monogenic
biallelic autosomal dominant. These types were chosen due to their conceptual simplicity
and direct compatibility with the method. Although the selection of modes of inheritance
chosen is small, the implementation of more modes of inheritance is trivial, as explained
in the Methods section.
In regard to the implementation itself, the next subsection is dedicated to its details
and intricacies. Next, the testing strategy is explained.
46
5.1 Implementation
The method for genetic information propagation in the family graph with the objective of
predicting genotype and phenotype distributions for individuals was implemented for two
types of inheritance as a module for genoDraw. Since genoDraw is a web-based PDS,
technologies of the web stack were to be chosen for this implementation.
The description of the implementation will be made in three parts. The first part
corresponds to the essential calculations. That is, to the parts that operate the propagation
of information itself. The second part is the supportive elements, which deal with the
existing components of genoDraw and retrieve data from databases. The retrieval of data
from databases is done in order to provide genoDraw with the capability to be integrated
with biomedical vocabularies and to add the functionality of suggesting inheritance modes
depending on the specific disease being analyzed. The next subsections are dedicated each
to one of the parts here mentioned. The third part is a brief description of the resulting
user interface of the genoDraw module.
5.1.1 Implementation of the genetic information propagation algorithm
The genetic information propagation algorithm was implemented inside the module ded-
icated to Risk Assessment in genoDraw using JavaScript as a core language and Tensor-
Flow.js as a library. JavaScript is a programming language of widespread use. As of now,
it is by far the most popular programming language for web development and is one in a
handful compatible with modern web browsers.
TensorFlow.js is also a unique asset in its class. It is a library for JavaScript which
brings access to GPU processing in browsers via WebGL. TensorFlow.js is thought as a
means for enabling Machine Learning applications to be run in web-based environments
locally in the device of the user. In the specific case of this work, TensorFlow’s capabilities
for accelerating machine learning algorithms are of no importance. However, being able
to manage numerical tensors with already-implemented functions and top-notch manip-
ulation capabilities are features that cannot be found in other JavaScript libraries. Hence
the use of TensorFlow.js in the implementation of this non-machine learning method.
The method was implemented as a loop with a limit of iterations and convergence
detection (so as to terminate the execution earlier in the case of convergence). Inside
the loop, a selection of propagation methods can be run. They are executed whenever
possible for each triplet of parents and one child. Then, a collapse of the stack of geno-
type probability distribution candidates is done for each individual, finding new estimated
distributions for every person (Likelihood estimation).
At the end of the prediction, the predicted genotype distributions are assigned to the
profile of each individual. Then, a prediction of the distribution of phenotypes is calcu-
lated.
47
The whole process here described is done twice, one for each mode of calculation
(underconfident and overconfident).
5.1.2 Implementation of the supportive elements
As supportive elements, genoDraw makes use of some internal and external resources.
From one perspective, the drawing of the pedigree is fundamental for displaying the cal-
culations’ results to the user. From another perspective, data must be fetched from exter-
nal databases in order to provide information about diseases (i.e. codes to which they are
attached and modes of inheritance).
As reported in [43], the drawing of the pedigree is made semi-automatically in the
user’s device, following a unique three-step process. The pedigree drawing module is
implemented in JavaScript and makes use of WebCola [45] and D3.js [46] as libraries.
The present module then draws the results on the nodes already represented on the screen.
External resources are retrieved from a relational database containing the information
of UMLS [20] for three biomedical vocabularies pointed by 12 de Octubre Hospital’s
genetics team: SNOMED-CT [21], OMIM [22] and HPO [23]. Also as mentioned in [1]
and [43], such database holds information of many diseases, to which patients can have
their codes linked as a means for registering their status for each of the diseases of interest
in a pedigree. As discussed in Section 2.4, their code is enough for this attachment.
However, the code registered is also used to retrieve, from the same database, the modes
of inheritance of the diseases which have them linked. Nongenetic disease or even some
diseases known to have genetic components do not have any mode of inheritance attached.
For this reason, they are retrieved as a hint to the user, who may choose any other mode
of inheritance to analyze families.
The communication between the client application and the database is made using
a server implemented in Node.js. This server works under a REST API, is enabled to
operate on HTTPS, and handles user authentication via username and password. User data
is stored in a MongoDB database, and passwords are hashed before being stored. From
this server, access to the UMLS database, which is a MySQL database, is performed.
The internal graph files which correspond to the pedigrees are JSON (JavaScript Ob-
ject Notation) files managed by the user locally. An internal graph can be loaded into the
platform but is never uploaded to the server for confidentiality reasons. After making the
desired changes or visualizing the pedigree, the user can choose to download the internal
graph as a file, which is the same file that can be loaded again into the platform in the
future. If desire be, the server can also be configured to store family graphs, a feature
which is deactivated in the current deployment on www.genodraw.com.
48
Figure 14: Architecture of genoDraw.
5.1.3 User Interface
As thoroughly described in [1] and [43], genoDraw is a PDS which automatically draws
pedigrees in a canvas with minimum necessary interaction from the user for adjustments.
In this implementation, a new mode was developed which allows for risk assessment. In
this module, the pedigree is drawn automatically in a canvas, as in all the other modes
(normal and editing). Then, by using a sidebar, the user is able to select specific inheri-
tance modes, with some suggestions retrieved from UMLS. The user is also able to select
the specific penetrance and allele frequency in a submenu. A button for launching the
prediction is also present in the sidebar. When the prediction is calculated, its results are
shown below the nodes of the pedigree, in their corresponding individuals (Figure 15).
A list is also composed of the individuals for whom not enough genetic information is
known, and is sorted by lack of information. That is, difference between methods A and
B. From this list, the user can assign phenotypes for the individuals, but this task can also
be done directly in the canvas through the use of context menus (Figure 16).
49
Figure 15: Screenshot of the risk assessment mode of genoDraw. As we can see, individ-
uals are annotated, after the prediction, with two possible distributions (methods A and
B). On the right, the sidebar is shown, from which the settings for the prediction are set.
50
Figure 16: Screenshot of the risk assessment mode of genoDraw. Context menus can be
used to assign and unassign individuals phenotypes. In this specific case, the user is on
the brink of assigning individual D the status of carrier for the disease being analyzed.
5.2 Evaluation procedure
To evaluate the method as a whole and the module implemented in genoDraw, an analysis
of results was performed for some examples. The examples range from simple situations,
in which there are only two parents and one child of theirs, to complex situations. The
simple examples always contain latent individuals and are composed of cases in which
information is expected to flow upwards in the family graph and cases in which the down-
wards flow is expected.
The analysis of simple examples consists of comparing the estimated results with the
aforementioned biological model, also referred to as part of the gene counting method by
other authors [26, 37]. That is, the model based on Punnett squares and widely adopted in
current clinical practice. Since autosomal dominant and autosomal recessive models are
implemented, the tests were performed for both cases. For each of the cases, a solution
for full penetrance and a solution for penetrance at 60% were given.
For more complex cases, which consist of large and complex pedigrees, a different
approach was taken. Even though separation of cases by inheritance mode was done, the
analysis using the biological model is not feasible, since it does not, by itself, present
51
acceptable results. Since one of the main aims of the method here described is to assist
the detection of individuals of high influence in the genetic distribution of the family,
the evaluation was performed by starting from a partially-latent family and removing
some individuals and evaluating the changes in the predictions for the family, as well as
discussing possible issues and other pertinent details.
genoDraw is an online tool, and it is intended to be used by medical professionals as
part of their routine. Therefore, good usability is required for its success in the intended
scenario. This usability was asserted in previous publications for the creation and visu-
alization parts [1, 43]. Since the addition of the mode reported in this work causes only
slight changes in the tool itself, and the method here presented is what is most essential
about this thesis, no usability evaluations will be presented in the next section.
52
6 RESULTS
In this section, the results obtained by following the testing procedures described in Sec-
tion 5.2 are presented. In all, it contains the calculations for a set of cases for atomic
pedigrees (triples of a father, a mother and a child), as well as more complex cases, both
with full penetrance and partial penetrance.
The cases associated with atomic families are presented in the next subsection. Then,
more complex cases are separated into another subsection.
53
6.1 Atomic family cases
The results for full penetrance in biallelic monogenic autosomal recessive traits are in-
cluded in Table 4. In every three rows, an atomic family is represented. The columns con-
tain a number to identify each family, the individual to which we refer by the row, their
phenotype, their initial predicted genotype distribution, their predicted genotype distribu-
tion at the end of the calculations and their predicted phenotype. In each of the predicted
columns, results are divided in two: calculated using only downwards probability prop-
agation and upwards constraint propagation (method A), and calculated using the three
proposed propagation methods (method B). In this table, since penetrance is complete,
the results of the last sets of three columns are the same.
In the first family included, all individuals are latent. Thus, no information should
be obtained from the processing of their data. As we can see, method A makes causes
almost no change to the calculations. Method B, from another perspective, transforms
the distributions of all the individuals by calculating the most expected case for the child
(since there is no constraint, the most expected is that no information is known for the
child. That is, if there are 4 possible combinations of ordered pairs of alleles, each is
assigned 25% probability, and since in terms of unordered pairs Aa is the same as aA,
their probabilities are added, and the result is 50% probability). A reasonable approach
could consider both situations as correct since no information is known for any individual.
Therefore, methods A and B are both correct for cases in which no information is given.
The second family includes a constraint. The father is defined as not affected for the
disease in question. That is, the father is known not to be affected nor carrier of the trait.
Therefore, for the case of full penetrance, the only possible genotype he can have is AA.
As we can see in the results, again, method A uses the prior information of the mother
and the constraint on the father to calculate the distribution of the child, reaching a correct
result in which the child is known not to be affected, since the father is not able to transmit
the recessive allele to the child, since his genome does not contain such allele. Method
B, on the other hand, predicts what is most likely from the perspective of the child. That
is, in a situation with no information about one of the parents but in which the father is
known to be AA, the child is most expected to either be AA or Aa, and this is predicted at
50% each.
Cases 3 and 4 represent similar situations and are not going to be commented on.
Suffice to state that both methods A and B, in their own perspectives, are correct.
Case 5 is an observation that the prediction for the latent child is equal for both meth-
ods when constraints are presented for both parents. That is, when their genotype dis-
tributions are fully defined. In the biological model for monogenic biallelic autosomal
recessive diseases, when one parent is carrier and the other parent is known to be ho-
mozygous and not affected (not carrier), their child will always be either carrier or also
54
homozygous, with equal probability. Since methods A and B reach the same result, both
are correct in case 5.
Cases 6, 7 and 8 contain situations in which information is expected to flow upwards
in the family graph. That is, information is given for a child and we want it to affect
the genetic distributions of the parents. These situations are what justify the existence
of an upwards propagation model, as well as separated A and B models. In case 6, an
observation of the constraint propagation is noticeable. As we can see, since penetrance
is complete since the child is affected, parents cannot be not affected at all, otherwise, the
mutated gene would not get to the child. Therefore, they are most likely to be carriers.
Model B, on the other hand, estimates that the most likely scenario for no information
of the parents is that both are very likely to also be affected. We know that this is an
exaggerated estimation. Nonetheless, it is indeed the most likely if a child is affected and
we know no information about the parents. Case 8 develops a similar result. Case 7, in
contrast, estimates nearly a 25%/50%/25% distribution, since the child is asserted to be
carrier of the disease. That is, their genotype is Aa.
In Table 5, calculations for 60% penetrance in monogenic biallelic autosomal reces-
sive diseases are included. Its columns have the same meaning as in the previously-
described full penetrance table. However, predicted genotypes and predicted phenotypes
are not equal anymore.
In the first case, latent individuals have the same genotype probabilities as in any other
case, since no phenotypes constraint the distributions of the persons in the family. The
predicted phenotypes are calculations over the predicted genotypes, calculated over the
distributions themselves and the penetrance factor.
Case 2 contains a phenotype which was not shown in the full penetrance scenario,
although it does not cease to be important for full penetrance disease cases. Essentially,
if a person is annotated to be possibly carrier, what it means is that that person is not
affected (or not known to be so, as discussed previously), or carrier. In this work, I
consider this duplicity of possibilities as a 50%/50% ratio of probabilities. In the case
of 60% penetrance, since there is a probability that the person in question is aa and still
does not present the characteristic, the distribution is shifted accordingly to represent the
scenario correctly. As in the full penetrance results, model A predicts the child given
the parents, which causes a shift in the child towards a higher probability of them being
carrier, although not affected, since the mother is assumed latent. Also similarly to the full
penetrance model, model B predicts the parents as based on the children to present all the
cases with as equally-distributed a distribution as possible. The phenotype distributions
are calculated from the obtained results, considering, as described earlier, the penetrance
factor and the genotype distribution. As we can see, the probabilities of actually being
affected are slightly decreased, as expected in incomplete penetrance cases. A similar
scenario is observed in case 3.
55
Case 4, on the other hand, is composed of two latent parents and a child who is asserted
to be possibly carrier. As in the previous cases, model A does not make any changes to
the initial genotypes of the parents, therefore not providing any new information to the
latent parents, while it is known that their child is possibly carrier. Model B, however,
searches for a most likely scenario for the most equally-distributed genotype for the child.
Of course, being a child possibly carrier is still no sufficient information to pinpoint the
genotype distributions of the parents. Nonetheless, models A and B provide results that
can enable one to have a better observation of the possible scenarios in a family.
In Tables 6 and 7, results for full penetrance and incomplete penetrance (60%) are in-
cluded for monogenic biallelic autosomal dominant diseases. As discussed before, domi-
nant diseases are the ones observable in the phenotype of an individual when at least one
allele is disease-causing (“mutated”). If they are both disease-causing, the symptoms tend
to be more pronounced [2].
As far as full penetrance cases are concerned, cases 1, 2 and 3 of Table 6 are examples
of such. As previously, models A and B perform differently but in a complementary
manner. A peculiarity of the autosomal dominant diseases with full penetrance is that no
individuals can be carriers and not affected (hence the carrier column, which is composed
fully by zeros). In case 1, all individuals are latent. Model A predicts no changes, and
model B predicts as equally-distributed a distribution as possible. In case 2, a similar case
happens. However, the father is known to not be affected. Thence, since the mother is
very unlikely to be affected, the child is predicted at 98% to also not be affected, and 2%
to be so by model A. Model B, on the other hand, finds its balance in the child being 50%
probable of being Aa and 50% of being aa, since no better approximation can be done as
far as eliminating the likelihood of the child being AA.
For the more general cases of incomplete penetrance, case 1 of Table 7 is a fully-latent
example. As we can see, the calculation of the genotype distribution is the same as for the
complete penetrance. The predicted phenotype, however, is quite different. Model A, as
usual, estimates no different genotypes as the initial ones. Model B also searches for the
most uniform distribution (considering the Aa/aA unbalance previously mentioned). Case
2 is very similar to other previously-commented cases. Case 3, however, is an interesting
example of how the balance within the probabilities can be important for the downwards
prediction. In case 3, the father is affected, and the mother is latent. However, the father
is much more likely to be Aa than AA, as observable in the population (allele frequency)
[2]. The consequence is that the child is slightly more likely to be heterozygous than
homozygous on the non-disease-causing alleles, and only 1% likely to be homozygous on
the disease-causing alleles.
56
Ph
eno
typ
eIn
itia
lG
eno
typ
eP
red
icte
dG
eno
typ
eP
red
icte
dP
hen
oty
pe
AA
Aa
aaA
AA
aaa
no
taf
fect
edca
rrie
raf
fect
ed
1fa
ther
late
nt
0.9
69
0.0
30
.00
10
.97
/0.2
10
.03
/0.5
90
.00
1/0
.20
0.9
7/0
.21
0.0
3/0
.59
0.0
01
/0.2
0
mo
ther
late
nt
0.9
69
0.0
30
.00
10
.97
/0.2
10
.03
/0.5
90
.00
1/0
.20
0.9
7/0
.21
0.0
3/0
.59
0.0
01
/0.2
0
chil
dla
ten
t0
.96
90
.03
0.0
01
0.9
7/0
.25
0.0
3/0
.50
0.0
00
/0.2
50
.97
/0.2
50
.03
/0.5
00
.00
0/0
.25
2fa
ther
no
taf
fect
ed1
00
1/
10
/00
/01
/1
0/0
0/0
mo
ther
late
nt
0.9
69
0.0
30
.00
10
.97
/0.2
00
.03
/0.5
90
.00
1/0
.20
.97
/0.2
00
.03
/0.5
90
.00
1/0
.2
chil
dla
ten
t0
.96
90
.03
0.0
01
0.9
8/0
.50
0.0
2/0
.50
0/0
0.9
8/0
.50
0.0
2/0
.50
0/0
3fa
ther
affe
cted
00
10
01
00
1
mo
ther
late
nt
0.9
69
0.0
30
.00
10
.97
/0.2
00
.03
/0.6
00
.00
1/0
.20
0.9
7/0
.20
0.0
3/0
.60
0.0
01
/0.2
0
chil
dla
ten
t0
.96
90
.03
0.0
01
0/0
0.9
8/0
.50
0.0
2/0
.50
0/0
0.9
8/0
.50
0.0
2/0
.50
4fa
ther
no
taf
fect
ed1
00
1/1
0/0
0/0
1/1
0/0
0/0
mo
ther
affe
cted
00
10
/00
/01
/1
0/0
0/0
1/
1
chil
dla
ten
t0
.96
90
.03
0.0
10
/01
/1
0/0
0/0
1/
10
/0
5fa
ther
no
taf
fect
ed1
00
1/
10
/00
/01
/1
0/0
0/0
mo
ther
carr
ier
01
00
/01
/1
0/0
0/0
1/
10
/0
chil
dla
ten
t0
.96
90
.03
0.0
01
0.5
0/0
.50
0.5
0/0
.50
0/0
0.5
0/0
.50
0.5
0/0
.50
0/0
6fa
ther
late
nt
0.9
69
0.0
30
.00
10
.00
/0.0
00
.97
/0.0
10
.03
/0.9
90
.00
/0.0
00
.97
/0.0
10
.03
/0.9
9
mo
ther
late
nt
0.9
69
0.0
30
.00
10
.00
/0.0
00
.97
/0.0
10
.03
/0.9
90
.00
/0.0
00
.97
/0.0
10
.03
/0.9
9
chil
daf
fect
ed0
01
0/0
0/0
1/
10
/00
/01
/1
7fa
ther
late
nt
0.9
69
0.0
30
.00
10
.97
/0.2
30
.03
/0.5
50
.00
1/0
.22
0.9
7/0
.23
0.0
3/0
.55
0.0
01
/0.2
2
mo
ther
late
nt
0.9
69
0.0
30
.00
10
.97
/0.2
30
.03
/0.5
50
.00
1/0
.22
0.9
7/0
.23
0.0
3/0
.55
0.0
01
/0.2
2
chil
dca
rrie
r0
10
0/0
1/
10
/00
/01
/1
0/0
8fa
ther
late
nt
0.9
69
0.0
30
.00
10
.97
/1.0
00
.03
/0.0
00
.00
1/0
.00
0.9
7/1
.00
0.0
3/0
.00
0.0
01
/0.0
0
mo
ther
late
nt
0.9
69
0.0
30
.00
10
.97
/1.0
00
.03
/0.0
00
.00
1/0
.00
0.9
7/1
.00
0.0
3/0
.00
0.0
01
/0.0
0
chil
dn
ot
affe
cted
10
01
/1
0/0
0/0
1/
10
/00
/0
Tab
le4
:E
xam
ple
so
fex
ecu
tio
ns
for
am
on
og
enic
bia
llel
icau
toso
mal
rece
ssiv
ed
isea
sew
ith
full
pen
etra
nce
.
57
Ph
eno
typ
eIn
itia
lG
eno
typ
eP
red
icte
dG
eno
typ
eP
red
icte
dP
hen
oty
pe
AA
Aa
aaA
AA
aaa
no
taf
fect
edca
rrie
raf
fect
ed
1fa
ther
late
nt
0.9
69
0.0
30
.01
0.9
7/0
.21
0.0
3/0
.59
0.0
01
/0.2
00
.97
/0.2
10
.03
/0.6
70
.00
1/0
.12
mo
ther
late
nt
0.9
69
0.0
30
.01
0.9
7/0
.21
0.0
3/0
.59
0.0
01
/0.2
00
.97
/0.2
10
.03
/0.6
70
.00
1/0
.12
chil
dla
ten
t0
.96
90
.03
0.0
10
.97
/0.2
60
.03
/0.5
00
.00
/0.2
40
.97
/0.2
60
.03
/0.6
00
.00
/0.1
5
2fa
ther
po
ssib
lyca
rrie
r0
.50
.36
0.1
40
.50
/0.2
50
.36
/0.4
50
.14
/0.3
00
.50
/0.2
50
.41
/0.5
70
.09
/0.1
8
mo
ther
late
nt
0.9
69
0.0
30
.01
0.9
7/0
.23
0.0
3/0
.58
0.0
0/0
.18
0.9
7/0
.23
0.0
3/0
.65
0.0
0/0
.11
chil
dla
ten
t0
.96
90
.03
0.0
10
.67
/0.2
60
.32
/0.5
00
.01
/0.2
40
.67
/0.2
60
.33
/0.6
00
.00
/0.1
4
3fa
ther
affe
cted
00
10
/00
/01
/1
0/0
0/0
1/
1
mo
ther
late
nt
0.9
69
0.0
30
.01
0.9
7/0
.21
0.0
3/0
.59
0.0
01
/0.2
00
.97
/0.2
10
.03
/0.6
70
.00
/0.1
2
chil
dla
ten
t0
.96
90
.03
0.0
10
.00
/0.0
00
.98
/0.5
00
.02
/0.5
00
.00
/0.0
00
.99
/0.7
00
.01
/0.3
0
4fa
ther
late
nt
0.9
69
0.0
30
.01
0.9
7/0
.28
0.0
3/0
.47
0.0
0/0
.25
0.9
7/0
.28
0.0
3/0
.57
0.0
0/0
.15
mo
ther
late
nt
0.9
69
0.0
30
.01
0.9
7/0
.28
0.0
3/0
.47
0.0
0/0
.25
0.9
7/0
.28
0.0
3/0
.57
0.0
0/0
.15
chil
dp
oss
ibly
carr
ier
0.5
0.3
60
.14
0.9
7/0
.27
0.0
3/0
.50
0.0
0/0
.23
0.9
7/0
.27
0.0
3/0
.59
0.0
0/0
.14
Tab
le5
:E
xam
ple
so
fex
ecu
tio
ns
for
am
on
og
enic
bia
llel
icau
toso
mal
rece
ssiv
ed
isea
sew
ith
pen
etra
nce
60
%.
58
Ph
eno
typ
eIn
itia
lG
eno
typ
eP
red
icte
dG
eno
typ
eP
red
icte
dP
hen
oty
pe
AA
Aa
aaA
AA
aaa
no
taf
fect
edca
rrie
raf
fect
ed
1fa
ther
late
nt
0.0
01
0.0
30
.96
90
.00
1/0
.20
0.0
3/0
.59
0.9
7/0
.21
0.9
7/0
.21
0.0
0/0
.00
0.0
3/0
.79
mo
ther
late
nt
0.0
01
0.0
30
.96
90
.00
1/0
.20
0.0
3/0
.59
0.9
7/0
.21
0.9
7/0
.21
0.0
0/0
.00
0.0
3/0
.79
chil
dla
ten
t0
.00
10
.03
0.9
69
0.0
0/0
.24
0.0
3/0
.50
0.9
7/0
.26
0.9
7/0
.26
0.0
0/0
.00
0.0
3/0
.74
2fa
ther
no
taf
fect
ed0
01
0/0
0/0
1/
11
/1
0/0
0/0
mo
ther
late
nt
0.0
01
0.0
30
.96
90
.00
1/0
.20
0.0
3/0
.59
0.9
7/0
.21
0.9
7/0
.21
0.0
0/0
.00
0.0
3/0
.79
chil
dla
ten
t0
.00
10
.03
0.9
69
0.0
0/0
.00
0.0
2/0
.50
0.9
8/0
.50
0.9
8/0
.50
0.0
0/0
.00
0.0
2/0
.50
3fa
ther
affe
cted
0.0
30
.97
00
.03
/0.0
80
.97
/0.9
20
/00
.00
/0.0
00
.00
/0.0
01
.00
/1.0
0
mo
ther
late
nt
0.0
01
0.0
30
.96
90
.00
1/0
.18
0.0
3/0
.57
0.9
7/0
.24
0.9
7/0
.23
0.0
0/0
.00
0.0
3/0
.77
chil
dla
ten
t0
.00
10
.03
0.9
69
0.0
1/0
.26
0.5
1/0
.50
0.4
7/0
.24
0.4
8/0
.24
0.0
0/0
.00
0.5
2/0
.76
Tab
le6
:E
xam
ple
so
fex
ecu
tio
ns
for
am
on
og
enic
bia
llel
icau
toso
mal
do
min
ant
dis
ease
wit
hfu
llp
enet
ran
ce.
Ph
eno
typ
eIn
itia
lG
eno
typ
eP
red
icte
dG
eno
typ
eP
red
icte
dP
hen
oty
pe
AA
Aa
aaA
AA
aaa
no
taf
fect
edca
rrie
raf
fect
ed
1fa
ther
late
nt
0.0
01
0.0
30
.96
90
.00
/0.2
00
.03
/0.5
90
.97
/0.2
10
.97
/0.2
10
.01
/0.3
20
.02
/0.4
7
mo
ther
late
nt
0.0
01
0.0
30
.96
90
.00
/0.2
00
.03
/0.5
90
.97
/0.2
10
.97
/0.2
10
.01
/0.3
20
.02
/0.4
7
chil
dla
ten
t0
.00
10
.03
0.9
69
0.0
0/0
.24
0.0
3/0
.50
0.9
7/0
.26
0.9
7/0
.26
0.0
1/0
.30
0.0
2/0
.44
2fa
ther
po
ssib
lyca
rrie
r0
.22
0.2
20
.56
0.2
2/0
.39
0.2
2/0
.28
0.5
6/0
.33
0.5
6/0
.33
0.1
8/0
.27
0.2
7/0
.40
mo
ther
late
nt
0.0
01
0.0
30
.96
90
.00
/0.1
90
.03
/0.5
80
.97
/0.2
20
.97
/0.2
20
.01
/0.3
10
.02
/0.4
7
chil
dla
ten
t0
.00
10
.03
0.9
69
0.0
1/0
.25
0.3
4/0
.50
0.6
6/0
.25
0.6
6/0
.25
0.1
4/0
.30
0.2
1/0
.45
3fa
ther
affe
cted
0.0
30
.97
00
.03
/0.0
80
.97
/0.9
20
/00
/00
/01
/1
mo
ther
late
nt
0.0
01
0.0
30
.96
90
.00
/0.1
90
.03
/0.5
80
.97
/0.2
30
.97
/0.2
30
.01
/0.3
10
.02
/0.4
6
chil
dla
ten
t0
.00
10
.03
0.9
69
0.0
1/0
.26
0.5
1/0
.50
0.4
7/0
.24
0.4
8/0
.24
0.2
1/0
.30
0.3
1/0
.46
Tab
le7
:E
xam
ple
so
fex
ecu
tio
ns
for
am
on
og
enic
bia
llel
icau
toso
mal
do
min
ant
dis
ease
wit
hp
enet
ran
ce6
0%
.
59
6.2 More complex cases
The objectives of this work include the devisal of a method which can help identify in-
dividuals who may add most information to the whole family. In this work, this is done
by analyzing the difference between two predictive methods for each individual. As a
side effect, eliminating the information of some individuals indicate the efficacy of the
method. In this section, I present the results obtained in three hypothetical pedigrees by
making use of such a technique in more complex scenarios. Some authors refer to com-plex in pedigrees when they present loops in a family. In this work, I use this term for
non-triad families.
Case 1 The first case to be included as more complex is that of a family composed of two
parents and two children. It is very simple in terms of structure. However, for the sake
of this work, demonstrating that the interaction between the triads exists is of interest.
As depicted in Figure 17a, individual C is affected by an autosomal recessive disease
represented in grey, to which every other related individual is carrier. In Figure 17b, I
omit that D is carrier. The calculations predict that her genotype probability distribution
is, in fact, as uniform as possible, as shown earlier. Next, I remove the information present
for individual A (Figure 17c). Since individual C is affected, A cannot be not affected
nor carrier, which is observed in his prediction. The probability that A is affected thus
escalates to 56%, which also causes the probability that D is affected to almost 40%,
according to method B. Figure 17d is the result of eliminating the information that the
mother (B) is carrier. As we can see, A is now less likely to be affected, while B is more
likely to be so. This unbalance affects the whole family, indicating that latent information
may have exceeded an acceptable level.
Case 2 The second case is composed of a large but simple pedigree. The simplicity here
refers to the lack of loops in the family.
This case is represented in Figure 18. As we can see, individuals C and K are carriers
of an autosomal recessive trait with full penetrance, G and R are affected by such trait and
D is not affected, although we do not know if she is carrier or not.
According to the predictions, however, individual D is sure to be carrier, since G is
affected and propagates a constraint that limits her to be carrier or affected.
After omitting that C is carrier and D is but affected, we reach the situation depicted
in Figure 19. As we can see, B is much more likely to be carrier according to method B,
which causes F to be more likely to be affected, and E to be less likely to be so.
Case 3 Lastly, the third case is a pedigree that can be considered as a complex pedigree
in the sense most authors in the literature use. It is comprised of a family in which a
60
(a) (b)
(c) (d)
Figure 17: Visualization of the progressive removal of information in case 1.
61
Figure 18: Initial situation of complex case 2.
62
Figure 19: Complex case 2 after the omission of some information.
63
Figure 20: Initial situation of complex case 3.
relationship between two individuals of the same family exists. Such family is represented
in Figure 20.
Automatically finding carriers is a capability that can be of interest in clinical practice.
Finding them in many cases requires a genome analysis, which is expensive and some-
times of no practical use. In this case, I am going to omit, one by one, all the carriers of
this hypothetical family, so as to evaluate its behavior.
In the first omission (Figure 21), since A is latent and B is affected, method A predicts
that E is almost sure to be carrier. Method B is more aggressive, estimating a 50% chance
that E is in fact affected.
After the second omission (Figure 22), N is less defined, since her probability of being
affected increases. A is also more likely to be affected.
The third omission (Figure 21) is that H is not registered as carrier. Therefore, almost
64
Figure 21: Complex case 3 after the omission that E is carrier.
65
Figure 22: Complex case 3 after the omission that K is carrier.
66
Figure 23: Complex case 3 after the omission that H is carrier.
the whole family has their genotype distributions free, except B and O. However, not
much is changed in terms of phenotype distributions. Q, who is in gestation, was initially
at a risk between 25% and 48% of being affected and is now at a risk between 26% and
52%, which is not a big change. This is due to O being an older sibling who is affected.
In a fourth omission (Figure 24), we can simulate that O is not a sibling of Q just by
eliminating his being affected. A shift in the whole family is observed, and the likelihood
that Q is affected decreases. However, with the amount of information available, although
there is a risk of 2% that Q is affected, nothing can be asserted, since their distribution,
according to method B, is too uniform, although pending to the affected side.
67
Figure 24: Complex case 3 after the omission that O is affected.
68
7 DISCUSSION
In this section, the context surrounding this work is considered. From one perspective,
other methods aim at better precisions in intricate and diverse inheritance scenarios. From
another perspective, the lack of methods that aim at assisting the risk assessment of ge-
netic diseases indicates a lack not of extremely precise, but of adequate tools in clinical
practice.
69
In this work, I describe three algorithms that can be useful at propagating genetic
information in family graphs, combining such algorithms in two groups (methods A and
B). This is done so providing an insight into the balance between available information
and latent information. Such groups were implemented as a module in genoDraw, and
their results were analyzed in the previous section.
The basis of this work, as described in the Methods section, is the propagation of
information in all directions, as well as the consideration that genotypes can have their
characteristics fixed by constraints. Each of the three algorithms presented is independent
from the rest, which enables flexibility in their combinations. Each combination is run in a
separate Expectation Maximization algorithm, in which each individual can be influenced
both by their parents and children by the means of gradients influencing their genotype
probability distributions. In a pedigree with more than one generation, a step-by-steppropagation can thus be observed.
In terms of related works, there are, to my knowledge, no other contributions that share
the specific purpose of this work, which is to provide medical professionals with a tool
that facilitates their calculations tasks without being overcomplex nor cumbersome. That
is, which assists them in what is more tedious and prone to errors in their daily routine.
The method presented here is nothing without the user knowing what every estimation
means, and what the intervals between estimations from methods A and B indicate in
terms of the precision and assertiveness the professional is seeking.
Most of the works I refer to in Section 2.6 intend to be as precise as possible. In fact,
they grew more and more complex throughout the 1980s and 1990s, seeking higher and
higher precisions via complex statistical tools. However, many of them lack the sensibility
to only include parameters which can actually be managed in a clinical scenario. These
render an otherwise useful tool a hindrance in the daily routine of geneticists. This may
be the reason for the remarkable lack of tools and platforms with purposes similar to what
I propose and implement here. Works such as [31, 32, 33, 34, 35] explore the nature of
genetics with extreme precision. [37] does not even mention alleles as unique blocks of
genetic information but refers to shared sections of the genome with multiple purposes.
Consanguinity is not analyzed in a global sense but as an analysis of identity of genes
by descent. Of course, the precision achieved can come to be excellent given enough
data. However, better still is the sequencing of every individual’s section of the genome
responsible for a trait. It is definitely a more expensive procedure, but undoubtedly more
precise. If precision is the intention, there are precise methods. However, for a tool to be
successful in clinical practice, one of the many characteristics it must have is a balance
between usefulness and precision.
In this work, usefulness is most considered. The fact that intervals between predicted
genetic probability distributions (methods A and B) are presented instead of a pinpoint
estimation is the utmost statement that precision is not what may appear to be the most
70
considered characteristic of the method. However, the segregation process during meio-sis is enough of a random process to enable us the luxury of presenting the user results
that are certain to not be accurate. This is why this work is developed using simple,
easy-to-calculate, propagation models. As the reader may observe, no mixed models or
otherwise complex and full-of-parameters methods are used here. Instead, I apply as sim-
ple a process as it can get, while expanding it to enable for the propagation of probability
distributions instead of most-likely situations. The only parameters of interest to the user
are the mode of inheritance of the disease being analyzed, its penetrance, and the mu-
tated allele’s frequency in the population. All of these parameters are easily obtained in
public databases and/or simple statistical observations. If a disease does not present a
clear mode of inheritance or it is quantitative, the method here presented is clearly not
adequate. However, having to deal with quantitative parameters in simple X-linked re-
cessive diseases, for example, is not an advantage over the current clinical practice. In
this sense and in this specific scenario, a simpler tool is expected to be more effective and
better accepted in its own scope than one adaptable to even the most complex situations
but excessively complex.
71
8 CONCLUSIONS AND FUTURE LINES OF WORK
In this section, I bring this work to a close, while contemplating new possibilities and
conceivable future efforts.
72
In this work, I present a method for genetic information propagation using graph min-
ing techniques. It is based on three modes of propagation which operate in complemen-
tary manners. Each combination of such modes serves a different purpose, as observed
in the Results section, being thus useful in different family scenarios. Together, they are
observed to indicate plausible genotype distributions and to bring attention to lacks of
information in the family. The method by me developed is centered in its applicability
in clinical practice, without focusing only on unmatchable precision scores. It takes ad-
vantage of widely-known and used parameters, such as modes of inheritance, as well as
a little more complex penetrance and allele frequency variables. In all, what is presented
here is a method which is novel in its central determination, which is to help genetics
counselors in their daily routines. The means through which this is performed is by au-
tomatically executing mathematical operations and statistics interpretations, an activity
that is currently performed by hand, and is thus extremely prone to errors and is usually
performed in a partial manner, without considering all possibilities from all individuals.
The method proposed is now implemented in genoDraw. genoDraw is a complex PDS
developed by me at the Biomedical Informatics Group, Technical University of Madrid,
in collaboration with the Genetics and Inheritance Research Group of the 12 de Octubre
Hospital, Madrid. The platform has already been reported in a conference paper presented
in March of this year (Inforsalud, 2019) [1], of which I am the main author. The article
is attached to this work as Appendix A. As commented in the introduction of this work,
the method in it proposed solves some of the necessities of complex PDSs, such as the
composition and disposition of pedigrees and the model of interaction with them. In the
present work, I build on such past work, by implementing the proposed method as a risk
assessment module in genoDraw.
Therefore, by being implement in an actual PDS, the method can come to have a
paramount impact in clinical practice by helping professionals more easily visualize the
dynamics of genetic diseases in families. At the 12 de Octubre Hospital, this could mean
savings in diagnosis time and professional efforts, as well as in genome sequencing and
more advanced but sometimes unnecessary analyses, be it in the context of precision
medicine or preventive medicine. In other medical centers, however, this could have
much deeper and far-reaching consequences. We currently live in a society decidedly
marked by a harsh contrast between rich and poor regions. In many areas of the World,
despite the existence of medical professionals (in many cases with top-notch academic
training), resources are scarce. In such cases, genome analysis and collection of detailed
genotypical information tend to not be available to the general public. In these scenarios,
the method here described can be most useful. Not only can it perform estimations based
only on family structures and on the phenotypes of some individuals, it is also based on
widely-known information, such as the mode of inheritance of diseases. In this sense, it
is compatible with the objective, purpose and reality of clinical procedures and resources.
73
The genetic information propagation techniques here described are not an end to the
field of statistical genetics in terms of modes of inheritance and related diseases, nor do
they intend to be. Quite the contrary, they are not more than some of many possibilities. In
the search for adequate ways for supporting the daily practice of genetic risk assessment,
some tradeoffs are required. Not all information is always known, and some may be less
decisive than others. In the future, I intend to (a) more deeply explore the possibilities
Graph Mining provides to the extraction of information from vastly-latent data sources in
genetics, in which the structure in which entities are related is significantly decisive. In
terms of genetics, some possibilities lie on the expansion of the methods here described to
incorporate factors such as how likely is a de novo mutation to happen, or even to enable
for monoparental disomies to be predicted. These are possibilities vastly more complex
to consider without turning the whole process a hindrance for the genetics practitioner.
However, more possible future efforts related to this work might be the (b) deployment
of the implementation here described so that the Genetics and Inheritance team at the
12 de Octubre Hospital, in Madrid, is able to not only use the tool in their daily clinical
practice but also to contribute with weighty feedback. Another idea aligned with the
deployment is (c) the devisal of a predictive algorithm capable of finding the most likely
mode of inheritance for genetic diseases given a family and some phenotypes. Lastly,
one more intrepid idea of future work is (d) to evaluate the association between modes
of inheritance. In essence, this line of work would contrast the modes of inheritance
currently assigned to genetic diseases by analyzing actual pedigrees.
74
9 REFERENCES
[1] L. Garcia-Giordano, S. Paraiso-Medina, R. Alonso-Calvo, F. J. Fernández-Martínez,
and V. Maojo, “Genodraw: A tool to create pedigree diagrams based on biomedical
terminologies and standards”, 2019.
[2] R. L. Bennett, The Practical Guide to the Genetic Family History. John Wiley and
Sons, 2011, ISBN: 978-1-118-20981-3.
[3] S. Lee, “Why do we read many articles with bad statistics? : What does the new
american statistical association’s statement on pvalues mean?”, Korean Journal ofAnesthesiology, vol. 69, no. 2, 109–110, Apr. 2016, ISSN: 2005-6419. DOI: 10.
4097/kjae.2016.69.2.109.
[4] S. J. Maglio and E. Polman, “Revising probability estimates: Why increasing like-
lihood means increasing impact.”, Journal of Personality and Social Psychology,
[5] R. G. Resta, “The crane’s foot: The rise of the pedigree in human genetics”, Journalof Genetic Counseling, vol. 2, no. 4, 235–260, 1993, ISSN: 1059-7700, 1573-3599.
DOI: 10.1007/BF00961574.
[6] M. M. Weber, “Ernst rüdin, 1874-1952: A german psychiatrist and geneticist”,
American Journal of Medical Genetics, vol. 67, no. 4, 323–331, 1996, ISSN: 0148-
[23] S. Köhler, L. Carmody, N. Vasilevsky, J. O. B. Jacobsen, D. Danis, J.-P. Gourdine,
M. Gargano, N. L. Harris, N. Matentzoglu, J. A. McMurry, and et al., “Expansion
of the human phenotype ontology (hpo) knowledge base and resources”, NucleicAcids Research, vol. 47, no. D1, D1018–D1027, 2019, ISSN: 0305-1048. DOI: 10.
1093/nar/gky1105.
[24] R. C. Elston and J. Stewart, “A new test of association for continuous variables”,
[34] C. Stricker, R. L. Fernando, and R. C. Elston, “An algorithm to approximate the
likelihood for pedigree data with loops by cutting”, Theoretical and Applied Ge-netics, vol. 91–91, no. 6–7, 1054–1063, 1995, ISSN: 0040-5752, 1432-2242. DOI:
10.1007/BF00223919.
[35] E. A. Thompson, “Statistical inference from genetic data on pedigrees”, NSF-CBMS Regional Conference Series in Probability and Statistics, 2000. [Online].
Available: http://www.jstor.org/stable/4153187.
[36] X. Li, “Haplotype inference from pedigree data and population data”, PhD thesis,
Case Western Reserve University, 2010. [Online]. Available: https://etd.
ohiolink.edu/pg_10?::NO:10:P10_ETD_SUBID:52101.
[37] D. E. A. Thompson, “Identity by descent in pedigrees and populations; methods
for genome-wide linkage and association. une short course: Feb 14-18, 201”, p. 99,
2011.
[38] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016,
[46] M. Bostock, V. Ogievetsky, and J. Heer, “D3 data-driven documents”, IEEE Trans-actions on Visualization and Computer Graphics, vol. 17, no. 12, 2301–2309, 2011,
ISSN: 1077-2626. DOI: 10.1109/TVCG.2011.185.
79
A GENODRAW: A TOOL TO CREATE PEDIGREE DI-AGRAMS BASED ON BIOMEDICAL TERMINOLO-GIES AND STANDARDS
80
291
Madrid, 21, 22 y 23 de marzo Infors@lud2017 Infors@lud2019
GENODRAW: A TOOL TO CREATE PEDIGREE DIAGRAMS BASED ON BIOMEDI-CAL TERMINOLOGIES AND STANDARDSL.GARCÍA-GIORDANO, S. PARAISO-MEDINA, R. ALONSO-CALVO, F. J . FERNÁNDEZ MARTÍNEZ, V. MAOJO
AbstractThe need for integrating genomic data into daily clinical practice raises the demand for tools capable of represen-ting individuals’ data and also their biological relationships with other individuals. In this work, we introduce geno-Draw, a new platform for creating and managing pedigree diagrams following biomedical standards. The proposed work focuses in five critical aspects for the adoption of this platform in the clinical practice, namely: data-drivenness, automation, in-teractivity, comprehensiveness and com-patibility with widely-adopted biomedical vocabularies for the annotation of traits and characteristics. We present a novel process for generat-ing pedigree diagrams from individual data. This process generates pedigree diagrams that comply with the pedigree nomenclature used as a de-facto standard in the area. We implemented the system as a web platform for ensuring com-plete compatibility. We also performed an evaluation pro-cess, which included usability tests, and the results show a promising adequacy for the usage in the clinical practice.
Keywords: Data visualization, Biomedical vocabularies, Genetics
IntroductionThe current usage of genomic data in the clinical practice indicates a demand for tools to represent individuals’ data and their relations. This demand also indicates a necessity for these tools, with characteristics such as, for instance, data-drivenness, automation, interactivity, comprehensi-veness, or compatibility with standard biomedical voca-bularies in such tools. To our knowledge, there are cu-rrently no informatics tools addressing all of these aspects. In this work, we present genoDraw, a new interactive and user-friendly system that aims to address all the characte-ristics previously mentioned. It is capable of following the guidelines established by the new-est revised version of the pedigree nomenclature [1]. The sys-tem is (a) compre-hensive enough to represent all the major scenarios, (b) capable of automating the creation of the geno-gram, (c) interactive, (d) data-driven and (e) compatible with the annotation of characteristics of each individual as terms from standard biomedical vocabularies, such as the Hu-man Phenotype Ontology (HPO) [2]
MethodsGenoDraw aims to address required characteristics enumerat-ed in the previous section from (a) to (e). Firstly, to provide comprehensiveness (a), we adopted a widely-used visual no-menclature [1][3]. This no-menclature is capable of clearly representing all kinds of clinical heritage scenarios, not only the traditional family relations, but also major non-traditional re-productive scenarios, such as ovum donations, adop-tions, sperm donations, surrogate gestations, planned adoptions, among others. We represent the pedigree diagram as a graph. The no-des of the graph are the entities of a genogram, which are positioned according to an optimization engine. This form of representa-tion allows us to achieve the characteristics (b-e), as it is commented in the pedi-gree graph creation process explained below.The generation of the graph and such constraints are done following an automatic process. Initially, the system contains information about the individuals, their relations and charac-teristics, including their traits as terms from biomedical vo-cabularies. The process generates a correct genogram from the data of individuals and their relations (data-drivenness) in a wide variety of scenarios (comprehensiveness). As a visual example of this process, we can consider a family comprised of a man (A) and a woman (B) who have a relationship and a female child still being gestated (C) and in which both the mother and the child have a certain trait (grey). Depicted in Figure 1 is the result of the genogram generation process for this set of data. Notice that all the traits and charac-teristics are represented (each individual as the shape that corresponds, the ‘P’ symbol that indicates that the individual is still being gestated, and the characte-ristics annotated – in this case the grey marks). While the nodes and links of the genogram are determined following pre-made rules, its structure is deter-mined by an optimization process, which defines the posi-tions in which the nodes that are to be positioned in the canvas. This optimization process takes advanta-ge of the alignments required by the nomencla-ture to define linear restrictions for an optimization process. An example for the result of this process is shown in Figure 2.The interactivity is achieved by facilitating the mani-pulation of the positions of the nodes in the repre-sented graph and the input of new information by allowing direct interaction with the drawn entities. The manipulation of positions is achieved by moving any node of the representation to a new desired po-sition. However, this could cause the diagram to be incor-rect according to the features of the nomencla-ture. To solve this possible issue, the graph as a whole is then repositioned by the convergence process of the optimization engine, which finds a new disposition that complies with the desire of the user and with the nomenclature.
292
XX Congreso Nacional de Informática de la Salud Infors@lud2019
The input of new information is achieved by the use of con-text menus in each of the represented nodes, as well as by other input methods. For example, for a child to be added to a certain person, the user might click on the relationship be-tween the two biological parents and then choose the item on the context menu that corresponds to adding a child. Further information about this child, such as name, gender or traits as terms from biomedical voca-bularies, can then be inserted using a sidebar menu that is shown by clicking the correspond-ing node.ResultsThe presented pedigree diagram drawing system was imple-mented as a web platform that allows for intuitive creation and display of genograms. The implementation of the plat-form was done using common web techno-logies, such as Ja-vaScript, and the representation engi-ne was based on Web-Cola (https://ialab.it.monash.edu/webcola/), a graph visuali-zation tool derived from the Force module of D3.js (https://d3js.org/). This was achie-ved by adapting the rules deduced from the nomencla-ture to the engine chosen for displaying the graph using linear constraints. From an interaction perspective, the user is able to, step-by-step, build a genogram only by inserting people and/or creat-ing relations between them. As shown in Figure 3, starting from a blank representation, a user may: (i) Add the parents: for this purpose, add a person called A, a man who has albi-nism (shown in grey), which is a term from HPO. Then, the user is able to set a partner for person A, which creates another person B and the rela-tionship among them. (ii) Add biological children: thus, another person C may be added, having person A and B as biological father and mother, respectively. Follow-ing the same steps, a child D can be added. Child C is also ob-served to have albinism, which is added through a sidebar menu for the affected individual using thus the same term as person A to describe this characteristic. Ultimately, the user is able to construct the family step by step, and all the infor-mation that is inserted are the information of each individual (name and affections) and the relations (biological father, biological mother, and partner are the ones used in the exam-ple). DiscussionIn this work, we describe the main elements of genoDraw, a new system that enables the user to create and edit ge-nograms not only in a highly interactive manner, but also in a way that engages them to follow the chosen nomen-clature, which is widely-accepted as a visual guideline to drawing pedigree diagrams. The first of these characteris-tics, interactiveness, has been continuously gaining rele-vance, since the advent of touchscreen-equipped devices and powerful graphics proces-sors, especially in mobile devices. The most recent pedigree drawing tools date back to the beginning of the massive as-cendance of such devices, which justifies the lack of interac-tiveness that is noticeable in them, but of tremendous utility nowadays. To test of the use of our system, we devised and carried out a usability test. The test consisted of, without prior use of the platform, generating two pedigree diagrams gi-ven a real-world situation written as text for each of them. The first diagram represented a simple but large represen-tation, to assess the familiarization of the users in a low complexity level.
The second diagram consisted of a family in which many chil-dren were born from surrogate gestations, ovum donors, or were adopted. In both cases, traits and characteristics were asked to be symbolized as specific terms from standard vo-cabularies. The test was carried out with a group of individuals with expertise in the biomedical domain that were previously formed in the standard nomenclature. The results show that the platform offers an adequate set of functionalities and cha-racteristics for its purpose, and is, therefore, suitable for the use in the clinical practice. The second characteristic, which relates to the correctness of the representations, is a common feature among current and old systems. None of them follows the revised version of the nomenclature that we chose, since instead they follow their own specifications of pedigree diagram. In terms of compre-hensiveness, our system complies with all the specific scenari-os proposed by the nomenclature. Since the nomenclature is very comprehensive by itself, genoDraw is thus very compre-hensive in terms of the diversity of situations can represent. We decided not to include some features that can become use-ful when analyzing psychological elements of a family, such as affinity among individuals. Therefore, regarding this specific issue, other tools are more complete than ours.In terms of data-drivenness, some reported systems imple-ment some kind of generation of diagrams from data, but it tends to be translated into the diagram upon insertion by the user and the information in itself is then lost. Using our ap-proach, we accomplished to collect all the inserted data and store it as the information of the genogram itself. The genera-tion of the pedigree diagram is then performed from the data stored and the output is a correct genogram.Finally, regarding the storage of traits and characteristics as terms of standard biomedical vocabularies, to our knowledge, no other pedigree drawing system reported in the literature discusses the implementation of this feature. This is undou-bt-edly a step towards the integration of this system with the electronic health records of each individual that might be re-presented in a genogram.Some limitations for creating a visual representation of fre-quent situations have been revealed. In fact, these limita-tions were previously commented in the pedigree diagram drawing literature [4]. One of them is the impossibility of drawing more than two relationships simultaneously for the same person. This can be addressed by allowing the user to hide relation-ships at their will without necessarily removing them from the data of the genogram. Another limitation is the impossibility of representing relationships between people from different generations in the same family. This is solved by creating new structure constraints which enable more fle-xibility of the rep-resented pedigree diagram.ConclusionsThis work explores the capabilities of the representation of ge-nograms by an automated interactive tool. By following the process mentioned in the methods section, we developed a system that complies with the proposed characteristics: com-prehensiveness, automation, interactiveness, data-drivenness and compatibility with biomedical vocabularies for traits.Regarding the limitations of the system, in terms of the com-prehensiveness of the representation, they are already defined
293
Madrid, 21, 22 y 23 de marzo Infors@lud2017 Infors@lud2019
by the limitations of the nomenclature itself, which are due to planarity and alignment issues, unavoidable on any bidimen-sional representation. As far as automation and data-drivenness are concerned, any graph generated from the in-formation of individuals and their relations is always com-posed by the entities required by the nomenclature, and the resulting graph is isomorphic to that derived from a correct pedigree diagram according to the directives of the nomencla-ture. Nonetheless, the structure, being calcula-ted only by an optimization engine, might not be able to, without corrections, represent the most adequate genogram. However, from an interactivity standpoint, the support for corrections of the nodes’ positions by the user makes the system capable of representing, in a correct manner, all the situations included in the chosen nomenclature.Although our system is currently capable of storing the anno-tation of diseases and traits as terms from standard biomedi-cal vocabularies, a current limitation is the limited access to information for electronic health records. We in-tend to ad-dress this limitation in the near future, so as to have this sys-tem not as an isolated part of the diagnosis of genetic diseas-es, but as a tool that is capable of retrieving, updating and using relevant information that is stored in electronic health records to contribute to an enhancement and facilitation of the diagnosis and risk evaluation of gene-tic diseases
Figures and Graphs
AcknowledgementsThis work is supported by “Proyecto colaborativo de inte-gración de datos genómicos (CICLOGEN)” PI17/01561 fund-ed by the Carlos III Health Institute from the Spa-nish Na-tional plan for Scientific and Technical Research and Innova-tion 2017-2020 and the European Regional Development Funds (FEDER).References[1] R.L. Bennett, K.S. French, R.G. Resta, and D.L. Doyle, Standardized Human Pedigree Nomenclature: Up-date and Assessment of the Recommendations of the Na-tional Society of Genetic Counselors, J Genet Counsel. 17 (2008) 424–433. doi:10.1007/s10897-008-9169-9.[2] S. Köhler, L. Carmody, N. Vasilevsky, et al., Ex-pansion of the Human Phenotype Ontology (HPO) knowledge base and resources, Nucleic Acids Res. 47 (2019) D1018–D1027. doi:10.1093/nar/gky1105.
294
XX Congreso Nacional de Informática de la Salud Infors@lud2019
[3]R.L. Bennett, K.A. Steinhaus, S.B. Uhrich, C.K. O’Sullivan, R.G. Resta, D. Lochner-Doyle, D.S. Markel, V. Vincent, and J. Hamanishi, Re-commendations for standard-ized human pe-digree nomenclature. Pedigree Standardization Task Force of the National Society of Genetic Counselors., Am J Hum Genet. 56 (1995) 745–752.[4]C. Kelleher, B. Drohan, K. Hughes, and G. Grinstein, Self Organizing Interactive Pedigree Diagrams, (2011).[5]K. Donnelly, SNOMED-CT: The advanced terminology and coding system for eHealth, Stud Health Technol In-form. 121 (2006) 279–290.
Este documento esta firmado porFirmante CN=tfgm.fi.upm.es, OU=CCFI, O=Facultad de Informatica - UPM,