Top Banner
Coevolution of Mathematics, Statistics, and Genetics Yun Joo Yoo Contents Introduction .................................................................. 2 Early Contributions ............................................................ 4 Mendel and His Inheritance Models ............................................. 4 Hardy-Weinberg Equilibrium .................................................. 7 Wright-Fisher Model ......................................................... 10 Study of Family History and Pedigrees ............................................ 13 Twin Studies ................................................................ 13 Genetic Linkage Mapping ..................................................... 15 Exploring Big Genetic Data ..................................................... 19 Genome-Wide Association Studies ............................................. 19 Whole Genome Sequencing ................................................... 23 Network-Based Analysis for Genetic Data ....................................... 25 Discussion ................................................................... 28 References ................................................................... 29 Abstract Genetics is the science of studying heredity. Heredity is the process of transmit- ting genetic materials from parents to offspring. In genetic studies, hypotheses derived from biological theories and mathematical models are tested with the data from experiments or observations of genetic phenomena using statistical methodologies. Throughout the history of genetics, mathematics and statistics have been extensively used for genetic studies, and genetics, in turn, has influenced many fields of mathematics and statistics. In this chapter, we describe some of the most important mathematical models and statistical methods in the Y. J.Yoo () Department of Mathematics Education, Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea e-mail: [email protected] © Springer Nature Switzerland AG 2018 B. Sriraman (ed.), Handbook of the Mathematics of the Arts and Sciences, https://doi.org/10.1007/978-3-319-70658-0_28-1 1
33

Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Jul 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics,and Genetics

Yun Joo Yoo

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Early Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Mendel and His Inheritance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Hardy-Weinberg Equilibrium. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Wright-Fisher Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Study of Family History and Pedigrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Twin Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Genetic Linkage Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Exploring Big Genetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Genome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Whole Genome Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Network-Based Analysis for Genetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Abstract

Genetics is the science of studying heredity. Heredity is the process of transmit-ting genetic materials from parents to offspring. In genetic studies, hypothesesderived from biological theories and mathematical models are tested with thedata from experiments or observations of genetic phenomena using statisticalmethodologies. Throughout the history of genetics, mathematics and statisticshave been extensively used for genetic studies, and genetics, in turn, hasinfluenced many fields of mathematics and statistics. In this chapter, we describesome of the most important mathematical models and statistical methods in the

Y. J. Yoo (�)Department of Mathematics Education, Interdisciplinary Program in Bioinformatics, SeoulNational University, Seoul, South Koreae-mail: [email protected]

© Springer Nature Switzerland AG 2018B. Sriraman (ed.), Handbook of the Mathematics of the Arts and Sciences,https://doi.org/10.1007/978-3-319-70658-0_28-1

1

Page 2: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

2 Y. J. Yoo

history of genetics. We especially focus on three periods: (1) the early days,when the basic concepts in genetics were established, such as genes, evolution,and inheritance, and mathematical models of such genetic mechanisms werelaid out; (2) the period of studying family data from twins or large pedigreesin the mid- to late twentieth century; and (3) the present period of exploring biggenetic data by complex modeling and machine learning. We show that variousprobabilistic models, differential equations, and graph and network theories havebeen applied to the analysis of genetic data. We also illustrate how statisticalissues involved with model fitting, estimation, and hypothesis testing have beenraised and resolved in the context of genetic studies, contributing to the field ofstatistics as well as that of genetics. In the discussion, we suggest some promisingmathematical and statistical methods to be applied in future genetic studies.

KeywordsMathematical genetics · Statistical genetics · Linkage study · Geneticassociation · Whole genome sequencing

Introduction

Genetics is the study of the mechanisms of inheritance in living organisms at themolecular level or at the population level. Whether an organism is a bacteriumor a human, many of its biological characteristics are affected by genetic factors.To solve many important problems in the fields of biology, agriculture, ecologyand medicine, heredity in humans and other species has long been studied usingmathematical models and statistical methodologies. In human genetics, genetictraits, especially those related to diseases for which researchers are trying to findcauses and cures, have been extensively researched.

Mathematics has been a key instrument since the very beginning of genetics,when Gregor Mendel (1822–1884) tried to explain why some traits of plants appearin certain ratios under controlled conditions (Siddartha 2016). Currently, geneticstudies depend upon large-scale, high-throughput data such as single nucleotidevariations or protein networks. The large-scale genomic data generated by thecurrent advanced technology require high-dimensional statistical analysis, advancedmachine learning techniques, and complex mathematical modeling, through whichresearchers discover a specific genetic makeup responsible for certain heritable traitsby exploring the billions of possible candidate genetic compositions (Brown 2002).

Early statisticians such as Francis Galton (1822–1911), Karl Pearson (1857–1936), and Ronald A. Fisher (1890–1962), who laid the foundations of modernstatistics, were also geneticists. Galton suggested the concepts of variance, stan-dard deviation, correlation, and regression for the first time and discovered thephenomenon called “regression to the mean” from observations of the heightsof parents and their children, where the extreme values found in the parentsmove toward the average in their children (Stigler 2010). Galton thought it was ageneral inheritance phenomenon, and the term “regression” originated from these

Page 3: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 3

observations (Galton 1886). Currently, the term “regression to the mean” is used todescribe the statistical phenomenon in which initial sampling bias disappears overrepeated observations (Stigler 1997). Later, the regression method became one of themost important and powerful tools in statistics. Pearson developed the correlationcoefficient r, chi-squared test, and hypothesis testing (Biau et al. 2010). Pearson’schi-squared test was applied to inheritance data, including Mendel’s originalexperiment results (Magnello 1998). Fisher discovered the F distribution, whichis the distribution of the ratios of two independent chi-squared random variables(Fisher 1924), and introduced the concept of p-values to judge the statisticalsignificance of hypothesis testing, which is required to determine whether evidenceis enough to prove the designated hypothesis in genetic studies (Biau et al. 2010).Fisher also discovered numerous genetic principles and key concepts in populationgenetics, including the principle of equal ratio between male and female and thefundamental theorem of natural selection which states the relationship betweenfitness and genetic variance (Yates and Mather 1963; Crow 2002). Galton, Pearson,Fisher, and their fellow scholars applied the various mathematical achievements oftheir time, from probability theory to differential equations, to explaining geneticand evolutionary phenomena, and laid the basic principles and frameworks ofstatistics, then a relatively new field of science, to provide rational criteria to makescientific judgments based on observed data in genetic experiments.

Currently, scientists say that genetics has entered the era of “omics,” wherelarge-scale comprehensive information is accumulated and analyzed. In this eraof omics, the coevolution of mathematics, statistics, and genetics remains strong(Kiechle et al. 2004; Raja et al. 2017; Tian et al. 2011). In earlier times, the costand time required to generate genetic data restricted the scale of the experimentssuch that a genetic study usually focused on a limited number of genetic elements.Now, new technologies with reduced cost and increased speed and capacity, suchas next-generation sequencing, parallel computing, and improved memory devices,enable geneticists, mathematicians, and statisticians together to conduct large-scalestudies using the genetic data from hundreds of thousands of subjects or allowthe accumulated data assembled in public databases to be shared with researchersall over the world. For example, genomics is the study of entire genomes to findthe causes of certain biological phenomena, using the genomic data of millionsand even billions of single nucleotides genotyped by commercial array chips orsequencing machines (Brown 2002). Proteomics is the large-scale study of proteinsgenerated from microarray experiments (Blackstock and Weir 1999). Phenomicsis the analysis of the phenotypes (trait configurations) of an organism and theirchanges in response to genetic variations and interactions with the environment(Freimer and Sabatti 2003; Gerlai 2002). Omics studies require mathematical andstatistical methodologies for high-dimensional data. For this purpose, networktheory has been actively applied to omics data, and new machine learning methodshave been developed to select causal variables amid a vast array of candidate geneticmaterials (Wu et al. 2014).

In this chapter, we introduce and discuss several key ideas for mathematicalmodels and statistical approaches that have enabled the discovery of important

Page 4: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

4 Y. J. Yoo

findings in genetics and have led to changes in the paradigm of genetic research.First, we show how the beginning of genetics was deeply related to mathematicalmodeling and early statistical development. Next, we describe how statisticalinferences from the inheritance data of complex families with molecular informationmade it possible to find genes responsible for Mendelian and complex diseases.Then, we also illustrate the more recent genetic research scenarios of using variousadvanced mathematical and machine learning methods to disentangle the complexmechanisms of genes and outcomes from the big genetic data generated by newtechnologies.

Early Contributions

Mendel and His Inheritance Models

Gregor Mendel is recognized as the founder of genetics due to his first attempt tosystematically model the cross-breeding results of plants and his proposal of theconcept of genes (the term he used was actually “factors”) (Siddartha 2016; Chiras2012). Mendel conducted extensive breeding experiments on pea plants (Pisumsativum) and established several rules of heredity, now called the laws of Mendelianinheritance.

His experiments using pea plants were intended to study the inheritance patternsof seven traits including height, flower color, and seed color and shape. Heartificially pollinated one type of pea plants with the pollen from another type(or sometimes the same type) and examined the appearances of those seven traitsin the resulting offspring. The first thing he did for these experiments was to obtainpurebred plants for each type of the seven traits by fertilizing plants that shared thesame trait type (self-fertilization) for years. Mendel ensured that the self-fertilizationof a plant purebred for a trait led to offspring with the same trait type as that of theparent. This phenomenon of the offspring having the same trait type as its parentsis called heredity. The concept of heredity as the result of breeding was widelyunderstood by scholars of agriculture or biology in Mendel’s time, but they lackeda clear explanation for the mechanism of heredity. As a result, accurate predictionor control of the outcomes of breeding was almost impossible (Orel 2009).

With these purebred plants in hand, Mendel proceeded to cross-breed differenttypes of purebred plants. The purebred plants used in the cross-breeding experimentare called the P generation, and the offspring, as the outcomes of cross-breedingbetween different types, are called the F1 generation. He then obtained the offspringof the self-fertilization of the F1 generation, which are called the F2 generation.When Mendel observed the distributions of trait types in the F1 and F2 generations,he noticed some patterns in terms of the ratios among the different trait types. Forexample, when he cross-bred violet-flowered pea plants and white-flowered peaplants, he obtained all violet flowers in the F1 generation. Next, when he let theF1 generation self-fertilize, suddenly, he observed white flowers appearing in the F2generation, which were absent in the F1 generation (Fig. 1). In addition, he found

Page 5: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 5

F1 generation

P generation

F2 generation

True-bred of white flowerTrue-bred of violet flower

cross-breedingX

All violet flowers

705 violet flowers 224 white flowers

Fig. 1 Results of cross-breeding of pure-bred pea plants of violet and white flowers and self-fertilization of offspring generation (F1)

that the ratio between violet and white in the F2 generation was close to 3:1 (705vs. 224). He observed similar phenomena for height (tall and short) and other traitsin the pea plants.

To explain why this phenomenon happens, Mendel came up with the theory ofa pair of hidden factors: one inherited from the father and the other inherited fromthe mother. This notion of “inheritance factors residing as a pair in an individual”suggested by Mendel led to the concept of the gene, which is now known as the DNAsequences inherited from both parents residing on chromosomes as pairs. Mendeltheorized the concept of genes without physically observing the process of meiosis,only by inferring a mathematical model that fits the data.

In addition to the concept of genes, he proposed three principles describinginheritance mechanisms. One of these principles is called the law of dominance,which can be seen as a biological model of inheritance. The law of dominancemeans that the trait type related to one allele (a variant of a gene) is suppressedby the other trait type, which is related to a different allele of that gene, when theycoexist in a heterozygous (co-appearing) genotype. The suppressed allele and itsrelated trait type are said to be recessive, and the suppressing allele and its relatedtrait type are said to be dominant. For example, if Y represents the dominant allele

Page 6: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

6 Y. J. Yoo

corresponding to violet flowers and y represents the recessive allele connected towhite flowers, a plant with a Yy genotype yields violet flowers.

The other two principles of inheritance are the law of segregation and the lawof independent assortment. These principles are directly connected to probabilisticmodels of Mendelian inheritance. The law of segregation is the principle that eachpair of parental genes (alleles) is randomly separated into sex cells so that theoffspring inherits one genetic allele from each parent with a 50:50 chance. Whenthe law of segregation is combined with the law of dominance, the 3:1 ratio ofviolet flowers versus white flowers in the F2 generation in Mendel’s experimentcan be explained. The cross-breeding of two different purebred lines in the Pgeneration means that fertilization occurs between one plant with a YY genotypeand another with a yy genotype; this cross-breeding process can be symbolizedas YY × yy. According to the law of segregation, only one of the parent genecopies will be inherited randomly by the offspring, so every offspring in the F1generation will have the genotype Yy (provided that we always write the Y beforethe y). Next, the self-fertilization of F1, denoted by Yy × Yy, will yield three possiblegenotypes, YY, Yy, and yy, in the F2 generation. The self-fertilization results of theheterozygous F1 generation are usually represented by a Punnett square (Fig. 2).The proportions of YY, Yy, and yy genotypes in the F2 generation should be 25%,

Fig. 2 Illustration of the lawof segregation and thePunnett square

F2 generation

P generation

F1 generation

YY Yy

Yy yy

Y

Y

y

y

Punnett square

y

Yy

Y

Yy x Yy Yy

yY

YY yy

Y

YY x yy

y

Page 7: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 7

50%, and 25%, according to the law of segregation. If violet flowers represent thedominant phenotype, then 75% of the F2 generation will have violet flowers, fittingthe observed ratio of approximately 3:1 in Mendel’s cross-breeding experiments.

The law of independent assortment is a principle describing independence in theinheritance of different genes. Mendel investigated two traits of pea plants at thesame time: seed color and seed shape. For seed color, there are two types, yellow(Y) and green (y), and for seed shape, round (R) and wrinkled (r). Here, Y and R aredominant alleles, i.e., seeds of genotype YY or Yy will appear yellow, and seeds ofgenotype RR or Rr will appear round. If we cross-breed pea plants that are purebredin both traits, one with seeds of yellow color and round shape (YYRR) and one withseeds of green color and wrinkled shape (yyrr), we will end up with an F1 generationhaving round yellow seeds and uniformly heterozygous genotypes (Yy and Rr); suchplants are called dihybrids.

When Mendel fertilized these dihybrid F1 plants among themselves, he observedfour types in the offspring (F2) generation: yellow and round, yellow and wrinkled,green and round, and green and wrinkled. These types were observed to appear in aratio of approximately 9:3:3:1. From this result, Mendel conjectured that the pairingof alleles for these two traits, seed color and seed shape, in the inheritance processis randomly (independently) determined, meaning that the pairs (Y, R), (Y, r), (y, R),and (y, r) all have equal probabilities. This result fit the observed 9:3:3:1 ratioin four phenotype cases (Fig. 3). Based on this observation of the independentdetermination of two traits, Mendel proposed the law of independent assortment.

With the data obtained from biological experiments in one hand and themathematical models that seemed to explain the data in the other hand, the needfor a rational decision process to justify models in comparison to data emerged inthe field of genetics (Cox 2002). The chi-squared goodness of fit test was suggestedby Karl Pearson in this atmosphere (Pearson 1900) and applied to Mendel’s databy Raphael Weldon (Magnello 2004). Weldon discovered that Mendel’s data weretoo close to the expected values and suggested the possibility of fabrication. Fisherwas another person who claimed possible falsification based on a statistical point ofview (Fairbanks and Schaalje 2007). The controversy over the results of Mendel’sexperiments was deeply related to hypothesis testing and the statistical decisionprocess, which prompted an active discussion of those subjects among statisticians.

Hardy-Weinberg Equilibrium

Mendel’s laws of inheritance are typical examples which show that finding properbiological and mathematical models for inheritance mechanisms that explain theobserved data can be the objective of genetic research. In particular, populationgenetics, the study of the distributions of genetic components and phenotypicvariables in the population in relation to various environmental and genetic factors,has actively used mathematical modeling and statistical methods to theorize andprove hypotheses about population genetic phenomena, such as mutation, evolution,migration, and natural selection, from the early days of genetic research. In its early

Page 8: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

8 Y. J. Yoo

P generation

yyrrX

F1 generation

X

F2 generation

YYRR YyRR YYRr YyRr

YyRR

YYRr

YyRr

yyRR YyRr yyRr

YyRr

yyRr

YYrr Yyrr

yyrrYyrr

YR

YYRR

yr

YyRr YyRr

YR yR yrYr YR yR yrYr

YR

YR

yR Yr yr

yR

Yr

yr

Fig. 3 Illustration of the law of independent assortment

days, population genetics was also called mathematical genetics due to its extensiveuse of mathematical theories (Edwards 1977).

One of the most famous principles found by the first generation of populationgeneticists is the Hardy-Weinberg principle established by Godfrey H. Hardy(Hardy 1908) and Wilhelm Weinberg (Weinberg 1908). The Hardy-Weinberg prin-ciple states that the allele frequencies and genotype frequencies of one generationwill remain unchanged in subsequent generations as long as no genetic interference,such as mutation, sexual and natural selections, or genetic drift, is present, withthe assumption of random mating. Hardy was actually a mathematician who hadnot been interested in genetics prior to this problem. When his cricket buddyReginald Punnett (1875–1967) (who created the Punnett square) introduced theproposition of the then-famous statistician George U. Yule (1871–1951) that adominant allele should prevail in a population over successive generations, Hardysolved the problem by applying his mathematical knowledge and commented that itwas a very “simple” problem (Edwards 2008).

Page 9: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 9

Table 1 Punnett square forHardy–Weinberg principleassuming random matingbetween males and females

Genotype and its frequency FemalesA (p) a (q)

Males A (p) AA (p2) Aa (pq)a (q) Aa (qp) aa (q2)

The mathematical modeling and theoretical explanation for the Hardy-Weinbergprinciple are as follows. Let the baseline generation be denoted using the indexvalue t = 0. Suppose that only two alleles, A and a, exist in the population for agene. The frequencies of occurrence of alleles A and a are denoted by p0(A) = pand p0(a) = q = 1 − p, respectively. For any generation t, the allele frequenciespt(A) and pt(a) can be obtained from the genotype frequencies pt(AA), pt(Aa) andpt(aa) of that generation by:

pt (A) = pt (AA) + 1

2pt(Aa)

pt (a) = pt (aa) + 1

2pt (Aa)

To obtain the genotype frequencies of the next generation, we assume randommating in the current generation, which results in the expected frequency table in aPunnett square, as shown in Table 1.

For example, the t = 1 generation’s allele frequencies are same as those of thet = 0 generation:

p1(A) = p1(AA) + 1

2p1(Aa) = p2 + pq = p = p0(A)

p1(a) = p1(aa) + 1

2p1(Aa) = q2 + pq = q = p0(a)

In this way, the allele frequencies and genotype frequencies remain fixed acrossgenerations; this phenomenon is also called Hardy-Weinberg equilibrium (HWE).

Given the genotype frequencies of a population, deviations from HWE can bestatistically tested using Pearson’s goodness of fit chi-squared test (Wang and Shete2017). If you have genotype counts OAA, OAa, and Oaa in the data for genotypes AA,Aa, and aa, respectively, then frequency p of allele A is obtained by:

p = OAA + 1

2OA

and the frequency of allele a is obtained by 1 − p.

Page 10: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

10 Y. J. Yoo

From this, you can compute the expected genotype frequencies EAA, EAa, and Eaa

according to HWE such that EAA = np2, EAa = 2npq, and Eaa = n(1 − p)2, wheren is the total number of genotypes (subjects).

The Pearson chi-squared test to detect deviation from HWE is then defined as:

χ2 = (OAA − EAA)2

EAA

+ (OAa − EAa)2

EAa

+ (Oaa − Eaa)2

Eaa

The distribution of χ2 under the null hypothesis of no deviation from HWEfollows a chi-squared distribution with 1 degree of freedom.

Wright-Fisher Model

Population genetics research evaluates changes and variations in the genetic compo-sition of populations over time. Factors that affect the population genetic phenomenasuch as natural selection or mutation are of particular interest. To find evidencefor such special evolutionary events, population geneticists must establish a neutralmodel and investigate departures from that neutral model, as in the case of the testfor HWE (Crow 1987).

One of the neutral phenomena that mathematical geneticists have studied fromthe early history of population genetics is genetic drift, the mechanism that causeschanges in allele frequencies in the population over generations due to chance(Masel 2011). Technically, any changes in allele frequencies can be called evolutionsince they can affect the characteristics of the population. In natural selection, allelefrequencies may change in order to adapt to environmental pressures. In geneticdrift, the change in allele frequencies occurs as a random phenomenon.

An illustration of genetic drift is given as follows. Suppose that the populationcontains only B and b alleles for a gene, with frequencies of 0.5 and 0.5, respectively,and the ratio of BB, Bb, and bb genotypes in the population is 1:2:1. Let us assumethat only a portion of the population has mated, and by chance, the individualswho have mated have only BB and Bb genotypes, in a ratio of 2:3. Then, in thenext generation, the allele frequencies become 0.7 and 0.3 for the B and b alleles,respectively. If, by chance, only BB individuals reproduce in the second generation,then the b allele completely disappears from the population, due solely to therandom phenomenon of genetic drift (Fig. 4).

Sewall Wright (1889∼1988) and Ronald A. Fisher (1890∼1962) independentlysuggested similar mathematical models for genetic drift in approximately 1930(Fisher 1930; Wright 1931). In a population of N individuals, for a gene with twoalleles B and b with allele frequencies pt = kt

2Nand qt = 1 − pt, respectively, in

generation t, the probability of obtaining kt + 1 copies of allele B in generation t + 1is:

Page 11: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 11

Second generationp=0.7 q=0.3

First generation p=0.5 (frequency of allele B)q=0.5 (frequency of allele b)

BB BB Bb Bb Bb Bb Bb Bb bb bb

BB BB BB BB BB Bb Bb Bb Bb bb

Third generation : fixation p=1.0 q=0.0

BB BB BB BB BB BB BB BB BB BB

genetic drift

genetic drift

Fig. 4 Illustration of genetic drift resulting in fixation in the third generation

(2N

kt+1

)(pt )

kt+1(qt )2N−kt+1

since kt + 1 gene copies are drawn from the 2N copies of each gene, consisting ofkt copies of allele B and 2N − kt copies of allele b. Then, the expected value andthe variance of allele frequency of B in generation t + 1, given the allele frequencydistribution in generation t, are:

E [pt+1|pt ] = pt ,

V ar (pt+1|pt) = 1

2Npt (1 − pt)

By iterating this process, the mean and variance of the allele frequency of B afters generations from the initial generation, t = 0, can be obtained as follows (Crowand Kimura 1970):

Page 12: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

12 Y. J. Yoo

E [ps |p0] = ps,

V ar (ps |p0) = 1

2Np0 (1 − p0)

(1 −

(1 − 1

2N

)r).

If we treat time as a continuous quantity and assume N is large, by letting �t =1

2Nas a new timescale, we obtain:

2Npt+�t | pt ∼ B (2N,pt )

where B(n, p) denotes a binomial distribution with sample size n and probability p.With a normal approximation to the binomial distribution, we have:

pt+�t | pt ∼ N (pt , pt (1 − pt )�t) (1)

0

0.0

0.4

B A

llele

freq

uenc

y

0.8

0.0

0.4

B A

llele

freq

uenc

y

0.8

0.0

0.4

B A

llele

freq

uenc

y

0.8

10 20 30Generations

n=200

n=20

Generations

Generations

n=2000

40 50

0 10 20 30 40 50

0 10 20 30 40 50

Fig. 5 The allele frequency changes by genetic drift simulated for three population sizes n = 20,200 and 2000

Page 13: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 13

where N(μ, σ 2) indicates the normal distribution with mean μ and variance σ 2.Furthermore, Eq. (1) corresponds to a differential equation:

dpt = √pt (1 − pt )dwt

where wt is a standard Brownian motion.HWE states that allele frequencies and genotype frequencies will remain the

same from generation to generation with random coupling of genes and assumingan infinite population (in practice, a sufficiently large population). The Wright-Fisher model states that chance can lead to the disappearance of an existing allelefrom a finite population, a phenomenon called fixation. In Fig. 5, simulated allelefrequency changes under genetic drift are given for several population sizes. Foreach population size (n = 20, 200 and 2000), 10 scenarios of genetic drift areplotted assuming only half of the population participate in mating. When thepopulation size is low, some populations undergo fixation by chance. In contrast,as the population size increases, these cases of fixation or near fixation occur lessoften in the simulation results.

Study of Family History and Pedigrees

Twin Studies

One of the longest-running scientific issues, the nature versus nurture debate, isclosely related to the history of genetics. Even before Mendel’s theories on theconcept of genes and inheritance became known to the public, Francis Galton triedto explain the nature of heredity by studying how inheritance affects human behaviorand characteristics (Galton 1874). To disentangle the effects of nature and nurture,with a belief in the primary role of the former, Galton investigated twins for thefirst time in history (Waller 2012). Since twins share almost the same environmentalfactors and genetic makeup, especially identical twins raised in the same family,research on twins was expected to provide the answer to this old debate.

Studies on twins that were more scientifically sound than Galton’s attempt, basedon the findings on genes and heredity in the early twentieth century, emerged atapproximately the same time. This “classical” twin study design involves comparingthe traits of monozygotic (MZ) twin pairs (identical twins) and dizygotic (DZ)twin pairs (nonidentical twins). MZ twins share exactly the same genetic variants,while DZ twins have 50% genetic similarity, the same as the genetic similaritybetween ordinary siblings, since they are derived from different eggs and sperm.The assumption of the classical twin study is that MZ and DZ twins share almostidentical family environments, and thus, if a trait is genetically determined, at leastto some degree, that trait in MZ twin pairs will be more similar than that in DZ twinpairs. The first twin study to compare MZ and DZ twins was a study on refractionin human eyes published by a German ophthalmologist, Walter Jablonski, in 1922(Liew et al. 2005). He compared 40 MZ twins and 12 DZ twins of the same sex for

Page 14: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

14 Y. J. Yoo

within-pair differences in refractive error and astigmatism. Two other independentlydesigned twin studies also appeared in 1924: one study by Hermann Siemens(Siemens 1924) and another by Merriman (Merrriman 1924). In Merriman’s study,the intelligence quotient (IQ) scores of MZ twins showed high correlation (98%)compared to that of the overall twin population (88%).

The most commonly used analytical method in classical twin studies is a statis-tical method called variance component analysis, which is based on a mathematicalmodel of a simple linear relationship between the variable representing the traitphenotype and the variables for the effects of genes and environment. If we denoteshared genetic effects between twins as A, shared environmental effects as C andthe residual as E, the linear model to account for the standardized train phenotype Y(the mean is zero and the variance is one) is:

Y = A + C + E

If A, C, and E are independent within each twin pair, then Var(Y) = Var(A) +Var(C) + Var(E), i.e., the phenotype variance, can be decomposed into genetic,environmental, and residual components. Here, Var(E) can be seen as the componentrepresenting the environmental influences that are not shared by family membersand measurement error, and E is assumed to be independent for the twins in eachpair.

If Y is affected solely by genes and not by environmental factors, then thecorrelation between MZ twin pairs should be 1 and the correlation between DZtwin pairs should be 0.5. If Y is affected by only shared environmental factors, thenthe correlations of the Y values between MZ twin pairs and DZ twin pairs shouldboth be 1.

The correlation between MZ twin pairs should be:

rMZ = V ar(A) + V ar(C) (2)

since A and C are identical and E is independent between the twins in each pair.Additionally, the correlation between DZ twin pairs should be:

rDZ = 1

2V ar(A) + V ar(C) (3)

since only 50% of genes are shared by DZ twin pairs.If we solve (2) and (3) for Var(A), we obtain:

V ar(A) = 2 (rMZ − rDZ)

which is called Falconer’s formula (Falconer and MacKay 1996) and can be usedto obtain the heritability. The heritability is the degree of genetic effect on a specifictrait and is mathematically defined as:

Page 15: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 15

h2 = V ar(A)

V ar(Y )

If Y is standardized, then h2 = Var(A). If the heritability is high, the trait canbe considered to be determined largely by genes, i.e., by nature. If the heritabilityis close to zero, we can conclude that the trait has nothing to do with geneticinfluence, i.e., it is the result of the other factors. For example, recent studies ofIQ, a phenotypic measurement of mental ability, have used twin studies (Visscheret al. 2008). In twin samples obtained from many studies, the average MZ and DZcorrelations for IQ were 0.86 and 0.60, respectively, based on 4,672 MZ and 5,546DZ twin pairs (Deary et al. 2006). The heritability obtained from these values usingFalconer’s formula is 2(0.86–0.60) = 0.52, i.e., 52%. The estimated heritability forIQ has generally been reported as 50–80% by various types of studies (Bartels et al.2002; McClearn et al. 1997).

A recent study examined all twin studies conducted between 1956 and 2012by meta-analysis, covering 17,804 traits from 14,558,903 twins, and reported thatthe heritability across all traits was 49%, concluding that the nature versus nurturedebate should be settled by admitting that both genetic and environmental factorsare equally important to human life (Polderman et al. 2015).

Genetic Linkage Mapping

Genetic linkage is a phenomenon that violates the law of independent assortment.When Mendel discovered the law of independent assortment, he did not knowthat genes are serially structured in chromosomes. Now, it is known that manygenes reside together on each chromosome. For example, humans have 19,000genes on 23 chromosomes, and fruit flies have approximately 15,000 genes onfour chromosomes (Ezkurdia et al. 2014; Halligan and Keightley 2006). Duringmeiosis, one of each pair of chromosomes is randomly selected to become thesingle set of chromosomes in a sex cell. For example, a woman’s egg can containthe chromosome 1 copy inherited from her mother and the chromosome 2 copyinherited from her father as a result of random selection. Therefore, if two genes areon different chromosomes, Mendel’s law of independent assortment applies.

Genes on the same chromosome are another question. Early geneticists thoughtthat the genes on the same chromosome were inherited together by offspringuntil Morgan discovered a contradictory phenomenon (Lobo and Shaw 2008). Forexample, if the genes for seed color and seed shape are on the same chromosome,and the two copies of the chromosome are separated intact during meiosis, thenself-fertilization in the F1 generation should involve only two types of allele pairsfor these two genes, (Y, R) and (y, r). (Here, Y and R represent the dominant allelesfor seed color and shape, while y and r represent the recessive alleles for seed colorand shape, respectively, as in the example in section “Mendel and His InheritanceModels.”) However, the actual meiotic mechanism is more complex than simple

Page 16: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

16 Y. J. Yoo

A

B

a

b

A A A A

B B B B

a a a a

b b b b

A A

B B

a a

b b

Four products of meiosis

A A

B B

a a

b b

NR NRRR

R : recombinant copyNR : nonrecombinant copy

Fig. 6 Illustration of cross-over (recombination) between a chromosome pair during meiosis

random selection of chromosomes. During meiosis, the DNA of chromosome pairscan become mixed due to physical proximity, and when they are separated, achromosome copy in a sex cell can contain some parts inherited from the father andsome parts inherited from the mother (Fig. 6). This phenomenon is called crossingover or recombination.

Since these recombination events occur at random locations on a chromosome,the alleles of genes positioned closely on a chromosome are more likely to beinherited together, whereas those of genes positioned farther apart are more likely toassort independently, similar to genes on different chromosomes. This phenomenonis called genetic linkage, and genes that are in physical proximity and passed onto gametes together are said to be linked. Genes that assort independently are saidto be unlinked. The combination of alleles at different loci (positions) on the samemember of a chromosome pair is called the haplotype.

The positions where recombinations occur are determined randomly, and therecombination probability between two positions depends on the distance betweenthem. Based on this biological model, the concept of a linkage study to find thelocation of susceptibility genes that are responsible for a trait was proposed through

Page 17: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 17

the study of pedigree data (Dawn-Teare and Barrett 2005; Pulst 1999). If pedigreedata including several generations of family members with their genotypes andphenotypes are available, then the inherited haplotypes at two loci can be logicallyor statistically inferred, and the occurrence of recombination events between thetwo loci can be determined or estimated. The number of recombination eventsamong all meioses observed in the offspring generations can be used to estimate therecombination probability, called recombination fraction, according to the relativefrequency of recombination.

If two genes are unlinked, the recombination fraction should be 0.5. If twogenes are linked, the recombination fraction should be lower than 0.5. These twostatements constitute the hypothesis testing for linkage studies:

H0 : θ = 1

2vs. H1 : θ <

1

2

where θ is the recombination fraction. To test the above hypotheses, a likelihoodratio statistic called the LOD score, which was developed by Morton (Morton 1955),is used. When we know the number of recombinant offspring (R) and the numberof nonrecombinant offspring (NR), the LOD score is calculated as follows:

LOD = maxθ<0.5

log10P (data|θ)

P (data| θ = 0.5)= max

θ<0.5log10

θR(1 − θ)NR

0.5R+NR

Using the LOD score, the location of the genes responsible for a disease traitcan be mapped on a chromosome. The search for disease susceptibility genes byscanning all chromosomes based on genetic linkage analysis of pedigree data iscalled linkage mapping. In linkage mapping, the LOD score is estimated for eachlocus across chromosomes using markers (loci with known genotypes) distributedon the chromosomes. Conventionally, LOD scores above 3.0 are considered asevidences of linkage (Fig. 7). High LOD scores at a locus indicate the proximity ofa susceptibility gene to that locus. In Fig. 7, an example plot of LOD score signalsobtained for a genome-wide linkage study of musical ability phenotype is shown(Park et al. 2012). In this study, a locus on chromosome 4 was detected as a linkageregion, showing a LOD score of 3.1. In this region, a gene called UGT8 has beensuggested as the candidate gene responsible for variation in musical ability.

For certain pedigree structures, the recombination fraction can be logicallyinferred without ambiguity. However, for other types of pedigrees, the recombina-tion fraction is statistically estimated. In Fig. 8, a pedigree consisting of the familymembers in three generations is shown. Males are represented by squares, andfemales are represented by circles. Individuals affected by the disease are shown inblack. If we assume that the disease is autosomal dominant based on a gene with twoalleles D (dominant allele) and d (recessive allele) and the genetic marker consistsof two alleles, 1 and 2, the recombination status of offspring can be determinedwithout ambiguity. For example, first granddaughter with the marker genotype of12 should have disease genotyped dd since she is not affected, and the haplotype

Page 18: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

18 Y. J. Yoo

Fig. 7 An example of genome-wide LOD scores obtained for linkage study of musical ability(Park et al. 2012)

22/dd 11/DD

11/dd12/Dd

12/dd 11/Dd 11/Dd 12/Dd 12/dd 12/dd

NR NR NR R NR NR

Grandparents

Parents

Offspring

11,12,22: marker genotypeDD, Dd, dd : disease gene genotype (unknown)NR : non recombinantR : recombinant

Fig. 8 A three generation pedigree data of affection status and marker genotype information. Therecombination status inferred from the known information are also stated

consisting of allele 2 and allele d must be inherited from the father, which is clearlyfrom grandfather indicating no recombination.

Through linkage mapping, susceptibility genes for many autosomal dominant orrecessive diseases have been found. Nail-patella syndrome (NPS) is a rare geneticdisorder causing abnormalities in the bones, joints, fingernails, and kidneys. NPSis characterized by absent or underdeveloped kneecaps and thumbnails and isestimated to occur in 1/50,000 newborns and is a Mendelian disease caused by an

Page 19: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 19

autosomal dominant allele. The first linkage evidence for NPS was suggested by alinkage analysis using ABO blood group as markers (Renwick 1956). Subsequentlinkage analyses with finer-scale genetic markers led to the discovery of the LMX1Bgene responsible for NPS (McIntosh et al. 2005). Huntington disease is anotherexample of a Mendelian disease, and it was the first disease that was mapped toa chromosome by linkage mapping without any prior indication of candidate loci(Gusella et al. 1983). Using data from two large families, researchers found linkageto Huntington disease in the markers on chromosome 4 (HTT gene) (Bates 2005;Walker 2007). Over 7,500 Mendelian disorders in humans have been found bylinkage studies, which are well catalogued in the Online Mendelian Inheritancein Man (OMIM) database (https://www.omim.org/) (McKusick-Nathans Instituteof Genetic Medicine 2017; Chong et al. 2015). The OMIM database also includescurated data of the genes related to these Mendelian disorders.

Exploring Big Genetic Data

Genome-Wide Association Studies

Linkage studies analyze pedigree data with genotype information from genome-wide marker sets covering all chromosomes to find the locations of susceptibilitygenes for a trait. However, recruiting and obtaining blood samples from all thefamily members of the individuals who have been affected by the disease isdifficult in many circumstances. A relatively less complicated approach to findingdisease susceptibility regions is the genetic association study, which uses data fromindependent subjects in a population. The most popular genetic association studydesign currently is the genome-wide association study (GWAS) using case andcontrol samples (Amos 2007; Wellcome Trust Case Control Consortium 2007).GWAS utilizes dense single nucleotide polymorphism (SNP) markers in the genomeobtained by commercial array chips. SNPs are single base-pair changes in theDNA sequence observed in the population. Most of the SNPs in the genome haveno effect on biological functions. However, SNPs causing amino-acid changes ortranscriptional changes may yield phenotypic variation that leads to diseases anddefects. Since SNPs occur due to a point mutation involving one base position,SNPs are typically binary, i.e., they have two alleles.

GWAS has become popular for several reasons: First, genetic association studiesare proven to be more effective than linkage studies, assuming the same sample size(Risch and Merikangas 1996). Second, the cost of new technology for genotypinghas decreased, so that the study of 300,000∼1,000,000 markers has become feasible,enabling finer mapping of susceptibility loci in GWAS, whereas typical linkagestudies were performed using approximately 300∼1,000 microsatellite markers(Baron 2001; Park et al. 2012). Third, the acquisition of population data fromindependent individuals to constitute cases and control samples can be easier thantracking down family members and obtaining their genotypes (Laird and Lange2006).

Page 20: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

20 Y. J. Yoo

Table 2 The genotypefrequency notations of a SNPwith allele A and G for casesand controls

AA AG GG

Controls (Y = 0) O00 O01 O02

Cases (Y = 1) O10 O12 O12

The SNPs to be genotyped for GWAS are usually predetermined by the selectionavailable on commercial array chips such as the products from Illumina (San Diego,CA) or Affymetrix (Santa Clara, CA). Those commercial chips may not includethe causal SNPs for the phenotypes in the study. The phenomenon of linkagedisequilibrium (LD) provides the theoretical basis for pursuing GWAS using thesearray chips of predetermined SNPs. LD refers to nonrandom association betweengenetic variants at different loci. Due to LD, the effects of untyped (not genotyped)SNPs can be captured by those typed (genotyped) SNPs that are in high LD withthe untyped SNPs. Those SNPs that capture the effects of untyped SNPs are calledtag SNPs (Stram 2005). Most of the commercial array chips are designed to includetag SNPs that capture indirect association with most of the known untyped SNPs(Illumina 2010).

The most commonly used analysis method for GWAS consists of single-SNP-based tests to capture the marginal signals of SNPs via indirect association. If aSNP has three genotypes, AA, AG, and GG, and the phenotype of interest is diseasestatus (case vs. control), then the genotype data consist of six counts of genotypesfor cases and controls, as shown in Table 2.

Several statistical methods exist to compare the distributions of genotype countsbetween cases and controls, assuming no specific genetic model or assumingdominant, recessive, or additive genetic models. One of them is the chi-squaredtest to compare the association between the binary phenotype variable and nominalgenotype variable:

T =1∑

i=0

2∑j=0

(Eij − Oij

)2

Eij

where Eij is the expected count assuming no association between phenotype andgenotype variables. The null distribution of this statistic follows a chi-squareddistribution with 2 degrees of freedom.

For case-control studies, a more flexible analysis model, which can incorporatecovariates such as age, sex, or other environmental variables, is the logisticregression. The logistic regression can also utilize flexible genetic models, includingdominant, recessive, and additive models. The model of logistic regression of thelog odds of the binary phenotype Y on the genotype dosage X and the covariatesZ1, · · · Zk is:

logP (Y = 1)

P (Y = 0)= β0 + β1X + γ1Z1 + · · · + γ1Zk

Page 21: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 21

Table 3 The genotypedosages of the dominant,recessive and additive geneticmodel

Genetic model AA AG GG

Dominant X = 1 X = 1 X = 0Recessive X = 0 X = 0 X = 0Additive X = 2 X = 1 X = 0

1

05

10

–log

10(P

)

15

2 3 4 5 6 7 8

chromosome9 10 11 12 13 14 15 16 17 18 19 2021 22

Fig. 9 An example of the Manhattan plot of the GWAS results of single-SNP based tests (Ikramet al. 2010)

Examples of genotype dosages for the dominant, recessive, and additive modelsare presented in Table 3, assuming allele A is the risk allele.

For quantitative phenotypes, the phenotype variable Y can be directly regressedon the linear model of the genotype dosage X and the covariates Z1, · · · Zk:

Y = β0 + β1X + γ1Z1 + · · · + γ1Zk

Once the statistics and corresponding p-values are obtained for every SNP in thegenome, the significantly associated SNPs, which show p-values below the genome-wide threshold value, are determined. The Manhattan plot (Fig. 9) is a popularvisualization method to present the results of single-SNP-based tests in GWAS(Clarke et al. 2011). The Y-axis of the Manhattan plot is the negative log of thep-value, and the X-axis represents the SNPs aligned by physical position in eachchromosome. The threshold value for GWAS should be determined considering themultiple hypothesis testing problem, which is the problem of Type I error inflationwhen the conclusion of the analysis is based on multiple hypothesis testing (Aickinand Gensler 1996). Type I error, or the error probability of false positives in anyof the hypotheses tested, increases depending on the number of hypotheses. Theprobability of a false positive in at least one hypothesis tested among multiplehypotheses is called the family-wise error rate (FWER) (Aickin and Gensler1996). For example, a 5% Type I error (false positive) probability for each of twoindependent hypotheses tested increases to 0.0975 of FWER, as shown in the leftcase in Fig. 10. To maintain the overall FWER of multiple hypothesis testing belowa certain value, the p-value threshold of each hypothesis tested should be adjusted

Page 22: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

22 Y. J. Yoo

1siseht op y

H

Hypothesis 2

5%

5%

FWER = 0.05x1+0.05x1-0.052

= 0.0975

Hyp

othe

sis

1Hypothesis 2

2.5%

2.5%

FWER = 0.025x1+0.025x1-0.0252

≈ 0.0494

Unadjusted FWER Adjusted FWER

Fig. 10 Family-wise error rate (FWER) of two independent hypotheses testing based on thesignificance level of 0.05 (unadjusted) and 0.025 (adjusted)

to a lower value than when testing just one hypothesis. If we lower the thresholdto 2.5%, then the FWER of two-hypothesis testing becomes approximately 0.0494,within the limit of 5% overall Type I error (Fig. 10). The most famous and simpleadjustment method is the Bonferroni correction, which uses an adjusted thresholdof the original threshold value divided by the number of hypotheses tested (Aickinand Gensler 1996). However, Bonferroni correction assumes independence amongthe test statistics under the null hypothesis, whereas genome-wide single-SNP-based marginal statistics are usually correlated due to LD. Several researchers havestudied properly adjusted genome-wide significance threshold values for GWAS andsuggested the use of a threshold of 1 × 10−6∼5 × 10−8 (Dudbridge and Gusnanto2008; Clayton 2003; Risch and Merikangas 1996).

Through GWAS, many causal genes and SNPs have been discovered for manygenetic traits, usually complex traits affected by multiple genes. As of August2018, the National Human Genome Institute (NHGRI)-EBI GWAS catalog (http://www.ebi.ac.uk/gwas/) contained more than 69,900 SNP-disease associations thathave reached p-values of less than 1 × 10−5 from more than 3,500 publications(MacArthur et al. 2017; Hindorff et al. 2018). For example, based on GWAS of36,989 cases and 113,075 controls, 108 associated loci for schizophrenia were found(Schizophrenia Working Group of the Psychiatric Genomics Consortium 2014).Later, a meta-analysis of several GWASs confirmed 132 risk genes for schizophrenia(Lin et al. 2016). These findings are utilized to construct the polygenic risk score,a score that predicts the possibility of an individual developing the disease (Poweret al. 2015); this method represents a gateway to the era of personalized medicine,which will allow specific prediction of genetic disease risks for individuals (Lu etal. 2014).

Page 23: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 23

Whole Genome Sequencing

Since GWAS became popular, SNPs have been the prevailing genetic variants usedfor studies in genetic epidemiology. SNP genotypes are usually obtained usingcommercial array chips targeting predetermined sites. For this reason, GWAS dataare limited to the known polymorphisms found in the panel data used to designthese commercial arrays. Complete DNA sequencing, even of a small region, usedto be very time- and resource-consuming, such that obtaining the entire sequenceof the genome was almost unthinkable. Therefore, GWAS using whole genomesequencing (WGS) data from many subjects to find disease susceptibility variantswas not attempted until recently.

After the first WGS for bacteria was attempted in 1995, the first WGS data forhuman were published in 2001 using capillary sequencing technology (Lander et al.2001; Fleischmann et al. 1995). Recently, faster and cheaper new sequencing tech-nologies have been developed, called next-generation sequencing (NGS) (Shendureand Hanlee 2008). Now, the sequencing results of thousands of people are piling up,waiting for the new era of human genetic studies to include complete nucleotide-level data (The 1000 Genomes Project Consortium, 2012).

NGS technology adopts a shotgun sequencing approach, in which a long DNAsequence is broken into small fragments and the resulting sequences of these shortfragments, called reads, are assembled to construct the original DNA sequenceby mathematical and statistical methods (Shendure and Hanlee 2008; Zhang etal. 2011). NGS methods try to obtain many reads of shorter length than thoseof the classical methods; thus, the technology itself heavily depends on complexmathematical and statistical computations to assemble the sequence fragments.

For the assembly of genome sequences using short read data, a method in graphtheory, the de Bruijn graph, has been widely used. In 1946, de Bruijn (1918–2012)studied the problem of aligning an original sequence of letters using the informationof all partial strings of a certain length (k), called k-mers (De Bruijn 1946). Thisproblem was called the superstring problem. An example of k-mers of a DNAsequence is as follows: Given the alphabet A, C, G, and T, corresponding to theDNA nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T), there canbe a sequence S, GGCGATTCATCG, and all the 3-mers of S are listed in Fig. 11.Another example of a simple superstring problem can be given using two numbers0 and 1. Suppose all possible 3-mers are given by the list of 3-digit numbers: 000,001, 010, 011, 100, 101, 110, 111. The circular superstring that is the shortest stringpossible to generate these 3-mers is 0001110100.

For a given value of k, de Bruijn modeled the possible circular relationship of(k−1)-mers using a directed graph allowing loops such that every possible (k−1)-mer is assigned to a node, and a directed edge connects one (k−1)-mer to another(k−1)-mer if the suffix of the former is the prefix of the latter (De Bruijn 1946).Then, the Eulerian cycle of the de Bruijn graph finds the shortest cyclic superstringthat connects each (k−1)-mer exactly once. Even though an actual short readalignment may not have all possible sequences, the graph generated from any partial

Page 24: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

24 Y. J. Yoo

Fig. 11 All 3-mers of theDNA sequenceGGCGATTCATCG

Original sequence

GGCGCG

CGAGAT

ATTTTC

TCA CAT

ATCTCG

All 3-mers

GGCGATTCATCG

0000 1111

001

010 101000

100

011

111

110

10110001 0010

1010

0101

0011

0100

1000

1000

1100

1101

0110

0111

1110

12

3

4

5

6

7 9

8

10

11

1213

14

1516

Fig. 12 De Bruijin graph for k = 4 and a string composed of 0 and 1 (Compeau et al. 2011)

k-mer information is called a de Bruijn graph. In Fig. 12, an example of a de Bruijngraph for k = 4 is given for a string composed of 0 and 1. In the graph, the nodesrepresent 3-mers, and the edges represent 4-mers, each of which is constructed byconnecting the two 3-mers adjacent to the edge. An Eulerian cycle can be foundin this graph, following the blue colored numbers assigned to edges from 1 to16. Following this order, the first letters of the k-mers constitute the superstring0000110010111101 (Compeau et al. 2011).

With new NGS equipment capable of generating massive quantities of randomshort reads combined with successful alignment algorithms, faster and cheapersequencing of the human genome became possible (Zhang et al. 2011; Goodwinet al. 2016). A project called the 1000 Genomes Project was launched to sequencemore than one thousand people of various populations over the world, and initialsequencing of 1092 humans was finished in 2012 (1000 Genomes Project Con-sortium 2012). In this project, 38 million SNPs were validated, including newlydiscovered SNPs, which represented 58% of the total SNPs found in the study.

Page 25: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 25

Now that the cost of WGS has dropped below $1000, WGS studies to identifygenetically associated loci for diseases have become feasible, and several have beenfinished and reported (Long et al. 2017). Additionally, using dense panel data, suchas those in the 1000 Genomes Project, which is available for public use, the untypedgenotypes of the subjects in a given study can be statistically inferred. This typeof technique to infer missing data is called imputation (Pasaniuc et al. 2012; 1000Genomes Project Consortium 2015). Imputation methods utilize the LD informationbetween the typed and untyped SNPs obtained from panel data. Imputation methodscan augment data and enable extraction of more information from existing GWASswith lower cost, and these methods have been used for many valuable achievementsin genetic association studies (Wood et al. 2013).

Network-Based Analysis for Genetic Data

To dissect complex genetic mechanisms and address the genetic information that hasbeen piling up so rapidly, more articulated mathematical models involving a verylarge number of genetic components and their relationships are required. Networktheory (or graph theory) has been used for this purpose, since it provides a simplebut powerful mathematical model in which the genetic components are representedby nodes and the relationships between them are represented by edges. Network-based analysis has been applied to solve diverse problems in genetic research bymodeling interactions and dynamics in the etiology of genetic disorders.

In the past decades, molecular interaction data from a diverse set of organisms,from yeast to human, have been accumulated. Molecular interaction informationcan be represented by a network called the interactome, and several kinds ofsuch networks exist, including protein interaction networks, metabolic networks,regulatory networks, and RNA networks (Barabási et al. 2011). Theoreticallyestablished network properties are applied to extract the underlying dynamicsof the genetic entities composing interactomes. In particular, network topologyand information flow through networks have been investigated to find importantrelationships between molecules.

A protein interaction network is a network model of physical interactionsbetween proteins. Protein interaction networks are the most studied type of molecu-lar interactome, and many public databases of protein interaction data are available.The famous ones include BioGRID (Stark et al. 2006), the Human Protein ReferenceDatabase (HPRD) (Keshava Prasad et al. 2009), and Database of InteractingProteins (DIP) (Salwinski et al. 2004). The human protein interaction networkconsists of approximately 20,000 proteins (represented by nodes) and more than400,000 interactions, represented by undirected edges, that have been documentedby scientific experiments (BioGRID, release 3.4.154). Protein interaction networksare characterized by the property that a small percentage of proteins interactwith many other proteins, while most proteins have only a few interactions withothers. For example, some proteins have different structural importance in termsof connectivity of the network, and each of these special proteins is called a hub.

Page 26: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

26 Y. J. Yoo

Removal of hub proteins from the network affects the function of the network.For example, knockout experiments of yeast genes encoding hub proteins resultedin lethality far more often than knockout of nonhub proteins did (Winzeler et al.1999). Additionally, hub proteins are more essential and enriched in cancer cells(Sun and Zhao 2010). Furthermore, they appear in more ancient organisms, sincetheir interactions have arisen through the evolutionary process (Chen et al. 2014).

Protein interaction networks are often analyzed to find modularity, or structuresof highly connected clusters of proteins within the network. Such clusters ofinteracting proteins have been shown to correspond to protein functions and to haveevolved together to perform a common biological function (Luo et al. 2007). Variousclustering algorithms have been applied to find protein complexes based on proteininteraction information (Brohée and Helden 2006; Trivodaliev et al. 2014). Oneof the approaches to clustering proteins is called divisive clustering; this techniquepartitions the network in hierarchical order. Newman and Girvan proposed a divisiveclustering algorithm using edge-betweenness, and this algorithm has been appliedto find protein modules in protein interaction networks (Newman and Girvan 2004;Dunn et al. 2005). Edge-betweenness is calculated as the number of shortest pathsthat pass through an edge to connect a pair of vertices. The Newman and Girvan(NG) algorithm finds the edge with the highest edge-betweenness and removes itfrom the network repeatedly, decomposing the nodes in a hierarchical order thatcan be represented by a dendrogram (Fig. 13). The NG algorithm can proceeduntil it divides the network into individual nodes. To obtain a clustering result, thealgorithm should determine an optimum value for the number of clusters using somecriteria. For example, a quality measure, Q, called modularity can be computed foreach setting for the number of clusters, denoted by K, which can be defined asfollows (Narayanan et al. 2011):

cluster 1 cluster 2 cluster 3

Fig. 13 A dendrogram consisting of a hierarchical tree structure. The horizontal line partitionsthe elements into three clusters

Page 27: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 27

Q =K∑

i=1

(eii − a2

i

)

where eii is the fraction of the edges with both ends in cluster i over all edges inthe network and ai is the fraction of edges with at least one end in cluster i over alledges in the network. Higher values of Q mean that more edges are within the samecluster than would be expected by chance.

Network modeling using directed edges is also used to model directionalrelationships between genetic components, environmental components, and diseasephenotypes. In particular, probabilistic models combined with structural assump-tions represented by a directed network can provide statistical methods to discovercomplex mechanisms relating disease phenotypes to genetic factors. Among theseprobabilistic graph models, Bayesian network (BN) analysis is a popular machinelearning method using a graph model based on a directed acyclic graph (DAG) rep-resenting a network structure (Pearl 2000). In BN analysis, nodes represent randomvariables, which can denote disease phenotypes, genetic variants, or environmentalfactors, and edges represent conditional dependencies between random variables.In certain cases, directed edges can be interpreted as causal relationships betweenrandom variables.

If a directed edge connects node X to node Y in a DAG, we call X the parentof Y or Y the child of X. A BN can be translated into a joint likelihood L of allinvolved random variables X1, · · · , Xn, given a set of parameters, to specify the jointprobability distribution. The likelihood L can be decomposed into a product of theconditional probabilities of each random variable given its parents:

L (X1, · · · , Xn|θ) =n∏

i=1

P (Xi |pa (Xi) , θi)

where pa(Xi) is the set of parent variables of Xi and θ = (θ1, · · · , θn) is the group ofparameters related to X1, · · · , Xn, respectively.

In Fig. 14, the relationships among random variables denoting cancer status,genetic variants (XRCC3_04 and XRCC3_241), and other covariates such as genderand risk factors including arsenic exposure and smoking are represented by a DAG.This model can be interpreted to state “for any combination of smoking status andgender, the risk of (bladder) cancer is elevated when toenail arsenic levels are highin those with variants at positions 241 and 04 of the gene region XRCC3” (Su et al.2013).

A BN model can be extracted from data by various machine learning algorithms.If this learning is performed for only the parameter values, assuming the structurepart is given, it is called parameter learning. If the structure, i.e., DAG, is alsodetermined by the data, the process is called structure learning. To learn BNparameters, methods using expectation-maximization (EM) algorithms or Bayesianmethods such as Markov chain Monte Carlo (MCMC) algorithms are usually used

Page 28: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

28 Y. J. Yoo

GENDER

SMOKER CANCER XRCC3_241

ARSENIC XRCC3_04

Fig. 14 The DAG representing the relationship between random variables denoting the canceraffection status of the disease, genetic variants (XRCC3_04 and XRCC3_241), and other covariatessuch as gender and risk factors such as arsenic exposure and smoking (Su et al. 2013)

(Neapolitan 2003). To learn DAG structure, constraint-based learning, which triesto construct the structure model by examining conditional independence betweenrandom variables, and score-based learning, which tries to optimize the fit scorebased on the likelihood obtained by assuming each candidate structure, are used.

Discussion

In this chapter, the author has explained how genetics emerged as mathematicalmodels of inheritance were found, stimulated the development of modern inferentialstatistics, and has evolved with advances in the mathematical and statistical sciences.These three domains have interacted and influenced each other throughout thehistory of modern science (Lange et al. 2014). Mathematical models have beenused to illustrate the mechanisms of inheritance and relationships between entitiesinvolved in genetic phenomena. Statistical methods have been developed to providecriteria for judgment of fit of mathematical models to observed genetic data(confirmatory analysis) and to explore genetic data to find meaningful patterns(exploratory analysis) (Tukey 1980). Statistical methods developed for geneticshave been extended and generalized to other situations outside genetics, contributingto the growth of statistical science.

The state of art mathematical and statistical methodologies will continue tobe applied in order to unveil the mechanisms of genetic phenomena using newlyavailable data types. The current trends of mathematical genetics and statisticalgenetics revolve around following needs in methodologies: (1) achieving drasticallyimproved computational efficiency to deal with big genomic data produced byrecent technologies (The Computational Pan-Genomics Consortium 2018), (2) find-ing a suitable strategies for integration of multiple information sources and differentoutputs from various analysis methods (Karczewski 2018), and (3) finding betteranalytic tools for more individualized healthcare including prediction, prevention,diagnosis, and prognosis based on individual genomic configuration (Lu et al.2014). Recently genetics research studies have been moving towards the directionsof personalized medicine with two new research paradigms: single-cell analysis

Page 29: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 29

(Shalek and Benson 2017) and human microbiomics (Zitvogel et al. 2015). Def-initely, we can expect, in near future, mathematics and statistics will affect theeveryday life more deeply through genetic analysis and predictions tailored for eachindividual.

Acknowledgments This work was supported by the National Research Foundation of Korea(NRF) grant NRF-2015R1A1A3A04001269 and NRF-2018R1A2B6008016.

References

1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092human genomes. Nature 491:56–65

1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature526:68–74

Aickin M, Gensler H (1996) Adjusting for multiple testing when reporting research results: theBonferroni vs Holm methods. Am J Public Health 86:726–728

Amos CI (2007) Successful design and conduct of genome-wide association studies. Hum MolGenet 16:R220–R225

Barabási AL, Gulbahce N, Loscalzo J (2011) Network medicine: a network-based approach tohuman disease. Nat Rev Genet 12:56–68

Baron M (2001) The search for complex disease genes: fault by linkage or fault by association?Mol Psychiatry 6:143–149

Bartels M, Rietveld MJ, Van Baal C, Boomsma DI (2002) Genetic and environmental influenceson the development of intelligence. Behav Genet 32:237–249

Bates GP (2005) History of genetic disease: The molecular genetics of Huntington disease – ahistory. Nat Rev Genet 6:766–773

Biau DJ, Jolles BM, Porcher R (2010) P value and the theory of hypothesis ttesting: an explanationfor new researchers. Clin Orthop Relat Res 468:885–892

Blackstock WP, Weir MP (1999) Proteomics: quantitative and physical mapping of cellularproteins. Trends Biotechnol 17:121–127

Brohée S, Helden JV (2006) Evaluation of clustering algorithms for protein-protein interactionnetworks. BMC Bioinf 7:488

Brown TA (2002) Genomes, 2nd edn. Wiley-Liss, OxfordChen C-Y, Ho A, Huang H-Y, Juan H-F, Huang H-C (2014) Dissecting the human protein-protein

interaction network via phylogenetic decomposition. Sci Rep 4:7153Chiras D (2012) Human biology, 7th edn. Jones & Barrett Learning, SudburyChong JX, Buckingham KJ, Jhangiani SN, Boehm C (2015) The genetic basis of mendelian

phenotypes: discoveries, challenges, and opportunities. Am J Hum Genet 97:199–215Clarke GM, Anderson CA, Pettersson FH, Cardon LR, Morris AP, Zondervan KT (2011) Basic

statistical analysis in genetic case-control studies. Nat Protoc 6:121–133Clayton D (2003) P-values, false discovery rates, and Bayes factors: how should we assess the

“significance” of genetic associations? Ann Hum Genet 67:630Compeau PE, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly.

Nat Biotechnol 29:987–991Cox DR (2002) Karl Pearson and the Chi-Squared Test. In: Huber-Carol C, Balakrishnan N,

Nikulin M, Mesbah M (eds) Goodness-of-fit test and model validity (Statistics for industryand technology). Springer Science+Business Media, Boston, pp 3–8

Page 30: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

30 Y. J. Yoo

Crow JF (1987) Population Genetics History: A Personal view. Annu Rev Genet 21:1–22Crow JF (2002) Perspective: here’s to Fisher, additive genetic variance, and the fundamental

theorem of natural selection. Evolution 56:1313–1316Crow JF, Kimura M (1970) An introduction to population genetics theory. Harper and Row, New

YorkDawn-Teare M, Barrett JH (2005) Genetic linkage studies. Lancet 366:1036–1044De Bruijn NG (1946) A combinatorial problem. Koninklijke Nederlandse Akademie v Weten-

schappen 49:758–764Deary IJ, Spinath FM, Bates TC (2006) Genetics of intelligence. Eur J Hum Genet 14:690–700Dudbridge F, Gusnanto A (2008) Estimation of significance thresholds for genomewide association

scans. Genet Epidemiol 32:227–234Dunn R, Dudbridge F, Sanderson C (2005) The use of edge-betweenness clustering to investigate

biological function in protein interaction networks. BMC Bioinf 6:39Edwards AWF (1977) Foundations of mathematical genetics. Cambridge University Press, Cam-

bridgeEdwards AWF (2008) G. H. Hardy (1908) and Hardy–Weinberg equilibrium. Genetics 179:1143–

1150Ezkurdia I, Juan D, Rodriguez JM, Frankish A, Diekhans M, Harrow J, Vazquez J, Valencia A,

Tress ML (2014) Multiple evidence strands suggest that there may be as few as 19,000 humanprotein-coding genes. Hum Mol Genet 23:5866–5878

Fairbanks DJ, Schaalje GB (2007) The tetrad-pollen model fails to explain the bias in Mendel’spea (Pisum sativum) experiments. Genetics 177:2531–2534

Falconer DS, MacKay TFC (1996) Introduction to quantitative genetics, 4th edn. Longmans Green,Harlow

Fisher RA (1924) On a distribution yielding the error functions of several well known statistics.In: Proceedings of the International Congress of Mathematics, vol 2, Toronto, pp 806–813

Fisher RA (1930) The genetical theory of natural selection. Clarendon, OxfordFleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb

JF, Dougherty BA, Merrick JM et al (1995) Whole-genome random sequencing and assemblyof Haemophilus influenzae Rd. Science 269:496–512

Freimer N, Sabatti C (2003) The human phenome project. Nat Genet 34:15–21Galton F (1874) On men of science, their nature and their nurture. In: Proceedings of the Royal

Institution of Great Britain, vol 7, pp 227–236Galton F (1886) Regression Towards Mediocrity in Hereditary Stature. J Anthropol Inst G B Irel

15:246–263Gerlai R (2002) Phenomics: fiction or the future? Trends Neurosci 25:506–509Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation

sequencing technologies. Nat Rev Genet 17:333–351Gusella JF, Wexler NS, Conneally PM, Naylor SL, Anderson MA, Tanzi RE, Watkins PC, Ottina

K, Wallace MR, Sakaguchi AY et al (1983) A polymorphic DNA marker genetically linked toHuntington’s disease. Nature 306:234–238

Halligan D, Keightley P (2006) Ubiquitous selective constraints in the Drosophila genome revealedby a genome-wide interspecies comparison. Genome Res 16:875–884

Hardy GH (1908) Mendelian proportions in a mixed population. Science 28:49–50Hindorff LA MJEBI, Morales J (European Bioinformatics Institute), Junkins HA, Hall PN, Klemm

AK, Manolio TA (2018) A catalog of published genome-wide association studies. Available at:http://www.ebi.ac.uk/gwas. Accessed at Mar 2018

Ikram MK, Sim X, Jensen RA, Cotch MF, Hewitt AW, Ikram MA, Wang JJ, Klein R, KleinBE, Breteler MM et al. (2010) Four novel Loci (19q13, 6q24, 12q24, and 5q14) influence themicrocirculation in vivo. PLoS Genet 28:e1001184

Illumina (2010) Techinical note: software for tag single nucleotide polymorphism selection.Illumina, San Diego

Karczewski KJ (2018) Integrative omics for health and disease. Nat Rev Genet 19:299–310

Page 31: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 31

Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, TelikicherlaD, Raju R, Shafreen B, Venugopal A, Balakrishnan L et al (2009) Human protein referencedatabase – 2009 update. Nucleic Acids Res 37:D767–D772

Kiechle FL, Zhang X, Holland-Staley CA (2004) The -omics era and its impact. Arch Pathol LabMed 128:1337–1345

Laird NM, Lange C (2006) Family-based designs in the age of large-scale gene-association studies.Nat Rev Genet 7:285–394

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, DoyleM et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921

Lange K, Papp JC, Sinsheimer JS, Sobel EM (2014) Next generation statistical genetics: modeling,penalization, and optimization in high-dimensional data. Annu Rev Stat Appl 1:279–300

Liew S, Elsner H, Spector T, Hammond C (2005) The first “classical” twin study? Analysis ofrefractive error using monozygotic and dizygotic twins published in 1922. Twin Res Hum Genet8:198–200

Lin J-R, Cai Y, Zhang Q, Zhang W, Nogales-Cadenas R, Zhang ZD (2016) Integrated post-GWASanalysis sheds new light on the disease mechanisms of schizophrenia. Genetics 204:1587–1600

Lobo I, Shaw K (2008) Thomas Hunt Morgan, genetic recombination, and gene mapping. NatEduc 1:205

Long T, Hicks M, Yu HC, Biggs WH, Kirkness EF, Menni C, Zierer J, Small KS, Mangino M,Messier H (2017) Whole-genome sequencing identifies common-to-rare variants associatedwith human blood metabolites. Nat Genet 49:568–578

Lu Y-F, Goldstein DB, Angrist M, Cavalleri G (2014) Personalized medicine and human geneticdiversity. Cold Spring Harb Perspect Med 4:a008581

Luo F, Yang Y, Chen CF, Chang R, Zhou J, Scheuermann RH (2007) Modular organization ofprotein interaction networks. Bioinformatics 23:207–214

MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A,Morales J, Pendlington ZM, Welter D, Burdett T, Hindorff L, Flicek P, Cunningham F, ParkinsonH (2017) The new NHGRI-EBI Catalog of published genome-wide association studies (GWASCatalog). Nucleic Acids Res 45:D896–D901

Magnello ME (1998) Karl Pearson’s mathematization of inheritance: from ancestral heredity toMendelian genetics (1895-1909). Ann Sci 55:35–94

Magnello ME (2004) The reception of mendlism by the biometricians and the early Mendlians. In:Keynes M, Edwards AWF, Peel R (eds) A century of Mendelism in human genetics. CRC Press,Boca Raton, pp 17–30

Masel J (2011) Genetic drift. Curr Biol 21:R837–R838McClearn GE, Johansson B, Berg S, Pedersen NL, Ahern F, Petrill SA, Plomin R (1997)

Substantial genetic influence on cognitive abilities in twins 80 or more years old. Science276:1560–1563

McIntosh I, Dunston JA, Liu L, Hoover-Fong JE, Sweeney E (2005) Nail patella syndromerevisited: 50 years after linkage. Ann Hum Genet 69:349–363

McKusick-Nathans Institute of Genetic Medicine (2017) “OMIM Entry Statistics” OnlineMendelian inheritance in man. Johns Hopkins University, Baltimore

Merrriman C (1924) The intellectual resemblance of twins. Psychol Monogr 33:1–58Morton NE (1955) Sequential tests for the detection of linkage. Am J Hum Genet 7:277–318Narayanan T, Gersten M, Subramaniam S, Grama A (2011) Modularity detection in protein-protein

interaction networks. BMC Res Notes 4:569Neapolitan RE (2003) Learning Bayesian networks. Prentice Hall, Englewood CliffsNewman M, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev

E 69:026113Orel V (2009) The “useful questions of heredity” before Mendel. J Hered 100:421–423Park H, Lee S, Kim HJ, Ju YS, Shin JY, Hong D, von Grotthuss M, Lee DS, Park C, Kim JH,

Kim B, Yoo YJ, Cho SI, Sung J, Lee C, Kim JI, Seo JS (2012) Comprehensive genomicanalyses associate UGT8 variants with musical ability in a Mongolian population. J Med Genet49:747–752

Page 32: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

32 Y. J. Yoo

Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N, Li H, Gupta N, Neale BM et al(2012) Extremely low-coverage sequencing and imputation increases power for genome-wideassociation studies. Nat Genet 44:631–635

Pearl J (2000) Causality: models, reasoning, and inference. Cambridge University Press, Cam-bridge

Pearson K (1900) On the criterion that a given system of deviations from the probable in the caseof a correlated system of variables is such that it can be reasonably supposed to have arisen fromrandom sampling. Philos Mag Ser 5 50:157–175

Polderman TJC, Benyamin B, Leeuw CAD, Sullivan PF (2015) Meta-analysis of the heritabilityof human traits based on fifty years of twin studies. Nat Genet 47:702–712

Power RA, Steinberg S, Bjornsdottir G, Rietveld CA, Abdellaoui A, Nivard MM, JohannessonM, Galesloot TE, Hottenga JJ et al (2015) Polygenic risk scores for schizophrenia and bipolardisorder predict creativity. Nat Neurosci 18:953–955

Pulst SM (1999) Genetic linkage studies. Arch Neurol 56:667–672Raja K, Patrick M, Gao Y, Madu D, Yang Y, Tsoi LC (2017) A review of recent advancement in

integrating omics data with literature mining towards biomedical discoveries. Int J Genomics2017:6213474

Renwick JH (1956) Nail-patella syndrome: evidence for modification by alleles at the main locus.An Hum Genet 21:159–169

Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science273:1516–1517

Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The database ofinteracting proteins: 2004 update. Nucleic Acids Res 32:D449–D451

Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014) Biological insightsfrom 108 schizophrenia-associated genetic loci. Nature 511:421–427

Shalek AK, Benson M (2017) Single-cell analyses to tailor treatments. Sci Transl Med 9:eaan4730Shendure J, Hanlee JI (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145Siddartha M (2016) The gene: an intimate history, 1st edn. Scribner, New YorkSiemens H (1924) Zwillingspathologie: Ihre Bedeutung; ihre Methodik, ihre bisherigen Ergeb-

nisse. Springer, BerlinStark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) BioGRID: a general

repository for interaction datasets. Nucleic Acids Res 34:D535–D539Stigler SM (1997) Regression toward the mean, historically considered. Stat Methods Med Res

6:103–114Stigler SM (2010) Darwin, Galton and the statistical enlightenment. J R Stat Soc A Stat 173:469–

482Stram DO (2005) Software for tag single nucleotide polymorphism selection. Hum Genomics

2:144–151Su C, Andrew A, Karagas MR, Borsuk ME (2013) Using Bayesian networks to discover relations.

BioData Min 6:6Sun J, Zhao Z (2010) A comparative study of cancer proteins in the human protein-protein

interaction network. BMC Genomics 11:S5The Computational Pan-Genomics Consortium (2018) Computational pan-genomics: status,

promises and challenges. Brief Bioinform 19:118–135Tian W, Dong X, Zhou Y, Ren R (2011) Predicting gene function using omics data: from data

preparation to data integration. In: Kihara D (ed) Protein function prediction for omics era.Springer, London, pp 215–242

Trivodaliev K, Bogojeska A, Kocarev L (2014) Exploring function prediction in protein interactionnetworks via clustering methods. PLoS One 9:e99755

Tukey JW (1980) We need both exploratory and confirmatory. Am Stat 34:23–25Visscher PM, Hill WG, Wray NR (2008) Heritability in the genomics era – concepts and

misconceptions. Nat Rev Genet 9:255–266Walker F (2007) Huntington’s disease. Lancet 369:218–228

Page 33: Coevolution of Mathematics, Statistics, and Geneticsmathematical and statistical methods to be applied in future genetic studies. Keywords Mathematical genetics · Statistical genetics

Coevolution of Mathematics, Statistics, and Genetics 33

Waller JC (2012) Commentary: the birth of the twin study – a commentary on Francis Galton’s‘The history of twins’. Int J Epidemiol 41:913–917

Wang J, Shete S (2017) Testing departure from Hardy-Weinberg proportions. Methods Mol Biol1666:83–115

Weinberg W (1908) Über den Nachweis der Vererbung beim Menschen. Jahreshefte des Vereinsfür vaterländische Naturkunde in Württemberg 64:368–382

Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 casesof seven common diseases and 3,000 shared controls. Nature 447:661–678

Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, BenitoR, Boeke JD et al (1999) Functional characterization of the S. cerevisiae genome by genedeletion and parallel analysis. Science 285:901–906

Wood AR, Perry JR, Tanaka T, Hernandez DG, Zheng HF, Melzer D, Gibbs JR, Nalls MA, WeedonMN, Spector TD, Richards JB, Bandinelli S, Ferrucci L, Singleton AB, Frayling TM (2013)Imputation of variants from the 1000 Genomes Project modestly improves known associationsand can identify low-frequency variant-phenotype associations undetected by HapMap basedimputation. PLoS One 8:e64343

Wright S (1931) Evolution in Mendelian populations. Genetics 16:97–159Wu X, AlHasan M, Chen J (2014) Pathway and network analysis in proteomics. J Theor Biol

2014:44–52Yates F, Mather K (1963) Ronald Aylmer Fisher, 1890–1962. Biogr Mem Fellows R Soc 9:91–129Zhang J, Chiodini R, Badr A, Zhang G (2011) The impact of next-generation sequencing on

genomics. J Genet Genomics 38:95–109Zitvogel L, Galluzzi L, Viaud S, Vétizou M, Daillère R, Merad M, Kroemer G (2015) Cancer and

the gut microbiota: an unexpected link. Sci Transl Med 7:271ps1