Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

1

Intro to population genetics

Shamil Sunyaev

Broad Institute of M.I.T. and Harvard

Forces responsible for genetic change

Mutation µ

Selection s

NeDrift

Population structure FST

Mutations

Mutation rate in humans and flies

~102 per nt changes genome

2.5x10-8 (Nachman & Crowell) 1.8x10-8 (Kondrashov)

Other events: indels (10-9)

repeat extensions/contractions (10-5)

large events (?)

NGS estimates ~1.2X10-8 per nt changes genome

Mutation rate is variable along the genome

Regional variation of mutation rate

Context dependence of mutation rate

Replication fidelity DNA damage DNA repair CpG deamination Genetic drift

2

Drift is a random change of allele frequencies Drift depends on population size

Demographic history

Selection

12

NeutralDeleterious Advantageous

New mutation

Functional

Nonfunctional

Selection indicates functional mutations, whether or not the tested trait is under selection

Selective effect of mutation

Most functional mutations are deleterious

3

Methods of mathematical population genetics

Dynamic of allelic substitution

time0

1

Mathematically, allele frequency change in a population follows a one-dimensional random walk

Diffusion approximation

Random walk that does not jump long distances can be approximated by a diffusion process

€

∂φ x, p,t( )∂t

= −∂Mφ x, p,t( )

∂x+12∂2Vφ x, p,t( )

∂x 2

Coalescent theoryInstead of modeling a population, we can model our sample

Time goes backwards !

t

Natural selection in protein coding regions

4

Effect of new missense mutations Computer simulations

time

€

∂φ x, p,t( )∂t

= −∂Mφ x, p,t( )

∂x+12∂2Vφ x, p,t( )

∂x 2Demographic history

Natural selection

0

0.05

0.1

0.15

0.2

0.25

0.3

• Can we find additional evidence in sequence data?

• Is there any information beyond frequency? Can we tell alleles under selection from neutral alleles if they are of the same frequency?

5

25

Maruyama effect (1974): at any frequency advantageous , or deleterious alleles are younger than neutral alleles

−150 −100 −50 0

050

100150

200250

300

time (generations)

allele c

ount

Frequency x

Frequency 0%Time

At a given frequency deleterious and advantageous alleles are younger than

neutral

Longer trajectory: 6 jumps

Shorter trajectory: 4 jumps

Frequency 0%

Frequency x

Time

Intuition: shorter trajectories require fewer lucky jumps

time

allelefrequency

Neutrals: equal time at each frequencySelecteds: faster through higher frequencies

Idea: low accumulation of mutations at linked sites indicates selection

Diffusion theory: deleterious alleles pass fast through higher frequencies

10

!

!

!

!

!

−25 −20 −15 −10 −5 0

0.0

0.1

0.2

0.3

0.4

0.5

selection coefficient 2Ns

mea

n ag

e (2

N g

ener

atio

ns)

!

!

!

!

!

!

!

!

!

!

!

!

!

Population frequency7%5%3%

!"

#$%&'(")"

0 5 10 15 20

0.00

00.

005

0.01

00.

015

0.02

0

Intermediate allele frequency (%)

mea

n so

jour

n tim

e (2

N g

ener

atio

ns)

!!

!

!

!

!

!

!

!

!

!

!

!!

!!

! ! ! ! !

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!! ! ! ! ! ! ! ! !

Selection coefficient (2Ns)0 (neutral)−2 (weakly deleterious)−10 (deleterious)

3%

*"

−0.20 −0.15 −0.10 −0.05 0.00

05

1015

time (generations before present, in 2N units)

popu

latio

n fre

quen

cy (%

)

Alleleneutraldeleterious

+"

""

""

Figure 1. Simulation and theoretical results for allelic age and sojourn times. a. Exampletrajectories for a neutral and deleterious allele with current population frequencies 3% (indicated by anarrow). The shaded areas indicate sojourn times at frequencies above 5%. b. Mean ages for neutral anddeleterious alleles at a given population frequency (lines show theoretical predictions, dots showsimulation results with standard error bars). The graph shows that deleterious alleles at a givenfrequency are younger than neutral alleles, and that the e↵ect is greater for more strongly selectedalleles. c. Mean sojourn times for neutral and deleterious alleles. Vertical line denotes the currentpopulation frequency of the variant (3%). Mean sojourn times have been computed in bins of 1%. Lineconnects theoretical predictions for each frequency bin. Dots show simulation results. The graphillustrates that deleterious alleles spend much less time than neutral alleles at higher populationfrequencies in the past even if they have the same current frequency.

Neighborhood clock (fuzzy clock)

29

Variant''Closest'rarer'linked'variant'

Closest'variant'beyond''recombina4on'event'

)LJXUH&OLFN�KHUH�WR�GRZQORDG�)LJXUH��)LJXUHB��SGI�

3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

MAC=3

NC statistic

Prop

ortio

n N

C <

= x

●

●

●● missense ancestral

missense derivedprobably damaging derivedsynonymous derived

)LJXUH&OLFN�KHUH�WR�GRZQORDG�)LJXUH��)LJXUHB��SGI�

Neighborhood clock is consistent with Maruyama-effect expectations

Data: pilot Genome of Netherlands dataset

15

MAC Variants N meanNC Effect size 95% CI P2 coding-synon 2813 4.97 baseline2 missense 3957 5.02 0.089 (0.0387, 0.138) 0.00122 benign 1772 5.02 0.088 (0.0361, 0.136) 0.00832 possibly damaging 708 4.99 0.040 (�0.013, 0.091) 0.1412 probably damaging 1136 5.05 0.142 (0.0914, 0.188) 0.00033 coding-synon 1708 4.68 baseline3 missense 2277 4.75 0.134 (0.0726, 0.197) 2.17⇥ 10�5

3 benign 1035 4.74 0.118 (0.0521, 0.183) 0.002133 possibly damaging 368 4.75 0.137 (0.0714, 0.202) 0.01493 probably damaging 650 4.79 0.211 (0.146, 0.275) 1.58⇥ 10�6

4 coding-synon 1216 4.46 baseline4 missense 1496 4.56 0.16 (0.088, 0.238) 2.68⇥ 10�5

4 benign 695 4.54 0.127 (0.050, 0.207) 0.008174 possibly damaging 254 4.59 0.217 (0.144, 0.287) 0.0005124 probably damaging 376 4.59 0.212 (0.140, 0.284) 0.0001245 coding-synon 935 4.37 baseline5 missense 1102 4.42 0.0966 (0.010, 0.188) 0.009345 benign 530 4.42 0.0922 (0.005, 0.176) 0.04545 possibly damaging 181 4.4 0.0596 (�0.028, 0.158) 0.3125 probably damaging 277 4.52 0.266 (0.185, 0.353) 2.73⇥ 10�5

6 coding-synon 814 4.24 baseline6 missense 896 4.28 0.082 (�0.015, 0.171) 0.05626 benign 432 4.26 0.047 (�0.044, 0.136) 0.2916 possibly damaging 145 4.29 0.101 (0.012, 0.187) 0.1836 probably damaging 215 4.37 0.243 (0.149, 0.338) 0.000826

2-6 coding-synon 7486 baseline2-6 missense 9728 1.79⇥ 10�10

2-6 benign 4464 5.30⇥ 10�06

2-6 possibly damaging 1656 0.0012-6 probably damaging 2654 3.25⇥ 10�13

Table 1. Discrimination of derived missense alleles by the NC statistic. Missense alleles aresub-classified info categories based on PolyPhen-2 predictions. Effect sizes were calculated as standarddeviations from the mean of the NC statistic for synonymous variants at the same minor allele count(MAC). Within each MAC class, P-values were calculated by 1-sided Mann-Whitney test. CombinedP-values for MAC 2-6 were computed by meta-analysis (Methods).

15

MAC Variants N meanNC Effect size 95% CI P2 coding-synon 2813 4.97 baseline2 missense 3957 5.02 0.089 (0.0387, 0.138) 0.00122 benign 1772 5.02 0.088 (0.0361, 0.136) 0.00832 possibly damaging 708 4.99 0.040 (�0.013, 0.091) 0.1412 probably damaging 1136 5.05 0.142 (0.0914, 0.188) 0.00033 coding-synon 1708 4.68 baseline3 missense 2277 4.75 0.134 (0.0726, 0.197) 2.17⇥ 10�5

3 benign 1035 4.74 0.118 (0.0521, 0.183) 0.002133 possibly damaging 368 4.75 0.137 (0.0714, 0.202) 0.01493 probably damaging 650 4.79 0.211 (0.146, 0.275) 1.58⇥ 10�6

4 coding-synon 1216 4.46 baseline4 missense 1496 4.56 0.16 (0.088, 0.238) 2.68⇥ 10�5

4 benign 695 4.54 0.127 (0.050, 0.207) 0.008174 possibly damaging 254 4.59 0.217 (0.144, 0.287) 0.0005124 probably damaging 376 4.59 0.212 (0.140, 0.284) 0.0001245 coding-synon 935 4.37 baseline5 missense 1102 4.42 0.0966 (0.010, 0.188) 0.009345 benign 530 4.42 0.0922 (0.005, 0.176) 0.04545 possibly damaging 181 4.4 0.0596 (�0.028, 0.158) 0.3125 probably damaging 277 4.52 0.266 (0.185, 0.353) 2.73⇥ 10�5

6 coding-synon 814 4.24 baseline6 missense 896 4.28 0.082 (�0.015, 0.171) 0.05626 benign 432 4.26 0.047 (�0.044, 0.136) 0.2916 possibly damaging 145 4.29 0.101 (0.012, 0.187) 0.1836 probably damaging 215 4.37 0.243 (0.149, 0.338) 0.000826

2-6 coding-synon 7486 baseline2-6 missense 9728 1.79⇥ 10�10

2-6 benign 4464 5.30⇥ 10�06

2-6 possibly damaging 1656 0.0012-6 probably damaging 2654 3.25⇥ 10�13

Table 1. Discrimination of derived missense alleles by the NC statistic. Missense alleles aresub-classified info categories based on PolyPhen-2 predictions. Effect sizes were calculated as standarddeviations from the mean of the NC statistic for synonymous variants at the same minor allele count(MAC). Within each MAC class, P-values were calculated by 1-sided Mann-Whitney test. CombinedP-values for MAC 2-6 were computed by meta-analysis (Methods).

6

Can signals of selection guide prioritization?

Genes of interest should be highly selectively constrained

Can we estimate fitness loss directly?

Several methods to estimate gene-based selection constrained exist (pLI, RVIS)

ExAC dataset combines exomes of >60,000 individuals

Selection inference using frequencies of individual SNPs

Change in allele frequency =

Mutation Selection Drift= ++

Of the order of 10-8

Demographic history

Population structure

Focusing on rare deleterious PTVs

PTV – protein truncating variant (a.k.a. nonsense)

Combine all PTVs per gene – we assume that they have identical effects

Consider each gene as a bi-allelic locus –PTV / no PTV

Selection inference using combine frequency of PTVs

Change in allele frequency =

Mutation Selection Drift= ++

Combined frequency of rare deleterious PTVs is expected to be Poisson distributed with l=U/hs

Simulations The model

PTV counts in each gene are Poisson distributed but we lacksufficient data to estimate selection coefficients

We can treat selection coefficients as random variables with a distribution to be estimated

7

Distribution of selection coefficients

10-4 0.001 0.010 0.100 1

0

10

20

30

40

50

60

Heterozygous selection coefficient, shet

P(shet|α,β)

Estimates for each gene

combine the results in a mixture distribution with equal weights. The mean mutation rates in the

three terciles are F^ = 4.6 ∙ 10y~, F= = 1.1 ∙ 10y�, and Fz = 2.6 ∙ 10y�. We estimate (α^, β^) =

(0.057±0.010,0.0052±0.0003), (α=, β=) = (0.046±0.005,0.0087±0.0004), and (αz, βz) =

(0.074±0.005,0.0160±0.0005), with error margins denoting two s.d. from 100 bootstrapping

replicates of the set of ~5,333 genes in each tercile. This error estimate is intended to quantify

the effect of the sampling noise in the data set on the parameter inference while local mutation

rate estimates are assumed fixed. The resulting fitted distributions of counts are shown in

Supplementary Figure 9 together with the corresponding p N , while Figure 1 shows the

inferred V !het; %, ' = IG !het; %^, '^ + IG !het; %=, '= + IG !het; %z, 'z /3. The choice for the

functional form of V !het is motivated by the shape of the empirical distribution of the naïve

estimator W/N (given by a simple inversion of Eq. 3). We also compared the log-likelihood of the

fit to p(N) obtained with this model to that obtained from two other two-parameter distributions,

!het~Gamma and !het~InvGamma, and chose the model with the highest likelihood, which is

!het~IG.

Inference of !het on individual genes From the inferred distributions V !het; %A, 'A in each tercile t of the mutation rate U, we construct

a per-gene estimator of !het for genes in the tercile using the posterior probability given N, which

mitigates the stochasticity of the observed PTV count:

V !"#$,6|N6; W6 =Ü _á|Sàâä,á;gá Ü Sàâä,á;fã,dã

Ü _á|S;gá Ü S;fã,dã dS , (7)

where the denominator is given by Eq. 5. Supplementary Table 1 provides the mean values

derived from these posterior probabilities for each gene. Predicted mode of inheritance in clinical exome cases

We trained a Naïve Bayes classifier to predict the mode of inheritance in a set of solved clinical

exome sequencing cases from Baylor College of Medicine (N=283 cases)22

and UCLA23

(N=176

cases). Using data from UCLA as the training dataset, we are able to cross-predict the mode of

inheritance in separately ascertained Baylor cases with classification accuracy of 88.0%,

sensitivity of 86.1%, specificity of 90.2%, and an AUC of 0.931. Genes that were related to

diagnosis in both clinics (overlapping genes) were removed from the larger Baylor set

(Supplementary Figure 2).

Using a logistic regression based on the full set of cases from Baylor and UCLA, we generated

predictions for all 15,998 genes where there is a !het value (Supplementary Table 4). Mouse knockout comparative analysis

We reviewed mouse knockout enrichments from two datasets: the full set of mouse knockouts

from a neutrally-ascertained mouse knockout screen (N=2,179 genes) generated by the

International Mouse Phenotyping Consortium25

. Genes were classified as ‘Viable’, ‘Sub-Viable’,

or ‘Lethal’ based on the results for the assay. PubMed gene score and enrichment analysis

peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;

The estimated distribution over selection coefficients can be now used as a prior, and per gene estimates from posteriors

AD and AR Mendelian genes

Figure 2: Separation of disease genes and clinical cases by mode of inheritance. [a] The distribution of genes associated with exclusively autosomal dominant (AD, N=867) disorders versus autosomal recessive (AR, N=1,482) disorders as annotated by the Clinical Genomics Database (CGD). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] Overall, AD genes have significantly higher !"#$ values than AR genes [Mann-Whitney p-value 3.14x10-64]. [c] Similarly, in solved Mendelian clinical exome sequencing cases (Baylor)22, !"#$ values can help discriminate between AR and AD disease genes, as annotated by clinical geneticists. [d] A !"#$ value of 0.04 can be used as a simple classification threshold for AD genes with a PPV of 96%. [e] This finding is replicated in a separately ascertained sample from UCLA. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range. In a set of 504 clinical exome cases that resulted in a Mendelian diagnosis22, we find a similar enrichment of cases by MOI and selection value (Figure 2[c]). We find that 90.4% of novel, dominant variants are associated with heterozygous fitness loss greater than 0.04 (Figure 2[d]). Among disease variants, a cutoff of !"#$ > 0.04 provides a 96% positive predictive value for discriminating between AD and AR modes of inheritance.

ADDiseaseGenes

ARDiseaseGenes

0.0001

0.0002

0.0005

0.001

0.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

1

s_he

t

[b] s_het distributions

AD Disease Genes AR Disease Genes

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.0003 >= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

2%

4%

6%

8%

10%

12%

Frac

tion

of g

enes

in e

ach

s_he

t bin

(10^

-x)

[a] Mode of Inheritance [Clinical Genomic Database]

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.0010

20

40

60

80

100

Num

ber o

f obs

erve

d ge

nes

0%

20%

40%

60%

80%

100%

Frac

tion

of g

enes

by

Mod

e of

Inhe

ritan

ce

102

382730

34

9

7 6

[c] Mode of Inheritance in Molecular Diagnoses [Baylor]

s_het bins

s_het <0.04

s_het >0.04

19.57%

96.04%

80.43%

[d] Baylor

s_het bins

s_het <0.04

s_het >0.04

21.18%

96.70%

78.82%

[e] UCLA

Mode of InheritanceAD

AR


Age of onset, penetrance and severity

To test the generalizable utility of !"#$ values in prioritizing candidate genes in Mendelian sequencing studies, we compared the overall prevalence of genes with !"#$ > 0.04 to the corresponding fraction in an independently ascertained dataset of new dominant Mendelian diagnoses (Figure 2[e])23. This analysis suggests that restricting to genes with !"#$ > 0.04 would provide a three-fold reduction of candidate variants, given the overall distribution of !"#$ values. Thus, initial effort in clinical cases can be focused on just a few genes for functional validation, familial segregation studies, and patient matching. We summarize the classification accuracy for all possible thresholds (AUC 0.9312) and probabilities for the mode of inheritance in each gene, generated using the full set of clinical sequencing cases (Supplementary Figure 2 and Supplementary Table 2). Beyond mode of inheritance, we find that !"#$ can help predict phenotypic severity, age of onset, penetrance, and the fraction of de novo variants in a set of high-confidence haploinsufficient disease genes (Figure 3). In broader sets of known disease genes, !"#$ estimates significantly correlate with the number of references in OMIM MorbidMap and the number of HGMD disease “DM” variants (Supplementary Figure 3).

Figure 3: Enrichments of !"#$ in known haploinsufficient disease genes of high confidence (ClinGen Project). In (N=127) autosomal genes, we annotate the !"#$ scores of genes associated with each disease category and classification. Higher !"#$ values are associated with increased phenotypic severity (Mann-Whitney p-value 4.87x10-

3), earlier age of onset (p=1.46 x10-2), high or unspecified penetrance (p=1.79 x10-2), and a larger fraction of de novo variants (p=8x10-5). Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range. Gene-specific fitness loss values allow us to plot the distribution of selective effects for different disorders. This provides information about the breadth and severity of selection associated with various disorder groups using both well-established genes (Figure 4[a]) and new findings from Mendelian exome cases (Figure 4[b]). Overall, genes involved in neurologic phenotypes and congenital heart disease appear to be under more intense selection when compared with other disorder groups, tolerated knockouts in a consanguineous cohort, or in all genes (Figure 4[c,d])24. Interestingly, genes recessive for these disorders appear to have only partially


Concordance with mouse knockout dataviability, while those with the lowest !"#$ estimates are depleted for embryonic lethality [Mann-Whitney p=2.95x10-28] (Figure 5[a,b]).

Figure 5: High-throughput screens of gene essentiality in mice and cell assays. [a] Proportion of orthologous mouse knockout genes by phenotype, from a neutrally-ascertained set of genes generated by the International Mouse Phenotyping Consortium (IMCP). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] ICMP mice are separated into viable (N=1,057), sub-viable (N=211) and lethal knockouts (N=477), and lethal knockouts have significantly higher !"#$ values than viable [Mann-Whitney p-value 2.95x10-28]. [c] Cell-essential genes as reported by Wang et al. from genome-wide KBM-7 tumor cell CRISPR assay (N=1,740) have significantly higher !"#$ values [p-value 5.13x10-16] and [d] as do genes that were characterized as essential in a gene trap assay (N= 1,081) [p-value = 4.90x10-18]. In the CRISPR assay, all genes with adjusted p-values < 0.05 and negative assay scores are included, and genes with gene trap scores < 0.4 or lower are included. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range.

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

20%

40%

60%

80%

100%

Per

cent

age

of g

enes

in e

ach

bin,

by

phen

otyp

e

105

215

308

118

283

130144102

55

48

19

57

17

36

71

11

11

7

7

1

[a] Orthologous mouse knockouts by phenotypePhenotype

Lethal Subviable Viable0.0001

0.0002

0.0005

0.001

0.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

1

s_he

t

[b] Distribution of s_het values

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

5%

10%

15%

20%

25%

Per

cent

age

of g

enes

cla

ssifi

ed a

s es

sent

ial

458

100

394

292

451

43

2

[c] Cell-Essential by KBM7 CRISPR Assays_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

5%

10%

15%

20%

Per

cent

age

of g

enes

cla

ssifi

ed a

s es

sent

ial

175

299

263

236

70

242

[d] Cell-Essential by Yeast Gene Trap Assay

PhenotypeLethal

Subviable

Viable


Concordance with cell essentiality screens

viability, while those with the lowest !"#$ estimates are depleted for embryonic lethality [Mann-Whitney p=2.95x10-28] (Figure 5[a,b]).

Figure 5: High-throughput screens of gene essentiality in mice and cell assays. [a] Proportion of orthologous mouse knockout genes by phenotype, from a neutrally-ascertained set of genes generated by the International Mouse Phenotyping Consortium (IMCP). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] ICMP mice are separated into viable (N=1,057), sub-viable (N=211) and lethal knockouts (N=477), and lethal knockouts have significantly higher !"#$ values than viable [Mann-Whitney p-value 2.95x10-28]. [c] Cell-essential genes as reported by Wang et al. from genome-wide KBM-7 tumor cell CRISPR assay (N=1,740) have significantly higher !"#$ values [p-value 5.13x10-16] and [d] as do genes that were characterized as essential in a gene trap assay (N= 1,081) [p-value = 4.90x10-18]. In the CRISPR assay, all genes with adjusted p-values < 0.05 and negative assay scores are included, and genes with gene trap scores < 0.4 or lower are included. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range.

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

20%

40%

60%

80%

100%

Per

cent

age

of g

enes

in e

ach

bin,

by

phen

otyp

e

105

215

308

118

283

130144102

55

48

19

57

17

36

71

11

11

7

7

1

[a] Orthologous mouse knockouts by phenotypePhenotype

Lethal Subviable Viable0.0001

0.0002

0.0005

0.001

0.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

1

s_he

t

[b] Distribution of s_het values

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

5%

10%

15%

20%

25%

Per

cent

age

of g

enes

cla

ssifi

ed a

s es

sent

ial

458

100

394

292

451

43

2

[c] Cell-Essential by KBM7 CRISPR Assays_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

5%

10%

15%

20%

Per

cent

age

of g

enes

cla

ssifi

ed a

s es

sent

ial

175

299

263

236

70

242

[d] Cell-Essential by Yeast Gene Trap Assay

PhenotypeLethal

Subviable

Viable


8

Black hole in knowledgeSupplementary Figure 7: Most published and least published genes from top êëíì decile

Most published and least published genes from top êëíì decile. The proportion of annotations related to genes

with the fewest and most publications in Entrez Gene. From the set of genes under the strongest selection (top 10%

of !"#$ values), we create two sets of 250 genes. The first set of genes has the fewest publications associated with

each gene, as defined by our PubMed gene score (Methods), and the second set has the greatest number of

associated publications. Between the two groups, we compare the !"#$ values, number of protein-protein interactions,

viability of orthologous mouse knockouts (IMPC), and cell essentiality assays (KBM-7 CRISPR score and Gene Trap

Score). These results suggest that the genes in the least published set are similar to those in the most published set,

and are also potentially important developmental genes.

Non-ViableSanger Mice

KBM7 HumanCell Line

Protein-Protein

Interactions s_het ValueYeast GeneTrap Score

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Perc

enta

ge o

f Gen

es in

Eac

h G

roup

Black Hole Figure

Measure NamesFewest Publications

Most Publications

Fewest Publications and Most Publications for each F1. Color shows details aboutFewest Publications and Most Publications. The view is filtered on F1, which keepsKBM7 Human Cell Line, Yeast Gene Trap Score, Protein-Protein Interactions, Non-Viable Sanger Mice and s_het Value.


Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

Documents