Student: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics Institute Comparative analysis of natural selection effects across human populations
Student: Julia Kornienko
Supervisor: Yury Barbitoff,
Bioinformatics Institute
Comparative analysis of
natural selection effects
across human populations
Analysis of protein-coding genetic variation in 60,706 humans - Lek et al., Nature, 2016
(assembled by the Exome Aggregation Consortium - ExAC)
Example of ExAC data:1 69516 rs776332430 G A 488.39 PASS
AC=1;AF=6.34180e-06;AN=157684;BaseQRankSum=-2.19700e+00;ClippingRankSum=-3.50000e-01;DP=6226716;FS=0.00000e+00;InbreedingCoeff=-1.59000e-02;MQ=3.46100e+01;MQRankSum=-3.09000e-01;QD=8.42000e+00;ReadPosRankSum=1.10000e-01;SOR=7.80000e-01;VQSLOD=3.50000e-01;VQSR_culprit=MQ;GQ_HIST_ALT=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1;DP_HIST_ALT=0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST_ALL=10123|5490|820|605|258|106|135|126|74|198|216|141|629|159|681|368|1295|175|1644|72931;DP_HIST_ALL=16331|996|338|458|1015|1800|4266|13665|14910|11109|8010|6763|5319|3981|2727|1771|1 0 7 5 | 5 2 0 | 3 4 9 | 2 2 0 ; A B _ H I S T _ A L L = 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=1;AC_OTH=0;AC_SAS=0;AC_Male=1;AC_Female=0;AN_AFR=11582;AN_AMR=19550;AN_ASJ=4962;AN_EAS=16704;AN_FIN=11160;AN_NFE=67470;AN_OTH=3302;AN_SAS=22954;AN_Male=86662;AN_Female=71022;AF_AFR=0.00000e+00;AF_AMR=0.00000e+00;AF_ASJ=0.00000e+00;AF_EAS=0.00000e+00;AF_FIN=0.00000e+00;AF_NFE=1.48214e-05;AF_OTH=0.00000e+00;AF_SAS=0.00000e+00;AF_Male=1.15391e-05;AF_Female=0.00000e+00;GC_AFR=5791,0,0;GC_AMR=9775,0,0;GC_ASJ=2481,0,0;GC_EAS=8352,0,0;GC_FIN=5580,0,0;GC_NFE=33734,1,0;GC_OTH=1651,0,0;GC_SAS=11477,0,0;GC_Male=43330,1,0;GC_Female=35511,0,0;AC_raw=1;AN_raw=192348;AF_raw=5.19891e-06;GC_raw=96173,1,0;GC=78841,1,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=0;Hom=0;POPMAX=NFE;AC_POPMAX=1;AN_POPMAX=67470;AF_POPMAX=1.48214e-05;DP_MEDIAN=58;DREF_MEDIAN=1.00000e-52;GQ_MEDIAN=99;AB_MEDIAN=3.79310e-01;AS_RF
=5.44231e-01;AS_FilterStatus=PASS;CSQ=A|stop_gained|HIGH|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding|1/1||ENST00000335137.3:c.426G>A|ENSP00000334393.3:p.Trp142Ter|426|426|142|W/*|
tgG/tgA|rs776332430|1||1||SNV|1|HGNC|14825|YES|||CCDS30547.1|ENSP00000334393|Q8NH21||UPI0000041BC1||||Transmembrane_helices:TMhelix&Superfamily_domains:SSF81321&Pfam_domain:PF13853&Gene3D:1.20.1070.10&hmmpanther:PTHR26451&hmmpanther:PTHR26451:SF72&PROSITE_profiles:PS50262|||A:0||||| | | |A:0|A:1.117e-05|A:0|A:1.275e-05|A:0|A:0|A:2.532e-05|A:0||||| | | | |HC||SINGLE_EXON|POSITION:0.464052287581699&PHYLOCSF_TOO_SHORT,A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000278218|open_chromatin_region||||||||||rs776332430|1||||SNV|1|||||||||||||||||A:0||||||||A:0|A:1.117e-05|A:0|A:1.275e-05|A:0|A:0|A:2.532e-05|A:0||||||||||||
• 60706 individuals
• Selection of only those PTV alleles (stop gained, splice donor/acceptor)
• Estimation of the selective effects of PTV alleles with very low allelic frequency (AF<<1), therefore contribution
of the homozygous PTVs was neglected (as a very low AF in the square).
E(n) = NU/Shet n - amount of the loss of function alleles among N chromosomes, U - estimated
frequency of mutations in the neutral selection model. It is considered that n has a
Poisson distribution with an expected value E(n) => it is possible to estimate Shet
Estimating the selective effects of heterozygous protein-truncating variants from human
exome data - Cassa et al., Nature Genetics, 2017
The aim of the project: estimate the selective effects of heterozygous PTV alleles
from gnomAD data (123136 exomes) and perform the comparative analysis of these
selective effects both for individual genes and for gene sets among the different
human populations.
Project objectives:
• Create filtered data set as it had been done in the paper Cassa et al., but for the updated gnomAD data
for the 123,136 individuals (instead of 60,706).
• Estimate the selective coefficients per individual genes in the global population for the gnomAD data and
to compare them to the published in the Cassa et al. paper selective coefficients for the ExAC data.
• Estimate the selective coefficients for the different populations and to perform their comparative analysis.
• Search for the genes and gene sets with population-dependent selective effects.
- We calculated the sum of PTV allele counts per each gene
(gencode19) with an average coverage > 30x and therefore
created the data set containing AC for 17412 genes both
for global and local populations.
- For further analysis we left genes with sum of AC <
0.001AN and with Shets (calculated in the naive way as
Shet = AN*U/AC) no more than 10 times differ from
published Shets for the ExAC data.
- Finally we created data set for further estimation of the
selective effects of 12367 genes.
Data set for further analysis
published Shets ExAC
na
ive
Sh
ets
gn
om
AD
10.10.010.001
0.01
0.001
0.1
1
We performed calculation of expected value of the Pois(AC, lambda = AN*U/Shet) distribution for each gene, where Shets were
varied from 0 to 2 with the step = 0.0001. Thus the selective coefficients were estimated both for global and local populations.
Estimation of the selective coefficients
Dependence of distribution of the coefficients for individual genes
on the population
Distributions of estimated coefficients for AFR, AMR, EAS and SAS populations were considered to be comparable. As the
FIN population is much smaller than other populations and its coefficients distribution differ from other populations, this
population was excluded from further analysis. NFE coefficients distribution is more comparable to the global population due
to its big size.
Population GLOBAL AFR AMR EAS SAS NFE
AC 26 0 0 0 26 0
AN 178435 9479 27069 12467 25309 72395
BBS10
GDNF
Population GLOBAL AFR AMR EAS SAS NFE
AC 32 3 5 3 5 16
AN 243601 9479 27069 12467 25309 72395
Chisq.test
p.value = 0.99
p.value = 5.23e-26
• 2040 of 12367 genes had p.value with Bonferroni correction < 0.05
• 30 of 2040 genes had more than 90% PTV alleles (with AC > 10) in one of four
populations (AFR = 6, AMR = 6, EAS = 6, SAS = 12)
Search for the genes with population-dependent selective effects
Some interesting genes with different distribution of AC among the
populations
GDNF - GLIAL CELL LINE-DERIVED NEUROTROPHIC FACTOR
(26 of 26 variant alleles are among South Asians)
• highly conserved neurotrophic factor
• supports survival and differentiation of dopaminergic and
motoneurons
PAX3 - PAIRED BOX GENE 3 (48 of 53 variant alleles are among
South Asians)
• together with Sox10 activates transcription of MITF and RET genes
• controls a cascade of transcriptional events that are necessary and
sufficient for skeletal myogenesis.
+ different olfactory receptor genes in each population
Gene-Phenotype Relationships (OMIM)
Phenotype Inheritance
Central hypoventilation syndrome AD
{Hirschsprung disease, susceptibility to, 3} AD
{Pheochromocytoma, modifier of} AD
Gene-Phenotype Relationships (OMIM)_
Phenotype Inheritance
Craniofacial-deafness-hand syndrome AD
Rhabdomyosarcoma 2, alveolar AR
Waardenburg syndrome, type 1 AD
Waardenburg syndrome, type 3 AR, AD
+ EIF4G3 (translation initiation factor 92% AFR), NUMBL (Numb-related gene, 94% EAS), TERF1
(Telomeric-repeat binding factor, 93% SAS)
+ other genes
We used curated gene set and hallmark gene set derived from MSigDB Collections GSEA to estimate Shets per gene sets
(1377 gene sets in total). To do this we used sums of all ACs and ANs for all genes included in the current pathway for each
population. Shets were estimated in the naive way as Shet = AN*U/AC.
Dependence of distribution of coefficients for the gene sets on the
population
Pathway AC_GLOBAL AFR AMR EAS SAS NFE p_value
PROTEASOME PATHWAY 185 0.53 0.08 0.02 0.09 0.21 2.20e-133
IONOTROPIC ACTIVITY OF KAINATE RECEPTORS 125 0.03 0.55 0.04 0.04 0.27 3.50e-35
IL 10 PATHWAY 326 0.05 0.06 0.06 0.51 0.20 5.91e-94
APOPTOSIS INDUCED DNA FRAGMENTATION 132 0.05 0.14 0.02 0.52 0.24 1.07e-33
REGULATION OF IFNG SIGNALING 186 0.03 0.06 0.03 0.60 0.24 9.67e-66
IL 13 PATHWAY 215 0.03 0.08 0.06 0.56 0.25 8.24e-64
Search for the gene sets with population-dependent selective effects
• 746 of 1379 gene sets had p.value with Bonferroni correction < 0.05
• 6 of 746 gene sets had more than 50% PTV alleles in one of four populations (AFR =
1, AMR = 1, EAS = 0, SAS = 4)
• 14 of 746 gene sets had more than 40% PTV alleles in one of four populations (AFR
= 2, AMR = 1, EAS = 3, SAS = 8)
Conclusions:
Thank you!
• Data set with calculated sum of PTV ACs for global and local populations was
created based on the gnomAD data for 12367 genes
• Selective effects of heterozygous PTVs were estimated both for individual genes and
gene sets for global and local populations
• With an increase of population size the estimated Shet values have a tendency to
decrease
• Selective effects of 2040 of 12367 genes are population-dependent and 30 of these
genes have more than 90% PTV alleles in one of four populations (AFR, AMR, EAS,
SAS)
• Selective effects of 746 of 1379 gene sets are population-dependent and 6 of these
gene sets have more than 50% PTV alleles in one of four populations (AFR, AMR,
EAS, SAS)
To sum up, selective effects for some genes do vary among the populations.
Data Set:
• 60706 individuals
• Mean coverage depth > 30
• Stop gained, splice donor, splice acceptor
• Selection only variants which lead to complete loss of function of the gene
X = ∑xj -частота loss of function аллелей в гене как сумма частот аллелей по всем
PTV сайтам внутри этого гена (на основе отобранного data set)
При X<<1 изменение частоты Х обуславливается притоком de novo мутаций и их
оттоку благодаря действию отбора (без учета дрейфа генов):
∂tX = − ShetX(1-Х) - Shom X2(1-Х)2+ UShet и Shom искомые коэффициенты отбора, действующие на
PTV варианты (в силу X<<1 shom можно пренебречь) U -
оценочная частота мутаций при нейтральном отборе
Для N хромосом, число loss of function аллелей n зависит от частоты X:
n = NX = N ∑xj И описывается распределением Пуассона с мат. ожиданием:
E(n) = NU/Shet
Estimating the selective effects of heterozygous protein-truncating variants from human
exome data - Cassa et al., Nature Genetics, 2017