Top Banner
Student: Julia Kornienko Supervisor: Yury Barbito, Bioinformatics Institute Comparative analysis of natural selection eects across human populations
13

Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Apr 28, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Student: Julia Kornienko

Supervisor: Yury Barbitoff,

Bioinformatics Institute

Comparative analysis of

natural selection effects

across human populations

Page 2: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Analysis of protein-coding genetic variation in 60,706 humans - Lek et al., Nature, 2016

(assembled by the Exome Aggregation Consortium - ExAC)

Example of ExAC data:1 69516 rs776332430 G A 488.39 PASS

AC=1;AF=6.34180e-06;AN=157684;BaseQRankSum=-2.19700e+00;ClippingRankSum=-3.50000e-01;DP=6226716;FS=0.00000e+00;InbreedingCoeff=-1.59000e-02;MQ=3.46100e+01;MQRankSum=-3.09000e-01;QD=8.42000e+00;ReadPosRankSum=1.10000e-01;SOR=7.80000e-01;VQSLOD=3.50000e-01;VQSR_culprit=MQ;GQ_HIST_ALT=0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1;DP_HIST_ALT=0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0;AB_HIST_ALT=0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0;GQ_HIST_ALL=10123|5490|820|605|258|106|135|126|74|198|216|141|629|159|681|368|1295|175|1644|72931;DP_HIST_ALL=16331|996|338|458|1015|1800|4266|13665|14910|11109|8010|6763|5319|3981|2727|1771|1 0 7 5 | 5 2 0 | 3 4 9 | 2 2 0 ; A B _ H I S T _ A L L = 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |0;AC_AFR=0;AC_AMR=0;AC_ASJ=0;AC_EAS=0;AC_FIN=0;AC_NFE=1;AC_OTH=0;AC_SAS=0;AC_Male=1;AC_Female=0;AN_AFR=11582;AN_AMR=19550;AN_ASJ=4962;AN_EAS=16704;AN_FIN=11160;AN_NFE=67470;AN_OTH=3302;AN_SAS=22954;AN_Male=86662;AN_Female=71022;AF_AFR=0.00000e+00;AF_AMR=0.00000e+00;AF_ASJ=0.00000e+00;AF_EAS=0.00000e+00;AF_FIN=0.00000e+00;AF_NFE=1.48214e-05;AF_OTH=0.00000e+00;AF_SAS=0.00000e+00;AF_Male=1.15391e-05;AF_Female=0.00000e+00;GC_AFR=5791,0,0;GC_AMR=9775,0,0;GC_ASJ=2481,0,0;GC_EAS=8352,0,0;GC_FIN=5580,0,0;GC_NFE=33734,1,0;GC_OTH=1651,0,0;GC_SAS=11477,0,0;GC_Male=43330,1,0;GC_Female=35511,0,0;AC_raw=1;AN_raw=192348;AF_raw=5.19891e-06;GC_raw=96173,1,0;GC=78841,1,0;Hom_AFR=0;Hom_AMR=0;Hom_ASJ=0;Hom_EAS=0;Hom_FIN=0;Hom_NFE=0;Hom_OTH=0;Hom_SAS=0;Hom_Male=0;Hom_Female=0;Hom_raw=0;Hom=0;POPMAX=NFE;AC_POPMAX=1;AN_POPMAX=67470;AF_POPMAX=1.48214e-05;DP_MEDIAN=58;DREF_MEDIAN=1.00000e-52;GQ_MEDIAN=99;AB_MEDIAN=3.79310e-01;AS_RF

=5.44231e-01;AS_FilterStatus=PASS;CSQ=A|stop_gained|HIGH|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding|1/1||ENST00000335137.3:c.426G>A|ENSP00000334393.3:p.Trp142Ter|426|426|142|W/*|

tgG/tgA|rs776332430|1||1||SNV|1|HGNC|14825|YES|||CCDS30547.1|ENSP00000334393|Q8NH21||UPI0000041BC1||||Transmembrane_helices:TMhelix&Superfamily_domains:SSF81321&Pfam_domain:PF13853&Gene3D:1.20.1070.10&hmmpanther:PTHR26451&hmmpanther:PTHR26451:SF72&PROSITE_profiles:PS50262|||A:0||||| | | |A:0|A:1.117e-05|A:0|A:1.275e-05|A:0|A:0|A:2.532e-05|A:0||||| | | | |HC||SINGLE_EXON|POSITION:0.464052287581699&PHYLOCSF_TOO_SHORT,A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000278218|open_chromatin_region||||||||||rs776332430|1||||SNV|1|||||||||||||||||A:0||||||||A:0|A:1.117e-05|A:0|A:1.275e-05|A:0|A:0|A:2.532e-05|A:0||||||||||||

Page 3: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

• 60706 individuals

• Selection of only those PTV alleles (stop gained, splice donor/acceptor)

• Estimation of the selective effects of PTV alleles with very low allelic frequency (AF<<1), therefore contribution

of the homozygous PTVs was neglected (as a very low AF in the square).

E(n) = NU/Shet n - amount of the loss of function alleles among N chromosomes, U - estimated

frequency of mutations in the neutral selection model. It is considered that n has a

Poisson distribution with an expected value E(n) => it is possible to estimate Shet

Estimating the selective effects of heterozygous protein-truncating variants from human

exome data - Cassa et al., Nature Genetics, 2017

Page 4: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

The aim of the project: estimate the selective effects of heterozygous PTV alleles

from gnomAD data (123136 exomes) and perform the comparative analysis of these

selective effects both for individual genes and for gene sets among the different

human populations.

Project objectives:

• Create filtered data set as it had been done in the paper Cassa et al., but for the updated gnomAD data

for the 123,136 individuals (instead of 60,706).

• Estimate the selective coefficients per individual genes in the global population for the gnomAD data and

to compare them to the published in the Cassa et al. paper selective coefficients for the ExAC data.

• Estimate the selective coefficients for the different populations and to perform their comparative analysis.

• Search for the genes and gene sets with population-dependent selective effects.

Page 5: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

- We calculated the sum of PTV allele counts per each gene

(gencode19) with an average coverage > 30x and therefore

created the data set containing AC for 17412 genes both

for global and local populations.

- For further analysis we left genes with sum of AC <

0.001AN and with Shets (calculated in the naive way as

Shet = AN*U/AC) no more than 10 times differ from

published Shets for the ExAC data.

- Finally we created data set for further estimation of the

selective effects of 12367 genes.

Data set for further analysis

published Shets ExAC

na

ive

Sh

ets

gn

om

AD

10.10.010.001

0.01

0.001

0.1

1

Page 6: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

We performed calculation of expected value of the Pois(AC, lambda = AN*U/Shet) distribution for each gene, where Shets were

varied from 0 to 2 with the step = 0.0001. Thus the selective coefficients were estimated both for global and local populations.

Estimation of the selective coefficients

Page 7: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Dependence of distribution of the coefficients for individual genes

on the population

Distributions of estimated coefficients for AFR, AMR, EAS and SAS populations were considered to be comparable. As the

FIN population is much smaller than other populations and its coefficients distribution differ from other populations, this

population was excluded from further analysis. NFE coefficients distribution is more comparable to the global population due

to its big size.

Page 8: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Population GLOBAL AFR AMR EAS SAS NFE

AC 26 0 0 0 26 0

AN 178435 9479 27069 12467 25309 72395

BBS10

GDNF

Population GLOBAL AFR AMR EAS SAS NFE

AC 32 3 5 3 5 16

AN 243601 9479 27069 12467 25309 72395

Chisq.test

p.value = 0.99

p.value = 5.23e-26

• 2040 of 12367 genes had p.value with Bonferroni correction < 0.05

• 30 of 2040 genes had more than 90% PTV alleles (with AC > 10) in one of four

populations (AFR = 6, AMR = 6, EAS = 6, SAS = 12)

Search for the genes with population-dependent selective effects

Page 9: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Some interesting genes with different distribution of AC among the

populations

GDNF - GLIAL CELL LINE-DERIVED NEUROTROPHIC FACTOR

(26 of 26 variant alleles are among South Asians)

• highly conserved neurotrophic factor

• supports survival and differentiation of dopaminergic and

motoneurons

PAX3 - PAIRED BOX GENE 3 (48 of 53 variant alleles are among

South Asians)

• together with Sox10 activates transcription of MITF and RET genes

• controls a cascade of transcriptional events that are necessary and

sufficient for skeletal myogenesis.

+ different olfactory receptor genes in each population

Gene-Phenotype Relationships (OMIM)

Phenotype Inheritance

Central hypoventilation syndrome AD

{Hirschsprung disease, susceptibility to, 3} AD

{Pheochromocytoma, modifier of} AD

Gene-Phenotype Relationships (OMIM)_

Phenotype Inheritance

Craniofacial-deafness-hand syndrome AD

Rhabdomyosarcoma 2, alveolar AR

Waardenburg syndrome, type 1 AD

Waardenburg syndrome, type 3 AR, AD

+ EIF4G3 (translation initiation factor 92% AFR), NUMBL (Numb-related gene, 94% EAS), TERF1

(Telomeric-repeat binding factor, 93% SAS)

+ other genes

Page 10: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

We used curated gene set and hallmark gene set derived from MSigDB Collections GSEA to estimate Shets per gene sets

(1377 gene sets in total). To do this we used sums of all ACs and ANs for all genes included in the current pathway for each

population. Shets were estimated in the naive way as Shet = AN*U/AC.

Dependence of distribution of coefficients for the gene sets on the

population

Page 11: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Pathway AC_GLOBAL AFR AMR EAS SAS NFE p_value

PROTEASOME PATHWAY 185 0.53 0.08 0.02 0.09 0.21 2.20e-133

IONOTROPIC ACTIVITY OF KAINATE RECEPTORS 125 0.03 0.55 0.04 0.04 0.27 3.50e-35

IL 10 PATHWAY 326 0.05 0.06 0.06 0.51 0.20 5.91e-94

APOPTOSIS INDUCED DNA FRAGMENTATION 132 0.05 0.14 0.02 0.52 0.24 1.07e-33

REGULATION OF IFNG SIGNALING 186 0.03 0.06 0.03 0.60 0.24 9.67e-66

IL 13 PATHWAY 215 0.03 0.08 0.06 0.56 0.25 8.24e-64

Search for the gene sets with population-dependent selective effects

• 746 of 1379 gene sets had p.value with Bonferroni correction < 0.05

• 6 of 746 gene sets had more than 50% PTV alleles in one of four populations (AFR =

1, AMR = 1, EAS = 0, SAS = 4)

• 14 of 746 gene sets had more than 40% PTV alleles in one of four populations (AFR

= 2, AMR = 1, EAS = 3, SAS = 8)

Page 12: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Conclusions:

Thank you!

• Data set with calculated sum of PTV ACs for global and local populations was

created based on the gnomAD data for 12367 genes

• Selective effects of heterozygous PTVs were estimated both for individual genes and

gene sets for global and local populations

• With an increase of population size the estimated Shet values have a tendency to

decrease

• Selective effects of 2040 of 12367 genes are population-dependent and 30 of these

genes have more than 90% PTV alleles in one of four populations (AFR, AMR, EAS,

SAS)

• Selective effects of 746 of 1379 gene sets are population-dependent and 6 of these

gene sets have more than 50% PTV alleles in one of four populations (AFR, AMR,

EAS, SAS)

To sum up, selective effects for some genes do vary among the populations.

Page 13: Julia Kornienko Supervisor: Yury Barbitoff, Bioinformatics ...

Data Set:

• 60706 individuals

• Mean coverage depth > 30

• Stop gained, splice donor, splice acceptor

• Selection only variants which lead to complete loss of function of the gene

X = ∑xj -частота loss of function аллелей в гене как сумма частот аллелей по всем

PTV сайтам внутри этого гена (на основе отобранного data set)

При X<<1 изменение частоты Х обуславливается притоком de novo мутаций и их

оттоку благодаря действию отбора (без учета дрейфа генов):

∂tX = − ShetX(1-Х) - Shom X2(1-Х)2+ UShet и Shom искомые коэффициенты отбора, действующие на

PTV варианты (в силу X<<1 shom можно пренебречь) U -

оценочная частота мутаций при нейтральном отборе

Для N хромосом, число loss of function аллелей n зависит от частоты X:

n = NX = N ∑xj И описывается распределением Пуассона с мат. ожиданием:

E(n) = NU/Shet

Estimating the selective effects of heterozygous protein-truncating variants from human

exome data - Cassa et al., Nature Genetics, 2017