2015 WHOLE-GENOME GENETIC DIVERSITY AND FUNCTIONAL CLASSIFICATION OF VARIATIONS OF A PAKISTANI INDIVIDUAL ______________________________________________________________________________ MUHAMMAD ILYAS _____________________________________________________ National Centre of Excellence in Molecular Biology UNIVERSITY OF THE PUNJAB, LAHORE PAKISTAN
119
Embed
WHOLE-GENOME GENETIC DIVERSITY AND FUNCTIONAL ...prr.hec.gov.pk/jspui/bitstream/123456789/7102/1/Muhammad_Ilyas... · 2015 Whole-Genome Genetic Diversity and Functional Classification
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2015
WHOLE-GENOME GENETIC DIVERSITY AND FUNCTIONAL CLASSIFICATION OF VARIATIONS OF
A PAKISTANI INDIVIDUAL ______________________________________________________________________________
MUHAMMAD ILYAS
_____________________________________________________ National Centre of Excellence in Molecular Biology
UNIVERSITY OF THE PUNJAB, LAHORE PAKISTAN
2015
Whole-Genome Genetic Diversity and Functional Classification of Variations of a Pakistani Individual
In Partial Fulfillment of the Requirement for the Degree of
DOCTORATE OF PHILOSOPHY
in MOLECULAR BIOLOGY
(Human Genomics, Bioinformatics)
Submitted by MUHAMMAD ILYAS
Supervisors DR. ZIAUR RAHMAN
PROF. DR. JONG BHAK
___________________________________________________ National Centre of Excellence in Molecular Biology
University of the Punjab, Lahore, Pakistan
“IN THE NAME OF ALLAH, THE MOST BENEFICENT, THE MOST MERCIFUL”
DEDICATED TO
MY MOTHER AND FATHER
WHOSE AFFECTION, LOVE, ENCOURAGEMENT AND PRAYS OF
DAY AND NIGHT MAKE ME ABLE TO GET SUCH SUCCESS AND
HONOR
ALONG WITH ALL HARD WORKING AND RESPECTED TEACHERS
CERTIFICATE This is to certify that the experimental work described in the thesis submitted by
MUHAMMAD ILYAS has been carried out under my direct supervision. Data/results reported
in this manuscript are duly recorded in the Centre’s official note book(s). I have personally gone
through the raw data and certify the authenticity of all the results reported herein. I further certify
that these data have not been used in part or full, in a manuscript already submitted or in the
process of submission in partial/complete fulfillment of the award of any other degree from any
other institution at home or abroad. I also certify that the enclosed manuscript, has been prepared
under my supervision and I endorse its evaluation for the award of PhD. Degree through the
official procedures of the Centre/University.
In accordance with the rules of the Centre, data books No. 1078 is declared as
unexpendable document that will be kept in the registry of the Centre for a minimum of three
years from the date of the thesis defense Examination.
Signature of Supervisor ___________________________
Name of Supervisor: Dr. Ziaur Rahman
Signature of Co-Supervisor: ________________________
Name of Co-Supervisor: Prof. Dr. Jong Bhak
I
SUMMARY
Pakistan covers a key geographic area in human history, being both part of the Indus
River region that acted as one of the cradles of civilization and as a link between Western
Eurasia and Eastern Asia. This region is inhabited by a number of distinct ethnic groups, the
largest being the Punjabi, Pathan (Pakhtun), Sindhi, and Baloch. We analyzed the first male
Pakistani genome (PTN) from the north-west province of Pakistan, by sequencing it to 29.7-fold
coverage using the Illumina HiSeq2000 platform. A total of 3.8 million single nucleotide
variations (SNVs) and 0.5 million small indels were identified by comparing with the human
reference genome. Among the SNVs, 129,441 were novel, and 10,315 nonsynonymous SNVs
were found in 5,344 genes. SNVs were annotated for health consequences and high risk diseases,
as well as possible influences on drug efficacy. It is confirmed that the PTN genome presented
here is representative of the Pathan/Pakhtun ethnic group by comparing it to a panel of Central
Asians from the HGDP-CEPH panels typed for ~650k SNPs. The mtDNA (C4a1a1) and Y
haplogroup (L1) of this individual were also typical of his geographic region of origin. The
demographic history by PSMC was constructed, which highlights a recent increase in effective
population size compatible with admixture between European and Asian lineages expected in
this geographic region. It is a useful resource to understand genetic variation and human
migration across the whole Asian continent. Finally it was concluded that modern
Pathans/Pakhtuns are admixture of European and Asian lineages, which made them unique from
other world populations. Their genetic makeup will help us discovering rare variants and
facilitate developing personalized medicine.
II
ACKNOWLEDGEMENTS
At the onset, I bow my head to the Omnipotent, the most merciful, the Compassionate
and the Omniscient Al-Mighty Allah, who showered upon me all HIS blessings throughout my
life and especially for giving me the strength for the completion of this research work.
I wish to acknowledge the remarkable contribution of Prof. Dr. Sheikh Riazuddin (S.I.,
T.I., HI) founder and Ex-Director, and Prof. Dr. Tayyab Husnain (I.F., T.I.) Director, Centre of
Excellence in Molecular Biology in the establishment and strengthening of the prestigious
institute CEMB, where I began learning research and science.
I am also grateful to my supervisor Dr. Ziaur Rahman for his guidance, energy, time
and other form of contributions. I am deeply grateful to him for the confidence, for being a
constant source of inspiration and for always sustaining me in pursuing my own ideas, and I am
most indebted to the extremely friendly atmosphere on a professional and personal level.
Without his support my research work was impossible.
Foremost, I would like to express my deep and sincere gratitude to my co-supervisor,
Prof. Dr. Jong Bhak Director and CEO of Personal Genomics Institute, Genome Research
Foundation South Korea. His vision, patience and motivation in every step of study made it
possible for me to work in this exciting and emerging field of research. His encouragement and
support helped me understand and carry out my research project in South Korea. The positive
atmosphere and excellent working facility in his laboratory raised my devotion for learning and
knowledge. Colleagues at Jong’s lab (JongSo Kim, Yunsung Cho, Hakmin, Jesse Cooper and
Jaewoo Moon) helped me in successful completion of this research.
I am indebted to the members of the Tonellato’s lab at Harvard Medical School for
providing a stimulating environment for intellectual development and research. From the day I
joined the group, Prof. Dr. Peter J Tonellato played a crucial role in getting me up to speed
with biomedical informatics and personalized medicines. I constantly benefited from his
continuous support and guidance all along my work. Informal discussions with Michiyo
III
Yamada, Sheida Nabavi, Latrice Landry and Yassine Souilmi were crucial for the success of my
research project. My whole stay at Harvard has been a rewarding and most agreeable experience,
and, also, Boston is one of the most enjoyable cities I have lived in.
I wish to express my deepest gratitude to my senior colleagues at CEMB, Khalid
Masood, the person who always motivated me to do best in bioinformatics, Sobia Ahsan Halim,
Muhammad Israr, Aneela Yasmin and Shahid ur Rahman for their helpful guidance. I also
acknowledge my lab members Atif Anwar Mirza and Zulfiqar Ali Mir for their kind cooperation
during my PhD.
I would also like to express my indebtedness to Prof. Dr. Andrea Manica (Cambridge
University UK) Prof. Dr. Qasim Ayub (Welcome Trust Sanger Lab, UK), Prof. Dr. Sultan-e-
Rome (Government Jehanzeb College Swat), Khwaja Aftab Ahmad (Swat) and Dr. Muhammad
Fahim (IBGE Peshawar) who encouraged me by showing interest in my work. They generously
provided reading material and shared their knowledge with me. However, special thanks are due
to Prof. Dr. Habib Ahmad and Prof. Dr. Mukhtar Alam for their kind words and continuous
guidance.
I additionally appreciate the support of my friends Ziaur Rahman, Sulaiman Shams,
Imtiaz Ali, Sahib Zar and Inamullah. Their endless help and support allowed me to overcome all
of the difficult times.
Finally, I thank those that are dearest to me, who have loved me unconditionally, and
stood by me during times of confusion and frustration. My mother and father, my brother
Muhammad Abbas, my sisters and my loving wife, who helped me, get through some of the
most difficult challenges that I have faced to date. I thank her for her patience and understanding
over the past few years. Last but not the least I am grateful to the rest of my family for their
endless love, support and encouragement throughout my entire academic career. My family has
been far away from me these years, but they were closer than ever in my mind and heart.
IV
I would like to appreciate the financial support of Genome Research Foundation while I
was working in South Korea. Thanks to Higher Education Commission of Pakistan for providing
me the fellowship, which helped me a lot to get advance training of personalized genomics and
biomedical informatics at Harvard University, Boston, USA.
Many people, especially my classmates and team members itself, have made valuable
comment suggestions on this project which gave me an inspiration to improve my research. I
thank all the people for their help directly and indirectly to complete this dissertation.
Muhammad Ilyas
Lahore, 2015
V
LIST OF ABBREVIATIONS
BAC: Bacterial Artificial Chromosome BGI: Beijing Genomics Institute ddNTP: dideoxyribonucleic acid dNTP: dideoxyribonucleic acid EST: Expression Sequence Tag FISH: Fluorescent in situ Hybridization GWAS: Genome Wide Association Study NGS: Next Generation Sequencing qPCR: quantitative PCR SNP: Single Nucleotide Polymorphism TGS: Third Generation Sequencing WGA: Whole Genome Amplification KPGP: Korean Personal Genomes Project 1KGP: 1000 Genome Project SNV: Single Nucleotide Variant CAMDA: Critical Assessment of Massive Data Analysis CDS: Coding DNA Sequence UTR: Un Translated Region NMD: nonsense mediated decay PTN: Pathan Genome PK1: Pakistani Genomes (Sindi) SNV: Single Nucleotide Variant CDS: Coding DNA Sequence SJK: First Korean Genome PGP: Personal Genomics Project
VI
TABLE OF CONTENTS
SUMMARY I
ACKNOWLEDGEMENT II
LIST OF ABBREVIATIONS V
LIST OF FIGURES IX
LIST OF TABLES X
CHAPTER 1
1. INTRODUCTION 1
CHAPTER 2
2. LITERATURE REVIEW 7
2.1. SEQUENCING TECHNIQUES 11
2.1.1. HIGH-THROUGHPUT SEQUENCING 11
2.1.2. DE NOVO SEQUENCING 12
2.1.3. RE-SEQUENCING 13
2.1.4. EXOME SEQUENCING 13
2.2. HIGH THROUGHPUT SEQUENCING PLATFORMS 14
2.2.1. ROCHE 454 SYSTEM: PYROSEQUENCING 14
2.2.2. AB SOLID SYSTEM: SEQUENCING BY LIGATION 16
2.2.3. ILLUMINA/SOLEXA SYSTEM: SEQUENCING WITH REVERSIBLE TERMINATORS 17
2.2.4. ION TORRENT: SEMICONDUCTOR SEQUENCING 19
2.2.5. THE THIRD GENERATION SEQUENCER 19
2.3. GENETIC VARIANTS IN HUMAN GENOME 20
2.3.1. SINGLE NUCLEOTIDE VARIANTS/POLYMORPHISMS 21
2.3.2. STRUCTURAL VARIATIONS 22
2.3.3. COPY NUMBER VARIATIONS 22
2.3.4. LINEAGE MARKERS FOR POPULATION STUDY 23
2.3.5. VARIABLE NUMBER TANDEM REPEATS 24
2.3.6. SHORT TANDEM REPEATS (STRS) 24
2.4. APPLICATIONS OF GENOME VARIANTS 25
2.4.1. GENETIC ANCESTRY AND ADMIXTURE MAPPING 26
2.4.2. MEDICAL AND CLINICAL IMPLICATIONS 26
2.4.3. PHARMACOGENOMICS 28
VII
2.5. PERSONAL AND POPULATION GENOME PROJECTS 30
2.5.1. PERSONAL GENOME PROJECT (PGP) 30
2.5.2 1000 GENOMES PROJECT (1KGP) 31
2.5.3 PAN-ASIAN POPULATION GENOMICS INITIATIVE (PAPGI) 31
2.5.4 ONE MILLION GENOMES 31
2.5.5 HUMAN GENOME DIVERSITY PROJECT (HGDP) 32
2.5.6 BILLION GENOMES PROJECT 32
2.5.7 OTHER GENOME CONSORTIUMS 32
CHAPTER 3
3. MATERIALS AND METHODS 33 3.1. SUBJECT SELECTION AND ETHICAL STATEMENT 33
3.2. DATA SOURCES 34
3.3. DNA EXTRACTION 34
3.4. CYTOGENETIC ANALYSIS 35
3.5. LIBRARY PREPARATION AND WHOLE GENOME SEQUENCING 35
We conducted a PSMC (Pairwise Sequentially Markovian Coalescent) analysis to reconstruct
the demographic population history of Pathans (Li and Durbin 2012). We compared the Pathan
genome to a set of 11 HGDP genomes from around the world (as published by Meyer et al). We
first used samtools to extract the diploid genomes from their BAM files aligned to hg19, and
excluded sex chromosomes and mitochondrial genomes because they are haploid. In PSMC, we
used the command line options -N25 -t15 -r5 -p "4+25*2+4+6" that have been successfully used
in previous similar analyses of human and great apes (Prado-Martinez et al., 2013).
45
3.14 Phylogenomics Analysis
The most important aspect of evolutionary biology is to understand the relationship
among species. Single nucleotide variants (SNVs) which is also known as SNPs generated
through the sequencing, genotyping and other related technologies enable phylogeny
reconstruction by providing extraordinary numbers of characters for investigation (Miller et al.,
2013). In the current study SNP-based phylogeny was construction after identifying SNPs in all
individuals, and then compiled. The neighbor joining tree was generated by using pairwise FST
calculated for all ethnic samples by using the population allele frequencies across all autosomal
variants. The function “Neighbor” from PHYLIP was used to construct all bootstrap trees (Saitou
and Nei, 1987), and then MEGA5 was used to visualize it (Tamura et al., 2011). Yoruba
population was used as an out-group to root the phylogenetic tree.
Chapter 4 RESULTS
Pages 46-66
46
CHAPTER 4 4. Results
4.1 Genome Sequencing and Variants Identification:
DNA extracted from blood was sequenced with paired-end reads of 90bp using the
IlluminaHiSeq2000 sequencer, producing 1,069,127,687 reads. A total of 83.3 Gb of
sequences were generated and aligned to the human reference genome (without Ns,
2,861,343,702bp), covering 98.2% of the reference genome at an average 28.5u depth (Table
4.1).
Table 4.1: Summary of data production and mapping results Reads length 90 No. of Reads 1,069,127,687 No. of Mapped Reads 992,124,335 Mapped Reads % 92.80% No. of nucleotide Gb 83.25 Gb 89,385,267,060 Mapping depth 28.5
We identified a total of 3,813,440 SNVs,of which 3,683,999 (96.6%) were reported in
the dbSNP database (Sherry et a., 2001) and 129,441 were novel (Table 4.2) which were
further compared with the novel variants count of other individual genomes from literature
(Figure 4.1). There were 1,272,912 homozygous and 2,540,528 heterozygous SNVs. A total
of 18,547 SNVs were found in coding DNA sequence (CDS) regions, 25,481 in 3’
untranslated regions (UTR), and 4,969 in 5’ UTRs. A total of 10,315 SNVs in 5,344 genes
were non-synonymous (nsSNVs).
47
Table 4.2: Summary of SNVs found in Pathan’s genome and overlaps with dbSNP137 Total SNVs
A total of 65 CNVRs had not previously been described in the database of genomic
variants (DGV; http://projects.tcag.ca/variation/). Figure 4.2 shows the number of gained and
lost CNVRs in each chromosome. ANNOVAR was used for detailed annotation analysis of
CNVRs to identify genes associated with these regions.
48
Figure 4.1: Novel SNVs in personal genomes in thirteen different ethnic groups. Scatter plot showing novel
variants repoted in personal genomes. Data collected from literature.
Figure 4.2: Copy number variations counts distributed in each chromosome.
49
4.2 Functional Classification and Clinical Relevance of Variants:
All 10,315 nsSNVs found in the Pakistani (PTN) genome were further scrutinized for
their possible functional effects using computational prediction methods (SIFT and
Polyphen2), resulting in 43 nsSNVs in 43 genes being classified as functionally damaging
(Table 4.4). Additionally, nsSNVs were annotated using ClinVar for their clinical relevance,
and we found that 31 coding SNVs are associated with several diseases (Table 4.5). Of
particular note are an SNV (rs1049296, Pro570Ser) in the TF gene (Wang et al., 2013),
which affects Alzheimer’s susceptibility; Ser217Leu in ELAC2 gene (rs4792311), which is
implicated in genetic susceptibility to hereditary prostate cancer (Alvarez-Cubero et al.,
2013). The rate of prostate cancer is low in Pakistan (3.8%) (Aziz et al., 2003), as compared
to Americans and Caucasians (Bhurgri et al., 2009). Three coding SNVs on GHRLOS
(rs696217, Leu72Met), SERPINE1 (rs6092, Ala15Thr), and PPARG (rs1801282, Pro12Ala)
which all have links with obesity (Gueorguiev et al., 2009, Bouchard et al., 2010, Galbete et
al., 2013). About 22.2% of Pakistanis are reported to be obese which is close to European
(~24%) and United States populations (~19%) (Flegal et al., 2010, Kopelman et al., 2009).
We also found three pathogenic SNVs in genes associated with hair, skin and
pigmentation: EDAR (rs3827760, Val370Ala), SLC45A2 (rs16891982, Phe374Leu), and TYR
(rs1042602, Ser192Tyr) (Tan et al., 2013, Spichenok et al., 2011, Sulem et al., 2007). In
addition, we detected a SNV (rs17822931, Gly180Arg) in ABCC11, which is responsible for
wet earwax which was also found in the Pakistani PK1 genome (Yoshiura et al., 2006).
50
Figure 4.3:Comparative variant count of other reported individual genomes with Pakistani (PTN) genome.
Graphical representation of comparative study of PTN SNVs with other personal genomes reported previously.
One of the variants (rs1065852, Pro34Ser) in the CYP2D6 gene is responsible for
poor metabolism of debrisoquine, an adrenergic-blocking medication used for the treatment
of hypertension (Zheng et al., 2013). Also, two SNVs in the TPMT (rs1142345, Tyr240Cys
and rs1800460, Ala154Thr) are known to have a pathogenic effect and lead to thiopurine
methyltransferase (TPMT)deficiency (Li et al., 2013, Corrigan et al., 2013). Moreover two
nsSNVs (rs2056899 and rs140980900) ofCYP4A22 and GGT5 genes in the Arachidonic acid
metabolism pathway were found. Arachidonic acid in the human body usually comes from
dietary animal sources, such as meat, eggs, and dairy. Meat is an important diet part of the
people living in the northwestern Pakistan, usually consumed at least once a day, often in the
form of kabab (minced meat fried in oil), or curry (Lindholm 2004).
51
Table 4.4: Functionally damaged novel nsSNVs.
CHR POS REF ALT AA GENE SIFT (≤ 0.05) Polyphen2 chr1 114442945 T C E232G AP4B1 0.00 Damaging chr1 235976331 G C L75V LYST 0.00 Damaging chr1 113253928 C T G336R PPM1J 0.01 Damaging chr1 156242159 G T A222E SMG5 0.01 Damaging chr10 73475893 G A R68C C10orf105 0.02 Damaging chr11 128839275 C G G1931R ARHGAP32 0.00 Damaging chr11 46388863 C T L251F DGKZ 0.04 Damaging chr11 607617 G A G720R PHRF1 0.01 Damaging chr12 46757591 C A M324I SLC38A2 0.03 Damaging chr12 21457414 C A G179V SLCO1A2 0.00 Damaging chr12 8327035 C G H42Q ZNF705A 0.00 Damaging chr14 71445083 C T R677W PCNX 0.01 Damaging chr15 45426095 G A R31Q DUOX1 0.04 Damaging chr15 42041072 T C L1817P MGA 0.00 Damaging chr16 70524280 C T V555M COG4 0.04 Damaging chr16 27782929 A G E1385G KIAA0556 0.05 Damaging chr16 75147696 A G L324P LDHD 0.00 Damaging chr17 36003399 G C D17E DDX52 0.01 Damaging chr17 78082104 C A P324Q GAA 0.00 Damaging chr17 2995813 T G T160P OR1D2 0.00 Damaging chr17 7324288 C A D98E SPEM1 0.00 Damaging chr18 10487685 G A G399S APCDD1 0.01 Damaging chr18 55143927 C G S496C ONECUT2 0.00 Damaging chr19 4513548 C T G128R PLIN4 0.01 Damaging chr2 42990263 C T V353M OXER1 0.01 Damaging chr2 179439827 G C Q23678E TTN 0.00 Damaging chr2 98779387 C G I354M VWA3B 0.05 Damaging chr21 34924337 C G P934A SON 0.02 Damaging chr22 50307056 G A S91F ALG12 0.01 Damaging chr4 69796409 G A P387S UGT2A3 0.00 Damaging chr5 65290677 G A D98N ERBB2IP 0.04 Damaging chr5 154320687 T A L6Q MRPL22 0.00 Damaging chr5 140475629 T A Y419N PCDHB2 0.00 Damaging chr6 56879992 G T K120N BEND6 0.00 Damaging chr6 32188296 C T G349S NOTCH4 0.00 Damaging chr6 84234199 G A G347S PRSS35 0.00 Damaging chr7 73634930 G C R94S LAT2 0.05 Damaging chr8 28989961 C G E936Q KIF13B 0.02 Damaging chr8 81897091 C T D266N PAG1 0.05 Damaging chr8 110476498 C A H2479Q PKHD1L1 0.01 Damaging chr8 142228631 C T D319N SLC45A4 0.03 Damaging chr9 135863848 G T C168F GFI1B 0.02 Damaging chrX 152801794 C T T30M ATP2B3 0.00 Damaging
52
Comparative genomic analysis was done using Pakistani genome symbolized as
“PTN” and the other previously published Pakistani (PK1) genome. Non-synonymous
variants from Pakistani (PK1) genome were annotated for investigating associated diseases.
Out of ~8,000 nsSNVs only 37 variants (three novel) were found linked with certain
disorders. Eight clinically relevant SNVs were detected overlapped with PTN genome. We
found no damaged variants responsible for Alzheimer’s, obesity and heart related diseases
just like we found in PTN genome. An SNV (rs1057910; CYP2C9) was observed in PK1
genome which is known for Wafarin response. Moreover, a pathogenic mutation (rs1169305)
was seen in the HNF1A gene which may become a cause of diabetes in the PK1 individual.
Most of the clinically relevant variants adopted in this study were originally described
in Caucasian populations. While this result might be a consequence of the genomic affinities
of the PTN genome with other Caucasian populations, it might also reflect a bias due to most
of the GWAS work being carried out on Caucasian populations (Ayub and Tyler-Smith
2009). Therefore a cohort study in the Pakistani population will be required for
authentication.
4.3 Pharmacogenomics Analysis:
Damaging nsSNVs were annotated using PharmGKB and DrugBank databases
(Hewett et al., 2002, Thorn et al., 2013, Wishart et al., 2008). A significant number of
variants were found linked with susceptibility to poisonous drugs, while remaining nsSNV
were associated to the drug’s efficacy used in the treatment of diseases such as depression,
Chr Position rsID Ref Alt Clinical Significance Description chr1 115236057 rs17602729 G A Pathogenic Muscle AMP deaminase deficiency (MMDD) chr2 49189921 rs6166 C T Association Ovarian hyperstimulation syndrome (OHSS) chr2 49191041 rs6165 C T drug response Ovarian response to FSH stimulation chr2 109513601 rs3827760 A G Pathogenic Hair morphology chr2 215813331 rs726070 C T Pathogenic Autosomal recessive congenital ichthyosis 4B (ARCI4B) chr3 10331457 rs696217 G T Pathogenic Obesity chr3 12393125 rs1801282 C G Pathogenic Obesity chr3 15686693 rs13078881 G C Pathogenic Biotinidase deficiency chr3 133494354 rs1049296 C T risk factor susceptibility to Alzheimer disease chr4 102751076 rs10516487 G A Pathogenic association with Systemic lupus erythmatosus chr5 33951693 rs16891982 C G Pathogenic Skin/hair/eye pigmentation, variation in, 5 (SHEP5) chr5 35861068 rs1494558 T C Pathogenic Severe combined immunodeficiency chr5 35871190 rs1494555 G A Pathogenic Severe combined immunodeficiency chr6 18130918 rs1142345 T C Pathogenic Thiopurine methyltransferase deficiency (TPMT) chr6 18139228 rs1800460 C T Pathogenic Thiopurine methyltransferase deficiency (TPMT) chr7 100771717 rs6092 G A Pathogenic Plasminogen activator inhibitor type 1 deficiency chr7 138417791 rs3807153 A G Pathogenic Renal tubular acidosis, distal, autosomal recessive (RTADR) chr8 18258103 rs1799930 G A drug response Slow acetylator due to N-acetyltransferase enzyme variant chr10 54531235 rs1800450 C T Pathogenic Mannose-binding protein deficiency chr10 70645376 rs10509305 A C Pathogenic Preeclampsia/eclampsia 4 (PEE4) chr11 5255582 rs35152987 C A Pathogenic delta Thalassemia chr11 88911696 rs1042602 C A Pathogenic Skin/hair/eye pigmentation, variation in, 3 (SHEP3) chr11 113270828 rs1800497 G A Pathogenic Dopamine receptor d2, reduced brain density of chr12 14993439 rs11276 C T Pathogenic DOMBROCK BLOOD GROUP chr14 21790040 rs10151259 G T Pathogenic Cone-rod dystrophy 13 (CORD13) chr15 28228553 rs74653330 C T Pathogenic Tyrosinase-positive oculocutaneous albinism (OCA2) chr16 48258198 rs17822931 C T Pathogenic Colostrum secretion, Ear wax chr17 12915009 rs4792311 G A Pathogenic Prostate cancer, hereditary, 2 (HPC2) chr20 43043159 rs142204928 G A likely pathogenic Maturity-onset diabetes of the young, type 1 (MODY1) chr20 43280227 rs73598374 C T Pathogenic Adenosine deaminase 2 allozyme chr22 42526694 rs1065852 G A Pathogenic poor metabolism of Debrisoquine
The same 46,946 SNVs were used to perform model-based cluster analysis using the
software ADMIXTURE. We performed analysis for K = 2 to K = 13 distinct ancestral
populations. For K = 3, the PTN genome corresponds to the Caucasian ancestry, accounting
for 85% of ancestry overall in PTN Pakistani individual and 74% in Gujarati Indians (Figure
4.5). For K = 4, the Caucasian, African and East Asian ancestral populations were observed
same as seen for K = 3. Comparing results from K = 3 and K = 4, we see remarkable
agreement in the relative proportions of Caucasian and Asian ancestry across all Indian and
Pakistani individual. However, K = 4 shows a very clear separation of South Asian ancestry
to distinct groups. Results from K = 5 to K =13 suggest further separation in the ancestral
populations. Moreover, the ancestry chromosome painting was performed using
INTERPRETOME, which verifies the admixture SNVs of the Pakistani individual with
Caucasians and Asians (Figure 4.6). The admixture results are in agreement with the MDS
plots and suggest shared common ancestry of Pakistanis and Caucasians.
Figure 4.5: ADMIXTURE results for K = 2 and K = 3 for the PTN individual combined with 46 selected
whole-genomes from Complete Genomics Inc. dataset (ASW: African ancestry in Southwest USA, CEU: Utah
residents of Northern and Western European ancestry, KOR: Korean, CHB: Han Chinese in Biejing, GIH:
Gujarati Indians in Houston, Texas, JPT: Japanese in Tokyo, Japan, LWK: Luhya in Webuye, Kenya, MKK:
Maasai in Kinyawa, Kenya, TSI: Toscani in Italia, YRI: Yoruba in Ibadan, Nigeria) and PTN: Pakistani Pathan.
61
The analysis was based on 46,946 SNVs. Each individual is represented by a vertical line, divided into colored
segments that represent membership coefficients in the subgroups.
Figure 4.6: Chromosome painting of possible genomic admixture, with Caucasians, Africans and Asians.
INTERPRETOME was used to create the chromosome ancestry painting.
4.5 Comparison with other Pakistani Individuals:
We investigated how representative our Pakistani PTN genome was of its ethnic
group by comparing it to other 190 Pakistani individuals in the HGDP-CEPH panel
(Rosenberg 2006, Li et al., 2008), which had been typed for ~650k SNVs. Admixture
analysis was performed based on 643,281 SNVs (thinned to avoid LD). We considered the
cluster membership from ADMIXTURE and STRUCTURE (from K=2 to K=5), the
Pakistani (PTN) genome composition was within the variability observed within the PTN
sample from the HGDP (Figure 4.7). Similarly, in a multi-dimensional scaling (MDS) plot,
the PTN genome fell within the other Pathan individuals (Figure 4.8). Taken together, these
62
two results confirm that the Pakistani genome symbolized as “PTN”, presented in this thesis
is representative of the Pathan ethnic group. These results are also in line with the self-
reported ancestry of the subject, with all his grandparents coming from Afghanistan to
Khyber Pakhtunkhwa (Pakistan).
Figure 4.7: Admixture results of Pakistani Pathan (PTN) individual to other ethnic groups in South Asia.
Admixture results for K = 2 and K = 5 for the Pathan individual combined with eight ethnic genomes from
HGDP dataset. The analysis was based on 643,281 SNVs. Each individual is represented by a vertical line,
divided into colored segments that represent membership coefficients in the subgroups.
63
Figure 4.8: Relationship of Pakistani Pathan individual to other ethnic groups in South Asia. Tweleve different groups from South Asia were compared with PTN. The
analysis was based on 643,281 SNVs.
64
4.6 Demographic History Analysis:
We inferred the demographic history of the Pakistani Pathan using the pairwise
sequentially Markovian coalescent (PSMC) model (Li and Durbin 2012) (Figure 4.9), and
compared it to a panel of worldwide populations based on a number of HGDP genomes (Meyer
et al., 2012). As previously reported, all populations share a similar demographic history
between 1 million to 200kyr ago. From 200kyr ago to 20kyr ago, the PTN follow a similar
trajectory to other Asian and European populations, with an inferred effective population size
smaller than African populations, reflecting the out of Africa bottleneck. Over the last 20k years,
the PTN shows an explosion in effective population size, contemporaneous to other Eurasian
populations but much greater in magnitude. The very large effective population size likely
reflects admixture between European and Asian lineages giving rise to modern Pathans in
Pakistan (as also suggested by the analysis of mtDNA and Y-chromosome), rather than an actual
increase in census sizes.
Figure 4.9: Pairwise Sequentially Markovian Coalescent (PSMC) model for reconstructing Pakistan’s demographic
history.
65
4.7 mtDNA and Y-chromosome analyses
The full mitochondrial genome of the Pakistani individual was generated by mapping its
reads to the revised Cambridge reference sequence (rCRS) (Andrews et al., 1999). Adenine and
thymine (AT) content of the genome was 55.5%, while guanine and cytosine (GC) content was
44.5%. A total of 57 SNVs were found in the PTN mitochondrial genome, 13 of which had not
been previously reported.The variants were then mapped with HaploGrep (Kloss-Brandstätter et
al., 2011) to identify the mitochondrial haplogroup of our PTN individual. A total of 14 SNVs
were diagnostic of the C4a1a1 haplogroup, which is more prevalent in the southern Siberian
populations, and is also reported in Pakistani Pathans (Rakha et al., 2011, Derenko et al., 2010).
The AT and GC contents of the Y-chromosome were 39.87% and 60.13%, respectively.
A total of 13,724 SNVs were identified, of which 4,423 were novel. The observed Y-
chromosomal SNVs were annotated as markers for the L1 haplotype of clade L. Haplogroup L
has high frequency in Pakistan (14%) as compare to India (6.3%), Turkey (~4%) and Caucasians
(~6%) (Mohyuddin et al., 2001, Firasat et al., 2007).
4.8 Phylogenomic Analysis:
A phylogenetic tree was constructed using 46 unrelated individuals in which, genomes
belonging to the same population and geographic region were found together in the same clad.
The PTN genome was observed closer to the Indian genome, which were the most similar and
geographically nearest to each other compared to the other representative genomes from other
Asian individuals. Pakistan lies next to China on the North East side geographically, which
makes a separate tree with its genetically similar ethnic groups such as Japan and Korea (Figure
4.11). Genomes from East Asia were placed close to each other. African, which includes the
66
genomes from Yoruba (YRI), Maasai (MKK), and Luhya (LWK) populations including Africans
from USA (ASW), were on one clad being clearly separated from the Asian and Caucasian
genomes. Utah genomes (CEU) were grouped together, separated from those of Italy (TSI). Only
the Indian (GIH) and Pakistani (PTN) genomes were used from South Asia for this study.
Together they made a clad. However, they also showed a rather clear separation from each other.
Figure 4.10: Phylogenomic tree of Pakistani PTN genome with other world ethnic genomes.
Chapter 5 DISCUSSION
Pages 67-74
67
CHAPTER 5
5. Discussion
Globally, human populations show structured genetic diversity as a result of geographical
dispersion, selection and drift (Gurdasani et al., 2015). Understanding this variation can provide
insights into evolutionary processes that shape both human adaptation and variation in disease
susceptibility (Ding and Kullo 2009). Although the Hapmap (Gibbs et al., 2003), HGDP (Cann
et al., 2002), PanAsia (Ngamphiw et al., 2011) and 1000 Genomes Projects (Siva, 2008) have
greatly enhanced our understanding of genetic variation globally, the detailed characterization of
Pakistani populations remains unexplored. The efforts such as the Human Genomes Diversity
Panel examine Pakistan genetic diversity but are limited by variant density (Cann et al., 2002).
The Pakistan population consists of four major ethnic groups (Punjabis, Pakhtuns, Sindhis,
Balochis) each with unique cultural, dietary, environmental and ancestral heritage (Mehdi et al.,
1999). Genetic inferences about these ethnic groups have mostly focused on the uniparental
lineage markers, indicating the Pakistanis ancient admixture with Caucasians (Mohyuddin et al.,
2001). Clarification and study of the Pakistani population’s admixture provide fundamental
knowledge pertinent to interpretation of any genetic study of prevalent disease in Pakistani
groups and corresponding improved healthcare. Disease prevalence in the Pakistan includes
Cancer, Diabetes, Hypertension, Cardiovascular and Neurological disorders (Dennis et al., 2006;
Rizvi et al., 2004; Whiting et al., 2011; Shera et al., 2007; Jafar et al., 2005; Jafar et al., 2003;
Nanan 2009; Shah et al., 2001; Mirza and Jenkins 2004). For example, it is estimated that 10%
of the population is afflicted with neurological diseases (Husain et al., 2000).
68
The disease consequence of genetic diversity associated with dispersion, selection and drift,
and complicated by admixture, disease prevalence, severity, and resistance vary considerably
among ethnic groups. These factors are further complicated by inheritance issues and
noninherited and environmental causes, such as poverty, unequal access to care, lifestyle, and
health-related cultural practices (Chin et al., 2007). Genetic makeup of populations from
Pakistan is important for the knowledge contribution to specific diseases and is important to
scientists around the globe due to increased likelihood of congenital diseases unique in
prevalence to Pakistani populations. Consequently, this research was conducted to sequence the
first whole genome from northwest Pakistan for discovering disease variants as well as provide a
foundation for complex disease studies. The current research does not only provide new
approaches in exploring population admixture dynamics, but also help us conduct the first
genetic study of diseases and pharmaco genes in the northwestern population of Pakistan. The
ultimate goal of this study was to extend the results of these studies to the interpretation and
translation to improve healthcare to the Pakistani people.
5.1 Clinical Relevance and Variant Characterization:
Studying complex diseases and gene mapping is often difficult due to sampling from
genetically heterogeneous populations. This complexity can be circumvented in isolated
populations where both genetic and environmental homogeneity will likely produce fewer
variants of the disease and the extent of linkage disequilibrium is generally larger than out bred
populations (Race and Group 2005). Genomic variations including single nucleotide variations
(SNVs), small insertions and deletions (indels), and copy number variations (CNVs) were
69
identified. Variants were then annotated and scanned for associated biological and physiological
function along with SNVs that could modulate drug response.
Overall, 3.8 million single nucleotide variations (SNVs), 1,503 copy number variation
regions (CNVRs) and 0.5 million small indels were identified by comparing it with the human
reference genome (hg19). Among the SNVs, 129,441 were novel, and 10,315 non-synonymous
SNVs were found in 5,344 genes. SNVs were annotated for genealogical study, high risk
diseases, as well as possible influences on drug efficacy. Functional classification of all the non-
synonymous variants obtained was performed using computational prediction methods. Clinical
variants were investigated, and it was found that 31 coding SNVs are associated with several
diseases. From our analysis we found that the donor is susceptible to Alzheimer’s, after
discovering an SNV rs1049296 in the TF gene where proline changes into serine on position 570
(Wang et al., 2013). The associated SNV with AD decreases the affinity of iron to TF leading to
iron accumulation in brain cells which results in memory loss. Another variant rs4792311 in
ELAC2 gene in Pakistani genome (PTN) was observed which is reported to have interaction with
prostate cancer. In result of this SNV serine on position 217 was found replaced by leucine
(Alvarez-Cubero et al., 2013). The rate of prostate cancer is low in Pakistan (3.8%) (Aziz et al.,
2003), as compared to Americans and Caucasian (Bhurgri et al., 2009). The donor’s family
medical history showed that there are documented cases of obesity, hypertension and heart
diseases. Therefore, we specifically investigated those genes which are responsible for the said
disorders. Three variants responsible for obesity were found on in genes GHRLOS (rs696217,
Leu72Met), SERPINE1 (rs6092, Ala15Thr), and PPARG (rs1801282, Pro12Ala) (Gueorguiev et
al., 2009; Bouchard et al., 2010; Galbete et al., 2013). About 22.2% of Pakistanis are reported to
be obese which is close to European (~24%) and United States populations (~19%) (Flegal et al.,
70
2010; Kopelman et al., 2009; Streib 2007). We also found three pathogenic SNVs in genes
associated with hair, skin and pigmentation: EDAR (rs3827760, Val370Ala), SLC45A2
(rs16891982, Phe374Leu), and TYR (rs1042602, Ser192Tyr) (Tan et al., 2013; Spichenok et al.,
2011; Sulem et al., 2007). In addition, we detected a SNV (rs17822931, Gly180Arg) in
ABCC11, which is responsible for wet earwax which was also found in the Pakistani PK1
genome (Yoshiura et al., 2006).
One of the variants (rs1065852, Pro34Ser) in the CYP2D6 gene is responsible for poor
metabolism of debrisoquine, an adrenergic-blocking medication used for the treatment of
hypertension (Zheng et al., 2013). Also, two SNVs are known to have a pathogenic effect and
lead to thiopurine methyltransferase (TPMT) deficiency (Li et al., 2013; Corrigan et al., 2013).
Moreover, two nsSNVs in the Arachidonic acid metabolism pathway were found. Arachidonic
acid in the human body usually comes from dietary animal sources, such as meat, eggs, and dairy
products. Meat is an important part of diet for the people living in Khyber Pakhtunkhwa, usually
consumed at least once a day, often in the form of kabab (minced meat fried in oil), or curry
(Lindholm, 2004).
Comparative genomic analysis was done using genome from the northwest (PTN) and the
other previously published Pakistani (PK1) genome (Azim et al., 2013). The PK1 genome was
report to have Sindhi ethnicity. Non-synonymous variants from Pakistani (PK1) genome were
annotated and screened against disease and drugs databases for example SIFT, PolyPhen,
OMIM, ClinVar, PharmGKB and Drug bank (Ng and Henikoff. 2003, Jordan et al., 2011,
Landrum et al., 2013, Amberger et al., 2011, Thorn et al., 2013, Wishart et al., 2008) for
investigating associated diseases. Out of ~8,000 nsSNVs only 37 variants (three novel) were
found linked with certain disorders. Eight clinically relevant SNVs were detected overlapped
71
with PTN genome. We found no damaged variants responsible for Alzheimer’s, obesity and
heart related diseases in PK1 just like we found in PTN genome. An SNV was observed in PK1
genome which is known for Wafarin response (Schwarz et al., 2008). Moreover, a pathogenic
mutation (rs1169305) was seen in the HNF1A gene which may become a cause of diabetes in the
PK1 individual (Bonnycastle et al., 2006). In addition, we detected an SNV (rs17822931,
Gly180Arg) in ABCC11, which is responsible for wet earwax which was found in both Pakistani
genomes (Yoshiura et al., 2006).
5.2 Pharmacogenomic Profile:
The genetic map of PTN individual was further used for finding possible influence on drug
efficacy. A large number of variants were associated with susceptibility to poisonous drugs,
while others nsSNV were linked to the efficacy of medicines used in the treatment of diseases
such as depression, diabetes mellitus, Alzheimer disease, arthritis and so on. A variant was found
associated with increased risk of metabolic syndrome when treated with antipsychotics
(Ellingrod et al., 2008). Our donor has high chance of having decreased diastolic blood pressure
if treated with benazepril (Jiang et al., 2004). One of the variants was associated with increased
risk of toxic liver disease when treated with ethambutol, isoniazid, pyrazinamide, and rifampin
(Çetintaş et al., 2008). We also observed an SNV which made this individual use escitalopram
for depression and other anxiety (Han et al., 2013).
Most of the clinically relevant variants adopted in this study were originally described in
Caucasian populations. While this result might be a consequence of the genomic affinities of the
Pakistani genome with other Caucasian populations, it might also reflect a bias due to most of
72
the GWAS work being carried out on Caucasian populations (Ayub et al., 2009). Therefore a
cohort study in the Pakistani population will be required for authentication.
The methodology, technology and infrastructure that we developed and used are equally
powerful to study other global ethnic populations and the diseases most prevalent in those
populations. Most importantly we successfully created a DNA variation dataset of the Pakistani
population and make it available to researchers for understanding human biology with respect to
disease predisposition, adverse drug reaction, and other genetically valuable healthcare
interpretation.
5.3 Genealogical and Admixture Analysis:
For the last many years researchers have been trying to clarify the origins and stratification
as well as intra and inter-population relationships of ethnic groups in Pakistan. Originally the
focus was on uniparental lineage markers passed through the Y chromosome and mtDNA in
male and female, respectively (Mohyuddin et al., 2001, Firasat et al., 2007, Rakha et al., 2011,
Metspalu et al., 2004). Therefore we analyzed the ever first whole genome of a Pathan /
Pakhtun from a North West province (Khyber Pakhtunkhwa) of Pakistan, to explore what
additional information can be learnt. Other analytical approaches were also used to assess the
influence of ancestral contributions within Pakistani Pakhtuns along with the historical
background of the region. Our analysis of 46 unrelated human genomes from 10 different
populations provides a comprehensive view of the PTN genome. We found that the Pakistani
Pathans appears with the Indian cline in our MDS beside Caucasians and East Asian. We saw
that at K = 4 the Pakistani Pathans and Indians made their own component to become better
representatives of the South Asia, that was additionally confirmed by comparing our
73
representative genome with other individuals from South Asia in the HGDP-CEPH panel (Li et
al., 2008), which were studied using illumina Omnichips of ~650k SNVs. We considered the
cluster membership (from K=2 to K=5), the PTN genome composition was within the
variability observed within the Pathan sample from the HGDP (Figure 4.7). Similarly, in a
multi-dimensional scaling (MDS) plot, the PTN genome fell within the other Pathan/Pakistan
individuals (Figure 4.8). African populations were found the most distant and differentiated
from the Pathan population. Being the only neighboring genome, Indian genomes showed the
closest genetic relationship with the Pakistani PTN genome. Both types of ethnic genomes made
a separate clad distant from other Asian genomes supported by the MDS plot and phylogenetic
tree analysis.
Based on our results we confirmed that our genome PTN is representative of the Pathan
ethnic group. These results are also in line with the self-reported ancestry of the subject, with all
his grandparents coming from Afghanistan to Khyber Pakhtunkhwa (Pakistan). We found that
the Pathan genome has more than 80% of Caucasian ancestry with C4a1a1 mito group and L Y-
chromosome group, suggesting that Pathans are probably an admixture of Caucasian and South
Asians at the genomic level. Haplogroup L has high frequency in Pakistan (14%) as compared
to India (6.3%), Turkey (~4%) and Caucasians (~6%) (Mohyuddin et al., 2001, Firasat et al.,
2007).
5.4 Demographic History Analysis and Ancestral Population Size:
We inferred the demographic history of the Pakistani genome (PTN) using the pairwise
sequentially Markovian coalescent (PSMC) model (Li H, Durbin 2012) (Figure 4.9), and
compared it to a panel of worldwide populations based on a number of HGDP genomes (Meyer
74
et al., 2012). As previously reported, all populations share a similar demographic history
between 1 million to 200kyr ago. From 200kyr ago to 20kyr ago, the PTN follow a similar
trajectory to other Asian and European populations, with an inferred effective population size
smaller than African populations, reflecting the out of Africa bottleneck. Over the last 20k years,
the PTN shows an explosion in effective population size, contemporaneous to other Eurasian
populations but much greater in magnitude. The very large effective population size likely
reflects admixture between European and Asian lineages giving rise to modern Pathans (as also
suggested by the analysis of mtDNA and Y-chromosome), rather than an actual increase in
census sizes.
5.5 Conclusion:
Here we present, for the first time, the whole genome of a Pakistani individual from a
north-west province (Khyber Pakhtunkhwa). This research does not only provide new
approaches in exploring population admixture dynamics, but also help us conduct the first
genetic study of diseases and pharmaco genes in the northwestern population of Pakistan. The
ultimate goal of this research was to extend the results of these studies to the interpretation and
translation to improve healthcare to the Pakistani people. Our analysis provides a detailed view
of the PTN genome diversity and functional classification of variants and its impact in
pharmacogenomics. A large scale analysis of diverse genomes is needed to help researchers
around the world in understanding genetic diversity and functional classification of variants
along with pharmacogenomic traits and associated drugs that would be use as personalized
medicine.
75
5.6 Recommendations and Future Plans:
x A genetic resource for all Pakistani populations should be established for computing their
allele sharing as a measure of linkage disequilibrium, admixture, and migration.
x Cohort study in the Pakistani population is required for Authentication, which will help
us, conducting the genetic disease studies.
x Rare and common diseases, its susceptibility and association within Pakistani
population's genetic makeup should be investigated.
x Patients, physicians and science journalists should be educated on interpreting genomic
results.
x Genomics applications and implications should be openly discussed through Conferences
and Workshops etc. This will encourage interaction between experts, academicians,
researchers, students, policy makers etc.
Chapter 6 REFERENCES
Pages 76-90
76
CHAPTER 6
6. References
Ahn, S.-M., Kim, T.-H., Lee, S., Kim, D., Ghang, H., Kim, D.-S., et al. (2009). The first Korean
genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome
research, 19(9), 1622-1629.
Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in