STATISTICAL ANALYSIS OF HUMAN TUBERCULOSIS MICROARRAY GENE EXPRESSION DATA IN THE BIOCONDUCTOR R PACKAGE UMAR SHITTU A dissertation submitted in partial fulfilment of the Requirements for the award of Master of Science (Biotechnology) Faculty of Biosciences and Medical Engineering Universiti Teknologi Malaysia JANUARY 2015
27
Embed
STATISTICAL ANALYSIS OF HUMAN TUBERCULOSIS …eprints.utm.my/id/eprint/54624/25/UmarShittuMFBME2015.pdf · vi ABSTRAK Batuk kering akibat daripada jangkitan bakteria intrasel yang
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
STATISTICAL ANALYSIS OF HUMAN TUBERCULOSIS MICROARRAY
GENE EXPRESSION DATA IN THE BIOCONDUCTOR R PACKAGE
UMAR SHITTU
A dissertation submitted in partial fulfilment of the
Requirements for the award of
Master of Science (Biotechnology)
Faculty of Biosciences and Medical Engineering
Universiti Teknologi Malaysia
JANUARY 2015
iii
DEDICATION
I dedicated this thesis to my beloved parents
Late Shittu Umar and Late Hauwa’u Lawal.
iv
ACKNOWLEDGEMENT
Alhamdulillah, in the name of ALLAH, the most Gracious, the most Merciful,
for His guidance and blessing throughout my master’s programme. I am very grateful
and would like to express my sincere gratitude to my supervisor Dr. Mohammed Abu
Naser for his invaluable guidance, continuous encouragement and constant support in
making this research possible.
I am highly grateful to my lovely wife Maryam Abdullahi and my son
Muhammad Kabir Umar who gave me courage and support to achieve success in my
chosen career. My sincere appreciation to all my colleagues and others who have
provided assistance at various occasions. Their views and tips were useful indeed.
Unfortunately, it is not possible to list all of them in this limited space. To you all i
said JAZAKUMULLAHU KHAIRAN.
v
ABSTRACT
Tuberculosis is an intracellular bacterial infection that attack organs of human
body system, it is a worldwide disease with high estimated number of death rate every
year. Microarray technology produces large amount of disease gene expression data
and provides opportunities mine the data to understand disease mechanisms at
molecular level. The aim of this study is to explore the usage of some tools available
for analysing human TB microarray gene expression data. The control stimulated
samples with phosphate buffer saline (PBS) and experimental unstimulated samples
of three different clinical forms of human TB microarray gene expression data such as
latent TB (LTB), pulmonary TB (PTB) and meningeal TB (TBM) were collected from
GEO-NCBI database and all analysis were performed by using Bioconductor R
packages. The results of this study, explore the use of affycoretool for microarray TB
image visualization analysis, AffyQCReport tool for TB microarray data quality
assessment, GCRMA method for TB microarray data normalization and LIMMA as a
statistical tool for the identification of significantly expressed genes of human TB.
According to LIMMA, there was a significant different between stimulated and
unstimulated tuberculosis and majority of the significantly expressed genes identified
were genes responsible for cellular immune response. The regulated genes identified
from the LIMMA analysis using Venn diagram indicated more decrease in rate of gene
expression than the increase in stimulated tuberculosis while show more increase in
rate of gene expression than decrease in unstimulated tuberculosis. Hierarchical
clustering (hclust) method was used to determine common expression pattern among
the three different clinical forms of human TB infection, it suggested that, hierarchical
clustering analysis distinguish different clinical forms of human TB infection. This
study recommended that the results generated from these findings can be used in
further analysis for detection and control of human TB infection.
vi
ABSTRAK
Batuk kering akibat daripada jangkitan bakteria intrasel yang menyerang
organ-organ sistem badan manusia, merupakan penyakit di seluruh dunia kadar
kematian yang tinggi setiap tahun. Teknologi microarray menghasilkan sejumlah besar
penyakit gen data ungkapan dan memberi peluang melombong data untuk memahami
mekanisme penyakit pada peringkat molekul. Tujuan kajian ini adalah untuk meneroka
penggunaan beberapa alatan yang disediakan untuk menganalisis microarray TB data
ekspresi gen manusia. Kawalan dirangsang sampel dengan penimbal fosfat masin
(PBS) dan ujikaji sampel tidak dirangsang daripada tiga bentuk klinikal yang berbeza
data ungkapan TB manusia microarray gen seperti TB terpendam (LTB), TB paru-
paru (PTB) dan TB meningeal (TBM) telah diambil dari GEO pangkalan data -NCBI
dan semua analisis telah dijalankan dengan menggunakan Bioconductor R pakej. Hasil
kajian ini, meneroka penggunaan affycoretool untuk microarray TB analisis visualisasi
imej, alat AffyQCReport untuk penilaian kualiti data microarray TB, kaedah GCRMA
untuk normalisasi data microarray TB dan LIMMA zdcvcsebagai alat statistik untuk
mengenal pasti gen yang ketara daripada TB manusia . Menurut LIMMA, terdapat
perbezaan yang signifikan antara dirangsang dan unstimulated batuk kering dan
majoriti daripada gen ketara yang dikenal pasti ialah gen yang bertanggungjawab
untuk tindak balas imun selular. Gen dikawal selia dikenal pasti daripada analisis
LIMMA dengan menggunakan gambar rajah Venn menunjukkan penurunan lebih
dalam kadar ekspresi gen daripada peningkatan batuk kering dirangsang manakala
persembahan lebih meningkat pada kadar ekspresi gen daripada penurunan batuk
kering tidakdirangsang. Hierarki kelompok (hclust) kaedah telah digunakan untuk
menentukan corak ungkapan biasa di kalangan tiga bentuk berbeza klinikal jangkitan
TB manusia, ia mencadangkan bahawa, analisis pengelompokan hierarki membezakan
bentuk klinikal berbeza jangkitan TB manusia. Kajian ini mencadangkan bahawa hasil
yang dijana daripada hasil kajian ini boleh digunakan dalam analisis selanjutnya untuk
pengesanan dan kawalan jangkitan TB manusia.
vii
TABLE OF CONTENTS
CHAPTER TITLE PAGE
DECLARATION ii
DEDICATION iii
ACKNOWLEDGEMENT iv
ABSTRACT v
ABSTRAK vi
TABLE OF CONTENTS vii
LIST OF TABLES x
LIST OF FIGURES xi
LIST OF ABBREVIATIONS/ SYMBOLS xiii
LIST OF APPENDICES xv
1 INTRODUCTION 1
1.1 Background of the Study 1
1.2 Statement of the Research Problem 3
1.3 Significance of the Study 4
1.4 Aims and Objectives of the Research 5
1.5 Scope and limitations 5
2 LITERATURE REVIEW 6
2.1 History of Tuberculosis 6
2.2 General Description of Tuberculosis Infection 8
2.2.1 Latent Tuberculosis Infection 9
2.2.2 Meningeal Tuberculosis Infection 10
viii
2.2.3 Active Tuberculosis Infection 10
2.3 Control of Tuberculosis Infection 11
2.3.1 Tuberculosis Early Prevention and Control 11
2.3.2 Introduction of Vaccines and Drugs for the Treatment
and Control of TB 12
2.4 Microarray Technology 15
2.5 Experimental Design of Microarray 16
2.6 Microarray Data Analysis 20
2.6.1 Packages for Microarray Data Analysis 21
2.6.2 Steps and Tools for Microarray Data Analysis in the R
Package and Bioconductor 25
2.6.2.1 Pre-processing of Microarray Data 25
2.6.2.2 Common Tools Used for Pre-processing of
In the R Package and Bioconductor
Microarray 27
2.6.2.3 Identification of Differential Expressed
Genes 31
2.6.2.4 LIMMA as a Common Statistical Tool for
Microarray Data in the R Package and
Bioconductor 31
2.6.2.5 Advantages of LIMMA over Other Tools for
Microarray Data Analysis in the R Package
and Bioconductor 32
2.6.1.6 Pattern of Recognition Determination 33
3 RESEARCH METHODOLOGY 34
3.1 Description of the Research Area 34
3.2 Data Collection 34
3.3 Stimulated and Unstimulated Samples of Human
Tuberculosis Infection 35
3.4 Microarray Data Analysis 36
3.4.1 Pre-processing of Microarray Data 36
3.4.1.1 Image Visualization analysis 36
3.4.1.2 Quality Control and RNA degradation
ix
Analysis of Microarray Data 37
3.4.1.3 Data Normalization, Background Correction,
Summarization and Visualization 38
3.5 Identification of Differential Gene Expression 41
3.6 Determination of Common Pattern of Recognition 44
4 RESULTS AND DISCUSSION 45
4.1 Image Visualization Analysis 45
4.2 Quality Control and RNA degradation Analysis of Microarray
Data 48
4.3 Microarray Data Normalization, Background Correction,
Summarization and Visualization 61
4.4 Identification of Differential Expressed Gene 67
4.4.1 B- Statistics 67
4.5 Determination of Common Recognition Pattern 74
5 CONCLUSIONS AND RECOMMENDATION 77
5.1 Conclusion 77
5.2 Recommendations 78
LIST OF REFERENCES 80
APPENDICES 89
x
LIST OF TABLES
TABLE NO. TITLE PAGE
2.1 Different Packages for Analysing Microarray Data 21
2.2 Common Tools Used for Pre-processing of Microarray Data 27
4.1 Samples of stimulated and unstimulated microarray array
Raw data in a form of CEL file format of three different forms
Of human tuberculosis infections. 49
4.2 Summary of RNA degradation plot with calculated slope
And p-value. 60
4.3 Identification of differential gene expression between
Stimulated samples of three different forms of human
Tuberculosis infections. 68
4.4 Identification of differential gene expression between
Unstimulated samples of three different forms of human
Tuberculosis infections 71
xi
LIST OF FIGURES
FIGURE NO. TITLE PAGE
2.1 Experimental design of microarray 18
2.2 Microarray experiment workflow 19
3.1 Image Visualization Analysis 37
3.2 Quality control and RNA degradation analysis of
Microarray data 38
3.3 Data Normalization, Background Correction,
Summarization and Visualization 40
3.4 Identification of differential gene expression 43
3.5 Determination of common recognition pattern 44
4.1 Images of some selected arrays of human
Latent tuberculosis infection. 46
4.2 Images of some selected arrays of human pulmonary
Tuberculosis infection. 46
4.3 Images of some selected arrays of human meningeal
Tuberculosis infection. 47
4.4 Box plot of unnormalize samples of stimulated and
Unstimulated microarray raw data of three different forms
Of human tuberculosis infections. 50
4.5 A histogram plot of raw data unnormalize samples of three
Different forms of human tuberculosis infections. 51
4.6 QC plot for all raw data samples of three different forms of
Human Tuberculosis infections from the simpleaffy
Package. 53
xii
4.7 Signal intensity plots of positive and negative border elements
Of microarray raw data with three different forms of human
Tuberculosis infections. 55
4.8 A heat map of the array-array Spearman rank correlation
Coefficients of microarray raw data of three different forms
Of human tuberculosis infections. 57
4.9 Summary of RNA degradation plot with calculated slope and
p- Value for each array. 59
4.10 Box plot of normalized samples of stimulated and
Unstimulated microarray raw data of three different forms
Of human tuberculosis infections. 63
4.11 MA plots of some selected normalized arrays of human latent
Tuberculosis infection. 64
4.12 MA plots of some selected normalized arrays of human
Meningeal Tuberculosis infection. 65
4.13 MA plots of some selected normalized arrays of human
Pulmonary Tuberculosis infection. 66
4.14 Venn diagram showing ‘up’ and ‘down’ regulated genes
Among the three group comparisons of stimulating samples. 70
4.15 Venn diagram showing ‘up’ and ‘down’ regulated genes
Among the three group comparisons of unstimulated
Samples. 73
4.16 Hierarchical clustering of stimulated and unstimulated