289 APPENDIX 1 A new algorithm for identification of S tress responsive T ranscrI ption F actor binding sites (STIF) and a database of abiotic stress responsive transcription factors in Arabidopsis thaliana (STIFDB) Publications from this chapter: • K. Shameer , S. Ambika, S. M. Varghese, N. Karaba, M. Udayakumar and R. Sowdhamini: STIFDB – Arabidopsis Stress responsive TranscrIption Factor DataBase, (2009) et.al; Int. Journal of Plant Genomics: 583429 • Sundar AS, Varghese SM, K. Shameer , Karaba N, Udayakumar M, R. Sowdhamini: STIF: Identification of stress-upregulated transcription factor binding sites in Arabidopsis thaliana. (2008) Bioinformation. 30; 2(10).
29
Embed
APPENDIX 1 A new algorithm for identification of Stress ...shodhganga.inflibnet.ac.in/bitstream/10603/3085/18/18_appendix.pdf · A new algorithm for identification of Stress responsive
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
!289
APPENDIX 1
A new algorithm for identification of Stress responsive
TranscrIption Factor binding sites (STIF) and a database of
abiotic stress responsive transcription factors in Arabidopsis
thaliana (STIFDB)
Publications from this chapter:
• K. Shameer, S. Ambika, S. M. Varghese, N. Karaba, M. Udayakumar and R. Sowdhamini: STIFDB – Arabidopsis Stress responsive TranscrIption Factor DataBase, (2009) et.al; Int. Journal of Plant Genomics: 583429
• Sundar AS, Varghese SM, K. Shameer, Karaba N, Udayakumar M, R. Sowdhamini: STIF: Identification of stress-upregulated transcription factor binding sites in Arabidopsis thaliana. (2008) Bioinformation. 30; 2(10).
!290
A: 1.1 Introduction
Transcription factors play a pivotal role in the cell by regulating differential expression of
genes required for a particular molecular function of biological process in the complex
cellular environment. The expressions of proteins in the cell are carefully regulated by
transcription factors that interact with their downstream targets in specific signal transduction
cascades. Our understanding of the regulation of functional genes responsive to a plant
abiotic stress signals is still nascent [638]. Understanding the molecular mechanisms that
underlie stress tolerance would be the first step in the generation of abiotic stress tolerant
crops. To understand plant abiotic stress responses, unraveling the mechanisms of regulation
of abiotic stress responsive genes assumes paramount importance. Gene regulation by
Transcription Factors (TFs) is an important facet of stress responsive signal transduction
cascades. Transcription factors are regulatory proteins that implement their functions by
binding directly to the promoters of target genes in a sequence-specific manner to either
activate or repress the transcription of downstream target genes. Arabidopsis thaliana is a
convenient plant model system to study fundamental questions related to regulation of the
stress transcriptome. Microarray experiments of the A. thaliana transcriptome indicate that
several genes could be up regulated during multiple stresses, such as cold, salinity, drought
etc. Experimental biochemical validations have proved the involvement of several
transcription factors could be involved in the up regulation of these stress responsive genes
[3, 639]. Bioinformatics approaches are widely employed in multiple domains of plant
biology to understand various aspects in the context of fundamental, cellular or biochemical
level [640]. In order to follow the intricate and complicated networks of transcription factors
and genes that respond to plant abiotic stress situations in plants, a new algorithm for the
identification of key transcription factor binding sites that are present in the upstream of
genes of interest was developed. Hidden Markov models of the transcription factor binding
sites enable the identification of predicted sites upstream of abiotic stress genes in A.
thaliana. The search algorithm named as ‘STIF’ was assessed for its performance using a set
of genes reported to be up regulated during abiotic stress response in A. thaliana [636]. The
algorithm performed well, with more than 90% sensitivity, when tested on experimentally
validated positions of transcription factor binding sites on a dataset of 29 plant abiotic stress
up regulated genes. Further, the algorithm is applied on a larger dataset of 2, 629 genes from
A. thaliana genome. The genes are extracted from various public microarray datasets related
to abiotic stress response experiments in A. thaliana. 2, 629 genes are scanned using the
algorithm for potential abiotic stress responsive transcription factor binding sites. A new
!291
database called “STIFDB [Stress responsive TranscrIption Factor DataBase]” is compiled,
developed and provided in the public domain [637, 641]. STIFDB is developed a database of
plant abiotic stress responsive genes and their predicted abiotic transcription factor binding
sites in A. thaliana. STIFDB will be a useful resource for researchers to understand the
abiotic stress regulome and transcriptome of this important model plant system. This Chapter
details various aspects of the new HMM based algorithm for the identification of plant abiotic
stress responsive transcription factor binding sites, database and also discusses about the
generic trends of the genes and transcription factors available in STIFDB.
A: 1.2 STIF Algorithm
The interactions between regulatory proteins and DNA control many important processes and
responses to abiotic stresses, and defects in these interactions can contribute to inefficient
stress responses. Numerous studies have shown that transcription factors are important in
regulating plant responses to stress. One important step in the control of stress responses is
the transcriptional activation or repression (regulation) of genes. Databases, such as
ATHAMAP, offer information about the chromosomal positions of genes of interest and
possible location of their transcription factors and binding sites [642]. Multiple signaling
pathways regulate the stress responses of plants and there is significant overlap between the
patterns of gene expression that are induced in plants in response to different stresses [643].
Many genes induced by stress challenges, including those encoding transcription factors,
have been identified and some of them have been shown to be essential for stress tolerance
[644]. Many studies have also revealed some of the complexity and overlap in the responses
to different stresses, and are likely to lead to new ways to enhance crop tolerance to disease
and environmental stress. The binding specificities of only a small number of transcription
factors (TFs) are well characterized. Transcription-factor binding sites (TFBSs) are usually
short in length (around 5-15 base-pairs (bp)) and they frequently contain degenerate sequence
motifs. The sequence degeneracy of TFBSs has been selected through evolution and is
beneficial, because it confers different levels of activity upon different promoters. Much of
the information on TF binding specificity has been determined using traditional
methodologies, such as foot-printing methods, (that identify the region of DNA protected by
a bound protein), nitrocellulose binding assays, South-western blotting (of both DNA and
protein) or reporter constructs. These methods are generally quite time-consuming and are
not readily scalable to a whole genome [645]. One of the promising approaches is to identify
the transcription factors by computational techniques at a whole genome level so as to choose
!292
promising targets for detailed experimental investigation. Well-known eukaryotic
transcription factors and their binding sites are recorded in TRANSFAC database [646].
Computational tools are available to facilitate the retrieval of transcription factor binding site
information from TRANSFAC database, but for the human genome [647]. Several existing
algorithms use position-specific profiles [648, 649] based scoring schemes or probabilistic
models to recognize putative binding sites. Even though various bioinformatics tools are
available in the public domain for transcription factor binding site prediction, most of the
servers and algorithms are largely for eukaryotic general-purpose transcription factors and
not specific for plant genomes or plant abiotic stress responsive genes. There are other
computational algorithms to search for possible genes that are downstream of classical
transcription factor binding sites, where the binding site data are encoded as HMMs and
searched all around the genome of interest. These methods are called as ‘targeted gene
finding’ since they begin from known transcription factor binding sites [648]. However, this
approach is complicated for plant stress genes since stress TF-binding site signatures could
potentially be upstream of constitutive genes as well and there could also be overlap in
various transcription factor binding sites. Data of a set of 10 well-known plant abiotic stress
specific transcription factors were curated from literature and generated Hidden Markov
Models (HMM) of known transcription factor binding sites. This knowledge-based approach,
by building HMM models through well-known abiotic stress cis-elements, has been tested
extensively to standardize thresholds for scores. ‘STIF’ is basically an HMM based algorithm
developed to predict transcription factor binding sites in the upstream and 5'UTR regions of
genes extracted from TAIR. Program based on STIF algorithm accepts a DNA sequence
(Upstream region + 5'UTR) in FASTA format as the input. Extensive experimental results
show that abiotic stress responsive transcription factors fall into ten transcription factor
families [650, 651]. These are ABI3/VP1, AP2/EREBP, ARF, bHLH, bZIP, HB, HSF, Myb,
NAC and WRKY families, which have a total of 22 subfamilies. Abiotic stress responsive
transcription factors largely belong to one of these 22 TF subfamilies (Table A3). Input
sequence is scanned using library of these 22 pre-constructed stress responsive transcription
factor HMMs obtained from literature. Input sequences are scanned for matches to the HMM
models. Subsequent to the HMM search, scores of all possible matches in forward and
reverse orientations in the upstream regions of stress genes are calculated along with standard
deviation and average. Based on STIF search results, hits are scored using significant scoring
method. In the final step Standard deviation, average and significant score base on hits are
used to calculate the Z-score and normalization. Hidden Markov Model (HMM) is used for
!293
transcription factor binding site detection in STIF algorithm. The consensus (S) of length (L)
was taken from the literature and the probabilistic score (P(S)) and log-odd score were
calculated.
P(S) = F * T
Where P(S) – Probability of consensus
F – Frequency (i.e. No:. of particular nucleotide/ Total no in column)
T – Transition probability
The log odd-score for consensus
(S) = log P(S) – L (AT) log 0.375 + L(GC) log 0.125
As plant sequences are rich in GC content, higher weight is assigned to AT than GC in log-
odd score. Schematic representation of the STIF algorithm is provided in Figure A1.
A: 1.3 Implementation of STIF Algorithm
STIF algorithm and associated scripts for HMM related computation, searching, calculation
of statistics and input - output parsing and other calculations like Z-Score and normalization
were coded in Perl. Flowchart of the algorithm is provided in Figure A2.
A:1.4 Statistical Validation of STIF Algorithm
A new transcription factor binding site prediction algorithm ‘STIF’ was been developed to
identify potential TFBS of stress-specific transcription factors, using the Hidden Markov
Models. The HMM models of cis-elements, based on abiotic stress transcription factor
families, were validated using Jackknifing method. HMM-based search algorithm STIF is
used to search 100 base pairs upstream of the gene with its 5’UTR. A set of 29 abiotic stress
genes from public microarray databases based on the high stress-induced expression profiles
were selected for the candidate genes for validation [652] . To evaluate the method further,
sequence searches are performed against 1000 base pairs with its 5’UTR. In the validation
data set, at a Z-score of 2.0 when searched 100 base pairs with 5’UTR, the sensitivity of the
method is found to be very high and the method identified 18 out of 20 hits (95% coverage)
with only two false negatives. Based on the statistical observations, a Z-score of 2.0 or more
could be defined as effective to search and predict transcription factor binding sites 100 base
pairs with 5’UTR. In several instances, more than one transcription factor has been recorded
for a stress gene of interest (for instance, COR15a has both DREB_AP2_EREBP and
G_ABRE_bZIP (Figure A3, Table A1). The 29 stress genes considered for validation is
known to be upregulated during different types of stress – such as cold, dehydration, salinity
!294
etc. It is possible that, during a particular type of abiotic stress, any one of these transcription
factors would selectively respond by binding upstream of the gene of interest. Due to few
‘validated’ transcription factor binding sites mapped in the 100 base pairs upstream of stress
genes, validation searches where extended to 1000 base pairs upstream of the gene and
likewise a Z-score threshold of 1.5 is appropriate for 1000 base pairs with 5’UTR (Figure A4,
Table A2). 90% sensitivity is achieved in STIF, where 71 out of 78 hits could be correctly
identified with Z-scores above the threshold. As with most other algorithms, method is not
highly specific and can generate false positives. The specificities for searches in the
validation set, by searching 100 base pairs and 1000 base pairs, is 57 and 18.6 (for Z-score
threshold of 1.5) and 54 and 20.4 (for Z-score threshold of 2.0), respectively. The difficulty in
obtaining high specificities has been due to simple and short nucleotide patterns that describe
some of the transcription factors like bHLH. Such TFs would respond frequently and that too
with very good match with HMM and are reflected as high scores. An alternate normalized
score were proposed for these frequently responding TFs in Arabidopsis genome. STIF
employs Hidden Markov Models of binding site information of well-known plant
transcription factors in abiotic stress. Microarray results of key stress up regulated genes in
plants have shown that a large number of these genes are up regulated in response to a variety
of genes generating redundancy in the dataset of stress up regulated plant genes. Further, the
experimentally ‘validated’ results also indicate that more than one transcription factor can
induce the expression of the stress genes in archived in STIFDB. The scoring schemes and
thresholds established should be useful for dealing with redundancy and occurrence of
Table A1: Statistical validation of search using STIF algorithm for Transcription Factor
Binding Sites (TFBS) 100 bp upstream of 11 stress genes. (* Total number of false positives
(x) and total number of false negatives (y) for a set threshold of Z-score. The numbers are
expressed as x,y for different thresholds imposed.)
!302
TAIR ID Number of hits* using different thresholds
1.0 1.5 2.0 2.5 3.0
AT1G02920 8,0 8,0 5,0 1,1 1,1
AT1G02930 9,0 9,0 7,0 6,1 3,1
AT1G05680 7,0 7,0 7,0 2,0 1,0
AT1G07890 17,1 14,1 8,1 8,1 0,1
AT1G20440 4,0 4,0 4,0 2,0 1,0
AT1G20450 6,0 6,0 5,0 4,0 2,0
AT1G52400 6,0 6,0 3,0 3,0 0,0
AT1G67090 12,0 12,0 12,0 12,0 3,0
AT1G77120 9,0 9,0 9,0 3,0 2,0
AT2G14610 4,0 4,0 4,3 4,3 4,3
AT2G14960 8,0 6,0 6,0 1,0 0,2
AT2G15970 13,0 13,0 12,0 4,0 3,0
AT2G21330 4,0 4,0 4,0 4,0 4,0
AT2G33380 11,0 11,0 9,0 7,0 2,0
AT2G40880 9,0 8,0 7,0 6,0 1,3
AT2G42540 16,0 6,0 5,0 4,0 4,0
AT2G46270 17,0 16,0 16,0 14,2 4,2
AT3G02480 7,0 5,0 4,0 3,0 3,1
AT3G04720 11,0 11,0 11,0 5,1 1,1
AT3G15500 10,0 10,0 10,0 4,0 3,1
AT4G00340 5,0 5,1 5,1 5,1 3,1
AT4G01120 12,0 12,0 12,0 10,1 2,1
AT4G02380 7,0 7,0 7,1 5,1 5,1
AT4G23130 9,0 9,0 9,2 2,2 0,2
AT4G37070 7,0 7,0 7,0 4,0 3,1
AT5G15970 11,0 11,0 5,0 3,0 1,2
!303
AT5G44420 9,0 9,0 9,0 8,0 2,0
AT5G51070 7,0 7,0 7,0 6,0 4,0
AT5G52310 6,0 6,0 4,0 3,0 3,0
Total 261,1 242,2 213,8 143,14 65,24 Table A2: Statistical validation of search using STIF algorithm for Transcription Factor Binding Sites (TFBS) 1000 bp upstream of 29 stress genes. (Total number of false positives (x) and total number of false negatives (y) for a set threshold of Z-score. The numbers are expressed as x,y for different thresholds imposed)
!304
Transcription factor Family name
Stress signal
Name of the Cis-element
Cis-element Reference (Stress signal / Cis-element)
ABI3/ VP1 ABA distB ABRE
GCCACTTGTC
[666]
AP2/ EREBP (EREBP-ERF sub-family)
Cold, Drought
GCC-box GCCGCC [667]
DREB sub-family
Cold, Drought
CRT/DRE (A/G)CCGAC
[668]
ARF Auxin AuxREs TGTCTC [669] BHLH/ myc NACL,
ABA, Drought
N box CACG(G/A)C
[670]
G box CACGTG [671] bZIP ABA,
Drought G box1
CCACGTGG
[672] [673]
G box2 TGACG(T/C)
[673]
G/ABRE (C/T)ACGTGGC
[674]
C/ABRE CGCGTG [674] HB ABA,
Drought CAATNAT
TG [675], [676]
HSF Drought, Cold, Heavy-metal stress and oxidative stress