http://www.abbs.info/ e-mail:[email protected] ISSN 0582-9879 ACTA BIOCHIMICA et BIOPHYSICA SINICA 2003, 35(4):317-324 CN 31-1300/Q |
Prediction of Prokaryotic Promoters Based on Prediction of Transcriptional Units
LIN Jian-Cheng, XU Jin-Lin#,
LUO Jian-Hua#, LI Yi-Xue1
( Biotechnology Department, School of Life Science and Technology, Shanghai Jiaotong University, Shanghai 200240, China; 1 Bioinformatics Center of Shanghai Institutes for Biological Sciences, the Chinese Academy of Sciences, Shanghai 200031, China )
Abstract Identification of promoters is very important in understanding gene regulating relationships in an organism, and computational identification of promoters has been a long standing problem in computational biology. A new method was presented to predict promoter regions in prokaryotic organism. The method predicted transcription unit (TU) first and the TU was divided into singlet that contains only one single gene in a TU, and operon that contains more than one gene. Based on these predicted TUs, promoter was predicted for each TU using hidden Markov model including explicit state duration density. Both predicted TUs and promoters were satisfying.
Key
words promoter; TU; operon; prediction;
HMM with state duration
Large-scale sequencing has brought us piles of rough data remaining to be annotated. The laboratory confir*mation is time-consuming and costly. Alternately, computational solutions to this problem have been developed very well and have achieved a lot. However, most of the work was mainly focused on the coding regions. In fact, promoter region is very important to a gene, because most of the regulatory elements involved in gene expression are located in or around this region. In E.coli, the σ factor is a functional component that recognizes promoter sequences. Thus, we always define different promoters by different σ factors recognizing them, such as the σ70 promoter is the promoter recognized by σ70 factor. One organism contains different σ factors, each recognizes different subset of promoters and thus regulates different gene expression under specific conditions. For example, σ32 promoters control the genes that are critical for the bacterium to survive prolonged exposure to higher-than-normal temperature[1]. Therefore, identification of the promoter region of a certain TU is very important for understanding the regulating relationships of the genes.
Prior to promoter prediction, we must realize that all the genes are transcribed as a TU, and if a TU is an operon which contains more than one gene, all the genes in this operon must share a same promoter under certain condition and are transcribed together[2]. Since a promoter corresponds to a TU but not a gene, we should predict promoter for each TU but not for each single gene as done before by others. And since the promoter region lies in the region before the coding region of a TU, that is, it lies before the coding region of the first gene in the TU, if the TU is an operon, what we need to care about is only the upstream region before the first gene in the operon. Thus, it is valuable to predict TU before promoter prediction. TU prediction could base on distance and functional class[3], or base on multiple-genome comparison[4]. Both methods have their advantages, the former can be efficiently used in the well-studied genomes, while the later is a good method to those less studied genomes.
TU prediction offered a good source for further promoter prediction. Considering only the limited region that a promoter resides in, we can make promoter prediction more accurately and efficiently. Moreover, because most of the prokaryotic promoter sequences lie inside the upstream 200 bp before the translation initiation size (TI), the upstream 200 bp before the TI of all the first genes in each TU was cut first, and then, their promoter sequences were predicted using hidden Markov model (HMM) with duration. HMM was first interpreted by Rabiner[5], and further applied to TU prediction[6], gene finding[7], and the related algorithm of ECOPARSE was also developed[8]. In promoter prediction, although HMM had been employed by Pedersen et al.[9], the HMM with explicit state duration density has not been employed yet. We calculated all the parameters carefully from the documented known promoters and used these parameters to predict unknown promoters. In the end, this method was used to analyze the whole genome of E. coli and the newly sequenced Leptospira interrogans genome.
In present study, TU prediction based on distance, function class and multiple-genome comparison was applied first. The HMM with state duration density was further applied to promoter prediction. The results were also used to analyze the genome of Leptospira interrogans genome.
1.1 Materials
1.1.1 Materials used in TUs predictionRegulonDB (http://kinich.cifn.unam.mx:8850/db/regulondb-intro.frameset)[10], a database with all the documented proved TUs, promoters and terminators, as well as some predictions based on direction, distance and functional class log-likelihoods. Our newly predicted results were validated in this database.
The collected 388 documented known TUs were kindly provided by professor Kenta Nakai from Tokyo University. The functional classes were available at “E. coli Genome and Proteome Database” GenProtEC (http://genprotec.mbl.edu). The multiple genome comparing result was provided by TIGR (http://www.tigr.org/tigr-scripts/operons/operons.cgi).
E. coli K12 genome sequences were downloaded from NCBI.
1.1.2 Materials used in promoter predictionPromEC (http://bioinfo.md.huji.ac.il/marg/promec/) is an updated compilation of E. coli mRNA promoter sequences. It includes documentation on the location of experimentally identified mRNA transcriptional start sites on the E.coli chromosome, as well as the actual sequences in the promoter region. 469 documented E.coli promoters were downloaded from PromEC, then we divided them into several categories by the transcriptional factors as shown in Table 1.
σ Factors |
Promoter
number |
σ70 |
375 |
σ38 |
3 |
σ32 |
10 |
σ54 |
5 |
σ19 |
1 |
Unknown |
60 |
σ70 and σ38 |
14 |
All of the promoters were from PromEC and validated in RegulonDB.
1.2 Methods
1.2.1 Methods for TUs prediction Firstly, as done by Salgado et al.[3], we divided all the genes in the E. coli K12 into forward and backward strand directons(the groups including every gene transcribes in the same direction with no intervening gene transcribed in the opposite one). The procedure yielded 638 forward directons and 636 backward directons. Secondly, we compared the collection of known TUs with the collection of directons to find those directons containing TUs with added genes at either side. These added genes were used to construct a data set of pairs of adjacent genes at borders between TUs, which constituted a contrasting data set against the collection of adjacent genes in operons. These operations resulted in a set of 577 pairs of adjacent genes in operons, a set of 265 pairs at the borders of TUs, and a set of 3004 total pairs of adjacent genes transcribed in the same direction.
Method by Salgado et al. was based on distance and functional class. Different distance between genes was assigned a different log-likelihood. The distance between pairs within operons and pairs at TU borders, as described by Salgado was shown in Fig.1. There were two obvious peaks at the distance of -4 and -1 (there was also a low peak at -8) for the genes within operons. There are two reasons: (1) there is no need for space between genes inside operons; (2)minimal spacing between genes can protect mRNA from degradation by association with ribosomes. But these answers are not enough. Here we supposed that some genes are co-expressed in both time and quantity, so that the distance between them should be short enough to ensure co-transcription. On the other hand, to combine with 16 S RNA in the translation process, the distance should be consistent with the 16 S RNA; this may explain why the distance should be -8, -4, and -1, but not -3, 0, etc. The log-likelihoods for different distances were calculated according to the formula given by Salgado et al.[3] [Function(1)]. In which, Nop and Nnop are pairs of genes in operons and at transcriptional boundaries, respectively, at a distance [dist] (in 10 bp intervals), whereas TNop and TNnop are the total number of pairs of genes in operons and at the transcription unit boundaries, respectively. While the log-likelihood LL[F] for genes in the same functional class was 0.7192, to different functional class was -0.616[3].
Fig.1 The
distance frequency of gene pairs within operons and at TU borders
To the gene pairs within operons, there are two obvious peaks at -4 and -1. But to the gene pairs at TU borders the peaks are not obvious although we can also find frequency increase at these two points and at -8 (From Salgado et al.[3]).
(1)
Since the log-likelihood of distance and functional class was based on the known knowledge, and therefore was fit for well-studied organism. Ermolaeva et al.[4] developed a method based on the knowledge that many sets of gene were in the same order through multiple genomes and these conserved gene groups represented candidate operons. They developed an algorithm to find gene pairs in the same operon with highly likely (probability>=98%), but low sensitivity (30%-50%). Low sensitivity limited badly its usage in prediction of TUs and operons.
Combining these two methods, we could add an additional log-likelihood of multiple genome comparing result to distance and functional class log-likelihood. According to Ermolaeva et al.[4], in the 263 SN(on the same strand but belong to different operons), there were only 24 false positives, and the accuracy was 98%. We could calculate the log-likelihood of the two genes in a same operon as done by Salgado et al.[3], LL[S]=log[0.98/(24/263)]=1.031. There were totally 513 pairs of genes predicted by this method. Thus, we could make our prediction based on distance, functional class, and multiple-genome comparison by Function (2). In this function, the three statistical elements belonged to different statistical spaces. If LL>threshold, we could predict that the gene pair was in the same TU, otherwise, it would belong to different TU.
(2)
1.2.2 Methods for promoter prediction
(1) Analysis of the promoter region From Table 1, it could be found that about 80% of the promoters were recognized by σ70 factor, therefore σ70 promoters were the most important transcriptional factors in E. coli; while the other promoters were only used in some special situations, such as the σ32 promoters were only used in heat shock response.
Cutting the sequence from upstream -75 bp to downstream +25 bp region of TS (transcriptional initiation site) of the promoters, we could make a statistical analysis of the documented promoters. First, we calculated the σ70 promoters and then the Un-σ70 promoters (the promoters other than σ70 promoters), and the Un-σ70 and the unknown promoters (keep unknown) were also calculated together in the end. These results were shown in Fig. 2.
Fig.2 Statistical
result of the promoters
From Fig.2, we could draw the following conclusions. (Ⅰ) Fig.2(A) and Fig.2(C) are similar at distances of 41, 65-70 and 76, with a “T” peak at 41, “TA” peaks at 65-70 and a “CAT” peak at 76, corresponding to the Sextama box, Pribnow box, and TS in the σ 70 promoters respectively. Fig.2(B) was not obvious at these sites due to the lack of data. From the similarity between Fig.2(A) and Fig.2(C), it was shown that all prokaryotic σ promoters may contain similar functional boxes as Sextama and Pribnow box. This phenomenon may be explained in this way that most σ factors have similar structure and sequence, and thus recognize similar DNA sequences at similar sites. (II) There was a high “A” peak which had not been identified and reported before at about a distance of 25 (corresponding to -51 upstream TS) which was much more obvious in un-σ70 promoters. Probably this was another functional box other than Sextama box, Pribnow box, and TS.
(2) HMM algorithm, architecture and design
(I) Typical hidden Markov models (HMM) An HMM is a discrete-time Markov
model with some additional features. The main addition is that when a state
is visited by the Markov chain, the state "emits” a letter from a fixed time-dependent alphabet. Letters are emitted
via a time independent, but usually state-dependent, probability distribution
over the alphabet. When the HMM runs, there is, first, a sequence of states
visited which we denote by q1, q2, q3, …, and second, a sequence of emitted symbols denoted by o1, o2, o3,
…. This is a two-step process: initial(q1)→emission(o1)→transition(to q2)→emission(o2)→transition(to q3)→emission(to o3)→……. Compared with Markov chain, an HMM is more general and flexible,
allowing us to model phenomena that we can’t model sufficiently well with a regular Markov chain model. One of
the standard HMM architectures for molecular biology applications, first introduced
by Krogh et al.[8], is the left-right architecture. Three algorithms
are used in HMM: Forward-Backward procedure, Viterbi algorithm and Baum-Welch
method. The typical HMMs have already been used in multiple sequence alignments.
In Pfam, it was used to determine protein domains for a query protein sequence.
(II) Model including explicit state duration density in HMMs and our architecture In a conventional HMM, the major weakness is the modeling of state duration. Suppose that ‘p’ is the probability of transition from any state to itself, the probability that the process stays in this state for ‘n’ steps will be pn-1(1-p), so that the time the process stays in that state will follow a geometric distribution. But sometimes we must use other distributions for this time, and only HMM with duration fits this situation.
In an HMM with explicit state duration, all transition probabilities from a state to itself are zero. When the process visits a state, it produces not a single symbol but an entire sequence from the alphabet. The length of the sequence can follow any distribution, and the model generating the sequence of that length can be any distribution.
A parse ¢ is a sequence q1, q2, ..., qr and a sequence of lengths d1, d2, ..., dr. Given an observed sequence S=(s1, s2, ..., sL), from an HMM with duration, the Viterbi algorithm finds an optimal parse ¢ opt such that Prob(¢opt|S)≥Prob(¢|S) for all parses ¢. That is to find the hidden state trajectory T that makes the conditional probability P(T|S) maximal. Here, T={(q1d1)(q2d2)...(qrdr)}, ∑dr=L. The algorithm is similar to that used in GeneMark.hmm[7].
Given a DNA sequence designated as S={s1, s2, ..., sL} with length L, where si stands for the nucleotide A, T, G or C, the functional role of each nucleotide could be indicated by a “functional” sequence T={t1, t2, ..., tL}. Here when ti takes value ‘1’, it indicates that the nucleotide si is a part of random region (containing no promoter sequence); ‘2’, si is a part of Sextama box; ‘3’, a part of the region between Sextama box and Pribnow boxes (-10 region), and so on (Fig.3). Basically, the Sextama box and Pribnow box were derived from the analysis of σ70 promoters, they were actually functional box recognized by σ70. Since most prokaryotic promoters contain the similar functional boxes as Sextama and Pribnow box, this architecture of σ70 promoters can be extended to all prokaryotic promoters.
Fig.3 The
architecture of HMM with duration to identify prokaryotic promoters
R, Sequence region; TS, transcriptional initiation site. State ‘1’ is the non-promoter region before a promoter, and ‘2’ is the Sextama box of a promoter, ‘3’ is the region between Sextama and Pribnow boxes, ‘4’ is the Pribnow box, ‘5’ is the region between Pribnow box and TS, ‘6’ is the TS and the random sequence after TS.
To find the optimal trajectory, we used an extended of Viterbi algorithm. Function (3) was used to calculate the maximal probability of all parses.
(3)
(3)Calculation of the HMM parameters From the documented E. coli promoters in PromEC, the observation probability of Pribnow box, Sextama box, and the transcriptional initiation site were calculated, and the result was shown in Table 2.
(A) Pribnow box
|
P1 |
P2 |
P3 |
P4 |
P5 |
P6 |
A |
0.0475 |
0.8881 |
0.2170 |
0.5898 |
0.5220 |
0.0305 |
T |
0.7932 |
0.0407 |
0.5051 |
0.1627 |
0.1288 |
0.9356 |
G |
0.0746 |
0.0407 |
0.1322 |
0.1254 |
0.1254 |
0.0170 |
C |
0.0847 |
0.0305 |
0.1458 |
0.1220 |
0.2237 |
0.0169 |
(B) Sextama box
|
P1 |
P2 |
P3 |
P4 |
P5 |
P6 |
A |
0.0441 |
0.0373 |
0.0881 |
0.5729 |
0.1898 |
0.4712 |
T |
0.8373 |
0.8508 |
0.1458 |
0.1695 |
0.1898 |
0.2000 |
G |
0.0576 |
0.0644 |
0.6339 |
0.0746 |
0.1017 |
0.1797 |
C |
0.0610 |
0.0475 |
0.1322 |
0.1831 |
0.5186 |
0.1492 |
(C) TS site
|
P1 |
P2 |
P3 |
A |
0.1732 |
0.4673 |
0.2320 |
T |
0.2810 |
0.1307 |
0.3595 |
G |
0.1405 |
0.2908 |
0.1536 |
C |
0.4052 |
0.1111 |
0.2549 |
It is supposed that the Sextama and the Pribnow boxes both have the length of 6 bp, the TS 3 bp. If the length is out of the limitation, all of its observation is set at 0.25 for each ‘A’, ‘T’, ‘G’, or ‘C’. P1 is the first nucleotide in a Sextama and Pribnow boxes or the TS, and P2, is the second, and so on.
The region between Pribnow and Sextama boxes can vary between 13 bp and 21 bp. When the region has a length of 17 bp, the promoter will have the highest transcriptional efficiency, which has been proved by experiments. Accordingly, the longest region we identified is 17 bp (Fig.4).
Fig.4 Frequency
of different length of the region between Pribnow and Sextama boxes
After calculation of the region between Pribnow and Sextama boxes of the documented σ70 promoters, the length of 17 bp has the largest probability.
Table 3 Base
content of various regions(%)
Regions |
A |
T |
G |
C |
Before Sextama |
0.26036 |
0.29997 |
0.22703 |
0.21264 |
Sextama―Pribnow box |
0.24647 |
0.30640 |
0.23817 |
0.20896 |
Pribnow―TS |
0.26478 |
0.29260 |
0.22015 |
0.22248 |
TS―TS +10 |
0.28072 |
0.31078 |
0.18954 |
0.21895 |
Fig.5 Frequency
of different length of the region between Sextama boxes and transcriptional
initiation size
The transcriptional initiation size is
defined as a region including three bp: -1, TS, and +1. The length of 5 bp has the largest
probability.
The length between Sextama box and TS can also be analyzed by the same method.
2.1 Result of TUs
prediction
We made a program
using the formula(2), and TUs could be predicted at different threshold. The
results with multiple-genome comparison and not at different threshold were
shown in Table 4.
Table 4 Transcription units generated at different thresholds using multiple-genome comparison and not using multiple genome comparison
(Threshold) LL=LL[dist]+LL[F]+LL[S] |
FD |
BD |
Total |
0.0600 |
1343 |
1382 |
2725 |
0.0950 |
1350 |
1390 |
2740 |
0.1000 |
1350 |
1390 |
2740 |
0.1500 |
1350 |
1390 |
2740 |
0.2000 |
1352 |
1391 |
2743 |
LL=LL[dist]+LL[F] |
FD |
BD |
Total |
0.0600 |
1392 |
1443 |
2835 |
0.0950 |
1389 |
1438 |
2827 |
0.1500 |
1370 |
1421 |
2791 |
FD, the number of TUs in the forward strand; BD, the number of TUs in the backward strand.
From Table 4, it could be found that, when the threshold was set at the optimal value (at 0.0950[3]), adding the LL[S] improved the sensitivity (The total number of TUs reduced by 2827-2740=87). At the same time, when the threshold changed from 0.0950 to 0.1500, the number didn’t change and remained at 2740. If LL[S] was removed, the predicted TUs number would increase. This showed that adding the LL[S] could largely enhance the predicted accuracy.
To make clear whether the newly predicted operons after adding LL[S] were correct, we validated the 87 newly predicted pairs of genes with the datain RegulonDB to see whether they were truly predicted or not. The results were shown in Table 5.
Category of the newly predicted
results |
Forward strand |
Backward
strand |
Total |
Accuracy
percentage |
Known true
gene pairs |
20 |
29 |
49 |
49/(49+6) =91% |
Known false
gene pairs |
3 |
3 |
6 | |
Unknown gene
pairs |
16 |
16 |
32 |
--------- |
Predicted new
operons |
16 |