Categories
Articles

02368

https://www.abbs.info/ e-mail:[email protected]

ISSN 0582-9879                          
      
        ACTA BIOCHIMICA et
BIOPHYSICA SINICA 2003, 35(4):317-324             
                
     CN 31-1300/Q

 

Prediction of Prokaryotic Promoters Based
on Prediction of Transcriptional Units

LIN Jian-Cheng, XU Jin-Lin#,
LUO Jian-Hua#, LI Yi-Xue1

*

(
Biotechnology Department, School of Life Science and Technology, Shanghai
Jiaotong University, Shanghai 200240, China;
1 Bioinformatics
Center of Shanghai Institutes for Biological Sciences, the Chinese Academy of
Sciences, Shanghai 200031, China
)

Abstract     
Identification of promoters is very important
in understanding gene regulating relationships in an organism, and computational
identification of promoters has been a long standing problem in computational
biology. A new method was presented to predict promoter regions in prokaryotic
organism. The method predicted transcription unit (TU) first and the TU was
divided into singlet that contains only one single gene in a TU, and operon that
contains more than one gene. Based on these predicted TUs, promoter was
predicted for each TU using hidden Markov model including explicit state
duration density. Both predicted TUs and promoters were satisfying.

Key
words
   promoter; TU; operon; prediction;
HMM with state duration

Large-scale sequencing
has brought us piles of rough data remaining to be annotated. The laboratory
confir*mation is time-consuming and costly. Alternately, computational solutions
to this problem have been developed very well and have achieved a lot. However,
most of the work was mainly focused on the coding regions. In fact, promoter
region is very important to a gene, because most of the regulatory elements
involved in gene expression are located in or around this region. In E.coli,
the σ factor is a functional component that recognizes promoter sequences.
Thus, we always define different promoters by different σ factors recognizing
them, such as the σ70 promoter is the promoter recognized by σ70 factor. One
organism contains different σ factors, each recognizes different subset of
promoters and thus regulates different gene expression under specific conditions.
For example,
σ32 promoters control the genes that are critical for the bacterium
to survive prolonged exposure to higher-than-normal temperature[1].
Therefore, identification of the promoter region of a certain TU is very important
for understanding the regulating relationships of the genes.

Prior to promoter
prediction, we must realize that all the genes are transcribed as a TU, and if a
TU is an operon which contains more than one gene, all the genes in this operon
must share a same promoter under certain condition and are transcribed
together[2]. Since a promoter corresponds to a TU but not a gene, we
should predict promoter for each TU but not for each single gene as done before
by others. And since the promoter region lies in the region before the coding
region of a TU, that is, it lies before the coding region of the first gene in
the TU, if the TU is an operon, what we need to care about is only the upstream
region before the first gene in the operon. Thus, it is valuable to predict TU
before promoter prediction. TU prediction could base on distance and functional
class[3], or base on multiple-genome comparison[4]. Both
methods have their advantages, the former can be efficiently used in the
well-studied genomes, while the later is a good method to those less studied
genomes.

TU prediction
offered a good source for further promoter prediction. Considering only the
limited region that a promoter resides in, we can make promoter prediction more
accurately and efficiently. Moreover, because most of the prokaryotic promoter
sequences lie inside the upstream 200 bp before the translation initiation size
(TI), the upstream 200 bp before the TI of all the first genes in each TU was
cut first, and then, their promoter sequences were predicted using hidden Markov
model (HMM) with duration. HMM was first interpreted by Rabiner[5],
and further applied to TU prediction[6], gene finding[7],
and the related algorithm of ECOPARSE was also developed[8]. In
promoter prediction, although HMM had been employed by Pedersen et
al
.[9], the HMM with explicit state duration density has not been
employed yet. We calculated all the parameters carefully from the documented
known promoters and used these parameters to predict unknown promoters. In the
end, this method was used to analyze the whole genome of E. coli and the newly
sequenced Leptospira interrogans genome.

In present study,
TU prediction based on distance, function class and multiple-genome comparison
was applied first. The HMM with state duration density was further applied to
promoter prediction. The results were also used to analyze the genome of
Leptospira interrogans genome.

1   
Materials and Methods

1.1  Materials

1.1.1      
Materials used in TUs predictionRegulonDB
(http://kinich.cifn.unam.mx:8850/db/regulondb
intro.frameset)[10], a database with all the documented
proved TUs, promoters and terminators, as well as some predictions based on
direction, distance and functional class log-likelihoods. Our newly predicted
results were validated in this database.

The collected 388
documented known TUs were kindly provided by professor Kenta Nakai from Tokyo
University. The functional classes were available at “E. coli Genome and
Proteome Database” GenProtEC (http://genprotec.mbl.edu). The multiple genome
comparing result was provided by TIGR
(http://www.tigr.org/tigr-scripts/operons/operons.cgi).

E. coli
K12 genome sequences were downloaded from
NCBI.

1.1.2      
Materials used in promoter    predictionPromEC
(http://bioinfo.md.huji.ac.il/marg/promec/) is an updated compilation of E.
coli
mRNA promoter sequences. It includes documentation on the location of
experimentally identified mRNA transcriptional start sites on the E.coli
chromosome, as well as the actual sequences in the promoter region. 469
documented E.coli promoters were downloaded from PromEC, then we divided them
into several categories by the transcriptional factors as shown in Table
1.

Table 1   The number of different promoters
defined by different σ factors recognizing them

σ Factors

Promoter
number

σ70

375

σ38

3

σ32

10

σ54

5

σ19

1

Unknown

60

σ70 and σ38

14

All
of the promoters were from PromEC and validated in RegulonDB.

1.2  Methods

1.2.1       Methods
for TUs prediction  
Firstly, as done by Salgado et al.[3], we divided
all the genes in the E. coli K12 into forward and backward strand
directons(the groups including every gene transcribes in the same direction with
no intervening gene transcribed in the opposite one). The procedure yielded 638
forward directons and 636 backward directons. Secondly, we compared the
collection of known TUs with the collection of directons to find those directons
containing TUs with added genes at either side. These added genes were used to
construct a data set of pairs of adjacent genes at borders between TUs, which
constituted a contrasting data set against the collection of adjacent genes in
operons. These operations resulted in a set of 577 pairs of adjacent genes in
operons, a set of 265 pairs at the borders of TUs, and a set of 3004 total pairs
of adjacent genes transcribed in the same direction.

Method by Salgado
et al. was based on distance and functional class. Different distance
between genes was assigned a different log-likelihood. The distance between
pairs within operons and pairs at TU borders, as described by Salgado was
shown in Fig.1. There were two obvious peaks at the distance of
4 and 1 (there was also a low peak at 8) for the genes within operons. There are two reasons: (1) there is
no need for space between genes inside operons; (2)minimal spacing between
genes can protect mRNA from degradation by association with ribosomes. But
these answers are not enough. Here we supposed that some genes are co-expressed
in both time and quantity, so that the distance between them should be short
enough to ensure co-transcription. On the other hand, to combine with 16 S
RNA in the translation process, the distance should be consistent with the
16 S RNA; this may explain why the distance should be
8, 4, and 1, but not 3, 0, etc. The log-likelihoods for different distances were calculated
according to the formula given by Salgado et al.[3] [Function(1)]. 
In which,  Nop and Nnop are pairs of genes in operons
and at transcriptional boundaries, respectively, at a distance [dist] (in
10 bp intervals),  whereas TNop
and TNnop are the total number of pairs of genes in operons and at the transcription
unit boundaries, respectively. While the log-likelihood LL[F]  for genes in the same functional class
was  0.7192, to different functional
class was
0.616[3].

Fig.1     The
distance frequency of gene pairs within operons and at TU borders

To the gene pairs within operons, there are
two obvious peaks at
4 and 1. But to the gene pairs at TU borders the peaks are not obvious
although we can also find frequency increase at these two points and at
8 (From Salgado et al.[3]). 

Since the
log-likelihood of distance and functional class was based on the known
knowledge, and therefore was fit for well-studied organism. Ermolaeva et
al
.[4] developed a method based on the knowledge that many sets
of gene were in the same order through multiple genomes and these conserved gene
groups represented candidate operons. They developed an algorithm to find gene
pairs in the same operon with highly likely (probability>=98%), but low
sensitivity (30%
50%). Low sensitivity limited badly its usage in prediction of TUs
and operons.

Combining these
two methods, we could add an additional log-likelihood of multiple genome
comparing result to distance and functional class log-likelihood. According to
Ermolaeva et al.[4], in the 263 SN(on the same strand but
belong to different operons), there were only 24 false positives, and the
accuracy was 98%. We could calculate the log-likelihood of the two genes in a
same operon as done by Salgado et al.[3],  LL[S]=log[0.98/(24/263)]=1.031. There
were totally 513 pairs of genes predicted by this method. Thus, we could make
our prediction based on distance, functional class, and multiple-genome
comparison by Function (2).  In this
function, the three statistical elements belonged to different statistical
spaces.  If LL>threshold, we
could predict that the gene pair was in the same TU, otherwise, it would belong
to different TU.

1.2.2       Methods
for promoter prediction

(1) Analysis of the promoter
region
      
From Table 1, it could be found that about 80% of the promoters were
recognized by σ70 factor, therefore σ70 promoters were the most important
transcriptional factors in E. coli; while the other promoters were only
used in some special situations, such as the σ32 promoters were only used in
heat shock response.

Cutting the
sequence from upstream
75 bp to downstream +25 bp region of TS (transcriptional initiation
site) of the promoters, we could make a statistical analysis of the documented
promoters. First, we calculated the
σ70 promoters and then the Un-σ70 promoters (the promoters other than σ70 promoters) and the Un-σ70 and the unknown promoters (keep unknown) were also calculated
together in the end. These results were shown in Fig. 2.


Fig.2       Statistical
result of the promoters

Distance
of 1 in the figure is the
75 bp upstream of the transcriptional initiation site, hence,
distance of 76 is the TS site. (A) is the 375 known
σ70 promoters, (B) is the 19 known un-σ70 promoters, (C) is the 79 unknown and Un-σ70 promoters.

From Fig.2, we
could draw the following conclusions. (
) Fig.2(A) and Fig.2(C) are similar at  distances of 41,  6570 and 76, with  a
T peak at 41, TA peaks at 6570 and  a CAT peak at 76, corresponding to the Sextama box, Pribnow box, and TS in
the σ 70 promoters respectively. Fig.2(B) was not obvious at these sites due to
the lack of data. From the similarity between Fig.2(A) and Fig.2(C), it was
shown that all prokaryotic σ promoters may contain similar functional boxes as
Sextama and Pribnow box. This phenomenon may be explained in this way that most
σ factors have similar structure and sequence, and thus recognize
similar DNA sequences at similar sites. (II) There was a high
A peak which had not been identified and reported before at about a
distance of 25 (corresponding to
51 upstream TS) which was much more obvious in un-σ70 promoters. Probably this was another functional box other than
Sextama box, Pribnow box, and TS.

(2) HMM algorithm, architecture and design             
(I) Typical hidden Markov models (HMM)   An HMM is a discrete-time Markov
model with some additional features. The main addition is that when a state
is visited by the Markov chain, the state
emits a letter from a fixed time-dependent alphabet. Letters are emitted
via a time independent, but usually state-dependent, probability distribution
over the alphabet. When the HMM runs, there is, first, a sequence of states
visited which we denote by q1, q2, q3,
, and second, a sequence of emitted symbols denoted by o1, o2, o3,
. This is a two-step process: initial(q1)emission(o1)transition(to q2)emission(o2)transition(to q3)emission(to o3)→……. Compared with Markov chain, an HMM is more general and flexible,
allowing us to model phenomena that we can
t model sufficiently well with a regular Markov chain model. One of
the standard HMM architectures for molecular biology applications, first introduced
by Krogh et al.[8], is the  left-right architecture. Three algorithms
are used in HMM: Forward-Backward procedure, Viterbi algorithm and Baum-Welch
method. The typical HMMs have already been used in multiple sequence alignments.
In Pfam, it was used to determine protein domains for a query protein sequence.

(II) Model including explicit state duration
density in HMMs and our architecture   
In a conventional HMM,
the major weakness is the modeling of state duration. Suppose that
p is the probability of transition from any state to itself, the probability
that the process stays in this state for
n steps will be pn1(1p), so that the time the process stays in that state will follow
a geometric distribution. But sometimes we must use other distributions for
this time, and only HMM with duration fits this situation.

In an HMM with
explicit state duration, all transition probabilities from a state to itself are
zero. When the process visits a state, it produces not a single symbol but an
entire sequence from the alphabet. The length of the sequence can follow any
distribution, and the model generating the sequence of that length can be any
distribution.

A parse
is a sequence q1, q2, …,
qr and a sequence of lengths d1, d2,
…, dr. Given an observed sequence S=(s1,
s2, …, sL), from an HMM with duration, the
Viterbi algorithm finds an optimal parse
opt such that Prob(opt|S)Prob(|S) for all parses . That is to find the hidden state trajectory T that makes the
conditional probability P(T|S) maximal. Here,
T={(q1d1)(q2d2)…(qrdr)},
dr=L. The algorithm is similar to that
used in GeneMark.hmm[7].

Given a DNA
sequence designated as S={s1, s2, …,
sL} with length L, where si  stands for the nucleotide A, T, G or C,
the functional role of each nucleotide could be indicated by a “functional”
sequence T={t1, t2, …, tL}. Here when
ti takes value ‘1’, it indicates that the nucleotide
si is a part of random region (containing no promoter
sequence); ‘2’, si is a part of Sextama box; ‘3’, a part of
the region between Sextama box and Pribnow boxes (
10 region), and so on (Fig.3). Basically, the Sextama box and Pribnow
box were derived from the analysis of σ70 promoters, they were actually
functional box recognized by σ70. Since most prokaryotic promoters contain the
similar functional boxes as Sextama and Pribnow box, this architecture of σ70
promoters can be extended to all prokaryotic promoters.

 

Fig.3       The
architecture of HMM with duration to identify prokaryotic promoters

R, Sequence region; TS, transcriptional
initiation site. State ‘1’ is the non-promoter region before a promoter, and ‘2’
is the Sextama box of a promoter, ‘3’ is the region between Sextama and Pribnow
boxes, ‘4’ is the Pribnow box, ‘5’ is the region between Pribnow box and TS, ‘6’
is the TS and the random sequence after TS.

To find the
optimal trajectory, we used an extended of Viterbi algorithm. Function (3) was
used to calculate the maximal probability of all parses.

(3)Calculation of the HMM
parameters
             
From the documented E. coli promoters in PromEC, the observation
probability of Pribnow box, Sextama box, and the transcriptional initiation site
were calculated, and the result was shown in Table 2.

Table 2  
Observation probability of Pribnow box, Sextama box and TS

(A)
Pribnow box

P1

P2

P3

P4

P5

P6

A

0.0475

0.8881

0.2170

0.5898

0.5220

0.0305

T

0.7932

0.0407

0.5051

0.1627

0.1288

0.9356

G

0.0746

0.0407

0.1322

0.1254

0.1254

0.0170

C

0.0847

0.0305

0.1458

0.1220

0.2237

0.0169

(B)
Sextama box

P1

P2

P3

P4

P5

P6

A

0.0441

0.0373

0.0881

0.5729

0.1898

0.4712

T

0.8373

0.8508

0.1458

0.1695

0.1898

0.2000

G

0.0576

0.0644

0.6339

0.0746

0.1017

0.1797

C

0.0610

0.0475

0.1322

0.1831

0.5186

0.1492

(C)
TS site

P1

P2

P3

A

0.1732

0.4673

0.2320

T

0.2810

0.1307

0.3595

G

0.1405

0.2908

0.1536

C

0.4052

0.1111

0.2549

It is supposed that the Sextama and the
Pribnow boxes both have the length of 6 bp, the TS 3 bp. If the length is out of
the limitation, all of its observation is set at 0.25 for each ‘A’, ‘T’, ‘G’, or
‘C’. P1 is the first nucleotide in a Sextama and Pribnow boxes or the TS,
and P2, is the second, and so on.

The region between
Pribnow and Sextama boxes can vary between 13 bp and 21 bp. When the region has
a length of 17 bp, the promoter will have the highest transcriptional
efficiency, which has been proved by experiments. Accordingly, the longest
region we identified is 17 bp (Fig.4).

Fig.4       Frequency
of different length of the region between Pribnow and Sextama boxes

After
calculation of the region between Pribnow and Sextama boxes of the documented
σ70 promoters, the length of 17 bp has the largest probability.

Table 3   Base
content of various regions(%)

Regions

A

T

G

C

Before Sextama

0.26036

0.29997

0.22703

0.21264

Sextama�DPribnow box

0.24647

0.30640

0.23817

0.20896

Pribnow�DTS

0.26478

0.29260

0.22015

0.22248

TS�DTS +10

0.28072

0.31078

0.18954

0.21895

Fig.5       Frequency
of different length of the region between Sextama boxes and transcriptional
initiation size

The transcriptional initiation size is
defined as a region including three bp:
1, TS, and +1. The length of 5 bp has the largest
probability.

The length between
Sextama box and TS can also be analyzed by the same method.

2   
Results

2.1  Result of TUs
prediction

We made a program
using the formula(2), and TUs could be predicted at different threshold. The
results with multiple-genome comparison and not at different threshold were
shown in Table 4.

Table 4   Transcription units generated at
different thresholds using multiple-genome comparison and not using multiple
genome comparison

(Threshold)

LL=LL[dist]+LL[F]+LL[S]

FD

BD

Total

0.0600

1343

1382

2725

0.0950

1350

1390

2740

0.1000

1350

1390

2740

0.1500

1350

1390

2740

0.2000

1352

1391

2743

LL=LL[dist]+LL[F]

 FD

BD

Total

0.0600

1392

1443

2835

0.0950

1389

1438

2827

0.1500

1370

1421

2791

FD,
the number of TUs in the forward strand; BD, the number of TUs in the backward
strand.

From Table 4, it
could be found that, when the threshold was set at the optimal value (at
0.0950[3]), adding the LL[S] improved the sensitivity (The total
number of TUs reduced by 2827
2740=87). At the same time, when the threshold changed from 0.0950 to
0.1500, the number didn
t change and  remained at
2740. If LL[S] was removed, the predicted TUs number would increase. This showed
that adding the LL[S] could largely enhance the predicted accuracy.

To make clear
whether the newly predicted operons after adding LL[S] were correct, we
validated the 87 newly predicted pairs of genes with the datain RegulonDB to see
whether they were truly predicted or not. The results were shown in Table
5.

Table 5  
Analysis of the newly predicted gene pairs

Category of the newly predicted
results

Forward strand

Backward
strand

Total

Accuracy
percentage

Known true
gene pairs

20

29

49

49/(49+6)

=91%

Known false
gene pairs

3

3

6

Unknown gene
pairs

16

16

32

———

Predicted new
operons

16

26

42

42/(42+6)

=87.5%

Predicted
partial operons

4

3

7

If a newly predicted gene pair was
documented truly in RegulonDB, it was assigned “Known true gene pairs”; while
documented falsely, assigned “Known false gene pairs”; otherwise, it might have
not been documented in RegulonDB, and was assigned “Unknown gene pairs”.
“Predicted new operons (Known)” meant that a whole operon was predicted, while
the “Predicted partial operons” meant that the predicted operon still lacked
some of its member genes.

The statistical
data in Table 5 showed that, most (91%) of the newly predicted gene pairs were
true compared with the data in RegulonDB. According to the calculation of
Salgado et al.[3], at the threshold of 0.0950, 73.13% of the
known TUs and 61.18% of the known operons could be correctly predicted without
the multiple-genome comparison. Adding the newly predicted gene pairs and
operons, the accuracy rate increased to 86.70%, and 78.90% respectively. All
these results showed that, combining distance, functional class and
multiple-genome comparison, our method could largely improve the specificity and
sensitivity. The results could be used in the promoter prediction
later.

2.2  Result of promoter
prediction

According to the
parameters set and architecture designed in method above, a program was
developed using HMM with duration to calculate the probability value and predict
where the promoter region is.

Firstly, we
analyzed the 375 σ70 promoters and 200 random sequences to find what probability
value was more suitable for the identification of promoters. The result was
shown in Figure 6. It was found that the new program can identify the promoter
sequences and random sequences very well. When the log probability value was
93, 75.80% of the σ70 promoters could be identified with the largest false positives of
22.39%.

To verify whether
the new program could  efficiently
predict the un-
σ70 promoters or not, all the known un-σ70 promoters and random sequences were also run by this program. As
shown in Fig.7, at the threshold of
93, about 72% of the un-σ70 sequences could be identified. To acquire the optimal probability
value for this program, we compared the sensitivity and accuracy of different
value from
92 to 94. It was found that, when the log probability is set at  92.6, the sensitivity was 70.48% with the largest false positives of
17.91%, and the accuracy was at least 82.09%. It was possible that this new
program could serve well for computer prediction.

 

Fig.6       Run by our
program, the log probability value of known σ70 promoter sequences and random
sequences

Obviously,
the promoter sequences had higher value than the random sequences. The promoter
sequences had a peak around
90, while the peak for the random sequences was around 97. The values around 93 could separate the two kinds of sequences very well.

 

Fig.7       Run by our
program, the log probability value of known un-
σ70 promoter sequences and random sequences

The result was similar to the σ70 promoters. It was shown that our
program running result was consistent with the analysis result of various
promoters shown in Fig.3.

3   
Discussion

The prediction of
the transcription unit (TU) of genomes is an important clue in the inference of
functional relationships of genes, the interpretation and evaluation of
transcriptome experiments, and the overall inference of the regulatory networks
governing the expression of genes in response to environment[11].
Several methods have been developed, among them the method based on inter-genic
distance distributions is the most efficient one. Recently, it is found that
similar inter-genic distance distributions are shared by all prokaryotic
genomes. Therefore, this method can be employed in the TU prediction of
Leptospira interrogans. Combining this method with the multiple-genome
comparison method attributes greatly to the accuracy and specificity of TU
prediction. However, some additional messages should be added to TU prediction,
such as the terminator and the documented promoters.

Upon analysis of
the promoter data, it was found that different kinds of promoters might contain
the similar functional boxes as Sextama and Pribnow boxes in σ70 promoters. This
seemed to be inconsistent with the fact that different E. coli σ factors
recognize promoters with different consensus sequences. There are two possible
reasons. First, different σ factors recognize different consensus sequences, but
have the similar DNA content at the known box, such the high “T” at Sextama box.
Second, the distance of the different box to TS site is also similar. Further
prediction of un-
σ70 promoters with satisfying result showed that our assumptions might
be rational.

In the promoter
prediction, it was found that at the threshold of
73, sometimes more than one promoter could be identified. These
predicted promoters might contain some false positives, but at some time, it
might indicate that this TU was promoted by more than one promoters at certain
circumstances.

Could this program
identify the promoters at proper sites? We used 375 sequences of 100 bp long,
which were between
75 bp upstream and +25 bp downstream the TS of the 375 known
promoters for our program to identify the promoter site. When the log value of
probability was highest, this predicted promoter sites were thought to be the
most probable ones. Hence, 238 of the most probable promoter sites were
coincided with the documented results. Considering that some promoters are
closely juxtaposed and we only found the “best” results, there must be some
hided promoters. Such as ftsZP4 and ftsZP4; gapAP4, gapAP3, and gapAP2; amnP,
amnP3, amnP4, amnP5, amnP6, are all closely juxtaposed, and only one promoter
site was identified, although actually there should be 2, 3 and 4 sites to be
identified. In the above prediction, totally there were 54 hided promoters, so
we could calculate the accuracy to be 74.14%.

Can TU prediction
actually improve the accuracy of the promoter prediction? The answer is “yes”,
because some regions inside a TU sometimes represent sequence similarity as a
promoter, and it will be predicted as a promoter without the prediction of TU.
One of the examples was the transcription unit
entCEBA-ybdB, it had only one promoter of entC (Promoter ID:
ECK120009358)[11]. Using the predicted result of TU
entCEBA-ybdB, we could successfully predict promoter entC. While lack of the
predicted TU, we predicted another two false positives for gene “entB” and
“entA”.

In the end, our
program could be used when the log value of probability was set at
92.6 to identify the promoter regions of each TU. Although the
program could be used to
slide through the forward and backward strands of the whole genome, TUs
prediction result would make it timesaving and improve the accuracy. Firstly,
the upstream 200 bp sequence of the first gene in each TUs was cut, then the
program was run to analyze the sequence and make promoter prediction. Predicted
2740 TUs are all analyzed, and among them about 78% have more than one result.
These results could be used for further research on the regulation of gene
expression and promoter analysis. Especially, to a researcher interested in
studying regulatory mechanisms of prokaryotic organisms, these results would be
very useful, and could serve as guidance for experimental confirmation, our new
method for promoter prediction can also help greatly in the further research of
prokaryotic genome.

References

1     Glansdorff N. On the
origin of operons and their possible role in evolution toward thermophily. J Mol
Evol, 1999, 49: 432
438

2     Craven M, Page D,
Shavlik J, Bockhorst J, Glasner JA. Probabilistic learning approach to
whole-genome operon prediction. Proc Int Conf Intell Syst Mol Biol, 2000, 8:
116
127

3     Salgado H,
Moreno-Hagelsieb G, Smith TF, Collado-Vides J. Operons in Escherichia coli:
Genomic analyses and predictions. Proc. Natl Aca Sci USA, 2000, 97:
6652
6657

4     Ermolaeva MD, White O,
Salzberg SL. Prediction of operons in microbial genomes. Nucleic Acids Res.
2001, 29: 1217
1221

5     Rabiner LR. A tutorial
on hidden Markov models and selected applications in speech recognition.
Proceeding of the IEEE, 1989, 77: 257
285

6     Yada T, Nakao M, Totoki
Y, Nakai K. Modeling and predicting transcriptional units of Escherichia coli
genes using hidden Markov models. Bioinformatics, 1999, 15: 987
993

7     Lukashin AV, Borodovsky
M. Gene mark.hmm: New solutions for gene finding. Nucleic Acids Res, 1998, 26:
1107
1115

8     Krogh A, Mian IS,
Haussler D. A hidden Markov model that finds genes in E.coli DNA. Nucleic Acids
Res. 1994, 22(22): 4768
4778

9     Pedersen AG, Baldi P,
Brunak S, Chauvin Y. Characterization of prokaryotic and eukaryotic promoters
using hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 1996, 4:
182
191

10    Salgado H, Santos-Zavaleta A,
Gama-Castro S, Millan-Zarate D, Diaz-Peredo E, Sanchez-Solano F, Perez-Rueda E
et al. RegulonDB (version 3.2): Transcriptional regulation and operon
organization in Escherichia coli K-12. Nucleic Acids Res, 2001, 29(1):
72
74

11    Brickman TJ, Ozenberger BA,
McIntosh MA. Regulation of divergent transcription from the iron-responsive
fepB-entC promoter-operator regions in Escherichia coli. J Mol Biol, 1990,
212(4): 669
682