http://www.abbs.info e-mail:[email protected]

ISSN 0582-9879                             ACTA BIOCHIMICA et BIOPHYSICA SINICA 2003, 35(6): 580-586                             CN 31-1300/Q

 

Short Communication

Factors Affecting Codon Usage in Yersinia pestis

HOU Zhuo-Cheng, YANG Ning*

( College of Animal Science and Technology, China Agricultural University, Beijing 100094, China )

 

Abstract        The complete genome of Yersinia pestis which was the causative agent of the systemic invasive infectious disease classically referred as plague, had been recently sequenced. In order to have a further insight into the synonymous codon usage evolution, factors shaping synonymous codon usage pattern of Yersinia pestis were analyzed in this paper. The coding sequences larger than or equal to 300 bp were used in codon usage analysis. Though “G”+“C” content in Y. pestis genome was slightly lower (47.64%), the highly expressed genes tended to use “C” or “G” at synonymous sites compared with lowly expressed genes. Conversely, lowly expressed genes tended to prefer “A” or “T” at synonymous positions. Gene expression level was strongly correlated with the first axis of the correspondence analysis (COA) (R=0.63, P<0.0001). By the analyses of the codon usage pattern of highly and lowly expressed genes, it was confirmed that gene expression level was partially responsible for the codon usage bias. GC-skew analysis showed that codon usage suffered replication-transcriptional selection. Codon adaptation index (CAI), frequency of “C”+“G” at the synonymous third position of codon (GC3s) and the effective number of codons (Nc) values showed some differences among different gene length groups. “G”+“C” content of genes was strongly correlated with the first axis of the COA (R=0.72, P<0.0001). It could be concluded that gene expressivity, replication-transcriptional selection, gene length and gene composition constraints were the main affecting factors of codon usage variation in Y. pestis.

 

Key words     codon usage; correspondence analysis; gene expression level; coding sequence length; Yersinia pestis

 

The fast-growing data of genomes give us new opportunities to study genome evolution on the molecular level. It is well known that codon usage pattern is nonrandom and species-specific, and the inter-genomic variation of the codon usage pattern is a widespread phenomenon. There were also some reports that different genes have different codon usage patterns in a same organism[1]. Biased codon usage of codons might be influenced by various factors, such as translational selection[2], mutation[3], compositional constraints[4], physical location of the gene on chromosome[5], replication-translational selection[6], hydrophobicity of each gene[7], etc. In Y. pestis,  the analysis of the codon usage pattern intrigued researchers greatly, because it was essential for studying major codon evolution, predicting ORF, and designing primers for PCR.

Yersinia pestis, a Gram-negative bacterium, had been considered as the causative agent of the systemic invasive infectious disease classically referred as plague, and had been responsible for three human pandemics: the Justinian plague, the Black Death, and modern plague. The complete genome of Y. pestis had been recently published[8]. Many genes in the Y. pestis genome seem to have been acquired from other bacteria and viruses. There are also evidences that Y. pestis has undergone large-scale genetic flux. Y. pestis provides a unique insight into the ways in which new and highly virulent pathogens evolve.

In this paper, the Y. pestis codon usage pattern and the main factors that influence the codon usage of Y. pestis were analysed by using the whole genome datasets. The aim of this study was to facilitate the further study on codon  evolution, ORF prediction, and primers designing.

 

1    Materials and Methods

The complete DNA sequences of the Y. pestis genome were downloaded from the Sanger Center (ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Bact-eria/ypestis—CO92). The length of all used coding region sequences is equal or greater than 300 bp. The 149 pseudogenes and 3 plasmids found in the Y. pestis genome were excluded from our datasets. 3444 genes (coding region sequences) were totally analyzed in this study. The coding sequences from the complete genome were retrieved with a program developed in our lab (ftp://202.205.81.236/download/soft/applying software/CDsRead).

Relative synonymous codon usage(RSCU)[9], the effective number of codons(Nc)[10], frequency of “C” + “G” at the synonymous third position of codon (GC3s) and correspondence analysis (COA) were calculated by using the program CodonW1.3 [written and provided by Dr. John Peden (Oxford University), see http://molbiol.ox.ac.uk/cu]. A3s, T3s, G3s, C3s were the distributions of “A”, “T”, “G” and “C” at the synonymous third position of codons, respectively. Codon adaptation index(CAI)[11] was calculated by using genes encoding the ribosomal proteins and elongation factors as the referenced dataset (totally, 71 genes). CAI value had been proved to be the best gene expression theory value and had been extensively used as a measure of gene expression level[4,6,7,12]. In this study, CAI value was used as a presumed expression level. Higher CAI value meant higher codon usage bias and higher gene expression level[11].

 

2    Results

2.1   Genome and gene composition constraint analysis

The “G”+“C” content could be one of the most important factors in the evolution of genomic structures[13]. The genome of Y. pestis was slightly compositionally biased, since its “G”+“C” content was 47.64%. The GC3s values of genes ranged from 16.5% to 69.8%, with a mean of 47.08% and standard deviation of 7.6%. The Nc values of different genes in Y. pestis ranged from 28.07 to 61, with a mean of 50.82 and standard deviation of 4.47. Wright[10] suggested that a plot of Nc against GC3s could be effectively used in explaining the codon usage variations among the genes. This method had been used to investigate the evolution of many genomes[6,14]. If the codon usage of a gene had not suffered from “G”+“C” composition constraints and natural selection, the Nc value of the gene would fall on the continuous Nc-plot curve. In Y.pestis, it was found that although there were a small number of genes lied on the Nc-Plot curve, the Nc values of most genes fell below the expected Nc-plot curve (Fig.1), which indicating that compositional constraints had some effects on the codon usage among the most genes.

 

Fig.1       Nc-plot of Y. pestis genes

The continuous curve represented the expected curve between GC3s and Nc under random codon usage.

 

2.2   The relationship among gene length, gene expression and codon bias

It had been considered that energetically costly longer genes had higher codon usage bias to maximize translation efficiently. Selection might also be acting to reduce the size of highly expressed genes, and the effect was particularly pronounced in eukaryotes[15]. In this paper, we classified genes according to the gene length into 5 groups (length500 bp, 500 bp999 bp, 1000 bp1999 bp, 2000 bp2999 bp, >3000 bp), and then examined the effects of gene length on gene expression level, Nc and GC3s for each group in Y. pestis.

Results of gene length study showed that the shorter the gene length the higher the Nc value (Table 1). However, the analyses of the GC3s and CAI presented different results (Table 1). The shorter gene length (length500 bp) had the lower GC3s compared with the longer gene lengths (length1000 bp). As to the CAI, it expressed a continuous variation in some degree, but appeared increasing with length totally. The longer length resulted in the higher CAI which meant the larger codon usage bias and higher expression level. In total, genes with length longer than 3000 bp had lower Nc, larger GC3s, larger codon usage bias and higher expression level (Table 1).

The positions of the genes along the first axis were also significantly correlated with the gene length (R=0.197, P<0.0001), and highly expressed genes were longer (Table 1). Similar results were also found in P.aeruginosa[12]. Longer genes were thought to impose constraints on the codon usage bias[15]. Powell & Moriyama[16] argued that the selective advantage for the speed of translation of an optimal codon would depend on the length of the translated message. If the gene length was longer, the effect of each individual mutation from non-optimal to optimal codons would be less effective to reduce the total time needed to translate the whole protein[17]. In Drosophila, longer coding regions had both a lower codon bias and higher synonymous substitution rates, and were affected less efficiently by selection[18]. Though the real reason was not quite clearly understood yet, the gene length played an important role in shaping the codon usage bias in Y. pestis.

 

Table 1   Comparision of Nc, GC3s and CAI among different length groups

Group

Gene length (bp)

Number of observations

Means±Standard deviation

Nc

GC3s

CAI

1

>3000

60

49.614±3.486b

0.533±0.079a

0.544±0.054b

2

20002999

193

49.854±3.726b

0.513±0.070b

0.558±0.060a

3

10001999

1201

50.574±3.783ab

0.506±0.067b

0.552±0.061ab

4

500999

1424

51.335±4.246a

0.484±0.072c

0.540±0.060b

5

<500

566

50.463±6.388ab

0.456±0.084d

0.539±0.082b

*There existed significant difference between two groups in a same column if there was no same superscripts letter (a, b, c, or d) between them (P<0.05).

 

2.3   Correspondence analysis (COA)

Correspondence analysis has been widely used to investigate codon usage patterns in different species[6,7,12,14]. In the present study, we applied COA to RSCU values of each gene to minimize effects of amino acid composition. The first axis generated by the analysis represents 12.99% of the total variability, while the second axis explains 8.67% and the third axis 5.56%. As the first axis explained only a partial amount of variation of codon usage among the genes in this bacterial genome, it was postulated that there were several major factors in shaping Y. pestis gene codon usage. The first axis of the COA explained values far less than other genomes studied previously[6,12], which suggests that in Y. pestis the major trend in codon selection is not as strong as in other species.

The correlation coefficient between the position of genes along the first major axis against GC3s, C3s, A3s and CAI were calculated (Fig.2). The first axis of the COA is positively correlated with the GC3s (R=0.66, P<0.0001), C3s (R=0.58, P<0.0001), but has a negative correlation with the A3s(R=0.70, P<0.0001). Furthermore, there is a high correlation between the position on the first axis of the COA and the gene expression level (CAI) (R=0.63, P<0.0001). In order to have a quantitative idea about the different codon usage between highly expressed genes and lowly expressed genes, a χ2 test was applied to compare the highest CAI value genes (151 genes) and the lowest CAI value genes (151 genes) (Table 2).

The presumed highly expressed sequences tend to use C3s-rich and A3s-poor codes compared with lowly expressed genes. The increment of C3s accompanies with a decrease in A3s, and the frequencies of these two bases in third codon positions are negatively correlated (R=0.45, P<0.001). We postulate that the first axis of the COA is determined by the gene expression level and the second axis of the COA might reflect the lowly expressed genes' effects. Although the G+C composition of the genome is 47.64%, but the highly expressed genes tend to terminate with C- or G- at synonymous position compared with lowly expressed genes. The CAI value and GC3s also has a significantly correlation (R=0.18, P<0.001). These results support that the highly expressed genes tend to use C- or G- at synonymous positions compared with lowly expressed genes. It is also confirmed that gene expression level affects the codon usage. Codon usage of the highly expressed genes has suffered selective pressures at translation processing. On the contrary, most of the lowly expressed genes terminate with A- or T- at synonymous positions and this implies that the codon usage in lowly expressed genes has not suffered such severely selective pressures as in highly expressed ones.

The results showed that the highly expressed genes displayed a pattern of codon usage that differed from the lowly expressed genes in Y. pestis. This analysis showed that 23 codons are used more frequently in highly expressed genes, while other 30 codons are used more frequently in lowly expressed genes (Table 2). There was a significant increment of “C” or “G” at the synonymous sites in highly expressed genes compared with the lowly expressed genes, and only two abnormal codons display “T” ending [Arg (CGT), Gly (GGT)]. The lowly expressed genes use “A” or “T” at terminate sites more frequently than highly expressed genes, except that two codons displayed “C” [Leu (CTC), Pro (CCC)] and two other codons displayed “G”[Arg (AGG), Gly (GGG)] at ends. These results confirmed that there was a bias towards “C” or “G” in the highly expressed genes, while towards “A” or “T” in the lowly expressed genes at the synonymous positions. In duets codon family, C-ending codon was preferred in the pyrimidine-ending codon family; and G-ending codon was preferred in the purine-ending codon family [exception Glu (GAG)]. In highly expressed genes, terminate codon (Ter) TAA (134/151), TAG (9/151), TGA (7/151) were used in different frequencies. Terminate codon TAA (74/151), TAG (27/151), TGA (50/151) were also used in different frequencies in lowly expressed genes. Stop codon expressed the same usage pattern. TAA was the most frequent stop codon among highly and lowly expressed genes. In highly expressed genes stop codon used more biased usage pattern than lowly expressed genes. TAA was the most popular stop codon in highly expressed genes in B. burgdorferi[19], C. trachomatis[7], E. histolytica[17]. It was needed to investigate more genomes data so that to infer that TAA was the most popular stop codon in highly expressed genes.

 

Fig.2       The correlations between the first axis of the COA and GC3s, A3s, C3s, CAI

 

Table 2   Codon usage in highly and lowly expressed genes in Y. pestis

a.a.

Codon

Na

RSCUa

Nb

RSCUb

a.a.

Codon

Na

RSCUa

Nb

RSCUb

Phe

TTT##

598

0.71

1026

1.39

Ser

TCT**

860

2.06

415

0.96

 

TTC**

1076

1.29

454

0.61

 

TCC**

384

0.92

230

0.53

Leu

TTA##

333

0.51

1223

1.69

 

TCA##

372

0.89

626

1.44

 

TTG*

862

1.33

884

1.22

 

TCG##

140

0.34

298

0.69

 

CTT##

199

0.31

550

0.76

Pro

CCT##

417

0.90

405

1.21

 

CTC##

113

0.17

452

0.62

 

CCC##

84

0.18

283

0.84

 

CTA##

138

0.21

439

0.60

 

CCA**

810

1.75

371

1.11

 

CTG**

2245

3.46

806

1.11

 

CCG**

541

1.17

282

0.84

Ile

ATT#

1261

1.26

1268

1.35

Thr

ACT**

859

1.28

380

0.86

 

ATC**

1700

1.70

661

0.70

 

ACC**

1183

1.77

450

1.02

 

ATA##

45

0.04

889

0.95

 

ACA##

300

0.45

506

1.15

Met

ATG

1315

1.00

952

1.00

 

ACG##

339

0.51

424

0.96

Val

GTT**

1695

1.78

694

1.23

Ala

GCT**

1600

1.32

620

0.94

 

GTC##

524

0.55

496

0.88

 

GCC##

928

0.76

687

1.05

 

GTA#

667

0.70

446

0.79

 

GCA

1273

1.05

732

1.12

 

GTG##

931

0.98

621

1.10

 

GCG

1058

0.87

587

0.89

Tyr

TAT##

679

0.99

851

1.45

Cys

TGT

219

1.22

225

1.11

 

TAC**

695

1.01

320

0.55

 

TGC

139

0.78

181

0.89

TER

TAA

134

1.00

74

1.00

TER

TGA

7

1.00

50

1.00

TER