Research Paper
|
|
|||
Acta Biochim Biophys Sin
2005,37: 759-766 |
||||
doi:10.1111/j.1745-7270.2005.00110.x |
Fast Fourier Transform-based
Support Vector Machine for Prediction of G-protein Coupled Receptor Subfamilies
Yan-Zhi GUO, Meng-Long LI*, Ke-Long
WANG, Zhi-Ning WEN, Min-Chun LU, Li-Xia LIU, and Lin JIANG
College
of Chemistry, Sichuan University, Chengdu 610064, China
Received: April 11,
2005
Accepted: August
24, 2005
*Corresponding
author: Tel, 86-28-89005151; Fax, 86-28-85412356; E-mail, [email protected]
Abstract Although the sequence
information on G-protein coupled receptors (GPCRs) continues to grow, many GPCRs
remain orphaned (i.e. ligand specificity unknown) or poorly characterized with
little structural information available, so an automated and reliable method is
badly needed to facilitate the identification of novel receptors. In this
study, a method of fast Fourier transform-based support vector machine has been
developed for predicting GPCR subfamilies according to protein's
hydrophobicity. In classifying Class B, C, D and F subfamilies, the method
achieved an overall Matthew's correlation coefficient and accuracy of 0.95 and
93.3%, respectively, when evaluated using the jackknife test. The method
achieved an accuracy of 100% on the Class B independent dataset. The results
show that this method can classify GPCR subfamilies as well as their functional
classification with high accuracy. A web server implementing the prediction is
available at http://chem.scu.edu.cn/blast/Pred-GPCR.
Key words G-protein coupled
receptor; subfamily; fast Fourier transform; support vector machine; prediction
G-protein coupled receptors
(GPCRs) constitute a superfamily of cell surface receptor proteins
characterized by seven transmembrane segments. The N-terminus is always
located extracellularly and the C-terminus extends into the cytoplasm, which
makes these proteins capable of transducing signals into the cell by the
heterotrimeric G-protein [1]. GPCRs play a key role in cellular signaling
networks that regulate various basic physiological processes, such as
neurotransmission, cell metabolism, secretion, cell differentiation and growth,
inflammatory and immune responses, smell, taste and vision [2]. More than 50%
of drugs now available on the market act through GPCRs [3]. Although there have
been methods developed to build the structural models of GPCRs [4,5], the
structure of only one GPCR, bovine rhodopsin, has been solved experimentally.
The identification of novel
GPCRs will greatly facilitate the target validation process and automatically
provide a possible compound-screening assay [6]. In the past, many strategies
have been used to identify novel GPCRs. The simplest and most frequently used
method is to search a sequence database using sequence alignment tools, such as
BLAST and FASTA [7-9]. Several pattern databases
(e.g. PRINTS) have been built [10,11]. However, they are not always successful
when the query proteins have no significant sequence similarity to the
sequences in the database. The Pfam classifier based on the profile-hidden
Markov model has been developed [12-15], but
on the class level. To overcome these limitations, the support vector machine
(SVM)-based methods have been used to classify the families and subfamilies,
even sub-subfamilies, of GPCRs [3,16,17]. Another method using binary topology
pattern has also been used to identify eukaryotic GPCRs [18].
The main goal of this work is
to develop a method to determine GPCRs function at the subfamily level. A new
method was developed for classifying subfamilies belonging to Class B, C, D
and F GPCRs. This method couples fast Fourier transform (FFT) with SVM on the
basis of the hydrophobicity of amino acid sequences. The performance of this
method was validated by the jackknife test and evaluated by the independent
dataset test.
Methods
Dataset
To collect the sequences used for this study, all of the sequences belonging to Class B, C, D and F in GPCRDB (March 2005 release 9.0) (http://www.gpcr.org/7tm/) [19] were picked out, then all orphan/putative sequences and fragments were removed. none of the sequences was identical to another in the dataset. The subfamilies that contained less than 10 sequences were dropped out. For Class B GPCRs, the sequences marked as "new" were excluded as the independent dataset; and because the other three classes were relatively small, all the eligible sequences were used. The final dataset contained 403 sequences belonging to 17 different subfamilies. The number of sequences for each different subfamily is listed in Table 1.
Quantitative description of
proteins
The quantitative description
of amino acid sequences is crucial. Here, three principal properties of
proteins, the hydrophobicity, bulk and electronic property, were taken into
account. The hydrophobicity model, c-p-v model [20] and electron-ion
interaction potential (EIIP) model [21] were selected. The hydrophobicity
determines the structure and function of proteins, especially for the
transmembrane proteins. Three different hydrophobicity scales, KDHF [22], MHF
[23] and FHF [24], were selected and
optimized. The c-p-v model includes the composition (c), polarity
(p) and molecular volume (v). The EIIP model describes the
average energy states of all valence electrons of amino acid sequences. These
numerical series are normalized to zero mean and unit standard deviation, as
defined in Equation 1:
Eq. 1
where xij is some property value of the
ith amino acid residue in the jth sequence, is the mean property value of the jth
sequence, and sj is the standard deviation of the jth
sequence.
FFT
The Fourier transform changes
the signal from time-based to frequency-based, as shown in Equation 2:
Eq. 2
FFT has been applied to
protein sequence comparison [25] and rapid multiple sequence alignment [26].
FFT is defined in Equation 3:
Eq. 3
where ? is an Nth root
of Unity. N is the number of frequency points.
In this work, 512 frequency points were set, and the power spectrum, a measurement of the power at each frequency, was used. A plot of power versus frequency is called the power spectrum or power spectral density. The power at each frequency point was taken as the input feature of SVMs. The numerical sequences of variable lengths are transformed to fixed length vectors in this way.
SVM
SVM [27,28] is a kind of learning
machine based on statistical learning theory. The most attractive characteristics
of SVM are the absence of local minima, the sparseness of the solution, and
the use of the kernel-induced feature spaces. The SVM training process always
seeks a global optimized solution and avoids over-fitting, so it has the
ability to handle a large number of features and a relatively small dataset.
The basic ideas behind SVM can be introduced as follows. For a two-class problem, there are a series of samples described by the feature vectors xi(i=1,2,¼,l) (Equation 4) with corresponding labels yi={+1,-1}(i=1,2,¼,l) (Equation 5). To classify the two classes of samples, SVM maps the input vectors into a higher dimensional feature space, then constructs the maximal margin hyperplane (MMH), which maximizes the distance of the closest vectors belonging to the two classes to the hyperplane. The MMH can be obtained by solving the following convex quadratic programming problem:
Maximize Eq. 4
subject to, Eq. 5
where C is a
regularization parameter that controls the trade-off between the margin and
classification error.
K(xj,xi) is the kernel function. In this
paper, the radial basis function was selected as the kernel function (Equation
6):
Eq. 6
where s is the kernel width parameter.
The decision function
implemented by SVM can be written as Equation 7:
Eq. 7
The prediction of GPCR
subfamilies is a multi-class classification problem. In this paper, n
SVMs were constructed for n-class classification. The ith SVM was
trained with all samples in the ith subfamily with the label
"1" and all other samples with the label "-1". The SVMs trained in this way
were referred to as one-versus-rest SVMs [29]. All the kernel parameters were
kept constant except for C and s. All the programs of this method were written in Matlab 7.0
programming language.
Performance evaluation
The models for all the
subfamilies were validated by the jackknife test. For cross-validation, the
jackknife test is deemed more effective and objective than the independent
dataset test and sub-sampling test [30,31]. Chou and Zhang [32] have given a
comprehensive discussion, and Mardia et al. [33] has explained the
mathematical principle behind it. During the process of jackknifing, each
receptor was singled out in turn as a test receptor with the remaining
receptors used to train SVM.
Four indices, the accuracy
(ACC) (Equation 8), Matthew's correlation coefficient (MCC) (Equation
9) [34], total ACC (Equation 10) and total MCC (Equation 11),
were calculated for the assessment of the prediction system.
Eq. 8
Eq. 9
Eq. 10
Eq. 11
Here, i is the any subfamily,
N is the total number of sequences, k is the subfamily number,
exp(i) is the number of sequences observed in subfamily i, p(i)
is the number of correctly predicted sequences of subfamily i, n(i)
is the number of correctly predicted sequences not of subfamily i, u(i)
is the number of under-predicted sequences, and o(i) is the
number of over-predicted sequences.
Results and Discussion
Selecting principal property
for SVMs with the best performance
FHF,
one of the hydrophobicity scales, was used in the hydrophobicity model. The
hydrophobicity, c-p-v and EIIP models transformed the amino acid
sequences into numerical sequences separately, which were then transformed to
input feature vectors using FFT of 512 frequency points for SVMs. The
performance of SVMs based on the three models was validated using the jackknife
test, as shown in Table 2. Table 2 shows that the performance
based on the hydrophobicity model (FHF) is
better than that based on the c-p-v model or EIIP model, achieving the
highest total ACC and MCC of 91.6% and 0.94, respectively. The results indicate
that hydrophobicity is the most important property of proteins and can
preferably substitute the amino acid sequences quantitatively.
Selecting input feature
vectors for SVMs from the FFT transformed signals
The numerical sequences based
on the hydrophobicity model with FHF as the
scale were transformed with FFT to the input feature vectors for SVMs in three
ways: (1) using all the 512 frequency points; (2) using 256 odd frequency
points extracted from the FFT signals; and (3) using 256 even frequency points
extracted from the FFT signals. The performances of the three groups of feature
vectors were compared using the jackknife test, as shown in Table 3.
Table 3 shows that adopting 256 even frequency points as input vectors can
get the highest overall ACC and MCC. For each GPCR subfamily, the total ACC of
adopting 256 even frequency points is equal or higher than that of adopting the
other two groups of frequency points. So, in the later experiments, the input
feature vectors of SVMs were the 256 even frequency points for all the
subfamilies.
Selecting one hydrophobicity
scale with the best performance
Adopting the 256 even
frequency points as the feature vectors for each subfamily, the performances of
SVMs based on the different hydrophobicity scales were compared using the
jackknife test. The results are listed in Table 4. From Table 4,
we can see that the performance of SVMs based on KDHF
and MHF is not as good as that based
on FHF. The SVM based on FHF achieves the highest total ACC and MCC
of 93.3% and 0.95, respectively. This method can classify Class B with a total
ACC of 90.7%; Class C, 87.9%; Class D, 95.8%; and Class F, 100%. The results
prove that SVM based on FHF can classify GPCR subfamilies
with the highest accuracy.
Assigning a reliability index
to the prediction
It is important to know the
prediction reliability when using machine-learning techniques to assign
subfamilies of GPCRs. The reliability index (RI) was assigned according
to the difference (noted as diff) between the highest and the
second-highest output score of SVMs in a multi-class classification [3,27]. The
reliability score in this work has been computed using Equation 12:
Eq. 12
The expected prediction
accuracy and the number of sequences for each given RI were calculated,
as shown in Fig. 1. Fig. 1 shows that the model predicted 81.9% (330/403)
sequences with RI5. Three hundred and thirty sequences (RI5)
were nearly 100% correctly predicted, and only one sequence is under-predicted
(the accuracy is 329/330=99.7%). These results suggest that our model can
predict GPCR subfamilies with high reliability.
Independent dataset test
Because Class C, D and F have
fewer members than Class B, all proteins of Class C, D and F were used for the
training dataset. As described in "Methods", the sequences marked as
"new" in Class B in GPCRDB (March 2005 release 9.0) that do not exist
in the training set were used to check the practical application of this method
as the independent dataset. There are three receptors for the calcitonin
subfamily, three for the corticotropin-releasing factor subfamily, two for the
glucagon subfamily, two for the parathyroid hormone subfamily, four for the
PACAP subfamily, three for the vasoactive intestinal polypeptide subfamily and
ten for the latrophilin subfamily. The overall ACC is 100%, which shows the
method's strong facility for practical application.
Comparison with SVM based on
amino acid composition and dipeptide composition
Artificial intelligence-based
techniques such as SVM and the neural network require a fixed number of inputs
for training, so it is necessary to find a strategy for transforming the
variable lengths of proteins to fixed length patterns. Amino acid composition
[3,29,33] and dipeptide composition [3,35,36] have been used widely in
bioinformatics using intelligence techniques. The amino acid and dipeptide
compositions of proteins can be used to encapsulate the protein information in
a vector of 20 and 400 dimensions, respectively. It is worth comparing our
method with the SVM based on amino acid composition and dipeptide composition
respectively. But the performances of the SVMs based on the two approaches with
the jackknife test indicate that neither of the two approaches can discriminate
any subfamily successfully. It may be because Class B, C, D and F do not have
enough samples, so that there is no statistical meaning for each subfamily
using the two statistical approaches. However, our method can classify GPCR
subfamilies with good performance, indicating its powerful ability to deal with
relatively small datasets.
Prediction Web Server
Based on this study, a web
server has been set up to allow users to recognize GPCR subfamilies. It is
freely available at http://chem.scu.edu.cn/blast/Pred-GPCR. Users can
submit the query sequence in the standard format of FASTA. After analysis, the
results will show the GPCR subfamily to which the query sequence belongs.
Conclusion
This paper describes a method
of SVM in combination with FFT to classify GPCR subfamilies. During the development
of SVMs, three principal properties, hydrophobicity, bulk and electronic
property, were compared. The results show that the hydrophobicity of proteins
is the most important property in deciding GPCRs' function. Three
hydrophobicity scales, KDHF, MHF
and FHF, were optimized, and the
performance based on FHF was best. It was indicated
that taking 256 even frequency points of FFT transformed signals as input
vectors could achieve the highest accuracy. From these results, it is obvious
that the substitution model can affect the performance of the method. We think
that the performance of this method will be improved further if a more suitable
substitution model is found. We expect to find a hybrid model (related to
hydrophobicity) that can integrate other important properties in a reasonable
way, rather than using hydrophobicity alone. The establishment of such
methods will facilitate drug discovery for many diseases.
Acknowledgement
This work was supported by the State Key Laboratory of Chemo/Biosensing and Chemometrics, Hunan University (Changsha, China).
References
1 Attwood TK, Croning MDR, Gaulton A.
Deriving structural and functional insights from a ligand-based hierarchical
classification of G protein-coupled receptors. Protein Eng 2002, 15: 7-12
2 Hebert TE, Bouvier M. Structural
and functional aspects of G protein-coupled receptor oligomerization. Biochem
Cell Biol 1998, 76: 1-11
3 Bhasin M, Raghava GPS. GPCRpred: An
SVM-based method for prediction of families and subfamilies of G-protein
coupled receptors. Nucleic Acids Res 2004, 32: W383-W389
4 Yin YB, Luo JC, Jiang Y. Advances
in G-protein coupled receptor research and related bioinformatics study. Chin
Sci Bull 2003, 48: 511-516
5 Huang XQ, Jiang HL, Luo XM, Chen
KX, Zhu YC, Ji RY, Cao Y. Comparative molecular modeling on 3D-structure of
opioid receptor-like 1 receptor. Acta Pharmacol Sin 2000, 21: 529-535
6 Takeshi H, Wataru N,Takeshi K,
Norihisa F. Construction of hypothetical three-dimensional structure of P2Y1
receptor based on Fourier transform analysis. J Protein Chem 2002, 21: 537-545
7 Altschul SF, Gish W, Miller W,
Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990, 215:
403-410
8 Altschul SF, Madden TL, Schaffer
AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new
generation of protein database search programs. Nucleic Acids Res 1997, 25:
3389-3402
9 Pearson WR. Flexible sequence
similarity searching with the FASTA3 program package. Methods Mol Biol 2000,
132: 185-219
10 Lapinsh M, Gutcaits A, Prusis P, Post C,
Lundstedt T, Wikberg JES. Classification of G-protein coupled receptors by
alignment-independent extraction of principal chemical properties of primary
amino acid sequences. Protein Sci 2002, 11: 795-805
11 Sadowski MI, Parish JH. Automated
generation and refinement of protein signatures: case study with G-protein
coupled receptors. Bioinformatics 2003, 19: 727-734
12 Bateman A, Birney E, Cerruti L, Durbin R,
Etwiller L, Eddy SR, Griffiths-Jones S et al. The Pfam protein families
database. Nucleic Acids Res 2002, 30: 276-280
13 Bateman A, Coin L, Durbin R, Finn RD,
Hollich V, Griffiths-Jones S, Khanna A et al. The Pfam protein families
database. Nucleic Acids Res 2004, 32: D138-D141
14 Papasaikas PK, Bagos PG, Litou ZI,
Promponas VJ, Hamodrakas SJ. PRED-GPCR: GPCR recognition and family
classification server. Nucleic Acids Res 2004, 32: W380-W382
15 Papasaikas PK, Bagos PG, Litou ZI,
Hamodrakas SJ. A novel method for GPCR recognition and family classification
from sequence alone using signatures derived from profile hidden Markov
models. SAR QSAR Environ Res 2003, 14: 413-420
16 Karchin R, Karplus K, Haussler D.
Classifying G-protein coupled receptors with support vector machines.
Bioinformatics 2002, 18: 147-159
17 Huang Y, Cai J, Ji L, Li YD. Classifying
G-protein coupled receptors with bagging classification tree. Comput Biol Chem
2004, 28: 275-280
18 Inoue Y, Ikeda M, Shimizu T.
Proteome-wide functional classification and identification of mammalian-type
GPCRs by binary topology pattern. Comput Biol Chem 2004, 28: 39-49
19 Horn F, Vriend G, Cohen FE. Collecting
and harvesting biological data: The GPCRDB and NucleaRDB information systems. Nucleic
Acids Res 2001, 29: 346-349
20 Grantham R. Amino acid difference formula
to help explain protein evolution. Science 1974, 185: 862-864
21 Cosic I. Macromolecular bioactivity: Is
it resonant interaction between macromolecules? Theory and applications. IEEE
Trans Biomed Eng 1994, 41: 1101-1114
22 Kyte J, Doolittle RF. A simple method for
displaying the hydropathic character of a protein. J Mol Biol 1982, 157: 105-132
23 Mandell AJ, Selz KA, Shlesinger MF.
Wavelet transformation of protein hydrophobicity sequences suggests their
memberships in structural families. Physica A 1997, 244: 254-262
24 Fauch��re J, Pliška V. Hydrophobic
parameters F of amino-acid side chains from the partitioning of
n-acetyl-amino-acid amides. Eur J Med Chem Chim Ther 1983, 18: 369-375
25 Trad CH, Fang Q, Cosic I. Protein
sequence comparison based on the wavelet transform approach. Protein Eng 2002,
15: 193-203
26 Katoh K, Misawa K, Kuma K, Miyata T.
MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier
transform. Nucleic Acids Res 2002, 30: 3059-3066
27 Haykin S. Support vector machines. In:
Haykin S ed. Neural Networks: A Comprehensive Foundation. 2nd ed. New York:
Prentice Hall Inc. 1999
28 Vapnik V. Support Vector Machines of Pattern
Recognition. In: Vapnik V ed. Statistical Learning Theory. Peking: Publishing
House of Electronics Industry 2004
29 Hua S, Sun Z. Support vector machine
approach for protein subcellular localization prediction. Bioinformatics 2001,
17: 721-728
30 Chou KC, Elrod DW. Bioinformatical
analysis of G-protein-coupled receptors. J Proteome Res 2002, 1: 429-433
31 Elrod DW, Chou KC. A study on the
correlation of G-protein-coupled receptor types with amino acid composition.
Protein Eng 2002, 15: 713-715
32 Chou KC, Zhang CT. Prediction of protein
structural classes. Crit Rev Biochem Mol 1995, 30: 275-349
33 Mardia KV, Kent JT, Bibby JM eds.
Multivariate Analysis. London: Academic Press 1979
34 Matthews BW. Comparison of the predicted
and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta
1975, 405: 442-451
35 Bhasin M, Raghava GPS.
Classification of nuclear receptors based on amino acid composition and
dipeptide composition. J Biol Chem 2004, 279: 23262-23266
36 Reczko M, Bohr H. The DEF data base of
sequence based protein fold class predictions. Nucleic Acids Res 1994, 22: 3616-3619