Protein-DNA and protein-RNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanisms of the protein-nucleic acid recognition, it is important to identify the DNA or RNA-binding amino acid residues in proteins. The identification is straightforward if the structure of a protein-DNA or protein-RNA complex is known. Unfortunately, it is very expensive and time-consuming to solve the structure of a protein-DNA/RNA complex. Currently, only a few hundreds of protein-nucleic acid complexes have structural data available in the Protein Data Bank (PDB, http://www.rcsb.org/pdb/). With the rapid accumulation of sequence data, predictive methods are needed for identifying potential DNA or RNA-binding residues in protein sequences.
Several machine learning methods have been reported for predicting DNA or RNA-binding residues directly from amino acid sequences [1-3], using biochemical features of amino acid residues [4, 5], and by incorporating evolutionary information in terms of position-specific scoring matrices [6-8]. Ahmad et al. [1] investigated representative structures of protein-DNA complexes, and used the amino acid sequences in these structures to train artificial neural networks (ANNs) for prediction of DNA-binding residues. Yan et al. [2] constructed Naïve Bayes classifiers for DNA-binding site prediction from amino acid identities. Naïve Bayes classifiers were also developed for predicting RNA-binding residues directly from amino acid sequences [3]. However, without using biological knowledge for classifier construction, the prediction accuracy was relatively low in these studies.
Bạn đang xem: BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features
The use of evolutionary information for input encoding has been shown to improve classifier performance. Ahmad and Sarai [6] constructed ANN classifiers for DNA-binding site prediction using evolutionary information in terms of position-specific scoring matrix (PSSM). More recently, PSSM profiles have also been used to train support vector machines (SVMs) and logistic regression models for sequence-based prediction of DNA-binding residues [7, 8]. For a given protein sequence, its PSSM profile can be derived from the result of a PSI-BLAST search against a large sequence database. PSSM scores indicate how well an amino acid position in the query sequence is conserved among its homologues. Since functional sites, including DNA and RNA-binding residues, tend to be conserved among homologous proteins, PSSM can provide relevant information for classifier construction. However, PSSM is rather designed for PSI-BLAST searches, and it may not contain all the evolutionary information for modelling DNA or RNA-binding sites.
Xem thêm : Surgical options for uneven breasts
In our previous studies [4, 5], ANN and SVM classifiers were constructed for DNA or RNA-binding site prediction using relevant biochemical features, including the hydrophobicity index, side chain pKa value, and molecular mass of an amino acid. These features were used to represent biological knowledge, which might not be learned from the training data. It was found that classifier performance was enhanced by using the biochemical features for input encoding, and the SVM classifiers outperformed the ANN predictors. Nevertheless, it is still unknown whether classifier performance can be further improved by combining the biochemical features with evolutionary information.
This study aimed to examine different descriptors of evolutionary information for DNA and RNA-binding site prediction, and to improve classifier performance by combining relevant sequence features. Three new descriptors of evolutionary information as well as PSSM were used to construct SVM classifiers, and the new descriptors were shown to improve classifier performance. Interestingly, the most accurate classifiers were obtained by combining the new descriptors with PSSM and relevant biochemical features for input encoding. The results suggest that PSSM, although useful for classifier construction, does not capture all the evolutionary information for predicting DNA and RNA-binding residues in protein sequences. A new web server called BindN+ (http://bioinfo.ggc.org/bindn+/) has been developed to make the SVM classifiers accessible to the biological research community.
Nguồn: https://buycookiesonline.eu
Danh mục: Info