Here we make use of computational prediction of solvent accessibility to extend our analysis to all the positions undergoing variations contained in HUMSAVAR. Unclassified SRVs were filtered-out from the set. Overall, 69, SRVs were collected. In Table 2 we summarize the basic statistics of the dataset.
For computing solvent accessibility from protein sequences, we implemented an in-house method for predicting solvent exposure from sequence. The method is based on deep-learning processing of several input features, which encode the protein sequence and the sequence profile see Materials and Methods for more details on the method.
Performances are listed in Table 3 and are evaluated adopting three different testing sets by adopting a cross validation procedure leftmost column ; on the blind test central column ; on our HVAR3D Comparing the first two columns, it is evident that our method is robust, achieving generalization performances that are as good and even better than cross-validation results.
Performance of our deep learning-based method for predicting solvent exposure from protein sequence. We also performed a side-by-side comparison between our method and two state-of-the-art approaches, namely PaleAle5. Results are reported in Table 4. All methods perform quite well, with comparable scoring indexes. It is worth mentioning that the testing set used in this benchmark is non-redundant only with respect to our training set: this condition is not guaranteed for the other two methods evaluated, which adopt different training sets.
In general, we can conclude that our method well-compares with recent tools at the state-of-the-art. Performance of different methods for solvent accessibility prediction on the blind test set described in this study comprising protein sequences. After computing solvent accessibility over HVARSEQ, we assessed the proportions of buried and exposed predictions separately on the subsets of residues undergoing disease-related and neutral variations.
Results are in Figure 4. The result further corroborates the notion that residues undergoing disease-related variations are mainly in buried positions. We also show the baseline probability P D 0. Here, buried and exposure states of each residue position have been predicted using the method described in Section Analyzing distributions of variated wild-type residues in the sequence database.
P D E,R : the conditional probability of a wild-type residue R to be disease related upon variation when exposed [see Equation 4 ]. In the sequence set, this behavior characterizes also arginine R and aspartic acid D. However, this discrepancy can be due to prediction errors on these two less abundant rare residues in the database.
For sake of curiosity, we took advantage of an example to show the 3D location of our sequence-based prediction. Disease-related SRVs are all associated to Trimethylaminuria OMIM: 5 , a disease condition resulting from the abnormal presence of large amounts of volatile and malodorous trimethylamine within the body. Mapping SASA predictions on a protein model.
Variation SVR positions are highlighted using the spacefill view. In red, buried positions associated to disease-related SRVs and correctly predicted as buried by our method. In magenta, buried disease-related positions wrongly predicted as exposed. In orange, exposed disease-related positions wrongly predicted as buried. In blue, exposed neutral SRV positions correctly predicted as exposed.
In yellow, exposed neutral positions wrongly predicted as buried. In green, buried neutral positions correctly predicted as buried. It is evident that the vast majority of disease-related SRVs 6 out of 8 are in buried positions. Of these, five are correctly predicted as buried by our method in red while only one is wrongly predicted as exposed in magenta.
Neutral SRVs are mostly exposed 10 out of 11 : eight of these are correctly predicted in exposed regions in blue. Results illustrate the general trend of what we observed in the structural data set and are consistent with the accuracy of the prediction method. In this paper, we focus on the solvent accessible surface area, a property of protein residues, firstly described and computed in several biophysical studies, to which Cyrus Chothia contributed Chothia, The property, which nowadays can be computed with machine learning based methods, is here exploited in relation to another important problem: the annotation of variations in human proteins as disease related or not.
We took advantage of an ample set of human protein structures to observe that indeed disease related variations occur more frequently in buried regions of the proteins than in solvent accessible surfaces. In turn, neutral polymorphisms are characterized by a more frequent solvent exposure.
We then proved that with a deep learning method performing at the state of art, the tendency is observable also in the majority of all the wild-type residues undergoing variations that are presently listed in HUMSAVAR. We suggest that the solvent accessible surface area of wild type residues is a distinguished property to be included among those necessary to annotate pathogenic from non-pathogenic variations. MM and CS: software. RC and PM: supervision.
All authors contributed to the article and approved the submitted version. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
National Center for Biotechnology Information , U. Journal List Front Mol Biosci v. Front Mol Biosci. Published online Jan 7. Author information Article notes Copyright and License information Disclaimer. This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences. Received Nov 5; Accepted Dec 7. The use, distribution or reproduction in other forums is permitted, provided the original author s and the copyright owner s are credited and that the original publication in this journal is cited, in accordance with accepted academic practice.
No use, distribution or reproduction is permitted which does not comply with these terms. This article has been cited by other articles in PMC. CSV K. CSV 4. Abstract Solvent accessibility SASA is a key feature of proteins for determining their folding and stability. Keywords: solvent accessible surface area, relative solvent accessibility, protein variations, prediction of solvent accessible surface, pathogenic protein variations.
Predicting Solvent Accessibility From the Protein Sequence The method implements a deep-learning architecture processing an input based on the following descriptors: The residue one-hot encoding, representing primary sequence information; Evolutionary information encoded with a protein sequence profile, as extracted from multiple sequence alignment generated using the HHblits version 3 program Steinegger et al. Open in a separate window. Figure 1. Analyzing Distributions of Variated Wild-Type Residues in the Structure Database We tackle the problem of associating solvent exposure to a specific wild-type residue as a characteristic feature to be associated to its variation type neutral or disease related.
Figure 2. The clear decrease in conservation for sequence alignments indicates the limitations in gaining prediction accuracy average filter [Eq. The averaging filter had a n Fig. Particularly striking was Half the residues are predicted as accurately the poor performance of the ten-state network for as by homology modeling the intermediate state I.
However, this was not a The reliability index [RZ, defined in Fig. This distribution is approximately Gaussian with a mean of 0. Shown is the distribution of exposure prediction secondary structure prediction the final filtered jury prediction called PHDacc. Three negative outliers are not shown: 2MEV-4 Here we compare the accuracy of prediction of relative 2GN5 0. Note that the prediction accuracy was signif- tion methods compared are: random prediction, cross-validated zyxwvutsrqp icantly higher for monomeric proteins Table The values of prediction accuracy evaluated based on the cross- validation set were largely confirmed Table This implies that the correlation between predicted zyxwvutsrq and observed solvent accessibility is likely to remain at about 0.
Here, we pursued a far less with RI 2 4. For these residues, the correlation between predicted ambitious goal: to evaluate the accuracy of exposure and observed accessibility was 0.
The reliability index, RI, is defined as: prediction as such. The empirical factor of 30 that RI lies between 0 and 9. Comparing relative accessibility between struc- Prediction performance confirmed by test on turally aligned corresponding residues in 3D homol- pre-release set ogous structures yields a correlation coefficient of To make doubly sure that the prediction results 0.
The accessibility of completely buried residues was best conserved Fig. Evaluated in three states, relative accessibility ture in 2D by a matrix of all inter-residue contacts in a protein. In particular, predic- This low degree of conservation raises the question tions of secondary structure and solvent accessibil- of whether or not there are descriptions of the rela- ity could be aligned to known 3D structures to detect zyxwvutsrqp tive position of a residue in a structure that are bet- putative remote homologues, or at least to provide ter conserved between 3D homologues.
Sixth, solvent acces- cient of about 0. Expressed in units of the differ- sibility predictions could be helpful to predict ence between automatic homology modeling and epitopes antigenic sites. DE by electronic mail. Prediction was better for residues in reg- proof-reading, and motivating discussions; Gerrit ular secondary structure segments, and in general Vriend and Reinhard Schneider, for many valuable better for the extreme cases, i.
The network method and Ulrike Gobel, for proof-reading. Last but not least, many thanks to all rate. However, its accuracy was relatively closer to those who publish experimental data on 3D protein prediction by automatic homology modeling than it structures and deposit the coordinates in public da- was to prediction of secondary structure Fig. C a n Prediction Accuracy Be Improved? Bairoch, A,, Boeckmann, B.
Nucleic Acids Res. Bernstein, F. The Protein Data B a n k A computer conserved in evolution. Another idea is to combine based archival file for macromolecular structures. Hobohm, U.
Protein Sci. Is Prediction of Accessibility Useful? Enlarged representative set of protein structures. Possible applications for solvent accessibility pre- 5. Oliver, S. First, an approach that has Aigle, M.
Second, the prediction could be used as Demolder, J. Such a n esti- Faye, G. Miyazawa, S. A new substitution matrix for protein sequence searches based on contact frequencies in protein structures. Protein Eng. Nishikawa, K. Development of pseudoenergy Y.
Ouzounis, C. Predic- Martegani, E. Stultz, C. Structural analysis Perea, J. The complete DNA sequence of yeast chro- based on state-space modeling. Nature , Anfinsen, C. The Taylor, W. Protein fold refinement: Building models kinetics of formation of native ribonuclease during oxida- from idealized folds using motif constraints and multiple tion of the reduced polypeptide chain.
Wodak, S. Generating and testing protein 7. Principles that govern the folding of pro- folds. Science , Sippl, M. Predictive power of mean force pair 8. Chothia, C.
The relation between the diver- potentials. In: H. EMBO J. Washington DC: 10s , Press, Schneider, R. Database of homology-derived Rost, B. Progress in protein structures and the structural meaning of sequence align- structure prediction? TIBS , Proteins , Russell, R. The limits of protein secondary Greer, J. Model for haptoglobin heavy chain based upon structure prediction accuracy from multiple sequence structural homology.
Redefining the goals of Blundell, T. Knowledgebased prediction of protein structures , Combining evolutionary information Holbrook, S. Predicting sur- B. Knowledge-based protein modelling and face exposure of amino acids from protein sequence.
Pro- design. Summers, N. Construction of side-chains in Lee, B. The interpretation of protein structures: Estimation of static accessibility. A holistic approach to protein , The nature of the accessible and buried sur- Overington, J.
Ter- faces in proteins. Janin, J. Surface area of globular proteins. Richards, F. A two-stage method is developed for the single sequence prediction of protein solvent accessibility from solely its amino acid sequence. The first stage classifies each residue in a protein sequence as exposed or buried using support vector machine SVM. The features used in the SVM are physico-chemical properties of the amino acid to be predicted as well as the information coming from its neighboring residues.
The MUMs are identified by an efficient data structure called suffix tree. The results demonstrate that the new method achieves slightly better accuracy than recent methods using single sequence prediction. Unable to display preview. Download preview PDF. Skip to main content. This service is more advanced with JavaScript available. Advertisement Hide. Conference paper.
This process is experimental and the keywords may be updated as the learning algorithm improves. This is a preview of subscription content, log in to check access.
0コメント