Performance evaluation of neural network, support vector machine and random forest for prediction of donor splice sites in rice
196 / 73
Abstract
Prediction of splice sites plays an important role in predicting the gene structure. Rice being one of the major cereal crops, continuous improvement is possible with the prediction of unknown genes associated with complex traits. Machine learning techniques i.e., Artificial Neural Network (ANN) and Support Vector Machine (SVM) have been successfully used for the prediction of splice sites but comparison of their performance has not been made yet to our limited knowledge. Further, Random Forest (RF), another machine learning method, has been successfully used and reported to outperform ANN and SVM in areas other than splice site prediction. In this study we have developed an approach to encode the splice site sequence data of rice into numeric form that are subsequently used as input in ANN, SVM and RF for prediction of donor splice sites. The performances were then evaluated and compared using receiving operating characteristics (ROC) curve and estimate of area under ROC curve (AUC), averaged over 5-fold cross validation. The result reveals that AUC of RF is higher than ANN and SVM which implies that it can be preferred over SVM and ANN in the prediction splice sites.
References
Baten A., Chang B., Halgamuge S. and Li J. 2006. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics, 7(Suppl 5): S15.
Bergmeir C. and Benýtez J. M. 2012. Neural Networks in R Using the Stuttgart Neural Network Simulator. J. Stat. Soft., 46(7): 1-26.
Bradley A. P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn., 30: 1145-59.
Breiman L. 1996. Bagging predictors. Mach Learn., 24(2): 123-140.
Breiman L. 2001. Random Forests. Mach Learn., 45: 532.
Breiman L., Freidman J. H., Olshen R. A. and Stone C. J. 1984. Classification and Regression Trees. Chapman and Hall, New York.
Bureau A., Dupuis J., Falls K., Lunetta K. L., Hayward B., Keith T. P. and Van Eerdewegh P. 2005. Identifying snps predictive of phenotype using random forests. Genet. Epidemiol., 28(2): 171-182.
Burge C. and Karlin S. 1997. Prediction of complete gene structure in human genomic DNA. J. Comput. Biol., 268(1): 78-94.
Chen Y., Liu F., Vanscheonwinkel B. and Manderick B. 2009. Splice site prediction using support vector machines with context-sensitive kernel functions. J Univers Comput Sci., 15(13): 2528-2546.
Comaniciu D., Ramesh V. and Meer P. 2003. Kernelbased object tracking. IEEE Trans Pattern Anal Mach Intell., 25(5): 564-577.
Degroeve S., De Baets B., Van de Peer Y. and Rouz P. 2002. Feature subset selection for splice site prediction. Bioinformatics, 18: S75-S83.
Dehzangi A., Phon-Amnuaisuk A. and Dehzangi O. 2010 Using random forest for protein fold prediction problem: An empirical study.J. Inf. Sci. Eng., 26(6): 1941-1956.
Fawcett T. 2006. An introduction to ROC analysis. Pattern Recogn. Lett., 27: 861-874.
Hamby S. E. and Hirst J. D. 2008. Prediction of glycosylation sites using random forests. BMC Bioinformatics, 9(1): 500.
Huang J., Li T., Chen K. and Wu J. 2006. An approach of encoding for prediction of splice sites using SVM. Biochemie, 88: 923-929.
Khalilia M., Chakraborty S. and Popescu M. 2011. Predicting disease risk from highly imbalanced data using random forest. BMC Med. Inform. Decis. Mak., 11: 51.
Lee J., Lee J., Park M. and Song S. 2005. An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data An., 48: 869885.
Liaw A. and Wiener M. 2002. Prediction and regression by random Forest. Rnews, 2: 18-22.
Meher P. K., Sahu T. K., Rao A. R. and Wahi S. D. 2014a. Determination of window size and identification of suitable method for prediction of donor splice sites in rice (Oryza sativa) genome. J. Plant Biochem. Biotechnol., DOI 10.1007/s13562-014-0286-2.
Meher P. K., Sahu T. K., Rao A. R. and Wahi S. D. 2014b. A statistical approach for 5’ splice site prediction using short sequence motif and without encoding sequence data. BMC Bioinformatics, 15: 362.
Meng Y., Yu Y., Cupples L., Farrer L. and Lunetta K. 2009. Performance of random forest when snps are in linkage disequilibrium. BMC Bioinformatics, 10(1): 78.
Meyer D., Dimitriadou E., Hornik K., Weingessel A., Leisch F., Chang C. C. and Lin C. C. 2012. e1071: Misc functions of the Department of Statistics (e1071), TU Wien, R package version 1.6-1.
Patterson D. J., Yasuhara K. and Ruzzo W. L. 2002. PremRNA secondary structure prediction aids splice sites prediction. Pac. Symp. Biocomput., 223-234.
Pertea M., Lin X. and Salzberg S. L. 2001. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res., 29(5): 1185-1190.
Rajapakse J. and CaH L.S. 2005. Markov encoding for detecting signals in genomic sequences. IEEE/ACM Trans. Comput. Biol. Bioinform., 2(2): 131-142.
Saeys Y., Degroeve S., Aeyels D., Rouzé P. and Van de Peer Y. 2004. Feature selection for splice site prediction: A new method using EDA-based feature ranking. BMC Bioinformatics, 5: 64.
Sonnenburg S., Ratsch, G., Jagota, A. and Muller K. R. 2002. New methods for splice site recognition. Proceedings of the international conference on artificial neural networks, 2415: 329-336.
Sonnenburg S., Schweikert G., Philips P., Behr J. and Rätsch G. 2007. Accurate splice site prediction using support vector machines. BMC Bioinformatics, 8 (Suppl 10): S7.
Sun Y. F., Fan X. D. and Li Y. D. 2003. Identifying splicing sites in eukaryotic RNA: support vector machine approach. Comput. Biol. Med., 33: 17-29.
Weber R. 2001. DNA splice sites prediction with kernels and voting. Proceedings of international conference on mathematical and engineering techniques in medicine and biological science, Nevada.
Wei D., Zhang H., Jiang Q. and Wei Y. 2012. A New Classification Method for Human Gene Splice Site Prediction. Proceedings of the first international conference on health and science, Beijing, China: 121-130.
Wu B., Abbott T., Fishman D., McMurray W., Mor G., Stone K., Ward D., Williams K. and Zhao H. 2003. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19(13): 1636-1643.
Zien A., Ratsch G., Mika S., Scholkopf B., Lengauer T. and Muller K. 2000. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16(9): 799-807.