Abstract
•Random Forest and Rotation Forest classifiers are used for subcellular localization.•Various feature extraction strategies are utilized.•SMOTE is employed as a data balancing technique.•SMOTE has improved prediction performance in classifying protein images.•A web server is available online at http://111.68.99.218/RF-SubLoc.
Protein subcellular localization plays a vital role in understanding proteins’ behavior under different circumstances. The effectiveness of various drugs can be assessed by the successful prediction of protein locations. Therefore, it is important to develop a prediction system that is sufficiently reliable and accurate in making decisions regarding the protein localization. However, main problem in developing a reliable and high throughput prediction system is the presence of imbalanced data, which greatly affects the performance of a prediction system. In order to remedy this problem, we utilized the notion of oversampling through Synthetic Minority Oversampling TEchnique (SMOTE). Further, different feature extraction strategies and ensemble classification techniques are assessed for their contribution toward the solution of the challenging problem of subcellular localization. After applying SMOTE data balancing technique, a remarkable improvement is observed in the performance of random forest and rotation forest ensemble classifiers for CHOM, CHOA and VeroA datasets. It is anticipated that our proposed model might be helpful for the research community in the field of functional and structural proteomics as well as in drug discovery.