Abstract
The Long non-coding RNA is involved important biological process in our body but dark part is it also involve in some major deceases like cancer. Very few research is done on long non-coding RNA due to its structural resemblance with mRNA, which is protein coding RNA. A proper distinction model between mRNA and Long non-coding RNA is required to do further research on Long non-coding RNA and its association with diseases.
Results: First, k-mers with different k values are extracted and directly tested with our model for best selection of k-mers features. Moreover, the k-mers with high accuracy score are combined in one pool and screened with the help of extra tees classifier to identify most relevant features. Furthermore short listed features are given to our random forest (RF) classifier to extract classification results for lncRna and mRna sequences of humans. Finally, over classifier is compared with other machine learning classifiers like State Vector Machine (SVM), Neural Network ( NN) and K Nearest Neighbor (KNN). Results proves that our proposed model score highest accuracy, precision, recall and F1.
Conclusion: We have established the RF classifier on the bases of combination of 2 and 3 k-mers features, which provides the accuracy of 0.9984. Thus this score is better than Convolutional Neural Network (CNN) based and all previous proposed classification approaches.