Abstract
A novel method 4mCi6mA-BGC for predicting DNA modification sites represented by 4mC and 6mA sites. Binary, K-mer, PseKNC, DAC and MonoDiKGap are fused to convert DNA sequence information into digital information, respectively. The elastic net feature selection method is employed to remove redundant and irrelevant features to select the optimal feature subset. This optimal feature subset is inputted into the deep learning method BiGRU_CNN to predict DNA 4mC sites and 6mA sites. The experimental results show 4mCi6mA-BGC method can significantly improve the prediction accuracy of DNA modification sites represented by 4mC and 6mA sites.
[Display omitted]
•A novel method called 4mCi6mA-BGC to predict DNA modification sites represented by 4mC and 6 mA sites.•Fusing the Binary, K-mer, PseKNC, DAC and MonoDiKGap to extract multiple feature information of DNA sequence.•The elastic net feature selection method is used to eliminate redundant and irrelevant features.•In terms of 4mC and 6 mA sites prediction, we employ a method BiGRU_CNN consisting of BiGRU and CNN for the first time and achieve good results.•4mCi6mA-BGC shows superior effect for predicting DNA modification sites represented by 4mC and 6 mA sites compared with existing prediction methods.
DNA N4-methylcytosine (4mC) and DNA N6-methyladenine (6mA) are significant epigenetic modifications. 4mC is closely related to the restriction modification system, and 6mA has a hand in the process of various cellular activities. In order to further explore their functional mechanisms and biological significance, and to overcome the bottleneck of narrow coverage in traditional experimental methods, it is needed to propose an efficient prediction method with a wide range of applications. In this work, we develop a prediction method named 4mCi6mA-BGC to predict 4mC sites and 6mA sites. First, we employ binary, K-mer nucleotide frequency (K-mer), pseudo K-tuple nucleotide composition (PseKNC), dinucleotide-based auto covariance (DAC) and monoDiKGap theoretical description (MonoDiKGap) to encode DNA sequences. Then, the elastic net is employed for feature selection, and the optimized feature space is put into a deep learning framework composed of bidirectional gated recurrent unit and convolutional neural network. The benchmark datasets include six datasets, which contain 14 328 4mC sites from different species. The results of 10-fold cross-validation indicate that the prediction accuracy significantly outperforms the existing prediction methods. Meanwhile, use independent datasets Rice and Arabidopsis thaliana to further confirm the predictive ability of 4mCi6mA-BGC. Compared with the existing prediction methods, 4mCi6mA-BGC shows the best prediction performance. These comprehensive results indicate that our method can identify DNA modification sites represented by 4mC and 6mA sites.