Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm

Nucleic Acids Res. 2014 Jun;42(11):e93. doi: 10.1093/nar/gku325. Epub 2014 Apr 25.

Abstract

To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features-structure, sequence, modularity, structural robustness and coding potential-to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Classification / methods
  • Genome, Bacterial
  • Genomics
  • Humans
  • Logistic Models
  • RNA, Long Noncoding / genetics*
  • RNA, Small Untranslated / genetics*
  • RNA, Untranslated / classification
  • RNA, Untranslated / genetics

Substances

  • RNA, Long Noncoding
  • RNA, Small Untranslated
  • RNA, Untranslated