Datasets: Benchmark datasets from Deng et al. 2022 (Nucleic Acids Research, doi:10.1093/nar/gkac112). FN = rRNA input (measures sensitivity); FP = non-rRNA input (measures specificity); FN+FP = mixed.
| Dataset | Test | Pairs | Description |
|---|---|---|---|
| SILVA_rRNA | FN | 20,000,000 | SILVA SSU+LSU rRNA sequences |
| OMA_CDS | FP | 20,000,000 | Prokaryotic and eukaryotic mRNA |
| ENA_virus | FP | 27,206,792 | Viral gene sequences from ENA |
| Amplicon_16S | FN | 7,917,920 | Real 16S V1-V2 amplicon reads (oral microbiome) |
| Human_ncRNA | FP | 6,330,381 | Human non-coding RNA |
| MetaT | FN+FP | 9,165,829 | Oral metatranscriptome: 4.7M prokaryotic mRNA, 2.5M human mRNA, 73K viral mRNA, 1.9M rRNA (21% rRNA fraction) |
Metrics: Sensitivity = (total - misclassifications) / total for FN datasets. FPR = misclassifications / total for FP datasets. MetaT reports reads classified as rRNA vs. the expected ~21% rRNA fraction.
| Dataset | Type | Total pairs | Misclassifications / rRNA classified | Metric | Value | Wall time (s) | Memory (MB) |
|---|---|---|---|---|---|---|---|
| OMA_CDS | nonrrna | 20000000 | 7031 | FPR | 0.04% | 547 | 3821 |
| SILVA_rRNA | rrna | 20000000 | 5124 | Sensitivity | 99.97% | 4160 | 4393 |
| Amplicon_16S | rrna | 7917920 | 7 | Sensitivity | 100.00% | 1357 | 4203 |
| Human_ncRNA | nonrrna | 6330381 | 4493 | FPR | 0.07% | 101 | 3805 |
| MetaT | mixed | 9165829 | 1888326 (20.6% of reads classified as rRNA; ~21% expected) | NA | NA | 999 | 3973 |
| ENA_virus | nonrrna | 27206792 | 1769 | FPR | 0.01% | 342 | 3798 |