SortMeRNA Database Clustering Summary

This table summarises how many representative sequences remain in each reference database after clustering at decreasing identity thresholds. Each column group covers a source database and taxonomic kingdom; each row corresponds to a SortMeRNA database configuration:

%clust — identity threshold used for vsearch clustering
#seqs — number of representative (centroid) sequences retained

Baseline rows (SILVA NR99 / RFAM unclustered) show the full unfiltered sequence counts against which the clustered databases can be compared. A dash (—) indicates that no sequences were available for that kingdom/database combination.

Source database versions:

SILVA release 138.2 (SSURef NR99 and LSURef NR99)
RFAM release 15.1 (5S / RF00001, 5.8S / RF00002)

Configuration	SILVA 138.2 SSURef NR99						SILVA LSURef NR99						RFAM
	Archaea		Bacteria		Eukaryota		Archaea		Bacteria		Eukaryota		5S		5.8S
	%clust	#seqs	%clust	#seqs	%clust	#seqs	%clust	#seqs	%clust	#seqs	%clust	#seqs	%clust	#seqs	%clust	#seqs
SILVA NR99 (unclustered)	99%	20,389	99%	431,166	99%	58,940	—	—	—	—	—	—	—	—	—	—
SMR v4.7 sensitive db	97%	8,934	97%	165,468	97%	28,707	—	—	—	—	—	—	—	—	—	—
SMR v4.7 default db	95%	5,293	95%	99,721	95%	18,810	—	—	—	—	—	—	—	—	—	—
SMR v4.7 fast db	90%	1,839	90%	31,979	90%	8,464	—	—	—	—	—	—	—	—	—	—
SMR v4.7 fast db (85%)	85%	813	85%	10,111	85%	4,199	—	—	—	—	—	—	—	—	—	—