Architecture and implementation of ulrb algorithm in R
Low-abundance microorganisms, often referred to as the “rare biosphere”, play a crucial role in ecosystem resistance and resilience, but remain challenging to study. One of the main difficulties lies in the lack of an appropriate definition of rare taxa. Most studies use relative abundance threshold...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-12-01
|
| Series: | Ecological Informatics |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1574954125002389 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Low-abundance microorganisms, often referred to as the “rare biosphere”, play a crucial role in ecosystem resistance and resilience, but remain challenging to study. One of the main difficulties lies in the lack of an appropriate definition of rare taxa. Most studies use relative abundance thresholds (e.g., 0.1 % relative abundance, per sample) to discern rare from abundant taxa within a microbial community. This is inappropriate because such thresholds are arbitrary and lack biological meaning. To solve this problem, we have proposed the utilization of unsupervised machine learning, through the ulrb (“Unsupervised Learning Definition of the Microbial Rare Biosphere”) algorithm, implemented as an R package (v0.1.8). This algorithm applies the partition around medoids (pam) algorithm to cluster taxa based on their abundance, in a community, for any number of samples. Based on the clusters, ulrb automatically classifies taxa into “rare”, “undetermined” or “abundant”, by default. Ulrb includes functions for all analytical steps necessary to define the rare biosphere. Specifically, we include four groups of functions: 1) process data of the user into the correct format for the ulrb algorithm; 2) cluster taxa into abundance classifications; 3) helper functions to evaluate detailed statistics of the clustering steps; and 4) visualization functions, focused on rank abundance curves and Silhouette scores, for assessment of clustering quality. In addition, ulrb allows the user to change the number of classifications obtained and includes options for detailed reporting. In this article, we describe the ulrb R package architecture, coding organization, and strategy. Furthermore, we use a 16S rRNA gene amplicon sequencing dataset from the Arctic Ocean to provide illustrative examples, with code, on how to use and explore ulrb capabilities. By explaining the architecture and implementation of ulrb, this study allows independent groups to integrate an abundance classification step in their data analysis protocols, instead of relying on taxa labeled by inconsistent or manual strategies. |
|---|---|
| ISSN: | 1574-9541 |