Microbiome Informatics: OTU vs. ASV

When analyzing targeted microbiome sequencing, a different approach to determining the origin of the sequence is required than typical alignment-based methods. This is because the gene of origin for an amplified target gene is already known and the goal is to determine its taxonomic origin based upon a potentially small number of variations relative to similar taxa. In the context of whole genome sequencing, a small number of single nucleotide variants (SNVs) caused by sequencer error are unlikely to seriously confound an aligner and have little effect on the final attribution of the sequence.

Targeted sequencing, where comparison of multiple similar sequences, rather than alignment across multiple genomes is the main operation, has the potential to be confounded by erroneous SNVs. This can result in misattribution of the sequence leading to either to detection of a similar, but incorrect organism, or the false discovery of a new organism. Fortunately, two strategies have been developed to minimize the effects of targeted sequencing error, each with their own pros, cons, and idiosyncrasies. In this article and video, we will review the logic behind these methods, their applications, and relative advantages and disadvantages.

What is OTU Clustering?

In order to minimize the risks of sequencer error in targeted sequencing, clustering approaches were initially developed. Clustering approaches are based upon the idea that related/similar organisms will have similar target gene sequences and that rare sequencing errors will have a trivial contribution, if any, to the consensus sequence for these clusters, or operating taxonomic units (OTUs)1.

There are three basic methods to generate OTUs from sequencing data, with clusters often being generated using a similarity threshold of 97% sequence identity. This approach carries with it the risk that multiple similar species can be grouped into a single OTU, with their individual identifications being lost to the abstract of a cluster. Alternatively, some have tried the approach of requiring extremely high levels of sequence identity to minimize the risk of losing diversity to clustering, with thresholds closer to 100% being used, but this creates a significant risk of identifying sequencing errors as new species and false diversity

The OTU approach clusters similar reads into one representative group, potentially containing more than one organism from the sample.

Reference-free OTU Clustering

The simplest method to understand, although the most computationally complex to carry out, is de novo clustering. De novo clustering requires no reference database and creates the OTU clusters entirely from observed sequences. This method is computationally expensive and difficult to carry out in parallel, resulting in potentially prohibitively long compute times for large sets of sequencing data. Additionally, de novo clustering must be repeated when data are added to or removed from the study. This is because the same sequence may cluster differently depending upon which other sequences were detected in the study.

Reference-based OTU Clustering

A much more computationally-efficient method for clustering is closed-reference clustering. As implied by the name, this method uses a reference database of target gene sequences from known taxa and compares discovered sequences to them. This method will also minimize the effects of sequencing errors because a small number of erroneous SNVs is unlikely to change the final consensus sequence from the entire OTU. Additionally, should the sequencing read have sufficient errors to prevent clustering with a reference sequence, closed-reference clustering will drop that read from further analysis. This method, along with being computationally fast, allows for the easy comparison between studies using matching reference databases and can allow for the rapid incorporation of new data into the study without having to reanalyze previous results. But, this method carries the disadvantage of being completely dependent on reference sequences, and thus subject to any errors or biases in the reference database. These biases may be a lesser issue in a well-studied sample type, such as human stool, that has robust representation in the database.

On the other hand, any novel taxa from this source, will be lost. If a more unusual or completely novel sample source is being used, closed-source clustering is likely to be inappropriate, as the reference database is unlikely to have appropriate sequences already deposited for many of the taxa present. To avoid the loss of novel sequences, open-reference clustering was developed, where sequences that can be quickly clustered to a reference database are clustered in a manner similar to closed-reference and remaining sequences are clustered in a manner similar to de novo.

What is ASV Analysis?

While OTU clustering approaches attempt to blur similar sequences into an abstracted consensus sequence, thus minimizing the influence of any sequencing errors within the pool of reads, the Amplicon Sequence Variant (ASV) approach attempts to go the opposite direction. The ASV approach will start by determining which exact sequences were read and how many times each exact sequence was read. These data will be combined with an error model for the sequencing run, enabling the comparison of similar reads to determine the probability that a given read at a given frequency is not due to sequencer error. This creates, in essence, a p-value for each exact sequence, where the null-hypothesis is equivalent to that exact sequence being a consequence of sequencing error.

Following this calculation, sequences are filtered according to some threshold value for confidence, leaving behind a collection of exact sequences having a defined statistical confidence. Because these are exact sequences, generated without clustering or reference databases, ASV results can be readily compared between studies using the same target region. Additionally, a given target gene sequence should always generate the same ASV and a given ASV, being an exact sequence, can be compared to a reference database at a much higher resolution allowing for more precise identification down to the species level and even potentially beyond.

'The ASV approach identifies single, exact sequences that are statistically supported as being present in the sample.

OTU vs. ASV Comparison

There are many arguments that the field should be moving towards an ASV approach4, 5. As stated above, ASV approaches can provide a significant advantage to more precise identification of microbes. In addition, they can provide a more detailed picture of the diversity within a sample. An OTU, being a cluster of multiple, similar sequences that may either be “real” sequences from the sample or errors can contain multiple, similar species of microbe lumped into a single unit. An ASV does not have this issue, as even a single base difference in the sequence will result in a unique ASV and a more detailed picture of the diversity of a given sample.

OTU vs. ASV Trade-offs

There is a potentially significant trade-off between methods of OTU generation where one selects for computational ease of generation and comparison of OTUs while the other selects for a lack of reference-bias, with a third method that combines the two for an intermediate result. Closed-reference OTUs are computationally fast and easy for both generation and comparison between samples and studies, but carry a significant risk of reference bias and loss of novel sequences. De novo OTUs are computationally slow, but will retain all sequences from the sample and have no risk of reference bias as they are generated reference-free.

Open-reference OTUs lie somewhere between these methods, depending upon the nature of the sample. Reference-based OTU approaches are still a valid choice in large, population-based studies such as the Human Microbiome Project6, which has contributed tremendous insight to the field through the enrollment of large numbers of subjects and the analysis and thorough characterization of samples where the expected taxa are already well-defined and well-documented in the reference databases. The concerns about reference bias for these sample types would not be expected to be especially high, and the computational efficiency and ease of adding new data and comparing samples help to keep computational resource requirements under control. By comparison, an analysis of microbes living in a previously unexplored, remote underwater cave in the Amazon where the water conditions are highly unusual in terms of mineral content, pH, and temperature would almost certainly require significant de novo OTU generation. This type of scenario would clearly do well to quickly adopt an ASV approach to facilitate data comparison and the addition of new data, as well as allowing for the acceptance of only high-confidence, exact sequence to reference databases.

OTU vs. ASV Performance against Confounding Facto

The ASV approach carries many advantages when working with difficult samples or trying to correct common confounding situations that affect either targeted sequence analysis or microbiome analysis in general. When attempting to study low-abundance sequences, OTUs are generally considered to be much more likely to retain rare sequences, although this comes at the cost of higher detection of spurious OTUs7. Among the ASV determination programs, DADA28 has been demonstrated to be the most sensitive to low-abundance sequences9. In the context of sample contamination, a study using a dilution series of the ZymoBIOMICS Microbial Community Standard was able to show that ASV-based methods were better able to infer sample from contaminant biomass when the nature of these two populations were known with the precise nature of ASVs allowing for the best identification of both sample and contaminant biomass5.

Finally, chimera creation is a constant nuisance in targeted sequence studies that, while being possible to minimize with optimized library production, is difficult to eliminate entirely. ASVs, being exact sequences, allow for simple, detection of chimeric sequences without the potential biasing effect of a reference database. OTU-based chimera detection requires the input of “fuzzy” consensus OTU sequences and can avoid making calls on sequences that are too similar to a parent sequence, as they will likely to end up joined in the same OTU anyway. Because an ASV is an exact sequence, a chimeric ASV can be expected to be the exact or near-exact child of two more prevalent exact parent sequences in the same sample, with one parent contributing the left side and one parent contributing the right side of the chimera. These properties can enable the identification of chimeras by using the alignment of less prevalent ASVs to more prevalent ASVs from the same sample.

While the OTU approach has served the microbiome community for many years, and will likely still find use for years to come in specific circumstances, the evidence is mounting that the future of targeted sequencing lies with the ASV approach. The ASV approach has several mature bioinformatic applications for analysis8, 10, 11 with their own advantages and disadvantages. As the community moves towards methods with increased reproducibility and ease of comparison between studies, the importance of the ASV method will only increase. This will be especially pronounced as more novel sources are analyzed where reference-based OTU clustering has significant disadvantages in terms of bias and de novo clustering has disadvantages in terms of computation and comparability, while ASV-based methods present significant advantages over both.

References:

  1. Blaxter M, Mann J, Chapman T, et al. Defining operational taxonomic units using DNA barcode data. Philos Trans R Soc Lond B Biol Sci. Oct 29 2005;360(1462):1935-43. doi:10.1098/rstb.2005.1725
  2. Kunin V, Engelbrektson A, Ochman H, Hugenholtz P. Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol. Jan 2010;12(1):118-23. doi:10.1111/j.1462-2920.2009.02051.x
  3. Callahan BJ, Wong J, Heiner C, et al. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Research. 2019;47(18):e103-e103. doi:10.1093/nar/gkz569
  4. Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal. 2017/12/01 2017;11(12):2639-2643. doi:10.1038/ismej.2017.119
  5. Caruso V, Song X, Asquith M, Karstens L. Performance of Microbiome Sequence Inference Methods in Environments with Varying Biomass. mSystems. 2019;4(1):e00163-18. doi:10.1128/mSystems.00163-18
  6. Gevers D, Knight R, Petrosino JF, et al. The Human Microbiome Project: a community resource for the healthy human microbiome. PLoS Biol. 2012;10(8):e1001377-e1001377. doi:10.1371/journal.pbio.1001377
  7. Edgar RC. Accuracy of microbial community diversity estimated by closed- and open-reference OTUs. PeerJ. 2017;5:e3889. doi:10.7717/peerj.3889
  8. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods. Jul 2016;13(7):581-3. doi:10.1038/nmeth.3869
  9. Nearing JT, Douglas GM, Comeau AM, Langille MGI. Denoising the Denoisers: an independent evaluation of microbiome sequence error-correction approaches. PeerJ. 2018;6:e5364-e5364. doi:10.7717/peerj.5364
  10. Amir A, McDonald D, Navas-Molina JA, et al. Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns. mSystems. Mar-Apr 2017;2(2)doi:10.1128/mSystems.00191-16
  11. Edgar RC. UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing. bioRxiv. 2016:081257. doi:10.1101/081257
  12. https://www.zymoresearch.com/

December 11, 2020