Jump to content

Draft:DIAMOND (biotechnology)

From Wikipedia, the free encyclopedia
  • Comment: Given a paid editor, it is often tricky to strike a neutral tone. However when the "most widely used" source is sourced back to the editor and a "citation needed" added on top, then this over the line. ChrysGalley (talk) 12:52, 28 February 2026 (UTC)



DIAMOND
Original authorBenjamin J. Buchfink
DeveloperBenjamin J. Buchfink
ReleaseNovember 17, 2014; 11 years ago (2014-11-17)
Stable release
2.2.1 / 25 May 2026; 33 days ago (2026-05-25)
Written inC++
Operating systemUNIX, Linux, Mac, MS-Windows
TypeBioinformatics tool
LicenseGNU General Public License version 3
Repositorygithub.com/bbuchfink/diamond

In bioinformatics, DIAMOND[1][2][3] is an algorithm and program for sequence alignment of protein and translated DNA sequences, designed as a fast alternative to NCBI BLAST. It has 19,000 citations in scientific literature[4] (as of June 2026) and has been built into pipelines such as antiSMASH[5], OrthoFinder[6], HUMAnN2[7], eggNOG-mapper[8], BRAKER2[9], Anvi'o[10] among others[11].

Background

[edit]
DTRA awarded a $1 million prize to fast alignment software defending against biothreats to the U.S. armed forces[12]

Development of DIAMOND was begun in late 2012 by German computer scientist Benjamin J. Buchfink. An early predecessor version, then called SASS ("Sequence Alignment using Spaced Seeds"), was part of the winning entry of the U.S. Defense Threat Reduction Agency's $1 Million Algorithm Challenge[12][13]. At the time, scientists were spending 800,000 CPU hours on a supercomputer to BLASTX their metagenomic reads against the KEGG database[14], creating the need for faster software solutions.

The first major version of DIAMOND was published in November 2014[1] and focused on alignment sensitivity at above 50% sequence identity and short read alignment. It reported performance gains of 600 to 22,000-fold vs BLAST and 23 to 500-fold vs RapSearch2[15][1].

The second major version of DIAMOND was published in April 2021[2] and extended the tool towards full pairwise alignment sensitivity (on par with BLAST) down to 20% sequence identity, reporting speedups of 12 to 15-fold vs state-of-the-art methods[2].

In 2021, DIAMOND was used to search through 10.2 petabases of sequencing data in the NCBI Sequence Read Archive, discovering 131,957 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude[16]. The computation was realized within 11 days, at a cost of US$23,980[16].

Sequence clustering is a downstream application of alignment. It was released as a feature in January 2023. In addition to clustering based on all-vs-all alignment, the tool also offers clustering with linear-time scaling and sensitivity at lower sequence identities that can be run in parallel across multiple compute nodes[3].

DIAMOND is one of 52 benchmarks in SPEC CPU®2026[17]

Algorithm

[edit]

Like all fast aligners, DIAMOND is based on the "seed-and-extend" concept[18] to find short exact matches between query and target (also called subject) sequences, then extending them into longer gapped alignments. The seeds are usually chosen as k-mers or spaced k-mers in the case of DIAMOND, possibly also in a reduced alphabet.

DIAMOND is an acryonym for "Double Indexed Alignment of NGS Data", referring to its concept of building an index for both query and target sequences[1]. Most aligners work by building an index data structure for the targets (database index), then linearly looking up query seeds in it[19]. BLAST indexes the queries and linearly processes the database[20]. DIAMOND indexes both queries and targets, then evaluates seed hits between them in a seed-by-seed order. The main advantage of this approach is better use of CPU caches by reusing cached data for all associated comparisons[1].

Seed hits are passed through several heuristic filter stages[1] before being subjected to gapped Smith Waterman[21] extension computing the final alignments.

See also

[edit]

References

[edit]
  1. ^ a b c d e f Buchfink, Benjamin J.; Xie, Chao; Huson, Daniel H. (2014). "Fast and sensitive protein alignment using DIAMOND". Nature Methods. 12 (1): 59–60. doi:10.1038/nmeth.3176. ISSN 1548-7105. PMID 25402007.
  2. ^ a b c Buchfink, Benjamin J.; Reuter, Klaus; Drost, Hajk-Georg (2021). "Sensitive protein alignments at tree-of-life scale using DIAMOND". Nature Methods. 18 (4): 366–368. doi:10.1038/s41592-021-01101-x. ISSN 1548-7105. PMC 8026399. PMID 33828273.
  3. ^ a b Buchfink, Benjamin J.; Barbé, Émile; Ashkenazy, Haim; Reuter, Klaus; Kennedy, John A.; Drost, Hajk-Georg (April 2026). "Clustering the protein universe of life using DIAMOND DeepClust". Nature Methods. 23 (4): 724–727. doi:10.1038/s41592-026-03030-z. ISSN 1548-7105. PMC 13076203. PMID 41876643.
  4. ^ "Benjamin J. Buchfink - Google Scholar".
  5. ^ Blin, Kai; Shaw, Simon; Vader, Lisa; Szenei, Judit; Reitz, Zachary L; Augustijn, Hannah E; Cediel-Becerra, José D D; de Crécy-Lagard, Valérie; Koetsier, Robert A; Williams, Sam E; Cruz-Morales, Pablo; Wongwas, Sopida; Segurado Luchsinger, Alejandro E; Biermann, Friederike; Korenskaia, Aleksandra (2025-04-25). "antiSMASH 8.0: extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation". Nucleic Acids Research. 53 (W1): W32–W38. doi:10.1093/nar/gkaf334. ISSN 0305-1048. PMC 12230676. PMID 40276974. Archived from the original on 2026-03-29.
  6. ^ Emms, David M.; Kelly, Steven (2019-11-14). "OrthoFinder: phylogenetic orthology inference for comparative genomics". Genome Biology. 20 (1): 238. doi:10.1186/s13059-019-1832-y. ISSN 1474-760X. PMC 6857279. PMID 31727128.
  7. ^ Franzosa, Eric A.; McIver, Lauren J.; Rahnavard, Gholamali; Thompson, Luke R.; Schirmer, Melanie; Weingart, George; Lipson, Karen Schwarzberg; Knight, Rob; Caporaso, J. Gregory; Segata, Nicola; Huttenhower, Curtis (November 2018). "Species-level functional profiling of metagenomes and metatranscriptomes". Nature Methods. 15 (11): 962–968. doi:10.1038/s41592-018-0176-y. ISSN 1548-7105. PMC 6235447. PMID 30377376.
  8. ^ P, Cantalapiedra, Carlos; Ana, Hernández-Plaza; Ivica, Letunic; Peer, Bork; Jaime, Huerta-Cepas (2021-12-09). "eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale". Molecular Biology and Evolution. 38 (12): 5825–5829. doi:10.1093/molbev/msab293. PMC 8662613. PMID 34597405. Archived from the original on 2026-05-10.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  9. ^ Brůna, Tomáš; Hoff, Katharina J; Lomsadze, Alexandre; Stanke, Mario; Borodovsky, Mark (2021-01-06). "BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database". NAR Genomics and Bioinformatics. 3 (1) lqaa108. doi:10.1093/nargab/lqaa108. ISSN 2631-9268. PMC 7787252. PMID 33575650. Archived from the original on 2026-05-19.
  10. ^ Eren, A. Murat; Esen, Özcan C.; Quince, Christopher; Vineis, Joseph H.; Morrison, Hilary G.; Sogin, Mitchell L.; Delmont, Tom O. (2015-10-08). "Anvi'o: an advanced analysis and visualization platform for 'omics data". PeerJ. 3 e1319. doi:10.7717/peerj.1319. ISSN 2167-8359. PMC 4614810. PMID 26500826.
  11. ^ "Applications - bbuchfink/diamond Wiki". GitHub.
  12. ^ a b Defense Threat Reduction Agency (DTRA) (2013). "DTRA/SCC-WMD Announces $1 Million Algorithm Challenge Winner". www.prweb.com. Retrieved 2025-12-24.
  13. ^ Buchfink, Benjamin; Huson, Daniel H.; Xie, Chao (2015-11-25). "MetaScope - Fast and accurate identification of microbes in metagenomic sequencing data". arXiv:1511.08753v1 [q-bio.GN].
  14. ^ Jansson, Janet (2011). "Towards tera terra: Terabase sequencing of terrestrial metagenomics". Lawrence Berkeley National Laboratory.
  15. ^ Zhao, Yongan; Tang, Haixu; Ye, Yuzhen (2012-01-01). "RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data". Bioinformatics (Oxford, England). 28 (1): 125–126. doi:10.1093/bioinformatics/btr595. ISSN 1367-4811. PMC 3244761. PMID 22039206.
  16. ^ a b Edgar, Robert C.; Taylor, Brie; Lin, Victor; Altman, Tomer; Barbera, Pierre; Meleshko, Dmitry; Lohr, Dan; Novakovsky, Gherman; Buchfink, Benjamin; Al-Shayeb, Basem; Banfield, Jillian F.; de la Peña, Marcos; Korobeynikov, Anton; Chikhi, Rayan; Babaian, Artem (February 2022). "Petabase-scale sequence alignment catalyses viral discovery". Nature. 602 (7895): 142–147. Bibcode:2022Natur.602..142E. doi:10.1038/s41586-021-04332-2. ISSN 1476-4687. PMID 35082445.
  17. ^ Madhav, Mahesh; Lee, Allen; Mejia, Andres; Moore, Branden; Soppadandi, Charan; Cambly, Chris; Müllner, Christoph; Bowers, Daniel; Reiner, David (2026-05-02), SPEC CPU: The Next Generation, arXiv:2605.01575
  18. ^ Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science. 227 (4693): 1435–41. Bibcode:1985Sci...227.1435L. doi:10.1126/science.2983426. PMID 2983426. Closed access icon
  19. ^ Langmead, Ben; Cole Trapnell; Mihai Pop; Steven L Salzberg (4 March 2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome". Genome Biology. 10 (3): 10:R25. doi:10.1186/gb-2009-10-3-r25. PMC 2690996. PMID 19261174.
  20. ^ Stephen Altschul; Warren Gish; Webb Miller; Eugene Myers; David J. Lipman (1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712. S2CID 14441902.
  21. ^ Smith, Temple F. & Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences" (PDF). Journal of Molecular Biology. 147 (1): 195–197. CiteSeerX 10.1.1.63.2897. doi:10.1016/0022-2836(81)90087-5. PMID 7265238.
[edit]