Scalable sequence database search using partitioned aggregated Bloom comb trees - CRISTAL-BONSAI Accéder directement au contenu
Article Dans Une Revue Bioinformatics Année : 2023

Scalable sequence database search using partitioned aggregated Bloom comb trees

Résumé

Motivation The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3–6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC’s ability to query 500 000 transcript sequences in less than an hour. Availability and implementation PAC’s open-source software is available at https://github.com/Malfoy/PAC.
Fichier principal
Vignette du fichier
btad225.pdf (533.7 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-04279824 , version 1 (14-11-2023)

Identifiants

Citer

Camille Marchet, Antoine Limasset. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics, 2023, 39 (Supplement_1), pp.i252-i259. ⟨10.1093/bioinformatics/btad225⟩. ⟨hal-04279824⟩
34 Consultations
8 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More