Scalable sequence database search using partitioned aggregated Bloom comb trees

Camille Marchet; Antoine Limasset

doi:10.1093/bioinformatics/btad225

Article Dans Une Revue Bioinformatics Année : 2023

Scalable sequence database search using partitioned aggregated Bloom comb trees

(1, 2) , (1, 2, 3)

1
2
3

Camille Marchet

Fonction : Auteur

Institut des sciences informatiques et de leurs interactions - CNRS Sciences informatiques

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Antoine Limasset

Fonction : Auteur

Institut des sciences informatiques et de leurs interactions - CNRS Sciences informatiques

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Centre National de la Recherche Scientifique

Résumé

Motivation The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3–6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC’s ability to query 500 000 transcript sequences in less than an hour. Availability and implementation PAC’s open-source software is available at https://github.com/Malfoy/PAC.

Domaines

Bio-informatique [q-bio.QM] Informatique [cs] Sciences du Vivant [q-bio]

Fichier principal

btad225.pdf (533.7 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Camille MARCHET : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04279824

Soumis le : mardi 14 novembre 2023-14:53:14

Dernière modification le : vendredi 12 avril 2024-10:35:17

Dates et versions

hal-04279824 , version 1 (14-11-2023)

Identifiants

HAL Id : hal-04279824 , version 1
DOI : 10.1093/bioinformatics/btad225

Citer

Camille Marchet, Antoine Limasset. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics, 2023, 39 (Supplement_1), pp.i252-i259. ⟨10.1093/bioinformatics/btad225⟩. ⟨hal-04279824⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS CRISTAL CRISTAL-BONSAI UNIV-LILLE ANR

34 Consultations

8 Téléchargements

Scalable sequence database search using partitioned aggregated Bloom comb trees

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager