Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters
Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters
ABSTRACT Alignment-free classification of sequences against collections of sequences has enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines. Originally hash-table based, much work has been done to improve and reduce the memory requirement of indexing of k-mer sequences with probabilistic indexing strategies. These efforts have led to lower memory highly efficient indexes, but often lack sensitivity in the face of sequencing errors or polymorphism because they are k-mer based. To address this, we designed a new memory efficient data structure that can tolerate mismatches using multiple spaced seeds, called a multi-index Bloom Filter. Implemented as part of BioBloom Tools, we demonstrate our algorithm in two applications, read binning for targeted assembly and taxonomic read assignment. Our tool shows a higher sensitivity and specificity for read-binning than BWA MEM at an order of magnitude less time. For taxonomic classification, we show higher sensitivity than CLARK-S at an order of magnitude less time while using half the memory.
Yeo Sarah、Chu Justin、Chiu Readman、Tse Jeffery、Mohamadi Hamid、Erhan Emre、Birol Inanc
Bioinformatics Technology Lab, Canada?ˉs Michael Smith Genome Sciences Centre, British Columbia Cancer AgencyBioinformatics Technology Lab, Canada?ˉs Michael Smith Genome Sciences Centre, British Columbia Cancer Agency||Department of Bioinformatics, University of British ColumbiaBioinformatics Technology Lab, Canada?ˉs Michael Smith Genome Sciences Centre, British Columbia Cancer AgencyBioinformatics Technology Lab, Canada?ˉs Michael Smith Genome Sciences Centre, British Columbia Cancer AgencyBioinformatics Technology Lab, Canada?ˉs Michael Smith Genome Sciences Centre, British Columbia Cancer AgencyBioinformatics Technology Lab, Canada?ˉs Michael Smith Genome Sciences Centre, British Columbia Cancer Agency||Department of Bioinformatics, University of British ColumbiaBioinformatics Technology Lab, Canada?ˉs Michael Smith Genome Sciences Centre, British Columbia Cancer Agency||Department of Bioinformatics, University of British Columbia
生物科学研究方法、生物科学研究技术计算技术、计算机技术
Yeo Sarah,Chu Justin,Chiu Readman,Tse Jeffery,Mohamadi Hamid,Erhan Emre,Birol Inanc.Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters[EB/OL].(2025-03-28)[2025-05-22].https://www.biorxiv.org/content/10.1101/434795.点此复制
评论