osptBWT:动态增长字符串集合的次优BWT在线计算方法
Burrows-Wheeler变换(BWT)是现代压缩技术与生物信息学应用中的核心技术。针对动态增长的字符串集合构造BWT仍然是一个挑战,现有面向字符串集合的最优BWT构建算法(optBWT)能显著减低目标字符串集合的BWT游程数r,但在新增字符串时需要完全重构整个BWT。为解决这一问题,本文提出一种基于动态插入区间的在线BWT构建算法--动态增长字符串集合的次优BWT在线计算方法(osptBWT)。该算法在每次新增字符串时仅需O(r log n)比特的空间与O(m log r)的时间(其中n为数据集长度,m为新增字符串长度,r为游程数),即可生成与optBWT游程数r仅存在微小差距的次优BWT。实验结果表明,在七个真实基因组数据集上,osptBWT产生的平均游程数较optBWT仅高出1.41%,且优于其他六种BWT构建算法。此外,其峰值内存使用量仅为当前最佳在线BWT构建算法的63%。在新增字符串规模远小于原始字符串集合的场景下,相较于离线optBWT算法,osptBWT展现出更快的构建速度与更低的峰值内存消耗。项目源码见:https://github.com/xiaoYu0103/osptBWT.git。
he Burrows-Wheeler Transform (BWT) is a core technology in many modern compression and bioinformatics applications. Constructing the BWT for dynamically growing string collections remains a challenge. The existing optimal BWT construction algorithm for string collections (optBWT) can significantly reduce the number of BWT runs r for a given string collection. However, it requires reconstructing the entire BWT when new strings are added. To address this issue, this paper proposes an online BWT construction algorithm based on dynamic insertion interval-Online Computation of Suboptimal BWT(osptBWT). This algorithm requires only O(rlogn) bits of space and O(mlogr) time (where n is the dataset length, m is the length of the newly added string, and r is the number of runs) for each new string added to produce a BWT that is suboptimal by only a small margin compared to optBWT in terms of runs r. Experimental results show that, across seven real-world genomic datasets, the average number of runs produced by osptBWT is only 1.41% higher than that of optBWT, outperforming six other BWT construction algorithms. Moreover, the peak memory usage is only 63% of the best-performing online BWT construction algorithm. In scenarios where the newly added strings is significantly smaller than the original string collection, osptBWT achieves faster construction speed and lower peak memory usage compared to the offline optBWT algorithm. Source code is available at https://github.com/xiaoYu0103/osptBWT.git.
喻心武、瞿有利
北京交通大学交通大数据与人工智能教育部重点实验室, 北京 100044北京交通大学交通大数据与人工智能教育部重点实验室, 北京 100044
计算技术、计算机技术
软件工程、无损压缩、Burrows-Wheeler变换(BWT)、基因组数据
Software EngineeringLossless CompressionBurrows-Wheeler Transform (BWT)Genomic Data
喻心武,瞿有利.osptBWT:动态增长字符串集合的次优BWT在线计算方法[EB/OL].(2025-04-22)[2025-04-26].http://www.paper.edu.cn/releasepaper/content/202504-187.点此复制
评论