|国家预印本平台
首页|Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

来源:bioRxiv_logobioRxiv
英文摘要

Abstract MotivationChoosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time-efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. ResultsWe designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform’s error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads. AvailabilityHercules source code is available at https://github.com/BilkentCompGen/Hercules

Firtina Can、Bar-Joseph Ziv、Cicek A. Ercument、Alkan Can

Department of Computer Engineering, Bilkent UniversityComputational Biology Department, School of Computer Science, Carnegie Mellon UniversityDepartment of Computer Engineering, Bilkent University||Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityDepartment of Computer Engineering, Bilkent University

10.1101/233080

生物科学研究方法、生物科学研究技术计算技术、计算机技术生物工程学

Firtina Can,Bar-Joseph Ziv,Cicek A. Ercument,Alkan Can.Hercules: a profile HMM-based hybrid error correction algorithm for long reads[EB/OL].(2025-03-28)[2025-05-03].https://www.biorxiv.org/content/10.1101/233080.点此复制

评论