|国家预印本平台
首页|extgfa: a low-memory on-disk representation of genome graphs

extgfa: a low-memory on-disk representation of genome graphs

extgfa: a low-memory on-disk representation of genome graphs

来源:bioRxiv_logobioRxiv
英文摘要

The representation of genomes and genomic sequences through graph structures has undergone a period of rapid development in recent years, particularly to accommodate the growing size of genome sequences that are being produced. Genome graphs have been employed extensively for a variety of purposes, including assembly, variance detection, visualization, alignment, and pangenomics. Many tools have been developed to work with and manipulate such graphs. However, the majority of these tools tend to load the complete graph into memory, which results in a significant burden even for relatively straightforward operations such as extracting subgraphs, or executing basic algorithms like breadth-first or depth-first search. In procedurally generated open-world games like Minecraft, it is not feasible to load the complete world into memory. Instead, a mechanism that keeps most of the world on disk and only loads parts when needed is necessary. Accordingly, the world is partitioned into chunks which are loaded or unloaded based on their distance from the player. Furthermore, to conserve memory, the system unloads chunks that are no longer in use based on the player's movement direction, sending them back to the disk. In this paper, we investigate the potential of employing a similar mechanism on genome graphs. To this end, we have developed a proof-of-concept implementation, which we called "extgfa" (for external GFA). Our implementation applies a similar chunking mechanism to genome graphs, whereby only the necessary parts of the graphs are loaded and the rest stays on disk. We demonstrate that this proof-of-concept implementation improves the memory profile when running an algorithm such as BFS on a large graph, and is able to reduce the memory profile by more than one order of magnitude for certain BFS parameters.

Dabbaghie Fawaz

10.1101/2024.11.29.626045

生物科学研究方法、生物科学研究技术计算技术、计算机技术

Dabbaghie Fawaz.extgfa: a low-memory on-disk representation of genome graphs[EB/OL].(2025-03-28)[2025-04-27].https://www.biorxiv.org/content/10.1101/2024.11.29.626045.点此复制

评论