Foldcomp: a library and format for compressing and indexing large protein structure sets
Foldcomp: a library and format for compressing and indexing large protein structure sets
Summary: Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here we present Foldcomp, a novel lossy structure compression algorithm and indexing system to address this challenge. By using a combination of internal and cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of 3 compared to the next best method. Its reconstruction error of 0.08 angstrom is comparable to the best lossy compressor. It is 5 times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analyzing large collections of protein structures. Availability: Foldcomp is a free open-source library and command-line software available for Linux, macOS and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB) and ESMatlas HQ (114GB) database ready-for-download.
Mirdita Milot、Steinegger Martin、Kim Hyunbin
生物科学研究方法、生物科学研究技术计算技术、计算机技术生物工程学
Mirdita Milot,Steinegger Martin,Kim Hyunbin.Foldcomp: a library and format for compressing and indexing large protein structure sets[EB/OL].(2025-03-28)[2025-07-17].https://www.biorxiv.org/content/10.1101/2022.12.09.519715.点此复制
评论