|国家预印本平台
首页|ProTokens: Probabilistic Vocabulary for Compact and Informative Encodings of All-Atom Protein Structures

ProTokens: Probabilistic Vocabulary for Compact and Informative Encodings of All-Atom Protein Structures

ProTokens: Probabilistic Vocabulary for Compact and Informative Encodings of All-Atom Protein Structures

来源:bioRxiv_logobioRxiv
英文摘要

Designing protein structures towards specific functions is of great values for science, industry and therapeutics. Although backbones can be designed with arbitrary variety in the coordinate space, the generated structures may not be stabilized by any combination of natural amino acids, resulting in the high failure risk of many design approaches. Aiming to sketch a compact space for designable protein structures, we develop probabilistic tokenization theory for metastable protein structures. We present an unsupervised learning strategy, which conjugates inverse folding with structure prediction, to encode protein structures into amino-acid-like tokens and decode them back to atom coordinates. We show that tokenizing protein structures variationally can lead to compact and informative representations (ProTokens). Compared to amino acids - the Anfinsen's tokens - ProTokens are easier to detokenize and more descriptive of finer conformational ensembles. Therefore, protein structures can be efficiently compressed, stored, aligned and compared in the form of ProTokens. By unifying 1-dimensional and 3-dimensional representations of protein structures, ProTokens also enable all-atom protein structure design via various generative models without the concern of symmetry or modality mismatch. We demonstrate that generative pretraining over ProToken vocabulary allows scalable foundation models to perceive, process and explore the microscopic structures of biomolecules effectively.

Lin Xiaohan、Feng Shihao、Gao Yi Qin、Li Yanheng、Ma Zicheng、Chen Zhenyu、Cao Ziqiang、Zhang Jun、Fan Chuanliu

10.1101/2023.11.27.568722

生物工程学生物物理学分子生物学

Lin Xiaohan,Feng Shihao,Gao Yi Qin,Li Yanheng,Ma Zicheng,Chen Zhenyu,Cao Ziqiang,Zhang Jun,Fan Chuanliu.ProTokens: Probabilistic Vocabulary for Compact and Informative Encodings of All-Atom Protein Structures[EB/OL].(2025-03-28)[2025-05-07].https://www.biorxiv.org/content/10.1101/2023.11.27.568722.点此复制

评论