|国家预印本平台
首页|preciseTAD: A machine learning framework for precise 3D domain boundary prediction at base-level resolution

preciseTAD: A machine learning framework for precise 3D domain boundary prediction at base-level resolution

preciseTAD: A machine learning framework for precise 3D domain boundary prediction at base-level resolution

来源:bioRxiv_logobioRxiv
英文摘要

Abstract High-throughput chromosome conformation capture technology (Hi-C) revealed extensive DNA looping and folding into discrete 3D domains. These include Topologically Associating Domains (TADs) and chromatin loops, the 3D domains critical for cellular processes like gene regulation and cell differentiation. The relatively low resolution of Hi-C data (regions of several kilobases in size) prevents precise mapping of domain boundaries. However, the high resolution of genomic annotations associated with boundaries, such as CTCF and members of cohesin complex, suggests they can inform the precise location of domain boundaries. Several methods attempted to leverage genome annotation data for predicting domain boundaries; however, they overlooked key characteristics of the data, such as spatial associations between an annotation and a boundary, and a much smaller number of boundaries than the rest of the genome (class imbalance). We developed preciseTAD, an optimized random forest model to improve the location of domain boundaries. Trained on high-resolution genome annotation data and boundaries from low-resolution Hi-C data, the model predicts the location of boundaries at base-level resolution. We investigated several feature engineering and resampling techniques (random over- and undersampling, Synthetic Minority Over-sampling TEchnique (SMOTE)) to select the most optimal data characteristics and address class imbalance. Density-based clustering and scalable partitioning techniques were used to identify the precise location of boundary regions and summit points. We benchmarked our method against the Arrowhead domain caller and a novel chromatin loop prediction algorithm, Peakachu, on the two most annotated cell lines. We found that spatial relationship (distance in the linear genome) between boundaries and annotations has the most predictive power. Transcription factor binding sites outperformed other genome annotation types. Random under-sampling significantly improved model performance. Boundaries predicted by preciseTAD were more enriched for CTCF, RAD21, SMC3, and ZNF143 signal and more conserved across cell lines, highlighting their higher biological significance. The model pre-trained in one cell line performs well in predicting boundaries in another cell line using only genomic annotations, enabling the detection of domain boundaries in cells without Hi-C data. Our study implements the method and the pre-trained models for precise domain boundary prediction using genome annotation data. The precise identification of domain boundaries will improve our understanding of how genomic regulators are shaping the 3D structure of the genome. preciseTAD R package is available on https://dozmorovlab.github.io/preciseTAD/ and Bioconductor (submitted).

Dozmorov Mikhail G.、Stilianoudakis Spiro C.

Dept. of Biostatistics, Virginia Commonwealth UniversityDept. of Biostatistics, Virginia Commonwealth University

10.1101/2020.09.03.282186

生物科学研究方法、生物科学研究技术生物科学现状、生物科学发展分子生物学

Dozmorov Mikhail G.,Stilianoudakis Spiro C..preciseTAD: A machine learning framework for precise 3D domain boundary prediction at base-level resolution[EB/OL].(2025-03-28)[2025-05-02].https://www.biorxiv.org/content/10.1101/2020.09.03.282186.点此复制

评论