|国家预印本平台
首页|Bridging Chaos Game Representations and $k$-mer Frequencies of DNA Sequences

Bridging Chaos Game Representations and $k$-mer Frequencies of DNA Sequences

Bridging Chaos Game Representations and $k$-mer Frequencies of DNA Sequences

来源:Arxiv_logoArxiv
英文摘要

This paper establishes formal mathematical foundations linking Chaos Game Representations (CGR) of DNA sequences to their underlying $k$-mer frequencies. We prove that the Frequency CGR (FCGR) of order $k$ is mathematically equivalent to a discretization of CGR at resolution $2^k \times 2^k$, and its vectorization corresponds to the $k$-mer frequencies of the sequence. Additionally, we characterize how symmetry transformations of CGR images correspond to specific nucleotide permutations in the originating sequences. Leveraging these insights, we introduce an algorithm that generates synthetic DNA sequences from prescribed $k$-mer distributions by constructing Eulerian paths on De Bruijn multigraphs. This enables reconstruction of sequences matching target $k$-mer profiles with arbitrarily high precision, facilitating the creation of synthetic CGR images for applications such as data augmentation for machine learning-based taxonomic classification of DNA sequences. Numerical experiments validate the effectiveness of our method across both real genomic data and artificially sampled distributions. To our knowledge, this is the first comprehensive framework that unifies CGR geometry, $k$-mer statistics, and sequence reconstruction, offering new tools for genomic analysis and visualization.

Haoze He、Lila Kari、Pablo Millan Arias

生物科学理论、生物科学方法计算技术、计算机技术

Haoze He,Lila Kari,Pablo Millan Arias.Bridging Chaos Game Representations and $k$-mer Frequencies of DNA Sequences[EB/OL].(2025-06-30)[2025-07-25].https://arxiv.org/abs/2506.22172.点此复制

评论