Skip-mers: increasing entropy and sensitivity to detect conserved genic regions with simple cyclic q-grams
Skip-mers: increasing entropy and sensitivity to detect conserved genic regions with simple cyclic q-grams
Abstract Bioinformatic analyses and tools make extensive use of k-mers (fixed contiguous strings of k nucleotides) as an informational unit. K-mer analyses are both useful and fast, but are strongly affected by single nucleotide polymorphisms or sequencing errors, effectively hindering direct-analyses of whole regions and decreasing their usability between evolutionary distant samples. Q-grams or spaced seeds, subsequences generated with a pattern of used-and-skipped nucleotides, overcome many of these limitations but introduce larger complexity which hinders their wider adoption. We introduce a concept of skip-mers, a cyclic pattern of used-and-skipped positions of k nucleotides spanning a region of size S ≥ k, and show how analyses are improved by using this simple subset of q-grams as a replacement for k-mers. The entropy of skip-mers increases with the larger span, capturing information from more distant positions and increasing the specificity, and uniqueness, of larger span skip-mers within a genome. In addition, skip-mers constructed in cycles of 1 or 2 nucleotides in every 3 (or a multiple of 3) lead to increased sensitivity in the coding regions of genes, by grouping together the more conserved nucleotides of the protein-coding regions. We implemented a set of tools to count and intersect skip-mers between different datasets, a simple task given that the properties of skip-mers make them a direct substitute for k-mers. We used these tools to show how skip-mers have advantages over k-mers in terms of entropy and increased sensitivity to detect conserved coding sequence, allowing better identification of genic matches between evolutionarily distant species. We then show benefits for multi-genome analyses provided by increased and better correlated coverage of conserved skip-mers across multiple samples. Software availabilitythe skm-tools implementing the methods described in this manuscript are available under MIT license at http://github.com/bioinfologics/skm-tools/
Barr Katie、Wright Jonathan、Accinelli Gonzalo Garcia、Clavijo Bernardo J.、Yanes Luis
Earlham Institute, Norwich Research ParkEarlham Institute, Norwich Research ParkEarlham Institute, Norwich Research ParkEarlham Institute, Norwich Research ParkEarlham Institute, Norwich Research Park
生物科学研究方法、生物科学研究技术分子生物学遗传学
Barr Katie,Wright Jonathan,Accinelli Gonzalo Garcia,Clavijo Bernardo J.,Yanes Luis.Skip-mers: increasing entropy and sensitivity to detect conserved genic regions with simple cyclic q-grams[EB/OL].(2025-03-28)[2025-05-28].https://www.biorxiv.org/content/10.1101/179960.点此复制
评论