Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies
Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies
The quality and traceability of microbial genomics data in public databases is deteriorating as they rapidly expand and struggle to cope with data curation challenges. While the availability of public genomic data has become essential for modern life sciences research, the curation of the data is a growing area of concern that has significant real-world impacts on public health epidemiology, drug discovery, and environmental biosurveillance research1–6. While public microbial genome databases such as NCBI’s RefSeq database leverage the scalability of crowd sourcing for growth, they do not require data provenance to the original biological source materials or accurate descriptions of how the data was produced7. Here, we describe the de novo assembly of 1,113 bacterial genome references produced from authenticated materials sourced from the American Type Culture Collection (ATCC), each with full data provenance. Over 98% of these ATCC Standard Reference Genomes (ASRGs) are superior to assemblies for comparable strains found in NCBI’s RefSeq database. Comparative genomics analysis revealed significant issues in RefSeq bacterial genome assemblies related to genome completeness, mutations, structural differences, metadata errors, and gaps in traceability to the original biological source materials. For example, nearly half of RefSeq assemblies lack details on sample source information, sequencing technology, or bioinformatics methods. We suggest there is an intrinsic connection between the quality of genomic metadata, the traceability of the data, and the methods used to produce them with the quality of the resulting genome assemblies themselves. Our results highlight common problems with “ reference genomes” and underscore the importance of data provenance for precision science and reproducibility. These gaps in metadata accuracy and data provenance represent an “ elephant in the room” for microbial genomics research, but addressing these issues would require raising the level of accountability for data depositors and our own expectations of data quality.
Combs Patrick Ford、Reese Amy L.、Duncan James、Greenfield Samuel R.、King Stephen、Jacobs Jonathan L.、Bagnoli John、Puthuveetil Nikhita P.、Riojas Marco A.、Benton Briana、Pierola Amanda E.、Yarmosh David A.、Tabron Corina、Marlow Robert、Lopera Juan G.
American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)||BEI ResourcesAmerican Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)American Type Culture Collection (ATCC)
生物科学研究方法、生物科学研究技术微生物学基础医学
Combs Patrick Ford,Reese Amy L.,Duncan James,Greenfield Samuel R.,King Stephen,Jacobs Jonathan L.,Bagnoli John,Puthuveetil Nikhita P.,Riojas Marco A.,Benton Briana,Pierola Amanda E.,Yarmosh David A.,Tabron Corina,Marlow Robert,Lopera Juan G..Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies[EB/OL].(2025-03-28)[2025-04-26].https://www.biorxiv.org/content/10.1101/2021.12.14.472616.点此复制
评论