GWAS中极端不平衡数据统计分析方法
he statistical analysis methods for extreme imbalance data in GWAS
极端不平衡数据定义为自变量或因变量指标的取值呈现严重比例失衡的数据,例如病例-对照极度不平衡、疾病发病率极低、生存数据大量删失以及遗传位点为低频或罕见变异等。在此情境下,logistic回归模型、Cox比例风险模型等参数假设检验的经典统计量偏离正态分布,难以控制一类错误。近年来,随着超大型人群队列全基因组关联研究资源的日益共享与深度挖掘,高效准确处理独立或非独立样本极端不平衡数据的统计需求日益突出。为此,本文系统地进行了方法学概述。首先,综述常见经典统计量理论推导的原理;其次,阐述极端不平衡数据对统计量分布的影响;然后,介绍遗传统计学中常用的两种统计量校正方法:Firth校正和鞍点近似方法;最后,简介极端不平衡基因组学数据常用软件。本文为极端不平衡数据的统计分析提供理论参考和应用推荐。
Extremely imbalanced data here refers to datasets where the values of independent or dependent variables exhibit severe imbalances in proportions, such as extremely imbalanced case-control ratios, very low disease incidence rates, heavily censored survival data, and low-frequency or rare variants. In such scenarios, test statistics in classical statistical methods, such as logistic regression and Cox proportional hazards models, may deviate from normality assumptions, leading to difficulties in controlling type I errors. With the increasing availability and exploration of resources from large-scale population cohorts in whole-genome association studies, there is a growing demand for efficient and accurate statistical approaches to handle extremely imbalanced data in independent and non-independent samples. To address this need, this paper provides a systematic methodological overview. Firstly, it derives test statistics from classical statistical methods. Secondly, it elucidates the impact of extremely imbalanced data on the distribution of test statistics. Thirdly, it introduces two widely used methods for correcting statistics in genome-wide association studies: Firth correction and saddlepoint approximation methods. Finally, it briefly introduces commonly used software for extremely imbalanced genomic data. This paper provides theoretical references and application recommendations for the statistical analysis of extremely imbalanced data.
医学研究方法生物科学现状、生物科学发展生物科学研究方法、生物科学研究技术
全基因组关联研究极端不平衡数据Firth校正鞍点近似罕见变异
genome-wide association studiesextremely imbalanced dataFirth correctionsaddlepoint approximationrare variants
.GWAS中极端不平衡数据统计分析方法[EB/OL].(2024-04-25)[2025-08-21].https://chinaxiv.org/abs/202404.00373.点此复制
评论