|国家预印本平台
首页|CluStrat: a structure informed clustering strategy for population stratification

CluStrat: a structure informed clustering strategy for population stratification

CluStrat: a structure informed clustering strategy for population stratification

来源:bioRxiv_logobioRxiv
英文摘要

Abstract Genome-wide association studies (GWAS) have been extensively used to estimate the signed effects of trait-associated alleles. Recent independent studies failed to replicate the strong evidence of selection for height across Europe implying the shortcomings of standard population stratification correction approaches. Here, we present CluStrat, a stratification correction algorithm for complex population structure that leverages the linkage disequilibrium (LD)-induced distances between individuals. CluStrat performs agglomerative hierarchical clustering using the Mahalanobis distance and then applies sketching-based randomized ridge regression on the genotype data to obtain the association statistics. With the growing size of data, computing and storing the genome wide covariance matrix is a non-trivial task. We get around this overhead by computing the GRM directly using a connection between statistical leverage scores and the Mahalanobis distance. We test CluStrat on a large simulation study of discrete and admixed, arbitrarily-structured sub-populations identifying two to three-fold more true causal variants when compared to Principal Component (PC) based stratification correction methods while trading off for a slightly higher spurious associations. Applying CluStrat on WTCCC2 Parkinson’s disease (PD) data, we identified loci mapped to a host of genes associated with PD such as BACH2, MAP2, NR4A2, SLC11A1, UNC5C to name a few. Availability and ImplementationCluStrat source code and user manual is available at: https://github.com/aritra90/CluStrat

Paschou Peristera、Burch Myson C.、Chowdhury Agniva、Drineas Petros、Bose Aritra

Department of Biological Sciences, Purdue UniversityComputer Science Department, Purdue UniversityDepartment of Statistics, Purdue UniversityComputer Science Department, Purdue UniversityComputational Genomics, IBM T.J. Watson Research Center||Computer Science Department, Purdue University

10.1101/2020.01.15.908228

生物科学研究方法、生物科学研究技术基础医学神经病学、精神病学

Population StructureAssociation StudiesClustering Ridge Regression

Paschou Peristera,Burch Myson C.,Chowdhury Agniva,Drineas Petros,Bose Aritra.CluStrat: a structure informed clustering strategy for population stratification[EB/OL].(2025-03-28)[2025-05-11].https://www.biorxiv.org/content/10.1101/2020.01.15.908228.点此复制

评论