Restoring the Forecasting Power of Google Trends with Statistical Preprocessing
Restoring the Forecasting Power of Google Trends with Statistical Preprocessing
Google Trends reports how frequently specific queries are searched on Google over time. It is widely used in research and industry to gain early insights into public interest. However, its data generation mechanism introduces missing values, sampling variability, noise, and trends. These issues arise from privacy thresholds mapping low search volumes to zeros, daily sampling variations causing discrepancies across historical downloads, and algorithm updates altering volume magnitudes over time. Data quality has recently deteriorated, with more zeros and noise, even for previously stable queries. We propose a comprehensive statistical methodology to preprocess Google Trends search information using hierarchical clustering, smoothing splines, and detrending. We validate our approach by forecasting U.S. influenza hospitalizations with a univariate ARIMAX model. Compared to omitting exogenous variables, our results show that raw Google Trends data degrades modeling performance, while preprocessed signals enhance forecast accuracy by 58% nationally and 24% at the state level.
Candice Djorno、Mauricio Santillana、Shihao Yang
医学研究方法
Candice Djorno,Mauricio Santillana,Shihao Yang.Restoring the Forecasting Power of Google Trends with Statistical Preprocessing[EB/OL].(2025-04-09)[2025-07-02].https://arxiv.org/abs/2504.07032.点此复制
评论