Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings
Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings
Training models on uncurated Text Embeddings (TEs) derived from raw tabular data can lead to a severe failure mode known as model collapse, where predictions converge to a single class regardless of input. By comparing models trained with identical hyper-parameter configurations on both raw tabular data and their TE-derived counterparts, we find that collapse is a consistent failure mode in the latter setting. We introduce a set of metrics that capture the extent of model collapse, offering a new perspective on TE quality as a proxy for data curation. Our results reveal that TE alone does not effectively function as a curation layer - and that their quality significantly influences downstream learning. More insidiously, we observe that the presence of model collapse can yield artificially inflated and spurious Accuracy-on-the-Line correlation. These findings highlight the need for more nuanced curation and evaluation of embedding-based representations, particularly in out-of-distribution settings.
Lucas Mattioli、Youness Ait Hadichou、Sabrina Chaouche、Martin Gonzalez
计算技术、计算机技术
Lucas Mattioli,Youness Ait Hadichou,Sabrina Chaouche,Martin Gonzalez.Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings[EB/OL].(2025-06-22)[2025-07-16].https://arxiv.org/abs/2506.17989.点此复制
评论