首页|Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

来源：

英文摘要

Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.

作者：Chenyang Zhang、Xuran Meng、Yuan Cao

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Chenyang Zhang,Xuran Meng,Yuan Cao.Transformer Learns Optimal Variable Selection in Group-Sparse Classification[EB/OL].(2025-04-11)[2025-05-25].https://arxiv.org/abs/2504.08638.点此复制

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

评论