|国家预印本平台
首页|Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

来源:Arxiv_logoArxiv
英文摘要

Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.

Chenyang Zhang、Xuran Meng、Yuan Cao

计算技术、计算机技术

Chenyang Zhang,Xuran Meng,Yuan Cao.Transformer Learns Optimal Variable Selection in Group-Sparse Classification[EB/OL].(2025-04-11)[2025-05-25].https://arxiv.org/abs/2504.08638.点此复制

评论