首页|Automated Feature Labeling with Token-Space Gradient Descent

Automated Feature Labeling with Token-Space Gradient Descent

来源：

英文摘要

We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.

作者：Julian Schulz、Seamus Fallows

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Julian Schulz,Seamus Fallows.Automated Feature Labeling with Token-Space Gradient Descent[EB/OL].(2025-04-01)[2025-05-06].https://arxiv.org/abs/2504.00754.点此复制

Automated Feature Labeling with Token-Space Gradient Descent

Automated Feature Labeling with Token-Space Gradient Descent

评论