SCALAR: Scale-wise Controllable Visual Autoregressive Learning
SCALAR: Scale-wise Controllable Visual Autoregressive Learning
Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone. This design provides persistent and structurally aligned guidance throughout the generation process. Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model. Extensive experiments show that SCALAR achieves superior generation quality and control precision across various tasks.
Ryan Xu、Dongyang Jin、Yancheng Bai、Rui Lan、Xu Duan、Lei Sun、Xiangxiang Chu
计算技术、计算机技术
Ryan Xu,Dongyang Jin,Yancheng Bai,Rui Lan,Xu Duan,Lei Sun,Xiangxiang Chu.SCALAR: Scale-wise Controllable Visual Autoregressive Learning[EB/OL].(2025-07-29)[2025-08-18].https://arxiv.org/abs/2507.19946.点此复制
评论