首页|Does CLIP perceive art the same way we do?

Does CLIP perceive art the same way we do?

来源：

英文摘要

CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it "see" the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.

作者：Andrea Asperti、Leonardo Dessì、Maria Chiara Tonetti、Nico Wu

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Andrea Asperti,Leonardo Dessì,Maria Chiara Tonetti,Nico Wu.Does CLIP perceive art the same way we do?[EB/OL].(2025-05-08)[2025-06-29].https://arxiv.org/abs/2505.05229.点此复制

Does CLIP perceive art the same way we do?

Does CLIP perceive art the same way we do?

评论