首页|F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search

F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search

来源：

英文摘要

The proliferation of digital food content has intensified the need for robust and accurate systems capable of fine-grained visual understanding and retrieval. In this work, we address the challenging task of food image-to-text matching, a critical component in applications such as dietary monitoring, smart kitchens, and restaurant automation. We propose F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search, a training-free, vision-language model (VLM)-guided framework that significantly improves retrieval performance through enhanced multi-modal feature representations. Our approach introduces two key contributions: (1) a uni-directional(and bi-directional) multi-modal fusion strategy that combines image embeddings with VLM-generated textual descriptions to improve query expressiveness, and (2) a novel feature-based re-ranking mechanism for top-k retrieval, leveraging predicted food ingredients to refine results and boost precision. Leveraging open-source image-text encoders, we demonstrate substantial gains over standard baselines - achieving ~10% and ~7.7% improvements in top-1 retrieval under dense and sparse caption scenarios, and a ~28.6% gain in top-k ingredient-level retrieval. Additionally, we show that smaller models (e.g., ViT-B/32) can match or outperform larger counterparts (e.g., ViT-H, ViT-G, ViT-bigG) when augmented with textual fusion, highlighting the effectiveness of our method in resource-constrained settings. Code and test datasets will be made publicly available at: https://github.com/mailcorahul/f4-its

作者：Raghul Asokan

作者单位：

学科分类：食品工业

推荐引用：Raghul Asokan.F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search[EB/OL].(2025-08-23)[2025-09-06].https://arxiv.org/abs/2508.17037.点此复制

F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search

F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search

评论