Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP’s ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text.
Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs.
Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method’s focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.
Given a global image and its detailed caption, LISM uses SAM to segment the image into local regions and splits the caption into individual sentences. These local pairs are then processed through CLIP encoders to obtain CLS embeddings, which are used for maximum similarity matching to identify the most relevant image-sentence pairs.
The framework processes global image-text pairs and their local pairs through shared CLIP encoders, extracting patch and sequence tokens. TSL identifies and projects corresponding token regions to match local CLS embeddings, enabling attention on local element.
Original test set results on DOCCI dataset. Comparison of retrieval performance across different fine-tuning approaches using ViT-B/16 and ViT-L/14 models. The evaluation metrics include both text-to-image and image-to-text Recall@K. The best and second-best scores for each method are marked in bold and underlined, respectively.
Comparison of attention maps generated by GOAL and w/o TSL methods. For each row pair, we present three components: (1) original input image (left), (2) attention heatmap visualization (middle), and (3) overlay of attention on the original image (right). The examples demonstrate how GOAL achieves more focused attention compared to the baseline w/o TSL method. Red circles in the overlay highlight regions where GOAL shows particularly effective attention localization.
If you find this work useful in your research, please consider citing:
@inproceedings{Hyungyu_2025_CVPR,
author={Hyungyu Choi, Young Kyun Jang, Chanho Eom},
title={GOAL: Global-local Object Alignment Learning},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}