COVA: Text-Guided Composed Retrieval for Audio-Visual Content

Gyuwon Han^1*, Young Kyun Jang^2*, Chanho Eom¹

* Equal contribution

¹Chung-Ang University
²Google DeepMind

While the (a) existing CoVR considers only visual modifications, (b) CoVA utilizes both visual and auditory information to support more fine-grained retrieval.

Abstract

Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio (COVA), a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark of video pairs with cross-modal changes and textual queries describing the differences, enabling retrieval based on audio as well. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for COVA.

Datasets

EXAMPLES OF GENERATED AV-COMP TRIPLETS

Example 1

Query: U6ElTfA5lSw_10

Modification Text Object: Replace the transparent plastic container with a black metal cage containing various colorful toys and perches.

Action: Change the parrot's action from perching on a container to climbing out of the cage onto the top bar and observing its surroundings.

Attribute: Change the dark background to an indoor setting near a window with blinds and natural light.

Audio: Replace the sound of a man speaking and a bird whistling with a bird chirping, a person burping, and the bird imitating the burp.

Target: NBR-XVmJNjQ_60

Hard Negative: eeNw3_nsEdM_310

Hard Negative: QM6X_mK_KAE_50

Example 2

Query: 7JCV2B6mbDo_30

Modification Text Object: Replace the child in a pink shirt, teddy bear, and chair with a baby in a white outfit with red patterns.

Action: Change the action from the child walking with support from a chair to a baby walking to an adult, touching their leg, and being picked up.

Attribute: Update the setting to an indoor home environment with a wooden cabinet and refrigerator, and change the clothing to a light patterned outfit for the baby and a dark shirt with beige pants for the adult.

Audio: Change the audio from a baby's loud cries to a baby crying while a kid sings and laughs.

Target: VcZykKLnTnI_30

Example 3

Query: cew6UO_TiWI_60

Modification Text Object: The one duck remains the same, but its color changes from white to brown and green.

Action: The action remains unchanged.

Attribute: A wooden fence is added to the background.

Audio: Replace the sound of splashing with a woman speaking, while a duck quacks.

Target: II7uLXgHSD8_30

Hard Negative: q0uIdT4wzRk_20

Example 4

Query: qkQ7ooIUNd0_60

Modification Text Object: Remove the trees from the background, focusing only on the sheep and the grassy field.

Action: Change the sheep's movement from walking in a line to running across the field in a scattered and playful manner.

Attribute: Shift the mood from calm and serene to lively and energetic, and change the sheep's appearance from all white to a mix of white, brown, and speckled patterns.

Audio: Replace the sound of a bell ringing with birds chirping and a man and woman speaking.

Target: YRg_topnqRI_40

Hard Negative: E4ECgoC8ahg_20

Example 5

Query: 2QsWqMg_j08_30

Modification Text Object: Change the man's attire from a striped shirt and headscarf to traditional attire, change the horse from white to brown, and add other people to the scene.

Action: Change the action from the man leading the horse calmly down a sidewalk to the horse bending down while the man adjusts its position as others observe.

Attribute: Transform the urban setting into a rural or semi-rural outdoor environment with a paved area.

Audio: Replace the sound of a child speaking with people shouting.

Target: 6O6rqrMirkU_30

Hard Negative: DukP2K1j2Kg_30

Example 6

Query: lxVT6iqlJ2k_27

Modification Text Object: Replace the brown cushion and various items with a zebra-patterned blanket, a wooden floor.

Action: Change the cat's action from standing on its hind legs to walking away.

Attribute: Change the environment from a naturally lit, casual home setting to a cozy bedroom with warm, dim lighting.

Audio: Remove the sound of lips kissing, leaving only a woman speaking and a cat meowing.

Target: Kx0eryXWMgE_7

Hard Negative: 2JgHbC7yyTU_0

STATISTICS

Dataset Composition

Explanation: Dataset Composition

Unique videos refer to distinct video clips, while Pairs indicate the number of triplets.

Aspect-wise Statistics in Modification Text

Explanation: Aspect Type Distribution

Modification texts are categorized into four types: Object, Action, Attribute, and Audio. This chart summarizes the distribution of these types across all triplets in the train and test sets of the dataset.

Word Cloud of Object in Modification Text

Word Cloud of Action in Modification Text

Word Cloud of Attribute in Modification Text

Word Cloud of Audio in Modification Text

We generate word clouds for each of the four modification text aspects: Object, Action, Attribute, and Audio, using the entire dataset (train and test). Each figure visualizes the most frequently occurring words within its respective aspect and provides an intuitive overview of the dominant concepts used in compositional queries.

Video Distribution per Cluster (Video-based Clustering)

Video Distribution per Cluster (Audio-based Clustering)

We perform clustering based on two types of embeddings: video embeddings and audio embeddings. In each clustering result, we color-code videos by their split: train (green), test (blue), and gallery (orange). These visualizations offer an intuitive view of the overall distribution patterns in the dataset and how samples are organized across different splits.