🏆 ICCV 2025 Highlight

DiffPS: Leveraging Prior Knowledge of Diffusion Model for Person Search

1GSAIM, Chung-Ang University  |  2IPAI, Seoul National University  |  3Dept. of Mathematical Sciences and RIMS, SNU
*Equal contribution  Â·  †Corresponding author

Research Motivation

Most existing person search models rely on ImageNet pre-trained backbones. While these backbones provide fine-grained features, they often lack the rich visual priors required for person search in diverse and complex scenes. Conventional methods also rely on a shared backbone for both person detection and re-identification, leading to conflicting optimization objectives and degraded performance.

We address these limitations by leveraging a pre-trained diffusion model, which offers richer visual semantics and enables task-specific decoupling to avoid feature interference.

DiffPS Teaser: diffusion features vs ImageNet backbones

Method

Prior Knowledge in Diffusion Model

DiffPS uses a frozen diffusion backbone (Stable Diffusion) and separates task-specific features for detection and Re-ID to avoid gradient interference. We leverage four inherent priors:

Pre-trained diffusion models exhibit strong text–image alignment, enabling precise localization of person-related regions (e.g., body parts, clothing). This motivates DGRPN and SFAN.

Text conditioning illustration

The informativeness of diffusion features varies with timestep. For person search, earlier steps retain fine-grained features while minimizing noise impact.

Timestep semantics

Up-stage layers in the UNet merge global context and local details. Carefully selecting specific up-stage outputs improves spatial precision and identity discrimination.

Hierarchical structure

Generative models often exhibit shape bias. We design MSFRN to reinforce high-frequency, fine-grained representations.

DiffPS Framework

DiffPS features a frozen diffusion backbone with decoupled detection (DGRPN) and re-ID (MSFRN, SFAN) branches, each tailored to distinct diffusion priors.

DiffPS architecture

Module Design

Dedicated modules for detection and re-ID that fully exploit the diffusion backbone.

📍 DGRPN (Detection)

A region proposal network that harnesses text-conditioned priors to enhance localization and suppress background clutter.

🔍 MSFRN (Re-ID)

Multi-scale frequency refinement with hierarchical DWT to mitigate shape bias and enhance fine-grained discriminative features.

đź§  SFAN (Re-ID)

Semantic-adaptive feature aggregation using cross-modal alignment for text-aware re-ID representations.

Quantitative Results

State-of-the-art results on CUHK-SYSU and PRW

Curious to dive deeper? Check out our full paper for all the details.

Citation

@inproceedings{kim2025leveraging,
  title     = {Leveraging Prior Knowledge of Diffusion Model for Person Search},
  author    = {Kim, Giyeol and Yang, Sooyoung and Oh, Jihyong and Kang, Myungjoo and Eom, Chanho},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages     = {20301--20312},
  year      = {2025}
}