DiffPS: Leveraging Prior Knowledge of Diffusion Model for Person Search

Giyeol Kim1*, Sooyoung Yang2*, Jihyong Oh1, Myungjoo Kang2,3, Chanho Eom1†
1GSAIM, Chung-Ang University, 2IPAI, Seoul National University,
3Department of Mathematical Sciences and RIMS, Seoul National University

*Equal contribution

πŸ† ICCV 2025 Highlight Paper πŸŽ‰

Teaser Figure

πŸš€ Research Motivation

Most existing person search models rely on ImageNet pre-trained backbones. While these backbones provide decent fine-grained features, they often lack the rich visual priors required for person search in diverse and complex scenes.

Furthermore, conventional approaches rely on a shared backbone feature for both person detection and person re-identification tasks, leading to conflicting optimization objectives and degraded performance.

Our key motivation is to address these limitations by leveraging a pre-trained diffusion model, which offers richer visual semantics and enables task-specific decoupling to avoid feature interference.

Method

Prior Knowledge in Diffusion Model

DiffPS uses a frozen diffusion model backbone (Stable Diffusion) to provide rich spatial features, and separates task-specific features for detection and Re-ID to avoid gradient interference. This decoupled design ensures stability and better representation learning. To fully exploit the capabilities of the pre-trained diffusion model, DiffPS leverages four inherent priors embedded in the model architecture:

DiffPS Framework

DiffPS features a frozen diffusion backbone with decoupled detection (DGRPN) and re-ID (MSFRN, SFAN) branches, each tailored to leverage distinct diffusion priors for robust person search.

Architecture Diagram

πŸ”Ž Module design in DiffPS

DiffPS decouples the person search task into two branches: detection and re-identification (Re-ID), and designs dedicated modules to fully exploit the strengths of the diffusion backbone.

πŸ” MSFRN (Re-ID)

MSFRN enhances fine-grained discriminative features by combining multi-scale fusion with hierarchical frequency decomposition (DWT-based). This mitigates the shape bias in diffusion features.

🧠 SFAN (Re-ID)

SFAN leverages cross-modal alignment from the diffusion model to boost identity discrimination, refining re-ID representations with text-aware semantics.

πŸ“ DGRPN (Detection)

A novel region proposal network tailored to diffusion features, DGRPN harnesses text-conditioned priors to enhance localization accuracy and suppress background clutter in detection.

Quantitative Results

SOTA Results

πŸŽ‰ Curious to dive deeper? Check out our full paper above for all the exciting details! πŸ“„

BibTeX

@inproceedings{kim2025diffps, title={Leveraging Prior Knowledge of Diffusion Model for Person Search}, author={Kim, Giyeol and Yang, Sooyoung and Oh, Jihyong and Kang, Myungjoo and Eom, Chanho}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year={2025} }