Most existing person search models rely on ImageNet pre-trained backbones. While these backbones provide decent fine-grained features, they often lack the rich visual priors required for person search in diverse and complex scenes.
Furthermore, conventional approaches rely on a shared backbone feature for both person detection and person re-identification tasks, leading to conflicting optimization objectives and degraded performance.
Our key motivation is to address these limitations by leveraging a pre-trained diffusion model, which offers richer visual semantics and enables task-specific decoupling to avoid feature interference.
DiffPS uses a frozen diffusion model backbone (Stable Diffusion) to provide rich spatial features, and separates task-specific features for detection and Re-ID to avoid gradient interference. This decoupled design ensures stability and better representation learning. To fully exploit the capabilities of the pre-trained diffusion model, DiffPS leverages four inherent priors embedded in the model architecture:
DiffPS features a frozen diffusion backbone with decoupled detection (DGRPN) and re-ID (MSFRN, SFAN) branches, each tailored to leverage distinct diffusion priors for robust person search.
DiffPS decouples the person search task into two branches: detection and re-identification (Re-ID), and designs dedicated modules to fully exploit the strengths of the diffusion backbone.
MSFRN enhances fine-grained discriminative features by combining multi-scale fusion with hierarchical frequency decomposition (DWT-based). This mitigates the shape bias in diffusion features.
SFAN leverages cross-modal alignment from the diffusion model to boost identity discrimination, refining re-ID representations with text-aware semantics.
A novel region proposal network tailored to diffusion features, DGRPN harnesses text-conditioned priors to enhance localization accuracy and suppress background clutter in detection.
π Curious to dive deeper? Check out our full paper above for all the exciting details! π