Most existing person search models rely on ImageNet pre-trained backbones. While these backbones provide decent fine-grained features, they often lack the rich visual priors required for person search in diverse and complex scenes.
Furthermore, conventional approaches rely on a shared backbone feature for both person detection and person re-identification tasks, leading to conflicting optimization objectives and degraded performance.
Our key motivation is to address these limitations by leveraging a pre-trained diffusion model, which offers richer visual semantics and enables task-specific decoupling to avoid feature interference.
DiffPS uses a frozen diffusion model backbone to provide rich spatial features, and separates task-specific features for detection and Re-ID to avoid gradient interference. This decoupled design ensures stability and better representation learning. To fully exploit the capabilities of the pre-trained diffusion model, DiffPS draws upon four core priors:
DiffPS features a frozen diffusion backbone with decoupled detection (DGRPN) and re-ID (MSFRN, SFAN) branches, each tailored to leverage distinct diffusion priors for robust person search.
DiffPS framework