PhyGenHOI is a novel framework for generating physically accurate and photorealistic 4D Human-Object Interactions. By coupling generative human motion (Motion Diffusion Model) with physical object simulation (Material Point Method) using 3D Gaussian Splatting, we synthesize dynamic scenes where humans actively engage with objects through text-driven actions.
We address the task of generating physically accurate and photorealistic 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through a diverse set of actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments demonstrate that PhyGenHOI successfully generates highly realistic and physically consistent 4D Human Object Interactions for a diverse set of actions, humans, and objects, outperforming state-of-the-art baselines.
Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through a diverse set of text-driven actions. To achieve this, PhyGenHOI couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. Their interaction is supervised through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity.
We evaluate the impact of different components by removing them individually. The table shows results for:
- w/o Attraction - the human motion generated by MDM evolves independently of the object's position, frequently missing the target entirely or making contact at unnatural moments
- w/o Contact - breaks the causal relationship between human action and object response; the human successfully reaches the object, but the object continues its initial trajectory unaffected
- w/o MDM - produces unnatural human motion lacking the characteristic wind-up and follow-through of realistic actions
- w/o Video-SDS - preserves the overall physical plausibility but leaves minor penetration artifacts at the contact region
- w/o MPM - eliminates both realistic pre-contact motion and post-contact material response
@misc{benishu2026phygenhoiphysicallyaware4dgeneration,
title={PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions},
author={Omer Benishu and Gal Fiebelman and Sagie Benaim},
year={2026},
eprint={2605.30268},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.30268},
}