PINO: Person-Interaction Noise Optimization

Spatiotemporal Motion Penalty

"Two people walk from one location to another."

Switching Prompt

"These two are posing for a photo to capture the moment."

"One person is posing while the other is talking a picture."

Motion Extention & Switching Prompt

"Two people shake hands."

"Two people are standing."

Abstract

Generating realistic group interactions involving multiple characters remains challenging due to increasing complexity as group size expands. While existing conditional diffusion models incrementally generate motions by conditioning on previously generated characters, they rely on single shared prompts, limiting nuanced control and leading to overly simplified interactions.

In this paper, we introduce Person-Interaction Noise Optimization (PINO), a novel, training-free framework designed for generating realistic and customizable interactions among groups of arbitrary size. PINO decomposes complex group interactions into sequential, semantically relevant pairwise interactions, leveraging pretrained two-person interaction diffusion models. To ensure physical plausibility and avoid common artifacts such as overlapping or penetration between characters, PINO employs physics-based penalties during noise optimization. This approach allows precise user control over character orientation, speed, and spatial relationships without additional training.

Comprehensive evaluations demonstrate that PINO generates visually realistic, physically coherent, and adaptable multi-person interactions suitable for diverse animation, gaming, and robotics applications.

Approach

PINO addresses multi-character interaction by decomposing it into a series of pairwise motion-generation steps.

It starts from a pretrained two-person diffusion model.
To add a new character, the method pairs the newcomer with one of the existing characters, supplies a text prompt describing only that pair, and generates their motion with the same model.
The model's initial noise is then optimized so that the new motion fits the entire group—avoiding overlaps and preserving proper distances and orientations.
This optimization-and-add cycle is repeated to build interactions of any size, and the same idea can be used to extend motion sequences over time.

Because physical and spatial penalties are built into the noise-optimization step, PINO offers fine-grained control over motion composition without any additional training of the diffusion model.

Multi-Person Motion Generation

"They acknowledge each other's presence with a polite greeting."

"They are engaged in a game of rock-paper-scissors."

"Both two humans simultaneously jump up while raising their right arms."

"They are talking and using hand gestures."

"Both people rotate on their feet in a circular dance."

"Two people are boxing."

"One person is boxing, while the other watches and cheers from outside."

Motion Extension

Ablation Study

BibTeX

@inproceedings{ota2025pino,
  author    = {Ota, Sakuya and Yu, Qing and Fujiwara, Kent and Ikehata, Satoshi and Sato, Ikuro},
  title     = {PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}