Thesis title: Modeling Virtual Humans and Scenes
Human body features like size, shape, and pose are key elements in Human-Centric Computer Vision. They determine where people are, what they do, and how much space they occupy in the environment.
This thesis investigates three aspects of computer vision, all connected by human presence: crowd analysis, motion generation, and human interaction with scenes.
Perceiving and counting people in a crowd is useful for public safety and incident prevention.
Motion generation helps creators improve the quality and smoothness of their content with minimal effort.
Human-Scene Interaction enables motions to realistically interact with environments, and the evaluation of such types of solutions is critical. Current methods in these domains often fail to meet key user requirements.
Practical applications demand an understanding of the physical space people occupy, a requirement that mere counting fails to provide. To address this, we first introduce the new task of Crowd Volume Estimation: predicting the total human body volume present in a scene from a single RGB image.
We release ANTHROPOS-V, a photorealistic benchmark with per-person and per-part supervision
derived from anatomically plausible meshes and based upon anthropometric priors.
Training on per-part volume density maps spreads supervision beyond heads to torsos and limbs, making estimates robust to occlusion and scale.
Our proposed model, STEERER-V, outperforms counting and human mesh recovery baselines on the Crowd Volume Estimation task, and can easily transfer its capabilities to real images, advancing crowd analysis from simple counts to a more practical, volume-based analysis for safety, planning, and comfort.
A central challenge in motion generation is ensuring that outputs are plausible and coherent, not just diverse. To tackle the problem of implausible and uncontrolled generation, we propose MoDiPO, a method to align text-to-motion diffusion models.
MoDiPO adapts Direct Preference Optimization to the motion domain and replaces costly human labels with AI feedback.
For each prompt, an AI ranker constructs preference sets over candidate motions;
We then align the generator toward winner motions, repulsing loser motions, while preserving diversity.
With a new dataset made of motion-preference pairs, named Pick-a-Move, MoDiPO improves FID, human preference, and prompt faithfulness without any mode collapse, turning diffusion-based text-to-motion models into preference-aligned motion generators.
True understanding requires contextual coherence across motion, intention, and the environment. To achieve this, we propose MonSTeR, a tri-modal retrieval model that embeds motion, scene, and text in a unified latent space. This space is trained via coupled unimodal and cross-modal encoders, supporting flexible, all-direction retrieval.
This versatile representation enables downstream tasks like zero-shot in-scene object placement and enhanced scene-aware motion captioning.
MonSTeR serves as a powerful evaluator for Human-Scene Interaction models. We validate this capability by showing that its scores correctly penalize physically implausible interactions such as those derived from path rotations and scene collisions.
Together, these contributions move from population-level capacity (volume)
to individual-level plausibility (aligned motion) to contextual coherence
(motion-scene-text alignment).