Thesis title: Generative Models for Human Motion: From Rule-Based Crowds to Scene-Aware Alignment
In recent years, Generative models has rapidly evolved into one of the most ground-breaking paradigms in machine learning, reshaping how data is synthesized across domains such as text, images, audio, and human animation.
Within this broad domain, human motion generation has emerged as a particularly compelling application, offering the ability to produce diverse, semantically meaningful, and context-aware motions from compact and descriptive conditioning signals like text prompts, scene layouts, or past trajectories. This capacity enables a wide range of applications, from animation, virtual reality, and gaming, to robotics, rehabilitation, and multi-agent simulation for data synthesis.
This dissertation investigates motion generation across different methodological stages, from symbolic-algorithmic pipelines to multimodal neural strategies, trying to solve emerging challenges in this field.
First, motion is high-dimensional, temporally and computationally complex, making it difficult to balance realism with diversity, especially when scaling to large crowds. Second, generative models often treat their latent spaces as black boxes, freely sampling from regions that may yield implausible outputs; here we explore both steering models away from such regions and repurposing them as useful signals. Third, when conditioning on additional modalities such as scenes, existing methods lack robust evaluation metrics to assess motion coherence relative to the environment, hindering alignment and reliability.
To address these issues, this thesis progresses from scalable symbolic pipelines to diffusion-based models, introducing alignment strategies that steer generation toward preferred outcomes, and developing unified latent representations that enable the estimation of fitness and coherence across text, motion, and scene samples.
In particular, we begin with ANTHROPOS-V, where a scalable rule-based system leverages a game engine to efficiently generate large synthetic crowds, providing an annotated dataset valuable for downstream tasks.
We then transition to diffusion-based models with MoCoDAD, using stochastic motion generation for synthesis but also exploiting poor-quality generations, likely originating from underfitted regions of the latent space, as anomaly indicators.
Building on this, MoDiPO introduces alignment strategies through Direct Preference Optimization, steering generative models toward preferred outputs using AI feedback and reducing reliance on costly human annotations.
Finally, MonSTeR proposes a unified latent space embedding motion, text, and scene together, enabling not only cross-modal retrieval and evaluation of generated sample, but also noticeable downstream tasks.
By considering generative models, together with their evaluation and contrastive alignment this dissertation shows not only how motions can be synthesized, but also how they can be made interpretable, measurable and useful for practical applications.