LUCA COLLORONE

PhD Graduate

PhD program:: XXXVIII



Thesis title: Generative Models for Human Motion: From Rule-Based Crowds to Scene-Aware Alignment

In recent years, Generative models has rapidly evolved into one of the most ground-breaking paradigms in machine learning, reshaping how data is synthesized across domains such as text, images, audio, and human animation. Within this broad domain, human motion generation has emerged as a particularly compelling application, offering the ability to produce diverse, semantically meaningful, and context-aware motions from compact and descriptive conditioning signals like text prompts, scene layouts, or past trajectories. This capacity enables a wide range of applications, from animation, virtual reality, and gaming, to robotics, rehabilitation, and multi-agent simulation for data synthesis. This dissertation investigates motion generation across different methodological stages, from symbolic-algorithmic pipelines to multimodal neural strategies, trying to solve emerging challenges in this field. First, motion is high-dimensional, temporally and computationally complex, making it difficult to balance realism with diversity, especially when scaling to large crowds. Second, generative models often treat their latent spaces as black boxes, freely sampling from regions that may yield implausible outputs; here we explore both steering models away from such regions and repurposing them as useful signals. Third, when conditioning on additional modalities such as scenes, existing methods lack robust evaluation metrics to assess motion coherence relative to the environment, hindering alignment and reliability. To address these issues, this thesis progresses from scalable symbolic pipelines to diffusion-based models, introducing alignment strategies that steer generation toward preferred outcomes, and developing unified latent representations that enable the estimation of fitness and coherence across text, motion, and scene samples. In particular, we begin with ANTHROPOS-V, where a scalable rule-based system leverages a game engine to efficiently generate large synthetic crowds, providing an annotated dataset valuable for downstream tasks. We then transition to diffusion-based models with MoCoDAD, using stochastic motion generation for synthesis but also exploiting poor-quality generations, likely originating from underfitted regions of the latent space, as anomaly indicators. Building on this, MoDiPO introduces alignment strategies through Direct Preference Optimization, steering generative models toward preferred outputs using AI feedback and reducing reliance on costly human annotations. Finally, MonSTeR proposes a unified latent space embedding motion, text, and scene together, enabling not only cross-modal retrieval and evaluation of generated sample, but also noticeable downstream tasks. By considering generative models, together with their evaluation and contrastive alignment this dissertation shows not only how motions can be synthesized, but also how they can be made interpretable, measurable and useful for practical applications.

Research products

11573/1741969 - 2025 - ANTHROPOS-V: Benchmarking the Novel Task of Crowd Volume Estimation
Collorone, Luca; D'arrigo, Stefano; Pappa, Massimiliano; D'amely Di Melendugno, Guido M.; Ficarra, Giovanni; Galasso, Fabio - 04b Atto di convegno in volume
conference: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025 (Tucson; Usa (AZ))
book: Proceedings of the 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025 - (979-8-3315-1083-1)

11573/1753073 - 2025 - MonSTeR: a Unified Model for Motion, Scene, Text Retrieval
Collorone, Luca; Gioia, Matteo; Pappa, Massimiliano; Leoni, Paolo; Ficarra, Giovanni; Litany, Or; Spinelli, Indro; Galasso, Fabio - 04b Atto di convegno in volume
conference: IEEE International Conference on Computer Vision (Honolulu; Hawaii)
book: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) - ()

11573/1699647 - 2023 - Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection
Flaborea, Alessandro; Collorone, Luca; D'amely Di Melendugno, Guido Maria; D'arrigo, Stefano; Prenkaj, Bardh; Galasso, Fabio - 04b Atto di convegno in volume
conference: IEEE/CVF International Conference on Computer Vision 2023 (Paris, France)
book: Proceedings of the IEEE/CVF International Conference on Computer Vision - (979-8-3503-0718-4)

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma