Thesis title: From Forecasting to Reasoning: Context-Aware and Safe Modeling of Human Motion and Behavior
Understanding, modeling, and controlling human motion and behavior are fundamental to the development of intelligent systems capable of interacting naturally and safely with people. From virtual and augmented reality to robotics, healthcare, and creative applications, the ability to predict, synthesize, and interpret how humans move, act, and respond to their surroundings enables adaptive and responsible human-machine collaboration.
Human behavior, however, reflects not only physical motion but also intent, attention, and social context, demanding models that unify geometry, dynamics, and semantics.
This thesis advances the computational modeling of human motion and behavior through a sequence of works that progressively broaden the scope of understanding and control. It begins with STAG, which introduces a staged contact-aware approach for scene-conditioned motion forecasting, followed by SEE-ME, the first socially conditioned egocentric motion estimation model that integrates cues from both the environment and other people. These works collectively establish the importance of context in predicting and reconstructing human behavior.
Building on this foundation, 2Body formalizes collaborative forecasting between two interacting humans, identifying the architectural and representational principles that transfer from single-person to multi-person motion prediction.
The next part of the thesis shifts focus from modeling to control, addressing the safety and ethical dimensions of generative AI. ``Human Motion Unlearning'' proposes a latent-code replacement method to selectively forget harmful motion patterns in text-to-motion diffusion models, while ``Video Unlearning via Low-Rank Refusal Vector'' extends this paradigm to video diffusion models, neutralizing unsafe generative concepts without retraining or data access. Together, these works define the first framework for targeted unlearning in spatio-temporal generative models.
The final part explores egocentric understanding and anomaly detection. PREGO introduces the first online open-set framework for detecting procedural mistakes from egocentric videos, combining visual recognition with symbolic reasoning to anticipate and identify behavioral deviations in real time. Its successor, TI-PREGO, incorporates Chain-of-Thought to enhance anticipation and reasoning over human procedures, bridging perception and cognition in egocentric understanding.
Across these seven works, the thesis establishes a coherent progression in human motion understanding from forecasting to generation, from interaction to unlearning, and from observation to reasoning. Collectively, these contributions articulate a unified vision of context-aware, socially grounded, and ethically aligned human motion modeling, paving the way for intelligent systems that anticipate, adapt, and act responsibly within human environments.