Samuele Poppi - Not All Attackers Are Malicious, When Safety Degrades Without Harmful Intent

AI safety research has traditionally relied on a static attacker-defender framework, assuming that threats come from malicious actors targeting frozen, deployed models. This talk challenges that assumption by highlighting a subtler and often overlooked risk: safety alignment can silently degrade through entirely benign interactions, such as domain adaptation, personalization, or utility-driven fine-tuning. We introduce SPQR, a benchmark for evaluating the robustness of safety alignment in text-to-image diffusion models against benign fine-tuning, and show that high safety at deployment time does not guarantee safety over a model’s lifecycle. Our findings call for a shift in how the community defines, measures, and maintains safety in continuously evolving AI systems.

18/03/2026

Dr. Samuele Poppi is a Postdoctoral Associate at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), where he works with Dr. Nils Lukas in the Secure, Private, Open, and Trustworthy (SPOT) AI Lab.