RICCARDO FOSCO GRAMACCIONI

Dottore di ricerca

ciclo: XXXVIII


relatore: prof. Danilo Comminiello

Titolo della tesi: Controllable Generative Audio for Audiovisual Immersive Environments

The rapid progress of deep generative learning has profoundly transformed multimedia production, opening new possibilities for the automatic synthesis and control of audiovisual content. Among the most promising and challenging directions is the generation of sound that is coherent with video, both in semantics and timing, and that can adapt to the physical and acoustic characteristics of the environment in which it is reproduced. This research area finds its roots in machine learning, acoustics, and creative media technologies, and it is of high interest to both academia and industry. On the academic side, it raises fundamental questions on multimodal representation learning, multimodal alignment, and physical modeling of sound. While on the industrial side, companies operating in cinema, videogames, and extended-reality (XR) production are investing heavily in generative solutions that can assist sound designers, post-production engineers, and interactive content creators. Automating or augmenting sound design through learning-based methods can reduce production time, enhance creative flexibility, and enable fully adaptive audio for immersive experiences. This thesis explores controllable generative audio for realistic simulation in audio- visual and immersive environments, with the goal of learning how to generate sounds that match the visual world semantically, temporally, spatially, and acoustically. The research develops through a coherent sequence of five works, each addressing a spe- cific aspect of this problem. Starting from the synthesis of temporally synchronized Foley effects from silent video, the work evolves toward spatialized sound analysis for virtual environments and concludes with data-driven modeling of physical acoustics through deep learning (DL)-based approximations of wave equations. The methodological backbone of this research is the use of diffusion-based generative models, which have proven highly effective in modeling the temporal and semantic dependencies between modalities. These architectures are extended with interpretable conditioning mechanisms—such as visual onset cues, motion-derived envelopes, and multimodal embeddings—enabling both automation and artistic supervision in audiovisual generation. We begin introducing an onset-synchronized video-to-audio generation model that aligns sound with video events using visual onset detection and diffusion- based synthesis. We then improve temporal and semantic control by separating the when and what of sound generation: we develop a model where a video-driven motion envelope guides the timing, while semantic embeddings define the auditory content. We extend this study by introducing GRAM-aligned multimodal encoders that jointly learn coherent audio, video, and textual representations, enhancing multimodal control and semantic consistency. Our research then moves toward immersive applications, focusing on where sound sources are located in the visual space. We propose a large-scale dataset and benchmark for learning 3D spatial audio and sound source localization from multichannel Ambisonics recordings and visual data, supporting audio generation and analysis in Augmented Reality (AR)/ Virtual Reality (VR) scenarios. Finally, we analyze the ways in which deep learning models can estimate physical acoustics, showing that neural networks can approximate solutions to the Helmholtz equation and emulate how waves propagate across materials and space. This last step grounds generative audio in physically consistent simulation, essential for realism in virtual environments. Together, these works try to unify semantic, temporal, spatial, and physical realism in generative audiovisual learning. Beyond algorithmic advances, the thesis also introduces several high-quality datasets, such as Walking the Maps, L3DAS23 dataset, and HA30K dataset, that address the critical lack of multimodal, well- synchronized data for research in this field. Overall, this thesis focuses on demonstrating how diffusion-based architectures and generative models can serve as controllable and interpretable tools for realistic and creative media synthesis. The proposed frameworks and datasets pave the way for future research and industrial applications, aiming for a new generation of systems capable of producing audiovisual content that is not only perceptually coherent but also acoustically and physically realistic, an essential step for deep learning-based applications in immersive media.

Produzione scientifica

11573/1747232 - 2025 - Generative Models for Helmholtz Equation Solutions: A Dataset of Acoustic Materials
Gramaccioni, R. F.; Marinoni, C.; Frezza, F.; Uncini, A.; Comminiello, D. - 04b Atto di convegno in volume
congresso: EUSIPCO 2025 (Palermo)
libro: Proc. EUSIPCO 2025 - ()

11573/1725167 - 2024 - Syncfusion: Multimodal Onset-Synchronized Video-to-Audio Foley Synthesis
Comunità, M.; Gramaccioni, R. F.; Postolache, E.; Rodolà, E.; Comminiello, D.; Reiss, J. D. - 04b Atto di convegno in volume
congresso: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Seoul, Korea, Republic of)
libro: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings - ()

11573/1714344 - 2024 - L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality
Gramaccioni, R. F.; Marinoni, C.; Chen, C.; Uncini, A.; Comminiello, D. - 01a Articolo in rivista
rivista: IEEE OPEN JOURNAL OF SIGNAL PROCESSING (New York NY: IEEE) pp. 1-9 - issn: 2644-1322 - wos: WOS:001256424400015 (2) - scopus: 2-s2.0-85187979009 (4)

11573/1712935 - 2024 - Inverse Design of Thin-Film Metamaterials with a LSTM-based Approach
Gramaccioni, R. F.; Marinoni, C.; Frezza, F.; Uncini, A.; Comminiello, D. - 04d Abstract in atti di convegno
congresso: WIRN 2024 (Vietri sul Mare)
libro: Proc. WIRN 2024 - ()

11573/1723593 - 2024 - Ship in sight: diffusion models for ship-image super resolution
Sigillo, L.; Gramaccioni, R. F.; Nicolosi, A.; Comminiello, D. - 04b Atto di convegno in volume
congresso: 2024 International Joint Conference on Neural Networks, IJCNN 2024 (Yokohama; Japan)
libro: Proceedings of the International Joint Conference on Neural Networks - (9798350359312)

11573/1714441 - 2023 - Overview of the L3DAS23 challenge on audio-visual extended reality
Marinoni, Christian; Gramaccioni, Riccardo F.; Chen, Changan; Uncini, Aurelio; Comminiello, Danilo - 04b Atto di convegno in volume
congresso: 48th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2023 (Rhodes Island; Greece)
libro: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings - (9781728163284)

11573/1606332 - 2021 - L3DAS21 challenge: machine learning for 3D audio signal processing
Guizzo, Eric; Gramaccioni Riccardo, Fosco; Jamili, Saeid; Marinoni, Christian; Massaro, Edoardo; Medaglia, Claudia; Nachira, Giuseppe; Nucciarelli, Leonardo; Paglialunga, Ludovica; Pennese, Marco; Pepe, Sveva; Rocchi, Enrico; Uncini, Aurelio; Comminiello, Danilo - 04b Atto di convegno in volume
congresso: 31st IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2021 (Gold Coast; Australia)
libro: IEEE International Workshop on Machine Learning for Signal Processing, MLSP - (978-1-7281-6338-3)

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma