CHRISTIAN MARINONI

Dottore di ricerca

ciclo: XXXVIII

relatore: prof. Danilo Comminiello

Titolo della tesi: An Echo of Sight: Generative Models for Audio-Visual Spatial Coherence

While deep generative models have achieved remarkable success in synthesizing high-fidelity images, video, and audio, the next frontier is coherent multimodal synthesis. In the audio-visual domain, current research has largely focused on semantic ("what") and temporal ("when") alignment, while neglecting the equally critical dimension of spatial coherence ("where"). This omission creates a perceptual disconnect that breaks immersion, as the auditory world feels flat and detached from the visual space. This thesis, guided by the principle of "An Echo of Sight", directly addresses this gap by investigating, developing, and advancing generative models that establish robust spatial coherence between audio and visual modalities. The work progresses through four core, interconnected contributions. First, we establish an analytical foundation for spatial audio understanding. Through our work on the L3DAS23 challenge and dataset, we demonstrate that deep learning models can effectively extract and exploit the rich spatial cues embedded in 3D Ambisonics audio for complex analysis tasks, including 3D speech enhancement and sound event localization and detection. Second, we transition from analysis to perceptual synthesis with StereoSync, a novel framework for spatially-aware video-to-audio (V2A) generation. This model is the first to leverage visual spatial cues, such as depth maps and object trajectories, to condition a latent diffusion model, successfully generating stereo audio that spatially pans and aligns with on-screen object dynamics. Third, we address the "off-screen" problem by expanding the generative context to a full spherical environment. We introduce Con360-AV, a framework for joint audio-visual generation conditioned on a complete 360° space. By using panoramic saliency and novel geometric maps, the model generates specific audio-visual viewpoints that are coherently embedded within the larger, surrounding world. Finally, we introduce the HA30K dataset, a large-scale collection of acoustic simulations, and develop a generative surrogate model that learns to approximate the solutions of the Helmholtz equation. This work demonstrates that a generative model can learn the complex physical laws connecting a visual "Sight" (the geometry of a space) to its physical "Echo" (the acoustic pressure field). The proposed frameworks demonstrate significant quantitative improvements in spatial alignment, generative fidelity, and computational efficiency. In particular, StereoSync achieves state-of-the-art spatial tracking, Con360-AV demonstrates robust spatial control in a 360° context, and our physics-based surrogate achieves a nearly 5x speedup over traditional solvers in batch processing. Collectively, this research provides a comprehensive methodology for audio-visual spatial coherence and delivers foundational technologies for the next generation of immersive media, virtual reality, and engineering-focused "Acoustic Digital Twins".

Produzione scientifica

11573/1747232 - 2025 - Generative models for Helmholtz equation solutions: A dataset of acoustic materials

Gramaccioni, R. F.; Marinoni, C.; Frezza, F.; Uncini, A.; Comminiello, D. - 04b Atto di convegno in volume

congresso: 33rd European Signal Processing Conference, EUSIPCO 2025 (Palermo; Italy)

libro: 2025 33rd European Signal Processing Conference (EUSIPCO) - ()

11573/1764468 - 2025 - FoleyGRAM. Video-to-audio generation with GRAM-aligned multimodal encoders

Gramaccioni, Riccardo Fosco; Marinoni, Christian; Grassucci, Eleonora; Cicchetti, Giordano; Uncini, Aurelio; Comminiello, Danilo - 04b Atto di convegno in volume

congresso: International Joint Conference on Neural Networks (IJCNN 2025) (Rome; Italy)

libro: 2025 International Joint Conference on Neural Networks (IJCNN) - (979-8-3315-1042-8; 979-8-3315-1043-5)

11573/1714344 - 2024 - L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality

Gramaccioni, R. F.; Marinoni, C.; Chen, C.; Uncini, A.; Comminiello, D. - 01a Articolo in rivista

rivista: IEEE OPEN JOURNAL OF SIGNAL PROCESSING (New York NY: IEEE) pp. 1-9 - issn: 2644-1322 - wos: WOS:001256424400015 (4) - scopus: 2-s2.0-85187979009 (6)

11573/1712935 - 2024 - Inverse Design of Thin-Film Metamaterials with a LSTM-based Approach

Gramaccioni, R. F.; Marinoni, C.; Frezza, F.; Uncini, A.; Comminiello, D. - 04d Abstract in atti di convegno

congresso: WIRN 2024 (Vietri sul Mare)

libro: Proc. WIRN 2024 - ()

11573/1741589 - 2024 - Diffusion models for audio semantic communication

Grassucci, Eleonora; Marinoni, Christian; Rodriguez, Andrea; Comminiello, Danilo - 04b Atto di convegno in volume

congresso: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Seoul; Korea)

libro: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - (979-8350-344-85-1)

11573/1714441 - 2023 - Overview of the L3DAS23 challenge on audio-visual extended reality

Marinoni, Christian; Gramaccioni, Riccardo F.; Chen, Changan; Uncini, Aurelio; Comminiello, Danilo - 04b Atto di convegno in volume

congresso: 48th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2023 (Rhodes Island; Greece)

libro: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings - (9781728163284)

11573/1715972 - 2023 - DPPL hallway tracker: hospital contact tracing during the COVID-19 pandemic

Marinoni, Christian; Ponzi, Valerio; Comminiello, Danilo - 04b Atto di convegno in volume

congresso: 9th Scholar's Yearly Symposium of Technology, Engineering and Mathematics, SYSTEM 2023 (Roma; Italia)

libro: CEUR Workshop Proceedings - ()

11573/1669170 - 2022 - L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office Environment

Guizzo, E.; Marinoni, C.; Pennese, M.; Ren, X.; Zheng, X.; Zhang, C.; Masiero, B.; Uncini, A.; Comminiello, D. - 04b Atto di convegno in volume

congresso: 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 (Marina Bay Sands Expo and Convention Center, Singapore)

libro: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings - (978-1-6654-0540-9)

11573/1606332 - 2021 - L3DAS21 challenge: machine learning for 3D audio signal processing

Guizzo, Eric; Gramaccioni Riccardo, Fosco; Jamili, Saeid; Marinoni, Christian; Massaro, Edoardo; Medaglia, Claudia; Nachira, Giuseppe; Nucciarelli, Leonardo; Paglialunga, Ludovica; Pennese, Marco; Pepe, Sveva; Rocchi, Enrico; Uncini, Aurelio; Comminiello, Danilo - 04b Atto di convegno in volume

congresso: 31st IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2021 (Gold Coast; Australia)

libro: IEEE International Workshop on Machine Learning for Signal Processing, MLSP - (978-1-7281-6338-3)