Titolo della tesi: Reducing Supervision in Semantic Segmentation through Advancements in Bayesian Prior Modelling
Over the past few years, semantic segmentation has witnessed significant advancements, particularly with the emergence of Vision Transformers (ViTs). However, from 2021 through 2022, when this research was begun, the adoption of ViTs in semantic segmentation was not yet widespread. The field continued to face challenges due to the labour-intensive and costly nature of annotating data for semantic segmentation. This thesis addresses these challenges by exploring reduced-supervision and unsupervised methodologies to make semantic segmentation more efficient and accessible through three interconnected research projects.
In the first study, we investigate the robustness of classification pre-trained deep neural networks for semantic segmentation without spatial guidance on object positions—a common challenge in weakly supervised semantic segmentation (WSSS). We address this by extracting high-level information encoded within model representations through low-level information degradation and multi-view information bottleneck techniques. By leveraging geometric priors on image composition—specifically, the principle of geometric equivariance under affine transformations—we enhance the model's ability to segment images accurately. Our empirical results demonstrate that ViTs, when combined with appropriate computation of Class Activation Maps (CAMs), are significantly more effective in achieving high-quality WSSS than the previously favoured deep convolutional networks (CNNs).
Building on these findings and the limitations of current approaches, our second study aims to mitigate the effects of the lack of spatial information in WSSS with a prior assumption about the spatial distribution of categories across natural images. We propose that objects appear at different scales within images, either in the foreground or background, leading to a similar spatial distribution for each category over a large set of images. By incorporating this prior, we avoid the side effects of unbalanced data distribution among visual concepts and enhance model generalization in WSSS. Our method achieves new state-of-the-art performance on several benchmarks. We model this prior through class frequencies and matrix balancing, an approach derived from optimal transport theory. Unlike contrastive learning methods, our approach operates efficiently with small batches without memory bank requirements, demonstrating the significant potential of cluster-based principles to enhance WSSS, reaching results comparable to fully supervised methods.
The third study explores the realm of unsupervised semantic segmentation (USS) by introducing a deep recursive spectral clustering technique that leverages the hypothesis that semantics is hierarchical. While conventional methods often rely on dataset-specific predefined assumptions, such as object-part decomposition and salient semantic regions, our data-driven approach segments images at multiple levels of granularity without prior knowledge of the scene's structure. This algebraic method recursively refines segmentation based on the inherent semantic properties of the data, providing a flexible and robust way of grouping pixels. We experimentally demonstrate that the method excels in discovering fine and coarse semantic structures in a fully unsupervised manner, offering substantial improvements over traditional models that often struggle with granularity and require dataset-specific priors hindering scalability.
Altogether, these studies advance semantic segmentation by progressively reducing the level of required supervision and demonstrating the effectiveness of minimizing reliance on pixel-level annotations through appropriate prior assumption modelling. These contributions align with the evolving trends in semantic segmentation during this research and pave the way for future developments in computer vision.