Thesis title: Capitalizing on Self-supervision and Pre-trained Models in Computer Vision
This thesis addresses the overarching challenge of advancing computer vision tasks under the constraints of limited labeled data and the imperative to capitalize on pre-existing knowledge encoded in pre-trained models. By exploring three distinct computer vision tasks - classification, regression, and segmentation - this work presents diverse frameworks aimed at transcending the conventional boundaries imposed by data scarcity and task-specific methodologies.
The first focus lies on Unsupervised Domain Adaptation (UDA) in visual recognition, a critical endeavor in bridging disparate visual domains for robust real-world performance. Existing approaches in UDA typically necessitate manual adaptation to specific backbone architectures, hindering adaptability over time as methods become outdated with evolving architectures. To circumvent this limitation, this thesis proposes a novel approach termed Adversarial Branch Architecture Search for UDA (ABAS). ABAS addresses the lack of target labels by employing a data-driven ensemble approach for model selection and explores auxiliary adversarial branches to drive domain alignment. Extensive validation on standard visual recognition datasets demonstrates ABAS's efficacy in enhancing modern UDA techniques, robustly yielding superior performances across diverse domains.
In the realm of regression tasks, the thesis delves into collaborative human pose forecasting, an understudied domain with the potential for improved performance through exploiting the correlated motion patterns of interacting individuals. By revisiting prevalent single-person practices and tailoring them to the collaborative setting, significant advancements are achieved. Notably, the integration of frequency input representations, space-time separable interaction encodings, and fully-learnable interaction adjacencies into a Graph Convolutional Network (GCN) framework showcases promising results. Furthermore, a novel initialization procedure for spatial interaction parameters enhances both performance and stability, culminating in a substantial performance boost over state-of-the-art methods on benchmark datasets.
Lastly, the thesis tackles semantic segmentation in autonomous driving scenarios, leveraging the unique capabilities of event cameras for low-latency operation in challenging lighting conditions. We introduce OVOSE, the first open-vocabulary semantic segmentation approach explicitly tailored for event-based data. \ourovose leverages knowledge distillation from pre-trained image-based models and synthetic event data to enhance segmentation performance. Additionally, we propose a novel dissimilarity network to recalibrate mask loss, mitigating the effects of sub-optimal reconstructions and enabling precise fine-tuning of the segmentation model. Through this novel approach, OVOSE demonstrates superior performance in dynamic environments, outperforming existing conventional image-based models and state-of-the-art methods in unsupervised domain adaptation for event-based semantic segmentation.
In summary, this thesis presents a holistic approach to computer vision tasks, unifying disparate methodologies under the common goal of leveraging pre-trained models and limited labels to achieve superior performance across diverse domains. By addressing specific challenges within classification, regression, and segmentation tasks, the proposed frameworks contributes towards advancing the frontier of computer vision in real-world applications.