When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
Supervised learning on large-scale labeled datasets has been critical to the success of computer vision, with widespread applications in robotics, healthcare, security, sports, and retail. To address challenges of over-dependence on labeled data, self-supervised learning aims to learn from data without annotations. However, new problems arise, such as the difficulty in defining appropriate pretext tasks, increased computational demands from multiple stages, and the need for large amounts of unlabeled data. In this thesis, we introduce a learning paradigm that models global and local context for semantic segmentation. The proposed method effectively captures pixel relationships, improving performance in ambiguous regions and better segmenting minority classes through masking. We show that our approach achieves better performance than state-of-the-art single and multi-task learning baselines in both binary and multi-class semantic segmentation tasks, particularly in tackling small, ambiguous regions in medical images and minority class instances in cluttered scenes. Motivated by the intuition that occluded objects are partial inputs, we propose a single-stage, model-agnostic approach for multi-label image recognition. The proposed method learns contextualized representations using a masked branch and models label co-occurrence through label consistency. Experimental results demonstrate the simplicity, applicability, and, more importantly, the competitive performance of our approach against previous state-of-the-art methods, especially in identifying small and occluded objects. Additionally, we propose an efficient unsupervised object localization method that can segment unfamiliar objects in images without the need for additional training, particularly when they are small, reflective, or poorly illuminated. The proposed method learns context-based representations at both the pixel- and shape-level using only a single learnable convolutional layer decoder and a frozen encoder. We demonstrate on six benchmarks datasets the simplicity, efficiency and competitive performance of our approach in both single object discovery and unsupervised salient object detection, outperforming existing methods that require intensive computational resources, extensive training, and large data volumes.