When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
This thesis addresses the challenge of few-shot semantic segmentation (FSS), aiming to achieve accurate image understanding in low-data regimes. Traditional few-shot semantic segmentation methods often struggle to generalize effectively, primarily due to the limited availability of labeled support examples. This scarcity makes it difficult to capture the full variability of object appearances, leading to poor performance in the presence of occlusions, appearance shifts, and viewpoint differences between support and query samples. To overcome these limitations, we first propose a transductive meta-learning framework that leverages an ensemble of features from pretrained classification and semantic segmentation networks. This method enhances discriminative power by capturing both high-level semantic cues and pixel-level spatial information, and introduces a two-pass correlation mechanism to improve intra-class and intra-object similarity modeling while reducing false positives — all with minimal trainable parameters.
However, despite strong performance, this approach remains limited in its ability to reason about object semantics or adapt flexibly to complex query-support discrepancies. Motivated by these challenges, we introduce a second framework that unifies visual features with semantic knowledge derived from large multimodal language models (LLMs). By generating adaptive class-specific semantic prompts using multi-modal LLMs and integrating them with dense visual correspondences between support and query samples, our model performs reasoning-driven segmentation and achieves robust generalization even in cross-domain setting. The resulting vision-language system addresses key failure cases of prior work, particularly in scenes with severe appearance variation or ambiguous context.
Extensive experiments on Pascal-5i and COCO-20i demonstrate that our proposed frameworks outperform prior methods, both in standard few-shot settings and under cross-domain evaluation. Together, these contributions represent a significant advancement in learning to segment with limited supervision, offering a path forward for more intelligent and adaptable vision systems.