Date & time
10 a.m. – 1 p.m.
This event is free
School of Graduate Studies
Online
When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Artificial intelligence has seen massive improvements with large-scale neural networks. In this thesis, we propose learnable algorithms to advance neural network training and make them practical along two axes: (1) learned optimization and (2) knowledge distillation. On the learned optimization axis, we meta-train a parametric optimizer on a distribution of tasks to learn an update rule that outperforms hand-designed alternatives. The core challenge in this area is meta-generalization, as learned optimizers have proven difficult to train in a way that transfers reliably to out-of-distribution problems. Our first contribution, Celo, directly targets this problem with a compute-efficient recipe that significantly improves the pareto frontier of meta-training compute and performance. With a fixed budget of 24 GPU hours, Celo outperforms prior hand-designed and learned optimizer baselines on 17 diverse out-of-distribution tasks. Building on Celo, we develop Celo2, a simplified recipe with a normalized update rule that, even when meta-trained on small-scale tasks with merely 4.5 GPU hours, scales stably to large-scale unseen tasks such as GPT-3 XL (1.3B) pretraining, which is six orders of magnitude larger than the tasks in its meta-training distribution. We further evaluate Celo2 on GPT-2 (124M) pretraining, ViT ImageNet classification, and Atari reinforcement learning, where it outperforms strong baselines. Together, these two works elevate learned optimizers from a promising research idea to a practical one, opening the door to improving optimizers through learning, compute, and data, much like deep models. On the knowledge distillation axis, we develop a learned two-stage algorithm that bridges a Transformer teacher and a Mamba-based state-space student via an intermediate Linear Attention model, recovering near-teacher performance at scale. This demonstrates that learnable algorithms can advance neural network training beyond optimization, enabling knowledge transfer across fundamentally different architectures.
© Concordia University