Thesis defences

PhD Oral Exam - Abhinav Moudgil, Computer Science

Advancing Learnable Neural Network Training Algorithms

Date & time

Wednesday, May 27, 2026
10 a.m. – 1 p.m.

Cost

This event is free

Organization

School of Graduate Studies

Contact

Dolly Grewal

Where

Online

When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.

Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.

Abstract

Artificial intelligence has seen massive improvements with large-scale neural networks. In this thesis, we propose learnable algorithms to advance neural network training and make them practical along two axes: (1) learned optimization and (2) knowledge distillation. On the learned optimization axis, we meta-train a parametric optimizer on a distribution of tasks to learn an update rule that outperforms hand-designed alternatives. The core challenge in this area is meta-generalization, as learned optimizers have proven difficult to train in a way that transfers reliably to out-of-distribution problems. Our first contribution, Celo, directly targets this problem with a compute-efficient recipe that significantly improves the pareto frontier of meta-training compute and performance. With a fixed budget of 24 GPU hours, Celo outperforms prior hand-designed and learned optimizer baselines on 17 diverse out-of-distribution tasks. Building on Celo, we develop Celo2, a simplified recipe with a normalized update rule that, even when meta-trained on small-scale tasks with merely 4.5 GPU hours, scales stably to large-scale unseen tasks such as GPT-3 XL (1.3B) pretraining, which is six orders of magnitude larger than the tasks in its meta-training distribution. We further evaluate Celo2 on GPT-2 (124M) pretraining, ViT ImageNet classification, and Atari reinforcement learning, where it outperforms strong baselines. Together, these two works elevate learned optimizers from a promising research idea to a practical one, opening the door to improving optimizers through learning, compute, and data, much like deep models. On the knowledge distillation axis, we develop a learned two-stage algorithm that bridges a Transformer teacher and a Mamba-based state-space student via an intermediate Linear Attention model, recovering near-teacher performance at scale. This demonstrates that learnable algorithms can advance neural network training beyond optimization, enabling knowledge transfer across fundamentally different architectures.