Skip to main content
Thesis defences

PhD Oral Exam - Xingyu Shen, Electrical and Computer Engineering

Deep Learning Approaches for Speech Enhancement Toward Robust ASR


Date & time
Tuesday, April 21, 2026
2 p.m. – 5 p.m.
Cost

This event is free

Organization

School of Graduate Studies

Contact

Dolly Grewal

Where

Engineering, Computer Science and Visual Arts Integrated Complex
1515 Ste-Catherine St. W.
Room EV 2.184

Accessible location

Yes - See details

When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.

Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.

Abstract

Speech enhancement (SE) aims to suppress noise and reverberation in order to improve speech quality and intelligibility in adverse environments. It also serves as an important front end for robust automatic speech recognition (ASR). In multichannel speech enhancement (MCSE), spatial diversity in microphone arrays can be exploited for spatial filtering. Classical array-processing methods such as minimum variance distortionless response (MVDR) beamforming provide principled spatial filters, but their performance can be sensitive to inaccurate spatial statistics, array perturbations, and nonstationary interference. Deep learning has substantially advanced MCSE by enabling data-driven spectro-temporal modeling and learnable spatial processing. However, achieving robust and deployable multichannel front ends under practical mismatch, including array-geometry variation, covariance-estimation brittleness, long-context modeling requirements, and reliable gains for downstream ASR under frozen recognizers, remains challenging.

This thesis develops several deep-learning approaches that address these challenges through complementary strategies including topology-aware spatial reasoning, physics-informed spatial filtering, scalable long-context sequence modeling, and recognition-consistent front-end synthesis, all formulated within a common complex short-time Fourier transform (STFT)-domain processing interface. Four major contributions are presented.

First, topology-robust spatial front ends are developed by representing microphones as graph nodes and learning inter-channel interactions in the complex STFT domain. The thesis progresses from complex-valued graph convolution with multi-path spatio-temporal modeling to graph-attention-based convex spatial combining that outputs real, nonnegative, sum-to-one channel weights per time-frequency bin, enabling gain-controlled and interpretable spatial filtering. To further reduce deployment sensitivity to covariance estimation, a covariance-free inference spatial front end is proposed, in which convex spatial weights are guided during training by MVDR-inspired teacher distributions and stabilized by learnable temperature scaling.

Second, physics-informed and lightweight spatial-filtering architectures are developed to improve robustness and efficiency, including a compact dynamic spatial-filtering design with residual spectral mapping and an end-to-end MVDR-inspired framework with physics-inspired regularization and residual refinement. A non-learned multi-band relative contrastive loss is further proposed as an analytic, architecture-agnostic objective that better aligns optimization with perceptual structure without increasing inference cost.

Third, to improve long-range modeling in noisy and reverberant conditions, this thesis proposes a dual-path state-space modeling framework with cross-domain interaction that captures both short- and long-context dependencies at a practical computational cost while coordinating magnitude restoration with complex-spectrum refinement.

Finally, beyond enhancement metrics, this thesis studies efficient and recognition-compatible ASR front ends under frozen recognizers. A band-split Parallel Time-Band Mixer (PTBM) captures intra-band temporal context and cross-band structure without within-block recurrence, and an ASR-oriented learned observation fusion (LOF) module performs structured complex-spectrum fusion between noisy and enhanced spectra on the complex STFT grid to suppress ASR-sensitive artifacts without development-set coefficient tuning.

Extensive experiments on diverse simulated and real noisy and reverberant benchmarks demonstrate consistent improvements in speech quality and intelligibility. When coupled with frozen recognizers, the proposed methods also improve downstream recognition accuracy. Simulation-based analyses further demonstrate robustness to geometry variation, array perturbations, and imperfect estimates of spatial statistics, together with efficiency under practical computational budgets.

Back to top

© Concordia University