Skip to main content
Thesis defences

PhD Oral Exam - Goutam Yelluru Gopal, Electrical and Computer Engineering

Exploring Convex Optimization and Transformer based Methods for Efficient Visual Object Tracking


Date & time
Tuesday, April 9, 2024
1 p.m. – 4 p.m.
Cost

This event is free

Organization

School of Graduate Studies

Contact

Nadeem Butt

Where

Engineering, Computer Science and Visual Arts Integrated Complex
1515 St. Catherine W.
Room 001.162

Wheel chair accessible

Yes

When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.

Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.

Abstract

Tracking arbitrary objects in video sequences is a widely explored task with diverse applications across domains, such as remote surveillance, augmented reality, and robotics. A critical determinant of a tracker's effectiveness is the representation of the target object through a collection of feature templates (or channels). However, during challenging video conditions such as target deformation, occlusion, background clutter, and motion blur, some channels lose their discriminative power, leading to tracking failures. There are two recent paradigms to visual object tracking: Discriminative Correlation Filters (DCF) and Siamese Networks (SN). DCF trackers address video challenges by aggregating hand-crafted and deep Convolutional Neural Network-based (CNN) channels to model low-level (such as shape and color) and high-level visual cues of the target. However, this approach increases the computational complexity of the tracker due to the increased number of model parameters and significantly reduces inference speed, especially on constrained hardware such as a Central Processing Unit (CPU) or edge devices. We observe a parallel trend in end-to-end trainable deep SN trackers, which deploy parameter-heavy backbone for feature extraction and rely on specialized hardware such as a Graphics Processing Unit (GPU) for faster inference. In this thesis, we propose computationally efficient solutions to both DCF and SN tracking algorithms, while improving their accuracy.

For multi-channel DCF tracking, we present three solutions to alleviate the impact of non-discriminative features (or channels). These methods are based on the concept of reliability, quantifying the discriminative power of a feature (or a channel) based on its filter response. Our first method dynamically lowers the weightage of unreliable features while emphasizing the temporal smoothness of the learned weights. The second method uses a dynamic channel pruning scheme to suppress non-discriminative channels, and our third method extends this pruning scheme by modeling the inter-channel relations to avoid false suppression of discriminative channels during the adaptive weight estimation. We formulate the process of learning adaptive weights of the features (or channels) as a convex optimization problem and derive efficient solutions to maintain tracking speed. Experimental results on multiple datasets demonstrate the efficacy of proposed solutions, enhancing baseline DCF trackers and outperforming related channel-adaptive DCF methods.

Expanding on the lightweight SN tracking paradigm, we propose two efficient transformer-based fast algorithms for visual object tracking. Our first algorithm, MVT, employs Mobile Vision Transformers as the backbone for accurate and fast tracking. Utilizing a cascaded arrangement of CNN and transformer blocks in its backbone, MVT fuses template and search regions during feature extraction to generate superior feature encoding for target localization. Our second tracking algorithm, a Separable Self and Mixed Attention Transformer-based tracker SMAT, further increases the efficiency of MVT by replacing the standard attention with a computationally efficient separable attention block. Notably, SMAT proposes the concurrent deployment of a CNN and transformer-based hybrid module in both the backbone and head modules for the first time. Proposed trackers exhibit superior performance on eight challenging benchmarks in comparison to the related lightweight trackers, with SMAT emerging as the top performer. Compared to the state-of-the-art SN trackers, the proposed tracker models demonstrate good accuracy-speed tradeoffs. The computationally efficient architecture enables our MVT and SMAT trackers to run at real-time tracking speed on a CPU, while achieving a high speed of 150 frames-per-second (fps) on a GPU. Additionally, we demonstrate that the use of deep learning frameworks such as ONNX-Runtime and TensorRT can significantly increase the fps during CPU and GPU-based inference.

Back to top

© Concordia University