When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
Modern computing applications demand ever-increasing performance and energy efficiency. However, conventional processor architectures frequently stall while waiting for data retrieval from memory, creating a bottleneck known as the memory wall. Over the past decades, various approaches such as speculative prefetching, load value prediction, and hardware caching have been proposed to mitigate this limitation. While these techniques yield moderate gains, they often rely on rigid hardware logic or simple pattern matching, which struggle with the irregular, data-driven workloads typical of contemporary multimedia and machine learning applications.
This thesis propose to use Machine Learning (ML) to speculate load values and reduce memory accesses. The proposed method is grounded in the principles of Approximate Computing (AC), where minor inaccuracies are accepted in exchange for improvements in performance or efficiency. To this end, we introduce an ML-based Load Value Approximation (ML-LVA) approach, which predicts the values of memory loads to reduce access latency. The ML-LVA is trained offline to generate a compact predictor that captures patterns in image and audio data, enabling accurate value prediction during runtime without the need for continual retraining. By learning spatial correlations among adjacent data values, the proposed ML-LVA effectively anticipates memory contents, thereby reducing stalls and improving overall system performance in online deployment.
We have implemented the proposed ML-LVA framework both in software and hardware. The software variant targets existing processors lacking reconfigurability, as well as systems with tight area or power constraints that prohibit adding custom hardware. It operates as a callable subroutine designed for seamless integration without modifying the processor architecture. The software implementation was tested on an x86 processor in the GEM5 simulator. On the other hand, the hardware-based implementation integrates the proposed ML-LVA as a dedicated accelerator accessed via a custom instruction, offering tighter pipeline integration, lower latency, and enhanced efficiency for newly designed systems. The hardware-based ML-LVA was implemented in CVA6, which is an open source RISC-V processor. The synthesis results conducted in Cadence Innovus showed that the overhead of the added accelerator is marginal.
Experimental results conducted on audio and image processing workloads demonstrate that the proposed ML-LVA accelerates memory access by over 6×, resulting in application speedups up to 2.45×. Additionally, even when predicting up to 95% of loads, the output fidelity remains within perceptual thresholds. Subsequently, the proposed ML-LVA outperforms state-of-the-art LVAs in terms of performance and quality. The ML-LVA achieves these results with only a 5% area overhead and less than 1% power increase in silicon.