Doctoral Thesis Defense: Bahareh Goodarzi
Speaker: Bahareh Goodarzi
Supervisor: Dr. D. Goswami
Examining Committee: Drs. A. Agarwal, M. Dagenais, H. Harutyunyan, B. Jaumard, N. Bouguila (Chair)
Title: Efficient Scheduling and High-Performance Graph Partitioning on Heterogeneous CPU-GPU Systems
Date: Tuesday, August 21, 2018
Time: 10:30 a.m.
Place: EV 1.162
Heterogeneous CPU-GPU systems have emerged as a power-efficient platform for high performance parallelization of the applications. However, effectively exploiting these architectures faces a number of challenges including differences in the programming models of the CPU (MIMD) and the GPUs (SIMD), GPU memory constraints. In this thesis, first we explore embarrassingly parallel applications where tasks have no inter-dependencies. Herein, we design and implement a new dynamic and scalable scheduling heuristic to minimize the execution time of embarrassingly parallel applications on a heterogeneous CPU-GPU system. The scheduler is integrated into a scheduling framework that provides pre-implemented automated scheduling modules, liberating the user from the complexities of scheduling. We then investigate task dependent applications, where the tasks have data dependencies, and the computational tasks and their communication patterns are expressed by a task interaction graph. Scheduling of the task interaction graph on a cluster can be done by first partitioning the graph into a set of computationally balanced partitions in such a way that the communication cost among the partitions is minimized. We design and implement a multilevel graph partitioner on a heterogeneous CPU-GPU system that takes advantage of the high parallel processing power of GPUs and overcome some of the challenges arising due to the irregular nature of the algorithm. We present a lock-free partitioner that outperforms serial and parallel MPI-based partitioners. It performs similar to shared-memory CPU-based parallel graph partitioner. To optimize the graph partitioner performance, we describe an effective approach to enable a GPU-based multi-level graph partitioning that is tailored specifically for the SIMD architecture. Our solution avoids thread divergence and balances the load over GPU threads by dynamically assigning an appropriate number of threads to process the graph vertices and irregular sized neighbors. We show that this design outperforms CPU-based parallel graph partitioner. Finally, we apply some of our partitioning techniques to another graph processing algorithm, minimum spanning tree (MST), and show that extending these techniques helps in achieving a high performance implementation of MST on the GPU.