PhD Oral Exam - Bahareh Goodarzi, Computer Science
When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Heterogeneous CPU-GPU systems have emerged as a power-efficient platform for high performance parallelization of the applications. However, effectively exploiting these architectures faces a number of challenges including differences in the programming models of the CPU (MIMD) and the GPUs (SIMD), GPU memory constraints, and comparatively low communication bandwidth between the CPU and GPU. As a consequence, high performance execution of applications on these platforms requires designing new adaptive parallelizing methods. In this thesis, first we explore embarrassingly parallel applications where tasks have no inter-dependencies. Although the massive processing power of GPUs provides an attractive opportunity for high-performance execution of embarrassingly parallel tasks on CPU-GPU systems, minimized execution time can only be obtained by optimally distributing the tasks between the processors. In contemporary CPU-GPU systems, the scheduler cannot decide about the appropriate rate distribution. Hence it requires high programming effort to manually divide the tasks between the processors. Herein, we design and implement a new dynamic scheduling heuristic to minimize the execution time of embarrassingly parallel applications on a heterogeneous CPU-GPU system. The scheduler is integrated into a scheduling framework that provides pre-implemented automated scheduling modules, liberating the user from the complexities of scheduling details. The experimental results show that our scheduling approach achieves better to similar performance compared to some of the scheduling algorithms proposed for CPU-GPU systems. We then investigate task dependent applications, where the tasks have data dependencies, and the computational tasks and their communication patterns are expressed by a task interaction graph. Scheduling of the task interaction graph on a cluster can be done by first partitioning the graph into a set of computationally balanced partitions in such a way that the communication cost among the partitions is minimized, and subsequently mapping the partitions onto physical processors. Aside from scheduling, graph partitioning is a common computation phase in many application domains, including social network analysis, data mining, and VLSI design. However, irregular and data-dependent graph partitioning sub-tasks pose multiple challenges for efficient GPU utilization, which favors regularity. We design and implement a multilevel graph partitioner on a heterogeneous CPU-GPU system that takes advantage of the high parallel processing power of GPUs by executing the computation-intensive parts of the partitioning sub-tasks on the GPU and assigning the parts with less parallelism to the CPU. Our partitioner aims to overcome some of the challenges arising due to the irregular nature of the algorithm, and memory constraints on GPUs. We present a lock-free scheme since fine-grained synchronization among thousands of GPU threads imposes too high a performance overhead. Experimental results demonstrate that our partitioner outperforms serial and parallel MPI-based partitioners. It performs similar to shared-memory CPU-based parallel graph partitioner. To optimize the graph partitioner performance, we describe an effective and methodological approach to enable a GPU-based multi-level graph partitioning that is tailored specifically for the SIMD architecture. Our solution avoids thread divergence and balances the load over GPU threads by dynamically assigning an appropriate number of threads to process the graph vertices and irregular sized neighbors. Our optimized design is autonomous as all the steps are carried out by the GPU with minimal CPU interference. We show that this design outperforms CPU-based parallel graph partitioner. Finally, we apply some of our partitioning techniques to another graph processing algorithm, minimum spanning tree (MST), that exhibits the load imbalance characteristics. We show that extending these techniques helps in achieving a high performance implementation of MST on the GPU.