Skip to main content
notice

Doctoral Seminar: Upama Kabir

April 16, 2015
|


Speaker: Upama Kabir

Supervisor: Dr. D. Goswami

Examining Committee:
Drs. T. Eavis, N. Kharma, S. P. Mudur

Title: Identifying Patterns Towards Algorithm Based Fault Tolerance

Date: Thursday, April 16, 2015

Time: 11:40 a.m.

Place: EV 3.309

ABSTRACT

Checkpoint and recovery cost imposed by coordinated checkpoint/restart (CCP/R) is a crucial performance issue for high performance computing (HPC) applications. In comparison, Algorithm Based Fault Tolerance (ABFT) is a promising fault tolerance method with low recovery overhead, but it suffers from the inadequacy of universal applicability and user non-transparency. In our research, we address the overhead problem of CCP/R and the limitations of ABFT and propose a solution for ABFT based on algorithmic patterns. The proposed solution is a generic fault tolerance strategy for a group of applications that exhibit similar algorithmic (structural and behavioral) features. These features together with the minimal fault recovery data (critical data) determine the fault tolerance strategy for the group of applications. We call this strategy a fault-tolerance pattern (FTP). In this research, we propose a hierarchical classification of FTP, based on the dependency patterns among the sub-problems of a parallel algorithm. We demonstrate the idea for two such classifications: (i) FTP for a class of parallel algorithms where sub-problems are independent. As a case study in this class, we consider the depth first search for discrete optimization problems using PIDA* and propose an algorithm for fault tolerant PIDA* (FTPIDA*). We prove the correctness of FTPIDA* and compare its performance with CCP/R both theoretically and experimentally. (ii) Next, we propose an FTP for the class of problems with an regular dependency pattern. As a case study in this class, we choose a dynamic programming based solution to the 0/1 knapsack problem, which is a serial monadic dynamic programming algorithm where the sub-problems are regularly interdependent. A fault tolerant version of the algorithm is designed and theoretically analyzed. The ongoing research focuses on evaluating other classes of FTPs and associated case studies, and their theoretical and experimental evaluations; with a final goal of investigating into the possibility of creating a generic FTP which is applicable to most, if not all, dependency patterns. We conclude this report with an outline of the future research plans and milestones to achieve the goal.




Back to top

© Concordia University