Thesis defences

PhD Oral Exam - Valerie Hayot-Sasson, Software Engineering

Data management without reinstrumentation: how to speed up existing big data neuroimaging workflows

Date & time

Friday, February 4, 2022 (all day)

Cost

This event is free

Organization

School of Graduate Studies

Contact

Daniela Ferrer

Where

Online

When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.

Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.

Abstract

Neuroimaging has entered the Big Data era through the adoption of data sharing practices and improved data collection infrastructure enabling higher resolution imaging. Whereas the apipelines have shifted to become increasingly data-intensive, neuroimaging software has minimally adapted to address this shift. Rather, scientific software has primarily focused on ease-of-use, reproducibility, portability and parallelism. While the goals of Big Data and scientific frameworks differ, their strategies can be combined to make scientific frameworks more suitable for the processing of the increasingly prominent scientific Big Data.

The objectives of this thesis are two-fold: 1) determine whether neuroimaging frameworks benefit from incorporation of Big Data management strategies and investigate how to adapt existing solutions, and 2) develop new tools to enable data management within neuroimaging workflows. Our performance analysis determined that neuroimaging frameworks can benefit significantly from the incorporation of data management strategies, by up to a factor of 5.3X in the most data-intensive case. While we found Big Data frameworks (i.e. Apache Spark) to significantly speedup data-intensive neuroimaging workflows, our analysis on overlay pilot-scheduling with Spark determined that large-scale Spark workflows would be difficult to run on HPC. Furthermore, while alternative hardware solutions, such as Intel Optane DCPMM produce speedups similar to in-memory processing with Spark and could be used as an alternative, it remains inaccessible to many researchers.

To bring data-management solutions to neuroimaging applications, we developed two libraries, namely, Rolling Prefetch and Sea. Rolling Prefetch is our data-management solution for cloud-based applications that enables the sequential prefetching of data located on Amazon S3 storage. Experimental results demonstrate that Rolling Prefetch can speed up experiments by a factor of 1.8X and has a theoretical bound of 2X.

Sea targets the standard neuroimaging workflows executed on HPC. It brings prefetching, data-locality and in-memory computing to POSIX-based command-line tools through the interception of glibc calls. Using this approach researchers can benefit from data-management related speedups by incorporating Sea into their standard analysis. Our results on standard neuroimaging pipelines show that Sea can speed up execution by an average of 11X with large datasets writing to a deteriorated shared file system.

Department of Computer Science and Software Engineering (CSSE)