Thesis defences

PhD Oral Exam - Hamed Jafarpour, Information and Systems Engineering

An Effective Large Language Model-based Pipeline to Preprocess Narrative Electronic Medical Records Data for Hospital Adverse Events Detection

Date & time

Thursday, June 12, 2025
11 a.m. – 2 p.m.

Cost

This event is free

Organization

School of Graduate Studies

Contact

Dolly Grewal

Where

Engineering, Computer Science and Visual Arts Integrated Complex
1515 St. Catherine W.
Room 3.309

Accessible location

Yes

When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.

Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.

Abstract

Narrative Electronic Medical Record (EMR) data is a valuable but challenging resource for analysis due to the need for preprocessing that comprises three essential tasks: section detection, text normalization, and feature engineering. This thesis endeavors to establish a pipeline leveraging Large Language Model (LLM) for the preprocessing of narrative EMR, with the objective of identifying Hospital Adverse Event (HAE). The proposed pipeline aims to enhance the efficiency of HAE detection by utilizing LLM, while simultaneously reducing reliance on labor-intensive, time-consuming, and costly procedures. The detection of HAE is typically accomplished through a variety of methods, which include manual chart reviews, discharge diagnostic coding, prevalence surveys, and incident reporting systems. Recently, there has been a growing interest among researchers in leveraging narrative EMR data, along with Natural Language Processing (NLP), Machine Learning (ML), and LLM techniques. A significant challenge associated with these techniques is the critical need for preprocessing narrative EMR data. Additionally, it is noteworthy that the existing tools intended for the preprocessing of narrative EMR are predominantly designed for general applications rather than being specifically optimized for the detection of HAE.

This thesis examines the preprocessing of narrative EMR to identify HAE by developing a pipeline based on LLMs. First, given the increasing use of NLP and, consequently, LLM for HAE detection, a systematic scoping review is conducted on this topic to summarize the existing literature, find the overlooked research gaps and the challenges related to using narrative EMR to detect HAE. The review also underscores the essential role of preprocessing tasks in enhancing the performance of HAE detection. The results emphasize the importance of text normalization and establishing feature engineering in preprocessing tasks that significantly affect HAE detection performance.

Second, the LLM-based pipeline tackles the challenges associated with the section detection task by designing and implementing a novel multi-head attention mechanism aimed at training LLMs for the accurate identification of section headers within clinical notes. In contrast to the regular attention mechanisms that analyze all tokens within the input sentence, the proposed customized multi-head attention mechanism selectively directs attention towards tokens that denote section header titles during the training phase of the LLMs. The results indicate that our approach resulted in enhanced performance across three distinct LLMs, namely Text-to-Text Transfer Transformer (T5), Generative Pre-trained Transformer (GPT)-2, and Bidirectional Encoder Representations from Transformers (BERT). Notably, consistent improvements were observed in T5, which is a smaller model.

Third, to address the text normalization challenges, a framework is proposed for detecting and deciphering abbreviations within clinical text, employing LLMs. This framework is structured into four distinct phases: task definition, properties identification, example selection, and the application of LLMs through either a fine-tuning approach or an optimized example-based prompting method. The results demonstrate that the fine-tuning approach for LLMs yields superior performance at a lower cost compared to the optimized example-based prompt. This finding indicates that fine-tuning LLMs effectively and efficiently facilitates the detection and deciphering of abbreviations in clinical notes.

In conclusion, this thesis posits that customized attention directed toward the specific target task in LLMs significantly enhances both the effectiveness and efficiency of task performance. This customization may be achieved through various approaches, including the design of a customized multi-head attention mechanism during training, the formulation of engineered prompts, or the systematic fine-tuning of LLMs.