PhD Oral Exam - Elnaz Davoodi, Computer Science
When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
The focus of this thesis is to study computationally the relation between discourse properties and textual complexity. Specifically, we explored three research questions.
The first research question tries to find out to what degree discourse-level properties can be used to predict the complexity level of a text. To do so, we considered three types of discourse-level properties: (1) the realization of discourse relations and the representation of discourse relations in terms of (2) the choice of discourse relation and (3) discourse marker. Using datasets from standard corpora in the field of discourse analysis and text simplification, we developed a supervised machine learning model for pairwise text complexity assessment and compared these properties with more traditional linguistic features. Our results show that the use of only discourse features performed statistically as well as using all linguistic features. Thus, we can conclude a strong correlation between discourse properties and complexity level.
The second question that we explored is how exactly does the complexity level of a text influence its discourse-level linguistic choices? To address this question, we conducted a corpus analysis of the Simple English Wikipedia, the largest annotated corpus based on complexity level. Our analysis used the 16 discourse relations defined in the DLTAG framework and focused on explicit relations. Our results show that the distribution of discourse relations is not influenced by a text’s complexity level; but how these are signalled is.
Finally, given the results of our corpus analysis, our third research question tries to investigate if we can leverage these differences to mine parallel corpora across complexity levels to automatically discover alternative lexicalizations (AltLexes) of discourse markers? This work led to the automatic identification of 91 new AltLexes in two corpora: the Simple English Wikipedia and the Newsela corpora.
Overall, this thesis demonstrates that a text’s complexity level and discourse level properties are indeed correlated. Discourse properties play an important role in the assessment of a text’s complexity level and should be taken into account in the complexity level assessment problem. In addition, we observed that the way that explicit discourse relations are signaled is influenced by textual complexity. Lastly, our thesis shows that the automatic identification of alternative lexializations of discourse markers can benefit from large-scale parallel corpora across complexity levels.