When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Language is the main mechanism used in human communication. For decades, researchers have tried to build computers capable of communicating with humans in natural languages. Such research endeavors include building systems to understand human language (a.k.a Natural Language Understanding or NLU) and generate a response in human language (a.k.a Natural Language Generation or NLG). The goal of NLG is to deliver information by producing natural texts in human languages. Text generation typically consists of several steps in a pipeline, from determining the content to communicate to producing the actual words, and hence requires different techniques at each step. In this thesis, we focus on ordering problems in NLG and address two tasks that require ordering; namely, sentence ordering and surface realization.
In sentence ordering, models need to capture the relations between sentences and then based on these relations, find the most coherent order of sentences. Our proposed approach is based on pointer networks where at each step, the model chooses the sentence that should appear next in the text. We show that using a conditional sentence representation that captures the sentence meaning based on its position and previously selected sentences in the text can improve the state-of-the-art on standard datasets.
For surface realization, we show that a pointer network is insufficient to improve state-of-the-art
performance. Therefore, we propose the use of language models for surface realization by mapping the surface realization task (a graph-to-text) to a text-to-text problem. Our experiments show that pre-trained language models can easily learn the task of surface realization and achieve competitive performances on standard datasets. To further improve the performance of surface realization, we then propose to pre-train a language model on synthetic data and then fine-tune it on manually labeled data in order to increase the number of training data for surface realization. Our experiments indicate that this approach improves the state-of-the-art by more than 10% BLEU score on standard datasets.