When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
A primary objective in natural language processing is the classification of texts into discrete categories. Topic models and mixture models are indispensable tools for this task, as they both acquire patterns from data in an unsupervised manner. Several extensions to established topic modeling frameworks are introduced by incorporating more flexible priors and advanced inference methods to enhance performance in text document analysis. The Multinomial Principal Component Analysis (MPCA) framework, a Dirichlet-based model, is extended by integrating generalized Dirichlet (GD) and Beta-Liouville (BL) distributions, resulting in GDMPCA and BLMPCA models. These priors address the limitations of the Dirichlet prior, such as its independent assumption within components and restricted covariance structure. Efficiency is further improved by implementing variational Bayesian inference and collapsed Gibbs sampling for fast and accurate parameter estimation.
Enhancements to the Bi-Directional Recurrent Attentional Topic Model (bi-RATM) are made by incorporating GD and BL distributions, leading to GD-bi-RATM and BL-bi-RATM models. These models leverage attention mechanisms to model relationships between sentences, offering higher flexibility and improved performance in document embedding tasks.
Extensions to the Dirichlet Multinomial Regression (DMR) and deep Dirichlet Multinomial Regression (dDMR) approaches are achieved by incorporating GD and BL distributions. This integration addresses limitations related to handling complex data structures and overfitting, with collapsed Gibbs sampling providing an efficient method for parameter inference. Experimental results on benchmark datasets demonstrate enhanced topic modeling performance, particularly in handling complex data structures and reducing overfitting.
Novel approaches are developed by integrating embeddings derived from Bert-Topic with the multi-grain clustering topic model (MGCTM). Recognizing the hierarchical and multi-scale nature of topics, these methods utilize MGCTM to capture topic structures at multiple levels of granularity. By incorporating GD and BL distributions, the expressiveness and flexibility of MGCTM are enhanced. Experiments on various datasets show superior topic coherence and granularity compared to state-of-the-art methods.
Overall, the proposed models exhibit improved interpretability and effectiveness in various natural language processing and machine learning applications, showcasing the potential of combining neural embeddings with advanced probabilistic modeling techniques.