Thesis defences

PhD Oral Exam - Hafsa Ennajari, Information and Systems Engineering

Embedded Spherical Probabilistic Modeling for Topic Discovery and Text Representation Learning in Unstructured Text Data

Date & time

Friday, September 8, 2023
11 a.m. – 1 p.m.

Cost

This event is free

Organization

School of Graduate Studies

Contact

Daniela Ferrer

Where

Online

When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.

Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.

Abstract

Every day, large amounts of text data are generated on the web. Taking advantage of such data necessitates good methods of retrieval, exploration, and analysis to extract hidden knowledge from these voluminous unstructured texts. In this context, probabilistic topic modeling is regarded as an effective text mining technique that uncovers the main topics from an unlabeled set of documents. Topic models have been successfully used in various domains to exhibit hidden topics, e.g., marketing, medicine, and political sciences. However, the inferred topics by conventional topic models are often unclear and not easy to interpret, because they do not account for semantic structures in language. Recently, several topic modeling approaches have been proposed to leverage external knowledge to enhance the quality of the learned topics, but they still assume a Multinomial or Gaussian document likelihood in the Euclidean space, which often results in information loss and poor performance. In this thesis, we introduce a set of probabilistic embedded spherical topic models designed to address several challenges, including lack of topic interpretability, high-dimensionality, and sparsity. Our approaches involve integrating knowledge graphs and word embeddings within a non-Euclidean curved space, namely the hypersphere, to enhance topic interpretability and generate discriminative text representations. The proposed models effectively handle a wide range of scenarios, encompassing unsupervised and supervised learning tasks. Experimental results demonstrate the effectiveness of the proposed algorithms in discovering coherent topics and learning high-quality text representations, which prove valuable for common Natural Language Processing (NLP) tasks across diverse benchmark datasets. These findings further highlight the advantages of modeling textual data on the surface of the unit-hypersphere using directional distributions while incorporating word and knowledge graph embeddings.

Events