When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Abstract
Categorical data are prevalent across various disciplines, making clustering a valuable tool for analysis. However, clustering categorical data is particularly challenging due to its non-numerical nature and multi-modal distributions. While numerous techniques have been developed to address these challenges, several issues persist. Since most of the proposed techniques adapt algorithms like k-modes, which were originally designed for categorical data, they often fail to fully exploit and capture unique characteristics of categorical datasets. A main problem with those techniques require setting the number of clusters, which restricts their application and may result in bias when no expert knowledge is available. Additionally, since they tend to choose cluster seeds randomly, they perform well on binary-class data but struggle when applied to multi-class or imbalanced binary datasets. In this thesis, we propose three techniques specifically designed to address these challenges. First, we introduce a cohesion-based clustering process that determines the potential number of clusters dynamically and that also allows detection of small clusters without relying on k-means or k-modes-like methods. Unlike conventional clustering algorithms that assign weights to all the attributes, we adopt a mechanism that assigns weights to clusters at the attribute-value level, improving cluster cohesion and interpretability. Second, we develop multi-criteria subspace-based clustering techniques that unlike the traditional subspace solution, search the entire space. Our techniques leverage two existing clustering strategies, namely density-based and theoretical clustering, to identify small, non-redundant clusters. To ensure effective merging, we extend the hierarchical clustering technique that allows discovered clusters to be combined while preventing small clusters from being absorbed into larger ones. Third, we study measuring quality and diversity methods in ensemble clustering selection. Existing methods often rely on external validation metrics, which are not well-suited to the characteristics of categorical data. Those metrics tend to introduce redundancy and favor large clusters, potentially overlooking smaller yet meaningful ones. Furthermore, current approaches often assess quality and diversity only at the cluster level, neglecting more granular and informative perspectives. To overcome these limitations, we adopt and extend a measure based on granular computing, which aligns more naturally with the discrete and multi-level structure of categorical data. The result helps evaluate quality and diversity at the class level as well as the object level, enabling a more nuanced and effective ensemble selection process. To evaluate its performance, we conduct extensive experiments using real-world and synthetic benchmark datasets. The experiment results and their analyses demonstrate overall enhanced clustering performance and helped identify small clusters in certain datasets.