Skip to main content
Oral defences & examinations, Thesis defences

Masters Thesis Defense: Sunanda Bansal


Date & time
Friday, August 20, 2021
1 p.m. – 3 p.m.
Cost

This event is free

Where

Online

Candidate:

Sunanda Bansal

   
             

Thesis Title:

Vector Representation of Documents Using Word Clusters

             

Date & Time: 

August 20th, 2021 @ 1:00 PM

   
             

Location:

Zoom

   
             

Examining Committee:

         
             
 

Dr. Gregory Butler

(Chair)

   
             
 

Dr. Sabine Bergler

(Supervisor)

   
             
 

Dr. Leila Kosseim 

(Examiner)

 
             
 

Dr. Gregory Butler

(Examiner)

 

Abstract

For processing the textual data using statistical methods like Machine Learning (ML), the data often needs to be represented in the form of a vector. With the dawn of the internet, the amount of textual data has exploded, and, partly owing to its size, most of this data is unlabeled. Therefore, often for sorting and analyzing text documents, the documents have to be represented in an unsupervised way, i.e. with no prior knowledge of expected output or labels. Most of the existing unsupervised methodologies do not factor in the similarity between words, and if they do, it can be further improved upon. This thesis discusses Word Cluster based Document Embedding (WcDe) where the documents are represented in terms of clusters of similar words and, compares its performance in representing documents at two levels of topical similarity - general and specific. This thesis shows that WcDe outperforms existing unsupervised representation methodologies at both levels of topical similarity. Furthermore, this thesis analyzes variations of WcDe with respect to its components and discusses the combination of components that consistently performs well across both topical levels. Finally, this thesis analyses the document vector generated by WcDe on two fronts, i.e. whether it captures the similarity of documents within a class, and whether it captures the dissimilarity of documents belonging to different classes. The analysis shows that Word Cluster based Document Embedding is able to encode both aspects of document representation very well and on both of the topical levels.

Back to top

© Concordia University