Skip to main content
Thesis defences

MCS Thesis Examination: Pavel Khloponin

Vector Space Proximity Based Document Retrieval For Document Embeddings Built By Transformers


Date & time
Wednesday, July 27, 2022
10 a.m. – 12 p.m.
Cost

This event is free

Organization

Department of Computer Science and Software Engineering

Contact

Leila Kosseim

Where

Online

Abstract

    Internet publications are staying atop of local and international events, generating hundreds, sometimes thousands of news articles per day, making it difficult for readers to navigate this stream of information without assistance. Competition for the reader’s attention has never been greater. One strategy to keep readers’ attention on a specific article and help them better understand its content is news recommendation, which automatically provides readers with references to relevant complementary articles. However, to be effective, news recommendation needs to select from a large collection of candidate articles only a handful of articles that are relevant yet provide diverse information.

    In this thesis, we propose and experiment with three methods for news recommendation and evaluate them in the context of the NIST News Track. Our first approach is based on the classic BM25 information retrieval approach and assumes that relevant articles will share common keywords with the current article. Our second approach is based on novel document embedding representations and uses various proximity measures to retrieve the closest documents. For this approach, we experimented with a substantial number of models, proximity measures, and hyperparameters, yielding a total of 47,332 distinct models. Finally, our third approach combines the BM25 and the embedding models to increase the diversity of the results.

    The results on the 2020 TREC News Track show that the performance of the BM25 model (nDCG@5 of 0.5924) greatly exceeds the TREC median performance (nDCG@5 of 0.5250) and achieves the highest score at the shared task. The performance of the embedding model alone (nDCG@5 of 0.4541) is lower than the TREC median and BM25. The performance of the combined model (nDCG@5 of 0.5873) is rather close to that of the BM25 model; however, an analysis of the results shows that the recommended articles are different from those proposed by BM25, hence may constitute a promising approach to reach diversity without much loss in relevance.

 

Examining Committee

  • Dr. Brigitte Jaumard (Chair) 
  • Dr. Leila Kosseim (Supervisor)
  • Dr. Essam Mansour (Examiner)
  • Dr. Bridgitte Jaumard (Examiner)
     
Back to top

© Concordia University