Skip to main content
Oral defences & examinations, Thesis defences

Masters Thesis Defense: Ahmed Helal


Date & time
Friday, August 20, 2021
10 a.m. – 12 p.m.
Cost

This event is free

Where

Online

Candidate:

Ahmed Helal

 

 

 

 

 

 

 

 

 

Thesis Title:

Towards Empowering Data Lakes with Knowledge Graphs

 

 

 

 

 

 

 

Date & Time:

August 20th, 2021 @ 10:00 AM – 12:00 PM

 

 

 

 

 

 

 

 

 

Location:

Zoom

 

 

 

 

 

 

 

 

 

Examining Committee:

 

 

 

 

 

 

 

 

 

 

 

 

 

Dr. Rene Witte

(Chair)

 

 

 

 

 

 

 

 

 

 

Dr. Essam Mansour

(Supervisor)

 

 

 

 

 

 

 

 

 

 

Dr. Rene Witte

(Examiner)

 

 

 

 

 

 

 

 

 

Dr. Tristan Glatard

(Examiner)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Abstract:

 

 

 

 

 

 

 

The emergence of data lakes has permitted storing a large amount of data coming in different formats and at high speed. Data lakes are simultaneously a boon and a bane: while they are great data stores, it is tedious to explore their content. In fact, data lakes are schema-agnostic. In other words, they come with limited or no metadata, making consequently data discovery time-consuming and cumbersome. In addition, some of the already existing data lakes, like the open data portals, have few functionalities that a user can instrumentalize to look for datasets. In addition, these functionalities merely consist of basic search coupled with some filters. These limitations are costly because users would spend considerable time looking for data rather than working on their main tasks. To mitigate this shortcoming, this thesis presents an approach to create metadata on top of the content of data lakes to facilitate data discovery and data enrichment. This approach consists of two steps: First, constructing an RDF knowledge graph (KG) as a navigational structure to model the schema. Second, providing the user with a set of APIs to discover and enrich data. To demonstrate this approach, this work will present a proof of concept (POC) system that captures the schema of tabular-like data and represent it as a KG (GLac), with the means of LAC, an ontology for data lakes. Then it will equip the practitioners with user-friendly interface services to interact with GLac and compile a dataset for a given task. With these main contributions, the system offers promising results in terms of the quality of the generated schema.

 

The main findings of this thesis have been published in two venues: as an extended abstract named ’Data Lakes Empowered by Knowledge Graphs’ [24] and ’A Demonstration of KGLac: A data Discovery an Enrichment Platform for Data Science’ [25]. The former, accepted to the poster session of SIGMOD/PODS’21, presents an approach describing how to utilize KGs to facilitate leveraging the content of data lakes. The latter, accepted to the demo session of VLDB’21, provides an overview of KGLac and illustrates the various functionalities the platform supports on top of data lakes after processing their content.

Back to top

© Concordia University