Skip to main content
Thesis defences

Automated Data Preparation Using Semantics of Data Science Artifacts


Date & time
Tuesday, August 15, 2023
1 p.m. – 3 p.m.
Speaker(s)

Shubham Vashisth

Cost

This event is free

Organization

Department of Computer Science and Software Engineering

Contact

Essam Mansour

Where

ER Building
2155 Guy St.
Room ZOOM

Wheel chair accessible

Yes

Abstract

  Data preparation is critical for improving model accuracy. However, data scientists often work independently, spending most of their time writing code to identify and select relevant features, enrich, clean, and transform their datasets to train predictive models for solving a machine learning problem. Working in isolation from each other, they lack support to learn from what other data scientists have performed on similar datasets. This thesis addresses these challenges by presenting a novel approach that automates data preparation using the semantics of data science artifacts. Therefore, this work proposes KGFarm 1, a holistic platform for automating data preparation based on machine learning models trained using the semantics of data science artifacts, captured as a knowledge graph (KG). These semantics comprise datasets and pipeline scripts. KGFarm seamlessly integrates with existing data science platforms, effectively enabling scientific communities to automatically discover and learn from each other’s work. KGFarm’s models were trained on top of a KG constructed from the top-rated 1000 Kaggle datasets and 13800 pipeline scripts with the highest number of votes. Our comprehensive evaluation uses 130 unseen datasets collected from different AutoML benchmarks to compare KGFarm against state-of-the-art systems in data cleaning, data transformation, feature selection, and feature engineering tasks. Our experiments show that KGFarm consumes significantly less time and memory compared to the state-of-the-art systems while achieving comparable or better accuracy. Hence, KGFarm effectively handles large-scale datasets and empowers data scientists to automate data preparation pipelines interactively.

https://github.com/CoDS-GCS/kgfarm

Back to top

© Concordia University