When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
In the Big Data Era, data is one of the most important core elements for any governmental, institutional, and even private organization. Acquiring, processing, and analyzing large amounts of heterogeneous data to derive valuable information has become crucial to both large and small organizations. However, efforts to extract valuable insights are useless if the quality of the data is not comprised. Assessing Data quality is a key differentiator to eliminate poor quality data and therefore support Big Data processes, including pre-processing, processing, and analytics. Moreover, the abundance of raw data issued from various sources and the scalable sizes, speeds, and formats in which data is generated and processed affect the overall quality of data. Consequently, Big Data Quality has become an important factor to ensure quality throughout the Big Data lifecycle.
In this research, we focus on the Quality of Big Data before, during, and after the pre-processing phase, which includes sub-processes such as data cleansing, integration, enrichment, filtering, and normalization. Hence, we propose a Big Data Quality Management Framework featuring components, processes, in addition to algorithms to support and ensure Data Quality before any processing. The framework’s main features include continuously improved data quality profile realized through Big Data sampling, profiling, quality mapping, quality assessment, quality rules discovery, quality control and monitoring. The outcomes of quality evaluation represent the essence of a Data Quality Profile (DQP). A DQP is a fundamental component of the proposed framework incorporating critical data quality information including the data profile, quality dimensions scores, and the quality rules. The quality rules are generated by the rule discovery component during the Data Quality Dimension score results against the quality requirements and acts accordingly. Moreover, the DQP is extended and updated with relevant quality-related data information throughout all the framework’s processes.
Another important process of the framework is the pre-processing of data samples, producing pre-processed data that is re-injected into the framework to certify and confirm that the DQP is delivering the pre-defined quality requirements. We also propose a Big Data quality profile repository that stores and manages data quality profiles, data quality rules, and pre-processing activities such as data cleansing and data transformation. The repository is indexed by data quality dimensions, data type, data domain, and data attributes.
The quality evaluation results’ analysis generally leads to improved data quality score enhancements; these deeper quality assessments can reveal additional insights about the data and its quality. Determining which aspects will have a more (or less) impact on the data quality is crucial. These aspects vary in terms of data attributes, observations and types, and can be represented in rows, in columns, or in an unstructured format. To extract these insights, we propose an exploratory quality profiling component supported with algorithms to profile the data quality and discover highly relevant data attributes using techniques such as Principal Component Analysis (PCA).
To cope with Unstructured Big Data (UBD) quality evaluation, we extend our framework to support extra features that address UBD quality assessment throughout the Big Data lifecycle. We propose a UDB quality model that emphasizes the unstructured data discovery, profiling, extraction, exploitation, classification, feature extraction and selection activities. These activities were applied to textual, multimedia and social media data characterized as schema-less data which will prompt additional challenges for evaluating Quality of UBD. These include for instance challenges related to UBD data parsing, metadata composition, schema construction, feature extraction, pre-processing, and analytics.
The novelty of our solution resides in its ability to estimate the quality of Big Data and to create ongoing data quality profiles and their repositories to record Big Data quality profiling results, thereby making it possible to reuse them for other Big Data sources. Moreover, the importance of addressing Big Data quality in the early phases of a Big Data’s lifecycle will significantly save on costs and ensure accurate data analysis.