When studying for a doctoral degree (PhD), candidates submit a thesis that provides a critical review of the current state of knowledge of the thesis subject as well as the student’s own contributions to the subject. The distinguishing criterion of doctoral graduate research is a significant and original contribution to knowledge.
Once accepted, the candidate presents the thesis orally. This oral exam is open to the public.
Many large software systems rely on bug tracking systems to record the submitted bug reports and to track and manage bugs. Handling bug reports is known to be a challenging task, especially in software organizations with a large client base, which tend to receive a considerable large number of bug reports a day. Fortunately, not all reported bugs are new; many are similar or identical to previously reported bugs, also called duplicate bug reports.
Automatic detection of duplicate bug reports is an important research topic to help reduce the time and effort spent by triaging and development teams on sorting and fixing bugs. This explains the recent increase in attention to this topic as evidenced by the number of tools and algorithms that have been proposed in academia and industry. The objective is to automatically detect duplicate bug reports as soon as they arrive into the system. To do so, existing techniques rely heavily on the nature of bug report data they operate on. This includes both structural information such as OS, product version, time and date of the crash, and stack traces, as well as unstructured information such as bug report summaries and descriptions written in natural language by end users and developers.
In this thesis, we propose new approaches for automatically detecting duplicate bug reports with the objective to help triaging teams and software developers in the provision of fixes. These techniques are based on machine learning and stochastic processes, namely automata, Hidden Markov Models and deep learning algorithms. While the majority of approaches focus on textual parts of bug reports, we use stack traces. The use of stack traces is desirable in situations where bug report descriptions are deemed unreliable due to the imprecision and ambiguity of natural language. Moreover, stack traces have the apparent advantage of decreasing the required number of preprocessing tasks, such as those associated with processing bug report comments and descriptions using natural language processing techniques. We evaluate the approaches presented in this thesis by applying them to bug reports of two large open source systems, namely Firefox and GNOME and comparing them to the state-of-the-art approaches.