Test Case Generation and Fault Localization for Data Science Programs

Staff - Faculty of Informatics

Date: 13 June 2024 / 09:00 - 12:00

USI East Campus, Room D1.15

You are cordially invited to attend the PhD Dissertation Defence of Mohammad Rezaalipour on Thursday 13 June 2024 at 09:00 in room D1.15.

Abstract:
Data science refers to inter-disciplinary approaches designed to extract knowledge from vast amounts of data. It combines techniques from fields such as statistics and machine learning to develop novel applications for different science and engineering domains. Data science approaches are implemented as programs usually written in languages such as R or Python, collectively referred to as data science programs. Due to their inter-disciplinary usages, these programs are often written by domain experts possibly unfamiliar with the best practices of software development, and thus, they may exhibit low quality. In fact, there is evidence that these programs contain several bugs, often different in nature compared to those found in traditional programs. As a result, data science programs challenge conventional debugging techniques such as those from test generation and fault localization activities, due to the unique nature of bugs found in them. Additionally, being written in dynamically typed languages such as Python adds to the difficulties of testing and analyzing them. These challenges call for research into new debugging techniques tailored specifically for these programs, which is the focus of the current dissertation. Precisely, this thesis aims to understand the capabilities and limitations of standard test generation and fault localization techniques on data science programs implemented in dynamic languages such as Python. To achieve this goal, the dissertation presents contributions in three areas: i) a test generation technique for neural network (NN) programs, a wide spread class of data science programs; ii) an empirical study of fault localization in Python programs; and iii) two debugging tools and a curated dataset of NN bugs. In the first area, we investigated and identified the limitations of general-purpose test generation techniques on NN programs, which led to the development of aNNoTest, a novel test generation technique tailored for NN programs. We evaluated aNNoTest on 19 open-source programs, demonstrating its effectiveness at finding bugs in real-world NN programs. In the second area, we conducted the first large-scale multi-family empirical study of fault localization in Python programs. Targeting 135 bugs from 13 projects, we studied seven fault localization techniques from four families along with combinations of them. We considered different fault localization granularity levels and measured both effectiveness and efficiency in our analyses. In the third area, we developed: i) the aNNoTest tool, an implementation of the aNNoTest approach mentioned above; ii) FauxPy, to our knowledge, the first open-source multi-family fault localization tool for Python; and iii) a curated dataset of NN bugs, for which aNNoTest was used to generate tests. Along with supporting the domain with the tools and techniques we developed, we hope our contributions will be beneficial to inform the development of more effective debugging techniques for Python data science programs.

Dissertation Committee:
- Prof. Carlo Alberto Furia, Università della Svizzera italiana, Switzerland (Research Advisor)
- Prof. Michele Lanza, Università della Svizzera italiana, Switzerland (Internal Member)
- Prof. Paolo Tonella, Università della Svizzera italiana, Switzerland (Internal Member)
- Prof. Domenico Bianculli, University of Luxembourg, Luxembourg (External Member)
- Prof. Gordon Fraser, University of Passau, Germany (External Member)