ICSE 2021
ICSE 2021 is the premier venue in software engineering. My research group is proudly presenting ten papers in the different co-located events.
I thank all my 25 collaborators (listed in any order): Jeanderson Cândido, Jan Haesen, Arie van Deursen, Hendrig Sellik, Onno van Paridon, Georgios Gousios, Bart van Oort, Luís Cruz, Casper Schröder, Adriaan van der Feltz, Annibale Panichella, Henk Grent, Aleksei Akimov, Frank Mulder, Felienne Hermans, Eric Maziero, Rafael Durelli, Vinicius Durelli, Jürgen Cito, Aaron Beigelbeck, Julian Harty, Haonan Zhang, Lili Wei, Luca Pascarella, and Weiyi Shang.
Although writing code seems trivial at times, problems arise when humans misinterpret what the code actually does. One of the potential causes are “atoms of confusion”, the smallest possible patterns of misinterpretable source code. Previous research has investigated the impact of atoms of confusion in C code. Results show that developers make significantly more mistakes in code where atoms are present.
In this paper, we replicate the work of Gopstein et al. to the Java language. After deriving a set of atoms of confusion for Java, we perform a two-phase experiment with 132 computer science students (i.e., novice developers).
Our results show that participants are 2.7 up to 56 times more likely to make mistakes in code snippets affected by 7 out of the 14 studied atoms of confusion, and when faced with both versions of the code snippets, participants perceived the version affected by the atom of confusion to be more confusing and/or less readable in 10 out of the 14 studied atoms of confusion.
Watch a summary of the paper (in English):
Logging is a development practice that plays an important role in the operations and monitoring of complex systems. Developers place log statements in the source code and use log data to understand how the system behaves in production. Unfortunately, anticipating where to log during development is challenging. Previous studies show the feasibility of leveraging machine learning to recommend log placement despite the data imbalance since logging is a fraction of the overall code base. However, it remains unknown how those techniques apply to an industry setting, and little is known about the effect of imbalanced data and sampling techniques.
In this paper, we study the log placement problem in the code base of Adyen, a large-scale payment company. We analyze 34,526 Java files and 309,527 methods that sum up +2M SLOC. We systematically measure the effectiveness of five models based on code metrics, explore the effect of sampling techniques, understand which features models consider to be relevant for the prediction, and evaluate whether we can exploit 388,086 methods from 29 Apache projects to learn where to log in an industry setting.
Our best performing model achieves 79% of balanced accuracy, 81% of precision, 60% of recall. While sampling techniques improve recall, they penalize precision at a prohibitive cost. Experiments with open source data yield under-performing models over Adyen’s test set; nevertheless, they are useful due to their low rate of false positives. Our supporting scripts and tools are available to the community.
Watch a summary of the paper (in English):
Mistakes in binary conditions are a source of error in many software systems. They happen when developers use, e.g., < or > instead of <= or >=. These boundary mistakes are hard to find and impose manual, labor-intensive work for software developers. While previous research has been proposing solutions to identify errors in boundary conditions, the problem remains open.
In this paper, we explore the effectiveness of deep learning models in learning and predicting mistakes in boundary conditions. We train different models on approximately 1.6M examples with faults in different boundary conditions.
We achieve a precision of 85% and a recall of 84% on a balanced dataset, but lower numbers in an imbalanced dataset. We also perform tests on 41 real-world boundary condition bugs found from GitHub, where the model shows only a modest performance. Finally, we test the model on a large-scale Java code base from Adyen, our industrial partner. The model reported 36 buggy methods, but none of them were confirmed by developers.
Watch a summary of the paper (in English):
Deciding what constitutes a single module, what classes belong to which module or the right set of modules for a specific software system has always been a challenging task. The problem is even harder in large-scale software systems composed of thousands of classes and hundreds of modules. Over the years, researchers have been proposing different techniques to support developers in re-modularizing their software systems. In particular, the search-based software re-modularization is an active research topic within the software engineering community for more than 20 years.
This paper describes our efforts in applying search-based software re-modularization approaches at Adyen, a large-scale payment company. Adyen’s code base has 5.5M+ lines of code, split into around 70 different modules. We leveraged the existing body of knowledge in the field to devise our own search algorithm and applied it to our code base.
Our results show that search-based approaches scale to large code bases as ours. Our algorithm can find solutions that improve the code base according to the metrics we optimize for, and developers see value in the recommendations. Based on our experiences, we then list a set of challenges and opportunities for future researchers, aiming at making search-based software re-modularization more efficient for large-scale software companies.
Watch a summary of the paper (in English):
Web APIs may have constraints on parameters, such that not all parameters are either always required or always optional. Moreover, the presence or value of one parameter could cause another parameter to be required, or parameters could have restrictions on what kinds of values are valid. Having a clear overview of the constraints helps API consumers to integrate without the need for additional support and with fewer integration faults. We made use of existing documentation and code analysis approaches for identifying parameter constraints in complex web APIs.
In this paper, we report our case study of several APIs at Adyen, a large-scale payment company that offers complex Web APIs to its customers. Our results show that the documentation- and code-based approach can identify 23% and 53% of the constraints respectively and, when combined, 68% of them. We also reflect on the current challenges that these approaches face. In particular, the absence of information that explicitly describes the constraints in the documentation (in the documentation analysis), and the engineering of a sound static code analyser that is sensitive to data-flow, maintains longer parameter references throughout the API’s code, and that is able to symbolically execute the several libraries and frameworks used by the API (in the static analysis).
Watch a summary of the paper (in English):
Grading large classes has become a challenging and expensive task for many universities. The Delft University of Technology (TU Delft), located in the Netherlands, has observed a large increase in student numbers over the past few years. Given the large growth of the student population, grading all the submissions results in high costs.
We made use of self and peer grading in the 2018-2019 edition of our software testing course. Students worked in teams of two, and self and peer graded three assignments in our course. We ended up with 906 self and peer graded submissions, which we compared to 248 submission that were graded by our TAs.
In this paper, we report on the differences we observed between self, peer, and TA grading. Our findings show that: (i) self grades tend to be 8-10% higher than peer grades on average, (ii) peer grades seem to be a good approximator of TA grades; in cases where self and peer grade differ significantly, the TA grade seems to lie in between, and (iii) the gender and the nationality of the student do not seem to affect self and peer grading.
Watch a summary of the paper (in English):
Refactoring is the process of changing the internal structure of software to improve its quality without modifying its external behavior. Before carrying out refactoring activities, developers need to identify refactoring opportunities. Currently, refactoring opportunity identification heavily relies on developers’ expertise and intuition.
In this paper, we investigate the effectiveness of machine learning algorithms in predicting software refactorings. More specifically, we train six different machine learning algorithms with a dataset comprising over two million refactorings from 11,149 real-world projects from the Apache, F-Droid, and GitHub ecosystems.
The resulting models predict 20 different refactorings at class, method, and variable-levels with an accuracy often higher than 90%. Our results show that (i) Random Forests are the best models for predicting software refactoring, (ii) process and ownership metrics seem to play a crucial role in the creation of better models, and (iii) models generalize well in different contexts.
Watch a summary of the paper (in English):
Artificial Intelligence (AI) and Machine Learning (ML) are pervasive in the current computer science landscape. Yet, there still exists a lack of software engineering experience and best practices in this field. One such best practice, static code analysis, can be used to find code smells, i.e., (potential) defects in the source code, refactoring opportunities, and violations of common coding standards.
Our research set out to discover the most prevalent code smells in ML projects. We gathered a dataset of 74 open-source ML projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category. Manual analysis of these smells mainly showed that code duplication is widespread and that the PEP8 convention for identifier naming style may not always be applicable to ML code due to its resemblance with mathematical notation. More interestingly, however, we found several major obstructions to the maintainability and reproducibility of ML projects, primarily related to the dependency management of Python projects. We also found that Pylint cannot reliably check for correct usage of imported dependencies, including prominent ML libraries such as PyTorch.
Watch a summary of the paper (in English):
Detecting performance issues due to suboptimal code during the development process can be a daunting task, especially when it comes to localizing them after noticing performance degradation after deployment. Static analysis has the potential to provide early feedback on performance problems to developers without having to run profilers with expensive (and often unavailable) performance tests.
We develop a VSCode tool that integrates the static performance analysis results from Infer via code annotations and decorations (surfacing complexity analysis results in context) and side panel views showing details and overviews (enabling explainability of the results). Additionally, we design our system for interactivity to allow for more responsiveness to code changes as they happen. We evaluate the efficacy of our tool by measuring the overhead that the static performance analysis integration introduces in the development workflow.
Further, we report on a case study that illustrates how our system can be used to reason about software performance in the context of a real performance bug in the ElasticSearch open-source project.
- Demo video: https://www.youtube.com/watch?v=-GqPb_YZMOs
- Repository: https://github.com/ipa-lab/vscode-infer-performance
Watch a summary of the paper (in English):
Software logs are of great value in both industrial and open-source projects. Mobile analytics logging enables developers to collect logs from the end users at the cost of recording and transmitting logs across the Internet to a centralised infrastructure. The goal of this paper is to make the first step in the characterisation of logging practices of a widely adopted mobile analytics logging library, namely Firebase Analytics. With this study, we aim to understand what are the common developers’ needs that push practitioners to adopt logging practices on mobile devices.
We propose an empirical evaluation of the use of Firebase Analytics in open-source Android applications that shows how mobile analytics logs are less pervasive and less maintained than traditional logging code. Finally, while the main goal of traditional logging consists in gathering information for debugging purposes, logging becomes more user centered when mobile analytics is used for logging.