An empirical study on code review activity prediction in practice (FSE 2024 - Research Papers)

Mon 15 - Fri 19 July 2024 Porto de Galinhas, Brazil, Brazil

Who

Doriane Olewicki, Sarra Habchi, Bram Adams

Track

FSE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 18 Jul 2024 12:12 - 12:30 at Mandacaru - Human Aspects 2 Chair(s): Bianca Trinkenreich

Abstract

During code reviews, an essential step in software quality assurance, reviewers have the difficult task of understanding and evaluating code changes to validate their quality and prevent introducing faults to the codebase. This is a tedious process where the effort needed is highly dependent on the code submitted, as well as the author’s and the reviewer’s experience, leading to median wait times for review feedback of 15-64 hours. This paper aims to improve the velocity and effectiveness of code reviews by predicting three tasks of review activity at code submission time: which parts of a patch need to be (1) commented, (2) revised, or (3) are hotspots (will be commented or revised). We evaluate two different types of text embeddings (i.e., Bag-of-Words and Large Language Models encoding) and review process features (i.e., code size-based and history-based features) to predict the tasks. Our empirical study on three open-source and two industrial datasets shows that combining the code embedding and review process features leads to better results than the state-of-the-art approach. F1-scores (median of 40-62%) are significantly better compared to the state-of-the-art for all tasks (from +1 to +9%). Furthermore, we found that size-based review process features improve the most performance for all datasets, whereas history-based features are found less important, though they still improve performance.

Link to Preprint

https://arxiv.org/pdf/2404.10703

Doriane Olewicki

Queen's University

Sarra Habchi

Ubisoft Montréal

Canada

Bram Adams