Thu 18 Jul 2024 11:00 - 11:18 at Mandacaru - Human Aspects 2 Chair(s): Bianca Trinkenreich

Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering and science-related tasks, these models could help democratize empirical software engineering research.

In this paper, we examine GPT-4’s abilities to perform replications of empirical software engineering research on new data. We specifically study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggle to generate ones that reflect common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains the correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.

Thu 18 Jul

Displayed time zone: Brasilia, Distrito Federal, Brazil change

11:00 - 12:30
Human Aspects 2Research Papers at Mandacaru
Chair(s): Bianca Trinkenreich Colorado State University
11:00
18m
Talk
Can GPT-4 Replicate Empirical Software Engineering Research?
Research Papers
Jenny T. Liang Carnegie Mellon University, Carmen Badea Microsoft Research, Christian Bird Microsoft Research, Robert DeLine Microsoft Research, Denae Ford Microsoft Research, Nicole Forsgren Microsoft Research, Thomas Zimmermann Microsoft Research
Pre-print
11:18
18m
Talk
Do Code Generation Models Think Like Us? - A Study of Attention Alignment between Large Language Models and Human Programmers
Research Papers
Bonan Kou Purdue University, Shengmai Chen Purdue University, Zhijie Wang University of Alberta, Lei Ma The University of Tokyo & University of Alberta, Tianyi Zhang Purdue University
Pre-print
11:36
18m
Talk
Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion
Research Papers
Md Shamimur Rahman University of Saskatchewan, Canada, Zadia Codabux University of Saskatchewan, Chanchal K. Roy University of Saskatchewan, Canada
11:54
18m
Talk
Effective Teaching through Code Reviews: Patterns and Anti-Patterns
Research Papers
Anita Sarma Oregon State University, Nina Chen Google
DOI
12:12
18m
Talk
An empirical study on code review activity prediction in practice
Research Papers
Doriane Olewicki Queen's University, Sarra Habchi Ubisoft Montréal, Bram Adams Queen's University
Pre-print