Can GPT-4 Replicate Empirical Software Engineering Research? (FSE 2024 - Research Papers)

Mon 15 - Fri 19 July 2024 Porto de Galinhas, Brazil, Brazil

Who

Jenny T. Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, Thomas Zimmermann

Track

FSE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 18 Jul 2024 11:00 - 11:18 at Mandacaru - Human Aspects 2 Chair(s): Bianca Trinkenreich

Abstract

Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering and science-related tasks, these models could help democratize empirical software engineering research.

In this paper, we examine GPT-4’s abilities to perform replications of empirical software engineering research on new data. We specifically study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggle to generate ones that reflect common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains the correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.

Link to Preprint

https://arxiv.org/pdf/2310.01727

Jenny T. Liang

Carnegie Mellon University

United States

Carmen Badea

Microsoft Research

Christian Bird

Microsoft Research

United States

Robert DeLine

Microsoft Research

Denae Ford

Microsoft Research

United States

Nicole Forsgren

Microsoft Research

United States

Thomas Zimmermann

Microsoft Research

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 18 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

11:00 - 12:30	Human Aspects 2Research Papers at Mandacaru Chair(s): Bianca Trinkenreich Colorado State University

11:00 18m Talk		Can GPT-4 Replicate Empirical Software Engineering Research? Research Papers Jenny T. Liang Carnegie Mellon University, Carmen Badea Microsoft Research, Christian Bird Microsoft Research, Robert DeLine Microsoft Research, Denae Ford Microsoft Research, Nicole Forsgren Microsoft Research, Thomas Zimmermann Microsoft Research Pre-print
11:18 18m Talk		Do Code Generation Models Think Like Us? - A Study of Attention Alignment between Large Language Models and Human Programmers Research Papers Bonan Kou Purdue University, Shengmai Chen Purdue University, Zhijie Wang University of Alberta, Lei Ma The University of Tokyo & University of Alberta, Tianyi Zhang Purdue University Pre-print
11:36 18m Talk		Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion Research Papers Md Shamimur Rahman University of Saskatchewan, Canada, Zadia Codabux University of Saskatchewan, Chanchal K. Roy University of Saskatchewan, Canada
11:54 18m Talk		Effective Teaching through Code Reviews: Patterns and Anti-Patterns Research Papers Anita Sarma Oregon State University, Nina Chen Google DOI
12:12 18m Talk		An empirical study on code review activity prediction in practice Research Papers Doriane Olewicki Queen's University, Sarra Habchi Ubisoft Montréal, Bram Adams Queen's University Pre-print