Predicting Test Results without Execution (FSE 2024 - Ideas, Visions and Reflections)

Track

FSE 2024 Ideas, Visions and Reflections

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 Jul 2024 17:21 - 17:30 at Pitanga - Testing 2 Chair(s): Wing Lam

Abstract

As software systems grow, test suites may become complex, making it challenging to run the tests frequently and locally. Recently, Large Language Models (LLMs) have been adopted in multiple software engineering tasks. It has demonstrated great results in code generation, however, it is not yet clear whether these models understand code execution. Particularly, it is unclear whether LLMs can be used to predict test results, and, potentially, overcome the issues of running real-world tests. To shed some light on this problem, in this paper, we explore the capability of LLMs to predict test results without execution. We evaluate the performance of the state-of-the-art GPT-4 in predicting the execution of 200 test cases of the Python Standard Library. Among these 200 test cases, 100 are passing and 100 are failing ones. Overall, we find that GPT-4 has a precision of 88.8%, recall of 71%, and accuracy of 81% in the test result prediction. However, the results vary depending on the test complexity: GPT-4 presented better precision and recall when predicting simpler tests (93.2% and 82%) than complex ones (83.3% and 60%). We also find differences among the analyzed test suites, with the precision ranging from 77.8% to 94.7% and recall between 60% and 90%. Our findings suggest that GPT-4 still needs significant progress in predicting test results.

Link to Preprint

https://andrehora.github.io/pub/2024-fse-predicting-test-result-gpt4.pdf

Bio

Andre Hora is currently a professor in the Department of Computer Science at UFMG, Brazil. He received his PhD in Computer Science at the University of Lille, France. He was a Postdoctoral researcher at the ASERG group. He worked as a software engineer at Inria (Lille, France) and was research intern at Siemens (Erlangen, Germany).

Predicting Test Results without Execution (FSE 2024)

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 17 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:00 - 18:00	Testing 2Demonstrations / Ideas, Visions and Reflections / Research Papers / Industry Papers at Pitanga Chair(s): Wing Lam George Mason University

16:00 18m Talk		Metamorphic Testing of Secure Multi-Party Computation (MPC) Compilers Research Papers Dongwei Xiao Hong Kong University of Science and Technology, Zhibo Liu The Hong Kong University of Science and Technology, Qi Pang Carnegie Mellon University, Shuai Wang The Hong Kong University of Science and Technology, Yichen LI Hong Kong University of Science and Technology
16:18 18m Talk		Mobile Bug Report Reproduction via Global Search on the App UI Model Research Papers Zhaoxu Zhang University of Southern California, Fazle Mohammed Tawsif University of Southern California, Komei Ryu University of Southern California, Tingting Yu University of Connecticut, William G.J. Halfond University of Southern California
16:36 18m Talk		FinHunter: Improved Search-based Test Generation for Structural Testing of FinTech Systems Industry Papers Xuanwen Ding East China Normal University, Qingshun Wang East China Normal University, Dan Liu East China Normal University, Lihua Xu New York University Shanghai, Jun Xiao Ant Group Co. Ltd., Bojun Zhang Ant Group Co. Ltd., Xue Li Ant Group Co. Ltd., Liang Dou East China Normal University, Liang He East China Normal University, Tao Xie Peking University
16:54 9m Talk		Tests4Py: A Benchmark for System Testing Demonstrations Marius Smytzek CISPA Helmholtz Center for Information Security, Martin Eberlein Humboldt University of Berlin, Batuhan Serce CISPA Helmholtz Center for Information Security, Lars Grunske Humboldt-Universität zu Berlin, Andreas Zeller CISPA Helmholtz Center for Information Security Pre-print Media Attached
17:03 9m Talk		On Polyglot Program Testing Ideas, Visions and Reflections Philémon Houdaille DIVERSE Team, IRISA-INRIA, CNRS, Université Rennes 1, Djamel Eddine Khelladi CNRS, IRISA, University of Rennes, Benoit Combemale University of Rennes, Inria, CNRS, IRISA, Gunter Mussbacher McGill University DOI Pre-print
17:12 9m Talk		Ctest4J: A Practical Configuration Testing Framework for Java Demonstrations Shuai Wang University of Illinois at Urbana-Champaign, Xinyu Lian University of Illinois at Urbana-Champaign, Qingyu Li University of Illinois at Urbana-Champaign, Darko Marinov University of Illinois at Urbana-Champaign, Tianyin Xu University of Illinois at Urbana-Champaign Pre-print
17:21 9m Talk		Predicting Test Results without Execution Ideas, Visions and Reflections Andre Hora UFMG Pre-print Media Attached
17:30 9m Talk		Py-holmes: Causal Testing for Deep Neural Networks in Python Demonstrations Wren McQueary George Mason University, sadia afrin mim George Mason University, Nishat Raihan George Mason University, Justin Smith Lafayette College, Brittany Johnson George Mason University Pre-print
17:39 9m Talk		AndroLog: Android Instrumentation and Code Coverage Analysis Demonstrations Jordan Samhi CISPA Helmholtz Center for Information Security, Andreas Zeller CISPA Helmholtz Center for Information Security DOI Pre-print
17:48 9m Talk		PathSpotter: Exploring Tested Paths to Discover Missing Tests Demonstrations Andre Hora UFMG Pre-print Media Attached

Predicting Test Results without Execution

Wed 17 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

Andre Hora

UFMG

Tracks

Co-hosted Conferences

Workshops

Co-hosted Symposia

Predicting Test Results without Execution

Program Display Configuration

Program Display Configuration

Wed 17 JulDisplayed time zone: Brasilia, Distrito Federal, Brazil change

Andre Hora

UFMG

Wed 17 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change