Wed 17 Jul 2024 17:21 - 17:30 at Pitanga - Testing 2 Chair(s): Wing Lam

As software systems grow, test suites may become complex, making it challenging to run the tests frequently and locally. Recently, Large Language Models (LLMs) have been adopted in multiple software engineering tasks. It has demonstrated great results in code generation, however, it is not yet clear whether these models understand code execution. Particularly, it is unclear whether LLMs can be used to predict test results, and, potentially, overcome the issues of running real-world tests. To shed some light on this problem, in this paper, we explore the capability of LLMs to predict test results without execution. We evaluate the performance of the state-of-the-art GPT-4 in predicting the execution of 200 test cases of the Python Standard Library. Among these 200 test cases, 100 are passing and 100 are failing ones. Overall, we find that GPT-4 has a precision of 88.8%, recall of 71%, and accuracy of 81% in the test result prediction. However, the results vary depending on the test complexity: GPT-4 presented better precision and recall when predicting simpler tests (93.2% and 82%) than complex ones (83.3% and 60%). We also find differences among the analyzed test suites, with the precision ranging from 77.8% to 94.7% and recall between 60% and 90%. Our findings suggest that GPT-4 still needs significant progress in predicting test results.

Andre Hora is currently a professor in the Department of Computer Science at UFMG, Brazil. He received his PhD in Computer Science at the University of Lille, France. He was a Postdoctoral researcher at the ASERG group. He worked as a software engineer at Inria (Lille, France) and was research intern at Siemens (Erlangen, Germany).

Wed 17 Jul

Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:00 - 18:00
16:00
18m
Talk
Metamorphic Testing of Secure Multi-Party Computation (MPC) Compilers
Research Papers
Dongwei Xiao Hong Kong University of Science and Technology, Zhibo Liu The Hong Kong University of Science and Technology, Qi Pang Carnegie Mellon University, Shuai Wang The Hong Kong University of Science and Technology, Yichen LI Hong Kong University of Science and Technology
16:18
18m
Talk
Mobile Bug Report Reproduction via Global Search on the App UI Model
Research Papers
Zhaoxu Zhang University of Southern California, Fazle Mohammed Tawsif University of Southern California, Komei Ryu University of Southern California, Tingting Yu University of Connecticut, William G.J. Halfond University of Southern California
16:36
18m
Talk
FinHunter: Improved Search-based Test Generation for Structural Testing of FinTech Systems
Industry Papers
Xuanwen Ding East China Normal University, Qingshun Wang East China Normal University, Dan Liu East China Normal University, Lihua Xu New York University Shanghai, Jun Xiao Ant Group Co. Ltd., Bojun Zhang Ant Group Co. Ltd., Xue Li Ant Group Co. Ltd., Liang Dou East China Normal University, Liang He East China Normal University, Tao Xie Peking University
16:54
9m
Talk
Tests4Py: A Benchmark for System Testing
Demonstrations
Marius Smytzek CISPA Helmholtz Center for Information Security, Martin Eberlein Humboldt University of Berlin, Batuhan Serce CISPA Helmholtz Center for Information Security, Lars Grunske Humboldt-Universität zu Berlin, Andreas Zeller CISPA Helmholtz Center for Information Security
Pre-print Media Attached
17:03
9m
Talk
On Polyglot Program Testing
Ideas, Visions and Reflections
Philémon Houdaille DIVERSE Team, IRISA-INRIA, CNRS, Université Rennes 1, Djamel Eddine Khelladi CNRS, IRISA, University of Rennes, Benoit Combemale University of Rennes, Inria, CNRS, IRISA, Gunter Mussbacher McGill University
DOI Pre-print
17:12
9m
Talk
Ctest4J: A Practical Configuration Testing Framework for Java
Demonstrations
Shuai Wang University of Illinois at Urbana-Champaign, Xinyu Lian University of Illinois at Urbana-Champaign, Qingyu Li University of Illinois at Urbana-Champaign, Darko Marinov University of Illinois at Urbana-Champaign, Tianyin Xu University of Illinois at Urbana-Champaign
Pre-print
17:21
9m
Talk
Predicting Test Results without Execution
Ideas, Visions and Reflections
Pre-print Media Attached
17:30
9m
Talk
Py-holmes: Causal Testing for Deep Neural Networks in Python
Demonstrations
Wren McQueary George Mason University, sadia afrin mim George Mason University, Nishat Raihan George Mason University, Justin Smith Lafayette College, Brittany Johnson George Mason University
Pre-print
17:39
9m
Talk
AndroLog: Android Instrumentation and Code Coverage Analysis
Demonstrations
Jordan Samhi CISPA Helmholtz Center for Information Security, Andreas Zeller CISPA Helmholtz Center for Information Security
DOI Pre-print
17:48
9m
Talk
PathSpotter: Exploring Tested Paths to Discover Missing Tests
Demonstrations
Pre-print Media Attached