As software systems grow, test suites may become complex, making it challenging to run the tests frequently and locally. Recently, Large Language Models (LLMs) have been adopted in multiple software engineering tasks. It has demonstrated great results in code generation, however, it is not yet clear whether these models understand code execution. Particularly, it is unclear whether LLMs can be used to predict test results, and, potentially, overcome the issues of running real-world tests. To shed some light on this problem, in this paper, we explore the capability of LLMs to predict test results without execution. We evaluate the performance of the state-of-the-art GPT-4 in predicting the execution of 200 test cases of the Python Standard Library. Among these 200 test cases, 100 are passing and 100 are failing ones. Overall, we find that GPT-4 has a precision of 88.8%, recall of 71%, and accuracy of 81% in the test result prediction. However, the results vary depending on the test complexity: GPT-4 presented better precision and recall when predicting simpler tests (93.2% and 82%) than complex ones (83.3% and 60%). We also find differences among the analyzed test suites, with the precision ranging from 77.8% to 94.7% and recall between 60% and 90%. Our findings suggest that GPT-4 still needs significant progress in predicting test results.
Andre Hora is currently a professor in the Department of Computer Science at UFMG, Brazil. He received his PhD in Computer Science at the University of Lille, France. He was a Postdoctoral researcher at the ASERG group. He worked as a software engineer at Inria (Lille, France) and was research intern at Siemens (Erlangen, Germany).
Wed 17 JulDisplayed time zone: Brasilia, Distrito Federal, Brazil change
16:00 - 18:00 | Testing 2Demonstrations / Ideas, Visions and Reflections / Research Papers / Industry Papers at Pitanga Chair(s): Wing Lam George Mason University | ||
16:00 18mTalk | Metamorphic Testing of Secure Multi-Party Computation (MPC) Compilers Research Papers Dongwei Xiao Hong Kong University of Science and Technology, Zhibo Liu The Hong Kong University of Science and Technology, Qi Pang Carnegie Mellon University, Shuai Wang The Hong Kong University of Science and Technology, Yichen LI Hong Kong University of Science and Technology | ||
16:18 18mTalk | Mobile Bug Report Reproduction via Global Search on the App UI Model Research Papers Zhaoxu Zhang University of Southern California, Fazle Mohammed Tawsif University of Southern California, Komei Ryu University of Southern California, Tingting Yu University of Connecticut, William G.J. Halfond University of Southern California | ||
16:36 18mTalk | FinHunter: Improved Search-based Test Generation for Structural Testing of FinTech Systems Industry Papers Xuanwen Ding East China Normal University, Qingshun Wang East China Normal University, Dan Liu East China Normal University, Lihua Xu New York University Shanghai, Jun Xiao Ant Group Co. Ltd., Bojun Zhang Ant Group Co. Ltd., Xue Li Ant Group Co. Ltd., Liang Dou East China Normal University, Liang He East China Normal University, Tao Xie Peking University | ||
16:54 9mTalk | Tests4Py: A Benchmark for System Testing Demonstrations Marius Smytzek CISPA Helmholtz Center for Information Security, Martin Eberlein Humboldt University of Berlin, Batuhan Serce CISPA Helmholtz Center for Information Security, Lars Grunske Humboldt-Universität zu Berlin, Andreas Zeller CISPA Helmholtz Center for Information Security Pre-print Media Attached | ||
17:03 9mTalk | On Polyglot Program Testing Ideas, Visions and Reflections Philémon Houdaille DIVERSE Team, IRISA-INRIA, CNRS, Université Rennes 1, Djamel Eddine Khelladi CNRS, IRISA, University of Rennes, Benoit Combemale University of Rennes, Inria, CNRS, IRISA, Gunter Mussbacher McGill University DOI Pre-print | ||
17:12 9mTalk | Ctest4J: A Practical Configuration Testing Framework for Java Demonstrations Shuai Wang University of Illinois at Urbana-Champaign, Xinyu Lian University of Illinois at Urbana-Champaign, Qingyu Li University of Illinois at Urbana-Champaign, Darko Marinov University of Illinois at Urbana-Champaign, Tianyin Xu University of Illinois at Urbana-Champaign Pre-print | ||
17:21 9mTalk | Predicting Test Results without Execution Ideas, Visions and Reflections Andre Hora UFMG Pre-print Media Attached | ||
17:30 9mTalk | Py-holmes: Causal Testing for Deep Neural Networks in Python Demonstrations Wren McQueary George Mason University, sadia afrin mim George Mason University, Nishat Raihan George Mason University, Justin Smith Lafayette College, Brittany Johnson George Mason University Pre-print | ||
17:39 9mTalk | AndroLog: Android Instrumentation and Code Coverage Analysis Demonstrations Jordan Samhi CISPA Helmholtz Center for Information Security, Andreas Zeller CISPA Helmholtz Center for Information Security DOI Pre-print | ||
17:48 9mTalk | PathSpotter: Exploring Tested Paths to Discover Missing Tests Demonstrations Andre Hora UFMG Pre-print Media Attached |