Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models (FSE 2024 - Research Papers)

Who

Yan Wang, Xiaoning Li, Tien N. Nguyen, Shaohua Wang, Chao Ni, Ling Ding

Track

FSE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 18 Jul 2024 16:00 - 16:18 at Mandacaru - SE4AI 2 Chair(s): Wei Yang

Abstract

Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization. We reported that 1) the reduction ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm–prompt engineering and interactive in-context learning and this study can save reduce the cost of invoking GPT-4 by 24%per API query. Importantly, SlimCode simplifies the input code with its greedy strategy and can obtain at most 133 times faster than the state-of-the-art technique with a significant improvement. This paper calls for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.

Link to Preprint

https://arxiv.org/abs/2405.11196

File attachments

(slimcode.pdf)	750KiB

Yan Wang

Central University of Finance and Economics

China

Xiaoning Li

Central University of Finance and Economics

Tien N. Nguyen

University of Texas at Dallas

United States

Shaohua Wang

Central University of Finance and Economics

China

Chao Ni

School of Software Technology, Zhejiang University

China

Ling Ding

Central University of Finance and Economics

Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 18 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:00 - 18:00	SE4AI 2Research Papers / Industry Papers / Demonstrations / Journal First at Mandacaru Chair(s): Wei Yang University of Texas at Dallas

16:00 18m Talk		Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models Research Papers Yan Wang Central University of Finance and Economics, Xiaoning Li Central University of Finance and Economics, Tien N. Nguyen University of Texas at Dallas, Shaohua Wang Central University of Finance and Economics, Chao Ni School of Software Technology, Zhejiang University, Ling Ding Central University of Finance and Economics Pre-print Media Attached File Attached
16:18 18m Talk		On Reducing Undesirable Behavior in Deep-Reinforcement-Learning-Based Software Research Papers Ophir Carmel The Hebrew University of Jerusalem, Guy Katz The Hebrew University of Jerusalem
16:36 9m Talk		Decide: Knowledge-based Version Incompatibility Detection in Deep Learning Stacks Demonstrations Zihan Zhou The University of Hong Kong, Zhongkai Zhao National University of Singapore, Bonan Kou Purdue University, Tianyi Zhang Purdue University DOI Pre-print Media Attached
16:45 18m Talk		Test input prioritization for Machine Learning Classifiers Journal First Xueqi Dang University of Luxembourg, Yinghua Li University of Luxembourg, Mike Papadakis University of Luxembourg, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg, Yves Le Traon University of Luxembourg, Luxembourg
17:03 18m Talk		How Far Are We with Automated Machine Learning? Characterization and Challenges of AutoML Toolkits Journal First Md Abdullah Al Alamin University of Calgary, Gias Uddin York University, Canada
17:21 18m Talk		Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4 Industry Papers Xuchao Zhang Microsoft, Supriyo Ghosh Microsoft, Chetan Bansal Microsoft Research, Rujia Wang Microsoft, Minghua Ma Microsoft Research, Yu Kang Microsoft Research, Saravan Rajmohan Microsoft
17:39 18m Talk		Exploring LLM-based Agents for Root Cause Analysis Industry Papers Devjeet Roy Washington State University, Xuchao Zhang Microsoft, Rashi Bhave Microsoft Research, Chetan Bansal Microsoft Research, Pedro Las-Casas Microsoft, Rodrigo Fonseca Microsoft Research, Saravan Rajmohan Microsoft