Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection (FSE 2024 - Research Papers)

Mon 15 - Fri 19 July 2024 Porto de Galinhas, Brazil, Brazil

Who

Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, Haoyu Wang

Track

FSE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 Jul 2024 15:12 - 15:30 at Pitomba - AI4SE 1 Chair(s): Mauro Pezze

Abstract

With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of “glitch tokens”, which are anomalous tokens produced by established tokenizers and could potentially compromise the models’ quality of response. Specifically, we experiment on seven top popular LLMs utilizing three distinct tokenizers and involving a totally of 182,517 tokens. We present categorizations of the identified glitch tokens and symptoms exhibited by LLMs when interacting with glitch tokens. Based on our observation that glitch tokens tend to cluster in the embedding space, we propose GlitchHunter, a novel iterative clustering-based technique, for efficient glitch token detection. The evaluation shows that our approach notably outperforms three baseline methods on eight open-source LLMs. To the best of our knowledge, we present the first comprehensive study on glitch tokens. Our new detection further provides valuable insights into mitigating tokenization-related errors in LLMs.

Yuxi Li

Huazhong University of Science and Technology

China

Yi Liu

Nanyang Technological University

Singapore

Gelei Deng

Nanyang Technological University

Singapore

Ying Zhang

Virginia Tech

United States

Wenjia Song

Virginia Tech

Ling Shi

Nanyang Technological University

Kailong Wang

Huazhong University of Science and Technology

China

Yuekang Li

The University of New South Wales

Australia

Yang Liu

Nanyang Technological University

Singapore

Haoyu Wang

Huazhong University of Science and Technology

China

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 17 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

14:00 - 15:30	AI4SE 1Research Papers at Pitomba Chair(s): Mauro Pezze USI Università della Svizzera Italiana & SIT Schaffhausen Institute of Technology

14:00 18m Talk		Are Human Rules Necessary? Generating Reusable APIs with CoT Reasoning and In-context Learning Research Papers Yubo Mai Zhejiang University, Zhipeng Gao Shanghai Institute for Advanced Study - Zhejiang University, Xing Hu Zhejiang University, Lingfeng Bao Zhejiang University, Yu Liu Zhejiang University, JianLing Sun Zhejiang University
14:18 18m Talk		CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking Research Papers Zian Su Purdue University, Xiangzhe Xu Purdue University, Ziyang Huang Purdue University, Zhuo Zhang Purdue University, Yapeng Ye Purdue University, Jianjun Huang Renmin University of China, Xiangyu Zhang Purdue University
14:36 18m Talk		Enhancing Code Understanding for Impact Analysis by Combining Transformers and Program Dependence Graphs Research Papers Yanfu Yan William & Mary, Nathan Cooper William & Mary, Kevin Moran University of Central Florida, Gabriele Bavota Software Institute @ Università della Svizzera Italiana, Denys Poshyvanyk William & Mary, Steve Rich Cisco Systems
14:54 18m Talk		Exploring and Unleashing the Power of Large Language Models in Automated Code Translation Research Papers Zhen Yang Shandong University, Fang Liu Beihang University, Zhongxing Yu Shandong University, Jacky Keung City University of Hong Kong, Jia Li Peking University, Shuo Liu City University of Hong Kong, Hong Yifan City University of Hong Kong, Xiaoxue Ma City University of Hong Kong, Zhi Jin Peking University, Ge Li Peking University Pre-print
15:12 18m Talk		Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection Research Papers Yuxi Li Huazhong University of Science and Technology, Yi Liu Nanyang Technological University, Gelei Deng Nanyang Technological University, Ying Zhang Virginia Tech, Wenjia Song Virginia Tech, Ling Shi Nanyang Technological University, Kailong Wang Huazhong University of Science and Technology, Yuekang Li The University of New South Wales, Yang Liu Nanyang Technological University, Haoyu Wang Huazhong University of Science and Technology