COSTELLO: Contrastive Testing for Embedding-based Large Language Model as a Service Embeddings (FSE 2024 - Research Papers)

Mon 15 - Fri 19 July 2024 Porto de Galinhas, Brazil, Brazil

Who

Weipeng Jiang, Juan Zhai, Shiqing Ma, Xiaoyu Zhang, Chao Shen

Track

FSE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 Jul 2024 14:54 - 15:12 at Pitanga - Testing 1 Chair(s): Xi Zheng

Abstract

Large language models have gained significant popularity and are often provided as a service (i.e., LLMaaS). Companies like OpenAI and Google provide online APIs of LLMs to allow downstream users to create innovative applications. Despite its popularity, LLM safety and quality assurance is a well-recognized concern in the real world, requiring extra efforts for testing these LLMs. Unfortunately, while end-to-end services like ChatGPT have garnered rising attention in terms of testing, the LLMaaS embeddings have comparatively received less scrutiny. We state the importance of testing and uncovering problematic individual embeddings without considering downstream applications. The abstraction and non-interpretability of embedded vectors, combined with the black-box inaccessibility of LLMaaS, make testing a challenging puzzle. This paper proposes COSTELLO, a black-box approach to reveal potential defects in abstract embedding vectors from LLMaaS by \textit{contrastive testing}. Our intuition is that high-quality LLMs can adequately capture the semantic relationships of the input texts and properly represent their relationships in the high-dimensional space. For the given interface of LLMaaS and seed inputs, COSTELLO can automatically generate test suites and output words with potential problematic embeddings. The idea is to synthesize contrastive samples with guidance, including positive and negative samples, by mutating seed inputs. Our synthesis guide will leverage task-specific properties to control the mutation procedure and generate samples with known partial relationships in the high-dimensional space. Thus, we can compare the expected relationship (oracle) and embedding distance (output of LLMs) to locate potential buggy cases. We evaluate COSTELLO on 42 open-source language models and two real-world commercial LLMaaS. Experimental results show that COSTELLO can effectively detect semantic violations, where more than 62% of violations on average result in erroneous behaviors (e.g., unfairness) of downstream applications.

Weipeng Jiang

Xi'an Jiaotong University

China

Juan Zhai

University of Massachusetts, Amherst

United States

Shiqing Ma

University of Massachusetts, Amherst

United States

Xiaoyu Zhang

Xi'an Jiaotong University

Chao Shen

Xi'an Jiaotong University

China

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 17 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

14:00 - 15:30	Testing 1Research Papers / Journal First at Pitanga Chair(s): Xi Zheng Macquarie University

14:00 18m Talk		Test Input Prioritization for 3D Point Clouds Journal First Yinghua Li University of Luxembourg, Xueqi Dang University of Luxembourg, Lei Ma The University of Tokyo & University of Alberta, Jacques Klein University of Luxembourg, Yves Le Traon University of Luxembourg, Luxembourg, Tegawendé F. Bissyandé University of Luxembourg
14:18 18m Talk		Evaluating and Improving ChatGPT for Unit Test Generation Research Papers Zhiqiang Yuan Fudan University, Mingwei Liu Fudan University, Shiji Ding Fudan University, Kaixin Wang Fudan University, Yixuan Chen Yale University, Xin Peng Fudan University, Yiling Lou Fudan University
14:36 18m Talk		Bounding Random Test Set Size with Computational Learning Theory Research Papers Neil Walkinshaw University of Sheffield, Michael Foster The University of Sheffield, José Miguel Rojas The University of Sheffield, Robert Hierons The University of Sheffield Pre-print
14:54 18m Talk		COSTELLO: Contrastive Testing for Embedding-based Large Language Model as a Service Embeddings Research Papers Weipeng Jiang Xi'an Jiaotong University, Juan Zhai University of Massachusetts, Amherst, Shiqing Ma University of Massachusetts, Amherst, Xiaoyu Zhang Xi'an Jiaotong University, Chao Shen Xi'an Jiaotong University
15:12 18m Talk		FeatMaker: Automated Feature Engineering for Search Strategy of Symbolic Execution Research Papers Jaehan Yoon Sungkyunkwan University, Sooyoung Cha Sungkyunkwan University