Abstract

Sentiment analysis has been used to study aspects of software engineering, such as issue resolution, toxicity, and self-admitted technical debt. The automatic classification of software engineering texts into three different polarity classes (negative, neutral, and positive) makes it possible to understand how developers communicate. To address the peculiarities of software engineering texts, sentiment analysis tools often consider the specific technical lingo practitioners use. With the emergence of more advanced deep-learning models, it has become increasingly important to understand the performance and limitations of sentiment analysis tools when applied to software engineering data. This is especially true because existing replications of software engineering studies that apply sentiment analysis tools show that tool choice can influence the conclusions obtained. Moreover, we believe that it is important to assess the performance of newer deep-learning tools and models and compare their performance to that of existing tools.

Therefore, we validated two existing recommendations made in software engineering literature: The recommendation to use pre-trained transformer models to classify sentiment and the recommendation to replace non-natural language elements with meta-tokens. 
The recommendations were validated in a set of rigorous benchmarks. 
We picked five different sentiment analysis tools, paying attention to select a diverse set of tools of two pre-trained transformer models and three machine learning tools. 
Because recent benchmarks show that ChatGPT is not competitive in sentiment analysis compared to fine-tuned tools, we do not select it for these benchmarks.
To train and evaluate the selected tools we use two state-of-the-art, manually labeled datasets sampled from GitHub and StackOverflow to evaluate the performance of the sentiment analysis tools. 

Based on the results of the benchmarks we conclude that these ``common-knowledge'' guidelines actually do not work as previously believed. 
We find that pre-trained transformers outperform the best machine learning tool on one of the two datasets, 
and that the performance difference is a few percentage points. 
Therefore, we recommend that software engineering researchers should not just consider predictive performance when selecting a sentiment analysis tool because the best-performing sentiment analysis tools perform very similarly to each other (within 4 percentage points). 
Additionally, we find that meta-tokenization, or the practice of pre-processing datasets to remove more non-natural language elements, does not further improve the predictive performance of sentiment analysis tools.  
These findings are relevant to researchers who apply sentiment analysis tools to software engineering data, as this information can help them select the appropriate tool. 
These findings also help tool builders of sentiment analysis tools we seek to further adapt software engineering specific sentiment analysis tools to software engineering. 

Information

The article was accepted for publication by the Springer Journal of Empirical Software Engineering (EMSE) on the 23rd of February 2024. Currently the article has not yet been published by EMSE itself, however, the camera-ready version submitted to the Author Services can be accessed online.\footnote{\url{https://cassee.dev/files/meta-tokenization-transformers.pdf}} The article is an original, journal-first article. It has not been presented, nor is it under consideration, for any other journal-first program or other conferences.