Automated code translation tools, namely transpilers, are developed for source-to-source translation (e.g., Java to Python) in an automated fashion. Current state-of-the-art learning-based transpilers (e.g., TransCoder) have demonstrated impressive enhancement in both translation accuracy and readability compared with rule-based counterparts (e.g., j2py). This is largely attributed to their employment of task-specific pre-training on extensive monolingual corpora. Nonetheless, despite these advancements, their current performance remains unsatisfactory for practical deployment, and the associated training resources are also prohibitively expensive. Large Language Models (LLMs), pre-trained on huge amounts of human-written code/text, have shown remarkable performance in many software engineering fields (e.g., code generation and program repair) due to their powerful generality, even without task-specific re-training/fine-tuning. Thus, LLMs can potentially circumvent the above limitations, but they have not been exhaustively explored yet.

In this paper, we perform the first extensive study on five LLMs and three state-of-the-art learning-based transpilers for automated code translation tasks between Python, Java, and C++. Our investigation finds that, although certain LLMs have outperformed current transpilers, they still have some accuracy issues. Taking GPT-3.5, one of the state-of-the-art LLMs, as an example, we carry out an in-depth analysis and categorization of its failures. Results demonstrate most of the failures are induced by (1) a lack of comprehension of the source programs (38.51%), (2) missing clear instructions on Input/Output (I/O) types in translation (14.94%), and (3) ignoring the discrepancies between source and target programs (41.38%).

Enlightened by the above findings, we further propose \textbf{UniTrans}, an \textbf{Uni}fied code \textbf{Trans}lation framework, applicable to various LLMs, for unleashing their power in this field. Specifically, \textbf{UniTrans} first craft a series of test cases for target programs with the assistance of the source programs. Next, as test cases imply requirements of programs for comprehension and carry the specific I/O type instructions, \textbf{UniTrans} harnesses the above test cases to augment the code translation and then evaluate their correctness via execution. Afterward, to alleviate failures brought by discrepancy ignorance, \textbf{UniTrans} further repairs incorrectly translated programs prompted by test case execution results, where an option of iterative repair is also provided for practitioners. Extensive experiments are conducted on six settings of translation datasets between Python, Java, and C++. Three state-of-the-art LLMs of diverse sizes, including GPT-3.5, LLaMA-13B, and LLaMA-7B, are tested with \textbf{UniTrans}, and all achieve substantial improvements in terms of Computational Accuracy (CA) and Exact Match Accuracy (EM Acc) among almost all translation settings, showing the universal effectiveness of \textbf{UniTrans} in practice.