Thu 18 Jul 2024 11:36 - 11:54 at Pitomba - Software Maintenance and Comprehension 2 Chair(s): Denys Poshyvanyk

A package’s source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package’s development platform from its distribution platform. To establish the link, existing tools retrieve the release’s repository information from the release’s metadata, which, however, suffers from two limitations: the metadata may not contain or contain the wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases (63.1% of packages). To address the limitations, this paper proposes \textsc{PyRadar}, a novel framework that utilizes both the release’s metadata and source code to automatically retrieve and validate the repository information for PyPI package releases. We start with an empirical study to compare four existing metadata-based tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release distribution but not in the release’s repository) in 14,375 correct package-repository links and 2,064 incorrect links. Based on the findings, we design \textsc{PyRadar} with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever, that progressively retrieves correct source code repository information for PyPI releases. In particular, the Metadata-based Retriever combines the best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies machine learning models on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries \textit{World of Code} with the SHA-1 hash of Python files in the release’s source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the \textsc{PyRadar} to better use PyPI packages.

Thu 18 Jul

Displayed time zone: Brasilia, Distrito Federal, Brazil change

11:00 - 12:30
Software Maintenance and Comprehension 2Research Papers at Pitomba
Chair(s): Denys Poshyvanyk William & Mary
11:00
18m
Talk
Bloat beneath Python's Scales: A Fine-Grained Inter-Project Dependency Analysis
Research Papers
Georgios-Petros Drosos ETH Zurich, Thodoris Sotiropoulos ETH Zurich, Diomidis Spinellis Athens University of Economics and Business & Delft University of Technology, Dimitris Mitropoulos University of Athens
DOI Pre-print
11:18
18m
Research paper
Characterizing Python Library Migrations
Research Papers
Mohayeminul Islam University of Alberta, Ajay Jha North Dakota State University, Ildar Akhmetov Northeastern University, Sarah Nadi New York University Abu Dhabi, University of Alberta
DOI Pre-print
11:36
18m
Talk
PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages
Research Papers
Kai Gao University of Science and Technology Beijing, Weiwei Xu Peking University, Wenhao Yang Peking University, Minghui Zhou Peking University
DOI Pre-print
11:54
18m
Talk
Refactoring to Pythonic Idioms: A Hybrid Knowledge-Driven Approach Leveraging Large Language Models
Research Papers
zejun zhang Australian National University, Zhenchang Xing CSIRO's Data61, Xiaoxue Ren Zhejiang University, Qinghua Lu Data61, CSIRO, Xiwei (Sherry) Xu Data61, CSIRO
12:12
18m
Talk
Dependency-Induced Waste in Continuous Integration: An Empirical Study of Unused Dependencies in the NPM Ecosystem
Research Papers
Nimmi Weeraddana University of Waterloo, Mahmoud Alfadel University of Waterloo, Shane McIntosh University of Waterloo
DOI Pre-print