PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages (FSE 2024 - Research Papers)

Mon 15 - Fri 19 July 2024 Porto de Galinhas, Brazil, Brazil

Who

Kai Gao, Weiwei Xu, Wenhao Yang, Minghui Zhou

Track

FSE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 18 Jul 2024 11:36 - 11:54 at Pitomba - Software Maintenance and Comprehension 2 Chair(s): Denys Poshyvanyk

Abstract

A package’s source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package’s development platform from its distribution platform. To establish the link, existing tools retrieve the release’s repository information from the release’s metadata, which, however, suffers from two limitations: the metadata may not contain or contain the wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases (63.1% of packages). To address the limitations, this paper proposes \textsc{PyRadar}, a novel framework that utilizes both the release’s metadata and source code to automatically retrieve and validate the repository information for PyPI package releases. We start with an empirical study to compare four existing metadata-based tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release distribution but not in the release’s repository) in 14,375 correct package-repository links and 2,064 incorrect links. Based on the findings, we design \textsc{PyRadar} with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever, that progressively retrieves correct source code repository information for PyPI releases. In particular, the Metadata-based Retriever combines the best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies machine learning models on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries \textit{World of Code} with the SHA-1 hash of Python files in the release’s source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the \textsc{PyRadar} to better use PyPI packages.

Link to Preprint

https://gaokai320.github.io/papers/FSE24.pdf

DOI

https://doi.org/10.1145/3660822

Kai Gao

University of Science and Technology Beijing

China

Weiwei Xu

Peking University

Wenhao Yang

Peking University

China

Minghui Zhou

Peking University

China

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 18 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

11:00 - 12:30	Software Maintenance and Comprehension 2Research Papers at Pitomba Chair(s): Denys Poshyvanyk William & Mary

11:00 18m Talk		Bloat beneath Python's Scales: A Fine-Grained Inter-Project Dependency Analysis Research Papers Georgios-Petros Drosos ETH Zurich, Thodoris Sotiropoulos ETH Zurich, Diomidis Spinellis Athens University of Economics and Business & Delft University of Technology, Dimitris Mitropoulos University of Athens DOI Pre-print
11:18 18m Research paper		Characterizing Python Library Migrations Research Papers Mohayeminul Islam University of Alberta, Ajay Jha North Dakota State University, Ildar Akhmetov Northeastern University, Sarah Nadi New York University Abu Dhabi, University of Alberta DOI Pre-print
11:36 18m Talk		PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages Research Papers Kai Gao University of Science and Technology Beijing, Weiwei Xu Peking University, Wenhao Yang Peking University, Minghui Zhou Peking University DOI Pre-print
11:54 18m Talk		Refactoring to Pythonic Idioms: A Hybrid Knowledge-Driven Approach Leveraging Large Language Models Research Papers zejun zhang Australian National University, Zhenchang Xing CSIRO's Data61, Xiaoxue Ren Zhejiang University, Qinghua Lu Data61, CSIRO, Xiwei (Sherry) Xu Data61, CSIRO
12:12 18m Talk		Dependency-Induced Waste in Continuous Integration: An Empirical Study of Unused Dependencies in the NPM Ecosystem Research Papers Nimmi Weeraddana University of Waterloo, Mahmoud Alfadel University of Waterloo, Shane McIntosh University of Waterloo DOI Pre-print