Fri 19 Jul 2024 14:00 - 14:18 at Pitanga - Fault Diagnosis and Root Cause Analysis 2 Chair(s): Xi Zheng

Timely localization of the root causes of gray failure is essential for maintaining the stability of the server OS. The previous intrusive gray failure localization methods usually require modifying the source code of applications, limiting their practical deployment. In this paper, we propose GrayScope, a method for non-intrusively localizing the root causes of gray failures based on the metric data in the server OS. Its core idea is to combine expert knowledge with causal learning techniques to capture more reliable inter-metric causal relationships. It then incorporates metric correlations and anomaly degrees, aiding in identifying potential root causes of gray failures. Additionally, it infers the gray failure propagation paths between metrics, providing interpretability and enhancing operators’ efficiency in mitigating gray failures. We evaluate GrayScope’s performance based on 1241 injected gray failure cases and 135 ones from industrial experiments in Huawei. GrayScope achieves the AC@5 of 90% and interpretability accuracy of 81%, significantly outperforming popular root cause localization methods. Additionally, we have made the code publicly available to facilitate further research.

(FSE2024_GrayScope.pdf) (GrayScope.pdf)12.92MiB

Fri 19 Jul

Displayed time zone: Brasilia, Distrito Federal, Brazil change

14:00 - 15:30
Fault Diagnosis and Root Cause Analysis 2Research Papers / Industry Papers at Pitanga
Chair(s): Xi Zheng Macquarie University
14:00
18m
Talk
Illuminating the Gray Zone: Non-Intrusive Gray Failure Localization in Server Operating Systems
Industry Papers
Shenglin Zhang Nankai University, Yongxin Zhao Nankai University, Xiao Xiong Nankai University, Yongqian Sun Nankai University, Xiaohui Nie CNIC, CAS, Jiacheng Zhang Nankai University, Fenglai Wang Huawei Technologies Ltd., Xian Zheng Huawei Technologies Ltd., Yuzhi Zhang Nankai University, Dan Pei Tsinghua University
DOI File Attached
14:18
18m
Talk
Towards Better Graph Neural Network-based Fault Localization Through Enhanced Code Representation
Research Papers
Md Nakhla Rafi Concordia University, Dong Jae Kim Concordia University, An Ran Chen University of Alberta, Tse-Hsun (Peter) Chen Concordia University, Shaowei Wang Department of Computer Science, University of Manitoba, Canada
14:36
18m
Talk
Easy over Hard: A Simple Baseline for Test Failures Causes Prediction
Industry Papers
Zhipeng Gao Shanghai Institute for Advanced Study - Zhejiang University, Zhipeng Xue , Xing Hu Zhejiang University, Weiyi Shang University of Waterloo, Xin Xia Huawei Technologies