Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph (FSE 2024 - Industry Papers)

Mon 15 - Fri 19 July 2024 Porto de Galinhas, Brazil, Brazil

Who

Zhenhe Yao, Changhua Pei, Wenxiao Chen, Hanzhang Wang, Liangfei Su, Huai Jiang, Zhe Xie, Xiaohui Nie, Dan Pei

Track

FSE 2024 Industry Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 Jul 2024 17:12 - 17:30 at Sapoti - Fault Diagnosis and Root Cause Analysis 1 Chair(s): Muhammad Ali Gulzar

Abstract

This paper presents Chain-of-Event (CoE), an interpretable model for root cause analysis in microservice systems that analyzes causal relationships of events transformed from multi-modal observation data. CoE distinguishes itself by its interpretable parameter design that aligns with the operation experience of Site Reliability Engineers (SREs), thereby facilitating the integration of their expertise directly into the analysis process. Furthermore, CoE automatically learns event-causal graphs from history incidents and accurately locates root cause events, eliminating the need for manual configuration. Through evaluation on two datasets sourced from an e-commerce system involving over 5,000 services, CoE achieves top-tier performance, with 79.30% top-1 and 98.8% top-3 accuracy on the Service dataset and 85.3% top-1 and 96.6% top-3 accuracy on the Business dataset. An ablation study further explores the significance of each component within the CoE model, offering insights into their individual contributions to the model’s overall effectiveness. Additionally, through real-world case analysis, this paper demonstrates how CoE enhances interpretability and improves incident comprehension for SREs.

Zhenhe Yao

Tsinghua University

China

Changhua Pei

Computer Network Information Center at Chinese Academy of Sciences

China

Wenxiao Chen

Tsinghua University

Hanzhang Wang

Walmart Global Tech

Liangfei Su

eBay, USA

Huai Jiang

eBay, USA

Zhe Xie

Tsinghua University

China

Xiaohui Nie

CNIC, CAS

China

Dan Pei

Tsinghua University

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 17 Jul
Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:00 - 18:00	Fault Diagnosis and Root Cause Analysis 1Demonstrations / Research Papers / Industry Papers at Sapoti Chair(s): Muhammad Ali Gulzar Virginia Tech

16:00 18m Talk		A Quantitative and Qualitative Evaluation of LLM-based Explainable Fault Localization Research Papers Sungmin Kang Korea Advanced Institute of Science and Technology, Gabin An Korea Advanced Institute of Science and Technology, Shin Yoo Korea Advanced Institute of Science and Technology Pre-print
16:18 18m Talk		BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection Research Papers Luan Pham RMIT University, Huong Ha RMIT University, Hongyu Zhang Chongqing University Pre-print
16:36 18m Talk		Fault Diagnosis for Test Alarms in Microservices Through Multi-source Data Industry Papers Shenglin Zhang Nankai University, Jun Zhu Nankai University, Bowen Hao Nankai University, Yongqian Sun Nankai University, Xiaohui Nie CNIC, CAS, Jingwen Zhu Nankai University, Xilin Liu Huawei Cloud, Xiaoqian Li Huawei Cloud, Yuchi Ma Huawei Cloud Computing Technologies CO., LTD., Dan Pei Tsinghua University
16:54 18m Talk		Costs and Benefits of Machine Learning Software Defect Prediction: Industrial Case Study Industry Papers Szymon Stradowski Wroclaw University of Science and Technology & NOKIA, Lech Madeyski Wroclaw University of Science and Technology
17:12 18m Talk		Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph Industry Papers Zhenhe Yao Tsinghua University, Changhua Pei Computer Network Information Center at Chinese Academy of Sciences, Wenxiao Chen Tsinghua University, Hanzhang Wang Walmart Global Tech, Liangfei Su eBay, USA, Huai Jiang eBay, USA, Zhe Xie Tsinghua University, Xiaohui Nie CNIC, CAS, Dan Pei Tsinghua University
17:30 18m Talk		ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems Research Papers Guangba Yu Sun Yat-sen University, Pengfei Chen Sun Yat-sen University, Zilong He Sun Yat-sen University, Qiuyu Yan Tencent, Yu Luo Tencent, Fangyuan Li Tencent, Zibin Zheng Sun Yat-sen University DOI Pre-print
17:48 9m Talk		MineCPP: Mining Bug Fix Pairs and Their Structures Demonstrations Sai Krishna Avula IIT Gandhinagar, Shouvick Mondal IIT Gandhinagar DOI Pre-print Media Attached