Major cloud providers have employed advanced AI-based solutions like large language models to aid humans in identifying the root causes of cloud incidents. Even though AI-driven assistants are be- coming more common in the process of analyzing root causes, their usefulness in supporting on-call engineers is limited by their unstable accuracy. This limitation arises from the fundamental challenges of the task, the tendency of language model-based methods to produce hallucinate information, and the difficulty in distinguishing these well-disguised hallucinations. To address this challenge, we propose a novel confidence estimation method to assign reliable confidence scores to root cause recommendations, aiding on-call engineers in deciding whether to trust the model’s predictions. We made re- training-free confidence estimation on out-of-domain tasks possible via retrieval augmentation. To elicit better-calibrated confidence es- timates, we adopt a two-stage prompting procedure and a learnable transformation, which reduces the estimated calibration error (ECE) to 31% of the direct prompting baseline on a dataset comprising over 100,000 incidents from Microsoft. Additionally, we demonstrate that our method is applicable across various root cause prediction models. Our study takes an important move towards reliably and effectively embedding LLMs into cloud incident management systems