OpenAI Reports on ChatGPT Outage Triggered by Minor Change

date

Dec 16, 2024

url

https://www.aibase.com/news/13990

damn

language

status

Published

type

News

image

https://www.ai-damn.com/1734362772815-6386995762721204183447021.png

slug

openai-reports-on-chatgpt-outage-triggered-by-minor-change-1734362857088

OpenAI Reports on ChatGPT Outage Triggered by Minor Change

Last week, on December 11, OpenAI's ChatGPT, along with services like Sora, experienced a significant downtime event lasting four hours and ten minutes, impacting numerous users. In response, OpenAI has officially released a detailed report explaining the cause of the outage and the recovery process that followed.

Cause of the Outage

The outage was attributed to a small change made in the deployment of a new telemetry service intended to collect metrics from the Kubernetes (K8S) control plane. Unfortunately, an inadvertently broad configuration of this service led to resource-intensive K8S API operations being executed simultaneously across every node in each cluster. This overload caused the API server to crash, resulting in the K8S data plane of most clusters becoming unable to serve requests.

While the K8S data plane can theoretically function independently of the control plane, critical services such as DNS depend on the control plane for communication. The overload on API operations compromised the service discovery mechanism, resulting in a complete service failure. Although the issue was identified within three minutes, engineers found themselves in a “deadlock” situation as they were locked out of the control plane, hindering their ability to roll back services or address the problem.

Recovery Efforts

Upon recognizing the issue, OpenAI's engineers took immediate action, beginning with attempts to recover the clusters. Their strategy involved scaling down the clusters to alleviate the API load on K8S and blocking access to the management K8S API. This was done to allow the servers to regain normal operation. Additionally, they expanded the resource configuration of the K8S API server to better handle the requests.

After several rounds of efforts, the engineers successfully regained control over the K8S control plane, which enabled them to remove the faulty services and gradually restore the clusters. During this recovery period, they also redirected traffic to recovered or newly added healthy clusters to further reduce the load on the affected systems.

However, due to the simultaneous recovery attempts by multiple services, resource constraints became saturated. This situation required additional manual interventions to facilitate the recovery process, resulting in some clusters taking longer to restore than others.

Lessons Learned

Through this incident, OpenAI aims to learn from the experience to prevent future occurrences of being locked out under similar circumstances. The incident serves as a reminder of the complexities involved in managing cloud services and the critical nature of configuration changes.

For further details on the incident, you can view the complete report here.

Key Points

Cause of the outage: A small change in the telemetry service led to an overload of K8S API operations, causing service failure.

Engineer dilemma: The crash of the control plane prevented engineers from accessing it, hindering issue resolution.

Recovery process: Services were ultimately restored through cluster scaling and resource increases.