OpenAI Reports on ChatGPT Outage Caused by Configuration Error
date
Dec 17, 2024
damn
language
en
status
Published
type
News
image
https://www.ai-damn.com/1734416091241-6386995762721204183447021.png
slug
openai-reports-on-chatgpt-outage-caused-by-configuration-error-1734416272471
tags
ChatGPT
OpenAI
Kubernetes
Outage
Service Recovery
summary
OpenAI has released a detailed report explaining the December 11 outage of ChatGPT, which lasted over four hours. The incident was triggered by a configuration error during a telemetry service deployment that overwhelmed system resources. The report outlines the challenges faced by engineers in restoring service and measures taken to prevent future occurrences.
OpenAI Reports on ChatGPT Outage Caused by Configuration Error
On December 11, 2024, OpenAI's ChatGPT and related services experienced a significant outage lasting approximately 4 hours and 10 minutes, impacting numerous users. In response, the organization has published a comprehensive report detailing the incident and the underlying causes.
Outage Overview
The outage originated from a small change in the deployment of a new telemetry service, which was intended to collect metrics from the Kubernetes (K8S) control plane. Unfortunately, this change resulted in an inadvertently broad configuration that executed resource-intensive operations on every node across all clusters simultaneously. This overload caused the K8S API server to crash, leaving most clusters unable to process requests.
While the K8S data plane can function independently of the control plane, the operations of the Domain Name System (DNS) heavily depend on the control plane. The failure of the API operations compromised the service discovery mechanism, ultimately leading to a complete service failure. Although the problem was identified within three minutes, engineers could not access the control plane to implement necessary rollbacks, creating a "deadlock" situation. The crash of the control plane hindered efforts to remove the malfunctioning services, further complicating recovery.
Recovery Efforts
In the wake of the incident, OpenAI engineers undertook various strategies to recover the affected clusters. Their initial steps included scaling down the cluster sizes to alleviate the API load on K8S and blocking access to the management K8S API, which facilitated a return to normal operations. Additionally, they increased the resource configuration of the K8S API server to better manage the incoming requests.
After several attempts, the engineers regained control over the K8S control plane, allowing them to remove the problematic services and gradually restore functionality to the clusters. During recovery, they also redirected traffic to healthy clusters to mitigate the load on those still under strain.
However, the simultaneous attempts to recover multiple services led to resource saturation, requiring further manual intervention in the restoration process. Some clusters experienced longer recovery times as a result. OpenAI aims to learn valuable lessons from this incident to prevent similar "lockout" situations in the future.
Conclusion
The detailed report serves not only as a record of the outage but also as a blueprint for improving response strategies in similar future incidents. OpenAI emphasizes the importance of careful monitoring and configuration management to avoid service disruptions.
For further details, the full report can be accessed here.
Key Points
- Cause of the outage: A configuration error during telemetry service deployment led to an overload of K8S API operations, resulting in service failure.
- Engineer dilemma: The crash of the control plane restricted engineers' access, complicating the resolution process.
- Recovery process: Engineers successfully restored services through cluster scaling and resource enhancements.