OpenAI Reports on ChatGPT Outage Caused by Configuration Error

On December 11, 2024, OpenAI's ChatGPT and related services experienced a significant outage lasting approximately 4 hours and 10 minutes, impacting numerous users. In response, the organization has published a comprehensive report detailing the incident and the underlying causes.

Outage Overview

The outage originated from a small change in the deployment of a new telemetry service, which was intended to collect metrics from the Kubernetes (K8S) control plane. Unfortunately, this change resulted in an inadvertently broad configuration that executed resource-intensive operations on every node across all clusters simultaneously. This overload caused the K8S API server to crash, leaving most clusters unable to process requests.

While the K8S data plane can function independently of the control plane, the operations of the Domain Name System (DNS) heavily depend on the control plane. The failure of the API operations compromised the service discovery mechanism, ultimately leading to a complete service failure. Although the problem was identified within three minutes, engineers could not access the control plane to implement necessary rollbacks, creating a "deadlock" situation. The crash of the control plane hindered efforts to remove the malfunctioning services, further complicating recovery.

Recovery Efforts

In the wake of the incident, OpenAI engineers undertook various strategies to recover the affected clusters. Their initial steps included scaling down the cluster sizes to alleviate the API load on K8S and blocking access to the management K8S API, which facilitated a return to normal operations. Additionally, they increased the resource configuration of the K8S API server to better manage the incoming requests.

After several attempts, the engineers regained control over the K8S control plane, allowing them to remove the problematic services and gradually restore functionality to the clusters. During recovery, they also redirected traffic to healthy clusters to mitigate the load on those still under strain.

However, the simultaneous attempts to recover multiple services led to resource saturation, requiring further manual intervention in the restoration process. Some clusters experienced longer recovery times as a result. OpenAI aims to learn valuable lessons from this incident to prevent similar "lockout" situations in the future.

Conclusion

The detailed report serves not only as a record of the outage but also as a blueprint for improving response strategies in similar future incidents. OpenAI emphasizes the importance of careful monitoring and configuration management to avoid service disruptions.

For further details, the full report can be accessed here.

Key Points

Cause of the outage: A configuration error during telemetry service deployment led to an overload of K8S API operations, resulting in service failure.
Engineer dilemma: The crash of the control plane restricted engineers' access, complicating the resolution process.
Recovery process: Engineers successfully restored services through cluster scaling and resource enhancements.

OpenAI Reports on ChatGPT Outage Caused by Configuration Error

OpenAI Reports on ChatGPT Outage Caused by Configuration Error

Outage Overview

Recovery Efforts

Conclusion

AI DAMN

Main Pages

Content

Others