Was the Global Outage Due to an Internal Error Instead of a Cyberattack?
Synopsis
Key Takeaways
- The global internet outage was due to an internal configuration error, not a cyberattack.
- Major platforms like X, ChatGPT, and Canva were significantly affected.
- Cloudflare will implement new safeguards to enhance system reliability.
- The incident highlighted the importance of proper system configuration and oversight.
- Cloudflare described this outage as their most serious since 2019.
New Delhi, Nov 19 (NationPress) Cloudflare's CEO, Matthew Prince, has clarified that the significant global internet disruption was not a cyberattack but rather stemmed from an internal configuration error.
The incident affected numerous major platforms, including X, ChatGPT, Canva, Discord, and many other websites and applications worldwide.
In a detailed analysis, Prince noted that the issue originated when a modification was made to permissions on a ClickHouse database cluster.
This update was intended to enhance data access; however, a flawed query resulted in the system retrieving an excessive amount of information.
This mistake caused a crucial “feature file” utilized by Cloudflare's Bot Management system to expand beyond its designated size.
This feature file is updated and disseminated across Cloudflare’s network every five minutes. When the file abruptly doubled in size, it exceeded the software’s capacity, leading to a crash of the routing software at the network edge.
The situation became erratic as the flawed file was generated only in sections of the cluster that had been altered. Consequently, every five minutes, Cloudflare’s network either received a proper file, allowing for a brief recovery, or a corrupted file, causing it to fail again.
This loop of recovery and failure persisted for approximately three hours, starting at around 11:20 UTC, resulting in extensive service interruptions globally. Prince emphasized that there was no involvement of a cyberattack and acknowledged that the company initially misinterpreted the symptoms as a massive DDoS attack before pinpointing the actual cause.
Engineers ultimately halted the distribution of the faulty file, substituted it with an older correct version, and rebooted the affected systems. Cloudflare announced that the matter was fully resolved by 17:06 UTC and described this incident as its most severe outage since 2019.
Prince extended his apologies for the disruption and stated that Cloudflare will implement more robust safeguards, including stricter limits on file sizes, global kill switches for critical updates, and a comprehensive review of potential failures in its core systems.