Pragmatic Engineer·2025年12月19日 02:44·約9分

パルス：Cloudflareの最新障害がグローバル設定変更の危険性を再証明

#クラウドインフラ #設定管理 #システム可用性 #Cloudflare

TL;DR

Cloudflare の再発する大規模障害は、グローバル設定変更のリスクと staged rollout（段階的ロールアウト）の実装遅延がもたらす深刻な信頼危機を示している。

AI深層分析2026年5月2日 14:07

重要/ 5段階

深度40%

キーポイント

再発したグローバル設定変更による障害

12 月 5 日の Cloudflare 障害は、React の脆弱性修正を巡るテストツールの無効化という設定変更がバグを引き起こし、ネットワーク全体に HTTP 500 エラーを広げたことが原因である。

再発防止策の実装遅延とリスク

11 月の障害で「設定ファイルの取り込みを強化する」という対策が決定されたものの、数ヶ月かかる大規模実装が完了せず、同じ原因での再発を許容してしまった。

顧客信頼とバックアップ CDN の必要性

類似の原因による連続障害は Cloudflare の信頼性を損ない、顧客は単なる事後報告ではなく、バックアップ CDN の導入や冗長化戦略を再考せざるを得なくなる。

CTO が認める優先課題の重み

Cloudflare の CTO は、グローバル設定変更に対する段階的ロールアウトの実装が組織全体の最優先事項であると認め、その実現へのコミットを表明している。

安全な設定変更の仕組み強化

Cloudflare は、ソフトウェアデプロイと同様に設定データにも厳格な健全性検証と迅速なロールバック機能を導入し、影響範囲を限定する「Enhanced Rollouts & Versioning」を最優先課題としている。

障害時のフォールオープン対応

重要なデータプレーンコンポーネントにおいて、設定ファイルの破損や異常時にリクエストを切断せず、既知の安全な状態へデフォルトするか、スコアリングなしでトラフィックを通す「Fail-Open」エラー処理へ移行する。

グローバル設定変更による大規模障害のパターン

DNS 変更（Meta, AWS）、OS の同時更新（Datadog, Heroku）、および設定ポリシーの即時全球複製（Google Spanner）など、過去の大規模障害には単一の変更がネットワーク全体に波及した共通パターンが存在する。

影響分析・編集コメントを表示

影響分析

この記事は、大規模クラウドインフラプロバイダーにおける設定管理プロセスの欠陥が、いかに迅速に全ネットワークの停止を引き起こすかを浮き彫りにしています。特に、再発防止策の実装遅延が直ちに信頼喪失につながった点は、エンジニアリングリーダーに対して「即時対応」と「堅牢なアーキテクチャ」のバランスを再考するよう強く警告しています。

編集コメント

「設定変更のリスク管理」はインフラエンジニアにとって永遠の課題ですが、今回の事例のように再発防止策の実装が間に合わなかったケースは極めて痛烈です。信頼回復には技術的な改善だけでなく、顧客への冗長化提案という現実的な対応も不可欠です。

こんにちは、プラグマティック・エンジニア・ニュースレターのボーナス無料号をお届けするゲルゲリーです。毎号、シニアエンジニアとエンジニアリングリーダーの視点を通じて、ビッグテックとスタートアップをカバーしています。今日は、先週の「ザ・パルス」号の4つのトピックのうちの1つを取り上げます。フル購読者は、以下の記事を7日前に受け取りました。このメールが転送されてきた方は、こちらから購読できます。

Cloudflareが大規模な障害を起こしてインターネットの半分をダウンさせてからわずか2週間後、同じことが再び起こりました。先週金曜日、12月5日、数千のサイトが再び完全または部分的にダウンし、Cloudflareのグローバル障害は25分間続きました。

前回同様、Cloudflareは同日中に完全な事後分析を迅速に共有しました。CloudflareのHTTPトラフィックの28%が影響を受けたと推定されています。この最新の障害の原因は、Cloudflareが一見無害に見える——しかしグローバルな——設定変更を行い、それが元に戻されるまで、世界的にCloudflareの大部分を停止させてしまったことでした。以下が起こったことです：

Cloudflareは、厄介なReactのセキュリティ脆弱性に対する修正を展開中だった

その修正が内部テストツールにエラーを引き起こした

Cloudflareチームはグローバルな緊急停止スイッチでそのテストツールを無効にした

このグローバル設定変更が行われた際、緊急停止スイッチが予期せずバグを引き起こし、Cloudflareネットワーク全体でHTTP 500エラーが発生した

この最新の障害で、Cloudflareはまたしてもグローバル設定変更によって痛い目を見ました。11月の前回の障害は、グローバルなデータベース権限変更が原因で発生しました。その事故の事後分析で、Cloudflareチームは次のアクション項目で締めくくっていました：

「ユーザー生成入力に対して行うのと同じ方法で、Cloudflareが生成する設定ファイルの取り込みを強化する」

この変更により、Cloudflareの設定ファイルは、現在のように即座にネットワーク全体に伝播しなくなるはずでした。しかし、すべてのグローバル設定ファイルに段階的なロールアウトを導入するのは大規模な実装作業であり、数ヶ月かかる可能性があります。明らかに、それを実装する時間はまだなく、それがCloudflareを再び苦しめる結果となりました。

Cloudflareにとって残念なことに、顧客は、わずか数週間前の前回の障害と類似した原因による2度目の障害を、おそらく受け入れがたいと感じるでしょう。Cloudflareが信頼できないことが証明されれば、顧客は少なくともバックアップCDNへのオンボーディングを計画すべきであり、バックアップCDNベンダーは新規顧客に自社をプライマリCDNとして利用するよう最善を尽くして説得するでしょう。

Cloudflareの付加価値は、顧客がバックアップCDNの予算を組む必要なく、堅牢な信頼性にあります。確かに、障害発生当日に事後分析を公開することは信頼回復に役立ちますが、大規模な障害が繰り返されれば、それは結局崩れてしまうでしょう。

公平を期すと、同社は段階的な設定ロールアウトの実装に力を入れています。事後分析で、Cloudflareは自らを最も厳しく批判しています。CTOのデイン・ネクトは次のように述べています：

「【グローバル設定変更がグローバルに展開されること】は、組織全体における最優先事項であり続けています。特に、以下に概説するプロジェクトは、この種の変更の影響を封じ込めるのに役立つはずです：

強化されたロールアウトとバージョニング：厳格な健全性検証を伴ってソフトウェアをゆっくりとデプロイする方法と同様に、迅速な脅威対応や一般的な設定に使用されるデータにも、同じ安全性と影響緩和機能が必要です。これには、健全性検証や迅速なロールバック機能などが含まれます。

合理化された非常用操作機能：追加の種類の障害が発生した場合でも、重要な操作を達成できることを保証します。これは内部サービスだけでなく、すべてのCloudflare顧客が使用するCloudflareコントロールプレーンとのすべての標準的な対話方法にも適用されます。

「フェイルオープン」エラー処理：レジリエンス努力の一環として、すべての重要なCloudflareデータプレーンコンポーネント全体に誤って適用されているハードフェイルロジックを置き換えています。設定ファイルが破損しているか、範囲外（例：機能上限を超える）の場合、システムはエラーを記録し、既知の正常な状態にデフォルト設定するか、スコアリングなしでトラフィックを通過させ、リクエストを破棄しません。一部のサービスでは、特定のシナリオで顧客にフェイルオープンまたはフェイルクローズのオプションを提供する予定です。これには、これが継続的に実施されることを保証するためのドリフト防止機能も含まれます。この種のインシデント、そしてそれらがどれだけ密集して発生するかは、当社のようなネットワークにとって許容できるものではありません」。

グローバル設定エラーは、しばしば大規模な障害を引き起こす

暗黙的または明示的なグローバル設定エラーが大規模な障害を引き起こすパターンがあり、近年の最大級の障害のいくつかは、単一の変更がマシンのネットワーク全体に展開されたことによって引き起こされました：

DNSおよびBGPのようなDNS関連システム：DNS変更はデフォルトでグローバルであるため、DNS変更がグローバルな障害を引き起こすのは当然です。Metaの2021年の7時間に及ぶ障害は、DNS変更（より具体的には、ボーダーゲートウェイプロトコル変更）に関連していました。一方、10月のAWSの障害は、内部DNSシステムから始まりました。

OS更新が世界的に同時に発生すること：Datadogの2023年の障害は同社に500万ドルの損失をもたらし、DatadogのUbuntuマシンが世界的に同じ時間枠内でOS更新を実行したことが原因でした。それはネットワーキングに問題を引き起こし、Datadogが3つの異なるネットワークにまたがる3つの異なるクラウドプロバイダーでインフラを運用していたことは事態を改善しませんでした。同種のUbuntu更新は、2024年にHerokuのグローバル障害も引き起こしました。

設定をグローバルに複製すること：2024年、設定ポリシー変更がグローバルに展開され、すべてのSpannerデータベースノードを即座にクラッシュさせました。Googleが事後分析で結論付けたように：「クォータ管理のグローバルな性質を考えると、このメタデータは数秒以内にグローバルに複製されました」。

すべての設定ファイルに段階的なロールアウトを実装するのは多くの作業です。また、目に見えない労力でもあります。なぜなら、うまくいけば、より良いインフラのおかげで、インシデントが発生しないことを除いて、その恩恵は検知できないからです！

世界最大のシステムは、おそらく設定を展開するより安全な方法を実装しなければならないでしょう——しかし、誰もがそうする必要はありません。段階的な設定ロールアウトは、小規模な企業や製品にとってはあまり意味がありません。なぜなら、このインフラ作業は製品開発を遅らせるからです。

それは単に構築を遅らせるだけでなく、すべてのデプロイメントも遅らせ、この摩擦はすべてをより遅くするように設計されています。そのため、成熟したシステムの安定性が迅速な反復よりも重要でない限り、あまり意味がありません。

ソフトウェアエンジニアリングは、トレードオフが日常茶飯事であり、普遍的な解決策が存在しない分野です。1年前の負荷とユーザー数の1/100のシステムでうまくいった開発方法が、今日では意味をなさないかもしれません。

これは、今週の「ザ・パルス」でカバーされた4つのトピックのうちの1つでした。完全版ではさらに以下をカバーしています：

業界の動向。AWSでの不十分なキャパシティ計画、Metaが「クローズドAI」アプローチに移行、迫り来るRAM不足、初期段階スタートアップの採用が以前より遅い、AmazonとMetaで60万ドルを稼ぐのにどれくらいかかるか、Appleが幹部をMetaに奪われる、など

OxideのエンジニアリングチームがLLMをどのように使用しているか。彼らは、LLMはドキュメントの読み取りや軽量な調査には優れているが、コーディングやコードレビューには賛否両論、そしてドキュメント作成——あるいはあらゆる種類の文章作成——には不向きだと感じている！

Linuxカーネルが公式にRustをサポート。Linux FoundationフェローがRustへのさらなるサポートを予測してから8ヶ月後、RustはLinuxカーネル内で第一級の言語になりました。LinuxへのRustサポートの長所と短所のまとめ

完全な「ザ・パルス」号をお読みください。

このような記事をメールボックスで受け取るには、私の週刊ニュースレターを購読してください。かなり良い読み物です——そしてSubstackで第1位のテックニュースレターです。

ステップ2 – GCP全体で設定ファイルをグローバルに複製すること – が2024年にグローバルな障害を引き起こした

原文を表示

Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover Big Tech and startups through the lens of senior engineers and engineering leaders. Today, we cover one out of four topics from last week’s The Pulse issue. Full subscribers received the below article seven days ago. If you’ve been forwarded this email, you can subscribe here.

A mere two weeks after Cloudflare suffered a major outage and took down half the internet, the same thing has happened again. Last Friday, 5th December, thousands of sites went down or partially down once more, in a global Cloudflare outage lasting 25 minutes.

As per last time, Cloudflare was speedy to share a full postmortem on the same day. It estimated that 28% of Cloudflare’s HTTP traffic was impacted. The cause of this latest outage was Cloudflare making a seemingly innocent – but global – configuration change that went on to take out a good portion of Cloudflare, globally, until being reverted. Here’s what happened:

Cloudflare was rolling out a fix for a nasty React security vulnerability

The fix caused an error in an internal testing tool

The Cloudflare team disabled the testing tool with a global killswitch

As this global configuration change was made, the killswitch unexpectedly caused a bug that resulted in HTTP 500 errors across Cloudflare’s network

In this latest outage, Cloudflare was burnt by yet another global configuration change. The previous outage in November happened thanks to a global database permissions change. In the postmortem of that incident, the Cloudflare team closed with this action item:

“Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input”

This change would make it so that Cloudflare’s configuration files do not propagate immediately to the full network, as they still do now. But making all global configuration files have staged rollouts is a large implementation that could take months. Evidently, there wasn’t time to make it yet, and it has come back to bite Cloudflare.

Unfortunately for Cloudflare, customers are likely to find unacceptable a second outage with similar causes to a previous one, only weeks ago. If Cloudflare proves unreliable, customers should plan to onboard to backup CDNs at the very least, and a backup CDN vendor will do its best to convince new customers to use it as the primary CDN.

Cloudflare’s value-add rests on rock-solid reliability without customers needing to budget for a backup CDN. Yes, publishing postmortems on the same day as an outage occurs helps restore trust, but that will crumble anyway with repeated large outages.

To be fair, the company is doubling down on implementing staged configuration rollouts. In its postmortem, Cloudflare is its own biggest critic. CTO Dane Knecht reflected:

“[Global configuration changes rolling out globally] remains our first priority across the organization. In particular, the projects outlined below should help contain the impact of these kinds of changes:Enhanced Rollouts & Versioning: Similar to how we slowly deploy software with strict health validation, data used for rapid threat response and general configuration needs to have the same safety and blast mitigation features. This includes health validation and quick rollback capabilities among other things.Streamlined break glass capabilities: Ensure that critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers.“Fail-Open” Error Handling: As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios. This will include drift-prevention capabilities to ensure this is enforced continuously. These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours”.

Global configuration errors often trigger large outages

There’s a pattern of implicit or explicit global configuration errors causing large outages, and some of the biggest ones in recent years were caused by a single change being rolled out to a whole network of machines:

DNS and DNS-related systems like BGP: DNS changes are global by default, so it’s no wonder that DNS changes can cause global outages. Meta’s 7-hour outage in 2021 was related to DNS changes (more specifically, Border Gateway Protocol changes.) Meanwhile, the AWS outage in October started with the internal DNS system.

OS updates happening at the same time, globally: Datadog’s 2023 outage cost the company $5M and was caused by Datadog’s Ubuntu machines executing an OS update within the same time window, globally. It caused issues with networking, and it didn’t help that Datadog ran its infra on 3 different cloud providers across 3 networks. The same kind of Ubuntu update also caused a global outage for Heroku in 2024.

Globally replicating configs: in 2024, a configuration policy change was rolled out globally and crashed every Spanner database node straight away. As Google concluded in its postmortem: “Given the global nature of quota management, this metadata was replicated globally within seconds”.

Implementing gradual rollouts for all configuration files is a lot of work. It’s also invisible labor because when done well, then its benefits will be undetectable, except in the absence of incidents, thanks to better infrastructure!

The largest systems in the world will likely have to implement safer ways to roll out configs – but not everybody needs to. Staged configuration rollout doesn’t make much sense for smaller companies and products because this infra work slows down product development.

It doesn’t just slow down building, but every deployment, too, and this friction is designed to make everything slower. As such, they don’t make much sense unless the stability of mature systems is more important than fast iteration.

Software engineering is a field where tradeoffs are a fact of life, and universal solutions don’t exist. The development which worked for a system with 1/100th of the load and users a year ago, may not make sense today.

This was one out of the four topics covered in this week’s The Pulse. The full edition additionally covers:

Industry Pulse. Poor capacity planning at AWS, Meta moves to a “closed AI” approach, a looming RAM shortage, early-stage startups hiring slower than before, how long it takes to earn $600K at Amazon and Meta, Apple loses execs to Meta, and more

How the engineering team at Oxide uses LLMs. They find LLMs great for reading documents and lightweight research, mixed for coding and code review, and a poor choice for writing documents – or any kind of writing, really!

Linux officially supports Rust in the kernel. Rust is now a first-class language inside the Linux kernel, eight months after a Linux Foundation Fellow predicted more support for Rust. A summary of the pros and cons of Rust support for Linux

Read the full The Pulse issue.

Subscribe to my weekly newsletter to get articles like this in your inbox. It's a pretty good read - and the #1 tech newsletter on Substack.

Step 2 – replicating a configuration file globally across GCP – caused a global outage in 2024

この記事をシェア

Pragmatic Engineer重要度42026年7月3日 03:46

The Pulse：新たなトレンド、スマートモデルルーティング

Pragmatic Engineer2026年6月28日 09:40

Pollen が私の記事削除を試み、Google がそれを支援している件

Pragmatic Engineer重要度42026年6月24日 01:30

信頼性の欠陥：Coinbase のグローバル取引サービスに自動化されたゾーンフェイルオーバーがない

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む