InfoQ·2026年4月6日 23:32·約6分で読める

Pinterest、自動メモリ再試行によりSparkのOOM障害を96%削減

#ビッグデータ処理 #Apache Spark #運用最適化 #データパイプライン #Pinterest #スケーラビリティ

TL;DR

Pinterest Engineeringは、改善された可観測性、設定チューニング、自動メモリ再試行により、Apache Sparkのメモリ不足エラーを96%削減し、データパイプラインの安定化と運用負荷の低減を実現した。

AI深層分析2026年4月7日 00:41

注目/ 5段階

深度40%

キーポイント

メモリ不足エラーの大幅削減

PinterestはApache Sparkのメモリ不足（OOM）エラーを96%削減した。

主要な改善手法

改善された可観測性、設定チューニング、自動メモリ再試行の3つの手法を組み合わせて実現した。

運用上の効果

段階的な導入、ダッシュボード、事前メモリ調整により、データパイプラインが安定化し、手動介入が減少した。

規模と影響

1日数万件のジョブに適用され、運用負荷が低下した。

影響分析・編集コメントを表示

影響分析

この記事は、大規模データ処理環境におけるSparkの運用安定化の実践的なケーススタディを提供している。特に自動メモリ再試行の導入は、同様の課題に直面する企業にとって参考になる実装パターンを示しており、データエンジニアリング分野のベストプラクティスとしての価値がある。

編集コメント

大規模データ処理の現場で頻発するメモリ不足問題に対する実践的な解決策として、特に運用負荷軽減の観点から参考になる内容。技術的革新性よりも実用性が高いケーススタディ。

Pinterestのエンジニアリングチームは、可観測性の向上、設定のチューニング、自動メモリリトライを組み合わせたアプローチにより、Apache Sparkのワークロードの信頼性を大幅に向上させ、メモリ不足（OOM）による失敗を96%削減しました。この取り組みは、レコメンデーションシステムや大規模データ処理を駆動するメモリ集約型のワークロードにおいて、パイプラインを混乱させ、オンコールの負荷を増大させ、タイムリーな分析を脅かしていた持続的なジョブ失敗の問題に対処するものです。

長年にわたり、OOMエラーはPersistentな頭痛の種でした。ジョブは実行の最終段階、しばしば数時間計算を行った後に失敗し、エンジニアが手動でメモリ設定を調整してパイプラインを稼働させ続けることを強いていました。これらの失敗は下流のプロセスを混乱させ、オンコールの負荷を増大させ、チームが新機能の提供に集中することを困難にしました。この問題を解決するには、失敗を減らしつつ手動の労力を最小限に抑えるために、技術的な解決策とワークレベルのソリューションの両方が必要でした。

重要な第一歩は、ジョブがメモリをどのように消費しているかについての可視性を向上させることでした。エンジニアは、Executorのメモリ使用量、シャッフル操作、タスクの実行時間に関する詳細なメトリクスを構築しました。このデータはホットスポット、偏ったパーティション、そして異常にリソースを消費するステージを特定するのに役立ちました。Pinterestのエンジニアがブログで説明しているように、ジョブ内でメモリがどこで消費されているかを理解することは、失敗を効果的に解決するために不可欠です。問題が具体的にどこで発生したかを知ることで、チームは一律にメモリを追加するのではなく、精密な調整を行うことができました。

Executor レベルのメモリ使用量と Spark ワークフローにおける Auto Memory Retry の可視化（出典：Pinterest ブログ記事）

これらの洞察を補完する形で、設定のチューニングが行われました。メモリ割り当て、シャッフルパーティション数、ブロードキャスト結合に関する Spark の設定が、ワークロードのパターンに合わせて最適化されました。アダプティブクエリ実行（Adaptive Query Execution）により、システムはパーティショニングを動的に調整でき、負荷の重いステージにおけるメモリ圧力を軽減しました。追加の前処理によりデータの不均衡（data skew）を平滑化し、バリデーションチェックによって失敗を引き起こす前に、異常に巨大または不審なデータセットを特定してフラグを立てました。高リスクのジョブについては、パイプラインの安定性と予測可能性を確保するため、人間のレビューがワークフローの一部として残されました。

Auto Memory Retry は、大きなワークフローの転換点を意味しました。以前はメモリ不足により失敗していたジョブが、更新されたメモリ設定で自動的に再起動できるようになりました。この自動化により、エンジニアの時間を消費していた手動チューニングの大部分が不要となり、コアなジョブロジックを変更せずにパイプラインを完了できるようになりました。

展開は慎重に段階的に行われました。エンジニアはまずアドホックジョブから始め、0% から 100% まで段階的に適用範囲を広げ、その後スケジュールされたジョブに移行しました。スケジュールジョブではまず低優先度のティアから始め、最終的に重要度の高いワークロードにこの機能を適用しました。ダッシュボードでは、回復されたジョブ数、コスト削減額、節約された MB 数、節約された vCore 秒数、リトライ後の失敗数といった主要な指標を追跡しました。この段階的なアプローチにより、チームは問題を早期に発見し、信頼性を確保し、完全な展開前にリトライの微調整を行うことができました。

その過程で、チームは重要な運用上の教訓を学びました。これには、大規模な TaskSet に対するスケジューラの性能向上、Apache Gluten との互換性のためのカスタムリソースプロファイルの処理、そして OOM 失敗が再試行をブロックしないようにホスト障害の除外設定を調整することが含まれます。今後の作業には、プロアクティブなメモリ増強が含まれます。これは、リスクの高いステージにあるタスクが失敗する前に追加のメモリを受け取る仕組みであり、再試行とクラスターのオーバーヘッドをさらに削減することを目指します。

著者について

Leela Kumili

Leela はスターバックスのシニアソフトウェアエンジニアであり、スケーラブルでクラウドネイティブなシステムおよび分散プラットフォームの構築に深い専門知識を持っています。彼女はリワードプラットフォーム全体において、アーキテクチャ、納品、運用の卓越性を牽引し、システムの近代化、スケーラビリティの向上、信頼性の強化に向けた取り組みを主導しています。

彼女のテクニカルリーダーシップに加え、リーラは組織内のAIチャンピオンとしても活動し、LLMベースのツールを活用して開発者の生産性やワークフローを改善する機会を特定し、AI導入におけるベストプラクティスを確立しています。彼女は本番環境で動作するシステムの構築、開発者体験の向上、そしてエンジニアが技術的および戦略的なインパクトにおいて成長できるようメンタリングすることに情熱を注いでいます。彼女の興味のある分野には、プラットフォームエンジニアリング、分散システム、開発者生産性、そして技術的ソリューションとビジネス・プロダクトの目標を結びつけることが含まれます。

原文を表示

Pinterest Engineering has significantly improved the reliability of its Apache Spark workloads, cutting out-of-memory (OOM) failures by 96% through a combination of improved observability, configuration tuning, and automatic memory retries. This work addresses persistent job failures that disrupted pipelines, increased on-call load, and threatened timely analytics for memory-heavy workloads powering recommendation systems and large-scale data processing.

For years, OOM errors were a persistent headache. Jobs would fail late in execution, often after hours of computation, forcing engineers to manually tweak memory settings to keep pipelines running. These failures disrupted downstream processes, increased on-call load, and made it harder for teams to focus on delivering new features. Fixing the problem required both technical and workflow-level solutions to reduce failures while minimizing manual effort.

A critical first step was improving visibility into how jobs consumed memory. Engineers built detailed metrics for executor memory usage, shuffle operations, and task execution times. This data helped identify hotspots, skewed partitions, and stages that were unusually resource-hungry. As Pinterest engineers explained in their blog, understanding where memory is consumed within a job is critical to addressing failures effectively. By knowing exactly where problems arose, the team could make precise adjustments rather than simply adding memory across the board.

Visualizing executor-level memory usage and Auto Memory Retry in Spark workflows (Source: Pinterest Blog Post)

Configuration tuning complemented these insights. Spark settings for memory allocation, shuffle partitions, and broadcast joins were optimized for workload patterns. Adaptive query execution allowed the system to adjust partitioning dynamically, reducing memory pressure during heavy stages. Additional preprocessing helped smooth out data skew, and validation checks flagged unusually large or anomalous datasets before they could trigger failures. For high-risk jobs, human review remained part of the workflow, ensuring pipelines stayed stable and predictable.

Auto Memory Retries represented a major workflow shift. Jobs that previously failed due to memory exhaustion could now automatically restart with updated memory settings. This automation eliminated much of the manual tuning that had been consuming engineering time, allowing pipelines to finish without changing core job logic.

The rollout was staged carefully. Engineers started with ad hoc jobs, ramping from 0% to 100%, and then moved to scheduled jobs, beginning with lower-priority tiers and eventually applying the feature to critical workloads. A dashboard tracked key metrics, including recovered jobs, cost savings, MB, vCore seconds saved, and post-retry failures. This staged approach allowed the team to catch issues early, ensure reliability, and fine-tune retries before full deployment.

Along the way, teams learned important operational lessons, including improving scheduler performance for large TaskSets, handling custom resource profiles for Apache Gluten compatibility, and adjusting host failure exclusions so OOM failures no longer blocked retries. Future work includes proactive memory increases, where tasks in high-risk stages receive extra memory before failing, further reducing retries and cluster overhead.

About the Author

Leela Kumili

Leela is a Lead Software Engineer at Starbucks with deep expertise in building scalable, cloud-native systems and distributed platforms. She drives architecture, delivery, and operational excellence across the Rewards Platform, leading efforts to modernize systems, improve scalability, and enhance reliability.

In addition to her technical leadership, Leela serves as an AI Champion for the organization, identifying opportunities to improve developer productivity and workflows using LLM-based tools and establishing best practices for AI adoption. She is passionate about building production-ready systems, enhancing developer experience, and mentoring engineers to grow in both technical and strategic impact. Her interests include platform engineering, distributed systems, developer productivity, and bridging technical solutions with business and product goals.

Show moreShow less

この記事をシェア

InfoQ★42026年4月24日 00:36

React Navigation 8.0アルファ版：ネイティブ下部タブの標準化、TypeScript推論と履歴機能

React Navigation開発チームは、React NativeおよびWeb向けのルーティングライブラリ「8.0」アルファ版を公開した。下部タブのネイティブ実装を標準化し、TypeScript推論とディープリンク機能を強化した。

InfoQ★32026年4月24日 00:00

Google、Room 3.0を発表：Kotlinファーストの非同期マルチプラットフォーム永続化ライブラリ

GoogleはRoom 3.0を発表した。本バージョンは破壊的変更を導入し、Kotlin Multiplatform対応を強化するとともにJSとWasmへのサポートを追加した。

InfoQ★42026年4月23日 22:00

Grafana、LokiをKafka基盤に再設計しコーディングエージェント向け観測CLIをリリース

グラファナラボスはGrafana 13を発表した。LokiをKafka基盤に再設計し、AI監視機能を搭載する。また開発エージェント向け新CLI「GCX」も提供した。

ニュース一覧に戻る元記事を読む