Pinecone·2026年5月26日 09:00·約25分で読める

AskData の内部：トークン消費量を 90% 以上削減した方法

#RAG #ベクトルデータベース #LLM エージェント #トークン最適化

TL;DR

Pinecone は、社内データエージェント「AskData」を Pinecone Nexus 基盤へ再構築することで、トークン消費量を 92% 削減し、クエリ処理時間を大幅に短縮する実証事例を発表した。

AI深層分析2026年6月12日 23:06

重要/ 5段階

深度40%

キーポイント

知識の断片化がボトルネックとなる課題

データウェアハウス内の構造化データだけでなく、Slack や CRM などの非構造化文書に含まれるビジネスコンテキスト（定義やルール）を統合できず、アナリストが意思決定のボトルネックとなっていた。

Pinecone Nexus による再構築と劇的な改善

2026 年 5 月に Pinecone Nexus を採用して AskData を再構築した結果、トークン使用量が 92% 以上削減され、クエリ応答ターン数は 78% 減少する成果を達成した。

RAG から信頼できるエージェントへの進化

単なる SQL クエリ実行ツールから、蓄積されたナレッジベースを活用して推論を行う信頼性の高い AI エージェントへと進化させ、ビジネス用語を正確にデータ定義へ変換する能力を獲得した。

影響分析・編集コメントを表示

影響分析

この事例は、LLM を単なるクエリ生成ツールとして使用するのではなく、組織の暗黙知や文脈をベクトルデータベースで統合することで、コスト効率と信頼性を両立させるための具体的なアーキテクチャを示しています。特にトークン消費量の劇的な削減は、大規模なデータ分析エージェントを運用する際の経済的障壁を下げる重要な示唆となり、業界全体で RAG（検索拡張生成）の設計思想が「知識インフラ」へ移行していることを裏付けています。

編集コメント

「トークン消費を 90% 削減」という数字は、LLM エージェントの実装におけるコスト最適化の重要な指標です。単にモデルを切り替えるだけでなく、知識インフラ（Nexus）を活用したアーキテクチャ設計が、実運用における成否を分ける決定的要因であることを示唆しています。

データは、製品がどのように進化し、企業がどのように意思決定を行うかを裏付けるものです。しかし、データウェアハウスから素早く、正確に、適切なビジネスコンテキストを組み込んで回答を引き出すことは、実際にはそれほど難しいことではありません。

Pinecone がマルチプロダクト・マルチチャネルの事業へと成長するにつれ、静的なダッシュボードだけでは不十分になりました。パイプラインの健全性、リテンションリスク、製品採用率、収益ミックスなど、実際の意思決定を駆動する質問は、事前に構築されたビューにきれいに収まることはめったにありません。アナリストがボトルネックとなり、アドホックな質問が行われず、意思決定は陳腐な数値や勘に頼るものになっていました。

そのギャップを埋めるために、私たちは AskData を構築しました。これは、当社の事業が実際にどのように運営されているかという蓄積された知識に基づいて、データウェアハウスを検索し推論を行う社内 AI データエージェントです。この知識は、Slack のスレッド、通話の書き起こし、CRM レコード、請求システム、内部文書などに散在しています。構造化されたウェアハウスデータとともにこれらすべてを表面化することが、単なるクエリ実行ツールと、人々が実際に信頼するエージェントとの決定的な違いとなります。

2026 年 5 月、私たちは AskData を Pinecone Nexus の上に再構築しました。その結果、トークン使用量は 92% 以上減少し、クエリの実行ターン数は 78% 削減されました。

これが、私たちがどのようにそこに到達したかの物語です。

最後の 1 マイルは知識の問題である

Pinecone のデータスタックは、非常に標準的なセットアップです。イベントは BigQuery に着地し、dbt がそれらを変換します。マートテーブルがダッシュボードの基盤となります。

パイプラインは機能し、ダッシュボードも機能しています。課題は最後の 1 マイル、つまりビジネス用語で表現された質問を、適切なテーブル、適切な列、適切なフィルター、そして適切な注意書きへと翻訳する点です。しかもそれをスケーラブルに実現する必要があります。

これはデータの問題ではなく、知識の問題です。

データはウェアハウス内に存在します。しかしデータの意味はどこか別の場所にあります。ARR における標準的なビューはどれですか？各指標にはどの程度のラグがありますか？除外すべきアカウントはどれですか？先四半期に定義が変更されたのはいつでしょうか。これらの情報はスキーマの検査からは発見できません。

毎日ウェアハウスで作業するアナリストにとっては、これはまた平凡な一日に過ぎません。しかしそれ以外のすべての人にとって、セルフサービスのコストはあまりにも高く、ほとんどのアドホックな質問自体が行われることはありません。意思決定は古くなったダッシュボードや勘に頼って行われます。アナリストはあらゆるクロスファンクショナルな質問に対するボトルネックとなってしまいます。

それが AskData が埋めなければならなかったギャップです。

V0 — Claude/Cursor に投げ込んで、結果を見てみる

明白な出発点は、一連のツールを BigQuery、dbt、そしていくつかの社内ドキュメントに接続し、それらをローカルのコーディングエージェントである Claude や Cursor などに直接手渡すことでした。多くの内部ユーザーが 2025 年後半にこの試みを行いました。

エージェントループ自体には問題がありませんでした。各セッションで人手によって十分なビジネスコンテキストを提供されれば、コーディングエージェントは SQL を読み取り、変換について推論し、指標の定義を十分に正確に再現できました。問題はそれ以外のすべてでした。

同じ質問でも、異なる回答。 同じ質問を二人のユーザーが投げれば、出力される SQL やフィルタ、そして数値はそれぞれ異なります。意思決定を促し、共通認識を形成するために報告すべき重要なビジネス指標について、エージェントの回答が一貫性を欠けば、意思決定は停止します。「ARR の意味とは何か」という問いに対する唯一つの正解が必要であり、一人が指摘した修正点を次の人に即座に伝える仕組みも必要でした。一貫性と再現性を実現できる、集中型の知識管理とエージェント制御基盤を構築する必要がありました。

共有された学習の欠如。 ユーザーが投げかける質問は、「先月の収益はいくらか」から「なぜこのアカウントがリスクにあるのか、そしてどう対処すべきか」という説明まで多岐にわたります。これらの質問には、非常に異なるレベルの推論と知能が求められます。どのモデル階層をどの種類の質問に割り当てるべきかを判断するのは、一度きりの作業として誰もが実行できるような単純な仕事ではありません。

フィードバックループの欠如。 集中型のツールリングが存在しないため、回帰テストを実行する評価環境も、本番環境での観測性（observability）も、またどの質問がエージェントを混乱させているかを示すシグナルも存在しませんでした。

セッションごとの方向付けコスト。 新しいセッションは毎回「コールドスタート」します。スキーマだけではビジネス上の意味を担えないため、エージェントは各質問ごとに、構造化されていないコンテキスト（dbt コード、Slack のスレッド、アナリストのメモ、クエリ履歴）から盲信的に探索し始めなければなりません。回答する前にトークンを浪費して方向付けを行う必要があるのです。

AskData V1: 知識層の構築

エージェントループは難しい部分ではありませんでした。難しかったのは、SQL の上にあるレイヤー、つまり SQL の数式や数値がビジネスにおいて実際に何を意味するかを保持するレイヤーです。

従来の「セマンティックレイヤー」はスキーマとメトリック定義を保持しますが、構造化データの手動で維持される記述であり、次第にビジネスとの同期がずれていきます。私たちが求めたのは、なぜそのメトリックが特定の定義を持つのか、そしてその定義が最後にいつ変更されたのかを説明する、Slack や Gong、dbt のコメント、ドキュメントといった非構造化コンテキストを保持する「知識層」でした。

この知識層は、人々が質問する方法と SQL がロジックを表現する方法の間の語彙のギャップを埋めなければなりませんでした。生の dbt SQL では機能しませんでした。SQL は変換を符号化しますが、意味を符号化するわけではありません。「月間アクティブな組織数はどれくらいか」という問いは、is_active フラグに対する count(distinct …) 式とはほとんど共通点のないベクトルとして埋め込まれます。

「ARR の傾向はどうですか」や「今月サービスに障害はありましたか」といったビジネス質問において、関連するウェアハウステーブルを記述した LLM 要約のマークダウンは、同じテーブルを定義または照会する生の SQL と比較して、少なくとも 2 倍高いコサイン類似度を示しました。同じデータでも、語彙が異なります。埋め込みモデルだけではこのギャップを埋めることはできませんでした。

そこで知識記事の作成を始めました。最初は数件の高品質な手書きの markdown ファイルから始まり、次に dbt モデルとクエリログからさらに生成するスクリプト、そしてギャップを調査して編集案を提案することだけが役割である Curator エージェントが登場しました。V1 が安定した時点では、KB（知識ベース）は 234 の markdown ファイル（計 18,000 行）となり、Pinecone Assistantによって提供されていました。さらに 5 つの追加的な検索面（Slack スレッド、Gong 通話、履歴 SQL、dbt ソース）が、統合された推論機能を持つPinecone ベクトルインデックス上で稼働していました。ホスト型の埋め込みと再ランク付けにより、管理すべき埋め込みパイプラインは不要となり、検索基盤全体を Pinecone がエンドツーエンドで担っていました。これに供給する 5 つの ETL パイプラインは、「毎日実行される」状態を超えて最適化されていませんでした。

V1 は #ask-data でローンチされました。3 ヶ月後の状況：

統計値

回答された質問数 3,690

アクティブなスレッドを持つ Slack チャンネル数 40

フォローアップ実行（連鎖的な会話） ~49%

1 回の実行あたりの平均 SQL クエリ数 2.2

1 日あたりの質問数（5 月 11 日） 191

驚きだったのはボリュームではありませんでした。それは49% のフォローアップ率です。人々はデータと対話をしていました：スコープの調整、結果の詳細な掘り下げ、コホートの比較などです。質問する際のハードルが下がり、小規模な質問のロングテールが現れ始めました。これは BI ツールがドリルダウンやその他のアドホック探索プリミティブを用いて何年もかけて埋めようとしてきたギャップです。

基本を正しくこなす 24/7 のデータ分析エージェントは、企業の意思決定の方法を根本から変えます。

V1 が限界に達した地点

検索基盤は Pinecone をエンドツーエンドで使用していましたが、エージェントから見たその姿は決して統一されたものではありませんでした。

V1 が安定する頃には、システムは以下のように成長していました：

2 つのエージェント（DataAgent と Curator）にまたがる 22 のツール。
6 つの専用検索面（Pinecone Assistant、3 つの Pinecone インデックス、dbt ファイル読み取り、履歴 SQL 検索）。
Slack、Gong、BigQuery ログを毎日 Pinecone に同期させる Airflow コードが 1,300 行。
234 のファイルにわたる 18,000 行の手動キュレーション済み Markdown を維持する Curator コードが 2,200 行。
エージェントと共に成長したシステムプロンプト。これは、search_kb と search_slack、search_query_logs、grep_dbt の使い分け方、それら間での重複排除方法、リテラルな grep に一致しない dbt の ref() マクロの扱い方を説明するものでした。

各バックエンドは独自のクライアント、スキーマ、埋め込み戦略、リトライロジック、ETL パイプラインを持っていました。KB が自己維持できないため Curator は存在しました。それらの部品を統合して一貫性のあるものへと組み上げる層は底部に存在せず、クロスソース間の合成は、クエリ実行時にエージェントによって、すべての質問に対して行われていました。

そのコストはトークンの使用履歴に現れました。複数の部分からなる質問（「1 月に資格認定された機会の総パイプライン金額、機会数、加重パイプラインはいくらか」）には 9 ステップかかり、約 240,000 トークンを消費しました。エージェントは知識ベース検索へ分散し、292 カラムのスキーマ JSON をコンテキストに投入し、正しい日付カラムを見つけるために再検索を 2 回行い、語彙を知るために DISTINCT クエリを実行し、ようやく 4 回目の試みで SQL が正しく生成されました。この 9 ステップのうち 7 つは、実際の分析が始まる前に「どのテーブルか、どのカラムか、どのフィルターか」という方向付けに費やされていました。

コンパイラが実行のたびにソースを再構文解析することはありません。知識層が基盤に存在しない場合、エージェントインフラストラクチャはまさにそれを行っていました。

それが Nexus が解決すべき課題でした。

Nexus として備えるべきもの

Pinecone Nexus は V1 と並行して設計されており、AskData が最初にサポートするワークロードとなりました。V1 の痛みから直接寄せられたいくつかの要望があります：

一つのキュレーションパイプライン、多数のソース。構造化データ、半構造化データ、非構造化データのすべての入力を受け取り、エージェントが 1 回の呼び出しで活用できるタスク固有のビューとアーティファクトを生成する、単一の管理システムです。AskData の場合、これは自然言語から SQL への意味変換（どのテーブルか、どのカラムか、どのパターンか）を指します。ETL パイプラインが 5 つあるわけでも、検索表面が 5 つあるわけでもありません。1 つだけです。

適応型知識表現。 アーティファクトのスキーマと表現は、事前に人間が設計する必要はありません。タスクに応じて有機的に進化し、固定されたオントロジーや手書きのテンプレートではなく、評価信号とソースデータによって駆動されるべきです。

人間をループに組み込んだ知識更新。 Curator エージェントの実際の役割（ギャップの調査、編集案の提案、レビューの取得）は、Nexus においても別個のエージェントとして存続するのではなく、ファーストクラスのフィードバックメカニズムとして継承されなければなりませんでした。

Nexus はこれらの要件に対して実装されました。アーキテクチャとプリミティブ（Context Compiler、KnowQL、およびその他）はそれ自体が物語の一部であり、詳細については Nexus の深掘りをご覧ください。

移行プロセス

移行は評価基準の定義から始まりました。Nexus のビルドループにおいて明確な検索結果が必要だったからです。V1 では明確な契約が存在せず、エージェントが単に 6 つのツールからコンテキスト拡張を駆動し、それらを自身で結合していました。

私たちは V1 の本番環境のトレースから評価セットを構築しました。各質問には完全な呼び出しログが含まれており、エージェントがどのツールを発火させたか、各ツールの戻り値、そして最終的に SQL を記述するために使用されたチャンクが記録されています。各質問に対して、エージェントが最初の試行で正しい SQL を取得できたであろう最小限のコンテキストペイロードを抽出しました。これらのペイロードが評価セットにおける期待される出力となりました。この評価セットは、Nexus のビルドループが最適化を目指す対象です。

Sample Eval Question:

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

私は、ある月におけるエンタープライズプラン顧客とスタンダードプラン顧客から生じた最低料金収益の分析を行うための正しい BigQuery SQL クエリ（SQL）を作成するために、包括的なコンテキストを必要としています。

関連する情報をすべて構造化された Markdown 形式で返してください：

テーブル: 使用するべきテーブル（analytics.table_name のような完全名称）
主要カラム: 役割とともに必要なすべてのカラム
方法論: SQL スニペットを含む計算ロジック
アンチパターン: 誤った例と正しい例を伴う一般的なミステイク
除外/フィルタ: 必須の WHERE クラース

網羅的に記述してください。SQL コードブロックを含めてください。このコンテキストが SQL の正しさを直接決定します。

計画は単純明快でした：異なる検索目的を担う2つのNexus Contextsを構築することです。1つはセマンティック層（スキーマ、dbt ソース、SQL パターン）用、もう1つは顧客コンテキスト（Slack スレッド、Gong 通話の文字起こし）用です。

6 つの検索ツールを 2 つに置き換えてください：search_semantic_layer と search_customer_context です。Curator エージェントを削除し、その KB（知識ベース）修正機能を、Nexus へ修正を送って知識を更新する単一の semantic_layer_feedback ツールに置き換えてください。ETL を再設定して、レガシーの Pinecone インデックスごとに専用のカスタムパイプラインを用意するのではなく、最小限の前処理と変換で Nexus Contexts にフィードソースを供給するようにしてください。

3 週間後、V2 は回帰セットにおいて V1 と完全に同等の性能に達しました：両者とも 68.3%です。このセットは 14 のビジネスドメインにわたる 101 の質問で構成されており、AskData を内部ベンチマークするために使用する、文脈が重く難しいケースに重点を置いています。今回の移行の目的は精度を上げることではありませんでした。目標は、予算の僅かな一部で同等の精度を達成し、その基盤となる Nexus を検証することでした：より少ないトークン、より少ないコード、そしてこのようなシステムを最初に構築するための所要期間が大幅に短縮されることです。

私たちはさらに 3 つ目のベースラインも測定しました：dbt リポジトリを指し示され、bq CLI を与えられた、新鮮な Claude Code セッションです。KB も Nexus も AskData ツール類もありません。

14 のドメインにわたって層別化された 20 の質問において、これは V1/V2 と同等の精度（68.3%）を達成しました。しかし、その代償は大きかったのです：質問あたり平均入力トークン数は約 625K（V2 の約 16 倍）、対話ターン数は 21（V2 の約 4.6 倍）、質問あたりのコストは$0.35（キャッシュを正規化した場合、V2 の約 5 倍）でした。この「方向転換税」は現実的なものです。有能なコーディングエージェントであっても正解を見つけることはできますが、各質問ごとに生 SQL から文脈を再構築することが、そのコストを生む原因なのです。

指標V0（Claude Code 生）`V1V2V0 → V2

質問あたりの平均入力トークン数〜625,28764,00839,595-93.7%

質問あたりの平均ターン数〜214.54.6-78%

質問あたりの平均コスト$0.35$0.20$0.07-80.2%

数字が実際に意味するもの

ステップごとの視点こそが、構造的な変化の現れる場所です。

V1 では、検索後の最初の LLM コールは平均して約 10,000 の入力トークンを要し、p25〜p90 の範囲も広かったです。すべての質問に対して、連結された検索チャンクのサイズが異なる山のようなコンテキストが与えられていました。一方 V2 では、同じコールで平均 6,000 トークンとなり、範囲は非常に狭くなりました。すべての質問に焦点を絞った一貫したコンテキストが提供されます。ステップ 3 になると、V1 は 21,000 トークンに対して V2 は 10,500 トークンでした。この複合的な効果が、入力トークンの 38% の削減につながっています。

出力トークンは 22% 増加しました。エージェントは実行ごとに SQL をより多く記述するようになりました（平均 1.8 クエリに対し、V1 は 1.3）。これは、予算を方向付けに費やすのではなく、すぐに SQL を記述できるコンテキストを持っていたためです。

単純な照会: 日次収益ランキングに関する質問。V1 では 4 ステップで 74,000 トークンを消費しました：3 つの並列検索により 10,000 トークンの生 KB が注入され、get_schema コールが 478 カラムの JSON を返した結果、SQL 実行前にコンテキストは 30,000 トークンを超えました。一方 V2 では 3 ステップで 11,000 トークンを消費しました：search_semantic_layer が、正確な列名、日付カラム、内部組織を除外するためのフィルタ、末尾ウィンドウ変種に関するアンチパターン警告、および例クエリを含む 1,200 トークンの要約を返します。get_schema は不要でした。同じ回答に対して予算は 7 分の 1 です。

マルチステップパイプライン質問: 以前のピポラインに関する質問。V1: 240,000 トークン、9 ステップ、データモデルの発見に 4 回の SQL 試行が必要でした。V2: 35,000 トークン、5 ステップ、2 回の SQL 試行。Nexus はステップ 0 で正しいマートテーブルを直接特定し、加重パイプライン数式（weighted-pipeline formula）を提供し、適切なステージフィルタも付与しました。1 つの SQL がタイプミスで失敗しましたが、エージェントが修正しました。

非構造化コンテキスト認識: SQL のみでは回答できない質問のクラスがあります。「なぜこのアカウントの利用量が減少しているのか？」という問いです。スキーマベースのセマンティックレイヤー（semantic layer）であればクエリは正しく実行できますが、真の説明は非構造化ソースにあります。search_customer_context は、サポートスレッドと Gong の通話記録を提示し、アカウントがポッドからサーバーレスへの移行途中にあること、そして実際の利用量減少は健全で予期されたものであることを示します。*構造化された数値だけでは異常を示唆しますが、非構造化ソースがビジネス上の意味を文脈化します。*

削除こそが重要である

Curator エージェント（2,194 ライン、9 つのツール、独自のシステムプロンプト）は消滅しました。234 件の手動キュレーションされた KB ファイルは Nexus のキュレーションアーティファクトとなりました。KB 生成スクリプトは Nexus のビルドプロセスに置き換えられました。Airflow ETL はレガシーな Pinecone インデックスの代わりに、Nexus コンテキスト（Contexts）を供給するために再利用されました。3 つのリポジトリ全体で 25,000 ラインが削除されました。エージェント自体も小型化され、ツールの数は 22 から 10 に減少しました。

私たちが撤廃した主な点は以下の通りです：

一度コンパイルすれば、何度も読み込める。 クエリ実行時の合成からコンパイル時への移行により、入力トークンの削減率が 38% に達しました。Nexus はエージェントにブリーフィングを提供するために約 8,000 トークンを一度だけ使用しますが、これにより後続の処理で約 24,000 トークンが節約されます。およそ 3:1 の比率です。

ツールを減らし、エージェントを強化する。 エージェントの意思決定ツリーにおいて、公開される各ツールは分岐点となります。6 つのツールを 2 つに集約することで、入力トークンのばらつきが約 4 倍削減されました。選択肢が減り、より良い選択が可能になり、実行あたりの SQL が増加します。

キュレーションは反復的に行われ、エージェントもそのプロセスに参加する必要があります。 単発の構築では、数ヶ月にわたる人間の熟練による改良には及びません。漸進的なキュレーションとエージェント主導のフィードバックにより、Nexus は知識ベース（KB）が持っていたのと同じ品質向上のラッチ効果を得ています。

構造化データは「何が」を伝え、非構造化データは「なぜ」を伝えます。 ウェアハウスは数値を提供しますが、その背後にあるビジネス上の理由（例えば、ある指標がなぜ低下したのか、ARR は実際にどのように定義されているのか）は非構造化ソースに存在します。AskData の精度は、これらの非構造化ソースを第一級の入力として扱い、データの知識層を構築することによって実現されました。

V1 では、有能な社内データエージェントの実現可能性と、それが企業の意思決定のあり方を変えることが証明されました。V2 では、管理された知識層（knowledge layer）の下に構築することで、そのエージェントの構築と維持が数ヶ月にわたる個別プロジェクトである必要がないことが証明されました。精度は同じまま、トークン使用量は 3 分の 1、コードも簡素化されています。

自分自身で構築する

この手法を試せるデモを公開しました。

nexus-analyst-demo は、エージェントの簡易版であり、1 つのリポジトリ内に 2 つのエージェントが含まれています。

agent-nexus/ は Nexus バージョンです。検索ツールは 1 つ（nexus_query）のみで、エージェントが Nexus コンテキストに対して自然言語で質問し、集約されたブリーフを取得します。
agent-classic/ はベースラインです。単純な Pinecone ベクトルインデックス上で 3 つの検索ツール（search_schema, search_docs, search_notebooks）を使用し、手動によるチャンク分割とソースごとのフィルタリングが行われます。これは移行前の AskData のおおよその形状に相当します。

それ以外では両エージェントは同一です：同じループ、同じ SQL エグゼキューター、同じチャートおよびファイルツール、そして同じ Gemini 推論を使用しています。唯一の違いは検索基盤（retrieval substrate）のみです。

コンパニオンリポジトリ nexus-analyst-demo-ingest はデータ側の役割を担います。ここでは、ナレッジレイヤーのために Nexus をどのように準備したかを示しています。架空の SaaS 企業である Acme から、構造化された倉庫データと非構造化された組織知識（Slack のスレッド、Gong スタイルのトランスクリプト、ランブック、ポストモーテム）を組み合わせて Nexus コンテキストを構築します。

2 つのスクリプトが並列して配置されています。upload_to_nexus.py は、ソースデータ、評価セット、および build.md を用いて Nexus コンテキストをエンドツーエンドで構築し、index_pinecone.py は同じソースデータをチャンク化して、クラシックなエージェント用のプレーンな Pinecone インデックスを作成します。両方のパイプラインが 1 つのリポジトリ内に収められています。Nexus 側のコードは、接着剤的な処理を含めても 1,000 行未満です。ソースごとの ETL（Extract-Transform-Load：データ抽出・変換・読み込み）を調整する必要もなく、チャンキング戦略の最適化や埋め込みモデルの選定も不要です。Nexus にソースデータ、評価セット、ビルド仕様を手渡すだけで、コンテキストコンパイラが残りの処理を行います。

30 問の難易度の高い多段階質問（シグナ値の罠、スキーマドリフト、Acme の倉庫を跨ぐクロスソース結合など）からなる評価セットを用いて両エージェントを実行し、その差を直接的に浮き彫りにします。

実際に試したい場合は、https://nexus-analyst-web.vercel.app/ でライブデモを確認できます。質問を選択すると、ベースラインのエージェントが呼び出しごとに複数の検索を組み合わせて文脈を再構築する様子と、Nexus エージェントがそれを 1 回の処理で完了させる様子を比較してご覧いただけます。

もし、ご自身の内部システム向けにこのようなものを構築したい場合は、Pinecone Nexus の早期アクセスに登録してください。

原文を表示

Data underwrites how products evolve and how companies make decisions. But pulling answers out of a data warehouse, quickly, correctly, with the right business context baked in, is still harder than it should be.

As Pinecone grew into a multi-product, multi-channel business, static dashboards stopped being enough. The questions that actually drive decisions, about pipeline health, retention risk, product adoption, revenue mix, rarely fit neatly into a pre-built view. Analysts became the bottleneck. Ad-hoc questions went unasked. Decisions got made on stale numbers or gut feel.

To close that gap, we built AskData: an in-house AI data agent that explores and reasons over our warehouse, informed by the accumulated knowledge of how our business actually operates. That knowledge is scattered across Slack threads, call transcripts, CRM records, billing systems, and internal documents. Surfacing all of it alongside structured warehouse data is what makes the difference between a query runner and an agent people actually trust.

In May 2026, we rebuilt AskData on top of Pinecone Nexus. Token usage dropped by over 92%. Query turns dropped by 78%.

This is the story of how we got there.

The Last Mile is a Knowledge Problem

Pinecone's data stack is a pretty standard setup. Events land in BigQuery. dbt transforms them. Mart tables feed dashboards on top.

The pipelines work. The dashboards work. The gap is the last mile; translating questions phrased in business vocabulary into the right table, right column, the right filter and the right caveat. And do it at scale.

That is a knowledge problem, not a data problem.

The data lives in the warehouse. The meaning of data lives somewhere else. Which view is canonical for ARR? Which metric has how much lag? Which accounts to filter out? When the definition changed last quarter. None of that is discoverable from schema inspection.

For analysts working in the warehouse every day, this is another normal day. For everyone else, the cost of self-service is high enough that most ad-hoc questions never get asked at all. Decisions get made on stale dashboards or gut feel. Analysts become the bottleneck for every cross-functional question.

That's the gap AskData had to close.

V0 — Throw it into Claude/Cursor, See what happens

The obvious starting point was to wire a set of tools to BigQuery, dbt and a few internal docs and hand them directly to local coding agents like Claude or Cursor. Many internal users tried this in late 2025.

The agent loop itself was not the problem. Given enough business context fed by hand in each session, the coding agent could read SQL, reason about transformation, and replicate metric definitions well enough. The problem was everything else.

Same question, different answers. Two people asking the same question would walk away with different SQL, different filters and different numbers. When an agent is reporting critical business metrics meant to drive decisions and align mental models is inconsistent, decisions stop. We needed one canonical answer to “what does ARR mean” and a way for a correction caught by one person to reach the next person immediately. We needed a centralized knowledge management and agent harness that can achieve consistency and reproducibility.

No shared learning. The questions users ask range from "what's last month's revenue" to "explain why this account is at risk and what to do about it." Those questions require very different levels of reasoning and intelligence. Figuring out which model tier fits which kind of questions is challenging work you want to do once, for everyone.

No feedback loop. Without centralized tooling, there was no eval to run regressions against, no production observability and no signal about which questions were tripping up the agent.

Orientation tax on every session. Each fresh session starts cold. Schemas alone do not carry business meaning, so the agent has to blind-traverse the unstructured context (dbt code, Slack threads, analyst notes, query history) from scratch on every question, burning tokens orienting before it can answer.

AskData V1: Building the Knowledge Layer

The agent loop wasn't the hard part. The hard part was the layer above the SQL, the one that holds what the SQL formulas and numbers actually mean for the business.

A traditional 'semantic layer' holds schema and metric definitions, typically manually maintained descriptions of structured data which slowly drift out of sync with your business. What we needed was a knowledge layer that holds the unstructured context (Slack, Gong, dbt comments, docs) that explains why a metric is defined the way it is, and *when* that definition last changed.

That knowledge layer had to bridge the vocabulary gap between how people ask questions and how SQL expresses logic. Raw dbt SQL didn't work. SQL encodes transformations, not meaning. The question "how many monthly active orgs do we have" embeds as a vector that has almost nothing in common with a count(distinct …) expression over an is_active flag.

Across business questions like "*how is our ARR trending*" or "*did our service have any outages this month*," LLM-summarized markdown describing the relevant warehouse table scored at least 2X higher cosine similarity to the question than the raw SQL that defines or queries the same table. Same data, different vocabulary. Embedding models alone couldn't close the gap.

So we started writing the knowledge articles. A few high-quality hand-written markdown files at first, then scripts that used LLMs to generate more from dbt models and query logs, then a Curator agent whose only job was to investigate gaps and propose edits. By the time V1 stabilized, the KB was 234 markdown files (18,000 lines) served by Pinecone Assistant. Five additional retrieval surfaces (Slack threads, Gong calls, historical SQL, dbt source) ran on Pinecone vector indexes with integrated inference. Hosted embedding and reranking meant no embedding pipeline to manage, and the retrieval substrate was Pinecone end-to-end. The five ETL pipelines feeding it hadn't been tuned beyond "it runs daily."

V1 launched in #ask-data. Three months in:

StatValue

Questions answered3,690

Slack channels with active threads40

Follow-up runs (chained conversations)~49%

Avg SQL queries per run2.2

Questions per day (May 11)191

The surprise wasn't volume. It was the 49% follow-up rate. People were having conversations with the data: adjusting scope, drilling into a result, comparing cohorts. The bar for asking dropped, and the long tail of small questions started showing up. This is the gap BI tools have spent years trying to close with drill-downs and other ad-hoc explore primitives.

A 24/7 data analytics agent that gets the basics right reshapes how a company decides.

Where V1 hit its limits

The retrieval substrate was Pinecone end-to-end, but the agent's view of it was anything but unified.

By the time V1 stabilized, the system had grown to:

22 tools across two agents (DataAgent + Curator).
6 dedicated retrieval surfaces (Pinecone Assistant + three Pinecone indexes + dbt file reads + historical SQL search).
1,300 lines of Airflow code syncing Slack, Gong, and BigQuery logs into Pinecone every day.
2,200 lines of Curator code maintaining 18,000 lines of hand-curated markdown across 234 files.
A system prompt that grew with the agent, explaining when to use search_kb versus search_slack versus search_query_logs versus grep_dbt, how to dedupe across them, how to handle dbt's ref() macros that don't match a literal grep.

Each backend brought its own client, schema, embedding strategy, retry logic, and ETL pipeline. The Curator existed because the KB couldn't maintain itself. There was no layer underneath that compiled the parts into something coherent; cross-source synthesis happened at query time, by the agent, on every question.

That cost showed up in the token traces. A multi-part question ("what was the total pipeline amount, opportunity count, and weighted pipeline for opportunities qualified in January") took 9 steps and around 240,000 tokens. The agent fanned out across KB searches, dumped a 292-column schema JSON into context, re-searched twice to find the right date column, ran a DISTINCT query just to learn the vocabulary, and finally got SQL right on its 4th attempt. 7 of those 9 steps were spent orienting (which table, which column, which filter) before the actual analysis could begin.

A compiler doesn't re-parse its source on every run. Without a knowledge layer underneath, agent infrastructure was doing exactly that.

That's what Nexus had to fix.

What Nexus had to be

Pinecone Nexus was being designed in parallel with V1, and AskData was the workload it had to support first.. A few asks coming directly from V1's pain stuck:

One curation pipeline, many sources. A single managed system that takes structured, semi-structured, and unstructured inputs from every source, and produces task-specific views and artifacts the agent can leverage in one call. For AskData that meant natural-language-to-SQL semantics: which table, which column, which pattern. Not five ETL pipelines, not five retrieval surfaces. One.

Adaptive knowledge representation. The artifact's schema and representation shouldn't require human design upfront. It should evolve organically based on the task at hand, driven by the eval signal and source data, not by a fixed ontology or hand-authored template.

Human-in-the-loop knowledge updates. The Curator agent's actual job (investigate a gap, propose an edit, get it reviewed) had to survive into Nexus, not as a separate agent but as a first-class feedback mechanism.

Nexus shipped against those requirements. The architecture and primitives (Context Compiler, KnowQL, and the rest) are their own story; for that, see the Nexus deep dive.

The migration

The migration started with defining the eval, since we needed clear retrieval outcomes for Nexus's build loop. V1 had no clean contract; the agent just drove the context expansion from six tools and stitched them together itself.

We built the eval from V1 production traces. Each question had a full call log: which tools the agent fired, what each tool returned, which chunks the agent ultimately used to write SQL. For each question, we extracted the minimum context payload that would have let the agent get the SQL right on the first attempt. Those payloads became the expected outputs in the eval. The eval set was the target Nexus's build loop optimized toward.

Sample Eval Question:

code

{
    "id": "difficulty_hard_minimum_fee_revenue",
    "input": "I need comprehensive context to write a correct BigQuery SQL query for: analyzing how much minimum fee revenue came from enterprise vs standard plan customers in a given month.\nReturn ALL relevant information as structured markdown:\n1. **Tables**: Which tables to use (full names like analytics.table_name)\n2. **Key Columns**: Every column needed with its role\n3. **Methodology**: Calculation logic with SQL snippets\n4. **Anti-patterns**: Common mistakes with wrong vs. right examples\n5. **Exclusions/Filters**: Required WHERE clauses\nBe exhaustive. Include SQL code blocks. This context directly determines SQL correctness.",
    "expected_output": "Must mention revenue columns that distinguish minimum fee or platform fee, plan or tier-based segmentation, and dim_orgs as source table to use. Must mention column platform_minimum_fee_1d. Must mention revenue data latency as a caveat. Must mention minimum fee is not final until month end.",
    "match_type": "llm_judge"
}

The plan was straightforward: stand up two Nexus Contexts serving different retrieval goals: one for the semantic layer (schemas + dbt source + SQL patterns), one for customer context (Slack threads + Gong call transcripts).

Replace the six retrieval tools with two: search_semantic_layer and search_customer_context. Delete the Curator agent and replace its KB-amendment function with a single semantic_layer_feedback tool that submits corrections back to Nexus for knowledge amendments. Re-point the ETL to feed sources into Nexus Contexts with minimal prep and transformation, instead of one dedicated custom pipeline per legacy Pinecone index.

Three weeks later, V2 hit full parity with V1 on our regression set: 68.3% on both. The set is 101 questions across 14 business domains, weighted toward the hard, context-heavy cases we use to benchmark AskData internally. The point of the migration wasn't to lift accuracy. The goal was matching accuracy on a fraction of the budget, and validating Nexus as the substrate underneath: fewer tokens, less code, much shorter path to standing up a system like this in the first place.

We also measured a third baseline: a fresh Claude Code session pointed at the dbt repo and given the bq CLI. No KB, no Nexus, no AskData tooling.

On 20 questions stratified across our 14 domains, it matched V1/V2's accuracy (68.3%). It just paid for it: ~625K average input tokens per question (~16× V2), 21 turns (~4.6× V2), $0.35 per question (~5× V2 with caching normalized). The orientation tax is real. Even a capable coding agent can find the right answer; rebuilding context from raw SQL on every question is what costs.

MetricV0 (Claude Code raw)`V1V2V0 → V2

Avg input tokens per question~625,28764,00839,595-93.7%

Avg turns per question~214.54.6-78%

Avg cost per question$0.35$0.20$0.07-80.2%

What the Numbers Actually Mean

The per-step view is where the change is structural, not incremental.

In V1, the first LLM call after retrieval averaged about 10,000 input tokens with a wide p25–p90 spread. Every question got a differently-sized pile of concatenated retrieval chunks. In V2, the same call averaged 6,000 tokens with a much tighter spread. Every question had the same focused context. By step 3, V1 was sitting at 21,000 tokens against V2's 10,500. That compounding is the 38% input-token drop.

Output tokens went up by 22%. The agent wrote more SQL per run (1.8 queries on average versus 1.3 in V1) because it had the context to write SQL immediately, instead of burning budget orienting.

Simple lookup: a daily-revenue ranking question. V1 burned 74,000 tokens over 4 steps: three parallel retrievals injected 10,000 tokens of raw KB, a get_schema call returned a 478-column JSON, context climbed past 30,000 tokens before SQL ran. V2 burned 11,000 tokens over 3 steps: search_semantic_layer returned a 1,200-token brief naming the exact column, the date column, the filter to exclude internal orgs, an anti-pattern warning about the trailing-window variant, and an example query. No get_schema needed. Seven times less budget for the same answer.

Multi-step pipeline question: the pipeline question from earlier. V1: 240,000 tokens, 9 steps, 4 SQL attempts spent discovering the data model. V2: 35,000 tokens, 5 steps, 2 SQL attempts. Nexus named the right mart table directly in step 0, gave the weighted-pipeline formula, and provided the correct stage filter. One SQL failed on a typo, the agent fixed it.

Unstructured context awareness: One class of questions has no SQL-only answer: "why is this account's usage dropping?" While a schema-based semantic layer would get the query right, the real explanation lives in unstructured sources. search_customer_context surfaces a support thread and a Gong call showing the account is mid-migration from pods to serverless and the actual usage drop is healthy and expected. *The structured number alone says anomaly but the unstructured sources contextualize the meaning to the business.*

Deletions are What Matter

The Curator agent (2,194 lines, 9 tools, its own system prompt) is gone. The 234 hand-curated KB files became Nexus curation artifacts. The KB-generation scripts were replaced by Nexus's build process. The Airflow ETL was repurposed to feed Nexus Contexts instead of legacy Pinecone indexes. Across three repos, 25,000 lines went away. The agent itself got smaller; tool count dropped from 22 to 10.

A few things we took away:

Compile once, read many. Moving synthesis from query time to compile time produced the 38% input-token drop. Nexus spends ~8,000 tokens once to give the agent a brief that saves it ~24,000 downstream. Roughly three-to-one.

Fewer tools, better agents. Every exposed tool is a branch in the agent's decision tree. Six tools collapsing to two cut the input-token spread by ~4×. Fewer choices, better choices, more SQL per run.

Curation has to be iterative, with the agent in the loop. A single-pass build can't match months of human refinement. Incremental curation + agent-driven feedback give Nexus the same quality ratchet the KB had.

Structured data tells you what but unstructured data tells you *why*. The warehouse gives you the number but the business rationale behind it (e.g. why a metric dropped, how ARR is actually defined) lives in unstructured sources. AskData's accuracy came from treating those unstructured sources as first-class inputs to create the knowledge layer for data.

V1 proved a competent internal data agent is feasible and changes how a company decides. V2 proved that with a managed knowledge layer underneath, building and maintaining that agent stops being a months-long bespoke project. Same accuracy, a third of the tokens, simplified code.

Build your own

We published a demo so you can try this yourself.

nexus-analyst-demo is a stripped-down version of the agent, with two agents in one repo:

agent-nexus/ is the Nexus version. One retrieval tool (nexus_query). The agent asks a natural-language question against a Nexus Context and gets back a compiled brief.
agent-classic/ is the baseline. Three retrieval tools (search_schema, search_docs, search_notebooks) over a plain Pinecone vector index, with manual chunking and per-source filtering. Roughly the shape AskData had before the migration.

Otherwise the two agents are identical: same loop, same SQL executor, same chart and file tools, same Gemini inference. The only difference is the retrieval substrate.

The companion repo, nexus-analyst-demo-ingest, is the data side. Shows how we prepared Nexus for the knowledge layer. It builds the Nexus Context from a fictional SaaS company Acme: structured warehouse data alongside unstructured organizational knowledge (Slack threads, Gong-style transcripts, runbooks, postmortems). Two scripts sit side-by-side: upload_to_nexus.py builds the Nexus Context end-to-end (sources + an eval set + a build.md), and index_pinecone.py chunks the same sources into a plain Pinecone index for the classic agent. Both pipelines, one repo. The Nexus side is under 1,000 lines of glue. There's no per-source ETL to tune, no chunking strategy to optimize, no embedding model to pick. You hand Nexus the sources, an eval set, and a build spec; the Context Compiler does the rest.

An eval set of 30 hard multi-part questions (sentinel-value gotchas, schema drift, cross-source joins across Acme's warehouse) runs against both agents and surfaces the gap directly.

Live at https://nexus-analyst-web.vercel.app/ if you want to try it before standing it up yourself. Pick a question, watch the baseline agent stitch together multiple searches and re-derive context on every call, while the Nexus agent does it in one.

If you want to build something like this for your own internal systems, sign up for Pinecone Nexus early access.

この記事をシェア

AWS Machine Learning Blog★42026年6月19日 23:15

Amazon Bedrock AgentCore に Web 検索機能を導入

AWS は、学習データに依存して最新情報を取得できない AI エージェントの課題を解決するため、Amazon Bedrock AgentCore に Web 検索機能を一般提供開始した。これによりエージェントはリアルタイムの株価やニュースなどを参照可能になった。

MarkTechPost★42026年6月19日 19:29

Liquid AI、11言語対応の高速多言語検索向け新モデル「LFM2.5-Embedding-350M」と「LFM2.5-ColBERT-350M」を発表

Liquid AI は、11言語間の高速な多言語・異言語検索を実現する新たな取得モデル「LFM2.5-Embedding-350M」と「LFM2.5-ColBERT-350M」を公開した。両モデルはパラメータ数 3.5 億で、LFM ファミリー初の双方向型であり、Hugging Face で利用可能となった。

Allen AI (AI2)★42026年6月18日 17:00

Domyn と AISquared が Ai2 のオープンリリースをどう活用したか

Domyn と AISquared は、透明性やライセンス管理が不可欠な規制業界向けに AI モデルを開発する際、Ai2 のオープンソースリリースを活用している。これにより顧客の信頼とコンプライアンス確保を実現している。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

{ "id": "difficulty_hard_minimum_fee_revenue", "input": "I need comprehensive context to write a correct BigQuery SQL query for: analyzing how much minimum fee revenue came from enterprise vs standard plan customers in a given month.\nReturn ALL relevant information as structured markdown:\n1. **Tables**: Which tables to use (full names like analytics.table_name)\n2. **Key Columns**: Every column needed with its role\n3. **Methodology**: Calculation logic with SQL snippets\n4. **Anti-patterns**: Common mistakes with wrong vs. right examples\n5. **Exclusions/Filters**: Required WHERE clauses\nBe exhaustive. Include SQL code blocks. This context directly determines SQL correctness.", "expected_output": "Must mention revenue columns that distinguish minimum fee or platform fee, plan or tier-based segmentation, and dim_orgs as source table to use. Must mention column platform_minimum_fee_1d. Must mention revenue data latency as a caveat. Must mention minimum fee is not final until month end.", "match_type": "llm_judge" }

キーポイント

影響分析

編集コメント

最後の 1 マイルは知識の問題である

V0 — Claude/Cursor に投げ込んで、結果を見てみる

AskData V1: 知識層の構築

V1 が限界に達した地点

Nexus として備えるべきもの

移行プロセス

数字が実際に意味するもの

削除こそが重要である

自分自身で構築する

The Last Mile is a Knowledge Problem

V0 — Throw it into Claude/Cursor, See what happens

AskData V1: Building the Knowledge Layer

Where V1 hit its limits

What Nexus had to be

The migration

What the Numbers Actually Mean

Deletions are What Matter

Build your own

関連記事

キーポイント

影響分析

編集コメント

最後の 1 マイルは知識の問題である

V0 — Claude/Cursor に投げ込んで、結果を見てみる

AskData V1: 知識層の構築

V1 が限界に達した地点

Nexus として備えるべきもの

移行プロセス

数字が実際に意味するもの

削除こそが重要である

自分自身で構築する

The Last Mile is a Knowledge Problem

V0 — Throw it into Claude/Cursor, See what happens

AskData V1: Building the Knowledge Layer

Where V1 hit its limits

What Nexus had to be

The migration

What the Numbers Actually Mean

Deletions are What Matter

Build your own

関連記事