Colab で安定した Fable 5 Traces ワークフローを構築:ツール呼び出しの解析、データ監査、ベースライン学習
MarkTechPost は、Hugging Face の Fable 5 Traces データセットを用いた Colab 環境での安定したワークフロー構築法と、コードエージェントのトレースデータ解析・可視化・ベースライン学習の実践的ガイドを公開しました。
キーポイント
Colab 環境の軽量安定化戦略
datasets や scikit-learn などの重厚な依存関係を避け、手動で JSONL ファイルをダウンロード・パースすることで、ノートブックの起動失敗リスクを低減する手法を示しています。
データ品質とセキュリティ監査
ツール呼び出しやテキスト出力の正規化に加え、API キーやパスワードなどの機密情報が含まれていないか自動検出するパターンマッチングを実装し、データの安全性を担保しています。
トレースデータに基づく予測モデル構築
純粋な Python 製の Naive Bayes ベースラインモデルを訓練し、コンテキスト情報からアシスタントの出力タイプやツール使用を予測する可能性を検証しています。
詳細なデータ可視化と分析
出力タイプ、使用ツール、ソースルート、テキスト長さなどの分布を可視化し、コードエージェントの動作特性やデータの偏りを定量的に把握するプロセスを提供しています。
Colab環境の軽量設定と一貫性の確保
ワークフローに必要最小限のパッケージのみを読み込み、データパスやランダムシードなどのパラメータを定義することで再現性を担保しています。
安全なデータ処理のためのヘルパー関数実装
JSONの整形、機密情報の削除(redaction)、欠損値の扱い、およびテキストプレビュー用のクリーンアップ機能を初期化しています。
柔軟なツール呼び出しパースロジック
JSON文字列の自動解析と、多様なキーネーム(例:tool_call, function_call)やネスト構造に対応する再帰的な抽出関数を実装し、異なるモデル出力形式を統一的に処理します。
影響分析・編集コメントを表示
影響分析
この記事は、コードエージェントの研究においてデータの前処理と品質管理が抱える実務的な課題(依存関係の競合、セキュリティリスク)に対する具体的な解決策を提供します。特に、大規模なライブラリに依存しない軽量アプローチは、研究者やエンジニアが迅速にプロトタイプを構築し、エージェントの挙動を分析する際のハードルを下げる重要な貢献となります。
編集コメント
コードエージェントの内部動作を深く理解するには、単にモデルを動かすだけでなく、その生成プロセスである「トレース」データをどう安全かつ効率的に扱うかが鍵となります。本記事はその実装細節まで踏み込んだ貴重な実践ガイドです。
本チュートリアルでは、Hugging Face の Fable 5 Traces データセットを取り扱い、実際のコーディングエージェントのトレースデータを中心に完全なワークフローを構築します。まず、datasets、scikit-learn、scipy といった脆弱な依存関係を避けつつ、軽量な環境を設定します。その後、ノートブックが Colab で安定して動作するように、結合された JSONL ファイルを手動でダウンロードし解析します。
そこから、リポジトリ内のファイルを検索し、生のトレース例をプレビューします。ツール呼び出しとテキスト出力を正規化し、データセット構造の監査を行い、潜在的な機密情報に類似したパターンを検出します。さらに、出力タイプ、ツール、ソースルート、テキスト長といった主要な分布を可視化します。
また、安全な CoT(Chain of Thought)なしチャット/SFT 輸出を作成し、シンプルなキーワード検索ヘルパーを構築します。そして、トレースコンテキストがアシスタントの出力タイプやツールの使用状況を予測できるかを評価するために、純粋な Python で実装された Naive Bayes(ナイーブベイズ)のベースラインモデルを訓練します。
import os
import sys
import json
import re
import math
import random
import subprocess
from pathlib import Path
from collections import Counter, defaultdict
def install_packages():
packages = [
"huggingface_hub>=0.23.0",
"rich>=13.0.0",
"tqdm>=4.66.0",
]
subprocess.run(
[
sys.executable,
"-m",
"pip",
"install",
"-q",
"-U",
"--upgrade-strategy",
"only-if-needed",
*packages,
],
check=False,
)
install_packages()
import pandas as pd
import matplotlib.pyplot as plt
try:
import numpy as np
except Exception:
np = None
from tqdm.auto import tqdm
from rich import print as rprint
from rich.panel import Panel
from rich.table import Table
from huggingface_hub import HfApi, hf_hub_download
from IPython.display import display
DATASET_ID = "Glint-Research/Fable-5-traces"
FLAT_JSONL_FILENAME = "fable5_cot_merged.jsonl"
OUT_DIR = Path("/content/fable5_traces_tutorial_outputs")
OUT_DIR.mkdir(parents=True, exist_ok=True)
SEED = 42
random.seed(SEED)
if np is not None:
np.random.seed(SEED)
MAX_PREVIEW_CHARS = 900
N_AGENT_TRACE_PREVIEWS = 2
N_SAFE_DATASET_PREVIEWS = 3
SAVE_COT_RESEARCH_EXPORT = False
MAX_ROWS_TO_LOAD = None
rprint(
Panel.fit(
f"[bold]Fable 5 Traces Advanced Tutorial[/bold]\n"
f"Dataset: {DATASET_ID}\n"
f"Output directory: {OUT_DIR}\n"
f"Manual JSONL loading: True\n"
f"CoT research export enabled: {SAVE_COT_RESEARCH_EXPORT}",
title="Setup",
)
)
SECRET_PATTERNS = [
r"sk-[A-Za-z0-9_\-]{20,}",
r"hf_[A-Za-z0-9_\-]{20,}",
r"github_pat_[A-Za-z0-9_]{20,}",
r"ghp_[A-Za-z0-9]{20,}",
r"xox[baprs]-[A-Za-z0-9\-]{20,}",
r"AKIA[0-9A-Z]{16}",
r"(?i:(api[_-]?key|secret|token|password)\s*[:=]\s*['\"]?[^'\"\s]{8,})",
]
SECRET_RE = re.compile("|".join(f"(?:{pattern})" for pattern in SECRET_PATTERNS))
TOKEN_RE = re.compile(r"[A-Za-z_][A-Za-z_0-9]{1,}|[./\\-]{2,}|[{}()\[\]:=<>]+")
def safe_json_dumps(obj, max_chars=None):
try:
text = json.dumps(obj, ensure_ascii=False, indent=2, default=str)
except Exception:
text = str(obj)
if max_chars is not None and len(text) > max_chars:
return text[:max_chars] + "\n... [truncated]"
return text
def is_missing_scalar(value):
if value is None:
return True
if isinstance(value, (list, dict, tuple, set)):
return False
try:
return bool(pd.isna(value))
except Exception:
return False
def clean_for_json(value):
if is_missing_scalar(value):
return None
if isinstance(value, dict):
return {str(k): clean_for_json(v) for k, v in value.items()}
if isinstance(value, list):
return [clean_for_json(v) for v in value]
if isinstance(value, tuple):
return [clean_for_json(v) for v in value]
if np is not None:
if isinstance(value, np.integer):
return int(value)
if isinstance(value, np.floating):
if math.isnan(float(value)):
return None
return float(value)
if isinstance(value, np.ndarray):
return value.tolist()
return value
def redact_possible_secrets(text):
if text is None:
return ""
text = str(text)
return SECRET_RE.sub("[REDACTED_POSSIBLE_SECRET]", text)
def contains_possible_secret(text):
if text is None:
return False
return bool(SECRET_RE.search(str(text)))
def preview_text(text, max_chars=MAX_PREVIEW_CHARS):
text = redact_possible_secrets(text)
text = re.sub(r"\s+", " ", text).strip()
if len(text) > max_chars:
return text[:max_chars] + " ... [truncated]"
return text
まず、このワークフローに必要な軽量パッケージのみを使用して Colab 環境を設定します。データセットパス、出力ディレクトリ、ランダムシード、プレビュー制限、エクスポートオプションを定義し、チュートリアルが常に一貫した動作をするようにします。また、安全な JSON フォーマット化、機密情報の隠蔽、欠損値の処理、クリーンなテキストプレビューのための最初のヘルパー関数群も作成します。
ツールコールとテキスト出力用の解析ユーティリティの構築
コードをコピーしました
別のブラウザを使用してください
def maybe_parse_json_string(value):
if isinstance(value, str):
stripped = value.strip()
if (stripped.startswith("{") and stripped.endswith("}")) or (
stripped.startswith("[") and stripped.endswith("]")
):
try:
return json.loads(stripped)
except Exception:
return value
return value
def normalize_output_obj(value):
return maybe_parse_json_string(value)
def extract_tool_name(output):
output = normalize_output_obj(output)
if isinstance(output, dict):
direct_keys = [
"name",
"tool_name",
"tool",
"function",
"command_name",
"recipient_name",
"toolName",
"callee",
]
for key in direct_keys:
value = output.get(key)
if isinstance(value, str) and value.strip():
return value.strip()
nested_keys = [
"tool_call",
"toolCall",
"function_call",
"call",
"action",
]
for nested_key in nested_keys:
nested = output.get(nested_key)
if isinstance(nested, dict):
found = extract_tool_name(nested)
if found:
return found
output_type = output.get("type")
if isinstance(output_type, str):
output_type = output_type.strip()
if output_type and output_type.lower() not in {"tool_use", "text", "message"}:
return output_type
return ""
def extract_tool_args(output):
output = normalize_output_obj(output)
if isinstance(output, dict):
direct_arg_keys = [
"input",
"args",
"arguments",
"parameters",
"kwargs",
"json",
"payload",
]
for key in direct_arg_keys:
if key in output:
return output[key]
nested_keys = [
"tool_call",
"toolCall",
"function_call",
"call",
"action",
]
for nested_key in nested_keys:
nested = output.get(nested_key)
if isinstance(nested, dict):
args = extract_tool_args(nested)
if args not in [None, "", {}]:
return args
ignored = {
"name",
"tool_name",
"tool",
"function",
"command_name",
"recipient_name",
"toolName",
"callee",
"type",
}
return {key: value for key, value in output.items() if key not in ignored}
return {}
def extract_text_payload(output):
output = normalize_output_obj(output)
if isinstance(output, str):
return output
if isinstance(output, dict):
text_keys = [
"text",
"content",
"message",
"output",
"value",
"result",
]
for key in text_keys:
value = output.get(key)
if isinstance(value, str):
return value
if isinstance(value, list):
return safe_json_dumps(value)
if isinstance(value, dict):
nested = extract_text_payload(value)
if nested:
return nested
return safe_json_dumps(output)
return str(output)
def robust_len(value):
if value is None:
return 0
return len(str(value))
def source_root(source_file):
source_file = str(source_file or "").replace("\\", "/")
if not source_file:
return "unknown"
parts = [part for part in source_file.split("/") if part]
for marker in ["projects", "AIArchives", "archives", "claude"]:
if marker in parts:
idx = parts.index(marker)
if idx + 1 < len(parts):
return parts[idx + 1]
if len(parts) >= 2:
return parts[-2]
if parts:
return parts[0]
return "unknown"
def write_jsonl(path, records):
path = Path(path)
with path.open("w", encoding="utf-8") as file:
for record in records:
file.write(json.dumps(clean_for_json(record), ensure_ascii=False, default=str) + "\n")
def save_plot(path):
path = Path(path)
plt.tight_layout()
plt.savefig(path, dpi=160, bbox_inches="tight")
plt.show()
plt.close()
return path
def print_basic_table(title, rows, columns=("Metric", "Value")):
table = Table(title=title)
for column in columns:
table.add_column(str(column))
for row in rows:
table.add_row(*[str(item) for item in row])
rprint(table)
def tokenize(text, max_chars=12000):
text = str(text or "")[:max_chars].lower()
return TOKEN_RE.findall(text)
def load_jsonl_manual(path, max_rows=None):
records = []
bad_lines = []
with open(path, "r", encoding="utf-8") as file:
for line_number, line in tqdm(enumerate(file, start=1), desc="Reading JSONL"):
line = line.strip()
if not line:
continue
try:
records.append(json.loads(line))
except Exception as error:
bad_lines.append(
{
"line_number": line_number,
"error": repr(error),
"preview": line[:500],
}
)
if max_rows is not None and len(records) >= max_rows:
break
return records, bad_lines
生出力フィールドを実用的なツール名、ツール引数、テキストペイロードに変換するコアパースユーティリティを構築します。また、テキスト長さの計測、ソースルートの特定、JSONL ファイルへの書き込み、プロットの保存、クリーンなテーブルの印刷を行うヘルパー関数を定義します。このスニペットでは、壊れやすいデータセット読み込み依存関係を回避するために、トークン化と手動による JSONL 読み込みを追加して完了させます。
Hugging Face リポジトリの調査と JSONL トレースの読み込み
コードをコピーしました
別のブラウザを使用してください
rprint(Panel.fit("[bold]Inspecting Hugging Face dataset repository[/bold]"))
api = HfApi()
files = api.list_repo_files(repo_id=DATASET_ID, repo_type="dataset")
pi_trace_files = [
file for file in files
if file.startswith("pi-traces/") and file.endswith(".jsonl")
]
file_summary = {
"total_repo_files": len(files),
"jsonl_files": sum(file.endswith(".jsonl") for file in files),
"pi_trace_files": len(pi_trace_files),
"claude_files": sum(file.startswith("claude/") for file in files),
"has_flat_jsonl": FLAT_JSONL_FILENAME in files,
}
print_basic_table(
"Repository File Summary",
[(key, value) for key, value in file_summary.items()],
)
rprint("[bold]Sample repository files:[/bold]")
for file in files[:20]:
print(" -", file)
rprint(Panel.fit("[bold]Manual raw pi-trace preview[/bold]"))
pi_examples = []
if pi_trace_files:
for trace_file in pi_trace_files[:N_AGENT_TRACE_PREVIEWS]:
try:
local_trace_path = hf_hub_download(
repo_id=DATASET_ID,
repo_type="dataset",
filename=trace_file,
)
trace_records, trace_bad_lines = load_jsonl_manual(local_trace_path, max_rows=1)
if trace_records:
example = trace_records[0]
pi_examples.append(example)
preview_payload = {
"trace_file": trace_file,
"keys": list(example.keys()),
"preview": example,
}
rprint(
Panel(
safe_json_dumps(preview_payload, max_chars=3000),
title=f"Raw pi-trace preview: {trace_file}",
)
)
if trace_bad_lines:
rprint(
f"[yellow]Bad JSONL lines in {trace_file}: {len(trace_bad_lines)}[/yellow]"
)
except Exception as error:
rprint(f"[yellow]Could not preview {trace_file}[/yellow]")
rprint(repr(error))
else:
rprint("[yellow]No pi-traces JSONL files found.[/yellow]")
rprint(Panel.fit("[bold]Downloading flat merged JSONL from Hugging Face Hub[/bold]"))
flat_path = hf_hub_download(
repo_id=DATASET_ID,
repo_type="dataset",
filename=FLAT_JSONL_FILENAME,
)
rprint(f"[green]Downloaded flat file:[/green] {flat_path}")
rprint(Panel.fit("[bold]Loading flat JSONL manually[/bold]"))
records, bad_lines = load_jsonl_manual(flat_path, max_rows=MAX_ROWS_TO_LOAD)
if bad_lines:
bad_lines_path = OUT_DIR / "bad_jsonl_lines.json"
with open(bad_lines_path, "w", encoding="utf-8") as file:
json.dump(bad_lines, file, ensure_ascii=False, indent=2)
rprint(f"[yellow]Bad JSONL lines found: {len(bad_lines)} -> {bad_lines_path}[/yellow]")
df = pd.DataFrame.from_records(records)
rprint(f"[green]Loaded rows:[/green] {len(df):,}")
rprint(f"[green]DataFrame shape:[/green] {df.shape}")
rprint("[bold]Columns:[/bold]")
print(list(df.columns))
display(df.head(3))
expected_cols = [
"uid",
"source_file",
"session",
"model",
"context",
"cot",
"output_type",
"output",
"completion",
"origin",
]
for column in expected_cols:
if column not in df.columns:
df[column] = None
df["output_norm"] = df["output"].map(normalize_output_obj)
df["tool_name"] = df["output_norm"].map(extract_tool_name)
df["tool_args"] = df["output_norm"].map(extract_tool_args)
df["text_payload"] = df["output_norm"].map(extract_text_payload)
df["context_chars"] = df["context"].map(robust_len)
df["cot_chars"] = df["cot"].map(robust_len)
df["completion_chars"] = df["completion"].map(robust_len)
df["text_payload_chars"] = df["text_payload"].map(robust_len)
df["source_root"] = df["source_file"].map(source_root)
df["possible_secret_in_context"] = df["context"].map(contains_possible_secret)
df["possible_secret_in_completion"] = df["completion"].map(contains_possible_secret)
df["possible_secret_anywhere"] = (
df["possible_secret_in_context"] | df["possible_secret_in_completion"]
)
Hugging Face のデータセットリポジトリを検証し、利用可能なファイル数、JSONL トレース、およびフラットマージされたファイルの数を要約します。また、データセットライブラリに依存せず、生の Pi トレースファイルをいくつか手動でプレビューしてその構造を理解します。その後、マージされた JSONL ファイルをダウンロードし、DataFrame に読み込んで後続の分析のために主要なフィールドを正規化します。
データセット構造の監査とトレース分布の可視化
コードをコピーしました
別のブラウザを使用してください
audit_rows = [
("rows", len(df)),
("columns", len(df.columns)),
("unique_uid", df["uid"].nunique(dropna=True)),
("duplicate_uid_rows", int(df["uid"].duplicated().sum())),
("unique_sessions", df["session"].nunique(dropna=True)),
("unique_models", df["model"].nunique(dropna=True)),
("missing_context", int(df["context"].isna().sum())),
("missing_cot", int(df["cot"].isna().sum())),
("missing_output", int(df["output"].isna().sum())),
("rows_with_possible_secret_pattern", int(df["possible_secret_anywhere"].sum())),
("median_context_chars", round(float(df["context_chars"].median()), 2)),
("median_cot_chars", round(float(df["cot_chars"].median()), 2)),
("median_completion_chars", round(float(df["completion_chars"].median()), 2)),
("max_completion_chars", int(df["completion_chars"].max())),
]
print_basic_table("Flat JSONL Audit", audit_rows)
rprint("\n[bold]Output type distribution:[/bold]")
display(df["output_type"].value_counts(dropna=False).to_frame("rows"))
rprint("\n[bold]Model distribution:[/bold]")
display(df["model"].value_counts(dropna=False).to_frame("rows").head(20))
rprint("\n[bold]Origin distribution:[/bold]")
display(df["origin"].value_counts(dropna=False).to_frame("rows"))
rprint("\n[bold]Top source roots:[/bold]")
display(df["source_root"].value_counts().head(20).to_frame("rows"))
rprint("\n[bold]Top tool names:[/bold]")
display(
df.loc[df["output_type"].eq("tool_use"), "tool_name"]
.replace("", pd.NA)
.value_counts(dropna=False)
.head(25)
.to_frame("rows")
)
rprint(
Panel.fit(
"[bold]Safe previews[/bold]\n"
"These previews redact common secret-like patterns and never execute trace commands."
)
)
sample_df = df.sample(
n=min(N_SAFE_DATASET_PREVIEWS, len(df)),
random_state=SEED,
).reset_index(drop=True)
for index, row in sample_df.iterrows():
payload = {
"uid": row.get("uid"),
"session": row.get("session"),
"model": row.get("model"),
"origin": row.get("origin"),
"output_type": row.get("output_type"),
"tool_name": row.get("tool_name"),
"context_preview": preview_text(row.get("context")),
"cot_preview": preview_text(row.get("cot")),
"text_or_tool_payload_preview": preview_text(row.get("text_payload")),
}
rprint(
Panel(
safe_json_dumps(payload, max_chars=4000),
title=f"Safe Row Preview {index}",
)
)
rprint(Panel.fit("[bold]Creating plots[/bold]"))
plot_paths = {}
output_counts = df["output_type"].fillna("missing").value_counts()
plt.figure(figsize=(8, 5))
output_counts.plot(kind="bar")
plt.title("Output Type Distribution")
plt.xlabel("Output Type")
plt.ylabel("Rows")
plt.xticks(rotation=25, ha="right")
plot_paths["output_type_distribution"] = str(
save_plot(OUT_DIR / "output_type_distribution.png")
)
tool_counts = (
df.loc[df["output_type"].eq("tool_use"), "tool_name"]
.replace("", "unknown")
.value_counts()
.head(20)
)
if len(tool_counts) > 0:
plt.figure(figsize=(9, 6))
tool_counts.sort_values().plot(kind="barh")
plt.title("Top Tool Names")
plt.xlabel("Rows")
plt.ylabel("Tool")
plot_paths["top_tools"] = str(save_plot(OUT_DIR / "top_tools.png"))
else:
rprint("[yellow]No tool-use rows found for tool plot.[/yellow]")
source_counts = df["source_root"].fillna("unknown").value_counts().head(20)
plt.figure(figsize=(9, 6))
source_counts.sort_values().plot(kind="barh")
plt.title("Top Source Roots")
plt.xlabel("Rows")
plt.ylabel("Source Root")
plot_paths["top_source_roots"] = str(save_plot(OUT_DIR / "top_source_roots.png"))
length_cols = [
"context_chars",
"cot_chars",
"completion_chars",
"text_payload_chars",
]
for column in length_cols:
plt.figure(figsize=(8, 5))
clipped = df[column].clip(upper=df[column].quantile(0.99))
plt.hist(clipped, bins=50)
plt.title(f"{column} Distribution, Clipped at P99")
plt.xlabel("Characters")
plt.ylabel("Rows")
plot_paths[f"{column}_histogram"] = str(
save_plot(OUT_DIR / f"{column}_histogram.png")
)
データセットの監査では、行数、一意のセッション数、重複するID、欠落しているフィールド、テキストの長さ、および秘密情報に類似したパターンを確認します。データの形状を理解するために、出力タイプ、モデル、起源、ソースルート、ツール名 across 重要な分布を表示します。また、安全なプレビューと視覚的なプロットを作成して、データを精査できるようにしています。
原文を表示
In this tutorial, we work with the Fable 5 Traces dataset from Hugging Face and build a complete workflow around real coding-agent trace data. We start by setting up a lightweight environment that avoids fragile dependencies such as datasets, scikit-learn, and scipy. Then we manually download and parse the merged JSONL file to keep the notebook stable in Colab. From there, we inspect repository files, preview raw trace examples, normalize tool calls and text outputs, audit the dataset structure, detect potential secret-like patterns, and visualize key distributions, including output types, tools, source roots, and text lengths. We also create safe no-CoT chat/SFT exports, build a simple keyword-search helper, and train pure-Python Naive Bayes baselines to assess whether trace context can predict the assistant’s output type and tool usage.
Setting Up the Fable 5 Traces Colab Environment and Helpers
Copy CodeCopiedUse a different Browser
import os
import sys
import json
import re
import math
import random
import subprocess
from pathlib import Path
from collections import Counter, defaultdict
def install_packages():
packages = [
"huggingface_hub>=0.23.0",
"rich>=13.0.0",
"tqdm>=4.66.0",
]
subprocess.run(
[
sys.executable,
"-m",
"pip",
"install",
"-q",
"-U",
"--upgrade-strategy",
"only-if-needed",
*packages,
],
check=False,
)
install_packages()
import pandas as pd
import matplotlib.pyplot as plt
try:
import numpy as np
except Exception:
np = None
from tqdm.auto import tqdm
from rich import print as rprint
from rich.panel import Panel
from rich.table import Table
from huggingface_hub import HfApi, hf_hub_download
from IPython.display import display
DATASET_ID = "Glint-Research/Fable-5-traces"
FLAT_JSONL_FILENAME = "fable5_cot_merged.jsonl"
OUT_DIR = Path("/content/fable5_traces_tutorial_outputs")
OUT_DIR.mkdir(parents=True, exist_ok=True)
SEED = 42
random.seed(SEED)
if np is not None:
np.random.seed(SEED)
MAX_PREVIEW_CHARS = 900
N_AGENT_TRACE_PREVIEWS = 2
N_SAFE_DATASET_PREVIEWS = 3
SAVE_COT_RESEARCH_EXPORT = False
MAX_ROWS_TO_LOAD = None
rprint(
Panel.fit(
f"[bold]Fable 5 Traces Advanced Tutorial[/bold]\n"
f"Dataset: {DATASET_ID}\n"
f"Output directory: {OUT_DIR}\n"
f"Manual JSONL loading: True\n"
f"CoT research export enabled: {SAVE_COT_RESEARCH_EXPORT}",
title="Setup",
)
)
SECRET_PATTERNS = [
r"sk-[A-Za-z0-9_\-]{20,}",
r"hf_[A-Za-z0-9_\-]{20,}",
r"github_pat_[A-Za-z0-9_]{20,}",
r"ghp_[A-Za-z0-9]{20,}",
r"xox[baprs]-[A-Za-z0-9\-]{20,}",
r"AKIA[0-9A-Z]{16}",
r"(?i:(api[_-]?key|secret|token|password)\s*[:=]\s*['\"]?[^'\"\s]{8,})",
]
SECRET_RE = re.compile("|".join(f"(?:{pattern})" for pattern in SECRET_PATTERNS))
TOKEN_RE = re.compile(r"[A-Za-z_][A-Za-z_0-9]{1,}|[./\\-]{2,}|[{}()\[\]:=<>]+")
def safe_json_dumps(obj, max_chars=None):
try:
text = json.dumps(obj, ensure_ascii=False, indent=2, default=str)
except Exception:
text = str(obj)
if max_chars is not None and len(text) > max_chars:
return text[:max_chars] + "\n... [truncated]"
return text
def is_missing_scalar(value):
if value is None:
return True
if isinstance(value, (list, dict, tuple, set)):
return False
try:
return bool(pd.isna(value))
except Exception:
return False
def clean_for_json(value):
if is_missing_scalar(value):
return None
if isinstance(value, dict):
return {str(k): clean_for_json(v) for k, v in value.items()}
if isinstance(value, list):
return [clean_for_json(v) for v in value]
if isinstance(value, tuple):
return [clean_for_json(v) for v in value]
if np is not None:
if isinstance(value, np.integer):
return int(value)
if isinstance(value, np.floating):
if math.isnan(float(value)):
return None
return float(value)
if isinstance(value, np.ndarray):
return value.tolist()
return value
def redact_possible_secrets(text):
if text is None:
return ""
text = str(text)
return SECRET_RE.sub("[REDACTED_POSSIBLE_SECRET]", text)
def contains_possible_secret(text):
if text is None:
return False
return bool(SECRET_RE.search(str(text)))
def preview_text(text, max_chars=MAX_PREVIEW_CHARS):
text = redact_possible_secrets(text)
text = re.sub(r"\s+", " ", text).strip()
if len(text) > max_chars:
return text[:max_chars] + " ... [truncated]"
return text
We begin by setting up the Colab environment with only the lightweight packages needed for this workflow. We define the dataset path, output directory, random seed, preview limits, and export options so the tutorial behaves consistently. We also create the first set of helper functions for safe JSON formatting, secret redaction, missing-value handling, and clean text previews.
Building Parsing Utilities for Tool Calls and Text Outputs
Copy CodeCopiedUse a different Browser
def maybe_parse_json_string(value):
if isinstance(value, str):
stripped = value.strip()
if (stripped.startswith("{") and stripped.endswith("}")) or (
stripped.startswith("[") and stripped.endswith("]")
):
try:
return json.loads(stripped)
except Exception:
return value
return value
def normalize_output_obj(value):
return maybe_parse_json_string(value)
def extract_tool_name(output):
output = normalize_output_obj(output)
if isinstance(output, dict):
direct_keys = [
"name",
"tool_name",
"tool",
"function",
"command_name",
"recipient_name",
"toolName",
"callee",
]
for key in direct_keys:
value = output.get(key)
if isinstance(value, str) and value.strip():
return value.strip()
nested_keys = [
"tool_call",
"toolCall",
"function_call",
"call",
"action",
]
for nested_key in nested_keys:
nested = output.get(nested_key)
if isinstance(nested, dict):
found = extract_tool_name(nested)
if found:
return found
output_type = output.get("type")
if isinstance(output_type, str):
output_type = output_type.strip()
if output_type and output_type.lower() not in {"tool_use", "text", "message"}:
return output_type
return ""
def extract_tool_args(output):
output = normalize_output_obj(output)
if isinstance(output, dict):
direct_arg_keys = [
"input",
"args",
"arguments",
"parameters",
"kwargs",
"json",
"payload",
]
for key in direct_arg_keys:
if key in output:
return output[key]
nested_keys = [
"tool_call",
"toolCall",
"function_call",
"call",
"action",
]
for nested_key in nested_keys:
nested = output.get(nested_key)
if isinstance(nested, dict):
args = extract_tool_args(nested)
if args not in [None, "", {}]:
return args
ignored = {
"name",
"tool_name",
"tool",
"function",
"command_name",
"recipient_name",
"toolName",
"callee",
"type",
}
return {key: value for key, value in output.items() if key not in ignored}
return {}
def extract_text_payload(output):
output = normalize_output_obj(output)
if isinstance(output, str):
return output
if isinstance(output, dict):
text_keys = [
"text",
"content",
"message",
"output",
"value",
"result",
]
for key in text_keys:
value = output.get(key)
if isinstance(value, str):
return value
if isinstance(value, list):
return safe_json_dumps(value)
if isinstance(value, dict):
nested = extract_text_payload(value)
if nested:
return nested
return safe_json_dumps(output)
return str(output)
def robust_len(value):
if value is None:
return 0
return len(str(value))
def source_root(source_file):
source_file = str(source_file or "").replace("\\", "/")
if not source_file:
return "unknown"
parts = [part for part in source_file.split("/") if part]
for marker in ["projects", "AIArchives", "archives", "claude"]:
if marker in parts:
idx = parts.index(marker)
if idx + 1 < len(parts):
return parts[idx + 1]
if len(parts) >= 2:
return parts[-2]
if parts:
return parts[0]
return "unknown"
def write_jsonl(path, records):
path = Path(path)
with path.open("w", encoding="utf-8") as file:
for record in records:
file.write(json.dumps(clean_for_json(record), ensure_ascii=False, default=str) + "\n")
def save_plot(path):
path = Path(path)
plt.tight_layout()
plt.savefig(path, dpi=160, bbox_inches="tight")
plt.show()
plt.close()
return path
def print_basic_table(title, rows, columns=("Metric", "Value")):
table = Table(title=title)
for column in columns:
table.add_column(str(column))
for row in rows:
table.add_row(*[str(item) for item in row])
rprint(table)
def tokenize(text, max_chars=12000):
text = str(text or "")[:max_chars].lower()
return TOKEN_RE.findall(text)
def load_jsonl_manual(path, max_rows=None):
records = []
bad_lines = []
with open(path, "r", encoding="utf-8") as file:
for line_number, line in tqdm(enumerate(file, start=1), desc="Reading JSONL"):
line = line.strip()
if not line:
continue
try:
records.append(json.loads(line))
except Exception as error:
bad_lines.append(
{
"line_number": line_number,
"error": repr(error),
"preview": line[:500],
}
)
if max_rows is not None and len(records) >= max_rows:
break
return records, bad_lines
We build the core parsing utilities that turn raw output fields into usable tool names, tool arguments, and text payloads. We also define helpers for measuring text length, identifying source roots, writing JSONL files, saving plots, and printing clean tables. We finish this snippet by adding tokenization and manual JSONL loading to avoid fragile dataset-loading dependencies.
Inspecting the Hugging Face Repository and Loading JSONL Traces
Copy CodeCopiedUse a different Browser
rprint(Panel.fit("[bold]Inspecting Hugging Face dataset repository[/bold]"))
api = HfApi()
files = api.list_repo_files(repo_id=DATASET_ID, repo_type="dataset")
pi_trace_files = [
file for file in files
if file.startswith("pi-traces/") and file.endswith(".jsonl")
]
file_summary = {
"total_repo_files": len(files),
"jsonl_files": sum(file.endswith(".jsonl") for file in files),
"pi_trace_files": len(pi_trace_files),
"claude_files": sum(file.startswith("claude/") for file in files),
"has_flat_jsonl": FLAT_JSONL_FILENAME in files,
}
print_basic_table(
"Repository File Summary",
[(key, value) for key, value in file_summary.items()],
)
rprint("[bold]Sample repository files:[/bold]")
for file in files[:20]:
print(" -", file)
rprint(Panel.fit("[bold]Manual raw pi-trace preview[/bold]"))
pi_examples = []
if pi_trace_files:
for trace_file in pi_trace_files[:N_AGENT_TRACE_PREVIEWS]:
try:
local_trace_path = hf_hub_download(
repo_id=DATASET_ID,
repo_type="dataset",
filename=trace_file,
)
trace_records, trace_bad_lines = load_jsonl_manual(local_trace_path, max_rows=1)
if trace_records:
example = trace_records[0]
pi_examples.append(example)
preview_payload = {
"trace_file": trace_file,
"keys": list(example.keys()),
"preview": example,
}
rprint(
Panel(
safe_json_dumps(preview_payload, max_chars=3000),
title=f"Raw pi-trace preview: {trace_file}",
)
)
if trace_bad_lines:
rprint(
f"[yellow]Bad JSONL lines in {trace_file}: {len(trace_bad_lines)}[/yellow]"
)
except Exception as error:
rprint(f"[yellow]Could not preview {trace_file}[/yellow]")
rprint(repr(error))
else:
rprint("[yellow]No pi-traces JSONL files found.[/yellow]")
rprint(Panel.fit("[bold]Downloading flat merged JSONL from Hugging Face Hub[/bold]"))
flat_path = hf_hub_download(
repo_id=DATASET_ID,
repo_type="dataset",
filename=FLAT_JSONL_FILENAME,
)
rprint(f"[green]Downloaded flat file:[/green] {flat_path}")
rprint(Panel.fit("[bold]Loading flat JSONL manually[/bold]"))
records, bad_lines = load_jsonl_manual(flat_path, max_rows=MAX_ROWS_TO_LOAD)
if bad_lines:
bad_lines_path = OUT_DIR / "bad_jsonl_lines.json"
with open(bad_lines_path, "w", encoding="utf-8") as file:
json.dump(bad_lines, file, ensure_ascii=False, indent=2)
rprint(f"[yellow]Bad JSONL lines found: {len(bad_lines)} -> {bad_lines_path}[/yellow]")
df = pd.DataFrame.from_records(records)
rprint(f"[green]Loaded rows:[/green] {len(df):,}")
rprint(f"[green]DataFrame shape:[/green] {df.shape}")
rprint("[bold]Columns:[/bold]")
print(list(df.columns))
display(df.head(3))
expected_cols = [
"uid",
"source_file",
"session",
"model",
"context",
"cot",
"output_type",
"output",
"completion",
"origin",
]
for column in expected_cols:
if column not in df.columns:
df[column] = None
df["output_norm"] = df["output"].map(normalize_output_obj)
df["tool_name"] = df["output_norm"].map(extract_tool_name)
df["tool_args"] = df["output_norm"].map(extract_tool_args)
df["text_payload"] = df["output_norm"].map(extract_text_payload)
df["context_chars"] = df["context"].map(robust_len)
df["cot_chars"] = df["cot"].map(robust_len)
df["completion_chars"] = df["completion"].map(robust_len)
df["text_payload_chars"] = df["text_payload"].map(robust_len)
df["source_root"] = df["source_file"].map(source_root)
df["possible_secret_in_context"] = df["context"].map(contains_possible_secret)
df["possible_secret_in_completion"] = df["completion"].map(contains_possible_secret)
df["possible_secret_anywhere"] = (
df["possible_secret_in_context"] | df["possible_secret_in_completion"]
)
We inspect the Hugging Face dataset repository and summarize the number of files, JSONL traces, and flat-merged files available. We manually preview a few raw Pi trace files to understand the structure without relying on the datasets library. We then download the merged JSONL file, load it into a DataFrame, and normalize key fields for later analysis.
Auditing Dataset Structure and Visualizing Trace Distributions
Copy CodeCopiedUse a different Browser
audit_rows = [
("rows", len(df)),
("columns", len(df.columns)),
("unique_uid", df["uid"].nunique(dropna=True)),
("duplicate_uid_rows", int(df["uid"].duplicated().sum())),
("unique_sessions", df["session"].nunique(dropna=True)),
("unique_models", df["model"].nunique(dropna=True)),
("missing_context", int(df["context"].isna().sum())),
("missing_cot", int(df["cot"].isna().sum())),
("missing_output", int(df["output"].isna().sum())),
("rows_with_possible_secret_pattern", int(df["possible_secret_anywhere"].sum())),
("median_context_chars", round(float(df["context_chars"].median()), 2)),
("median_cot_chars", round(float(df["cot_chars"].median()), 2)),
("median_completion_chars", round(float(df["completion_chars"].median()), 2)),
("max_completion_chars", int(df["completion_chars"].max())),
]
print_basic_table("Flat JSONL Audit", audit_rows)
rprint("\n[bold]Output type distribution:[/bold]")
display(df["output_type"].value_counts(dropna=False).to_frame("rows"))
rprint("\n[bold]Model distribution:[/bold]")
display(df["model"].value_counts(dropna=False).to_frame("rows").head(20))
rprint("\n[bold]Origin distribution:[/bold]")
display(df["origin"].value_counts(dropna=False).to_frame("rows"))
rprint("\n[bold]Top source roots:[/bold]")
display(df["source_root"].value_counts().head(20).to_frame("rows"))
rprint("\n[bold]Top tool names:[/bold]")
display(
df.loc[df["output_type"].eq("tool_use"), "tool_name"]
.replace("", pd.NA)
.value_counts(dropna=False)
.head(25)
.to_frame("rows")
)
rprint(
Panel.fit(
"[bold]Safe previews[/bold]\n"
"These previews redact common secret-like patterns and never execute trace commands."
)
)
sample_df = df.sample(
n=min(N_SAFE_DATASET_PREVIEWS, len(df)),
random_state=SEED,
).reset_index(drop=True)
for index, row in sample_df.iterrows():
payload = {
"uid": row.get("uid"),
"session": row.get("session"),
"model": row.get("model"),
"origin": row.get("origin"),
"output_type": row.get("output_type"),
"tool_name": row.get("tool_name"),
"context_preview": preview_text(row.get("context")),
"cot_preview": preview_text(row.get("cot")),
"text_or_tool_payload_preview": preview_text(row.get("text_payload")),
}
rprint(
Panel(
safe_json_dumps(payload, max_chars=4000),
title=f"Safe Row Preview {index}",
)
)
rprint(Panel.fit("[bold]Creating plots[/bold]"))
plot_paths = {}
output_counts = df["output_type"].fillna("missing").value_counts()
plt.figure(figsize=(8, 5))
output_counts.plot(kind="bar")
plt.title("Output Type Distribution")
plt.xlabel("Output Type")
plt.ylabel("Rows")
plt.xticks(rotation=25, ha="right")
plot_paths["output_type_distribution"] = str(
save_plot(OUT_DIR / "output_type_distribution.png")
)
tool_counts = (
df.loc[df["output_type"].eq("tool_use"), "tool_name"]
.replace("", "unknown")
.value_counts()
.head(20)
)
if len(tool_counts) > 0:
plt.figure(figsize=(9, 6))
tool_counts.sort_values().plot(kind="barh")
plt.title("Top Tool Names")
plt.xlabel("Rows")
plt.ylabel("Tool")
plot_paths["top_tools"] = str(save_plot(OUT_DIR / "top_tools.png"))
else:
rprint("[yellow]No tool-use rows found for tool plot.[/yellow]")
source_counts = df["source_root"].fillna("unknown").value_counts().head(20)
plt.figure(figsize=(9, 6))
source_counts.sort_values().plot(kind="barh")
plt.title("Top Source Roots")
plt.xlabel("Rows")
plt.ylabel("Source Root")
plot_paths["top_source_roots"] = str(save_plot(OUT_DIR / "top_source_roots.png"))
length_cols = [
"context_chars",
"cot_chars",
"completion_chars",
"text_payload_chars",
]
for column in length_cols:
plt.figure(figsize=(8, 5))
clipped = df[column].clip(upper=df[column].quantile(0.99))
plt.hist(clipped, bins=50)
plt.title(f"{column} Distribution, Clipped at P99")
plt.xlabel("Characters")
plt.ylabel("Rows")
plot_paths[f"{column}_histogram"] = str(
save_plot(OUT_DIR / f"{column}_histogram.png")
)
We audit the dataset by checking row counts, unique sessions, duplicate IDs, missing fields, text lengths, and possible secret-like patterns. We display important distributions across output types, models, origins, source roots, and tool names to understand the data’s shape. We also create safe previews and visual plots so we can inspect the
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み