MarkTechPost·2026年6月28日 16:02·約10分

Colab で安定した Fable 5 Traces ワークフローを構築：ツール呼び出しの解析、データ監査、ベースライン学習

#コードエージェント #Fable-5-traces #データ前処理 #セキュリティ監査 #Hugging Face

TL;DR

MarkTechPost は、Hugging Face の Fable 5 Traces データセットを用いた Colab 環境での安定したワークフロー構築法と、コードエージェントのトレースデータ解析・可視化・ベースライン学習の実践的ガイドを公開しました。

AI深層分析2026年6月28日 08:02

注目/ 5段階

深度40%

キーポイント

Colab 環境の軽量安定化戦略

datasets や scikit-learn などの重厚な依存関係を避け、手動で JSONL ファイルをダウンロード・パースすることで、ノートブックの起動失敗リスクを低減する手法を示しています。

データ品質とセキュリティ監査

ツール呼び出しやテキスト出力の正規化に加え、API キーやパスワードなどの機密情報が含まれていないか自動検出するパターンマッチングを実装し、データの安全性を担保しています。

トレースデータに基づく予測モデル構築

純粋な Python 製の Naive Bayes ベースラインモデルを訓練し、コンテキスト情報からアシスタントの出力タイプやツール使用を予測する可能性を検証しています。

詳細なデータ可視化と分析

出力タイプ、使用ツール、ソースルート、テキスト長さなどの分布を可視化し、コードエージェントの動作特性やデータの偏りを定量的に把握するプロセスを提供しています。

Colab環境の軽量設定と一貫性の確保

ワークフローに必要最小限のパッケージのみを読み込み、データパスやランダムシードなどのパラメータを定義することで再現性を担保しています。

安全なデータ処理のためのヘルパー関数実装

JSONの整形、機密情報の削除（redaction）、欠損値の扱い、およびテキストプレビュー用のクリーンアップ機能を初期化しています。

柔軟なツール呼び出しパースロジック

JSON文字列の自動解析と、多様なキーネーム（例：tool_call, function_call）やネスト構造に対応する再帰的な抽出関数を実装し、異なるモデル出力形式を統一的に処理します。

影響分析・編集コメントを表示

影響分析

この記事は、コードエージェントの研究においてデータの前処理と品質管理が抱える実務的な課題（依存関係の競合、セキュリティリスク）に対する具体的な解決策を提供します。特に、大規模なライブラリに依存しない軽量アプローチは、研究者やエンジニアが迅速にプロトタイプを構築し、エージェントの挙動を分析する際のハードルを下げる重要な貢献となります。

編集コメント

コードエージェントの内部動作を深く理解するには、単にモデルを動かすだけでなく、その生成プロセスである「トレース」データをどう安全かつ効率的に扱うかが鍵となります。本記事はその実装細節まで踏み込んだ貴重な実践ガイドです。

本チュートリアルでは、Hugging Face の Fable 5 Traces データセットを取り扱い、実際のコーディングエージェントのトレースデータを中心に完全なワークフローを構築します。まず、datasets、scikit-learn、scipy といった脆弱な依存関係を避けつつ、軽量な環境を設定します。その後、ノートブックが Colab で安定して動作するように、結合された JSONL ファイルを手動でダウンロードし解析します。

そこから、リポジトリ内のファイルを検索し、生のトレース例をプレビューします。ツール呼び出しとテキスト出力を正規化し、データセット構造の監査を行い、潜在的な機密情報に類似したパターンを検出します。さらに、出力タイプ、ツール、ソースルート、テキスト長といった主要な分布を可視化します。

また、安全な CoT（Chain of Thought）なしチャット/SFT 輸出を作成し、シンプルなキーワード検索ヘルパーを構築します。そして、トレースコンテキストがアシスタントの出力タイプやツールの使用状況を予測できるかを評価するために、純粋な Python で実装された Naive Bayes（ナイーブベイズ）のベースラインモデルを訓練します。

import os

import sys

import json

import re

import math

import random

import subprocess

from pathlib import Path

from collections import Counter, defaultdict

def install_packages():

packages = [

"huggingface_hub>=0.23.0",

"rich>=13.0.0",

"tqdm>=4.66.0",

]

subprocess.run(

[

sys.executable,

"-m",

"pip",

"install",

"-q",

"-U",

"--upgrade-strategy",

"only-if-needed",

*packages,

check=False,

)

install_packages()

import pandas as pd

import matplotlib.pyplot as plt

try:

import numpy as np

except Exception:

np = None

from tqdm.auto import tqdm

from rich import print as rprint

from rich.panel import Panel

from rich.table import Table

from huggingface_hub import HfApi, hf_hub_download

from IPython.display import display

DATASET_ID = "Glint-Research/Fable-5-traces"

FLAT_JSONL_FILENAME = "fable5_cot_merged.jsonl"

OUT_DIR = Path("/content/fable5_traces_tutorial_outputs")

OUT_DIR.mkdir(parents=True, exist_ok=True)

SEED = 42

random.seed(SEED)

if np is not None:

np.random.seed(SEED)

MAX_PREVIEW_CHARS = 900

N_AGENT_TRACE_PREVIEWS = 2

N_SAFE_DATASET_PREVIEWS = 3

SAVE_COT_RESEARCH_EXPORT = False

MAX_ROWS_TO_LOAD = None

rprint(

Panel.fit(

f"[bold]Fable 5 Traces Advanced Tutorial[/bold]\n"

f"Dataset: {DATASET_ID}\n"

f"Output directory: {OUT_DIR}\n"

f"Manual JSONL loading: True\n"

f"CoT research export enabled: {SAVE_COT_RESEARCH_EXPORT}",

title="Setup",

)

SECRET_PATTERNS = [

r"sk-[A-Za-z0-9_\-]{20,}",

r"hf_[A-Za-z0-9_\-]{20,}",

r"github_pat_[A-Za-z0-9_]{20,}",

r"ghp_[A-Za-z0-9]{20,}",

r"xox[baprs]-[A-Za-z0-9\-]{20,}",

r"AKIA[0-9A-Z]{16}",

r"(?i:(api[_-]?key|secret|token|password)\s*[:=]\s*['\"]?[^'\"\s]{8,})",

]

SECRET_RE = re.compile("|".join(f"(?:{pattern})" for pattern in SECRET_PATTERNS))

TOKEN_RE = re.compile(r"[A-Za-z_][A-Za-z_0-9]{1,}|[./\\-]{2,}|[{}()\[\]:=<>]+")

def safe_json_dumps(obj, max_chars=None):

try:

text = json.dumps(obj, ensure_ascii=False, indent=2, default=str)

except Exception:

text = str(obj)

if max_chars is not None and len(text) > max_chars:

return text[:max_chars] + "\n... [truncated]"

return text

def is_missing_scalar(value):

if value is None:

return True

if isinstance(value, (list, dict, tuple, set)):

return False

try:

return bool(pd.isna(value))

except Exception:

return False

def clean_for_json(value):

if is_missing_scalar(value):

return None

if isinstance(value, dict):

return {str(k): clean_for_json(v) for k, v in value.items()}

if isinstance(value, list):

return [clean_for_json(v) for v in value]

if isinstance(value, tuple):

return [clean_for_json(v) for v in value]

if np is not None:

if isinstance(value, np.integer):

return int(value)

if isinstance(value, np.floating):

if math.isnan(float(value)):

return None

return float(value)

if isinstance(value, np.ndarray):

return value.tolist()

return value

def redact_possible_secrets(text):

if text is None:

return ""

text = str(text)

return SECRET_RE.sub("[REDACTED_POSSIBLE_SECRET]", text)

def contains_possible_secret(text):

if text is None:

return False

return bool(SECRET_RE.search(str(text)))

def preview_text(text, max_chars=MAX_PREVIEW_CHARS):

text = redact_possible_secrets(text)

text = re.sub(r"\s+", " ", text).strip()

if len(text) > max_chars:

return text[:max_chars] + " ... [truncated]"

return text

まず、このワークフローに必要な軽量パッケージのみを使用して Colab 環境を設定します。データセットパス、出力ディレクトリ、ランダムシード、プレビュー制限、エクスポートオプションを定義し、チュートリアルが常に一貫した動作をするようにします。また、安全な JSON フォーマット化、機密情報の隠蔽、欠損値の処理、クリーンなテキストプレビューのための最初のヘルパー関数群も作成します。

ツールコールとテキスト出力用の解析ユーティリティの構築

コードをコピーしました

別のブラウザを使用してください

def maybe_parse_json_string(value):

if isinstance(value, str):

stripped = value.strip()

if (stripped.startswith("{") and stripped.endswith("}")) or (

stripped.startswith("[") and stripped.endswith("]")

try:

return json.loads(stripped)

except Exception:

return value

def normalize_output_obj(value):

return maybe_parse_json_string(value)

def extract_tool_name(output):

output = normalize_output_obj(output)

if isinstance(output, dict):

direct_keys = [

"name",

"tool_name",

"tool",

"function",

"command_name",

"recipient_name",

"toolName",

"callee",

]

for key in direct_keys:

value = output.get(key)

if isinstance(value, str) and value.strip():

return value.strip()

nested_keys = [

"tool_call",

"toolCall",

"function_call",

"call",

"action",

]

for nested_key in nested_keys:

nested = output.get(nested_key)

if isinstance(nested, dict):

found = extract_tool_name(nested)

if found:

return found

output_type = output.get("type")

if isinstance(output_type, str):

output_type = output_type.strip()

if output_type and output_type.lower() not in {"tool_use", "text", "message"}:

return output_type

return ""

def extract_tool_args(output):

output = normalize_output_obj(output)

if isinstance(output, dict):

direct_arg_keys = [

"input",

"args",

"arguments",

"parameters",

"kwargs",

"json",

"payload",

]

for key in direct_arg_keys:

if key in output:

return output[key]

nested_keys = [

"tool_call",

"toolCall",

"function_call",

"call",

"action",

]

for nested_key in nested_keys:

nested = output.get(nested_key)

if isinstance(nested, dict):

args = extract_tool_args(nested)

if args not in [None, "", {}]:

return args

ignored = {

"name",

"tool_name",

"tool",

"function",

"command_name",

"recipient_name",

"toolName",

"callee",

"type",

}

return {key: value for key, value in output.items() if key not in ignored}

return {}

def extract_text_payload(output):

output = normalize_output_obj(output)

if isinstance(output, str):

return output

if isinstance(output, dict):

text_keys = [

"text",

"content",

"message",

"output",

"value",

"result",

]

for key in text_keys:

value = output.get(key)

if isinstance(value, str):

return value

if isinstance(value, list):

return safe_json_dumps(value)

if isinstance(value, dict):

nested = extract_text_payload(value)

if nested:

return nested

return safe_json_dumps(output)

return str(output)

def robust_len(value):

if value is None:

return 0

return len(str(value))

def source_root(source_file):

source_file = str(source_file or "").replace("\\", "/")

if not source_file:

return "unknown"

parts = [part for part in source_file.split("/") if part]

for marker in ["projects", "AIArchives", "archives", "claude"]:

if marker in parts:

idx = parts.index(marker)

if idx + 1 < len(parts):

return parts[idx + 1]

if len(parts) >= 2:

return parts[-2]

if parts:

return parts[0]

return "unknown"

def write_jsonl(path, records):

path = Path(path)

with path.open("w", encoding="utf-8") as file:

for record in records:

file.write(json.dumps(clean_for_json(record), ensure_ascii=False, default=str) + "\n")

def save_plot(path):

path = Path(path)

plt.tight_layout()

plt.savefig(path, dpi=160, bbox_inches="tight")

plt.show()

plt.close()

return path

def print_basic_table(title, rows, columns=("Metric", "Value")):

table = Table(title=title)

for column in columns:

table.add_column(str(column))

for row in rows:

table.add_row(*[str(item) for item in row])

rprint(table)

def tokenize(text, max_chars=12000):

text = str(text or "")[:max_chars].lower()

return TOKEN_RE.findall(text)

def load_jsonl_manual(path, max_rows=None):

records = []

bad_lines = []

with open(path, "r", encoding="utf-8") as file:

for line_number, line in tqdm(enumerate(file, start=1), desc="Reading JSONL"):

line = line.strip()

if not line:

continue

try:

records.append(json.loads(line))

except Exception as error:

bad_lines.append(

{

"line_number": line_number,

"error": repr(error),

"preview": line[:500],

}

)

if max_rows is not None and len(records) >= max_rows:

break

return records, bad_lines

生出力フィールドを実用的なツール名、ツール引数、テキストペイロードに変換するコアパースユーティリティを構築します。また、テキスト長さの計測、ソースルートの特定、JSONL ファイルへの書き込み、プロットの保存、クリーンなテーブルの印刷を行うヘルパー関数を定義します。このスニペットでは、壊れやすいデータセット読み込み依存関係を回避するために、トークン化と手動による JSONL 読み込みを追加して完了させます。

Hugging Face リポジトリの調査と JSONL トレースの読み込み

コードをコピーしました

別のブラウザを使用してください

rprint(Panel.fit("[bold]Inspecting Hugging Face dataset repository[/bold]"))

api = HfApi()

files = api.list_repo_files(repo_id=DATASET_ID, repo_type="dataset")

pi_trace_files = [

file for file in files

if file.startswith("pi-traces/") and file.endswith(".jsonl")

]

file_summary = {

"total_repo_files": len(files),

"jsonl_files": sum(file.endswith(".jsonl") for file in files),

"pi_trace_files": len(pi_trace_files),

"claude_files": sum(file.startswith("claude/") for file in files),

"has_flat_jsonl": FLAT_JSONL_FILENAME in files,

}

print_basic_table(

"Repository File Summary",

[(key, value) for key, value in file_summary.items()],

)

rprint("[bold]Sample repository files:[/bold]")

for file in files[:20]:

print(" -", file)

rprint(Panel.fit("[bold]Manual raw pi-trace preview[/bold]"))

pi_examples = []

if pi_trace_files:

for trace_file in pi_trace_files[:N_AGENT_TRACE_PREVIEWS]:

try:

local_trace_path = hf_hub_download(

repo_id=DATASET_ID,

repo_type="dataset",

filename=trace_file,

)

trace_records, trace_bad_lines = load_jsonl_manual(local_trace_path, max_rows=1)

if trace_records:

example = trace_records[0]

pi_examples.append(example)

preview_payload = {

"trace_file": trace_file,

"keys": list(example.keys()),

"preview": example,

}

rprint(

Panel(

safe_json_dumps(preview_payload, max_chars=3000),

title=f"Raw pi-trace preview: {trace_file}",

)

if trace_bad_lines:

rprint(

f"[yellow]Bad JSONL lines in {trace_file}: {len(trace_bad_lines)}[/yellow]"

)

except Exception as error:

rprint(f"[yellow]Could not preview {trace_file}[/yellow]")

rprint(repr(error))

else:

rprint("[yellow]No pi-traces JSONL files found.[/yellow]")

rprint(Panel.fit("[bold]Downloading flat merged JSONL from Hugging Face Hub[/bold]"))

flat_path = hf_hub_download(

repo_id=DATASET_ID,

repo_type="dataset",

filename=FLAT_JSONL_FILENAME,

)

rprint(f"[green]Downloaded flat file:[/green] {flat_path}")

rprint(Panel.fit("[bold]Loading flat JSONL manually[/bold]"))

records, bad_lines = load_jsonl_manual(flat_path, max_rows=MAX_ROWS_TO_LOAD)

if bad_lines:

bad_lines_path = OUT_DIR / "bad_jsonl_lines.json"

with open(bad_lines_path, "w", encoding="utf-8") as file:

json.dump(bad_lines, file, ensure_ascii=False, indent=2)

rprint(f"[yellow]Bad JSONL lines found: {len(bad_lines)} -> {bad_lines_path}[/yellow]")

df = pd.DataFrame.from_records(records)

rprint(f"[green]Loaded rows:[/green] {len(df):,}")

rprint(f"[green]DataFrame shape:[/green] {df.shape}")

rprint("[bold]Columns:[/bold]")

print(list(df.columns))

display(df.head(3))

expected_cols = [

"uid",

"source_file",

"session",

"model",

"context",

"cot",

"output_type",

"output",

"completion",

"origin",

]

for column in expected_cols:

if column not in df.columns:

df[column] = None

df["output_norm"] = df["output"].map(normalize_output_obj)

df["tool_name"] = df["output_norm"].map(extract_tool_name)

df["tool_args"] = df["output_norm"].map(extract_tool_args)

df["text_payload"] = df["output_norm"].map(extract_text_payload)

df["context_chars"] = df["context"].map(robust_len)

df["cot_chars"] = df["cot"].map(robust_len)

df["completion_chars"] = df["completion"].map(robust_len)

df["text_payload_chars"] = df["text_payload"].map(robust_len)

df["source_root"] = df["source_file"].map(source_root)

df["possible_secret_in_context"] = df["context"].map(contains_possible_secret)

df["possible_secret_in_completion"] = df["completion"].map(contains_possible_secret)

df["possible_secret_anywhere"] = (

df["possible_secret_in_context"] | df["possible_secret_in_completion"]

)

Hugging Face のデータセットリポジトリを検証し、利用可能なファイル数、JSONL トレース、およびフラットマージされたファイルの数を要約します。また、データセットライブラリに依存せず、生の Pi トレースファイルをいくつか手動でプレビューしてその構造を理解します。その後、マージされた JSONL ファイルをダウンロードし、DataFrame に読み込んで後続の分析のために主要なフィールドを正規化します。

データセット構造の監査とトレース分布の可視化

コードをコピーしました

別のブラウザを使用してください

audit_rows = [

("rows", len(df)),

("columns", len(df.columns)),

("unique_uid", df["uid"].nunique(dropna=True)),

("duplicate_uid_rows", int(df["uid"].duplicated().sum())),

("unique_sessions", df["session"].nunique(dropna=True)),

("unique_models", df["model"].nunique(dropna=True)),

("missing_context", int(df["context"].isna().sum())),

("missing_cot", int(df["cot"].isna().sum())),

("missing_output", int(df["output"].isna().sum())),

("rows_with_possible_secret_pattern", int(df["possible_secret_anywhere"].sum())),

("median_context_chars", round(float(df["context_chars"].median()), 2)),

("median_cot_chars", round(float(df["cot_chars"].median()), 2)),

("median_completion_chars", round(float(df["completion_chars"].median()), 2)),

("max_completion_chars", int(df["completion_chars"].max())),

]

print_basic_table("Flat JSONL Audit", audit_rows)

rprint("\n[bold]Output type distribution:[/bold]")

display(df["output_type"].value_counts(dropna=False).to_frame("rows"))

rprint("\n[bold]Model distribution:[/bold]")

display(df["model"].value_counts(dropna=False).to_frame("rows").head(20))

rprint("\n[bold]Origin distribution:[/bold]")

display(df["origin"].value_counts(dropna=False).to_frame("rows"))

rprint("\n[bold]Top source roots:[/bold]")

display(df["source_root"].value_counts().head(20).to_frame("rows"))

rprint("\n[bold]Top tool names:[/bold]")

display(

df.loc[df["output_type"].eq("tool_use"), "tool_name"]

.replace("", pd.NA)

.value_counts(dropna=False)

.head(25)

.to_frame("rows")

)

rprint(

Panel.fit(

"[bold]Safe previews[/bold]\n"

"These previews redact common secret-like patterns and never execute trace commands."

)

sample_df = df.sample(

n=min(N_SAFE_DATASET_PREVIEWS, len(df)),

random_state=SEED,

).reset_index(drop=True)

for index, row in sample_df.iterrows():

payload = {

"uid": row.get("uid"),

"session": row.get("session"),

"model": row.get("model"),

"origin": row.get("origin"),

"output_type": row.get("output_type"),

"tool_name": row.get("tool_name"),

"context_preview": preview_text(row.get("context")),

"cot_preview": preview_text(row.get("cot")),

"text_or_tool_payload_preview": preview_text(row.get("text_payload")),

}

rprint(

Panel(

safe_json_dumps(payload, max_chars=4000),

title=f"Safe Row Preview {index}",

)

rprint(Panel.fit("[bold]Creating plots[/bold]"))

plot_paths = {}

output_counts = df["output_type"].fillna("missing").value_counts()

plt.figure(figsize=(8, 5))

output_counts.plot(kind="bar")

plt.title("Output Type Distribution")

plt.xlabel("Output Type")

plt.ylabel("Rows")

plt.xticks(rotation=25, ha="right")

plot_paths["output_type_distribution"] = str(

save_plot(OUT_DIR / "output_type_distribution.png")

)

tool_counts = (

df.loc[df["output_type"].eq("tool_use"), "tool_name"]

.replace("", "unknown")

.value_counts()

.head(20)

)

if len(tool_counts) > 0:

plt.figure(figsize=(9, 6))

tool_counts.sort_values().plot(kind="barh")

plt.title("Top Tool Names")

plt.xlabel("Rows")

plt.ylabel("Tool")

plot_paths["top_tools"] = str(save_plot(OUT_DIR / "top_tools.png"))

else:

rprint("[yellow]No tool-use rows found for tool plot.[/yellow]")

source_counts = df["source_root"].fillna("unknown").value_counts().head(20)

plt.figure(figsize=(9, 6))

source_counts.sort_values().plot(kind="barh")

plt.title("Top Source Roots")

plt.xlabel("Rows")

plt.ylabel("Source Root")

plot_paths["top_source_roots"] = str(save_plot(OUT_DIR / "top_source_roots.png"))

length_cols = [

"context_chars",

"cot_chars",

"completion_chars",

"text_payload_chars",

]

for column in length_cols:

plt.figure(figsize=(8, 5))

clipped = df[column].clip(upper=df[column].quantile(0.99))

plt.hist(clipped, bins=50)

plt.title(f"{column} Distribution, Clipped at P99")

plt.xlabel("Characters")

plt.ylabel("Rows")

plot_paths[f"{column}_histogram"] = str(

save_plot(OUT_DIR / f"{column}_histogram.png")

)

データセットの監査では、行数、一意のセッション数、重複するID、欠落しているフィールド、テキストの長さ、および秘密情報に類似したパターンを確認します。データの形状を理解するために、出力タイプ、モデル、起源、ソースルート、ツール名 across 重要な分布を表示します。また、安全なプレビューと視覚的なプロットを作成して、データを精査できるようにしています。

原文を表示

In this tutorial, we work with the Fable 5 Traces dataset from Hugging Face and build a complete workflow around real coding-agent trace data. We start by setting up a lightweight environment that avoids fragile dependencies such as datasets, scikit-learn, and scipy. Then we manually download and parse the merged JSONL file to keep the notebook stable in Colab. From there, we inspect repository files, preview raw trace examples, normalize tool calls and text outputs, audit the dataset structure, detect potential secret-like patterns, and visualize key distributions, including output types, tools, source roots, and text lengths. We also create safe no-CoT chat/SFT exports, build a simple keyword-search helper, and train pure-Python Naive Bayes baselines to assess whether trace context can predict the assistant’s output type and tool usage.

Setting Up the Fable 5 Traces Colab Environment and Helpers

Copy CodeCopiedUse a different Browser

import os

import sys

import json

import re

import math

import random

import subprocess

from pathlib import Path

from collections import Counter, defaultdict

def install_packages():

packages = [

"huggingface_hub>=0.23.0",

"rich>=13.0.0",

"tqdm>=4.66.0",

]

subprocess.run(

[

sys.executable,

"-m",

"pip",

"install",

"-q",

"-U",

"--upgrade-strategy",

"only-if-needed",

*packages,

check=False,

)

install_packages()

import pandas as pd

import matplotlib.pyplot as plt

try:

import numpy as np

except Exception:

np = None

from tqdm.auto import tqdm

from rich import print as rprint

from rich.panel import Panel

from rich.table import Table

from huggingface_hub import HfApi, hf_hub_download

from IPython.display import display

DATASET_ID = "Glint-Research/Fable-5-traces"

FLAT_JSONL_FILENAME = "fable5_cot_merged.jsonl"

OUT_DIR = Path("/content/fable5_traces_tutorial_outputs")

OUT_DIR.mkdir(parents=True, exist_ok=True)

SEED = 42

random.seed(SEED)

if np is not None:

np.random.seed(SEED)

MAX_PREVIEW_CHARS = 900

N_AGENT_TRACE_PREVIEWS = 2

N_SAFE_DATASET_PREVIEWS = 3

SAVE_COT_RESEARCH_EXPORT = False

MAX_ROWS_TO_LOAD = None

rprint(

Panel.fit(

f"[bold]Fable 5 Traces Advanced Tutorial[/bold]\n"

f"Dataset: {DATASET_ID}\n"

f"Output directory: {OUT_DIR}\n"

f"Manual JSONL loading: True\n"

f"CoT research export enabled: {SAVE_COT_RESEARCH_EXPORT}",

title="Setup",

)

SECRET_PATTERNS = [

r"sk-[A-Za-z0-9_\-]{20,}",

r"hf_[A-Za-z0-9_\-]{20,}",

r"github_pat_[A-Za-z0-9_]{20,}",

r"ghp_[A-Za-z0-9]{20,}",

r"xox[baprs]-[A-Za-z0-9\-]{20,}",

r"AKIA[0-9A-Z]{16}",

r"(?i:(api[_-]?key|secret|token|password)\s*[:=]\s*['\"]?[^'\"\s]{8,})",

]

SECRET_RE = re.compile("|".join(f"(?:{pattern})" for pattern in SECRET_PATTERNS))

TOKEN_RE = re.compile(r"[A-Za-z_][A-Za-z_0-9]{1,}|[./\\-]{2,}|[{}()\[\]:=<>]+")

def safe_json_dumps(obj, max_chars=None):

try:

text = json.dumps(obj, ensure_ascii=False, indent=2, default=str)

except Exception:

text = str(obj)

if max_chars is not None and len(text) > max_chars:

return text[:max_chars] + "\n... [truncated]"

return text

def is_missing_scalar(value):

if value is None:

return True

if isinstance(value, (list, dict, tuple, set)):

return False

try:

return bool(pd.isna(value))

except Exception:

return False

def clean_for_json(value):

if is_missing_scalar(value):

return None

if isinstance(value, dict):

return {str(k): clean_for_json(v) for k, v in value.items()}

if isinstance(value, list):

return [clean_for_json(v) for v in value]

if isinstance(value, tuple):

return [clean_for_json(v) for v in value]

if np is not None:

if isinstance(value, np.integer):

return int(value)

if isinstance(value, np.floating):

if math.isnan(float(value)):

return None

return float(value)

if isinstance(value, np.ndarray):

return value.tolist()

return value

def redact_possible_secrets(text):

if text is None:

return ""

text = str(text)

return SECRET_RE.sub("[REDACTED_POSSIBLE_SECRET]", text)

def contains_possible_secret(text):

if text is None:

return False

return bool(SECRET_RE.search(str(text)))

def preview_text(text, max_chars=MAX_PREVIEW_CHARS):

text = redact_possible_secrets(text)

text = re.sub(r"\s+", " ", text).strip()

if len(text) > max_chars:

return text[:max_chars] + " ... [truncated]"

return text

We begin by setting up the Colab environment with only the lightweight packages needed for this workflow. We define the dataset path, output directory, random seed, preview limits, and export options so the tutorial behaves consistently. We also create the first set of helper functions for safe JSON formatting, secret redaction, missing-value handling, and clean text previews.

Building Parsing Utilities for Tool Calls and Text Outputs

Copy CodeCopiedUse a different Browser

def maybe_parse_json_string(value):

if isinstance(value, str):

stripped = value.strip()

if (stripped.startswith("{") and stripped.endswith("}")) or (

stripped.startswith("[") and stripped.endswith("]")

try:

return json.loads(stripped)

except Exception:

return value

def normalize_output_obj(value):

return maybe_parse_json_string(value)

def extract_tool_name(output):

output = normalize_output_obj(output)

if isinstance(output, dict):

direct_keys = [

"name",

"tool_name",

"tool",

"function",

"command_name",

"recipient_name",

"toolName",

"callee",

]

for key in direct_keys:

value = output.get(key)

if isinstance(value, str) and value.strip():

return value.strip()

nested_keys = [

"tool_call",

"toolCall",

"function_call",

"call",

"action",

]

for nested_key in nested_keys:

nested = output.get(nested_key)

if isinstance(nested, dict):

found = extract_tool_name(nested)

if found:

return found

output_type = output.get("type")

if isinstance(output_type, str):

output_type = output_type.strip()

if output_type and output_type.lower() not in {"tool_use", "text", "message"}:

return output_type

return ""

def extract_tool_args(output):

output = normalize_output_obj(output)

if isinstance(output, dict):

direct_arg_keys = [

"input",

"args",

"arguments",

"parameters",

"kwargs",

"json",

"payload",

]

for key in direct_arg_keys:

if key in output:

return output[key]

nested_keys = [

"tool_call",

"toolCall",

"function_call",

"call",

"action",

]

for nested_key in nested_keys:

nested = output.get(nested_key)

if isinstance(nested, dict):

args = extract_tool_args(nested)

if args not in [None, "", {}]:

return args

ignored = {

"name",

"tool_name",

"tool",

"function",

"command_name",

"recipient_name",

"toolName",

"callee",

"type",

}

return {key: value for key, value in output.items() if key not in ignored}

return {}

def extract_text_payload(output):

output = normalize_output_obj(output)

if isinstance(output, str):

return output

if isinstance(output, dict):

text_keys = [

"text",

"content",

"message",

"output",

"value",

"result",

]

for key in text_keys:

value = output.get(key)

if isinstance(value, str):

return value

if isinstance(value, list):

return safe_json_dumps(value)

if isinstance(value, dict):

nested = extract_text_payload(value)

if nested:

return nested

return safe_json_dumps(output)

return str(output)

def robust_len(value):

if value is None:

return 0

return len(str(value))

def source_root(source_file):

source_file = str(source_file or "").replace("\\", "/")

if not source_file:

return "unknown"

parts = [part for part in source_file.split("/") if part]

for marker in ["projects", "AIArchives", "archives", "claude"]:

if marker in parts:

idx = parts.index(marker)

if idx + 1 < len(parts):

return parts[idx + 1]

if len(parts) >= 2:

return parts[-2]

if parts:

return parts[0]

return "unknown"

def write_jsonl(path, records):

path = Path(path)

with path.open("w", encoding="utf-8") as file:

for record in records:

file.write(json.dumps(clean_for_json(record), ensure_ascii=False, default=str) + "\n")

def save_plot(path):

path = Path(path)

plt.tight_layout()

plt.savefig(path, dpi=160, bbox_inches="tight")

plt.show()

plt.close()

return path

def print_basic_table(title, rows, columns=("Metric", "Value")):

table = Table(title=title)

for column in columns:

table.add_column(str(column))

for row in rows:

table.add_row(*[str(item) for item in row])

rprint(table)

def tokenize(text, max_chars=12000):

text = str(text or "")[:max_chars].lower()

return TOKEN_RE.findall(text)

def load_jsonl_manual(path, max_rows=None):

records = []

bad_lines = []

with open(path, "r", encoding="utf-8") as file:

for line_number, line in tqdm(enumerate(file, start=1), desc="Reading JSONL"):

line = line.strip()

if not line:

continue

try:

records.append(json.loads(line))

except Exception as error:

bad_lines.append(

{

"line_number": line_number,

"error": repr(error),

"preview": line[:500],

}

)

if max_rows is not None and len(records) >= max_rows:

break

return records, bad_lines

We build the core parsing utilities that turn raw output fields into usable tool names, tool arguments, and text payloads. We also define helpers for measuring text length, identifying source roots, writing JSONL files, saving plots, and printing clean tables. We finish this snippet by adding tokenization and manual JSONL loading to avoid fragile dataset-loading dependencies.

Inspecting the Hugging Face Repository and Loading JSONL Traces

Copy CodeCopiedUse a different Browser

rprint(Panel.fit("[bold]Inspecting Hugging Face dataset repository[/bold]"))

api = HfApi()

files = api.list_repo_files(repo_id=DATASET_ID, repo_type="dataset")

pi_trace_files = [

file for file in files

if file.startswith("pi-traces/") and file.endswith(".jsonl")

]

file_summary = {

"total_repo_files": len(files),

"jsonl_files": sum(file.endswith(".jsonl") for file in files),

"pi_trace_files": len(pi_trace_files),

"claude_files": sum(file.startswith("claude/") for file in files),

"has_flat_jsonl": FLAT_JSONL_FILENAME in files,

}

print_basic_table(

"Repository File Summary",

[(key, value) for key, value in file_summary.items()],

)

rprint("[bold]Sample repository files:[/bold]")

for file in files[:20]:

print(" -", file)

rprint(Panel.fit("[bold]Manual raw pi-trace preview[/bold]"))

pi_examples = []

if pi_trace_files:

for trace_file in pi_trace_files[:N_AGENT_TRACE_PREVIEWS]:

try:

local_trace_path = hf_hub_download(

repo_id=DATASET_ID,

repo_type="dataset",

filename=trace_file,

)

trace_records, trace_bad_lines = load_jsonl_manual(local_trace_path, max_rows=1)

if trace_records:

example = trace_records[0]

pi_examples.append(example)

preview_payload = {

"trace_file": trace_file,

"keys": list(example.keys()),

"preview": example,

}

rprint(

Panel(

safe_json_dumps(preview_payload, max_chars=3000),

title=f"Raw pi-trace preview: {trace_file}",

)

if trace_bad_lines:

rprint(

f"[yellow]Bad JSONL lines in {trace_file}: {len(trace_bad_lines)}[/yellow]"

)

except Exception as error:

rprint(f"[yellow]Could not preview {trace_file}[/yellow]")

rprint(repr(error))

else:

rprint("[yellow]No pi-traces JSONL files found.[/yellow]")

rprint(Panel.fit("[bold]Downloading flat merged JSONL from Hugging Face Hub[/bold]"))

flat_path = hf_hub_download(

repo_id=DATASET_ID,

repo_type="dataset",

filename=FLAT_JSONL_FILENAME,

)

rprint(f"[green]Downloaded flat file:[/green] {flat_path}")

rprint(Panel.fit("[bold]Loading flat JSONL manually[/bold]"))

records, bad_lines = load_jsonl_manual(flat_path, max_rows=MAX_ROWS_TO_LOAD)

if bad_lines:

bad_lines_path = OUT_DIR / "bad_jsonl_lines.json"

with open(bad_lines_path, "w", encoding="utf-8") as file:

json.dump(bad_lines, file, ensure_ascii=False, indent=2)

rprint(f"[yellow]Bad JSONL lines found: {len(bad_lines)} -> {bad_lines_path}[/yellow]")

df = pd.DataFrame.from_records(records)

rprint(f"[green]Loaded rows:[/green] {len(df):,}")

rprint(f"[green]DataFrame shape:[/green] {df.shape}")

rprint("[bold]Columns:[/bold]")

print(list(df.columns))

display(df.head(3))

expected_cols = [

"uid",

"source_file",

"session",

"model",

"context",

"cot",

"output_type",

"output",

"completion",

"origin",

]

for column in expected_cols:

if column not in df.columns:

df[column] = None

df["output_norm"] = df["output"].map(normalize_output_obj)

df["tool_name"] = df["output_norm"].map(extract_tool_name)

df["tool_args"] = df["output_norm"].map(extract_tool_args)

df["text_payload"] = df["output_norm"].map(extract_text_payload)

df["context_chars"] = df["context"].map(robust_len)

df["cot_chars"] = df["cot"].map(robust_len)

df["completion_chars"] = df["completion"].map(robust_len)

df["text_payload_chars"] = df["text_payload"].map(robust_len)

df["source_root"] = df["source_file"].map(source_root)

df["possible_secret_in_context"] = df["context"].map(contains_possible_secret)

df["possible_secret_in_completion"] = df["completion"].map(contains_possible_secret)

df["possible_secret_anywhere"] = (

df["possible_secret_in_context"] | df["possible_secret_in_completion"]

)

We inspect the Hugging Face dataset repository and summarize the number of files, JSONL traces, and flat-merged files available. We manually preview a few raw Pi trace files to understand the structure without relying on the datasets library. We then download the merged JSONL file, load it into a DataFrame, and normalize key fields for later analysis.

Auditing Dataset Structure and Visualizing Trace Distributions

Copy CodeCopiedUse a different Browser

audit_rows = [

("rows", len(df)),

("columns", len(df.columns)),

("unique_uid", df["uid"].nunique(dropna=True)),

("duplicate_uid_rows", int(df["uid"].duplicated().sum())),

("unique_sessions", df["session"].nunique(dropna=True)),

("unique_models", df["model"].nunique(dropna=True)),

("missing_context", int(df["context"].isna().sum())),

("missing_cot", int(df["cot"].isna().sum())),

("missing_output", int(df["output"].isna().sum())),

("rows_with_possible_secret_pattern", int(df["possible_secret_anywhere"].sum())),

("median_context_chars", round(float(df["context_chars"].median()), 2)),

("median_cot_chars", round(float(df["cot_chars"].median()), 2)),

("median_completion_chars", round(float(df["completion_chars"].median()), 2)),

("max_completion_chars", int(df["completion_chars"].max())),

]

print_basic_table("Flat JSONL Audit", audit_rows)

rprint("\n[bold]Output type distribution:[/bold]")

display(df["output_type"].value_counts(dropna=False).to_frame("rows"))

rprint("\n[bold]Model distribution:[/bold]")

display(df["model"].value_counts(dropna=False).to_frame("rows").head(20))

rprint("\n[bold]Origin distribution:[/bold]")

display(df["origin"].value_counts(dropna=False).to_frame("rows"))

rprint("\n[bold]Top source roots:[/bold]")

display(df["source_root"].value_counts().head(20).to_frame("rows"))

rprint("\n[bold]Top tool names:[/bold]")

display(

df.loc[df["output_type"].eq("tool_use"), "tool_name"]

.replace("", pd.NA)

.value_counts(dropna=False)

.head(25)

.to_frame("rows")

)

rprint(

Panel.fit(

"[bold]Safe previews[/bold]\n"

"These previews redact common secret-like patterns and never execute trace commands."

)

sample_df = df.sample(

n=min(N_SAFE_DATASET_PREVIEWS, len(df)),

random_state=SEED,

).reset_index(drop=True)

for index, row in sample_df.iterrows():

payload = {

"uid": row.get("uid"),

"session": row.get("session"),

"model": row.get("model"),

"origin": row.get("origin"),

"output_type": row.get("output_type"),

"tool_name": row.get("tool_name"),

"context_preview": preview_text(row.get("context")),

"cot_preview": preview_text(row.get("cot")),

"text_or_tool_payload_preview": preview_text(row.get("text_payload")),

}

rprint(

Panel(

safe_json_dumps(payload, max_chars=4000),

title=f"Safe Row Preview {index}",

)

rprint(Panel.fit("[bold]Creating plots[/bold]"))

plot_paths = {}

output_counts = df["output_type"].fillna("missing").value_counts()

plt.figure(figsize=(8, 5))

output_counts.plot(kind="bar")

plt.title("Output Type Distribution")

plt.xlabel("Output Type")

plt.ylabel("Rows")

plt.xticks(rotation=25, ha="right")

plot_paths["output_type_distribution"] = str(

save_plot(OUT_DIR / "output_type_distribution.png")

)

tool_counts = (

df.loc[df["output_type"].eq("tool_use"), "tool_name"]

.replace("", "unknown")

.value_counts()

.head(20)

)

if len(tool_counts) > 0:

plt.figure(figsize=(9, 6))

tool_counts.sort_values().plot(kind="barh")

plt.title("Top Tool Names")

plt.xlabel("Rows")

plt.ylabel("Tool")

plot_paths["top_tools"] = str(save_plot(OUT_DIR / "top_tools.png"))

else:

rprint("[yellow]No tool-use rows found for tool plot.[/yellow]")

source_counts = df["source_root"].fillna("unknown").value_counts().head(20)

plt.figure(figsize=(9, 6))

source_counts.sort_values().plot(kind="barh")

plt.title("Top Source Roots")

plt.xlabel("Rows")

plt.ylabel("Source Root")

plot_paths["top_source_roots"] = str(save_plot(OUT_DIR / "top_source_roots.png"))

length_cols = [

"context_chars",

"cot_chars",

"completion_chars",

"text_payload_chars",

]

for column in length_cols:

plt.figure(figsize=(8, 5))

clipped = df[column].clip(upper=df[column].quantile(0.99))

plt.hist(clipped, bins=50)

plt.title(f"{column} Distribution, Clipped at P99")

plt.xlabel("Characters")

plt.ylabel("Rows")

plot_paths[f"{column}_histogram"] = str(

save_plot(OUT_DIR / f"{column}_histogram.png")

)

We audit the dataset by checking row counts, unique sessions, duplicate IDs, missing fields, text lengths, and possible secret-like patterns. We display important distributions across output types, models, origins, source roots, and tool names to understand the data’s shape. We also create safe previews and visual plots so we can inspect the

この記事をシェア

TLDR AI重要度42026年6月26日 09:00

1 コマンドで HF Jobs で vLLM サーバーを実行する方法（3 分読了）

Hugging Face Blog2026年6月26日 01:11

ハイブリッドモデルはどのトークンをより正確に予測するか？

MarkTechPost2026年6月29日 01:47

OCRmyPDF チュートリアル：スキャン文書を検索可能な PDF/A ファイルに変換し、サイドカーテキスト抽出とバッチ処理を行う方法

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

MarkTechPost·2026年6月28日 16:02·約10分

Colab で安定した Fable 5 Traces ワークフローを構築：ツール呼び出しの解析、データ監査、ベースライン学習

#コードエージェント #Fable-5-traces #データ前処理 #セキュリティ監査 #Hugging Face

TL;DR

AI深層分析2026年6月28日 08:02

注目/ 5段階

深度40%

キーポイント

Colab 環境の軽量安定化戦略

データ品質とセキュリティ監査

トレースデータに基づく予測モデル構築

詳細なデータ可視化と分析

Colab環境の軽量設定と一貫性の確保

安全なデータ処理のためのヘルパー関数実装

JSONの整形、機密情報の削除（redaction）、欠損値の扱い、およびテキストプレビュー用のクリーンアップ機能を初期化しています。

柔軟なツール呼び出しパースロジック

影響分析・編集コメントを表示

影響分析

編集コメント

import os

import sys

import json

import re

import math

import random

import subprocess

from pathlib import Path

from collections import Counter, defaultdict

def install_packages():

packages = [

"huggingface_hub>=0.23.0",

"rich>=13.0.0",

"tqdm>=4.66.0",

]

subprocess.run(

[

sys.executable,

"-m",

"pip",

"install",

"-q",

"-U",

"--upgrade-strategy",

"only-if-needed",

*packages,

check=False,

)

install_packages()

import pandas as pd

import matplotlib.pyplot as plt

try:

import numpy as np

except Exception:

np = None

from tqdm.auto import tqdm

from rich import print as rprint

from rich.panel import Panel

from rich.table import Table

from huggingface_hub import HfApi, hf_hub_download

from IPython.display import display

DATASET_ID = "Glint-Research/Fable-5-traces"

FLAT_JSONL_FILENAME = "fable5_cot_merged.jsonl"

OUT_DIR = Path("/content/fable5_traces_tutorial_outputs")

OUT_DIR.mkdir(parents=True, exist_ok=True)

SEED = 42

random.seed(SEED)

if np is not None:

np.random.seed(SEED)

MAX_PREVIEW_CHARS = 900

N_AGENT_TRACE_PREVIEWS = 2

N_SAFE_DATASET_PREVIEWS = 3

SAVE_COT_RESEARCH_EXPORT = False

MAX_ROWS_TO_LOAD = None

rprint(

Panel.fit(

f"[bold]Fable 5 Traces Advanced Tutorial[/bold]\n"

f"Dataset: {DATASET_ID}\n"

f"Output directory: {OUT_DIR}\n"

f"Manual JSONL loading: True\n"

f"CoT research export enabled: {SAVE_COT_RESEARCH_EXPORT}",

title="Setup",

)

SECRET_PATTERNS = [

r"sk-[A-Za-z0-9_\-]{20,}",

r"hf_[A-Za-z0-9_\-]{20,}",

r"github_pat_[A-Za-z0-9_]{20,}",

r"ghp_[A-Za-z0-9]{20,}",

r"xox[baprs]-[A-Za-z0-9\-]{20,}",

r"AKIA[0-9A-Z]{16}",

r"(?i:(api[_-]?key|secret|token|password)\s*[:=]\s*['\"]?[^'\"\s]{8,})",

]

SECRET_RE = re.compile("|".join(f"(?:{pattern})" for pattern in SECRET_PATTERNS))

TOKEN_RE = re.compile(r"[A-Za-z_][A-Za-z_0-9]{1,}|[./\\-]{2,}|[{}()\[\]:=<>]+")

def safe_json_dumps(obj, max_chars=None):

try:

text = json.dumps(obj, ensure_ascii=False, indent=2, default=str)

except Exception:

text = str(obj)

if max_chars is not None and len(text) > max_chars:

return text[:max_chars] + "\n... [truncated]"

return text

def is_missing_scalar(value):

if value is None:

return True

if isinstance(value, (list, dict, tuple, set)):

return False

try:

return bool(pd.isna(value))

except Exception:

return False

def clean_for_json(value):

if is_missing_scalar(value):

return None

if isinstance(value, dict):

return {str(k): clean_for_json(v) for k, v in value.items()}

if isinstance(value, list):

return [clean_for_json(v) for v in value]

if isinstance(value, tuple):

return [clean_for_json(v) for v in value]

if np is not None:

if isinstance(value, np.integer):

return int(value)

if isinstance(value, np.floating):

if math.isnan(float(value)):

return None

return float(value)

if isinstance(value, np.ndarray):

return value.tolist()

return value

def redact_possible_secrets(text):

if text is None:

return ""

text = str(text)

return SECRET_RE.sub("[REDACTED_POSSIBLE_SECRET]", text)

def contains_possible_secret(text):

if text is None:

return False

return bool(SECRET_RE.search(str(text)))

def preview_text(text, max_chars=MAX_PREVIEW_CHARS):

text = redact_possible_secrets(text)

text = re.sub(r"\s+", " ", text).strip()

if len(text) > max_chars:

return text[:max_chars] + " ... [truncated]"

return text

ツールコールとテキスト出力用の解析ユーティリティの構築

コードをコピーしました

別のブラウザを使用してください

def maybe_parse_json_string(value):

if isinstance(value, str):

stripped = value.strip()

if (stripped.startswith("{") and stripped.endswith("}")) or (

stripped.startswith("[") and stripped.endswith("]")

try:

return json.loads(stripped)

except Exception:

return value

def normalize_output_obj(value):

return maybe_parse_json_string(value)

def extract_tool_name(output):

output = normalize_output_obj(output)

if isinstance(output, dict):

direct_keys = [

"name",

"tool_name",

"tool",

"function",

"command_name",

"recipient_name",

"toolName",

"callee",

]

for key in direct_keys:

value = output.get(key)

if isinstance(value, str) and value.strip():

return value.strip()

nested_keys = [

"tool_call",

"toolCall",

"function_call",

"call",

"action",

]

for nested_key in nested_keys:

nested = output.get(nested_key)

if isinstance(nested, dict):

found = extract_tool_name(nested)

if found:

return found

output_type = output.get("type")

if isinstance(output_type, str):

output_type = output_type.strip()

if output_type and output_type.lower() not in {"tool_use", "text", "message"}:

return output_type

return ""

def extract_tool_args(output):

output = normalize_output_obj(output)

if isinstance(output, dict):

direct_arg_keys = [

"input",

"args",

"arguments",

"parameters",

"kwargs",

"json",

"payload",

]

for key in direct_arg_keys:

if key in output:

return output[key]

nested_keys = [

"tool_call",

"toolCall",

"function_call",

"call",

"action",

]

for nested_key in nested_keys:

nested = output.get(nested_key)

if isinstance(nested, dict):

args = extract_tool_args(nested)

if args not in [None, "", {}]:

return args

ignored = {

"name",

"tool_name",

"tool",

"function",

"command_name",

"recipient_name",

"toolName",

"callee",

"type",

}

return {key: value for key, value in output.items() if key not in ignored}

return {}

def extract_text_payload(output):

output = normalize_output_obj(output)

if isinstance(output, str):

return output

if isinstance(output, dict):

text_keys = [

"text",

"content",

"message",

"output",

"value",

"result",

]

for key in text_keys:

value = output.get(key)

if isinstance(value, str):

return value

if isinstance(value, list):

return safe_json_dumps(value)

if isinstance(value, dict):

nested = extract_text_payload(value)

if nested:

return nested

return safe_json_dumps(output)

return str(output)

def robust_len(value):

if value is None:

return 0

return len(str(value))

def source_root(source_file):

source_file = str(source_file or "").replace("\\", "/")

if not source_file:

return "unknown"

parts = [part for part in source_file.split("/") if part]

for marker in ["projects", "AIArchives", "archives", "claude"]:

if marker in parts:

idx = parts.index(marker)

if idx + 1 < len(parts):

return parts[idx + 1]

if len(parts) >= 2:

return parts[-2]

if parts:

return parts[0]

return "unknown"

def write_jsonl(path, records):

path = Path(path)

with path.open("w", encoding="utf-8") as file:

for record in records:

file.write(json.dumps(clean_for_json(record), ensure_ascii=False, default=str) + "\n")

def save_plot(path):

path = Path(path)

plt.tight_layout()

plt.savefig(path, dpi=160, bbox_inches="tight")

plt.show()

plt.close()

return path

def print_basic_table(title, rows, columns=("Metric", "Value")):

table = Table(title=title)

for column in columns:

table.add_column(str(column))

for row in rows:

table.add_row(*[str(item) for item in row])

rprint(table)

def tokenize(text, max_chars=12000):

text = str(text or "")[:max_chars].lower()

return TOKEN_RE.findall(text)

def load_jsonl_manual(path, max_rows=None):

records = []

bad_lines = []

with open(path, "r", encoding="utf-8") as file:

for line_number, line in tqdm(enumerate(file, start=1), desc="Reading JSONL"):

line = line.strip()

if not line:

continue

try:

records.append(json.loads(line))

except Exception as error:

bad_lines.append(

{

"line_number": line_number,

"error": repr(error),

"preview": line[:500],

}

)

if max_rows is not None and len(records) >= max_rows:

break

return records, bad_lines

Hugging Face リポジトリの調査と JSONL トレースの読み込み

コードをコピーしました

別のブラウザを使用してください

rprint(Panel.fit("[bold]Inspecting Hugging Face dataset repository[/bold]"))

api = HfApi()

files = api.list_repo_files(repo_id=DATASET_ID, repo_type="dataset")

pi_trace_files = [

file for file in files

if file.startswith("pi-traces/") and file.endswith(".jsonl")

]

file_summary = {

"total_repo_files": len(files),

"jsonl_files": sum(file.endswith(".jsonl") for file in files),

"pi_trace_files": len(pi_trace_files),

"claude_files": sum(file.startswith("claude/") for file in files),

"has_flat_jsonl": FLAT_JSONL_FILENAME in files,

}

print_basic_table(

"Repository File Summary",

[(key, value) for key, value in file_summary.items()],

)

rprint("[bold]Sample repository files:[/bold]")

for file in files[:20]:

print(" -", file)

rprint(Panel.fit("[bold]Manual raw pi-trace preview[/bold]"))

pi_examples = []

if pi_trace_files:

for trace_file in pi_trace_files[:N_AGENT_TRACE_PREVIEWS]:

try:

local_trace_path = hf_hub_download(

repo_id=DATASET_ID,

repo_type="dataset",

filename=trace_file,

)

trace_records, trace_bad_lines = load_jsonl_manual(local_trace_path, max_rows=1)

if trace_records:

example = trace_records[0]

pi_examples.append(example)

preview_payload = {

"trace_file": trace_file,

"keys": list(example.keys()),

"preview": example,

}

rprint(

Panel(

safe_json_dumps(preview_payload, max_chars=3000),

title=f"Raw pi-trace preview: {trace_file}",

)

if trace_bad_lines:

rprint(

f"[yellow]Bad JSONL lines in {trace_file}: {len(trace_bad_lines)}[/yellow]"

)

except Exception as error:

rprint(f"[yellow]Could not preview {trace_file}[/yellow]")

rprint(repr(error))

else:

rprint("[yellow]No pi-traces JSONL files found.[/yellow]")

rprint(Panel.fit("[bold]Downloading flat merged JSONL from Hugging Face Hub[/bold]"))

flat_path = hf_hub_download(

repo_id=DATASET_ID,

repo_type="dataset",

filename=FLAT_JSONL_FILENAME,

)

rprint(f"[green]Downloaded flat file:[/green] {flat_path}")

rprint(Panel.fit("[bold]Loading flat JSONL manually[/bold]"))

records, bad_lines = load_jsonl_manual(flat_path, max_rows=MAX_ROWS_TO_LOAD)

if bad_lines:

bad_lines_path = OUT_DIR / "bad_jsonl_lines.json"

with open(bad_lines_path, "w", encoding="utf-8") as file:

json.dump(bad_lines, file, ensure_ascii=False, indent=2)

rprint(f"[yellow]Bad JSONL lines found: {len(bad_lines)} -> {bad_lines_path}[/yellow]")

df = pd.DataFrame.from_records(records)

rprint(f"[green]Loaded rows:[/green] {len(df):,}")

rprint(f"[green]DataFrame shape:[/green] {df.shape}")

rprint("[bold]Columns:[/bold]")

print(list(df.columns))

display(df.head(3))

expected_cols = [

"uid",

"source_file",

"session",

"model",

"context",

"cot",

"output_type",

"output",

"completion",

"origin",

]

for column in expected_cols:

if column not in df.columns:

df[column] = None

df["output_norm"] = df["output"].map(normalize_output_obj)

df["tool_name"] = df["output_norm"].map(extract_tool_name)

df["tool_args"] = df["output_norm"].map(extract_tool_args)

df["text_payload"] = df["output_norm"].map(extract_text_payload)

df["context_chars"] = df["context"].map(robust_len)

df["cot_chars"] = df["cot"].map(robust_len)

df["completion_chars"] = df["completion"].map(robust_len)

df["text_payload_chars"] = df["text_payload"].map(robust_len)

df["source_root"] = df["source_file"].map(source_root)

df["possible_secret_in_context"] = df["context"].map(contains_possible_secret)

df["possible_secret_in_completion"] = df["completion"].map(contains_possible_secret)

df["possible_secret_anywhere"] = (

df["possible_secret_in_context"] | df["possible_secret_in_completion"]

)

データセット構造の監査とトレース分布の可視化

コードをコピーしました

別のブラウザを使用してください

audit_rows = [

("rows", len(df)),

("columns", len(df.columns)),

("unique_uid", df["uid"].nunique(dropna=True)),

("duplicate_uid_rows", int(df["uid"].duplicated().sum())),

("unique_sessions", df["session"].nunique(dropna=True)),

("unique_models", df["model"].nunique(dropna=True)),

("missing_context", int(df["context"].isna().sum())),

("missing_cot", int(df["cot"].isna().sum())),

("missing_output", int(df["output"].isna().sum())),

("rows_with_possible_secret_pattern", int(df["possible_secret_anywhere"].sum())),

("median_context_chars", round(float(df["context_chars"].median()), 2)),

("median_cot_chars", round(float(df["cot_chars"].median()), 2)),

("median_completion_chars", round(float(df["completion_chars"].median()), 2)),

("max_completion_chars", int(df["completion_chars"].max())),

]

print_basic_table("Flat JSONL Audit", audit_rows)

rprint("\n[bold]Output type distribution:[/bold]")

display(df["output_type"].value_counts(dropna=False).to_frame("rows"))

rprint("\n[bold]Model distribution:[/bold]")

display(df["model"].value_counts(dropna=False).to_frame("rows").head(20))

rprint("\n[bold]Origin distribution:[/bold]")

display(df["origin"].value_counts(dropna=False).to_frame("rows"))

rprint("\n[bold]Top source roots:[/bold]")

display(df["source_root"].value_counts().head(20).to_frame("rows"))

rprint("\n[bold]Top tool names:[/bold]")

display(

df.loc[df["output_type"].eq("tool_use"), "tool_name"]

.replace("", pd.NA)

.value_counts(dropna=False)

.head(25)

.to_frame("rows")

)

rprint(

Panel.fit(

"[bold]Safe previews[/bold]\n"

"These previews redact common secret-like patterns and never execute trace commands."

)

sample_df = df.sample(

n=min(N_SAFE_DATASET_PREVIEWS, len(df)),

random_state=SEED,

).reset_index(drop=True)

for index, row in sample_df.iterrows():

payload = {

"uid": row.get("uid"),

"session": row.get("session"),

"model": row.get("model"),

"origin": row.get("origin"),

"output_type": row.get("output_type"),

"tool_name": row.get("tool_name"),

"context_preview": preview_text(row.get("context")),

"cot_preview": preview_text(row.get("cot")),

"text_or_tool_payload_preview": preview_text(row.get("text_payload")),

}

rprint(

Panel(

safe_json_dumps(payload, max_chars=4000),

title=f"Safe Row Preview {index}",

)

rprint(Panel.fit("[bold]Creating plots[/bold]"))

plot_paths = {}

output_counts = df["output_type"].fillna("missing").value_counts()

plt.figure(figsize=(8, 5))

output_counts.plot(kind="bar")

plt.title("Output Type Distribution")

plt.xlabel("Output Type")

plt.ylabel("Rows")

plt.xticks(rotation=25, ha="right")

plot_paths["output_type_distribution"] = str(

save_plot(OUT_DIR / "output_type_distribution.png")

)

tool_counts = (

df.loc[df["output_type"].eq("tool_use"), "tool_name"]

.replace("", "unknown")

.value_counts()

.head(20)

)

if len(tool_counts) > 0:

plt.figure(figsize=(9, 6))

tool_counts.sort_values().plot(kind="barh")

plt.title("Top Tool Names")

plt.xlabel("Rows")

plt.ylabel("Tool")

plot_paths["top_tools"] = str(save_plot(OUT_DIR / "top_tools.png"))

else:

rprint("[yellow]No tool-use rows found for tool plot.[/yellow]")

source_counts = df["source_root"].fillna("unknown").value_counts().head(20)

plt.figure(figsize=(9, 6))

source_counts.sort_values().plot(kind="barh")

plt.title("Top Source Roots")

plt.xlabel("Rows")

plt.ylabel("Source Root")

plot_paths["top_source_roots"] = str(save_plot(OUT_DIR / "top_source_roots.png"))

length_cols = [

"context_chars",

"cot_chars",

"completion_chars",

"text_payload_chars",

]

for column in length_cols:

plt.figure(figsize=(8, 5))

clipped = df[column].clip(upper=df[column].quantile(0.99))

plt.hist(clipped, bins=50)

plt.title(f"{column} Distribution, Clipped at P99")

plt.xlabel("Characters")

plt.ylabel("Rows")

plot_paths[f"{column}_histogram"] = str(

save_plot(OUT_DIR / f"{column}_histogram.png")

)

原文を表示

Setting Up the Fable 5 Traces Colab Environment and Helpers

Copy CodeCopiedUse a different Browser

import os

import sys

import json

import re

import math

import random

import subprocess

from pathlib import Path

from collections import Counter, defaultdict

def install_packages():

packages = [

"huggingface_hub>=0.23.0",

"rich>=13.0.0",

"tqdm>=4.66.0",

]

subprocess.run(

[

sys.executable,

"-m",

"pip",

"install",

"-q",

"-U",

"--upgrade-strategy",

"only-if-needed",

*packages,

check=False,

)

install_packages()

import pandas as pd

import matplotlib.pyplot as plt

try:

import numpy as np

except Exception:

np = None

from tqdm.auto import tqdm

from rich import print as rprint

from rich.panel import Panel

from rich.table import Table

from huggingface_hub import HfApi, hf_hub_download

from IPython.display import display

DATASET_ID = "Glint-Research/Fable-5-traces"

FLAT_JSONL_FILENAME = "fable5_cot_merged.jsonl"

OUT_DIR = Path("/content/fable5_traces_tutorial_outputs")

OUT_DIR.mkdir(parents=True, exist_ok=True)

SEED = 42

random.seed(SEED)

if np is not None:

np.random.seed(SEED)

MAX_PREVIEW_CHARS = 900

N_AGENT_TRACE_PREVIEWS = 2

N_SAFE_DATASET_PREVIEWS = 3

SAVE_COT_RESEARCH_EXPORT = False

MAX_ROWS_TO_LOAD = None

rprint(

Panel.fit(

f"[bold]Fable 5 Traces Advanced Tutorial[/bold]\n"

f"Dataset: {DATASET_ID}\n"

f"Output directory: {OUT_DIR}\n"

f"Manual JSONL loading: True\n"

f"CoT research export enabled: {SAVE_COT_RESEARCH_EXPORT}",

title="Setup",

)

SECRET_PATTERNS = [

r"sk-[A-Za-z0-9_\-]{20,}",

r"hf_[A-Za-z0-9_\-]{20,}",

r"github_pat_[A-Za-z0-9_]{20,}",

r"ghp_[A-Za-z0-9]{20,}",

r"xox[baprs]-[A-Za-z0-9\-]{20,}",

r"AKIA[0-9A-Z]{16}",

r"(?i:(api[_-]?key|secret|token|password)\s*[:=]\s*['\"]?[^'\"\s]{8,})",

]

SECRET_RE = re.compile("|".join(f"(?:{pattern})" for pattern in SECRET_PATTERNS))

TOKEN_RE = re.compile(r"[A-Za-z_][A-Za-z_0-9]{1,}|[./\\-]{2,}|[{}()\[\]:=<>]+")

def safe_json_dumps(obj, max_chars=None):

try:

text = json.dumps(obj, ensure_ascii=False, indent=2, default=str)

except Exception:

text = str(obj)

if max_chars is not None and len(text) > max_chars:

return text[:max_chars] + "\n... [truncated]"

return text

def is_missing_scalar(value):

if value is None:

return True

if isinstance(value, (list, dict, tuple, set)):

return False

try:

return bool(pd.isna(value))

except Exception:

return False

def clean_for_json(value):

if is_missing_scalar(value):

return None

if isinstance(value, dict):

return {str(k): clean_for_json(v) for k, v in value.items()}

if isinstance(value, list):

return [clean_for_json(v) for v in value]

if isinstance(value, tuple):

return [clean_for_json(v) for v in value]

if np is not None:

if isinstance(value, np.integer):

return int(value)

if isinstance(value, np.floating):

if math.isnan(float(value)):

return None

return float(value)

if isinstance(value, np.ndarray):

return value.tolist()

return value

def redact_possible_secrets(text):

if text is None:

return ""

text = str(text)

return SECRET_RE.sub("[REDACTED_POSSIBLE_SECRET]", text)

def contains_possible_secret(text):

if text is None:

return False

return bool(SECRET_RE.search(str(text)))

def preview_text(text, max_chars=MAX_PREVIEW_CHARS):

text = redact_possible_secrets(text)

text = re.sub(r"\s+", " ", text).strip()

if len(text) > max_chars:

return text[:max_chars] + " ... [truncated]"

return text

Building Parsing Utilities for Tool Calls and Text Outputs

Copy CodeCopiedUse a different Browser

def maybe_parse_json_string(value):

if isinstance(value, str):

stripped = value.strip()

if (stripped.startswith("{") and stripped.endswith("}")) or (

stripped.startswith("[") and stripped.endswith("]")

try:

return json.loads(stripped)

except Exception:

return value

def normalize_output_obj(value):

return maybe_parse_json_string(value)

def extract_tool_name(output):

output = normalize_output_obj(output)

if isinstance(output, dict):

direct_keys = [

"name",

"tool_name",

"tool",

"function",

"command_name",

"recipient_name",

"toolName",

"callee",

]

for key in direct_keys:

value = output.get(key)

if isinstance(value, str) and value.strip():

return value.strip()

nested_keys = [

"tool_call",

"toolCall",

"function_call",

"call",

"action",

]

for nested_key in nested_keys:

nested = output.get(nested_key)

if isinstance(nested, dict):

found = extract_tool_name(nested)

if found:

return found

output_type = output.get("type")

if isinstance(output_type, str):

output_type = output_type.strip()

if output_type and output_type.lower() not in {"tool_use", "text", "message"}:

return output_type

return ""

def extract_tool_args(output):

output = normalize_output_obj(output)

if isinstance(output, dict):

direct_arg_keys = [

"input",

"args",

"arguments",

"parameters",

"kwargs",

"json",

"payload",

]

for key in direct_arg_keys:

if key in output:

return output[key]

nested_keys = [

"tool_call",

"toolCall",

"function_call",

"call",

"action",

]

for nested_key in nested_keys:

nested = output.get(nested_key)

if isinstance(nested, dict):

args = extract_tool_args(nested)

if args not in [None, "", {}]:

return args

ignored = {

"name",

"tool_name",

"tool",

"function",

"command_name",

"recipient_name",

"toolName",

"callee",

"type",

}

return {key: value for key, value in output.items() if key not in ignored}

return {}

def extract_text_payload(output):

output = normalize_output_obj(output)

if isinstance(output, str):

return output

if isinstance(output, dict):

text_keys = [

"text",

"content",

"message",

"output",

"value",

"result",

]

for key in text_keys:

value = output.get(key)

if isinstance(value, str):

return value

if isinstance(value, list):

return safe_json_dumps(value)

if isinstance(value, dict):

nested = extract_text_payload(value)

if nested:

return nested

return safe_json_dumps(output)

return str(output)

def robust_len(value):

if value is None:

return 0

return len(str(value))

def source_root(source_file):

source_file = str(source_file or "").replace("\\", "/")

if not source_file:

return "unknown"

parts = [part for part in source_file.split("/") if part]

for marker in ["projects", "AIArchives", "archives", "claude"]:

if marker in parts:

idx = parts.index(marker)

if idx + 1 < len(parts):

return parts[idx + 1]

if len(parts) >= 2:

return parts[-2]

if parts:

return parts[0]

return "unknown"

def write_jsonl(path, records):

path = Path(path)

with path.open("w", encoding="utf-8") as file:

for record in records:

file.write(json.dumps(clean_for_json(record), ensure_ascii=False, default=str) + "\n")

def save_plot(path):

path = Path(path)

plt.tight_layout()

plt.savefig(path, dpi=160, bbox_inches="tight")

plt.show()

plt.close()

return path

def print_basic_table(title, rows, columns=("Metric", "Value")):

table = Table(title=title)

for column in columns:

table.add_column(str(column))

for row in rows:

table.add_row(*[str(item) for item in row])

rprint(table)

def tokenize(text, max_chars=12000):

text = str(text or "")[:max_chars].lower()

return TOKEN_RE.findall(text)

def load_jsonl_manual(path, max_rows=None):

records = []

bad_lines = []

with open(path, "r", encoding="utf-8") as file:

for line_number, line in tqdm(enumerate(file, start=1), desc="Reading JSONL"):

line = line.strip()

if not line:

continue

try:

records.append(json.loads(line))

except Exception as error:

bad_lines.append(

{

"line_number": line_number,

"error": repr(error),

"preview": line[:500],

}

)

if max_rows is not None and len(records) >= max_rows:

break

return records, bad_lines

Inspecting the Hugging Face Repository and Loading JSONL Traces

Copy CodeCopiedUse a different Browser

rprint(Panel.fit("[bold]Inspecting Hugging Face dataset repository[/bold]"))

api = HfApi()

files = api.list_repo_files(repo_id=DATASET_ID, repo_type="dataset")

pi_trace_files = [

file for file in files

if file.startswith("pi-traces/") and file.endswith(".jsonl")

]

file_summary = {

"total_repo_files": len(files),

"jsonl_files": sum(file.endswith(".jsonl") for file in files),

"pi_trace_files": len(pi_trace_files),

"claude_files": sum(file.startswith("claude/") for file in files),

"has_flat_jsonl": FLAT_JSONL_FILENAME in files,

}

print_basic_table(

"Repository File Summary",

[(key, value) for key, value in file_summary.items()],

)

rprint("[bold]Sample repository files:[/bold]")

for file in files[:20]:

print(" -", file)

rprint(Panel.fit("[bold]Manual raw pi-trace preview[/bold]"))

pi_examples = []

if pi_trace_files:

for trace_file in pi_trace_files[:N_AGENT_TRACE_PREVIEWS]:

try:

local_trace_path = hf_hub_download(

repo_id=DATASET_ID,

repo_type="dataset",

filename=trace_file,

)

trace_records, trace_bad_lines = load_jsonl_manual(local_trace_path, max_rows=1)

if trace_records:

example = trace_records[0]

pi_examples.append(example)

preview_payload = {

"trace_file": trace_file,

"keys": list(example.keys()),

"preview": example,

}

rprint(

Panel(

safe_json_dumps(preview_payload, max_chars=3000),

title=f"Raw pi-trace preview: {trace_file}",

)

if trace_bad_lines:

rprint(

f"[yellow]Bad JSONL lines in {trace_file}: {len(trace_bad_lines)}[/yellow]"

)

except Exception as error:

rprint(f"[yellow]Could not preview {trace_file}[/yellow]")

rprint(repr(error))

else:

rprint("[yellow]No pi-traces JSONL files found.[/yellow]")

rprint(Panel.fit("[bold]Downloading flat merged JSONL from Hugging Face Hub[/bold]"))

flat_path = hf_hub_download(

repo_id=DATASET_ID,

repo_type="dataset",

filename=FLAT_JSONL_FILENAME,

)

rprint(f"[green]Downloaded flat file:[/green] {flat_path}")

rprint(Panel.fit("[bold]Loading flat JSONL manually[/bold]"))

records, bad_lines = load_jsonl_manual(flat_path, max_rows=MAX_ROWS_TO_LOAD)

if bad_lines:

bad_lines_path = OUT_DIR / "bad_jsonl_lines.json"

with open(bad_lines_path, "w", encoding="utf-8") as file:

json.dump(bad_lines, file, ensure_ascii=False, indent=2)

rprint(f"[yellow]Bad JSONL lines found: {len(bad_lines)} -> {bad_lines_path}[/yellow]")

df = pd.DataFrame.from_records(records)

rprint(f"[green]Loaded rows:[/green] {len(df):,}")

rprint(f"[green]DataFrame shape:[/green] {df.shape}")

rprint("[bold]Columns:[/bold]")

print(list(df.columns))

display(df.head(3))

expected_cols = [

"uid",

"source_file",

"session",

"model",

"context",

"cot",

"output_type",

"output",

"completion",

"origin",

]

for column in expected_cols:

if column not in df.columns:

df[column] = None

df["output_norm"] = df["output"].map(normalize_output_obj)

df["tool_name"] = df["output_norm"].map(extract_tool_name)

df["tool_args"] = df["output_norm"].map(extract_tool_args)

df["text_payload"] = df["output_norm"].map(extract_text_payload)

df["context_chars"] = df["context"].map(robust_len)

df["cot_chars"] = df["cot"].map(robust_len)

df["completion_chars"] = df["completion"].map(robust_len)

df["text_payload_chars"] = df["text_payload"].map(robust_len)

df["source_root"] = df["source_file"].map(source_root)

df["possible_secret_in_context"] = df["context"].map(contains_possible_secret)

df["possible_secret_in_completion"] = df["completion"].map(contains_possible_secret)

df["possible_secret_anywhere"] = (

df["possible_secret_in_context"] | df["possible_secret_in_completion"]

)

Auditing Dataset Structure and Visualizing Trace Distributions

Copy CodeCopiedUse a different Browser

audit_rows = [

("rows", len(df)),

("columns", len(df.columns)),

("unique_uid", df["uid"].nunique(dropna=True)),

("duplicate_uid_rows", int(df["uid"].duplicated().sum())),

("unique_sessions", df["session"].nunique(dropna=True)),

("unique_models", df["model"].nunique(dropna=True)),

("missing_context", int(df["context"].isna().sum())),

("missing_cot", int(df["cot"].isna().sum())),

("missing_output", int(df["output"].isna().sum())),

("rows_with_possible_secret_pattern", int(df["possible_secret_anywhere"].sum())),

("median_context_chars", round(float(df["context_chars"].median()), 2)),

("median_cot_chars", round(float(df["cot_chars"].median()), 2)),

("median_completion_chars", round(float(df["completion_chars"].median()), 2)),

("max_completion_chars", int(df["completion_chars"].max())),

]

print_basic_table("Flat JSONL Audit", audit_rows)

rprint("\n[bold]Output type distribution:[/bold]")

display(df["output_type"].value_counts(dropna=False).to_frame("rows"))

rprint("\n[bold]Model distribution:[/bold]")

display(df["model"].value_counts(dropna=False).to_frame("rows").head(20))

rprint("\n[bold]Origin distribution:[/bold]")

display(df["origin"].value_counts(dropna=False).to_frame("rows"))

rprint("\n[bold]Top source roots:[/bold]")

display(df["source_root"].value_counts().head(20).to_frame("rows"))

rprint("\n[bold]Top tool names:[/bold]")

display(

df.loc[df["output_type"].eq("tool_use"), "tool_name"]

.replace("", pd.NA)

.value_counts(dropna=False)

.head(25)

.to_frame("rows")

)

rprint(

Panel.fit(

"[bold]Safe previews[/bold]\n"

"These previews redact common secret-like patterns and never execute trace commands."

)

sample_df = df.sample(

n=min(N_SAFE_DATASET_PREVIEWS, len(df)),

random_state=SEED,

).reset_index(drop=True)

for index, row in sample_df.iterrows():

payload = {

"uid": row.get("uid"),

"session": row.get("session"),

"model": row.get("model"),

"origin": row.get("origin"),

"output_type": row.get("output_type"),

"tool_name": row.get("tool_name"),

"context_preview": preview_text(row.get("context")),

"cot_preview": preview_text(row.get("cot")),

"text_or_tool_payload_preview": preview_text(row.get("text_payload")),

}

rprint(

Panel(

safe_json_dumps(payload, max_chars=4000),

title=f"Safe Row Preview {index}",

)

rprint(Panel.fit("[bold]Creating plots[/bold]"))

plot_paths = {}

output_counts = df["output_type"].fillna("missing").value_counts()

plt.figure(figsize=(8, 5))

output_counts.plot(kind="bar")

plt.title("Output Type Distribution")

plt.xlabel("Output Type")

plt.ylabel("Rows")

plt.xticks(rotation=25, ha="right")

plot_paths["output_type_distribution"] = str(

save_plot(OUT_DIR / "output_type_distribution.png")

)

tool_counts = (

df.loc[df["output_type"].eq("tool_use"), "tool_name"]

.replace("", "unknown")

.value_counts()

.head(20)

)

if len(tool_counts) > 0:

plt.figure(figsize=(9, 6))

tool_counts.sort_values().plot(kind="barh")

plt.title("Top Tool Names")

plt.xlabel("Rows")

plt.ylabel("Tool")

plot_paths["top_tools"] = str(save_plot(OUT_DIR / "top_tools.png"))

else:

rprint("[yellow]No tool-use rows found for tool plot.[/yellow]")

source_counts = df["source_root"].fillna("unknown").value_counts().head(20)

plt.figure(figsize=(9, 6))

source_counts.sort_values().plot(kind="barh")

plt.title("Top Source Roots")

plt.xlabel("Rows")

plt.ylabel("Source Root")

plot_paths["top_source_roots"] = str(save_plot(OUT_DIR / "top_source_roots.png"))

length_cols = [

"context_chars",

"cot_chars",

"completion_chars",

"text_payload_chars",

]

for column in length_cols:

plt.figure(figsize=(8, 5))

clipped = df[column].clip(upper=df[column].quantile(0.99))

plt.hist(clipped, bins=50)

plt.title(f"{column} Distribution, Clipped at P99")

plt.xlabel("Characters")

plt.ylabel("Rows")

plot_paths[f"{column}_histogram"] = str(

save_plot(OUT_DIR / f"{column}_histogram.png")

)

この記事をシェア

TLDR AI重要度42026年6月26日 09:00

1 コマンドで HF Jobs で vLLM サーバーを実行する方法（3 分読了）

Hugging Face Blog2026年6月26日 01:11

ハイブリッドモデルはどのトークンをより正確に予測するか？

MarkTechPost2026年6月29日 01:47

OCRmyPDF チュートリアル：スキャン文書を検索可能な PDF/A ファイルに変換し、サイドカーテキスト抽出とバッチ処理を行う方法

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Colab で安定した Fable 5 Traces ワークフローを構築：ツール呼び出しの解析、データ監査、ベースライン学習

キーポイント

影響分析

編集コメント

関連記事

Colab で安定した Fable 5 Traces ワークフローを構築：ツール呼び出しの解析、データ監査、ベースライン学習

キーポイント

影響分析

編集コメント

関連記事