AIニュース最前線
最新ニュースAI日報Hacker日報週報動画AIツールトレンド企業

AIニュース最前線

世界中のAI最新情報を日本語で毎時更新

最新ニュース日報トレンド企業プレミアムRSS
© 2026 ainew.jp特定商取引法に基づく表記
ニュース一覧元記事を開く
MarkTechPost·2026年6月16日 16:20·約11分で読める

レイアウト認識ドキュメントインテリジェンスのための Docling Parse を用いた解析パイプライン構築方法

#Document Intelligence#PDF Parsing#Layout Analysis#RAG#Open Source
TL;DR

このチュートリアル記事は、Docling Parse を用いて PDF のレイアウト構造を詳細に解析し、表や画像を含む複雑な文書から構造化データを抽出する実装ワークフローを提示している。

AI深層分析2026年6月16日 17:02
3
注目/ 5段階
深度40%
4
関連度30%
4
実用性20%
5
革新性10%
3

キーポイント

1

Colab 環境の構築と依存関係の解決

Pillow のバージョン競合問題を検知して自動的に再インストールし、Docling Parse や Docling Core を含む必要なライブラリを安定した Colab 環境にセットアップする手順が詳述されている。

2

多要素を含むテスト PDF の自動生成

テキスト、カラム、表形式データ、ベクトル図形、埋め込まれた画像など、実世界の複雑な文書を模倣したテスト用 PDF をプログラムで生成し、パーサーの性能評価に適した環境を整えている。

3

レイアウト認識に基づく詳細な構造化抽出

Docling Parse を活用して単語や文字レベルでの座標情報を取得し、可視化オーバーレイを生成することで、読み順の再構築や表認識などの高度なドキュメント AI タスクへの応用基盤を確立している。

4

構造化データとしての出力形式

解析結果を JSON や CSV といった機械可読な形式で保存し、RAG(検索拡張生成)やドキュメント分析パイプラインへの直接組み込みを可能にする実装例を示している。

5

多様なドキュメント要素のテスト用PDF生成

コードはテキスト、表形式レイアウト、ベクトルパス、埋め込みビットマップ画像を含む複雑なPDFをプログラムmaticに作成し、パーサーの各機能を検証する。

6

座標情報に基づく読み順と構造解析

2カラムレイアウトやテキストブロックの配置により、単語レベルの座標情報を取得して読み順を再構築し、ページ空間構造を分析する方法を示している。

7

ダウンストリームタスクへの対応

抽出されたレイアウト特徴は、検索(retrieval)、表抽出、チャンキング、および位置情報が重要なRAGアプリケーションなどの下流処理に活用可能である。

影響分析・編集コメントを表示

影響分析

この記事は、単なる理論的な解説ではなく、実際の開発者が直面する環境構築の問題(依存関係競合)から解決し、複雑な文書構造を解析して構造化データに変換する具体的なコードを提供している点で実務価値が高い。Docling Parse のようなツールが普及することで、非構造化 PDF からの高品質なデータ抽出コストが下がり、ドキュメント AI アプリケーションの開発スピードが加速すると期待される。

編集コメント

Docling Parse の実装詳細と Colab での環境構築ノウハウが詰まった非常に実践的な記事です。特に依存関係のトラブルシューティング手順は、開発者が実際に遭遇しうる課題に対する即効性のある解決策として貴重です。

このチュートリアルでは、Docling Parse を使用して PDF ドキュメントを詳細な構造的レベルで分析するためのワークフローを構築します。まず、安定した Python 環境の準備から始め、一般的な Colab の依存関係の問題に対処し、テキスト、列、表のようなコンテンツ、ベクトル形状、埋め込まれた画像を含むカスタム多ページ PDF を生成します。その後、Docling Parse を使用して、ページレベルの座標付きで単語、文字、行を抽出し、視覚的なオーバーレイを描画し、結果を構造化された JSON および CSV ファイルに保存します。このワークフローを通じて、低レベルの PDF パーシングが、レイアウト分析、読み順の再構築、表認識処理、検索準備完了ドキュメントの作成といったドキュメント AI タスクをどのようにサポートできるかを確認できます。

Docling Parse の Colab 環境と依存関係の設定

コードをコピーしました。別のブラウザを使用してください

import os, sys, subprocess, textwrap, json, time, shutil

from pathlib import Path

def run(cmd):

print(f"\n$ {cmd}")

return subprocess.run(cmd, shell=True, text=True, capture_output=False)

run(f'{sys.executable} -m pip install -q --no-cache-dir -U "pillow>=10.4.0,<12" reportlab pandas matplotlib docling-core docling-parse')

try:

from PIL import Image, ImageDraw

except ImportError:

print("\nPillow import failed because Colab has a mixed PIL installation.")

print("Reinstalling Pillow and restarting runtime. After restart, run this same cell again.")

run(f'{sys.executable} -m pip uninstall -y pillow PIL')

run(f'{sys.executable} -m pip install -q --no-cache-dir --force-reinstall "pillow>=10.4.0,<12"')

os.kill(os.getpid(), 9)

import pandas as pd

import matplotlib.pyplot as plt

from reportlab.lib.pagesizes import A4

from reportlab.lib import colors

from reportlab.platypus import Table, TableStyle

from reportlab.pdfgen import canvas

from docling_core.types.doc.page import TextCellUnit

from docling_parse.pdf_parser import DoclingPdfParser

print("Environment ready.")

print("Python:", sys.version.split()[0])

WORKDIR = Path("/content/docling_parse_advanced_tutorial")

WORKDIR.mkdir(parents=True, exist_ok=True)

PDF_PATH = WORKDIR / "advanced_docling_parse_demo.pdf"

OUT_DIR = WORKDIR / "outputs"

OUT_DIR.mkdir(exist_ok=True)

DEMO_IMAGE_PATH = WORKDIR / "demo_bitmap.png"

Colab 環境を構築するために、Docling Parse、Docling Core、Pillow、ReportLab、Pandas、Matplotlib をインストールします。また、Pillow のインポートに関する問題に対処し、Colab で PIL のインストールが破損している場合や混在している場合にノートブックが復元できるように安全に処理します。その後、本チュートリアル全体で使用する作業ディレクトリ、出力フォルダ、PDF パス、画像パスを定義します。

パーサー評価用のマルチ要素テスト PDF の生成

コードをコピー

コピー済み

別のブラウザを使用

def create_demo_image(path):

img = Image.new("RGB", (320, 180), "white")

draw = ImageDraw.Draw(img)

draw.rectangle([20, 20, 300, 160], outline="black", width=3)

draw.ellipse([55, 45, 145, 135], outline="black", width=4)

draw.line([180, 140, 285, 45], fill="black", width=4)

draw.text((45, 145), "Embedded bitmap image", fill="black")

img.save(path)

create_demo_image(DEMO_IMAGE_PATH)

def build_pdf(pdf_path):

c = canvas.Canvas(str(pdf_path), pagesize=A4)

width, height = A4

c.setFont("Helvetica-Bold", 20)

c.drawString(60, height - 70, "Docling Parse Advanced PDF Parsing Tutorial")

c.setFont("Helvetica", 11)

intro = (

"This generated document is designed for testing text extraction, coordinate parsing, "

"line grouping, vector path detection, bitmap resources, and layout-aware reconstruction."

)

text_obj = c.beginText(60, height - 105)

text_obj.setLeading(15)

for line in textwrap.wrap(intro, width=90):

text_obj.textLine(line)

c.drawText(text_obj)

c.setFont("Helvetica-Bold", 14)

c.drawString(60, height - 170, "1. Two-column text region")

left_para = (

"The left column contains compact explanatory text. A parser should expose words, "

"characters, and line-level cells along with coordinates. These coordinates allow us "

"to reconstruct reading order and inspect the spatial structure of a page."

)

right_para = (

"The right column contains a separate paragraph. In document AI pipelines, layout "

"features are useful for retrieval, table extraction, chunking, and downstream RAG "

"applications where page position can matter."

)

y_start = height - 200

left_text = c.beginText(60, y_start)

left_text.setFont("Helvetica", 10)

left_text.setLeading(13)

for line in textwrap.wrap(left_para, width=42):

left_text.textLine(line)

c.drawText(left_text)

right_text = c.beginText(325, y_start)

right_text.setFont("Helvetica", 10)

right_text.setLeading(13)

for line in textwrap.wrap(right_para, width=42):

right_text.textLine(line)

c.drawText(right_text)

c.setStrokeColor(colors.darkblue)

c.setLineWidth(2)

c.rect(55, height - 315, 225, 130, stroke=1, fill=0)

c.rect(320, height - 315, 225, 130, stroke=1, fill=0)

c.setStrokeColor(colors.darkgreen)

c.setLineWidth(3)

c.circle(140, height - 390, 40, stroke=1, fill=0)

c.line(220, height - 430, 310, height - 355)

c.setFont("Helvetica-Bold", 14)

c.setFillColor(colors.black)

c.drawString(60, height - 470, "2. Simple table-like structure")

data = [

["Section", "Signal", "Expected parser behavior"],

["Text", "Words and lines", "Return text cells with coordinates"],

["Vector", "Boxes and lines", "Expose page path/vector resources"],

["Bitmap", "Embedded image", "Expose or render image resources"],

]

table = Table(data, colWidths=[100, 130, 260])

table.setStyle(TableStyle([

("BACKGROUND", (0, 0), (-1, 0), colors.lightgrey),

("GRID", (0, 0), (-1, -1), 0.7, colors.black),

("FONTNAME", (0, 0), (-1, 0), "Helvetica-Bold"),

("FONTSIZE", (0, 0), (-1, -1), 9),

("VALIGN", (0, 0), (-1, -1), "MIDDLE"),

]))

table.wrapOn(c, width, height)

table.drawOn(c, 60, height - 590)

c.setFont("Helvetica", 9)

c.drawString(60, 55, "Page 1: generated programmatic PDF with text, table-like layout, and vector paths.")

c.showPage()

c.setFont("Helvetica-Bold", 18)

c.drawString(60, height - 70, "Page 2: Bitmap, Dense Text, and Reading Order")

c.setFont("Helvetica", 10)

dense = (

"This page includes an embedded bitmap image and several short blocks of text. "

"We use it to test whether rendering works, whether the parser preserves page-level "

"coordinates, and whether our own reconstruction logic can group words into lines."

)

y = height - 105

for para_idx in range(4):

tx = c.beginText(60, y)

tx.setFont("Helvetica", 10)

tx.setLeading(13)

for line in textwrap.wrap(f"Block {para_idx + 1}: {dense}", width=92):

tx.textLine(line)

c.drawText(tx)

y -= 70

c.drawImage(str(DEMO_IMAGE_PATH), 110, height - 510, width=320, height=180, preserveAspectRatio=True)

c.setStrokeColor(colors.red)

c.setLineWidth(2)

c.roundRect(95, height - 525, 350, 210, 10, stroke=1, fill=0)

c.setFillColor(colors.black)

c.setFont("Helvetica-Bold", 12)

c.drawString(60, height - 570, "Coordinate-aware extraction lets us keep page, text, and position together.")

c.setFont("Helvetica", 9)

c.drawString(60, 55, "Page 2: embedded bitmap image and multiple text blocks.")

c.save()

build_pdf(PDF_PATH)

print("Created PDF:", PDF_PATH)

Docling Parse のテスト用に、小さなビットマップ画像を生成し、カスタムの 2 ページ PDF を作成します。テキストブロック、2 カラム構成のコンテンツ、ベクター図形、表のようなコンテンツ、埋め込まれた画像を追加し、パーサーが処理する複数のドキュメント要素を持たせます。生成された PDF は、テキスト抽出、レイアウト構造、レンダリング、座標認識型パースを検証するための制御入力として使用します。

Docling Parse による単語、文字、行セルの抽出

コードをコピーしました(コピー済み)

別のブラウザを使用してください

def safe_to_dict(obj, max_depth=2):

if obj is None:

return None

if isinstance(obj, (str, int, float, bool)):

return obj

if isinstance(obj, (list, tuple)):

return [safe_to_dict(x, max_depth=max_depth - 1) for x in obj[:50]]

if isinstance(obj, dict):

return {

str(k): safe_to_dict(v, max_depth=max_depth - 1)

for k, v in list(obj.items())[:50]

}

if hasattr(obj, "model_dump"):

try:

return obj.model_dump()

except Exception:

pass

if hasattr(obj, "__dict__") and max_depth > 0:

try:

return {

k: safe_to_dict(v, max_depth=max_depth - 1)

for k, v in obj.__dict__.items()

if not k.startswith("_")

}

except Exception:

pass

return str(obj)

def rect_to_dict(rect):

d = safe_to_dict(rect)

if isinstance(d, dict):

return d

attrs = {}

for name in [

"l", "t", "r", "b",

"left", "top", "right", "bottom",

"x0", "y0", "x1", "y1",

"width", "height"

]:

if hasattr(rect, name):

try:

attrs[name] = getattr(rect, name)

except Exception:

pass

return attrs if attrs else {"raw": str(rect)}

def get_text_cell_records(page_no, pred_page, unit_type):

records = []

try:

cells = list(pred_page.iterate_cells(unit_type=unit_type))

except Exception as e:

print(f"Could not iterate {unit_type} cells on page {page_no}: {e}")

return records

for idx, cell in enumerate(cells):

text = getattr(cell, "text", "")

rect = getattr(cell, "rect", None)

records.append({

"page": page_no,

"unit": str(unit_type).split(".")[-1],

"index": idx,

"text": text,

"rect": rect_to_dict(rect),

"raw_cell": safe_to_dict(cell, max_depth=1),

})

return records

def count_possible_resources(pred_page):

resource_summary = {}

names = dir(pred_page)

keywords = ["path", "bitmap", "image", "resource", "line", "rect"]

for name in names:

lname = name.lower()

if any(k in lname for k in keywords) and not name.startswith("_"):

try:

value = getattr(pred_page, name)

if callable(value):

continue

try:

resource_summary[name] = len(value)

except Exception:

resource_summary[name] = type(value).__name__

except Exception:

pass

return resource_summary

parser = DoclingPdfParser()

start = time.perf_counter()

pdf_doc = parser.load(path_or_stream=str(PDF_PATH))

load_time = time.perf_counter() - start

print(f"\nLoaded PDF in {load_time:.3f} seconds.")

all_records = []

page_summaries = []

rendered_paths = []

parse_start = time.perf_counter()

for page_no, pred_page in pdf_doc.iterate_pages():

print(f"\n--- Page {page_no} ---")

word_records = get_text_cell_records(page_no, pred_page, TextCellUnit.WORD)

char_records = get_text_cell_records(page_no, pred_page, TextCellUnit.CHAR)

line_records = get_text_cell_records(page_no, pred_page, TextCellUnit.LINE)

all_records.extend(word_records)

all_records.extend(char_records)

all_records.extend(line_records)

resource_summary = count_possible_resources(pred_page)

page_summaries.append({

"page": page_no,

"words": len(word_records),

"chars": len(char_records),

"lines": len(line_records),

"possible_resource_attributes": resource_summary,

})

print("Words:", len(word_records))

print("Characters:", len(char_records))

print("Lines:", len(line_records))

print("Possible resource attributes:", resource_summary)

print("\nFirst 20 extracted words:")

print(" ".join([r["text"] for r in word_records[:20]]))

for unit_name, unit_type in [

("word", TextCellUnit.WORD),

("char", TextCellUnit.CHAR),

("line", TextCellUnit.LINE),

]:

try:

img = pred_page.render_as_image(cell_unit=unit_type)

out_img = OUT_DIR / f"page_{page_no}_{unit_name}_overlay.png"

img.save(out_img)

rendered_paths.append(out_img)

print("Saved rendered overlay:", out_img)

except Exception as e:

print(f"Could not render {unit_name} overlay for page {page_no}: {e}")

parse_time = time.perf_counter() - parse_start

Docling オブジェクト、矩形、ページリソースを安全に読み取り可能な Python 辞書に変換するためのヘルパー関数を定義します。DoclingPdfParser を使用して生成された PDF を読み込み、各ページから単語レベル、文字レベル、行レベルのテキストセルを抽出します。また、異なるテキスト単位に対してページオーバーレイを描画し、Docling Parse が PDF ページ上のコンテンツを検出およびマッピングする方法を視覚的に確認します。

構造化出力のエクスポートとレイアウト認識テキストの再構築

コードをコピーしました(コピー済み)

別のブラウザを使用する

records_path = OUT_DIR / "docling_parse_cells.json"

with open(records_path, "w", encoding="utf-8") as f:

json.dump(all_records, f, indent=2, ensure_ascii=False)

summary_path = OUT_DIR / "page_summaries.json"

with open(summary_path, "w", encoding="utf-8") as f:

json.dump(page_summaries, f, indent=2, ensure_ascii=False)

flat_rows = []

for r in all_records:

rect = r.get("rect", {})

row = {

"page": r["page"],

"unit": r["unit"],

"index": r["index"],

"text": r["text"],

}

if isinstance(rect, dict):

for k, v in rect.items():

if isinstance(v, (str, int, float, bool)) or v is None:

row[f"rect_{k}"] = v

else:

row[f"rect_{k}"] = str(v)

flat_rows.append(row)

df = pd.DataFrame(flat_rows)

csv_path = OUT_DIR / "docling_parse_cells.csv"

df.to_csv(csv_path, index=False)

summary_df = pd.DataFrame(page_summaries)

summary_csv_path = OUT_DIR / "page_summaries.csv"

summary_df.to_csv(summary_csv_path, index=False)

print("\nSaved structured outputs:")

print(records_path)

print(csv_path)

print(summary_path)

print(summary_csv_path)

print("\nPage summary:")

display(summary_df)

print("\nCell dataframe sample:")

display(df.head(20))

def extract_rect_numbers(rect):

if not isinstance(rect, dict):

return None

possible_sets = [

("l", "t", "r", "b"),

("left", "top", "right", "bottom"),

("x0", "y0", "x1", "y1"),

]

for keys in possible_sets:

if all(k in rect for k in keys):

try:

vals = [float(rect[k]) for k in keys]

return vals

except Exception:

pass

numeric = []

for v in rect.values():

try:

numeric.append(float(v))

except Exception:

pass

if len(numeric) >= 4:

return numeric[:4]

return None

word_df = df[df["unit"].str.contains("WORD", case=False, na=False)].copy()

if len(word_df) == 0:

word_df = df[df["unit"].str.contains("word", case=False, na=False)].copy()

coords = []

for _, row in word_df.iterrows():

rect_data = {}

for col in word_df.columns:

if col.startswith("rect_"):

rect_data[col.replace("rect_", "")] = row[col]

nums = extract_rect_numbers(rect_data)

coords.append(nums)

word_df["coord_numbers"] = coords

word_df = word_df[word_df["coord_numbers"].notna()].copy()

if len(word_df) > 0:

word_df["x0"] = word_df["coord_numbers"].apply(lambda x: min(x[0], x[2]))

word_df["x1"] = word_df["coord_numbers"].apply(lambda x: max(x[0], x[2]))

word_df["y0"] = word_df["coord_numbers"].apply(lambda x: min(x[1], x[3]))

word_df["y1"] = word_df["coord_numbers"].apply(lambda x: max(x[1], x[3]))

word_df["y_mid"] = (word_df["y0"] + word_df["y1"]) / 2

reconstructed_pages = {}

for page, g in word_df.groupby("page"):

g = g.sort_values(["y_mid", "x0"]).copy()

y_values = sorted(g["y_mid"].tolist())

line_bins = []

threshold = 8.0

for y in y_values:

placed = False

for line in line_bins:

if abs(line["center"] - y) <= threshold:

line["values"].append(y)

line["center"] = sum(line["values"]) / len(line["values"])

placed = True

break

if not placed:

line_bins.append({"center": y, "values": [y]})

def assign_line(y):

return min(range(len(line_bins)), key=lambda i: abs(line_bins[i]["center"] - y))

g["line_id"] = g["y_mid"].apply(assign_line)

lines = []

for line_id, lg in g.groupby("line_id"):

lg = lg.sort_values("x0")

line_text = " ".join(lg["text"].astype(str).tolist())

lines.append((lg["y_mid"].mean(), line_text))

lines = sorted(lines, key=lambda x: x[0])

reconstructed_text = "\n".join([line for _, line in lines])

reconstructed_pages[int(page)] = reconstructed_text

recon_path = OUT_DIR / "layout_aware_reconstructed_text.json"

with open(recon_path, "w", encoding="utf-8") as f:

json.dump(reconstructed_pages, f, indent=2, ensure_ascii=False)

print("\nLayout-aware reconstructed text:")

for page, text in reconstructed_pages.items():

print(f"\n===== PAGE {page} =====")

print(text[:2500])

print("\nSaved reconstruction:", recon_path)

else:

print("\nCould not build coordinate-based reconstruction because rectangle coordinates were not exposed in a numeric form.")

抽出した解析結果は、後日の分析のために JSON および CSV ファイルに保存します。解析されたレコードを平坦化して Pandas DataFrame に変換し、ページごとのサマリーとセルレベルの抽出サンプルを表示します。また、座標情報からテキストを再構築することで、単語の位置情報からレイアウト認識型の読み取り順序がどのように導き出されるかを理解することができます。

スレッド型パースのベンチマーク実行および CLI の利用可能性確認

コードをコピーしました(コピー済み)

別のブラウザを使用してください

print("\nAttempting threaded parsing benchmark...")

threaded_results = []

threaded_available = True

try:

from docling_parse.pdf_parser import DoclingThreadedPdfParser, ThreadedPdfParserConfig

from docling_parse.pdf_parsers import DecodePageConfig

parser_config = ThreadedPdfParserConfig(

loglevel="fatal",

threads=4,

max_concurrent_results=32,

)

decode_config = DecodePageConfig()

threaded_parser = DoclingThreadedPdfParser(

parser_config=parser_config,

decode_config=decode_config,

)

t0 = time.perf_counter()

doc_key = threaded_parser.load(str(PDF_PATH))

page_count = threaded_parser.page_count(doc_key)

print("Threaded doc key:", doc_key)

print("Threaded page count:", page_count)

for result in threaded_parser.iterate_results():

item = {

"doc_key": str(getattr(result, "doc_key", "")),

"page_number": getattr(result, "page_number", None),

"success": getattr(result, "success", None),

"error_message": getattr(result, "error_message", None),

}

if getattr(result, "success", False):

seg_page = result.get_page()

timings = result.get_timings()

item["word_count"] = len(getattr(seg_page, "word_cells", []))

try:

item["total_time"] = timings.total()

except Exception:

item["total_time"] = str(timings)

threaded_results.append(item)

threaded_time = time.perf_counter() - t0

except Exception as e:

threaded_available = False

threaded_time = None

print("Threaded parser is not available or failed in this environment.")

print("Error:", repr(e))

if threaded_available:

threaded_path = OUT_DIR / "threaded_parse_results.json"

with open(threaded_path, "w", encoding="utf-8") as f:

json.dump(threaded_results, f, indent=2, ensure_ascii=False)

threaded_df = pd.DataFrame(threaded_results)

print("\nThreaded parsing results:")

display(threaded_df)

print("Saved threaded results:", threaded_path)

benchmark = {

"standard_load_time_seconds": load_time,

"standard_iterate_parse_time"

原文を表示

In this tutorial, we build a workflow for using Docling Parse to analyze PDF documents at a detailed structural level. We start by preparing a stable Python environment, handling common Colab dependency issues, and generating a custom multi-page PDF with text, columns, table-like content, vector shapes, and an embedded image. We then use Docling Parse to extract words, characters, and lines with page-level coordinates, render visual overlays, and save the results into structured JSON and CSV files. Through this workflow, we see how low-level PDF parsing can support document AI tasks such as layout analysis, reading-order reconstruction, table-aware processing, and retrieval-ready document preparation.

Setting Up the Docling Parse Colab Environment and Dependencies

Copy CodeCopiedUse a different Browser

import os, sys, subprocess, textwrap, json, time, shutil

from pathlib import Path

def run(cmd):

print(f"\n$ {cmd}")

return subprocess.run(cmd, shell=True, text=True, capture_output=False)

run(f'{sys.executable} -m pip install -q --no-cache-dir -U "pillow>=10.4.0,<12" reportlab pandas matplotlib docling-core docling-parse')

try:

from PIL import Image, ImageDraw

except ImportError:

print("\nPillow import failed because Colab has a mixed PIL installation.")

print("Reinstalling Pillow and restarting runtime. After restart, run this same cell again.")

run(f'{sys.executable} -m pip uninstall -y pillow PIL')

run(f'{sys.executable} -m pip install -q --no-cache-dir --force-reinstall "pillow>=10.4.0,<12"')

os.kill(os.getpid(), 9)

import pandas as pd

import matplotlib.pyplot as plt

from reportlab.lib.pagesizes import A4

from reportlab.lib import colors

from reportlab.platypus import Table, TableStyle

from reportlab.pdfgen import canvas

from docling_core.types.doc.page import TextCellUnit

from docling_parse.pdf_parser import DoclingPdfParser

print("Environment ready.")

print("Python:", sys.version.split()[0])

WORKDIR = Path("/content/docling_parse_advanced_tutorial")

WORKDIR.mkdir(parents=True, exist_ok=True)

PDF_PATH = WORKDIR / "advanced_docling_parse_demo.pdf"

OUT_DIR = WORKDIR / "outputs"

OUT_DIR.mkdir(exist_ok=True)

DEMO_IMAGE_PATH = WORKDIR / "demo_bitmap.png"

We set up the Colab environment by installing Docling Parse, Docling Core, Pillow, ReportLab, Pandas, and Matplotlib. We also handle the Pillow import issue safely so the notebook can recover if Colab has a broken or mixed PIL installation. We then define the working directory, output folder, PDF path, and image path that we use throughout the tutorial.

Generating a Multi-Element Test PDF for Parser Evaluation

Copy CodeCopiedUse a different Browser

def create_demo_image(path):

img = Image.new("RGB", (320, 180), "white")

draw = ImageDraw.Draw(img)

draw.rectangle([20, 20, 300, 160], outline="black", width=3)

draw.ellipse([55, 45, 145, 135], outline="black", width=4)

draw.line([180, 140, 285, 45], fill="black", width=4)

draw.text((45, 145), "Embedded bitmap image", fill="black")

img.save(path)

create_demo_image(DEMO_IMAGE_PATH)

def build_pdf(pdf_path):

c = canvas.Canvas(str(pdf_path), pagesize=A4)

width, height = A4

c.setFont("Helvetica-Bold", 20)

c.drawString(60, height - 70, "Docling Parse Advanced PDF Parsing Tutorial")

c.setFont("Helvetica", 11)

intro = (

"This generated document is designed for testing text extraction, coordinate parsing, "

"line grouping, vector path detection, bitmap resources, and layout-aware reconstruction."

)

text_obj = c.beginText(60, height - 105)

text_obj.setLeading(15)

for line in textwrap.wrap(intro, width=90):

text_obj.textLine(line)

c.drawText(text_obj)

c.setFont("Helvetica-Bold", 14)

c.drawString(60, height - 170, "1. Two-column text region")

left_para = (

"The left column contains compact explanatory text. A parser should expose words, "

"characters, and line-level cells along with coordinates. These coordinates allow us "

"to reconstruct reading order and inspect the spatial structure of a page."

)

right_para = (

"The right column contains a separate paragraph. In document AI pipelines, layout "

"features are useful for retrieval, table extraction, chunking, and downstream RAG "

"applications where page position can matter."

)

y_start = height - 200

left_text = c.beginText(60, y_start)

left_text.setFont("Helvetica", 10)

left_text.setLeading(13)

for line in textwrap.wrap(left_para, width=42):

left_text.textLine(line)

c.drawText(left_text)

right_text = c.beginText(325, y_start)

right_text.setFont("Helvetica", 10)

right_text.setLeading(13)

for line in textwrap.wrap(right_para, width=42):

right_text.textLine(line)

c.drawText(right_text)

c.setStrokeColor(colors.darkblue)

c.setLineWidth(2)

c.rect(55, height - 315, 225, 130, stroke=1, fill=0)

c.rect(320, height - 315, 225, 130, stroke=1, fill=0)

c.setStrokeColor(colors.darkgreen)

c.setLineWidth(3)

c.circle(140, height - 390, 40, stroke=1, fill=0)

c.line(220, height - 430, 310, height - 355)

c.setFont("Helvetica-Bold", 14)

c.setFillColor(colors.black)

c.drawString(60, height - 470, "2. Simple table-like structure")

data = [

["Section", "Signal", "Expected parser behavior"],

["Text", "Words and lines", "Return text cells with coordinates"],

["Vector", "Boxes and lines", "Expose page path/vector resources"],

["Bitmap", "Embedded image", "Expose or render image resources"],

]

table = Table(data, colWidths=[100, 130, 260])

table.setStyle(TableStyle([

("BACKGROUND", (0, 0), (-1, 0), colors.lightgrey),

("GRID", (0, 0), (-1, -1), 0.7, colors.black),

("FONTNAME", (0, 0), (-1, 0), "Helvetica-Bold"),

("FONTSIZE", (0, 0), (-1, -1), 9),

("VALIGN", (0, 0), (-1, -1), "MIDDLE"),

]))

table.wrapOn(c, width, height)

table.drawOn(c, 60, height - 590)

c.setFont("Helvetica", 9)

c.drawString(60, 55, "Page 1: generated programmatic PDF with text, table-like layout, and vector paths.")

c.showPage()

c.setFont("Helvetica-Bold", 18)

c.drawString(60, height - 70, "Page 2: Bitmap, Dense Text, and Reading Order")

c.setFont("Helvetica", 10)

dense = (

"This page includes an embedded bitmap image and several short blocks of text. "

"We use it to test whether rendering works, whether the parser preserves page-level "

"coordinates, and whether our own reconstruction logic can group words into lines."

)

y = height - 105

for para_idx in range(4):

tx = c.beginText(60, y)

tx.setFont("Helvetica", 10)

tx.setLeading(13)

for line in textwrap.wrap(f"Block {para_idx + 1}: {dense}", width=92):

tx.textLine(line)

c.drawText(tx)

y -= 70

c.drawImage(str(DEMO_IMAGE_PATH), 110, height - 510, width=320, height=180, preserveAspectRatio=True)

c.setStrokeColor(colors.red)

c.setLineWidth(2)

c.roundRect(95, height - 525, 350, 210, 10, stroke=1, fill=0)

c.setFillColor(colors.black)

c.setFont("Helvetica-Bold", 12)

c.drawString(60, height - 570, "Coordinate-aware extraction lets us keep page, text, and position together.")

c.setFont("Helvetica", 9)

c.drawString(60, 55, "Page 2: embedded bitmap image and multiple text blocks.")

c.save()

build_pdf(PDF_PATH)

print("Created PDF:", PDF_PATH)

We generate a small bitmap image and create a custom two-page PDF for testing Docling Parse. We add text blocks, two-column content, vector shapes, table-like content, and an embedded image so the parser has multiple document elements to process. We use the generated PDF as a controlled input to check text extraction, layout structure, rendering, and coordinate-aware parsing.

Extracting Word, Character, and Line Cells with Docling Parse

Copy CodeCopiedUse a different Browser

def safe_to_dict(obj, max_depth=2):

if obj is None:

return None

if isinstance(obj, (str, int, float, bool)):

return obj

if isinstance(obj, (list, tuple)):

return [safe_to_dict(x, max_depth=max_depth - 1) for x in obj[:50]]

if isinstance(obj, dict):

return {

str(k): safe_to_dict(v, max_depth=max_depth - 1)

for k, v in list(obj.items())[:50]

}

if hasattr(obj, "model_dump"):

try:

return obj.model_dump()

except Exception:

pass

if hasattr(obj, "__dict__") and max_depth > 0:

try:

return {

k: safe_to_dict(v, max_depth=max_depth - 1)

for k, v in obj.__dict__.items()

if not k.startswith("_")

}

except Exception:

pass

return str(obj)

def rect_to_dict(rect):

d = safe_to_dict(rect)

if isinstance(d, dict):

return d

attrs = {}

for name in [

"l", "t", "r", "b",

"left", "top", "right", "bottom",

"x0", "y0", "x1", "y1",

"width", "height"

]:

if hasattr(rect, name):

try:

attrs[name] = getattr(rect, name)

except Exception:

pass

return attrs if attrs else {"raw": str(rect)}

def get_text_cell_records(page_no, pred_page, unit_type):

records = []

try:

cells = list(pred_page.iterate_cells(unit_type=unit_type))

except Exception as e:

print(f"Could not iterate {unit_type} cells on page {page_no}: {e}")

return records

for idx, cell in enumerate(cells):

text = getattr(cell, "text", "")

rect = getattr(cell, "rect", None)

records.append({

"page": page_no,

"unit": str(unit_type).split(".")[-1],

"index": idx,

"text": text,

"rect": rect_to_dict(rect),

"raw_cell": safe_to_dict(cell, max_depth=1),

})

return records

def count_possible_resources(pred_page):

resource_summary = {}

names = dir(pred_page)

keywords = ["path", "bitmap", "image", "resource", "line", "rect"]

for name in names:

lname = name.lower()

if any(k in lname for k in keywords) and not name.startswith("_"):

try:

value = getattr(pred_page, name)

if callable(value):

continue

try:

resource_summary[name] = len(value)

except Exception:

resource_summary[name] = type(value).__name__

except Exception:

pass

return resource_summary

parser = DoclingPdfParser()

start = time.perf_counter()

pdf_doc = parser.load(path_or_stream=str(PDF_PATH))

load_time = time.perf_counter() - start

print(f"\nLoaded PDF in {load_time:.3f} seconds.")

all_records = []

page_summaries = []

rendered_paths = []

parse_start = time.perf_counter()

for page_no, pred_page in pdf_doc.iterate_pages():

print(f"\n--- Page {page_no} ---")

word_records = get_text_cell_records(page_no, pred_page, TextCellUnit.WORD)

char_records = get_text_cell_records(page_no, pred_page, TextCellUnit.CHAR)

line_records = get_text_cell_records(page_no, pred_page, TextCellUnit.LINE)

all_records.extend(word_records)

all_records.extend(char_records)

all_records.extend(line_records)

resource_summary = count_possible_resources(pred_page)

page_summaries.append({

"page": page_no,

"words": len(word_records),

"chars": len(char_records),

"lines": len(line_records),

"possible_resource_attributes": resource_summary,

})

print("Words:", len(word_records))

print("Characters:", len(char_records))

print("Lines:", len(line_records))

print("Possible resource attributes:", resource_summary)

print("\nFirst 20 extracted words:")

print(" ".join([r["text"] for r in word_records[:20]]))

for unit_name, unit_type in [

("word", TextCellUnit.WORD),

("char", TextCellUnit.CHAR),

("line", TextCellUnit.LINE),

]:

try:

img = pred_page.render_as_image(cell_unit=unit_type)

out_img = OUT_DIR / f"page_{page_no}_{unit_name}_overlay.png"

img.save(out_img)

rendered_paths.append(out_img)

print("Saved rendered overlay:", out_img)

except Exception as e:

print(f"Could not render {unit_name} overlay for page {page_no}: {e}")

parse_time = time.perf_counter() - parse_start

We define helper functions to safely convert Docling objects, rectangles, and page resources into readable Python dictionaries. We load the generated PDF with DoclingPdfParser and extract word-level, character-level, and line-level text cells from each page. We also render page overlays for different text units to visually inspect how Docling Parse detects and maps content on PDF pages.

Exporting Structured Outputs and Reconstructing Layout-Aware Text

Copy CodeCopiedUse a different Browser

records_path = OUT_DIR / "docling_parse_cells.json"

with open(records_path, "w", encoding="utf-8") as f:

json.dump(all_records, f, indent=2, ensure_ascii=False)

summary_path = OUT_DIR / "page_summaries.json"

with open(summary_path, "w", encoding="utf-8") as f:

json.dump(page_summaries, f, indent=2, ensure_ascii=False)

flat_rows = []

for r in all_records:

rect = r.get("rect", {})

row = {

"page": r["page"],

"unit": r["unit"],

"index": r["index"],

"text": r["text"],

}

if isinstance(rect, dict):

for k, v in rect.items():

if isinstance(v, (str, int, float, bool)) or v is None:

row[f"rect_{k}"] = v

else:

row[f"rect_{k}"] = str(v)

flat_rows.append(row)

df = pd.DataFrame(flat_rows)

csv_path = OUT_DIR / "docling_parse_cells.csv"

df.to_csv(csv_path, index=False)

summary_df = pd.DataFrame(page_summaries)

summary_csv_path = OUT_DIR / "page_summaries.csv"

summary_df.to_csv(summary_csv_path, index=False)

print("\nSaved structured outputs:")

print(records_path)

print(csv_path)

print(summary_path)

print(summary_csv_path)

print("\nPage summary:")

display(summary_df)

print("\nCell dataframe sample:")

display(df.head(20))

def extract_rect_numbers(rect):

if not isinstance(rect, dict):

return None

possible_sets = [

("l", "t", "r", "b"),

("left", "top", "right", "bottom"),

("x0", "y0", "x1", "y1"),

]

for keys in possible_sets:

if all(k in rect for k in keys):

try:

vals = [float(rect[k]) for k in keys]

return vals

except Exception:

pass

numeric = []

for v in rect.values():

try:

numeric.append(float(v))

except Exception:

pass

if len(numeric) >= 4:

return numeric[:4]

return None

word_df = df[df["unit"].str.contains("WORD", case=False, na=False)].copy()

if len(word_df) == 0:

word_df = df[df["unit"].str.contains("word", case=False, na=False)].copy()

coords = []

for _, row in word_df.iterrows():

rect_data = {}

for col in word_df.columns:

if col.startswith("rect_"):

rect_data[col.replace("rect_", "")] = row[col]

nums = extract_rect_numbers(rect_data)

coords.append(nums)

word_df["coord_numbers"] = coords

word_df = word_df[word_df["coord_numbers"].notna()].copy()

if len(word_df) > 0:

word_df["x0"] = word_df["coord_numbers"].apply(lambda x: min(x[0], x[2]))

word_df["x1"] = word_df["coord_numbers"].apply(lambda x: max(x[0], x[2]))

word_df["y0"] = word_df["coord_numbers"].apply(lambda x: min(x[1], x[3]))

word_df["y1"] = word_df["coord_numbers"].apply(lambda x: max(x[1], x[3]))

word_df["y_mid"] = (word_df["y0"] + word_df["y1"]) / 2

reconstructed_pages = {}

for page, g in word_df.groupby("page"):

g = g.sort_values(["y_mid", "x0"]).copy()

y_values = sorted(g["y_mid"].tolist())

line_bins = []

threshold = 8.0

for y in y_values:

placed = False

for line in line_bins:

if abs(line["center"] - y) <= threshold:

line["values"].append(y)

line["center"] = sum(line["values"]) / len(line["values"])

placed = True

break

if not placed:

line_bins.append({"center": y, "values": [y]})

def assign_line(y):

return min(range(len(line_bins)), key=lambda i: abs(line_bins[i]["center"] - y))

g["line_id"] = g["y_mid"].apply(assign_line)

lines = []

for line_id, lg in g.groupby("line_id"):

lg = lg.sort_values("x0")

line_text = " ".join(lg["text"].astype(str).tolist())

lines.append((lg["y_mid"].mean(), line_text))

lines = sorted(lines, key=lambda x: x[0])

reconstructed_text = "\n".join([line for _, line in lines])

reconstructed_pages[int(page)] = reconstructed_text

recon_path = OUT_DIR / "layout_aware_reconstructed_text.json"

with open(recon_path, "w", encoding="utf-8") as f:

json.dump(reconstructed_pages, f, indent=2, ensure_ascii=False)

print("\nLayout-aware reconstructed text:")

for page, text in reconstructed_pages.items():

print(f"\n===== PAGE {page} =====")

print(text[:2500])

print("\nSaved reconstruction:", recon_path)

else:

print("\nCould not build coordinate-based reconstruction because rectangle coordinates were not exposed in a numeric form.")

We save the extracted parsing results into JSON and CSV files for later analysis. We flatten the parsed records into a Pandas DataFrame and display both the page summary and cell-level extraction sample. We also reconstruct text from coordinate information, which helps us understand how a layout-aware reading order can be derived from word positions.

Benchmarking Threaded Parsing and Checking CLI Availability

Copy CodeCopiedUse a different Browser

print("\nAttempting threaded parsing benchmark...")

threaded_results = []

threaded_available = True

try:

from docling_parse.pdf_parser import DoclingThreadedPdfParser, ThreadedPdfParserConfig

from docling_parse.pdf_parsers import DecodePageConfig

parser_config = ThreadedPdfParserConfig(

loglevel="fatal",

threads=4,

max_concurrent_results=32,

)

decode_config = DecodePageConfig()

threaded_parser = DoclingThreadedPdfParser(

parser_config=parser_config,

decode_config=decode_config,

)

t0 = time.perf_counter()

doc_key = threaded_parser.load(str(PDF_PATH))

page_count = threaded_parser.page_count(doc_key)

print("Threaded doc key:", doc_key)

print("Threaded page count:", page_count)

for result in threaded_parser.iterate_results():

item = {

"doc_key": str(getattr(result, "doc_key", "")),

"page_number": getattr(result, "page_number", None),

"success": getattr(result, "success", None),

"error_message": getattr(result, "error_message", None),

}

if getattr(result, "success", False):

seg_page = result.get_page()

timings = result.get_timings()

item["word_count"] = len(getattr(seg_page, "word_cells", []))

try:

item["total_time"] = timings.total()

except Exception:

item["total_time"] = str(timings)

threaded_results.append(item)

threaded_time = time.perf_counter() - t0

except Exception as e:

threaded_available = False

threaded_time = None

print("Threaded parser is not available or failed in this environment.")

print("Error:", repr(e))

if threaded_available:

threaded_path = OUT_DIR / "threaded_parse_results.json"

with open(threaded_path, "w", encoding="utf-8") as f:

json.dump(threaded_results, f, indent=2, ensure_ascii=False)

threaded_df = pd.DataFrame(threaded_results)

print("\nThreaded parsing results:")

display(threaded_df)

print("Saved threaded results:", threaded_path)

benchmark = {

"standard_load_time_seconds": load_time,

"standard_iterate_parse_time

この記事をシェア

関連記事

AWS Machine Learning Blog★42026年6月12日 00:11

Amazon Bedrock Data Automation のブループリント抽出精度を最適化する方法

AWS は、インボイスや契約書などの非構造化文書からの構造化データ抽出精度を向上させるため、Amazon Bedrock Data Automation の利用方法を解説した。

AWS Machine Learning Blog★42026年6月12日 23:43

PDF から洞察へ:AWS 生成 AI サービスを用いたインテリジェントなドキュメント処理パイプラインの構築

AWS は、従来の OCR では文脈や意味を理解できない課題に対し、生成 AI を活用した新しいドキュメント処理パイプラインを提案し、手作業によるボトルネック解消とコスト削減を実現するアーキテクチャを紹介している。

AWS Machine Learning Blog★42026年5月28日 06:28

Amazon Bedrock Data Automation を活用した金融文書の処理方法

AWS は、Amazon Bedrock Data Automation(BDA)が税務申告書やローン明細など多様な形式の金融文書からデータを自動抽出・検証・分析する機能を提供すると発表した。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む