lift-pdf を用いたスキーマ指向の請求書インテリジェンスパイプライン設計:経理処理・検証・帳簿生成のための抽出手法
本記事は、lift-pdf を活用した構造化スキーマガイド型の請求書抽出パイプライン構築チュートリアルを提供し、実務で頻出する複雑な財務データの正確な抽出と検証手法を提示している。
キーポイント
OCR からドキュメント理解へ
単純な文字認識(OCR)ではなく、ベンダー名やPO番号などの特定フィールドを定義したスキーマに基づいた文書理解アプローチを採用しています。
実務的な抽出トラップの克服
請求先と発送先の区別、税抜・税込合計の分離、欠損値の null 処理、未払い状態の正確な判定など、実際の財務ワークフローで直面する課題への対応策を解説しています。
合成データによるテスト環境
現実的な請求書 PDF を生成し、構造化された JSON スキーマを出力目標とする制御されたテストドキュメントを用いてパイプラインを検証する手法を示しています。
ランタイム制御と依存関係の定義
処理するインボイスの数、4-bit ローディングの使用、PDFプレビューの有無、および実テストの実行可否を決定するランタイム制御を定義し、必要なコア依存関係をインストールします。
Pillow のバージョン固定による互換性確保
Colab 環境における Pillow、torchvision、Transformers 間の既知の競合問題を回避するため、Pillow を安定したバージョンに固定して再現可能な実行環境を構築します。
動的な GPU 検出と VRAM 最適化
コードは CUDA の利用可能性を検出し、VRAM が 34GB 未満の場合や設定により自動的に 4-bit NF4 量子化モードへ切り替えるロジックを実装しています。
Transformers ライブラリの自動パッチ適用
モデル読み込み時に `AutoModel` クラスを動的にパッチングし、量子化設定(quantization_config)とデバイスマップを自動的に注入してメモリ効率を高めています。
影響分析・編集コメントを表示
影響分析
このチュートリアルは、請求書処理のようなドキュメントインテリジェンス分野において、単なる文字認識を超えた文脈理解の重要性を浮き彫りにしています。実務で直面する複雑な条件分岐やデータ整合性の問題を解決するための具体的なコード例を提供することで、開発者が即座に生産的なシステムを構築できる土台となります。
編集コメント
請求書処理の自動化において、単なる OCR 精度だけでなく、ビジネスロジックに基づく文脈理解が不可欠であることを示す優れた実装例です。
このチュートリアルでは、lift-pdf を用いてエンドツーエンドの請求書精算抽出パイプラインを構築します。ここでは合成された請求書 PDF を制御されたテストドキュメントとして、構造化された JSON スキーマを対象出力フォーマットとして使用します。単なる OCR タスクとして請求書解析を扱うのではなく、スキーマガイド付き文書理解という枠組みで捉えます。具体的には、現実的な請求書を生成し、ベンダーの身元、請求先、PO 番号、明細項目、税金、合計金額、未払い残高、支払いステータスなどのフィールドを定義した上で、モデルに対してレンダリングされた PDF レイアウトからこれらの値を直接抽出させます。また、実務の財務ワークフローで頻出する実用的な抽出の罠も組み込みます。例えば、「請求先」と「配送先」の区別、税抜合計額と税込合計額の分離、存在しない値に対して null を返す処理、残高が残っている場合に部分的に支払われた請求書を未払いとして正しくマークする処理などです。GPU 対応モデル読み込み、オプションの 4 ビット量子化、PDF の生成と抽出、スコアリング、および帳簿構築を通じて、このチュートリアルは請求書マイニングのための文書知能に関するコンパクトかつ現実的なデモンストレーションへと昇華されます。
コードをコピーしました。別のブラウザを使用してください
N_DOCS = 3
FORCE_FULL_PRECISION = False
FORCE_4BIT = False
SHOW_FIRST_PAGE = True
RUN_ON_REAL_PDF = False
REAL_PDF_URL = ""
REAL_PDF_PAGES = "0-1"
PIN_PILLOW = True
PILLOW_VERSION = "11.3.0"
import os, sys, subprocess, json, re, time, warnings
warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def pip(*pkgs, upgrade=False):
"""シェルを呼び出さずにインストールする(これにより'[hf]'が glob 展開されないようにする)。"""
args = [sys.executable, "-m", "pip", "install", "-q"] + (["-U"] if upgrade else []) + list(pkgs)
print(" pip install", *pkgs)
subprocess.run(args, check=False)
print("STEP 1/7 · lift と light の依存関係のインストール(初回実行は時間がかかります)…")
pip("reportlab", "pypdfium2", "pandas", "matplotlib")
pip("lift-pdf[hf]")
pip("bitsandbytes", "accelerate", upgrade=True)
if PIN_PILLOW:
pip(f"pillow=={PILLOW_VERSION}")
if "PIL" in sys.modules:
import PIL
if getattr(PIL, "__version__", "") != PILLOW_VERSION:
print(f" ディスク上に Pinned Pillow {PILLOW_VERSION} を設定済みですが、メモリ内には古いバージョン {getattr(PIL, '__version__', '?')} が読み込まれています — ランタイムを再起動します。")
print(" Colab の再接続後にセルを再実行してください。")
os.kill(os.getpid(), 9)
print(" …インストール完了。\n")
import torch
まず、処理する請求書の数や4ビット読み込みの使用有無、生成されたPDFのプレビューの実行、後続の実際の請求書テストの有無を決定するランタイム制御を定義することから始めます。PDF生成、レンダリング、表解析、プロット描画、および lift-pdf 推論のためのコア依存関係をインストールします。また、本チュートリアルが Pillow、torchvision、Transformers の間で知られている Colab 互換性の問題に対処しているため、Pillow を安定版に固定します。この設定により、モデルの読み込みやドキュメント生成の前に再現可能な環境を整えることができます。
Copy CodeCopiedUse a different Browser
def detect_gpu():
if not torch.cuda.is_available():
raise SystemExit(
"\n✗ No CUDA GPU found. In Colab: Runtime ▸ Change runtime type ▸ GPU "
"(A100 is best; L4/T4 also work).\n"
)
p = torch.cuda.get_device_properties(0)
cc = torch.cuda.get_device_capability(0)
return p.name, p.total_memory / 1e9, cc
def enable_4bit(compute_dtype):
"""Load lift's weights in 4-bit NF4 whatever transformers Auto* class it uses internally."""
import inspect, functools, transformers
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=compute_dtype,
)
def patch(cls):
try:
cm = inspect.getattr_static(cls, "from_pretrained")
orig = cm.__func__ if isinstance(cm, (classmethod, staticmethod)) else cm
except Exception:
return
@functools.wraps(orig)
def inner(cls_, *args, **kwargs):
kwargs.setdefault("quantization_config", bnb)
kwargs.setdefault("device_map", {"": 0})
model = orig(cls_, *args, **kwargs)
try:
model.to = lambda *a, **k: model
model.cuda = lambda *a, **k: model
except Exception:
pass
return model
cls.from_pretrained = classmethod(inner)
for name in ["AutoModelForImageTextToText", "AutoModelForMultimodalLM",
"AutoModelForVision2Seq", "AutoModelForCausalLM", "AutoModel"]:
c = getattr(transformers, name, None)
if c is not None:
patch(c)
try:
from transformers.modeling_utils import PreTrainedModel
patch(PreTrainedModel)
except Exception:
pass
print("STEP 2/7 · Preparing the model backend…")
gpu_name, vram, cc = detect_gpu()
use_4bit = FORCE_4BIT or (vram = 8 else torch.float16
print(f" GPU: {gpu_name} | ~{vram:.0f} GB | compute capability {cc[0]}.{cc[1]}")
print(f" Load mode: {'4-bit NF4' if use_4bit else 'full bf16'} (compute dtype {compute_dtype})")
os.environ.setdefault("TORCH_DEVICE", "cuda:0")
os.environ.setdefault("MODEL_CHECKPOINT", "datalab-to/lift")
if use_4bit:
enable_4bit(compute_dtype)
from lift import extract
from lift.model import InferenceManager
print(" Loading lift weights (≈20 GB download on first run)…")
_t = time.time()
MODEL = InferenceManager(method="hf")
print(f" ✓ model ready in {time.time() - _t:.0f}s\n")
def run_lift(pdf_path, schema, page_range=None):
kw = {"model": MODEL}
if page_range:
kw["page_range"] = page_range
result = extract(pdf_path, schema, **kw)
return getattr(result, "extraction", None)
GPU 対応推論バックエンドを準備し、利用可能な VRAM に基づいてモデルをフル精度で実行するか、4 ビット NF4 量子化(NF4 quantization)で実行するかを決定します。必要に応じて lift が BitsAndBytes 量子化設定付きのチェックポイントを透過的に読み込めるよう、Hugging Face のモデル読み込みパスをパッチ適用します。InferenceManager は一度だけ初期化し、すべての請求書間で再利用して、繰り返されるモデル読み込みのオーバーヘッドを回避します。最後に、lift.extract() を小さなヘルパー関数でラップすることで、各 PDF が同じスキーマとオプションのページ範囲で抽出できるようにします。
Copy CodeCopiedUse a different Browser
DOCS = [
dict(
invoice_number="INV-2026-0412",
invoice_date="2026-05-04", due_date="2026-06-03",
vendor_name="Cloudworks Inc.",
vendor_address="500 Market St, Suite 900, San Francisco, CA 94105, USA",
bill_to_name="Acme Robotics LLC",
bill_to_address="12 Foundry Rd, Pittsburgh, PA 15222, USA",
ship_to_name="Acme Robotics — Warehouse 4",
ship_to_address="88 Dockside Blvd, Newark, NJ 07114, USA",
po_number=None,
discount_amount=None,
currency_code="USD", currency_symbol="$",
tax_rate=0.085,
amount_paid=0.00,
line_items=[
("Cloud Compute — Standard tier (monthly)", 3, 240.00),
("Object Storage — 2 TB", 1, 46.00),
("Priority Support add-on", 1, 99.00),
],
notes="Payment due within 30 days. Late payments accrue 1.5% monthly interest.",
),
dict(
invoice_number="INV-ND-2026-118",
invoice_date="2026-04-18", due_date="2026-05-18",
vendor_name="Nordic Design Studio Oy",
vendor_address="Eteläranta 12, 00130 Helsinki, Finland",
bill_to_name="Helsinki Media Oy",
bill_to_address="Mannerheimintie 4, 00100 Helsinki, Finland",
ship_to_name=None, ship_to_address=None,
po_number="PO-HM-5589",
discount_amount=785.00,
currency_code="EUR", currency_symbol="€",
tax_rate=0.24,
amount_paid=8760.60,
line_items=[
("Brand identity design package", 1, 4200.00),
("Web UI design — 12 screens", 12, 180.00),
("Custom illustration set", 1, 850.00),
("Design-system documentation", 1, 640.00),
],
notes="Paid in full — thank you. All amounts in EUR.",
),
dict(
invoice_number="INV-BR-4471",
invoice_date="2026-06-01", due_date="2026-07-15",
vendor_name="BuildRight Contractors Inc.",
vendor_address="740 Industrial Way, Austin, TX 78744, USA",
bill_to_name="Sunrise Property Group",
bill_to_address="9 Lakeview Terrace, Austin, TX 78703, USA",
ship_to_name="Sunrise Property Group — Lot 14 site office",
ship_to_address="Parcel 14, Mesa Ridge Development, Austin, TX 78737, USA",
po_number="PO-SPG-2211",
discount_amount=None,
currency_code="USD", currency_symbol="$",
tax_rate=0.07,
amount_paid=15000.00,
line_items=[
("Site preparation and grading", 1, 18500.00),
("Foundation concrete pour (Phase 1)", 1, 27400.00),
],
notes="A 15,000 USD deposit has been received. Remaining balance due by the date above.",
),
][:N_DOCS]
def compute(d):
"""Derive every money figure once, so PDF text and ground truth are guaranteed identical."""
items = [(desc, q, up, round(q * up, 2)) for (desc, q, up) in d["line_items"]]
subtotal = round(sum(t for *_, t in items), 2)
disc = d.get("discount_amount")
taxable = round(subtotal - (disc or 0.0), 2)
tax = round(taxable * d["tax_rate"], 2)
total = round(taxable + tax, 2)
paid = round(d.get("amount_paid", 0.0), 2)
balance = round(total - paid, 2)
return dict(items=items, subtotal=subtotal, discount=disc, tax=tax,
total=total, amount_paid=paid, balance=balance, is_paid=(balance <= 0.005))
def ground_truth(d):
"""Reshape raw inputs + computed totals into the exact JSON shape our schema asks for."""
c = compute(d)
return {
"invoice_number": d["invoice_number"],
"invoice_date": d["invoice_date"],
"due_date": d["due_date"],
"vendor": {"name": d["vendor_name"], "address": d["vendor_address"]},
"customer_name": d["bill_to_name"],
"purchase_order_number": d.get("po_number"),
"currency": d["currency_code"],
"line_items": [{"description": desc, "quantity": q,
"unit_price": up, "line_total": t} for (desc, q, up, t) in c["items"]],
"subtotal": c["subtotal"],
"discount_amount": c["discount"],
"tax_amount": c["tax"],
"total_amount": c["total"],
"amount_paid": c["amount_paid"],
"balance_due": c["balance"],
"is_paid": c["is_paid"],
]
私たちは、異なるベンダー、通貨、支払い状態、および請求書レイアウトにわたる現実的な支払管理文書を模倣する制御された合成請求書コーパスを定義します。各請求書には、ベンダー詳細、請求先および配送先当事者、PO 番号、割引、税金、預金、および行項目などの生きたビジネスフィールドが含まれています。その後、生データから小計、税金、合計、未払い残高、および支払いステータスといった派生的な財務値を計算します。これにより、レンダリングされた PDF と正解 JSON が数学的に整合性を持つことを保証します。
Copy CodeCopiedUse a different Browser
def render_pdf(d, path):
"""現実的な 1 ページの請求書を生成する:ヘッダー、メタデータ、請求先/配送先、明細行、合計、支払い情報。"""
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.lib import colors
from reportlab.platypus import (SimpleDocTemplate, Paragraph, Spacer,
Table, TableStyle)
c = compute(d)
sym = d["currency_symbol"]
def money(x): return f"{sym}{x:,.2f}"
ss = getSampleStyleSheet()
H1 = ParagraphStyle("H1", parent=ss["Title"], fontSize=18, leading=22, spaceAfter=2)
SMALL= ParagraphStyle("SM", parent=ss["Normal"], fontSize=8.5, textColor=colors.grey, leading=11)
LBL = ParagraphStyle("LBL", parent=ss["Normal"], fontSize=8.5, textColor=colors.HexColor("#2b3a67"),
spaceAfter=1, fontName="Helvetica-Bold")
BODY = ParagraphStyle("BODY", parent=ss["Normal"], fontSize=9.5, leading=13)
RIGHT= ParagraphStyle("R", parent=ss["Normal"], fontSize=16, leading=18, alignment=2,
textColor=colors.HexColor("#2b3a67"), fontName="Helvetica-Bold")
story = []
head = Table([[
[Paragraph(d["vendor_name"], H1), Paragraph(d["vendor_address"], SMALL)],
[Paragraph("INVOICE", RIGHT),
Paragraph(f"{d['invoice_number']}", ParagraphStyle('n', parent=SMALL, alignment=2, fontSize=9.5))],
]], colWidths=[4.2 * inch, 2.8 * inch])
head.setStyle(TableStyle([("VALIGN", (0, 0), (-1, -1), "TOP")]))
story += [head, Spacer(1, 10)]
meta_rows = [["Invoice date", d["invoice_date"], "Due date", d["due_date"]]]
if d.get("po_number"):
meta_rows.append(["PO number", d["po_number"], "Currency", d["currency_code"]])
else:
meta_rows.append(["Currency", d["currency_code"], "", ""])
meta = Table(meta_rows, colWidths=[1.3 * inch, 2.2 * inch, 1.3 * inch, 2.2 * inch])
meta.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 9),
("TEXTCOLOR", (0, 0), (0, -1), colors.HexColor("#2b3a67")),
("TEXTCOLOR", (2, 0), (2, -1), colors.HexColor("#2b3a67")),
("FONTNAME", (0, 0), (0, -1), "Helvetica-Bold"),
("FONTNAME", (2, 0), (2, -1), "Helvetica-Bold"),
("BOTTOMPADDING", (0, 0), (-1, -1), 3), ("TOPPADDING", (0, 0), (-1, -1), 3)]))
story += [meta, Spacer(1, 12)]
bill = [Paragraph("BILL TO", LBL), Paragraph(d["bill_to_name"], BODY),
Paragraph(d["bill_to_address"], SMALL)]
if d.get("ship_to_name"):
ship = [Paragraph("SHIP TO", LBL), Paragraph(d["ship_to_name"], BODY),
Paragraph(d["ship_to_address"], SMALL)]
else:
ship = [Paragraph("SHIP TO", LBL), Paragraph("Same as billing address", SMALL)]
parties = Table([[bill, ship]], colWidths=[3.5 * inch, 3.5 * inch])
parties.setStyle(TableStyle([("VALIGN", (0, 0), (-1, -1), "TOP"),
("LEFTPADDING", (0, 0), (-1, -1), 0)]))
story += [parties, Spacer(1, 14)]
rows = [["Description", "Qty", "Unit price", "Amount"]]
for (desc, q, up, t) in c["items"]:
rows.append([desc, str(q), money(up), money(t)])
items_tbl = Table(rows, colWidths=[3.5 * inch, 0.7 * inch, 1.4 * inch, 1.4 * inch])
items_tbl.setStyle(TableStyle([
("BACKGROUND", (0, 0), (-1, 0), colors.HexColor("#2b3a67")),
("TEXTCOLOR", (0, 0), (-1, 0), colors.white),
("FONTSIZE", (0, 0), (-1, -1), 9.5),
("ALIGN", (1, 0), (-1, -1), "RIGHT"),
("GRID", (0, 0), (-1, -1), 0.4, colors.HexColor("#cdd3e6")),
("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#eef1f8")]),
("LEFTPADDING", (0, 0), (-1, -1), 8), ("TOPPADDING", (0, 0), (-1, -1), 5),
("BOTTOMPADDING", (0, 0), (-1, -1), 5)]))
story += [items_tbl, Spacer(1, 10)]
tot_rows = [["Subtotal", money(c["subtotal"])]]
if c["discount"]:
tot_rows.append(["Discount", "-" + money(c["discount"])])
tot_rows.append([f"Tax ({d['tax_rate']*100:.1f}%)", money(c["tax"])])
tot_rows.append(["TOTAL", money(c["total"])] )
totals = Table(tot_rows, colWidths=[1.6 * inch, 1.4 * inch], hAlign="RIGHT")
totals.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 10),
("ALIGN", (0, 0), (-1, -1), "RIGHT"),
("LINEABOVE", (0, -1), (-1, -1), 1.0, colors.HexColor("#2b3a67")),
("FONTNAME", (0, -1), (-1, -1), "Helvetica-Bold"),
("TEXTCOLOR", (0, -1), (-1, -1), colors.HexColor("#2b3a67")),
("TOPPADDING", (0, 0), (-1, -1), 3), ("BOTTOMPADDING", (0, 0), (-1, -1), 3)]))
story += [totals, Spacer(1, 8)]
pay_rows = [["Amount paid", money(c["amount_paid"])],
["Balance due", money(c["balance"])] ]
pay = Table(pay_rows, colWidths=[1.6 * inch, 1.4 * inch], hAlign="RIGHT")
due_color = colors.HexColor("#1b7a3d") if c["is_paid"] else colors.HexColor("#7a2e2e")
pay.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 10),
("ALIGN", (0, 0), (-1, -1), "RIGHT"),
("FONTNAME", (0, 1), (-1, 1), "Helvetica-Bold"),
("TEXTCOLOR", (0, 1), (-1, 1), due_color),
("TOPPADDING", (0, 0), (-1, -1), 2), ("BOTTOMPADDING", (0, 0), (-1, -1), 2)]))
status = "PAID IN FULL" if c["is_paid"] else "BALANCE DUE"
story += [pay, Spacer(1, 6),
Paragraph(f"Status: {status}", BODY), Spacer(1, 16),
Paragraph("Notes", LBL), Paragraph(d["notes"], BODY)]
SimpleDocTemplate(path, pagesize=LETTER,
topMargin=0.7 * inch, bottomMargin=0.7 * inch,
leftMargin=0.8 * inch, rightMargin=0.8 * inch).build(story)
print("STEP 3/7 · 合成請求書 PDF の生成中…")
CORPUS = []
for i, d in enumerate(DOCS):
path = f"/content/invoice_{i}.pdf" if os.path.isdir("/content") else f"invoice_{i}.pdf"
render_pdf(d, path)
CORPUS.append((d, ground_truth(d), path))
print(f" ✓ {os.path.basename(path)} — {d['vendor_name']} → {d['bill_to_name']}")
print()
if SHOW_FIRST_PAGE:
try:
import pypdfium2 as pdfium, matplotlib.pyplot as plt
pg = pdfium.PdfDocument(CORPUS[0][2])[0]
img = pg.render(scale=2.0).to_pil()
plt.figure(figsize=(6.4, 8.3)); plt.imshow(img); plt.axis("off")
plt.title("What lift reads — page 1 of invoice_0.pdf", fontsize=10); plt.show()
except Exception as e:
print(" page preview skipped:", e, "\n")
各合成請求書を ReportLab を用いて、ヘッダー、請求書メタデータ、請求先および配送先ブロック、明細行テーブル、合計金額、支払いステータス、注釈を含む現実的な 1 ページの PDF としてレンダリングします。別々の請求先と配送先のセクションや、小計と合計のフィールドなど、請求書抽出を困難にするレイアウト要素は意図的に保持しています。その後、PDF コーパスを生成し、必要に応じて pypdfium2 と Matplotlib を用いて最初のページをプレビューします。このステップにより、lift が抽出処理中に読み込む実際の視覚ドキュメントが作成されます。
Copy CodeCopiedUse a different Browser
SCHEMA = {
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "請求書の固有識別子/番号"},
"invoice_date": {"type": "string", "description": "請求書が発行された日付(印刷通り)"},
"due_date": {"type": "string", "description": "支払いが期限となる日付"},
"vendor": {
"type": "object",
"description": "請求書を発行した当事者(売り手/サプライヤー)",
"properties": {
"name": {"type": "string"},
"address": {"type": "string"},
}},
"customer_name": {"type": "string",
"description": "請求書が請求される当事者("Bill To" 当事者) — 売り手ではなく、異なる場合は"Ship To"当事者でもない"},
"purchase_order_number": {"type": "string",
"description": "請求書に記載されている PO 番号。購入注文番号が表示されない場合は null を返す"},
"currency": {"type": "string",
原文を表示
In this tutorial, we build an end-to-end accounts-payable extraction pipeline with lift-pdf, using synthetic invoice PDFs as controlled test documents and a structured JSON schema as the target output format. Instead of treating invoice parsing as a simple OCR task, we frame it as schema-guided document understanding: we generate realistic invoices, define fields such as vendor identity, billing party, PO number, line items, tax, total amount, balance due, and payment status, and then ask the model to extract those values directly from the rendered PDF layout. We also include practical extraction traps that appear in real finance workflows, such as distinguishing bill-to from ship-to, separating subtotal from after-tax total, returning null for absent values, and correctly marking partially paid invoices as unpaid when a balance remains. Through GPU-aware model loading, optional 4-bit quantization, PDF generation and extraction, scoring, and ledger construction, we turn this tutorial into a compact yet realistic demonstration of document intelligence for invoice mining.
Copy CodeCopiedUse a different Browser
N_DOCS = 3
FORCE_FULL_PRECISION = False
FORCE_4BIT = False
SHOW_FIRST_PAGE = True
RUN_ON_REAL_PDF = False
REAL_PDF_URL = ""
REAL_PDF_PAGES = "0-1"
PIN_PILLOW = True
PILLOW_VERSION = "11.3.0"
import os, sys, subprocess, json, re, time, warnings
warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def pip(*pkgs, upgrade=False):
"""Install without invoking a shell (so '[hf]' is never glob-expanded)."""
args = [sys.executable, "-m", "pip", "install", "-q"] + (["-U"] if upgrade else []) + list(pkgs)
print(" pip install", *pkgs)
subprocess.run(args, check=False)
print("STEP 1/7 · Installing lift + light dependencies (first run is the slow one)…")
pip("reportlab", "pypdfium2", "pandas", "matplotlib")
pip("lift-pdf[hf]")
pip("bitsandbytes", "accelerate", upgrade=True)
if PIN_PILLOW:
pip(f"pillow=={PILLOW_VERSION}")
if "PIL" in sys.modules:
import PIL
if getattr(PIL, "__version__", "") != PILLOW_VERSION:
print(f" Pinned Pillow {PILLOW_VERSION} on disk, but a stale "
f"{getattr(PIL, '__version__', '?')} is loaded in memory — restarting runtime.")
print(" Just re-run the cell(s) after Colab reconnects.")
os.kill(os.getpid(), 9)
print(" …install finished.\n")
import torch
We begin by defining the runtime controls that decide how many invoices we process, whether we use 4-bit loading, whether we preview the generated PDF, and whether we later test a real invoice. We install the core dependencies for PDF generation, rendering, tabular analysis, plotting, and lift-pdf inference. We also pin Pillow to a stable version because the tutorial addresses a known Colab compatibility issue among Pillow, torchvision, and Transformers. This setup gives us a reproducible environment before we load any model or generate any document.
Copy CodeCopiedUse a different Browser
def detect_gpu():
if not torch.cuda.is_available():
raise SystemExit(
"\n✗ No CUDA GPU found. In Colab: Runtime ▸ Change runtime type ▸ GPU "
"(A100 is best; L4/T4 also work).\n"
)
p = torch.cuda.get_device_properties(0)
cc = torch.cuda.get_device_capability(0)
return p.name, p.total_memory / 1e9, cc
def enable_4bit(compute_dtype):
"""Load lift's weights in 4-bit NF4 whatever transformers Auto* class it uses internally."""
import inspect, functools, transformers
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=compute_dtype,
)
def patch(cls):
try:
cm = inspect.getattr_static(cls, "from_pretrained")
orig = cm.__func__ if isinstance(cm, (classmethod, staticmethod)) else cm
except Exception:
return
@functools.wraps(orig)
def inner(cls_, *args, **kwargs):
kwargs.setdefault("quantization_config", bnb)
kwargs.setdefault("device_map", {"": 0})
model = orig(cls_, *args, **kwargs)
try:
model.to = lambda *a, **k: model
model.cuda = lambda *a, **k: model
except Exception:
pass
return model
cls.from_pretrained = classmethod(inner)
for name in ["AutoModelForImageTextToText", "AutoModelForMultimodalLM",
"AutoModelForVision2Seq", "AutoModelForCausalLM", "AutoModel"]:
c = getattr(transformers, name, None)
if c is not None:
patch(c)
try:
from transformers.modeling_utils import PreTrainedModel
patch(PreTrainedModel)
except Exception:
pass
print("STEP 2/7 · Preparing the model backend…")
gpu_name, vram, cc = detect_gpu()
use_4bit = FORCE_4BIT or (vram < 34 and not FORCE_FULL_PRECISION)
compute_dtype = torch.bfloat16 if cc[0] >= 8 else torch.float16
print(f" GPU: {gpu_name} | ~{vram:.0f} GB | compute capability {cc[0]}.{cc[1]}")
print(f" Load mode: {'4-bit NF4' if use_4bit else 'full bf16'} (compute dtype {compute_dtype})")
os.environ.setdefault("TORCH_DEVICE", "cuda:0")
os.environ.setdefault("MODEL_CHECKPOINT", "datalab-to/lift")
if use_4bit:
enable_4bit(compute_dtype)
from lift import extract
from lift.model import InferenceManager
print(" Loading lift weights (≈20 GB download on first run)…")
_t = time.time()
MODEL = InferenceManager(method="hf")
print(f" ✓ model ready in {time.time() - _t:.0f}s\n")
def run_lift(pdf_path, schema, page_range=None):
kw = {"model": MODEL}
if page_range:
kw["page_range"] = page_range
result = extract(pdf_path, schema, **kw)
return getattr(result, "extraction", None)
We prepare the GPU-aware inference backend and decide whether the model should run in full precision or 4-bit NF4 quantization based on available VRAM. We patch the Hugging Face model-loading path so lift can transparently load the checkpoint with a BitsAndBytes quantization configuration when needed. We initialize the InferenceManager once and reuse it across all invoices, avoiding repeated model-loading overhead. Finally, we wrap lift.extract() inside a small helper so each PDF can be mined with the same schema and optional page range.
Copy CodeCopiedUse a different Browser
DOCS = [
dict(
invoice_number="INV-2026-0412",
invoice_date="2026-05-04", due_date="2026-06-03",
vendor_name="Cloudworks Inc.",
vendor_address="500 Market St, Suite 900, San Francisco, CA 94105, USA",
bill_to_name="Acme Robotics LLC",
bill_to_address="12 Foundry Rd, Pittsburgh, PA 15222, USA",
ship_to_name="Acme Robotics — Warehouse 4",
ship_to_address="88 Dockside Blvd, Newark, NJ 07114, USA",
po_number=None,
discount_amount=None,
currency_code="USD", currency_symbol="$",
tax_rate=0.085,
amount_paid=0.00,
line_items=[
("Cloud Compute — Standard tier (monthly)", 3, 240.00),
("Object Storage — 2 TB", 1, 46.00),
("Priority Support add-on", 1, 99.00),
],
notes="Payment due within 30 days. Late payments accrue 1.5% monthly interest.",
),
dict(
invoice_number="INV-ND-2026-118",
invoice_date="2026-04-18", due_date="2026-05-18",
vendor_name="Nordic Design Studio Oy",
vendor_address="Eteläranta 12, 00130 Helsinki, Finland",
bill_to_name="Helsinki Media Oy",
bill_to_address="Mannerheimintie 4, 00100 Helsinki, Finland",
ship_to_name=None, ship_to_address=None,
po_number="PO-HM-5589",
discount_amount=785.00,
currency_code="EUR", currency_symbol="€",
tax_rate=0.24,
amount_paid=8760.60,
line_items=[
("Brand identity design package", 1, 4200.00),
("Web UI design — 12 screens", 12, 180.00),
("Custom illustration set", 1, 850.00),
("Design-system documentation", 1, 640.00),
],
notes="Paid in full — thank you. All amounts in EUR.",
),
dict(
invoice_number="INV-BR-4471",
invoice_date="2026-06-01", due_date="2026-07-15",
vendor_name="BuildRight Contractors Inc.",
vendor_address="740 Industrial Way, Austin, TX 78744, USA",
bill_to_name="Sunrise Property Group",
bill_to_address="9 Lakeview Terrace, Austin, TX 78703, USA",
ship_to_name="Sunrise Property Group — Lot 14 site office",
ship_to_address="Parcel 14, Mesa Ridge Development, Austin, TX 78737, USA",
po_number="PO-SPG-2211",
discount_amount=None,
currency_code="USD", currency_symbol="$",
tax_rate=0.07,
amount_paid=15000.00,
line_items=[
("Site preparation and grading", 1, 18500.00),
("Foundation concrete pour (Phase 1)", 1, 27400.00),
],
notes="A 15,000 USD deposit has been received. Remaining balance due by the date above.",
),
][:N_DOCS]
def compute(d):
"""Derive every money figure once, so PDF text and ground truth are guaranteed identical."""
items = [(desc, q, up, round(q * up, 2)) for (desc, q, up) in d["line_items"]]
subtotal = round(sum(t for *_, t in items), 2)
disc = d.get("discount_amount")
taxable = round(subtotal - (disc or 0.0), 2)
tax = round(taxable * d["tax_rate"], 2)
total = round(taxable + tax, 2)
paid = round(d.get("amount_paid", 0.0), 2)
balance = round(total - paid, 2)
return dict(items=items, subtotal=subtotal, discount=disc, tax=tax,
total=total, amount_paid=paid, balance=balance, is_paid=(balance <= 0.005))
def ground_truth(d):
"""Reshape raw inputs + computed totals into the exact JSON shape our schema asks for."""
c = compute(d)
return {
"invoice_number": d["invoice_number"],
"invoice_date": d["invoice_date"],
"due_date": d["due_date"],
"vendor": {"name": d["vendor_name"], "address": d["vendor_address"]},
"customer_name": d["bill_to_name"],
"purchase_order_number": d.get("po_number"),
"currency": d["currency_code"],
"line_items": [{"description": desc, "quantity": q,
"unit_price": up, "line_total": t} for (desc, q, up, t) in c["items"]],
"subtotal": c["subtotal"],
"discount_amount": c["discount"],
"tax_amount": c["tax"],
"total_amount": c["total"],
"amount_paid": c["amount_paid"],
"balance_due": c["balance"],
"is_paid": c["is_paid"],
}
We define a controlled synthetic invoice corpus that mimics realistic accounts-payable documents across different vendors, currencies, payment states, and invoice layouts. Each invoice includes raw business fields such as vendor details, bill-to and ship-to parties, PO numbers, discounts, taxes, deposits, and line items. We then compute derived financial values such as subtotal, tax, total, balance due, and paid status from the raw invoice data. This ensures the rendered PDF and the ground-truth JSON remain mathematically consistent.
Copy CodeCopiedUse a different Browser
def render_pdf(d, path):
"""Draw a realistic one-page invoice: header, meta, bill/ship, line items, totals, payment."""
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.lib import colors
from reportlab.platypus import (SimpleDocTemplate, Paragraph, Spacer,
Table, TableStyle)
c = compute(d)
sym = d["currency_symbol"]
def money(x): return f"{sym}{x:,.2f}"
ss = getSampleStyleSheet()
H1 = ParagraphStyle("H1", parent=ss["Title"], fontSize=18, leading=22, spaceAfter=2)
SMALL= ParagraphStyle("SM", parent=ss["Normal"], fontSize=8.5, textColor=colors.grey, leading=11)
LBL = ParagraphStyle("LBL", parent=ss["Normal"], fontSize=8.5, textColor=colors.HexColor("#2b3a67"),
spaceAfter=1, fontName="Helvetica-Bold")
BODY = ParagraphStyle("BODY", parent=ss["Normal"], fontSize=9.5, leading=13)
RIGHT= ParagraphStyle("R", parent=ss["Normal"], fontSize=16, leading=18, alignment=2,
textColor=colors.HexColor("#2b3a67"), fontName="Helvetica-Bold")
story = []
head = Table([[
[Paragraph(d["vendor_name"], H1), Paragraph(d["vendor_address"], SMALL)],
[Paragraph("INVOICE", RIGHT),
Paragraph(f"{d['invoice_number']}", ParagraphStyle('n', parent=SMALL, alignment=2, fontSize=9.5))],
]], colWidths=[4.2 * inch, 2.8 * inch])
head.setStyle(TableStyle([("VALIGN", (0, 0), (-1, -1), "TOP")]))
story += [head, Spacer(1, 10)]
meta_rows = [["Invoice date", d["invoice_date"], "Due date", d["due_date"]]]
if d.get("po_number"):
meta_rows.append(["PO number", d["po_number"], "Currency", d["currency_code"]])
else:
meta_rows.append(["Currency", d["currency_code"], "", ""])
meta = Table(meta_rows, colWidths=[1.3 * inch, 2.2 * inch, 1.3 * inch, 2.2 * inch])
meta.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 9),
("TEXTCOLOR", (0, 0), (0, -1), colors.HexColor("#2b3a67")),
("TEXTCOLOR", (2, 0), (2, -1), colors.HexColor("#2b3a67")),
("FONTNAME", (0, 0), (0, -1), "Helvetica-Bold"),
("FONTNAME", (2, 0), (2, -1), "Helvetica-Bold"),
("BOTTOMPADDING", (0, 0), (-1, -1), 3), ("TOPPADDING", (0, 0), (-1, -1), 3)]))
story += [meta, Spacer(1, 12)]
bill = [Paragraph("BILL TO", LBL), Paragraph(d["bill_to_name"], BODY),
Paragraph(d["bill_to_address"], SMALL)]
if d.get("ship_to_name"):
ship = [Paragraph("SHIP TO", LBL), Paragraph(d["ship_to_name"], BODY),
Paragraph(d["ship_to_address"], SMALL)]
else:
ship = [Paragraph("SHIP TO", LBL), Paragraph("Same as billing address", SMALL)]
parties = Table([[bill, ship]], colWidths=[3.5 * inch, 3.5 * inch])
parties.setStyle(TableStyle([("VALIGN", (0, 0), (-1, -1), "TOP"),
("LEFTPADDING", (0, 0), (-1, -1), 0)]))
story += [parties, Spacer(1, 14)]
rows = [["Description", "Qty", "Unit price", "Amount"]]
for (desc, q, up, t) in c["items"]:
rows.append([desc, str(q), money(up), money(t)])
items_tbl = Table(rows, colWidths=[3.5 * inch, 0.7 * inch, 1.4 * inch, 1.4 * inch])
items_tbl.setStyle(TableStyle([
("BACKGROUND", (0, 0), (-1, 0), colors.HexColor("#2b3a67")),
("TEXTCOLOR", (0, 0), (-1, 0), colors.white),
("FONTSIZE", (0, 0), (-1, -1), 9.5),
("ALIGN", (1, 0), (-1, -1), "RIGHT"),
("GRID", (0, 0), (-1, -1), 0.4, colors.HexColor("#cdd3e6")),
("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#eef1f8")]),
("LEFTPADDING", (0, 0), (-1, -1), 8), ("TOPPADDING", (0, 0), (-1, -1), 5),
("BOTTOMPADDING", (0, 0), (-1, -1), 5)]))
story += [items_tbl, Spacer(1, 10)]
tot_rows = [["Subtotal", money(c["subtotal"])]]
if c["discount"]:
tot_rows.append(["Discount", "-" + money(c["discount"])])
tot_rows.append([f"Tax ({d['tax_rate']*100:.1f}%)", money(c["tax"])])
tot_rows.append(["TOTAL", money(c["total"])])
totals = Table(tot_rows, colWidths=[1.6 * inch, 1.4 * inch], hAlign="RIGHT")
totals.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 10),
("ALIGN", (0, 0), (-1, -1), "RIGHT"),
("LINEABOVE", (0, -1), (-1, -1), 1.0, colors.HexColor("#2b3a67")),
("FONTNAME", (0, -1), (-1, -1), "Helvetica-Bold"),
("TEXTCOLOR", (0, -1), (-1, -1), colors.HexColor("#2b3a67")),
("TOPPADDING", (0, 0), (-1, -1), 3), ("BOTTOMPADDING", (0, 0), (-1, -1), 3)]))
story += [totals, Spacer(1, 8)]
pay_rows = [["Amount paid", money(c["amount_paid"])],
["Balance due", money(c["balance"])]]
pay = Table(pay_rows, colWidths=[1.6 * inch, 1.4 * inch], hAlign="RIGHT")
due_color = colors.HexColor("#1b7a3d") if c["is_paid"] else colors.HexColor("#7a2e2e")
pay.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 10),
("ALIGN", (0, 0), (-1, -1), "RIGHT"),
("FONTNAME", (0, 1), (-1, 1), "Helvetica-Bold"),
("TEXTCOLOR", (0, 1), (-1, 1), due_color),
("TOPPADDING", (0, 0), (-1, -1), 2), ("BOTTOMPADDING", (0, 0), (-1, -1), 2)]))
status = "PAID IN FULL" if c["is_paid"] else "BALANCE DUE"
story += [pay, Spacer(1, 6),
Paragraph(f"<b>Status:</b> {status}", BODY), Spacer(1, 16),
Paragraph("Notes", LBL), Paragraph(d["notes"], BODY)]
SimpleDocTemplate(path, pagesize=LETTER,
topMargin=0.7 * inch, bottomMargin=0.7 * inch,
leftMargin=0.8 * inch, rightMargin=0.8 * inch).build(story)
print("STEP 3/7 · Generating synthetic invoice PDFs…")
CORPUS = []
for i, d in enumerate(DOCS):
path = f"/content/invoice_{i}.pdf" if os.path.isdir("/content") else f"invoice_{i}.pdf"
render_pdf(d, path)
CORPUS.append((d, ground_truth(d), path))
print(f" ✓ {os.path.basename(path)} — {d['vendor_name']} → {d['bill_to_name']}")
print()
if SHOW_FIRST_PAGE:
try:
import pypdfium2 as pdfium, matplotlib.pyplot as plt
pg = pdfium.PdfDocument(CORPUS[0][2])[0]
img = pg.render(scale=2.0).to_pil()
plt.figure(figsize=(6.4, 8.3)); plt.imshow(img); plt.axis("off")
plt.title("What lift reads — page 1 of invoice_0.pdf", fontsize=10); plt.show()
except Exception as e:
print(" page preview skipped:", e, "\n")
We render each synthetic invoice into a realistic one-page PDF using ReportLab, including headers, invoice metadata, billing and shipping blocks, line-item tables, totals, payment status, and notes. We intentionally preserve layout elements that make invoice extraction difficult, such as separate bill-to and ship-to sections and subtotal versus total fields. We then generate the PDF corpus and optionally preview the first page using pypdfium2 and Matplotlib. This step creates the actual visual documents that lift reads during extraction.
Copy CodeCopiedUse a different Browser
SCHEMA = {
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "The invoice's unique identifier / number"},
"invoice_date": {"type": "string", "description": "Date the invoice was issued (as printed)"},
"due_date": {"type": "string", "description": "Date payment is due"},
"vendor": {
"type": "object",
"description": "The party that ISSUED the invoice (the seller / supplier)",
"properties": {
"name": {"type": "string"},
"address": {"type": "string"},
}},
"customer_name": {"type": "string",
"description": "The party the invoice is billed TO (the 'Bill To' party) — "
"not the vendor, and not the 'Ship To' party if it differs"},
"purchase_order_number": {"type": "string",
"description": "The PO number referenced on the invoice. "
"Return null if no purchase-order number appears"},
"currency": {"type": "string",
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み