Hugging Face Blog·2026年4月16日 09:00·約4分で読める

Sentence Transformersを用いたマルチモーダル埋め込み・リランカーモデルのトレーニングとファインチューニング

#マルチモーダル埋め込み #Sentence Transformers #ファインチューニング #RAG/検索 #Hugging Face

TL;DR

本記事は、Sentence Transformersライブラリを活用したマルチモーダル埋め込みモデルの学習・ファインチューニング手順と、ドメイン特化データによる検索精度向上の実例を解説している。

AI深層分析2026年4月16日 23:47

重要/ 5段階

深度40%

キーポイント

マルチモーダル学習パイプラインの標準化

モデル、データセット、損失関数、学習引数、評価器、トレーナーの6要素で構成される標準的なトレーニングパイプラインを、テキストから画像・音声・動画を含むマルチモーダルモデルへもシームレスに適用可能である。

ドメイン特化ファインチューニングの実証

Qwen3-VL-Embedding-2BをVisual Document Retrievalデータセットで学習させた結果、NDCG@10が0.888から0.947へ大幅向上し、4倍規模のモデルを上回る検索精度を達成した実例を示している。

最適化手法と評価指標の活用

CachedMultipleNegativesRankingLossによる損失計算やMatryoshka Dimensionsの活用、NDCG@10を用いた厳密な評価により、計算リソースを抑制しつつ高精度な埋め込み抽出を実現する手法を体系化している。

マルチモーダル学習スクリプトの特徴

テキスト単一とほぼ同等だが、model_kwargs/processor_kwargsの指定、CachedMultipleNegativesRankingLoss（mini_batch_size=1）の使用、画像コーパス＋テキストクエリによるクロスモーダル評価の実装が異なる。

小規模モデルの高性能化とドメイン特化学習の効果

1エポックのファインチューニングで2Bモデルが0.947のNDCG@10を達成し、より大規模な汎用モデルや既存VDRモデルを上回り、ドメイン特化ファインチューニングの有効性を示した。

マトロシカ次元によるサイズと性能のトレードオフ

全2048次元から64次元（32分の1）まで性能が92%以上維持され、重要な情報は早期の次元に集中するため、デプロイ時に埋め込みサイズと検索品質を柔軟に調整可能。

影響分析・編集コメントを表示

影響分析

本記事は、RAGやセマンティック検索の基盤となる埋め込みモデルの実務的な学習手法を体系化しており、開発者が独自ドメインに特化した高精度な検索システムを構築する際の標準的な指針となる。特に、大規模モデルを上回る性能をファインチューニングで実現可能であることを示すことで、計算リソースの最適化とモデルカスタマイズのバランスを重視する業界動向に貢献する。

編集コメント

マルチモーダル検索の実装において、大規模モデルの採用よりもドメイン特化型のファインチューニングが精度向上に直結する実証ケースとして、現場のエンジニアに強く推奨できる実践的なガイドである。

python

from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
from sentence_transformers.cross_encoder.modules import LogitScore, Transformer
from sentence_transformers.cross_encoder.trainer import CrossEncoderTrainer
from sentence_transformers.cross_encoder.training_args import CrossEncoderTrainingArguments

# 1. モジュールからモデルを構築する
transformer = Transformer(
    "Qwen/Qwen3.5-0.8B",
    transformer_task="any-to-any",
    model_kwargs={"torch_dtype": "bfloat16", "device_map": "auto", "attn_implementation": "flash_attention_2"},
    processor_kwargs={"chat_template": {"add_generation_prompt": True}},
)
# チャットテンプレートを拡張して「query」と「document」ロールをサポートする
transformer.processor.chat_template = transformer.processor.chat_template.replace(
    'message.role == "user"',
    'message.role in ["user", "query", "document"]'
)
# LogitScore: score = log(P("1")) - log(P("0"))
score_head = LogitScore(
    true_token_id=transformer.tokenizer.convert_tokens_to_ids("1"),
    false_token_id=transformer.tokenizer.convert_tokens_to_ids("0"),
)
model = CrossEncoder(
    modules=[transformer, score_head],
    num_labels=1,
    prompts={
        "image_to_text": "画像が与えられたとき、テキストがそれに一致するかどうかを判定してください。一致する場合は1、一致しない場合は0で応答してください。",
        "text_to_image": "テキストが与えられたとき、画像がそれに一致するかどうかを判定してください。一致する場合は1、一致しない場合は0で応答してください。",
    },
)

# 2. 損失を定義する
loss = BinaryCrossEntropyLoss(model)

# 3. 方向ごとに分離したマルチデータセット学習
trainer = CrossEncoderTrainer(
    model=model,
    args=args,
    train_dataset={"image_to_text": train_image_to_text, "text_to_image": train_text_to_image},
    eval_dataset={"image_to_text": eval_image_to_text, "text_to_image": eval_text_to_image},
    loss=loss,
    evaluator=[image_to_text_evaluator, text_to_image_evaluator],
)
trainer.train()

マルチモーダルリランカーには、複数の有効なアーキテクチャ選択肢があります：

Any-to-Any + LogitScore: マルチモーダル言語モデルを使用してトークンを生成し、「1」対「0」の対数オッズを計算します。
特徴抽出（Feature Extraction） + プーリング（Pooling） + 密結合層（Dense）: マルチモーダルベースモデルのみを使用し、最後のトークンの隠れ状態を抽出し、密結合層を介してスコアに投影します。これにより言語モデリングヘッドの計算を回避します。

両方のアプローチは、マルチモーダルクロスエンコーダーの学習例で実演されています。

上記でリンクされている2つのスクリプトは、学習データを2つのデータセット（画像からテキスト、テキストから画像の方向ごとに1つ）に分割し、各方向でモデルにスコアリング方法を指示するタスク固有のプロンプトを提供します。各正例ペアは、ランダムにサンプリングされた負例で拡張され、損失が一致と不一致のバランスの取れた混合を見るようになります。

追加リソース

以前のブログ記事

Sentence Transformersを用いたマルチモーダル埋め込み＆リランカーモデル: マルチモーダル推論
Sentence Transformers v3を用いた埋め込みモデルの学習とファインチューニング: 埋め込みモデルの学習
Sentence Transformers v4を用いたリランカーモデルの学習とファインチューニング: リランカーモデルの学習
Sentence Transformers v5を用いたスパース埋め込みモデルの学習とファインチューニング: スパース埋め込みモデルの学習

学習例

Sentence Transformersリポジトリには、いくつかのマルチモーダル学習例が含まれています：

視覚的文書検索（Visual Document Retrieval）: このブログ記事で使用された、文書スクリーンショット検索のためにVLMベースの埋め込みモデルをファインチューニングする学習スクリプト
マルチモーダルリランカー（Any-to-Any）: LogitScoreを使用してマルチモーダルリランカーを学習する
マルチモーダルリランカー（特徴抽出）: プーリング＋密結合層を使用してマルチモーダルリランカーを学習する

さらに、以下のページはSentence Transformersでの学習についてさらに学ぶのに役立つかもしれません：

原文を表示

Back to Articles Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Upvote 2

Sentence Transformers is a Python library for using and training embedding and reranker models for applications like retrieval augmented generation, semantic search, and more. In my previous blogpost, I introduced the new multimodal capabilities, showing how to use embedding and reranker models that handle text, images, audio, and video. In this blogpost, I'll show you how to train or finetune these multimodal models on your own data.

As a practical example, I'll walk through finetuning Qwen/Qwen3-VL-Embedding-2B

tomaarsen/Qwen3-VL-Embedding-2B-vdr

If you're new to multimodal models in Sentence Transformers, I recommend reading Multimodal Embedding & Reranker Models with Sentence Transformers first. For training text-only embedding, reranker, or sparse embedding models, see the Prior Blogposts section at the end.

Table of Contents

Training Components

Dataset Visual Document Retrieval Dataset

Loss Function CachedMultipleNegativesRankingLoss

Training Arguments

Results Model Size vs NDCG@10

Matryoshka Dimensions vs NDCG@10

Training Multimodal Reranker Models

Additional Resources Prior Blogposts

Training Examples

General-purpose multimodal embedding models like Qwen/Qwen3-VL-Embedding-2B

Consider Visual Document Retrieval: given a text query like "What was the company's Q3 revenue?", the model must find the most relevant document screenshot from a corpus of thousands. This requires understanding document layouts, charts, tables, and text, which is a very different skill from e.g. matching pictures of shoes with product descriptions.

By finetuning on domain-specific data, the model can learn these specialized patterns. In my experiment, finetuning improved NDCG@10 from 0.888 to 0.947, ahead of every recent multimodal model I tested, including ones up to 4x larger.

Training Components

Training multimodal Sentence Transformer models involves the same components as training text-only models:

Model: The multimodal model to train or finetune.

Dataset: The data used for training and evaluation.

Loss Function: A function that quantifies the model's performance and guides the optimization process.

Training Arguments (optional): Parameters that influence training performance and tracking/debugging.

Evaluator (optional): A tool for evaluating the model before, during, or after training.

Trainer: Brings together the model, dataset, loss function, and other components for training.

The multimodal training pipeline uses the same SentenceTransformerTrainer

Let's walk through each component, using Visual Document Retrieval (matching text queries to document screenshots) as a running example.

The most common approach is to finetune an existing multimodal embedding model, or to start from a Vision-Language Model (VLM) checkpoint. The Transformer

To finetune an existing multimodal embedding model (e.g. one that already has a modules.json

processor_kwargs

AutoProcessor.from_pretrained(...)

AutoModel.from_pretrained(...)

from sentence_transformers import SentenceTransformer model = SentenceTransformer( "Qwen/Qwen3-VL-Embedding-2B", model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"}, processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600}, )

You can also start from a fresh VLM checkpoint that hasn't been trained for embeddings yet. Sentence Transformers will attempt to recognize the architecture, infer the supported modalities from the processor, and set up the appropriate forward method and pooling. If the automatic detection doesn't work perfectly for a particular model, the configuration in the saved sentence_bert_config.json

from sentence_transformers import SentenceTransformer model = SentenceTransformer("Qwen/Qwen3-VL-2B")

In both cases, the Transformer

print(model.modalities) # ['text', 'image', 'video', 'message'] print(model.supports("image")) # True

Instead of using a single VLM backbone, you can compose separate encoders for different modalities using the Router

from sentence_transformers import SentenceTransformer from sentence_transformers.sentence_transformer.modules import Dense, Pooling, Router, Transformer # Create separate encoders for different modalities text_encoder = Transformer("sentence-transformers/all-MiniLM-L6-v2") text_pooling = Pooling(text_encoder.get_embedding_dimension(), pooling_mode="mean") text_projection = Dense(text_encoder.get_embedding_dimension(), 768) # SigLIP outputs pooled embeddings directly, so no separate Pooling module is needed image_encoder = Transformer("google/siglip2-base-patch16-224") # Route inputs based on modality router = Router( sub_modules={ "text": [text_encoder, text_pooling, text_projection], "image": [image_encoder], }, ) model = SentenceTransformer(modules=[router])

Since Router-based multimodal models use separate encoders per modality, their embedding spaces are initially unaligned. Training is required to align the spaces for meaningful cross-modal similarity. The Dense

This approach is useful when you want to use lightweight, specialized encoders rather than a large VLM. You can also combine Router-based multimodality with task-based routing (e.g. different encoders for queries vs. documents) using route_mappings

Visual Document Retrieval Dataset

For this example, I use the tomaarsen/llamaindex-vdr-en-train-preprocessed

llamaindex/vdr-multilingual-train

from datasets import load_dataset train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train") train_dataset = train_dataset.select_columns(["query", "image", "negative_0"]) eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train")

Just like text-only training, the dataset format must match your chosen loss function. The rules are the same:

If your loss function requires a Label, your dataset must have a column named "label" or "score".

All columns other than "label" or "score" are considered Inputs. The number of these columns must match the number of valid inputs for your chosen loss function. Beyond the label column, the column names don't matter, only the order does.

For multimodal datasets, the inputs can contain:

Image: PIL images, file paths, URLs, or numpy/torch arrays.

Audio: file paths, numpy/torch arrays, dicts with "array"

"sampling_rate"

torchcodec.AudioDecoder

Video: file paths, numpy/torch arrays, dicts with "array"

"video_metadata"

torchcodec.VideoDecoder

Multimodal dicts: a dict mapping modality names to values, e.g. {"text": ..., "image": ...}

The data collator automatically calls model.preprocess()

Many Hugging Face datasets that work out of the box with Sentence Transformers have been tagged with sentence-transformers

CachedMultipleNegativesRankingLoss

For this training, I use CachedMultipleNegativesRankingLoss

Hard negatives: the negative column(s) explicitly supplied in the dataset (just negative_0

In-batch negatives: the positives and hard negatives from every other sample in the same batch, reused as additional negatives for this query at no extra cost.

More negatives per query means a stronger training signal, so a larger batch size directly improves training quality. Beyond that, the "cached" variant of the loss uses gradient caching to make large effective batch sizes feasible even when GPU memory is limited.

The mini_batch_size

from sentence_transformers.sentence_transformer.losses import CachedMultipleNegativesRankingLoss loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=1)

To produce embeddings that work well at multiple dimensionalities, I wrap the base loss with MatryoshkaLoss

from sentence_transformers.sentence_transformer.losses import CachedMultipleNegativesRankingLoss, MatryoshkaLoss loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=1) loss = MatryoshkaLoss(model, loss, matryoshka_dims=[2048, 1536, 1024, 512, 256, 128, 64])

This is especially useful for multimodal models, where embeddings can be large (2048 dimensions for Qwen3-VL). With Matryoshka training, you can use truncated embeddings (e.g., 256 or 128 dimensions) at deployment time for faster search with minimal quality loss. As I'll show in the Results section, the finetuned model achieves near-peak performance even at 512 dimensions.

Training Arguments

The SentenceTransformerTrainingArguments

from sentence_transformers.sentence_transformer.training_args import SentenceTransformerTrainingArguments, BatchSamplers run_name = "Qwen3-VL-Embedding-2B-vdr" args = SentenceTransformerTrainingArguments( # Required parameter: output_dir=f"models/{run_name}", # Optional training parameters: num_train_epochs=1, per_device_train_batch_size=64, per_device_eval_batch_size=64, learning_rate=2e-5, warmup_ratio=0.1, fp16=False, bf16=True, batch_sampler=BatchSamplers.NO_DUPLICATES, # Optional tracking/debugging parameters: eval_strategy="steps", eval_steps=0.1, save_strategy="steps", save_steps=0.1, save_total_limit=2, logging_steps=0.05, run_name=run_name, )

A few things to note for (multimodal) training:

batch_sampler=BatchSamplers.NO_DUPLICATES

MultipleNegativesRankingLoss

per_device_train_batch_size=64

CachedMultipleNegativesRankingLoss

mini_batch_size=1

To track retrieval performance before, during, and after training, I use the InformationRetrievalEvaluator

from sentence_transformers.sentence_transformer.evaluation import InformationRetrievalEvaluator # Build the evaluation data from the eval dataset. # Queries and corpus use integer IDs: query 0's relevant document is corpus 0. eval_queries = {qid: sample["query"] for qid, sample in enumerate(eval_dataset)} eval_corpus = {did: sample["image"] for did, sample in enumerate(eval_dataset)} num_eval = len(eval_dataset) # Add hard negatives to the corpus with offset IDs (num_eval, 2*num_eval, ...) # so they don't collide with the positive document IDs (0..num_eval-1). negative_columns = ["negative_0", "negative_1", "negative_2", "negative_3"] for neg_idx, neg_col in enumerate(negative_columns): for did, sample in enumerate(eval_dataset): eval_corpus[num_eval * (neg_idx + 1) + did] = sample[neg_col] # Each query's relevant document is the positive at the same index eval_relevant_docs = {idx: [idx] for idx in range(len(eval_dataset))} eval_evaluator = InformationRetrievalEvaluator( queries=eval_queries, corpus=eval_corpus, relevant_docs=eval_relevant_docs, batch_size=1, show_progress_bar=True, name="vdr-eval-hard", )

The evaluator takes text queries, a corpus of images (including hard negatives), and a mapping of which documents are relevant to which queries. Note that the corpus contains a mix of positive and hard negative document screenshots, making this a challenging evaluation. Using batch_size=1

The SentenceTransformerTrainer

from datasets import load_dataset from sentence_transformers import SentenceTransformer from sentence_transformers.sentence_transformer.evaluation import InformationRetrievalEvaluator from sentence_transformers.sentence_transformer.losses import CachedMultipleNegativesRankingLoss, MatryoshkaLoss from sentence_transformers.sentence_transformer.model_card import SentenceTransformerModelCardData from sentence_transformers.sentence_transformer.trainer import SentenceTransformerTrainer from sentence_transformers.sentence_transformer.training_args import ( BatchSamplers, SentenceTransformerTrainingArguments, ) # 1. Load a model to finetune with (optional) model card data model = SentenceTransformer( "Qwen/Qwen3-VL-Embedding-2B", model_card_data=SentenceTransformerModelCardData( language="en", license="apache-2.0", model_name="Qwen3-VL-Embedding-2B model trained on Visual Document Retrieval query-document screenshot pairs", ), model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"}, # Control image resolution: lower values save memory, higher values preserve detail processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600}, ) # 2. Load a dataset to finetune on: (query, positive, negative_0) triplets for training, # all 4 hard negatives retained for evaluation train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train") train_dataset = train_dataset.select_columns(["query", "image", "negative_0"]) eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train") # 3. Define a loss function loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=1) loss = MatryoshkaLoss(model, loss, matryoshka_dims=[2048, 1536, 1024, 512, 256, 128, 64]) # 4. (Optional) Specify training arguments run_name = "Qwen3-VL-Embedding-2B-vdr" args = SentenceTransformerTrainingArguments( # Required parameter: output_dir=f"models/{run_name}", # Optional training parameters: num_train_epochs=1, per_device_train_batch_size=64, per_device_eval_batch_size=64, learning_rate=2e-5, warmup_ratio=0.1, fp16=False, # BF16 is preferred over FP16 for VLMs due to better numerical stability bf16=True, # Set to True if your GPU supports BF16 (most modern GPUs do) batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicates # Optional tracking/debugging parameters: eval_strategy="steps", eval_steps=0.1, save_strategy="steps", save_steps=0.1, save_total_limit=2, logging_steps=0.05, run_name=run_name, # Used in e.g. Trackio if installed # report_to=["codecarbon", "trackio"], # Uncomment to enable logging (pip install codecarbon trackio) ) # 5. (Optional) Create an evaluator & evaluate the base model eval_queries = {qid: sample["query"] for qid, sample in enumerate(eval_dataset)} eval_corpus = {did: sample["image"] for did, sample in enumerate(eval_dataset)} num_eval = len(eval_dataset) negative_columns = ["negative_0", "negative_1", "negative_2", "negative_3"] for neg_idx, neg_col in enumerate(negative_columns): for did, sample in enumerate(eval_dataset): eval_corpus[num_eval * (neg_idx + 1) + did] = sample[neg_col] eval_relevant_docs = {idx: [idx] for idx in range(len(eval_dataset))} eval_evaluator = InformationRetrievalEvaluator( queries=eval_queries, corpus=eval_corpus, relevant_docs=eval_relevant_docs, batch_size=1, show_progress_bar=True, name="vdr-eval-hard", ) eval_evaluator(model) # 6. Create a trainer & train trainer = SentenceTransformerTrainer( model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, loss=loss, evaluator=eval_evaluator, ) trainer.train() # 7. (Optional) Evaluate at each Matryoshka dimension eval_evaluator(model) for dim in [2048, 1536, 1024, 512, 256, 128, 64]: dim_evaluator = InformationRetrievalEvaluator( queries=eval_queries, corpus=eval_corpus, relevant_docs=eval_relevant_docs, truncate_dim=dim, batch_size=1, show_progress_bar=True, name=f"vdr-eval-hard-{dim}d", ) dim_evaluator(model) # 8. Save the trained model model.save_pretrained(f"models/{run_name}/final") # 9. (Optional) Push it to the Hugging Face Hub # This pushes to your personal namespace, e.g. {your_username}/Qwen3-VL-Embedding-2B-vdr model.push_to_hub("Qwen3-VL-Embedding-2B-vdr")

The training script is nearly identical to a text-only training script. The only differences are:

Model loading: We pass model_kwargs

processor_kwargs

Loss function: We use CachedMultipleNegativesRankingLoss

mini_batch_size=1

Evaluator: The evaluator uses images in the corpus and text as queries, enabling cross-modal retrieval evaluation.

Everything else (the trainer, training arguments, dataset loading) works exactly the same as text-only training.

Model Size vs NDCG@10

After training for just 1 epoch, the finetuned tomaarsen/Qwen3-VL-Embedding-2B-vdr model achieves an NDCG@10 of 0.947 on the evaluation set (300 queries, 1500 corpus documents, cosine similarity). This is a significant improvement over the base Qwen/Qwen3-VL-Embedding-2B model's 0.888, and outperforms all existing VDR models:

tomaarsen/Qwen3-VL-Embedding-2B-vdr

Qwen/Qwen3-VL-Embedding-8B

nvidia/omni-embed-nemotron-3b

nvidia/llama-nemotron-embed-vl-1b-v2

nomic-ai/nomic-embed-multimodal-7b

llamaindex/vdr-2b-multi-v1

llamaindex/vdr-2b-v1

nomic-ai/nomic-embed-multimodal-3b

Qwen/Qwen3-VL-Embedding-2B

LCO-Embedding/LCO-Embedding-Omni-7B

LCO-Embedding/LCO-Embedding-Omni-3B

BAAI/BGE-VL-v1.5-zs

BAAI/BGE-VL-v1.5-mmeb

BAAI/BGE-VL-MLLM-S2

BidirLM/BidirLM-Omni-2.5B-Embedding

BAAI/BGE-VL-MLLM-S1

sentence-transformers/clip-ViT-L-14

BAAI/BGE-VL-large

BAAI/BGE-VL-base

The finetuned 2B model outperforms even the 8B Qwen3-VL-Embedding model, demonstrating the power of task-specific finetuning. Finetuning on your own domain is often worth considering, even when a larger general-purpose model is available!

Matryoshka Dimensions vs NDCG@10

The comparison above uses full-size 2048-dim embeddings. Thanks to the Matryoshka training, the finetuned model also holds up well when truncated to fewer dimensions, letting you trade off embedding size and retrieval quality at deployment time:

The finetuned model's peak is at the full 2048 dimensions (0.948), but it stays within 0.3% of peak all the way down to 512 (4x smaller), and retains over 92% of peak even at 64 (32x smaller). Matryoshka training concentrates the most important information in the earlier dimensions, so moderate truncation costs very little performance.

Finetuned NDCG@10

The gap between 1024 and 2048 dimensions is small (0.946 vs. 0.948), so I've saved the model with truncate_dim=1024

SentenceTransformer("tomaarsen/Qwen3-VL-Embedding-2B-vdr")

Training Multimodal Reranker Models

You can also finetune multimodal Cross Encoder (reranker) models using the same training infrastructure. The key difference is using CrossEncoderTrainer

Here's a simplified example based on the doodles training script, which trains a reranker to match images with text captions:

from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss from sentence_transformers.cross_encoder.modules import LogitScore, Transformer from sentence_transformers.cross_encoder.trainer import CrossEncoderTrainer from sentence_transformers.cross_encoder.training_args import CrossEncoderTrainingArguments # 1. Build the model from modules transformer = Transformer( "Qwen/Qwen3.5-0.8B", transformer_task="any-to-any", model_kwargs={"torch_dtype": "bfloat16", "device_map": "auto", "attn_implementation": "flash_attention_2"}, processing_kwargs={"chat_template": {"add_generation_prompt": True}}, ) # Extend chat template to support "query" and "document" roles transformer.processor.chat_template = transformer.processor.chat_template.replace( 'message.role == "user"', 'message.role in ["user", "query", "document"]' ) # LogitScore: score = log(P("1")) - log(P("0")) score_head = LogitScore( true_token_id=transformer.tokenizer.convert_tokens_to_ids("1"), false_token_id=transformer.tokenizer.convert_tokens_to_ids("0"), ) model = CrossEncoder( modules=[transformer, score_head], num_labels=1, prompts={ "image_to_text": "Given the image, judge whether the text matches it. Respond with 1 if they match, 0 if they don't.", "text_to_image": "Given the text, judge whether the image matches it. Respond with 1 if they match, 0 if they don't.", }, ) # 2. Define the loss loss = BinaryCrossEntropyLoss(model) # 3. Multi-dataset training with separate directions trainer = CrossEncoderTrainer( model=model, args=args, train_dataset={"image_to_text": train_image_to_text, "text_to_image": train_text_to_image}, eval_dataset={"image_to_text": eval_image_to_text, "text_to_image": eval_text_to_image}, loss=loss, evaluator=[image_to_text_evaluator, text_to_image_evaluator], ) trainer.train()

There are multiple valid architectural choices for multimodal rerankers, including:

Any-to-Any + LogitScore: Uses the multimodal language model to generate a token, then computes the log-odds of "1" vs "0".

Feature Extraction + Pooling + Dense: Uses only the multimodal base model, and extracts the last token's hidden state and projects it to a score via a Dense layer, avoiding the language modeling head computation.

Both approaches are demonstrated in the multimodal cross encoder training examples.

The two scripts linked above split the training data into two datasets, one per direction (image-to-text and text-to-image), with a task-specific prompt for each that tells the model how to score in that direction. Each positive pair is then expanded with randomly sampled negatives so the loss sees a balanced mix of matches and non-matches.

Additional Resources

Prior Blogposts

Multimodal Embedding & Reranker Models with Sentence Transformers: Multimodal inference

Training and Finetuning Embedding Models with Sentence Transformers v3: Training embedding models

Training and Finetuning Reranker Models with Sentence Transformers v4: Training reranker models

Training and Finetuning Sparse Embedding Models with Sentence Transformers v5: Training sparse embedding models

Training Examples

The Sentence Transformers repository includes several multimodal training examples:

Visual Document Retrieval: The training script used in this blogpost to finetune a VLM-based embedding model for document screenshot retrieval

Multimodal Reranker (Any-to-Any): Train a multimodal reranker using LogitScore

Multimodal Reranker (Feature Extraction): Train a multimodal reranker using Pooling + Dense

Additionally, the following pages may be useful to learn more about training with Sentence Transformers:

Sentence Transformer > Training Overview

Sentence Transformer > Loss Overview

Cross Encoder > Training Overview

Cross Encoder > Loss Overview

Dataset Overview

この記事をシェア

Hugging Face Blog★42026年3月18日 01:37

Hugging Faceにおけるオープンソースの現状：2026年春

Hugging Faceが2026年春のオープンソース動向を発表し、プラットフォーム上でのモデル共有・協業の進展を報告した。

AWS Machine Learning Blog★42026年5月14日 02:22

Databricks Unity Catalog と Amazon SageMaker AI を用いた大規模言語モデルのファインチューニング

Databricks の Unity Catalog と Amazon SageMaker AI を組み合わせて大規模言語モデルをファインチューニングする際、データガバナンスと権限管理における課題について解説している。

AWS Machine Learning Blog★42026年5月13日 00:48

Amazon SageMaker AI における EU AI 法対応のガイドライン

Amazon は、EU AI 法の遵守義務を判断するために必要な計算資源（FLOPs）の追跡方法を、SageMaker AI を用いた大規模言語モデルのファインチューニングにおいて説明している。

ニュース一覧に戻る元記事を読む

from sentence_transformers.cross_encoder import CrossEncoder from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss from sentence_transformers.cross_encoder.modules import LogitScore, Transformer from sentence_transformers.cross_encoder.trainer import CrossEncoderTrainer from sentence_transformers.cross_encoder.training_args import CrossEncoderTrainingArguments # 1. モジュールからモデルを構築する transformer = Transformer( "Qwen/Qwen3.5-0.8B", transformer_task="any-to-any", model_kwargs={"torch_dtype": "bfloat16", "device_map": "auto", "attn_implementation": "flash_attention_2"}, processor_kwargs={"chat_template": {"add_generation_prompt": True}}, ) # チャットテンプレートを拡張して「query」と「document」ロールをサポートする transformer.processor.chat_template = transformer.processor.chat_template.replace( 'message.role == "user"', 'message.role in ["user", "query", "document"]' ) # LogitScore: score = log(P("1")) - log(P("0")) score_head = LogitScore( true_token_id=transformer.tokenizer.convert_tokens_to_ids("1"), false_token_id=transformer.tokenizer.convert_tokens_to_ids("0"), ) model = CrossEncoder( modules=[transformer, score_head], num_labels=1, prompts={ "image_to_text": "画像が与えられたとき、テキストがそれに一致するかどうかを判定してください。一致する場合は1、一致しない場合は0で応答してください。", "text_to_image": "テキストが与えられたとき、画像がそれに一致するかどうかを判定してください。一致する場合は1、一致しない場合は0で応答してください。", }, ) # 2. 損失を定義する loss = BinaryCrossEntropyLoss(model) # 3. 方向ごとに分離したマルチデータセット学習 trainer = CrossEncoderTrainer( model=model, args=args, train_dataset={"image_to_text": train_image_to_text, "text_to_image": train_text_to_image}, eval_dataset={"image_to_text": eval_image_to_text, "text_to_image": eval_text_to_image}, loss=loss, evaluator=[image_to_text_evaluator, text_to_image_evaluator], ) trainer.train()

Sentence Transformersを用いたマルチモーダル埋め込み・リランカーモデルのトレーニングとファインチューニング

キーポイント

影響分析

編集コメント

関連記事

Sentence Transformersを用いたマルチモーダル埋め込み・リランカーモデルのトレーニングとファインチューニング

キーポイント

影響分析

編集コメント

関連記事