MarkTechPost·2026年6月24日 03:31·約7分で読める

Python で NVIDIA Canary-1B-v2 を使用した音声認識・翻訳・自動 SRT サブタイトルエクスポートの方法

#Speech Recognition #NeMo #NVIDIA Canary #Automatic Subtitle Generation #Multilingual Translation

TL;DR

本記事は、NVIDIA Canary-1B-v2 モデルを活用して、Python で音声認識、多言語翻訳、SRT サブタイトルエクスポートを自動化する完全なワークフロー構築手順と実装コードを提供している。

AI深層分析2026年6月24日 04:03

重要/ 5段階

深度40%

キーポイント

環境構築と依存関係の管理

NeMo ASR ツールキット、librosa、soundfile、pydub などの必要なライブラリをインストールし、16kHz モノラル音声への前処理を含む完全な開発環境のセットアップ手順を示している。

Canary-1B-v2 モデルのロードと推論

GPU 環境での効率的な推論を可能にするための PyTorch と CUDA の設定、および NVIDIA Canary-1B-v2 モデルの読み込み方法について解説している。

多言語対応ワークフローの実装

英語音声からの認識（ASR）と、ブルガリア語からポーランド語までを含む 20 以上の言語への自動翻訳機能を備えたパイプラインの構築方法を詳述している。

SRT サブタイトル出力とバッチ処理

単語レベルおよびセグメントレベルの時系列情報を取得し、これを SRT 形式でエクスポートする機能に加え、長尺音声への対応やバッチ処理による推論速度のベンチマーク手法を含んでいる。

環境確認と言語サポート

コードはCUDAの可用性を確認し、GPUがない場合はCPUでの実行（低速）を警告します。また、24か国語に対応した多言語辞書が定義されており、ASRおよび翻訳タスクをサポートしています。

モデルの読み込みとデバイス設定

NeMoライブラリを使用してNVIDIA Canary-1B-v2モデルをロードし、利用可能なGPU（またはCPU）に移動させて推論用に評価モード（eval）に設定します。

音声データの標準化処理

URL から直接ダウンロードし、16kHz モノラルの WAV ファイルに変換する再利用可能な関数を実装しています。

影響分析・編集コメントを表示

影響分析

このチュートリアルは、NVIDIA の最新音声モデルを即座に実務環境へ組み込むための具体的なロードマップを提供しており、開発者が多言語対応の字幕生成システムや大規模音声トランスクリプション基盤を迅速に構築することを可能にする。特に、SRT エクスポート機能とバッチ処理の統合は、コンテンツ制作やメディアアーカイブ化などの現場における生産性向上に直結する重要な技術的価値を持つ。

編集コメント

単なるモデル紹介に留まらず、環境構築から字幕出力までの実装コードを公開している点が高く評価されます。特に多言語対応と SRT エクスポート機能の組み合わせは、ローカライズ業務や動画制作現場での即戦力となる可能性が高いです。

このチュートリアルでは、NVIDIA Canary-1B-v2 を使用して音声認識および翻訳ワークフローを構築します。まず、必要なオーディオライブラリ、NeMo、NumPy、SciPy の依存関係をセットアップし、効率的な推論のために GPU 対応ランタイム上で Canary モデルを読み込みます。その後、オーディオデータをクリーンな 16 kHz モノラル形式に整形し、英語の音声認識 (ASR) を実行します。続いて、音声を複数の言語に翻訳し、単語およびセグメントのタイムスタンプを生成して、翻訳された字幕を SRT ファイルとしてエクスポートします。さらに、長尺テキストの書き起こしテストやバッチ処理の実行、推論速度のベンチマークも実施します。最後に、実世界のオーディオファイルへの適応、字幕生成、大規模な書き起こし実験に対応できる完全な多言語 ASR および音声翻訳パイプラインを完成させます。

NeMo、オーディオライブラリ、NumPy、SciPy の依存関係のインストール

コードをコピーしました。別のブラウザを使用してください

import os, subprocess, sys

SENTINEL = "/content/.canary_setup_done"

if not os.path.exists(SENTINEL):

def sh(c):

print("$", c); subprocess.run(c, shell=True, check=False)

print(">>> PHASE 1: installing dependencies (one-time)...\n")

sh("apt-get -qq update")

sh("apt-get -qq install -y libsndfile1 ffmpeg > /dev/null")

sh('pip install -q "nemo_toolkit[asr]"')

sh("pip install -q librosa soundfile pydub")

sh('pip install -q --force-reinstall --no-cache-dir "numpy>=2.2,<2.4" "scipy>=1.15"')

open(SENTINEL, "w").write("done")

print("\nimage Setup complete. Restarting the runtime now.")

print(" When it reconnects, RUN THIS CELL AGAIN to start the tutorial.")

os.kill(os.getpid(), 9)

NVIDIA Canary-1B-v2 のチュートリアル用環境を設定します。必要なシステムパッケージ、NeMo ASR ツールキット、オーディオライブラリ、および互換性のある NumPy および SciPy のバージョンをインストールします。その後、セットアップマーカーを作成し、ランタイムを再起動して、メインのチュートリアルを実行する前に更新された依存関係がクリーンに読み込まれるようにします。

NVIDIA Canary-1B-v2 の読み込みと GPU 利用可能性の確認

コードをコピーしました

別のブラウザを使用してください

import time, json, gc, math, urllib.request

import torch, numpy as np, soundfile as sf, librosa

print(">>> PHASE 2: running tutorial\n")

print("NumPy:", np.__version__, "| PyTorch:", torch.__version__)

print("CUDA available:", torch.cuda.is_available())

if torch.cuda.is_available():

print("GPU:", torch.cuda.get_device_name(0),

f"| VRAM: {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")

else:

print("image No GPU — will run on CPU (very slow). "

"Set Runtime > Change runtime type > GPU.")

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

LANGS = {

"bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",

"en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",

"el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",

"mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",

"sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",

}

print(f"\nSupported languages ({len(LANGS)}):", ", ".join(LANGS.keys()))

from nemo.collections.asr.models import ASRModel

print("\nLoading nvidia/canary-1b-v2 ...")

t0 = time.time()

asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()

print(f"Model loaded in {time.time()-t0:.1f}s")

主要なライブラリをインポートし、GPU アクセラレーションのために CUDA が利用可能か確認します。Canary が多言語音声認識 (ASR) および翻訳タスクを処理できるように、サポートされる言語の辞書定義を行います。その後、NeMo から NVIDIA Canary-1B-v2 モデルを読み込み、推論に使用可能なデバイスへ移動させます。

16 kHz オーディオの準備と英語 ASR による翻訳の実行

コードをコピーしました

別のブラウザを使用してください

TARGET_SR = 16000

def prepare_audio(path_or_url, out_path=None):

if str(path_or_url).startswith(("http://", "https://")):

local = "/content/_dl_" + os.path.basename(path_or_url.split("?")[0])

urllib.request.urlretrieve(path_or_url, local)

path_or_url = local

audio, _ = librosa.load(path_or_url, sr=TARGET_SR, mono=True)

if out_path is None:

base = os.path.splitext(os.path.basename(path_or_url))[0]

out_path = f"/content/{base}_16k_mono.wav"

sf.write(out_path, audio, TARGET_SR, subtype="PCM_16")

dur = len(audio) / TARGET_SR

print(f"Prepared: {out_path} ({dur:.1f}s, 16kHz, mono)")

return out_path, dur

SAMPLE_URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav"

sample_wav, sample_dur = prepare_audio(SAMPLE_URL)

def transcribe(files, source_lang="en", target_lang="en", timestamps=False, batch_size=1):

if isinstance(files, str):

files = [files]

return asr_model.transcribe(files, source_lang=source_lang, target_lang=target_lang,

timestamps=timestamps, batch_size=batch_size)

print("\n=== 1) BASIC ASR (English) ===")

res = transcribe(sample_wav, source_lang="en", target_lang="en")

print("Transcript:", res[0].text)

print("\n=== 2) TRANSLATION (EN audio -> X) ===")

for tgt in ["fr", "de", "es", "it"]:

out = transcribe(sample_wav, source_lang="en", target_lang=tgt)

print(f" EN -> {LANGS[tgt]:<10} ({tgt}): {out[0].text}")

必要な場合にオーディオをダウンロードし、16 kHz モノラル WAV フォーマットに変換する再利用可能なオーディオ準備関数を作成します。サンプルオーディオファイルを読み込み、文字起こしと翻訳のためのヘルパー関数を定義します。その後、基本的な英語の音声認識（ASR）を実行し、同じ英語の発話をフランス語、ドイツ語、スペイン語、イタリア語に翻訳します。

単語およびセグメントのタイムスタンプの生成と SRT サブタイトルのエクスポート

コードをコピーしました。別のブラウザを使用してください

print("\n=== 3) タイムスタンプ (ASR) ===")

ts_out = transcribe(sample_wav, source_lang="en", target_lang="en", timestamps=True)

word_ts = ts_out[0].timestamp.get("word", [])

seg_ts = ts_out[0].timestamp.get("segment", [])

print("セグメント:")

for s in seg_ts:

print(f" [{s['start']:6.2f}s - {s['end']:6.2f}s] {s['segment']}")

print("最初の 10 単語:")

for w in word_ts[:10]:

print(f" [{w['start']:6.2f}s - {w['end']:6.2f}s] {w['word']}")

def _srt_time(t):

h=int(t//3600); m=int((t%3600)//60); s=int(t%60); ms=int(round((t-int(t))*1000))

return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

def segments_to_srt(segments, out_path="/content/output.srt"):

lines=[]

for i, seg in enumerate(segments, 1):

lines += [str(i), f"{_srt_time(seg['start'])} --> {_srt_time(seg['end'])}",

seg["segment"].strip(), ""]

open(out_path, "w", encoding="utf-8").write("\n".join(lines))

print(f"SRT 保存完了: {out_path}")

return out_path

print("\n=== 4) SRT エクスポート (翻訳されたフランス語字幕) ===")

fr_ts = transcribe(sample_wav, source_lang="en", target_lang="fr", timestamps=True)

segments_to_srt(fr_ts[0].timestamp["segment"], "/content/subtitles_fr.srt")

print(open("/content/subtitles_fr.srt").read())

タイムスタンプ付きの文字起こしを有効化して、セグメントレベルと単語レベルのタイミング情報を両方抽出します。モデルがテキストとオーディオをどのようにアライメントしているかを確認するために、文字起こしのセグメントと最初の数単語のタイムスタンプを表示します。また、翻訳されたフランス語のセグメントを SRT サブタイトルファイルに変換し、生成されたサブタイトルを表示します。

ロングフォーム文字起こし、バッチ処理、速度ベンチマークの実行

コードをコピーしました

別のブラウザを使用してください

print("\n=== 5) LONG-FORM (sample tiled x6) ===")

long_audio, _ = librosa.load(sample_wav, sr=TARGET_SR, mono=True)

long_audio = np.tile(long_audio, 6)

sf.write("/content/long.wav", long_audio, TARGET_SR, subtype="PCM_16")

print(f"Long clip duration: {len(long_audio)/TARGET_SR:.1f}s")

long_out = transcribe("/content/long.wav", source_lang="en", target_lang="en", batch_size=1)

print("Long transcript (first 300 chars):", long_out[0].text[:300], "...")

print("\n=== 6) BATCH ===")

for name in ["clip_a", "clip_b"]:

sf.write(f"/content/{name}.wav",

librosa.load(sample_wav, sr=TARGET_SR, mono=True)[0], TARGET_SR, subtype="PCM_16")

batch = transcribe(["/content/clip_a.wav", "/content/clip_b.wav"],

source_lang="en", target_lang="en", batch_size=2)

for i, b in enumerate(batch):

print(f" file {i}: {b.text}")

print("\n=== 7) BENCHMARK ===")

t0 = time.time(); _ = transcribe(sample_wav, source_lang="en", target_lang="en")

elapsed = time.time()-t0

print(f"Audio: {sample_dur:.2f}s | Compute: {elapsed:.2f}s | RTFx ≈ {sample_dur/elapsed:.1f}x")

print("\nimage Done. Change source_lang/target_lang from the LANGS dict to try other languages.")

長尺の文字起こしテストでは、サンプルオーディオを複数回繰り返してより長いクリップを作成し、モデルに渡します。また、バッチサイズ 2 のバッチ処理による文字起こしを示すために、2 つの複製オーディオクリップも作成しました。さらに、オーディオの持続時間と計算時間を比較することでモデルのベンチマークを行い、リアルタイムファクター速度を報告しています。

結論として、NVIDIA Canary-1B-v2 を多言語 ASR（自動音声認識）および音声翻訳システムとして活用するための実用的なエンドツーエンドワークフローを完成させました。生オーディオの処理、正確な文字起こしの生成、音声から異なるターゲット言語への翻訳、タイムスタンプの抽出、字幕ファイルの作成、長尺オーディオクリップの扱い、そしてシンプルなベンチマークを通じたランタイムパフォーマンスの比較を行いました。これにより、カスタムアップロード、より多くの言語、大規模バッチ、本番環境向けのオーディオ処理などに対応可能な再利用可能な Colab 対応パイプラインが完成しました。

ノートブック付きの完全なコードはこちらで確認できます。また、Twitter でフォローしていただくこともお気軽にどうぞ。150k+ の ML サブレッドに参加し、ニュースレターを購読することも忘れないでください。待ってください！Telegram をご利用ですか？今なら Telegram でも私たちに参加いただけます。

GitHub リポジトリや Hugging Face ページ、製品リリース、ウェビナーなどのプロモーションのためにパートナーシップをご検討の場合は、ぜひご連絡ください。

本記事「How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python」は、MarkTechPost で最初に公開されました。

原文を表示

In this tutorial, we build a speech recognition and translation workflow using NVIDIA Canary-1B-v2. We begin by setting up the required audio, NeMo, NumPy, and SciPy dependencies, then load the Canary model on a GPU-enabled runtime for efficient inference. From there, we prepare audio into a clean 16 kHz mono format, perform English ASR, translate speech into multiple languages, generate word and segment timestamps, export translated subtitles as an SRT file, test long-form transcription, run batch processing, and benchmark inference speed. At the end, we have a complete multilingual ASR and speech translation pipeline that we can adapt for real audio files, subtitle generation, and large-scale transcription experiments.

Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies

Copy CodeCopiedUse a different Browser

import os, subprocess, sys

SENTINEL = "/content/.canary_setup_done"

if not os.path.exists(SENTINEL):

def sh(c):

print("$", c); subprocess.run(c, shell=True, check=False)

print(">>> PHASE 1: installing dependencies (one-time)...\n")

sh("apt-get -qq update")

sh("apt-get -qq install -y libsndfile1 ffmpeg > /dev/null")

sh('pip install -q "nemo_toolkit[asr]"')

sh("pip install -q librosa soundfile pydub")

sh('pip install -q --force-reinstall --no-cache-dir "numpy>=2.2,<2.4" "scipy>=1.15"')

open(SENTINEL, "w").write("done")

print("\nimage Setup complete. Restarting the runtime now.")

print(" When it reconnects, RUN THIS CELL AGAIN to start the tutorial.")

os.kill(os.getpid(), 9)

We set up the environment for the NVIDIA Canary-1B-v2 tutorial. We install the required system packages, NeMo ASR toolkit, audio libraries, and compatible NumPy and SciPy versions. We then create a setup marker and restart the runtime so that the updated dependencies load cleanly before running the main tutorial.

Loading NVIDIA Canary-1B-v2 and Checking GPU Availability

Copy CodeCopiedUse a different Browser

import time, json, gc, math, urllib.request

import torch, numpy as np, soundfile as sf, librosa

print(">>> PHASE 2: running tutorial\n")

print("NumPy:", np.__version__, "| PyTorch:", torch.__version__)

print("CUDA available:", torch.cuda.is_available())

if torch.cuda.is_available():

print("GPU:", torch.cuda.get_device_name(0),

f"| VRAM: {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")

else:

print("image No GPU — will run on CPU (very slow). "

"Set Runtime > Change runtime type > GPU.")

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

LANGS = {

"bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",

"en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",

"el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",

"mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",

"sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",

}

print(f"\nSupported languages ({len(LANGS)}):", ", ".join(LANGS.keys()))

from nemo.collections.asr.models import ASRModel

print("\nLoading nvidia/canary-1b-v2 ...")

t0 = time.time()

asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()

print(f"Model loaded in {time.time()-t0:.1f}s")

We import the main libraries and check whether CUDA is available for GPU acceleration. We define the supported language dictionary to enable Canary to handle multilingual ASR and translation tasks. We then load the NVIDIA Canary-1B-v2 model from NeMo and move it to the available device for inference.

Preparing 16 kHz Audio and Running English ASR with Translation

Copy CodeCopiedUse a different Browser

TARGET_SR = 16000

def prepare_audio(path_or_url, out_path=None):

if str(path_or_url).startswith(("http://", "https://")):

local = "/content/_dl_" + os.path.basename(path_or_url.split("?")[0])

urllib.request.urlretrieve(path_or_url, local)

path_or_url = local

audio, _ = librosa.load(path_or_url, sr=TARGET_SR, mono=True)

if out_path is None:

base = os.path.splitext(os.path.basename(path_or_url))[0]

out_path = f"/content/{base}_16k_mono.wav"

sf.write(out_path, audio, TARGET_SR, subtype="PCM_16")

dur = len(audio) / TARGET_SR

print(f"Prepared: {out_path} ({dur:.1f}s, 16kHz, mono)")

return out_path, dur

SAMPLE_URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav"

sample_wav, sample_dur = prepare_audio(SAMPLE_URL)

def transcribe(files, source_lang="en", target_lang="en", timestamps=False, batch_size=1):

if isinstance(files, str):

files = [files]

return asr_model.transcribe(files, source_lang=source_lang, target_lang=target_lang,

timestamps=timestamps, batch_size=batch_size)

print("\n=== 1) BASIC ASR (English) ===")

res = transcribe(sample_wav, source_lang="en", target_lang="en")

print("Transcript:", res[0].text)

print("\n=== 2) TRANSLATION (EN audio -> X) ===")

for tgt in ["fr", "de", "es", "it"]:

out = transcribe(sample_wav, source_lang="en", target_lang=tgt)

print(f" EN -> {LANGS[tgt]:<10} ({tgt}): {out[0].text}")

We create a reusable audio preparation function that downloads audio when needed and converts it into 16 kHz mono WAV format. We load the sample audio file and define a helper function for transcription and translation. We then run basic English ASR and translate the same English speech into French, German, Spanish, and Italian.

Generating Word and Segment Timestamps and Exporting SRT Subtitles

Copy CodeCopiedUse a different Browser

print("\n=== 3) TIMESTAMPS (ASR) ===")

ts_out = transcribe(sample_wav, source_lang="en", target_lang="en", timestamps=True)

word_ts = ts_out[0].timestamp.get("word", [])

seg_ts = ts_out[0].timestamp.get("segment", [])

print("Segments:")

for s in seg_ts:

print(f" [{s['start']:6.2f}s - {s['end']:6.2f}s] {s['segment']}")

print("First 10 words:")

for w in word_ts[:10]:

print(f" [{w['start']:6.2f}s - {w['end']:6.2f}s] {w['word']}")

def _srt_time(t):

h=int(t//3600); m=int((t%3600)//60); s=int(t%60); ms=int(round((t-int(t))*1000))

return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

def segments_to_srt(segments, out_path="/content/output.srt"):

lines=[]

for i, seg in enumerate(segments, 1):

lines += [str(i), f"{_srt_time(seg['start'])} --> {_srt_time(seg['end'])}",

seg["segment"].strip(), ""]

open(out_path, "w", encoding="utf-8").write("\n".join(lines))

print(f"Saved SRT: {out_path}")

return out_path

print("\n=== 4) SRT EXPORT (translated French subtitles) ===")

fr_ts = transcribe(sample_wav, source_lang="en", target_lang="fr", timestamps=True)

segments_to_srt(fr_ts[0].timestamp["segment"], "/content/subtitles_fr.srt")

print(open("/content/subtitles_fr.srt").read())

We enable timestamped transcription to extract both segment-level and word-level timing information. We print the transcript segments and the first few word timestamps to inspect how the model aligns text with audio. We also convert translated French segments into an SRT subtitle file and display the generated subtitles.

Running Long-Form Transcription, Batch Processing, and Speed Benchmark

Copy CodeCopiedUse a different Browser

print("\n=== 5) LONG-FORM (sample tiled x6) ===")

long_audio, _ = librosa.load(sample_wav, sr=TARGET_SR, mono=True)

long_audio = np.tile(long_audio, 6)

sf.write("/content/long.wav", long_audio, TARGET_SR, subtype="PCM_16")

print(f"Long clip duration: {len(long_audio)/TARGET_SR:.1f}s")

long_out = transcribe("/content/long.wav", source_lang="en", target_lang="en", batch_size=1)

print("Long transcript (first 300 chars):", long_out[0].text[:300], "...")

print("\n=== 6) BATCH ===")

for name in ["clip_a", "clip_b"]:

sf.write(f"/content/{name}.wav",

librosa.load(sample_wav, sr=TARGET_SR, mono=True)[0], TARGET_SR, subtype="PCM_16")

batch = transcribe(["/content/clip_a.wav", "/content/clip_b.wav"],

source_lang="en", target_lang="en", batch_size=2)

for i, b in enumerate(batch):

print(f" file {i}: {b.text}")

print("\n=== 7) BENCHMARK ===")

t0 = time.time(); _ = transcribe(sample_wav, source_lang="en", target_lang="en")

elapsed = time.time()-t0

print(f"Audio: {sample_dur:.2f}s | Compute: {elapsed:.2f}s | RTFx ≈ {sample_dur/elapsed:.1f}x")

print("\nimage Done. Change source_lang/target_lang from the LANGS dict to try other languages.")

We test long-form transcription by repeating the sample audio several times and passing the longer clip through the model. We also create two duplicate audio clips to demonstrate batch transcription with a batch size of two. Also, we benchmark the model by comparing audio duration with compute time and report the real-time factor speed.

Conclusion

In conclusion, we completed a practical end-to-end workflow for using NVIDIA Canary-1B-v2 as a multilingual ASR and speech translation system. We processed raw audio, generated accurate transcripts, translated speech into different target languages, extracted timestamps, created subtitle files, handled longer audio clips, and compared runtime performance through a simple benchmark. We now have a reusable Colab-ready pipeline that we can extend further with custom uploads, more languages, larger batches, and production-style audio processing.

Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python appeared first on MarkTechPost.

この記事をシェア

MarkTechPost★42026年6月8日 17:56

Microsoft AI、MAI-Transcribe-1.5 を発表：人工分析で WER2.4%、FLEURS 精度は業界最高水準、長音響変換速度は最大 5 倍向上

マイクロソフト AI は自社開発音声認識モデル「MAI-Transcribe-1.5」を発表し、43 言語・雑音環境に対応し、人工分析で WER2.4%、FLEURS 精度は業界最高水準、長音響変換速度を最大 5 倍向上させた。

Hugging Face Blog★42026年6月4日 21:59

あなたの言語・ドメイン、またはアクセント向けに Nemotron 3.5 ASR をファインチューニングする方法

Hugging Face は、Nemotron 3.5 ASR モデルを特定の言語や業界ドメイン、話者のアクセントに合わせてカスタマイズするファインチューニングの手順を解説した。

The Verge AI★32026年5月20日 00:51

卒業式で AI アナウンサーが名前を誤読・飛ばす

The Verge は、近年人気を集める卒業式の AI 音声アナウンスシステムが、学生の名前を誤って発音したり読み飛ばしたりする事例が多発していることを報じた。

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

MarkTechPost·2026年6月24日 03:31·約7分で読める

Python で NVIDIA Canary-1B-v2 を使用した音声認識・翻訳・自動 SRT サブタイトルエクスポートの方法

#Speech Recognition #NeMo #NVIDIA Canary #Automatic Subtitle Generation #Multilingual Translation

TL;DR

AI深層分析2026年6月24日 04:03

重要/ 5段階

深度40%

キーポイント

環境構築と依存関係の管理

Canary-1B-v2 モデルのロードと推論

GPU 環境での効率的な推論を可能にするための PyTorch と CUDA の設定、および NVIDIA Canary-1B-v2 モデルの読み込み方法について解説している。

多言語対応ワークフローの実装

SRT サブタイトル出力とバッチ処理

環境確認と言語サポート

モデルの読み込みとデバイス設定

NeMoライブラリを使用してNVIDIA Canary-1B-v2モデルをロードし、利用可能なGPU（またはCPU）に移動させて推論用に評価モード（eval）に設定します。

音声データの標準化処理

URL から直接ダウンロードし、16kHz モノラルの WAV ファイルに変換する再利用可能な関数を実装しています。

影響分析・編集コメントを表示

影響分析

編集コメント

NeMo、オーディオライブラリ、NumPy、SciPy の依存関係のインストール

コードをコピーしました。別のブラウザを使用してください

import os, subprocess, sys

SENTINEL = "/content/.canary_setup_done"

if not os.path.exists(SENTINEL):

def sh(c):

print("$", c); subprocess.run(c, shell=True, check=False)

print(">>> PHASE 1: installing dependencies (one-time)...\n")

sh("apt-get -qq update")

sh("apt-get -qq install -y libsndfile1 ffmpeg > /dev/null")

sh('pip install -q "nemo_toolkit[asr]"')

sh("pip install -q librosa soundfile pydub")

sh('pip install -q --force-reinstall --no-cache-dir "numpy>=2.2,<2.4" "scipy>=1.15"')

open(SENTINEL, "w").write("done")

print("\nimage Setup complete. Restarting the runtime now.")

print(" When it reconnects, RUN THIS CELL AGAIN to start the tutorial.")

os.kill(os.getpid(), 9)

NVIDIA Canary-1B-v2 の読み込みと GPU 利用可能性の確認

コードをコピーしました

別のブラウザを使用してください

import time, json, gc, math, urllib.request

import torch, numpy as np, soundfile as sf, librosa

print(">>> PHASE 2: running tutorial\n")

print("NumPy:", np.__version__, "| PyTorch:", torch.__version__)

print("CUDA available:", torch.cuda.is_available())

if torch.cuda.is_available():

print("GPU:", torch.cuda.get_device_name(0),

f"| VRAM: {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")

else:

print("image No GPU — will run on CPU (very slow). "

"Set Runtime > Change runtime type > GPU.")

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

LANGS = {

"bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",

"en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",

"el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",

"mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",

"sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",

}

print(f"\nSupported languages ({len(LANGS)}):", ", ".join(LANGS.keys()))

from nemo.collections.asr.models import ASRModel

print("\nLoading nvidia/canary-1b-v2 ...")

t0 = time.time()

asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()

print(f"Model loaded in {time.time()-t0:.1f}s")

16 kHz オーディオの準備と英語 ASR による翻訳の実行

コードをコピーしました

別のブラウザを使用してください

TARGET_SR = 16000

def prepare_audio(path_or_url, out_path=None):

if str(path_or_url).startswith(("http://", "https://")):

local = "/content/_dl_" + os.path.basename(path_or_url.split("?")[0])

urllib.request.urlretrieve(path_or_url, local)

path_or_url = local

audio, _ = librosa.load(path_or_url, sr=TARGET_SR, mono=True)

if out_path is None:

base = os.path.splitext(os.path.basename(path_or_url))[0]

out_path = f"/content/{base}_16k_mono.wav"

sf.write(out_path, audio, TARGET_SR, subtype="PCM_16")

dur = len(audio) / TARGET_SR

print(f"Prepared: {out_path} ({dur:.1f}s, 16kHz, mono)")

return out_path, dur

SAMPLE_URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav"

sample_wav, sample_dur = prepare_audio(SAMPLE_URL)

def transcribe(files, source_lang="en", target_lang="en", timestamps=False, batch_size=1):

if isinstance(files, str):

files = [files]

return asr_model.transcribe(files, source_lang=source_lang, target_lang=target_lang,

timestamps=timestamps, batch_size=batch_size)

print("\n=== 1) BASIC ASR (English) ===")

res = transcribe(sample_wav, source_lang="en", target_lang="en")

print("Transcript:", res[0].text)

print("\n=== 2) TRANSLATION (EN audio -> X) ===")

for tgt in ["fr", "de", "es", "it"]:

out = transcribe(sample_wav, source_lang="en", target_lang=tgt)

print(f" EN -> {LANGS[tgt]:<10} ({tgt}): {out[0].text}")

単語およびセグメントのタイムスタンプの生成と SRT サブタイトルのエクスポート

コードをコピーしました。別のブラウザを使用してください

print("\n=== 3) タイムスタンプ (ASR) ===")

ts_out = transcribe(sample_wav, source_lang="en", target_lang="en", timestamps=True)

word_ts = ts_out[0].timestamp.get("word", [])

seg_ts = ts_out[0].timestamp.get("segment", [])

print("セグメント:")

for s in seg_ts:

print(f" [{s['start']:6.2f}s - {s['end']:6.2f}s] {s['segment']}")

print("最初の 10 単語:")

for w in word_ts[:10]:

print(f" [{w['start']:6.2f}s - {w['end']:6.2f}s] {w['word']}")

def _srt_time(t):

h=int(t//3600); m=int((t%3600)//60); s=int(t%60); ms=int(round((t-int(t))*1000))

return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

def segments_to_srt(segments, out_path="/content/output.srt"):

lines=[]

for i, seg in enumerate(segments, 1):

lines += [str(i), f"{_srt_time(seg['start'])} --> {_srt_time(seg['end'])}",

seg["segment"].strip(), ""]

open(out_path, "w", encoding="utf-8").write("\n".join(lines))

print(f"SRT 保存完了: {out_path}")

return out_path

print("\n=== 4) SRT エクスポート (翻訳されたフランス語字幕) ===")

fr_ts = transcribe(sample_wav, source_lang="en", target_lang="fr", timestamps=True)

segments_to_srt(fr_ts[0].timestamp["segment"], "/content/subtitles_fr.srt")

print(open("/content/subtitles_fr.srt").read())

ロングフォーム文字起こし、バッチ処理、速度ベンチマークの実行

コードをコピーしました

別のブラウザを使用してください

print("\n=== 5) LONG-FORM (sample tiled x6) ===")

long_audio, _ = librosa.load(sample_wav, sr=TARGET_SR, mono=True)

long_audio = np.tile(long_audio, 6)

sf.write("/content/long.wav", long_audio, TARGET_SR, subtype="PCM_16")

print(f"Long clip duration: {len(long_audio)/TARGET_SR:.1f}s")

long_out = transcribe("/content/long.wav", source_lang="en", target_lang="en", batch_size=1)

print("Long transcript (first 300 chars):", long_out[0].text[:300], "...")

print("\n=== 6) BATCH ===")

for name in ["clip_a", "clip_b"]:

sf.write(f"/content/{name}.wav",

librosa.load(sample_wav, sr=TARGET_SR, mono=True)[0], TARGET_SR, subtype="PCM_16")

batch = transcribe(["/content/clip_a.wav", "/content/clip_b.wav"],

source_lang="en", target_lang="en", batch_size=2)

for i, b in enumerate(batch):

print(f" file {i}: {b.text}")

print("\n=== 7) BENCHMARK ===")

t0 = time.time(); _ = transcribe(sample_wav, source_lang="en", target_lang="en")

elapsed = time.time()-t0

print(f"Audio: {sample_dur:.2f}s | Compute: {elapsed:.2f}s | RTFx ≈ {sample_dur/elapsed:.1f}x")

print("\nimage Done. Change source_lang/target_lang from the LANGS dict to try other languages.")

本記事「How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python」は、MarkTechPost で最初に公開されました。

原文を表示

Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies

Copy CodeCopiedUse a different Browser

import os, subprocess, sys

SENTINEL = "/content/.canary_setup_done"

if not os.path.exists(SENTINEL):

def sh(c):

print("$", c); subprocess.run(c, shell=True, check=False)

print(">>> PHASE 1: installing dependencies (one-time)...\n")

sh("apt-get -qq update")

sh("apt-get -qq install -y libsndfile1 ffmpeg > /dev/null")

sh('pip install -q "nemo_toolkit[asr]"')

sh("pip install -q librosa soundfile pydub")

sh('pip install -q --force-reinstall --no-cache-dir "numpy>=2.2,<2.4" "scipy>=1.15"')

open(SENTINEL, "w").write("done")

print("\nimage Setup complete. Restarting the runtime now.")

print(" When it reconnects, RUN THIS CELL AGAIN to start the tutorial.")

os.kill(os.getpid(), 9)

Loading NVIDIA Canary-1B-v2 and Checking GPU Availability

Copy CodeCopiedUse a different Browser

import time, json, gc, math, urllib.request

import torch, numpy as np, soundfile as sf, librosa

print(">>> PHASE 2: running tutorial\n")

print("NumPy:", np.__version__, "| PyTorch:", torch.__version__)

print("CUDA available:", torch.cuda.is_available())

if torch.cuda.is_available():

print("GPU:", torch.cuda.get_device_name(0),

f"| VRAM: {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")

else:

print("image No GPU — will run on CPU (very slow). "

"Set Runtime > Change runtime type > GPU.")

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

LANGS = {

"bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",

"en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",

"el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",

"mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",

"sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",

}

print(f"\nSupported languages ({len(LANGS)}):", ", ".join(LANGS.keys()))

from nemo.collections.asr.models import ASRModel

print("\nLoading nvidia/canary-1b-v2 ...")

t0 = time.time()

asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()

print(f"Model loaded in {time.time()-t0:.1f}s")

Preparing 16 kHz Audio and Running English ASR with Translation

Copy CodeCopiedUse a different Browser

TARGET_SR = 16000

def prepare_audio(path_or_url, out_path=None):

if str(path_or_url).startswith(("http://", "https://")):

local = "/content/_dl_" + os.path.basename(path_or_url.split("?")[0])

urllib.request.urlretrieve(path_or_url, local)

path_or_url = local

audio, _ = librosa.load(path_or_url, sr=TARGET_SR, mono=True)

if out_path is None:

base = os.path.splitext(os.path.basename(path_or_url))[0]

out_path = f"/content/{base}_16k_mono.wav"

sf.write(out_path, audio, TARGET_SR, subtype="PCM_16")

dur = len(audio) / TARGET_SR

print(f"Prepared: {out_path} ({dur:.1f}s, 16kHz, mono)")

return out_path, dur

SAMPLE_URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav"

sample_wav, sample_dur = prepare_audio(SAMPLE_URL)

def transcribe(files, source_lang="en", target_lang="en", timestamps=False, batch_size=1):

if isinstance(files, str):

files = [files]

return asr_model.transcribe(files, source_lang=source_lang, target_lang=target_lang,

timestamps=timestamps, batch_size=batch_size)

print("\n=== 1) BASIC ASR (English) ===")

res = transcribe(sample_wav, source_lang="en", target_lang="en")

print("Transcript:", res[0].text)

print("\n=== 2) TRANSLATION (EN audio -> X) ===")

for tgt in ["fr", "de", "es", "it"]:

out = transcribe(sample_wav, source_lang="en", target_lang=tgt)

print(f" EN -> {LANGS[tgt]:<10} ({tgt}): {out[0].text}")

Generating Word and Segment Timestamps and Exporting SRT Subtitles

Copy CodeCopiedUse a different Browser

print("\n=== 3) TIMESTAMPS (ASR) ===")

ts_out = transcribe(sample_wav, source_lang="en", target_lang="en", timestamps=True)

word_ts = ts_out[0].timestamp.get("word", [])

seg_ts = ts_out[0].timestamp.get("segment", [])

print("Segments:")

for s in seg_ts:

print(f" [{s['start']:6.2f}s - {s['end']:6.2f}s] {s['segment']}")

print("First 10 words:")

for w in word_ts[:10]:

print(f" [{w['start']:6.2f}s - {w['end']:6.2f}s] {w['word']}")

def _srt_time(t):

h=int(t//3600); m=int((t%3600)//60); s=int(t%60); ms=int(round((t-int(t))*1000))

return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

def segments_to_srt(segments, out_path="/content/output.srt"):

lines=[]

for i, seg in enumerate(segments, 1):

lines += [str(i), f"{_srt_time(seg['start'])} --> {_srt_time(seg['end'])}",

seg["segment"].strip(), ""]

open(out_path, "w", encoding="utf-8").write("\n".join(lines))

print(f"Saved SRT: {out_path}")

return out_path

print("\n=== 4) SRT EXPORT (translated French subtitles) ===")

fr_ts = transcribe(sample_wav, source_lang="en", target_lang="fr", timestamps=True)

segments_to_srt(fr_ts[0].timestamp["segment"], "/content/subtitles_fr.srt")

print(open("/content/subtitles_fr.srt").read())

Running Long-Form Transcription, Batch Processing, and Speed Benchmark

Copy CodeCopiedUse a different Browser

print("\n=== 5) LONG-FORM (sample tiled x6) ===")

long_audio, _ = librosa.load(sample_wav, sr=TARGET_SR, mono=True)

long_audio = np.tile(long_audio, 6)

sf.write("/content/long.wav", long_audio, TARGET_SR, subtype="PCM_16")

print(f"Long clip duration: {len(long_audio)/TARGET_SR:.1f}s")

long_out = transcribe("/content/long.wav", source_lang="en", target_lang="en", batch_size=1)

print("Long transcript (first 300 chars):", long_out[0].text[:300], "...")

print("\n=== 6) BATCH ===")

for name in ["clip_a", "clip_b"]:

sf.write(f"/content/{name}.wav",

librosa.load(sample_wav, sr=TARGET_SR, mono=True)[0], TARGET_SR, subtype="PCM_16")

batch = transcribe(["/content/clip_a.wav", "/content/clip_b.wav"],

source_lang="en", target_lang="en", batch_size=2)

for i, b in enumerate(batch):

print(f" file {i}: {b.text}")

print("\n=== 7) BENCHMARK ===")

t0 = time.time(); _ = transcribe(sample_wav, source_lang="en", target_lang="en")

elapsed = time.time()-t0

print(f"Audio: {sample_dur:.2f}s | Compute: {elapsed:.2f}s | RTFx ≈ {sample_dur/elapsed:.1f}x")

print("\nimage Done. Change source_lang/target_lang from the LANGS dict to try other languages.")

Conclusion

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python appeared first on MarkTechPost.

この記事をシェア

MarkTechPost★42026年6月8日 17:56

Microsoft AI、MAI-Transcribe-1.5 を発表：人工分析で WER2.4%、FLEURS 精度は業界最高水準、長音響変換速度は最大 5 倍向上

Hugging Face Blog★42026年6月4日 21:59

あなたの言語・ドメイン、またはアクセント向けに Nemotron 3.5 ASR をファインチューニングする方法

The Verge AI★32026年5月20日 00:51

卒業式で AI アナウンサーが名前を誤読・飛ばす

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Python で NVIDIA Canary-1B-v2 を使用した音声認識・翻訳・自動 SRT サブタイトルエクスポートの方法

キーポイント

影響分析

編集コメント

関連記事

Python で NVIDIA Canary-1B-v2 を使用した音声認識・翻訳・自動 SRT サブタイトルエクスポートの方法

キーポイント

影響分析

編集コメント

関連記事