Simon Willison Blog·2026年4月13日 08:57·約1分で読める

MLXを使用したGemma 4オーディオ

#音声認識 #マルチモーダルAI #LLM #オープンソース #デベロッパーツール #Apple Silicon

TL;DR

Simon Willison氏のブログ記事は、macOS上でMLXとmlx-vlmを用いてGemma 4 E2Bモデルで音声ファイルを文字起こしする具体的なコマンドレシピを紹介し、実際の14秒のWAVファイルで試した結果とその誤認識例を示している。

AI深層分析2026年4月13日 09:40

注目/ 5段階

深度40%

キーポイント

具体的な音声文字起こし実装レシピ

記事では、macOS環境でuv、MLX、mlx-vlm、torchvision、gradioを用いて、10.28GBのGemma 4 E2Bモデルで音声ファイルを文字起こしする実行可能なコマンド例を提供している。

実際のデモと結果の共有

著者は14秒のデモ音声ファイルで実際に試行し、得られた文字起こし結果を引用して公開している。

モデルの性能と限界の実例

文字起こし結果には「This right here...」が「This front here...」と、「how well that works」が「how that works」と誤認識された箇所があり、著者はその理由を推測しながら、現時点でのモデルの限界も示している。

影響分析

この記事は、最新の大規模言語モデルを音声認識という実用的なタスクに組み合わせる具体的な方法論を開発者コミュニティに提供しており、オープンソースツールチェーンを用いたマルチモーダルAIアプリケーションの実装ハードルを下げる教育的価値を持つ。一方、内容は個人的な実験の報告であり、技術的なブレークスルーや広範な業界影響を主張するものではない。

編集コメント

技術ブログとしての実用的な価値は高いが、新規性や革新性の観点では限定的。主な価値は、複雑な技術スタックを具体的なコマンドで示した「再現レシピ」としての情報提供にある。

ラヒム・ナトワニからのヒントのおかげで、macOS上でMLXとmlx-vlmを用いて、10.28 GBのGemma 4 E2Bモデルでオーディオファイルの文字起こしを行うためのuv runレシピを紹介します。

code

uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0

ブラウザはオーディオ要素をサポートしていません。

私はこの14秒間の.wavファイルで試してみましたが、以下の結果が出力されました。

This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works.

（本来は「This right here...」と「...how well that works」と言うつもりでしたが、なぜそれが「front」や「how that works」と誤解釈されたのかは聞き取れます。）

タグ: uv, mlx, ai, gemma, llms, speech-to-text, python, generative-ai

原文を表示

Thanks to a tip from Rahim Nathwani, here's a uv run recipe for transcribing an audio file on macOS using the 10.28 GB Gemma 4 E2B model with MLX and mlx-vlm:

code

uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0

Your browser does not support the audio element.

I tried it on this 14 second .wav file and it output the following:

This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works.

(That was supposed to be "This right here..." and "... how well that works" but I can hear why it misinterpreted that as "front" and "how that works".)

Tags: uv, mlx, ai, gemma, llms, speech-to-text, python, generative-ai

この記事をシェア

Hugging Face Blog★42026年4月23日 09:00

Chrome拡張機能でTransformers.jsを使用する方法

開発者はChrome拡張機能にTransformers.jsを組み込み、ブラウザ上で機械学習モデルを実行する。これによりサーバー依存を排除し、プライバシー保護と低レイテンシを実現する実装手順を示す。

InfoQ★32026年4月24日 00:00

Google、Room 3.0を発表：Kotlinファーストの非同期マルチプラットフォーム永続化ライブラリ

GoogleはRoom 3.0を発表した。本バージョンは破壊的変更を導入し、Kotlin Multiplatform対応を強化するとともにJSとWasmへのサポートを追加した。

Simon Willison Blog2026年4月16日 01:41

Google の Gemini 3.1 Flash TTS モデルによる自然な音声合成ツール

Google は、単一話者および複数話者の会話モードに対応し、発声指示タグの適用も可能な「Gemini 3.1 Flash TTS」モデルを公開した。このツールにより、テキストから自然な音声を生成してダウンロードできるようになった。

ニュース一覧に戻る元記事を読む

Simon Willison Blog·2026年4月13日 08:57·約1分で読める

MLXを使用したGemma 4オーディオ

#音声認識 #マルチモーダルAI #LLM #オープンソース #デベロッパーツール #Apple Silicon

TL;DR

AI深層分析2026年4月13日 09:40

注目/ 5段階

深度40%

キーポイント

具体的な音声文字起こし実装レシピ

実際のデモと結果の共有

著者は14秒のデモ音声ファイルで実際に試行し、得られた文字起こし結果を引用して公開している。

モデルの性能と限界の実例

影響分析

編集コメント

code

uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0

ブラウザはオーディオ要素をサポートしていません。

私はこの14秒間の.wavファイルで試してみましたが、以下の結果が出力されました。

This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works.

タグ: uv, mlx, ai, gemma, llms, speech-to-text, python, generative-ai

原文を表示

Thanks to a tip from Rahim Nathwani, here's a uv run recipe for transcribing an audio file on macOS using the 10.28 GB Gemma 4 E2B model with MLX and mlx-vlm:

code

uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0

Your browser does not support the audio element.

I tried it on this 14 second .wav file and it output the following:

This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works.

(That was supposed to be "This right here..." and "... how well that works" but I can hear why it misinterpreted that as "front" and "how that works".)

Tags: uv, mlx, ai, gemma, llms, speech-to-text, python, generative-ai

この記事をシェア

Hugging Face Blog★42026年4月23日 09:00

Chrome拡張機能でTransformers.jsを使用する方法

InfoQ★32026年4月24日 00:00

Google、Room 3.0を発表：Kotlinファーストの非同期マルチプラットフォーム永続化ライブラリ

GoogleはRoom 3.0を発表した。本バージョンは破壊的変更を導入し、Kotlin Multiplatform対応を強化するとともにJSとWasmへのサポートを追加した。

Simon Willison Blog2026年4月16日 01:41

Google の Gemini 3.1 Flash TTS モデルによる自然な音声合成ツール

ニュース一覧に戻る元記事を読む