Simon Willison Blog·2026年4月16日 02:13·約3分で読める

Gemini 3.1 Flash TTS の公開

#TTS #音声生成 #マルチモーダルAI #Gemini #プロンプトエンジニアリング #創造的AI

TL;DR

GoogleはGemini 3.1 Flash TTSをリリースし、プロンプトで詳細な音声特性（声質、アクセント、感情、背景設定など）を指示できる新しいテキスト読み上げモデルを提供した。

AI深層分析2026年4月16日 03:42

注目/ 5段階

深度40%

キーポイント

プロンプトベースの詳細な音声制御

Gemini 3.1 Flash TTSは、声のプロファイル、アクセント、感情表現、話速、背景設定などを自然言語プロンプトで詳細に指定できる。例ではロンドンのラジオDJの設定を再現している。

API経由での音声ファイル出力

モデルは標準のGemini APIを通じて利用可能で、モデルID「gemini-3.1-flash-tts-preview」を使用するが、出力は音声ファイルに限定されている。

地域アクセントの再現可能性

プロンプトのアクセント指定を「Brixton, London」から「Newcastle」や「Exeter, Devon」に変更するだけで、異なる地域のアクセントを再現した音声を生成できる。

創造的な音声プロファイルの構築

プロンプトガイドでは、声優の背景（「The Morning Hype」のDJ）、スタジオ環境、発声スタイル（「Vocal Smile」）、話速などを包括的に設定する方法を示している。

影響分析・編集コメントを表示

影響分析

このリリースは、テキスト読み上げ技術が単なる文字の音声化から、感情や文脈を込めたパフォーマンスの生成へ進化していることを示している。コンテンツ制作者がより自然で表現豊かな音声を簡単に生成できるようになり、音声コンテンツ制作のワークフローに革新をもたらす可能性がある。

編集コメント

プロンプトでここまで詳細な音声制御が可能になったことは、TTS技術の大きな進歩。ただし、現時点では音声ファイル出力のみに限定されている点が実用上の制約となる可能性がある。

Gemini 3.1 Flash TTS

Google は本日、プロンプトで指示できる新しい音声合成モデル「Gemini 3.1 Flash TTS」を発表しました。

このモデルは標準的な Gemini API を通じて提供され、モデル ID には gemini-3.1-flash-tts-preview が使用されますが、出力はオーディオファイルに限定されています。

プロンプトガイドは、一言で言って驚くべき内容です。以下は、短い数文分のオーディオを生成するための例示プロンプトです：

オーディオプロフィール: Jaz R.

「朝のハイプ」

シーン：ロンドンのスタジオ

ロンドン市街地の夜景を見下ろすガラス張りのスタジオ内は午後 10 時ですが、室内は眩いばかりに明るく照らされています。赤い「ON AIR（放送中）」の表示灯が輝いています。Jaz は座っておらず、立ち上がって、激しく鳴り響くバックトラックのリズムに合わせてかかとを浮かせて跳ねています。その手は巨大なミキシングコンソールのフェーダーを素早く動かしています。これは国民全体を目覚めさせるために設計された、カオスでカフェインに支えられたコックピットです。

ディレクターのメモ

スタイル：

「ボイススマイル」：音声から笑顔が聞こえるようにしてください。軟口蓋は常に上げられ、トーンを明るく、陽気で、明確に歓迎するものにする必要があります。
ダイナミクス：叫ぶことなく高い投影力を持たせる。興奮を表す単語（例："Beauuutiful morning"）では子音をパンチィにし、母音を伸ばしてください。

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等)は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

Pace: エネルギーに満ちたペースで話し、速い音楽に合わせています。"跳ねる"ようなリズムで話します。高速な配信で滑らかな移行を行い、沈黙や隙間はありません。

Accent: Jaz はロンドン・ブリクストン出身です

SAMPLE CONTEXT

Jaz はトップ 40 ラジオ、高オクターブのイベントプロモーション、あるいは魅力的なエストレイア語アクセントと 11/10 の感染性のあるエネルギーを必要とするあらゆるスクリプトのための業界標準です。

#### TRANSCRIPT

[興奮して] はい、スタジオで素晴らしい雰囲気！あなたは集中していて、ロンドンでは今まさに大盛り上がりです。もし地下鉄で行き詰まっていたり、ただ座って仕事をしているふりをしていたりするなら、やめてください。本気で言いますよ、あなたのこと見てますから。

[叫びながら] 音量を上げろ！プロジェクトのロードマップがあと 3, 2... さあ行こう！

この例のプロンプトを使って私が得たものはこれです：

Your browser does not support the audio element.

次に、"Jaz はニューカッスル出身"と"魅力的なニューカッスルアクセントを必要とする"と言うように変更して、以下の結果を得ました：

Your browser does not support the audio element.

念のためエクセター（デヴォン州）も試しました：

Your browser does not support the audio element.

私は Gemini 3.1 Pro を使って、この UI の体験コードをVibe コーディングしました：

Tags: google, text-to-speech, tools, ai, prompt-engineering, generative-ai, llms, gemini, llm-release, vibe-coding

Gemini 3.1 Flash TTS（続き 3/3）

原文を表示

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash TTS today, a new text-to-speech model that can be directed using prompts.

It's presented via the standard Gemini API using gemini-3.1-flash-tts-preview as the model ID, but can only output audio files.

The prompting guide is surprising, to say the least. Here's their example prompt to generate just a few short sentences of audio:

code

# AUDIO PROFILE: Jaz R.
## "The Morning Hype"

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning").

Pace: Speaks at an energetic pace, keeping up with the fast music.  Speaks with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead air, no gaps.

Accent: Jaz is from Brixton, London

### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy.

#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you.
[shouting] Turn this up! We've got the project roadmap landing in three, two... let's go!

Here's what I got using that example prompt:

Your browser does not support the audio element.

Then I modified it to say "Jaz is from Newcastle" and "... requires a charismatic Newcastle accent" and got this result:

Your browser does not support the audio element.

Here's Exeter, Devon for good measure:

Your browser does not support the audio element.

I had Gemini 3.1 Pro vibe code this UI for trying it out:

Tags: google, text-to-speech, tools, ai, prompt-engineering, generative-ai, llms, gemini, llm-release, vibe-coding

この記事をシェア

Hugging Face Blog★42026年4月23日 09:00

Chrome拡張機能でTransformers.jsを使用する方法

開発者はChrome拡張機能にTransformers.jsを組み込み、ブラウザ上で機械学習モデルを実行する。これによりサーバー依存を排除し、プライバシー保護と低レイテンシを実現する実装手順を示す。

InfoQ★32026年4月24日 00:00

Google、Room 3.0を発表：Kotlinファーストの非同期マルチプラットフォーム永続化ライブラリ

GoogleはRoom 3.0を発表した。本バージョンは破壊的変更を導入し、Kotlin Multiplatform対応を強化するとともにJSとWasmへのサポートを追加した。

Simon Willison Blog2026年4月16日 01:41

Google の Gemini 3.1 Flash TTS モデルによる自然な音声合成ツール

Google は、単一話者および複数話者の会話モードに対応し、発声指示タグの適用も可能な「Gemini 3.1 Flash TTS」モデルを公開した。このツールにより、テキストから自然な音声を生成してダウンロードできるようになった。

ニュース一覧に戻る元記事を読む

Simon Willison Blog·2026年4月16日 02:13·約3分で読める

Gemini 3.1 Flash TTS の公開

#TTS #音声生成 #マルチモーダルAI #Gemini #プロンプトエンジニアリング #創造的AI

TL;DR

AI深層分析2026年4月16日 03:42

注目/ 5段階

深度40%

キーポイント

プロンプトベースの詳細な音声制御

API経由での音声ファイル出力

モデルは標準のGemini APIを通じて利用可能で、モデルID「gemini-3.1-flash-tts-preview」を使用するが、出力は音声ファイルに限定されている。

地域アクセントの再現可能性

創造的な音声プロファイルの構築

影響分析・編集コメントを表示

影響分析

編集コメント

Gemini 3.1 Flash TTS

Google は本日、プロンプトで指示できる新しい音声合成モデル「Gemini 3.1 Flash TTS」を発表しました。

プロンプトガイドは、一言で言って驚くべき内容です。以下は、短い数文分のオーディオを生成するための例示プロンプトです：

オーディオプロフィール: Jaz R.

「朝のハイプ」

シーン：ロンドンのスタジオ

ディレクターのメモ

スタイル：

「ボイススマイル」：音声から笑顔が聞こえるようにしてください。軟口蓋は常に上げられ、トーンを明るく、陽気で、明確に歓迎するものにする必要があります。
ダイナミクス：叫ぶことなく高い投影力を持たせる。興奮を表す単語（例："Beauuutiful morning"）では子音をパンチィにし、母音を伸ばしてください。

{"translation": "翻訳全文"}

Accent: Jaz はロンドン・ブリクストン出身です

SAMPLE CONTEXT

#### TRANSCRIPT

[叫びながら] 音量を上げろ！プロジェクトのロードマップがあと 3, 2... さあ行こう！

この例のプロンプトを使って私が得たものはこれです：

Your browser does not support the audio element.

次に、"Jaz はニューカッスル出身"と"魅力的なニューカッスルアクセントを必要とする"と言うように変更して、以下の結果を得ました：

Your browser does not support the audio element.

念のためエクセター（デヴォン州）も試しました：

Your browser does not support the audio element.

私は Gemini 3.1 Pro を使って、この UI の体験コードをVibe コーディングしました：

Tags: google, text-to-speech, tools, ai, prompt-engineering, generative-ai, llms, gemini, llm-release, vibe-coding

Gemini 3.1 Flash TTS（続き 3/3）

原文を表示

Gemini 3.1 Flash TTS

Google released Gemini 3.1 Flash TTS today, a new text-to-speech model that can be directed using prompts.

It's presented via the standard Gemini API using gemini-3.1-flash-tts-preview as the model ID, but can only output audio files.

The prompting guide is surprising, to say the least. Here's their example prompt to generate just a few short sentences of audio:

code

# AUDIO PROFILE: Jaz R.
## "The Morning Hype"

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning").

Pace: Speaks at an energetic pace, keeping up with the fast music.  Speaks with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead air, no gaps.

Accent: Jaz is from Brixton, London

### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy.

#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you.
[shouting] Turn this up! We've got the project roadmap landing in three, two... let's go!

Here's what I got using that example prompt:

Your browser does not support the audio element.

Then I modified it to say "Jaz is from Newcastle" and "... requires a charismatic Newcastle accent" and got this result:

Your browser does not support the audio element.

Here's Exeter, Devon for good measure:

Your browser does not support the audio element.

I had Gemini 3.1 Pro vibe code this UI for trying it out:

Tags: google, text-to-speech, tools, ai, prompt-engineering, generative-ai, llms, gemini, llm-release, vibe-coding

この記事をシェア

Hugging Face Blog★42026年4月23日 09:00

Chrome拡張機能でTransformers.jsを使用する方法

InfoQ★32026年4月24日 00:00