読み込み中…

Google AI Blog·2026年4月16日 00:00·約7分

Gemini 3.1 Flash TTS：表現豊かな次世代AI音声の登場

#Text-to-Speech #Generative AI #Audio Synthesis #SynthID #Google Gemini

TL;DR

Googleは「Gemini 3.1 Flash TTS」を発表し、70以上の言語で音声スタイルやペースを細かく制御可能なオーディオタグ機能とSynthID透かし技術を搭載した次世代AI音声モデルを提供した。

AI深層分析2026年4月26日 18:02

重要/ 5段階

深度40%

キーポイント

Granular Audio Tagsによる精密制御

自然言語コマンドを用いて、発話のスタイル、ペース、表現力などを細かく指定できる「オーディオタグ」機能により、従来よりも高度な音声生成が可能になった。

多言語対応と品質向上

70以上の言語をサポートし、以前のバージョンよりも自然で高品質な音声出力を実現している。

SynthIDによる透かし技術の統合

生成されたすべての音声にSynthID透かしを埋め込み、AI生成コンテンツの識別と誤情報防止を図っている。

開発者向けツールの提供

Google AI Studio、Vertex AI、Google Vidsを通じて利用可能であり、開発者は声を微調整して設定をエクスポートし、一貫性のある利用が可能。

音声の制御と品質の向上

Gemini 3.1 Flash TTSは自然な音質を実現し、オーディオタグを用いて70以上の言語で発話スタイルやペースを細かく制御可能になった。

SynthIDによる透かし技術の導入

生成されたすべての音声にSynthID透かしが埋め込まれており、AI生成音声であることを識別可能にし、誤情報防止に貢献する。

主要プラットフォームでの利用可能

Google AI Studio、Vertex AI、Google Vidsにおいてテストおよび利用が可能であり、開発者は声を微調整して設定をエクスポートできる。

重要な引用

Our newest audio model introduces granular audio tags that give you precise control to direct AI speech for expressive audio generation.

Gemini 3.1 Flash TTS is here, giving you improved AI speech quality and control.

Audio tags let you control vocal style, pace, and delivery using natural language commands.

You can now use audio tags to adjust vocal style and pacing in over 70 languages.

all audio is watermarked with SynthID to prevent misinformation.

"Gemini 3.1 Flash TTS achieved an impressive Elo score of 1,211."

影響分析・編集コメントを表示

影響分析

このリリースは、AI音声生成技術が単なる「読み上げ」から「表現制御」へと段階を進めたことを示しており、コンテンツ制作やアクセシビリティ分野での実用性を大幅に高めている。また、SynthIDの統合は、生成AI時代における情報信頼性の確保という社会的要請に応える重要な一歩であり、業界標準としての影響力が大きい。

編集コメント

Googleは音声生成の制御可能性と倫理的透明性（SynthID）を両立させることで、実務利用におけるハードルを下げており、競合他社との差別化を図っている。

2026年4月15日

読了時間：10分

最新の音声モデルは、表現豊かな音声生成のためにAIの発話を指示するための精密な制御を可能にする、細分化された音声タグ（audio tags）を導入しました。

ヴィロブ・メスラム

シニアプロダクトマネージャー

マックス・グビン

ジェミニチーム代表、シニアリサーチエンジニア

概要

Gemini 3.1 Flash TTSが登場し、AI音声の品質と制御性が向上しました。現在、70以上の言語で、音声タグを使用して発話スタイルやペースを調整することができます。Google AI Studio、Vertex AI、および Google Vids でお試しください。すべての音声には SynthID による透かしが埋め込まれており、誤情報防止に貢献しています。

要約は Google AI によって生成されました。生成AIは実験的な技術です。

ポイント

「Gemini 3.1 Flash TTS」は、制御性、表現力、品質が向上した新しいAI音声モデルです。

このモデルは音声品質が改善されており、以前のバージョンよりも自然な響きになります。

音声タグ（audio tags）により、自然言語のコマンドを使用して発話スタイル、ペース、および配信方法を制御できます。

開発者は Google AI Studio を使用して声を微調整し、一貫した利用のために設定をエクスポートできます。

Gemini 3.1 Flash TTS は 70以上の言語をサポートし、SynthID透かし（watermarking）を使用して AI生成のオーディオを識別します。

サマリーは Google AI によって生成されました。生成 AI は実験的な技術です。

## 基本的な解説

Gemini 3.1 Flash TTS は、コンピュータの発音をよりリアルに聞こさせる新しい AI です。テキスト内の特殊なコマンドを使用することで、AI の話し方を変更することができます。この AI は 70以上の言語で話すことができ、オーディオに隠し透かし（watermark）を追加します。これにより、これが AI 生成のものであり、実際の人物ではないことを人々が知ることができます。

サマリーは Google AI によって生成されました。生成 AI は実験的な技術です。

image

あなたのブラウザはオーディオ要素をサポートしていません。

記事の再生

このコンテンツは Google AI によって生成されています。生成 AI は実験的な技術です

[[duration]] 分

本日より、Gemini 3.1 Flash TTSを発表いたします。これは、制御性、表現力、品質を向上させた最新のテキスト読み上げ（Text-to-Speech: TTS）モデルであり、開発者、企業、一般ユーザーが次世代のAI音声アプリケーションを構築できるよう支援します。

本日より、3.1 Flash TTSが以下の形で展開されます：

開発者向け：Gemini APIおよびGoogle AI Studioを通じてプレビュー版を提供
企業向け：Vertex AI上でプレビュー版を提供
Workspaceユーザー向け：Google Vidsを通じて提供

音声品質と制御性の向上

Gemini 3.1 Flash TTSの全体的な音声品質を改善し、これまでで最も自然で表現豊かなモデルとしました。数千の盲検テストに基づく人間の嗜好を捉えるベンチマークであるArtificial Analysis TTSリーダーボードにおいて、3.1 Flash TTSは印象的なEloスコア1,211を達成しました。

Artificial Analysisはまた、Gemini 3.1 Flash TTSを、高品質な音声生成と低コストの理想的な組み合わせを示す「最も魅力的な四角形」内に位置付けています。このモデルは、ネイティブなマルチスピーカー対話機能、70以上の言語のサポート、自然言語による細かなクリエイティブ制御が可能である点でも際立っています。

より表現豊かな音声生成のための新しいオーディオタグ

3.1 Flash TTS はまた、オーディオタグも導入しています。これは、発話のスタイル、ペース、および伝達方法を制御するための直感的な方法です。自然言語のコマンドをテキスト入力に直接埋め込むことで、より細粒度のレベルで AI による音声出力を制御することができます。

3.1 Flash TTS を利用することで、企業は Vertex AI 内でオーディオタグを活用し、次世代のエンタープライズアプリケーションを強化することができます。

あなたは、開発者体験に関するその他の更新と合わせて、これらのオーディオタグの実験を、開発者に「監督者の椅子」に座るような制御を提供する構成可能なコントロールを持つ Google AI Studio で開始することができます。

シーン指示：環境を定義し、具体的なセリフの指示を提供することで舞台設定を行います。この世界観の文脈は、キャラクターが「キャラクターらしさ」を保ち、複数回のやり取りを通じて互いに自然に反応することを助けます。
話者レベルの specificity：固有の Audio Profile を使用してキャラクターをキャストし、Director’s Notes を指定してペース、トーン、アクセントを切り替えます。インラインタグを使用することで、話者はこれらの高レベル設定から中盤の表現へ切り替えることができます。
シームレスなエクスポート：パフォーマンスが完成したら、これらの正確なパラメータを Gemini API コードとしてエクスポートでき、さまざまなプロジェクトやプラットフォーム間で一貫性があり認識可能な音声を実現します。

これらの新しい設定により、開発者は特定のシナリオに対する精度を高め、記憶に残るキャラクターや没入感のあるオーディオ体験を作成できます。

グローバルスケール向けに設計

Gemini 3.1 Flash TTS は、70以上の言語で高忠実度の音声とより精密な制御を提供します。これらのコア最適化により、主要市場に対して高度なスタイル、ペース、アクセント制御が可能になり、開発者がグローバルスケールでユーザー向けにローカライズされた表現豊かな音声体験を作成するのを支援します。

初期の開発者およびエンタープライズテスターはすでに 3.1 Flash TTS の影響を目の当たりにしており、その印象的な制御性と表現力を強調しています。彼らは、オーディオタグが創造的な精度の新しいレベルを提供し、単純なテキストを高忠実度のボーカルパフォーマンスに変換することを私たちに語ってくれました。

SynthIDで透かし処理

Gemini 3.1 Flash TTSによって生成されるすべての音声には、SynthIDによる透かし処理が施されています。この不可視の透かしは音声出力に直接織り込まれており、AI生成コンテンツの信頼性の高い検出を可能にし、誤情報防止に貢献します。安全性と責任に関する私たちのアプローチの詳細については、モデルカードをご参照ください。

Googleのストーリーをあなたのメールボックスで受信する。

完了。あと一歩です。

購読を確認するために、あなたのメールボックスをご確認ください。

あなたはすでに私たちのニュースレターに登録されています。

あなたはこの方法でも登録できます

General summary

Gemini 3.1 Flash TTS is here, giving you improved AI speech quality and control. You can now use audio tags to adjust vocal style and pacing in over 70 languages. Test it out in Google AI Studio, Vertex AI, and Google Vids, and know that all audio is watermarked with SynthID to prevent misinformation.

Summaries were generated by Google AI. Generative AI is experimental.

Bullet points

"Gemini 3.1 Flash TTS" is a new AI speech model with better control, expressiveness, and quality.

This model has improved speech quality, making it sound more natural than previous versions.

Audio tags let you control vocal style, pace, and delivery using natural language commands.

Developers can use Google AI Studio to fine-tune voices and export settings for consistent use.

Gemini 3.1 Flash TTS supports 70+ languages and uses SynthID watermarking to identify AI-generated audio.

Summaries were generated by Google AI. Generative AI is experimental.

Basic explainer

Gemini 3.1 Flash TTS is a new AI that makes computer speech sound more real. It lets people change how the AI talks by using special commands in the text. This AI can speak in over 70 languages and adds a hidden watermark to the audio. This helps people know it's AI-generated and not a real person.

Summaries were generated by Google AI. Generative AI is experimental.

Gemini logo next to the text "3.1 Flash TTS", all over colored dots

Your browser does not support the audio element.

Listen to article

This content is generated by Google AI. Generative AI is experimental

[[duration]] minutes

Today, we’re introducing Gemini 3.1 Flash TTS, the latest text-to-speech model that delivers improved controllability, expressivity and quality — empowering developers, enterprises and everyday users to build the next generation of AI-speech applications.

Starting today, 3.1 Flash TTS is rolling out:

For developers in preview via the Gemini API and Google AI Studio
For enterprises in preview on Vertex AI
For Workspace users via Google Vids

Improved speech quality and controllability

We’ve improved the overall speech quality of Gemini 3.1 Flash TTS, making it our most natural and expressive model to date. On the Artificial Analysis TTS leaderboard, a benchmark that captures thousands of blind human preferences, 3.1 Flash TTS achieved an impressive Elo score of 1,211.

Artificial Analysis has also positioned Gemini 3.1 Flash TTS within its “most attractive quadrant” for its ideal blend of high-quality speech generation and low cost. The model stands out further with native multi-speaker dialogue, support for 70+ languages, and granular creative control via natural language.

New audio tags for more expressive speech generation

3.1 Flash TTS also introduces audio tags — an intuitive way to control vocal style, pace and delivery. By embedding natural language commands directly into the text input, you can steer AI-speech output with improved levels of granularity.

3.1 Flash TTS enables enterprises to utilize audio tags within Vertex AI, empowering the next generation of enterprise applications.

You can start experimenting with these audio tags along with other updates to the developer experience in Google AI Studio with configurable controls that place the developer in the “director’s chair”:

Scene direction: Set the stage by defining the environment and providing specific dialogue instructions. This world-building context helps characters remain “in-character” and react to one another naturally across multiple turns.
Speaker-level specificity: Cast characters using unique Audio Profiles, then specify Director’s Notes to toggle pace, tone and accent. Using inline tags, speakers can pivot from these high-level settings to change expression mid-sentence.
Seamless export: Once the performance is perfected, these exact parameters can be exported as Gemini API code to ensure consistent, recognizable voices across various projects and platforms.

With these new configurations, developers can enhance precision for specific scenarios, creating memorable characters and immersive audio experiences.

Built for global scale

Gemini 3.1 Flash TTS delivers high-fidelity speech and more precise control across more than 70 languages. These core optimizations bring advanced style, pacing and accent control to major markets — helping developers create localized, expressive speech experiences for users at global scale.

Early developer and enterprise testers are already seeing the impact of 3.1 Flash TTS, highlighting its impressive controllability and expressivity. They’ve told us how audio tags provide a new level of creative precision, transforming simple text into a high-fidelity vocal performance.

Watermarked with SynthID

All audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID. This imperceptible watermark is interwoven directly into the audio output, allowing the reliable detection of AI-generated content to help prevent misinformation. For more information on our approach to safety and responsibility, you can review the model card.

Get more stories from Google in your inbox.

Done. Just one step more.

Check your inbox to confirm your subscription.

You are already subscribed to our newsletter.

You can also subscribe with a

Gemini 3.1 Flash TTS：表現豊かな次世代AI音声の登場

#Text-to-Speech #Generative AI #Audio Synthesis #SynthID #Google Gemini

TL;DR

AI深層分析2026年4月26日 18:02

重要/ 5段階

深度40%

キーポイント

Granular Audio Tagsによる精密制御

多言語対応と品質向上

70以上の言語をサポートし、以前のバージョンよりも自然で高品質な音声出力を実現している。

SynthIDによる透かし技術の統合

生成されたすべての音声にSynthID透かしを埋め込み、AI生成コンテンツの識別と誤情報防止を図っている。

開発者向けツールの提供

Google AI Studio、Vertex AI、Google Vidsを通じて利用可能であり、開発者は声を微調整して設定をエクスポートし、一貫性のある利用が可能。

音声の制御と品質の向上

Gemini 3.1 Flash TTSは自然な音質を実現し、オーディオタグを用いて70以上の言語で発話スタイルやペースを細かく制御可能になった。

SynthIDによる透かし技術の導入

生成されたすべての音声にSynthID透かしが埋め込まれており、AI生成音声であることを識別可能にし、誤情報防止に貢献する。

主要プラットフォームでの利用可能

Google AI Studio、Vertex AI、Google Vidsにおいてテストおよび利用が可能であり、開発者は声を微調整して設定をエクスポートできる。

重要な引用

Our newest audio model introduces granular audio tags that give you precise control to direct AI speech for expressive audio generation.

Gemini 3.1 Flash TTS is here, giving you improved AI speech quality and control.

Audio tags let you control vocal style, pace, and delivery using natural language commands.

You can now use audio tags to adjust vocal style and pacing in over 70 languages.

all audio is watermarked with SynthID to prevent misinformation.

"Gemini 3.1 Flash TTS achieved an impressive Elo score of 1,211."

影響分析・編集コメントを表示

影響分析

編集コメント

2026年4月15日

読了時間：10分

ヴィロブ・メスラム

シニアプロダクトマネージャー

マックス・グビン

ジェミニチーム代表、シニアリサーチエンジニア

概要

要約は Google AI によって生成されました。生成AIは実験的な技術です。

ポイント

「Gemini 3.1 Flash TTS」は、制御性、表現力、品質が向上した新しいAI音声モデルです。

このモデルは音声品質が改善されており、以前のバージョンよりも自然な響きになります。

音声タグ（audio tags）により、自然言語のコマンドを使用して発話スタイル、ペース、および配信方法を制御できます。

開発者は Google AI Studio を使用して声を微調整し、一貫した利用のために設定をエクスポートできます。

Gemini 3.1 Flash TTS は 70以上の言語をサポートし、SynthID透かし（watermarking）を使用して AI生成のオーディオを識別します。

サマリーは Google AI によって生成されました。生成 AI は実験的な技術です。

## 基本的な解説

サマリーは Google AI によって生成されました。生成 AI は実験的な技術です。

image

あなたのブラウザはオーディオ要素をサポートしていません。

記事の再生

このコンテンツは Google AI によって生成されています。生成 AI は実験的な技術です

[[duration]] 分

本日より、3.1 Flash TTSが以下の形で展開されます：

開発者向け：Gemini APIおよびGoogle AI Studioを通じてプレビュー版を提供
企業向け：Vertex AI上でプレビュー版を提供
Workspaceユーザー向け：Google Vidsを通じて提供

音声品質と制御性の向上

より表現豊かな音声生成のための新しいオーディオタグ

シーン指示：環境を定義し、具体的なセリフの指示を提供することで舞台設定を行います。この世界観の文脈は、キャラクターが「キャラクターらしさ」を保ち、複数回のやり取りを通じて互いに自然に反応することを助けます。
話者レベルの specificity：固有の Audio Profile を使用してキャラクターをキャストし、Director’s Notes を指定してペース、トーン、アクセントを切り替えます。インラインタグを使用することで、話者はこれらの高レベル設定から中盤の表現へ切り替えることができます。
シームレスなエクスポート：パフォーマンスが完成したら、これらの正確なパラメータを Gemini API コードとしてエクスポートでき、さまざまなプロジェクトやプラットフォーム間で一貫性があり認識可能な音声を実現します。

グローバルスケール向けに設計

SynthIDで透かし処理

Googleのストーリーをあなたのメールボックスで受信する。

完了。あと一歩です。

購読を確認するために、あなたのメールボックスをご確認ください。

あなたはすでに私たちのニュースレターに登録されています。

あなたはこの方法でも登録できます

General summary

Summaries were generated by Google AI. Generative AI is experimental.

Bullet points

"Gemini 3.1 Flash TTS" is a new AI speech model with better control, expressiveness, and quality.

This model has improved speech quality, making it sound more natural than previous versions.

Audio tags let you control vocal style, pace, and delivery using natural language commands.

Developers can use Google AI Studio to fine-tune voices and export settings for consistent use.

Gemini 3.1 Flash TTS supports 70+ languages and uses SynthID watermarking to identify AI-generated audio.

Summaries were generated by Google AI. Generative AI is experimental.

Basic explainer

Summaries were generated by Google AI. Generative AI is experimental.

Gemini logo next to the text "3.1 Flash TTS", all over colored dots

Your browser does not support the audio element.

Listen to article

This content is generated by Google AI. Generative AI is experimental

[[duration]] minutes

Starting today, 3.1 Flash TTS is rolling out:

For developers in preview via the Gemini API and Google AI Studio
For enterprises in preview on Vertex AI
For Workspace users via Google Vids

Improved speech quality and controllability

New audio tags for more expressive speech generation

3.1 Flash TTS enables enterprises to utilize audio tags within Vertex AI, empowering the next generation of enterprise applications.

Scene direction: Set the stage by defining the environment and providing specific dialogue instructions. This world-building context helps characters remain “in-character” and react to one another naturally across multiple turns.
Speaker-level specificity: Cast characters using unique Audio Profiles, then specify Director’s Notes to toggle pace, tone and accent. Using inline tags, speakers can pivot from these high-level settings to change expression mid-sentence.
Seamless export: Once the performance is perfected, these exact parameters can be exported as Gemini API code to ensure consistent, recognizable voices across various projects and platforms.

With these new configurations, developers can enhance precision for specific scenarios, creating memorable characters and immersive audio experiences.

Built for global scale

Watermarked with SynthID

Get more stories from Google in your inbox.

Done. Just one step more.

Check your inbox to confirm your subscription.

You are already subscribed to our newsletter.

You can also subscribe with a

キーポイント

重要な引用

影響分析

編集コメント

概要

ポイント

音声品質と制御性の向上

より表現豊かな音声生成のための新しいオーディオタグ

グローバルスケール向けに設計

SynthIDで透かし処理

Googleのストーリーをあなたのメールボックスで受信する。

関連ストーリー

General summary

Bullet points

Basic explainer

Improved speech quality and controllability

New audio tags for more expressive speech generation

Built for global scale

Watermarked with SynthID

Get more stories from Google in your inbox.

Related stories

関連記事

キーポイント

重要な引用

影響分析

編集コメント

概要

ポイント

音声品質と制御性の向上

より表現豊かな音声生成のための新しいオーディオタグ

グローバルスケール向けに設計

SynthIDで透かし処理

Googleのストーリーをあなたのメールボックスで受信する。

関連ストーリー

General summary

Bullet points

Basic explainer

Improved speech quality and controllability

New audio tags for more expressive speech generation

Built for global scale

Watermarked with SynthID

Get more stories from Google in your inbox.

Related stories

関連記事