Google DeepMind·2025年12月13日 02:50·約8分

強力な音声体験のための改良版Geminiオーディオモデル

#音声生成 AI #リアルタイム通訳 #Gemini #Google DeepMind #エージェント

TL;DR

Google は Gemini 2.5 Flash Native Audio を更新し、音声エージェントの機能呼び出し精度と対話の滑らかさを向上させると同時に、Google Translate アプリでリアルタイム通訳機能をベータ展開した。

AI深層分析2026年5月2日 20:07

重要/ 5段階

深度40%

キーポイント

Gemini 2.5 Flash Native Audio の強化

ライブ音声エージェント向けに機能呼び出しの精度と指示従順性が向上し、会話履歴からの文脈取得により対話がより自然になった。

リアルタイム音声翻訳のベータ展開

Google Translate アプリで、話者のイントネーションやピッチを保持したまま 70 以上の言語間でストリーミング通訳が可能になる新機能が米国、メキシコ、インドで提供開始された。

多様なプロダクトへの展開

この更新は Google AI Studio や Vertex AI だけでなく、Gemini Live や Search Live にも導入され、検索や顧客サービスでの活用が拡大する。

機能呼び出しの精度向上

外部関数のトリガー信頼性が向上し、会話中にリアルタイム情報を正確に取得して音声レスポンスにシームレスに統合できるようになりました。

複雑な指示への対応強化

開発者の指示に対する遵守率が 84% から 90% に向上し、コンテンツの完全性が高まりユーザー満足度が改善されました。

多段階会話と翻訳機能の拡充

過去の文脈を効果的に参照して会話を円滑にし、連続聴取や双方向会話に対応したリアルタイム音声翻訳機能をネイティブサポートしました。

リアルタイム音声翻訳の主要機能

70カ国以上・2000以上の言語ペアに対応し、話者のイントネーションやトーンを保持するスタイル転送、自動検出、ノイズ低減などの機能を備えています。

影響分析・編集コメントを表示

影響分析

この発表は、AI 音声技術が単なるテキスト変換から、文脈を理解し感情を伝える「生きた対話」へと進化することを示しており、顧客サービスやグローバルコミュニケーションの現場での実用化が加速します。特にイントネーション保持型の通訳機能は、言語の壁を超えた自然な交流を可能にし、AI 翻訳市場における Google の競争力を強化する重要な一歩となります。

編集コメント

音声 AI の進化が「通訳」から「対話」へと重心を移していることが明確に示されており、特に感情表現の保持は実用化への大きな転換点です。企業にとっては顧客対応エージェントの品質向上、個人にとっては言語の壁を取り払う新たなツールとして即座に価値を見出せる内容です。

強力な音声体験のための改善された Gemini オーディオモデル

一般概要

Google は、より優れたライブ音声エージェントのために Gemini 2.5 Flash Native Audio を強化しました。機能呼び出しの精度向上、堅牢な指示従順性、そして滑らかな会話の実現が期待されます。Google Translate アプリのベータ版でライブ音声翻訳を試すことができます。現在、米国、メキシコ、インドの Android ユーザー向けに段階的に展開されています。

「強力な音声体験のための改善された Gemini オーディオモデル」は、ライブエージェントと翻訳を強化します。

Gemini 2.5 Flash Native Audio は now、機能呼び出しの精度が向上し、指示従順性が改善されました。

このアップデートにより、以前の会話履歴から文脈を取得することで、会話がより滑らかになります。

Google Translate のライブ音声翻訳は、イントネーションを保持しつつ、70 以上の言語に対応しています。

Vertex AI で Gemini 2.5 Flash Native Audio を使用すれば、今日から音声エージェントの構築を開始できます。

基本的な解説

Google は、Gemini AI が会話での理解と発話能力を向上させました。現在は指示をより正確に理解し、会話を滑らかに進行させ、リアルタイムで言語翻訳を行うことが可能になりました。これは、AI が顧客サービスなどのビジネス支援に役立ち、異なる言語を話す人同士でも相互理解が深まることを意味します。Google Translate アプリのライブ翻訳機能も実際に試すことができます。

他のスタイルを探す:

一般概要

基本的な解説

お使いのブラウザは音声要素をサポートしていません。

先週、Gemini 2.5 Pro および Flash テキスト音声合成モデルのアップグレードにより、オーディオ生成に対する制御性を高めました。

しかし、表現豊かな音声を生成するのは会話の一面に過ぎません。本日、ライブ音声エージェント向けに更新された Gemini 2.5 Flash Native Audio をリリースします。このアップデートは、複雑なワークフローの処理、ユーザー指示への対応、そして自然な対話の維持におけるモデルの能力を向上させます。

Gemini 2.5 Flash Native Audio は現在、Google AI Studio や Vertex AI を含む Google の製品群で利用可能となり、Gemini Live および Search Live での展開も開始されました。これにより、Search Live でネイティブオーディオの自然さが初めて実現されます。つまり、Gemini とライブでブレインストーミングをより効果的に行ったり、Search Live でリアルタイムのサポートを受けたり、次世代のエントプライズ対応カスタマーサービスエージェントを構築したりすることが可能になります。

有益なエージェントを支えるだけでなく、ネイティブオーディオはグローバルコミュニケーションにおける新たな可能性を開きます。私たちはライブ音声翻訳という機能を導入しました。これはヘッドフォン向けのストリーミング音声対音声翻訳を可能にする機能であり、話者のイントネーション、ペース、ピッチを保持します。このベータ版体験は、本日より Google 翻訳アプリで展開を開始しています。

ライブ音声エージェント

さまざまな製品やプラットフォームにわたる広範なユースケースを実現するために、Gemini 2.5 Native Audio を以下の 3 つの主要領域で改善しました：

関数呼び出しの精度向上：外部関数のトリガー時のモデルの信頼性を改善しました。会話中にリアルタイム情報を取得すべきタイミングをより正確に識別できるようになり、そのデータを音声応答にシームレスに組み込むことで、流れを途切れさせることなく対応できます。ComplexFuncBench Audio（多段階の関数呼び出しと各種制約を捉える評価指標）では、Gemini 2.5 Native Audio が 71.5% のスコアで首位となっています。

堅牢な指示従順性：複雑な指示への対応が向上し、コンテンツの完全性の観点からユーザー満足度が向上しました。開発者の指示に対する遵守率が 90%（前回の 84% から改善）に達し、より信頼性の高い出力を提供します。

滑らかな会話：多段階会話の品質において大幅な進歩を達成しました。Gemini 2.5 Flash Native Audio は、以前の対話からの文脈をより効果的に取得できるため、一貫性のある会話を創出できます。

ComplexFuncBench における、更新された Gemini 2.5 Flash Native Audio の過去のバージョンおよび業界競合他社とのパフォーマンス比較

顧客の声

Google Cloud の顧客はすでに、Gemini のネイティブ音声機能を活用して、住宅ローン処理からカスタマーコールまで、実際のビジネス成果を創出しています。

「ユーザーは Sidekick を使用して 1 分も経たないうちに、自分が AI と対話していることを忘れ、長いチャットの後にボットにお礼を言うケースさえあります。Gemini [2.5 Flash Native Audio] を通じて提供される新しい Live API の AI 機能により、当社の加盟店は勝利を収めることができます。」— Shopify 製品担当副社長デイビッド・ワーツ

「Gemini 2.5 Flash Native Audio モデル（音声モデル）を統合することで、2025 年 5 月のローンチ以来、Mia の機能を大幅に強化しました。この強力な組み合わせにより、ブローカーパートナー向けに 14,000 件を超える融資の生成が可能になりました。」— United Wholesale Mortgage (UWM) 最高技術責任者ジェイソン・ブレスラー

「Vertex AI を通じて Gemini 2.5 Flash Native Audio モデルを活用することで、Newo.ai の AI 受付担当者は比類なき対話知能を実現しています。彼らは騒がしい環境でも主要な話者を特定でき、会話中に言語を切り替えられ、驚くほど自然で感情的に表現豊かな声を出すことができます。」— Newo.ai 共同創設者デイビッド・ヤン

ライブ音声翻訳

Gemini は現在、連続的なリスニングと双方向の会話を処理するように設計された、新しいライブ音声から音声への翻訳機能をネイティブサポートしています。

連続リスニング機能により、Gemini は複数の言語での発話を自動的に単一の目標言語に翻訳します。これにより、ヘッドフォンを着用して、周囲の世界を自分の言葉で聞くことが可能になります。

双方向の会話において、Gemini のライブ音声翻訳は、話者に応じて出力言語を自動的に切り替えながら、2 つの言語間のリアルタイム翻訳を処理します。例えば、英語を話してヒンディー語話者とチャットしたい場合、イヤホン内ではリアルタイムで英語による翻訳が聞こえ、あなたが話し終わった後は携帯電話からヒンディー語が発声されます。

Gemini のライブ音声翻訳には、実世界での活用を支えるいくつかの主要な機能があります:

言語対応範囲: Gemini モデルの世界知識と多言語能力をネイティブの音声処理能力と組み合わせることで、70 以上の言語および 2,000 組以上の言語ペア間の音声を翻訳します。

スタイル転送: 話者のイントネーション、ペース、ピッチを保持し、人間の発話のニュアンスを捉えることで、翻訳が自然に聞こえるようにします。

多言語入力: 単一のセッション内で複数の言語を同時に理解できるため、言語設定をいじる必要なく、多言語での会話を追跡できます。

自動検出: 話されている言語を自動的に識別して翻訳を開始するため、どの言語が使われているかを知らなくても翻訳を開始できます。

ノイズ耐性: 環境音をフィルタリングする機能により、騒がしい屋外環境でも快適に会話することができます。

本日より、Google Translate アプリの新しいベータ版で試すことができます。デバイスにヘッドフォンを接続し、「ライブ翻訳」をタップすることで、ヘッドフォン内でのリアルタイム翻訳が可能になります。この機能は現在、米国、メキシコ、インドのすべての Android デバイス向けに展開されており、iOS 対応および他の地域への展開も近日予定されています。

フィードバックに基づき、私たちは引き続きこの体験を改善し、2026 年には Gemini API を含むより多くの Google プロダクトにも導入していく予定です。

今日から始めましょう

Gemini 2.5 Flash Native Audio を使用して、音声エージェントの構築を今日から開始できます。これは Vertex AI で一般提供されており、Gemini API ではプレビュー版として利用可能です。Google AI Studio で実際に試してみてください。

Gemini 2.5 Flash および 2.5 Pro のテキスト読み上げモデルも、Google AI Studio の Gemini API を経由して利用可能です。音声生成のドキュメントで始めたり、プロンプティングガイドを探索したり、または Gemini API Cookbook をチェックしてすぐに使い始めることができます。

原文を表示

Improved Gemini audio models for powerful voice interactions

General summary

Google enhanced Gemini 2.5 Flash Native Audio for better live voice agents. Expect sharper function calling, robust instruction following and smoother conversations. Try live speech translation in the Google Translate app beta, rolling out now on Android in the US Mexico and India.

"Improved Gemini audio models for powerful voice interactions" enhance live agents and translation.

Gemini 2.5 Flash Native Audio now has sharper function calling and better instruction following.

The update allows for smoother conversations by retrieving context from previous turns.

Live speech translation in Google Translate preserves intonation and handles 70+ languages.

You can start building voice agents today with Gemini 2.5 Flash Native Audio on Vertex AI.

Basic explainer

Google made its Gemini AI better at understanding and speaking in conversations. It can now understand instructions better, have smoother conversations, and translate languages in real time. This means AI can help businesses with customer service and people can understand each other better, even if they speak different languages. You can even try out the live translation feature in the Google Translate app.

Explore other styles:

General summary

Basic explainer

Your browser does not support the audio element.

Earlier this week, we introduced greater control over audio generation with an upgrade to our Gemini 2.5 Pro and Flash Text-to-Speech models.

But generating expressive speech is only one side of the conversation. Today, we’re releasing an updated Gemini 2.5 Flash Native Audio for live voice agents. This update improves the model’s ability to handle complex workflows, navigate user instructions, and hold natural conversations.

Gemini 2.5 Flash Native Audio is now available across Google products including Google AI Studio, Vertex AI, and has also started rolling out in Gemini Live and Search Live, bringing the naturalness of native audio to Search Live for the first time. This means you can more effectively brainstorm live with Gemini, get real-time help in Search Live, or build the next generation of enterprise-ready customer service agents.

Beyond powering helpful agents, native audio unlocks new possibilities for global communication. We’re introducing live speech translation, a capability that enables streaming speech-to-speech translation for headphones. It preserves the speaker’s intonation, pacing and pitch. This beta experience is rolling out in the Google Translate app starting today.

Live Voice Agents

To enable the breadth of use cases across surfaces and products, we have improved Gemini 2.5 Native Audio in three key areas:

Sharper function calling: We’ve improved the model's reliability when triggering external functions. It can now more accurately identify when to fetch real-time information during a conversation and seamlessly weave that data back into the audio response, without breaking the flow. On ComplexFuncBench Audio, an eval that captures multi-step function calling with various constraints, Gemini 2.5 Native Audio leads with a score of 71.5%.

Robust instruction following: The model is now better at handling complex instructions resulting in higher user satisfaction on content completeness. With a 90% adherence rate to developer instructions (up from 84%), it delivers more reliable outputs.

Smoother conversations: We’ve achieved significant gains in multi-turn conversation quality. Gemini 2.5 Flash Native Audio is able to retrieve context from previous turns more effectively, creating more cohesive conversations.

The updated Gemini 2.5 Flash Native Audio’s performance against previous versions and industry competitors on ComplexFuncBench

What customers are saying

Google Cloud customers are already using Gemini’s native audio capabilities to drive real business results, from mortgage processing to customer calls.

“Users often forget they’re talking to AI within a minute of using Sidekick, and in some cases have thanked the bot after a long chat…New Live API AI capabilities offered through Gemini [2.5 Flash Native Audio] empower our merchants to win.” – David Wurtz, VP of Product, Shopify

"By integrating the Gemini 2.5 Flash Native Audio model…we've significantly enhanced Mia's capabilities since launching in May 2025. This powerful combination has enabled us to generate over 14,000 loans for our broker partners." – Jason Bressler, Chief Technology Officer, United Wholesale Mortgage (UWM)

“Working with the Gemini 2.5 Flash Native Audio model through Vertex AI allows Newo.ai AI Receptionists to achieve unmatched conversational intelligence ... .They can identify the main speaker even in noisy settings, switch languages mid-conversation, and sound remarkably natural and emotionally expressive.” – David Yang, Co-founder, Newo.ai

Live Speech Translation

Gemini now natively supports new live speech-to-speech translation capabilities designed to handle both continuous listening and two-way conversation.

With continuous listening, Gemini automatically translates speech in multiple languages into a single target language. This allows you to put headphones in and hear the world around you in your language.

For two-way conversation, Gemini’s live speech translation handles translation between two languages in real-time, automatically switching the output language based on who is speaking. For example, if you speak English and want to chat with a Hindi speaker, you’ll hear English translations in real-time in your headphones, while your phone broadcasts Hindi when you’re done speaking.

Gemini’s live speech translation has a number of key capabilities that help in the real world:

Language coverage: Translates speech in over 70 languages and 2000 language pairs by combining Gemini model’s world knowledge and multilingual capabilities with its native audio capabilities

Style transfer: Captures the nuance of human speech, preserving the speaker’s intonation, pacing and pitch so the translation sounds natural.

Multilingual input: Understands multiple languages simultaneously in a single session, helping you follow multilingual conversations without needing to fiddle around with language settings.

Auto detection: Identifies the spoken language and begins translation, so you don’t even need to know what language is being spoken to start translating.

Noise robustness: Filters out ambient noise so you can converse comfortably even in loud, outdoor environments.

Starting today, you can try it in a new beta experience in the Google Translate app for real-time translation in your headphones by connecting them to your device and tapping “Live translate.” This experience is rolling out to all Android devices in the US, Mexico and India with support for iOS and more regions coming soon.

Based on feedback, we will continue to iterate on this experience and bring it to more Google products including the Gemini API in 2026.

Get started today

Start building voice agents today with Gemini 2.5 Flash Native Audio, now generally available on Vertex AI and as preview in the Gemini API. Try it out in Google AI Studio.

Gemini 2.5 Flash and 2.5 Pro text-to-speech models are also available via the Gemini API in Google AI Studio. Get started with the speech generation docs, explore the prompting guide, or check out the Gemini API Cookbook to get started.

この記事をシェア

Google DeepMind2026年7月3日 23:25

Google DeepMind と映画制作会社 A24 が初の研究パートナーシップを発表

MarkTechPost重要度42026年7月3日 12:24

Interfaze が拡散型 ASR モデル「diffusion-gemma-asr-small」を公開、6 か国語の並列ノイズ除去デコーダーで音声認識を実現

GitHub Changelog重要度42026年7月3日 08:07

GitHub Copilot における Gemini 2.5 Pro および Gemini 3 Flash の利用終了発表

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む