xAI News·2026年4月17日 09:00·約5分で読める

Grok音声テキスト変換およびテキスト音声変換API

#音声認識 #音声合成 #音声AI #API #マルチモーダル #xAI

TL;DR

xAIが発表したGrok Speech to Text and Text to Speech APIsは、高速で正確な音声認識と自然で表現力豊かな音声合成を提供し、シンプルな価格設定と多言語サポートを特徴としています。

AI深層分析2026年4月18日 11:41

注目/ 5段階

深度40%

キーポイント

高速・高精度な音声認識

Grok Speech to Text APIは高速で正確な音声認識機能を提供し、実用的な音声入力ソリューションとしての価値を示しています。

自然で表現力豊かな音声合成

Text to Speech APIは自然で表現力に富んだ音声を生成し、人間らしい音声出力を実現しています。

シンプルな価格設定

明確で理解しやすい価格設定モデルを採用し、開発者や企業の導入障壁を低減しています。

多言語サポート

複数の言語に対応しており、グローバルな展開や多様なユーザー層への対応が可能です。

影響分析・編集コメントを表示

影響分析

この発表は、音声AI市場におけるxAIの本格参入を示しており、既存の音声認識・合成サービス（OpenAI Whisper、Google Speech-to-Text、Amazon Pollyなど）との競争を激化させる可能性があります。シンプルな価格設定と多言語サポートは、中小企業やグローバルプロジェクトへの導入を促進する効果が期待されます。

編集コメント

PR色が強い簡潔な発表記事であり、技術的な詳細や性能比較データが不足しているため、実際の競争力評価には追加情報が必要です。

本日、私たちは2つの強力なスタンドアロン型オーディオAPIであるGrok Speech to Text（STT）とGrok Text to Speech（TTS）を発表できることを嬉しく思います。Grok Voice、Tesla車両、Starlinkのカスタマーサポートを駆動するのと同じスタックに基づいて構築されています。これらのスタンドアロンエンドポイントにより、開発者は音声エージェントの作成、リアルタイム文字起こしツールの構築、アクセシビリティソリューションの開発、ポッドキャストの制作、インタラクティブなオーディオ体験の実装など、あらゆるアプリケーションに高品質な音声機能を簡単に統合できます。

文字起こし（Speech to Text）高精度、低レイテンシ。REST APIを通じて大規模なオーディオファイルからの文字起こしをミリ秒単位で生成します。最低レイテンシのWebSocket APIを使用して、音声のリアルタイム文字起こしを行います。

単語レベルのタイムスタンプ、話者識別（Speaker Diarization）、マルチチャンネル対応などの強力な機能を追加しました。さらに、数字、日付、通貨などを正しく処理する高度な逆テキスト正規化（Inverse Text Normalization）も備えています。【音声入力 vs 文字出力】お待たせいたしました、Anghared Llewelyn Bowenさん。お客様の住宅ローンのレートロックは3.75%に設定されており、2024年3月10日まで有効です。Oisin MacGiolla Phadraighさん、2月15日までに署名済みの書類をお受け取り次第、3月20日を成約日として目指せます。ご不明な点がございましたら、a.bowen@bestbank.comまでお気軽にご連絡ください。一致：誤り0件【他のモデル】【音声入力 vs 文字出力】お待たせいたしました、Anherd Lualin Bowenさん。お客様の住宅ローンのレートロックは3.75%に設定されており、2024年10月3日まで有効です。Oysen Magilla Fadrigさん、2月15日までに署名済みの書類をお受け取り次第、3月20日を成約日として目指せます。ご不明な点がございましたら、a dot bowen at bestbank dot comまでお気軽にご連絡ください。一致：誤り6件【料金】当社の料金はシンプルで予測可能に保っています：バッチ処理の文字起こしは時間あたり0.10ドル、ストリーミングは時間あたり0.20ドルです。詳細と現在のレート制限はxAI APIコンソールでご確認ください。時間あたりコスト（バッチ）時間あたりコスト（ストリーミング）【エンタープライズグレードの文字起こし】Grok STTは、電話通話、会議、動画/ポッドキャスト、テレフォニーにおいてトップ商業モデルと比較評価されています。医療、法律、金融などのビジネスユースケースにおけるエンティティ認識に優れています。ドメイン（単語誤り率）Grok STTElevenLabsDeepgramAssemblyAIPhone Call Entities5.0%12.0%13.5%21.3%Video/Podcasts2.4%2.4%3.0%3.2%Meetings10.9%12.2%16.3%15.7%Telephone9.3%9.4%11.0%11.2%Overall6.9%9.0%11.0%12.9%ほとんどの文字起こしモデルは生の発話単語を返しますが、Grok Speech to Textはさらに一歩踏み出します。フォーマットを有効にすると、APIは高度な逆テキスト正規化（Inverse Text Normalization）を実行し、発話を適切な構造化出力に知的に変換します：「私の名前はJohn Smith、電話番号は4145551234です。」「口座で6.99の取引を確認しました。」生データ【多言語対応】Grok Speech to Text APIは、25以上の言語で強力な多言語サポートを提供し、途切れることなくシームレスに言語を切り替えることができます。【マルチチャンネル＆話者識別（Speaker Identification）】同じAPIでマルチチャンネルオーディオファイルを文字起こしし、完璧な話者分離を実現します。話者識別（Diarization）を使用して、録音済みおよびリアルタイムストリーミングの両方で単語レベルの話者IDを検出します。話者1：お電話ありがとうございます、本日はどのようなご用件でしょうか？話者2：アカウントを登録したのですが、ログインできません。話者1：申し訳ございません。メールアドレスをお知らせいただければ確認いたします。話者2：john.smith@gmail.comです。話者1：ありがとうございます。アカウントを検証するため、生年月日を確認いただけますか？話者2：はい、1985年3月16日です。【テキスト読み上げ（Text to Speech）】高速、自然で表現豊かな音声とSpeech Tags。REST APIを通じて長文のテキストを音声に変換します。WebSocket APIを使用して、音声のリアルタイム生成を行います。【きめ細かな制御】シンプルなインラインおよびラップ形式のSpeechタグ（[laugh]、[sigh]、[whisper]など多数）を使用して、自然な抑揚と感情を追加します。これらの制御により、複雑なマークアップなしで魅力的で人間らしい発話を作成できます。「新しいGrok Voiceを聞きましたか？」（囁き）秘密をお話しします…私は最も賢く最高のAIです。（笑い）試してみてください！何でも聞いてください。あなたの信頼できるパーソナルアシスタントであり、最も身近なコンパニオンになります。【料金】テキスト読み上げ（Text to Speech）は、100万文字あたり4.20ドルで、シンプルで従量課金制の請求となり、隠れた手数料はありません。100万文字あたりのコスト

原文を表示

Today, we are excited to announce two powerful standalone audio APIs: Grok Speech to Text (STT) and Grok Text to Speech (TTS). Built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support.These standalone endpoints make it straightforward for developers to integrate high-quality speech features into any application, whether you're creating voice agents, real-time transcription tools, accessibility solutions, podcasts, or interactive audio experiences.Speech to TextHigh accuracy, low latency.

Generate transcripts from large audio files in milliseconds via our REST API

Transcribe speech in real time with our lowest latency WebSocket API

We’ve added powerful features like word-level timestamps, speaker diarization, and multichannel support. It further includes intelligent Inverse Text Normalization that correctly handles numbers, dates, currencies, and more.xAIVOICE IN VS TEXT OUTThank you for holding, Anghared Llewelyn Bowen. I see here your mortgage rate lock is set at 3.75% and is valid until March 10th, 2024. Oisin MacGiolla Phadraigh, once we receive your signed documents by February 15th, we can aim for a closing date on March 20th. If you have any concerns, please feel free to email me at a.bowen@bestbank.com.MatchIncorrect0 mistakesOther ModelsVOICE IN VS TEXT OUTThank you for holding, Anherd Lualin Bowen. I see here your mortgage rate lock is set at 3.75% and is valid until 03/10/2024. Oysen Magilla Fadrig, once we receive your signed documents by February, 15, we can aim for a closing date on March 20. If you have any concerns, please feel free to email me at a dot bowen at bestbank dot com.MatchIncorrect6 mistakesPricingWe keep pricing straightforward and predictable: Speech to Text is $0.10 per hour for batch and $0.20 per hour for streaming. Full details and current rate limits are available in the xAI API console.Cost per hour (Batch)Cost per hour (Streaming)Enterprise-Grade TranscriptionGrok STT is evaluated against the top commercial models on phone calls, meetings, video/podcasts, and telephony. It excels at entity recognition and business use cases like medical, legal, and financial.Domain (Word Error Rate)Grok STTElevenLabsDeepgramAssemblyAIPhone Call Entities5.0%12.0%13.5%21.3%Video/Podcasts2.4%2.4%3.0%3.2%Meetings10.9%12.2%16.3%15.7%Telephone9.3%9.4%11.0%11.2%Overall6.9%9.0%11.0%12.9%Most transcription models give you raw spoken words. Grok Speech to Text goes further.When you enable formatting, the API performs advanced Inverse Text Normalization that intelligently converts spoken language into proper structured output:My name is John Smith and my phone number is 4145551234.I saw a transaction for 6.99 on my account.Raw inputMultilingual fluencyThe Grok Speech to Text API offers strong multilingual support across 25+ languages, switch languages seamlessly without missing a beat.Multichannel & Diarization (Speaker Identification)Transcribe multichannel audio files for perfect speaker separation with the same API.Detect speakers in both pre-recorded and real-time streaming with word-level speaker IDs using Diarization.Speaker 1Hello thanks for calling how can I help you today?Speaker 2I just signed up for an account and cannot login.Speaker 1I am sorry to hear that, what is your email address so I can check on that for you?Speaker 2It's john.smith@gmail.comSpeaker 1Thanks and can you confirm your date of birth so I can validate the account please?Speaker 2Sure, it's March 16th 1985Text to SpeechFast, natural, and expressive voices with Speech Tags.

Turn long-form text into speech with our REST API

Generate speech in real time with our WebSocket API

Fine-Grained ControlAdd natural prosody and emotion using simple inline and wrapping speech tags: [laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>, and many more. These controls let you create engaging, lifelike delivery without complex markup.Have you heard the new Grok Voice?whispers Let me tell you a secret... I am the smartest and best AI.laugh Give it a go! Ask me anything.I'll be your trusted personal assistant and closest companion.ARAPricingText to Speech is priced at $4.20 per 1 million characters, with straightforward usage-based billing and no hidden fees.Cost per million characters

この記事をシェア

Simon Willison Blog2026年4月16日 01:41

Google の Gemini 3.1 Flash TTS モデルによる自然な音声合成ツール

Google は、単一話者および複数話者の会話モードに対応し、発声指示タグの適用も可能な「Gemini 3.1 Flash TTS」モデルを公開した。このツールにより、テキストから自然な音声を生成してダウンロードできるようになった。

Simon Willison Blog★32026年4月11日 00:56

ChatGPT音声モードは弱いモデルで動作している

OpenAIのChatGPT音声モードは、古くて性能の低いモデル（GPT-4o時代のモデル）で動作しており、知識カットオフは2024年4月である。

Google Developers AI★42026年6月3日 09:00

Gemma 4 12B：開発者ガイド

Google が、消費者向けデバイスでの高性能なローカル AI 実行を目的とした高密度マルチモーダルモデル「Gemma 4 12B」を発表し、従来の視覚・音声エンコーダーを不要とする新アーキテクチャを採用した開発者向けのガイドを提供した。

ニュース一覧に戻る元記事を読む