The Decoder·2026年3月23日 00:52·約1分

Xiaomiが3つのMiMo AIモデルを発表、エージェント・ロボット・音声を駆動

#AIエージェント #自律制御 #ロボティクス #Xiaomi #マルチモーダルAI

TL;DR

中国のテクノロジー企業Xiaomiは、ソフトウェアを自律的に制御し、ブラウザで買い物を行い、将来的にはロボットも制御できるAIエージェントを構築するために、3つのMiMo AIモデルを発表した。

AI深層分析2026年3月23日 01:40

注目/ 5段階

深度40%

キーポイント

XiaomiのAIエージェント構想

Xiaomiは、ソフトウェアの自律制御、ブラウザでの買い物、将来的なロボット制御を可能にするAIエージェントの構築を目指している。

3つのMiMoモデル同時発表

社内のMiMoチームが、エージェント、ロボット、音声を駆動するための3つのAIモデルを同時に発表した。

自律制御と実世界応用への志向

発表されたモデルは、単なる対話ではなく、ソフトウェアやロボットといった実世界のシステムを自律的に操作・制御する能力を備えることを志向している。

影響分析・編集コメントを表示

影響分析

この発表は、Xiaomiがスマートフォンや家電を超えて、次世代の自律型AIエージェント・プラットフォーマーを目指す姿勢を明確にした点で重要である。中国市場におけるAIエージェント競争の激化と、AIの応用領域が単なる対話から実世界の操作・制御へと拡大するトレンドを反映している。

編集コメント

XiaomiのAI戦略の具体的な一歩を示す発表だが、現段階では構想とモデル発表に留まっており、実用化の詳細や競合に対する優位性は不明。今後の開発動向と実証結果に注目が必要。

image

中国のテクノロジー企業Xiaomiは、ソフトウェアを自律的に操作し、ブラウザーで買い物を行い、将来的にはロボットも制御できるAIエージェントの構築を目指しています。同社内のMiMoチームは、このほど3つのモデルを同時に発表しました。

本記事「Xiaomi launches three MiMo AI models to power agents, robots, and voice」は、The Decoderに最初に掲載されました。

原文を表示

Xiaomi wants to build AI agents that can control software on their own, navigate browsers, and eventually run robots. To get there, the company's in-house MiMo team just shipped three models at once.

The flagship MiMo-V2-Pro runs on a Mixture-of-Experts architecture with over one trillion total parameters, 42 billion of which are active per request. That's roughly three times the size of its predecessor, MiMo-V2-Flash, which launched in December 2025. Despite the jump in scale, a hybrid attention mechanism keeps things efficient, letting the model handle context windows up to one million tokens. It also generates multiple tokens at once instead of predicting one word at a time, giving it a noticeable speed boost.

MiMo-V2-Pro ranks third globally on both PinchBench and ClawEval, trailing just behind Claude Opus 4.6. | Image: Xiaomi

On the Artificial Analysis Intelligence Index, MiMo-V2-Pro lands at seventh place worldwide, making it the top-performing Chinese model after GLM-5 and MiniMax-M2.7. It hits 78 percent on the coding benchmark SWE-bench Verified, just a hair below Claude Opus 4.6 (80.8) and within striking distance of Claude Sonnet 4.6 (79.6). On ClawEval, the agent benchmark, it pulls 81 points, nearly matching Claude Opus 4.6's 81.5, while GPT-5.2 sits at 77.

MiMo-V2-Pro generates a 3D tower defense game with different tower types, enemy waves, and explosion effects from a single prompt. | Image: Xiaomi

Xiaomi undercuts Anthropic on pricing by a wide margin

Xiaomi is going after the competition hard on price. According to the platform page, MiMo-V2-Pro costs one dollar per million input tokens and three dollars per million output tokens for context lengths up to 256,000 tokens. For comparison, Claude Sonnet 4.6 runs three or 15 dollars, and Claude Opus 4.6 goes for five or 25 dollars. Xiaomi is also waiving all cache writing costs for now.

The model is live through a public API. For the launch, Xiaomi has partnered with five agent frameworks: OpenClaw, OpenCode, KiloCode, Blackbox, and Cline. Developers worldwide get free API access for one week.

MiMo-V2-Omni sees, hears, and acts in a single model

MiMo-V2-Omni folds image, video, and audio encoders into a shared backbone. The model can perceive and act on what it takes in: it natively supports structured tool calls, executes functions, and navigates user interfaces on its own.

MiMo-V2-Omni beats Claude Opus 4.6 on audio and image benchmarks but falls short of Gemini 3 Pro on video. | Image: Xiaomi

Xiaomi says MiMo-V2-Omni beats Gemini 3 Pro on audio and can record continuously for over ten hours. On images (MMMU-Pro: 76.8), it edges out Claude Opus 4.6 (73.9). The agent benchmarks tell a different story, though: on ClawEval, the Omni model scores just 54.8 - well behind Claude Opus 4.6 (66.3) and GPT-5.2 (59.6). It did outperform both Gemini 3 Pro and GPT-5.2 on the MM-BrowserComp web navigation benchmark.

For a demo, Xiaomi fed the model dashcam footage and had it flag pedestrians, oncoming vehicles, and bottlenecks as potential hazards in real time. In another scenario, MiMo-V2-Omni opened a browser on its own, looked up product reviews on the Chinese platform Xiaohongshu, compared prices on JD.com, haggled for discounts with customer service via chat, and completed the purchase.

A separate demo showed the model creating multimedia content, debugging the code behind it, and publishing the result to TikTok through the browser, all without human input. In every case, MiMo-V2-Omni handles the decision-making while the open-source framework OpenClaw takes care of the actual clicks and file operations.

MiMo-V2-TTS generates emotional speech from natural language descriptions

Xiaomi says its MiMo-V2-TTS speech synthesis model was trained on over 100 million hours of speech data. It breaks speech down into several parallel layers of discrete units, giving it finer control over sound, rhythm, and emotion than standard TTS systems.

The key difference: instead of picking an emotion from a dropdown, users describe the voice they want in plain language. "Sleepy, just woken up, slightly hoarse" sounds different from "angry, but trying to stay calm." The model also generates paralinguistic sounds like coughs, hesitations, sighs, and laughter as part of the output rather than splicing in audio clips after the fact.

According to Xiaomi, MiMo-V2-TTS is the only commercially available TTS API that natively handles both speech and singing in the same model. It reads typographic cues like capital letters or repeated characters as signals for emphasis and rhythm, so that "THIS IS IMPORTANT" comes out with real punch, not simply higher volume. Even without any style instructions, the model picks up the right tone directly from the text.

Competitive benchmarks, but Xiaomi still has ground to cover

Shipping three specialized models at once sends a clear signal: Xiaomi wants to build a full-stack platform for AI agents. The benchmarks show the models going toe-to-toe with Anthropic and OpenAI in some areas while still falling short in others. On general agent tasks in particular, MiMo-V2-Pro still has work to do before it catches Claude Opus 4.6.

Next up, the MiMo team says it's working on long-term planning across hours and days, real-time streaming, coordinated multi-agent systems, and robotics. "We believe the path to general intelligence runs through the real world," the team writes. ""A model that only reads text lives in a library. A model that sees, hears, reasons, and acts lives in the world."

The "Hunter Alpha" mystery - it wasn't Deepseek

Before Xiaomi officially pulled the curtain back, MiMo-V2-Pro was listed anonymously on the API platform OpenRouter under the codename "Hunter Alpha." Xiaomi says usage climbed steadily: the model topped the daily rankings for several days straight and racked up over one trillion tokens total. The most popular use case by far was coding.

Many users had guessed Hunter Alpha was actually Deepseek V4. But Deepseek is still a ways out - reports say the next major Deepseek model has been delayed due to its growing size.

Other Chinese AI labs aren't sitting still, though. Zhipu AI recently shipped GLM-5, an open-source model with 744 billion parameters built to compete with Claude Opus 4.5 and GPT-5.2 on coding and agent tasks. Moonshot AI's Kimi K2.5 takes a different approach with swarms of agents working in parallel, and Alibaba has been expanding its Qwen 3.5 lineup.

この記事をシェア

The Decoder重要度42026年4月25日 21:44

「ChatGPT登場以降、米プログラマーの雇用成長がほぼ半減」連邦準備理事会の研究で判明

The Decoder重要度42026年4月25日 21:16

Qwen3.6-27B、大半のコーディングベンチマークで大型後継モデルを凌駕

The Decoder重要度42026年4月25日 19:18

アンストロピック「強力なAIモデルはより良い取引を実現し、劣るモデルを使う利用者は気づかない」

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む