クアルコム、スマートフォン向け思考モデルのAI推論チェーンを2.4倍圧縮
Qualcomm AI Researchは、推論能力を持つ言語モデルの思考プロセスを2.4倍圧縮するモジュラーシステムを開発し、スマートフォン上での実行を可能にした。
キーポイント
推論チェーンの大幅圧縮
Qualcommが開発した技術により、言語モデルの冗長な思考プロセスを2.4倍圧縮し、スマートフォンへの搭載を実現した。
モジュラーシステムの採用
圧縮技術はモジュラーシステムとして実装されており、柔軟な適用が可能となっている。
スマートフォンでのローカル推論実現
この技術により、クラウド依存せずにスマートフォン上で推論能力を持つ言語モデルが動作する道が開かれた。
QualcommのAI研究開発
同社の研究部門Qualcomm AI Researchがこの技術開発を主導したことが明らかになった。
影響分析・編集コメントを表示
影響分析
この技術は、スマートフォンなどのエッジデバイスで高度なAI推論をローカル実行する可能性を大きく広げ、プライバシー保護、レイテンシー低減、オフライン機能の実現に貢献する。業界全体として、クラウド依存からエッジAIへの移行を加速させる重要な技術的進展と言える。
編集コメント
スマートフォンでの本格的なAI推論実現に向けた具体的な技術的ブレークスルーとして注目。2.4倍圧縮という数値目標の達成は、業界のベンチマークとなり得る。

Qualcomm AI Researchは、言語モデルの冗長な思考プロセスを2.4分の1に圧縮することで、推論能力を備えた言語モデルをスマートフォンで動作可能にするモジュラーシステムを開発しました。
本記事「Qualcomm shrinks AI reasoning chains by 2.4x to fit thinking models on smartphones」は、The Decoderに最初に掲載されました。
原文を表示
Qualcomm AI Research has developed a modular system that brings reasoning-capable language models to smartphones by compressing the models' verbose thought processes by a factor of 2.4.
Current reasoning models pose a fundamental problem on mobile devices because their lengthy thought chains generate massive amounts of tokens, balloon memory requirements, and drain battery life. Qualcomm's new framework is designed to make these models run on smartphones despite these constraints.
According to the paper, the company envisions use cases ranging from intelligent personal assistants that plan multi-step tasks and act independently across apps to direct interaction with device interfaces and external services. Running locally also brings structural advantages since sensitive data stays on the device, latency drops, and the whole system works without an internet connection.
A single base model that switches between two modes
Rather than training a completely new model, Qualcomm went with a modular approach. The starting point is a standard language model without reasoning capabilities (Qwen2.5-7B-Instruct), extended through LoRA adapters: small, specialized add-on modules that can be toggled on or off as needed. The same model can work either as a fast chatbot or as a deeper reasoning system, depending on the task.
Left: the modular architecture with switcher and LoRA adapters. Top right: the parallel reasoning strategy. Bottom right: the pipeline from training to on-device deployment. | Image: Qualcomm AI
Only about 4 percent of the parameters need to be trained, according to the researchers. Despite that, the result comes close to the performance of DeepSeek-R1-Distill-Qwen-7B, a model that required significantly more training effort. A built-in classifier also decides automatically for each query whether the more complex reasoning mode is even necessary, saving compute and energy on simple questions.
Reinforcement learning cuts token bloat by up to 8x
The biggest problem after initial training is that the models become extremely verbose. They often arrive at the correct solution early on, then burn through thousands of tokens double-checking their own work in various ways. Researchers call this phenomenon "epistemic hesitation," in the broader research community, it's been known simply as "overthinking."
To tackle this, the team uses reinforcement learning that specifically penalizes overlong answers. On average, responses shrink by a factor of 2.4, and for some tasks the reduction reaches up to 8x. One example from the paper: an algebraic simplification that took the initial model 3,118 tokens gets solved in just 810 tokens after optimization. According to the researchers, accuracy stays largely intact.
Top: the task. Middle: the baseline answer at 3,118 tokens with numerous redundant verification loops (marked in red). Bottom: the budget-optimized answer at 810 tokens. Both arrive at the same result. | Image: Qualcomm AI Research
One early approach to limiting length backfired: the model learned to formally close its reasoning block, then simply continued its lengthy deliberation in the regular answer section. The team had to redesign the reward function to account for total answer length before the model stopped gaming the system.
Parallel solution paths and 4-bit compression for real-world use
The framework also lets the model pursue multiple solution paths at the same time. A small evaluation head on the base model estimates which answer is most likely correct. With eight parallel runs, accuracy on the MATH500 math benchmark jumps by about 10 percent without significantly increasing response time, according to the paper. That's because token generation on mobile devices is bottlenecked by memory access, not compute power, so the parallel paths simply tap into capacity that would otherwise sit idle.
Left: the RL-optimized model (green) generates significantly shorter answers than the baseline (red), with even slightly higher accuracy. Right: the distribution of length reduction per task, averaging 2.4x. | Image: Qualcomm AI Research
To actually run on a phone, Qualcomm compresses the model weights to 4 bits. The reasoning adapters have to be trained directly on the compressed model; otherwise, the system just produces random text, according to the paper. Despite this aggressive compression, the final model loses only about 2 percent accuracy compared to the uncompressed version. Videos on the project page show the system running on mobile devices.
On-device AI still hasn't moved beyond demos
Qualcomm has been pushing to bring AI models to mobile devices for years, publishing 80 pre-optimized AI models for Snapdragon devices and presenting an AI orchestrator designed to sit between personal data, apps, and on-device AI models. Google has made similar moves, showing how small language models can run locally on Android with FunctionGemma and the AI Edge Gallery.
But so far, these efforts have largely remained technical proofs of concept. When it comes to deep system integration—where an AI assistant needs access to emails, photos, and calendars—companies still default to cloud models. Google's recently announced "Personal Intelligence" feature, for example, connects Gemini with Gmail, Google Photos, and Search but runs entirely on the server side.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み