今日は何も大きな出来事はありませんでした
NVIDIA が長期エージェントワークロードに特化したオープンソース大規模モデル「Nemotron 3 Ultra」を公開し、低精度事前学習と高速推論で業界のベンチマークを更新した。
キーポイント
Nemotron 3 Ultra の技術仕様
550B モデル(55B アクティブ)かつ 1M コンテキスト長を備え、ハイブリッド Mamba/Attention アーキテクチャと NVFP4 事前学習を採用している。
エージェントワークロードへの最適化
長期実行タスクに焦点を当て、競合比で最大 5 倍の高速化と 30% のコスト削減を実現し、OpenMDW 1.1 ライセンスで完全オープン化された。
ベンチマークと推論性能
ArtificialAnlys による評価で US オープンウェイトモデルとして最高スコアを記録し、BlackBox を介して 400+ トークン/秒の推論速度を実現した。
影響分析・編集コメントを表示
影響分析
この発表は、大規模モデルにおける「コスト対性能」のトレードオフを打破し、特にエージェントワークロード分野でのオープンソースモデルの競争力を劇的に高めるものです。低精度学習技術のスケールアップ成功は、将来的により多くの組織が高性能 AI をローカルまたはオンプレミスで運用する可能性を広げます。
編集コメント
「今日は何も起こらなかった」というタイトルとは裏腹に、オープンソース AI の性能とコスト構造を再定義する重要なリリースでした。特に NVFP4 を用いた大規模事前学習の成功は、今後の業界標準となる可能性を秘めています。
a quiet day.
AI News for 6/3/2026-6/4/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
NVIDIA’s Nemotron 3 Ultra and 3.5 ASR Release
- Nemotron 3 Ultra was the clearest technical release of the day: a fully open 550B MoE model with 55B active parameters, 1M context, and an explicit focus on long-running agent workloads. NVIDIA says it is up to 5x faster and 30% lower cost for agentic tasks, with weights, synthetic data, reward checkpoints, quantized variants, and training recipes released under OpenMDW 1.1 (NVIDIA launch, NVIDIAAI open artifacts, Pavlo Molchanov thread). The architecture combines hybrid Mamba/attention, LatentMoE, and native MTP, with pretraining done in NVFP4 over 20T tokens—notable because it pushes low-precision pretraining into a new scale regime (tech notes, scaling discussion).
- Benchmarks and serving story were unusually strong for an open release. @ArtificialAnlys measured 47.7 on its Intelligence Index using NVIDIA’s recommended NVFP4 inference weights (48.2 in BF16), making it the strongest US open-weights model they’ve tested, though still behind Kimi K2.6. More interestingly, they reported 400+ output tok/s via BlackBox, and separately showed Nemotron 3 Ultra sitting on the Pareto frontier for task latency vs. performance on Terminal-Bench-style evaluations under turn limits (latency analysis, BlackBox throughput). The model shipped day 0 across the stack: vLLM, Modal, Together, Fireworks, Ollama cloud, Baseten, CoreWeave/W&B, Cline, Prime Intellect, and Nous Portal.
- Nemotron 3.5 ASR was the quieter but practical companion release: an open streaming ASR model with a single 0.6B checkpoint, 40 language-locale combinations, and sub-100ms latency, built on a cache-aware FastConformer / RNN-T style design optimized for voice agents and streaming speech workloads (Piotr Zelasko, Together, fal availability).
Anthropic’s Recursive Self-Improvement Framing and Internal AI-Coding Metrics
- Anthropic published the most-discussed policy/research note of the day, arguing that current systems show early signs of recursive self-improvement (RSI)—not yet full autonomy in research direction, but clear evidence that AI is accelerating AI development (Anthropic post). The headline operational claims were concrete: 80%+ of merged code at Anthropic is now authored by Claude, the typical engineer ships 8x more code per quarter than in prior years, and on internal open-ended engineering tasks Claude’s success rate rose from roughly 26% to 76% in six months (code metric, Alex Albert summary).
- The most striking empirical datapoint was Anthropic’s recurring “speed up a small model training script” test: Claude Opus 4 averaged about 3x speedup, while Mythos Preview reportedly achieved ~52x (Anthropic benchmark claim, correction on dates). Anthropic also says Mythos gave better “what to do next” research suggestions than humans 64% of the time in sessions where the researcher had taken a wrong turn (research-next-step result). Their broader thesis: automating problem selection is still unresolved, but automating large portions of implementation and iteration is already happening.
- The governance angle mattered as much as the productivity claims. Anthropic explicitly wrote that “it would be good for the world to have the option to slow or temporarily pause frontier AI development,” framing verification and coordination mechanisms as increasingly urgent if RSI-like dynamics continue (Anthropic governance statement, discussion, commentary). This landed amid criticism that Anthropic recently weakened parts of its Responsible Scaling Policy thresholds around bio/chemical risk, according to @CRSegerie. Separately, a coalition including Altman, Amodei, Hassabis, and Baker backed mandatory DNA synthesis screening and recordkeeping in the US, arguing AI is eroding biological knowledge barriers (letter summary).
Cloudflare Acquires VoidZero and Tightens the Full-Stack Agent Toolchain
- The biggest developer-platform move was Cloudflare bringing in VoidZero, the team behind Vite, Vitest, Rolldown, Oxc, and Vite+. Cloudflare and VoidZero emphasized that Vite remains open source, MIT, and vendor-neutral, with Cloudflare also committing $1M to a fund for independent Vite ecosystem development (Cloudflare, Vite statement, Evan You).
- The strategic read from developers was that this gives Cloudflare tighter control over an increasingly agent-friendly application stack: frontend/build tooling, runtime, storage, inference, deployment primitives, and security in one place. @wesbos framed it as Cloudflare assembling “a tidy package they can hand to an LLM to make a site,” which is directionally consistent with Cloudflare’s own push on agents, MCP, sandboxes, AI search, payments, and observability in a unified platform (Cloudflare agents docs overview).
Agents, Harnesses, Memory, and Evaluation Infrastructure
- Several tweets pointed to a maturing “agent systems” layer beyond raw model releases. A recurring theme was that the bottleneck is increasingly the harness/orchestrator, not just prompting. A popular clip summarized the Claude Code workflow as “I don’t prompt Claude anymore, I write loops,” while @omarsar0 described reverse-engineering dynamic workflows into his own orchestrator for branching research, verification, triage, data synthesis, and eval generation. The common idea: higher-order control loops, not one-shot prompts, are becoming the real unit of work.
- Tooling around those loops also improved. LangSmith Sandboxes reached GA with Dockerfile snapshots, interactive consoles, TCP tunneling, and standard Linux tooling. Hugging Face pushed two adjacent ideas: a Kernels distribution path for custom kernels on the Hub (announcement) and stronger support for storing agent traces as first-class artifacts, echoed by @ClementDelangue. @julien_c released SynthTraces, a minimal harness that generated 2,000+ synthetic coding-agent session traces by having an open model play the coding agent and a local model simulate the user.
- Evaluation also shifted toward real-world agent work. Arena launched Agent Arena / Agent Mode, measuring agentic performance from millions of live sessions with tools like web search, filesystem, bash, and image generation. Their current ranking puts GPT-5.5 first, followed by Claude Opus 4.7, GLM-5.1, Gemini 3.1 Pro, and Kimi-K2.6, with methodology based on task success, steerability, recovery, user praise/complaint, and tool hallucination across 300K+ tasks, 2M+ tool calls, and 40M lines of code (launch, methodology). On the enterprise side, Cognition introduced an AI Productivity Guarantee for Devin—up to $10M in covered usage if the product doesn’t produce positive engineering value—backed by an internal measurement system over 258 enterprise sessions spanning tasks up to 64+ hours (guarantee, technical writeup).
Memory, Multimodality, and Model/Benchmark Updates
- OpenAI rolled out a more capable ChatGPT memory system to Plus and Pro users in the US, with memory summaries, more steering controls, and 2x more memory. The company framed this as a longer-running research arc from saved memory to “dreaming” to the current system (OpenAI, controls, Christina Kim explanation). Related developer-side updates included moderation scores in the Responses and Completions APIs (OpenAIDevs) and a heavily shared demo of the new Codex iOS app plugin for viewing and testing apps in-browser with hot reload (OpenAIDevs demo).
- A few other model/data releases are worth noting. Gemma 4 12B continued to draw attention both as a local coding model replacement and in highly compressed form: Unsloth released a 2-bit GGUF at 4.66 GB. @_philschmid highlighted an architectural explainer on how Gemma 4 handles text/images/audio without separate encoders. In multimodal research, @skalskip92 flagged Molmo2 as a strong open VLM candidate at CVPR, supporting video pointing, tracking, counting, and multi-image reasoning. For document understanding, ParseBench from LlamaIndex introduced an open benchmark with 2,000+ human-verified pages and 167K+ test rules across tables, charts, faithfulness, formatting, and grounding (benchmark announcement).
Top Tweets (by engagement, filtered for technical relevance)
- Anthropic on RSI and internal automation: Claude now writes 80%+ of merged code at Anthropic, engineers ship 8x more code, and the company says AI accelerating AI development is becoming plausible (Anthropic).
- OpenAI memory upgrade: a more capable ChatGPT memory system with summaries, steering controls, and 2x more memory for Plus/Pro users in the US (OpenAI).
- Cloudflare + VoidZero: Cloudflare brings in the VoidZero team while keeping Vite MIT and vendor-neutral, plus a $1M OSS fund for the ecosystem (Cloudflare, Vite).
- Nemotron 3 Ultra launch: open 550B/55B-active hybrid MoE for long-running agents, with full recipes and unusually strong speed claims (NVIDIA).
- Cursor canvases + context explorer: sharable canvases for apps/reports/internal tools and an interactive breakdown of where agent context is spent (Cursor).
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. Gemma 4 12B Release and Benchmarks
- google/gemma-4-12B · Hugging Face (Activity: 1610): Google DeepMind released google/gemma-4-12B as part of the Gemma 4 open-weights family, spanning E2B, E4B, 12B, 26B A4B, and 31B variants with dense and MoE architectures, instruction-tuned/pretrained checkpoints, multimodal input, multilingual support across 140+ languages, and context windows up to 256K tokens. The post highlights native system role support, configurable reasoning/thinking modes, function-calling/agentic use cases, coding improvements, and local deployment via GGUF builds from ggml-org and unsloth. A top comment links Maarten Grootendorst’s visual guide, specifically calling out the model’s “encoder-free architecture.” Commenters are mainly interested in empirical coding performance, with one explicitly wanting to test whether Gemma 4 12B can beat Qwen 3.5 9B on coding tasks. No concrete benchmark results were provided in the comments.
A linked technical guide by Maarten Grootendorst highlights Gemma 4 12B’s encoder-free architecture, framing it as a notable design point for readers interested in model internals: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4-12b.
- Several commenters positioned Gemma 4 12B as a practical size tier between smaller Gemma variants like E4B and larger models such as 26B, with one user also noting interest in whether it can outperform Qwen 3.5 9B on coding tasks.
- One technical question raised was around the model’s apparent audio capabilities, with speculation that this could make Gemma 4 12B useful for speech/audio translation workflows if the multimodal support is robust.
- New Google Gemma 4 12B Claims Near-26B Performance - We Tested Both! (Activity: 984): A local single-RTX 4090 comparison claims Google Gemma 4 26B-A4B used 15 GB VRAM, generated 6.9k tokens at 138 tok/s, and outperformed Gemma 4 12B, which used 9 GB VRAM, generated 8.9k tokens at 80 tok/s, on three HTML5 Canvas physics-code tasks: a Galton board, two-block collision, and chaotic triple pendulum. The poster argues the MoE-style 26B-A4B model is ~1.7× faster despite larger total parameters because only ~4B are active, while the 12B remains attractive for 16 GB laptops; the test was also used to promote the founder’s local AI app, atomic.chat. Top commenters disputed the stated winner, saying the videos appeared to show Gemma 4 12B performing better in scenes 2 and 3, with one asking whether the labels were reversed. Another commenter requested a comparable benchmark against Qwen3.6 35B-A3B.
Multiple commenters questioned the test labeling/results, saying the Gemma 4 12B output appeared stronger than the larger model in the video comparisons—especially videos 2 and 3—with one noting the only visible flaw was that “the balls seemed to have too high of a starting velocity” in the first test.
A technical advantage highlighted for Gemma 4 12B was multimodal capability: it can ingest audio and video while fitting on devices with less VRAM</st
関連記事
[AINews] 今日は何も大きな出来事はありませんでした
Anthropic が RSI の兆候を示し、OpenAI の ChatGPT が月間アクティブユーザー数で 10 億人を突破。SpaceX AI は IPO について説明しているが、最も重要なのは AIE WF のチケット確保とイベント参加である。
NVIDIA Nemotron 3 Ultra が Amazon SageMaker JumpStart で利用可能に
AWS は、推論速度を5倍向上させ、コストを最大30%削減する「NVIDIA Nemotron 3 Ultra」モデルを、Amazon SageMaker JumpStart でワンクリックデプロイ可能にしたと発表した。
Nemotron 3 Ultra が AI Gateway で利用可能に
NVIDIA が開発したオープンな混合専門家推論モデル「Nemotron 3 Ultra」が、Vercel の AI Gateway で利用可能になりました。このモデルは最大 100 万トークンのコンテキストウィンドウを持ち、長期実行型エージェントワークフローの調整に最適化されています。