[AINews] 今日は何も大きな出来事はありませんでした
NVIDIA が 550B パラメータのオープンウェイトモデル「Nemotron 3 Ultra」を公開し、長文コンテキストとエージェントタスクにおける性能・コスト面で画期的な成果を示した。
キーポイント
大規模オープンウェイトモデルの発表
550B パラメータ(アクティブ 55B)の MoE アーキテクチャを持つ「Nemotron 3 Ultra」が、OpenMDW 1.1 ライセンス下で完全公開された。
エージェントタスクにおける性能と効率
NVFP4 量子化による推論で 5 倍の高速化と 30% のコスト削減を実現し、1M トークンのコンテキスト長をサポートして複雑なエージェント作業に最適化されている。
業界最高水準のパフォーマンス評価
ArtificialAnlys によるベンチマークで US 製オープンウェイトモデルとして最強のスコアを記録し、タスク遅延と性能のトレードオフにおいてパレート最適 Frontier に位置している。
即座に利用可能なエコシステム
リリース当日から vLLM、Modal、Together などの主要推論プラットフォームやクラウドプロバイダでサポートされ、すぐに実装可能となっている。
Anthropic の再帰的自己改善(RSI)と生産性向上
Claude がコードの 80% 以上を執筆し、エンジニアの生産性が 8 倍に向上した一方、AI による研究方向性の自動化は未解決だが実装の自動化は進行中であると報告された。
モデル学習スクリプトの劇的な高速化
Claude Opus 4 と Mythos Preview が小規模なモデル学習スクリプトをそれぞれ約 3 倍、最大 52 倍に高速化する能力を示し、AI が AI 開発を加速させる兆候となった。
Cloudflare の VoidZero 買収と Vite 生態系への投資
Vite や Vitest を開発した VoidZero チームを Cloudflare が買収し、ツールはオープンソースとして維持される一方、独立した Vite 生態系の発展のために 100 万ドルの基金が設立された。
影響分析・編集コメントを表示
影響分析
この発表は、オープンウェイトモデルの性能とコスト効率に関する基準を再定義する可能性があり、特に複雑なエージェントタスクや長文処理が必要な分野での開発スピードを加速させるでしょう。NVIDIA が独自のアプローチ(NVFP4、Hybrid Mamba/Attention)で低精度学習のスケーリング限界に挑戦した点は、今後の業界全体のアーキテクチャ設計に影響を与える重要な転換点となります。
編集コメント
NVIDIA のこの発表は、単なるモデルの公開にとどまらず、低精度学習技術の実用化とオープンソースエコシステムへの即時統合という点で極めて戦略的です。特に「エージェント」特化型の性能向上は、次世代 AI アプリケーション開発におけるデファクトスタンダードとなり得るでしょう。
Anthropic is seeing Sparks of RSI, OpenAI’s ChatGPT has finally crossed 1B MAU ~5 months behind schedule and improved memory, and SpaceXAI is explaining its IPO to people who might not know they will be forced into buying it.
None of which are as important as getting your AIEWF tickets and hotels and tuning in to the latest pod with Andon Labs!
$2k in credits and free AIE WF tickets!","cta":null,"showBylines":true,"size":"sm","isEditorNode":true,"title":"Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs","publishedBylines":[],"post_date":"2026-06-04T20:39:18.514Z","cover_image":"https://substack-video.s3.amazonaws.com/video_upload/post/200614482/1621f1b3-afdf-4e73-96ad-7e9344965086/transcoded-1780580537.png","cover_image_alt":null,"canonical_url":"https://www.latent.space/p/andon","section_name":null,"video_upload_id":null,"id":200614482,"type":"podcast","reaction_count":7,"comment_count":0,"publication_id":1084089,"publication_name":"Latent.Space","publication_logo_url":"https://substackcdn.com/image/fetch/$s_!DbYa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png","belowTheFold":false,"youtube_url":null,"show_links":null,"feed_url":null}">
AI News for 6/3/2026-6/4/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!
AI Twitter Recap
NVIDIA’s Nemotron 3 Ultra and 3.5 ASR Release
Nemotron 3 Ultra was the clearest technical release of the day: a fully open 550B MoE model with 55B active parameters, 1M context, and an explicit focus on long-running agent workloads. NVIDIA says it is up to 5x faster and 30% lower cost for agentic tasks, with weights, synthetic data, reward checkpoints, quantized variants, and training recipes released under OpenMDW 1.1 (NVIDIA launch, NVIDIAAI open artifacts, Pavlo Molchanov thread). The architecture combines hybrid Mamba/attention, LatentMoE, and native MTP, with pretraining done in NVFP4 over 20T tokens—notable because it pushes low-precision pretraining into a new scale regime (tech notes, scaling discussion).
Benchmarks and serving story were unusually strong for an open release. @ArtificialAnlys measured 47.7 on its Intelligence Index using NVIDIA’s recommended NVFP4 inference weights (48.2 in BF16), making it the strongest US open-weights model they’ve tested, though still behind Kimi K2.6. More interestingly, they reported 400+ output tok/s via BlackBox, and separately showed Nemotron 3 Ultra sitting on the Pareto frontier for task latency vs. performance on Terminal-Bench-style evaluations under turn limits (latency analysis, BlackBox throughput). The model shipped day 0 across the stack: vLLM, Modal, Together, Fireworks, Ollama cloud, Baseten, CoreWeave/W&B, Cline, Prime Intellect, and Nous Portal.
Nemotron 3.5 ASR was the quieter but practical companion release: an open streaming ASR model with a single 0.6B checkpoint, 40 language-locale combinations, and sub-100ms latency, built on a cache-aware FastConformer / RNN-T style design optimized for voice agents and streaming speech workloads (Piotr Zelasko, Together, fal availability).
Anthropic’s Recursive Self-Improvement Framing and Internal AI-Coding Metrics
Anthropic published the most-discussed policy/research note of the day, arguing that current systems show early signs of recursive self-improvement (RSI)—not yet full autonomy in research direction, but clear evidence that AI is accelerating AI development (Anthropic post). The headline operational claims were concrete: 80%+ of merged code at Anthropic is now authored by Claude, the typical engineer ships 8x more code per quarter than in prior years, and on internal open-ended engineering tasks Claude’s success rate rose from roughly 26% to 76% in six months (code metric, Alex Albert summary).
The most striking empirical datapoint was Anthropic’s recurring “speed up a small model training script” test: Claude Opus 4 averaged about 3x speedup, while Mythos Preview reportedly achieved ~52x (Anthropic benchmark claim, correction on dates). Anthropic also says Mythos gave better “what to do next” research suggestions than humans 64% of the time in sessions where the researcher had taken a wrong turn (research-next-step result). Their broader thesis: automating problem selection is still unresolved, but automating large portions of implementation and iteration is already happening.
The governance angle mattered as much as the productivity claims. Anthropic explicitly wrote that “it would be good for the world to have the option to slow or temporarily pause frontier AI development,” framing verification and coordination mechanisms as increasingly urgent if RSI-like dynamics continue (Anthropic governance statement, discussion, commentary). This landed amid criticism that Anthropic recently weakened parts of its Responsible Scaling Policy thresholds around bio/chemical risk, according to @CRSegerie. Separately, a coalition including Altman, Amodei, Hassabis, and Baker backed mandatory DNA synthesis screening and recordkeeping in the US, arguing AI is eroding biological knowledge barriers (letter summary).
Cloudflare Acquires VoidZero and Tightens the Full-Stack Agent Toolchain
The biggest developer-platform move was Cloudflare bringing in VoidZero, the team behind Vite, Vitest, Rolldown, Oxc, and Vite+. Cloudflare and VoidZero emphasized that Vite remains open source, MIT, and vendor-neutral, with Cloudflare also committing $1M to a fund for independent Vite ecosystem development (Cloudflare, Vite statement, Evan You).
The strategic read from developers was that this gives Cloudflare tighter control over an increasingly agent-friendly application stack: frontend/build tooling, runtime, storage, inference, deployment primitives, and security in one place. @wesbos framed it as Cloudflare assembling “a tidy package they can hand to an LLM to make a site,” which is directionally consistent with Cloudflare’s own push on agents, MCP, sandboxes, AI search, payments, and observability in a unified platform (Cloudflare agents docs overview).
Agents, Harnesses, Memory, and Evaluation Infrastructure
Several tweets pointed to a maturing “agent systems” layer beyond raw model releases. A recurring theme was that the bottleneck is increasingly the harness/orchestrator, not just prompting. A popular clip summarized the Claude Code workflow as “I don’t prompt Claude anymore, I write loops,” while @omarsar0 described reverse-engineering dynamic workflows into his own orchestrator for branching research, verification, triage, data synthesis, and eval generation. The common idea: higher-order control loops, not one-shot prompts, are becoming the real unit of work.
Tooling around those loops also improved. LangSmith Sandboxes reached GA with Dockerfile snapshots, interactive consoles, TCP tunneling, and standard Linux tooling. Hugging Face pushed two adjacent ideas: a Kernels distribution path for custom kernels on the Hub (announcement) and stronger support for storing agent traces as first-class artifacts, echoed by @ClementDelangue. @julien_c released SynthTraces, a minimal harness that generated 2,000+ synthetic coding-agent session traces by having an open model play the coding agent and a local model simulate the user.
Evaluation also shifted toward real-world agent work. Arena launched Agent Arena / Agent Mode, measuring agentic performance from millions of live sessions with tools like web search, filesystem, bash, and image generation. Their current ranking puts GPT-5.5 first, followed by Claude Opus 4.7, GLM-5.1, Gemini 3.1 Pro, and Kimi-K2.6, with methodology based on task success, steerability, recovery, user praise/complaint, and tool hallucination across 300K+ tasks, 2M+ tool calls, and 40M lines of code (launch, methodology). On the enterprise side, Cognition introduced an AI Productivity Guarantee for Devin—up to $10M in covered usage if the product doesn’t produce positive engineering value—backed by an internal measurement system over 258 enterprise sessions spanning tasks up to 64+ hours (guarantee, technical writeup).
Memory, Multimodality, and Model/Benchmark Updates
OpenAI rolled out a more capable ChatGPT memory system to Plus and Pro users in the US, with memory summaries, more steering controls, and 2x more memory. The company framed this as a longer-running research arc from saved memory to “dreaming” to the current system (OpenAI, controls, Christina Kim explanation). Related developer-side updates included moderation scores in the Responses and Completions APIs (OpenAIDevs) and a heavily shared demo of the new Codex iOS app plugin for viewing and testing apps in-browser with hot reload (OpenAIDevs demo).
A few other model/data releases are worth noting. Gemma 4 12B continued to draw attention both as a local coding model replacement and in highly compressed form: Unsloth released a 2-bit GGUF at 4.66 GB. @_philschmid highlighted an architectural explainer on how Gemma 4 handles text/images/audio without separate encoders. In multimodal research, @skalskip92 flagged Molmo2 as a strong open VLM candidate at CVPR, supporting video pointing, tracking, counting, and multi-image reasoning. For document understanding, ParseBench from LlamaIndex introduced an open benchmark with 2,000+ human-verified pages and 167K+ test rules across tables, charts, faithfulness, formatting, and grounding (benchmark announcement).
Top Tweets (by engagement, filtered for technical relevance)
Anthropic on RSI and internal automation: Claude now writes 80%+ of merged code at Anthropic, engineers ship 8x more code, and the company says AI accelerating AI development is becoming plausible (Anthropic).
OpenAI memory upgrade: a more capable ChatGPT memory system with summaries, steering controls, and 2x more memory for Plus/Pro users in the US (OpenAI).
Cloudflare + VoidZero: Cloudflare brings in the VoidZero team while keeping Vite MIT and vendor-neutral, plus a $1M OSS fund for the ecosystem (Cloudflare, Vite).
Nemotron 3 Ultra launch: open 550B/55B-active hybrid MoE for long-running agents, with full recipes and unusually strong speed claims (NVIDIA).
Cursor canvases + context explorer: sharable canvases for apps/reports/internal tools and an interactive breakdown of where agent context is spent (Cursor).
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
- Gemma 4 12B Release and Benchmarks
google/gemma-4-12B · Hugging Face (Activity: 1610): Google DeepMind released google/gemma-4-12B as part of the Gemma 4 open-weights family, spanning E2B, E4B, 12B, 26B A4B, and 31B variants with dense and MoE architectures, instruction-tuned/pretrained checkpoints, multimodal input, multilingual support across 140+ languages, and context windows up to 256K tokens. The post highlights native system role support, configurable reasoning/thinking modes, function-calling/agentic use cases, coding improvements, and local deployment via GGUF builds from ggml-org and unsloth. A top comment links Maarten Grootendorst’s visual guide, specifically calling out the model’s “encoder-free architecture.” Commenters are mainly interested in empirical coding performance, with one explicitly wanting to test whether Gemma 4 12B can beat Qwen 3.5 9B on coding tasks. No concrete benchmark results were provided in the comments.
A linked technical guide by Maarten Grootendorst highlights Gemma 4 12B’s encoder-free architecture, framing it as a notable design point for readers interested in model internals
Several commenters positioned Gemma 4 12B as a practical size tier between smaller Gemma variants like E4B and larger models such as 26B, with one user also noting interest in whether it can outperform Qwen 3.5 9B on coding tasks.
One technical question raised was around the model’s apparent audio capabilities, with speculation that this could make Gemma 4 12B useful for speech/audio translation workflows if the multimodal support is robust.
New Google Gemma 4 12B Claims Near-26B Performance - We Tested Both! (Activity: 984): A local single-RTX 4090 comparison claims Google Gemma 4 26B-A4B used 15 GB VRAM, generated 6.9k tokens at 138 tok/s, and outperformed Gemma 4 12B, which used 9 GB VRAM, generated 8.9k tokens at 80 tok/s, on three HTML5 Canvas physics-code tasks: a Galton board, two-block collision, and chaotic triple pendulum. The poster argues the MoE-style 26B-A4B model is ~1.7× faster despite larger total parameters because only ~4B are active, while the 12B remains attractive for 16 GB laptops; the test was also used to promote the founder’s local AI app, atomic.chat. Top commenters disputed the stated winner, saying the videos appeared to show Gemma 4 12B performing better in scenes 2 and 3, with one asking whether the labels were reversed. Another commenter requested a comparable benchmark against Qwen3.6 35B-A3B.
Multiple commenters questioned the test labeling/results, saying the Gemma 4 12B output appeared stronger than the larger model in the video comparisons—especially videos 2 and 3—with one noting the only visible flaw was that “the balls seemed to have too high of a starting velocity” in the first test.
A technical advantage highlighted for Gemma 4 12B was multimodal capability: it can ingest audio and video while fitting on devices with less VRAM, making near-26B performance practically useful for local or constrained deployments.
Commenters requested broader baselines such as Qwen3.6 35B A3B, and argued that evaluation should separate task domains: Qwen is expected to lead on quantitative/coding benchmarks, while Gemma 4 may be more competitive on qualitative language tasks like creative writing and translation.
gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint (Activity: 520): The image is a technical benchmark table comparing Gemma 4 12B Unified vs Qwen3.5-9B, compiled from official Hugging Face model-card scores, with Qwen3.5-9B winning 5/8 shared benchmarks despite a smaller parameter footprint and allegedly lighter KV cache (image). Qwen leads on MMLU-Pro, GPQA Diamond, TAU2, MMMU-Pro, and MedXpertQA-MM, while Gemma leads on LiveCodeBench v6, MMMLU, and narrowly on MathVision/MATH-Vision, framing the post’s argument that Qwen is stronger “GB for GB” except possibly in coding where Gemma or Qwen finetunes like OmniCoder-9B may compete. Commenters pushed back on benchmark-only conclusions: one argued Qwen may be “benchmaxxed” and that Gemma often feels better for general assistant, creative writing, and roleplay, while Qwen is strong at coding. Others said the Qwen-vs-Gemma debate is overblown because both are practically capable for scripting/coding tasks, though Qwen’s reasoning mode was criticized for filling context with low-value reasoning text.
Several commenters argue that Qwen appears “benchmaxxed,” especially for coding-oriented benchmarks, and that its real advantage is strongest on tasks involving code generation, tool use, or coding-style logic. In practical use, users report both Gemma 4 31B / Gemma 3.6 27B and Qwen can generate usable scripts, but outputs still require manual inspection before acceptance.
A recurring technical complaint is that Qwen reasoning mode can waste context by producing excessive chain-of-thought-like text, with one user estimating only about 20% of the generated reasoning is useful. This suggests that for some local/SLM workflows, disabling reasoning may improve effective context utilization and reduce noise.
Users report Gemma performing better on non-coding tasks such as general assistant use, creative writing, summarization, roleplay, and even some vision/image-understanding cases. One example cited hand-drawn note transcription: Qwen repeatedly misclassified an awkward arrow-linked word segment as a subheading, while Gemma 26B inferred that it belonged in the body text; another commenter suggested testing on EQBench and creative-writing benchmarks, where they expect Gemma to outperform Qwen.
- Long-Context Scaling and KV Cache Efficiency
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face (Activity: 542): NVIDIA released nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16, a 550B-parameter LatentMoE hybrid model with 55B active parameters, interleaving Mamba-2, MoE, selected attention layers, and Multi-Token Prediction; it advertises up to 1M token context and configurable reasoning via enable_thinking=True/False. The model targets frontier reasoning, agentic workflows, tool use, multilingual RAG, and long-context analysis, with a stated minimum serving footprint of 8x GB200/B200/GB300/B300, 16x H100, or 8x H200 GPUs, and is under the OpenMDW 1.1 license. Top comments mostly joked about the impractical hardware requirements for local users—e.g. “Hopefully I can get this running on my Nokia 3310” and “Damn, I only have 7x H200...”—rather than debating model quality or architecture.
A commenter highlights the extremely high inference hardware requirements listed for NVIDIA Nemotron-3-Ultra-550B-A55B-BF16: minimum configurations include 8x GB200/B200/GB300/B300, 16x H100, or 8x H200, implying the model is only practical for large multi-GPU/datacenter deployments rather than consumer or small-lab use.
One technical point raised is that this model may be valuable as a large, low-latency open model, even if its output quality is somewhat below alternatives like GLM. The tradeoff discussed is that faster response/processing can matter more than absolute benchmark quality for latency-sensitive applications.
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) (Activity: 438): Huawei CSL open-sourced KVarN, an Apache-2.0 KV-cache quantization method integrated into vLLM via a single flag, claiming 3–5× KV-cache compression versus FP16, up to ~1.4× FP16 throughput, and up to ~2.4× TurboQuant throughput while preserving FP16-level quality (repo, paper). The post contrasts KVarN with vLLM FP8 KV cache (~2× capacity, near-BF16 throughput) and Google TurboQuant, citing a vLLM/Red Hat AI study where TurboQuant achieves compression but drops to 66–80% of BF16 throughput and loses ~20 reasoning points in low-bit modes on benchmarks like AIME25 and LiveCodeBench. The key technical claim is that KVarN avoids explicit BF16 dequantization overhead in attention and maintains reasoning/code/math accuracy at higher compression, with no model changes, retraining, or calibration. Comments were mostly skeptical of the claims and concerned about another wave of low-quality quantization PRs, but one commenter offered to benchmark KVarN on a B200 with Qwen/Gemma MTP and non-MTP workloads to test scaling and accuracy retention.
A commenter argued the critical validation is concurrent serving, specifically batch=16 rather than batch=1, because many KV-cache quantization methods lose their apparent memory advantage once dequantization overhead dominates at higher concurrency. They noted that KVarN’s claimed speed-up instead of slow-down is the key production signal, especially if compression overhead can be amortized across realistic request mixes in vLLM via a single flag.
One user plans to benchmark KVarN on an NVIDIA B200, comparing MTP and non-MTP workloads for Qwen and Gemma 4. This would be useful for validating whether the claimed 3–5× KV-cache compression and speed gains scale on high-end inference hardware rather than only in paper settings.
Another commenter was skeptical that KV quantization results will generalize to newer architectures, suggesting many methods work because current models store information inefficiently i
関連記事
今日は何も大きな出来事はありませんでした
Smol AI News は、6月3日から4日にかけての期間に、12件のサブレッドや544件のツイートを調査しましたが、AI業界で特筆すべき動きは確認されませんでした。
NVIDIA Nemotron 3 Ultra が Amazon SageMaker JumpStart で利用可能に
AWS は、推論速度を5倍向上させ、コストを最大30%削減する「NVIDIA Nemotron 3 Ultra」モデルを、Amazon SageMaker JumpStart でワンクリックデプロイ可能にしたと発表した。
Nemotron 3 Ultra が AI Gateway で利用可能に
NVIDIA が開発したオープンな混合専門家推論モデル「Nemotron 3 Ultra」が、Vercel の AI Gateway で利用可能になりました。このモデルは最大 100 万トークンのコンテキストウィンドウを持ち、長期実行型エージェントワークフローの調整に最適化されています。