The Decoder·2026年4月3日 21:04·約1分で読める

Zhipu AIのGLM-5V-Turboがデザインモックアップを直接実行可能なフロントエンドコードに変換

#マルチモーダルAI #コード生成 #フロントエンド開発 #エージェントワークフロー #中国AI #デザイントゥコード

TL;DR

中国のAIスタートアップZhipu AIは、デザインモックアップを実行可能なフロントエンドコードに直接変換するマルチモーダルモデル「GLM-5V-Turbo」をリリースし、エージェントワークフローでの使用を想定している。

AI深層分析2026年4月3日 22:41

重要/ 5段階

深度40%

キーポイント

新モデル「GLM-5V-Turbo」のリリース

Zhipu AIが画像・動画・テキストを処理するマルチモーダルモデル「GLM-5V-Turbo」を発表した。

デザインからコードへの直接変換機能

このモデルはデザインモックアップを実行可能なフロントエンドコードに直接変換する能力を持つ。

エージェントワークフロー向け設計

GLM-5V-Turboはエージェントワークフローでの使用を特に想定して開発されている。

中国AIスタートアップの技術進展

中国のAIスタートアップであるZhipu AIがこのような先進的な機能を持つモデルをリリースした点が注目される。

影響分析・編集コメントを表示

影響分析

この技術はフロントエンド開発のワークフローを根本から変革する可能性があり、デザイナーと開発者の間のギャップを埋める画期的なソリューションとなる。特にエージェントワークフローへの統合は、AI支援開発の新たな段階を示唆している。

編集コメント

デザインからコードへの自動変換は長年求められてきた機能であり、実用化が進めば開発プロセスの効率化に大きく貢献する可能性がある。中国AI企業の技術進展にも注目が集まる。

image

中国のAIスタートアップZhipu AIは、画像、動画、テキストを処理するマルチモーダルモデル「GLM-5V-Turbo」をリリースしました。このモデルはエージェントワークフローでの利用を想定して設計されています。

「Zhipu AIのGLM-5V-Turboがデザインモックアップを直接実行可能なフロントエンドコードに変換」とする記事は、The Decoderで最初に公開されました。

原文を表示

Chinese AI company Zhipu AI has released GLM-5V-Turbo, its first multimodal coding base model. It processes images, video, and text and is built specifically for agent workflows.

With GLM-5V-Turbo, the startup wants to close the gap between visual understanding and code generation. Instead of working with text alone, the model analyzes design mockups and generates executable code straight from them. According to the company, it plugs right into agents like Claude Code and OpenClaw, covering the full loop of "understand the environment → plan actions → execute tasks."

The context window handles 200,000 tokens, with a maximum output of 128,000 tokens. Features include a thinking mode, streaming output, function calling, and context caching.

How vision and code come together in a single model

Z.AI says GLM-5V-Turbo's performance stems from improvements in four areas: model architecture, training methods, data construction, and tooling.

The model learns to process images and text together from the start of training, rather than tacking a separate image recognition module onto a finished language model after the fact. Z.AI built a new vision encoder called CogViT for this. The model also predicts multiple tokens at once during inference, which should speed up output.

Reinforcement learning optimizes the model across more than 30 task types, including STEM, grounding, video, GUI agents, and coding agents, aiming for more robust perception, reasoning, and agentic execution.

To tackle the shortage of agent training data, Z.AI built a multi-level, controllable, and verifiable data system. Agentic meta-skills are baked into the pre-training stage to strengthen action prediction and execution early on.

A new multimodal toolchain extends the agent's reach from pure text to visual interaction. Tools for box drawing, screenshots, and website reading, including image understanding, complete the perception-planning-execution loop.

Strong numbers in coding and GUI agent benchmarks

According to Z.AI, GLM-5V-Turbo delivers leading results in multimodal coding and agent tasks. The model scores well in design-to-code generation, visual code generation, multimodal search, and visual exploration, and posts strong numbers on AndroidWorld and WebVoyager, two benchmarks that test an agent's ability to navigate real GUI environments.

Z.AI says GLM-5V-Turbo leads in most multimodal coding and tool usage categories. Claude Opus 4.6 pulls ahead in a few benchmarks like Flame-VLM-Code and OSWorld. | Image: Z.AI

In text-only coding tasks, GLM-5V-Turbo reportedly shows no performance drop despite the added visual capabilities, holding its own across the three core CC-Bench-V2 benchmarks (backend, frontend, repo exploration). It also puts up strong numbers on PinchBench, ClawEval, and ZClawBench, which measure task execution quality. Independent evaluations are still pending.

In text-only coding and agent benchmarks, Claude Opus 4.6 leads overall, but GLM-5V-Turbo outperforms its own text model GLM-5-Turbo and Kimi K2.5 in several categories. | Image: Z.AI

Design mockups become working front-end projects

GLM-5V-Turbo targets several specific use cases. The model takes design mockups or reference images and generates a complete, runnable front-end project. It reconstructs wireframe structure and functionality, aiming for pixel-perfect visual consistency with high-resolution designs.

Paired with frameworks like Claude Code, the model handles autonomous GUI exploration: it searches target websites on its own, maps page transitions, collects visual assets and interaction details, and writes code based on what it finds. Z.AI calls this an upgrade from "recreating from a screenshot" to "recreating through autonomous exploration."

For debugging, the model screenshots broken pages, automatically spots rendering issues like layout shifts, component overlaps, and color mismatches, then generates fix code. With GLM-5V-Turbo integrated, OpenClaw can also understand website layouts, GUI elements, and diagrams, helping it tackle more complex tasks that combine perception, planning, and execution.

Z.AI ships official skills, including image captioning, visual grounding, document-based writing, resume screening, and prompt generation, all available on ClawHub. GLM-5V-Turbo is available only as an API through the Z.AI platform for now, priced at $1.20 per million input tokens and $4 per million output tokens, the same as the text-only GLM-5-Turbo and slightly above the base GLM-5 model. Z.AI hasn't announced open model weights yet.

GLM-5-Turbo and GLM-5 laid the groundwork

Z.AI recently shipped GLM-5-Turbo, a text-only model built for the OpenClaw agent framework that improves tool calls, instruction following, time-controlled and persistent tasks, and long task chain execution.

Alongside it, Z.AI introduced ZClawBench, an end-to-end benchmark for agent tasks in the OpenClaw ecosystem. Results show GLM-5-Turbo significantly outperforming its predecessor, GLM-5, and beating Claude Opus 4.6, Gemini 3.1 Pro, MiniMax M2.5, and Kimi K2.5 in several categories. Skill usage in the OpenClaw ecosystem jumped from 26 to 45 percent in a short time, a sign of growing momentum for modular agent systems, Z.AI says.

Before that, Zhipu AI released GLM-5 in mid-February: an open-source model with 744 billion parameters under an MIT license that the company says competes with Claude Opus 4.5 and GPT-5.2 on coding and agent tasks. GLM-5 hit 77.8 percent on SWE-bench Verified, just behind Claude Opus 4.5 at 80.9 percent. The model also runs on Chinese chips from Huawei and others alongside Nvidia GPUs, a major advantage given US export restrictions.

Alibaba is taking a similar approach with Qwen3.5-Omni, an omnimodal model that processes text, images, audio, and video. Like GLM-5V-Turbo, it generates code from visual input but also accepts spoken instructions.

この記事をシェア

404 Media★42026年6月4日 22:30

Google 社員が社内掲示板で AI の不具合を揶揄するミームを共有

Google のCEOサンダル・ピチャイ氏が新コードの75%がAI生成だと主張する一方、同社社員は社内掲示板でAIがそのタスクに不得意で業務を困難にしていると批判するミームを共有している。

404 Media★42026年6月3日 09:16

Google、Play ストア開発者のコードを AI 訓練用に購入へ

Google は Android アプリ開発者に対し、AI コーディングツールの訓練に使用するコードへのアクセス権を購入する提案を非公式に行っている。

The Zvi★42026年6月2日 23:05

Claude Opus 4.8：機能と反応について

Zvi は、新モデル「Claude Opus 4.8」の理解には多数のデータポイントが必要であり、限られたベンチマークからの評価は誤解を招くと指摘し、多様なソースからの数十件のテスト結果やモデルカード情報を統合して初めて一貫したパターンが形成できると述べています。

ニュース一覧に戻る元記事を読む