Naverの「ソウル・ワールド・モデル」、実在するストリートビュー・データを用いてAIによる都市全体の幻覚生成を防止
韓国インターネット大手Naverは、自社の100万枚以上のストリートビュー画像に基づく実際の都市ジオメトリを基盤としたビデオ世界モデル「Seoul World Model」を構築し、他の都市へのファインチューニングなしでの汎化を実現した。
キーポイント
現実データに基づく世界モデル
Naverは自社のストリートビュー画像100万枚以上から実際の都市ジオメトリを抽出し、それを基盤としたビデオ世界モデルを構築した。
都市の幻覚生成防止
このアプローチにより、AIが架空の都市を生成する「幻覚」問題を軽減し、より現実的な都市表現が可能となる。
ファインチューニング不要の汎化
ソウルで訓練されたモデルが、追加の調整なしで他の都市にも一般化できる能力を示している。
実用的な応用可能性
自動運転シミュレーション、都市計画、バーチャルツアーなど、現実世界に基づくAI生成コンテンツへの応用が期待される。
影響分析・編集コメントを表示
影響分析
この技術は、AI生成コンテンツの現実性と信頼性を向上させ、自動運車のシミュレーション環境構築やデジタルツイン都市の作成など、実用的な応用分野に大きな影響を与える可能性がある。また、地理空間AIの分野における新しいベンチマークを確立する可能性がある。
編集コメント
現実世界のデータに基づくAIモデル構築の重要性を示す好例。特に都市環境のような複雑な空間の正確な表現において、従来の生成AIの限界を克服する可能性を秘めている。

韓国のインターネット大手Naverは、自社のストリートビュー画像100万枚以上から取得した実際の都市の幾何学構造に基づくビデオ世界モデルを構築しました。このモデルは、ファインチューニングを一切行わずに他の都市へも一般化が可能です。
この記事「Naverの『Seoul World Model』、実際のストリートビューデータでAIの都市全体のハルシネーションを防止」は、The Decoderに最初に掲載されました。
原文を表示
South Korean internet giant Naver built a video world model grounded in actual city geometry from over a million of its own Street View images. The model generalizes to other cities without any fine-tuning.
Previous video world models produce visually convincing but entirely fictional environments. Everything beyond the starting image—invisible streets, distant buildings—is hallucinated. Researchers from Naver and Naver Cloud are taking a fundamentally different approach: their Seoul World Model (SWM) anchors video generation in the real geometry and appearance of an actual city.
SWM follows real routes through Seoul and generates videos that users can modify with text prompts - adding burning cars or dropping Godzilla between skyscrapers. | Image: Naver
According to the research paper, this is the first world model tied to a real physical location. Naver is often called the "Google of South Korea" and operates the country's dominant search engine along with Naver Map, its own mapping service with street panoramas similar to Google Maps. The model draws directly from this pool.
Users enter geographic coordinates, a desired camera movement, and a text prompt. The model then searches a database of 1.2 million panoramic images from Naver Map, retrieves the nearest Street View images, and uses them as guides for step-by-step video generation.
Real street data creates three distinct challenges
Working with real images introduces problems that don't exist with purely synthetic world models. The biggest one: Street View images are snapshots. Cars and pedestrians captured at the time of shooting have nothing to do with the dynamic scene the model needs to generate. Without a fix, the model would simply copy these random objects from the reference images into the generated video.
With the cross-temporal pairing mechanism (center), the model focuses on buildings and streets. Without it (right), it latches onto cars and pedestrians, incorrectly copying them from the reference images. | Image: Naver
The researchers solve this with "cross-temporal pairing:" during training, they deliberately combine reference images and target sequences from different recording times. This teaches the model to distinguish between permanent structures like building facades and transient objects like parked cars. In ablation studies, this mechanism turned out to be the single most effective component.
Moreover, Street View cameras are mounted on vehicles and only capture an image every 5 to 20 meters. That means there are no continuous videos and no images from a pedestrian perspective or from the air. To fill this gap, the researchers generated 12,700 synthetic videos in the Unreal Engine simulator CARLA, with camera paths covering pedestrian, vehicle, and free-flight perspectives. They also developed a pipeline that interpolates temporally coherent training videos from the spatially scattered individual images.
Top: real Street View images from Seoul, where reference images and target video intentionally come from different points in time. Bottom: synthetic data from the CARLA simulator with pedestrian and vehicle perspectives. | Image: Naver
Finally, small errors accumulate over long distances because the model generates video section by section. Previous methods use the very first image as a fixed anchor, but that becomes useless once the camera has traveled hundreds of meters.
SWM replaces this static anchor with a "virtual lookahead sink:" for each new section, the model retrieves a Street View image slightly further ahead on the route and inserts it as a virtual destination. This gives the model an error-free landmark that moves along with the camera.
Depth maps and original images work together
The retrieved Street View images feed into the generation process through two complementary paths. First, the model projects a spatially close reference image into the target perspective using its depth information, providing the spatial layout of the scene.
Second, the reference images aren't fed directly into the Transformer as raw pixels. Instead, they're first encoded into latent representations and integrated as semantic references. This lets the model pick up additional appearance details from the environment. According to the researchers, quality drops significantly if either of these two paths is removed.
SWM is built on Nvidia's Cosmos-Predict2.5-2B, a diffusion transformer with two billion parameters. The researchers trained the model on 24 Nvidia H100 GPUs using 440,000 Seoul Street View images, the synthetic CARLA data, and publicly available Waymo driving data.
Users enter coordinates, camera movement, and text. The model retrieves matching Street View images and feeds them into the video transformer two ways. once as a spatial layout via depth map, and once as original images for fine details. | Image: Naver
SWM generalizes to cities it was never trained on
The researchers tested SWM in Seoul and also in Busan and the U.S. city of Ann Arbor, both completely absent from training. According to the paper, SWM outperforms six current video world models, including Aether, DeepVerse, and HY-World1.5, across visual quality, camera fidelity, temporal consistency, and correspondence with real locations on custom benchmarks with 30 test sequences of roughly 100 meters each.
Existing models increasingly drift over longer distances, producing blurry videos or a complete generation collapse. SWM keeps its output stable over hundreds of meters. Despite the strict spatial anchoring, the model still responds to text prompts: users can change weather, time of day, or add hypothetical scenarios while the underlying city layout stays intact.
Missing video data still limits prediction quality
Because continuous video recordings of entire cities aren't freely available, training relies on interpolated sequences of individual images, which fall short of real video footage in quality. Incorrect timestamps in the metadata also occasionally cause vehicles to appear or vanish abruptly in generated videos.
All Street View data was processed in compliance with privacy regulations, the researchers say, with faces and license plates anonymized before training. They point to urban planning, autonomous driving, and location-based exploration as potential use cases.
World models are currently one of the most actively researched areas in AI. Runway recently unveiled its first "General World Model," GWM-1, which builds an internal representation of an environment and simulates future events in real time. Google Deepmind CEO Demis Hassabis sees such models as a critical step toward general artificial intelligence. And a recent study by Microsoft Research and several U.S. universities also showed that large language models can function as world models, predicting environmental conditions with more than 99 percent accuracy.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み