The world model will make rapid breakthroughs this year! Autonomous driving may迎 the turning point for commercialization

LightningPacketLoss · 2026-04-06T04:16:32+00:00

At the 2026 Zhongguancun Forum, Zhu Jun pointed out that under a unified architecture and data system support, the world model will make rapid breakthroughs. He emphasized that the current definition of the world model is vague and needs to clarify the differences between its digital and physical applications, especially regarding pre-training requirements in the robotics field. Future technological breakthroughs will focus on real-time interaction and online learning capabilities, aiming to support more intelligent applications.

LightningPacketLoss

2026-04-06 04:16:32

Abstract generation in progress

“Driven together by the unified architecture, the data ecosystem, and computing power support, the world model will see rapid breakthroughs this year!”

At the 2026 Zhongguancun Forum annual conference special forum “AI Future Forum: Leapfrogging · Investing · Coexistence,” held on March 29, Zhu Jun, founder of Shengshu Technology and deputy director of the Institute for Artificial Intelligence at Tsinghua University, put forward the above viewpoint.

How to build

Meanwhile, the definition of the world model is being broadened and made vague. “It is necessary to further clarify the definition of ‘world model,’” Zhu Jun said. Many current research efforts are incomplete. For example, some interactive video generation methods essentially remain limited to rebuilding digital space; they are mainly used for one-way interaction between humans and the system, and they do not have the ability to learn and execute actions in real-world environments.

“World model” is divided by Wu Wei, founder of the manifold space, into two categories: one is the world model in the digital world, mainly focused on building more real-time interactive interfaces; the other is for the physical world, becoming a predictive machine brain. “The capabilities supporting the two kinds of world models are not the same. In the digital world, you need to cater more to creators’ preferences; but in the physical world, you need to replicate real physics and robotic operation.”

Taking autonomous driving and embodied intelligence as examples, autonomous driving collects real-vehicle data to achieve a data closed loop, while robots face a data cold start. Wu Wei analyzed that many companies tend to deploy robots in a way similar to autonomous driving—performing remote teleoperation in real environments to collect data. Although the data quality is very high, there is a problem that the model’s performance grows at a rate that is tied to parameter scale or computing power investment. “For training world models, pretraining with first-person perspective data can solve this issue.”

Based on corporate experience, Xu Huazhe, founder of Po-king Robot and an assistant professor at the Institute for Interdisciplinary Information Research at Tsinghua University, pointed out that if data is collected from 100 households, it cannot generalize to 10,000 households. Robot pretraining needs to use first-person video for pretraining to provide true generalization in the real sense. Specifically, first define what to do and what not to do, then iteratively improve the system in reverse, including hardware, motion control, and so on. For example, the hand of the Po-king robot cannot achieve 21 degrees of freedom, but it can generalize across 10 kinds of tasks, and then wait for upgrades.

Zhu Jun proposed a “unified world model framework,” unifying cross-modal generation and action tasks theoretically. This unification is not an engineering patchwork, but a structural-level unification. From a more macro perspective, whether in the digital world or the physical world, the end result will be composed of intelligent agents in different forms. Intelligent agents in the physical world have “bodies,” while the world model is their core “intelligence hub.”

Building a general-purpose world model can return to first principles of large models: a scalable architecture, large-scale data, and sufficient computing power. Zhu Jun believes that world models should adopt a unified architecture, while current mainstream approaches are often modular and fragmented—some focus on fitting action trajectories, others emphasize prediction, and others directly learn control strategies.

Technical breakthroughs

When discussing the possibilities of world model technology, Zhang Mingxing, an associate professor at Tsinghua University, said that many world model routes are based on the capabilities of language models and then transferred to more modalities. However, is language sufficient to model the physical world? Or do we need another kind of shallow-space language? There are currently theoretical disagreements on this point. In addition, should we achieve “physical remote sensing” or “first-person perspective” through data training, or through the physical space itself? The physical-space modalities and their implementation still need to be broken through.

More specifically, in 2026, two major technical breakthroughs should be the focus for world models. Wu Wei said: first, real-time manipulation and interaction capabilities; second, post-training of world models. “Especially reinforcement learning and online learning,” Xu Huazhe elaborated in detail—expanding reinforcement learning to 100, 1,000, or 10,000 robots, reaching human-like speed without losing success rate; also, enabling embodied intelligence to carry out fast online learning for strange tasks even after deployment.

Based on its long-term accumulation in video large models, Zhu Jun proposed a clearer technical roadmap: at the bottom layer, Diffusion Transformer (U-ViT) serves as the unified base architecture; in pixel-space decoding, corresponding to the Vidu video generation model, serving digital content creation; and in action-space decoding, serving embodied interaction in the physical world. This means the same base model can support both the generation capabilities in the digital world and the action capabilities in the physical world.

According to the introduction, Shengshu Technology has verified its capabilities in multi-task scenarios. For example: captcha-solving operation tasks—by using a robotic arm to simulate human operation of a mouse, achieving screen recognition and precise clicking; board-game decision tasks—requiring long-horizon planning and multi-step reasoning, coordinating perception, prediction, and decision-making; flexible object manipulation—when facing complex, irregular objects, enabling stable grasping.

Unified architecture brings a new development path. Through experimental observation, Zhu Jun said there are two key phenomena: first, compared with the traditional Vision-Language-Action (VLA) route, data utilization efficiency improves by orders of magnitude; second, multi-task generalization ability is strengthened—under a unified model, efficient generalization can be achieved across more than 50 tasks, with performance not only not dropping but increasing. By contrast, traditional VLA models (such as PI0.5) will see a noticeable decline in performance as the number of tasks increases.

At the implementation level, the two major tracks—autonomous driving and industrial vertical scenarios—will see a commercial and capital-driven turning point in 2026. Bai Zongyi, founding partner of Yaotu Capital, said directly that he is optimistic about new opportunities in the embodied intelligence era—especially the last-mile logistics track. Ivo Muth, vice president for R&D at Audi China, believes that regarding spatial intelligence and world models, the most core future change—beyond improving driving safety—will also be reflected in contextual awareness and passenger riding comfort.

(Editor: Wen Jing)

Keywords:

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes