Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Launchpad
Be early to the next big token project
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
The world model will make rapid breakthroughs this year! Autonomous driving may迎 the turning point for commercialization
“Driven together by the unified architecture, the data ecosystem, and computing power support, the world model will see rapid breakthroughs this year!”
At the 2026 Zhongguancun Forum annual conference special forum “AI Future Forum: Leapfrogging · Investing · Coexistence,” held on March 29, Zhu Jun, founder of Shengshu Technology and deputy director of the Institute for Artificial Intelligence at Tsinghua University, put forward the above viewpoint.
How to build
Meanwhile, the definition of the world model is being broadened and made vague. “It is necessary to further clarify the definition of ‘world model,’” Zhu Jun said. Many current research efforts are incomplete. For example, some interactive video generation methods essentially remain limited to rebuilding digital space; they are mainly used for one-way interaction between humans and the system, and they do not have the ability to learn and execute actions in real-world environments.
“World model” is divided by Wu Wei, founder of the manifold space, into two categories: one is the world model in the digital world, mainly focused on building more real-time interactive interfaces; the other is for the physical world, becoming a predictive machine brain. “The capabilities supporting the two kinds of world models are not the same. In the digital world, you need to cater more to creators’ preferences; but in the physical world, you need to replicate real physics and robotic operation.”
Taking autonomous driving and embodied intelligence as examples, autonomous driving collects real-vehicle data to achieve a data closed loop, while robots face a data cold start. Wu Wei analyzed that many companies tend to deploy robots in a way similar to autonomous driving—performing remote teleoperation in real environments to collect data. Although the data quality is very high, there is a problem that the model’s performance grows at a rate that is tied to parameter scale or computing power investment. “For training world models, pretraining with first-person perspective data can solve this issue.”
Based on corporate experience, Xu Huazhe, founder of Po-king Robot and an assistant professor at the Institute for Interdisciplinary Information Research at Tsinghua University, pointed out that if data is collected from 100 households, it cannot generalize to 10,000 households. Robot pretraining needs to use first-person video for pretraining to provide true generalization in the real sense. Specifically, first define what to do and what not to do, then iteratively improve the system in reverse, including hardware, motion control, and so on. For example, the hand of the Po-king robot cannot achieve 21 degrees of freedom, but it can generalize across 10 kinds of tasks, and then wait for upgrades.
Zhu Jun proposed a “unified world model framework,” unifying cross-modal generation and action tasks theoretically. This unification is not an engineering patchwork, but a structural-level unification. From a more macro perspective, whether in the digital world or the physical world, the end result will be composed of intelligent agents in different forms. Intelligent agents in the physical world have “bodies,” while the world model is their core “intelligence hub.”
Building a general-purpose world model can return to first principles of large models: a scalable architecture, large-scale data, and sufficient computing power. Zhu Jun believes that world models should adopt a unified architecture, while current mainstream approaches are often modular and fragmented—some focus on fitting action trajectories, others emphasize prediction, and others directly learn control strategies.
Technical breakthroughs
When discussing the possibilities of world model technology, Zhang Mingxing, an associate professor at Tsinghua University, said that many world model routes are based on the capabilities of language models and then transferred to more modalities. However, is language sufficient to model the physical world? Or do we need another kind of shallow-space language? There are currently theoretical disagreements on this point. In addition, should we achieve “physical remote sensing” or “first-person perspective” through data training, or through the physical space itself? The physical-space modalities and their implementation still need to be broken through.
More specifically, in 2026, two major technical breakthroughs should be the focus for world models. Wu Wei said: first, real-time manipulation and interaction capabilities; second, post-training of world models. “Especially reinforcement learning and online learning,” Xu Huazhe elaborated in detail—expanding reinforcement learning to 100, 1,000, or 10,000 robots, reaching human-like speed without losing success rate; also, enabling embodied intelligence to carry out fast online learning for strange tasks even after deployment.
Based on its long-term accumulation in video large models, Zhu Jun proposed a clearer technical roadmap: at the bottom layer, Diffusion Transformer (U-ViT) serves as the unified base architecture; in pixel-space decoding, corresponding to the Vidu video generation model, serving digital content creation; and in action-space decoding, serving embodied interaction in the physical world. This means the same base model can support both the generation capabilities in the digital world and the action capabilities in the physical world.
According to the introduction, Shengshu Technology has verified its capabilities in multi-task scenarios. For example: captcha-solving operation tasks—by using a robotic arm to simulate human operation of a mouse, achieving screen recognition and precise clicking; board-game decision tasks—requiring long-horizon planning and multi-step reasoning, coordinating perception, prediction, and decision-making; flexible object manipulation—when facing complex, irregular objects, enabling stable grasping.
Unified architecture brings a new development path. Through experimental observation, Zhu Jun said there are two key phenomena: first, compared with the traditional Vision-Language-Action (VLA) route, data utilization efficiency improves by orders of magnitude; second, multi-task generalization ability is strengthened—under a unified model, efficient generalization can be achieved across more than 50 tasks, with performance not only not dropping but increasing. By contrast, traditional VLA models (such as PI0.5) will see a noticeable decline in performance as the number of tasks increases.
At the implementation level, the two major tracks—autonomous driving and industrial vertical scenarios—will see a commercial and capital-driven turning point in 2026. Bai Zongyi, founding partner of Yaotu Capital, said directly that he is optimistic about new opportunities in the embodied intelligence era—especially the last-mile logistics track. Ivo Muth, vice president for R&D at Audi China, believes that regarding spatial intelligence and world models, the most core future change—beyond improving driving safety—will also be reflected in contextual awareness and passenger riding comfort.
(Editor: Wen Jing)
Keywords: