WonderTurbo:Generating Interactive 3D World in 0.72 Seconds
- Chaojun Ni*1,2
- Xiaofeng Wang*1,3
- Zheng Zhu*1✉
- Weijie Wang*1,4
- Li Haoyun1,3
- Guosheng Zhao1,3
- Jie Li1
- Wenkang Qin1
- Guan Huang1
- Wenjun Mei2✉
- 1 GigaAI
- 2 Peking University
- 3 Institute of Automation, Chinese Academy of Sciences
- 4 Zhejiang University

Beginning with a single image, users can freely adjust the viewpoint and interactively control the generation of a 3D scene, each interaction requiring only 0.72 seconds.

The pipeline of WonderTurbo. As the user moves the real-time rendering camera and inputs the text, the rendered image and depth map are then processed by FastPaint and QuickDepth to generate coherent geometry and appearance. Finally, StepSplat performs incremental fusion based on the outputs of FastPaint and QuickDepth.
Abstract
Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15x speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.