Highlights
Highlights
Demo Video
Spatial tracing is pivotal for embodied robots to translate the spatially constrained instructions (e.g., "Water flowers from left to right with watering can hovering 1-5 cm above each flower") into 3D positional sequence (i.e., spatial traces) in complex 3D scenes. This task demands (a) 3D spatial referring to resolve spatial relations and locate relevant objects involved in the trace, and (b) 3D spatial measuring to understand absolute, real-world metric quantities related to the trace. For example, (a) 3D positions of the watering can and each flower pot are localized from left to right, and (b) their corresponding heights in meters are measured. By performing multi-step, metric-grounded reasoning over the key information above, the generated spatial trace can support not only (c) multi-step manipulation, but also (d) collision-free motion, thereby (e) enabling efficient control of diverse robots (e.g. G1 humanoid) across tasks in cluttered scenes.
This demo shows the visualization of the performance of our model and the baseline model on TraceSpatial-Bench. Yellow masks mark the target objects, and pink 3D boxes mark correct end regions. Despite similar 2D projections (left views of each case), our model yields more accurate spatial traces than strong general VLMs, which often produce floating or colliding traces due to inaccurate depth estimation. Leveraging richer geometric cues further improves performance.
Visualization of executing spatial-trace-guided embodied tasks in the RoboTwin simulation environment. The demo shows how robotic arms follow 3D spatial traces to successfully complete a diverse set of manipulation tasks, demonstrating RoboTracer's strong spatial reasoning ability and effective support for embodied task execution.
Spatial Tracing in Cluttered Scenes via RoboTracer



























Demos below show that RoboTracer can handle challenging long-horizon spatial tracing tasks requiring complex multi-step metric-grounded reasoning in cluttered and dynamic environments by integrating various control policies diverse robots.
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
Overview of RoboTracer. RoboTracer can process RGB images and task instructions, while flexibly integrating various geometric configurations (e.g., absolute depth, camera intrinsics) when available to improve spatial precision, enabled by the integrated universal spatial encoder. It also has a scale decoder to output a metric scale factor supervised by a regression loss beyond next-token prediction to bolster real-world scale awareness. After SFT, metric-sensitive reward functions in RFT further supervise the key perceptual objects involved in the trace and offer crucial intermediate evidence (e.g., "3D spatial referring and measuring") for accurate spatial trace generation.
RefSpatial is a comprehensive dataset including 4.5M data samples (~30M QA pairs) from 2D/3D/Video sources, spanning outdoor/indoor/tabletop scenes and containing complex reasoning processes (up to 9 steps). RefSpatial's key features are: (1) Fine-Grained Annotations, (2) Multi-Dimensionality, (3) High Quality, (4) Large Scale, (5) Rich Diversity, (6) Easy Scalability.
These demos shows that RoboRefer can be integrated into the system as a useful tool for spatial referring to predict location and placement with spatial relations, which is crucial for robots both in simulator and real-world scenes.
@article{zhou2025roborefer,
title={RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics},
author={Zhou, Enshen and An, Jingkun and Chi, Cheng and Han, Yi and Rong, Shanyu and Zhang, Chi and Wang, Pengwei and Wang, Zhongyuan and Huang, Tiejun and Sheng, Lu and others},
journal={arXiv preprint arXiv:2506.04308},
year={2025}
}