RoboTracer | Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

1Beihang University; 2Peking University; 3Beijing Academy of Artificial Intelligence; 4CASIA;
Equal Contribution   † Project Leader   ✉ Equal Advising  

From what you say to where it moves — with RoboTracer

highlight

Highlights

  • RoboTracer is the first 3D-aware VLM for multi-step metric-grounded spatial tracing with explicit reasoning.

  • RoboTracer first acquires both 3D spatial referring and measuring via SFT, and further advances multi-step metric-grounded spatial tracing via RFT.

  • To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes, and containing complex reasoning processes (up to 9 steps).

  • SFT-trained RoboTracer achieves SOTA spatial understanding/measuring/referring, and RFT-trained RoboTracer exhibits strong spatial tracing under novel cluttered and dynamic scenes with complex reasoning processes.

highlight

Demo Video


Motivation

Spatial tracing is pivotal for embodied robots to translate the spatially constrained instructions (e.g., "Water flowers from left to right with watering can hovering 1-5 cm above each flower") into 3D positional sequence (i.e., spatial traces) in complex 3D scenes. This task demands (a) 3D spatial referring to resolve spatial relations and locate relevant objects involved in the trace, and (b) 3D spatial measuring to understand absolute, real-world metric quantities related to the trace. For example, (a) 3D positions of the watering can and each flower pot are localized from left to right, and (b) their corresponding heights in meters are measured. By performing multi-step, metric-grounded reasoning over the key information above, the generated spatial trace can support not only (c) multi-step manipulation, but also (d) collision-free motion, thereby (e) enabling efficient control of diverse robots (e.g. G1 humanoid) across tasks in cluttered scenes.

TraceSpatial-Bench Results

This demo shows the visualization of the performance of our model and the baseline model on TraceSpatial-Bench. Yellow masks mark the target objects, and pink 3D boxes mark correct end regions. Despite similar 2D projections, our model yields more accurate spatial traces than strong general VLMs, which often produce floating or colliding traces due to inaccurate depth estimation. Leveraging richer geometric cues further improves performance.

RoboTwin 2.0 Execution Demo

The demo shows how robotic arms follow 3D spatial traces to successfully complete a diverse set of manipulation tasks, demonstrating RoboTracer's strong spatial reasoning ability and effective support for embodied task execution.

Spatial Tracing in Cluttered Scenes via RoboTracer

More Real-world Demos

Demos below show that RoboTracer can handle challenging long-horizon spatial tracing tasks requiring complex multi-step metric-grounded reasoning in cluttered and dynamic environments by integrating various control policies diverse robots.




Abstract

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

Method Overview

Overview of RoboTracer. RoboTracer can process RGB images and task instructions, while flexibly integrating various geometric configurations (e.g., absolute depth, camera intrinsics) when available to improve spatial precision, enabled by the integrated universal spatial encoder. It also has a scale decoder to output a metric scale factor supervised by a regression loss beyond next-token prediction to bolster real-world scale awareness. After SFT, metric-sensitive reward functions in RFT further supervise the key perceptual objects involved in the trace and offer crucial intermediate evidence (e.g., "3D spatial referring and measuring") for accurate spatial trace generation.

Dataset Overview

TraceSpatial is a comprehensive dataset including 4.5M data samples (~30M QA pairs) from 2D/3D/Video sources, spanning outdoor/indoor/tabletop scenes and containing complex reasoning processes (up to 9 steps). TraceSpatial's key features are: (1) Rich Diversity, , (2) Multi-Dimensionality, (3) Large Scale, (4) Fine-Grained Annotations, (5) High Quality, (6) Easy Scalability.

BibTeX

@article{zhou2025robotracer,
    title={RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics},
    author={Zhou, Enshen and Chi, Cheng and Li, Yibo and An, Jingkun and Zhang, Jiayuan and Rong, Shanyu and Han, Yi and Ji, Yuheng and Liu, Mengzhen and Wang, Pengwei and others},
    journal={arXiv preprint arXiv:2512.13660},
    year={2025}
}