RoboTracer | Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Enshen Zhou^1*, Cheng Chi^3*†, Yibo Li^1*;, Jingkun An^1*, Jiayuan Zhang¹, Shanyu Rong^2,3, Yi Han^1,3,
Yuheng Ji^3,4, Mengzhen Liu², Pengwei Wang³, Zhongyuan Wang³, Lu Sheng^1✉, Shanghang Zhang^2,3✉

¹Beihang University; ²Peking University; ³Beijing Academy of Artificial Intelligence; ⁴CASIA;

^*Equal Contribution ^†Project Leader ^✉Equal Advising

Paper Code Dataset (Coming Soon) Bench (Coming Soon)

From what you say to where it moves — with RoboTracer

Highlights

RoboTracer is the first 3D-aware VLM for multi-step metric-grounded spatial tracing with explicit reasoning.

RoboTracer first acquires both 3D spatial referring and measuring via SFT, and further advances multi-step metric-grounded spatial tracing via RFT.

To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes, and containing complex reasoning processes (up to 9 steps).

SFT-trained RoboRefer achieves SOTA spatial understanding/measuring/referring, and RFT-trained RoboRefer exhibits strong spatial tracing under novel cluttered and dynamic scenes with complex reasoning processes.

Demo Video

Motivation

Spatial tracing is pivotal for embodied robots to translate the spatially constrained instructions (e.g., "Water flowers from left to right with watering can hovering 1-5 cm above each flower") into 3D positional sequence (i.e., spatial traces) in complex 3D scenes. This task demands (a) 3D spatial referring to resolve spatial relations and locate relevant objects involved in the trace, and (b) 3D spatial measuring to understand absolute, real-world metric quantities related to the trace. For example, (a) 3D positions of the watering can and each flower pot are localized from left to right, and (b) their corresponding heights in meters are measured. By performing multi-step, metric-grounded reasoning over the key information above, the generated spatial trace can support not only (c) multi-step manipulation, but also (d) collision-free motion, thereby (e) enabling efficient control of diverse robots (e.g. G1 humanoid) across tasks in cluttered scenes.

TraceSpatial-Bench Results

This demo shows the visualization of the performance of our model and the baseline model on TraceSpatial-Bench. Yellow masks mark the target objects, and pink 3D boxes mark correct end regions. Despite similar 2D projections (left views of each case), our model yields more accurate spatial traces than strong general VLMs, which often produce floating or colliding traces due to inaccurate depth estimation. Leveraging richer geometric cues further improves performance.

Ours with Geometric Input

Ours

Gemini-2.5-Pro

Qwen3-VL-4B

Qwen3-VL-8B

Pick up the pale blue pillow on the sofa which is the second pale blue pillow from the right, and move it to the top of the wooden stool on the left.

Input Image

3D Point Cloud Visualization

Pick up the paper towel on the white table, and move it to the top of the highest brown wooden shelf on the right.

Input Image

3D Point Cloud Visualization

Pick up the orange object which at right which is on the window sill, and move it to a spot which is on the sink's edge and closest to the right wall.

Input Image

3D Point Cloud Visualization

Pick up the dark blue clothes which is hanging on the clothes rack, and move it to the white basket on the right.

Input Image

3D Point Cloud Visualization

Pick up the rightmost vase on the desk, and move it to the spot between the black monitor and the water bottle.

Input Image

3D Point Cloud Visualization

Pick up the towel hanging on the handle of the oven, and move it to the sink.

Input Image

3D Point Cloud Visualization

Pick up the top orange fruit on the dark blue plate, and move it to the orange bowl on the top right.

Input Image

3D Point Cloud Visualization

Pick up the top green paper box, and move it to the spot which is on the table and in front of the farthest chair.

Input Image

3D Point Cloud Visualization

Pick up the second black remote controller from the front on the table, and move it to the spot which is on the bed and on the left of the closest pillow.

Input Image

3D Point Cloud Visualization

Pick up the brown towel, and move it to the metal sink.

Input Image

3D Point Cloud Visualization

RoboTwin Execution Demo

Visualization of executing spatial-trace-guided embodied tasks in the RoboTwin simulation environment. The demo shows how robotic arms follow 3D spatial traces to successfully complete a diverse set of manipulation tasks, demonstrating RoboTracer's strong spatial reasoning ability and effective support for embodied task execution.

1. Move the skillet to the front of the camera.
2. Move the bread into the skillet.

1. Put the french fries on the right side of the tray.
2. Put the hamburg on the left side of the tray.

1. Put the leftmost can on the back side inside plastic box.
2. Put the rightmost can on the front side inside plastic box.

1. Grasp the red block and place it to the left area of the table center.
2. Grasp the green block and place it to the right side of the red block.
3. Grasp the blue block and place it to the right side of the green block.

1. Grasp the largest block and place it to the left area of the table center.
2. Grasp the medium block and place it to the middle area of the table center.
3. Grasp the smallest block and place it to the right area of the table center.

1. Click the bell.

1. Move the can to the left side of the pot.

1. Put the playing cards to the right side area of the right grasp.

1. Move the stapler to the center of the green area.

1. Move phone to the left side of the staple.

1. Move the bell to the right side of the soup.

1. Put the closest bread to the basket into the basket.
2. Put the farthest bread to the basket into the basket..

1. Put the bowl into the plate.

1. Grasp the edge of the cup and put the cup onto the coaster.

1. Grasp the fan and put the fan onto the lime pad.

1. Put the mouse on the cyan pad.

1. Put the stapler onto the display stand.

1. Move the red block on the front side of the table.
2. Put the green block on the red block.

Spatial Tracing in Cluttered Scenes via RoboTracer

More Real-world Demos

Demos below show that RoboTracer can handle challenging long-horizon spatial tracing tasks requiring complex multi-step metric-grounded reasoning in cluttered and dynamic environments by integrating various control policies diverse robots.

Abstract

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

Method Overview

Overview of RoboTracer. RoboTracer can process RGB images and task instructions, while flexibly integrating various geometric configurations (e.g., absolute depth, camera intrinsics) when available to improve spatial precision, enabled by the integrated universal spatial encoder. It also has a scale decoder to output a metric scale factor supervised by a regression loss beyond next-token prediction to bolster real-world scale awareness. After SFT, metric-sensitive reward functions in RFT further supervise the key perceptual objects involved in the trace and offer crucial intermediate evidence (e.g., "3D spatial referring and measuring") for accurate spatial trace generation.

Dataset Overview

RefSpatial is a comprehensive dataset including 4.5M data samples (~30M QA pairs) from 2D/3D/Video sources, spanning outdoor/indoor/tabletop scenes and containing complex reasoning processes (up to 9 steps). RefSpatial's key features are: (1) Fine-Grained Annotations, (2) Multi-Dimensionality, (3) High Quality, (4) Large Scale, (5) Rich Diversity, (6) Easy Scalability.

Open6DOR V2 Benchmark Demos in LIBERO

These demos shows that RoboRefer can be integrated into the system as a useful tool for spatial referring to predict location and placement with spatial relations, which is crucial for robots both in simulator and real-world scenes.

Task: Place the tissue box to the left of the wineglass on the table

RoboRefer's Location Prediction

RoboRefer's Placement Prediction

Pick-Move-Place Execution

Task: Place the wineglass to the right of the hard drive on the table.

RoboRefer's Location Prediction

RoboRefer's Placement Prediction

Pick-Move-Place Execution

Task: Place the cup in front of the spatula on the table.

RoboRefer's Location Prediction

RoboRefer's Placement Prediction

Pick-Move-Place Execution

Task: Place the cup behind the skillet on the table

RoboRefer's Location Prediction

RoboRefer's Placement Prediction

Pick-Move-Place Execution

Task: Place the skillet between the hard drive and the box on the table.

RoboRefer's Location Prediction

RoboRefer's Placement Prediction

Pick-Move-Place Execution

Task: Place the shoe at the center of all the objects on the table.

RoboRefer's Location Prediction

RoboRefer's Placement Prediction

Pick-Move-Place Execution

BibTeX

@article{zhou2025roborefer, title={RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics}, author={Zhou, Enshen and An, Jingkun and Chi, Cheng and Han, Yi and Rong, Shanyu and Zhang, Chi and Wang, Pengwei and Wang, Zhongyuan and Huang, Tiejun and Sheng, Lu and others}, journal={arXiv preprint arXiv:2506.04308}, year={2025} }

This page was built using the Academic Project Page Template which was adopted from the Nerfies project page. You are free to borrow the of this website, we just ask that you link back to this page in the footer.
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.