RoboRefer | Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

1Beihang University; 2Peking University; 3Beijing Academy of Artificial Intelligence;
Equal Contribution   † Project Leader   ✉ Equal Advising  

From words to exactly where you mean — with RoboRefer

highlight

Highlights

  • RoboRefer is the first 3D-aware VLM for multi-step spatial referring with explicit reasoning.

  • RoboRefer first acquires precise spatial understanding via SFT, and further exhibits generalized strong reasoning via RFT.

  • To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and containing complex reasoning processes (up to 5 steps).

  • SFT-trained RoboRefer achieves SOTA spatial understanding, and RFT-trained RoboRefer exhibits generalizable spatial referring under novel spatial relation combinations.

Motivation

Spatial referring in complex 3D environments demands not only precise single-step spatial understanding but also multi-step spatial reasoning to resolve intricate references step-by-step, thereby enabling efficient control of diverse robots (e.g. UR5 arm, G1 Humanoid) across tasks (e.g., manipulation, navigation).

Spatial Referring for Location via RoboRefer

Spatial Referring for Placement via RoboRefer

Real-world Demos

These demos shows that RoboRefer can handle challenging long-horizon tasks requiring complex multi-step spatial referring in cluttered and dynamic environments by integrating various control policies diverse robots (e.g., UR5, G1 humanoid).








Abstract

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 12.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.

Method Overview

Overview of RoboRefer. RoboRefer can perform single-step precise spatial understanding from RGB(D) inputs with spatially constrained instructions (enabled by the SFT stage introducing depth modality), and multi-step spatial referring with explicit reasoning (powered by the RFT stage and leveraging spatial understanding within each step learned in SFT).

Dataset Overview

RefSpatial is a comprehensive dataset including 2.5M data samples from 2D/3D/Simulated sources, with 31 spatial relations. RefSpatial's key features are: (1) Fine-Grained Annotations, (2) Multi-Dimensionality, (3) High Quality, (4) Large Scale, (5) Rich Diversity, (6) Easy Scalability.

Open6DOR V2 Benchmark Demos in LIBERO

These demos shows that RoboRefer can be integrated into the system as a useful tool for spatial referring to predict location and placement with spatial relations, which is crucial for robots both in simulator and real-world scenes.


Visualization of RefSpatial Dataset

we show more visualization of RefSpatial Dataset, which has 31 different spatial relations to enhance spatial understanding in SFT stage and advance multi-step spatial referring with reasoning in RFT stage.

RoboRefer's Prediction Visualization for Cluttered Tabletops.

we show more visualization of RoboRefer's prediction on a cluttered tabletop to demonstrate the strong generalizability on unseen scenarios, tasks with novel spatial relation combinations, and objects.

BibTeX

@article{zhou2025roborefer,
    title={RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics},
    author={Zhou, Enshen and An, Jingkun and Chi, Cheng and Han, Yi and Rong, Shanyu and Zhang, Chi and Wang, Pengwei and Wang, Zhongyuan and Huang, Tiejun and Sheng, Lu and others},
    journal={arXiv preprint arXiv:2506.04308},
    year={2025}
}