Code-as-Monitor | Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

1Beihang University; 2Peking University; 3Beijing Academy of Artificial Intelligence; 4GalBot;
Equal Contribution   ✉ Equal Advising  

highlight

Highlights

  • Code-as-Monitor is the first framework to integrate both reactive and proactive failure detection.

  • Code-as-Monitor leverages the proposed constraint elements to simplify real-time failure detection with high precision.

  • Code-as-Monitor achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, and exhibits strong generalizability on unseen scenarios, tasks, and objects.

highlight

Summary Video

Motivation

For the task "Move the pan with lobster to the stove without losing the lobster", (a) reactive failure detection identifies failures after they occur, and proactive failure detection prevents foreseeable failures. In (a), the robot detects the failure after the lobster unpredictably jumps out due to the heat. In (b), pan tilting is detected and corrected it requiring real-time precision. (c) shows that our method combined with an open-loop policy forms a closed-loop system, enabling proactive (e.g., detecting moving glass during grasping) and reactive (e.g., removing toy after grasping) failure detection in cluttered scenes.

Real-world Demos

These demos shows that Code-as-Monitor can transform an open-loop policy into a closed-loop system by integrating reactive and proactive failure detection for long-horizon tasks in cluttered environments with disturbances.



Clear all objects on table except for animals. (2X speed)


Grasp the animals according to their distances to fruits, from nearest to farthest. (1X speed)

Abstract

Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.

Method Overview

Overview of Code-as-Monitor. This framework unifies reactive and proactive failure detection via constraints, more generally abstracts relevant entities/parts through constraint elements, and ensures precise and real-time monitoring via code evaluation.

Constraint Elements

Constraint Element Pipeline. Given a constraint, our model ConSeg generates instance-level and part-level masks across multiple views, which are projected into 3D space. Through a series of heuristics, the desired elements are produced. Once all elements are obtained, they are annotated onto the original multi-view images.

CLIPort Simulator Demos

These demos shows that Code-as-Monitor can successfully enhance monitoring of 3D spatial relationships of entities in the environment, facilitating both reactive and proactive failure detection and leading to more accurate counting.


Task: Stack in Order. Disturbances: Placement noise.


Our Method
DoReMi Baseline: Frequent VLM queries lead to increased incorrect failure judgments due to a limited 3D spatial understanding from single.
Our Method under the most severe disturbance.

Task: Stack in Order. Disturbances: Random drop.


Our Method
Inner Monologue Baseline: Only detects failures upon subgoal completion.
Our Method under the most severe disturbance.

Task: Sweep half the blocks


Our Method
DoReMi Baseline: Directly using VLM to count blocks fails to complete the task.

Omnigibson Simulator Demos

These demos shows that Code-as-Monitor can detect richer failures (e.g., point, line, surface-level disturbances) with lower computational cost compared to frequent querying VLMs.

Visualization of Constraint-aware Segmentation

we show more visualization of constraint-aware segmentation results of both instance level and part level from out of distribution data to demonstrate the strong generalizability on unseen scenarios, tasks, and objects.

BibTeX

@article{zhou2024code,
    title={Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection},
    author={Zhou, Enshen and Su, Qi and Chi, Cheng and Zhang, Zhizheng and Wang, Zhongyuan and Huang, Tiejun and Sheng, Lu and Wang, He},
    journal={arXiv preprint arXiv:2412.04455},
    year={2024}
}