Publications ( *, †, ‡ indicates the equal contributions, corresponding author, project leader, respectively.)
Currently, my interest lies in Embodied Agents,
which are at the intersection of Multimodal Large Language Models (MLLMs) and Embodied AI,
with particular interests in high-level planning and low-level control with spatio-temporal intelligence,
working towards an generalist agent in a complex real-world environment.
Representative works are highlighted.
|
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure
Detection
Enshen Zhou *,
Qi Su *,
Cheng Chi *†,
Zhizheng
Zhang,
Zhongyuan
Wang,
Tiejun
Huang,
Lu Sheng†,
He Wang†
Paper /
Project /
BiilBili Video /
Copy BibTeX
Copy Success!
TL;DR: Enjoy Open-world Failure Detection with Real-time high precision!
The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
|
|
WorldSimBench: Towards Video Generation Models as World Simulators
Yinran Qin *,
Zhelun Shi *,
Jiwen Yu,
Xijun Wang,
Enshen Zhou,
Lijun Li,
Zhenfei Yin‡,
Xihui Liu,
Lu Sheng,
Jing Shao†
Lei Bai,
Wanli Ouyang,
Ruimao Zhang†
TL;DR: Evalute Video Generation Models as World Simulators!
Paper /
Project /
Copy BibTeX
Copy Success!
Preprint, 2024
|
|
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image
Generation
Jingkun An *,
Yinghao Zhu *,
Zongjian Li*,
Enshen Zhou,
Haoran Feng,
Xijie Huang,
Bohua
Chen,
Yemin Shi,
Chengwei Pan†
Paper /
Project /
Code /
Copy BibTeX
Copy Success!
TL;DR: Train T2I Diffusion model with AI-Generated Feedback for DPO!
Proceedings of the AAAI Conference on Artificial Intelligence, AAAI 2025
|
|
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World
Control
Enshen Zhou *,
Yinran Qin *,
Zhenfei Yin,
Yuzhou Huang,
Ruimao Zhang†,
Lu Sheng†,
Yu Qiao,
Jing Shao‡
Paper /
Project /
Code /
Copy BibTeX
Copy Success!
TL;DR: Use Imagination to Guide agent itself How to Act step-by-step!
Advances in Neural Information Processing Systems, NeurIPS 2024 @ OWA
|
|
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
Yinran Qin *,
Enshen Zhou *,
Qichang Liu *,
Zhenfei Yin,
Lu Sheng†,
Ruimao Zhang†,
Yu Qiao,
Jing Shao‡
Paper /
Project /
BiilBili Video /
Code /
Copy BibTeX
Copy Success!
TL;DR: Multi-Agent System can Solve Endless Open-ended Long-horizion tasks!
The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
|
Services
Workshop Challenge Organizer:
- Trustworthy Multi-modal Foundation Models and AI Agents (TiFA) in ICML 2024.
- Multi-modal Foundation Model meets Embodied AI (MFM-EAI) in ICML 2024.
Reviewer: CVPR
|
Selected Awards and Honors
2024: Outstanding Graduate of Beihang University.
2023: Special Prize (Top 1) in "Challenge Cup" Competition of Science Achievement in China.
2017: Rank 1st/68k in National High School Entrance Examination of ShenZhen.
|
|