Researcher at BAAI
I am currently a researcher at the Beijing Academy of Artificial Intelligence, focusing on embodied multimodal large models. Previously, I received my Ph.D. from the Institute of Information Engineering at the Chinese Academy of Sciences, advised by Prof. Bo Li.
We have several academic visitor and intern positions at Beijing Academy of Artificial Intelligence. We actively work on Multimodal Retrieval, Multi-Modal Learning, Automatic Driving Perception, and Embodied Intelligence. If you like what we do, don't hesitate to contact me.
![]() |
Amazon Web ServicesMentor:Yi Zhu, Mu Li |
![]() |
Samsung Research China - Beijing (SRC-B) |
![]() |
Beijing Academy of Artificial Intelligence |
* equal contributions ‡ project lead § corresponding author
![]() |
A Hierarchical Reinforcement Learning Framework for Multi-UAV Combat Using Leader-Follower Strategy |
![]() |
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete |
![]() |
TASAR: TRANSFER-BASED ATTACK ON SKELETAL ACTION RECOGNITION |
![]() |
AS-GCL: Asymmetric Spectral Augmentation on Graph Contrastive Learning |
![]() |
MapFusion: A novel BEV feature fusion network for multi-modal map construction
Information Fusion, 2025
|
![]() |
STViT+: improving self-supervised multi-camera depth estimation with spatial-temporal context and adversarial geometry regularization
Applied
Intelligence,
2025
|
![]() |
Is Your HD MapConstructor Reliable under Sensor Corruptions? |
![]() |
MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation |
![]() |
KALAHash: Knowledge-Anchored Low-Resource Adaptation for Deep Hashing |
![]() |
FTF-ER: Feature-Topology Fusion-Based Experience Replay Method for Continual Graph Learning
ACM Multimedia (MM), 2024
|
![]() |
MBFusion: A New Multi-modal BEV Feature Fusion Method for HD Map Construction |
![]() |
CUSTOMIZED TREATMENT PER PIXEL FOR BLIND IMAGE SUPER-RESOLUTION |
![]() |
Enhancing 3D Hand Pose Estimation via Dense Ordinal Regression Network |
![]() |
ESC-MISR: Enhancing Spatial Correlations for Multi-Image Super-Resolution in Remote Sensing ![]() |
![]() |
Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval |
![]() |
Uncertainty-Aware Alignment Network for Cross-Domain Video-Text Retrieval |
![]() |
MixGen: A NewMulti-Modal Data Augmentation |
![]() |
LISTEN AND LOOK: MULTI-MODAL AGGREGATION AND CO-ATTENTION NETWORK FOR VIDEO-AUDIO RETRIEVAL |
![]() |
Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval |
![]() |
WHAT MATTERS: ATTENTIVE AND RELATIONAL FEATURE AGGREGATION NETWORK FOR VIDEO-TEXT RETRIEVAL |
* equal contributions ‡ project lead § corresponding author
![]() |
TLA: Tactile-Language-Action Model for Contact-Rich Manipulation
arXiv
|
![]() |
AffordGrasp: In-Context Affordance Reasoning for Open-Vocabulary Task-Oriented Grasping in Clutter
arXiv
|
![]() |
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
arXiv
|
![]() |
What Foundation Models can Bring for Robot Learning in Manipulation : A Survey
arXiv
|
![]() |
BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation
arXiv
|
![]() |
Communication-Efficient Personalized Federal Graph Learning via Low-Rank Decomposition
arXiv
|
![]() |
DWCL: Dual-Weighted Contrastive Learning for Multi-View Clustering
arXiv
|
![]() |
MapNav: ANovel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation
arXiv
|
![]() |
MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception
arXiv
|
![]() |
EPIC-Kitchens Dataset Challenges ![]()
IEEE/CVF Computer Vision and Pattern Recognition (CVPR)
|
![]() |
EPIC-Kitchens Dataset Challenges ![]()
IEEE/CVF Computer Vision and Pattern Recognition (CVPR)
|
![]() |
EPIC-Kitchens Dataset Challenges ![]()
IEEE/CVF Computer Vision and Pattern Recognition (CVPR)
|
![]() |
The RoboDrive Challenge ![]()
IEEE Conference on Robotics and Automation (ICRA)
|
![]() |
The RoboDrive Challenge ![]()
IEEE Conference on Robotics and Automation (ICRA)
|
![]() |
A Challenge for Out-of-Distribution Generalization in Computer Vision (OOD-CV)
![]()
IEEE/CVF International Conference on Computer Vision (ICCV)
|