XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations figure
AlphaXiv 中文论文页面(可滚动查看)