Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks figure
AlphaXiv 中文概览(可滚动查看)