VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation figure
AlphaXiv 中文论文页面(可滚动查看)