Robotic Manipulation is Vision-to-Geometry Mapping (f(v) \rightarrow G): Vision-Geometry Backbones over Language and Video Models figure
AlphaXiv 中文论文页面(可滚动查看)