VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model figure
AlphaXiv 中文论文页面(可滚动查看)