Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models figure
AlphaXiv 中文论文页面(可滚动查看)