Research
Highlighted Papers
Can Vision-Language Models Solve the Shell Game?
VET-Bench
Visual Entity Tracking Benchmark (VET-Bench) is a synthetic diagnostic benchmark simulating the shell game with visually indistinguishable objects. State-of-the-art VLMs perform at random chance, while our proposed Molmo2-SGCoT achieves over 90% accuracy.