Research

Highlighted Papers

Can Vision-Language Models Solve the Shell Game? VET-Bench

Visual Entity Tracking Benchmark (VET-Bench) is a synthetic diagnostic benchmark simulating the shell game with visually indistinguishable objects. State-of-the-art VLMs perform at random chance, while our proposed Molmo2-SGCoT achieves over 90% accuracy.