SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

초록

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

저자 (9명)

Bo Yin — LinkedIn 검색
Xiaobin Hu — LinkedIn 검색
Chengming Xu — LinkedIn 검색
Ruolin Shen — LinkedIn 검색
Mo Yang — LinkedIn 검색
Jiangning Zhang — LinkedIn 검색
Peng-Tao Jiang — LinkedIn 검색
Cheng Tan — LinkedIn 검색
Shuicheng YAN — LinkedIn 검색

저자 LinkedIn 변경 추적은 추후 자동화 예정입니다. 현재는 검색 링크를 제공합니다.