SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs
초록
Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}
저자 (9명)
- Bo Yin — LinkedIn 검색
- Xiaobin Hu — LinkedIn 검색
- Chengming Xu — LinkedIn 검색
- Ruolin Shen — LinkedIn 검색
- Mo Yang — LinkedIn 검색
- Jiangning Zhang — LinkedIn 검색
- Peng-Tao Jiang — LinkedIn 검색
- Cheng Tan — LinkedIn 검색
- Shuicheng YAN — LinkedIn 검색
저자 LinkedIn 변경 추적은 추후 자동화 예정입니다. 현재는 검색 링크를 제공합니다.