Zero-shot VLN agents built on Large Language Models (LLMs) show strong generalization, but their performance is limited by an overreliance on linguistic reasoning and insufficient spatial perception. To address this limitation, we focus on complex, perceptually rich continuous environments and systematically categorize the key perceptual bottlenecks into three key spatial challenges: door interaction, multi-room navigation, and ambiguous instruction execution, where existing methods consistently suffer high failure rates. We present Spatial-VLN, aperception-guided exploration framework designed to overcomethese challenges. The framework consists of two main modules.The Spatial Perception Enhancement (SPE) module integratespanoramic filtering with specialized door and region expertsto produce spatially coherent, cross-view consistent perceptualrepresentations. Building on this foundation, our Explored Multi-expert Reasoning (EMR) module uses parallel LLM expertsto address waypoint-level semantics and region-level spatialtransitions. When discrepancies arise between expert predictions,a query-and-explore mechanism is activated, prompting the agentto actively probe critical areas and resolve perceptual ambiguities. Experiments on VLN-CE demonstrate that Spatial-VLNachieves state-of-the-art performance using only low-cost LLMs.Furthermore, to validate real-world applicability, we introduce avalue-based waypoint sampling strategy that effectively bridgesthe Sim2Real gap. Extensive real-world evaluations confirm thatour framework delivers superior generalization and robustnessin complex, cluttered environments.
The architecture of the proposed Spatial-VLN framework comprises two core components:
Together, these two components enable robust and adaptive navigation by explicitly integrating spatial perception with exploration-driven multi-expert reasoning.