Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration

Lu Yue, Yue Fan, Shiwei Lian, Yu Zhao, Jiaxin Yu, Liang Xie, Feitian Zhang

Robotics and Control Laboratory, School of Advanced Manufacturing and Robotics, State Key Laboratory of Turbulence and Complex Systems, College of Engineering, Peking University
Defense Innovation Institute, Academy of Military Sciences
Tianjin Artificial Intelligence Innovation Center
💻 Code 📄 Paper

Abstract

Abstract Overview

Zero-shot VLN agents built on Large Language Models (LLMs) show strong generalization, but their performance is limited by an overreliance on linguistic reasoning and insufficient spatial perception. To address this limitation, we focus on complex, perceptually rich continuous environments and systematically categorize the key perceptual bottlenecks into three key spatial challenges: door interaction, multi-room navigation, and ambiguous instruction execution, where existing methods consistently suffer high failure rates. We present Spatial-VLN, aperception-guided exploration framework designed to overcomethese challenges. The framework consists of two main modules.The Spatial Perception Enhancement (SPE) module integratespanoramic filtering with specialized door and region expertsto produce spatially coherent, cross-view consistent perceptualrepresentations. Building on this foundation, our Explored Multi-expert Reasoning (EMR) module uses parallel LLM expertsto address waypoint-level semantics and region-level spatialtransitions. When discrepancies arise between expert predictions,a query-and-explore mechanism is activated, prompting the agentto actively probe critical areas and resolve perceptual ambiguities. Experiments on VLN-CE demonstrate that Spatial-VLNachieves state-of-the-art performance using only low-cost LLMs.Furthermore, to validate real-world applicability, we introduce avalue-based waypoint sampling strategy that effectively bridgesthe Sim2Real gap. Extensive real-world evaluations confirm thatour framework delivers superior generalization and robustnessin complex, cluttered environments.

Method

The architecture of the proposed Spatial-VLN framework comprises two core components:

Together, these two components enable robust and adaptive navigation by explicitly integrating spatial perception with exploration-driven multi-expert reasoning.

Method Illustration

Real-World Visual Results

Qualitative outcomes captured from diverse indoor environments