Comparison of model predictions with ground truth answers for various perceptional and spatial audio questions. (Please use headphones for best audio quality.)
OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models
Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single- step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the Spatial-Acoustic Geometry Encoder (SAGE), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and simulated room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present OWL, an ALLM that integrates SAGE with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, OWL supports o’clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release BiDepth, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new BiDepth and the public SpatialSoundQA, OWL reduces mean DoA error by 11◦ through SAGE and improves spatial reasoning QA accuracy by up to 25% over BAT
Comparison of OWL with closed- and open-source baselines on BiDepth across four task types: Type I (event detection), Type II (direction estimation), Type III (spatial reasoning), and Type IV (CoT reasoning). OWL consistently surpasses prior open-source models, with further gains from CoT supervision. Best results are in bold.
Zero-shot Performance of OWL on the SpatialSoundQA across perception and reasoning tasks. OWL consistently outperforms the baselines, with larger gains in spatial reasoning tasks, demonstrating the benefit of the SAGE and CoT instruction tuning. Best results are denoted in bold.
Comparison of model predictions with ground truth answers for various perceptional and spatial audio questions. (Please use headphones for best audio quality.)