OWL

Abstract

Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single- step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the Spatial-Acoustic Geometry Encoder (SAGE), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and simulated room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present OWL, an ALLM that integrates SAGE with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, OWL supports o’clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release BiDepth, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new BiDepth and the public SpatialSoundQA, OWL reduces mean DoA error by 11◦ through SAGE and improves spatial reasoning QA accuracy by up to 25% over BAT

Architecture

Architecture of OWL and SAGE. The left panel shows SAGE, trained with geometry-aware supervision using RIRs and depth cues. The right panel illustrates the OWL pipeline, where the Binaural Audio Encoder φ_a(·) is combined with the LLM (Π) through a projector ψ(·) to generate spatially grounded answers.

Quantitative Results

Comparison of OWL with closed- and open-source baselines on BiDepth across four task types: Type I (event detection), Type II (direction estimation), Type III (spatial reasoning), and Type IV (CoT reasoning). OWL consistently surpasses prior open-source models, with further gains from CoT supervision. Best results are in bold.

Zero-shot Performance of OWL on the SpatialSoundQA across perception and reasoning tasks. OWL consistently outperforms the baselines, with larger gains in spatial reasoning tasks, demonstrating the benefit of the SAGE and CoT instruction tuning. Best results are denoted in bold.

Qualitative Examples

Comparison of model predictions with ground truth answers for various perceptional and spatial audio questions. (Please use headphones for best audio quality.)

Question: What distinct auditory events do you notice?

Ground Truth: Engine; Idling; Vehicle

Prediction: Engine; Idling; Vehicle; Medium engine (mid frequency)

Question: From which direction and at what distance can the sound of the Speech be detected?

Ground Truth: twelve o'clock; down; 1.0 m

Prediction: twelve o'clock; down; 1.0 m

Question:Can you identify the sound positioned at the right side of the receiver?

Ground Truth: Fireworks is located at eight o' clock, while Speech is positioned at four o' clock. Therefore, Speech is on the receiver's right side.

Prediction: From the receiver's perspective, the positions four o' clock indicates that Speech is on the right-hand side.

OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

Abstract

Architecture

Quantitative Results

Qualitative Examples