RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

BASH Lab, Worcester Polytechnic Institute
Empirical Methods in Natural Language Processing (main), 2025

Abstract

Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off- camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modal- ities, enabling the model to amplify informa- tive signals and suppress distractors before fu- sion. RAVEN is trained through a three- stage pipeline comprising unimodal pretrain- ing, query-aligned fusion, and disagreement- oriented fine-tuning – each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To sup- port training and evaluation, we release AVS- QA, a dataset of 300K synchronized Audio– Video-Sensor streams paired with automati- cally generated question-answer pairs. Ex- perimental results on seven multi-modal QA benchmarks – including egocentric and exo- centric tasks – show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy com- pared to state-of-the-art multi-modal large lan- guage models, respectively. Incorporating sen- sor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%.

Architecture

Overview of RAVEN. Each modality (video, audio, sensor) is encoded using pretrained encoders and projected into a shared space. The QuART module performs query-conditioned token relevance scoring to align informative tokens across modalities. The figure also highlights the three-stage training pipeline for alignment- aware multi-modal reasoning.

Quantitative Results

Qualitative Results

BibTeX


        @article{biswas2025raven,
        title={RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language},
        author={Biswas, Subrata and Khan, Mohammad Nur Hossain and Islam, Bashima},
        journal={arXiv preprint arXiv:2505.17114},
        year={2025}
        }