RAVEN

Abstract

Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off- camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modal- ities, enabling the model to amplify informa- tive signals and suppress distractors before fu- sion. RAVEN is trained through a three- stage pipeline comprising unimodal pretrain- ing, query-aligned fusion, and disagreement- oriented fine-tuning – each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To sup- port training and evaluation, we release AVS- QA, a dataset of 300K synchronized Audio– Video-Sensor streams paired with automati- cally generated question-answer pairs. Ex- perimental results on seven multi-modal QA benchmarks – including egocentric and exo- centric tasks – show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy com- pared to state-of-the-art multi-modal large lan- guage models, respectively. Incorporating sen- sor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%.

Architecture

Overview of RAVEN. Each modality (video, audio, sensor) is encoded using pretrained encoders and projected into a shared space. The QuART module performs query-conditioned token relevance scoring to align informative tokens across modalities. The figure also highlights the three-stage training pipeline for alignment- aware multi-modal reasoning.

Quantitative Results

Comparison of RAVEN and prior MLLMs on exocentric open-ended video QA (MSVD-QA, MSRVTT- QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in bold and underline. ∗ indicates scores reproduced by us.

Comparison of RAVEN with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. RAVEN outperforms across metrics and excels in reasoning. Bold and underline indicate the best and second-best scores.

AVS-QA results comparing RAVEN with SOTA models using different modality combinations.

Comparison under cross-modal mismatch scenarios. RAVEN with Stage III fine-tuning consistently outperforms baseline methods across all evaluation metrics and benchmarks, demonstrating superior robustness to modality perturbations.

Qualitative Results

Example illustrating the value of sensor input for activity disambiguation. Given the question Was the user actively cooking or stirring something in the pot on the stove?, the Audio+Video model observes a cooking scene but cannot confirm active engagement due to the absence of motion cues. In contrast, the Au- dio+Video+Sensor model leverages IMU data to detect a lack of body movement and integrates audio signals to confirm no stirring, allowing it to infer that the user is not actively cooking.

Example illustrating subtle activity disambiguation using multimodal reasoning. Given the question What activity is the person likely engaged in?, the Audio+Video model identifies dishwashing activity based on sink visibility and audio cues such as water flow. The Audio+Video+Sensor model enhances this understanding by incorporating IMU data, which reveals low hand and body movement. This confirms a controlled, repetitive action consistent with small-scale washing (e.g., lathering a ladle), demonstrating the added value of sensor input for refining temporal and motion-level interpretations.

Example demonstrating the added value of sensor data in identifying subtle concurrent actions. Given the question Is the person engaged in any other activities other than washing hands?, the Audio+Video model detects only hand presence and water sounds, concluding that no other activities are evident. In contrast, the Au- dio+Video+Sensor model identifies a sudden IMU spike, indicating arm movement associated with reaching for soap–capturing a secondary action that is visually and acoustically ambiguous.

Example showcasing multimodal reasoning for fine-grained activity understanding. Given the question What is the person doing with his bicycle?, the Audio+Video model identifies that the person is not riding the bicycle and is likely talking nearby. In contrast, the Audio+Video+Sensor model captures continuous IMU fluctuations, suggesting active engagement, such as adjusting the bikes tire pressure, demonstrating the added interpretive power of sensor input.

BibTeX

@article{biswas2025raven, title={RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language}, author={Biswas, Subrata and Khan, Mohammad Nur Hossain and Islam, Bashima}, journal={arXiv preprint arXiv:2505.17114}, year={2025} }