Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off- camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modal- ities, enabling the model to amplify informa- tive signals and suppress distractors before fu- sion. RAVEN is trained through a three- stage pipeline comprising unimodal pretrain- ing, query-aligned fusion, and disagreement- oriented fine-tuning – each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To sup- port training and evaluation, we release AVS- QA, a dataset of 300K synchronized Audio– Video-Sensor streams paired with automati- cally generated question-answer pairs. Ex- perimental results on seven multi-modal QA benchmarks – including egocentric and exo- centric tasks – show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy com- pared to state-of-the-art multi-modal large lan- guage models, respectively. Incorporating sen- sor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%.
@article{biswas2025raven,
title={RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language},
author={Biswas, Subrata and Khan, Mohammad Nur Hossain and Islam, Bashima},
journal={arXiv preprint arXiv:2505.17114},
year={2025}
}