The Convergence of Vision and Language: A Deep Learning Approach to Enhanced Visual Question Answering via Question-Guided Attention

Authors

  • Ali Rasheed Computer Dept, Sweden Author

DOI:

https://doi.org/10.66395/globeis.3

Keywords:

Artificial Intelligence, Computer Vision, Natural Language Processing, Vision-Language Models (VLMs), Visual Question Answering (VQA), Deep Learning, Transformer, Multimodal AI

Abstract

Computer Vision (CV) and Natural Language Processing (NLP) in combination, which is represented by Vision-Language Models (VLMs), is a critical field in the development of cognitive Artificial Intelligence (AI). The paper discusses one such limitation of the current VLMs, which is a major weakness of the current VLMs in Visual Question Answering (VQA) in that they are prone to linguistic bias and poor visual grounding. We propose an in-depth discussion and experimental evidence of a new framework of VLM that adds a Question-Guided Attention (QGA) module under the cross-modal fusion phase. QGA module filters the raw visual features dynamically and puts more emphasis to those that are most relevant to the question vector which is a semantic feature thus giving a cleaner visual signal to be used later in the reasoning process. The QGA-VLM model is experimentally tested on VQA v2.0 benchmark. Findings indicate that VQA accuracy has significantly improved in general as compared to baseline VLM, and that the most significant improvements in this case were achieved in more complicated types of reasoning, including counting and relational questions, which improved by an average of $\mathbf{+4.75 percent). The effectiveness of the QGA module in improving visual grounding and minimizing reliance on dataset priors is confirmed by this strong performance, and more trustworthy and human-interpretable multimodal AI systems are on the horizon.

Downloads

Download data is not yet available.

Downloads

Published

2025-12-26

How to Cite

The Convergence of Vision and Language: A Deep Learning Approach to Enhanced Visual Question Answering via Question-Guided Attention. (2025). GlobeIS International Journal of Global Information Systems, 1(1), 20-27. https://doi.org/10.66395/globeis.3