Modular Co-attention Networks in Nepali Visual Question Answering Systems

Gyanwali, Aashish and Sapkota, Binod and Koirala, Abhishek and Dawadi, Babu R (2024) Modular Co-attention Networks in Nepali Visual Question Answering Systems. Asian Journal of Research in Computer Science, 17 (10). pp. 62-84. ISSN 2581-8260

[thumbnail of Sapkota17102024AJRCOS123586.pdf] Text
Sapkota17102024AJRCOS123586.pdf - Published Version

Download (2MB)

Abstract

Visual question answering (VQA) has been regarded as a challenging task requiring a perfect blend of computer vision and natural language processing. As no dataset was available to train such a model for the Nepali language, a new dataset was developed during the research by translating the VQAv2 dataset. Then the dataset consisting of 202,577 images and 886,560 questions was used to train an attention-based VQA model. The dataset consists of yes/no, counting, and other questions with primarily one-word answers. Modular Co-attention Network (MCAN) was applied to the visual features extracted using the Faster RCNN framework and question embeddings extracted using the Nepali GloVe model. After co-attending the visual and language features for a few cascaded MCAN layers, the features are fused to train the whole network. During evaluation, an overall accuracy of 69.87% was obtained with 81.09% accuracy in yes/no type questions. The results surpassed the performance of models developed for Hindi and Bengali languages. Overall, novel research has been done in the Nepali Language VQA domain paving the way for further advancements.

Item Type: Article
Subjects: European Scholar > Computer Science
Depositing User: Managing Editor
Date Deposited: 21 Oct 2024 07:27
Last Modified: 21 Oct 2024 07:27
URI: http://article.publish4promo.com/id/eprint/3561

Actions (login required)

View Item
View Item