Skip to main content

Lightweight Visual Question Answering using Scene Graphs

Vidyaranya Sai Nuthalapati‚ Ramraj Chandradevan‚ Eleonora Giunchiglia‚ Bowen Li‚ Maxime Kayser‚ Thomas Lukasiewicz and Carl Yang


Visual question answering (VQA) is a challenging problem in machine perception, which requires the deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, while powerful yet elegant models like graph neural networks (GNNs) have shown a great power in reasoning over graph-structured data. In this work, we propose to bridge the gap between scene graph generation and VQA by leveraging GNNs. In particular, we design a new model called Conditional Enhanced Graph ATtention network (CE-GAT) to encode pairs of visual and semantic scene graphs with both node and edge features, which is seamlessly integrated with a textual question encoder to generate answers through question-graph conditioning. Moreover, to alleviate the training difficulties of CE-GAT towards VQA, we enforce more useful inductive biases in the scene graphs through novel question-guided graph enriching and pruning. Finally, we evaluate the framework on one of the largest available VQA datasets (namely, GQA) with ground-truth scene graphs, achieving the accuracy of 77.87%, compared with the state of the art (namely, the neural state machine (NSM)), which gives 63.17%. Notably, by leveraging existing scene graphs, our framework is much lighter compared with end-to-end VQA methods (e.g., about 95.3% less parameters than a typical NSM).

Book Title
Proceedings of the 30th International Conference on Information and Knowledge Management‚ CIKM 2021‚ Gold Coast‚ Queensland‚ Australia‚ November 1–5‚ 2021
ACM Press