Abstract
With the application of deep learning method in the field of image processing, the image-related intelligent interaction technology has also been rapidly developed. Visual question answering (VQA) collects the image information by asking questions related to the image and ultimately achieves the purpose for enriching the image understanding. Vision and language are the two core parts of human intelligence to understand the real world, and also the basic components to realize artificial intelligence, and a lot of research has been carried out in their respective fields. With the continuous promotion and application of deep learning in the fields of computer vision and natural language processing, visual question answering technology across the visual field and natural language disciplines has become a research hotspot in recent years. Visual question answering (VQA) for intelligent interaction collects image information by asking relevant questions to the content of the image and finally achieves the purpose of enriching image understanding. At the same time, as an emerging research direction, the challenges faced by the visual question answering system are huge, and we need to learn and excavate. Through the comprehensive comparison and analysis of the existing models and methods of visual question answering, this paper summarizes the shortcomings and development directions of the current research work and analyzes several models of visual question answering technology for the processing of image input and question input of the visual question answering model. The working principle of the model and the common public data set of the model: it is concluded that extending the structured knowledge base and applying mature technologies such as text question answering and natural language processing to deal with VQA problems are the future development directions of the VQA model.
1. Introduction
In recent years, with the continuous combination and development of computer vision and deep learning, the accuracy and efficiency of computer vision-related tasks have been greatly improved. In tasks such as image classification [1, 2], behavior recognition [3, 4], and target detection [5, 6], machines have reached or even exceeded the level of humans, and these tasks belong to the field of artificial intelligence. As humans, in addition to identifying specific objects and their attributes in pictures or videos, marking the spatial position of objects in the picture, we can also mark the relationship between objects, and find the corresponding objects according to the given text content and decribe the picture in detail. Even, we will ask corresponding questions based on the content of the pictures or videos for others to answer, and reason about the relationship between the provided questions and the corresponding pictures so as to obtain the required information. However, the main problem of current computer vision-related tasks is that the degree of image understanding is low, and it is difficult to conduct a comprehensive analysis of images. In response to this problem, researchers try to explore further intelligent interaction methods of images. Visual question answering (VQA) is based on this method. Visual question answering can abstract and condense from a given image. High-level information such as the category, spatial relationship, activity, and scene of the objects in the picture is displayed; and according to the different questions, reasonable answers can be given. It can be seen from the above description that visual question and answer can improve the experience of human-computer interaction. It is one of the key researches to realize artificial intelligence.
VQA is an emerging topic in the field of artificial intelligence and also a very challenging research direction. It covers the two fields of computer vision and natural language processing. It needs to build a model to analyze images and understand the problem, that is, when inputting an image and image-related questions, the model can automatically output a predicted answer. Since the VQA challenge in 2014, a large number of visual question answering models have been proposed. According to whether exogenous knowledge bases are introduced, the existing VQA models are divided into two categories: joint embedding models and knowledge base-based models. But most pioneering VQA models focus on visual processing of dataset tasks. For recently proposed data TextVQA [7] and ST-VQA [8], i.e., some field-case pictures with the image text and existing VQA models. It generally does not perform well on these datasets. In the VQA task, both the question and the required operation object are unknown, the question is raised during the system operation, and the output answer varies with the training set and the operation object. Therefore, VQA is more “intelligent.” Compared with text question answering in the NLP domain, the VQA task faces challenges such as higher image dimensionality, more noise, and lack of structured semantics and grammatical rules for images. In general, as a cross-domain artificial intelligence task, the research of visual question answering represents the exploration of the future “general artificial intelligence,” which cannot only provide a cross-modal data processing and fusion method but also provide machine understanding. And a new stage of artificial intelligence that solves complex problems and even completes reasoning.
The basic models of visual question answering usually use linear classifiers or multi-layer perceptual (MLP) classifiers to connect image vectors and text features with each other [9–11]. In 1950, Alan Turing proposed the concept of the Turing Test (Turing Test), which is used to test whether a computer can show intelligence equivalent to or indistinguishable from humans. A computer that passes the Turing test can be considered to have the ability to think. Since the use of computers to answer questions posed by humans, many question answering systems have been introduced, promoting the development of natural language processing technology. In 2015, A. Agrawal proposed a visual question answering task, which provides an image to a visual question answering system and gives a natural language question about this image. The task is to provide an accurate natural language answer. Malinowski et al. first proposed a joint embedding model, Neural-Image-QA applied to real scenes [12]. Neural-Image-QA uses convolutional neural network CNN to extract image features, and the obtained feature vector and question text are transmitted to long short-term memory LSTM to generate the word sequence of the answer. The accuracy of the model on the DAQUAR [13] dataset is 19.43%. This basic paradigm of CNN + RNN is also widely used by later researchers. In terms of text characterization, Zhou et al. chose the bag-of-words model BOW, which is simpler than the long short-term memory LSTM, when dealing with the question text, and proposed the ffiOWIMG model W, and transferred the pre-trained GoogLeNet to extract image features—good performance on VQA dataset. Gao et al. believed that the question and the answer were different in syntactic structure, so they used two independent LSTM networks to encode the question and decode the answer, and combined the convolutional neural network to form the mQA model M. Lin et al. applied convolutional neural network CNN to both encoded image content and feature extraction of question text, and used a multimodal convolutional layer to output joint feature vectors and proposed a dual CNN model [14].
According to the use of knowledge base, the visual question answering model based on knowledge base is divided into knowledge base query and knowledge base embedding. The goal of the knowledge base query class is to create a knowledge base query statement based on images and texts and get the answer through the knowledge base query. The model extracts the entity of the picture, maps the entity to the knowledge base, converts the natural language into a query statement, and queries the knowledge base. The representative models are the Ahab and FVQA models [15]. Wang et al. introduced the DBpedia knowledge base and proposed the Ahab model. Ahab uses pre-trained FastR-CNN [16] and two different VGGnets [17] to extract three visual concepts of object, image scene, and image attributes from images, respectively. All extracted image information is represented in the form of Resource Description Framework (RDF), for example, “image contains giraffe object” is represented as (image, contains, and object1), (object1, name, and giraffe). Each visual concept is directly linked to a knowledge base concept with the same semantics. In terms of question text processing, Wang et al. set up 23 question templates based on the self-built KB-VQA data set, which requires common sense or exogenous knowledge and transformed natural language questions into corresponding knowledge base query statements, directly get answers from knowledge base queries [18]. On the KB-VQA dataset, Ahab’s accuracy on each question type is much higher than the joint embedding model [19]. Solve the problem of identification class: Knowledge Base Embedding Class > Joint Embedding Model > Knowledge Base Query Class. Solve the problem of Tuili class: Knowledge Base Query Class > Knowledge Base Embedding Class > Joint Embedding Model. Migration capability: Knowledge Base Embedding Class = Knowledge Base Query Class > Joint Embedding Model.
One of the most effective ways to improve joint embedding models is to use attention mechanisms. Humans are capable of answering questions by focusing only on the areas in the image that are most relevant to the question, whereas most of the models introduced previously use the global features of the entire image to represent the visual input, which is likely to introduce noisy information irrelevant to a given question to affect the answer prediction. The main idea of the attention mechanism is to let the model focus on specific visual areas in the image or certain words in the question. Compared with other information in the image or question, the answer to the question can provide more effective information. The attention mechanism model and the structure is shown in Figure 1.

In this model, attention can focus on the relevant image regions to obtain attention weights according to the specific questions raised, and then weighted and summed the image region features to obtain image features, and finally combined with the problem features to input the classifier to obtain a predicted answer. Different variants of the attention mechanism can adaptively select the most important features and improve the accuracy of visual question answering. The soft and hard attention mechanisms proposed by Xu et al. have become mainstream approaches for VQA. Subsequently, Yang et al. proposed a cascaded attention network to generate multiple attention maps on the image in a sequential manner, gradually focusing on the most important visual areas. Kim et al. extended this idea and incorporated it into the remaining. The connected architectures yield better attention mechanisms and the experimental results of models with attention mechanisms significantly outperform them compared to methods without attention mechanisms.
2. Methods and Datasets
To predict an accurate answer for a given question and image pair, this paper proposes a graph convolutional network based method. The algorithm flow chart can be seen in Figure 2. First, the target features of the image and the word vector of the calculation problem are extracted, and then the image and problem features are processed into a graph structure to simulate the adjacency matrix of the graph, which is limited by the relationship between the image and the problem. The adjacency matrix can be used in the operation of the graph convolution layer, and the convolutional features not only focus on the image target but also represent the correlation between the image target and the problem.(1)TextVQA for the TextVQA dataset, it contains 28,408 images from the OpenImages v3 dataset, 45,336 questions about images, various questions involving query time, name, and brand; and possibly partially occluded image text information, making it a challenging task. Each question answer has a list of tokens extracted from the image text by Rosetta OCR, which is available in both multilingual (Rosetta-ml) and English only (Rosetta-en) versions, using the VQA accuracy criterion. Evaluate whether the effect is good or bad.(2)ST-VQA The ST-VQA dataset is composed of 23,038 images with 31,791 questions and contains natural images from multiple sources, including ICDAR 2013, ICDAR 2015, ImageNet, VizWiz, IIIT STR, Visual Genome, and COCO-Text. The dataset involves three tasks: strongly contextualised, weakly contextualised, and open vocabulary. These three different tasks use different dictionaries. For strong context, each picture has its own dictionary, and the dictionary includes 100 words; for weak context, all pictures share a large dictionary, and the dictionary includes 30,000 words, 22, 000 of them are correct; the others are distractors; for open words, the dictionary is empty.

Figure 3 provides a comparison of ST-VQA and TextVQA datasets, from which it can be seen that the length distributions of questions and answers in the two datasets are very similar, although TextVQA and ST-VQA need to read the scene text. General VQA dataset, but ST-VQA has 84.1% consistency between subjective answers and standard answers, while TextVQA has only 80.3%, indicating that ST-VQA has lower question ambiguity. In addition, all ST-VQA questions are about the text information in the image, while 39% of the questions in TextVQA do not use any scene text, indicating that ST-VQA is more representative than TextVQA.

2.1. Basic Model Framework and Working Principle of Visual Question Answering Technology
VQA Automatically answering questions about visual content is considered one of the highest goals of artificial intelligence. Its uniqueness is that it handles problems across the fields of computer vision and natural language processing. The difficulty and goal is to build a well-designed model to predict the correct answer. The basic model framework of this technology is shown in Figure 4.

In the above figure, when an image of the Eiffel Tower is given and a related question is asked for the image, the model extracts the semantic information related to the question according to the information provided by the image and answers the question in the form of natural language obtained. The answer is the Eiffel Tower. Therefore, VQA technology faces the challenge of image analysis and question understanding, and sometimes even requires inference answers from information that does not exist in the image. This additional required information may be common sense or external knowledge about specific elements in the image. With the continuous development of related technologies, many novel visual question answering models have emerged.
2.2. Practical Application of VQA
The visual question answering model needs certain reasoning ability in the process of answering questions, rather than simply “guessing” the answer. At present, there is no way for general neural network models to obtain such reasoning ability through end-to-end training. Visual question answering can be regarded as a knowledge complex, which has a very broad prospect in practice. For example, a blind visual assistant can answer the questions asked by the blind person based on the front image taken by the blind person. The system can describe the current interface, then obtain more image information through question and answer, and make correct guidance judgments for blind people. In addition, it also has broad applications in online education for young children and automatic query of surveillance videos. Therefore, visual question answering is a field worthy of further study to better understand visual question answering techniques. Visual question answering can be used to improve the mode of human-computer interaction and provide a more natural way to query visual content. For example, visual question answering can be used for image retrieval without providing image metadata or tags. If the visual question answering task is basically solved, it will be enough to become a significant milestone in the history of artificial intelligence.
3. Conclusion
This paper discusses the current situation and existing problems of visual question answering in terms of models and datasets, classifies visual question answering algorithms, and summarizes the advantages and disadvantages, and also points out some directions for future visual question answering research. Visual question answering is both a relatively novel concept and a complex task. Through continuous application and development in recent years, visual question answering research has achieved some positive results, but we should also see that the current visual question answering model still has considerable limitations. There are still some gaps in the actual goals achieved. Our currently proposed visual question answering models and methods are formally limited and simplified. The overall structure of the model is relatively simple, the form of the answer generated by the model is relatively simple, and the basic reasoning and analysis capabilities are lacking. Visual question answering still has a lot of research space.
Data Availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.
Acknowledgments
The authors acknowledge the National Natural Science Foundation of China (Grant no. 52075046).