Path-Wise Attention Memory Network for Visual Question Answering

Xiang, Yingxin; Zhang, Chengyuan; Han, Zhichao; Yu, Hao; Li, Jiaye; Zhu, Lei

doi:10.3390/math10183244

Open AccessArticle

Path-Wise Attention Memory Network for Visual Question Answering

by

Yingxin Xiang

¹,

Chengyuan Zhang

^2,*

,

Zhichao Han

³,

Hao Yu

²,

Jiaye Li

² and

Lei Zhu

^4,*

¹

School of Computer Science and Engineering, Central South University, Changsha 410083, China

²

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410083, China

³

College of Science and Technology, Xiangsihu College Guangxi University for Nationalities, Nanning 530008, China

⁴

College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2022, 10(18), 3244; https://doi.org/10.3390/math10183244

Submission received: 21 July 2022 / Revised: 18 August 2022 / Accepted: 1 September 2022 / Published: 7 September 2022

(This article belongs to the Special Issue Computational Methods and Application in Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Visual question answering (VQA) is regarded as a multi-modal fine-grained feature fusion task, which requires the construction of multi-level and omnidirectional relations between nodes. One main solution is the composite attention model which is composed of co-attention (CA) and self-attention (SA). However, the existing composite models only consider the stack of single attention blocks, lack of path-wise historical memory, and overall adjustments. We propose a path attention memory network (PAM) to construct a more robust composite attention model. After each single-hop attention block (SA or CA), the importance of the cumulative nodes is used to calibrate the signal strength of nodes’ features. Four memoried single-hop attention matrices are used to obtain the path-wise co-attention matrix of path-wise attention (PA); therefore, the PA block is capable of synthesizing and strengthening the learning effect on the whole path. Moreover, we use guard gates of the target modal to check the source modal values in CA and conditioning gates of another modal to guide the query and key of the current modal in SA. The proposed PAM is beneficial to construct a robust multi-hop neighborhood relationship between visual and language and achieves excellent performance on both VQA2.0 and VQA-CP V2 datasets.

Keywords:

attention mechanism; path-wise attention; attention memory; memory network

MSC:

68T04

1. Introduction

Traditionally, computer vision and natural language processing are two important but mutually independent research fields of Artificial Intelligence. Both fields have made significant progress toward their goals and have gradually been driven to mutual convergence by the explosion of visual and textual data and the requirements of complex real-world tasks. At present, multi-modal learning has bridged the gap between visual and language and has been widely concerned [1,2,3,4,5]. Remarkable progress has been made in many multi-modal learning tasks, e.g., image captioning [6,7,8,9], video captioning [10,11,12], cross-modal retrieval [13,14,15,16,17,18,19,20,21,22], and visual question answering(VQA) [7,23,24,25,26,27,28,29,30,31].

The VQA task is more challenging than other multi-modal learning tasks because it requires a full understanding of textual information in the question and visual information in the image and finding out the key information to solve the problem for comprehensive reasoning.

Inchoate methods coarsely learn joint embedding representation with global features [32,33,34], which contains more noise and has difficulty answering the fine-grained questions. To address the problem, two lines of work have made major contributions and shown effective improvement in accuracy. Firstly, on data representations, global features are replaced by image regions features [27,29,35,36] and words features [29,37], which enable the model to fuse modal information at a finer level. Secondly, different variants of attention mechanisms are applied to enhance the interaction between fine-grained visual features and language features. The underlying motivation is to selectively focus on important parts and ignore irrelevant information. We combine fine-grained representation and attention mechanism and further improve the multi-step joint reasoning ability of the model; then we propose a Path Attention Memory network (PAM). By recording and tracking the attention distribution pattern, the path-wise memory attention block is constructed to strengthen the effective signal and weaken the invalid noise.

The application of the attention mechanism in VQA is very extensive, and various types of attention have been proposed. REGAT [27] uses question self-attention to aggregate words embedding into a global sentence feature. MCAN [36] uses self-attention on attended word features and attended regions features in the fusion stage so that an answer vector is obtained by adding the learned global question vector and global visual vector. The attention mechanism mentioned above fuses information from a sequence of features into a single feature vector, while graph-based attention similar to graph attention network (GAT) [38] does not change the length of the input feature sequence. Graph-based self-attention can be used to establish region-to-region relations on the visual channels and word-to-word relations on the textual channels. Co-attention is able to learn fine-grained correlations between two feature sequences [25,37], so region-to-word and word-to-region relations can be acquired when applied to the VQA task. Compared with single-stream early fusion and dual-stream late fusion, the introduction of co-attention can flexibly adjust the hidden values according to the context of the other modal and learn the inter-modal relations.

Many existing works use self-attention [23,27,35]; only a few works use co-attention and self-attention simultaneously [29,36]. Existing models aggregate important information in multi-hop neighborhoods by stacking several layers of attention modules. An encoder–decoder architecture similar to Transformer is also proposed in MCAN [36] to stack self-attention and co-attention. However, visual features and linguistic features are in different feature spaces, and some deviations may exist even if transformation matrices are used for feature projection. Inspired by the channel-wise conditioning gates used in DFAF [29], we used channel-wise guard gates to check and reshape the projected values from another modal in co-attention.

In addition, we believe that the multi-hop neighborhood relationship constructed by stacking attention modules needs a path-wise module to summarize the previous learning effects and correct some learning biases. This path-wise module needs to be directly connected to the previous attention modules and play a role of centralized control to instruct and constrain the learning of single-hop modules. We refer to the idea of a memory network to memorize four attention matrices that represent pairs’ relationships in a set of regions and words. These attention matrices are regarded as the direct link between the single-hop module and the path-wise module. In short, we propose the path-wise attention block(PA) which is designed as a special co-attention block. The PA takes four attention matrices and two modality features as inputs to calculate a path-wise attention matrix with the adjusted features as output, so it can directly guide the attention matrix learning of the single-hop attention module along the entire path in the backpropagation. The attention matrix of the general attention module is calculated by query and key and is directly related to the current input features, while the attention matrix of PA is not directly related to the input features but to the updated history of features.

The contributions of our work are four-fold:

(1) We propose a novel framework Path-wise Attention Memory (PAM) Network to construct a robust multi-hop neighborhood relationship between visual and language.

(2) We design a central governor for attention-based models, namely Path-wise Attention, to instruct and constrain the learning of single-hop attention blocks in the path.

(3) We use the cumulative nodes’ importance to calibrate the signal strength of regions and words after each single-hop attention block. This strategy is simple, inexpensive but effective.

(4) We adaptively adopt a new gate mechanism in self-attention and co-attention, to make the information interaction between modalities tight and useful.

Roadmap. The rest of this paper is organized as follows. The related work is summarized in Section 2. The details of the proposed method PAM are presented in Section 3. The experimental results and evaluations are reported in Section 4. Finally, we conclude this paper in Section 5.

2. Related Work

2.1. Visual Question Answering

Visual question answering requires a comprehensive and fine-grained understanding of both visual information and text information. One line of work focuses on designing vector fusion functions to capture a high-level correlation between the global visual vector space and the global text vector space [32,33,39]. Some approaches on this line achieve excellent results. However, the direct use of coarse-grained global features in this way may lead to the loss of local details, and inter-modal relations are highly abstract and poorly interpretable. Therefore, research began to shift from coarse-grained to fine-grained, and many methods gradually use image regions features to replace global image vectors and words features to replace global sentence vectors [25,40]. Moreover, the attention mechanism was also introduced to aggregate fine-grained node information and has become the most mainstream inference framework. Ref. [27] takes the question as a global sentence vector to be concrete after each visual region feature. Three types of visual relationships are considered, and their relations matrices are pre-trained and applied to mask the soft attention of a bidirectional graph attention network. Fine-grained representations of both visual and language features are adopted by [29]. They use intra-modal apply self-attention, and inter-modal apply cross-modal feature gate vector to guide self-attention; that is, the query and key in inter-modal self-attention need an element-wise product with the average pooling aggregate cross-modal features [36]. The stack question self-attention module and question guided visual self-attention module are instituted through the proposed encoder–decoder stacking form.

2.2. Attention Mechanisms

By selectively focusing on important parts and ignoring irrelevant information, an attention mechanism can effectively solve the problem of long-term dependencies in model training and improve the interpretability of neural networks and has been widely applied to many unimodal tasks(e.g., machine translation [41,42], visual detection [43], image classification) and multimodal tasks(e.g., multimedia retrieval, visual question answering [23]).

A typical attention module uses a query and key to calculate the weight of each item in the input sequence then sums the weighted values to obtain the output sequence. The attention mechanism is not restricted to the length of the input sequence suitable for an inductive learning problem. Different varieties of attention modules are produced in different application scenarios.

The attention mechanism is usually combined with an RNN for machine translation tasks in the early stage. The attention model was first introduced for Machine Translation by Bahdanau et al. [41] with sequence input but non-sequence output. Similarly, Yang et al. use an RNN and attention mechanism to capture document information gradually at word level and statement level and finally construct a document level feature vector. In such a scenario, attention is used to weight the tokens in the input sequence and aggregate the representation of those tokens to form a holistic vector. An RNN’s recurrent architectures are nonparallel which result in computational inefficiency. To address this, Vaswani et al. proposed eminent Transformer architecture which only relies on a self-attention mechanism to calculate the relationship between each word and all the other words [44].

In the VQA task, self-attention is extensively used to model word-to-word relationships for questions and region-to-region relationships for images in the VQA task. Question-guided attention on image regions or video frames is generally explored for visual question answering [27,36,45,46], video question answering [47,48,49], image captioning [6], etc. In order to capture more intensive correlation between cross-modal nodes, co-attention-based approaches [25,29,36,37] use bi-direction attention to learn the relationships between word–region pairs.

2.3. Graph Attention Network

When processing graph structure data, graph convolution networks (GCNs) use the Laplacian matrix to aggregate node information leading to fixed-size neighborhoods and allocate non-parametric node weight according to node degree. However, graph attention networks(GATs) use the attention mechanism to aggregate node information, which guarantees that more important nodes receive higher weight. The parallelism, flexibility, and interpretability of the attention mechanism make it overcome such problems and perform well. Presented by Smith et al. [38], graph attention networks aggregate multi-hop neighborhood information as the number of layers deepens. Meanwhile, each layer applies the multi-head attention to stabilize the learning process, K hidden states are computed by K independent mechanisms and can be interpreted as observing features from different K perspectives.

When GAT applies to the VQA task, visual regions nodes and language words nodes jointly compose a fully connected graph, whose edges weights can be learned from associative node features by attention alone or constrained by calculating relative spatial positions between nodes or additional relational label data [27,50]. Our model represents both visual features and language features as graph structure, simultaneously learning the inter-relation and intra-relation of two types of nodes by attention mechanism. The fine-grained aggregation at the node level makes the reasoning process of the model more meticulous and precise. For the sequence of the nodes processed by multiple graph attention, we retrace its importance on the meta-path to construct meta-path-based cross-modal attention, thereby reinforcing and verifying the previous reasoning.

3. The Proposed Approach

According to the input visual information and question information, a solver of VQA extracts features from a singular modality, fusions feature between the visual and language modalities, finally predicting the answer based on the learned joint representation. The VQA model can be divided into three steps: feature extraction, feature fusion, and answer prediction.

Our proposed Path-wise Attention Memory Network (PAM) also follows this structure, as shown in Figure 1. In feature fusion, we use a composite attention network composed of co-attention, self-attention, and path-wise attention. The node impact factor is used to self-reinforce the signal. In the following paragraphs, we introduce the problem definition first and then discuss PAM in that order.

3.1. Problem Definition

As most existing approaches do, we regard the visual question answering task as a multi-class problem rather than a generation problem. Let (X,Y,Z) denote the training set, where X is the space of grounded images, Y is the space of questions, and Z is the space of labels. Following previous VQA methods, we consider the multi-class classification problem with binary cross-entropy (BCE) loss.

\begin{matrix} P = Sigmoid (f (X, Y; θ)) \end{matrix}

(1)

\begin{matrix} L (P, Z) = - \sum_{i = 1}^{C} z_{i} \log (p_{i}) + (1 - z_{i}) \log (1 - p_{i}) \end{matrix}

(2)

C denotes the number of categories, and f(.) represents the VQA model that fuses the visual and language information to a C dim predictive vector.

Our model takes fine-grained visual features and question features as input to divide the image into multiple regions according to the different objects identified in the image and divide the text of the question into consecutive words. The image regions and question words are collectively called nodes.

3.2. Build Fine-Grained Feature Vectors

The visual information used in our model is fine-grained image region information. A Faster-RCNN [51] model is pre-trained on a visual genome dataset to detect objects in origin images. Then the image regions feature is extracted. Each image is represented as a feature vector

R \in ℜ^{m \times 2048}

with at most m = 100 regions.

The input question is first tokenized into a word sequence and padded or truncated to the maximum length of

n = 14

; each word has 600 dimensions. Then the words sequence is further transformed into feature vector

W \in ℜ^{n \times 1024}

by a bidirectional RNN with a gated recurrent unit (GRU) [52]. In order to better capture the overall information in the language modality, we use the simple self-attention mechanism [41] to fuse the information of the entire words sequence feature vector to obtain a global information node of the question with the same dimension of 1024, which is concatenated after the words feature vector, and the final language feature vector is 15 × 1024.

To facilitate the subsequent processing of cross-modal and self-modal graph attention information, two linear layers with an ReLU activation function and 0.1 dropout is used to map the visual feature vector and language feature vector to the same dimension D. That is, the visual feature vector is represented as

R \in ℜ^{N 1 \times D}

and the language feature vector is represented as

W \in ℜ^{N 2 \times D}

. For the feature with less than N1 or N2 nodes, zero-padding is used to add a fake empty node, and the attention logits of an empty node are filled with negative infinity by using nodes masks in the following several attention modules.

3.3. Attention Blocks with Memory

Taking the input sequences of regions R and words W, we use a composite attention network composed of three types of attention blocks to fuse the dual-modal information.

In this subsection, we first introduce the standard attention block and then introduce our memory improved attention blocks, that is co-attention with guard gates, self-attention with conditioning gates, and path-wise attention block. Then we introduce the details of the node impact factor used in co-attention and self-attention.

3.3.1. Standard Attention Block

For the standard Transformer attention [44], given the input features

I \in ℜ^{N \times D}

, three linear layers project it to the query, key, and value matrices and divide it into multiple heads,

Q \in ℜ^{H \times N \times D_{h}}

,

K \in ℜ^{H \times N \times D_{h}}

,

V \in ℜ^{H \times N \times D_{h}}

,

\begin{matrix} Q, K, V = Linear (I), \end{matrix}

(3)

\begin{matrix} Att = Softmax (\frac{{QK}^{T}}{\sqrt{D_{h}}}), \end{matrix}

(4)

where “Linear” denotes a fully-connected layer,

D_{h}

is the dimension of heads, and

K^{T}

denotes the transpose of K. Query feature

Q

and key feature

K

are used to calculate the attention matrix

Att

which is regarded as the weight of value feature

V

, so then the sum of the weighted value feature is fed to the Feed Forward Network(FFN) to obtain the updated feature

I^{*}

,

\begin{matrix} I^{*} = FFN (Att V), \end{matrix}

(5)

\begin{matrix} FFN (x) = Linear (ReLU (Linear (x))) . \end{matrix}

(6)

“ReLU” is used to filter positive values; that is,

ReLU (x) = \max (0, x)

.

Nowadays, many works follow the attention mechanism in Transformer which brings huge computing and memory overhead. Our work introduce gates to it and pruned it. To simultaneously gain a simpler network and good performance, we trimmed the single attention module of Transformer appropriately to delete the Feed Forward Network and some projection linear layers. Details are covered separately in specific attention blocks.

3.3.2. Co-Attention with Guard Gates

The co-attention block as shown in Figure 2 learns to capture the attention score between each pair of regions and word feature. This information flow structure is able to learn cross-modal relations and facilitate cross-modal information interaction.

Given the visual regions feature and words feature, we first transformed the input feature into query, key, and value features.

\begin{matrix} Q = Linear (R), K = W \end{matrix}

(7)

\begin{matrix} V_{r} = Q, V_{w} = W \end{matrix}

(8)

For the co-attention block, query and value of visual are the heads divided visual regions feature with linear projection; key and value of language are the heads divided words feature without linear projection.

By calculating the matrix multiplication between query and key, then applying the Softmax function on different dimensions, we obtain two inter-modality attention matrices.

\begin{matrix} {Att}_{r 2 w} = Softmax (\frac{{QK}^{T}}{\sqrt{D_{h}}}, - 1) \end{matrix}

(9)

\begin{matrix} {Att}_{w 2 r} = Softmax {(\frac{{QK}^{T}}{\sqrt{D_{h}}}, - 2)}^{T} \end{matrix}

(10)

where

{Att}_{r 2 w} \in ℜ^{H \times N 1 \times N 2}

represent regions’ attention to words, and

{Att}_{w 2 r} \in ℜ^{H \times N 2 \times N 1}

represent words’ attention to regions.

Then, the two attention matrices are used to gather the cross-modal information from the value of regions or words. To reduce the noise caused by the difference between two modal feature spaces, we use the guard gates to check and filter the noise info in cross-modal values. The guard gates are computed by average pooling of target modal values and used in the element-wise product with the source modal values to reduce the noise of cross-modal message passing, defined as

g_{r}

and

g_{w}

.

\begin{matrix} g_{r} = Avg (V_{r}), g_{r} \in ℜ^{H \times D_{h}} \end{matrix}

(11)

\begin{matrix} g_{w} = Avg (V_{w}), g_{w} \in ℜ^{H \times D_{h}} \end{matrix}

(12)

\begin{matrix} Avg (Y) = \frac{1}{N} \sum_{i}^{N} y_{i}, Y \in ℜ^{H \times N \times D_{h}} \end{matrix}

(13)

We denote the updated information flows as updated regions feature

R^{*}

and updated words feature

W^{*}

, respectively,

\begin{matrix} R^{*} = Linear ({Att}_{r 2 w} (V_{w} \times g_{r})) \times r_{imp} \end{matrix}

(14)

\begin{matrix} W^{*} = {Att}_{w 2 r} (V_{r} \times g_{w}) \times w_{imp} \end{matrix}

(15)

Here

r_{imp}

is the accumulated regions importance, and

w_{imp}

is the accumulated words importance. See details in Section 3.3.5.

The updated regions feature and words feature would then be fed into the following self-attention block to learn the mutual relations between the same modal nodes.

3.3.3. Self-Attention with Conditioning Gates

Co-attention promotes the fusion of relevant information between visuals and language, while self-attention facilitates the accurate location of important nodes through multi-hop relations. For example, for the question, “What is the man doing under the tree?”, the self-attention block helps the model focus on the right man by recognizing the relation between the man and the tree.

Two symmetrical self-attention (SA) blocks are used respectively for regions and words; the SA for regions is shown in Figure 3. We pruned the network and introduce conditioning gates to standard transformer attention which is the base of our proposed self-attention.

The conditioning gate from the other modal is an element-wise product with the query and key to building the context-appropriate relations between nodes. Then, we obtain a pair of different attention matrices by the gated query and key.

{Att}_{r 2 r} \in ℜ^{H \times N 1 \times N 1}

represent regions’ attention to regions,

{Att}_{w 2 w} \in ℜ^{H \times N 2 \times N 2}

represent words’ attention to words.

\begin{matrix} {Att}_{r 2 r} = Softmax (\frac{(Linear (R) \times Avg (W)) {(R \times Avg (W))}^{T}}{\sqrt{D_{h}}}) \end{matrix}

(16)

\begin{matrix} {Att}_{w 2 w} = Softmax (\frac{(Linear (W) \times Avg (R)) {(W \times Avg (R))}^{T}}{\sqrt{D_{h}}}) \end{matrix}

(17)

In the self-attention blocks, the image regions feature and words feature are finally updated as follows,

\begin{matrix} R^{*} = ({Att}_{r 2 r} R) \times r_{imp} \end{matrix}

(18)

\begin{matrix} W^{*} = ({Att}_{w 2 w} W) \times w_{imp} \end{matrix}

(19)

The details of accumulated importance are described in Section 3.3.5.

3.3.4. Path-Wise Attention Block

A co-attention block and two self-attention blocks have been used previously; hence four attention matrices have been memorized. These memorized attention matrices are good for exactly two things. One is used to calculate the importance of global nodes which will be declared in Section 3.3.5, and the other important usage is to calculate a path-wise attention matrix.

Path-wise attention block is a special co-attention block without the linear projection from features to query and key. Instead of that, the memoried four attention matrices are used to calculate its attention matrix which could influence all the single-hop attention blocks along the entire path in the backpropagation. Adding a PAM after the stack of single-hop attention modules enables the model not only to gradually transmit information in the single-hop neighborhood, but also to adjust the signal strength along the multi-hop constructed path.

To be clear, four memorized matrices from previous attention blocks are region-to-region attention

{Att}_{r 2 r}

, word-to-word attention

{Att}_{w 2 w}

, region-to-word attention

{Att}_{r 2 w}

, and word-to-region attention

{Att}_{w 2 r}

. To calculate the path-wise attention matrix, we first use the sum of the attention matrix to calculate the four importance scores with the multi-head structure remaining. For example,

\begin{matrix} S_{r 2 w} = \sum_{i}^{N 1} {Att}_{r 2 w}^{i}, S_{r 2 w} \in ℜ^{H \times N 2 \times 1} \end{matrix}

(20)

\begin{matrix} S_{w 2 w} = \sum_{i}^{N 1} {Att}_{w 2 w}^{i}, S_{w 2 w} \in ℜ^{H \times N 2 \times 1} \end{matrix}

(21)

S_{r 2 w}

represents the word’s importance scores which are rated by visual regions, and

S_{w 2 w}

represents the word’s importance scores which are rated by language words. The importance score rated by two different modalities can be combined into a unified rating score; we perform the matrix multiplication with a tanh activation function to obtain both regions unified rating score and words unified rating score. Then we can calculate the PAM attention matrix score,

\begin{matrix} S = \tanh (S_{w 2 r} S_{r 2 r}^{T}) {Att}_{r 2 w} \times {(\tanh (S_{r 2 w} S_{w 2 w}^{T}) {Att}_{w 2 r})}^{T}, \end{matrix}

(22)

\begin{matrix} scores = Conv 1 d (S), \end{matrix}

(23)

where S comprehensively considers the importance of the previous attention module to each node, and final PAM attention matrix

scores

are further fused by one-dimensional convolution,

scores \in ℜ^{H \times N 1 \times N 2}

. Such a calculation helps each attention block of the model construct the same-level interdependence in the backpropagation. The interactivity between attention blocks is stronger, so that the PAM has the ability to directly regulate the blocks along the entire attention path.

As shown in Figure 4, the rest part of PA block is similar to CA block. we calculate the cross-modal path-wise attention matrix by Softmax on different dimensions of the same attention scores,

\begin{matrix} {Att}_{r 2 w}^{p} & = Softmax (scores, - 1), \end{matrix}

(24)

\begin{matrix} {Att}_{w 2 r}^{p} & = Softmax {(scores, - 2)}^{T} . \end{matrix}

(25)

Then the two path-wise attention matrices are used to update values and output the updated

R^{*}

and

W^{*}

. The guard gates of values mentioned in Equations (11) and (12) are also adopted in PA.

3.3.5. Node Impact Factor

The node impact factor measures the relative importance of a node according to how much attention has been received in the attention network. It is declared as global node importance in our paper. This concept is introduced to amplify the signal of important nodes and reduce the signal of irrelevant nodes after each attention operation.

\begin{matrix} imp * = Sigmoid ((1 - α) imp + α \sum_{i}^{H} \sum_{j}^{N} Att) . \end{matrix}

(26)

Global nodes importance updates after each attention block and are obtained by the sum in another dimension and activated with a sigmoid function,

imp \in ℜ^{N}

. In order to remember the importance of nodes along the whole attention path, a cumulative strategy is adopted to update the importance of nodes. H is the number of heads, and

α

is the cumulative coefficient, ranging from 0 to 1. Experiments show that the value of

α

between 0.9 and 0.96 will have better performance. Each region or word feature multiplies by its global node importance to adjust the signal strength of node.

3.4. Answer Prediction

After several blocks of feature updating by CA, SA, and PA, we obtain the final visual regions and language words features by gathering nodes information via robust multi-hop bi-modality neighborhood relationships. As described in Figure 1, two attention blocks are used to turn regions and words features into global visual feature

r \in ℜ^{D}

and global language feature

w \in ℜ^{D}

. The attention block uses multi-layer perceptron to compress the input’s dimension from D to 1, then uses the Softmax function to obtain each item’s contribution to global representation,

a \in ℜ^{N 1 \times 1}

, and finally, computes a matrix multiplication between input and a.

\begin{matrix} a_{r} = Softmax (MLP (R), - 2), a_{r} \in ℜ^{N 1 \times 1}, \end{matrix}

(27)

\begin{matrix} r = R^{T} a_{r}, \end{matrix}

(28)

\begin{matrix} a_{w} = Softmax (MLP (W), - 2), a_{w} \in ℜ^{N 2 \times 1}, \end{matrix}

(29)

\begin{matrix} w = W^{T} a_{w} . \end{matrix}

(30)

The global features

r

and

w

are eventually fused to generate the last answer prediction,

\begin{matrix} answer = MLP (FC ([r, w]) + r + w), \end{matrix}

(31)

\begin{matrix} FC (x) = ReLU ({xW}_{x} + b), \end{matrix}

(32)

\begin{matrix} MLP (x) = ReLU ({xW}_{1} + b_{1}) W_{2} + b_{2} . \end{matrix}

(33)

FC is a linear layer with ReLU and 0.1 dropout. Its MLP maps the hidden dimension H of the fused feature vector to C, where C is the number of the most frequent answers in the training set. So the binary cross-entropy (BCE) loss is calculated with the prediction and data labels by Equation (2).

4. Experiment

4.1. Datasets

We conduct our experiments mainly on VQA2.0 [53], but also test out-of-distribution generalization on VQA-CP [54].

VQA2.0 [53] is the most commonly used VQA benchmark dataset. It is composed of human-annotated question–answer pairs for the real images from the Microsoft COCO dataset [55]. Compared with previous VQA 1.0 [56], VQA 2.0 has many more annotations and less dataset bias. The entire dataset is split into training (82,783 images and 443,757 QA pairs), validation (40,504 images and 214,354 QA pairs) and test-standard sets (81,434 images and 447,793 QA pairs). Additionally, the test-dev set is a 25% subset of the size of the test-standard set. An average of 3 questions are generated for each image; 10 answers are collected for each image–question pair by human annotations, and the most frequent answer is treated as the ground truth. The results are divided into three per-type accuracies(yes/no, number, and other) and an overall accuracy according to question type.

VQA-CP v2 [54] dataset was constructed by reorganizing VQA2.0, which is called visual question answering under changing priors, such that the distribution of answers for each question type differs in the training set and the testing set to avoid the model’s over-reliance on the prior knowledge of potential relationships between questions and answers.

4.2. Implementation Details

Our model is implemented with PyTorch. All initialization is the Pytorch’s default initialization; that is, the initial weights of linear layers and conv1d layers are sampled from the uniform distribution. We use cross-validation to evaluate the accuracy under different hyperparameters and select the group with the best performance. We declare the final hyperparameters setting in Table 1. The batch size is set to 64. We adopted 8 heads (H = 8) of attention in CA/SA/PA, hidden dimension is 2048 (D = 2048), and the dimension of each head is 256 (HD = 2048/8).

The dropout rate of fully connected layers is set as

0.1

. We use binary cross entropy as the loss function with the Adamax optimizer. The basic learning rate is set as 0.001, and the hot start strategy is adopted. The learning rates of the first three epochs are 0.0005, 0.001, and 0.0015, and then the learning rate remains unchanged at 0.002 until the 10th epoch begins to decay with a step length of 2 epochs and a decay rate of 0.5. The cumulative coefficient of global nodes importance is set as 0.95.

4.3. Baselines

In this paper, the proposed PAM method is compared with several baselines which can be divided into two categories based on the dataset they are focused on.

One is mainly concerned with the performance on VQA2.0, including Teney et al., 2018 [57], DCN [25], DRAU [58], BLOCK [26], Zhang et al., 2020 [59], MuRel [28], RAMEN [60], and ReGAT-implicit [27].

The other is mainly concerned with VQA-CP to verify its generalization performance, including AReg [24], Grand et al., 2019 [61], CSS-UpDn [30], Teney et al., 2021 [31], and Whitehead et al., 2021 [62].

To reflect the improvements in the model structure, we select their models with similar feature extraction methods and without extra annotations, without ensembling or tuning. Here is a brief introduction to these baselines.

Teney et al., 2018 [57] adopts the question/image early joint embedding and the single-layer question-guided image self-attention mechanism.
DCN [25] designs a bi-directional interactions hierarchy network stacked by several dense co-attention maps to fuse visual and language features.
DRAU [58] combines convolutional attention and recurrent attention to promote bi-modal fusion reasoning.
BLOCK [26] is a multi-modal fusion method based on the tensor composition which can perform fine-grained inter-modal representation while maintaining strong single-modal representation ability.
Zhang et al., 2020 [59] designs a question-guided top-down visual attention block and a question-guided convolutional relational reasoning block.
MuRel [28] adopts a method similar to the graph attention mechanism to learn the inter-regional relations of the image and updates the regions hidden representation under the guidance of questions and image geometric information.
RAMEN [60] performs an early fusion of fine-grained image regions’ feature and global question feature and then processes the fused feature by bi-directional gated recurrent unit(bi-GRU) but without an attention or a bi-linear pooling mechanism [63].
ReGAT-implicit [27] models implicit relations between visual regions via graph attention network.
AReg [24] prevents the VQA model from capturing language bias by introducing the question-only adversary to encourage visual grounding.
Grand et al., 2019 [61] research how to alleviate linguistically biased by introducing adversarial regularization.
CSS-UpDn [30] forces the model to reason correctly by adding custom counterfactual samples to the training data; then the model could achieve better performance on VQA-CP.
Teney et al., 2021 [31] adopts a new multiple environment training scheme to improve the out-of-distributed generalization. We choose its best non-ensemble model to compare with our PAM in Table 2.
Whitehead et al., 2021 [62] utilize both labeled and unlabeled image–question pairs and try to separate skills and concepts for the model to improve its generalization.

4.4. Results and Analysis

The PAM is trained on the training set of VQA2.0 [53]. Table 2 shows its overall performance on VQA2.0 test-dev, test-std, and val splits using the accuracy metric [56] in comparison to the baselines and evaluated on the VQA-CP v2 [54] dataset to demonstrate its generalizability. Our PAM has already achieved the state-of-the-art overall accuracy on VQA2.0 (

69.24 %

vs.

68.41 %

in test-std split,

66.20 %

vs.

65.93 %

in val split), and

41.60 %

overall accuracy on the test set of VQA-CP. It states that the PAM is not only the best performing among the baselines focused on VQA2.0, but also very competitive in VQA-CP focused baselines, which means our PAM can rely less on prior knowledge and more on correct inference understanding and can be generalized to the VQA-CP dataset as well.

To be more specific about PAM’s generalizability comparison, the bottom five other models focus on VQA out-of-distribution generalization which leads them to achieve higher accuracy on the VQA-CP dataset, but obviously at the cost of huge performance degradation on the VQA2.0 dataset. Take Grand et al., 2019 [61] for example, compared with ReGAT-implicit it gained a

2.2 %

performance improvement on the VQA-CP test, but a

14.01 %

performance degradation on the VQA2.0 val.

Our PAM did not specifically optimize performance on VQA-CP v2 like the models in the bottom five rows in Table 2; however, without adversarial regularization or optimized training data, our PAM can achieve competitive performance on the VQA-CP dataset and avoid the obvious degradation on VQA2.0. PAM outperforms other popular models by

\pm 0.8 %

on the VQA-CP test set, and obtains an improvement of at least

3.45 %

on the VQA2.0 val set. This shows that our model achieves a more suitable balance between generalization ability and fitting ability.

4.5. Ablation Study

Ablation studies are performed on the VQA2.0 [53] validation dataset to evaluate the effectiveness of our proposed actions. The results are shown in Table 3.

The “PAM” model in the first row is our complete model involving all actions. The next few rows progressively remove blocks based on PAM. “-PA” indicates that path attention is removed, which is the block marked in purple in Figure 1, while “-PA+CA” means to replace PA with CA. The “-imp” indicates the importance of the global nodes after each SA and CA block is removed; that is, the operations marked in red at the end of the SA (Figure 3) and CA (Figure 2) are deleted. We remove guard gates marked in orange in CA (Figure 2); it shows as “-gates in CA” in Table 3 and “-gate in CA/SA” if gates in SA are also removed.

Comparing “PAM” with “-PA”, we achieved a 0.51% performance improvement by adding the path-wise attention. To eliminate the influence of the change of parameters number, we also experimented with replacing a PA with CA in “-PA+CA”, as PA is designed as a special CA, and their parameters are mainly derived from two linear layers. Results in Table 3 show that “-PA+CA” does not achieve a similar performance as PAM and even has a performance decrease compared to “-PA”. We believe that the reason is that CA is only a single-hop attention module and cannot summarize and strengthen the multi-hop neighborhood formed by multiple attention modules along the whole path like PA.

Comparing “-PA” with “-PA-imp”, we achieved a 0.68% performance improvement by applying the importance of the global nodes. The “imp” trick is universal and easy to operate for all the popular attention blocks with almost no extra overhead. Moreover, the importance of accumulation is very suitable for concatenating multiple models in boosting-based ensemble learning. Due to the limitation of computing resources, we only tried concatenating two PAM on small dimensions, and it shows performance improvement compared with one PAM. The gates vector is inspired by the conditioning gating vector mentioned in DFAF. Some adaptive changes have been made in the calculation method and applied to both SA and CA for different purposes, while the gating vector in DFAF is only used to guide SA learning. The improved guard gates in CA are effective, as removing them from “-PA-imp” decreases accuracy from 65.01 to 64.66.

From Table 3, removing gates in SA from “-PA-imp” brings performance improvement, while removing gates from PAM brings performance decrement. We believe the reason is that SA can perfect and supplement other proposed strategies in PAM, which is beneficial to the overall performance.

Figure 5 shows the validation accuracy versus epochs of different ablations of our model, and our PAM model keeps the advantage of accuracy in the whole training process.

4.6. Distribution of Attention and Prediction

In Figure 6, we analyzed the distribution of eight attention heads. Each colored line represents a head in the attention block. Its calculation formula is similar to Equation (20). The PAM successfully focuses on the important words and regions compared with the model without path-wise attention block.

In Figure 7 and Figure 8, our PAM gave a higher degree of confidence in correct answers compared to the model without path-wise attention block.

5. Conclusions

In this paper, we proposed a novel path-wise attention memory(PAM) network framework for visual question answering. Four attention matrices of self-attention and co-attention are memorized by the network. They benefit to calculate global nodes importance and a new attention matrix of path-wise attention. The cumulative nodes importance could calibrate the signal strength of regions and words after each single-hop attention block. The path-wise attention could directly guide the attention matrix learning of a single-hop attention module as a central governor. The gates vectors in SA/CA/PA are used to enhance the interaction between the visual and the language modal. These actions bring model effectiveness improvement, and PAM shows good accuracy and generalization performance in experiments.

Author Contributions

Conceptualization, C.Z.; Methodology, Z.H. and J.L.; Resources, H.Y.; Software, Y.X.; Visualization, Y.X. and L.Z.; Writing—original draft, Y.X.; Writing—review & editing, C.Z. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62072166, 61836016) and the Natural Science Foundation of Hunan Province (2022JJ40190).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our Code is available at https://github.com/bluehat999/PAM-for-VQA (accessed on 21 July 2022). VQA2.0 and VQA1.0 download at: https://visualqa.org/download.html (accessed on 21 July 2022), and VQA-CP download at: https://computing.ece.vt.edu/~aish/vqacp (accessed on 21 July 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Notations

The following notations are used in this manuscript:

r	Final global visual vector
w	Final global question vector
$α$	Cumulative coefficient of nodes impact factor
R	Image regions feature
W	Question words feature
C	Number of categories
D	Hidden dimension
$N_{1}$	Number of image regions
$N_{2}$	Number of question words
H	Number of attention heads
$D_{h}$	Dimension of attention heads
$W_{1, 2, R, \dots}$	Weights of linear layer
$b_{1, 2, R, \dots}$	Bias of linear layer
Avg	Average pooling layer
Linear	Linear layer

References

Kim, J.; Koh, J.; Kim, Y.; Choi, J.; Hwang, Y.; Choi, J.W. Robust Deep Multi-Modal Learning Based on Gated Information Fusion Network. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 90–106. [Google Scholar]
Dou, Q.; Liu, Q.; Heng, P.A.; Glocker, B. Unpaired multi-modal segmentation via knowledge distillation. IEEE Trans. Med. Imaging 2020, 39, 2415–2425. [Google Scholar] [CrossRef]
Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1780–1790. [Google Scholar]
Yu, H.; Zhang, C.; Li, J.; Zhang, S. Robust sparse weighted classification For crowdsourcing. IEEE Trans. Knowl. Data Eng. 2022, 1–13. [Google Scholar] [CrossRef]
Mun, J.; Cho, M.; Han, B. Text-guided attention model for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
Jiang, M.; Huang, Q.; Zhang, L.; Wang, X.; Zhang, P.; Gan, Z.; Diesner, J.; Gao, J. Tiger: Text-to-image grounding for image caption evaluation. arXiv 2019, arXiv:1909.02050. [Google Scholar]
Ding, S.; Qu, S.; Xi, Y.; Wan, S. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 2020, 398, 520–530. [Google Scholar] [CrossRef]
Rohrbach, M.; Qiu, W.; Titov, I.; Thater, S.; Pinkal, M.; Schiele, B. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 433–440. [Google Scholar]
Dong, J.; Li, X.; Snoek, C.G. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multimed. 2018, 20, 3377–3388. [Google Scholar] [CrossRef]
Ding, S.; Qu, S.; Xi, Y.; Wan, S. A long video caption generation algorithm for big video data retrieval. Future Gener. Comput. Syst. 2019, 93, 583–595. [Google Scholar] [CrossRef]
Wang, L.; Zhu, L.; Dong, X.; Liu, L.; Sun, J.; Zhang, H. Joint feature selection and graph regularization for modality-dependent cross-modal retrieval. J. Vis. Commun. Image Represent. 2018, 54, 213–222. [Google Scholar] [CrossRef]
Zhang, C.; Liu, M.; Liu, Z.; Yang, C.; Zhang, L.; Han, J. Spatiotemporal activity modeling under data scarcity: A graph-regularized cross-modal embedding approach. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Gao, D.; Jin, L.; Chen, B.; Qiu, M.; Li, P.; Wei, Y.; Hu, Y.; Wang, H. Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 2251–2260. [Google Scholar]
Xie, D.; Deng, C.; Li, C.; Liu, X.; Tao, D. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans. Image Process. 2020, 29, 3626–3637. [Google Scholar] [CrossRef] [PubMed]
Mithun, N.C.; Sikka, K.; Chiu, H.P.; Samarasekera, S.; Kumar, R. Rgb2lidar: Towards solving large-scale cross-modal visual localization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 934–954. [Google Scholar]
Zhang, C.; Song, J.; Zhu, X.; Zhu, L.; Zhang, S. Hcmsl: Hybrid cross-modal similarity learning for cross-modal retrieval. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–22. [Google Scholar] [CrossRef]
Zhang, C.; Xie, F.; Yu, H.; Zhang, J.; Zhu, L.; Li, Y. PPIS-JOIN: A novel privacy-preserving image similarity join method. Neural Process. Lett. 2021, 54, 2783–2801. [Google Scholar] [CrossRef]
Zhang, C.; Zhong, Z.; Zhu, L.; Zhang, S.; Cao, D.; Zhang, J. M2guda: Multi-metrics graph-based unsupervised domain adaptation for cross-modal Hashing. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021; pp. 674–681. [Google Scholar]
Zhu, L.; Zhang, C.; Song, J.; Liu, L.; Zhang, S.; Li, Y. Multi-graph based hierarchical semantic fusion for cross-modal representation. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Zhu, L.; Zhang, C.; Song, J.; Zhang, S.; Tian, C.; Zhu, X. Deep Multi-Graph Hierarchical Enhanced Semantic Representation for Cross-Modal Retrieval. IEEE Multimed. 2022. [Google Scholar] [CrossRef]
Zhu, C.; Zhao, Y.; Huang, S.; Tu, K.; Ma, Y. Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1291–1300. [Google Scholar]
Ramakrishnan, S.; Agrawal, A.; Lee, S. Overcoming language priors in visual question answering with adversarial regularization. arXiv 2018, arXiv:1810.03649. [Google Scholar]
Nguyen, D.K.; Okatani, T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6087–6096. [Google Scholar]
Ben-Younes, H.; Cadene, R.; Thome, N.; Cord, M. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8102–8109. [Google Scholar]
Li, L.; Gan, Z.; Cheng, Y.; Liu, J. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 10313–10322. [Google Scholar]
Cadene, R.; Ben-Younes, H.; Cord, M.; Thome, N. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1989–1998. [Google Scholar]
Gao, P.; Jiang, Z.; You, H.; Lu, P.; Hoi, S.C.; Wang, X.; Li, H. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6639–6648. [Google Scholar]
Chen, L.; Yan, X.; Xiao, J.; Zhang, H.; Pu, S.; Zhuang, Y. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10800–10809. [Google Scholar]
Teney, D.; Abbasnejad, E.; van den Hengel, A. Unshuffling data for improved generalization in visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1417–1427. [Google Scholar]
Zhou, B.; Tian, Y.; Sukhbaatar, S.; Szlam, A.; Fergus, R. Simple baseline for visual question answering. arXiv 2015, arXiv:1512.02167. [Google Scholar]
Chen, K.; Wang, J.; Chen, L.C.; Gao, H.; Xu, W.; Nevatia, R. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv 2015, arXiv:1511.05960. [Google Scholar]
Ren, M.; Kiros, R.; Zemel, R. Exploring models and data for image question answering. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Shih, K.J.; Singh, S.; Hoiem, D. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4613–4621. [Google Scholar]
Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6281–6290. [Google Scholar]
Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical question-image co-attention for visual question answering. Adv. Neural Inf. Process. Syst. 2016, 29, 289–297. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Santoro, A.; Raposo, D.; Barrett, D.G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Ghosh, S.; Burachas, G.; Ray, A.; Ziskind, A. Generating natural language explanations for visual question answering using scene graphs and visual attention. arXiv 2019, arXiv:1902.05715. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Bapna, A.; Chen, M.X.; Firat, O.; Cao, Y.; Wu, Y. Training deeper neural machine translation models with transparent attention. arXiv 2018, arXiv:1808.07561. [Google Scholar]
Zhang, H.; Kyaw, Z.; Chang, S.F.; Chua, T.S. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5532–5540. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Xu, H.; Saenko, K. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 451–466. [Google Scholar]
Guo, W.; Zhang, Y.; Yang, J.; Yuan, X. Re-attention for visual question answering. IEEE Trans. Image Process. 2021, 30, 6730–6743. [Google Scholar] [CrossRef]
Yu, T.; Yu, J.; Yu, Z.; Tao, D. Compositional attention networks with two-stream fusion for video question answering. IEEE Trans. Image Process. 2019, 29, 1204–1218. [Google Scholar] [CrossRef]
Jiang, J.; Chen, Z.; Lin, H.; Zhao, X.; Gao, Y. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11101–11108. [Google Scholar]
Kim, N.; Ha, S.J.; Kang, J.W. Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1708–1717. [Google Scholar]
Teney, D.; Liu, L.; van Den Hengel, A. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1–9. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4971–4980. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
Teney, D.; Anderson, P.; He, X.; Van Den Hengel, A. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4223–4232. [Google Scholar]
Osman, A.; Samek, W. DRAU: Dual recurrent attention units for visual question answering. Comput. Vis. Image Underst. 2019, 185, 24–30. [Google Scholar] [CrossRef]
Zhang, W.; Yu, J.; Hu, H.; Hu, H.; Qin, Z. Multimodal feature fusion by relational reasoning and attention for visual question answering. Inf. Fusion 2020, 55, 116–126. [Google Scholar] [CrossRef]
Shrestha, R.; Kafle, K.; Kanan, C. Answer them all! toward universal visual question answering models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10472–10481. [Google Scholar]
Grand, G.; Belinkov, Y. Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. arXiv 2019, arXiv:1906.08430. [Google Scholar]
Whitehead, S.; Wu, H.; Ji, H.; Feris, R.; Saenko, K. Separating Skills and Concepts for Novel Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5632–5641. [Google Scholar]
Kim, J.H.; Jun, J.; Zhang, B.T. Bilinear attention networks. arXiv 2018, arXiv:1805.07932. [Google Scholar]

Figure 1. Illustration of the proposed Path-wise Memory Network (PAM) for visual question answering. It contains three main parts: (1) feature extraction, which contains a Faster-RCNN for image regions feature extraction, and a gated recurrent unit for question words feature learning; (2) a composite attention network composed of co-attention, self-attention, and path-wise attention, which is to dynamically swap and fuse information between the visual modality and the text modality; (3) an answer predictor which computes the final fused multi-modal feature and map features to the answer vector space.

Figure 2. The co-attention (CA) block used in PAM, with the guard gates (orange marked) and importance rectification (red marked).

Figure 3. The self-attention (SA) block used in PAM for regions, with the conditioning gate (orange marked) and importance rectification (red marked). Similarly, an SA for words and

{Att}_{w 2 w}

exist.

Figure 3. The self-attention (SA) block used in PAM for regions, with the conditioning gate (orange marked) and importance rectification (red marked). Similarly, an SA for words and

{Att}_{w 2 w}

exist.

Figure 4. The path-wise attention (PA) block with the guard gate (orange marked). The specific calculation method of score is given by the formula.

Figure 5. Validation accuracy versus epochs for the proposed VQA model.

Figure 6. An example of attention distribution by 8 heads, the left two from PAM and the right two from “-PA”. The path-wise attention block helps the model focus on useful information to answer the question.

Figure 7. Examples of model prediction. The second column is the TOP5 prediction of the “-PA” model, and the third column is the TOP5 prediction of PAM.

Figure 8. Examples of model prediction for complicated questions. The second column is the TOP5 prediction of the “-PA” model, and the third column is the TOP5 prediction of PAM.

Table 1. Setting of hyperparameters.

Hyperparameters	Value
Hidden dimension	2048
Number of attention heads	8
Learning rate (lr)	0.001
Decay start epoch of lr	10
Decay interval of lr	2
Decay factor of lr	0.5
Dropout factor	0.1
Batch size	64
Epochs	15

Table 2. Model accuracy on the VQA 2.0 and VQA-CP benchmark.

Model		VQA2.0		VQA-CP
	Test-Dev	Test-Std	val	Test
Teney et al., 2018 [57]	65.32	65.67	63.15	-
DCN [25]	66.60	67.00	-	-
DRAU [58]	66.45	66.85	-	-
BLOCK [26]	66.41	67.92	-	-
Zhang et al., 2020 [59]	67.20	67.34	-	-
MuRel [28]	68.03	68.41	65.14	39.54
RAMEN [60]	65.96	-	-	39.21
ReGAT-implicit [27]	67.6	67.81	65.93	40.13
Our PAM	69.01	69.24	66.20	41.60
AReg [24]	-	-	62.75	41.17
Grand et al., 2019 [61]	-	-	51.92	42.33
CSS-UpDn [30]	-	-	59.21	41.16
Teney et al., 2021 [31]	-	-	61.08	42.39
whitehead et al., 2021 [62]	-	-	61.08	41.71

Table 3. Ablation studies of our proposed actions on the VQA2.0 validation dataset. “-” is the operation to delete trick based on complete model PAM.

Model	VQA v2 val
PAM	66.20
-PA	65.69
-PA+CA	65.32
-PA-imp	65.01
-gate in SA	66.11
-gates in CA/SA/PA	65.85
-PA-imp-gates in CA/PA	64.66
-PA-imp-gates in SA	65.58
-PA-imp-gates in CA/SA/PA	65.34

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiang, Y.; Zhang, C.; Han, Z.; Yu, H.; Li, J.; Zhu, L. Path-Wise Attention Memory Network for Visual Question Answering. Mathematics 2022, 10, 3244. https://doi.org/10.3390/math10183244

AMA Style

Xiang Y, Zhang C, Han Z, Yu H, Li J, Zhu L. Path-Wise Attention Memory Network for Visual Question Answering. Mathematics. 2022; 10(18):3244. https://doi.org/10.3390/math10183244

Chicago/Turabian Style

Xiang, Yingxin, Chengyuan Zhang, Zhichao Han, Hao Yu, Jiaye Li, and Lei Zhu. 2022. "Path-Wise Attention Memory Network for Visual Question Answering" Mathematics 10, no. 18: 3244. https://doi.org/10.3390/math10183244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Path-Wise Attention Memory Network for Visual Question Answering

Abstract

1. Introduction

2. Related Work

2.1. Visual Question Answering

2.2. Attention Mechanisms

2.3. Graph Attention Network

3. The Proposed Approach

3.1. Problem Definition

3.2. Build Fine-Grained Feature Vectors

3.3. Attention Blocks with Memory

3.3.1. Standard Attention Block

3.3.2. Co-Attention with Guard Gates

3.3.3. Self-Attention with Conditioning Gates

3.3.4. Path-Wise Attention Block

3.3.5. Node Impact Factor

3.4. Answer Prediction

4. Experiment

4.1. Datasets

4.2. Implementation Details

4.3. Baselines

4.4. Results and Analysis

4.5. Ablation Study

4.6. Distribution of Attention and Prediction

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Notations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI