Hateful Memes Detection Based on Multi-Task Learning

Ma, Zhiyu; Yao, Shaowen; Wu, Liwen; Gao, Song; Zhang, Yunqi

doi:10.3390/math10234525

Open AccessArticle

Hateful Memes Detection Based on Multi-Task Learning

by

Zhiyu Ma

^1,2,

Shaowen Yao

^1,2,

Liwen Wu

^1,2,

Song Gao

^1,2 and

Yunqi Zhang

^1,2,3,*

¹

Engineering Research Center of Cyberspace, Yunnan University, Kunming 650091, China

²

School of Software, Yunnan University, Kunming 650091, China

³

Yunnan Key Laboratory of Statistical Modeling and Data Analysis, School of Mathematics and Statistics, Yunnan University, Kunming 650091, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(23), 4525; https://doi.org/10.3390/math10234525

Submission received: 29 October 2022 / Revised: 22 November 2022 / Accepted: 26 November 2022 / Published: 30 November 2022

(This article belongs to the Special Issue Advances in Artificial Intelligence: Models, Optimization, and Machine Learning, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

With the popularity of posting memes on social platforms, the severe negative impact of hateful memes is growing. As existing detection models have lower detection accuracy than humans, hateful memes detection is still a challenge to statistical learning and artificial intelligence. This paper proposed a multi-task learning method consisting of a primary multimodal task and two unimodal auxiliary tasks to address this issue. We introduced a self-supervised generation strategy in auxiliary tasks to generate unimodal auxiliary labels automatically. Meanwhile, we used BERT and RESNET as the backbone for text and image classification, respectively, and then fusion them with a late fusion method. In the training phase, the backward guidance technique and the adaptive weight adjustment strategy were used to capture the consistency and variability between different modalities, numerically improving the hateful memes detection accuracy and the generalization and robustness of the model. The experiment conducted on the Facebook AI multimodal hateful memes dataset shows that the prediction accuracy of our model outperformed the comparing models.

Keywords:

hateful memes; deep learning; multimodal data; multi-task learning; self-supervised

MSC:

68T07

1. Introduction

Memes are an element of a cultural or behavioral system transmitted from one person to another through imitation or other non-genetic behaviors. Memes come in various types and formats, including but not limited to images, videos, or posts, which are increasingly influential on social platforms. The vast amount of memes on the Internet constitutes an eye-catching problem. Memes not only express people’s natural emotions but may also cause emotional damage to someone. The most popular form of memes is images containing text, which is the type we are interested in. Usually, an ordinary sentence or a picture does not have any special emotional meaning, but when combined, they become meaningful. Hateful memes thus emerge and are becoming an increasingly serious problem in modern society. People with malignant motives use such memes, with misleading content, hateful speech, and harmful images, to attack vulnerable people or target people.

Nowadays, social giants such as Facebook, Twitter, and Weibo, are engaged in identifying hateful memes and removing thousands of hateful memes to protect users. However, it is impossible to have humans detect every meme on a massive Internet scale manually. Researchers have explored statistical tools [1,2] and machine learning techniques [3,4] with optimization algorithms [5] to address this issue. The probability upper bounds of the generalization errors of simple models are well studied [6,7], but statisticians are still struggling to explain the generalization ability of large artificial neural networks [8]. Meanwhile, machines cannot understand contextual information like humans, and detecting hateful memes is still a challenging study for statistical learning and artificial intelligence. Owing to the development of sentiment analysis (hate is one emotion) and artificial intelligence, we can build our research on the work of previous researchers [9,10,11,12]. However, the available sentiment analysis methods have limited usefulness in practice because hate is not always as easy to identify as other emotions, and they do not explain generalization ability statistically. Most early studies for hateful memes focused on unimodal hateful text detection, classifying hateful, abusive, or offensive texts against individuals or groups according to gender, nationality, or sexual orientation [13,14]. These studies for hate detection are enlightening, but they cannot handle hateful memes detection, which combines visual and textual elements. In addition, some hateful attacks against specific groups are very subtle. To further improve the accuracy of detecting hateful memes, we have to extend them to multimodal learning.

Baltrušaitis et al. [15] figured that difficulties and challenges in multimodal learning are representation, translation, alignment, fusion, and co-learning, while representation learning may be the most critical impact on multimodal learning. According to the difference of guidance in representation learning, the existing methods are divided into forward guidance and backward guidance. The forward guidance projects unimodal representations together into a shared subspace [16] with the interaction module for obtaining information on different modalities [10,17,18,19]. However, the uniformity of multimodal labels makes it difficult to get information in a single modality. Backward guidance adds extra regularization terms to the optimization objectives [20] to guide feature learning by gradient descent and thus learn the variability across modalities [21,22], and we prefer this method.

Multi-task Learning is a learning paradigm in machine learning that learns multiple related tasks jointly and leverages useful information contained in multiple related tasks [23]. It can learn multiple related tasks together simultaneously and maximize the use of information from each modal in multimodal data. Therefore, it can be further used to enhance the accuracy of hateful meme detection. Usually, multi-task learning is designed with a primary classification task and some auxiliary tasks to enhance the feature learning capability. Nevertheless, this leads to a problem coming with a requirement of independent labels for auxiliary tasks, which is time-consuming and labor-intensive [22] by manual labeling. Yu et al. [24] designed a self-supervised unimodal label generation module to overcome this problem. This method can automatically get appropriate labels without requiring access to any further data.

Thus, we proposed a new idea to detect hateful memes using a multi-task learning method to balance the unilateral information exacting from different modalities separately and the fuzzy information from the multimodal without introducing further data or manual labels and reduce the generalization errors. We conducted a primary task to learn multimodal features and classify hateful memes. Meanwhile, two auxiliary tasks were used in the training phase to learn unimodal features and classify the hate of text and images. Moreover, we used two self-supervised label generation modules to generate unimodal labels in auxiliary tasks automatically. Finally, we applied our method to the Facebook AI hateful memes data sets [25] and achieved competitive results. In contrast with the previous works, the main contributions of this work are as follows:

A new artificial intelligence model is proposed for hateful memes detection. It effectively improved the hateful memes detection accuracy in that our model outperformed the comparing models.
The multi-task strategy and adaptive weight adjustment strategy used in our model captured the consistency and variability between different modalities and numerically improved the generalization and robustness of the model.
Our auxiliary tasks using self-supervised unimodal auxiliary label generation module enhanced the feature learning capability without human-defined labels or additional data.

The remaining part of this paper is organized as follows. Section 2 introduces related works. Section 3 shows our hateful memes detection model’s framework and algorithm. Next, experiments with real data and their results are presented in Section 4. Section 5 summarizes this work.

2. Related Works

Hateful memes detection is a binary classification task from multimodal data containing text and images. As detection accuracy is still a big challenge for this task, we want to introduce a multi-task learning strategy into our model to address it. So we review some text, image, and multimodal classification models and bring ideas from them together with multi-task learning.

2.1. Datasets

As research on multimodal sentiment analysis constantly evolves, many multimodal sentiment analysis datasets have emerged. The CMU-MOSEI dataset [26] is one of the enormous trimodal sentiment analysis datasets and has both sentiment and emotion labels. It contains seven categories of sentiment, from negative to positive, and six categories of emotion, including anger, happiness, sadness, surprise, fear, and disgust. This CMU-MOSEI dataset has been extensively studied in the literature.

However, hate is a special emotion, and the expression of hateful emotion is subtle and not easy to detect, requiring more appropriate reasoning. A growing number of researchers are focusing on the study of hate analysis, especially for multimodal hateful memes detection. MMHS150K [27] is another multimodal hateful speech dataset collected and annotated from Twitter, consisting of images and text. Facebook announced the launch of a competition called the “Hateful Memes Challenge” with over 10,000 “hateful memes,” which will be used as a data set [25]. It is also a multimodal dataset consisting of images and text, but it uses methods such as “benign confounders” to make its hateful samples challenging to distinguish by unimodal methods.

2.2. Textual Model

Much of the early research on hate detection was related to hateful text detection. Warner et al. [28] developed a support vector machine (SVM) classifier to detect offensive languages. The classifier can distinguish the features extracted from the text and classify whether a given text is malicious. At the same time, Djuric et al. [29] proposed using N-gram features to classify whether the text is offensive. As hateful text detection is a binary classification problem, many deep learning models are also available. TextCNN [30] has been proven to have good performance early, and a variety of more advanced models related to the task [31,32,33,34,35,36] have emerged in recent years. Among them, the best-performing model is BERT (Bidirectional Encoder Representation from Transformers) [37]. It is a pre-trained model proposed by Google AI Research to learn bidirectional representations with the help of Transformer [38]. By using an attention mechanism with Transformer, BERT can process entire sequences in parallel to collect information about the context of a word and encode it in a rich vector to represent it. The commonly used BERT has two versions, BERT

_{B A S E}

(L = 12, H = 768, A = 12, Total Parameters = 110 M) and BERT

_{L A R G E}

(L = 24, H = 1024, A = 16, Total Parameters = 340 M), where L is the number of layers (Transformer blocks), H is the hidden size, and A is the number of self-attention heads. The pre-trained BERT is a highly generalizable model that no longer needs to be trained with large datasets in a specific task, saving time and efficiency. Figure 1 shows the architecture of BERT.

2.3. Visual Model

Another important part of memes is the image. Among many developed image classification models, they have their unique characteristics, and the research with the most significant progress is on neural networks for extracting image information. Early researchers focused on the analysis of convolutional neural networks, such as VGG (Visual Geometry Group) [39] and RESNET (Residual Neural Network) [40]. VGG is a large model with a few fully connected 3 × 3 convolution kernels, and it is famous for regularly designed, simple, and stackable convolution blocks. Compared with VGG and other neural networks, the most significant advantage of RESNET is that it introduces an identity mapping to construct a Residual Unit to calculate the residuals and solve the degradation problem generated by the high number of layers. RESNET has different versions depending on the number of convolution layer blocks, and five standard versions are RESNET18, RESNET34, RESNET50, RESNET101, and RESNET152. Figure 2 shows the architecture of RESNET18 as an example. Recently, many studies proved that the dependence on CNNs is unnecessary, and the Transformer model based on attention strategy can also perform well [41]. Currently, both of them are extensively applied to image classification tasks.

2.4. Multimodal Model

As hateful memes detection is a multimodal classification task, fusion techniques and attention strategies for multimodal models can be used. Many late fusion models have outstanding performance, such as the concatenation model [42] and the multiplicative combining model [43]. This concatenation fusion model [42] fused image features based on the VGG16 with text features based on BERT to train a multi-Layer perception network for hate detection. This multiplicative combining model [43] may automatically focus on information from more reliable modalities while reducing the emphasis on the less reliable modalities during the training process. Later more advanced multimodal models [44,45,46,47,48,49,50] were studied and designed. They extracted visual-textual relationships by introducing an attention strategy. These models can be divided into two main categories, single-stream models and dual-stream models. In the single-stream model, the language information and the visual information were fused at the beginning and fed directly into the encoder together. A typical single-stream model is VisualBERT [49], which inputs both text and images into the model, then aligns and fuses the text and image information through Transformer’s self-attention. In the dual-stream model, the language and vision information first passed through two separate encoder modules, and then the different modal information was fused through the cross transformer, for example, the ViLBERT [50] model. It is a representative dual-stream model, which does not directly fuse linguistic and image information at the beginning. Instead, the image and text go through two different streams into the co-attention transformer layer first. Then the two streams pass through multiple layers of intersecting co-transformer and transformer layers. This allows the corresponding visual information to be embedded when generating text features by attention and vice versa.

2.5. Multi-Task Learning

Multi-task learning is prevalent in multimodal sentiment analysis [22,51,52,53], but few people have applied multi-task learning to hateful meme detection. As multi-task learning can improve performance on a primary task by using information from auxiliary tasks [23], we want to bring this idea to hateful memes detection. We can learn similarity information from multimodal tasks and differentiation information from unimodal tasks to improve classification accuracy. There are two main challenges to multi-task learning in the training phase compared to single-task learning. The first is how to share network parameters, and the leading solutions are soft parameter sharing and hard parameter sharing methods. Hard parameter sharing is achieved by sharing the hidden layer among all tasks while keeping a few task-specific output layers. For soft parameter sharing, each task has a separate model with its exclusive parameters. The distance between the model parameters is used as a regularization term to ensure that the parameters are as similar as possible. Another challenge is to solve the problem of inconsistent convergence speed and training importance of different tasks. We can refer to some optimization methods to solve this problem, such as Gradnorm [54]. So, we introduced two unimodal auxiliary tasks to help the primary task improve its accuracy in the hateful memes detection task. Meanwhile, we used the hard parameter sharing method and an adaptive weight adjustment strategy to solve the two challenges we faced.

3. Method

This paper aims to design a model that can balance the unilateral information exacting from different modalities separately and the fuzzy information from the multimodal without introducing further data or manual labels and reduce generalization errors. Nowadays, modern methods for predicting and understanding data are rooted in both statistical and computational thinking, and algorithmics are put on equal footing with intuition, properties, and the abstract arguments behind them [55]. So we proposed a new hateful memes detection method combing statistic theory with modern neural nets and optimization algorithms. And we will describe it detailly in this section.

First, we introduced the setup of our model to illustrate the inputs and outputs. Next, we constructed the multi-task learning model with a primary multimodal task and two unimodal auxiliary tasks to capture the consistency and variability between different modalities. As we only have manually labeled labels (

y_{m}

) in the dataset for the primary task, we adopt a self-supervised method [24] to generate the unimodal labels (

y_{u}

). And then, we designed an adaptive weight in our objective function to optimize this model and reduce the generalization errors. In the following, we call multimodal labels m-labels and unimodal labels u-labels, where

u = t, v

.

3.1. Setup

Hateful memes detection is a binary classification task that uses text and image signals to judge whether a meme is hateful. Our designed model takes

I_{t}

and

I_{v}

as inputs after data processing and the hateful intensity

{\hat{y}}_{m} \in R

as outputs. In addition to the primary multimodal classification output

{\hat{y}}_{m}

, two unimodal auxiliary task outputs

{\hat{y}}_{t}

and

{\hat{y}}_{v}

are also set to improve the accuracy in the training phase. Obviously,

{\hat{y}}_{m}

is the final result we are interested in.

3.2. Architecture

We designed a multi-task learning model that can generate auxiliary labels in a self-supervised way to detect multimodal hateful memes, as shown in Figure 3. The network consists of a primary multimodal task using BERT and RESNET to extract features and two unimodal auxiliary tasks that share the bottom feature learning network in a hard parameter sharing method.

The primary task part is a multimodal classification net, which consists of three steps, the extraction of features, the fusion of features, and the output of classification. Pre-trained models have performed very well in recent years, so we used two pre-trained models as the backbone for two unimodal tasks in the hateful memes detection task.

For text processing, we use the pre-trained twelve-layers BERT [37] to extract text feature

F_{t}

.

F_{t} = B E R T (I_{t}; θ_{t}^{b e r t}),

where

I_{t}

is the text input,

θ_{t}^{b e r t}

is all parameters of the BERT we used.

For image processing, we use the pre-trained RESNET101 [40] to extract image feature

F_{v}

.

F_{v} = R E S N E T (I_{v}; θ_{v}^{r e s n e t}),

where

I_{v}

is the image input,

θ_{v}^{r e s n e t}

is all parameters of the RESNET we used.

Then, the text and image representations are concatenated as

F_{m} = [F_{t}; F_{v}]

and projected onto a low-dimensional space.

F_{m}^{*} = σ (W_{1}^{m} F_{m} + b_{1}^{m}),

where

W_{1}^{m}

and

b_{1}^{m}

are the parameters of the first linear layer in the primary multimodal task,

σ

is the activation function.

After that, we use the representation of fusion obtained from the linear layer and activation function to detect whether the meme is hateful.

{\hat{y}}_{m} = W_{2}^{m} F_{m}^{*} + b_{2}^{m},

where

W_{2} \in R^{d_{m} \times 1}

, and

W_{2}^{m}

and

b_{2}^{m}

are the parameters of the second linear layer in the multimodal primary task.

The auxiliary tasks are two unimodal classification tasks that detect the presence of hateful sentiment in text and images, respectively. We project the unimodal features into a new feature space, which reduces the impact of the dimensional difference between different modalities. Moreover, the text and image auxiliary classification tasks share modal features with the primary multimodal classification task.

F_{u}^{*} = σ (W_{1}^{u} F_{u} + b_{1}^{u}),

where

u \in {t, v}

,

W_{1}^{u}

and

b_{1}^{u}

are parameters of the first linear layer in the unimodal auxiliary task.

Then, the results of unimodal auxiliary tasks are obtained by

{\hat{y}}_{u} = W_{2}^{u} F_{u}^{*} + b_{2}^{u},

where

u \in {t, v}

,

W_{2}^{u}

and

b_{2}^{u}

are parameters of the second linear layer in the unimodal auxiliary task.

3.3. Unimodal Label Generation Module

While we need corresponding labels to guide the training in the two unimodal auxiliary tasks, and manual labeling is too costly, we adopt a strategy of self-supervised label generation to obtain u-labels. We call this module the “Unimodal Label Generation Module” (ULGM), that is

y_{u} = U L G M (y_{m}, F_{m}^{*}, F_{u}^{*}),

where

u \in {t, v}

.

The ULGM generates labels for unimodal auxiliary tasks based on multimodal labels and the feature of each modality. The unimodal label generation module does not have any parameters, which makes it a stand-alone module without any impact on the multi-task network. Based on the fact that unimodal labels are closely related to multimodal labels, this module calculates the offset value based on the distance between each modal representation to the center of the hateful class and the non-hateful class.

Here, we calculate the relative distance rather than absolute distance values, which overcomes the error introduced by different modal features in different feature spaces. First, we keep the center of the hateful class (

C_{k}^{h}

) and the center of the not-hateful class (

C_{k}^{n}

) unchanged for different modal features in the training phase. And the hateful class center and the not-hateful class center can be defined as:

\begin{matrix} C_{k}^{h} = \frac{\sum_{j = 1}^{N} I (y_{k j} > c) \cdot F_{k j}^{g}}{\sum_{j = 1}^{N} I (y_{k j} > c)}, \\ C_{k}^{n} = \frac{\sum_{j = 1}^{N} I (y_{k j} < c) \cdot F_{k j}^{g}}{\sum_{j = 1}^{N} I (y_{k j} < c)}, \end{matrix}

(1)

where

k \in {m, t, v}

, N is the sample size of the training set.

I (\cdot)

is an indicator function and

F_{k j}^{g}

is the global representation of the j-th sample in modality k, and c is a threshold value, which we chose it as

0.5

in our experiment.

Then we use the L2 norm to calculate the distance between features and the hateful/not-hateful class centers, that is

\begin{matrix} D_{k}^{h} = \frac{{∥F_{k}^{*} - C_{k}^{h}∥}_{2}^{2}}{\sqrt{d_{k}}}, \\ D_{k}^{n} = \frac{{∥F_{k}^{*} - C_{k}^{n}∥}_{2}^{2}}{\sqrt{d_{k}}}, \end{matrix}

(2)

where

k \in {m, t, v}

,

d_{k}

is a scaling factor used to represent the dimensions.

After doing the above calculations, we can calculate the relative distance

α_{k}

between the modality representation and the hateful/not-hateful center with

α_{k} = \frac{D_{k}^{n} - D_{k}^{h}}{D_{k}^{h} + ϵ},

(3)

where

k \in {m, t, v}

,

ϵ

is a very small number to avoid zero exception.

Obviously,

α_{k}

is positively related to

y_{k}

, then the ratio relationship between

y_{u}

and

y_{m}

can be summarised as:

\frac{y_{u}}{y_{m}} \propto \frac{{\hat{y}}_{u}}{{\hat{y}}_{m}} \propto \frac{α_{u}}{α_{m}} \Rightarrow y_{u} = \frac{α_{u} \cdot y_{m}}{α_{m}} .

(4)

To avoid the “zero value problem”, the difference relationship between

y_{s}

and

y_{m}

should also be considered, which means:

(y_{u} - y_{m}) \propto ({\hat{y}}_{u} - {\hat{y}}_{m}) \propto (α_{u} - α_{m}) \Rightarrow y_{u} = y_{m} + α_{u} - α_{m} .

(5)

By equal-weight summation Equations (4) and (5), we obtain the unimodal supervisions as follows.

\begin{matrix} y_{u} = \frac{y_{m} \cdot α_{u}}{2 α_{m}} + \frac{y_{m} + α_{u} - α_{m}}{2} \\ = y_{m} + \frac{α_{u} - α_{m}}{2} \cdot \frac{y_{m} + α_{m}}{α_{m}} \\ = y_{m} + δ_{u m}, \end{matrix}

(6)

where

u \in {t, v}

,

δ_{u m}

is the offset value of the unimodal supervision values to the given multimodal labels.

3.4. Optimization Objectives

In the case of a binary classification task, since there are only positive and negative cases, and the probability sum of both is 1, it is not necessary to predict a vector, but only a probability. We choose the cross-entropy loss of binary classification as the base optimization objective, and the loss function is defined in a simplified way as follows.

{l o s s}_{k} = - [y_{k} \cdot log ({\hat{y}}_{k}) + (1 - y_{k}) \cdot log (1 - {\hat{y}}_{k})],

(7)

where

k \in {m, t, v}

.

As the hateful memes data are complicated with two modalities, we designed multi-task learning to make the statistical inference. When we optimize the model, the extracted information may be fuzzy if we pay too much attention to the multimodal part. However, if we pay too much attention to the unimodal part, the extracted information may be much unilateral and weaken our primary task. In addition, the gradient magnitudes of the backpropagation of several tasks’ losses may differ. When backpropagating to the shared bottom part, the task with a small gradient magnitude has less weight to update the model parameters, making the shared bottom not learn enough for that task. Of course, we can simply introduce static weights to balance the gradients for different tasks. However, this does not work well. If we assigned a fixed weight for a task with a large gradient magnitude at the beginning of training, this small weight would keep limiting this task by the end of the training, making this task not learned enough and enhancing the generalization errors [56,57]. Meanwhile, information may be with different intensities among different samples. Suppose the difference between the multimodal label

y_{m}^{(i)}

and the generated unimodal label

{\hat{y}}_{u}^{(i)}

is large. In that case, the results from different modalities are diverging, and we should impose a larger weight on this sample to learn more information. Therefore, a data-driving weight should be imposed on different samples so that the objective function can be adaptively adjusted to balance the learning process.

Thus, we use the absolute difference between the generated unimodal label and the existing multimodal label as a measure for weight adjustment, that is,

| y_{u}^{(i)} - y_{m} |

. As we want to make more significant adjustments for samples with large distances and slight adjustments for samples with small distances, an ‘S’-type function

\in (0, 1)

may be preferred, such as

t a n h (\cdot)

,

e l i o t (\cdot)

,

a r c t a n (\cdot)

and

l o g i t (\cdot)

. We chose

t a n h (\cdot)

here to get more adjustment for the samples with large distances with rapid change, and the weight of

i_{t h}

sample for auxiliary task u can be expressed as

ω_{u}^{i} = t a n h (| y_{u}^{(i)} - y_{m} |)

. Then the optimization objective is

L = \frac{1}{N} \sum_{j}^{N} ({l o s s}_{m}^{j} + \sum_{u}^{{t, v}} ω_{u}^{j} * {l o s s}_{u}^{j}),

(8)

where N is the sample size,

{l o s s}_{m}^{j}

is the binary cross-entropy loss between multimodal labels and multimodal predictions of the j-th sample,

{l o s s}_{u}^{j}

is the binary cross-entropy loss between the self-supervised generated unimodal labels and the unimodal predictions of the j-th sample.

While the modal representations are changing dynamically, so the generated auxiliary labels are unstable. In order to mitigate the influence of this disadvantage, a momentum update strategy is introduced.

\begin{matrix} y_{u}^{(i)} = \{\begin{matrix} y_{m} & i = 1 \\ \frac{i - 1}{i + 1} y_{s}^{(i - 1)} + \frac{2}{i + 1} y_{s}^{i} & i > 1 \end{matrix}, \end{matrix}

(9)

where

u \in {t, v}

, i means the i-th epoch [58].

Finally, supervised by the m-labels in the dataset and the u-labels generated by the self-supervised module, the final result

{\hat{y}}_{m}

for detecting whether each meme is hateful or not can be obtained. Overall, the entire algorithm (Algorithm 1) of our model is defined as follows:

Algorithm 1: The algorithm of our model in training stage [24]

4. Experiments

4.1. Dataset

To validate the performance of our model, we choose the hateful memes dataset in the “Hateful Memes Challenge” [25] published by Facebook AI as our experimental dataset. It is a dataset of over 10,000 strictly labeled memes, where the memes are manually labeled as hate or not with a strict definition. The researchers carefully designed each meme and confounded the hateful memes with the benign memes by methods such as “benign confounders”, as shown in Figure 4. These subtle designs make each meme challenging to detect accurately by unimodal detection methods and must be reasoned about both text and image to obtain accurate detection results.

4.2. Compared Models

We compared our model with different advanced unimodal, multimodal models described in [59]. All models can be classified into two categories, unimodal models and multimodal models.

Unimodal models include the image and text classification models, while image classification models include Image-Grid and Image-Region regarding different features. Features of Image-Grid are ResNet-152 [40] convolutional features and are based on res-5c with average pooling. Features of Image-Region are from the fc6 layer of Faster-RCNN [60] and are based on ResNeXt-152. The text classification model is the Twelve-layer BERT.

Multimodal models include Late Fusion, Concat BERT, MMBT-Grid, MMBT-Region, ViLBERT, and VisualBERT. Late Fusion is a model that fused the mean of outputs of the unimodal text model BERT and the unimodal image model ResNet-152 through simple fusion methods. Concat BERT is a model that concatenates the unimodal image model ResNet-152 feature with the unimodal text model BERT. MMBT-Grid and MMBT-Region are both supervised multimodal transformers models, the former using Image-Grid features and the latter using Image-Region features. VisualBERT [49] is a single-stream model in which the text and image features are fused at the beginning of the model. ViLBERT and VisualBERT can be pretrained on unimodal and multimodal datasets. ViLBERT [50] model is a dual-stream model, where text and image features are first passed through two separate encoding modules. Then the different modal information is fused through a co-attention mechanism. We use VisualBERT and ViLBERT with unimodal pretraining, Visualbert COCO is VisualBERT trained on multimodal dataset COCO [61] and ViLBERT CC is ViLBERT trained on multimodal dataset Conceptual Captions [62].

4.3. Results

We compared the results of our model with all kinds of unimodal and multimodal models on the hateful memes dataset. The activation function in our model is selected as ReLU, and the threshold value to calculate the hateful/not-hatful class center is set as

0.5

. The results of compared models on the dataset were from [59]. For the unimodal models, it can be found that their performance is generally less satisfactory. In addition, the unimodal text model outperformed the unimodal image model, reflecting the fact that the text features may contain more information. For the multimodal models, they outperformed the unimodal models. We also found that the fusion method affects their performance, while models using early fusion methods outperformed those using later fusion methods. For the multimodal pretrained process, there was little difference between the multimodal pretrained model and the unimodal pretrained model.

In contrast to the models mentioned above, our model used a late fusion method and two unimodal pre-training models. Although the late fusion method generally performed worse than the early fusion method, our model outperformed those early fusion models. Thanks to the additional auxiliary learning, which validated the idea that adding multi-task learning to hateful meme detection can improve the accuracy of the task. Moreover, it may help to fuse different unimodal pre-training models using our method in future studies for similar tasks. Prediction accuracy results of these models are presented in Table 1.

4.4. Ablation Study

We added a self-supervised multi-task learning of generating auxiliary labels to the task of hateful memes detection, which did greatly improve its accuracy. However, we wanted to further investigate the effect of each unimodal auxiliary learning on the overall model. Therefore, we set up this experiment to test the model by adding each unimodal auxiliary task separately and comparing the results in Table 2.

These results indicated that the accuracy of the multi-task model only with the unimodal textual auxiliary or only with the unimodal visual auxiliary task is very similar in hateful meme detection. Furthermore, both the results were also very close compared to the multimodal task, which showed that the accuracy of detecting hateful memes could hardly be improved by adding a single unimodal auxiliary task alone. In contrast, the multi-task learning model was greatly enhanced with the addition of a unimodal textual auxiliary task and a unimodal visual auxiliary task. Moreover, all the cases optimized using equal weights with

ω_{u}^{j} = 1

performed worse than the same model using the adaptive weight adjustment strategy. In conclusion, the multi-task learning and the adaptive weight adjustment strategy helped improve the testing accuracy and reduce the generation errors.

5. Conclusions

Our research aims to improve the accuracy and reduce generalization errors of detecting hateful memes, which are widely available on the Internet and have severe negative impacts. For this purpose, we selected a multimodal dataset of hateful memes published by Facebook AI as our experimental dataset. Moreover, we designed a multi-task learning model that can generate auxiliary labels self-supervised. A text classification model BERT and an image classification model RESNET were selected as the backbone, and a late fusion method was used. In the multi-task learning network, we added two unimodal auxiliary learning tasks, the textual and the visual auxiliary task, to the primary classification task. In order to solve the problem of lacking labels for the unimodal auxiliary tasks and the high cost of manual labeling, we chose a strategy of self-supervised label generation for the auxiliary tasks. In the phrase of optimization, we added a data-driving adaptive weight adjustment strategy to balance the learning process and reduce the generalization errors. By comparing our multi-task learning model with various advanced models for the detection of hateful memes, we can find that our multi-task learning model achieved more accurate results.

In the ablation experiments, we also found that it is difficult to improve the accuracy of the final classification results by simply adding a single unimodal auxiliary task to the multi-task learning network. Both the text and image auxiliary tasks should be introduced to the model to achieve better results. In addition to the good performance of the results, our method can easily be extended to fuse other unimodal models to solve similar problems. Although our experiments achieved good results, there is still much room for improvement. Our model and existing multimodal models are still far from reaching the accuracy of humans (84.7%) for the task. We are trying to improve the accuracy of hateful meme detection from other perspective. One is improving the adaptability of the backbone model and the multi-task learning network. Another is improving the feature fusion methods.

Author Contributions

Conceptualization, Y.Z. and S.Y.; software, Z.M. and S.G.; validation, Y.Z.; formal analysis, Z.M., L.W. and S.G.; investigation, Y.Z. and L.W.; resources, Y.Z., L.W. and S.Y.; writing—original draft preparation, Z.M. and Y.Z.; writing—review and editing, Y.Z. and S.G.; visualization, Z.M.; supervision, Y.Z. and S.Y.; project administration, Y.Z., L.W. and S.Y.; funding acquisition, Y.Z. and S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61863036), the China Postdoctoral Science Foundation (No. 2021M702778), the Fundamental Research Funds for the Central Universities (No. 2042022KF0021), and the Fundamental Research Plan of “Release Management Service” in Yunnan Province: Research on Multi-source Data Platform and Situation Awareness Application for Cross-border Cyberspace Security (No. 202001BB050076).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and analysed are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer Science & Business Media: New York, NY, USA, 2013; Volume 31. [Google Scholar]
Fan, J.; Li, R.; Zhang, C.H.; Zou, H. Statistical Foundations of Data Science; Chapman and Hall/CRC: New York, NY, USA, 2020. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Bertsekas, D.P. Nonlinear programming. J. Oper. Res. Soc. 1997, 48, 334. [Google Scholar] [CrossRef]
Tewari, A.; Bartlett, P.L. On the Consistency of Multiclass Classification Methods. J. Mach. Learn. Res. 2007, 8, 1007–1025. [Google Scholar]
Zhang, T. Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res. 2004, 5, 1225–1251. [Google Scholar]
Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; NIH Public Access: Bethesda, MD, USA, 2019; Volume 2019, p. 6558. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Mihalcea, R. Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Trans. Affect. Comput. 2020, 1. [Google Scholar] [CrossRef]
Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef] [Green Version]
i Orts, Ò.G. Multilingual detection of hate speech against immigrants and women in Twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 460–463. [Google Scholar]
Burnap, P.; Williams, M.L. Hate speech, machine classification and statistical modelling of information flows on Twitter: Interpretation and communication for policy decision making. In Proceedings of the Internet, Policy & Politics Conference, Oxford, UK, 26 September 2014. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [Green Version]
Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Sun, Z.; Sarma, P.; Sethares, W.; Liang, Y. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8992–8999. [Google Scholar]
Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), online, 5–10 July 2020; NIH Public Access: Bethesda, MD, USA, 2020; Volume 2020, p. 2359. [Google Scholar]
Wang, S.; Zhang, H.; Wang, H. Object co-segmentation via weakly supervised data fusion. Comput. Vis. Image Underst. 2017, 155, 43–54. [Google Scholar] [CrossRef]
Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 10790–10797. [Google Scholar]
Kiela, D.; Firooz, H.; Mohan, A.; Goswami, V.; Singh, A.; Ringshia, P.; Testuggine, D. The hateful memes challenge: Detecting hate speech in multimodal memes. Adv. Neural Inf. Process. Syst. 2020, 33, 2611–2624. [Google Scholar]
Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 2236–2246. [Google Scholar]
Gomez, R.; Gibert, J.; Gomez, L.; Karatzas, D. Exploring hate speech detection in multimodal publications. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1470–1478. [Google Scholar]
Warner, W.; Hirschberg, J. Detecting hate speech on the world wide web. In Proceedings of the Second Workshop on Language in Social Media, Montréal, QC, Canada, 7 June 2012; pp. 19–26. [Google Scholar]
Djuric, N.; Zhou, J.; Morris, R.; Grbovic, M.; Radosavljevic, V.; Bhamidipati, N. Hate speech detection with comment embeddings. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 29–30. [Google Scholar]
Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2015. [Google Scholar]
Waseem, Z.; Davidson, T.; Warmsley, D.; Weber, I. Understanding abuse: A typology of abusive language detection subtasks. arXiv 2017, arXiv:1705.09899. [Google Scholar]
Benikova, D.; Wojatzki, M.; Zesch, T. What does this imply? Examining the impact of implicitness on the perception of hate speech. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, Berlin, Germany, 13–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 171–179. [Google Scholar]
Wiegand, M.; Siegel, M.; Ruppenhofer, J. Overview of the germeval 2018 shared task on the identification of offensive language. In Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018), Vienna, Austria, 21 September 2018. [Google Scholar]
Kumar, R.; Ojha, A.K.; Malmasi, S.; Zampieri, M. Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Santa Fe, NM, USA, 25 August 2018; pp. 1–11. [Google Scholar]
Nobata, C.; Tetreault, J.; Thomas, A.; Mehdad, Y.; Chang, Y. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 May 2016; pp. 145–153. [Google Scholar]
Aggarwal, P.; Horsmann, T.; Wojatzki, M.; Zesch, T. LTL-UDE at SemEval-2019 Task 6: BERT and two-vote classification for categorizing offensiveness. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 678–682. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Sabat, B.O.; Ferrer, C.C.; Giro-i Nieto, X. Hate speech in pixels: Detection of offensive memes towards automatic moderation. arXiv 2019, arXiv:1910.02334. [Google Scholar]
Liu, K.; Li, Y.; Xu, N.; Natarajan, P. Learn to combine modalities in multimodal deep learning. arXiv 2018, arXiv:1805.11730. [Google Scholar]
Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–120. [Google Scholar]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 121–137. [Google Scholar]
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
Aken, B.v.; Winter, B.; Löser, A.; Gers, F.A. Visbert: Hidden-state visualizations for transformers. In Proceedings of the Companion Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 207–211. [Google Scholar]
Yu, F.; Tang, J.; Yin, W.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 3208–3216. [Google Scholar]
Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2019; Volume 32. [Google Scholar]
Liu, W.; Mei, T.; Zhang, Y.; Che, C.; Luo, J. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3707–3715. [Google Scholar]
Zhang, W.; Li, R.; Zeng, T.; Sun, Q.; Kumar, S.; Ye, J.; Ji, S. Deep model based transfer and multi-task learning for biological image analysis. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 1475–1484. [Google Scholar]
Akhtar, M.S.; Chauhan, D.S.; Ghosal, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Multi-task learning for multi-modal emotion recognition and sentiment analysis. arXiv 2019, arXiv:1905.05812. [Google Scholar]
Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the International Conference on Machine Learning (PMLR 2018), Stockholm, Sweden, 10–15 July 2018; pp. 794–803. [Google Scholar]
Efron, B.; Hastie, T. Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science; Cambridge University Press: Cambridge, UK, 2021; Volume 6. [Google Scholar]
Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 2004, 32, 56–85. [Google Scholar] [CrossRef]
Chen, D.R.; Sun, T. Consistency of multiclass empirical risk minimization methods based on convex loss. J. Mach. Learn. Res. 2006, 7, 2435–2447. [Google Scholar]
Su, W.; Boyd, S.; Candes, E. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2014; Volume 27. [Google Scholar]
Sandulescu, V. Detecting hateful memes using a multimodal deep ensemble. arXiv 2020, arXiv:2012.13235. [Google Scholar]
Mao, H.; Yao, S.; Tang, T.; Li, B.; Yao, J.; Wang, Y. Towards real-time object detection on embedded systems. IEEE Trans. Emerg. Top. Comput. 2016, 6, 417–431. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Mokady, R.; Hertz, A.; Bermano, A.H. Clipcap: Clip prefix for image captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]

Figure 1. The structure of BERT using the structure of bidirectional transformer [37].

Figure 2. The structure of RESNET18 for ImagNet [40].

Figure 3. The architecture of our method.

y_{m}

is the labeled multimodal label in the dataset, and

y_{t}

,

y_{v}

are the auxiliary labels generated by the self-supervised label generation module for the unimodal text and image auxiliary tasks, respectively.

{\hat{y}}_{m}

is the predicted output of the primary multimodal task,

{\hat{y}}_{t}

,

{\hat{y}}_{v}

are the predicted outputs of the unimodal text and image auxiliary tasks, respectively.

Figure 3. The architecture of our method.

y_{m}

is the labeled multimodal label in the dataset, and

y_{t}

,

y_{v}

are the auxiliary labels generated by the self-supervised label generation module for the unimodal text and image auxiliary tasks, respectively.

{\hat{y}}_{m}

is the predicted output of the primary multimodal task,

{\hat{y}}_{t}

,

{\hat{y}}_{v}

are the predicted outputs of the unimodal text and image auxiliary tasks, respectively.

Figure 4. Example pictures in the experimental dataset. The memes in the first column are all hateful memes, the second column replaces only their images to make them not hateful, and the third column replaces only their text to make them not hateful.

Table 1. The prediction accuracy of different models on the “Hateful Memes Challenge” data set.

Type	Model	Validation	Test
Unimodal	Image-Grid	52.73%	52.00%
	Image-Region	52.66%	52.13%
	Text BERT	58.26%	59.20%
Multimodal	Late Fusion	61.53%	59.66%
	Concat BERT	58.60%	59.13%
	MMBT-Grid	58.20%	60.06%
	MMBT-Region	58.73%	60.23%
	ViLBERT	62.20%	62.30%
	Visual BERT	62.10%	63.20%
	ViLBERT CC	61.40%	61.10%
	Visual BERT COCO	65.06%	64.73%
	Our model	65.92%	66.30%

The results of compared models on the dataset are from [59]. We show the best performance results of our model on Accuracy.

Table 2. The prediction accuracy of the multi-task learning models with the addition of different unimodal auxiliary tasks.

Model	Validation	Test
M	61.92%	63.40%
M,T $_{E}$	62.67%	63.10%
M,V $_{E}$	62.05%	62.24%
M,T	62.83%	63.45%
M,V	62.33%	62.60%
M,T $_{E}$ ,V $_{E}$	63.00%	64.65%
M,T,V	65.92%	66.30%

M is the model with the primary task of a multimodal classification only; M, T_E is the model with a primary task of multimodal classification and an auxiliary task of text classification using equal weights with

ω_{u}^{j} = 1

; M, V_E is the model with a primary task of multimodal classification and an auxiliary task of image classification using equal weights with

ω_{u}^{j} = 1

; M, T is the model with a primary task of multimodal classification and an auxiliary task of text classification; M, V is the model with a primary task of multimodal classification and an auxiliary task of image classification; M, T_E, V_E is the model with a primary task of multimodal classification and two auxiliary tasks including text and image classification using equal weights with

ω_{u}^{j} = 1

; M, T, V is the model with a primary task of multimodal classification and two auxiliary tasks including text and image classification.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Z.; Yao, S.; Wu, L.; Gao, S.; Zhang, Y. Hateful Memes Detection Based on Multi-Task Learning. Mathematics 2022, 10, 4525. https://doi.org/10.3390/math10234525

AMA Style

Ma Z, Yao S, Wu L, Gao S, Zhang Y. Hateful Memes Detection Based on Multi-Task Learning. Mathematics. 2022; 10(23):4525. https://doi.org/10.3390/math10234525

Chicago/Turabian Style

Ma, Zhiyu, Shaowen Yao, Liwen Wu, Song Gao, and Yunqi Zhang. 2022. "Hateful Memes Detection Based on Multi-Task Learning" Mathematics 10, no. 23: 4525. https://doi.org/10.3390/math10234525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hateful Memes Detection Based on Multi-Task Learning

Abstract

1. Introduction

2. Related Works

2.1. Datasets

2.2. Textual Model

2.3. Visual Model

2.4. Multimodal Model

2.5. Multi-Task Learning

3. Method

3.1. Setup

3.2. Architecture

3.3. Unimodal Label Generation Module

3.4. Optimization Objectives

4. Experiments

4.1. Dataset

4.2. Compared Models

4.3. Results

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI