A Multi-View Interactive Approach for Multimodal Sarcasm Detection in Social Internet of Things with Knowledge Enhancement

Liu, Hao; Yang, Bo; Yu, Zhiwen

doi:10.3390/app14052146

Open AccessArticle

A Multi-View Interactive Approach for Multimodal Sarcasm Detection in Social Internet of Things with Knowledge Enhancement

by

Hao Liu

,

Bo Yang

^*

and

Zhiwen Yu

School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(5), 2146; https://doi.org/10.3390/app14052146

Submission received: 1 February 2024 / Revised: 1 March 2024 / Accepted: 2 March 2024 / Published: 4 March 2024

(This article belongs to the Special Issue Future Trends in Intelligent Edge Computing and Networking)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Multimodal sarcasm detection is a developing research field in social Internet of Things, which is the foundation of artificial intelligence and human psychology research. Sarcastic comments issued on social media often imply people’s real attitudes toward the events they are commenting on, reflecting their current emotional and psychological state. Additionally, the limited memory of Internet of Things mobile devices has posed challenges in deploying sarcastic detection models. An abundance of parameters also leads to an increase in the model’s inference time. Social networking platforms such as Twitter and WeChat have generated a large amount of multimodal data. Compared to unimodal data, multimodal data can provide more comprehensive information. Therefore, when studying sarcasm detection on social Internet of Things, it is necessary to simultaneously consider the inter-modal interaction and the number of model parameters. In this paper, we propose a lightweight multimodal interaction model with knowledge enhancement based on deep learning. By integrating visual commonsense knowledge into the sarcasm detection model, we can enrich the semantic information of image and text modal representation. Additionally, we develop a multi-view interaction method to facilitate the interaction between modalities from different modal perspectives. The experimental results indicate that the model proposed in this paper outperforms the unimodal baselines. Compared to multimodal baselines, it also has similar performance with a small number of parameters.

Keywords:

sarcasm detection; multi-view interaction; multimodal

1. Introduction

There is a close relationship between human activities and sarcasm communication, which is vital in daily work. Sarcasm is a unique expression, and the Oxford Dictionary defines sarcasm as “a way of using words that are the opposite of what you mean in order to be unpleasant to someone or to make fun of them”. Given a sample consisting of a sentence W and image I, the multimodal sarcasm detection aims to predict the sarcasm label of the sample from the label set:

\{S a r c a s m, N o n - S a r c a s m\}

. From the Internet of Things perspective, identifying sarcastic information can be helpful for machines in many ways, such as making more anthropomorphic human–computer interaction applications on social IoT. It is also the reason why sarcasm detection has attracted people’s attention. Social Internet of Things is a “social network of intelligent objects” paradigm based on social relationships between objects. It can capture the Internet of Things and discover services and resources through social relationships [1,2]. Sarcasm detection can be applied in social Internet of Things applications, such as sentiment analysis, opinion mining, and dialogue generation, et al. [3,4,5,6,7]. Sarcasm may mislead the predictions of sentiment analysis and opinion-mining methods. Sentiment analysis and sarcastic expression can influence the dialogue system’s performance in social robots. Another common application is multimodal data analysis on mobile social platforms such as WeChat, Facebook, and Twitter. Early sarcasm detection research mainly utilized machine learning methods to mine information, such as viewpoint or emotional orientation in text for text classification [8,9]. With the advancements in mobile social media technology, people can share their daily lives or comments on current hot events through multimodal data composed of videos, images, and text issued on platforms like Twitter and Weibo, et al., resulting in a substantial increase in the volume of multimodal data available. Sarcasm detection gradually shifts from unimodal to multimodal research [10]. This paper focuses on multimodal sarcasm detection in images and text on Twitter. When it comes to the sarcasm detection of images and text, we need to recognize that images and text express different psychological states of users. Therefore, multimodal sarcasm detection requires simultaneous extraction of semantic information between images and texts and consideration of their incongruity.

For multimodal sarcasm detection composed of images and text, some researchers use concatenation operations [11], attention mechanisms, or graph neural networks to fuse multimodal data [12,13]. Although these studies have achieved excellent performance, there are still the following drawbacks:

They overlook commonsense knowledge’s role in supplementing the information of graphic and textual features. For example, subtitles and background descriptions of images can be used as visual commonsense knowledge to enhance unimodal feature representations. Intuitively, visual commonsense knowledge provides additional semantic information.
They lack effective modal interaction approaches. The current research on multimodal interaction is overly simple and needs to view multimodal sarcasm detection tasks from multiple perspectives.
They do nottake into account the impact of the model size on mobile devices. Especially when processing multimodal data, it is necessary to extract each modality’s features and typically requires a larger modal fusion model.

However, most existing methods for sarcasm detection extract semantic representations for each modality separately. Subsequently, they fuse multimodal features through complex networks. The sarcastic information embedded in both text and images mutually complements one another, allowing the expression of the user’s current intentions and feelings. In previous literature, each modal feature is modeled separately, and relatively little attention is paid to the interaction between modalities. Consequently, for sarcasm detection, it is essential to construct a multimodal interaction model that integrates visual commonsense knowledge.

For the above reason, this paper focuses on multimodal sarcasm detection that can be used for mobile devices with limited computing resources. We exploit a multi-view interaction model with knowledge enhancement (MVK) for multimodal sarcasm detection in this paper. MVK mainly consists of two parts, interactive learning and feature fusion. Firstly, we employ the pre-trained model ResNet [14] to obtain each image’s corresponding attribute labels, which can effectively represent the object and background information of the image. These attribute labels serve as visual commonsense knowledge. From the perspective of multi-view learning, attribute labels can also be used as the image attribute view. Then, we extract images, text, and visual commonsense knowledge representations. In the stage of interaction learning, text and image features are learned and acquired through the self-influence and mutual influence of text and images with the guidance of visual commonsense knowledge. Specifically, we concatenate knowledge features with text and image features, respectively. By employing an attention mechanism, we obtain the importance of knowledge for each modality and enhance the single modality representation. Subsequently, an inter-modal interaction module is established to interact with image and text information to learn inter-modal mutual and incongruous information. Within this module, the attention weight matrix from the text perspective is employed for extracting image features, whereas the attention weight matrix from the image perspective is employed for extracting text features. We optimize the attention weight matrix for each modality through interactive learning. Finally, in the feature fusion stage, a late-attention mechanism is utilized to fuse the representation vectors of each modality. Additionally, the constructed multimodal sarcasm detection model has fewer parameters. The experiment demonstrates that the presented approach performs excellently in multimodal sarcasm detection.

The main contributions of this paper can be stated as follows:

In this paper, we employ ResNet to acquire attribute information of images as visual commonsense knowledge and leverage visual commonsense knowledge to enhance the representation of each modality.
The proposed method enhances the multimodal information interaction. The model can learn the inter-modal differences and mutual information through different modal views and improve the model’s robustness.
A series of experiments and analyses indicate that the presented model can effectively utilize the information of text and image to improve the performance of multimodal sarcasm detection. Concurrently, the model we constructed has fewer parameters, reducing memory pressure and computational resources.

The rest of the paper is arranged as follows. Section 2 introduces the relevant works. Section 3 provides a detailed description of the method proposed in this paper, and then the experimental setting and results are presented in the Section 4 and Section 5. The Section 6 provides a conclusion.

2. Related Work

2.1. Unimodal Sarcasm Detection

Early sarcasm detection primarily concentrates on text modality [15,16,17]. Moreover, on the basis of the textual data, the early works extract paragraph-level, sentence-level, or word-level fine-grained features. The research on text sarcasm detection mainly includes three categories: deep learning-based, machine learning-based, and rule-based. Riloff et al. [18] bring up a rule-based algorithm that iteratively expands positive and negative phrases and then uses these learned words to predict irony label, but it lacks sufficient adaptability. With the widespread use of machine learning algorithms, researchers have begun to extract manual textual features and design machine learning algorithms to carry out sarcastic detection work. Ghosh et al. [19] use SVM as their classifier to solve this problem. These methods rely on the quality of features and are not friendly to data with similar features. However, since features can automatically be extracted through deep learning architecture, a large amount of work on deep learning has been used for sarcastic detection. Poria et al. [20] leverage mixed CNN-SVM to encode emotional and personal features for sarcastic detection. Xiong et al. [21] use an attention mechanism to capture word differences to detect sarcastic discourse. Ilic et al. [22] employ ELMo to extract a word’s character-level representation to express contextual satire. These deep learning-based methods have achieved excellent performance. However, due to the increase in multimodal data, researchers are paying more attention to multimodal sarcasm detection.

2.2. Multimodal Sarcasm Detection

Applications such as social networks and news websites have generated abundant multimodal data. As a result, people have conducted extensive multimodal investigations; for example, sentiment analysis [23,24,25,26,27], image and text retrieval [28], reason extraction [29,30], and sarcasm detection [31]. Unlike text-modal-based sarcasm detection, multimodal sarcasm detection aims to identify sarcastic expressions implied in multimodal data. Schifanella et al. [11] first process multimodal sarcasm detection tasks in social platforms through manually designed text and image data features; manual features require higher costs. Chauhan et al. [32] introduce a multitasking framework to identify irony and emotions. Furthermore, Cai et al. [12] create a new text-image sarcasm dataset collection from Twitter and propose an attention-based fusion model for multimodal sarcasm recognition which achieves a simple late fusion. Xu et al. [33] exploit relational and decompositional networks to model semantic correlation in sarcastic recognition. These early studies using simple fusion networks make extracting mutual information between modalities difficult. The HKE model [34] proposes the joint use of cross-modal graphs and syntactic analysis. These studies employ adjective–noun pairs to enrich multimodal information. Pan et al. [35] propose inter-modal and common attention mechanisms to complete sarcastic detection tasks. The DIP [36] framework recently proposed leverages a contrastive loss to capture sarcastic information from factual and emotional perspectives. Liang et al. [13] focused on learning the incongruous relationships through a cross-modal graph convolutional network by building intra-modal and cross-modal graphs for each multimodal sample. Furthermore, they [37] present a cross-modal graph neural network in which the edge weights come from SenticNet to capture the inter-modal inconsistency. Jiang et al. have embedded sentiment word into multimodal vectors [38]. These graph neural networks have achieved excellent performance but bring much computational trouble.

These studies have achieved high classification accuracy by learning the inconsistentrelationships between modalities through attention mechanisms or graph neural networks. Although the above deep learning-based methods avoid the high cost of manually designing features, there are still some areas for improvement in utilizing commonsense knowledge. Meanwhile, they focus on improving the model’s classification performance while ignoring the model’s size. This paper proposes a multimodal interaction model based on knowledge augmentation to utilize visual knowledge information. This model has fewer parameters, which is beneficial for deployment on mobile devices.

3. Methodology

The approach presented in this paper consists of three components: feature extract module, multi-view interaction module, and late fusion module. The pre-trained models extract the multimodal features in the feature extract module. The multi-view interaction module interacts with multimodal messages with visual commonsense knowledge. Lastly, the late fusion module leverages an attention machine to fuse multimodal representation. The overall architecture is illustrated in Figure 1.

3.1. Feature Extract Module

3.1.1. Image Feature

We employ the pre-trained DenseNet-121 [39] to extract image features, which is divided into 14 × 14 regions. Each region is fed into DenseNet to obtain the local feature:

C_{i m a g e_{i}} = D e n s e N e t (I_{i})

(1)

Then, we average the value of all region features as the representation vector of the image:

v_{i m a g e} = \frac{1}{n} \sum_{i = 1}^{n} C_{i m a g e_{i}}

(2)

where n = 196 is the number of regions,

C_{i m a g e_{i}}

is used for late attention fusion, and

v_{i m a g e}

is used for multimodal interaction.

3.1.2. Text Feature

In this section, we utilize the pre-trained model BERT-base [40] to extract the raw text feature. For each sample, we extract word-level features:

C_{t e x t_{i}} = B E R T (w_{i})

(3)

where

w_{i}

denotes the i-th word in text. The maximum length of a text is set to 75, i.e.,

i \in [1, 75]

. The output of the BERT’s last layer serves as the text’s sentence-level features

v_{t e x t}

.

C_{t e x t_{i}}

is used for late attention fusion and

v_{t e x t}

is used for multimodal interaction.

3.1.3. Knowledge Feature

Previous studies have revealed that introducing extra knowledge can enrich the semantic information of images or texts and enhance the robustness of the model. We utilize the ResNet model trained on the COCO dataset [41] to predict the attribute labels of each image in the dataset to be evaluated. That is to say, we acquire some words that can illustrate the content and background of the image. These words serve as the visual commonsense knowledge to establish connections between text and images. Subsequently, we utilize BERT to extract the representation of each word:

C_{k n o w l e d g e_{i}} = B E R T (k_{i})

(4)

where

k_{i}

denotes a word of knowledge and

i \in [1, 5]

. After that, attention weighting is applied in these representations to obtain knowledge features for subsequent sections:

\begin{matrix} α_{i} = s o f t m a x (W_{o} \cdot α_{i} + b_{o}) \end{matrix}

(5)

\begin{matrix} v_{k n o w l e d g e} = \sum_{i = 1}^{i = 5} α_{i} \cdot C_{k n o w l e d g e_{i}} \end{matrix}

(6)

where

C_{k n o w l e d g e_{i}}

is used for late attention fusion and

v_{k n o w l e d g e}

is used for multimodal interaction.

3.2. Multi-View Interaction Module

In this section, we elaborate on the multimodal interaction process of multimodal data, which is enhanced with visual commonsense knowledge. Before feeding data into the multi-view interaction module, text, images, and knowledge features are first aligned with dimensions through the full connection layer. As displayed in Figure 1, the multi-view interaction module has N layers. Firstly, in the i-th layer, text and images are concatenated with visual commonsense knowledge separately.

\begin{matrix} B_{T - K_{i}} = c o n c a t e n a t e (v_{t e x t_{i}}, v_{k n o w l e d g e}) \end{matrix}

(7)

\begin{matrix} B_{I - K_{i}} = c o n c a t e n a t e (v_{i m a g e_{i}}, v_{k n o w l e d g e}) \end{matrix}

(8)

where

B_{T - K_{i}} \in R^{d}

,

B_{I - K_{i}} \in R^{d}

.

Then, we utilize scale dot product attention to enhance text and image representation with visual commonsense knowledge:

\begin{matrix} {\hat{B}}_{T_{i}} = W_{T_{i}} \cdot B_{T - K_{i}} / \sqrt{d} \end{matrix}

(9)

\begin{matrix} {\hat{B}}_{I_{i}} = W_{I_{i}} \cdot B_{I - K_{i}} / \sqrt{d} \end{matrix}

(10)

\begin{matrix} a t t w_{T_{i}} = s o f t max ({\hat{B}}_{T_{i}}) \end{matrix}

(11)

\begin{matrix} a t t w_{I_{i}} = s o f t max ({\hat{B}}_{T_{i}}) \end{matrix}

(12)

\begin{matrix} F_{T_{i}} = a t t w_{T_{i}}^{T} \otimes B_{T - K_{i}} \end{matrix}

(13)

\begin{matrix} F_{I_{i}} = a t t w_{I_{i}}^{T} \otimes B_{I - K_{i}} \end{matrix}

(14)

where d = 512, the

a t t w_{T_{i}}

and

a t t w_{I_{i}}

are the attention weights for the text and image model, respectively, with the dimension of R. The

W_{T_{i}}

and

W_{T_{i}}

are trained parameters. The symbols · and ⊗ denote the dot product operator and matrix product, respectively. The

F_{T_{i}}

and

F_{I_{i}}

are the text and image modal representations with knowledge enhancement through the attention machine. We then utilize attention weights to facilitate information interaction between text and images:

\begin{matrix} f_{T_{i}} = a t t w_{I_{i}}^{T} \otimes B_{T - K_{i}} \end{matrix}

(15)

\begin{matrix} f_{I_{i}} = a t t w_{T_{i}}^{T} \otimes B_{I - K_{i}} \end{matrix}

(16)

after that,

F_{T_{i}}

concatenates with

f_{T_{i}}

,

F_{I_{i}}

concatenates with

f_{I_{i}}

. Then, they go through a projection layer:

\begin{matrix} O_{T_{i}} = p r o j e c t i o n_{T_{i}} (c o n c a t e n a t e (F_{T_{i}}, f_{T_{i}})) \end{matrix}

(17)

\begin{matrix} O_{I_{i}} = p r o j e c t i o n_{I_{i}} (c o n c a t e n a t e (F_{I_{i}}, f_{I_{i}})) \end{matrix}

(18)

the projection layer consists of two linear layers, a LeakyReLU activation function, and a LayerNorm layer.

The current layer’s output serves as the next layer’s input:

\begin{matrix} v_{t e x t_{i + 1}} = O_{T_{i}} \end{matrix}

(19)

\begin{matrix} v_{i m a g e_{i + 1}} = O_{I_{i}} \end{matrix}

(20)

3.3. Late Fusion Module

Finally, we adopt an attention mechanism to fuse the representation vectors of text, images, and visual commonsense knowledge. The representations obtained from the multi-view interaction module are concatenated with the raw representations, which are fed into a full connection layer to calculate the fused vector:

\begin{matrix} α_{m n}^{i} = tanh (W_{m n} \cdot c o n c a t e n a t e (C_{m}^{i}, v_{n}) + b_{m n}) \end{matrix}

(21)

\begin{matrix} α_{m}^{i} = \frac{1}{3} \sum_{n \in ϕ} α_{m n}^{i} \end{matrix}

(22)

\begin{matrix} v_{m} = \frac{1}{S_{m}} \sum_{i = 1}^{S_{m}} α_{m}^{i} C_{m}^{i} \end{matrix}

(23)

where

W_{m n}

and

b_{m n}

are trained parameters.

m, n \in ϕ

,

ϕ = s e t \{t e x t, i m a g e, k n o w l e d g e\}

.

C_{m}^{i}

denotes the i-th raw representation vector of modal m.

S_{m}

is the length of sequence.

\begin{matrix} \hat{α} = s o f t max (W_{m} \cdot v_{m} + b_{m}) \end{matrix}

(24)

\begin{matrix} v_{p} = \sum_{m \in ϕ} {\hat{α}}_{m} v_{m} \end{matrix}

(25)

where

W_{m}

and

b_{m}

are trained parameters.

v_{p}

is the fused vector used in the classification layer.

3.4. Classification and Loss Function

In this paper, we utilize the fully connected layer as the classification layer, followed by the sigmoid activation function. The binary cross-entropy loss is applied to optimize the model.

The training, validation, and testing process of the model is shown in Figure 2.

4. Experiment Setup

4.1. Dataset and Evaluated Metrics

In this paper, we use Cai’s [12] publicly available multimodal sarcasm dataset collected from the Twitter platform. This dataset comprises images and corresponding text. This dataset contains 24,635 samples, divided into training, validation, and testing sets. Every sample is annotated by sarcasm or non-sarcasm labels, i.e., 1 and 0. Table 1 presents the specific statistical information about the dataset. Following the previous works, Accuracy Score (Acc), Precision (Pre), Recall (rec), and F1 score (F1) are applied to evaluate model performance.

4.2. Hyper-Parameters Setting

In our experiment, the pre-trained models BERT and DenseNet-121 are employed to extract multimodal features. The pre-trained model freezes its parameters when extracting raw features. The optimizer is Adam. The important hyper-parameters are shown in Table 2.

4.3. Compared Baselines

The following compared baselines contain models that use text, images, and text–image modalities as inputs.

TextCNN [42]: It uses the 1D Convolutional Neural Network to extract text features with few parameters.
TextCNN-RNN [42]: We add a recurrent neural network on the basis of TextCNN to further extract text features.
BERT [40]: The BERT-based-uncased is utilized to extract textual features and fine-tune them for downstream sarcastic tasks.
Image [43]: It purely employs the image vectors after the pooling layer of Desnet to predict the results of sarcasm detection.
ViT [43]: Like BERT, the vision pre-trained model ViT extracts the image representation by inserting the [CLS] token for the image sarcasm detection.
HFN [12]: This model utilizes GloVe and ResNet as feature extractors. Then, a simple attention mechanism layer fusion is adopted to fuse text, image, and attribute modularity.
Attr-BERT [35]: It presents a self-attention structure focusing on the intra-modal information based on BERT architecture.
HKE [34]: It exploits atomic-level congruity and composition-level congruity detection models based on the graph neural network through cross-modality attention.

5. Experiemental Result and Analysis

5.1. Main Experiment Result and Analysis

The main results of our proposed MVK model and other compared baselines are presented in Table 3. The key evaluation indicator, F1, for the model proposed in this article, is 83.92%, with an accuracy (Acc) of 86.68%. Additionally, the precision (Pre) and recall (rec) have also improved. In the compared baselines, using only text modality resulted in better results than using only image modality, indicating that text contains more information than images. However, they cannot utilize the complementarity between multimodal data sufficiently. Current research focuses on building multimodal fusion networks using multimodal data as input. For example, HFN and InCrossMGs use attention mechanisms and cross-modal graph convolutional networks to fuse text and image features directly. Attr-BERT, on the other hand, directly concatenates text and image data, using the BERT structure to extract multimodal features. However, these approaches ignore the interactivity between modalities and do not leverage visual commonsense knowledge to enrich multimodal representations. We draw visual commonsense knowledge into the model to consciously enhance the representation vectors of each modality. Simultaneously, our model uses multi-view to interact with inter-modal information dynamically in multiple turns. Finally, we fuse the multimodal representation through attention fusion. The experimental results manifest that our multi-view interaction based on the knowledge enhancement model can effectively improve the performance of multimodal sarcasm detection.

In this section, we also quantified the total number of parameters of both the proposed model and the comparative baselines, as demonstrated in Table 3. The comparative baselines employ multimodal data as input. In contrast to the HKE model, our model, as proposed in this article, exhibits comparable performance, but it contains significantly fewer parameters. This observation suggests that our approach, when deployed in mobile devices, can effectively mitigate memory constraints on devices. Furthermore, the reduced parameter count results in a decreased computational overhead.

5.2. Ablation Study

To explore the role of each component in multimodal sarcasm detection, we conduct a series of ablation experiments and provide a detailed analysis. The ablation experiment removed the multi-view interaction module and the late-fusion module, respectively. Table 4 reports the results of the ablation experiment. (Legend: w/o V denotes without employing multi-view interaction. w/o L denotes that without employing late attention fusion, it concatenates the multimodal representation after multi-view interaction. w/o denotes without employing any interaction and fusion action.)

Upon observing the results presented in Table 4, it becomes evident that the model’s performance decreases significantly when adopting no actions, indicating that in multimodal sarcasm detection, knowledge enhancement, multi-view interaction, and attention fusion mechanisms are crucial for learning the relationships between multimodal data. Specifically, when the multi-view interaction module is removed (w/o V), the experimental results show a noticeable decline, which indicates that interacting information through different modal perspectives can improve the model’s ability to learn dependency relationships between modalities. When removing attention mechanism fusion, the model’s performance is slightly worse, indicating that a simple fusion method cannot make the model focus on important information in multimodal representations. It is worth noting that the model achieves excellent performance in multimodal sarcasm detection when employing these two components together, especially when utilizing visual commonsense knowledge to enrich the semantic information of representation.

5.3. Impact of Multi-View Layers

To investigate the impact of the number of multi-view layers on the presented model, we run experiments on different numbers of multi-view layers. The number varies from 1 to 10. Figure 3 exhibits the experimental results on Acc and F1. When the multi-view module consists of two layers, the model performs best. When there is only one layer, insufficient interaction between text and image information makes the model unable to obtain satisfactory features. As the number of layers increases, the model’s performance gradually declines, indicating that the excessive depth weakens its learning ability. The possible reason is that the growth of model parameters leads to overfitting. Even if we increase this module’s dropout rate, the model’s performance is still constrained.

5.4. Case Study

We conduct a case study and its attention visualization on a sarcastic sample and a non-sarcasm sample from the test set to analyze how our proposed model learns multimodal information. The results are exhibited in Figure 4. For multimodal sarcasm detection, models need to learn the inter-modal mutual information and incongruous information. We construct an architecture based on multi-view interaction with knowledge enhancement and attention fusion to learn this information. As shown in Figure 4, for the sample (a) with sarcasm label, in the image’s attention hot map, it is observed that the model focuses on the lower middle of the image, i.e., the lying dog. Furthermore, from the text’s hot map, we can see the model focuses on the words ’wasting time’ and ’fun’. For sample (b) with the non-sarcasm label, the image’s attention hot map focuses on the wild lizard and the text’s hot map focuses on the words ‘browser’, ‘fronoze’, and ‘video’. We can conclude that with the help of visual commonsense knowledge and multi-view interactions, our model can effectively extract text and image features and enhance modal interaction. The model can concentrate on the important objects and words of the multimodal data.

5.5. Error Analysis and Limitation

By observing the samples with incorrect predictions in the test set, we note a notable ratio of the samples containing less information in the images. For example, they are featured with only a few words in the image, blurry visuals, and a single scene, as exhibited in Figure 5, which makes the model fail to focus on essential positions in the image and unable to extract adequate information from the image. As a result, the model heavily relies on text modality alone. The subsequent work can alleviate these issues by extracting words from images and using techniques such as image defogging to enhance the visual quality. Research limitation: the inability to extract sufficient information from blurred images and images with less information and inadequate adaptability in the absence of a certain modality.

6. Conclusions

The rapid evolution of social networks has led to the generation of vast amounts of multimodal data, especially published on mobile devices. Mobile devices on the Internet of Things typically have limited memory and computing power. Therefore, this paper proposes a multi-view interaction model based on knowledge enhancement for multimodal sarcasm detection in social Internet of Things. The model enhances the representation of each modalityby leveraging visual commonsense knowledge. Furthermore, it utilizes a multi-turn multi-view interaction module to extract mutual information and an incongruous relationship between text and image. The experimental results validate the effectiveness of the proposed model. In addition, the model has fewer parameters and does not require a large amount of memory consumption. The lightweight model also has a lower inference time which can facilitate the development of social Internet of Things. Despite all this, parameter sharing or multi-teacher distillation compression models can be considered in future work to make the model practical. In addition, considering the impact of poor image quality on model performance, future work can utilize more powerful feature extractors, such as VIT and CLIP. Optical Character Recognition (OCR) can also be employed to capture the textual information of simple images to enrich image information.

Author Contributions

Conceptualization, methodology and software; writing—original draft preparation, H.L.; visualization, validation, supervision, B.Y.; data curation, investigation, supervision, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (No. 61960206008, 62025205). The work of B. Yang was supported in part by the Qin Chuang Yuan Fund Program (QCYRCXM-2022-358).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this paper are publicable and available at the link: https://github.com/headacheboy/data-of-multimodal-sarcasm-detection, accessed on 21 March 2020.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Atzori, L.; Iera, A.; Morabito, G.; Nitti, M. The Social Internet of Things (SIoT)—When social networks meet the Internet of Things: Concept, architecture and network characterization. Comput. Netw. 2012, 56, 3594–3608. [Google Scholar] [CrossRef]
Atzori, L.; Iera, A.; Morabito, G. SIoT: Giving a Social Structure to the Internet of Things. IEEE Commun. Lett. 2011, 15, 1193–1195. [Google Scholar] [CrossRef]
Jena, A.K.; Sinha, A.; Agarwal, R. C-net: Contextual network for sarcasm detection. In Proceedings of the Second Workshop on Figurative Language Processing, Online, 9 July 2020; pp. 61–66. [Google Scholar]
Ravi, K.; Ravi, V. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowl.-Based Syst. 2015, 89, 14–46. [Google Scholar] [CrossRef]
Joshi, A.; Bhattacharyya, P.; Carman, M.J. Automatic sarcasm detection: A survey. ACM Comput. Surv. (CSUR) 2017, 50, 1–22. [Google Scholar] [CrossRef]
Jiang, D.; Liu, H.; Tu, G.; Wei, R.; Cambria, E. Self-supervised utterance order prediction for emotion recognition in conversations. Neurocomputing 2024, 577, 127370. [Google Scholar] [CrossRef]
Tu, G.; Xie, T.; Liang, B.; Wang, H.; Xu, R. Adaptive Graph Learning for Multimodal Conversational Emotion Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Alita, D. Multiclass SVM Algorithm for Sarcasm Text in Twitter. JATISI (J. Tek. Inform. Dan Sist. Inf.) 2021, 8, 118–128. [Google Scholar] [CrossRef]
Eke, C.I.; Norman, A.A.; Shuib, L.; Nweke, H.F. Sarcasm identification in textual data: Systematic review, research challenges and open directions. Artif. Intell. Rev. 2020, 53, 4215–4258. [Google Scholar] [CrossRef]
Castro, S.; Hazarika, D.; Pérez-Rosas, V.; Zimmermann, R.; Mihalcea, R.; Poria, S. Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4619–4629. [Google Scholar]
Schifanella, R.; De Juan, P.; Tetreault, J.; Cao, L. Detecting sarcasm in multimodal social platforms. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 1136–1145. [Google Scholar]
Cai, Y.; Cai, H.; Wan, X. Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2506–2515. [Google Scholar]
Liang, B.; Lou, C.; Li, X.; Gui, L.; Yang, M.; Xu, R. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4707–4715. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhang, M.; Zhang, Y.; Fu, G. Tweet sarcasm detection using deep neural network. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2449–2460. [Google Scholar]
Tay, Y.; Luu, A.T.; Hui, S.C.; Su, J. Reasoning with Sarcasm by Reading In-Between. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 1010–1020. [Google Scholar]
Jain, T.; Agrawal, N.; Goyal, G.; Aggrawal, N. Sarcasm detection of tweets: A comparative study. In Proceedings of the 2017 Tenth International Conference on Contemporary Computing (IC3), Noida, India, 10–12 August 2017; pp. 1–6. [Google Scholar]
Riloff, E.; Qadir, A.; Surve, P.; De Silva, L.; Gilbert, N.; Huang, R. Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 704–714. [Google Scholar]
Ghosh, D.; Guo, W.; Muresan, S. Sarcastic or not: Word embeddings to predict the literal or sarcastic meaning of words. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1003–1012. [Google Scholar]
Poria, S.; Cambria, E.; Hazarika, D.; Vij, P. A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 1601–1612. [Google Scholar]
Xiong, T.; Zhang, P.; Zhu, H.; Yang, Y. Sarcasm detection with self-matching networks and low-rank bilinear pooling. In Proceedings of the The World Wide Web Conference, Toronto, ON, Canada, 11–14 May 2019; pp. 2115–2124. [Google Scholar]
Ilic, S.; Marrese-Taylor, E.; Balazs, J.; Matsuo, Y. Deep contextualized word representations for detecting sarcasm and irony. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Brussels, Belgium, 31 October 2018; pp. 2–7. [Google Scholar]
Jiang, D.; Liu, H.; Wei, R.; Tu, G. CSAT-FTCN: A Fuzzy-Oriented Model with Contextual Self-attention Network for Multimodal Emotion Recognition. Cogn. Comput. 2023, 15, 1082–1091. [Google Scholar] [CrossRef]
Jiang, D.; Wei, R.; Liu, H.; Wen, J.; Tu, G.; Zheng, L.; Cambria, E. A Multitask Learning Framework for Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), Auckland, New Zealand, 7–10 December 2021; pp. 151–157. [Google Scholar] [CrossRef]
Tu, G.; Wen, J.; Liu, H.; Chen, S.; Zheng, L.; Jiang, D. Exploration meets exploitation: Multitask learning for emotion recognition based on discrete and dimensional models. Knowl.-Based Syst. 2022, 235, 107598. [Google Scholar] [CrossRef]
Li, Z.; Tu, G.; Liang, X.; Xu, R. Developing Relationships: A Heterogeneous Graph Network with Learnable Edge Representation for Emotion Identification in Conversations. In CAAI International Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2022; pp. 310–322. [Google Scholar]
Tu, G.; Liang, B.; Jiang, D.; Xu, R. Sentiment-Emotion-and Context-guided Knowledge Selection Framework for Emotion Recognition in Conversations. IEEE Trans. Affect. Comput. 2022, 14, 1803–1816. [Google Scholar] [CrossRef]
Chen, H.; Ding, G.; Liu, X.; Lin, Z.; Liu, J.; Han, J. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12655–12663. [Google Scholar]
Nam, H.; Ha, J.W.; Kim, J. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 299–307. [Google Scholar]
Jiang, D.; Liu, H.; Tu, G.; Wei, R. Window transformer for dialogue document: A joint framework for causal emotion entailment. Int. J. Mach. Learn. Cybern. 2023, 14, 2697–2707. [Google Scholar] [CrossRef]
Sarsam, S.M.; Al-Samarraie, H.; Alzahrani, A.I.; Wright, B. Sarcasm detection using machine learning algorithms in Twitter: A systematic review. Int. J. Mark. Res. 2020, 62, 578–598. [Google Scholar] [CrossRef]
Chauhan, D.S.; Singh, G.V.; Arora, A.; Ekbal, A.; Bhattacharyya, P. An emoji-aware multitask framework for multimodal sarcasm detection. Knowl.-Based Syst. 2022, 257, 109924. [Google Scholar] [CrossRef]
Xu, N.; Zeng, Z.; Mao, W. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3777–3786. [Google Scholar]
Liu, H.; Wang, W.; Li, H. Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 4995–5006. [Google Scholar]
Pan, H.; Lin, Z.; Fu, P.; Qi, Y.; Wang, W. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1383–1392. [Google Scholar]
Wen, C.; Jia, G.; Yang, J. DIP: Dual Incongruity Perceiving Network for Sarcasm Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2540–2550. [Google Scholar]
Liang, B.; Lou, C.; Li, X.; Yang, M.; Gui, L.; He, Y.; Pei, W.; Xu, R. Multi-modal sarcasm detection via cross-modal graph convolutional network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 1767–1777. [Google Scholar]
Fu, H.; Liu, H.; Wang, H.; Xu, L.; Lin, J.; Jiang, D. Multi-Modal Sarcasm Detection with Sentiment Word Embedding. Electronics 2024, 13, 855. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 25–29 October 2014. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]

Figure 1. The overall architecture of multi-view interaction model with knowledge enhancement (MVK). The model consists of three modules: (a) feature extract module, (b) multi-view interaction module, and (c) late fusion module. In module (a), the DenseNet and BERT are feature extractors used to extract features at two levels. Module (b) has N layers, and each layer performs an inter-modal interaction based on knowledge augmentation.

Figure 2. The model’s process of training, validation, and test. The test set result is based on the model with the minimum valid loss. The visualization data are from test data.

Figure 3. Impact of multi-view interaction layers.

Figure 4. Case study and visualization.

Figure 5. Samples for error analysis.

Table 1. The statistics of the multimodal dataset.

	Non-Sarcasm	Sarcasm	Total	Percentage
Training set	8642	11,174	19,816	80.44%
Val set	959	1451	2410	9.78%
Test set	959	1450	2409	9.78%

Table 2. Hyper-parameter setting.

Hyper-Parameters	Value
Batch size	256
Number of knowledge words	5
Max length of text	75
Dropout rate	0.25
Learning rate	0.00025
Number of multi-view sublayers	2
Epoch	15
BERT embedding dimension	768

Table 3. Main results of the experiment.

Modality	Model	Acc (%)	F1-Score			Parameters
Modality	Model	Acc (%)	Pre (%)	Rec (%)	F1 (%)	Parameters
Text	TextCNN	79.54	73.57	78.78	76.09	6.0 M
	TextCNN-RNN	82.30	78.86	78.09	78.48	6.1 M
	BERT	83.85	78.72	82.27	80.22	112.8 M
	MVK(ours)	83.36	80.09	79.46	79.78	19.5 M
Image	Image	64.76	54.41	70.80	61.53	14.9 M
	ViT	67.83	57.93	70.07	63.43	86.6 M
	MVK (ours)	72.44	67.24	64.93	66.06	19. 5M
Text + Image	HFN	83.44	76.57	84.15	80.18	14.9 M
	Attr-BERT	86.05	78.63	83.31	80.90	36.1 M
	HKE	87.02	82.97	84.90	83.92	112.5 M
	MVK (ours)	86.68	83.75	84.06	83.92	19.5 M

Table 4. Ablation experimental result.

Model	Acc (%)	F1 (%)
MVK	86.68	83.92
w/o V	85.22	82.15
w/o L	84.25	82.04
w/o V-L	81.61	76.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Yang, B.; Yu, Z. A Multi-View Interactive Approach for Multimodal Sarcasm Detection in Social Internet of Things with Knowledge Enhancement. Appl. Sci. 2024, 14, 2146. https://doi.org/10.3390/app14052146

AMA Style

Liu H, Yang B, Yu Z. A Multi-View Interactive Approach for Multimodal Sarcasm Detection in Social Internet of Things with Knowledge Enhancement. Applied Sciences. 2024; 14(5):2146. https://doi.org/10.3390/app14052146

Chicago/Turabian Style

Liu, Hao, Bo Yang, and Zhiwen Yu. 2024. "A Multi-View Interactive Approach for Multimodal Sarcasm Detection in Social Internet of Things with Knowledge Enhancement" Applied Sciences 14, no. 5: 2146. https://doi.org/10.3390/app14052146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-View Interactive Approach for Multimodal Sarcasm Detection in Social Internet of Things with Knowledge Enhancement

Abstract

1. Introduction

2. Related Work

2.1. Unimodal Sarcasm Detection

2.2. Multimodal Sarcasm Detection

3. Methodology

3.1. Feature Extract Module

3.1.1. Image Feature

3.1.2. Text Feature

3.1.3. Knowledge Feature

3.2. Multi-View Interaction Module

3.3. Late Fusion Module

3.4. Classification and Loss Function

4. Experiment Setup

4.1. Dataset and Evaluated Metrics

4.2. Hyper-Parameters Setting

4.3. Compared Baselines

5. Experiemental Result and Analysis

5.1. Main Experiment Result and Analysis

5.2. Ablation Study

5.3. Impact of Multi-View Layers

5.4. Case Study

5.5. Error Analysis and Limitation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI