Video Summarization Generation Based on Graph Structure Reconstruction

Zhang, Jing; Wu, Guangli; Song, Shanshan

doi:10.3390/electronics12234757

Open AccessArticle

Video Summarization Generation Based on Graph Structure Reconstruction

by

Jing Zhang

^*

,

Guangli Wu

and

Shanshan Song

School of Cyberspace Security, Gansu University of Political Science and Law, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(23), 4757; https://doi.org/10.3390/electronics12234757

Submission received: 9 October 2023 / Revised: 11 November 2023 / Accepted: 22 November 2023 / Published: 23 November 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Video summarization aims to identify important segments in a video and merge them into a concise representation, enabling users to comprehend the essential information without watching the entire video. Graph structure-based video summarization approaches ignore the issue of redundant adjacency matrix. To address this issue, this paper proposes a video summary generation model based on graph structure reconstruction (VOGNet), in which the model first adopts a variational graph auto-encoders (VGAE) to reconstruct the graph structure to remove redundant information in the graph structure; followed by using the reconstructed graph structure in a graph attention network (GAT), allocating different weights to different shot features in the neighborhood; and lastly, in order to avoid the loss of information during the training of the model, a feature fusion approach is proposed to combine the training obtained shot features with the original shot features as the shot features for generating the summary. We perform extensive experiments on two standard datasets, SumMe and TVSum, and the experimental results demonstrate the effectiveness and robustness of the proposed model.

Keywords:

video summarization; variational graph auto-encoders; graph attention network; feature fusion

1. Introduction

With the widespread application of digital camera technology and the Internet, massive amounts of video data have been generated, but their management and utilization are facing challenges in storage, browsing, and searching. Consequently, video summarization techniques have become the focus of research, aiming to achieve compressed storage, fast browsing and efficient searching of video data through automated techniques. Video summarization enables users to grasp the main content of a video rapidly without watching the entire video by extracting key frames or key shots from the original video [1,2,3]. According to the perspective of generating summary content, they are divided into two categories: static video summarization and dynamic video summarization [4]. Static video summarization concludes video content utilizing a set of keyframes, which typically contain only image information, whereas dynamic video summarization consists of video clips, combining a variety of information including sound, images, and text to provide richer content for the user. Under this background, we choose to study dynamic video summarization as it enables a more comprehensive consideration of multiple messages, including sound, images and text, to better meet the needs of information delivery and user experience.

Deep learning methods, especially those based on Recurrent Neural Network (RNN) and Long Short-Term Memory Network (LSTM), have made significant achievements in the field of video summarization [5,6,7,8]. With the introduction of RNN, it is possible to capture the temporal information in the video more effectively, extract the contextual semantic information, and generate more accurate and coherent video summarization [7]. Although designed to capture long-term dependencies, LSTM still faces challenges in dealing with dependencies in very long sequences [9]. Deep learning methods perform excellently in handling Euclidean spatial data, but are ineffective in handling non-Euclidean spatial data. To solve this problem, researchers have introduced Graph Neural Network [10], exhibiting excellent performance. The researchers transform the video into a graph structure and capture the complex dependencies between frames through a GNN, providing a novel approach to video summarization [7,11,12]. Nonetheless, the problem of long-term dependency in the video modeling process remains a notable challenge. Moreover, the graph structures generated based on videos could suffer from large redundancy, hindering the effective learning of structural features of videos. Therefore, addressing these issues remains a key challenge that needs to be addressed in the domain of video summarization.

The GNN is used to model the video in this study for solving the long-term dependency problem of RNN. First, frame features are extracted by GoogleNet [13], and then a representative algorithm is utilized to select representative frame features from them as the features of the current shot. Then, a graph structure is constructed by using these shot features as nodes and using the similarity between them as edges. To further optimize the representation of the video graph, we employ a VGAE to reconstruct the original video graph and remove the redundancy of the adjacency matrix. The graph structure is optimized by introducing a GAT, and aggregation is weighted according to the importance of neighbor information. Besides that, we utilize the raw and structural features of the shots for feature fusion to improve the reasonableness of scoring the importance of the shots. This comprehensive approach effectively combines algorithms such as GNN, VGAE, and GAT to more accurately capture key information in the video, optimize the graph structure, and improve the accurate assessment of shot significance.

The contribution of the method to video summarization is reflected in three aspects:

1.: Transformation of the video into nonlinear data and building of a graph structure in which the shot features are the nodes and the inter-shot similarities are the edges, thus overcoming the long-term dependency problem, effectively extracting the dependencies between shots, and compensating for the limitations of LSTM.
2.: Application of VGAE to reconstruct the graph of the original video for redundant information removal and aggregation of the neighborhood information of the shot nodes by GAT.
3.: Feature fusion of raw and aggregated features of video shots to evaluate the significance of video shots in a more rational way.

The remainder of the paper is organized as follows. Section 2 reviews the related work. Section 3 provides a detailed description of the proposed method. Section 4 provides the results and analysis of the experiments. Finally, conclusions and insights are drawn in Section 5.

2. Related Work

We first present the existing research in the area of video summarization, followed by an introduction to variational graph auto encoders and graph attention network.

2.1. Video Summrization

Existing video summarization is mainly implemented using deep learning methods. In recent years, Recurrent Neural Networks (RNNs) have been widely used to process sequential data. For instance, Zhang et al. [14] propose a video summarization model based on long short-term memory network (LSTM), which utilizes LSTM to model the variable range of temporal dependencies between video frames. LSTM suffers from the problem of long-term dependency when dealing with long sequence data. In order to tackle this problem, Zhao et al. [6] propose a hierarchical recurrent neural network (H-RNN), which is divided into two layers. The first layer of the network employs LSTM to capture the short-term temporal dependencies between frames within a sub-shot, and the second layer of the network uses a bi-directional long short-term memory network (Bi-LSTM) to extract the long-term dependencies between shots. Considering that people focus on key shots, Ji et al. [15] apply the attention mechanism to LSTM. They introduce an encoder–decoder video summarization network based on the attention mechanism, which encodes contextual information between video frames using a bidirectional long short-term memory network (Bi-LSTM), and explore two types of attention-based LSTM networks in terms of the decoder. Although RNN has achieved great success in processing Euclidean spatial data, it is not applicable to processing non-Euclidean spatial data. For the purpose of better mining the hidden structural relationships in videos, video data are constructed into graph structures and graph neural networks, such as Graph Convolutional Neural Network (GCN) [10], are used to learn the topology of the graphs. The idea of GCN is to use a low-dimensional vector to represent the high-dimensional information of the nodes in the graph, which can capture the global information of the graph and better represent the characteristics of the nodes. Park et al. [16] present a recursive graph modeling network (SumGraph) for video summarization, which takes video frames as the nodes of the graph, and the nodes are connected by semantic relationships between frames as the graph edges. Zhao et al. [7] propose a reconstructed sequence-graph network (RSGN) that hierarchically encodes frames and shots as sequences and graphs, where frame-level correlations are encoded by long short-term memory (LSTM) and shot-level correlations are captured by a graph convolutional network (GCN). Li et al. [17] propose that GCAN has two modules, an embedding learning module and a feature fusion module. One branch of its embedding learning module extracts temporal features using DTC and TSA, and the other extracts spatial features using GCN. The fusion module uses the fusion gate mechanism to fuse the original, temporal, and spatial features to generate summary. To highlight critical information and filter irrelevant information, Zhong et al. [7] use the graph attention mechanism GAT to merge the influence of related neighboring nodes, convert the visual features of nodes to higher-level features, and adjust the weights of related nodes.

In conclusion, video summarization is studied, and initially, the sequence information in video summary is successfully extracted by RNN, and then the concept of hierarchical is introduced to solve the long-term dependency problem of RNN. Since LSTM cannot effectively handle non-Euclidean spatial data, graph neural networks are then introduced to learn graph structures.

2.2. Variational Graph Auto Encoders (VGAE)

Graph neural networks are applied with great success to extract features from non-Euclidean spatial data. We classify graph neural network into five main categories, namely graph convolutional networks (GCNs) [10], graph attention networks (GATs) [18], graph autoencoders (GAEs) [19], variational graph autoencoders (VGAEs) [19], and graph sample and aggregates (GraphSAGEs) [20]. This paper focuses on exploring VGAEs and GATs.

Kingma et al. [19] introduce the variational autoencoder (VAE), a generative model that learns latent variables from input data and generates new samples. It is an unsupervised learning algorithm that learns without labeling the data. Kipf et al. [19] migrate VAE to the graph domain and present the variational graph self-encoder. The encoder learns a distribution represented by node vectors using a known graph structure after encoding, after which it samples in the distribution and finally decodes to realize graph structure reconstruction. The method in this paper employs a variational graph auto-encoder to reconstruct the graph structure of the constructed video by feeding the constructed adjacency matrix into the VGAE, allowing the model the learning of a number of distributions using hidden variables, then sampling from these distributions to obtain the latent representation vectors, that is, the encoding part, and then reconstructing the original graph, that is, the decoding part, using the obtained latent representations. Specifically, the encoding part of VGAE uses GCN, while the decoding part is a simple inner product. Finally, the reconstructed graph is used as input for the next part of the model. The purpose of using VGAE in this paper is to learn the potential structural features in the video through VGAE and obtain the graph structure that can better represent video information.

2.3. Graph Attention Network (GAT)

Kipf et al. [21] offered a fresh perspective by proposing graph convolutional networks (GCNs) that fuse graph structural features in a convolutional manner. They motivate the choice of convolutional architecture by a local first-order approximation of the spectral graph convolution. Hidden layer representations of local graph structure and node features are encoded using models. However, when GCN fuses features of graph structures, the weights of the edges of the graph are fixed and not flexible enough. Addressing this problem, Veličković et al. [18] introduced the graph attention network (GAT). This network assigns different weights to different nodes in the neighborhood when dealing with neighborhoods of different sizes and does not rely on knowing the entire graph structure in advance. Specifically, by calculating the attention coefficients, namely the weights, of the current node and its neighbors, and weighting them when aggregating neighbors, the graph neural network is able to pay more attention to important nodes in order to reduce the impact of noise, and also improve the robustness of the model and endow it with a certain degree of interpretability. Zhong et al. [7] applied the graph attention network to video summarization. They utilized a bi-directional long short-term memory network (Bi-LSTM) unsupervised video summarization model based on a graph attention network (GAT) to solve the problem of high redundancy between key frames.

To summarize, GCN can effectively extract the spatial features of the video, but when fusing the features of the graph structure, GAT can assign different weights to different nodes in the neighborhood according to the attention, which can better extract the spatial features of the video.

3. Model

As shown in Figure 1, a Video Summarization Generation Network Based on Graph Structure Reconstruction (VOGNet) is proposed for removing redundancies in graph structures and learning structural features of videos. The major components of this network include three key parts: feature extraction, graph structure reconstruction and feature fusion. For feature extraction, we first extract features from video frames followed by selecting representative frames as shot features. The shot features are used as nodes of the graph, while the original graph is constructed using the similarity between shots as edges. Then, a VGAE is employed to reconstruct the structure of the original graph, learning the structure of potentially information rich graphs. After that, by introducing GAT, we process the reconstructed graph to aggregate the neighbor information of the nodes and make the learned node features more stable and representational through different weights.

3.1. Feature Extraction

First, the original video is defined as

F = {f_{n}}_{n = 1}^{N}

,

f_{n} \in R^{w \times h \times 3}

, where

f_{n}

denotes the

n

th video frame,

N

denotes the total number of frames of the video,

w

denotes the width of each image frame,

h

denotes the length of each image frame, and 3 denotes the number of channels of the image. The specific process of feature extraction is as follows.

3.1.1. Shot Split

First, the KTS algorithm [22] is utilized to segment the original video into a number of shots of varying lengths, and the set of shots of the original video is denoted as

{S = {s_{m}}}_{m = 1}^{M}

,

s_{m} \in R^{w \times h \times 3 \times T_{M}}

, M denotes the total number of shots in which the video is segmented, and

s_{m}

denotes that there are

T_{m}

video frames in the mth shot and

N = \sum_{m = 1}^{M} T_{m}

.

3.1.2. Calculation of Shot Features

After the collection of shots is obtained, the depth features of each frame are extracted using GoogLeNet. Since shots are short sequences of consecutive frames that are captured in a single shot and characterize the same semantic action, the visual features extracted by GoogLeNet for each frame in the shot are relatively similar. And since the number of video frames in each shot is different, the dimensions of the shot features are not the same. In order to construct the video summary graph structure, in this paper, the features of representative frames in each shot are selected as the features of a shot, and the shot features are denoted as

G = {s_{m}^{'}}_{m = 1}^{M}

,

s_{m}^{'} \in R^{F \times T_{m}}

, where

s_{m}^{'}

is the depth video feature of shot

s_{m}^{'}

,

F

is the feature dimension of frames, and

T_{m}

is the number of frames in shot

s_{m}

. The average of the 2 parameters between the frames in a shot and the rest of the frames is computed, with the representative frame being the one with the lowest average.

represent frame = exp (- \frac{1}{T_{m}} \sum_{t = 1}^{T_{m}} min_{t^{'} \in y} | | x_{t} - x_{t^{'}} {| |}_{2}) .

(1)

We denote by

X

the nodes (shot features) of the graph.

X = s_{represent frame}

(2)

where

s_{represent frame} \in R^{F \times M}

denotes the features of the representative frame.

3.1.3. Adjacency Matrix Construction

We construct adjacency matrix A using the extracted node features X. There are four general ways to construct the adjacency matrix:

Dot Product [11]:

e_{ij} = f (x_{i}, x_{j}) = - ϕ {(x_{i})}^{T} φ (x_{j});

(3)

Gaussian [11]:

e_{ij} = f (x_{i}, x_{j}) = exp {- ϕ {(x_{i})}^{T} φ (x_{j})};

(4)

Concatenation [11]:

e_{ij} = f (x_{i}, x_{j}) = W_{e}^{T} [ϕ (x_{i}) | | φ (x_{j})];

(5)

Cosine similarity [23]:

e_{ij} = f (x_{i}, x_{j}) = \frac{x_{i}^{T} x_{j}}{| | x_{i} {| |}_{2} \cdot | | x_{j} {| |}_{2}},

(6)

where x is each node,

| |

denotes the connection operation.

ϕ (\cdot)

and

φ (\cdot)

are linear transformations based on

W_{ϕ}

,

W_{φ}

.

W_{ϕ}

,

W_{φ}

,

W_{e}

are all learnable parameters. When it comes to video summarization, there is not a defined graph structure. To create one, we consider the shot features as the graph’s nodes, while the edges are determined by comparing the similarity of the two nodes. According to literature [11], we can obtain the edge weight calculated using the dot product, and the F1 value obtained by this training is slightly higher than the values obtained by the two methods of Gaussian and concatenation. According to literature [24], it is proven that cosine similarity is more suitable for calculating the similarity between two nodes, so this paper uses cosine similarity to calculate the similarity of two nodes.

e_{ij} = {\frac{x_{i}^{T} x_{j}}{| | x_{i} {| |}_{2} \cdot | | x_{j} | |}}_{2},

(7)

where

i, j = 1, 2, \dots, M

and

x_{i}

denotes the feature of the

i

th shot.

| | x_{i} {| |}_{2}

denotes the L2 norm for

x_{i}

.

At this point, we have a complete graph

G = (X, E)

.

X

is the set of nodes of the graph, specifically consisting of the video shot features

{x_{1}, x_{2} \dots x_{M}}

. E is the set of edges of the graph, consisting of

{e_{11}, \dots, e_{ij}, \dots e_{MM}}

.

e_{ij}

is the similarity of two nodes, i.e., the adjacency matrix

A \in R^{M \times M}

is obtained.

3.2. Reconstruction the Adjacency Matrix

VGAE consists of an encoder and a decoder. The encoder learns the mean

μ

and variance

\log σ

of the node’s low-dimensional vector representation, uses the posterior probability

q (Z | A, X)

to obtain the hidden variable

Z

, and then reconstructs the adjacency matrix A with the hidden variable

Z

.

The encoder is implemented by a two-layer graph convolutional network:

GCN (X, A) = \tilde{A} ReLU (\tilde{A} X W_{0}) W_{1},

(8)

where

W_{0}

is the weight matrix of the first layer,

\tilde{A} = D^{- \frac{1}{2}} {AD}^{- \frac{1}{2}}

is the symmetric normalized adjacency matrix, and

D

is the degree matrix.

Here are the specific steps:

The node features X and the adjacency matrix A are fed into the encoder of VGAE to obtain the mean

μ

and variance

σ

of the low-dimensional vector representation of the node. Expressions for

μ

and

σ

are computed, respectively:

\begin{matrix} μ = {GCN}_{μ} (X, A), \end{matrix}

(9)

\begin{matrix} \log σ = {GCN}_{σ} (X, A) . \end{matrix}

(10)

In normal distribution

N (μ, σ^{2})

by the nodes’ low-dimensional vectors, we sample hidden variables Z. The posterior probability distribution of the hidden variable Z is

\begin{matrix} q (Z | X, A) = \prod_{i = 1}^{N} q (z_{i} | X, A), \end{matrix}

(11)

\begin{matrix} q (z_{i} | X, A) = N (z_{i} | μ_{i}, d i a g (σ_{i}^{2})) . \end{matrix}

(12)

It is required that each sample of VGAE has a corresponding normal distribution, which can make a one-to-one correspondence between

z_{i}

obtained from sampling and the real samples, which can ensure the learning effect of the model.

The sampling process cannot be derived, which means that the model cannot perform backpropagation and gradient descent. In order to resolve this problem, reparameterization is introduced in order to transform the sampling relationship into a deterministic one.

z = μ + σ ⊙ ϵ, ϵ \sim N (0, I) .

(13)

To reconstruct the graph structure, the decoder uses an inner product calculation between two hidden variables.

\begin{matrix} p (A | Z) = \prod_{i = 1}^{N} \prod_{j = 1}^{N} p (A_{i j} | z_{i}, z_{j}), \end{matrix}

(14)

\begin{matrix} p (A_{i j} = 1 | z_{i}, z_{j}) = σ (z_{i}^{T} z_{j}), \end{matrix}

(15)

where

σ (\cdot)

is sigmoid. We reconstruct adjacency matrix

\hat{A}

using the inner product of hidden variables

Z

.

\hat{A} = logistic sigmoid ({ZZ}^{T}) .

(16)

3.3. Optimizing the Adjacency Matrix

In our model, we utilize the GAT to consider the weight information of various neighbors. This helps us optimize the adjacency matrix and minimize its redundancy.

We calculate the attention scores of the central node and the neighboring nodes.

e_{i j} = ϕ (W_{X_{i}}, W_{X_{j}}),

(17)

where

e_{i j}

denotes the importance of the

j

th node to the

i

th node, i.e., the attention score.

X_{i}

and

X_{j}

represent the features of the

i

th node and the

j

th node.

ϕ

denotes the method of computing the attention score, where the inner product operation is commonly used.

W

is a learnable linear transformation

W \in R^{F \times F^{'}}

of the graph attention GAT setup that transforms the input features into higher-level features.

Our model makes use of MLP, where we define the a-vector dimension as twice the dimension of the mapped node features. We splice and compute the inner product with the a-vector by columns.

Attention scores of nodes and neighboring nodes are

e_{ij} = a (W_{X_{i}}, W_{X_{j}}) .

(18)

Activating Attention Score

e_{i j - G A T}

,

e_{i j - G A T} = LeakyRe L U (a^{T} [W_{X_{i}} | | W_{X_{j}}]),

(19)

where

T

denotes the transpose and

[\cdot | | \cdot]

denotes the concatenation operation.

We updated attention scores and optimized the graph structure. In order to highlight the key information and optimize the structure of the graph, we use the reconstructed adjacency matrix

\hat{A}

and the activated attention scores

e_{ij - GAT}

to perform matrix multiplication to obtain updated attention scores

{\tilde{e}}_{ij}

.

{\tilde{e}}_{ij} = \hat{A} \otimes e_{ij - GAT} .

(20)

We performed attention score normalization. We normalized all neighboring nodes of node i using softmax function

{\tilde{e}}_{ij} = softmax ({\tilde{e}}_{ij}) = \frac{exp ({\tilde{e}}_{ij})}{\sum_{k \in N_{i}} exp ({\tilde{e}}_{ik})},

(21)

where

k \in N_{i}

denotes the neighbor information of node

i

.

3.4. Feature Fusion

Feature fusion can improve the quality of key shot selection. To achieve this, the fusion feature F is obtained by combining shot structural features

{\tilde{e}}_{ij}

that were optimized by VGAE and GAT, along with shot depth features

G

extracted by GoogleNet. The resulting

F

is then inputted into the MLP to obtain the shot prediction score,

S_{pred}

, for the network. By using feature fusion, it becomes possible to evaluate the importance of video shots in a more rational way.

S_{pred} = MLP (F) = MLP (cat ({\tilde{e}}_{ij}, G)) .

(22)

3.5. Loss Function

The loss function for VGAE, called

L_{VGAE}

, includes two components: reconstruction loss and

KL

divergence. To determine the distance between generated and original graphs, the reconstruction loss uses cross-entropy loss. Meanwhile,

KL

divergence calculates the distance between the node representation vector distribution and the normal distribution.

\begin{matrix} L_{VGAE} = E_{q} (Z | X, A) [\log_{p} (A | Z)] \\ + K L [q (Z | X, A) | | p (Z)] . \end{matrix}

(23)

VOGNet’s loss function L uses the mean square loss function MSELoss to represent the gap between predicted values and user ratings.

L = {(S_{real} - S_{pred})}^{2},

(24)

where

S_{real}

denotes the real score of the shots level and

S_{pred}

denotes the score predicted by the model.

4. Experiments

For this section, we describe two table-aligned datasets and evaluation metrics for video summarization.

4.1. Datasets

Two standard datasets, the SumMe [25] dataset and the TVSum [26] dataset, are employed in this experiment. The SumMe dataset consists of 25 videos recorded by users. The duration of each video ranged from 1.5 to 6.5 min, with a minimum duration of 32 s and a maximum duration of up to 324 s, and an average duration of about 146 s. There are 15–18 manual annotations for each video, and these annotations are categorized as important (0) and unimportant (1). The TVSum dataset includes a total of 50 videos covering a variety of topics including news stories, recorded articles, operating tips, and video blogs. These videos are between 2 and 10 min long, with a minimum of 83 s and a maximum of 647 s, for an average duration of about 235 s. One to five different scores are assigned to each video after it is viewed by 20 users, and the average of these scores is used as the final rating result. Both SumMe and TVSum offer us frame-level scores. Shots are segments made up of sequences of frames, and within the same shot, the frames typically vary relatively steadily. Since our experiments require shot-level scores, we utilize the KTS algorithm to divide the video into video segments of varying lengths. Following modeling, we obtain shot-level scores and ultimately generate a summary using a 0/1 backpacking algorithm.

During the experimental stage, we split the dataset in the ratio of 80% for the training set and 20% for the test set. In contrast to the previous segmentation of the overall dataset, we perform the segmentation of the training and test sets after randomly disrupting the video numbers. To ensure that the experimental results are broadly applicable, we use Python’s random library to randomly divide the data into K parts, and then perform a K-fold cross-validation, taking the average of the K values as the final result.

4.2. Evaluation of Indicators

When it comes to assessing video summarization studies, the

F 1

score is the most frequently utilized evaluation indicator. In order to ensure that the

F 1

score is precise, it must be determined via

precision

and

recall

. Precision denotes the ratio between the accurate prediction and the model’s prediction, while recall measures the ratio of correctly predicted outcomes to manually predicted ones.

Precision = \frac{N_match}{N_extract},

(25)

where

N_m a t c h

: is the total number of frames for which the manual and the model can make a match;

N_e x t r a c t

is the number of frames generated by the model.

Recall = \frac{N_match}{N_US},

(26)

where

N_U S

is the number of manually labeled video summarization frames.

F 1 = \frac{2 * Precision * Recall}{Precision + Recall} .

(27)

4.3. Results

In this section, we first compare the proposed model with RNN-based methods, GNN-based methods and Reconstruction-based methods as well as state-of-the-art video summarization methods. The model is then subjected to ablation experiments to verify the validity of each module. Lastly, the results of the experiments are visualized.

4.3.1. Contrast Experiment

Table 1 shows the results of comparing the F1 scores of this paper’s method to those of other state-of-the-art methods on the two standards datasets. In particular, vsLSTM and dppLSTM introduce the determinatal point process (DPP) to obtain representative and concise video summary of the sequence. SUM-GAN is an unsupervised video summarization model based on variational cyclic auto-encoder and generative adversarial network which consists of a summarizer and a discriminator. The purpose of the summarizer is to generate a summary to fool the discriminator, and the purpose of the discriminator is to identify the summarized video from the original video. DR-DSN is trained unsupervised by reinforcement learning by designing different reward functions to obtain a more representative and diverse video summary. CSNetsup designs a simple but effective regularized loss term, called variance loss and also designs a chunk and stride network (CSNet) that utilizes both local (chunk) and global (stride) temporal views of video features. HAS-RNN is a hierarchical structured adaptive network used in order to preserve the hierarchical structure of video data and improve the quality of the generated summary. M-AVS is divided into the encoding part and the decoding part, and introduces an attention mechanism in the decoder part. Concretely, the encoding part is Bi-LSTM and the decoding part is Bi-LSTM with the introduction of attention. FCSN describes video summarization as a sequential problem, establishes a new link between semantic segmentation and video summarization, and applies semantic segmentation network to video summarization. RSGN divides the video into a frame level and a shot level, and the frame-level dependencies are represented by the long short-term memory network (LSTM) to encode them as sequences, and shot-level dependencies are encoded as graphs, which are implemented by GCN. GATVSunsup uses GAT to aggregate the frame features into depth features, and after that proposes a new spatial attention model to extract the visual features of the frames, and finally combines the high-level visual features with the sequence features extracted by BiLSTM. GCANsup consists of Embedding Learning and Context Fusion; embedding learning is subdivided into temporal branching and graph branching. It learns the embedding representation of the graph through a multilayer graph convolutional network.

The data in Table 1 show that the proposed method achieves high results on both datasets, which proves the effectiveness and robustness of the method. Although the results of this paper’s method on the SumMe dataset are 3.2% lower than those of GCANsup, the F1 scores on TVSum are higher than those of GCANsup. The results of M-AVS on the TVSum dataset are 0.2% higher than the results of the proposed method, but on the SumMe dataset, the results are 5.4% lower than those of the proposed method. Although the method does not reach the highest level on both datasets individually, it is comparable to the highest level on both datasets compared to other state-of-the-art methods, proving the generalization of our method.

4.3.2. Ablation Experiment

To further validate the effects of VGAE and GAT on the video summary generation model, we design three ablation experiments on two datasets, SumMe and TVSum, namely M_model, G_model, and VG_model, where M_model indicates that the model of the experiment contains only multilayer perceptron, G_model indicates that the model contains GAT and does not contain VGAE, and VG_model indicates that the model includes VGAE and GAT. The experimental results are shown in Table 2.

From Table 2, it can be observed that the F1 score of G_model is lower than that of M_model on both datasets, which proves that the features learned by multilayer perceptron are superior to those learned by GAT, and the probable explanation for this is that there is a large amount of redundant information in the adjacency matrix of the GAT inputs. Comparison of VG_model and M_model shows that on the SumMe dataset, the F1 score of VG_model is 0.9% higher than that of M_model, and on the TVSum dataset, the results of the two models are comparable. Further comparing VG_model and G_model, the F1 score of VG_model is 4.5% higher than that of G_model on the SumMe dataset and 1.9% higher on the TVSum dataset. From the above analysis, it can be concluded that the G_model using GAT has lower performance due to the redundant information in the graph, while the VG_model consisting of VGAE and GAT substantially improves the performance of the model because of the reconstruction of the graph structure by VGAE, which removes the redundant information in the graph.

4.3.3. Visualization Result

In our experiments on video summarization visualization, we utilized our VOGNet model and compared it with ground truth to evaluate the quality of the summarization generated. The video examples shown in Figure 2 are from the 15th and 25th videos in SumMe, and the 37th and 39th videos in TVSum. A comparison of the graphs reveals that VOGNet generated a summary that closely matches the ground truth and accurately selects the keyframes.

Figure 3 shows the hot map of the shot adjacency matrix, where darker colors in the hot map indicate higher similarity. From Figure 3, it can be seen that the highest similarity is found at the diagonal line, which is due to the cosine similarity value of one between itself and itself. Additionally, we can observe that the hot map is roughly symmetric because the adjacency matrix of the undirected graph generated by the model is symmetric.

5. Conclusions

A video summarization generation model based on graph structure reconstruction is proposed in this paper. It is designed to remove the redundant information in the adjacency matrix through graph structure reconstruction and effectively extract the structural features of the video. Specifically, the video is first shot-sliced and frame features are extracted, and then the features of representative frames are selected from them as the features of the current shot. Next, with the shot features as nodes, and using the similarity between the shots as the weights of the edges, the graph is constructed to obtain the adjacency matrix. The original graph is input into VGAE to obtain the reconstructed graph structure, and then the neighbor information in the graph structure is learned through the processing of GAT. Finally, to avoid losing critical information during model training, we perform feature fusion of the GAT output with the original shot features to obtain the importance scores of the shots and generate a summary. In the coming research, we intend to extend the study of video summarization to the multimodal domain to fuse video, audio, and textual information more comprehensively. It will contribute to the generation of richer, more informative summary that will enable users to better understand video content. Given the limited number of labeled datasets and the huge human and material investment required for the labeling process, we will consider exploring unsupervised methods to reduce the reliance on large amounts of labeled data. Aiming to further improve the quality of generated summary, we intend to design new evaluation metrics to more accurately measure model performance. Furthermore, taking into account the importance of user privacy protection, we will work on ways to protect private information in videos in video summarization generation.

Author Contributions

Conceptualization, G.W.; Methodology, J.Z.; Software, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Natural Science Foundation of Gansu Province under grant No. 21JR7RA570, 20JR10RA334, Gansu University of Political Science and Law Major Scientific Research and Innovation Projects under grant No. GZF2020XZDA03, the Young Doctoral Fund Project of Higher Education Institutions in Gansu Province in 2022 under grant No. 2022QB-123, Gansu Province Higher Education Innovation Fund Project under grant No. 2022A-097, and University-level Innovative Research Team of Gansu University of Political Science and Law.

Conflicts of Interest

The authors declare no conflict of interest.

References

Saini, P.; Kumar, K.; Kashid, S.; Saini, A.; Negi, A. Video summarization using deep learning techniques: A detailed analysis and investigation. Artif. Intell. Rev. 2023, 56, 12347–12385. [Google Scholar] [CrossRef] [PubMed]
Wu, J.; Zhong, S.-H.; Liu, Y. MvsGCN: A novel graph convolutional network for multi-video summarization. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019. [Google Scholar]
Xu, W.; Wang, R.; Guo, X.; Li, S.; Ma, Q.; Zhao, Y.; Guo, S.; Zhu, Z.; Yan, J. MHSCNET: A Multimodal Hierarchical Shot-Aware Convolutional Network for Video Summarization. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
Meena, P.; Kumar, H.; Yadav, S.K. A review on video summarization techniques. Eng. Appl. Artif. Intell. 2023, 118, 105667. [Google Scholar] [CrossRef]
Zhao, B.; Li, X.; Lu, X. HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zhao, B.; Li, X.; Lu, X. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar]
Zhong, R.; Wang, R.; Zou, Y.; Hong, Z.; Hu, M. Graph Attention Networks Adjusted Bi-LSTM for Video Summarization. IEEE Signal Process 2021, 28, 663–667. [Google Scholar] [CrossRef]
Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Haq, H.B.U.; Asif, M.; Ahmad, M.B. Video Summarization Using Deep Neural Networks: A Survey. Int. J. Sci. Technol. Res. 2020, 11, 146–153. [Google Scholar]
Kipf, T.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 4–6 April 2017. [Google Scholar]
Zhao, B.; Li, H.; Lu, X.; Li, X. Reconstructive Sequence-Graph Network for Video Summarization. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2793–2801. [Google Scholar] [CrossRef] [PubMed]
Zhu, W.; Han, Y.; Lu, J.; Zhou, J. Relational reasoning over spatial-temporal graphs for video summarization. IEEE Trans. Image Process. 2022, 31, 3017–3031. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Zhang, K.; Chao, W.L.; Sha, F. Video summarization with long short-term memory. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Ji, Z.; Xiong, K.; Pang, Y. Video summarization with attention-based encoder-decoder networks. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1709–1717. [Google Scholar] [CrossRef]
Park, J.; Lee, I.; Kim, J. Sumgraph: Video summarization via recursive graph modeling. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Li, P.; Tang, C.; Xu, X. Video summarization with a graph convolutional attention network. Front. Inf. Technol. Electron. Eng 2021, 22, 902–913. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Kipf, T.N.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar]
Potapov, D.; Douze, M.; Harchaoui, Z.; Schmid, C. Category-Specific Video Summarization. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 540–555. [Google Scholar]
Park, J.; Lee, J.; Kim, J. Probabilistic Representations for Video Contrastive Learning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Yasir, M.A.; Ali, Y.H. Dynamic Background Subtraction in Video Surveillance Using Color-Histogram and Fuzzy C-Means Algorithm with Cosine Similarity. Int. J. Online Biomed. Eng. 2022, 18, 74–85. [Google Scholar] [CrossRef]
Gygli, M.; Grabner, H.; Riemenschneider, H. Creating summaries from user videos. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Zhou, K.; Qiao, Y.; Xiang, T. Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward. In Proceedings of the AAAI Conference on Artificial Intelligence, McLean, VA, USA, 2–7 February 2018. [Google Scholar]
Jung, Y.; Cho, D.; Kim, D.; Woo, S. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 29–31 January 2019. [Google Scholar]
Liu, Y.T.; Li, Y.J.; Yang, F.; Woo, E. Learning hierarchical self-attention for video summarization. In Proceedings of the 2019 IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019. [Google Scholar]
Rochan, M.; Ye, L.W. Video summarization using fully convolutional sequence networks. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]

Figure 1. The overall framework of our method.

Figure 2. Comparison of model-generated summary and manual evaluation.

Figure 3. Hot map of the shot adjacency matrix.

Table 1. Comparison of F1 scores with existing advanced methods.

Method	SumMe	TVSum
vsLSTM [14]	37.6	54.2
DPP-LSTM [14]	38.6	54.7
SUM-GAN [8]	41.7	56.3
DR-DSN [27]	42.1	58.1
CSNetsup [28]	48.6	58.5
HAS-RNN [5]	44.1	59.8
M-AVS [29]	44.4	61.0
FCSN [30]	48.8	58.4
RSGN [11]	45.0	60.1
GATVSunsup [7]	51.5	59.1
GCANsup [17]	53.0	60.7
VOGNet (ours)	49.8	60.8

Table 2. F1 score results of ablation experiments on SumMe and TVSum datasets.

Model	SumMe	TVSum
M_model	48.9	60.8
G_model	45.3	58.9
VG_model	49.8	60.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Wu, G.; Song, S. Video Summarization Generation Based on Graph Structure Reconstruction. Electronics 2023, 12, 4757. https://doi.org/10.3390/electronics12234757

AMA Style

Zhang J, Wu G, Song S. Video Summarization Generation Based on Graph Structure Reconstruction. Electronics. 2023; 12(23):4757. https://doi.org/10.3390/electronics12234757

Chicago/Turabian Style

Zhang, Jing, Guangli Wu, and Shanshan Song. 2023. "Video Summarization Generation Based on Graph Structure Reconstruction" Electronics 12, no. 23: 4757. https://doi.org/10.3390/electronics12234757

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Video Summarization Generation Based on Graph Structure Reconstruction

Abstract

1. Introduction

2. Related Work

2.1. Video Summrization

2.2. Variational Graph Auto Encoders (VGAE)

2.3. Graph Attention Network (GAT)

3. Model

3.1. Feature Extraction

3.1.1. Shot Split

3.1.2. Calculation of Shot Features

3.1.3. Adjacency Matrix Construction

3.2. Reconstruction the Adjacency Matrix

3.3. Optimizing the Adjacency Matrix

3.4. Feature Fusion

3.5. Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation of Indicators

4.3. Results

4.3.1. Contrast Experiment

4.3.2. Ablation Experiment

4.3.3. Visualization Result

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI