Next Article in Journal
Improved Repetitive Control for an LCL-Type Grid-Tied Inverter with Frequency Adaptive Capability in Microgrids
Next Article in Special Issue
Objective Video Quality Assessment and Ground Truth Coordinates for Automatic License Plate Recognition
Previous Article in Journal
Fault Segment Location for MV Distribution System Based on the Characteristic Voltage of LV Side
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Summarization of Videos with the Signature Transform

1
Centre for Intelligent Multidimensional Data Analysis, HK Science Park, Shatin, Hong Kong
2
Departamento de Informática de Sistemas y Computadores, Universitat Politècnica de València, 46022 València, Spain
3
Informatik und Mathematik, GOETHE-University Frankfurt am Main, 60323 Frankfurt am Main, Germany
4
Estudis d’Informàtica, Multimèdia i Telecomunicació, Universitat Oberta de Catalunya, 08018 Barcelona, Spain
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(7), 1735; https://doi.org/10.3390/electronics12071735
Submission received: 16 March 2023 / Revised: 30 March 2023 / Accepted: 2 April 2023 / Published: 5 April 2023
(This article belongs to the Special Issue Advanced Technologies for Image/Video Quality Assessment)

Abstract

:
This manuscript presents a new benchmark for assessing the quality of visual summaries without the need for human annotators. It is based on the Signature Transform, specifically focusing on the RMSE and the MAE Signature and Log-Signature metrics, and builds upon the assumption that uniform random sampling can offer accurate summarization capabilities. We provide a new dataset comprising videos from Youtube and their corresponding automatic audio transcriptions. Firstly, we introduce a preliminary baseline for automatic video summarization, which has at its core a Vision Transformer, an image–text model pre-trained with Contrastive Language–Image Pre-training (CLIP), as well as a module of object detection. Following that, we propose an accurate technique grounded in the harmonic components captured by the Signature Transform, which delivers compelling accuracy. The analytical measures are extensively evaluated, and we conclude that they strongly correlate with the notion of a good summary.

1. Introduction and Problem Statement

Video data have become ubiquitous, from content creation to the animation industry. The ability to summarize the information present in large quantities of data is a central problem in many applications, particularly when there is a need to reduce the amount of information transmitted and to swiftly assimilate visual contents. Video summarization [1,2,3,4,5,6,7] has been extensively studied in Computer Vision, using both handcrafted methods [8] and learning techniques [9,10]. These approaches traditionally use feature extraction on keyframes to formulate an adequate summary.
Recent advances in Deep Neural Networks (DNN) [11,12,13] have spurred progress across various scientific fields [14,15,16,17,18,19,20,21]. In the realm of video summarization, two prominent approaches have emerged: LSTM- and RNN-based models [22,23,24]. These models have demonstrated considerable success in developing effective systems for video summarization. Additionally, numerous other learning techniques have been employed to address this challenge [25,26,27,28].
In this study, we introduce a novel concatenation of models for video summarization, capitalizing on advancements in Visual Language Models (VLM) [29,30]. Our approach combines zero-shot text-conditioned object detection with automatic text video annotations, resulting in an initial summarization method that captures the most critical information within the visual sequence.
Metrics to assess the performance of such techniques have usually relied on a human in the loop, using services such as Amazon Mechanical Turk (AMT) to provide annotated summaries for comparison. There have been attempts to introduce quantitative measures to address this problem, the most common being the F1-score, but these measures need human annotators and have shown that many state-of-the-art methodologies perform worse than mere uniform random sampling [31].
However, in this work, we go beyond the current state of the art and introduce a set of metrics based on the Signature Transform [32,33], a rough equivalent to the Fourier Transform that takes order and area into account and that contrasts the spectrum of the original video with the spectrum of the generated summary to provide a measurable score. We then propose an accurate state-of-the-art baseline based on the Signature Transform to accomplish the task. Thorough evaluations are provided, where we can see that the methodologies provide accurate video summaries, and that the technique based on the Signature Transform achieves summarization capabilities superior to the state of the art. Indeed, the temporal content present in a video timeline makes the Signature Transform an ideal candidate to assess the quality of generated summaries where a video stream is treated as a path.
Section 2 gives a primer on the Signature Transform to bring forth in Section 2.1 a set of metrics to assess the quality of visual summaries by considering the harmonic components of the signal. The metrics are then used to put forward an accurate baseline for video summarization in Section 2.2. In the following section, we introduce the concept of Foundation Models, which serves to propose a preliminary technique for the summarization of videos. Thorough experiments are conducted in Section 4, with emphasis on the newly introduced dataset and the set of measures. Section 4.1 gives an assessment of the metrics in comparison to human annotators, whereas Section 4.2 evaluates the performance of the baselines based on the Signature Transform against another technique. Finally, Section 5 delivers conclusions, addresses the limitations of the methodology, and discusses further work.

2. Signature Transform

The Signature Transform [34,35,36,37,38] is roughly equivalent to the Fourier Transform; instead of extracting information concerning frequency, it extracts information about the order and area. However, the Signature Transform differs from the Fourier Transform in that it utilizes the space of functions of paths, a more general case than the basis of the space of paths found in the Fourier Transform.
Following the work in [34], the truncated signature of order N of the path x is defined as a collection of coordinate iterated integrals
S N ( x ) = 0 < t 1 < < t a < 1 c = 1 a d f z c d t ( t c ) d t 1 d t a 1 z 1 , , z a d 1 a N .
Here, x = ( x 1 , , x n ) , where x z R d . Let f = ( f 1 , , f d ) : [ 0 , 1 ] R d be continuous, such that f ( z 1 n 1 ) = x z , and linear in the intervals in between.

2.1. RMSE and MAE Signature and Log-Signature

The F1-score between a summary and the ground truth of annotated data has been the widely accepted measure of choice for the task of video summarization. However, recent approaches highlighted the need to come up with metrics that can capture the underlying nature of the information present in the video [31].
In this work, we leverage tools from harmonic analysis by the use of the Signature Transform to introduce a set of measures, namely, Signature and Log-Signature Root Mean Squared Error (denoted from now on as RMSE Signature and Log-Signature), that can shed light on what a good summary is and serve as powerful tools to analytically quantize the information present in the selected frames.
As introduced in [32] in the context of GAN convergence assessment, the RMSE and MAE Signature and Log-Signature can be defined as follows, particularized for the application under study:
Definition 1. 
Given n components of the element-wise mean of the signatures { y ˜ ( c ) } c = 1 n T ( R d ) from the target summary to the score, and the same number of components of the element-wise mean of the signatures { x ˜ ( c ) } c = 1 n T ( R d ) from the original video subsampled at a given frame rate and uniformly chosen, we define the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) as
RMSE x ˜ ( c ) c = 1 n , y ˜ ( c ) c = 1 n = 1 n c = 1 n y ˜ ( c ) x ˜ ( c ) 2 ,
and
MAE x ˜ ( c ) c = 1 n , y ˜ ( c ) c = 1 n = 1 n c = 1 n | y ˜ ( c ) x ˜ ( c ) | ,
respectively, where T ( R d ) = c = 0 R d c .
The case for Log-Signature is analogous.
For the task of video summarization, two approaches are given. In the case where the user has annotated summaries available, RMSE ( S ¯ , S ¯ t a r g e t ) is computed between an element-wise mean of the annotated summaries and the target summary to the score. If annotations are not available, a comparison against mean random uniform samples is performed, S ¯ , and mean score and standard deviation are provided. Given the properties of the Signature Transform, the measure takes into consideration the harmonic components that are intrinsic to the video under study and that should be preserved once the video is shortened to produce a summary. As a matter of fact, both approaches should lead to the same conclusions, as the harmonic components present in the annotated summaries and the ones present in average in the random uniform samples should also agree. A confidence interval of the scores can be provided for a given measure by analyzing the distances in the RMSEs of annotated summaries or random uniform samples, RMSE ( S ¯ a , S ¯ c ) .
When comparing against random uniform samples, the underlying assumption is as follows: we assume that good visual summaries capturing all or most of the harmonic components present in the visual cues will achieve a lower standard deviation. In contrast, summaries that lack support for the most important components will yield higher values. For a qualitative example, see Figure 1. With these ideas in mind, we can discern techniques that likely generate consistent summaries from those that fail to convey the most critical information. Moreover, the study of random sample intervals provides a set of tolerances for considering a given summary adequate for the task, meaning it is comparable to or better than uniform sampling of the interval at capturing harmonic components. Consequently, the proposed measures allow for a percentage score representing the number of times a given methodology outperforms random sampling by containing the same or more harmonic components present in the spectrum.

2.2. Summarization of Videos with RMSE Signature

Proposing a methodology based on the Signature Transform to select proper frames for a visual summary can be effectuated as follows: Given a uniform random sample of the video to summarize, we can compare it against subsequent random summaries using RMSE ( S ¯ , S ¯ * ) . We can repeat this procedure n times and choose, as a good candidate, the minimum according to the standard deviation. Using this methodology, we can also repeat the procedure for a range of selected summary lengths, which will give us a set of good candidates, among which we will choose the candidate with the minimum standard deviation. This will provide us with an estimate of the most suitable length. It is important to note that this baseline is completely unsupervised in the sense that no annotations are used, only the metrics based on the Signature Transform. We rely on the fact that, in general, uniform random samples provide relatively accurate summaries, and among those, we choose the ones that are best according to std( RMSE ( S ¯ , S ¯ * ) ), which we denote as RMSE ( S ¯ , S ¯ u m i n ) | n . This will grant us competitive uniform random summaries according to the given measures to use as a baseline for comparison against other methodologies, and with which we can estimate an appropriate summary length to use in those cases.
Below, we provide a description of the entities involved in the computation of the metrics and the proposed baselines based on the Signature Transform:
  • S ¯ * : Element-wise mean Signature Transform of the target summary to the score of the corresponding video;
  • S ¯ : Element-wise mean Signature Transform of a uniform random sample of the corresponding video;
  • RMSE ( S ¯ , S ¯ * ) : Root mean squared error between the spectra of S ¯ and S ¯ * with the same summary length. For the computation of standard deviation and mean, this value is calculated ten times, changing S ¯ ;
  • RMSE ( S ¯ , S ¯ ) : Root mean squared error between the spectra of S ¯ and S ¯ with the same summary length. For computation of standard deviation and mean, this value is calculated ten times, changing both S ¯ each time;
  • RMSE ( S ¯ , S ¯ u m i n ) | n : Baseline based on the Signature Transform. It corresponds to RMSE ( S ¯ , S ¯ * ) , where S ¯ * is, in this case, a fixed uniform random sample denoted as S ¯ u . We repeat this procedure n times and choose the minimum candidate according to standard deviation, S ¯ u m i n , to propose as a summary;
  • s t d ( ) : Standard deviation.

3. Summarization of Videos via Text-Conditioned Object Detection

Large Language Models (LLM) [39,40,41,42] and VLMs [43] have emerged as indispensable resources for characterizing complex tasks and bestowing intelligent systems with the capacity to interact with humans in unprecedented ways. These models, also called Foundation Models [44,45,46], excel in a wide variety of tasks, such as robotics manipulation [47,48,49], and can be integrated with other modules to perform robustly in highly complex situations such as navigation and guidance [50,51]. One fundamental module is the Vision Transformer [52].
We introduce a simple yet effective technique aimed at generating video summaries that accurately describe the information contained within video streams, while also proposing new measures for the task of the summarization of videos. These measures will prove useful not only when text transcriptions are available, but also in more general cases in which we seek to describe the quality of a video summary.
Building on the text-conditioned object detection using Vision Transformers, as recently proposed in [53], we enhance the summarization task by leveraging the automated text transcriptions found in video platforms. We utilize a module of noun extraction employing NLP techniques [54], which is subsequently processed to account for the most frequent nouns. These nouns serve as input queries for text-conditioned object searches in frames. Frames containing the queries are selected for the video summary; see Figure 2 for a detailed depiction of the methodology.
In this manuscript, we initially present a baseline leveraging text-conditioned object detection, specifically Contrastive Language–Image Pre-training (CLIP) [43]. To assess this approach, we employ a recently introduced metric based on the Signature Transform, which accurately gauges summary quality compared to a uniform random sample. Our preliminary baseline effectively demonstrates the competitiveness of uniform random sampling [31]. Consequently, we introduce a technique utilizing prior knowledge of the Signature, specifically the element-wise mean comparison of the spectrum, to generate highly accurate random uniform samples for summarization. The Signature Transform allows for a design featuring an inherent link between the methodology, metric, and baseline. We first present a method for evaluation, followed by a set of metrics for assessment, and ultimately, we propose a state-of-the-art baseline that can function as an independent technique.

4. Experiments: Dataset and Metrics

A dataset consisting of 28 videos about science experiments was sourced from Youtube, along with their automatic audio transcriptions, to evaluate the methodology and the proposed metrics. Table 1 provides a detailed description of the collected data and computed metrics, Figure 3 shows the distribution of selected frames using text-conditioned object detection over a subset of videos and the baselines based on the Signature Transform, Figure 4 depicts a visual comparison between methodologies, and Figure 5 and Figure 6 visually elucidate the RMSE distribution for each video with mean and standard deviation.
The dataset consists of science videos covering a wide range of experiments on several topics of interest; it has an average number of 264 frames per video (sampling rate 1 4 s) and an average duration of 17 min 30 s.
Figure 3 depicts the selected frames when using our methodology for a subset of videos in the dataset. The selection coincides with the trigger of the zero-shot text-conditioned object detector by the 20 most frequent word code-phrase queries, which chooses a subset of the methodology that best explains the main factors of the argument. A comparison with the baselines based on the Signature Transform with 10 and 20 points is delivered.
In all experiments that involve the computation of the Signature Transform, we use the parameters proposed in [32] that were originally used to assess synthetic distributions generated with GANs; specifically, we employ truncated signatures of order 3 with a resized image size of 64 × 64 in grayscale.
RMSE ( S ¯ , S ¯ * ) computes the element-wise mean of the signatures of both the target summary to the score and a random uniform sample with the same number of frames, comparing their spectra with the use of the RMSE. Likewise, RMSE ( S ¯ , S ¯ ) computes the same measure between two random uniform samples with the same number of frames. The standard deviation of both results is compared to assess the quality of the summarized video concerning the present harmonic components. The preliminary technique based on text-conditioned object detection (see Table 1) achieves a zero-shot of 50 % positive cases when compared against std ( RMSE ( S ¯ , S ¯ ) ). The number of frames selected by the methodology is consistent, and it automatically selects on average 20 % of the total number of frames.
In this paragraph, we discuss the baseline based on the Signature Transform (see Table 1) in terms of the RMSE ( S ¯ , S ¯ u m i n ) | 10 and RMSE ( S ¯ , S ¯ u m i n ) | 20 . These techniques select a uniform random sample with minimum standard deviation in a set of 10 points and 20 points, respectively, and achieve 100 % positive cases when compared to RMSE ( S ¯ , S ¯ ) . Under the assumption that the summary can be approximated well by a random uniform sample, which holds true in many cases, the methodology finds a set of frames that maximizes the harmonic components relative to those present in the original video.
Figure 4 displays examples of summaries using the baseline based on the Signature Transform compared to the summaries using text-conditioned object detection. The figure allows for a visual comparison of the results obtained using RMSE ( S ¯ , S ¯ u m i n ) | 10 , RMSE ( S ¯ , S ¯ u m i n ) | 20 and S ¯ * . The best summary among the three baselines according to the metric is highlighted (Table 1).
The selected frames are consistent and provide a good overall description of the original videos. Moreover, the metric based on the Signature Transform aligns well with our expectations of a high-quality summary, with better scores being assigned to summaries that effectively convey the content present in the original video.
Table 2 presents a qualitative analysis of the baseline based on the Signature Transform using 10 points, RMSE ( S ¯ , S ¯ u m i n ) | 10 and RMSE ( S ¯ , S ¯ ) with a varying number of frames per summary. We observe that RMSE ( S ¯ , S ¯ ) reflects the variability of the harmonic components present; that is, it is preferable to work with lengths for which the variability among summaries is low, according to the standard deviation. RMSE ( S ¯ , S ¯ u m i n ) | 10 indicates the minimum standard deviation achieved in a set of 10 points, meaning that given a computational budget allowing us to select up to a specific number of frames, a good choice is to pick the length that yields the minimum RMSE ( S ¯ , S ¯ u m i n ) | 10 with low variability, as per RMSE ( S ¯ , S ¯ ) .
RMSE ( S ¯ , S ¯ * ) (Figure 5) and RMSE ( S ¯ , S ¯ ) (Figure 6) show the respective distribution of RMSE values (10 points) with the mean and standard deviation. Low standard deviations, in comparison with the random uniform sample counterparts, indicate good summarization capabilities.

4.1. Assessment of the Metrics

The metrics have been rigorously evaluated using the dataset in [1], which consists of short videos sourced from Youtube, and includes 5 annotated summaries per video for a total of 20. Table 3 and Table 4 report the results, using a one-frame-per-second sampling rate. In this case, the average number of times that the human annotator outperforms uniform random sampling according to the proposed metric, std ( RMSE ( S ¯ , S ¯ ) ), is 87 % . Several observations emerge from these findings:
  • The proposed metrics demonstrate that human evaluators can perform above average during the task, effectively capturing the dominant harmonic frequencies present in the video.
  • Another crucial aspect to emphasize is that the metrics are able to evaluate human annotators with fair criteria and identify which subjects are creating competitive summaries.
  • Moreover, the observations from this study indicate that the metrics serve as a reliable proxy for evaluating summaries without the need for annotated data, as they correlate strongly with human annotations.
Figure 7 shows the mean and standard deviation for each human-annotated summary (user 1 to user 5) for the subset of 20 videos from [1], using a sampling rate of 1 frame per second. For each video, a visual inspection of the error plot bar for each annotated summary provides an accurate estimate of the quality of the annotation compared to other users. Specifically:
  • Annotations with lower standard deviations offer a better harmonic representation of the overall video;
  • Annotations with higher standard deviations suggest that important harmonic components are missing from the given summary;
  • The metrics make it simple to identify annotated summaries that may need to be relabeled for improved accuracy.
Furthermore, these metrics remain consistent when applied to various sampling rates.
That being said, there are several standard measures that are commonly used for video summarization, such as F1 score, precision, recall, and Mean Opinion Score (MOS). Each of these measures has its own strengths and weaknesses. Compared to these standard measures, the proposed benchmark based on the Signature Transform has several potential advantages. Here are a few reasons for this:
  • Content based: the Signature Transform is a content-based approach that captures the salient features of the video data. This means that the proposed measure is not reliant on manual annotations or subjective human ratings, which can be time consuming and prone to biases.
  • Robustness: the Signature Transform is a robust feature extraction technique that can handle different types of data, including videos with varying frame rates, resolutions, and durations. This means that the proposed measure can be applied to a wide range of video datasets without the need for pre-processing or normalization.
  • Efficiency: the Signature Transform is a computationally efficient approach that can be applied to large-scale datasets. This means that the proposed measure can be used to evaluate the effectiveness of visual summaries quickly and accurately.
  • Flexibility: the Signature Transform can be applied to different types of visual summaries, including keyframe-based and shot-based summaries. This means that the proposed measure can be used to evaluate different types of visual summaries and compare their effectiveness.
Overall, the proposed measure based on the Signature Transform has the potential to provide a more accurate and comprehensive assessment of the standard of visual summaries compared to the preceding measures used in video summarization.
Figure 8 shows a summary that is well annotated by all users, demonstrating that the metrics can accurately indicate when human annotators have effectively summarized the information present in the video.
To illustrate how these metrics can help improve annotations, Figure 9 displays the metrics along with the annotated summaries of users 1 to 5. We observe that selecting the frames highlighted by users 1–4 can increase the performance if user 5 is asked to relabel its summary.
Figure 10 showcases an example in which random uniform sampling outperforms the majority of human annotators. This occurs because the visual information is uniformly distributed throughout the video. In this case, user 5 performs the best, scoring slightly higher than std ( RMSE ( S ¯ , S ¯ ) . Highlighted values on the table correspond to the lowest standard deviation.).
Similarly, Figure 11 presents an example in which incorporating the highlighted frames improves the accuracy of the annotated summary by user 3, which is currently performing worse than uniform random sampling, according to the metrics.

4.2. Evaluation

In this section, we evaluate the baselines and metrics compared to VSUMM [1], a methodology based on handcrafted techniques that performs particularly well on this dataset. Table 5 displays the comparison between the standard deviation of RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) , as well as against the baselines based on the Signature Transform, RMSE ( S ¯ , S ¯ u m i n ) | 10 and RMSE ( S ¯ , S ¯ u m i n ) | 20 , with 10 and 20 points, respectively.
We can observe how the metrics effectively capture the quality of the visual summaries and how the introduced methodology based on the Signature Transform achieves state-of-the-art results with both 10 and 20 points. The advantages of using a technique that operates on the spectrum of the signal, compared to other state-of-the-art systems, is that it can generate visual summaries without fine-tuning the methodology. In other words, there is no need to train on a subset of the target distribution of videos, but rather, compelling summaries can be generated at once for any dataset. Moreover, this approach is highly efficient, as computation is performed on the CPU and consists only of calculating the Signature Transform, element-wise mean, and RMSE. These operations can be further optimized for rapid on-device processing or for deploying in parallel at the tera-scale level.

5. Conclusions and Future Work

In this manuscript, we propose a benchmark based on the Signature Transform to evaluate visual summaries. For this purpose, we introduce a dataset consisting of videos obtained from Youtube related to science experiments with automatic audio transcriptions. A baseline, based on zero-shot text-conditioned object detection, is used as a preliminary technique in the study to evaluate the metrics. Subsequently, we present an accurate baseline built on the prior knowledge that the Signature provides. Furthermore, we conduct rigorous comparison against human-annotated summaries to demonstrate the high correlation between the measures and the human notion of a good summary.
One of the main contributions of this work is that techniques based on the Signature Transform can be integrated with any state-of-the-art method in the form of a gate that activates when the method performs worse than the metric, s t d ( RMSE ( S ¯ , S ¯ * ) ) > s t d RMSE ( S ¯ , S ¯ ) ) .
The experiments conducted in this work lead to the following conclusion: if a method for delivering a summarization technique is proposed that involves complex computation (e.g., DNN techniques or Foundation Models), it must provide better summarization capabilities than the baselines based on the Signature Transform, which serve as lower bounds for uniform random samples. If not, there is no need to use a more sophisticated technique that would involve greater computational and memory overhead and possibly require training data. The only exception to this would be when additional constraints are present in the problem, such as when summarization must be performed by leveraging audio transcriptions (as in the technique based on text-conditioned object detection) or any other type of multimodal data.
That being said, the methodology proposed based on the Signature Transform, although accurate and effective, is built on the overall representation of harmonic components of the signal. Videlicet, under certain circumstances, can provide summaries in which frames are selected due to low-level representations of the signal, such as color and image intensity, rather than the storyline. Moreover, it assumes that, in general, uniform random sampling can provide good summarization capabilities, which is supported by the literature. However, this assumption is not fulfilled in all circumstances. Therefore, in subsequent works, it would be desirable to develop techniques that perform exceptionally well according to the metrics while simultaneously bestowing a level of intelligence similar to the methodology based on Foundation Models. This would take into account factors such as the human concept of detected objects, leading to more context-aware and meaningful summarization.

Author Contributions

Conceptualization, J.d.C. and I.d.Z.; funding acquisition, C.T.C. and G.R.; investigation, J.d.C. and I.d.Z.; methodology, J.d.C. and I.d.Z.; software, J.d.C. and I.d.Z.; supervision, G.R. and C.T.C.; writing—original draft, J.d.C.; writing—review and editing, C.T.C., G.R., J.d.C. and I.d.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the HK Innovation and Technology Commission (InnoHK Project CIMDA). We acknowledge the support of Universitat Politècnica de València; R&D project PID2021-122580NB-I00, funded by MCIN/AEI/10.13039/501100011033 and ERDF. We thank the following funding sources from GOETHE-University Frankfurt am Main; ‘DePP—Dezentrale Plannung von Platoons im Straßengüterverkehr mit Hilfe einer KI auf Basis einzelner LKW’ and ‘Center for Data Science & AI’.

Data Availability Statement

https://doi.org/10.24433/CO.7648856.v2 (accessed on 1 March 2023).

Conflicts of Interest

The authors declare that they have no conflicts of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
DNNDeep Neural Networks
AMTAmazon Mechanical Turk
RMSERoot Mean Squared Error
MAEMean Absolute Error
VLMVisual Language Models
LLMLarge Language Models
GANGenerative Adversarial Networks
CLIPContrastive Language–Image Pre-training
LSTMLong Short-Term Memory
RNNRecurrent Neural Network
NLPNatural Language Processing
CPUCentral Processing Unit
MOSMean Opinion Score

References

  1. de Avila, S.E.F.; Lopes, A.; da Luz, A., Jr.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
  2. Gygli, M.; Grabner, H.; Gool, L.V. Video summarization by learning submodular mixtures of objectives. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  3. Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
  4. Kanehira, A.; Gool, L.V.; Ushiku, Y.; Harada, T. Viewpoint-aware video summarization. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  5. Liang, G.; Lv, Y.; Li, S.; Zhang, S.; Zhang, Y. Video summarization with a convolutional attentive adversarial network. Pattern Recognit. 2022, 131, 108840. [Google Scholar] [CrossRef]
  6. Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. TVSum: Summarizing web videos using titles. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
  7. Zhu, W.; Lu, J.; Han, Y.; Zhou, J. Learning multiscale hierarchical attention for video summarization. Pattern Recognit. 2022, 122, 108312. [Google Scholar] [CrossRef]
  8. Ngo, C.-W.; Ma, Y.-F.; Zhang, H.-J. Automatic video summarization by graph modeling. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003. [Google Scholar]
  9. Fajtl, J.; Sokeh, H.S.; Argyriou, V.; Monekosso, D.; Remagnino, P. Summarizing videos with attention. In Proceedings of the Computer Vision—ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  10. Zhu, W.; Lu, J.; Li, J.; Zhou, J. DSNet: A flexible detect-to-summarize network for video summarization. IEEE Trans. Image Process. 2020, 30, 948–962. [Google Scholar] [CrossRef] [PubMed]
  11. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  12. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  13. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
  14. de Curtò, J.; de Zarzà, I.; Yan, H.; Calafate, C.T. On the applicability of the hadamard as an input modulator for problems of classification. Softw. Impacts 2022, 13, 100325. [Google Scholar] [CrossRef]
  15. de Zarzà, I.; de Curtò, J.; Calafate, C.T. Detection of glaucoma using three-stage training with efficientnet. Intell. Syst. Appl. 2022, 16, 200140. [Google Scholar] [CrossRef]
  16. Dwivedi, K.; Bonner, M.F.; Cichy, R.M.; Roig, G. Unveiling functions of the visual cortex using task-specific deep neural networks. PLoS Comput. Biol. 2021, 17, e100926. [Google Scholar] [CrossRef]
  17. Dwivedi, K.; Roig, G.; Kembhavi, A.; Mottaghi, R. What do navigation agents learn about their environment? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10276–10285. [Google Scholar]
  18. Rakshit, S.; Tamboli, D.; Meshram, P.S.; Banerjee, B.; Roig, G.; Chaudhuri, S. Multi-source open-set deep adversarial domain adaptation. In Proceedings of the Computer Vision—ECCV: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 735–750. [Google Scholar]
  19. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI; Springer: Cham, Switzerland, 2015. [Google Scholar]
  20. Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  21. Thao, H.; Balamurali, B.; Herremans, D.; Roig, G. Attendaffectnet: Self-attention based networks for predicting affective responses from movies. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 8719–8726. [Google Scholar]
  22. Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  23. Zhang, K.; Chao, W.-L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the Computer Vision–ECCV: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
  24. Zhao, B.; Li, X.; Lu, X. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar]
  25. Rochan, M.; Ye, L.; Wang, Y. Video summarization using fully convolutional sequence networks. In Proceedings of the Computer Vision–ECCV: 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
  26. Yuan, L.; Tay, F.E.; Li, P.; Zhou, L.; Feng, J. Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
  27. Zhang, K.; Grauman, K.; Sha, F. Retrospective encoders for video summarization. In Proceedings of the Computer Vision–ECCV: 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
  28. Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  29. Narasimhan, M.; Rohrbach, A.; Darrell, T. Clip-it! Language-Guided Video Summarization. Adv. Neural Inf. Process. Syst. 2021, 34, 13988–14000. [Google Scholar]
  30. Plummer, B.A.; Brown, M.; Lazebnik, S. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  31. Otani, M.; Nakashima, Y.; Rahtu, E.; Heikkilä, J. Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  32. de Curtò, J.; de Zarzà, I.; Yan, H.; Calafate, C.T. Signature and Log-signature for the Study of Empirical Distributions Generated with GANs. arXiv 2022, arXiv:2203.03226. [Google Scholar]
  33. Lyons, T. Rough paths, signatures and the modelling of functions on streams. arXiv 2014, arXiv:1405.4537. [Google Scholar]
  34. Bonnier, P.; Kidger, P.; Arribas, I.P.; Salvi, C.; Lyons, T. Deep signature transforms. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December2019; Volume 32. [Google Scholar]
  35. Chevyrev, I.; Kormilitzin, A. A primer on the signature method in machine learning. arXiv 2016, arXiv:1603.03788. [Google Scholar]
  36. Kidger, P.; Lyons, T. Signatory: Differentiable computations of the signature and logsignature transforms, on both CPU and GPU. arXiv 2020, arXiv:2001.00706. [Google Scholar]
  37. Liao, S.; Lyons, T.J.; Yang, W.; Ni, H. Learning stochastic differential equations using RNN with log signature features. arXiv 2019, arXiv:1908.0828. [Google Scholar]
  38. Morrill, J.; Kidger, P.; Salvi, C.; Foster, J.; Lyons, T.J. Neural CDEs for long time series via the log-ode method. arXiv 2021, arXiv:2009.08295. [Google Scholar]
  39. Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. arXiv 2022, arXiv:2204.14198. [Google Scholar]
  40. Gu, X.; Lin, T.-Y.; Kuo, W.; Cui, Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv 2022, arXiv:2104.13921. [Google Scholar]
  41. Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
  42. de Curtò, J.; de Zarzà, I.; Calafate, C.T. Semantic scene understanding with large language models on unmanned aerial vehicles. Drones 2023, 7, 114. [Google Scholar] [CrossRef]
  43. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. et al. Learning transferable visual models from natural language supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
  44. Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
  45. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  46. Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.; et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv 2022, arXiv:2205.11487. [Google Scholar]
  47. Cui, Y.; Niekum, S.; Gupta, A.; Kumar, V.; Rajeswaran, A. Can foundation models perform zero-shot task specification for robot manipulation? In Proceedings of the Learning for Dynamics and Control Conference, Palo Alto, CA, USA, 23–24 June 2022. [Google Scholar]
  48. Nair, S.; Rajeswaran, A.; Kumar, V.; Finn, C.; Gupta, A. R3M: A universal visual representation for robot manipulation. arXiv 2022, arXiv:2203.12601. [Google Scholar]
  49. Zeng, A.; Florence, P.; Tompson, J.; Welker, S.; Chien, J.; Attarian, M.; Armstrong, T.; Krasin, I.; Duong, D.; Wahid, A.; et al. Transporter networks: Rearranging the visual world for robotic manipulation. In Proceedings of the Conference on Robot Learning, Online, 15–18 November 2020. [Google Scholar]
  50. Huang, W.; Abbeel, P.; Pathak, D.; Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv 2022, arXiv:2201.07207. [Google Scholar]
  51. Zeng, A.; Attarian, M.; Ichter, B.; Choromanski, K.; Wong, A.; Welker, S.; Tombari, F.; Purohit, A.; Ryoo, M.; Sindhwani, V.; et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv 2022, arXiv:2204.00598. [Google Scholar]
  52. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 ×16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  53. Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection with vision transformers. arXiv 2022, arXiv:2205.06230. [Google Scholar]
  54. Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python, 1st ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Figure 1. Conceptual plot with RMSE ( S ¯ , S ¯ ) and RMSE ( S ¯ , S ¯ * ) standard deviation and mean for two given summaries (our method and a counterexample) of 12 frames using a randomly picked video from Youtube to illustrate how to select a proper summary according to the proposed metric.
Figure 1. Conceptual plot with RMSE ( S ¯ , S ¯ ) and RMSE ( S ¯ , S ¯ * ) standard deviation and mean for two given summaries (our method and a counterexample) of 12 frames using a randomly picked video from Youtube to illustrate how to select a proper summary according to the proposed metric.
Electronics 12 01735 g001
Figure 2. Video Summarization via Zero-shot Text-conditioned Object Detection.
Figure 2. Video Summarization via Zero-shot Text-conditioned Object Detection.
Electronics 12 01735 g002
Figure 3. Comparison of distribution of selected frames for a subset of videos (Tides, Sulfur Hexafluoride, Centre of Gravity and Bubbles) using the method based on text-conditioned object detection and the baselines using the Signature Transform.
Figure 3. Comparison of distribution of selected frames for a subset of videos (Tides, Sulfur Hexafluoride, Centre of Gravity and Bubbles) using the method based on text-conditioned object detection and the baselines using the Signature Transform.
Electronics 12 01735 g003
Figure 4. Summarization of videos using the baseline based on the Signature Transform in comparison to the summarization using text-conditioned object detection. RMSE ( S ¯ , S ¯ u m i n ) | 10 , RMSE ( S ¯ , S ¯ u m i n ) | 20 and S ¯ * summaries for two videos of the introduced dataset. The best summary among the three, according to the metric, is highlighted.
Figure 4. Summarization of videos using the baseline based on the Signature Transform in comparison to the summarization using text-conditioned object detection. RMSE ( S ¯ , S ¯ u m i n ) | 10 , RMSE ( S ¯ , S ¯ u m i n ) | 20 and S ¯ * summaries for two videos of the introduced dataset. The best summary among the three, according to the metric, is highlighted.
Electronics 12 01735 g004
Figure 5. Plot with RMSE ( S ¯ , S ¯ * ) standard deviation and mean.
Figure 5. Plot with RMSE ( S ¯ , S ¯ * ) standard deviation and mean.
Electronics 12 01735 g005
Figure 6. Plot with RMSE ( S ¯ , S ¯ ) standard deviation and mean.
Figure 6. Plot with RMSE ( S ¯ , S ¯ ) standard deviation and mean.
Electronics 12 01735 g006
Figure 7. Error bar plot with mean and standard deviation for each human-annotated summary of the subset of 20 videos from [1]. Sampling rate: 1 frame per second.
Figure 7. Error bar plot with mean and standard deviation for each human-annotated summary of the subset of 20 videos from [1]. Sampling rate: 1 frame per second.
Electronics 12 01735 g007
Figure 8. Visual depiction of human annotated summaries together with RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) of video V11, Table 3. Sampling rate: 1 frame per second. Highlighted values on the table correspond to the lowest standard deviation.
Figure 8. Visual depiction of human annotated summaries together with RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) of video V11, Table 3. Sampling rate: 1 frame per second. Highlighted values on the table correspond to the lowest standard deviation.
Electronics 12 01735 g008
Figure 9. Visual depiction of human annotated summaries together with RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) of video V19, Table 3. Sampling rate: 1 frame per second. Highlighted frames can increase the accuracy of the annotated summary by user 5. Highlighted values on the table correspond to the lowest standard deviation.
Figure 9. Visual depiction of human annotated summaries together with RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) of video V19, Table 3. Sampling rate: 1 frame per second. Highlighted frames can increase the accuracy of the annotated summary by user 5. Highlighted values on the table correspond to the lowest standard deviation.
Electronics 12 01735 g009
Figure 10. Visual depiction of human annotated summaries, together with RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) of video V75, Table 4. Sampling rate: 1 frame per second. Highlighted values on the table correspond to the lowest standard deviation.
Figure 10. Visual depiction of human annotated summaries, together with RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) of video V75, Table 4. Sampling rate: 1 frame per second. Highlighted values on the table correspond to the lowest standard deviation.
Electronics 12 01735 g010
Figure 11. Visual depiction of human annotated summaries together with RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) of video V76, Table 4. Sampling rate: 1 frame per second. Highlighted frames can increase the accuracy of the annotated summary by user 3. Highlighted values on the table correspond to the lowest standard deviation.
Figure 11. Visual depiction of human annotated summaries together with RMSE ( S ¯ , S ¯ * ) and RMSE ( S ¯ , S ¯ ) of video V76, Table 4. Sampling rate: 1 frame per second. Highlighted frames can increase the accuracy of the annotated summary by user 3. Highlighted values on the table correspond to the lowest standard deviation.
Electronics 12 01735 g011
Table 1. Descriptive statistics with RMSE ( S ¯ , S ¯ * ) (target summary against random uniform sample) and RMSE ( S ¯ , S ¯ ) (random uniform sample against random uniform sample). RMSE ( S ¯ , S ¯ u m i n ) | 10 and RMSE ( S ¯ , S ¯ u m i n ) | 20 correspond to the baselines based on the Signature Transform using 10 and 20 random samples, respectively. Highlighted results in blue/brown correspond to values better than std ( RMSE ( S ¯ , S ¯ ) ). Yellow values indicate when std ( RMSE ( S ¯ , S ¯ ) ) is lower than std ( RMSE ( S ¯ , S ¯ * ) ).
Table 1. Descriptive statistics with RMSE ( S ¯ , S ¯ * ) (target summary against random uniform sample) and RMSE ( S ¯ , S ¯ ) (random uniform sample against random uniform sample). RMSE ( S ¯ , S ¯ u m i n ) | 10 and RMSE ( S ¯ , S ¯ u m i n ) | 20 correspond to the baselines based on the Signature Transform using 10 and 20 random samples, respectively. Highlighted results in blue/brown correspond to values better than std ( RMSE ( S ¯ , S ¯ ) ). Yellow values indicate when std ( RMSE ( S ¯ , S ¯ ) ) is lower than std ( RMSE ( S ¯ , S ¯ * ) ).
Descriptive StatisticsSummary RMSE ( S ¯ , S ¯ * ) RMSE ( S ¯ , S ¯ ) RMSE ( S ¯ , S ¯ u m i n ) | 10 RMSE ( S ¯ , S ¯ u m i n ) | 20
Video# FramesLength# Frames (%)StdMeanStdMeanStdMeanStdMean
Tides15910 m 29 s35 (22%)13,663202,38814,838155,9868859157,4557312167,480
Sulfur Hexafluoride23015 m 12 s47 (20%)22,727217,93522,607179,4097194161,9957722173,490
Centre of Gravity15510 m 14 s33 (21%)12,333181,46016,404168,8248481160,77912,416175,971
Bubbles17411 m 30 s35 (20%)23,127201,55316,806185,7027461194,9935711175,176
Airplanes15810 m 24 s22 (14%)19,964215,68823,591231,5398417227,39110,235233,020
Protons17411 m 30 s25 (14%)29,853252,22420,186262,43412,835251,90711,542250,512
Hydrophobic16811 m 06 s29 (17%)15,016251,67125,835248,54811,973250,13113,917245,761
States of Matter33222 m 03 s78 (23%)16,249156,4089709130,0646630115,4545340121,028
Spool Racer33222 m 02 s90 (27%)15,903142,52011,883136,1477054137,6218112151,888
Paper Airplane33222 m 03 s29 (9%)20,642235,63911,829221,2205400224,7189385177,448
Loudest Sound33222 m 01 s93 (28%)16,898179,9638304148,8857884138,5614355147,016
Lightning33222 m 01 s70 (21%)15,237169,33821,862162,8499300177,0087494153,797
Light Challenge33222 m 02 s82 (25%)12,566152,48810,546126,1175490139,7004874129,044
Hot Air Balloon33222 m 01 s98 (30%)8620150,3665417144,6343516137,1414165138,453
Hoop Glider33222 m 01 s82 (25%)6419148,0656752132,5444051133,8974966133,894
Drag Race33222 m 03 s73 (22%)9384135,2288931125,2644375122,6154645129,851
All about Balance33222 m 03 s59 (18%)14,023182,06314,238182,1797801176,2196914167,727
Air Pressure33222 m 03 s65 (20%)10,123166,34218,314151,6646386145,8974602148,232
Friction and Momentum16210 m 42 s28 (17%)18,754217,40322,443218,20313,348202,28812,238205,680
Electricity16210 m 41 s30 (19%)24,376298,23822,885279,82016,889268,93210,263270,619
Catapult16911 m 11 s27 (16%)26,413271,64331,265214,72715,158203,29010,222188,008
Carbonation and More16510 m 53 s40 (24%)18,977237,14218,107226,04412,130234,27811,884214,149
Carbon Dioxide16210 m 41 s38 (23%)25,862245,41518,806217,27013,838207,8287760211,504
Bridge16410 m 51 s21 (13%)25,839269,41226,038271,55110,761263,74713,038264,532
Bread Experiment33722 m 22 s59 (18%)15,099189,0868575146,7715542153,2245691156,230
Balloon Power33722 m 22 s53 (16%)14,075157,54229,415147,7107741128,9207351134,545
Attraction and Forces65443 m 30 s81 (12%)5955107,0977486102,965370196,266209399,271
Puzzles20913 m 48 s46 (22%)11,258185,50219,012196,76214,620199,55614,622197,064
Average26417 m 30 s52 (20%)14/28 (50%) 28/28 (100%)28/28 (100%)
Table 2. Descriptive statistics for a set of videos with varying numbers of frames per summary with RMSE ( S ¯ , S ¯ u m i n ) | 10 (brown) and RMSE ( S ¯ , S ¯ ) (yellow).
Table 2. Descriptive statistics for a set of videos with varying numbers of frames per summary with RMSE ( S ¯ , S ¯ u m i n ) | 10 (brown) and RMSE ( S ¯ , S ¯ ) (yellow).
Dataset RMSE ( S ¯ , S ¯ u m i n ) | 10 RMSE ( S ¯ , S ¯ ) Visualization
Video# FramesSummary (%)StdMeanStdMeanPlot (Std,Std)
Tides1598 (5%)22,786422,02654,067390,483Electronics 12 01735 i001
16 (10%)12,851254,98437,713263,881
24 (15%)9423202,92517,935224,797
32 (20%)9074183,93315,700186,621
40 (25%)4782158,18313,903159,452
Sulfur Hexafluoride23012 (5%)30,325452,13468,212362,061Electronics 12 01735 i002
23 (10%)12,701281,42539,872246,967
35 (15%)12,034228,53020,846201,740
46 (20%)9241190,98528,621175,440
58 (25%)7914161,6189021152,310
Centre of Gravity1558 (5%)48,787406,50249,234369,648Electronics 12 01735 i003
16 (10%)22,163252,84121,974276,366
24 (15%)8050212,89326,776229,959
31 (20%)10,963180,95335,813184,437
39 (25%)2528164,66616,259163,007
Bubbles1749 (5%)24,538401,40637,816397,470Electronics 12 01735 i004
18 (10%)11,669272,43049,740276,152
27 (15%)12,965213,33619,125215,961
35 (20%)10,331190,63913,792183,984
44 (25%)7625173,0099427162,091
Table 3. Descriptive statistics with RMSE ( S ¯ , S ¯ * ) (target summary against random uniform sample) and RMSE ( S ¯ , S ¯ ) (random uniform sample against random uniform sample). Lower is better. Sampling rate: 1 frame per second. Dataset in [1], videos from V11 to V20. Highlighted results in blue/yellow correspond to the lowest values, either std ( RMSE ( S ¯ , S ¯ * ) ) or std ( RMSE ( S ¯ , S ¯ ) ), respectively.
Table 3. Descriptive statistics with RMSE ( S ¯ , S ¯ * ) (target summary against random uniform sample) and RMSE ( S ¯ , S ¯ ) (random uniform sample against random uniform sample). Lower is better. Sampling rate: 1 frame per second. Dataset in [1], videos from V11 to V20. Highlighted results in blue/yellow correspond to the lowest values, either std ( RMSE ( S ¯ , S ¯ * ) ) or std ( RMSE ( S ¯ , S ¯ ) ), respectively.
Youtube, Dataset RMSE ( S ¯ , S ¯ * ) RMSE ( S ¯ , S ¯ ) Visualization
Video# FramesUser# Frames UserStdMeanStdMeanPlot (Std,Std)
V114811026,644171,10646,655151,483Electronics 12 01735 i005
21213,673202,17215,479155,481
31029,857213,88051,590182,327
4921,192236,95952,982196,303
5831,627254,33652,925193,520
V125911115,497436,72346,551252,142Electronics 12 01735 i006
21718,927359,56224,665177,286
31526,071342,16131,703180,066
41125,330429,27282,323242,627
51434,479348,83439,199188,417
V135911912,238187,00124,649114,155Electronics 12 01735 i007
2925,267287,47934,635166,495
3187790187,34621,203126,432
4149544222,49625,553140,508
51812,298198,34927,138124,386
V14591932,739302,11851,770183,978Electronics 12 01735 i008
21620,249219,06844,235141,927
31724,345222,55935,235113,806
41020,498244,50927,548155,515
51626,561200,13932,840143,384
V155711214,454237,55151,812207,845Electronics 12 01735 i009
21120,018301,65046,590209,491
31313,192261,01442,337171,810
41336,408305,37630,041179,442
51444,931261,85954,428180,145
V16701935,722449,75895,662376,411Electronics 12 01735 i010
2986,863425,10765,626328,563
31241,260388,86943,186340,133
4951,299447,52365,698375,162
51342,200369,51752,316302,677
V175911217,668324,56236,166242,235Electronics 12 01735 i011
21326,203262,89532,930243,366
31810,957250,54330,660177,779
41219,956300,39020,252223,791
51612,611297,70728,433207,258
V185011335,152501,23074,454260,574Electronics 12 01735 i012
21440,896559,24470,863274,572
31446,791540,74739,899246,964
41033,309541,49056,012329,343
51430,663420,92472,998308,756
V19651156114186,89316,695119,136Electronics 12 01735 i013
2206701225,0756899103,517
3205339167,0858834103,752
4138462185,45212,020129,608
5623,992275,15532,512208,629
V206111523,716627,12152,711540,857Electronics 12 01735 i014
21219,933707,82386,586609,589
3952,818787,18893,656747,199
41143,598688,06568,016617,091
51131,058695,90569,077618,156
Table 4. Descriptive statistics with RMSE ( S ¯ , S ¯ * ) (target summary against random uniform sample) and RMSE ( S ¯ , S ¯ ) (random uniform sample against random uniform sample). Lower is better. Sampling rate: 1 frame per second. Dataset in [1], videos from V71 to V80. Highlighted values correspond to the lowest standard deviation.
Table 4. Descriptive statistics with RMSE ( S ¯ , S ¯ * ) (target summary against random uniform sample) and RMSE ( S ¯ , S ¯ ) (random uniform sample against random uniform sample). Lower is better. Sampling rate: 1 frame per second. Dataset in [1], videos from V71 to V80. Highlighted values correspond to the lowest standard deviation.
Youtube, Dataset RMSE ( S ¯ , S ¯ * ) RMSE ( S ¯ , S ¯ ) Visualization
Video# FramesUser# Frames UserStdMeanStdMeanPlot (Std,Std)
V7127711816,916319,97535,173330,114Electronics 12 01735 i015
21823,314315,99648,511339,793
32038,384293,85350,766345,021
41732,270310,19332,411359,049
51841,753329,35359,688334,337
V7253611815,842187,01932,676194,820Electronics 12 01735 i016
21625,427211,46633,363202,442
31618,684196,14945,453217,699
41821,112205,42119,122177,117
51827,718206,33529,057205,808
V7320111164,802538,239116,284484,970Electronics 12 01735 i017
27153,682106,8305211,124704,655
38113,805661,992135,899653,041
4883,387856,406248,619689,301
57111,767899,150241,947794,828
V7429311725,780282,20029,674309,051Electronics 12 01735 i018
21618,954273,77651,670331,322
31536,714322,83324,961335,618
41341,327363,66555,543369,875
51630,798289,13538,881353,928
V7538311442,736254,38525,959282,877Electronics 12 01735 i019
21341,632263,43139,826337,124
31059,083315,53139,925330,766
41737,954227,41128,843250,314
51249,908278,96663,236312,366
V76891664,097440,82593,524422,565Electronics 12 01735 i020
2453,727536,138123,009464,922
31566,208843,799485,614878,793
4640,356382,64378,354424,418
5639,194395,90660,916401,751
V7716811224,546302,07647,095366,748Electronics 12 01735 i021
2952,176339,28561,880385,056
3961,623355,88354,390413,118
41039,765349,20790,313400,379
5770,562440,65690,468451,833
V7831011365,238706,97896,368770,000Electronics 12 01735 i022
214100,771672,121112,412807,250
33410,792159,3229203,589188,2757
49149,063839,743213,286106,1204
52340,178466,57173,228614,140
V79491756,918831,057124,249835,575Electronics 12 01735 i023
2856,569793,83160,657859,241
3685,973925,025104,621990,479
45158,480109,3141179,902109,9105
5687,104873,950131,597895,318
V8015911866,585529,87567,019572,836Electronics 12 01735 i024
21766,367527,93059,432602,819
31329,459579,07884,101726,883
41243,740643,01687,688685,117
51489,016553,27494,849649,317
Table 5. VSUMM [1] comparison against baseline based on the Signature Transform for the first 20 videos of the dataset crawled from Youtube. Descriptive statistics with RMSE ( S ¯ , S ¯ * ) (target summary against random uniform sample) and RMSE ( S ¯ , S ¯ ) (random uniform sample against random uniform sample). RMSE ( S ¯ , S ¯ u m i n ) | 10 and RMSE ( S ¯ , S ¯ u m i n ) | 20 correspond to the baselines based on the Signature Transform using 10 and 20 random samples, respectively. Highlighted results are better than std ( RMSE ( S ¯ , S ¯ ) ). Sampling rate: 1 frame per second. Highlighted results correspond to lowest standard deviation as described in Table 1.
Table 5. VSUMM [1] comparison against baseline based on the Signature Transform for the first 20 videos of the dataset crawled from Youtube. Descriptive statistics with RMSE ( S ¯ , S ¯ * ) (target summary against random uniform sample) and RMSE ( S ¯ , S ¯ ) (random uniform sample against random uniform sample). RMSE ( S ¯ , S ¯ u m i n ) | 10 and RMSE ( S ¯ , S ¯ u m i n ) | 20 correspond to the baselines based on the Signature Transform using 10 and 20 random samples, respectively. Highlighted results are better than std ( RMSE ( S ¯ , S ¯ ) ). Sampling rate: 1 frame per second. Highlighted results correspond to lowest standard deviation as described in Table 1.
Descriptive StatisticsVSUMM RMSE ( S ¯ , S ¯ * ) RMSE ( S ¯ , S ¯ ) RMSE ( S ¯ , S ¯ u m i n ) | 10 RMSE ( S ¯ , S ¯ u m i n ) | 20
Video# Frames# FramesStdMeanStdMeanStdMeanStdMean
V11481125,981185,95937,907175,03116,343148,12818,343159,157
V12591356,274313,15641,613205,00417,770181,53311,665206,951
V1359197018184,86515,319120,30710,578110,2586655134,846
V1459821,415281,96939,412171,93519,069157,53110,104180,199
V15571020,159271,19746,041219,18227,536192,66727,765218,787
V1670965,997513,44084,667428,02538,088283,32430,235446,068
V17591510,697255,66641,831197,13617,625197,94419,102227,646
V18501442,731449,32451,635230,69533,525261,28830,179242,746
V1965163891235,7975739121,7665883116,2454582111,766
V2061943,864796,44839,035733,54728,460684,54639,414644,681
V712771720,840383,94543,176341,77914,908352,36520,657327,732
V725361261,886233,64948,603252,68817,604276,63118,966248,489
V732011040,261717,107156,051533,45764,344681,06438,361711,039
V742931726,274270,37436,674334,26517,622354,62117,486330,606
V753831037,516272,80438,026366,51023,163339,07821,295360,216
V7689736,084353,323114,266377,69931,131335,95834,724405,954
V77168926,653361,51667,134422,61233,214407,08527,562480,795
V783101395,305831,043127,705823,93833,903980,39736,361951,784
V7949767,052965,267101,325878,91742,513818,62947,401885,023
V801591548,115613,702118,428644,52943,411589,25637,487808,984
Average1531217/20 (85%) 19/20 (95%)19/20 (95%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

de Curtò, J.; de Zarzà, I.; Roig, G.; Calafate, C.T. Summarization of Videos with the Signature Transform. Electronics 2023, 12, 1735. https://doi.org/10.3390/electronics12071735

AMA Style

de Curtò J, de Zarzà I, Roig G, Calafate CT. Summarization of Videos with the Signature Transform. Electronics. 2023; 12(7):1735. https://doi.org/10.3390/electronics12071735

Chicago/Turabian Style

de Curtò, J., I. de Zarzà, Gemma Roig, and Carlos T. Calafate. 2023. "Summarization of Videos with the Signature Transform" Electronics 12, no. 7: 1735. https://doi.org/10.3390/electronics12071735

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop