## Abstract

**:**

## 1. Introduction and Problem Statement

## 2. Signature Transform

#### 2.1. RMSE and MAE Signature and Log-Signature

**Definition 1.**

#### 2.2. Summarization of Videos with RMSE Signature

- ${\overline{S}}_{*}$: Element-wise mean Signature Transform of the target summary to the score of the corresponding video;
- $\overline{S}$: Element-wise mean Signature Transform of a uniform random sample of the corresponding video;
- $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$: Root mean squared error between the spectra of $\overline{S}$ and ${\overline{S}}_{*}$ with the same summary length. For the computation of standard deviation and mean, this value is calculated ten times, changing $\overline{S}$;
- $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$: Root mean squared error between the spectra of $\overline{S}$ and $\overline{S}$ with the same summary length. For computation of standard deviation and mean, this value is calculated ten times, changing both $\overline{S}$ each time;
- $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{n}$: Baseline based on the Signature Transform. It corresponds to $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$, where ${\overline{S}}_{*}$ is, in this case, a fixed uniform random sample denoted as ${\overline{S}}_{u}$. We repeat this procedure n times and choose the minimum candidate according to standard deviation, ${\overline{S}}_{{u}_{min}}$, to propose as a summary;
- $std\left(\right)$: Standard deviation.

## 3. Summarization of Videos via Text-Conditioned Object Detection

## 4. Experiments: Dataset and Metrics

#### 4.1. Assessment of the Metrics

- The proposed metrics demonstrate that human evaluators can perform above average during the task, effectively capturing the dominant harmonic frequencies present in the video.
- Another crucial aspect to emphasize is that the metrics are able to evaluate human annotators with fair criteria and identify which subjects are creating competitive summaries.
- Moreover, the observations from this study indicate that the metrics serve as a reliable proxy for evaluating summaries without the need for annotated data, as they correlate strongly with human annotations.

- Annotations with lower standard deviations offer a better harmonic representation of the overall video;
- Annotations with higher standard deviations suggest that important harmonic components are missing from the given summary;
- The metrics make it simple to identify annotated summaries that may need to be relabeled for improved accuracy.

- Content based: the Signature Transform is a content-based approach that captures the salient features of the video data. This means that the proposed measure is not reliant on manual annotations or subjective human ratings, which can be time consuming and prone to biases.
- Robustness: the Signature Transform is a robust feature extraction technique that can handle different types of data, including videos with varying frame rates, resolutions, and durations. This means that the proposed measure can be applied to a wide range of video datasets without the need for pre-processing or normalization.
- Efficiency: the Signature Transform is a computationally efficient approach that can be applied to large-scale datasets. This means that the proposed measure can be used to evaluate the effectiveness of visual summaries quickly and accurately.
- Flexibility: the Signature Transform can be applied to different types of visual summaries, including keyframe-based and shot-based summaries. This means that the proposed measure can be used to evaluate different types of visual summaries and compare their effectiveness.

#### 4.2. Evaluation

## 5. Conclusions and Future Work

**Figure 1.**Conceptual plot with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ standard deviation and mean for two given summaries (our method and a counterexample) of 12 frames using a randomly picked video from Youtube to illustrate how to select a proper summary according to the proposed metric.

**Figure 3.**Comparison of distribution of selected frames for a subset of videos (Tides, Sulfur Hexafluoride, Centre of Gravity and Bubbles) using the method based on text-conditioned object detection and the baselines using the Signature Transform.

**Figure 4.**Summarization of videos using the baseline based on the Signature Transform in comparison to the summarization using text-conditioned object detection. $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{10}$, $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{20}$ and ${\overline{S}}_{*}$ summaries for two videos of the introduced dataset. The best summary among the three, according to the metric, is highlighted.

**Figure 5.**Plot with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ standard deviation and mean.

**Figure 6.**Plot with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ standard deviation and mean.

**Figure 7.**Error bar plot with mean and standard deviation for each human-annotated summary of the subset of 20 videos from [1]. Sampling rate: 1 frame per second.

**Figure 8.**Visual depiction of human annotated summaries together with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ of video V11, Table 3. Sampling rate: 1 frame per second. Highlighted values on the table correspond to the lowest standard deviation.

**Figure 9.**Visual depiction of human annotated summaries together with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ of video V19, Table 3. Sampling rate: 1 frame per second. Highlighted frames can increase the accuracy of the annotated summary by user 5. Highlighted values on the table correspond to the lowest standard deviation.

**Figure 10.**Visual depiction of human annotated summaries, together with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ of video V75, Table 4. Sampling rate: 1 frame per second. Highlighted values on the table correspond to the lowest standard deviation.

**Figure 11.**Visual depiction of human annotated summaries together with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ of video V76, Table 4. Sampling rate: 1 frame per second. Highlighted frames can increase the accuracy of the annotated summary by user 3. Highlighted values on the table correspond to the lowest standard deviation.

**Table 1.**Descriptive statistics with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ (target summary against random uniform sample) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (random uniform sample against random uniform sample). $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{10}$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{20}$ correspond to the baselines based on the Signature Transform using 10 and 20 random samples, respectively. Highlighted results in blue/brown correspond to values better than std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$). Yellow values indicate when std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$) is lower than std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$).

Descriptive Statistics | Summary | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{*})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}){|}_{10}$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}){|}_{20}$ | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Video | # Frames | Length | # Frames (%) | Std | Mean | Std | Mean | Std | Mean | Std | Mean |

Tides | 159 | 10 m 29 s | 35 (22%) | 13,663 | 202,388 | 14,838 | 155,986 | 8859 | 157,455 | 7312 | 167,480 |

Sulfur Hexafluoride | 230 | 15 m 12 s | 47 (20%) | 22,727 | 217,935 | 22,607 | 179,409 | 7194 | 161,995 | 7722 | 173,490 |

Centre of Gravity | 155 | 10 m 14 s | 33 (21%) | 12,333 | 181,460 | 16,404 | 168,824 | 8481 | 160,779 | 12,416 | 175,971 |

Bubbles | 174 | 11 m 30 s | 35 (20%) | 23,127 | 201,553 | 16,806 | 185,702 | 7461 | 194,993 | 5711 | 175,176 |

Airplanes | 158 | 10 m 24 s | 22 (14%) | 19,964 | 215,688 | 23,591 | 231,539 | 8417 | 227,391 | 10,235 | 233,020 |

Protons | 174 | 11 m 30 s | 25 (14%) | 29,853 | 252,224 | 20,186 | 262,434 | 12,835 | 251,907 | 11,542 | 250,512 |

Hydrophobic | 168 | 11 m 06 s | 29 (17%) | 15,016 | 251,671 | 25,835 | 248,548 | 11,973 | 250,131 | 13,917 | 245,761 |

States of Matter | 332 | 22 m 03 s | 78 (23%) | 16,249 | 156,408 | 9709 | 130,064 | 6630 | 115,454 | 5340 | 121,028 |

Spool Racer | 332 | 22 m 02 s | 90 (27%) | 15,903 | 142,520 | 11,883 | 136,147 | 7054 | 137,621 | 8112 | 151,888 |

Paper Airplane | 332 | 22 m 03 s | 29 (9%) | 20,642 | 235,639 | 11,829 | 221,220 | 5400 | 224,718 | 9385 | 177,448 |

Loudest Sound | 332 | 22 m 01 s | 93 (28%) | 16,898 | 179,963 | 8304 | 148,885 | 7884 | 138,561 | 4355 | 147,016 |

Lightning | 332 | 22 m 01 s | 70 (21%) | 15,237 | 169,338 | 21,862 | 162,849 | 9300 | 177,008 | 7494 | 153,797 |

Light Challenge | 332 | 22 m 02 s | 82 (25%) | 12,566 | 152,488 | 10,546 | 126,117 | 5490 | 139,700 | 4874 | 129,044 |

Hot Air Balloon | 332 | 22 m 01 s | 98 (30%) | 8620 | 150,366 | 5417 | 144,634 | 3516 | 137,141 | 4165 | 138,453 |

Hoop Glider | 332 | 22 m 01 s | 82 (25%) | 6419 | 148,065 | 6752 | 132,544 | 4051 | 133,897 | 4966 | 133,894 |

Drag Race | 332 | 22 m 03 s | 73 (22%) | 9384 | 135,228 | 8931 | 125,264 | 4375 | 122,615 | 4645 | 129,851 |

All about Balance | 332 | 22 m 03 s | 59 (18%) | 14,023 | 182,063 | 14,238 | 182,179 | 7801 | 176,219 | 6914 | 167,727 |

Air Pressure | 332 | 22 m 03 s | 65 (20%) | 10,123 | 166,342 | 18,314 | 151,664 | 6386 | 145,897 | 4602 | 148,232 |

Friction and Momentum | 162 | 10 m 42 s | 28 (17%) | 18,754 | 217,403 | 22,443 | 218,203 | 13,348 | 202,288 | 12,238 | 205,680 |

Electricity | 162 | 10 m 41 s | 30 (19%) | 24,376 | 298,238 | 22,885 | 279,820 | 16,889 | 268,932 | 10,263 | 270,619 |

Catapult | 169 | 11 m 11 s | 27 (16%) | 26,413 | 271,643 | 31,265 | 214,727 | 15,158 | 203,290 | 10,222 | 188,008 |

Carbonation and More | 165 | 10 m 53 s | 40 (24%) | 18,977 | 237,142 | 18,107 | 226,044 | 12,130 | 234,278 | 11,884 | 214,149 |

Carbon Dioxide | 162 | 10 m 41 s | 38 (23%) | 25,862 | 245,415 | 18,806 | 217,270 | 13,838 | 207,828 | 7760 | 211,504 |

Bridge | 164 | 10 m 51 s | 21 (13%) | 25,839 | 269,412 | 26,038 | 271,551 | 10,761 | 263,747 | 13,038 | 264,532 |

Bread Experiment | 337 | 22 m 22 s | 59 (18%) | 15,099 | 189,086 | 8575 | 146,771 | 5542 | 153,224 | 5691 | 156,230 |

Balloon Power | 337 | 22 m 22 s | 53 (16%) | 14,075 | 157,542 | 29,415 | 147,710 | 7741 | 128,920 | 7351 | 134,545 |

Attraction and Forces | 654 | 43 m 30 s | 81 (12%) | 5955 | 107,097 | 7486 | 102,965 | 3701 | 96,266 | 2093 | 99,271 |

Puzzles | 209 | 13 m 48 s | 46 (22%) | 11,258 | 185,502 | 19,012 | 196,762 | 14,620 | 199,556 | 14,622 | 197,064 |

Average | 264 | 17 m 30 s | 52 (20%) | 14/28 (50%) | 28/28 (100%) | 28/28 (100%) |

**Table 2.**Descriptive statistics for a set of videos with varying numbers of frames per summary with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{10}$ (brown) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (yellow).

Dataset | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}{\left)\right|}_{10}$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | Visualization | ||||
---|---|---|---|---|---|---|---|

Video | # Frames | Summary (%) | Std | Mean | Std | Mean | Plot (Std,Std) |

Tides | 159 | 8 (5%) | 22,786 | 422,026 | 54,067 | 390,483 | |

16 (10%) | 12,851 | 254,984 | 37,713 | 263,881 | |||

24 (15%) | 9423 | 202,925 | 17,935 | 224,797 | |||

32 (20%) | 9074 | 183,933 | 15,700 | 186,621 | |||

40 (25%) | 4782 | 158,183 | 13,903 | 159,452 | |||

Sulfur Hexafluoride | 230 | 12 (5%) | 30,325 | 452,134 | 68,212 | 362,061 | |

23 (10%) | 12,701 | 281,425 | 39,872 | 246,967 | |||

35 (15%) | 12,034 | 228,530 | 20,846 | 201,740 | |||

46 (20%) | 9241 | 190,985 | 28,621 | 175,440 | |||

58 (25%) | 7914 | 161,618 | 9021 | 152,310 | |||

Centre of Gravity | 155 | 8 (5%) | 48,787 | 406,502 | 49,234 | 369,648 | |

16 (10%) | 22,163 | 252,841 | 21,974 | 276,366 | |||

24 (15%) | 8050 | 212,893 | 26,776 | 229,959 | |||

31 (20%) | 10,963 | 180,953 | 35,813 | 184,437 | |||

39 (25%) | 2528 | 164,666 | 16,259 | 163,007 | |||

Bubbles | 174 | 9 (5%) | 24,538 | 401,406 | 37,816 | 397,470 | |

18 (10%) | 11,669 | 272,430 | 49,740 | 276,152 | |||

27 (15%) | 12,965 | 213,336 | 19,125 | 215,961 | |||

35 (20%) | 10,331 | 190,639 | 13,792 | 183,984 | |||

44 (25%) | 7625 | 173,009 | 9427 | 162,091 |

**Table 3.**Descriptive statistics with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ (target summary against random uniform sample) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (random uniform sample against random uniform sample). Lower is better. Sampling rate: 1 frame per second. Dataset in [1], videos from V11 to V20. Highlighted results in blue/yellow correspond to the lowest values, either std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$) or std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$), respectively.

Youtube, Dataset | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{*})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | Visualization | |||||
---|---|---|---|---|---|---|---|---|

Video | # Frames | User | # Frames User | Std | Mean | Std | Mean | Plot (Std,Std) |

V11 | 48 | 1 | 10 | 26,644 | 171,106 | 46,655 | 151,483 | |

2 | 12 | 13,673 | 202,172 | 15,479 | 155,481 | |||

3 | 10 | 29,857 | 213,880 | 51,590 | 182,327 | |||

4 | 9 | 21,192 | 236,959 | 52,982 | 196,303 | |||

5 | 8 | 31,627 | 254,336 | 52,925 | 193,520 | |||

V12 | 59 | 1 | 11 | 15,497 | 436,723 | 46,551 | 252,142 | |

2 | 17 | 18,927 | 359,562 | 24,665 | 177,286 | |||

3 | 15 | 26,071 | 342,161 | 31,703 | 180,066 | |||

4 | 11 | 25,330 | 429,272 | 82,323 | 242,627 | |||

5 | 14 | 34,479 | 348,834 | 39,199 | 188,417 | |||

V13 | 59 | 1 | 19 | 12,238 | 187,001 | 24,649 | 114,155 | |

2 | 9 | 25,267 | 287,479 | 34,635 | 166,495 | |||

3 | 18 | 7790 | 187,346 | 21,203 | 126,432 | |||

4 | 14 | 9544 | 222,496 | 25,553 | 140,508 | |||

5 | 18 | 12,298 | 198,349 | 27,138 | 124,386 | |||

V14 | 59 | 1 | 9 | 32,739 | 302,118 | 51,770 | 183,978 | |

2 | 16 | 20,249 | 219,068 | 44,235 | 141,927 | |||

3 | 17 | 24,345 | 222,559 | 35,235 | 113,806 | |||

4 | 10 | 20,498 | 244,509 | 27,548 | 155,515 | |||

5 | 16 | 26,561 | 200,139 | 32,840 | 143,384 | |||

V15 | 57 | 1 | 12 | 14,454 | 237,551 | 51,812 | 207,845 | |

2 | 11 | 20,018 | 301,650 | 46,590 | 209,491 | |||

3 | 13 | 13,192 | 261,014 | 42,337 | 171,810 | |||

4 | 13 | 36,408 | 305,376 | 30,041 | 179,442 | |||

5 | 14 | 44,931 | 261,859 | 54,428 | 180,145 | |||

V16 | 70 | 1 | 9 | 35,722 | 449,758 | 95,662 | 376,411 | |

2 | 9 | 86,863 | 425,107 | 65,626 | 328,563 | |||

3 | 12 | 41,260 | 388,869 | 43,186 | 340,133 | |||

4 | 9 | 51,299 | 447,523 | 65,698 | 375,162 | |||

5 | 13 | 42,200 | 369,517 | 52,316 | 302,677 | |||

V17 | 59 | 1 | 12 | 17,668 | 324,562 | 36,166 | 242,235 | |

2 | 13 | 26,203 | 262,895 | 32,930 | 243,366 | |||

3 | 18 | 10,957 | 250,543 | 30,660 | 177,779 | |||

4 | 12 | 19,956 | 300,390 | 20,252 | 223,791 | |||

5 | 16 | 12,611 | 297,707 | 28,433 | 207,258 | |||

V18 | 50 | 1 | 13 | 35,152 | 501,230 | 74,454 | 260,574 | |

2 | 14 | 40,896 | 559,244 | 70,863 | 274,572 | |||

3 | 14 | 46,791 | 540,747 | 39,899 | 246,964 | |||

4 | 10 | 33,309 | 541,490 | 56,012 | 329,343 | |||

5 | 14 | 30,663 | 420,924 | 72,998 | 308,756 | |||

V19 | 65 | 1 | 15 | 6114 | 186,893 | 16,695 | 119,136 | |

2 | 20 | 6701 | 225,075 | 6899 | 103,517 | |||

3 | 20 | 5339 | 167,085 | 8834 | 103,752 | |||

4 | 13 | 8462 | 185,452 | 12,020 | 129,608 | |||

5 | 6 | 23,992 | 275,155 | 32,512 | 208,629 | |||

V20 | 61 | 1 | 15 | 23,716 | 627,121 | 52,711 | 540,857 | |

2 | 12 | 19,933 | 707,823 | 86,586 | 609,589 | |||

3 | 9 | 52,818 | 787,188 | 93,656 | 747,199 | |||

4 | 11 | 43,598 | 688,065 | 68,016 | 617,091 | |||

5 | 11 | 31,058 | 695,905 | 69,077 | 618,156 |

**Table 4.**Descriptive statistics with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ (target summary against random uniform sample) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (random uniform sample against random uniform sample). Lower is better. Sampling rate: 1 frame per second. Dataset in [1], videos from V71 to V80. Highlighted values correspond to the lowest standard deviation.

Youtube, Dataset | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{*})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | Visualization | |||||
---|---|---|---|---|---|---|---|---|

Video | # Frames | User | # Frames User | Std | Mean | Std | Mean | Plot (Std,Std) |

V71 | 277 | 1 | 18 | 16,916 | 319,975 | 35,173 | 330,114 | |

2 | 18 | 23,314 | 315,996 | 48,511 | 339,793 | |||

3 | 20 | 38,384 | 293,853 | 50,766 | 345,021 | |||

4 | 17 | 32,270 | 310,193 | 32,411 | 359,049 | |||

5 | 18 | 41,753 | 329,353 | 59,688 | 334,337 | |||

V72 | 536 | 1 | 18 | 15,842 | 187,019 | 32,676 | 194,820 | |

2 | 16 | 25,427 | 211,466 | 33,363 | 202,442 | |||

3 | 16 | 18,684 | 196,149 | 45,453 | 217,699 | |||

4 | 18 | 21,112 | 205,421 | 19,122 | 177,117 | |||

5 | 18 | 27,718 | 206,335 | 29,057 | 205,808 | |||

V73 | 201 | 1 | 11 | 64,802 | 538,239 | 116,284 | 484,970 | |

2 | 7 | 153,682 | 106,8305 | 211,124 | 704,655 | |||

3 | 8 | 113,805 | 661,992 | 135,899 | 653,041 | |||

4 | 8 | 83,387 | 856,406 | 248,619 | 689,301 | |||

5 | 7 | 111,767 | 899,150 | 241,947 | 794,828 | |||

V74 | 293 | 1 | 17 | 25,780 | 282,200 | 29,674 | 309,051 | |

2 | 16 | 18,954 | 273,776 | 51,670 | 331,322 | |||

3 | 15 | 36,714 | 322,833 | 24,961 | 335,618 | |||

4 | 13 | 41,327 | 363,665 | 55,543 | 369,875 | |||

5 | 16 | 30,798 | 289,135 | 38,881 | 353,928 | |||

V75 | 383 | 1 | 14 | 42,736 | 254,385 | 25,959 | 282,877 | |

2 | 13 | 41,632 | 263,431 | 39,826 | 337,124 | |||

3 | 10 | 59,083 | 315,531 | 39,925 | 330,766 | |||

4 | 17 | 37,954 | 227,411 | 28,843 | 250,314 | |||

5 | 12 | 49,908 | 278,966 | 63,236 | 312,366 | |||

V76 | 89 | 1 | 6 | 64,097 | 440,825 | 93,524 | 422,565 | |

2 | 4 | 53,727 | 536,138 | 123,009 | 464,922 | |||

3 | 1 | 566,208 | 843,799 | 485,614 | 878,793 | |||

4 | 6 | 40,356 | 382,643 | 78,354 | 424,418 | |||

5 | 6 | 39,194 | 395,906 | 60,916 | 401,751 | |||

V77 | 168 | 1 | 12 | 24,546 | 302,076 | 47,095 | 366,748 | |

2 | 9 | 52,176 | 339,285 | 61,880 | 385,056 | |||

3 | 9 | 61,623 | 355,883 | 54,390 | 413,118 | |||

4 | 10 | 39,765 | 349,207 | 90,313 | 400,379 | |||

5 | 7 | 70,562 | 440,656 | 90,468 | 451,833 | |||

V78 | 310 | 1 | 13 | 65,238 | 706,978 | 96,368 | 770,000 | |

2 | 14 | 100,771 | 672,121 | 112,412 | 807,250 | |||

3 | 3 | 410,792 | 159,3229 | 203,589 | 188,2757 | |||

4 | 9 | 149,063 | 839,743 | 213,286 | 106,1204 | |||

5 | 23 | 40,178 | 466,571 | 73,228 | 614,140 | |||

V79 | 49 | 1 | 7 | 56,918 | 831,057 | 124,249 | 835,575 | |

2 | 8 | 56,569 | 793,831 | 60,657 | 859,241 | |||

3 | 6 | 85,973 | 925,025 | 104,621 | 990,479 | |||

4 | 5 | 158,480 | 109,3141 | 179,902 | 109,9105 | |||

5 | 6 | 87,104 | 873,950 | 131,597 | 895,318 | |||

V80 | 159 | 1 | 18 | 66,585 | 529,875 | 67,019 | 572,836 | |

2 | 17 | 66,367 | 527,930 | 59,432 | 602,819 | |||

3 | 13 | 29,459 | 579,078 | 84,101 | 726,883 | |||

4 | 12 | 43,740 | 643,016 | 87,688 | 685,117 | |||

5 | 14 | 89,016 | 553,274 | 94,849 | 649,317 |

**Table 5.**VSUMM [1] comparison against baseline based on the Signature Transform for the first 20 videos of the dataset crawled from Youtube. Descriptive statistics with $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{*})$ (target summary against random uniform sample) and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$ (random uniform sample against random uniform sample). $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{10}$ and $\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},{\overline{S}}_{{u}_{min}}){|}_{20}$ correspond to the baselines based on the Signature Transform using 10 and 20 random samples, respectively. Highlighted results are better than std ($\mathrm{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{S},\overline{S})$). Sampling rate: 1 frame per second. Highlighted results correspond to lowest standard deviation as described in Table 1.

Descriptive Statistics | VSUMM | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{*})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},\overline{\mathit{S}})$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}{\left)\right|}_{10}$ | $\mathbf{RMSE}\phantom{\rule{3.33333pt}{0ex}}(\overline{\mathit{S}},{\overline{\mathit{S}}}_{{\mathit{u}}_{\mathit{m}\mathit{i}\mathit{n}}}{\left)\right|}_{20}$ | |||||
---|---|---|---|---|---|---|---|---|---|---|

Video | # Frames | # Frames | Std | Mean | Std | Mean | Std | Mean | Std | Mean |

V11 | 48 | 11 | 25,981 | 185,959 | 37,907 | 175,031 | 16,343 | 148,128 | 18,343 | 159,157 |

V12 | 59 | 13 | 56,274 | 313,156 | 41,613 | 205,004 | 17,770 | 181,533 | 11,665 | 206,951 |

V13 | 59 | 19 | 7018 | 184,865 | 15,319 | 120,307 | 10,578 | 110,258 | 6655 | 134,846 |

V14 | 59 | 8 | 21,415 | 281,969 | 39,412 | 171,935 | 19,069 | 157,531 | 10,104 | 180,199 |

V15 | 57 | 10 | 20,159 | 271,197 | 46,041 | 219,182 | 27,536 | 192,667 | 27,765 | 218,787 |

V16 | 70 | 9 | 65,997 | 513,440 | 84,667 | 428,025 | 38,088 | 283,324 | 30,235 | 446,068 |

V17 | 59 | 15 | 10,697 | 255,666 | 41,831 | 197,136 | 17,625 | 197,944 | 19,102 | 227,646 |

V18 | 50 | 14 | 42,731 | 449,324 | 51,635 | 230,695 | 33,525 | 261,288 | 30,179 | 242,746 |

V19 | 65 | 16 | 3891 | 235,797 | 5739 | 121,766 | 5883 | 116,245 | 4582 | 111,766 |

V20 | 61 | 9 | 43,864 | 796,448 | 39,035 | 733,547 | 28,460 | 684,546 | 39,414 | 644,681 |

V71 | 277 | 17 | 20,840 | 383,945 | 43,176 | 341,779 | 14,908 | 352,365 | 20,657 | 327,732 |

V72 | 536 | 12 | 61,886 | 233,649 | 48,603 | 252,688 | 17,604 | 276,631 | 18,966 | 248,489 |

V73 | 201 | 10 | 40,261 | 717,107 | 156,051 | 533,457 | 64,344 | 681,064 | 38,361 | 711,039 |

V74 | 293 | 17 | 26,274 | 270,374 | 36,674 | 334,265 | 17,622 | 354,621 | 17,486 | 330,606 |

V75 | 383 | 10 | 37,516 | 272,804 | 38,026 | 366,510 | 23,163 | 339,078 | 21,295 | 360,216 |

V76 | 89 | 7 | 36,084 | 353,323 | 114,266 | 377,699 | 31,131 | 335,958 | 34,724 | 405,954 |

V77 | 168 | 9 | 26,653 | 361,516 | 67,134 | 422,612 | 33,214 | 407,085 | 27,562 | 480,795 |

V78 | 310 | 13 | 95,305 | 831,043 | 127,705 | 823,938 | 33,903 | 980,397 | 36,361 | 951,784 |

V79 | 49 | 7 | 67,052 | 965,267 | 101,325 | 878,917 | 42,513 | 818,629 | 47,401 | 885,023 |

V80 | 159 | 15 | 48,115 | 613,702 | 118,428 | 644,529 | 43,411 | 589,256 | 37,487 | 808,984 |

Average | 153 | 12 | 17/20 (85%) | 19/20 (95%) | 19/20 (95%) |

