# A Lightweight Graph Neural Network Algorithm for Action Recognition Based on Self-Distillation

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Previous Works

#### 2.1. GNNs

#### 2.2. Model Compression

## 3. Algorithm

#### 3.1. Problem Definition

#### 3.2. Input Features

#### 3.3. ST-GCN Compression Based on Self-Distillation

#### 3.3.1. ST-GCN Blocks

#### 3.3.2. Self-Distillation Compression

## 4. Experiments and Discussion

#### 4.1. Accuracy

**When the backbone capacity is proper, the self-distillation guarantees better performance compared with pure supervision**. The evaluated accuracy on datasets NTU xview and NTU xsub are presented in Table 1 and Table 2, respectively. They demonstrate that the accuracies of the whole backbone ST-GCN 20 or ST-GCN 12 by self-distillation are better than the supervised results, while the ST-GCN 8 by self-distillation underperforms the supervised ST-GCN 8. One posssible explanation for the poor performance of ST-GCN 8 is that the self-distillation of a lower capacity backbone misleads the optimization to converge to a sub-optimal point [19]. When the capacities (layers) of backbones are diminished, the accuracy slightly decreases in ST-GCN 12 and extremely decreases in ST-GCN 8. However, the performance of ST-GCN 12 is still competitive compared with ST-GCN 20. Figure 6 gives an explicit impression between the accuracy from self-distillation and accuracy from supervision for each block (the previous tables only show the performance of the overall output). Notice that the supervised models are trained by exactly the same parameters as in the original ST-GCN 20 [1], except for the parallel computing strategy. This could explain the accuracies shown in our results released by the original supervised ST-GCN 20 (88.3 on NTU xsub, 81.5 on NTU xview).

**When the capacity of the backbone is small, the accuracy from shallower blocks is better.**As shown in Figure 6, at ST-GCN 20 or ST-GCN 12, the contributions from deeper blocks 2 or 3 are explicitly higher than for block 1, while in ST-GCN 8, the contributions from block 1 are higher than the other deeper blocks. One reason for this is that when the capacity of the model is large, it is more difficult to distill the knowledge and then back propagate it by gradient flow across blocks considering the possible gradient vanishing, while when the capacity is small these difficulties are alleviated.

**The backbones trained with mov + pos data are usually the best.**Table 1, Table 2 and Figure 6 all illustrate this. The pos data contains the 3D positions of each skeleton joint to record the spatial movements of each frame, while the mov data emphasizes the temporal movements amplitude for each skeleton joint at each frame. The mov data helps to fulfill some missing information (e.g., the differences between running and walking) and therefore leads to the best result generated by mov + pos.

#### 4.2. Compression

**The backbone size can be compressed by at least 50% at the second block, and by at least $\mathbf{70}\%$ at the first block.**The compressed backbones and uncompressed backbones follow a similar architecture. The compressed parameter percentage of each block is shown based on the corresponding whole heavy backbone in Figure 7. According to Figure 6, the accuracy by the second block of ST-GCN 12 is competitive compared with the whole ST-GCN 20. Since the rank of parameter size for each ST-GCN backbone is $\mathrm{ST}-\mathrm{GCN}\phantom{\rule{4.pt}{0ex}}20>\mathrm{ST}-\mathrm{GCN}\phantom{\rule{4.pt}{0ex}}12>\mathrm{ST}-\mathrm{GCN}\phantom{\rule{4.pt}{0ex}}8$, if taking the second block of ST-GCN 12, the compressed percentage is $80\%$ compared with the whole ST-GCN 20, including the compression from decreasing backbone layers and decreasing the number of blocks.

**The real-time running time can be accelerated by up to at least 1.42× at the second block, and by at least 2.33× at the first block.**The details are listed in Table 3.

#### 4.3. Denser Representations

- Davies–Bouldin [20]: The good clusters should have low intra-cluster distance and high inter-cluster distance, and therefore a small Davies–Bouldin index. The metric is defined as$$DB=\frac{1}{K}\sum _{i=1}^{K}\underset{j\ne i}{max}\left(\frac{{\sigma}_{i}+{\sigma}_{j}}{d({c}_{i},{c}_{j})}\right),$$
- Dunn index [21,22]: The clusters with a higher Dunn Index are more desirable. The formula is$$D=\frac{\underset{1\le i\le j\le K}{min}d(i,j)}{\underset{1\le k\le K}{max}{d}^{\prime}\left(k\right)},$$
- Silhouette coefficients [23]: The clusters with a high silhouette value are considered well clustered. The silhouette coefficients ranges in $[-1,1]$. Its formula is$$S\left(i\right)=\left\{\begin{array}{c}1-\frac{a\left(i\right)}{b\left(i\right)},\mathrm{if}\phantom{\rule{4.pt}{0ex}}a\left(i\right)<b\left(i\right)\phantom{\rule{4.pt}{0ex}}\hfill \\ 0,\mathrm{if}\phantom{\rule{4.pt}{0ex}}a\left(i\right)=b\left(i\right)\phantom{\rule{4.pt}{0ex}}\hfill \\ \frac{b\left(i\right)}{a\left(i\right)}-1,\mathrm{if}\phantom{\rule{4.pt}{0ex}}a\left(i\right)=b\left(i\right)\phantom{\rule{4.pt}{0ex}}\hfill \end{array}\right.$$

#### 4.3.1. Geometric Shapes of Feature Representations

**The representations learned by ST-GCN after self-distillation have more complex geometric shapes that are linearly separable.**Different ways for projecting the extracted representations to 2D space were tried, namely principal components analysis (PCA) [24], t-distributed stochastic neighbor embedding (T-SNE) [25,26] and k-nearest-neighbor-graphs (k-NNG) [27] (Figure 8). PCA generates clusters with mixed classes. The projections by T-SNE and k-NNG are more impressive; actions with similar semantics such as drinking water and making a phone call are close, and vice versa. Because PCA tends to draw the clusters with linear separable features well, this demonstrates that the representations learned by ST-GCN have more complex geometric shapes.

#### 4.3.2. Self-Distillation vs. Supervision

**The representations learned by self-distillation are denser than representations trained under supervision.**As shown in Figure 9, the cluster densities of representations by self-distillation are better than the representations under pure supervision. For ST-GCN 20 and ST-GCN 12, the Dunn index and silhouette coefficients of self-distillation are larger than the supervision, and the Davies–Bouldin index of self-distillation is smaller than the supervision. The training under supervision only adopts action labels to drive the predicted labels to be close to the ground truth labels, while the self-distillation also emphasizes the similarities between predicted distributions and feature distances. By limiting the predicted distributions to be close to the ground truth (well compacted clusters), each cluster’s density is limited. The forcing of close feature distances ensures that the learned representations are geometrically close to the ground truth.

#### 4.3.3. Each Block’s Feature Representation

**The deeper blocks generate denser clusters.**When the blocks are deeper, the cluster densities (Figure 9) improve. This is illustrated by a smaller Davies–Bouldin index and larger Dunn Index and silhouette index. The features of eight actions from different blocks are also projected to visualize the density differences. In Figure 10, as the depth of the block deepens, the denser each cluster is, and the farther the cluster are from each other. This is reasonable considering the back propagation of training. The gradient flows are back propagated from the deeper blocks to the shallower blocks. The blocks that are close to the output layer are more nourished with informed gradients and therefore generated representations more similar to the true clusters. Furthermore, the shallower blocks first learn the coarse features, and gradually the fine features are learned by the deeper blocks.

**Increasing the capacity of the model encourages the model to focus more on the detailed difference of each action.**In the eight selected actions, staggering and falling are anomalies that vary from other actions, if taking the moving speed and moving amplitude into consideration. ST-GCN 20 and ST-GCN 12 can both classify each action, while ST-GCN 8 can divide staggering and falling from the other six actions, but fails to further divide the other actions clearly. As demonstrated in Figure 11, although the clusters of actions 0, 10, 16, 27, 31 and 40 are mixed together, actions 41 and 42 (falling and staggering) are already separated from the others.

## 5. Discussion and Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

GNNs | Graph Neural Networks |

HAR | Human Action Recognition |

BYOT | Be Your Own Teacher |

## References

- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
- Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T. A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv
**2020**, arXiv:1710.09282. [Google Scholar] - Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; Ma, K. Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation. arXiv
**2019**, arXiv:1905.08094. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv
**2015**, arXiv:1512.03385. [Google Scholar] - Veličković, P. Everything is connected: Graph neural networks. Curr. Opin. Struct. Biol.
**2023**, 79, 102538. [Google Scholar] [CrossRef] [PubMed] - Wang, X.; Xu, H.; Wang, X.; Xu, X.; Wang, Z. A Graph Neural Network and Pointer Network-Based Approach for QoS-Aware Service Composition. IEEE Trans. Serv. Comput.
**2023**, 16, 1589–1603. [Google Scholar] [CrossRef] - Zhang, Y.; Hu, Y.; Han, N.; Yang, A.; Liu, X.; Cai, H. A survey of drug-target interaction and affinity prediction methods via graph neural networks. Comput. Biol. Med.
**2023**, 163, 107136. [Google Scholar] [CrossRef] [PubMed] - Zhao, Q.; Feng, X. Utilizing citation network structure to predict paper citation counts: A Deep learning approach. J. Inf.
**2022**, 16, 101235. [Google Scholar] [CrossRef] - Bukumira, M.; Antonijevic, M.; Jovanovic, D.; Zivkovic, M.; Mladenovic, D.; Kunjadic, G. Carrot grading system using computer vision feature parameters and a cascaded graph convolutional neural network. J. Electron. Imaging
**2022**, 31, 061815. [Google Scholar] [CrossRef] - Hameed, M.S.A.; Schwung, A. Graph neural networks-based scheduler for production planning problems using reinforcement learning. J. Manuf. Syst.
**2023**, 69, 91–102. [Google Scholar] [CrossRef] - Hamilton, W.L. Graph representation learning. Synth. Lect. Artifical Intell. Mach. Learn.
**2020**, 14, 1–159. [Google Scholar] - Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv
**2013**, arXiv:1312.6203. [Google Scholar] - Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 2, pp. 729–734. [Google Scholar]
- Feng, M.; Meunier, J. Skeleton Graph-Neural-Network-Based Human Action Recognition: A Survey. Sensors
**2022**, 22, 2091. [Google Scholar] [CrossRef] [PubMed] - Li, Z.; Li, H.; Meng, L. Model Compression for Deep Neural Networks: A Survey. Computers
**2023**, 12, 60. [Google Scholar] [CrossRef] - Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis.
**2021**, 129, 1789–1819. [Google Scholar] [CrossRef] - Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Stanton, S.; Izmailov, P.; Kirichenko, P.; Alemi, A.A.; Wilson, A.G. Does knowledge distillation really work? Adv. Neural Inf. Process. Syst.
**2021**, 34, 6906–6919. [Google Scholar] - Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell.
**1979**, PAMI-1, 224–227. [Google Scholar] [CrossRef] - Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J. Cybern.
**1973**, 3, 32–57. [Google Scholar] [CrossRef] - Dunn†, J.C. Well-Separated Clusters and Optimal Fuzzy Partitions. J. Cybern.
**1974**, 4, 95–104. [Google Scholar] [CrossRef] - Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.
**1987**, 20, 53–65. [Google Scholar] [CrossRef] - Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci.
**2016**, 374, 20150202. [Google Scholar] [CrossRef] - Hinton, G.E.; Roweis, S. Stochastic neighbor embedding. Adv. Neural Inf. Process. Syst.
**2002**, 15, 833–840. [Google Scholar] - Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.
**2008**, 9, 2579–2605. [Google Scholar] - Preparata, F.P.; Shamos, M.I. Computational Geometry: An Introduction; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv
**2017**, arXiv:1705.06950. [Google Scholar]

**Figure 1.**The algorithm ST-GCN [1]. The blue dots represent human joints. The human skeleton at each frame is regarded as a graph.

**Figure 2.**The normalized adjacency matrices after spatial partition. Each row or column represents one pair of nodes. The centripetal matrix records the root nodes and their centripetal nodes, while the centrifugal matrix only records the centrifugal joints.

**Figure 3.**The spatial partition to build new adjacency matrices. The black cross is the center of gravity of the whole skeleton, usually summarized as the average of all joints. The green nodes are the selected root joints and the purple nodes are centripetal nodes, while the orange nodes are centrifugal joints. The distance between the centripetal nodes and the center of gravity is shorter than the distance between the centrifugal nodes and the center of gravity.

**Figure 4.**The blocks division of each backbone. The green layers denoted by ‘Preprocess’ represent preprocessing such as batch normalization. The number denotes the dimensions of the output features from each block. The b1, b2, b3 indicate ST Blocks, and each layer (ST-GCN unit, marked by red, purple or blue) contains two units: graph spatial unit and temporal unit. Therefore, for ST-GCN 20, the number of layers in b1, b2, b3 are 18 layers in total, and counting the preprocessing layer and output fully connection (FC) layer, that would be 20 layers.

**Figure 5.**The compression algorithm framework, drawn based on [4]. The ST Block is one block of the ST-GCN algorithm. There are three ST Blocks in total. The FC is the fully connection layer. The number of star-shaped symbols for computing denotes the relative magnitude of computing resources required.

**Figure 6.**The accuracy increments of different blocks in HAR task, where kd_b1, kd_b2 and kd_b3 are the result from different blocks after self-distillation; sup indicates the results by supervision. The integers on the x-axis denote the number of layers of ST-GCN, and the pos/mov/mov + pos indicate different input data.

**Figure 7.**The percentage of the parameter size for each block compared with the corresponding whole backbone, where b1, b2, b3 denote block 1, block 2 and block 3, and ST 20, ST 12, ST 8 represent ST-GCN 20, ST-GCN 12 and ST-GCN 8, respectively. The parameter sizes of b1, b2, b3 differ because of the varied output dimensions shown in Figure 4. The size of b1 includes the preprocessing layer, and the size of b3 includes the FC layer.

**Figure 8.**The (

**a**) PCA; (

**b**) T-SNE; (

**c**) 20-Nearest Neighbor Graph projections for representations learned by self-distillation. The features come from ST-GCN 20 with pos as input data, performed on NTU xsub, captured from the first block.

**Figure 9.**The cluster densities of representations estimated by (

**a**) Davies–Bouldin index, (

**b**) Dunn index and (

**c**) silhouette coefficients. The representations are trained by self-distillation or pure labels, which are labeled as kd, and sup in the legend, respectively. The brown color, which is not shown in the legend, is the overlapped region between kd and sup. For each subgraph, each row shows the densities on different datasets, and each column shows different blocks. The x-axis denotes trained backbones and input data. All instances are evaluated here.

**Figure 10.**The features from (

**a**) the 1st block; (

**b**) the 2nd block; (

**c**) the 3rd block and (

**d**) the overall output are projected by the 20-nearest-neighbor-graph. The features are trained by ST-GCN 20 with pos as input data, performed on NTU xsub. The overall output includes the features from the output of the whole backbone. From (

**a**,

**b**), the classes $0,16$ and 27 (marked as black, red and yellow) are more separated. From (

**b**,

**c**), the classes 0 and 27 are separated more clearly. From (

**c**,

**d**), the distance between actions 10 and 31 becomes larger.

**Figure 11.**The features from (

**a**) the 1st block; (

**b**) the 2nd block; (

**c**) the 3rd block and (

**d**) the overall output are projected by the 20-nearest-neighbor-graph. The features are trained by ST-GCN 8 with pos as input data, performed on NTU xsub. The overall output includes the features from the output of the whole backbone. Although many classes are mixed together, classes 41 (green) and 42 (pink) can be linearly separated. They are falling and staggering, which are anomalies compared with the other six daily actions.

**Table 1.**The accuracy performed on NTU xview. ST 20 means ST-GCN with 20 layers; ST 12 and ST 8 follow similar explanations. The supervised models are noted with Sup. and only used labels for training.

Model | ST 20 | ST 12 | ST 8 | Sup. ST 20 | Sup. ST 12 | Sup. ST 8 |
---|---|---|---|---|---|---|

pos | 81.37 | 77.49 | 21.46 | 75.19 | 74.97 | 79.47 |

mov | 85.91 | 82.34 | 18.75 | 82.32 | 81.99 | 80.28 |

mov + pos | 85.79 | 84.63 | 27.84 | 82.19 | 81.34 | 80.34 |

**Table 2.**The accuracy performed on NTU xsub. ST 20 means ST-GCN with 20 layers; ST 12 and ST 8 follow similar explanations. The supervised models are noted with Sup. and only used labels for training.

Model | ST 20 | ST 12 | ST 8 | Sup. ST 20 | Sup. ST 12 | Sup. ST 8 |
---|---|---|---|---|---|---|

pos | 77.36 | 62.14 | 21.46 | 71.34 | 71.48 | 69.09 |

mov | 80.25 | 77.47 | 17.41 | 75.35 | 76.25 | 75.81 |

mov + pos | 80.61 | 78.67 | 26.29 | 74.83 | 74.63 | 75.96 |

**Table 3.**The relative execution time for each sample compared with the block 3 of ST 20, which is the heaviest model. The ST 20, ST 12 and ST 8 denote ST-GCN 20, ST-GCN 12, and ST-GCN 8, respectively. Each sample has 300 frames.

Block | ST 20 | ST 12 | ST 8 |
---|---|---|---|

block 1 | 2.33× | 3.43× | 4.60× |

block 2 | 1.42× | 2.36× | 2.90× |

block 3 | 1.00× | 1.54× | 2.03× |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Feng, M.; Meunier, J.
A Lightweight Graph Neural Network Algorithm for Action Recognition Based on Self-Distillation. *Algorithms* **2023**, *16*, 552.
https://doi.org/10.3390/a16120552

**AMA Style**

Feng M, Meunier J.
A Lightweight Graph Neural Network Algorithm for Action Recognition Based on Self-Distillation. *Algorithms*. 2023; 16(12):552.
https://doi.org/10.3390/a16120552

**Chicago/Turabian Style**

Feng, Miao, and Jean Meunier.
2023. "A Lightweight Graph Neural Network Algorithm for Action Recognition Based on Self-Distillation" *Algorithms* 16, no. 12: 552.
https://doi.org/10.3390/a16120552