# An Improved Vision Transformer Network with a Residual Convolution Block for Bamboo Resource Image Identification

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Experiment and Methods

#### 2.1. Bamboo Resources

#### 2.2. Methods

#### 2.2.1. Vision Transformer

#### 2.2.2. Residual Vision Transformer Algorithm

#### 2.2.3. Quantitative Evaluation Indicators

## 3. Results and Discussion

#### 3.1. Bamboo Resource Images Dataset Analysis

#### 3.2. Analyzing the Performance of ReVI on Bamboo Datasets

#### 3.3. Comparison with Different Deep Learning Models

## 4. Conclusions

- This paper improves the ViT algorithm, the ReVI algorithm proposed in this paper outperforms both the ViT and CNN algorithms on the bamboo dataset, and ReVI still outperforms ViT despite decreasing the number of training samples. It can be concluded that the convolution and residual mechanisms compensate for the inductive bias that cannot be learned by ViT on a small-scale dataset, rendering ViT no longer limited to the number of training samples and equally applicable to classification on small-scale datasets.
- Bamboo varies in species and is widely distributed around the world. Additionally, collecting bamboo samples requires much expertise and human resources. The average classification accuracy of ReVI is up to 90.21% compared to CNNs such as ResNet18, VGG16, Xception and Densenet121. The ReVI algorithm proposed in this manuscript can help bamboo experts to conduct more efficient and accurate bamboo classification and identification, which is important for the conservation of bamboo germplasm diversity.

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Liese, W. Research on Bamboo. Wood Sci. Technol.
**1987**, 21, 189–209. [Google Scholar] [CrossRef] - Jain, S.; Kumar, R.; Jindal, U.C. Mechanical behaviour of bamboo and bamboo composite. J. Mater. Sci.
**1992**, 27, 4598–4604. [Google Scholar] [CrossRef] - Sharma, B.; van der Vegte, A. Engineered Bamboo for Structural Applications. Constr. Build. Mater.
**2015**, 81, 66–73. [Google Scholar] [CrossRef] - Lakkad, S.C.; Patel, J.M. Mechanical Properties of Bamboo, a Natural Composite. Fibre Sci. Technol.
**1981**, 14, 319–322. [Google Scholar] [CrossRef] - Scurlock, J.M.; Dayton, D.C.; Hames, B. Bamboo: An overlooked biomass resource? Biomass Bioenergy
**2000**, 19, 229–244. [Google Scholar] [CrossRef][Green Version] - Yeasmin, L.; Ali, M.; Gantait, S.; Chakraborty, S. Bamboo: An overview on its genetic diversity and characterization. 3 Biotech
**2015**, 5, 1–11. [Google Scholar] [CrossRef][Green Version] - LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput.
**1989**, 1, 541–551. [Google Scholar] [CrossRef] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM
**2017**, 60, 84–90. [Google Scholar] [CrossRef][Green Version] - Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] - Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Milicevic, M.; Zubrinic, K.; Grbavac, I.; Obradovic, I. Application of deep learning architectures for accurate detection of olive tree flowering phenophase. Remote Sens.
**2020**, 12, 2120. [Google Scholar] [CrossRef] - Quiroz, I.A.; Alférez, G.H. Image recognition of Legacy blueberries in a Chilean smart farm through deep learning. Comput. Electron. Agric.
**2020**, 168, 105044. [Google Scholar] [CrossRef] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 17 January 2023).
- Lee, S.; Lee, S.; Song, B.C. Improving Vision Transformers to Learn Small-Size Dataset From Scratch. IEEE Access
**2022**, 10, 123212–123224. [Google Scholar] [CrossRef] - Park, S.; Kim, B.-K.; Dong, S.-Y. Self-Supervised Rgb-Nir Fusion Video Vision Transformer Framework for Rppg Estimation. IEEE Trans. Instrum. Meas.
**2022**, 71, 1–10. [Google Scholar] [CrossRef] - Bi, M.; Wang, M.; Li, Z.; Hong, D. Vision Transformer with Contrastive Learning for Remote Sensing Image Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
**2022**, 16, 738–749. [Google Scholar] [CrossRef] - Wiedemann, G.; Remus, S.; Chawla, A.; Biemann, C. Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. arXiv
**2019**, arXiv:1909.10430. [Google Scholar] - Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog
**2019**, 1, 9. [Google Scholar] - Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst.
**2020**, 33, 1877–1901. [Google Scholar] - Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do vision transformers see like convolutional neural networks? In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34. [Google Scholar]
- Jin, X.; Zhu, X. Classifying a Limited Number of the Bamboo Species by the Transformation of Convolution Groups. In Proceedings of the 4th International Conference on Computer Science and Application Engineering, Sanya, China, 20–22 October 2020. [Google Scholar]
- Wu, Z.; Shen, C.; Van Den Hengel, A. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognit.
**2019**, 90, 119–133. [Google Scholar] [CrossRef][Green Version] - Coşkun, M.; Uçar, A.; Yildirim, Ö.; Demir, Y. Face recognition based on convolutional neural network. In Proceedings of the 2017 International Conference on Modern Electrical and Energy Systems (MEES), Kremenchuk, Ukraine, 15–17 November 2017. [Google Scholar]
- Almabdy, S.; Elrefaei, L. Deep convolutional neural network-based approaches for face recognition. Appl. Sci.
**2019**, 9, 4397. [Google Scholar] [CrossRef][Green Version] - Cheng, H.-D.; Jiang, X.H.; Sun, Y.; Wang, J. Color image segmentation: Advances and prospects. Pattern Recognit.
**2001**, 34, 2259–2281. [Google Scholar] [CrossRef] - Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995. [Google Scholar]
- Ripley, B. Pattern Recognition and Neural Networks; Google Scholar; Cambridge University Press: Cambridge, UK, 1996. [Google Scholar]
- Venables, W.N.; Ripley, B.D. Modern Applied Statistics with S-PLUS; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
**2014**, arXiv:1412.3555. [Google Scholar] - Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv
**2016**, arXiv:1609.08144. [Google Scholar] - Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv
**2014**, arXiv:1409.0473. [Google Scholar] - Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv
**2014**, arXiv:1406.1078. [Google Scholar] - Kim, Y.; Denton, C.; Hoang, L.; Rush, A.M. Structured attention networks. arXiv
**2017**, arXiv:1702.00887. [Google Scholar] - Parikh, A.P.; Täckström, O.; Das, D.; Uszkoreit, J. A decomposable attention model for natural language inference. arXiv
**2016**, arXiv:1606.01933. [Google Scholar] - Paulus, R.; Xiong, C.; Socher, R. A deep reinforced model for abstractive summarization. arXiv
**2017**, arXiv:1705.04304. [Google Scholar] - Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv
**2016**, arXiv:1607.06450. [Google Scholar] - Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]

**Figure 1.**Nineteen species of bamboo sample [1,24]. Note: Nos.

**0**–

**4**are Phyllostachys bambusoides Sieb. et Zucc. f. lacrima-deae Keng f. et Wen, Pleioblastus fortunei (v. Houtte) Nakai, Pleioblastus viridistriatus (Regel) Makino, Phyllostachys edulis ‘Heterocycla’, Phyllostachys eduliscv. Tao kiang; Nos.

**5**–

**9**are Phyllostachys bambusoidesf.mixtaZ.P.WangetN. X. Ma, Phyllostachys heterocycla (Carr.) Mitford cv. Luteosulcata Wen, Phyllostachyssulphurea (Carr.) A. et C. Riv. ‘Robert Young’, Phyllostachys aureosulcata ‘Spectabilis’ C.D. Chu. Et C.S. Chao, Phyllostachysheterocycla (Carr.) Mitford cv. Viridisulcata; Nos.

**10**–

**14**are Phyllostachys aurea Carr. ex A. et C. Riv, Phyllostachys nigra (Lodd.) Munro, Shibataea chinensis Nakai, Acidosasa chienouensis (Wen.) C. C. Chao. et Wen, Bambusa subaequalis H. L. Fung et C. Y. Sia; Nos.

**15**–

**18**are Phyllostachys hirtivagina G.H. Lai f. flavovittata G.H. Lai, Pseudosasa amabiLis (McClure) Keng f, P.hindsii (Munro) C.D. Chu et C.S. Chao, Phyllostachys heterocycla (Carr.) Mitford cv. Obliquinoda Z.P. Wang et N.X. Ma.

**Figure 2.**Residual block [11].

**Figure 3.**Transformer encoder-decoder architecture. The figure on the

**left**is the encoder of the transformer and the figure on the

**right**is the decoder of the transformer [15].

**Figure 4.**The framework of ViT [22].

**Figure 7.**The performance of the ReVI models: (

**a**) accuracy variation curve of the $\mathrm{ReVI}\_\mathrm{Xb}\left(\mathrm{X}\in \left[1,2,3,4,5\right]\right)$; (

**b**) average accuracy, average recall and average F1-score and average specificity variation curves of the $\mathrm{ReVI}\_\mathrm{Xb}\left(\mathrm{X}\in \left[1,2,3,4,5\right]\right)$.

**Figure 8.**(

**a**) Training loss variation curve of the $\mathrm{ViT}\_\mathrm{Xb}\left(\mathrm{X}\in \left[1,2,3,4,5\right]\right)$ and $\mathrm{ReVI}\_\mathrm{Xb}\left(\mathrm{X}\in \left[1,2,3,4,5\right]\right)$; (

**b**) accuracy variation curve of the $\mathrm{ViT}\_\mathrm{Xb}\left(\mathrm{X}\in \left[1,2,3,4,5\right]\right)$ and $\mathrm{ReVI}\_\mathrm{Xb}\left(\mathrm{X}\in \left[1,2,3,4,5\right]\right).$.

**Figure 9.**Prediction accuracy of $\mathrm{ViT}\_\mathrm{Xb}\left(\mathrm{X}\in \left[1,2,3,4,5\right]\right)$ and $\mathrm{ReVI}\_\mathrm{Xb}\left(\mathrm{X}\in \left[1,2,3,4,5\right]\right)$ on the test dataset.

**Figure 11.**(

**a**) ROC (receiver operating characteristic) curve of the ReVI_3b. (

**b**) PR curve (precision–recall curve) of the ReVI_3b.

**Figure 12.**(

**a**) Variation curves of loss for different deep models on the training dataset; (

**b**) variation curves of accuracy for different deep models on the training dataset.

Layer | Output Size | Parameter |
---|---|---|

Conv2d | 192 × 192 | 7 × 7, 64, stride 2, padding 3 |

Maxpool | 95 × 95 | 3 × 3, stride 2 |

Residual unit | 48 × 48 | $\left(\begin{array}{c}1\times 1,128\\ 3\times 3,128\\ 1\times 1,256\end{array}\right)$× 1 |

Re_embedding layer | 768 × 577 | Patch_size = 2 |

Transformer encoder | 768 | --- |

FC | 19 | --- |

Highest Accuracy/% | $\mathbf{R}\mathbf{e}\mathbf{V}\mathbf{I}\_\mathbf{X}\mathbf{b}\left(\mathbf{X}\in \left[1,2,3,4,5\right]\right)$ | ||||
---|---|---|---|---|---|

ReVI_1b | ReVI_2b | ReVI_3b | ReVI_4b | ReVI_5b | |

Training dataset | 95.15 | 96.18 | 96.18 | 96.27 | 96.58 |

Validation dataset | 96.65 | 95.34 | 95.81 | 95.50 | 95.65 |

Test dataset | 87.46 | 89.30 | 90.21 | 88.07 | 89.60 |

Model | Precision/% | Recall/% | F1-Score/% | Specificity/% | mAP/% |
---|---|---|---|---|---|

ReVI_3b | 91.40 | 85.67 | 79.63 | 99.35 | 91.00 |

ViT_4b | 82.89 | 83.39 | 80.77 | 99.19 | 90.22 |

Model | ReVI_3b | ViT_4b | ResNet18 | ResNet50 | VGG16 | DenseNet121 | Xception | ResNet18 | ResNet50 |
---|---|---|---|---|---|---|---|---|---|

(Pretraining) | (Pretraining) | ||||||||

Training time | 60 m 20 s | 59 m 9 s | 40 m 34 s | 69 m 55 s | 88 m 29 s | 102 m 36 s | 74 m 10 s | 44 m 52 s | 62 m 50 s |

accuracy/% | 90.21 | 85.63 | 84.71 | 87.16 | 84.4 | 85.93 | 82.97 | 88.07 | 94.5 |

Precision/% | 80.15 | 81.55 | 85.44 | 82.08 | 83.52 | 81.68 | 75.19 | 80.23 | 93.9 |

Recall/% | 80.83 | 81.26 | 75.56 | 82.73 | 75.81 | 77.38 | 76.12 | 81.02 | 87.38 |

F1-score/% | 79.63 | 80.77 | 77.47 | 81.23 | 75.98 | 78.43 | 74.89 | 79.88 | 89.03 |

Specificity/% | 99.35 | 99.19 | 99.14 | 99.29 | 99.12 | 99.21 | 99.04 | 99.34 | 99.69 |

mAP/% | 90.22 | 91 | 89.94 | 88.73 | 83.99 | 88.9 | 86.86 | 85.86 | 96.36 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zou, Q.; Jin, X.; Song, Y.; Wang, L.; Li, S.; Rao, Y.; Zhang, X.; Gao, Q. An Improved Vision Transformer Network with a Residual Convolution Block for Bamboo Resource Image Identification. *Electronics* **2023**, *12*, 1055.
https://doi.org/10.3390/electronics12041055

**AMA Style**

Zou Q, Jin X, Song Y, Wang L, Li S, Rao Y, Zhang X, Gao Q. An Improved Vision Transformer Network with a Residual Convolution Block for Bamboo Resource Image Identification. *Electronics*. 2023; 12(4):1055.
https://doi.org/10.3390/electronics12041055

**Chicago/Turabian Style**

Zou, Qing, Xiu Jin, Yi Song, Lianglong Wang, Shaowen Li, Yuan Rao, Xiaodan Zhang, and Qijuan Gao. 2023. "An Improved Vision Transformer Network with a Residual Convolution Block for Bamboo Resource Image Identification" *Electronics* 12, no. 4: 1055.
https://doi.org/10.3390/electronics12041055