Convolutional Neural Networks for Visual Detection, Recognition and Segmentation in Images and Videos

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (15 November 2023) | Viewed by 38435

Special Issue Editor


E-Mail Website
Guest Editor
Department of Computer Science and Engineering, Blekinge Institute of Technology, SE-371 41 Karlskrona, Sweden
Interests: artificial intelligence; image processing; computer vision
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

We are inviting submissions to the Special Issue on "Convolutional Neural Networks for Visual Detection, Recognition and Segmentation in Images and Videos".

The amount of visual data collected and generated has been rising over the world, and the automatic recognition of the events in these visual data has become important and challenging problem for the research community. Visual recognition with human effort is a time-consuming, tedious, inefficient, and costly. Therefore, it is necessary to decrease manual human efforts, and automated intelligent systems play an important role in solving visual-based problems. In the last decade, Convolutional Neural Network (CNN)-based computational intelligence algorithms have been shown to outperform state-of-the-art machine learning methods such as k-nearest neighbor, neural networks, and support vector machine. It may also be required to develop and employ a hybrid system combining CNN-based models with artificial intelligence techniques (e.g., support vector machine, random forests, etc.) to resolve a visual problem in images and videos. This Special Issue offers an opportunity for researchers to address broad challenges on both theoretical and application aspects of CNN-based architectures in image and video processing.

The main objective of this Special Issue is to discover and examine theory and application of CNN-based deep learning architectures for the problems in image and video applications. We invite researchers to contribute original research and review articles that will motivate the continuing effort on the application of Convolutional Neural Network frameworks to resolve image and video processing problems. The topics of this Special Issue on “Convolutional Neural Networks for Visual Recognition, Detection and Segmentation in Images and Videos” explicitly include (but are not limited to) the following aspects:

  • convolutional neural networks;
  • evolutionary deep learning models;
  • big image and video datasets;
  • deep-learning-based image/video processing;
  • hybrid deep learning models;
  • parallel deep convolutional neural networks;
  • deep-learning-based object detection and recognition in images and videos;
  • deep-learning-based object tracking;
  • image and video segmentation;
  • deep CNN-based real-time video processing;
  • expert systems;
  • applications of convolutional autoencoder in images and videos;
  • applications of convolutional and generative adversarial networks.

Dr. Hüseyin Kusetogullari
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • convolutional neural networks
  • evolutionary deep learning models
  • big image and video datasets
  • deep-learning-based image/video processing
  • hybrid deep learning models
  • parallel deep convolutional neural networks
  • deep-learning-based object detection and recognition in images and videos
  • deep-learning-based object tracking
  • image and video segmentation
  • deep CNN-based real-time video processing
  • expert systems
  • applications of convolutional autoencoder in images and videos
  • applications of convolutional and generative adversarial networks

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

20 pages, 5225 KiB  
Article
A Reactive Deep Learning-Based Model for Quality Assessment in Airport Video Surveillance Systems
by Wanting Liu, Ya Pan and Yong Fan
Electronics 2024, 13(4), 749; https://doi.org/10.3390/electronics13040749 - 13 Feb 2024
Viewed by 472
Abstract
Monitoring the correct operation of airport video surveillance systems is of great importance in terms of the image quality provided by the cameras. Performing this task using human resources is time-consuming and usually associated with a delay in diagnosis. For this reason, in [...] Read more.
Monitoring the correct operation of airport video surveillance systems is of great importance in terms of the image quality provided by the cameras. Performing this task using human resources is time-consuming and usually associated with a delay in diagnosis. For this reason, in this article, an automatic system for image quality assessment (IQA) in airport surveillance systems using deep learning techniques is presented. The proposed method monitors the video surveillance system based on the two goals of “quality assessment” and “anomaly detection in images”. This model uses a 3D convolutional neural network (CNN) for detecting anomalies such as jitter, occlusion, and malfunction in frame sequences. Also, the feature maps of this 3D CNN are concatenated with feature maps of a separate 2D CNN for image quality assessment. This combination can be useful in improving the concurrence of correlation coefficients for IQA. The performance of the proposed model was evaluated both in terms of quality assessment and anomaly detection. The results show that the proposed 3D CNN model could correctly detect anomalies in surveillance videos with an average accuracy of 96.48% which is at least 3.39% higher than the compared methods. Also, the proposed hybrid CNN model could assess image quality with an average correlation of 0.9014, which proves the efficiency of the proposed method. Full article
Show Figures

Figure 1

16 pages, 20064 KiB  
Article
MFSNet: Enhancing Semantic Segmentation of Urban Scenes with a Multi-Scale Feature Shuffle Network
by Xiaohong Qian, Chente Shu, Wuyin Jin, Yunxiang Yu and Shengying Yang
Electronics 2024, 13(1), 12; https://doi.org/10.3390/electronics13010012 - 19 Dec 2023
Viewed by 574
Abstract
The complexity of urban scenes presents a challenge for semantic segmentation models. Existing models are constrained by factors such as the scale, color, and shape of urban objects, which limit their ability to achieve more accurate segmentation results. To address these limitations, this [...] Read more.
The complexity of urban scenes presents a challenge for semantic segmentation models. Existing models are constrained by factors such as the scale, color, and shape of urban objects, which limit their ability to achieve more accurate segmentation results. To address these limitations, this paper proposes a novel Multi-Scale Feature Shuffle NetWork (MFSNet), which is an improvement upon the existing Deeplabv3+ model. Specifically, MFSNet integrates a novel Pyramid Shuffle Module (PSM) to extract discriminative features and feature correlations, with the objective of improving the accuracy of classifying insignificant objects. Additionally, we propose an efficient feature aggregation module (EFAM) to effectively expand the receptive field and aggregate contextual information, which is integrated as a branch within the network architecture to mitigate the information loss resulting from downsampling operations. Moreover, in order to augment the precision of segmentation boundary delineation and object localization, we employ a progressive upsampling strategy for reinstating spatial information in the feature maps. The experimental results show that the proposed model achieves competitive performance, achieving 80.4% MIoU on the Pascal VOC 2012 dataset, 79.4% MIoU on the Cityscapes dataset, and 40.1% MIoU on the Coco-Stuff dataset. Full article
Show Figures

Figure 1

15 pages, 4474 KiB  
Article
FCIHMRT: Feature Cross-Layer Interaction Hybrid Method Based on Res2Net and Transformer for Remote Sensing Scene Classification
by Yan Huo, Shuang Gang and Chao Guan
Electronics 2023, 12(20), 4362; https://doi.org/10.3390/electronics12204362 - 20 Oct 2023
Cited by 5 | Viewed by 903
Abstract
Scene classification is one of the areas of remote sensing image processing that is gaining much attention. Aiming to solve the problem of the limited precision of optical scene classification caused by complex spatial patterns, a high similarity between classes, and a high [...] Read more.
Scene classification is one of the areas of remote sensing image processing that is gaining much attention. Aiming to solve the problem of the limited precision of optical scene classification caused by complex spatial patterns, a high similarity between classes, and a high diversity of classes, a feature cross-layer interaction hybrid algorithm for optical remote sensing scene classification is proposed in this paper. Firstly, a number of features are extracted from two branches, a vision transformer branch and a Res2Net branch, to strengthen the feature extraction capability of the strategy. A novel interactive attention technique is proposed, with the goal of focusing on the strong correlation between the two-branch features, to fully use the complementing advantages of the feature information. The retrieved feature data are further refined and merged. The combined characteristics are then employed for classification. The experiments were conducted by using three open-source remote sensing datasets to validate the feasibility of the proposed method, which performed better in scene classification tasks than other methods. Full article
Show Figures

Figure 1

15 pages, 19812 KiB  
Article
Ellipse Detection with Applications of Convolutional Neural Network in Industrial Images
by Kang Liu, Yonggang Lu, Rubing Bai, Kun Xu, Tao Peng, Yichun Tai and Zhijiang Zhang
Electronics 2023, 12(16), 3431; https://doi.org/10.3390/electronics12163431 - 14 Aug 2023
Viewed by 1387
Abstract
Ellipse detection has a very wide range of applications in the field of industrial production, especially in the geometric detection of metallurgical hinge pins. However, the factors in industrial images, such as small object size and incomplete ellipse in the image boundary, bring [...] Read more.
Ellipse detection has a very wide range of applications in the field of industrial production, especially in the geometric detection of metallurgical hinge pins. However, the factors in industrial images, such as small object size and incomplete ellipse in the image boundary, bring challenges to ellipse detection, which cannot be solved by existing methods. This paper proposes a method for ellipse detection in industrial images, which utilizes the extended proposal operation to prevent the loss of ellipse rotation angle features during ellipse regression. Moreover, the Gaussian angle distance conforming to the ellipse axioms is adopted and combined with smooth L1 loss as the ellipse regression loss function to enhance the prediction accuracy of the ellipse rotation angle. The effectiveness of the proposed method is demonstrated on the hinge pins dataset, with experiment results showing an AP* of 80.93% and indicating superior detection performance compared to other methods. It is thus suitable for engineering applications and can provide visual guidance for the precise measurement of ellipse-like mechanical parts. Full article
Show Figures

Figure 1

14 pages, 8302 KiB  
Article
FEFD-YOLOV5: A Helmet Detection Algorithm Combined with Feature Enhancement and Feature Denoising
by Yiduo Zhang, Yi Qiu and Huihui Bai
Electronics 2023, 12(13), 2902; https://doi.org/10.3390/electronics12132902 - 01 Jul 2023
Cited by 1 | Viewed by 1341
Abstract
In intelligent surveillance of construction sites, safety helmet detection is of great significance. However, due to the small size of safety helmets and the presence of high levels of noise in construction scenarios, existing detection methods often encounter issues related to insufficient accuracy [...] Read more.
In intelligent surveillance of construction sites, safety helmet detection is of great significance. However, due to the small size of safety helmets and the presence of high levels of noise in construction scenarios, existing detection methods often encounter issues related to insufficient accuracy and robustness. To address this challenge, this paper introduces a new safety helmet detection algorithm, FEFD-YOLOV5. The FEFD-YOLOV5 algorithm enhances detection performance by adding a shallow detection head specifically for small target detection and incorporating an SENet channel attention module to compress global spatial information, thus improving the model’s mean average precision (mAP) in corresponding scenarios. Additionally, a novel denoise module is introduced in this algorithm, ensuring the model maintains high accuracy and robustness under various noise conditions, thereby enhancing the model’s generalization capability to meet real-world scenario demands. Experimental results show that the proposed improved algorithm achieves a detection accuracy of 94.89% in noiseless environments, and still reaches 91.55% in high-noise environments, demonstrating superior detection efficacy compared to the original algorithm. Full article
Show Figures

Figure 1

14 pages, 1740 KiB  
Article
MFDANet: Multi-Scale Feature Dual-Stream Aggregation Network for Salient Object Detection
by Bin Ge, Jiajia Pei, Chenxing Xia and Taolin Wu
Electronics 2023, 12(13), 2880; https://doi.org/10.3390/electronics12132880 - 29 Jun 2023
Viewed by 662
Abstract
With the development of deep learning, significant improvements and optimizations have been made in salient object detection. However, many salient object detection methods have limitations, such as insufficient context information extraction, limited interaction modes for different level features, and potential information loss due [...] Read more.
With the development of deep learning, significant improvements and optimizations have been made in salient object detection. However, many salient object detection methods have limitations, such as insufficient context information extraction, limited interaction modes for different level features, and potential information loss due to a single interaction mode. In order to solve the aforementioned issues, we proposed a dual-stream aggregation network based on multi-scale features, which consists of two main modules, namely a residual context information extraction (RCIE) module and a dense dual-stream aggregation (DDA) module. Firstly, the RCIE module was designed to fully extract context information by connecting features from different receptive fields via residual connections, where convolutional groups composed of asymmetric convolution and dilated convolution are used to extract features from different receptive fields. Secondly, the DDA module aimed to enhance the relationships between different level features by leveraging dense connections to obtain high-quality feature information. Finally, two interaction modes were used for dual-stream aggregation to generate saliency maps. Extensive experiments on 5 benchmark datasets show that the proposed model performs favorably against 15 state-of-the-art methods. Full article
Show Figures

Figure 1

18 pages, 11512 KiB  
Article
BiGA-YOLO: A Lightweight Object Detection Network Based on YOLOv5 for Autonomous Driving
by Jun Liu, Qiqin Cai, Fumin Zou, Yintian Zhu, Lyuchao Liao and Feng Guo
Electronics 2023, 12(12), 2745; https://doi.org/10.3390/electronics12122745 - 20 Jun 2023
Cited by 4 | Viewed by 1845
Abstract
Object detection in autonomous driving scenarios has become a popular task in recent years. Due to the high-speed movement of vehicles and the complex changes in the surrounding environment, objects of different scales need to be detected, which places high demands on the [...] Read more.
Object detection in autonomous driving scenarios has become a popular task in recent years. Due to the high-speed movement of vehicles and the complex changes in the surrounding environment, objects of different scales need to be detected, which places high demands on the performance of the network model. Additionally, different driving devices have varying performance capabilities, and a lightweight model is needed to ensure the stable operation of devices with limited computing power. To address these challenges, we propose a lightweight network called BiGA-YOLO based on YOLOv5. We design the Ghost-Hardswish Conv module to simplify the convolution operations and incorporate spatial coordinate information into feature maps using Coordinate Attention. We also replace the PANet structure with the BiFPN structure to enhance the expression ability of features through different weights during the process of fusing multi-scale feature maps. Finally, we conducted extensive experiments on the KITTI dataset, and our BiGA-YOLO achieved a mAP@0.5 of 92.2% and a mAP@0.5:0.95 of 68.3%. Compared to the baseline model YOLOv5, our proposed model achieved improvements of 1.9% and 4.7% in mAP@0.5 and mAP@0.5:0.95, respectively, while reducing the model size by 15.7% and the computational cost by 16%. The detection speed was also increased by 6.3 FPS. Through analysis and discussion of the experimental results, we demonstrate that our proposed model is superior, achieving a balance between detection accuracy, model size, and detection speed. Full article
Show Figures

Figure 1

25 pages, 8029 KiB  
Article
Signature and Log-Signature for the Study of Empirical Distributions Generated with GANs
by J. de Curtò, I. de Zarzà, Gemma Roig and Carlos T. Calafate
Electronics 2023, 12(10), 2192; https://doi.org/10.3390/electronics12102192 - 11 May 2023
Cited by 1 | Viewed by 1450
Abstract
In this paper, we address the research gap in efficiently assessing Generative Adversarial Network (GAN) convergence and goodness of fit by introducing the application of the Signature Transform to measure similarity between image distributions. Specifically, we propose the novel use of Root Mean [...] Read more.
In this paper, we address the research gap in efficiently assessing Generative Adversarial Network (GAN) convergence and goodness of fit by introducing the application of the Signature Transform to measure similarity between image distributions. Specifically, we propose the novel use of Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) Signature, along with Log-Signature, as alternatives to existing methods such as Fréchet Inception Distance (FID) and Multi-Scale Structural Similarity Index Measure (MS-SSIM). Our approach offers advantages in terms of efficiency and effectiveness, providing a comprehensive understanding and extensive evaluations of GAN convergence and goodness of fit. Furthermore, we present innovative analytical measures based on statistics by means of Kruskal–Wallis to evaluate the goodness of fit of GAN sample distributions. Unlike existing GAN measures, which are based on deep neural networks and require extensive GPU computations, our approach significantly reduces computation time and is performed on the CPU while maintaining the same level of accuracy. Our results demonstrate the effectiveness of the proposed method in capturing the intrinsic structure of the generated samples, providing meaningful insights into GAN performance. Lastly, we evaluate our approach qualitatively using Principal Component Analysis (PCA) and adaptive t-Distributed Stochastic Neighbor Embedding (t-SNE) for data visualization, illustrating the plausibility of our method. Full article
Show Figures

Figure 1

37 pages, 17322 KiB  
Article
Accelerating the Response of Self-Driving Control by Using Rapid Object Detection and Steering Angle Prediction
by Bao Rong Chang, Hsiu-Fen Tsai and Chia-Wei Hsieh
Electronics 2023, 12(10), 2161; https://doi.org/10.3390/electronics12102161 - 09 May 2023
Cited by 4 | Viewed by 1971
Abstract
A vision-based autonomous driving system can usually fuse information about object detection and steering angle prediction for safe self-driving through real-time recognition of the environment around the car. If an autonomous driving system cannot respond fast to driving control appropriately, it will cause [...] Read more.
A vision-based autonomous driving system can usually fuse information about object detection and steering angle prediction for safe self-driving through real-time recognition of the environment around the car. If an autonomous driving system cannot respond fast to driving control appropriately, it will cause high-risk problems with regard to severe car accidents from self-driving. Therefore, this study introduced GhostConv to the YOLOv4-tiny model for rapid object detection, denoted LW-YOLOv4-tiny, and the ResNet18 model for rapid steering angle prediction LW-ResNet18. As per the results, LW-YOLOv4-tiny can achieve the highest execution speed by frames per second, 56.1, and LW-ResNet18 can obtain the lowest prediction loss by mean-square error, 0.0683. Compared with other integrations, the proposed approach can achieve the best performance indicator, 2.4658, showing the fastest response to driving control in self-driving. Full article
Show Figures

Figure 1

17 pages, 7533 KiB  
Article
DAAM-YOLOV5: A Helmet Detection Algorithm Combined with Dynamic Anchor Box and Attention Mechanism
by Weipeng Tai, Zhenzhen Wang, Wei Li, Jianfei Cheng and Xudong Hong
Electronics 2023, 12(9), 2094; https://doi.org/10.3390/electronics12092094 - 04 May 2023
Cited by 5 | Viewed by 1480
Abstract
Helmet recognition algorithms based on deep learning aim to enable unmanned full-time detection and record violations such as failure to wear a helmet. However, in actual scenarios, weather and human factors can be complicated, which poses challenges for safety helmet detection. Camera shaking [...] Read more.
Helmet recognition algorithms based on deep learning aim to enable unmanned full-time detection and record violations such as failure to wear a helmet. However, in actual scenarios, weather and human factors can be complicated, which poses challenges for safety helmet detection. Camera shaking and head occlusion are common issues that can lead to inaccurate results and low availability. To address these practical problems, this paper proposes a novel helmet detection algorithm called DAAM-YOLOv5. The DAAM-YOLOv5 algorithm enriches the diversity of datasets under different weather conditions to improve the mAP of the model in corresponding scenarios by using Mosaic-9 data enhancement. Additionally, this paper introduces a novel dynamic anchor box mechanism, K-DAFS, into this algorithm and enhances the generation speed of the blocked target anchor boxes by using bidirectional feature fusion (BFF). Furthermore, by using an attention mechanism, this paper redistributes the weight of objects in a picture and appropriately reduces the model’s sensitivity to the edge information of occluded objects through pooling. This approach improves the model’s generalization ability, which aligns with practical application requirements. To evaluate the proposed algorithm, this paper adopts the region of interest (ROI) detection strategy and carries out experiments on specific, real datasets. Compared with traditional deep learning algorithms on the same datasets, our method effectively distinguishes helmet-wearing conditions even when head information is occluded and improves the detection speed of the model. Moreover, compared with the YOLOv5s algorithm, the proposed algorithm increases the mAP and FPS by 4.32% and 9 frames/s, respectively. Full article
Show Figures

Figure 1

19 pages, 5072 KiB  
Article
A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning
by Saadin Oyucu
Electronics 2023, 12(8), 1900; https://doi.org/10.3390/electronics12081900 - 18 Apr 2023
Cited by 3 | Viewed by 17463
Abstract
Text-to-Speech (TTS) systems have made strides but creating natural-sounding human voices remains challenging. Existing methods rely on noncomprehensive models with only one-layer nonlinear transformations, which are less effective for processing complex data such as speech, images, and video. To overcome this, deep learning [...] Read more.
Text-to-Speech (TTS) systems have made strides but creating natural-sounding human voices remains challenging. Existing methods rely on noncomprehensive models with only one-layer nonlinear transformations, which are less effective for processing complex data such as speech, images, and video. To overcome this, deep learning (DL)-based solutions have been proposed for TTS but require a large amount of training data. Unfortunately, there is no available corpus for Turkish TTS, unlike English, which has ample resources. To address this, our study focused on developing a Turkish speech synthesis system using a DL approach. We obtained a large corpus from a male speaker and proposed a Tacotron 2 + HiFi-GAN structure for the TTS system. Real users rated the quality of synthesized speech as 4.49 using Mean Opinion Score (MOS). Additionally, MOS-Listening Quality Objective evaluated the speech quality objectively, obtaining a score of 4.32. The speech waveform inference time was determined by a real-time factor, with 1 s of speech data synthesized in 0.92 s. To the best of our knowledge, these findings represent the first documented deep learning and HiFi-GAN-based TTS system for Turkish TTS. Full article
Show Figures

Figure 1

41 pages, 4004 KiB  
Article
Deep Crowd Anomaly Detection by Fusing Reconstruction and Prediction Networks
by Md. Haidar Sharif, Lei Jiao and Christian W. Omlin
Electronics 2023, 12(7), 1517; https://doi.org/10.3390/electronics12071517 - 23 Mar 2023
Cited by 4 | Viewed by 1554
Abstract
Abnormal event detection is one of the most challenging tasks in computer vision. Many existing deep anomaly detection models are based on reconstruction errors, where the training phase is performed using only videos of normal events and the model is then capable to [...] Read more.
Abnormal event detection is one of the most challenging tasks in computer vision. Many existing deep anomaly detection models are based on reconstruction errors, where the training phase is performed using only videos of normal events and the model is then capable to estimate frame-level scores for an unknown input. It is assumed that the reconstruction error gap between frames of normal and abnormal scores is high for abnormal events during the testing phase. Yet, this assumption may not always hold due to superior capacity and generalization of deep neural networks. In this paper, we design a generalized framework (rpNet) for proposing a series of deep models by fusing several options of a reconstruction network (rNet) and a prediction network (pNet) to detect anomaly in videos efficiently. In the rNet, either a convolutional autoencoder (ConvAE) or a skip connected ConvAE (AEc) can be used, whereas in the pNet, either a traditional U-Net, a non-local block U-Net, or an attention block U-Net (aUnet) can be applied. The fusion of both rNet and pNet increases the error gap. Our deep models have distinct degree of feature extraction capabilities. One of our models (AEcaUnet) consists of an AEc with our proposed aUnet has capability to confirm better error gap and to extract high quality of features needed for video anomaly detection. Experimental results on UCSD-Ped1, UCSD-Ped2, CUHK-Avenue, ShanghaiTech-Campus, and UMN datasets with rigorous statistical analysis show the effectiveness of our models. Full article
Show Figures

Figure 1

Review

Jump to: Research

49 pages, 4529 KiB  
Review
Deep-Learning-Based Approaches for Semantic Segmentation of Natural Scene Images: A Review
by Busra Emek Soylu, Mehmet Serdar Guzel, Gazi Erkan Bostanci, Fatih Ekinci, Tunc Asuroglu and Koray Acici
Electronics 2023, 12(12), 2730; https://doi.org/10.3390/electronics12122730 - 19 Jun 2023
Cited by 9 | Viewed by 5971
Abstract
The task of semantic segmentation holds a fundamental position in the field of computer vision. Assigning a semantic label to each pixel in an image is a challenging task. In recent times, significant advancements have been achieved in the field of semantic segmentation [...] Read more.
The task of semantic segmentation holds a fundamental position in the field of computer vision. Assigning a semantic label to each pixel in an image is a challenging task. In recent times, significant advancements have been achieved in the field of semantic segmentation through the application of Convolutional Neural Networks (CNN) techniques based on deep learning. This paper presents a comprehensive and structured analysis of approximately 150 methods of semantic segmentation based on CNN within the last decade. Moreover, it examines 15 well-known datasets in the semantic segmentation field. These datasets consist of 2D and 3D image and video frames, including general, indoor, outdoor, and street scenes. Furthermore, this paper mentions several recent techniques, such as SAM, UDA, and common post-processing algorithms, such as CRF and MRF. Additionally, this paper analyzes the performance evaluation of reviewed state-of-the-art methods, pioneering methods, common backbone networks, and popular datasets. These have been compared according to the results of Mean Intersection over Union (MIoU), the most popular evaluation metric of semantic segmentation. Finally, it discusses the main challenges and possible solutions and underlines some future research directions in the semantic segmentation task. We hope that our survey article will be useful to provide a foreknowledge to the readers who will work in this field. Full article
Show Figures

Figure 1

Back to TopTop