Deep Learning Based Object Detection

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (31 December 2020) | Viewed by 116021

Special Issue Editor


E-Mail Website
Guest Editor
Department of Intelligent Systems and Robotics, Chungbuk National University, Cheongju 28644, Republic of Korea
Interests: medical image processing; efficient deep learning; low-level processing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Object detection is one of the most important and challenging categories of computer vision and machine learning, which have been extensively utilized in various applications, such as video surveillance, autonomous vehicle, human–machine interaction, medical image analysis, and so on. Recently, significant improvement has been achieved as a result of the rapid development of deep learning, especially convolutional neural networks (CNNs).

To evaluate deep learning-based object detection methods, various databases have been introduced, and many researchers have endeavored to improve the performance of their proposed methodologies for the target database. There are mainstream benchmarks based on general object detection datasets, such as ImageNet, KITTI, and MS COCO. Even though significant improvements were achieved from previous shallow network-based methods for well-known datasets, unseen data from different environments or different applications suffered from relatively low performance.

This Special Issue will cover the most recent technical advances in all deep learning-based object recognition aspects, including theoretical issues on deep learning, real-world applications, practical object detection systems, and originally designed databases. Both transfer learning or semi-supervised learning of deep learning are welcome. Reviews and surveys of the state-of-the-art in deep learning-based object detection are also welcome. Topics of interest for this Special Issue include, but are not limited to, the following topics:

  • Image/video-based object detection using deep learning
  • Sensor fusion for object detection using deep learning
  • Transfer learning for object detection
  • Online learning for object detection
  • Active learning for object detection
  • Semi-supervised learning for object detection
  • Deep learning-based object detection for real-world applications
  • Object detection systems
  • New database for object detection
  • Survey for deep learning-based object detection

Dr. Youngbae Hwang
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (20 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

28 pages, 18023 KiB  
Article
A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit
by Rafael Padilla, Wesley L. Passos, Thadeu L. B. Dias, Sergio L. Netto and Eduardo A. B. da Silva
Electronics 2021, 10(3), 279; https://doi.org/10.3390/electronics10030279 - 25 Jan 2021
Cited by 299 | Viewed by 21111
Abstract
Recent outstanding results of supervised object detection in competitions and challenges are often associated with specific metrics and datasets. The evaluation of such methods applied in different contexts have increased the demand for annotated datasets. Annotation tools represent the location and size of [...] Read more.
Recent outstanding results of supervised object detection in competitions and challenges are often associated with specific metrics and datasets. The evaluation of such methods applied in different contexts have increased the demand for annotated datasets. Annotation tools represent the location and size of objects in distinct formats, leading to a lack of consensus on the representation. Such a scenario often complicates the comparison of object detection methods. This work alleviates this problem along the following lines: (i) It provides an overview of the most relevant evaluation methods used in object detection competitions, highlighting their peculiarities, differences, and advantages; (ii) it examines the most used annotation formats, showing how different implementations may influence the assessment results; and (iii) it provides a novel open-source toolkit supporting different annotation formats and 15 performance metrics, making it easy for researchers to evaluate the performance of their detection algorithms in most known datasets. In addition, this work proposes a new metric, also included in the toolkit, for evaluating object detection in videos that is based on the spatio-temporal overlap between the ground-truth and detected bounding boxes. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

16 pages, 745 KiB  
Article
Layer-Wise Network Compression Using Gaussian Mixture Model
by Eunho Lee and Youngbae Hwang
Electronics 2021, 10(1), 72; https://doi.org/10.3390/electronics10010072 - 03 Jan 2021
Cited by 9 | Viewed by 2733
Abstract
Due to the large number of parameters and heavy computation, the real-time operation of deep learning in low-performance embedded board is still difficult. Network Pruning is one of effective methods to reduce the number of parameters without additional network structure modification. However, the [...] Read more.
Due to the large number of parameters and heavy computation, the real-time operation of deep learning in low-performance embedded board is still difficult. Network Pruning is one of effective methods to reduce the number of parameters without additional network structure modification. However, the conventional method prunes redundant parameters up to the same rate for all layers. It may cause a bottleneck problem, which leads to the performance degradation, because the minimum number of optimal parameters is different according to the each layer. We propose a layer adaptive pruning method based on the modeling of weight distribution. We can measure the amount of weights close to zero accurately by applying Gaussian Mixture Model (GMM). Until the target compression rate is reached, the layer selection and pruning are iteratively performed. The layer selection in each iteration considers the timing to reach the target compression rate and the degree of weight pruning. We apply the proposed network compression method for image classification and semantic segmentation to show the effectiveness of the proposed method. In the experiments, the proposed method shows higher compression rate during maintaining the accuracy compared with previous methods. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

15 pages, 4314 KiB  
Article
Object Detection Based on Center Point Proposals
by Hao Chen and Hong Zheng
Electronics 2020, 9(12), 2075; https://doi.org/10.3390/electronics9122075 - 05 Dec 2020
Cited by 1 | Viewed by 2782
Abstract
Anchor-based detectors are widely adopted in object detection. To improve the accuracy of object detection, multiple anchor boxes are intensively placed on the input image, yet most of them are invalid. Although anchor-free methods can reduce the number of useless anchor boxes, the [...] Read more.
Anchor-based detectors are widely adopted in object detection. To improve the accuracy of object detection, multiple anchor boxes are intensively placed on the input image, yet most of them are invalid. Although anchor-free methods can reduce the number of useless anchor boxes, the invalid ones still occupy a high proportion. On this basis, this paper proposes an object-detection method based on center point proposals to reduce the number of useless anchor boxes while improving the quality of anchor boxes, balancing the proportion of positive and negative samples. By introducing the differentiation module in the shallow layer, the new method can alleviate the problem of missing detection caused by overlapping of center points. When trained and tested on COCO (Common Objects in Context) dataset, this algorithm records an increase of about 2% in APS (Average Precision of Small Object), reaching 27.8%. The detector designed in this study outperforms most of the state-of-the-art real-time detectors in speed and accuracy trade-off, achieving the AP of 43.2 in 137 ms. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Graphical abstract

16 pages, 4028 KiB  
Article
Content-Based Image Copy Detection Using Convolutional Neural Network
by Xiaolong Liu, Jinchao Liang, Zi-Yi Wang, Yi-Te Tsai, Chia-Chen Lin and Chih-Cheng Chen
Electronics 2020, 9(12), 2029; https://doi.org/10.3390/electronics9122029 - 01 Dec 2020
Cited by 8 | Viewed by 2835
Abstract
With the rapid development of network technology, concerns pertaining to the enhancement of security and protection against violations of digital images have become critical over the past decade. In this paper, an image copy detection scheme based on the Inception convolutional neural network [...] Read more.
With the rapid development of network technology, concerns pertaining to the enhancement of security and protection against violations of digital images have become critical over the past decade. In this paper, an image copy detection scheme based on the Inception convolutional neural network (CNN) model in deep learning is proposed. The image dataset is transferred by a number of image processing manipulations and the feature values in images are automatically extracted for learning and detecting the suspected unauthorized digital images. The experimental results show that the proposed scheme takes on an extraordinary role in the process of detecting duplicated images with rotation, scaling, and other content manipulations. Moreover, the mechanism of detecting duplicate images via a convolutional neural network model with different combinations of original images and manipulated images can improve the accuracy and efficiency of image copy detection compared with existing schemes. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

14 pages, 1175 KiB  
Article
Patient Monitoring by Abnormal Human Activity Recognition Based on CNN Architecture
by Malik Ali Gul, Muhammad Haroon Yousaf, Shah Nawaz, Zaka Ur Rehman and HyungWon Kim
Electronics 2020, 9(12), 1993; https://doi.org/10.3390/electronics9121993 - 24 Nov 2020
Cited by 38 | Viewed by 5330
Abstract
Human action recognition has emerged as a challenging research domain for video understanding and analysis. Subsequently, extensive research has been conducted to achieve the improved performance for recognition of human actions. Human activity recognition has various real time applications, such as patient monitoring [...] Read more.
Human action recognition has emerged as a challenging research domain for video understanding and analysis. Subsequently, extensive research has been conducted to achieve the improved performance for recognition of human actions. Human activity recognition has various real time applications, such as patient monitoring in which patients are being monitored among a group of normal people and then identified based on their abnormal activities. Our goal is to render a multi class abnormal action detection in individuals as well as in groups from video sequences to differentiate multiple abnormal human actions. In this paper, You Look only Once (YOLO) network is utilized as a backbone CNN model. For training the CNN model, we constructed a large dataset of patient videos by labeling each frame with a set of patient actions and the patient’s positions. We retrained the back-bone CNN model with 23,040 labeled images of patient’s actions for 32 epochs. Across each frame, the proposed model allocated a unique confidence score and action label for video sequences by finding the recurrent action label. The present study shows that the accuracy of abnormal action recognition is 96.8%. Our proposed approach differentiated abnormal actions with improved F1-Score of 89.2% which is higher than state-of-the-art techniques. The results indicate that the proposed framework can be beneficial to hospitals and elder care homes for patient monitoring. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

21 pages, 4349 KiB  
Article
Deep Learning Method for Selecting Effective Models and Feature Groups in Emotion Recognition Using an Asian Multimodal Database
by Jun-Ho Maeng, Dong-Hyun Kang and Deok-Hwan Kim
Electronics 2020, 9(12), 1988; https://doi.org/10.3390/electronics9121988 - 24 Nov 2020
Cited by 14 | Viewed by 3436
Abstract
Emotional awareness is vital for advanced interactions between humans and computer systems. This paper introduces a new multimodal dataset called MERTI-Apps based on Asian physiological signals and proposes a genetic algorithm (GA)—long short-term memory (LSTM) deep learning model to derive the active feature [...] Read more.
Emotional awareness is vital for advanced interactions between humans and computer systems. This paper introduces a new multimodal dataset called MERTI-Apps based on Asian physiological signals and proposes a genetic algorithm (GA)—long short-term memory (LSTM) deep learning model to derive the active feature groups for emotion recognition. This study developed an annotation labeling program for observers to tag the emotions of subjects by their arousal and valence during dataset creation. In the learning phase, a GA was used to select effective LSTM model parameters and determine the active feature group from 37 features and 25 brain lateralization features extracted from the electroencephalogram (EEG) time, frequency, and time–frequency domains. The proposed model achieved a root-mean-square error (RMSE) of 0.0156 in terms of the valence regression performance in the MAHNOB-HCI dataset, and RMSE performances of 0.0579 and 0.0287 in terms of valence and arousal regression performance, and 65.7% and 88.3% in terms of valence and arousal accuracy in the in-house MERTI-Apps dataset, which uses Asian-population-specific 12-channel EEG data and adds an additional brain lateralization (BL) feature. The results revealed 91.3% and 94.8% accuracy in the valence and arousal domain in the DEAP dataset owing to the effective model selection of a GA. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

17 pages, 5884 KiB  
Article
Optimization of Spiking Neural Networks Based on Binary Streamed Rate Coding
by Ali A. Al-Hamid and HyungWon Kim
Electronics 2020, 9(10), 1599; https://doi.org/10.3390/electronics9101599 - 29 Sep 2020
Cited by 3 | Viewed by 5066
Abstract
Spiking neural networks (SNN) increasingly attract attention for their similarity to the biological neural system. Hardware implementation of spiking neural networks, however, remains a great challenge due to their excessive complexity and circuit size. This work introduces a novel optimization method for hardware [...] Read more.
Spiking neural networks (SNN) increasingly attract attention for their similarity to the biological neural system. Hardware implementation of spiking neural networks, however, remains a great challenge due to their excessive complexity and circuit size. This work introduces a novel optimization method for hardware friendly SNN architecture based on a modified rate coding scheme called Binary Streamed Rate Coding (BSRC). BSRC combines the features of both rate and temporal coding. In addition, by employing a built-in randomizer, the BSRC SNN model provides a higher accuracy and faster training. We also present SNN optimization methods including structure optimization and weight quantization. Extensive evaluations with MNIST SNNs demonstrate that the structure optimization of SNN (81-30-20-10) provides 183.19 times reduction in hardware compared with SNN (784-800-10), while providing an accuracy of 95.25%, a small loss compared with 98.89% and 98.93% reported in the previous works. Our weight quantization reduces 32-bit weights to 4-bit integers leading to further hardware reduction of 4 times with only 0.56% accuracy loss. Overall, the SNN model (81-30-20-10) optimized by our method shrinks the SNN’s circuit area from 3089.49 mm2 for SNN (784-800-10) to 4.04 mm2—a reduction of 765 times. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

20 pages, 12781 KiB  
Article
FASSD: A Feature Fusion and Spatial Attention-Based Single Shot Detector for Small Object Detection
by Deng Jiang, Bei Sun, Shaojing Su, Zhen Zuo, Peng Wu and Xiaopeng Tan
Electronics 2020, 9(9), 1536; https://doi.org/10.3390/electronics9091536 - 19 Sep 2020
Cited by 13 | Viewed by 3880
Abstract
Deep learning methods have significantly improved object detection performance, but small object detection remains an extremely difficult and challenging task in computer vision. We propose a feature fusion and spatial attention-based single shot detector (FASSD) for small object detection. We fuse high-level semantic [...] Read more.
Deep learning methods have significantly improved object detection performance, but small object detection remains an extremely difficult and challenging task in computer vision. We propose a feature fusion and spatial attention-based single shot detector (FASSD) for small object detection. We fuse high-level semantic information into shallow layers to generate discriminative feature representations for small objects. To adaptively enhance the expression of small object areas and suppress the feature response of background regions, the spatial attention block learns a self-attention mask to enhance the original feature maps. We also establish a small object dataset (LAKE-BOAT) of a scene with a boat on a lake and tested our algorithm to evaluate its performance. The results show that our FASSD achieves 79.3% mAP (mean average precision) on the PASCAL VOC2007 test with input 300 × 300, which outperforms the original single shot multibox detector (SSD) by 1.6 points, as well as most improved algorithms based on SSD. The corresponding detection speed was 45.3 FPS (frame per second) on the VOC2007 test using a single NVIDIA TITAN RTX GPU. The test results of a simplified FASSD on the LAKE-BOAT dataset indicate that our model achieved an improvement of 3.5% mAP on the baseline network while maintaining a real-time detection speed (64.4 FPS). Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

15 pages, 5037 KiB  
Article
Multimodel Deep Learning for Person Detection in Aerial Images
by Mirela Kundid Vasić and Vladan Papić
Electronics 2020, 9(9), 1459; https://doi.org/10.3390/electronics9091459 - 07 Sep 2020
Cited by 26 | Viewed by 3762
Abstract
In this paper, we propose a novel method for person detection in aerial images of nonurban terrain gathered by an Unmanned Aerial Vehicle (UAV), which plays an important role in Search And Rescue (SAR) missions. The UAV in SAR operations contributes significantly due [...] Read more.
In this paper, we propose a novel method for person detection in aerial images of nonurban terrain gathered by an Unmanned Aerial Vehicle (UAV), which plays an important role in Search And Rescue (SAR) missions. The UAV in SAR operations contributes significantly due to the ability to survey a larger geographical area from an aerial viewpoint. Because of the high altitude of recording, the object of interest (person) covers a small part of an image (around 0.1%), which makes this task quite challenging. To address this problem, a multimodel deep learning approach is proposed. The solution consists of two different convolutional neural networks in region proposal, as well as in the classification stage. Additionally, contextual information is used in the classification stage in order to improve the detection results. Experimental results tested on the HERIDAL dataset achieved precision of 68.89% and a recall of 94.65%, which is better than current state-of-the-art methods used for person detection in similar scenarios. Consequently, it may be concluded that this approach is suitable for usage as an auxiliary method in real SAR operations. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

22 pages, 3553 KiB  
Article
FCC-Net: A Full-Coverage Collaborative Network for Weakly Supervised Remote Sensing Object Detection
by Suting Chen, Dongwei Shao, Xiao Shu, Chuang Zhang and Jun Wang
Electronics 2020, 9(9), 1356; https://doi.org/10.3390/electronics9091356 - 21 Aug 2020
Cited by 9 | Viewed by 2947
Abstract
With an ever-increasing resolution of optical remote-sensing images, how to extract information from these images efficiently and effectively has gradually become a challenging problem. As it is prohibitively expensive to label every object in these high-resolution images manually, there is only a small [...] Read more.
With an ever-increasing resolution of optical remote-sensing images, how to extract information from these images efficiently and effectively has gradually become a challenging problem. As it is prohibitively expensive to label every object in these high-resolution images manually, there is only a small number of high-resolution images with detailed object labels available, highly insufficient for common machine learning-based object detection algorithms. Another challenge is the huge range of object sizes: it is difficult to locate large objects, such as buildings and small objects, such as vehicles, simultaneously. To tackle these problems, we propose a novel neural network based remote sensing object detector called full-coverage collaborative network (FCC-Net). The detector employs various tailored designs, such as hybrid dilated convolutions and multi-level pooling, to enhance multiscale feature extraction and improve its robustness in dealing with objects of different sizes. Moreover, by utilizing asynchronous iterative training alternating between strongly supervised and weakly supervised detectors, the proposed method only requires image-level ground truth labels for training. To evaluate the approach, we compare it against a few state-of-the-art techniques on two large-scale remote-sensing image benchmark sets. The experimental results show that FCC-Net significantly outperforms other weakly supervised methods in detection accuracy. Through a comprehensive ablation study, we also demonstrate the efficacy of the proposed dilated convolutions and multi-level pooling in increasing the scale invariance of an object detector. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

16 pages, 8623 KiB  
Article
Object Detection in Sonar Images
by Divas Karimanzira, Helge Renkewitz, David Shea and Jan Albiez
Electronics 2020, 9(7), 1180; https://doi.org/10.3390/electronics9071180 - 21 Jul 2020
Cited by 23 | Viewed by 8225
Abstract
The scope of the project described in this paper is the development of a generalized underwater object detection solution based on Automated Machine Learning (AutoML) principles. Multiple scales, dual priorities, speed, limited data, and class imbalance make object detection a very challenging task. [...] Read more.
The scope of the project described in this paper is the development of a generalized underwater object detection solution based on Automated Machine Learning (AutoML) principles. Multiple scales, dual priorities, speed, limited data, and class imbalance make object detection a very challenging task. In underwater object detection, further complications come in to play due to acoustic image problems such as non-homogeneous resolution, non-uniform intensity, speckle noise, acoustic shadowing, acoustic reverberation, and multipath problems. Therefore, we focus on finding solutions to the problems along the underwater object detection pipeline. A pipeline for realizing a robust generic object detector will be described and demonstrated on a case study of detection of an underwater docking station in sonar images. The system shows an overall detection and classification performance average precision (AP) score of 0.98392 for a test set of 5000 underwater sonar frames. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

10 pages, 2227 KiB  
Article
Channel-Based Network for Fast Object Detection of 3D LiDAR
by SoonSub Kwon and TaeHyoung Park
Electronics 2020, 9(7), 1122; https://doi.org/10.3390/electronics9071122 - 10 Jul 2020
Cited by 7 | Viewed by 2402
Abstract
Currently, there are various methods of LiDAR-based object detection networks. In this paper, we propose a channel-based object detection network using LiDAR channel information. The proposed method is a 2D convolution network with data alignment processing stages including a single-step detection stage. The [...] Read more.
Currently, there are various methods of LiDAR-based object detection networks. In this paper, we propose a channel-based object detection network using LiDAR channel information. The proposed method is a 2D convolution network with data alignment processing stages including a single-step detection stage. The network consists of a channel internal convolution network, channel external convolution network and detection network. First, the convolutional network within the channel divides the LiDAR data for each channel to find features within the channel. Second, the convolutional network outside the channel combines the LiDAR data divided for each channel to find features between the channels. Finally, the detection network finds objects with the features obtained. We evaluate our proposed network using our 16-channel lidar and popular KITTI dataset. We can confirm that the proposed method detects objects quickly while maintaining performance when compared with the existing network. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

21 pages, 6747 KiB  
Article
Real-Time and Deep Learning Based Vehicle Detection and Classification Using Pixel-Wise Code Exposure Measurements
by Chiman Kwan, David Gribben, Bryan Chou, Bence Budavari, Jude Larkin, Akshay Rangamani, Trac Tran, Jack Zhang and Ralph Etienne-Cummings
Electronics 2020, 9(6), 1014; https://doi.org/10.3390/electronics9061014 - 18 Jun 2020
Cited by 25 | Viewed by 3849
Abstract
One key advantage of compressive sensing is that only a small amount of the raw video data is transmitted or saved. This is extremely important in bandwidth constrained applications. Moreover, in some scenarios, the local processing device may not have enough processing power [...] Read more.
One key advantage of compressive sensing is that only a small amount of the raw video data is transmitted or saved. This is extremely important in bandwidth constrained applications. Moreover, in some scenarios, the local processing device may not have enough processing power to handle object detection and classification and hence the heavy duty processing tasks need to be done at a remote location. Conventional compressive sensing schemes require the compressed data to be reconstructed first before any subsequent processing can begin. This is not only time consuming but also may lose important information in the process. In this paper, we present a real-time framework for processing compressive measurements directly without any image reconstruction. A special type of compressive measurement known as pixel-wise coded exposure (PCE) is adopted in our framework. PCE condenses multiple frames into a single frame. Individual pixels can also have different exposure times to allow high dynamic ranges. A deep learning tool known as You Only Look Once (YOLO) has been used in our real-time system for object detection and classification. Extensive experiments showed that the proposed real-time framework is feasible and can achieve decent detection and classification performance. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

15 pages, 6985 KiB  
Article
A Two-Branch Network for Weakly Supervised Object Localization
by Chang Sun, Yibo Ai, Sheng Wang and Weidong Zhang
Electronics 2020, 9(6), 955; https://doi.org/10.3390/electronics9060955 - 08 Jun 2020
Viewed by 2900
Abstract
Weakly supervised object localization (WSOL) has attracted intense interest in computer vision for instance level annotations. As a hot research topic, a number of existing works concentrated on utilizing convolutional neural network (CNN)-based methods, which are powerful in extracting and representing features. The [...] Read more.
Weakly supervised object localization (WSOL) has attracted intense interest in computer vision for instance level annotations. As a hot research topic, a number of existing works concentrated on utilizing convolutional neural network (CNN)-based methods, which are powerful in extracting and representing features. The main challenge in CNN-based WSOL methods is to obtain features covering the entire target objects, not only the most discriminative object parts. To overcome this challenge and to improve the detection performance of feature extracting related WSOL methods, a CNN-based two-branch model was presented in this paper to locate objects using supervised learning. Our method contained two branches, including a detection branch and a self-attention branch. During the training process, the two branches interacted with each other by regarding the segmentation mask from the other branch as the pseudo ground truth labels of itself. Our model was able to focus on capturing the information of all the object parts due to the self-attention mechanism. Additionally, we embedded multi-scale detection into our two-branch method to output two-scale features. We evaluated our two-branch network on the CUB-200-2011 and VOC2007 datasets. The pointing localization, intersection over union (IoU) localization, and correct localization precision (CorLoc) results demonstrated competitive performance with other state-of-the-art methods in WSOL. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

21 pages, 6188 KiB  
Article
Evaluation of Robust Spatial Pyramid Pooling Based on Convolutional Neural Network for Traffic Sign Recognition System
by Christine Dewi, Rung-Ching Chen and Shao-Kuo Tai
Electronics 2020, 9(6), 889; https://doi.org/10.3390/electronics9060889 - 27 May 2020
Cited by 47 | Viewed by 7028
Abstract
Traffic sign recognition (TSR) is a noteworthy issue for real-world applications such as systems for autonomous driving as it has the main role in guiding the driver. This paper focuses on Taiwan’s prohibitory sign due to the lack of a database or research [...] Read more.
Traffic sign recognition (TSR) is a noteworthy issue for real-world applications such as systems for autonomous driving as it has the main role in guiding the driver. This paper focuses on Taiwan’s prohibitory sign due to the lack of a database or research system for Taiwan’s traffic sign recognition. This paper investigates the state-of-the-art of various object detection systems (Yolo V3, Resnet 50, Densenet, and Tiny Yolo V3) combined with spatial pyramid pooling (SPP). We adopt the concept of SPP to improve the backbone network of Yolo V3, Resnet 50, Densenet, and Tiny Yolo V3 for building feature extraction. Furthermore, we use a spatial pyramid pooling to study multi-scale object features thoroughly. The observation and evaluation of certain models include vital metrics measurements, such as the mean average precision (mAP), workspace size, detection time, intersection over union (IoU), and the number of billion floating-point operations (BFLOPS). Our findings show that Yolo V3 SPP strikes the best total BFLOPS (65.69), and mAP (98.88%). Besides, the highest average accuracy is Yolo V3 SPP at 99%, followed by Densenet SPP at 87%, Resnet 50 SPP at 70%, and Tiny Yolo V3 SPP at 50%. Hence, SPP can improve the performance of all models in the experiment. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

23 pages, 4196 KiB  
Article
Combustion Instability Monitoring through Deep-Learning-Based Classification of Sequential High-Speed Flame Images
by Ouk Choi, Jongwun Choi, Namkeun Kim and Min Chul Lee
Electronics 2020, 9(5), 848; https://doi.org/10.3390/electronics9050848 - 20 May 2020
Cited by 20 | Viewed by 4300
Abstract
In this study, novel deep learning models based on high-speed flame images are proposed to diagnose the combustion instability of a gas turbine. Two different network layers that can be combined with any existing backbone network are established—(1) An early-fusion layer that can [...] Read more.
In this study, novel deep learning models based on high-speed flame images are proposed to diagnose the combustion instability of a gas turbine. Two different network layers that can be combined with any existing backbone network are established—(1) An early-fusion layer that can learn to extract the power spectral density of subsequent image frames, which is time-invariant under certain conditions. (2) A late-fusion layer which combines the outputs of a backbone network at different time steps to predict the current combustion state. The performance of the proposed models is validated by the dataset of high speed flame images, which have been obtained in a gas turbine combustor during the transient process from stable condition to unstable condition and vice versa. Excellent performance is achieved for all test cases with high accuracy of 95.1–98.6% and a short processing time of 5.2–12.2 ms. Interestingly, simply increasing the number of input images is as competitive as combining the proposed early-fusion layer to a backbone network. In addition, using handcrafted weights for the late-fusion layer is shown to be more effective than using learned weights. From the results, the best combination is selected as the ResNet-18 model combined with our proposed fusion layers over 16 time-steps. The proposed deep learning method is proven as a potential tool for combustion instability identification and expected to be a promising tool for combustion instability prediction as well. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

18 pages, 6130 KiB  
Article
Interactive Trimap Generation for Digital Matting Based on Single-Sample Learning
by Zhenpeng Chen, Yuanjie Zheng, Xiaojie Li, Rong Luo, Weikuan Jia, Jian Lian and Chengjiang Li
Electronics 2020, 9(4), 659; https://doi.org/10.3390/electronics9040659 - 17 Apr 2020
Cited by 3 | Viewed by 3408
Abstract
Image matting refers to the task of estimating the foreground of images, which is an important problem in image processing. Recently, trimap generation has attracted considerable attention because designing a trimap for every image is labor-intensive. In this paper, a two-step algorithm is [...] Read more.
Image matting refers to the task of estimating the foreground of images, which is an important problem in image processing. Recently, trimap generation has attracted considerable attention because designing a trimap for every image is labor-intensive. In this paper, a two-step algorithm is proposed to generate trimaps. To use the proposed algorithm, users must only provide some clicks (foreground clicks and background clicks), which are employed as the input to generate a binary mask. One-shot learning technique achieves remarkable progress on semantic segmentation, we extend this technique to perform the binary mask prediction task. The mask is further used to predict the trimap using image dilation. Extensive experiments were performed to evaluate the proposed algorithm. Experimental results show that the trimaps generated using the proposed algorithm are visually similar to the user-annotated ones. Comparing with the interactive matting algorithms, the proposed algoritm is less labor-intensive than trimap-based matting algorithm and achieved more accuate results than scribble-based matting algorithm. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

11 pages, 1474 KiB  
Article
Feasibility Analysis of Deep Learning-Based Reality Assessment of Human Back-View Images
by Young Chan Kwon, Jae Won Jang, Hwasup Lim and Ouk Choi
Electronics 2020, 9(4), 656; https://doi.org/10.3390/electronics9040656 - 16 Apr 2020
Cited by 3 | Viewed by 2300
Abstract
Realistic personalized avatars can play an important role in social interactions in virtual reality, increasing body ownership, presence, and dominance. A simple way to obtain the texture of an avatar is to use a single front-view image of a human and to generate [...] Read more.
Realistic personalized avatars can play an important role in social interactions in virtual reality, increasing body ownership, presence, and dominance. A simple way to obtain the texture of an avatar is to use a single front-view image of a human and to generate the hidden back-view image. The realism of the generated image is crucial in improving the overall texture quality, and subjective image quality assessment methods can play an important role in the evaluation. The subjective methods, however, require dozens of human assessors, a controlled environment, and time. This paper proposes a deep learning-based image reality assessment method, which is fully automatic and has a short testing time of nearly a quarter second per image. We train various discriminators to predict whether an image is real or generated. The trained discriminators are then used to give a mean opinion score for the reality of an image. Through experiments on human back-view images, we show that our learning-based mean opinion scores are close to their subjective counterparts in terms of the root mean square error between them. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

11 pages, 6181 KiB  
Article
Object Detection Algorithm Based on Improved YOLOv3
by Liquan Zhao and Shuaiyang Li
Electronics 2020, 9(3), 537; https://doi.org/10.3390/electronics9030537 - 24 Mar 2020
Cited by 173 | Viewed by 20173
Abstract
The ‘You Only Look Once’ v3 (YOLOv3) method is among the most widely used deep learning-based object detection methods. It uses the k-means cluster method to estimate the initial width and height of the predicted bounding boxes. With this method, the estimated width [...] Read more.
The ‘You Only Look Once’ v3 (YOLOv3) method is among the most widely used deep learning-based object detection methods. It uses the k-means cluster method to estimate the initial width and height of the predicted bounding boxes. With this method, the estimated width and height are sensitive to the initial cluster centers, and the processing of large-scale datasets is time-consuming. In order to address these problems, a new cluster method for estimating the initial width and height of the predicted bounding boxes has been developed. Firstly, it randomly selects a couple of width and height values as one initial cluster center separate from the width and height of the ground truth boxes. Secondly, it constructs Markov chains based on the selected initial cluster and uses the final points of every Markov chain as the other initial centers. In the construction of Markov chains, the intersection-over-union method is used to compute the distance between the selected initial clusters and each candidate point, instead of the square root method. Finally, this method can be used to continually update the cluster center with each new set of width and height values, which are only a part of the data selected from the datasets. Our simulation results show that the new method has faster convergence speed for initializing the width and height of the predicted bounding boxes and that it can select more representative initial widths and heights of the predicted bounding boxes. Our proposed method achieves better performance than the YOLOv3 method in terms of recall, mean average precision, and F1-score. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

Review

Jump to: Research

22 pages, 4782 KiB  
Review
A Survey on Deep Learning Based Methods and Datasets for Monocular 3D Object Detection
by Seong-heum Kim and Youngbae Hwang
Electronics 2021, 10(4), 517; https://doi.org/10.3390/electronics10040517 - 22 Feb 2021
Cited by 22 | Viewed by 5492
Abstract
Owing to recent advancements in deep learning methods and relevant databases, it is becoming increasingly easier to recognize 3D objects using only RGB images from single viewpoints. This study investigates the major breakthroughs and current progress in deep learning-based monocular 3D object detection. [...] Read more.
Owing to recent advancements in deep learning methods and relevant databases, it is becoming increasingly easier to recognize 3D objects using only RGB images from single viewpoints. This study investigates the major breakthroughs and current progress in deep learning-based monocular 3D object detection. For relatively low-cost data acquisition systems without depth sensors or cameras at multiple viewpoints, we first consider existing databases with 2D RGB photos and their relevant attributes. Based on this simple sensor modality for practical applications, deep learning-based monocular 3D object detection methods that overcome significant research challenges are categorized and summarized. We present the key concepts and detailed descriptions of representative single-stage and multiple-stage detection solutions. In addition, we discuss the effectiveness of the detection models on their baseline benchmarks. Finally, we explore several directions for future research on monocular 3D object detection. Full article
(This article belongs to the Special Issue Deep Learning Based Object Detection)
Show Figures

Figure 1

Back to TopTop