Deep Learning Technologies for Machine Vision and Audition

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (7 March 2021) | Viewed by 31376

Special Issue Editors


E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
Interests: deep learning; computer vision; audio source separation; music information retrieval
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Jubilee Campus, University of Nottingham, Wollaton Road, Nottingham NG8 1BB, UK
Interests: deep learning; computer vision; face recognition

Special Issue Information

Dear Colleagues,

In recent years, we have witnessed extensive breakthroughs in the field of autonomous robotics. One key element of a successful robotic system is the exploitation of the visual and auditory information around the system in order to make decisions. Therefore, machine vision and audition is a major task in most robotic systems. Humans, on the other hand, are very adept at handling and processing visual and auditory stimuli to perform series of tasks such as object detection and identification. The key element in these tasks is the human brain—a complicated organ featuring some billions of neurons and some trillions of synapses (connections) between them. In recent years, due to the rise of parallel-processing hardware (i.e., graphical processing units (GPUs)), we have seen the emergence of deep neural network architectures that attempt to emulate the vastness and complexity of the human brain in order to match its performance. This is particularly evident in machine vision and audition applications, where the emergence of deep learning techniques has boosted the performance of traditional shallow neural network architectures.

The aim of this Special Issue is to present and highlight the newest trends in deep learning for machine vision and audition applications. This may include but is not limited to:

  • Deep learning architectures;
  • Deep learning image and audio classification;
  • Deep learning object detection;
  • Deep learning semantic segmentation;
  • Deep learning image enhancement;
  • Deep learning music information retrieval tasks;
  • Deep learning audio-visual source separation;
  • Deep learning audio-visual enhancement;
  • Deep learning for audio-visual scene analysis;
  • Deep learning for audio-visual emotion recognition;
  • Deep learning for audio-visual face analysis.

Assoc. Prof. Nikolaos Mitianoudis
Assoc. Prof. Georgios Tzimiropoulos
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • image enhancement
  • object detection
  • image semantic segmentation
  • source separation
  • music information retrieval
  • audio enhancement
  • scene analysis
  • emotion recognition
  • face analysis

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

19 pages, 6036 KiB  
Article
Facial Emotion Recognition Using Transfer Learning in the Deep CNN
by M. A. H. Akhand, Shuvendu Roy, Nazmul Siddique, Md Abdus Samad Kamal and Tetsuya Shimamura
Electronics 2021, 10(9), 1036; https://doi.org/10.3390/electronics10091036 - 27 Apr 2021
Cited by 139 | Viewed by 17485
Abstract
Human facial emotion recognition (FER) has attracted the attention of the research community for its promising applications. Mapping different facial expressions to the respective emotional states are the main task in FER. The classical FER consists of two major steps: feature extraction and [...] Read more.
Human facial emotion recognition (FER) has attracted the attention of the research community for its promising applications. Mapping different facial expressions to the respective emotional states are the main task in FER. The classical FER consists of two major steps: feature extraction and emotion recognition. Currently, the Deep Neural Networks, especially the Convolutional Neural Network (CNN), is widely used in FER by virtue of its inherent feature extraction mechanism from images. Several works have been reported on CNN with only a few layers to resolve FER problems. However, standard shallow CNNs with straightforward learning schemes have limited feature extraction capability to capture emotion information from high-resolution images. A notable drawback of the most existing methods is that they consider only the frontal images (i.e., ignore profile views for convenience), although the profile views taken from different angles are important for a practical FER system. For developing a highly accurate FER system, this study proposes a very Deep CNN (DCNN) modeling through Transfer Learning (TL) technique where a pre-trained DCNN model is adopted by replacing its dense upper layer(s) compatible with FER, and the model is fine-tuned with facial emotion data. A novel pipeline strategy is introduced, where the training of the dense layer(s) is followed by tuning each of the pre-trained DCNN blocks successively that has led to gradual improvement of the accuracy of FER to a higher level. The proposed FER system is verified on eight different pre-trained DCNN models (VGG-16, VGG-19, ResNet-18, ResNet-34, ResNet-50, ResNet-152, Inception-v3 and DenseNet-161) and well-known KDEF and JAFFE facial image datasets. FER is very challenging even for frontal views alone. FER on the KDEF dataset poses further challenges due to the diversity of images with different profile views together with frontal views. The proposed method achieved remarkable accuracy on both datasets with pre-trained models. On a 10-fold cross-validation way, the best achieved FER accuracies with DenseNet-161 on test sets of KDEF and JAFFE are 96.51% and 99.52%, respectively. The evaluation results reveal the superiority of the proposed FER system over the existing ones regarding emotion detection accuracy. Moreover, the achieved performance on the KDEF dataset with profile views is promising as it clearly demonstrates the required proficiency for real-life applications. Full article
(This article belongs to the Special Issue Deep Learning Technologies for Machine Vision and Audition)
Show Figures

Figure 1

13 pages, 4138 KiB  
Article
Audio-Based Aircraft Detection System for Safe RPAS BVLOS Operations
by Jorge Mariscal-Harana, Víctor Alarcón, Fidel González, Juan José Calvente, Francisco Javier Pérez-Grau, Antidio Viguria and Aníbal Ollero
Electronics 2020, 9(12), 2076; https://doi.org/10.3390/electronics9122076 - 05 Dec 2020
Cited by 6 | Viewed by 4032
Abstract
For the Remotely Piloted Aircraft Systems (RPAS) market to continue its current growth rate, cost-effective ‘Detect and Avoid’ systems that enable safe beyond visual line of sight (BVLOS) operations are critical. We propose an audio-based ‘Detect and Avoid’ system, composed of microphones and [...] Read more.
For the Remotely Piloted Aircraft Systems (RPAS) market to continue its current growth rate, cost-effective ‘Detect and Avoid’ systems that enable safe beyond visual line of sight (BVLOS) operations are critical. We propose an audio-based ‘Detect and Avoid’ system, composed of microphones and an embedded computer, which performs real-time inferences using a sound event detection (SED) deep learning model. Two state-of-the-art SED models, YAMNet and VGGish, are fine-tuned using our dataset of aircraft sounds and their performances are compared for a wide range of configurations. YAMNet, whose MobileNet architecture is designed for embedded applications, outperformed VGGish both in terms of aircraft detection and computational performance. YAMNet’s optimal configuration, with >70% true positive rate and precision, results from combining data augmentation and undersampling with the highest available inference frequency (i.e., 10 Hz). While our proposed ‘Detect and Avoid’ system already allows the detection of small aircraft from sound in real time, additional testing using multiple aircraft types is required. Finally, a larger training dataset, sensor fusion, or remote computations on cloud-based services could further improve system performance. Full article
(This article belongs to the Special Issue Deep Learning Technologies for Machine Vision and Audition)
Show Figures

Figure 1

13 pages, 2628 KiB  
Article
A Learning Frequency-Aware Feature Siamese Network for Real-Time Visual Tracking
by Yuxiang Yang, Weiwei Xing, Shunli Zhang, Qi Yu, Xiaoyu Guo and Min Guo
Electronics 2020, 9(5), 854; https://doi.org/10.3390/electronics9050854 - 21 May 2020
Cited by 1 | Viewed by 2364
Abstract
Visual object tracking by Siamese networks has achieved favorable performance in accuracy and speed. However, the features used in Siamese networks have spatially redundant information, which increases computation and limits the discriminative ability of Siamese networks. Addressing this issue, we present a novel [...] Read more.
Visual object tracking by Siamese networks has achieved favorable performance in accuracy and speed. However, the features used in Siamese networks have spatially redundant information, which increases computation and limits the discriminative ability of Siamese networks. Addressing this issue, we present a novel frequency-aware feature (FAF) method for robust visual object tracking in complex scenes. Unlike previous works, which select features from different channels or layers, the proposed method factorizes the feature map into multi-frequency and reduces the low-frequency information that is spatially redundant. By reducing the low-frequency map’s resolution, the computation is saved and the receptive field of the layer is also increased to obtain more discriminative information. To further improve the performance of the FAF, we design an innovative data-independent augmentation for object tracking to improve the discriminative ability of tracker, which enhanced linear representation among training samples by convex combinations of the images and tags. Finally, a joint judgment strategy is proposed to adjust the bounding box result that combines intersection-over-union (IoU) and classification scores to improve tracking accuracy. Extensive experiments on 5 challenging benchmarks demonstrate that our FAF method performs favorably against SOTA tracking methods while running around 45 frames per second. Full article
(This article belongs to the Special Issue Deep Learning Technologies for Machine Vision and Audition)
Show Figures

Figure 1

Review

Jump to: Research

33 pages, 5195 KiB  
Review
Deep Learning Algorithms for Single Image Super-Resolution: A Systematic Review
by Yoong Khang Ooi and Haidi Ibrahim
Electronics 2021, 10(7), 867; https://doi.org/10.3390/electronics10070867 - 06 Apr 2021
Cited by 32 | Viewed by 6583
Abstract
Image super-resolution has become an important technology recently, especially in the medical and industrial fields. As such, much effort has been given to develop image super-resolution algorithms. A recent method used was convolutional neural network (CNN) based algorithms. super-resolution convolutional neural network (SRCNN) [...] Read more.
Image super-resolution has become an important technology recently, especially in the medical and industrial fields. As such, much effort has been given to develop image super-resolution algorithms. A recent method used was convolutional neural network (CNN) based algorithms. super-resolution convolutional neural network (SRCNN) was the pioneer of CNN-based algorithms, and it continued being improved till today through different techniques. The techniques included the type of loss functions used, upsampling module deployed, and the adopted network design strategies. In this paper, a total of 18 articles were selected through the PRISMA standard. A total of 19 algorithms were found in the selected articles and were reviewed. A few aspects are reviewed and compared, including datasets used, loss functions used, evaluation metrics applied, upsampling module deployed, and adopted design techniques. For each upsampling module and design techniques, their respective advantages and disadvantages were also summarized. Full article
(This article belongs to the Special Issue Deep Learning Technologies for Machine Vision and Audition)
Show Figures

Figure 1

Back to TopTop