sensors-logo

Journal Browser

Journal Browser

Artificial Intelligence in Computer Vision: Methods and Applications

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensing and Imaging".

Deadline for manuscript submissions: closed (30 June 2023) | Viewed by 61115

Special Issue Editors


E-Mail Website
Guest Editor
Department of Mechanical Engineering, The Catholic University of America, Washington, DC 20064, USA
Interests: optics; mechanics; robotics; computer vision
Facebook Reality Labs Research, Sausalito, CA 94965, USA
Interests: computer vision; computational photography; machine learning

E-Mail Website
Guest Editor
Neuroimaging Research Branch, National Institute on Drug Abuse, National Institutes of Health, Baltimore, MD 21224, USA
Interests: computer vision; machine learning; deep learning; computer hardware; neuroimaging

Special Issue Information

Dear Colleagues,

Recent years have seen an explosion of interest in the research and development of artificial intelligence techniques. In the meantime, computer vision methods have been enhanced and extended to encompass an astonishing number of novel sensors and measurement systems. As artificial intelligence spreads over almost all fields of science and engineering, computer vision remains one of its primary application areas. Notably, incorporating artificial intelligence into computer-vision-based sensing and measurement techniques has led to numerous unprecedented performances, such as high-accuracy object detection, image segmentation, human pose estimation, and real-time 3D sensing, which cannot be fulfilled using conventional methods.

This Special Issue aims to cover recent advancements in computer vision that involve using artificial intelligence methods, with a particular interest in sensors and sensing. Both original research and review articles are welcome. Typical topics include but are not limited to the following:

  • Physical, chemical, biological, and healthcare sensors and sensing techniques with deep learning approaches;
  • Localization, mapping, and navigation techniques with artificial intelligence;
  • Artificial intelligence-based recognition of objects, scenes, actions, faces, gestures, expressions, and emotions, as well as object relations and interactions;
  • 3D imaging and sensing with deep learning schemes;
  • Accurate learning with simulation datasets or with a small number of training labels for sensors and sensing;
  • Supervised and unsupervised learning for sensors and sensing;
  • Broad computer vision methods and applications that involve using deep learning or artificial intelligence.

Prof. Dr. Zhaoyang Wang
Dr. Minh P. Vo
Dr. Hieu Nguyen
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • artificial intelligence
  • deep learning
  • computer vision
  • smart sensors
  • intelligent sensing
  • 3D imaging and sensing
  • localization and mapping
  • navigation and positioning

Published Papers (30 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

17 pages, 4044 KiB  
Article
Underwater Fish Segmentation Algorithm Based on Improved PSPNet Network
by Yanling Han, Bowen Zheng, Xianghong Kong, Junjie Huang, Xiaotong Wang, Tianhong Ding and Jiaqi Chen
Sensors 2023, 23(19), 8072; https://doi.org/10.3390/s23198072 - 25 Sep 2023
Viewed by 919
Abstract
With the sustainable development of intelligent fisheries, accurate underwater fish segmentation is a key step toward intelligently obtaining fish morphology data. However, the blurred, distorted and low-contrast features of fish images in underwater scenes affect the improvement in fish segmentation accuracy. To solve [...] Read more.
With the sustainable development of intelligent fisheries, accurate underwater fish segmentation is a key step toward intelligently obtaining fish morphology data. However, the blurred, distorted and low-contrast features of fish images in underwater scenes affect the improvement in fish segmentation accuracy. To solve these problems, this paper proposes a method of underwater fish segmentation based on an improved PSPNet network (IST-PSPNet). First, in the feature extraction stage, to fully perceive features and context information of different scales, we propose an iterative attention feature fusion mechanism, which realizes the depth mining of fish features of different scales and the full perception of context information. Then, a SoftPool pooling method based on fast index weighted activation is used to reduce the numbers of parameters and computations while retaining more feature information, which improves segmentation accuracy and efficiency. Finally, a triad attention mechanism module, triplet attention (TA), is added to the different scale features in the golden tower pool module so that the space attention can focus more on the specific position of the fish body features in the channel through cross-dimensional interaction to suppress the fuzzy distortion caused by background interference in underwater scenes. Additionally, the parameter-sharing strategy is used in this process to make different scale features share the same learning weight parameters and further reduce the numbers of parameters and calculations. The experimental results show that the method presented in this paper yielded better results for the DeepFish underwater fish image dataset than other methods, with 91.56% for the Miou, 46.68 M for Params and 40.27 G for GFLOPS. In the underwater fish segmentation task, the method improved the segmentation accuracy of fish with similar colors and water quality backgrounds, improved fuzziness and small size and made the edge location of fish clearer. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

19 pages, 6384 KiB  
Article
Iterated Residual Graph Convolutional Neural Network for Personalized Three-Dimensional Reconstruction of Left Myocardium from Cardiac MR Images
by Xuchu Wang, Yue Yuan, Minghua Liu and Yanmin Niu
Sensors 2023, 23(17), 7430; https://doi.org/10.3390/s23177430 - 25 Aug 2023
Viewed by 859
Abstract
Three-dimensional reconstruction of the left myocardium is of great significance for the diagnosis and treatment of cardiac diseases. This paper proposes a personalized 3D reconstruction algorithm for the left myocardium using cardiac MR images by incorporating a residual graph convolutional neural network. The [...] Read more.
Three-dimensional reconstruction of the left myocardium is of great significance for the diagnosis and treatment of cardiac diseases. This paper proposes a personalized 3D reconstruction algorithm for the left myocardium using cardiac MR images by incorporating a residual graph convolutional neural network. The accuracy of the mesh, reconstructed using the model-based algorithm, is largely affected by the similarity between the target object and the average model. The initial triangular mesh is obtained directly from the segmentation result of the left myocardium. The mesh is then deformed using an iterated residual graph convolutional neural network. A vertex feature learning module is also built to assist the mesh deformation by adopting an encoder–decoder neural network to represent the skeleton of the left myocardium at different receptive fields. In this way, the shape and local relationships of the left myocardium are used to guide the mesh deformation. Qualitative and quantitative comparative experiments were conducted on cardiac MR images, and the results verified the rationale and competitiveness of the proposed method compared to related state-of-the-art approaches. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

18 pages, 9963 KiB  
Article
SSA Net: Small Scale-Aware Enhancement Network for Human Pose Estimation
by Shaohua Li, Haixiang Zhang, Hanjie Ma, Jie Feng and Mingfeng Jiang
Sensors 2023, 23(17), 7299; https://doi.org/10.3390/s23177299 - 22 Aug 2023
Viewed by 831
Abstract
In the field of human pose estimation, heatmap-based methods have emerged as the dominant approach, and numerous studies have achieved remarkable performance based on this technique. However, the inherent drawbacks of heatmaps lead to serious performance degradation in methods based on heatmaps for [...] Read more.
In the field of human pose estimation, heatmap-based methods have emerged as the dominant approach, and numerous studies have achieved remarkable performance based on this technique. However, the inherent drawbacks of heatmaps lead to serious performance degradation in methods based on heatmaps for smaller-scale persons. While some researchers have attempted to tackle this issue by improving the performance of small-scale persons, their efforts have been hampered by the continued reliance on heatmap-based methods. To address this issue, this paper proposes the SSA Net, which aims to enhance the detection accuracy of small-scale persons as much as possible while maintaining a balanced perception of persons at other scales. SSA Net utilizes HRNetW48 as a feature extractor and leverages the TDAA module to enhance small-scale perception. Furthermore, it abandons heatmap-based methods and instead adopts coordinate vector regression to represent keypoints. Notably, SSA Net achieved an AP of 77.4% on the COCO Validation dataset, which is superior to other heatmap-based methods. Additionally, it achieved highly competitive results on the Tiny Validation and MPII datasets as well. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

22 pages, 21052 KiB  
Article
IRDC-Net: Lightweight Semantic Segmentation Network Based on Monocular Camera for Mobile Robot Navigation
by Thai-Viet Dang, Dinh-Manh-Cuong Tran and Phan Xuan Tan
Sensors 2023, 23(15), 6907; https://doi.org/10.3390/s23156907 - 03 Aug 2023
Cited by 1 | Viewed by 903
Abstract
Computer vision plays a significant role in mobile robot navigation due to the wealth of information extracted from digital images. Mobile robots localize and move to the intended destination based on the captured images. Due to the complexity of the environment, obstacle avoidance [...] Read more.
Computer vision plays a significant role in mobile robot navigation due to the wealth of information extracted from digital images. Mobile robots localize and move to the intended destination based on the captured images. Due to the complexity of the environment, obstacle avoidance still requires a complex sensor system with a high computational efficiency requirement. This study offers a real-time solution to the problem of extracting corridor scenes from a single image using a lightweight semantic segmentation model integrating with the quantization technique to reduce the numerous training parameters and computational costs. The proposed model consists of an FCN as the decoder and MobilenetV2 as the decoder (with multi-scale fusion). This combination allows us to significantly minimize computation time while achieving high precision. Moreover, in this study, we also propose to use the Balance Cross-Entropy loss function to handle diverse datasets, especially those with class imbalances and to integrate a number of techniques, for example, the Adam optimizer and Gaussian filters, to enhance segmentation performance. The results demonstrate that our model can outperform baselines across different datasets. Moreover, when being applied to practical experiments with a real mobile robot, the proposed model’s performance is still consistent, supporting the optimal path planning, allowing the mobile robot to efficiently and effectively avoid the obstacles. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

15 pages, 2455 KiB  
Article
Fusing Self-Attention and CoordConv to Improve the YOLOv5s Algorithm for Infrared Weak Target Detection
by Xiangsuo Fan, Wentao Ding, Wenlin Qin, Dachuan Xiao, Lei Min and Haohao Yuan
Sensors 2023, 23(15), 6755; https://doi.org/10.3390/s23156755 - 28 Jul 2023
Cited by 1 | Viewed by 985
Abstract
Convolutional neural networks have achieved good results in target detection in many application scenarios, but convolutional neural networks still face great challenges when facing scenarios with small target sizes and complex background environments. To solve the problem of low accuracy of infrared weak [...] Read more.
Convolutional neural networks have achieved good results in target detection in many application scenarios, but convolutional neural networks still face great challenges when facing scenarios with small target sizes and complex background environments. To solve the problem of low accuracy of infrared weak target detection in complex scenes, and considering the real-time requirements of the detection task, we choose the YOLOv5s target detection algorithm for improvement. We add the Bottleneck Transformer structure and CoordConv to the network to optimize the model parameters and improve the performance of the detection network. Meanwhile, a two-dimensional Gaussian distribution is used to describe the importance of pixel points in the target frame, and the normalized Guassian Wasserstein distance (NWD) is used to measure the similarity between the prediction frame and the true frame to characterize the loss function of weak targets, which will help highlight the targets with flat positional deviation transformation and improve the detection accuracy. Finally, through experimental verification, compared with other mainstream detection algorithms, the improved algorithm in this paper significantly improves the target detection accuracy, with the mAP reaching 96.7 percent, which is 2.2 percentage points higher compared with Yolov5s. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

18 pages, 890 KiB  
Article
A Lightweight Monocular 3D Face Reconstruction Method Based on Improved 3D Morphing Models
by Xingyi You, Yue Wang and Xiaohu Zhao
Sensors 2023, 23(15), 6713; https://doi.org/10.3390/s23156713 - 27 Jul 2023
Viewed by 1259
Abstract
In the past few years, 3D Morphing Model (3DMM)-based methods have achieved remarkable results in single-image 3D face reconstruction. However, high-fidelity 3D face texture generation has been successfully achieved with this method, which mostly uses the power of deep convolutional neural networks during [...] Read more.
In the past few years, 3D Morphing Model (3DMM)-based methods have achieved remarkable results in single-image 3D face reconstruction. However, high-fidelity 3D face texture generation has been successfully achieved with this method, which mostly uses the power of deep convolutional neural networks during the parameter fitting process, which leads to an increase in the number of network layers and computational burden of the network model and reduces the computational speed. Currently, existing methods increase computational speed by using lightweight networks for parameter fitting, but at the expense of reconstruction accuracy. In order to solve the above problems, we improved the 3D deformation model and proposed an efficient and lightweight network model: Mobile-FaceRNet. First, we combine depthwise separable convolution and multi-scale representation methods to fit the parameters of a 3D deformable model (3DMM); then, we introduce a residual attention module during network training to enhance the network’s attention to important features, guaranteeing high-fidelity facial texture reconstruction quality; and, finally, a new perceptual loss function is designed to better address smoothness and image similarity for the smoothing constraints. Experimental results show that the method proposed in this paper can not only achieve high-precision reconstruction under the premise of lightweight, but it is also more robust to influences such as attitude and occlusion. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

21 pages, 4365 KiB  
Article
Liquid Content Detection In Transparent Containers: A Benchmark
by You Wu, Hengzhou Ye, Yaqing Yang, Zhaodong Wang and Shuiwang Li
Sensors 2023, 23(15), 6656; https://doi.org/10.3390/s23156656 - 25 Jul 2023
Cited by 4 | Viewed by 1249
Abstract
Various substances that possess liquid states include drinking water, various types of fuel, pharmaceuticals, and chemicals, which are indispensable in our daily lives. There are numerous real-world applications for liquid content detection in transparent containers, for example, service robots, pouring robots, security checks, [...] Read more.
Various substances that possess liquid states include drinking water, various types of fuel, pharmaceuticals, and chemicals, which are indispensable in our daily lives. There are numerous real-world applications for liquid content detection in transparent containers, for example, service robots, pouring robots, security checks, industrial observation systems, etc. However, the majority of the existing methods either concentrate on transparent container detection or liquid height estimation; the former provides very limited information for more advanced computer vision tasks, whereas the latter is too demanding to generalize to open-world applications. In this paper, we propose a dataset for detecting liquid content in transparent containers (LCDTC), which presents an innovative task involving transparent container detection and liquid content estimation. The primary objective of this task is to obtain more information beyond the location of the container by additionally providing certain liquid content information which is easy to achieve with computer vision methods in various open-world applications. This task has potential applications in service robots, waste classification, security checks, and so on. The presented LCDTC dataset comprises 5916 images that have been extensively annotated through axis-aligned bounding boxes. We develop two baseline detectors, termed LCD-YOLOF and LCD-YOLOX, for the proposed dataset, based on two identity-preserved human posture detectors, i.e., IPH-YOLOF and IPH-YOLOX. By releasing LCDTC, we intend to stimulate more future works into the detection of liquid content in transparent containers and bring more focus to this challenging task. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

17 pages, 4667 KiB  
Article
MFAFNet: A Lightweight and Efficient Network with Multi-Level Feature Adaptive Fusion for Real-Time Semantic Segmentation
by Kai Lu, Jieren Cheng, Hua Li and Tianyu Ouyang
Sensors 2023, 23(14), 6382; https://doi.org/10.3390/s23146382 - 13 Jul 2023
Cited by 2 | Viewed by 1011
Abstract
Currently, real-time semantic segmentation networks are intensely demanded in resource-constrained practical applications, such as mobile devices, drones and autonomous driving systems. However, most of the current popular approaches have difficulty in obtaining sufficiently large receptive fields, and they sacrifice low-level details to improve [...] Read more.
Currently, real-time semantic segmentation networks are intensely demanded in resource-constrained practical applications, such as mobile devices, drones and autonomous driving systems. However, most of the current popular approaches have difficulty in obtaining sufficiently large receptive fields, and they sacrifice low-level details to improve inference speed, leading to decreased segmentation accuracy. In this paper, a lightweight and efficient multi-level feature adaptive fusion network (MFAFNet) is proposed to address this problem. Specifically, we design a separable asymmetric reinforcement non-bottleneck module, which designs a parallel structure to extract short- and long-range contextual information and use optimized convolution to increase the inference speed. In addition, we propose a feature adaptive fusion module that effectively balances feature maps with multiple resolutions to reduce the loss of spatial detail information. We evaluate our model with state-of-the-art real-time semantic segmentation methods on the Cityscapes and Camvid datasets. Without any pre-training and post-processing, our MFAFNet has only 1.27 M parameters, while achieving accuracies of 75.9% and 69.9% mean IoU with speeds of 60.1 and 82.6 FPS on the Cityscapes and Camvid test sets, respectively. The experimental results demonstrate that the proposed method achieves an excellent trade-off between inference speed, segmentation accuracy and model size. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

15 pages, 670 KiB  
Article
Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition
by Haiping Zhang, Xinhao Zhang, Dongjin Yu, Liming Guan, Dongjing Wang, Fuxing Zhou and Wanjun Zhang
Sensors 2023, 23(12), 5414; https://doi.org/10.3390/s23125414 - 07 Jun 2023
Viewed by 1277
Abstract
Graph convolutional networks are widely used in skeleton-based action recognition because of their good fitting ability to non-Euclidean data. While conventional multi-scale temporal convolution uses several fixed-size convolution kernels or dilation rates at each layer of the network, we argue that different layers [...] Read more.
Graph convolutional networks are widely used in skeleton-based action recognition because of their good fitting ability to non-Euclidean data. While conventional multi-scale temporal convolution uses several fixed-size convolution kernels or dilation rates at each layer of the network, we argue that different layers and datasets require different receptive fields. We use multi-scale adaptive convolution kernels and dilation rates to optimize traditional multi-scale temporal convolution with a simple and effective self attention mechanism, allowing different network layers to adaptively select convolution kernels of different sizes and dilation rates instead of being fixed and unchanged. Besides, the effective receptive field of the simple residual connection is not large, and there is a great deal of redundancy in the deep residual network, which will lead to the loss of context when aggregating spatio-temporal information. This article introduces a feature fusion mechanism that replaces the residual connection between initial features and temporal module outputs, effectively solving the problems of context aggregation and initial feature fusion. We propose a multi-modality adaptive feature fusion framework (MMAFF) to simultaneously increase the receptive field in both spatial and temporal dimensions. Concretely, we input the features extracted by the spatial module into the adaptive temporal fusion module to simultaneously extract multi-scale skeleton features in both spatial and temporal parts. In addition, based on the current multi-stream approach, we use the limb stream to uniformly process correlated data from multiple modalities. Extensive experiments show that our model obtains competitive results with state-of-the-art methods on the NTU-RGB+D 60 and NTU-RGB+D 120 datasets. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

25 pages, 6219 KiB  
Article
Hardware–Software Partitioning for Real-Time Object Detection Using Dynamic Parameter Optimization
by Corneliu Zaharia, Vlad Popescu and Florin Sandu
Sensors 2023, 23(10), 4894; https://doi.org/10.3390/s23104894 - 19 May 2023
Viewed by 1334
Abstract
Computer vision algorithms implementations, especially for real-time applications, are present in a variety of devices that we are currently using (from smartphones or automotive applications to monitoring/security applications) and pose specific challenges, memory bandwidth or energy consumption (e.g., for mobility) being the most [...] Read more.
Computer vision algorithms implementations, especially for real-time applications, are present in a variety of devices that we are currently using (from smartphones or automotive applications to monitoring/security applications) and pose specific challenges, memory bandwidth or energy consumption (e.g., for mobility) being the most notable ones. This paper aims at providing a solution to improve the overall quality of real-time object detection computer vision algorithms using a hybrid hardware–software implementation. To this end, we explore the methods for a proper allocation of algorithm components towards hardware (as IP Cores) and the interfacing between hardware and software. Addressing specific design constraints, the relationship between the above components allows embedded artificial intelligence to select the operating hardware blocks (IP cores)—in the configuration phase—and to dynamically change the parameters of the aggregated hardware resources—in the instantiation phase, similar to the concretization of a class into a software object. The conclusions show the benefits of using hybrid hardware–software implementations, as well as major gains from using IP Cores, managed by artificial intelligence, for an object detection use-case, implemented on a FPGA demonstrator built around a Xilinx Zynq-7000 SoC Mini-ITX sub-system. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

18 pages, 5042 KiB  
Article
A Multiscale Instance Segmentation Method Based on Cleaning Rubber Ball Images
by Erjie Su, Yongzhi Tian, Erjun Liang, Jiayu Wang and Yibo Zhang
Sensors 2023, 23(9), 4261; https://doi.org/10.3390/s23094261 - 25 Apr 2023
Viewed by 1219
Abstract
The identification of wear rubber balls in the rubber ball cleaning system in heat exchange equipment directly affects the descaling efficiency. For the problem that the rubber ball image contains impurities and bubbles and the segmentation is low in real time, a multi-scale [...] Read more.
The identification of wear rubber balls in the rubber ball cleaning system in heat exchange equipment directly affects the descaling efficiency. For the problem that the rubber ball image contains impurities and bubbles and the segmentation is low in real time, a multi-scale feature fusion real-time instance segmentation model based on the attention mechanism is proposed for the object segmentation of the rubber ball images. First, we introduce the Pyramid Vision Transformer instead of the convolution module in the backbone network and use the spatial-reduction attention layer of the transformer to improve the feature extraction ability across scales and spatial reduction to reduce computational cost; Second, we improve the feature fusion module to fuse image features across scales, combined with an attention mechanism to enhance the output feature representation; Third, the prediction head separates the mask branches separately. Combined with dynamic convolution, it improves the accuracy of the mask coefficients and increases the number of upsampling layers. It also connects the penultimate layer with the second layer feature map to achieve detection of smaller images with larger feature maps to improve the accuracy. Through the validation of the produced rubber ball dataset, the Dice score, Jaccard coefficient, and mAP of the actual segmented region of this network with the rubber ball dataset are improved by 4.5%, 4.7%, and 7.73%, respectively, and our model achieves 33.6 fps segmentation speed and 79.3% segmentation accuracy. Meanwhile, the average precision of Box and Mask can also meet the requirements under different IOU thresholds. We compared the DeepMask, Mask R-CNN, BlendMask, SOLOv1 and SOLOv2 instance segmentation networks with this model in terms of training accuracy and segmentation speed and obtained good results. The proposed modules can work together to better handle object details and achieve better segmentation performance. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

17 pages, 4842 KiB  
Article
FSVM: A Few-Shot Threat Detection Method for X-ray Security Images
by Cheng Fang, Jiayue Liu, Ping Han, Mingrui Chen and Dayu Liao
Sensors 2023, 23(8), 4069; https://doi.org/10.3390/s23084069 - 18 Apr 2023
Cited by 5 | Viewed by 1747
Abstract
In recent years, automatic detection of threats in X-ray baggage has become important in security inspection. However, the training of threat detectors often requires extensive, well-annotated images, which are hard to procure, especially for rare contraband items. In this paper, a few-shot SVM-constraint [...] Read more.
In recent years, automatic detection of threats in X-ray baggage has become important in security inspection. However, the training of threat detectors often requires extensive, well-annotated images, which are hard to procure, especially for rare contraband items. In this paper, a few-shot SVM-constraint threat detection model, named FSVM is proposed, which aims at detecting unseen contraband items with only a small number of labeled samples. Rather than simply finetuning the original model, FSVM embeds a derivable SVM layer to back-propagate the supervised decision information into the former layers. A combined loss function utilizing SVM loss is also created as the additional constraint. We have evaluated FSVM on the public security baggage dataset SIXray, performing experiments on 10-shot and 30-shot samples under three class divisions. Experimental results show that compared with four common few-shot detection models, FSVM has the highest performance and is more suitable for complex distributed datasets (e.g., X-ray parcels). Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

18 pages, 18756 KiB  
Article
S-BIRD: A Novel Critical Multi-Class Imagery Dataset for Sewer Monitoring and Maintenance Systems
by Ravindra R. Patil, Mohamad Y. Mustafa, Rajnish Kaur Calay and Saniya M. Ansari
Sensors 2023, 23(6), 2966; https://doi.org/10.3390/s23062966 - 09 Mar 2023
Cited by 1 | Viewed by 1539
Abstract
Computer vision in consideration of automated and robotic systems has come up as a steady and robust platform in sewer maintenance and cleaning tasks. The AI revolution has enhanced the ability of computer vision and is being used to detect problems with underground [...] Read more.
Computer vision in consideration of automated and robotic systems has come up as a steady and robust platform in sewer maintenance and cleaning tasks. The AI revolution has enhanced the ability of computer vision and is being used to detect problems with underground sewer pipes, such as blockages and damages. A large amount of appropriate, validated, and labeled imagery data is always a key requirement for learning AI-based detection models to generate the desired outcomes. In this paper, a new imagery dataset S-BIRD (Sewer-Blockages Imagery Recognition Dataset) is presented to draw attention to the predominant sewers’ blockages issue caused by grease, plastic and tree roots. The need for the S-BIRD dataset and various parameters such as its strength, performance, consistency and feasibility have been considered and analyzed for real-time detection tasks. The YOLOX object detection model has been trained to prove the consistency and viability of the S-BIRD dataset. It also specified how the presented dataset will be used in an embedded vision-based robotic system to detect and remove sewer blockages in real-time. The outcomes of an individual survey conducted at a typical mid-size city in a developing country, Pune, India, give ground for the necessity of the presented work. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

23 pages, 6619 KiB  
Article
Behavior-Based Video Summarization System for Dog Health and Welfare Monitoring
by Othmane Atif, Jonguk Lee, Daihee Park and Yongwha Chung
Sensors 2023, 23(6), 2892; https://doi.org/10.3390/s23062892 - 07 Mar 2023
Cited by 1 | Viewed by 2532
Abstract
The popularity of dogs has been increasing owing to factors such as the physical and mental health benefits associated with raising them. While owners care about their dogs’ health and welfare, it is difficult for them to assess these, and frequent veterinary checkups [...] Read more.
The popularity of dogs has been increasing owing to factors such as the physical and mental health benefits associated with raising them. While owners care about their dogs’ health and welfare, it is difficult for them to assess these, and frequent veterinary checkups represent a growing financial burden. In this study, we propose a behavior-based video summarization and visualization system for monitoring a dog’s behavioral patterns to help assess its health and welfare. The system proceeds in four modules: (1) a video data collection and preprocessing module; (2) an object detection-based module for retrieving image sequences where the dog is alone and cropping them to reduce background noise; (3) a dog behavior recognition module using two-stream EfficientNetV2 to extract appearance and motion features from the cropped images and their respective optical flow, followed by a long short-term memory (LSTM) model to recognize the dog’s behaviors; and (4) a summarization and visualization module to provide effective visual summaries of the dog’s location and behavior information to help assess and understand its health and welfare. The experimental results show that the system achieved an average F1 score of 0.955 for behavior recognition, with an execution time allowing real-time processing, while the summarization and visualization results demonstrate how the system can help owners assess and understand their dog’s health and welfare. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

21 pages, 22876 KiB  
Article
Image-Based Detection of Modifications in Assembled PCBs with Deep Convolutional Autoencoders
by Diulhio Candido de Oliveira, Bogdan Tomoyuki Nassu and Marco Aurelio Wehrmeister
Sensors 2023, 23(3), 1353; https://doi.org/10.3390/s23031353 - 25 Jan 2023
Cited by 1 | Viewed by 2052
Abstract
In this paper, we introduce a one-class learning approach for detecting modifications in assembled printed circuit boards (PCBs) based on photographs taken without tight control over perspective and illumination conditions. Anomaly detection and segmentation are essential for several applications, where collecting anomalous samples [...] Read more.
In this paper, we introduce a one-class learning approach for detecting modifications in assembled printed circuit boards (PCBs) based on photographs taken without tight control over perspective and illumination conditions. Anomaly detection and segmentation are essential for several applications, where collecting anomalous samples for supervised training is infeasible. Given the uncontrolled environment and the huge number of possible modifications, we address the problem as a case of anomaly detection, proposing an approach that is directed towards the characteristics of that scenario, while being well suited for other similar applications. We propose a loss function that can be used to train a deep convolutional autoencoder based only on images of the unmodified board—which allows overcoming the challenge of producing a representative set of samples containing anomalies for supervised learning. We also propose a function that explores higher-level features for comparing the input image and the reconstruction produced by the autoencoder, allowing the segmentation of structures and components that differ between them. Experiments performed on a dataset built to represent real-world situations (which we made publicly available) show that our approach outperforms other state-of-the-art approaches for anomaly segmentation in the considered scenario, while producing comparable results on a more general object anomaly detection task. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

15 pages, 31252 KiB  
Article
Subgrid Variational Optimized Optical Flow Estimation Algorithm for Image Velocimetry
by Haoxuan Xu, Jianping Wang, Ya Zhang, Guo Zhang and Zhaolong Xiong
Sensors 2023, 23(1), 437; https://doi.org/10.3390/s23010437 - 30 Dec 2022
Cited by 1 | Viewed by 1441
Abstract
The variational optical flow model is used in this work to investigate a subgrid-scale optimization approach for modeling complex fluid flows in image sequences and estimating their two-dimensional velocity fields. To solve the problem of lack of sub-grid small-scale structure information in variational [...] Read more.
The variational optical flow model is used in this work to investigate a subgrid-scale optimization approach for modeling complex fluid flows in image sequences and estimating their two-dimensional velocity fields. To solve the problem of lack of sub-grid small-scale structure information in variational optical flow estimation, we combine the motion laws of incompressible fluids. Introducing the idea of large eddy simulation, the instantaneous motion can be decomposed into large-scale motion and a small-scale turbulence in the data term. The Smagorinsky model is used to model and solve the small-scale turbulence. The improved subgrid scale Horn–Schunck (SGS-HS) optical flow algorithm provides better results in velocity field estimation of turbulent image sequences than the traditional Farneback dense optical flow algorithm. To make the SGS-HS algorithm equally competent for the open channel flow measurement task, a velocity gradient constraint is chosen for the canonical term of the model, which is used to improve the accuracy of the SGS-HS algorithm in velocimetric experiments in the case of the relatively uniform flow direction of the open channel flow field. The experimental results show that our algorithm has better performance in open channel velocimetry compared with the conventional algorithm. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

19 pages, 1325 KiB  
Article
Calibration of a Stereoscopic Vision System in the Presence of Errors in Pitch Angle
by Jonatán Felipe, Marta Sigut and Leopoldo Acosta
Sensors 2023, 23(1), 212; https://doi.org/10.3390/s23010212 - 25 Dec 2022
Viewed by 1384
Abstract
This paper proposes a novel method for the calibration of a stereo camera system used to reconstruct 3D scenes. An error in the pitch angle of the cameras causes the reconstructed scene to exhibit some distortion with respect to the real scene. To [...] Read more.
This paper proposes a novel method for the calibration of a stereo camera system used to reconstruct 3D scenes. An error in the pitch angle of the cameras causes the reconstructed scene to exhibit some distortion with respect to the real scene. To do the calibration procedure, whose purpose is to eliminate or at least minimize said distortion, machine learning techniques have been used, and more specifically, regression algorithms. These algorithms are trained with a large number of vectors of input features with their respective outputs, since, in view of the application of the procedure proposed, it is important that the training set be sufficiently representative of the variety that can occur in a real scene, which includes the different orientations that the pitch angle can take, the error in said angle and the effect that all this has on the reconstruction process. The most efficient regression algorithms for estimating the error in the pitch angle are derived from decision trees and certain neural network configurations. Once estimated, the error can be corrected, thus making the reconstructed scene appear more like the real one. Although the authors base their method on U-V disparity and employ this same technique to completely reconstruct the 3D scene, one of the most interesting features of the method proposed is that it can be applied regardless of the technique used to carry out said reconstruction. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

15 pages, 5176 KiB  
Article
Cascading Alignment for Unsupervised Domain-Adaptive DETR with Improved DeNoising Anchor Boxes
by Huantong Geng, Jun Jiang, Junye Shen and Mengmeng Hou
Sensors 2022, 22(24), 9629; https://doi.org/10.3390/s22249629 - 08 Dec 2022
Cited by 1 | Viewed by 1753
Abstract
Transformer-based object detection has recently attracted increasing interest and shown promising results. As one of the DETR-like models, DETR with improved denoising anchor boxes (DINO) produced superior performance on COCO val2017 and achieved a new state of the art. However, it often encounters [...] Read more.
Transformer-based object detection has recently attracted increasing interest and shown promising results. As one of the DETR-like models, DETR with improved denoising anchor boxes (DINO) produced superior performance on COCO val2017 and achieved a new state of the art. However, it often encounters challenges when applied to new scenarios where no annotated data is available, and the imaging conditions differ significantly. To alleviate this problem of domain shift, in this paper, unsupervised domain adaptive DINO via cascading alignment (CA-DINO) was proposed, which consists of attention-enhanced double discriminators (AEDD) and weak-restraints on category-level token (WROT). Specifically, AEDD is used to aggregate and align the local–global context from the feature representations of both domains while reducing the domain discrepancy before entering the transformer encoder and decoder. WROT extends Deep CORAL loss to adapt class tokens after embedding, minimizing the difference in second-order statistics between the source and target domain. Our approach is trained end to end, and experiments on two challenging benchmarks demonstrate the effectiveness of our method, which yields 41% relative improvement compared to baseline on the benchmark dataset Foggy Cityscapes, in particular. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

12 pages, 3025 KiB  
Article
Compact Image-Style Transfer: Channel Pruning on the Single Training of a Network
by Minseong Kim and Hyun-Chul Choi
Sensors 2022, 22(21), 8427; https://doi.org/10.3390/s22218427 - 02 Nov 2022
Cited by 3 | Viewed by 1516
Abstract
Recent image-style transfer methods use the structure of a VGG feature network to encode and decode the feature map of the image. Since the network is designed for the general image-classification task, it has a number of channels and, accordingly, requires a huge [...] Read more.
Recent image-style transfer methods use the structure of a VGG feature network to encode and decode the feature map of the image. Since the network is designed for the general image-classification task, it has a number of channels and, accordingly, requires a huge amount of memory and high computational power, which is not mandatory for such a relatively simple task as image-style transfer. In this paper, we propose a new technique to size down the previously used style transfer network for eliminating the redundancy of the VGG feature network in memory consumption and computational cost. Our method automatically finds a number of consistently inactive convolution channels during the network training phase by using two new losses, i.e., channel loss and xor loss. The former maximizes the number of inactive channels and the latter fixes the positions of these inactive channels to be the same for the image. Our method improves the image generation speed to be up to 49% faster and reduces the number of parameters by 20% while maintaining style transferring performance. Additionally, our losses are also effective in pruning the VGG16 classifier network, i.e., parameter reduction by 26% and top-1 accuracy improvement by 0.16% on CIFAR-10. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

13 pages, 1998 KiB  
Article
AI-Generated Face Image Identification with Different Color Space Channel Combinations
by Songwen Mo, Pei Lu and Xiaoyong Liu
Sensors 2022, 22(21), 8228; https://doi.org/10.3390/s22218228 - 27 Oct 2022
Cited by 3 | Viewed by 3137
Abstract
With the rapid development of the Internet and information technology (in particular, generative adversarial networks and deep learning), network data are exploding. Due to the misuse of technology and inadequate supervision, deep-network-generated face images flood the network, and the forged image is called [...] Read more.
With the rapid development of the Internet and information technology (in particular, generative adversarial networks and deep learning), network data are exploding. Due to the misuse of technology and inadequate supervision, deep-network-generated face images flood the network, and the forged image is called a deepfake. Those realistic faked images launched a serious challenge to the human eye and the automatic identification system, resulting in many legal, ethical, and social issues. For the needs of network information security, deep-network-generated face image identification based on different color spaces is proposed. Due to the extremely realistic effect of deepfake images, it is difficult to achieve high accuracy with ordinary methods for neural networks, so we used the image processing method here. First, by analyzing the differences in different color space components in the deep learning network model for face sensitivity, a combination of color space components that can effectively improve the discrimination rate of the deep learning network model is given. Second, to further improve the discriminative performance of the model, a channel attention mechanism was added at the shallow level of the model to further focus on the features contributing to the model. The experimental results show that this scheme achieved better accuracy in the same face generation model and in different face generation models than the two compared methods, and its accuracy reached up to 99.10% in the same face generation model. Meanwhile, the accuracy of this model only decreased to 98.71% when coping with a JPEG compression factor of 100, which shows that this model is robust. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

15 pages, 4860 KiB  
Article
Automatic Meter Reading from UAV Inspection Photos in the Substation by Combining YOLOv5s and DeeplabV3+
by Guanghong Deng, Tongbin Huang, Baihao Lin, Hongkai Liu, Rui Yang and Wenlong Jing
Sensors 2022, 22(18), 7090; https://doi.org/10.3390/s22187090 - 19 Sep 2022
Cited by 10 | Viewed by 2683
Abstract
The combination of unmanned aerial vehicles (UAVs) and artificial intelligence is significant and is a key topic in recent substation inspection applications; and meter reading is one of the challenging tasks. This paper proposes a method based on the combination of YOLOv5s object [...] Read more.
The combination of unmanned aerial vehicles (UAVs) and artificial intelligence is significant and is a key topic in recent substation inspection applications; and meter reading is one of the challenging tasks. This paper proposes a method based on the combination of YOLOv5s object detection and Deeplabv3+ image segmentation to obtain meter readings through the post-processing of segmented images. Firstly, YOLOv5s was introduced to detect the meter dial area and the meter was classified. Following this, the detected and classified images were passed to the image segmentation algorithm. The backbone network of the Deeplabv3+ algorithm was improved by using the MobileNetv2 network, and the model size was reduced on the premise that the effective extraction of tick marks and pointers was ensured. To account for the inaccurate reading of the meter, the divided pointer and scale area were corroded first, and then the concentric circle sampling method was used to flatten the circular dial area into a rectangular area. Several analog meter readings were calculated by flattening the area scale distance. The experimental results show that the mean average precision of 50 (mAP50) of the YOLOv5s model with this method in this data set reached 99.58%, that the single detection speed reached 22.2 ms, and that the mean intersection over union (mIoU) of the image segmentation model reached 78.92%, 76.15%, 79.12%, 81.17%, and 75.73%, respectively. The single segmentation speed reached 35.1 ms. At the same time, the effects of various commonly used detection and segmentation algorithms on the recognition of meter readings were compared. The results show that the method in this paper significantly improved the accuracy and practicability of substation meter reading detection in complex situations. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

17 pages, 4481 KiB  
Article
Automatic Segmentation of Standing Trees from Forest Images Based on Deep Learning
by Lijuan Shi, Guoying Wang, Lufeng Mo, Xiaomei Yi, Xiaoping Wu and Peng Wu
Sensors 2022, 22(17), 6663; https://doi.org/10.3390/s22176663 - 03 Sep 2022
Cited by 8 | Viewed by 1616
Abstract
Semantic segmentation of standing trees is important to obtain factors of standing trees from images automatically and effectively. Aiming at the accurate segmentation of multiple standing trees in complex backgrounds, some traditional methods have shortcomings such as low segmentation accuracy and manual intervention. [...] Read more.
Semantic segmentation of standing trees is important to obtain factors of standing trees from images automatically and effectively. Aiming at the accurate segmentation of multiple standing trees in complex backgrounds, some traditional methods have shortcomings such as low segmentation accuracy and manual intervention. To achieve accurate segmentation of standing tree images effectively, SEMD, a lightweight network segmentation model based on deep learning, is proposed in this article. DeepLabV3+ is chosen as the base framework to perform multi-scale fusion of the convolutional features of the standing trees in images, so as to reduce the loss of image edge details during the standing tree segmentation and reduce the loss of feature information. MobileNet, a lightweight network, is integrated into the backbone network to reduce the computational complexity. Furthermore, SENet, an attention mechanism, is added to obtain the feature information efficiently and suppress the generation of useless feature information. The extensive experimental results show that using the SEMD model the MIoU of the semantic segmentation of standing tree images of different varieties and categories under simple and complex backgrounds reaches 91.78% and 86.90%, respectively. The lightweight network segmentation model SEMD based on deep learning proposed in this paper can solve the problem of multiple standing trees segmentation with high accuracy. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

14 pages, 4809 KiB  
Article
BRefine: Achieving High-Quality Instance Segmentation
by Jimin Yu, Xiankun Yang, Shangbo Zhou, Shougang Wang and Shangguo Hu
Sensors 2022, 22(17), 6499; https://doi.org/10.3390/s22176499 - 29 Aug 2022
Viewed by 1253
Abstract
Instance segmentation has been developing rapidly in recent years. Mask R-CNN, a two-stage instance segmentation approach, has demonstrated exceptional performance. However, the masks are still very coarse. The downsampling operation of the backbone network and the ROIAlign layer loses much detailed information, especially [...] Read more.
Instance segmentation has been developing rapidly in recent years. Mask R-CNN, a two-stage instance segmentation approach, has demonstrated exceptional performance. However, the masks are still very coarse. The downsampling operation of the backbone network and the ROIAlign layer loses much detailed information, especially from large targets. The sawtooth effect of the edge mask is caused by the lower resolution. A lesser percentage of boundary pixels leads to not-fine segmentation. In this paper, we propose a new method called Boundary Refine (BRefine) that achieves high-quality segmentation. This approach uses FCN as the foundation segmentation architecture, and forms a multistage fusion mask head with multistage fusion detail features to improve mask resolution. However, the FCN architecture causes inconsistencies in multiscale segmentation. BRank and sort loss (BR and S loss) is proposed to solve the problems of segmentation inconsistency and the difficulty of boundary segmentation. It is combined with rank and sort loss, and boundary region loss. BRefine can handle hard-to-partition boundaries and output high-quality masks. On the COCO, LVIS, and Cityscapes datasets, BRefine outperformed Mask R-CNN by 3.0, 4.2, and 3.5 AP, respectively. Furthermore, on the COCO dataset, the large objects improved by 5.0 AP. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

13 pages, 2165 KiB  
Article
Deep Learning-Based 3D Measurements with Near-Infrared Fringe Projection
by Jinglei Wang, Yixuan Li, Yifan Ji, Jiaming Qian, Yuxuan Che, Chao Zuo, Qian Chen and Shijie Feng
Sensors 2022, 22(17), 6469; https://doi.org/10.3390/s22176469 - 27 Aug 2022
Cited by 5 | Viewed by 1892
Abstract
Fringe projection profilometry (FPP) is widely applied to 3D measurements, owing to its advantages of high accuracy, non-contact, and full-field scanning. Compared with most FPP systems that project visible patterns, invisible fringe patterns in the spectra of near-infrared demonstrate fewer impacts on human [...] Read more.
Fringe projection profilometry (FPP) is widely applied to 3D measurements, owing to its advantages of high accuracy, non-contact, and full-field scanning. Compared with most FPP systems that project visible patterns, invisible fringe patterns in the spectra of near-infrared demonstrate fewer impacts on human eyes or on scenes where bright illumination may be avoided. However, the invisible patterns, which are generated by a near-infrared laser, are usually captured with severe speckle noise, resulting in 3D reconstructions of limited quality. To cope with this issue, we propose a deep learning-based framework that can remove the effect of the speckle noise and improve the precision of the 3D reconstruction. The framework consists of two deep neural networks where one learns to produce a clean fringe pattern and the other to obtain an accurate phase from the pattern. Compared with traditional denoising methods that depend on complex physical models, the proposed learning-based method is much faster. The experimental results show that the measurement accuracy can be increased effectively by the presented method. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

18 pages, 4524 KiB  
Article
ULMR: An Unsupervised Learning Framework for Mismatch Removal
by Cailong Deng, Shiyu Chen, Yong Zhang, Qixin Zhang and Feiyan Chen
Sensors 2022, 22(16), 6110; https://doi.org/10.3390/s22166110 - 16 Aug 2022
Cited by 2 | Viewed by 1571
Abstract
Due to radiometric and geometric distortions between images, mismatches are inevitable. Thus, a mismatch removal process is required for improving matching accuracy. Although deep learning methods have been proved to outperform handcraft methods in specific scenarios, including image identification and point cloud classification, [...] Read more.
Due to radiometric and geometric distortions between images, mismatches are inevitable. Thus, a mismatch removal process is required for improving matching accuracy. Although deep learning methods have been proved to outperform handcraft methods in specific scenarios, including image identification and point cloud classification, most learning methods are supervised and are susceptible to incorrect labeling, and labeling data is a time-consuming task. This paper takes advantage of deep reinforcement leaning (DRL) and proposes a framework named unsupervised learning for mismatch removal (ULMR). Resorting to DRL, ULMR firstly scores each state–action pair guided by the output of classification network; then, it calculates the policy gradient of the expected reward; finally, through maximizing the expected reward of state–action pairings, the optimal network can be obtained. Compared to supervised learning methods (e.g., NM-Net and LFGC), unsupervised learning methods (e.g., ULCM), and handcraft methods (e.g., RANSAC, GMS), ULMR can obtain higher precision, more remaining correct matches, and fewer remaining false matches in testing experiments. Moreover, ULMR shows greater stability, better accuracy, and higher quality in application experiments, demonstrating reduced sampling times and higher compatibility with other classification networks in ablation experiments, indicating its great potential for further use. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

15 pages, 2634 KiB  
Article
Fair Facial Attribute Classification via Causal Graph-Based Attribute Translation
by Sunghun Kang, Gwangsu Kim and Chang D. Yoo
Sensors 2022, 22(14), 5271; https://doi.org/10.3390/s22145271 - 14 Jul 2022
Cited by 2 | Viewed by 1625
Abstract
Recent studies have raised concerns regarding racial and gender disparity in facial attribute classification performance. As these attributes are directly and indirectly correlated with the sensitive attribute in a complex manner, simple disparate treatment is ineffective in reducing performance disparity. This paper focuses [...] Read more.
Recent studies have raised concerns regarding racial and gender disparity in facial attribute classification performance. As these attributes are directly and indirectly correlated with the sensitive attribute in a complex manner, simple disparate treatment is ineffective in reducing performance disparity. This paper focuses on achieving counterfactual fairness for facial attribute classification. Each labeled input image is used to generate two synthetic replicas: one under factual assumptions about the sensitive attribute and one under counterfactual. The proposed causal graph-based attribute translation generates realistic counterfactual images that consider the complicated causal relationship among the attributes with an encoder–decoder framework. A causal graph represents complex relationships among the attributes and is used to sample factual and counterfactual facial attributes of the given face image. The encoder–decoder architecture translates the given facial image to have sampled factual or counterfactual attributes while preserving its identity. The attribute classifier is trained for fair prediction with counterfactual regularization between factual and corresponding counterfactual translated images. Extensive experimental results on the CelebA dataset demonstrate the effectiveness and interpretability of the proposed learning method for classifying multiple face attributes. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

16 pages, 5107 KiB  
Article
Adaptive Aggregate Stereo Matching Network with Depth Map Super-Resolution
by Botao Liu, Kai Chen, Sheng-Lung Peng and Ming Zhao
Sensors 2022, 22(12), 4548; https://doi.org/10.3390/s22124548 - 16 Jun 2022
Cited by 2 | Viewed by 1757
Abstract
In order to avoid the direct depth reconstruction of the original image pair and improve the accuracy of the results, we proposed a coarse-to-fine stereo matching network combining multi-level residual optimization and depth map super-resolution (ASR-Net). First, we used the u-net feature extractor [...] Read more.
In order to avoid the direct depth reconstruction of the original image pair and improve the accuracy of the results, we proposed a coarse-to-fine stereo matching network combining multi-level residual optimization and depth map super-resolution (ASR-Net). First, we used the u-net feature extractor to obtain the multi-scale feature pair. Second, we reconstructed global disparity in the lowest resolution. Then, we regressed the residual disparity using the higher-resolution feature pair. Finally, the lowest-resolution depth map was refined by using the disparity residual. In addition, we introduced deformable convolution and group-wise cost volume into the network to achieve adaptive cost aggregation. Further, the network uses ABPN instead of the traditional interpolation method. The network was evaluated on three datasets: scene flow, kitti2015, and kitti2012 and the experimental results showed that the speed and accuracy of our method were excellent. On the kitti2015 dataset, the three-pixel error converged to 2.86%, and the speed was about six times and two times that of GC-net and GWC-net. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

17 pages, 69838 KiB  
Article
RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
by Hatem Ibrahem, Ahmed Salem and Hyun-Soo Kang
Sensors 2022, 22(10), 3849; https://doi.org/10.3390/s22103849 - 19 May 2022
Cited by 4 | Viewed by 3765
Abstract
The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks [...] Read more.
The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

Review

Jump to: Research

31 pages, 1426 KiB  
Review
Literature Review on Ship Localization, Classification, and Detection Methods Based on Optical Sensors and Neural Networks
by Eduardo Teixeira, Beatriz Araujo, Victor Costa, Samuel Mafra and Felipe Figueiredo
Sensors 2022, 22(18), 6879; https://doi.org/10.3390/s22186879 - 12 Sep 2022
Cited by 9 | Viewed by 3096
Abstract
Object detection is a common application within the computer vision area. Its tasks include the classic challenges of object localization and classification. As a consequence, object detection is a challenging task. Furthermore, this technique is crucial for maritime applications since situational awareness can [...] Read more.
Object detection is a common application within the computer vision area. Its tasks include the classic challenges of object localization and classification. As a consequence, object detection is a challenging task. Furthermore, this technique is crucial for maritime applications since situational awareness can bring various benefits to surveillance systems. The literature presents various models to improve automatic target recognition and tracking capabilities that can be applied to and leverage maritime surveillance systems. Therefore, this paper reviews the available models focused on localization, classification, and detection. Moreover, it analyzes several works that apply the discussed models to the maritime surveillance scenario. Finally, it highlights the main opportunities and challenges, encouraging new research in this area. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Graphical abstract

28 pages, 2300 KiB  
Review
A Review of Image Processing Techniques for Deepfakes
by Hina Fatima Shahzad, Furqan Rustam, Emmanuel Soriano Flores, Juan Luís Vidal Mazón, Isabel de la Torre Diez and Imran Ashraf
Sensors 2022, 22(12), 4556; https://doi.org/10.3390/s22124556 - 16 Jun 2022
Cited by 16 | Viewed by 9564
Abstract
Deep learning is used to address a wide range of challenging issues including large data analysis, image processing, object detection, and autonomous control. In the same way, deep learning techniques are also used to develop software and techniques that pose a danger to [...] Read more.
Deep learning is used to address a wide range of challenging issues including large data analysis, image processing, object detection, and autonomous control. In the same way, deep learning techniques are also used to develop software and techniques that pose a danger to privacy, democracy, and national security. Fake content in the form of images and videos using digital manipulation with artificial intelligence (AI) approaches has become widespread during the past few years. Deepfakes, in the form of audio, images, and videos, have become a major concern during the past few years. Complemented by artificial intelligence, deepfakes swap the face of one person with the other and generate hyper-realistic videos. Accompanying the speed of social media, deepfakes can immediately reach millions of people and can be very dangerous to make fake news, hoaxes, and fraud. Besides the well-known movie stars, politicians have been victims of deepfakes in the past, especially US presidents Barak Obama and Donald Trump, however, the public at large can be the target of deepfakes. To overcome the challenge of deepfake identification and mitigate its impact, large efforts have been carried out to devise novel methods to detect face manipulation. This study also discusses how to counter the threats from deepfake technology and alleviate its impact. The outcomes recommend that despite a serious threat to society, business, and political institutions, they can be combated through appropriate policies, regulation, individual actions, training, and education. In addition, the evolution of technology is desired for deepfake identification, content authentication, and deepfake prevention. Different studies have performed deepfake detection using machine learning and deep learning techniques such as support vector machine, random forest, multilayer perceptron, k-nearest neighbors, convolutional neural networks with and without long short-term memory, and other similar models. This study aims to highlight the recent research in deepfake images and video detection, such as deepfake creation, various detection algorithms on self-made datasets, and existing benchmark datasets. Full article
(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications)
Show Figures

Figure 1

Back to TopTop