Intelligent Analysis and Image Recognition

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 November 2023) | Viewed by 10926

Special Issue Editors

School of Electronic and Information Engineering, Beihang University, No. 37 Xueyuan Road, Haidian District, Beijing 100191, China
Interests: remote sensing; image recognition; domain adaptation; few-shot learning; light-weight neural network
Special Issues, Collections and Topics in MDPI journals
Department of Electronics and Information Engineering, Beihang University, Beijing 100191, China
Interests: affective computing; pattern recognition; human–computer interaction
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Major progress is being published regularly on both the technology and exploitation of Intelligent Analysis and Image Recognition. However, the real-world applications of intelligent data analysis and image recognition technologies are still limited. This is related to several factors, such as the complexity of realistic scenes, the high cost of computing resources in large models and the difficulties associated with large amounts of annotated data. Current research is also focusing on the efficient use of multi-domain data. For example, infrared images can compensate for the absence of RGB image information in low-illumination scenes. Different image domains have differentiated target features. It is very important to use the information of different image domains to obtain more powerful image recognition models and build highly robust models for different image domains. We are interested in articles that explore robust intelligent analysis and image recognition systems. Potential topics include, but are not limited to, the following:

  • Intelligent recognition of large-scale images;
  • Multi-domain image recognition;
  • Few-shot/zero-shot image recognition;
  • Graph representation of images;
  • Multi-view image joint recognition;
  • Image recognition based on big data analytics;
  • Intelligent image recognition for real-world applications;
  • Efficient image recognition systems on embedded devices.

Prof. Dr. Qi Zhao
Dr. Lijiang Chen
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 2107 KiB  
Article
NTPP-MVSNet: Multi-View Stereo Network Based on Neighboring Tangent Plane Propagation
Appl. Sci. 2023, 13(14), 8388; https://doi.org/10.3390/app13148388 - 20 Jul 2023
Viewed by 679
Abstract
Although learning-based multi-view stereo algorithms have produced exciting results in recent years, few researchers have explored the specific role of deep sampling in the network. We posit that depth sampling accuracy more directly impacts the quality of scene reconstruction. To address this issue, [...] Read more.
Although learning-based multi-view stereo algorithms have produced exciting results in recent years, few researchers have explored the specific role of deep sampling in the network. We posit that depth sampling accuracy more directly impacts the quality of scene reconstruction. To address this issue, we proposed NTPP-MVSNet, which utilizes normal vector and depth information from neighboring pixels to propagate tangent planes. Based on this, we obtained a more accurate depth estimate through homography transformation. We used deformable convolution to acquire continuous pixel positions on the surface and 3D-UNet to account for the regression of depth and normal vector maps without consuming additional GPU memory. Finally, we applied homography transformation to complete the mapping of the imaging plane and the neighborhood surface tangent plane to generate a depth hypothesis. Experimental trials on the DTU and Tanks and Temples datasets demonstrate the feasibility of NTPP-MVSNet, and ablation experiments confirm the superior performance of our deep sampling methodology. Full article
(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)
Show Figures

Figure 1

18 pages, 8019 KiB  
Article
FusionTrack: Multiple Object Tracking with Enhanced Information Utilization
Appl. Sci. 2023, 13(14), 8010; https://doi.org/10.3390/app13148010 - 08 Jul 2023
Viewed by 1173
Abstract
Multi-object tracking (MOT) is one of the significant directions of computer vision. Though existing methods can solve simple tasks like pedestrian tracking well, some complex downstream tasks featuring uniform appearance and diverse motion remain difficult. Inspired by DETR, the tracking-by-attention (TBA) method uses [...] Read more.
Multi-object tracking (MOT) is one of the significant directions of computer vision. Though existing methods can solve simple tasks like pedestrian tracking well, some complex downstream tasks featuring uniform appearance and diverse motion remain difficult. Inspired by DETR, the tracking-by-attention (TBA) method uses transformers to accomplish multi-object tracking tasks. However, there are still issues with existing TBA methods within the TBA paradigm, such as difficulty detecting and tracking objects due to gradient conflict in shared parameters, and insufficient use of features to distinguish similar objects. We introduce FusionTrack to address these issues. It utilizes a joint track-detection decoder and a score-guided multi-level query fuser to enhance the usage of information within and between frames. With these improvements, FusionTrack achieves 11.1% higher by HOTA metric on the DanceTrack dataset compared with the baseline model MOTR. Full article
(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)
Show Figures

Figure 1

18 pages, 2376 KiB  
Article
Multiscale YOLOv5-AFAM-Based Infrared Dim-Small-Target Detection
Appl. Sci. 2023, 13(13), 7779; https://doi.org/10.3390/app13137779 - 30 Jun 2023
Viewed by 818
Abstract
Infrared detection plays an important role in the military, aerospace, and other fields, which has the advantages of all-weather, high stealth, and strong anti-interference. However, infrared dim-small-target detection suffers from complex backgrounds, low signal-to-noise ratio, blurred targets with small area percentages, and other [...] Read more.
Infrared detection plays an important role in the military, aerospace, and other fields, which has the advantages of all-weather, high stealth, and strong anti-interference. However, infrared dim-small-target detection suffers from complex backgrounds, low signal-to-noise ratio, blurred targets with small area percentages, and other challenges. In this paper, we proposed a multiscale YOLOv5-AFAM algorithm to realize high-accuracy and real-time detection. Aiming at the problem of target intra-class feature difference and inter-class feature similarity, the Adaptive Fusion Attention Module (AFAM) was proposed to generate feature maps that are calculated to weigh the features in the network and make the network focus on small targets. This paper proposed a multiscale fusion structure to solve the problem of small and variable detection scales in infrared vehicle targets. In addition, the downsampling layer is improved by combining Maxpool and convolutional downsampling to reduce the number of model parameters and retain the texture information. For multiple scenarios, we constructed an infrared dim and small vehicle target detection dataset, ISVD. The multiscale YOLOv5-AFAM was conducted on the ISVD dataset. Compared to YOLOv7, mAP@0.5 achieves a small improvement while the parameters are only 17.98% of it. In contrast, with the YOLOv5s model, mAP@0.5 was improved from 81.4% to 85.7% with a parameter reduction from 7.0 M to 6.6 M. The experimental results demonstrate that the multiscale YOLOv5-AFAM has a higher detection accuracy and detection speed on infrared dim and small vehicles. Full article
(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)
Show Figures

Figure 1

15 pages, 11178 KiB  
Article
UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology
Appl. Sci. 2023, 13(13), 7556; https://doi.org/10.3390/app13137556 - 27 Jun 2023
Viewed by 1454
Abstract
To prevent the compilation of documents, many table documents are formatted with non-editable and non-structured texts such as PDFs or images. Quickly recognizing the contents of tables is still a challenge due to factors such as irregular formats, uneven text quality, and complex [...] Read more.
To prevent the compilation of documents, many table documents are formatted with non-editable and non-structured texts such as PDFs or images. Quickly recognizing the contents of tables is still a challenge due to factors such as irregular formats, uneven text quality, and complex and diverse table content. This article proposes the UTTSR table recognition model, which consists of four parts: text region detection, text line detection and recognition, and table sequence recognition. For table detection, the Cascade Faster RCNN with the ResNeXt105 network is implemented, using TPS (Thin Plate Spline) transformation and affine transformation to correct the image and to improve accuracy. For text line detection, DBNET is used with Do-Conv in FPN (Feature Pyramid Networks) to speed up training. Text lines are recognized using CRNN without the CTC module, enhancing recognition performance. Table sequence recognition is based on the transformer combined with post-processing algorithms that fuse table structure sequences and unit grid content. Experimental results show that the UTTSR model outperforms the compared methods. This upgraded model significantly improves the accuracy of the previous state-of-the-art F1 score on complex tables, reaching 97.8%. Full article
(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)
Show Figures

Figure 1

26 pages, 5177 KiB  
Article
FPGA Implementation of a Deep Learning Acceleration Core Architecture for Image Target Detection
Appl. Sci. 2023, 13(7), 4144; https://doi.org/10.3390/app13074144 - 24 Mar 2023
Cited by 3 | Viewed by 2763
Abstract
Due to the flexibility and ease of deployment of Field Programmable Gate Arrays (FPGA), more and more studies have been conducted on developing and optimizing target detection algorithms based on Convolutional Neural Networks (CNN) models using FPGAs. Still, these studies focus on improving [...] Read more.
Due to the flexibility and ease of deployment of Field Programmable Gate Arrays (FPGA), more and more studies have been conducted on developing and optimizing target detection algorithms based on Convolutional Neural Networks (CNN) models using FPGAs. Still, these studies focus on improving the performance of the core algorithm and optimizing hardware structure, with few studies focusing on the unified architecture design and corresponding optimization techniques for the algorithm model, resulting in inefficient overall model performance. The essential reason is that these studies do not address arithmetic power, speed, and resource consistency. In order to solve this problem, we propose a deep learning acceleration core architecture based on FPGAs, which is designed for target detection algorithms with CNN models, using multi-channel parallelization of CNN network models to improve the arithmetic power, using scheduling tasks and intensive computation pipelining to meet the algorithm’s data bandwidth requirements and unifying the speed and area of the orchestrated computation matrix to save hardware resources. The proposed framework achieves 14 Frames Per Second (FPS) inference performance of the TinyYolo model at 5 Giga Operations Per Second (GOPS) with 30% higher running clock frequency, 2–4 times higher arithmetic power, and 28% higher Digital Signal Processing (DSP) resource utilization efficiency using less than 25% of FPGA resource usage. Full article
(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)
Show Figures

Figure 1

19 pages, 11500 KiB  
Article
Exploiting the Potential of Overlapping Cropping for Real-World Pedestrian and Vehicle Detection with Gigapixel-Level Images
Appl. Sci. 2023, 13(6), 3637; https://doi.org/10.3390/app13063637 - 13 Mar 2023
Cited by 1 | Viewed by 1325
Abstract
Pedestrian and vehicle detection is widely used in intelligent assisted driving, pedestrian counting, drone aerial photography, and other applications. Recently, with the development of gigacameras, gigapixel-level images have emerged. The large field of view and high resolution provide global and local information, which [...] Read more.
Pedestrian and vehicle detection is widely used in intelligent assisted driving, pedestrian counting, drone aerial photography, and other applications. Recently, with the development of gigacameras, gigapixel-level images have emerged. The large field of view and high resolution provide global and local information, which enables object detection in real-world scenarios. Although existing pedestrian and vehicle detection algorithms have achieved remarkable success for standard images, their methods are not suitable for ultra-high-resolution images. In order to improve the performance of existing pedestrian and vehicle detectors in real-world scenarios, we used a sliding window to crop the original images to solve this problem. When fusing the sub-images, we proposed a midline method to reduce the cropped objects that NMS could not eliminate. At the same time, we used varifocal loss to solve the imbalance between positive and negative samples caused by the high resolution. We also found that pedestrians and vehicles were separable in size and comprised more than one target type. As a result, we improved the detector performance with single-class object detection for pedestrians and vehicles, respectively. At the same time, we provided many useful strategies to improve the detector. The experimental results demonstrated that our method could improve the performance of real-world pedestrian and vehicle detection. Full article
(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)
Show Figures

Figure 1

21 pages, 8642 KiB  
Article
LGViT: A Local and Global Vision Transformer with Dynamic Contextual Position Bias Using Overlapping Windows
Appl. Sci. 2023, 13(3), 1993; https://doi.org/10.3390/app13031993 - 03 Feb 2023
Viewed by 1634
Abstract
Vision Transformers (ViTs) have shown their superiority in various visual tasks for the capability of self-attention mechanisms to model long-range dependencies. Some recent works try to reduce the high cost of vision transformers by limiting the self-attention module in a local window. As [...] Read more.
Vision Transformers (ViTs) have shown their superiority in various visual tasks for the capability of self-attention mechanisms to model long-range dependencies. Some recent works try to reduce the high cost of vision transformers by limiting the self-attention module in a local window. As a price, the adopted window-based self-attention also reduces the ability to capture the long-range dependencies compared with the original self-attention in transformers. In this paper, we propose a Local and Global Vision Transformer (LGViT) that incorporates overlapping windows and multi-scale dilated pooling to robust the self-attention locally and globally. Our proposed self-attention mechanism is composed of a local self-attention module (LSA) and a global self-attention module (GSA), which are performed on overlapping windows partitioned from the input image. In LSA, the key and value sets are expanded by the surroundings of windows to increase the receptive field. For GSA, the key and value sets are expanded by multi-scale dilated pooling to promote global interactions. Moreover, a dynamic contextual positional encoding module is exploited to add positional information more efficiently and flexibly. We conduct extensive experiments on various visual tasks and the experimental results strongly demonstrate the outperformance of our proposed LGViT to state-of-the-art approaches. Full article
(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)
Show Figures

Figure 1

Back to TopTop