sensors-logo

Journal Browser

Journal Browser

Computer Vision in AI for Robotics Development

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensors and Robotics".

Deadline for manuscript submissions: 30 September 2024 | Viewed by 5459

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science and Engineering, Tatung University, Taipei 104, Taiwan
Interests: artificial intelligence; multimedia processing; database design; pattern recognition

E-Mail Website
Guest Editor
Department of Computer Science and Information Engineering, National Central University, Chung-li, Taiwan
Interests: artificial intelligence; machine learning; human-robot interaction; big data analytics

Special Issue Information

Dear Colleagues,

The integration of computer vision and machine learning techniques in robotics has enabled significant advancements in various applications (such as perception, navigation, and control). These techniques allow robots to process and understand visual information from their environment, leading to improved decision making and performance.

We are inviting researchers, practitioners, and academics to submit their original and high-quality papers on the latest developments in the application of computer vision and machine learning in robotics. This Special Issue aims to foster discussions and collaboration within the community, and promote the advancements in this exciting field.

Topics of Interest:

  • Robotics perception using computer vision and machine learning;
  • Visual navigation for robots;
  • Robotic control using visual information;
  • Visual object recognition and tracking for robotics;
  • 3D reconstruction and scene understanding for robotics;
  • Transfer learning for robotics applications;
  • Active perception for robots;
  • Other related topics in the intersection of computer vision, machine learning, and robotics.

Dr. Chen-Chiung Hsieh
Dr. Hsiao-Ting Tseng
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

22 pages, 8135 KiB  
Article
A Lightweight Visual Simultaneous Localization and Mapping Method with a High Precision in Dynamic Scenes
by Qi Zhang, Wentao Yu, Weirong Liu, Hao Xu and Yuan He
Sensors 2023, 23(22), 9274; https://doi.org/10.3390/s23229274 - 19 Nov 2023
Cited by 1 | Viewed by 838
Abstract
Currently, in most traditional VSLAM (visual SLAM) systems, static assumptions result in a low accuracy in dynamic environments, or result in a new and higher level of accuracy but at the cost of sacrificing the real–time property. In highly dynamic scenes, balancing a [...] Read more.
Currently, in most traditional VSLAM (visual SLAM) systems, static assumptions result in a low accuracy in dynamic environments, or result in a new and higher level of accuracy but at the cost of sacrificing the real–time property. In highly dynamic scenes, balancing a high accuracy and a low computational cost has become a pivotal requirement for VSLAM systems. This paper proposes a new VSLAM system, balancing the competitive demands between positioning accuracy and computational complexity and thereby further improving the overall system properties. From the perspective of accuracy, the system applies an improved lightweight target detection network to quickly detect dynamic feature points while extracting feature points at the front end of the system, and only feature points of static targets are applied for frame matching. Meanwhile, the attention mechanism is integrated into the target detection network to continuously and accurately capture dynamic factors to cope with more complex dynamic environments. From the perspective of computational expense, the lightweight network Ghostnet module is applied as the backbone network of the target detection network YOLOv5s, significantly reducing the number of model parameters and improving the overall inference speed of the algorithm. Experimental results on the TUM dynamic dataset indicate that in contrast with the ORB–SLAM3 system, the pose estimation accuracy of the system improved by 84.04%. In contrast with dynamic SLAM systems such as DS–SLAM and DVO SLAM, the system has a significantly improved positioning accuracy. In contrast with other VSLAM algorithms based on deep learning, the system has superior real–time properties while maintaining a similar accuracy index. Full article
(This article belongs to the Special Issue Computer Vision in AI for Robotics Development)
Show Figures

Figure 1

17 pages, 12029 KiB  
Article
A Study on Generative Models for Visual Recognition of Unknown Scenes Using a Textual Description
by Jose Martinez-Carranza, Delia Irazú Hernández-Farías, Victoria Eugenia Vazquez-Meza, Leticia Oyuki Rojas-Perez and Aldrich Alfredo Cabrera-Ponce
Sensors 2023, 23(21), 8757; https://doi.org/10.3390/s23218757 - 27 Oct 2023
Viewed by 1062
Abstract
In this study, we investigate the application of generative models to assist artificial agents, such as delivery drones or service robots, in visualising unfamiliar destinations solely based on textual descriptions. We explore the use of generative models, such as Stable Diffusion, and embedding [...] Read more.
In this study, we investigate the application of generative models to assist artificial agents, such as delivery drones or service robots, in visualising unfamiliar destinations solely based on textual descriptions. We explore the use of generative models, such as Stable Diffusion, and embedding representations, such as CLIP and VisualBERT, to compare generated images obtained from textual descriptions of target scenes with images of those scenes. Our research encompasses three key strategies: image generation, text generation, and text enhancement, the latter involving tools such as ChatGPT to create concise textual descriptions for evaluation. The findings of this study contribute to an understanding of the impact of combining generative tools with multi-modal embedding representations to enhance the artificial agent’s ability to recognise unknown scenes. Consequently, we assert that this research holds broad applications, particularly in drone parcel delivery, where an aerial robot can employ text descriptions to identify a destination. Furthermore, this concept can also be applied to other service robots tasked with delivering to unfamiliar locations, relying exclusively on user-provided textual descriptions. Full article
(This article belongs to the Special Issue Computer Vision in AI for Robotics Development)
Show Figures

Figure 1

29 pages, 15531 KiB  
Article
Adaptation of YOLOv7 and YOLOv7_tiny for Soccer-Ball Multi-Detection with DeepSORT for Tracking by Semi-Supervised System
by Jorge Armando Vicente-Martínez, Moisés Márquez-Olivera, Abraham García-Aliaga and Viridiana Hernández-Herrera
Sensors 2023, 23(21), 8693; https://doi.org/10.3390/s23218693 - 25 Oct 2023
Cited by 2 | Viewed by 1795
Abstract
Object recognition and tracking have long been a challenge, drawing considerable attention from analysts and researchers, particularly in the realm of sports, where it plays a pivotal role in refining trajectory analysis. This study introduces a different approach, advancing the detection and tracking [...] Read more.
Object recognition and tracking have long been a challenge, drawing considerable attention from analysts and researchers, particularly in the realm of sports, where it plays a pivotal role in refining trajectory analysis. This study introduces a different approach, advancing the detection and tracking of soccer balls through the implementation of a semi-supervised network. Leveraging the YOLOv7 convolutional neural network, and incorporating the focal loss function, the proposed framework achieves a remarkable 95% accuracy in ball detection. This strategy outperforms previous methodologies researched in the bibliography. The integration of focal loss brings a distinctive edge to the model, improving the detection challenge for soccer balls on different fields. This pivotal modification, in tandem with the utilization of the YOLOv7 architecture, results in a marked improvement in accuracy. Following the attainment of this result, the implementation of DeepSORT enriches the study by enabling precise trajectory tracking. In the comparative analysis between versions, the efficacy of this approach is underscored, demonstrating its superiority over conventional methods with default loss function. In the Materials and Methods section, a meticulously curated dataset of soccer balls is assembled. Combining images sourced from freely available digital media with additional images from training sessions and amateur matches taken by ourselves, the dataset contains a total of 6331 images. This diverse dataset enables comprehensive testing, providing a solid foundation for evaluating the model’s performance under varying conditions, which is divided by 5731 images for supervised system and the last 600 images for semi-supervised. The results are striking, with an accuracy increase to 95% with the focal loss function. The visual representations of real-world scenarios underscore the model’s proficiency in both detection and classification tasks, further affirming its effectiveness, the impact, and the innovative approach. In the discussion, the hardware specifications employed are also touched on, any encountered errors are highlighted, and promising avenues for future research are outlined. Full article
(This article belongs to the Special Issue Computer Vision in AI for Robotics Development)
Show Figures

Figure 1

23 pages, 5218 KiB  
Article
Automatic Speaker Positioning in Meetings Based on YOLO and TDOA
by Chen-Chiung Hsieh, Men-Ru Lu and Hsiao-Ting Tseng
Sensors 2023, 23(14), 6250; https://doi.org/10.3390/s23146250 - 08 Jul 2023
Viewed by 1045
Abstract
In recent years, many things have been held via video conferences due to the impact of the COVID-19 epidemic around the world. A webcam will be used in conjunction with a computer and the Internet. However, the network camera cannot automatically turn and [...] Read more.
In recent years, many things have been held via video conferences due to the impact of the COVID-19 epidemic around the world. A webcam will be used in conjunction with a computer and the Internet. However, the network camera cannot automatically turn and cannot lock the screen to the speaker. Therefore, this study uses the objection detector YOLO to capture the upper body of all people on the screen and judge whether each person opens or closes their mouth. At the same time, the Time Difference of Arrival (TDOA) is used to detect the angle of the sound source. Finally, the person’s position obtained by YOLO is reversed to the person’s position in the spatial coordinates through the distance between the person and the camera. Then, the spatial coordinates are used to calculate the angle between the person and the camera through inverse trigonometric functions. Finally, the angle obtained by the camera, and the angle of the sound source obtained by the microphone array, are matched for positioning. The experimental results show that the recall rate of positioning through YOLOX-Tiny reached 85.2%, and the recall rate of TDOA alone reached 88%. Integrating YOLOX-Tiny and TDOA for positioning, the recall rate reached 86.7%, the precision rate reached 100%, and the accuracy reached 94.5%. Therefore, the method proposed in this study can locate the speaker, and it has a better effect than using only one source. Full article
(This article belongs to the Special Issue Computer Vision in AI for Robotics Development)
Show Figures

Figure 1

Back to TopTop