Journal Browser

► Journal Browser

Transformers in Computer Vision

Share This Special Issue

Special Issue Editors

Prof. Dr. Ren-Hung Hwang

E-Mail Website
Guest Editor

College of Artificial Intelligence, National Yang Ming Chiao Tung University, Tainan 71150, Taiwan
Interests: 5G/B5G/6G; space-air-ground integrated network (SAGIN); AI-enabled 6G networks; deep-learning-based HIDS & NIDS; smart transportation

Dr. Chen-Kuo Chiang

E-Mail Website
Guest Editor

Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621301, Taiwan
Interests: computer vision; machine learning; pattern recognition research

Special Issue Information

Dear Colleagues,

Currently, state-of-the-art research in computer vision is heavily focused on deep learning approaches, particularly significant breakthroughs in various tasks based on transformer architecture. Object detection and recognition is a key area of research, with significant progress being made in developing models that can accurately identify and locate objects in images and videos. Additionally, image and video analysis is a popular area of research, with many recent developments in action recognition, video segmentation, and activity detection. Another important area of research is 3D reconstruction and scene understanding, which involves creating 3D models of objects and scenes from 2D images. Deep learning techniques are also being used to improve the performance of medical image analysis, such as tumor detection and segmentation in medical scans. Finally, computer vision for autonomous systems, such as self-driving cars and drones, is also an active area of research, with ongoing work on object detection, scene understanding, and motion planning.

Topics include but are not limited to:

Transformer architectures for object detection and recognition
Transformer-based face applications, such as face recognition systems and multiple-object tracking systems.
Transformer-based approaches for image segmentation
Transformer-based models for video analysis and action recognition
Transformer-based models for 3D object detection
Transformer-based models for panoptic segmentation
Multi-modal transformer models for image–text understanding
Transformer-based models for image synthesis and style transfer
Transformer-based models for image super-resolution
Transformer-based models for image captioning and text-to-image synthesis
Transformer-based models for video synthesis and action generation

Prof. Dr. Ren-Hung Hwang
Dr. Chen-Kuo Chiang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

computer vision
deep learning
transformer

Published Papers (1 paper)

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

22 pages, 6404 KiB

Open AccessArticle

PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention

by Nikolas Ebert, Didier Stricker and Oliver Wasenmüller

Sensors 2023, 23(7), 3447; https://doi.org/10.3390/s23073447 - 25 Mar 2023

Cited by 3 | Viewed by 3337

Abstract

Recently, transformer architectures have shown superior performance compared to their CNN counterparts in many computer vision tasks. The self-attention mechanism enables transformer networks to connect visual dependencies over short as well as long distances, thus generating a large, sometimes even a global receptive field. In this paper, we propose our Parallel Local-Global Vision Transformer (PLG-ViT), a general backbone model that fuses local window self-attention with global self-attention. By merging these local and global features, short- and long-range spatial interactions can be effectively and efficiently represented without the need for costly computational operations such as shifted windows. In a comprehensive evaluation, we demonstrate that our PLG-ViT outperforms CNN-based as well as state-of-the-art transformer-based architectures in image classification and in complex downstream tasks such as object detection, instance segmentation, and semantic segmentation. In particular, our PLG-ViT models outperformed similarly sized networks like ConvNeXt and Swin Transformer, achieving Top-1 accuracy values of 83.4%, 84.0%, and 84.5% on ImageNet-1K with 27M, 52M, and 91M parameters, respectively. Full article

(This article belongs to the Special Issue Transformers in Computer Vision)

► Show Figures

Journal Menu

Journal Browser

Transformers in Computer Vision

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (1 paper)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI