Recent Trends in Automatic Image Captioning Systems

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Electrical, Electronics and Communications Engineering".

Deadline for manuscript submissions: 20 June 2024 | Viewed by 9663

Special Issue Editors

Faculty of Information Technology and Electrical Engineering (ITEE), University of Oulu, 90570 Oulu, Finland
Interests: social media; data mining; robotics; data fusion; computer vision
University of Orleans, IDP Laboratory, UMR CNRS 7013, 45100 Orleans, France
Interests: image analysis; medical analysis; image retrieval systems;

Special Issue Information

Dear Colleagues,

The rapid development of digitalization, tagging and user-generated content has brought about a substantial increase in datasets where images are accompanied by related text, giving rise to automatic image captioning (AIC), a process that seeks to automatically generate properly formed English sentences or captions that describe the content of images. AIC research has a great impact on various domains, such as virtual assistants, image indexing, recommendation systems, and medical diagnosis systems, among others. Image captioning in particular relies on computer vision for image comprehension and natural language processing to generate textual descriptions that are semantically and linguistically sound.

Various techniques have been pursued to improve automatic image captioning, including deep learning technology, automatic text generation, retrieval-based image captioning, and template-based image captioning. Nevertheless, the research in this field has been brought to a halt by several inherent challenges and limitations of the theoretical framework employed, the quality of datasets used for machine learning development, and difficulty of cross-domain generalization. Thus, further research and advancement in the field is urgently needed. This Special Issue will present state-of-the-art research in the field of automatic image captioning, highlighting the latest developments for both the theoretical and practical applications of this emerging technology. Topics of interest include, but are not limited to:

  • Machine learning and deep learning techniques for image captioning;
  • Theoretical frameworks for automatic image captioning;
  • Text summarization for image captioning;
  • Review papers on image captioning;
  • New dataset and evaluation frameworks for automatic image captioning;
  • Preprocessing and filtering techniques for image captioning;
  • Cross-domain generalization in image captioning;
  • Recent advances automatic medical image captioning

Dr. Mourad Oussalah
Prof. Dr. Rachid Jennane
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • image captioning
  • natural language processing
  • text generation
  • image content analysis
  • image tagging

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

21 pages, 1298 KiB  
Article
A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
by Jiajia Peng and Tianbing Tang
Appl. Sci. 2024, 14(6), 2657; https://doi.org/10.3390/app14062657 - 21 Mar 2024
Viewed by 295
Abstract
Image captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual descriptions lack depth, context, or the nuanced relationships contained within [...] Read more.
Image captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual descriptions lack depth, context, or the nuanced relationships contained within the images. In an effort to overcome these limitations, we introduce a novel encoder–decoder framework called A Unified Visual and Linguistic Semantics Method. Our method comprises three key components: an encoder, a mapping network, and a decoder. The encoder employs a fusion of CLIP (Contrastive Language–Image Pre-training) and SegmentCLIP to process and extract salient image features. SegmentCLIP builds upon CLIP’s foundational architecture by employing a clustering mechanism, thereby enhancing the semantic relationships between textual and visual elements in the image. The extracted features are then transformed by a mapping network into a fixed-length prefix. A GPT-2-based decoder subsequently generates a corresponding Chinese language description for the image. This framework aims to harmonize feature extraction and semantic enrichment, thereby producing more contextually accurate and comprehensive image descriptions. Our quantitative assessment reveals that our model exhibits notable enhancements across the intricate AIC-ICC, Flickr8k-CN, and COCO-CN datasets, evidenced by a 2% improvement in BLEU@4 and a 10% uplift in CIDEr scores. Additionally, it demonstrates acceptable efficiency in terms of simplicity, speed, and reduction in computational burden. Full article
(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)
Show Figures

Figure 1

15 pages, 6866 KiB  
Article
IFE-Net: An Integrated Feature Extraction Network for Single-Image Dehazing
by Can Leng and Gang Liu
Appl. Sci. 2023, 13(22), 12236; https://doi.org/10.3390/app132212236 - 11 Nov 2023
Viewed by 587
Abstract
In recent years, numerous single-image dehazing algorithms have made significant progress; however, dehazing still presents a challenge, particularly in complex real-world scenarios. In fact, single-image dehazing is an inherently ill-posed problem, as scene transmission relies on unknown and nonhomogeneous depth information. This study [...] Read more.
In recent years, numerous single-image dehazing algorithms have made significant progress; however, dehazing still presents a challenge, particularly in complex real-world scenarios. In fact, single-image dehazing is an inherently ill-posed problem, as scene transmission relies on unknown and nonhomogeneous depth information. This study proposes a novel end-to-end single-image dehazing method called the Integrated Feature Extraction Network (IFE-Net). Instead of estimating the transmission matrix and atmospheric light separately, IFE-Net directly generates the clean image using a lightweight CNN. During the dehazing process, texture details are often lost. To address this issue, an attention mechanism module is introduced in IFE-Net to handle different information impartially. Additionally, a new nonlinear activation function is proposed in IFE-Net, known as a bilateral constrained rectifier linear unit (BCReLU). Extensive experiments were conducted to evaluate the performance of IFE-Net. The results demonstrate that IFE-Net outperforms other single-image haze removal algorithms in terms of both PSNR and SSIM. In the SOTS dataset, IFE-Net achieves a PSNR value of 24.63 and an SSIM value of 0.905. In the ITS dataset, the PSNR value is 25.62, and the SSIM value reaches 0.925. The quantitative results of the synthesized images are either superior to or comparable with those obtained via other advanced algorithms. Moreover, IFE-Net also exhibits significant subjective visual quality advantages. Full article
(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)
Show Figures

Figure 1

13 pages, 4074 KiB  
Article
Machine Vision-Based Chinese Walnut Shell–Kernel Recognition and Separation
by Yongcheng Zhang, Xingyu Wang, Yang Liu, Zhanbiao Li, Haipeng Lan, Zhaoguo Zhang and Jiale Ma
Appl. Sci. 2023, 13(19), 10685; https://doi.org/10.3390/app131910685 - 26 Sep 2023
Viewed by 745
Abstract
Walnut shell–kernel separation is an essential step in the deep processing of walnut. It is a crucial factor that prevents the increase in the added value and industrial development of walnuts. This study proposes a walnut shell–kernel detection method based on YOLOX deep [...] Read more.
Walnut shell–kernel separation is an essential step in the deep processing of walnut. It is a crucial factor that prevents the increase in the added value and industrial development of walnuts. This study proposes a walnut shell–kernel detection method based on YOLOX deep learning using machine vision and deep-learning technology to address common issues, such as incomplete shell–kernel separation in the current airflow screening, high costs and the low efficiency of manually assisted screening. A dataset was produced using Labelme by acquiring walnut shell and kernel images following shellshock. This dataset was transformed into the COCO dataset format. Next, 110 epochs of training were performed on the network. When the intersection over the union threshold was 0.5, the average precision (AP), the average recall rate (AR), the model size, and floating point operations per second were 96.3%, 84.7%, 99 MB, and 351.9, respectively. Compared with YOLOv3, Faster Region-based Convolutional Neural Network (Faster R-CNN), and Single Shot MultiBox Detector algorithms (SSD), the AP value of the proposed algorithm was increased by 2.1%, 1.3%, and 3.4%, respectively. Similarly, the AR was increased by 10%, 2.3%, and 9%, respectively. Meanwhile, walnut shell–kernel detection was performed under different situations, such as distinct species, supplementary lighting, or shielding conditions. This model exhibits high recognition and positioning precision under different walnut species, supplementary lighting, and shielding conditions. It has high robustness. Moreover, the small size of this model is beneficial for migration applications. This study’s results can provide some technological references to develop faster walnut shell–kernel separation methods. Full article
(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)
Show Figures

Figure 1

17 pages, 5406 KiB  
Article
Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning
by Tian Xie, Weiping Ding, Jinbao Zhang, Xusen Wan and Jiehua Wang
Appl. Sci. 2023, 13(13), 7916; https://doi.org/10.3390/app13137916 - 06 Jul 2023
Viewed by 1771
Abstract
The discipline of automatic image captioning represents an integration of two pivotal branches of artificial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic information of a [...] Read more.
The discipline of automatic image captioning represents an integration of two pivotal branches of artificial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic information of a higher order. The bidirectional long short-term memory (Bi-LSTM) has garnered wide acceptance in executing image captioning tasks. Of late, scholarly attention has been focused on modifying suitable models for innovative and precise subtitle captions, although tuning the parameters of the model does not invariably yield optimal outcomes. Given this, the current research proposes a model that effectively employs the bidirectional LSTM and attention mechanism (Bi-LS-AttM) for image captioning endeavors. This model exploits the contextual comprehension from both anterior and posterior aspects of the input data, synergistically with the attention mechanism, thereby augmenting the precision of visual language interpretation. The distinctiveness of this research is embodied in its incorporation of Bi-LSTM and the attention mechanism to engender sentences that are both structurally innovative and accurately reflective of the image content. To enhance temporal efficiency and accuracy, this study substitutes convolutional neural networks (CNNs) with fast region-based convolutional networks (Fast RCNNs). Additionally, it refines the process of generation and evaluation of common space, thus fostering improved efficiency. Our model was tested for its performance on Flickr30k and MSCOCO datasets (80 object categories). Comparative analyses of performance metrics reveal that our model, leveraging the Bi-LS-AttM, surpasses unidirectional and Bi-LSTM models. When applied to caption generation and image-sentence retrieval tasks, our model manifests time economies of approximately 36.5% and 26.3% vis-a-vis the Bi-LSTM model and the deep Bi-LSTM model, respectively. Full article
(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)
Show Figures

Figure 1

24 pages, 10221 KiB  
Article
ACapMed: Automatic Captioning for Medical Imaging
by Djamila Romaissa Beddiar, Mourad Oussalah, Tapio Seppänen and Rachid Jennane
Appl. Sci. 2022, 12(21), 11092; https://doi.org/10.3390/app122111092 - 01 Nov 2022
Cited by 3 | Viewed by 2015
Abstract
Medical image captioning is a very challenging task that has been rarely addressed in the literature on natural image captioning. Some existing image captioning techniques exploit objects present in the image next to the visual features while generating descriptions. However, this is not [...] Read more.
Medical image captioning is a very challenging task that has been rarely addressed in the literature on natural image captioning. Some existing image captioning techniques exploit objects present in the image next to the visual features while generating descriptions. However, this is not possible for medical image captioning when one requires following clinician-like explanations in image content descriptions. Inspired by the preceding, this paper proposes using medical concepts associated with images, in accordance with their visual features, to generate new captions. Our end-to-end trainable network is composed of a semantic feature encoder based on a multi-label classifier to identify medical concepts related to images, a visual feature encoder, and an LSTM model for text generation. Beam search is employed to ensure the best selection of the next word for a given sequence of words based on the merged features of the medical image. We evaluated our proposal on the ImageCLEF medical captioning dataset, and the results demonstrate the effectiveness and efficiency of the developed approach. Full article
(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)
Show Figures

Figure 1

18 pages, 6632 KiB  
Article
Metaheuristics Optimization with Deep Learning Enabled Automated Image Captioning System
by Mesfer Al Duhayyim, Sana Alazwari, Hanan Abdullah Mengash, Radwa Marzouk, Jaber S. Alzahrani, Hany Mahgoub, Fahd Althukair and Ahmed S. Salama
Appl. Sci. 2022, 12(15), 7724; https://doi.org/10.3390/app12157724 - 31 Jul 2022
Cited by 8 | Viewed by 1523
Abstract
Image captioning is a popular topic in the domains of computer vision and natural language processing (NLP). Recent advancements in deep learning (DL) models have enabled the improvement of the overall performance of the image captioning approach. This study develops a metaheuristic optimization [...] Read more.
Image captioning is a popular topic in the domains of computer vision and natural language processing (NLP). Recent advancements in deep learning (DL) models have enabled the improvement of the overall performance of the image captioning approach. This study develops a metaheuristic optimization with a deep learning-enabled automated image captioning technique (MODLE-AICT). The proposed MODLE-AICT model focuses on the generation of effective captions to the input images by using two processes involving encoding unit and decoding unit. Initially, at the encoding part, the salp swarm algorithm (SSA), with a HybridNet model, is utilized to generate effectual input image representation using fixed-length vectors, showing the novelty of the work. Moreover, the decoding part includes a bidirectional gated recurrent unit (BiGRU) model used to generate descriptive sentences. The inclusion of an SSA-based hyperparameter optimizer helps in attaining effectual performance. For inspecting the enhanced performance of the MODLE-AICT model, a series of simulations were carried out, and the results are examined under several aspects. The experimental values suggested the betterment of the MODLE-AICT model over recent approaches. Full article
(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)
Show Figures

Figure 1

20 pages, 10487 KiB  
Article
Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning
by Deepika Kumar, Varun Srivastava, Daniela Elena Popescu and Jude D. Hemanth
Appl. Sci. 2022, 12(13), 6733; https://doi.org/10.3390/app12136733 - 02 Jul 2022
Cited by 6 | Viewed by 1852
Abstract
Image captioning is oriented towards describing an image with the best possible use of words that can provide a semantic, relatable meaning of the scenario inscribed. Different models can be used to accomplish this arduous task depending on the context and requirement of [...] Read more.
Image captioning is oriented towards describing an image with the best possible use of words that can provide a semantic, relatable meaning of the scenario inscribed. Different models can be used to accomplish this arduous task depending on the context and requirement of what needs to be achieved. An encoder–decoder model which uses the image feature vectors as an input to the encoder is often marked as one of the appropriate models to accomplish the captioning process. In the proposed work, a dual-modal transformer has been used which captures the intra- and inter-model interactions in a simultaneous manner within an attention block. The transformer architecture is quantitatively evaluated on a publicly available Microsoft Common Objects in Context (MS COCO) dataset yielding a Bilingual Evaluation Understudy (BLEU)-4 Score of 85.01. The efficacy of the model is evaluated on Flickr 8k, Flickr 30k datasets and MS COCO datasets and results for the same is compared and analysed with the state-of-the-art methods. The results shows that the proposed model outperformed when compared with conventional models, such as the encoder–decoder model and attention model. Full article
(This article belongs to the Special Issue Recent Trends in Automatic Image Captioning Systems)
Show Figures

Graphical abstract

Back to TopTop