Editorial

3 pages, 174 KiB

Open AccessEditorial

Artificial Intelligence for Multimedia Signal Processing

by Byung-Gyu Kim and Dong-San Jun

Appl. Sci. 2022, 12(15), 7358; https://doi.org/10.3390/app12157358 - 22 Jul 2022

Cited by 1 | Viewed by 1184

At the ImageNet Large Scale Visual Re-Conversion Challenge (ILSVRC), a 2012 global image recognition contest, the University of Toronto Supervision team led by Prof [...] Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

Research

Jump to: Editorial

13 pages, 6937 KiB

Open AccessFeature PaperArticle

Reduction of Compression Artifacts Using a Densely Cascading Image Restoration Network

by Yooho Lee, Sang-hyo Park, Eunjun Rhee, Byung-Gyu Kim and Dongsan Jun

Appl. Sci. 2021, 11(17), 7803; https://doi.org/10.3390/app11177803 - 25 Aug 2021

Cited by 2 | Viewed by 1991

Abstract

Since high quality realistic media are widely used in various computer vision applications, image compression is one of the essential technologies to enable real-time applications. Image compression generally causes undesired compression artifacts, such as blocking artifacts and ringing effects. In this study, we [...] Read more.

Since high quality realistic media are widely used in various computer vision applications, image compression is one of the essential technologies to enable real-time applications. Image compression generally causes undesired compression artifacts, such as blocking artifacts and ringing effects. In this study, we propose a densely cascading image restoration network (DCRN), which consists of an input layer, a densely cascading feature extractor, a channel attention block, and an output layer. The densely cascading feature extractor has three densely cascading (DC) blocks, and each DC block contains two convolutional layers, five dense layers, and a bottleneck layer. To optimize the proposed network architectures, we investigated the trade-off between quality enhancement and network complexity. Experimental results revealed that the proposed DCRN can achieve a better peak signal-to-noise ratio and structural similarity index measure for compressed joint photographic experts group (JPEG) images compared to the previous methods. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

17 pages, 6472 KiB

Open AccessArticle

Context-Based Structure Mining Methodology for Static Object Re-Identification in Broadcast Content

by Krishna Kumar Thirukokaranam Chandrasekar and Steven Verstockt

Appl. Sci. 2021, 11(16), 7266; https://doi.org/10.3390/app11167266 - 06 Aug 2021

Cited by 1 | Viewed by 1342

Abstract

Technological advancement, in addition to the pandemic, has given rise to an explosive increase in the consumption and creation of multimedia content worldwide. This has motivated people to enrich and publish their content in a way that enhances the experience of the user. [...] Read more.

Technological advancement, in addition to the pandemic, has given rise to an explosive increase in the consumption and creation of multimedia content worldwide. This has motivated people to enrich and publish their content in a way that enhances the experience of the user. In this paper, we propose a context-based structure mining pipeline that not only attempts to enrich the content, but also simultaneously splits it into shots and logical story units (LSU). Subsequently, this paper extends the structure mining pipeline to re-ID objects in broadcast videos such as SOAPs. We hypothesise the object re-ID problem of SOAP-type content to be equivalent to the identification of reoccurring contexts, since these contexts normally have a unique spatio-temporal similarity within the content structure. By implementing pre-trained models for object and place detection, the pipeline was evaluated using metrics for shot and scene detection on benchmark datasets, such as RAI. The object re-ID methodology was also evaluated on 20 randomly selected episodes from broadcast SOAP shows New Girl and Friends. We demonstrate, quantitatively, that the pipeline outperforms existing state-of-the-art methods for shot boundary detection, scene detection, and re-identification tasks. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

13 pages, 4074 KiB

Open AccessArticle

3D Avatar Approach for Continuous Sign Movement Using Speech/Text

by Debashis Das Chakladar, Pradeep Kumar, Shubham Mandal, Partha Pratim Roy, Masakazu Iwamura and Byung-Gyu Kim

Appl. Sci. 2021, 11(8), 3439; https://doi.org/10.3390/app11083439 - 12 Apr 2021

Cited by 13 | Viewed by 8281

Abstract

Sign language is a visual language for communication used by hearing-impaired people with the help of hand and finger movements. Indian Sign Language (ISL) is a well-developed and standard way of communication for hearing-impaired people living in India. However, other people who use [...] Read more.

Sign language is a visual language for communication used by hearing-impaired people with the help of hand and finger movements. Indian Sign Language (ISL) is a well-developed and standard way of communication for hearing-impaired people living in India. However, other people who use spoken language always face difficulty while communicating with a hearing-impaired person due to lack of sign language knowledge. In this study, we have developed a 3D avatar-based sign language learning system that converts the input speech/text into corresponding sign movements for ISL. The system consists of three modules. Initially, the input speech is converted into an English sentence. Then, that English sentence is converted into the corresponding ISL sentence using the Natural Language Processing (NLP) technique. Finally, the motion of the 3D avatar is defined based on the ISL sentence. The translation module achieves a 10.50 SER (Sign Error Rate) score. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

13 pages, 401 KiB

Open AccessFeature PaperArticle

A Novel 1-D CCANet for ECG Classification

by Ian-Christopher Tanoh and Paolo Napoletano

Appl. Sci. 2021, 11(6), 2758; https://doi.org/10.3390/app11062758 - 19 Mar 2021

Cited by 11 | Viewed by 2857

Abstract

This paper puts forward a 1-D convolutional neural network (CNN) that exploits a novel analysis of the correlation between the two leads of the noisy electrocardiogram (ECG) to classify heartbeats. The proposed method is one-dimensional, enabling complex structures while maintaining a reasonable computational [...] Read more.

This paper puts forward a 1-D convolutional neural network (CNN) that exploits a novel analysis of the correlation between the two leads of the noisy electrocardiogram (ECG) to classify heartbeats. The proposed method is one-dimensional, enabling complex structures while maintaining a reasonable computational complexity. It is based on the combination of elementary handcrafted time domain features, frequency domain features through spectrograms and the use of autoregressive modeling. On the MIT-BIH database, a 95.52% overall accuracy is obtained by classifying 15 types, whereas a 95.70% overall accuracy is reached when classifying 7 types from the INCART database. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

16 pages, 3719 KiB

Open AccessArticle

Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech

by Yun Kyung Lee and Jeon Gue Park

Appl. Sci. 2021, 11(6), 2642; https://doi.org/10.3390/app11062642 - 16 Mar 2021

Cited by 4 | Viewed by 1953

Abstract

This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker’s spoken English. Stress and [...] Read more.

This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker’s spoken English. Stress and rhythm scores are one of the important factors used to evaluate fluency in spoken English and are computed by comparing the stress patterns and the rhythm distributions to those of native speakers. In order to compute the stress and rhythm scores even when the phonemic sequence of the L2 speaker’s English sentence is different from the native speaker’s one, we align the phonemic sequences based on a dynamic time-warping approach. We also improve the performance of the speech recognition system for non-native speakers and compute fluency features more accurately by augmenting the non-native training dataset and training an acoustic model with the augmented dataset. In this work, we augment the non-native speech by converting some speech signal characteristics (style) while preserving its linguistic information. The proposed variational autoencoder (VAE)-based speech conversion network trains the conversion model by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. Experimental results show that the proposed method effectively measures the fluency scores and generates diverse output signals. Also, in the proficiency evaluation and speech recognition tests, the proposed method improves the proficiency score performance and speech recognition accuracy for all proficiency areas compared to a method employing conventional acoustic models. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

11 pages, 1606 KiB

Open AccessFeature PaperArticle

Rain Streak Removal for Single Images Using Conditional Generative Adversarial Networks

by Prasad Hettiarachchi, Rashmika Nawaratne, Damminda Alahakoon, Daswin De Silva and Naveen Chilamkurti

Appl. Sci. 2021, 11(5), 2214; https://doi.org/10.3390/app11052214 - 03 Mar 2021

Cited by 13 | Viewed by 2553

Abstract

Rapid developments in urbanization and smart city environments have accelerated the need to deliver safe, sustainable, and effective resource utilization and service provision and have thereby enhanced the need for intelligent, real-time video surveillance. Recent advances in machine learning and deep learning have [...] Read more.

Rapid developments in urbanization and smart city environments have accelerated the need to deliver safe, sustainable, and effective resource utilization and service provision and have thereby enhanced the need for intelligent, real-time video surveillance. Recent advances in machine learning and deep learning have the capability to detect and localize salient objects in surveillance video streams; however, several practical issues remain unaddressed, such as diverse weather conditions, recording conditions, and motion blur. In this context, image de-raining is an important issue that has been investigated extensively in recent years to provide accurate and quality surveillance in the smart city domain. Existing deep convolutional neural networks have obtained great success in image translation and other computer vision tasks; however, image de-raining is ill posed and has not been addressed in real-time, intelligent video surveillance systems. In this work, we propose to utilize the generative capabilities of recently introduced conditional generative adversarial networks (cGANs) as an image de-raining approach. We utilize the adversarial loss in GANs that provides an additional component to the loss function, which in turn regulates the final output and helps to yield better results. Experiments on both real and synthetic data show that the proposed method outperforms most of the existing state-of-the-art models in terms of quantitative evaluations and visual appearance. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

14 pages, 2752 KiB

Open AccessArticle

Discovering Sentimental Interaction via Graph Convolutional Network for Visual Sentiment Prediction

by Lifang Wu, Heng Zhang, Sinuo Deng, Ge Shi and Xu Liu

Appl. Sci. 2021, 11(4), 1404; https://doi.org/10.3390/app11041404 - 04 Feb 2021

Cited by 14 | Viewed by 2149

Abstract

With the popularity of online opinion expressing, automatic sentiment analysis of images has gained considerable attention. Most methods focus on effectively extracting the sentimental features of images, such as enhancing local features through saliency detection or instance segmentation tools. However, as a high-level [...] Read more.

With the popularity of online opinion expressing, automatic sentiment analysis of images has gained considerable attention. Most methods focus on effectively extracting the sentimental features of images, such as enhancing local features through saliency detection or instance segmentation tools. However, as a high-level abstraction, the sentiment is difficult to accurately capture with the visual element because of the “affective gap”. Previous works have overlooked the contribution of the interaction among objects to the image sentiment. We aim to utilize interactive characteristics of objects in the sentimental space, inspired by human sentimental principles that each object contributes to the sentiment. To achieve this goal, we propose a framework to leverage the sentimental interaction characteristic based on a Graph Convolutional Network (GCN). We first utilize an off-the-shelf tool to recognize objects and build a graph over them. Visual features represent nodes, and the emotional distances between objects act as edges. Then, we employ GCNs to obtain the interaction features among objects, which are fused with the CNN output of the whole image to predict the final results. Experimental results show that our method exceeds the state-of-the-art algorithm. Demonstrating that the rational use of interaction features can improve performance for sentiment analysis. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Graphical abstract

12 pages, 5708 KiB

Open AccessArticle

Single Image Super-Resolution Method Using CNN-Based Lightweight Neural Networks

by Seonjae Kim, Dongsan Jun, Byung-Gyu Kim, Hunjoo Lee and Eunjun Rhee

Appl. Sci. 2021, 11(3), 1092; https://doi.org/10.3390/app11031092 - 25 Jan 2021

Cited by 22 | Viewed by 4374

Abstract

There are many studies that seek to enhance a low resolution image to a high resolution image in the area of super-resolution. As deep learning technologies have recently shown impressive results on the image interpolation and restoration field, recent studies are focusing on [...] Read more.

There are many studies that seek to enhance a low resolution image to a high resolution image in the area of super-resolution. As deep learning technologies have recently shown impressive results on the image interpolation and restoration field, recent studies are focusing on convolutional neural network (CNN)-based super-resolution schemes to surpass the conventional pixel-wise interpolation methods. In this paper, we propose two lightweight neural networks with a hybrid residual and dense connection structure to improve the super-resolution performance. In order to design the proposed networks, we extracted training images from the DIVerse 2K (DIV2K) image dataset and investigated the trade-off between the quality enhancement performance and network complexity under the proposed methods. The experimental results show that the proposed methods can significantly reduce both the inference speed and the memory required to store parameters and intermediate feature maps, while maintaining similar image quality compared to the previous methods. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

15 pages, 1416 KiB

Open AccessArticle

A Multi-Resolution Approach to GAN-Based Speech Enhancement

by Hyung Yong Kim, Ji Won Yoon, Sung Jun Cheon, Woo Hyun Kang and Nam Soo Kim

Appl. Sci. 2021, 11(2), 721; https://doi.org/10.3390/app11020721 - 13 Jan 2021

Cited by 20 | Viewed by 3330

Abstract

Recently, generative adversarial networks (GANs) have been successfully applied to speech enhancement. However, there still remain two issues that need to be addressed: (1) GAN-based training is typically unstable due to its non-convex property, and (2) most of the conventional methods do not [...] Read more.

Recently, generative adversarial networks (GANs) have been successfully applied to speech enhancement. However, there still remain two issues that need to be addressed: (1) GAN-based training is typically unstable due to its non-convex property, and (2) most of the conventional methods do not fully take advantage of the speech characteristics, which could result in a sub-optimal solution. In order to deal with these problems, we propose a progressive generator that can handle the speech in a multi-resolution fashion. Additionally, we propose a multi-scale discriminator that discriminates the real and generated speech at various sampling rates to stabilize GAN training. The proposed structure was compared with the conventional GAN-based speech enhancement algorithms using the VoiceBank-DEMAND dataset. Experimental results showed that the proposed approach can make the training faster and more stable, which improves the performance on various metrics for speech enhancement. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

12 pages, 1140 KiB

Open AccessFeature PaperArticle

Place Classification Algorithm Based on Semantic Segmented Objects

by Woon-Ha Yeo, Young-Jin Heo, Young-Ju Choi and Byung-Gyu Kim

Appl. Sci. 2020, 10(24), 9069; https://doi.org/10.3390/app10249069 - 18 Dec 2020

Cited by 7 | Viewed by 2121

Abstract

Scene or place classification is one of the important problems in image and video search and recommendation systems. Humans can understand the scene they are located, but it is difficult for machines to do it. Considering a scene image which has several objects, [...] Read more.

Scene or place classification is one of the important problems in image and video search and recommendation systems. Humans can understand the scene they are located, but it is difficult for machines to do it. Considering a scene image which has several objects, humans recognize the scene based on these objects, especially background objects. According to this observation, we propose an efficient scene classification algorithm for three different classes by detecting objects in the scene. We use pre-trained semantic segmentation model to extract objects from an image. After that, we construct a weight matrix to determine a scene class better. Finally, we classify an image into one of three scene classes (i.e., indoor, nature, city) by using the designed weighting matrix. The performance of our scheme outperforms several classification methods using convolutional neural networks (CNNs), such as VGG, Inception, ResNet, ResNeXt, Wide-ResNet, DenseNet, and MnasNet. The proposed model achieves 90.8% of verification accuracy and improves over 2.8% of the accuracy when comparing to the existing CNN-based methods. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

20 pages, 4670 KiB

Open AccessArticle

Recommendations for Different Tasks Based on the Uniform Multimodal Joint Representation

by Haiying Liu, Sinuo Deng, Lifang Wu, Meng Jian, Bowen Yang and Dai Zhang

Appl. Sci. 2020, 10(18), 6170; https://doi.org/10.3390/app10186170 - 04 Sep 2020

Cited by 4 | Viewed by 2060

Abstract

Content curation social networks (CCSNs), such as Pinterest and Huaban, are interest driven and content centric. On CCSNs, user interests are represented by a set of boards, and a board is composed of various pins. A pin is an image with a description. [...] Read more.

Content curation social networks (CCSNs), such as Pinterest and Huaban, are interest driven and content centric. On CCSNs, user interests are represented by a set of boards, and a board is composed of various pins. A pin is an image with a description. All entities, such as users, boards, and categories, can be represented as a set of pins. Therefore, it is possible to implement entity representation and the corresponding recommendations on a uniform representation space from pins. Furthermore, lots of pins are re-pinned from others and the pin’s re-pin sequences are recorded on CCSNs. In this paper, a framework which can learn the multimodal joint representation of pins, including text representation, image representation, and multimodal fusion, is proposed. Image representations are extracted from a multilabel convolutional neural network. The multiple labels of pins are automatically obtained by the category distributions in the re-pin sequences, which benefits from the network architecture. Text representations are obtained with the word2vec tool. Two modalities are fused with a multimodal deep Boltzmann machine. On the basis of the pin representation, different recommendation tasks are implemented, including recommending pins or boards to users, recommending thumbnails to boards, and recommending categories to boards. Experimental results on a dataset from Huaban demonstrate that the multimodal joint representation of pins contains the information of user interests. Furthermore, the proposed multimodal joint representation outperformed unimodal representation in different recommendation tasks. Experiments were also performed to validate the effectiveness of the proposed recommendation methods. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

16 pages, 827 KiB

Open AccessArticle

The Application and Improvement of Deep Neural Networks in Environmental Sound Recognition

by Yu-Kai Lin, Mu-Chun Su and Yi-Zeng Hsieh

Appl. Sci. 2020, 10(17), 5965; https://doi.org/10.3390/app10175965 - 28 Aug 2020

Cited by 8 | Viewed by 2101

Abstract

Neural networks have achieved great results in sound recognition, and many different kinds of acoustic features have been tried as the training input for the network. However, there is still doubt about whether a neural network can efficiently extract features from the raw [...] Read more.

Neural networks have achieved great results in sound recognition, and many different kinds of acoustic features have been tried as the training input for the network. However, there is still doubt about whether a neural network can efficiently extract features from the raw audio signal input. This study improved the raw-signal-input network from other researches using deeper network architectures. The raw signals could be better analyzed in the proposed network. We also presented a discussion of several kinds of network settings, and with the spectrogram-like conversion, our network could reach an accuracy of 73.55% in the open-audio-dataset “Dataset for Environmental Sound Classification 50” (ESC50). This study also proposed a network architecture that could combine different kinds of network feeds with different features. With the help of global pooling, a flexible fusion way was integrated into the network. Our experiment successfully combined two different networks with different audio feature inputs (a raw audio signal and the log-mel spectrum). Using the above settings, the proposed ParallelNet finally reached the accuracy of 81.55% in ESC50, which also reached the recognition level of human beings. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

18 pages, 7190 KiB

Open AccessArticle

Human Height Estimation by Color Deep Learning and Depth 3D Conversion

by Dong-seok Lee, Jong-soo Kim, Seok Chan Jeong and Soon-kak Kwon

Appl. Sci. 2020, 10(16), 5531; https://doi.org/10.3390/app10165531 - 10 Aug 2020

Cited by 10 | Viewed by 10844

Abstract

In this study, an estimation method for human height is proposed using color and depth information. Color images are used for deep learning by mask R-CNN to detect a human body and a human head separately. If color images are not available for [...] Read more.

In this study, an estimation method for human height is proposed using color and depth information. Color images are used for deep learning by mask R-CNN to detect a human body and a human head separately. If color images are not available for extracting the human body region due to low light environment, then the human body region is extracted by comparing between current frame in depth video and a pre-stored background depth image. The topmost point of the human head region is extracted as the top of the head and the bottommost point of the human body region as the bottom of the foot. The depth value of the head top-point is corrected to a pixel value that has high similarity to a neighboring pixel. The position of the body bottom-point is corrected by calculating a depth gradient between vertically adjacent pixels. Two head-top and foot-bottom points are converted into 3D real-world coordinates using depth information. Two real-world coordinates estimate human height by measuring a Euclidean distance. Estimation errors for human height are corrected as the average of accumulated heights. In experiment results, we achieve that the estimated errors of human height with a standing state are 0.7% and 2.2% when the human body region is extracted by mask R-CNN and the background depth image, respectively. Full article

(This article belongs to the Special Issue Artificial Intelligence for Multimedia Signal Processing)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Artificial Intelligence for Multimedia Signal Processing

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (14 papers)

Editorial

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI