Share This Special Issue

Special Issue Editors

Special Issue Information

Dear Colleagues,

In less than a decade, breakthroughs in machine learning and artificial intelligence have profoundly impacted society, industry, and our scientific community, changing the way we now work with large amounts of data. Computer vision is one of the research fields that has benefited the most from these advances, especially because it originally provided the ideal experimental scenario to test models such as deep convolutional neural networks, which now represent the foundation of most artificial intelligence approaches, even for non-visual tasks. Indeed, machine capabilities of understanding the visual world are getting closer and closer to human performance, both in general scene comprehension tasks (e.g., object detection and classification, object tracking, semantic segmentation) and in specific domains, such as the medical one.

However, while our machines are quickly improving at processing visual information, combining and correlating it with additional data sources is a complex and still unsolved challenge. Combining multimodal sources with visual data has the potential of improving artificial visual systems by harnessing information that is complementary but related to the observed scene, while improving the representational power of such models and their consistency to different aspects and perspectives of the real world. Well-known examples of such kinds of multimodal analysis include visual–textual (e.g., captioning, synthesis, question answering) and visual–audio (e.g., speech recognition and audio source identification from videos), but new and emerging applications are attracting attention, such as autonomous driving (e.g., RGB/depth/LiDAR integration) and medical (e.g., X-ray/CT/MRI integration with laboratory tests). Moreover, there are several aspects related to multimodal learning that have not been sufficiently explored yet: How do learned visual representations and semantics improve when coupled to another mode? How can multimodal learning improve visual understanding in specific and diverse scenarios, such as affective computing, health monitoring, and interaction between human and artificial agents? How can multimodal learning improve the interpretability of visual models?

The objective of this Special Issue, “Multimodal Data Analysis in Computer Vision” is to attract high-quality, innovative and original works related but not restricted to the above challenges.

Examples of the topics that we expect to cover are the following:

Machine learning methods for multimodal data;
Multimodal data fusion and representation;
Multimodal approaches for vision and/or text and/or audio;
Multimodal approaches for autonomous driving;
Multimodal approaches for machine learning in medicine;
Visual reasoning supported by multimodal data;
Multimodal computer vision for interactive systems;
Large-scale multimodal data analysis;
Multimodal summarization;
Novel benchmarks for multimodal information processing;
Visual model interpretability supported by multimodal data.

Dr. Daniela Giordano
Dr. Simone Palazzo
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

Deep learning for multimodal data
Multimodal data fusion for visual applications
Multimodal visual embedding
Machine learning for video/text/audio/biosignals
Multimodal visual reasoning
Multimodal benchmarks
Multimodal monitoring and surveillance

Published Papers (1 paper)

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

20 pages, 5008 KiB

Open AccessArticle

Region Anomaly Detection via Spatial and Semantic Attributed Graph in Human Monitoring

by Kang Zhang, Muhammad Fikko Fadjrimiratno and Einoshin Suzuki

Sensors 2023, 23(3), 1307; https://doi.org/10.3390/s23031307 - 23 Jan 2023

Cited by 1 | Viewed by 1441

Abstract

This paper proposes a graph-based deep framework for detecting anomalous image regions in human monitoring. The most relevant previous methods, which adopt deep models to obtain salient regions with captions, focus on discovering anomalous single regions and anomalous region pairs. However, they cannot detect an anomaly involving more than two regions and have deficiencies in capturing interactions among humans and objects scattered in multiple regions. For instance, the region of a man making a phone call is normal when it is located close to a kitchen sink and a soap bottle, as they are in a resting area, but abnormal when close to a bookshelf and a notebook PC, as they are in a working area. To overcome this limitation, we propose a spatial and semantic attributed graph and develop a Spatial and Semantic Graph Auto-Encoder (SSGAE). Specifically, the proposed graph models the “context” of a region in an image by considering other regions with spatial relations, e.g., a man sitting on a chair is adjacent to a white desk, as well as other region captions with high semantic similarities, e.g., “a man in a kitchen” is semantically similar to “a white chair in the kitchen”. In this way, a region and its context are represented by a node and its neighbors, respectively, in the spatial and semantic attributed graph. Subsequently, SSGAE is devised to reconstruct the proposed graph to detect abnormal nodes. Extensive experimental results indicate that the AUC scores of SSGAE improve from 0.79 to 0.83, 0.83 to 0.87, and 0.91 to 0.93 compared with the best baselines on three real-world datasets. Full article

(This article belongs to the Special Issue Multimodal Data Analysis in Computer Vision)

► Show Figures

Journal Menu

Journal Browser

Multimodal Data Analysis in Computer Vision

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (1 paper)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI