Semantic and Geometric-Aware Day-to-Night Image Translation Network

Bang, Geonkyu; Lee, Jinho; Endo, Yuki; Nishimori, Toshiaki; Nakao, Kenta; Kamijo, Shunsuke

doi:10.3390/s24041339

Open AccessArticle

Semantic and Geometric-Aware Day-to-Night Image Translation Network

by

Geonkyu Bang

^1,*,

Jinho Lee

¹,

Yuki Endo

²

,

Toshiaki Nishimori

³,

Kenta Nakao

⁴ and

Shunsuke Kamijo

^5,*

¹

Emerging Design and Informatics Course, Graduate School of Interdisciplinary Information Studies, The University of Tokyo, 4 Chome-6-1 Komaba, Meguro-ku, Tokyo 153-0041, Japan

²

Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo, 7 Chome-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan

³

Mitsubishi Heavy Industries Machinery Systems, Ltd., 1 Chome-1-1 Wadasaki-cho, Hyogo-ku, Kobe 652-8585, Japan

⁴

Mitsubishi Heavy Industries, Ltd., 1 Chome-1-1 Wadasaki-cho, Hyogo-ku, Kobe 652-8585, Japan

⁵

Institute of Industrial Science, The University of Tokyo, 4 Chome-6-1 Komaba, Meguro-ku, Tokyo 153-0041, Japan

^*

Authors to whom correspondence should be addressed.

Sensors 2024, 24(4), 1339; https://doi.org/10.3390/s24041339

Submission received: 12 January 2024 / Revised: 10 February 2024 / Accepted: 18 February 2024 / Published: 19 February 2024

(This article belongs to the Special Issue Advances in Sensing, Imaging and Computing for Autonomous Driving)

Download

Browse Figures

Versions Notes

Abstract

:

Autonomous driving systems heavily depend on perception tasks for optimal performance. However, the prevailing datasets are primarily focused on scenarios with clear visibility (i.e., sunny and daytime). This concentration poses challenges in training deep-learning-based perception models for environments with adverse conditions (e.g., rainy and nighttime). In this paper, we propose an unsupervised network designed for the translation of images from day-to-night to solve the ill-posed problem of learning the mapping between domains with unpaired data. The proposed method involves extracting both semantic and geometric information from input images in the form of attention maps. We assume that the multi-task network can extract semantic and geometric information during the estimation of semantic segmentation and depth maps, respectively. The image-to-image translation network integrates the two distinct types of extracted information, employing them as spatial attention maps. We compare our method with related works both qualitatively and quantitatively. The proposed method shows both qualitative and qualitative improvements in visual presentation over related work.

Keywords:

image-to-image translation; domain adaptation; data augmentation; generative adversarial networks; generative model; deep learning

1. Introduction

Autonomous driving systems require effective and secure operation under various visibility conditions. The functionality of these systems is heavily dependent on their perception tasks, which have seen significant improvements in accuracy through advances in deep learning in recent years. Despite these advances, challenges persist in addressing perception tasks under poor visibility conditions (e.g., nighttime, rain, and fog). The primary obstacle stems from an imbalance in the amount of available data for each scenario. Deep-learning-based models, reliant on substantial datasets and annotations for training, often encounter difficulties due to the scarcity of relevant data for adverse visibility situations. Although numerous datasets have been created, most are concentrated on clear daytime conditions, making it impractical to collect and annotate data for every conceivable traffic scene and visibility scenario.

To address this challenge, researchers [1,2,3,4,5,6,7] have increasingly utilized synthetic data (e.g., computer graphic images from sources such as video games and simulators) to diversify the datasets. Despite the advantages of easy dataset creation for various scenarios, there remains a disparity between synthetic and real-world data. Consequently, deep-learning-based models (e.g., depth estimation, semantic segmentation, and camera pose estimation) trained with synthetic data may exhibit decreased performance in real-world applications. Efforts to enhance the photorealism [8,9,10,11] have been made, but the usability of models trained on synthetic data for autonomous driving systems remains a challenge.

In contrast, day-to-night image translation offers a solution by creating realistic nighttime data while preserving the objects, structure, and perspective. This process involves translating annotated daytime images into nighttime images, enabling the utilization of daytime image labels for the translated nighttime images. This facilitates the creation of nighttime datasets.

Numerous contemporary image translation methods leverage generative adversarial networks (GANs) [12], a robust framework for the training of generative models. However, it is challenging to obtain paired data for model training in traffic scenes (i.e., daytime and nighttime image pairs where every corresponding point is the same, except for the time of day). Consequently, this paper adopts unsupervised image-to-image translation methods to address the lack of paired data.

In this paper, we introduce an unsupervised day-to-night image translation model based on GANs [12] as a data augmentation technique. The translation of daytime images to nighttime poses a formidable challenge. This requires not only accurate color adjustment but also the consideration of semantic and geometric information at the pixel level. The goal is to achieve consistent transformation in semantic and geometric information while allowing for diverse style conversion. As shown in Figure 1, our model sets a hypothesis. We assume that the semantic and geometric information can be extracted from semantic segmentation and depth estimation. We first train the multi-task network, which estimates the semantic segmentation and depth (Figure 2a). The trained parameters of the multi-task network are utilized for the encoder and decoders of the image translation networks (Figure 2b). The attention module infers the attention map using the feature map extracted from the decoder as input. Leveraging the semantic segmentation and depth multi-task estimation network’s capacity to extract vital semantic and geometric information, the attention modules are able to infer semantic and geometric attention maps along the spatial dimension.

Our contributions can be summarized as follows.

We propose a semantic- and geometric-aware image-to-image translation network that adopts semantic segmentation and depth estimation guided attention modules. To the best of our knowledge, this is the first work that utilizes both semantic segmentation and depth information in image-to-image translation.
We introduce the semantic segmentation and depth estimation guided attention modules and adopt them for image-to-image translation. Our method does not require annotations for the target domain.
The proposed method generates better results both quantitatively and qualitatively in our experiments; it outperforms the related work in two distinct evaluation metrics.

Our method is trained with two different authentic datasets (i.e., Berkeley Deep Drive [13], Cityscapes [14]) at the same time.

2. Related Work

2.1. Generative Adversarial Networks (GANs)

Generative models within the realm of deep learning, particularly those based on the framework of generative adversarial networks (GANs) [12], have received significant attention. The fundamental structure of GANs [12] involves two adversarial networks (i.e., a generator and a discriminator) engaged in a competitive training process. The generator aims to produce data that the discriminator perceives as real, leading to a continuous interplay between the two networks. Subsequent to the introduction of GANs [12], various enhancements and alternative versions have been proposed to further improve their capabilities. cGAN [15] is one such advancement, introducing a conditional approach by incorporating additional input layers to condition the data. This allows the explicit generation of outputs based on specified conditions. Combining GANs [12] with auto-encoders [16], VAE/GAN [17] and VEEGAN [18] represent innovative approaches. These models leverage the strengths of both GANs and auto-encoders, with the aim of enhancing the overall generative process. In pursuit of better training objectives, alternative loss functions have been explored. LSGA [19] addresses the vanishing gradients problem by utilizing the least-squares loss function for the discriminator. This adjustment helps to stabilize the training process and improve the overall performance of the GAN [12] model. WGAN [20] introduces a different training objective by adopting the Wasserstein distance between the distributions of generated and real data. This alternative approach aims to overcome the limitations associated with traditional GANs’ training objectives.

2.2. Image-to-Image Translation Network

Pix2Pix [21] made significant strides as the initial unified framework for paired image-to-image translation, using cGAN [15]. In more recent developments, there has been a shift towards unsupervised image-to-image translation methods, which operate without a reliance on paired data. To tackle the inherent challenges of this ill-posed problem, various approaches have adopted the cycle consistency constraint. This constraint ensures that the translated data can be accurately reconstructed back to the source domain [22,23,24,25]. Some methods assume a shared latent space among images in different domains. CoGAN [26] features two generators with shared weights, producing images from different domains using the same random noise. UNIT [27], building upon CoGAN and incorporating VAE/GAN [17], maps each domain into a common latent space. Additionally, the exploration of multimodal image-to-image translation methods has gained traction [28,29,30,31,32,33]. However, these methods tend to exhibit suboptimal results when confronted with images from domains with substantial differences, such as daytime and nighttime, as they often lose instance-level information.

2.3. Day-to-Night Image Translation Network

Focusing on domain adaptation between daytime and nighttime images is crucial in enhancing the performance of various perception tasks, such as object detection [34], semantic segmentation [35,36], and localization [37]. Some works have attempted to boost deep network training for specific [38] or multiple [39] tasks. Some methods utilize semantic segmentation for additional information. SG-GAN [40] adopts semantic-aware discriminators, using semantic information to distinguish generated images from real ones. SemGAN [41] and Ramirez et al. [42] take a distinctive approach by inferring semantic segmentation from the translated images, thereby enforcing semantic consistency during the translation process. This emphasis on semantic information contributes to the overall perceptual coherence of the generated images. AugGAN [43,44] is a multi-task network designed for both day-to-night image translation and semantic segmentation estimation. This integrated approach reflects a comprehensive strategy in which the generator simultaneously learns the information about image translation between day and night and semantic segmentation.

2.4. Attention Mechanism

Attention mechanisms weight the parameters of deep learning models based on features extracted from input images. RAM [45] was initially proposed in the field of computer vision, introducing a method to recurrently estimate spatial attention and update the network. SENet [46] and ECA Net [47] introduced channel attention networks. Subsequent research has demonstrated advancements in spatial [48,49] or channel [50,51] attention. Some studies have proposed inferring attention maps along spatial and channel dimensions [52,53,54,55,56]. More recently, self-attention [57,58,59] and Transformer [60,61,62] have been introduced into computer vision, rapidly advancing the field.

3. Proposed Method

In this section, we propose a semantic- and geometric-aware day-to-night image translation method based on the CycleGAN framework [22]. When translating daytime images to nighttime ones, both descriptions of light sources with semantic properties and expressions of darkness according to geometric distances are required. As shown in Figure 2, the proposed method aims to extract semantic and geometric information from input images and apply it to the image-to-image translation network with the attention mechanism.

We assume that semantic and geometric information can be acquired from semantic segmentation and depth estimation processes, respectively. We first train the semantic segmentation and depth multi-task network. After, this trained multi-task network is utilized in the image translation phase to extract semantic and geometric information as feature maps. Subsequently, semantic and geometric attention maps are generated from the feature maps of the decoders. Finally, the attention maps are applied to image-to-image translation networks.

Here, we denote the domains of the daytime and nighttime RGB images by X and Y, respectively.

3.1. Semantic Segmentation and Depth Estimation

Figure 2a provides an overview of the multi-task network for the estimation of semantic segmentation and depth. The multi-task network first encodes authentic daytime images

x \in X

into latent representations via the encoder

E^{M u l}

; then, the decoders

D^{S e g}

and

D^{D e p}

estimate semantic segmentation maps

s_{X}

and depth maps

d_{X}

, respectively, from the latent representation: (1) semantic segmentation map

s_{X} = D^{S e g} (E^{M u l} (x))

and (2) depth map

d_{X} = D^{D e p} (E^{M u l} (x))

.

The trained parameters of the multi-task network are utilized in the image-to-image translation network.

3.2. Image Translation

The overall framework of our proposed method is depicted in Figure 2b. The framework consists of two opposite cycles (i.e., day-to-night cycle and night-to-day cycle) and each of them is a coupled image-to-image translation network. For each cycle, we refer to the translation from real images as translation and the translation from translated images as reconstruction.

3.2.1. Image-to-Image Translation Network

The day-to-night image translation network consists of one encoder

E_{X}

, one generator

G_{X}

, two decoders for semantic segmentation

D_{X}^{S e g}

, depth

D_{X}^{D e p}

, and four CBAM [55] modules

{A_{X}^{k}}_{k \in {S e g_{1}, D e p_{1}, S e g_{2}, D e p_{2}}}

. The encoder

E_{X}

and the decoders

D_{X}^{S e g}

and

D_{X}^{D e p}

utilize the parameters of the encoder

E^{M u l}

and the decoders

D^{S e g}

and

D^{D e p}

of Section 3.1.

As shown in Figure 3, the decoders

D_{X}^{S e g}

and

D_{X}^{D e p}

are utilized for the extraction of feature maps

{f_{X}^{k} \in R^{H \times W \times C}}_{k \in {S e g_{1}, D e p_{1}, S e g_{2}, D e p_{2}}}

. Then, the attention maps are generated from these feature maps:

{A_{X}^{k} (f_{X}^{k}) = : m_{X}^{k} \in R^{H \times W}}_{k \in {S e g_{1}, D e p_{1}, S e g_{2}, D e p_{2}}} = : M_{X}

. The inferred attention maps

M_{X}

are applied to the generator

G_{X}

by pixel-wise multiplication. Additionally, channel attention maps are inferred and implemented within the generator

G_{X}

.

In each cycle, authentic images

{x, y}

are translated into

{\bar{y}, \bar{x}}

by the image-to-image translation network with the semantic and geometric attention maps:

\bar{y} = G_{X} (E_{X} (x), M_{X})

,

\bar{x} = G_{Y} (E_{Y} (y), M_{Y})

.

The night-to-day image translation network is structured in the same way. Here, the transferred parameters of the day-to-night image translation networks

E_{X}

,

D_{X}^{S e g}

, and

D_{X}^{D e p}

are fixed, and only those of the night-to-day image translation networks

E_{Y}

,

D_{Y}^{S e g}

, and

D_{Y}^{D e p}

are retrained during the training of the image translation network.

Moreover, discriminators

{D i s c_{X}, D i s c_{Y}}

are defined for each domain to determine whether a daytime or nighttime image is real or fake.

3.2.2. Sharing of Semantic and Geometric Feature Maps

Sharing semantic and geometric information during the image translation cycle is considered plausible, given the consistency observed in most scene elements before and after the image translation. However, challenges arise because the multi-task networks

E^{M u l}

,

D^{S e g}

, and

D^{D e p}

are not trained on nighttime images, which can cause poor accuracy in semantic segmentation and depth estimation from nighttime images. Therefore, during the reconstruction phase, from translated images

{\bar{y}, \bar{x}}

to reconstructed images

{\hat{x}, \hat{y}}

, the feature maps of semantic segmentation and depth decoders are shared only within the day-to-night cycle. In contrast, these feature maps are separately estimated in the night-to-day cycle, as shown in Figure 2b:

\hat{x} = G_{Y} (E_{Y} (\bar{y}), M_{X})

,

\hat{y} = G_{X} (E_{X} (\bar{x}), {\bar{M}}_{X})

, where

{\bar{M}}_{X} = {A_{X}^{k} ({\bar{f}}_{X}^{k})}_{k \in {S e g_{1}, D e p_{1}, S e g_{2}, D e p_{2}}}

, and

{\bar{f}}_{X}^{k}

is the feature map extracted from translated images

\bar{x}

.

3.3. Training Networks

The proposed method follows a two-step training process. In the initial step, multi-task networks responsible for estimating semantic segmentation and depth maps (i.e.,

{E^{M u l}, D^{S e g}, D^{D e p}}

) are trained exclusively on daytime data. In the subsequent step, the parameters of the initially trained networks

{E^{M u l}, D^{S e g}, D^{D e p}}

are transferred to the image translation networks

{E_{X}, E_{Y}}

,

{D_{X}^{S e g}, D_{Y}^{S e g}}

, and

{D_{X}^{D e p}, D_{Y}^{D e p}}

. Following this transfer, the day-to-night and night-to-day image translation networks

{T_{X \to Y} : = G_{X} (E_{X} (\cdot)), T_{X \to Y} : = G_{Y} (E_{Y} (\cdot))}

and the discriminators

{D i s c_{X}, D i s c_{Y}}

are trained using the CycleGAN framework [22]. During the training of the image translation networks, the transferred parameters for the daytime domain networks, specifically

{E_{X}, D_{X}^{S e g}, D_{X}^{D e p}}

, remain fixed. At the same time, the networks associated with the nighttime domain, denoted as

{E_{Y}, D_{Y}^{S e g}, D_{Y}^{D e p}}

, undergo a retraining process. This selective retraining allows the model to adapt and fine-tune its parameters for the unique characteristics and challenges posed by nighttime data.

3.3.1. Semantic Segmentation and Depth Estimation Network

For the first step, we train the multi-task networks

{E^{M u l}, G^{S e g}, G^{D e p}}

on the daytime images x. In this step, we employ multi-class cross-entropy loss

l_{m c e}

for the training of semantic segmentation and L2 loss for the training of depth estimations. Let S and D be the domains of semantic segmentation and depth labels, respectively. The mathematical expressions for the loss function of each task are given by

L_{S e g} = E_{x \sim X, s \sim S} [l_{m c e} (D^{S e g} (E^{M u l} (x)), s)]

(1)

L_{D e p} = E_{x \sim X, d \sim D} [∥ D^{D e p} (E^{M u l} (x)) - d ∥_{2}]

(2)

The overall loss function is as follows:

L_{M u l} = L_{S e g} + L_{D e p}

(3)

3.3.2. Image-to-Image Translation Network

As the second step, we train the image-to-image translation networks. The image translation generators adopt the attention maps inferred from the feature maps of the semantic segmentation and depth estimation networks. Attention modules (CBAM [55]) are jointly trained with encoders and generators. Following the CycleGAN framework [22], we adopt the loss functions as follows to train the image translation network with unpaired data.

Adversarial Loss

Adversarial losses are implemented for both day-to-night and night-to-day image translation networks, seeking to minimize the distributional gap between translated images and targets. The objectives of the image translation networks

{T_{X \to Y}, T_{Y \to X}}

and their corresponding discriminators

{D i s c_{Y}, D i s c_{X}}

are expressed as follows:

\begin{matrix} L_{a d v 1} & = E_{y \sim Y} [log (D i s c_{Y} (y))] \\ + E_{x \sim X} [log (1 - D i s c_{Y} (T_{X \to Y} (x, M_{X})))] \end{matrix}

(4)

\begin{matrix} L_{a d v 2} & = E_{x \sim X} [log (D i s c_{X} (x))] \\ + E_{y \sim Y} [log (1 - D i s c_{X} (T_{Y \to X} (y, M_{Y})))] \end{matrix}

(5)

Cycle Consistency Loss

The cycle consistency loss is introduced to prevent the image translation network from generating arbitrary images in the target domain, regardless of the input images. The primary goal is to alter only the time of day, while preserving all other elements of the scene. To accomplish this, we apply constraints to guarantee the alignment between the input image and the reconstructed image. The cycle consistency loss is mathematically expressed as follows:

\begin{matrix} L_{c y c} & = E_{x \sim X} [{∥ T_{Y \to X} (T_{X \to Y} (x, M_{X}), M_{X}) - x ∥}_{1}] \\ + E_{y \sim Y} [{∥ T_{X \to Y} (T_{Y \to X} (y, M_{Y}), {\bar{M}}_{X}) - y ∥}_{1}] \end{matrix}

(6)

Identity Loss

The identity loss requires that the image translation networks exclusively translate the source domain images and not those from the target domain. The objective is expressed as follows:

\begin{matrix} L_{i d} & = E_{x \sim X} [{∥ T_{Y \to X} (x, M_{X}^{'}) - x ∥}_{1}] \\ + E_{y \sim Y} [{∥ T_{X \to Y} (y, M_{Y}^{'}) - y ∥}_{1}] \end{matrix}

(7)

where

{M_{X}^{'}, M_{Y}^{'}}

represents the attention maps inferred from real images

{x, y}

using the networks designed for opposite domains.

Total Loss

The total loss’ objective is as follows:

L_{t o t a l} = L_{a d v 1} + L_{a d v 2} + λ_{c y c} L_{c y c} + λ_{i d} L_{i d}

(8)

where

λ_{c y c}

and

λ_{i d}

are hyperparameters to control the influences of each loss.

4. Experiments

In this section, we first compare our method with related methods for day-to-night image translation. Then, we present experiments to investigate the validity of the architecture of the proposed method.

4.1. Experimental Environments

4.1.1. Datasets

The proposed method requires daytime and nighttime images, along with the corresponding labels for semantic segmentation and depth during training. However, there is currently no available dataset that encompasses all these requirements. Consequently, we conducted training using data from two distinct datasets: the Berkeley Deep Drive (BDD) dataset [13] and the Cityscapes dataset [14].

The Berkeley Deep Drive dataset comprises 10,000 RGB images capturing diverse driving scenarios (e.g., highway, urban area, bridge, and tunnel). The dataset includes variations in weather conditions and the time of day. Semantic segmentation labels are provided for these images. The image resolutions are

1280 \times 720

pixels.

The Cityscapes dataset offers 5000 RGB daytime images showcasing various driving environments, accompanied by multiple annotations. In particular, the dataset includes disparity information that can be converted to depth. The image resolutions are

2048 \times 1024

pixels.

During the training of the network outlined in Section 3.1, we utilized semantic segmentation labels from the BDD dataset and depth labels from the Cityscapes dataset. The image translation network described in Section 3.2 was trained using both daytime and nighttime images from the BDD dataset.

4.1.2. Implementation Settings

Our networks were implemented based on the architectures of CycleGAN (https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix accessed on 11 January 2024) [22] and CBAM (https://github.com/Jongchan/attention-module accessed on 11 January 2024) [55], using the PyTorch framework [63], and were trained on a single NVIDIA GeForce RTX 3090 GPU. The hyperparameters and training details are as follows.

Learning rate: fixed at $l r = 0.0002$ for the initial 100 epochs and then linearly decayed to $l r = 0$ over the next 100 epochs.
Batch size: set to 4.
Optimization algorithm: Adam with $β_{1} = 0.5$ , $β_{2} = 0.999$ .

During training, we randomly sampled 1000 daytime images with semantic segmentation labels and 1000 nighttime images from the BDD dataset. In addition, 1000 daytime images with depth labels were randomly selected from the Cityscapes dataset. The images from the BDD dataset and the Cityscapes dataset were randomly cropped to

512 \times 512

pixels and

824 \times 824

pixels, respectively. Subsequently, the images were resized to

256 \times 256

pixels after the cropping. For testing, 1000 daytime images were randomly sampled from the BDD dataset, and these images were center-cropped to

512 \times 512

pixels before being resized to

256 \times 256

pixels.

4.1.3. Comparison

We compare the proposed method with the following models.

CycleGAN [22] introduced cycle consistency loss for unpaired image-to-image translation.
SemGAN [41] adopted semantic consistency loss to maintain semantic information during image-to-image translation.
AugGAN [43,44] learned image translation and semantic segmentation simultaneously.
Lee et al. [64] transfer-learned the weights of semantic segmentation networks to the day-to-night image translation networks.
UNIT [27] achieves unsupervised image-to-image translation through VAE [16] from different domains that share a latent space.
MUNIT [30] is a multimodal unsupervised image-to-image translation method that is an extension of UNIT [27].

4.1.4. Evaluation Metrics

We evaluate the proposed and compared methods using the following metrics for quantitative comparisons.

The Fréchet Inception Distance (FID) [65] measures the Fréchet distance between the distributions of features extracted from the Inception-V3 network [66] for real and generated images.
The Kernel Inception Distance (KID) [67] uses the calculation of the squared Maximum Mean Discrepancy (MMD) by comparing the Inception-V3 [66] features of the real and translated samples. This comparison is conducted through the application of a polynomial kernel.
The Learned Perceptual Image Patch Similarity (LPIPS) metric [68], utilized to assess the diversity of an image set, computes the average feature distances between all pairs of images. Specifically, it gauges the translation diversity by evaluating the similarity between distinct deep features extracted from the pre-trained AlexNet [69].

4.2. Comparison to the Related Work

We conducted experiments to compare our proposed method with the related works as mentioned in Section 4.1.3.

Figure 4 shows the day-to-night image translation results of each method. CycleGAN [22], Lee et al.’s method [64], UNIT [27], and the proposed method generate expressions of streetlights, while other methods do not generate these expressions. Furthermore, our method generates visually more detailed expressions (i.e., color, shape, and position) of the streetlights than CycleGAN [22], Lee et al.’s method [64] and UNIT [27]. Translation failures in the sky area, sandwiched between buildings, are observed from the results of CycleGAN [22], AugGAN [43,44], and Lee et al.’s method [64]. UNIT [27] and MUNIT [30] appear to fail in their translations, as they darken the entire scene, giving the impression of inverting the colors in the images. In contrast, the proposed model’s results demonstrate the preservation of color information for each object in the input image. The results of our method show the car and road in front of the ego vehicle being illuminated by headlights.

Table 1 shows the quantitative evaluation with FID [65], KID [67], and Diversity (LPIPS [68]). Our method outperforms other methods in both the FID [65] and KID [67] metrics, which evaluate the realism of the results. AugGAN [43,44] obtained the highest score in Diversity (LPIPS [68]). However, it is crucial to note that achieving a high score in Diversity (LPIPS [68]) does not necessarily indicate a suitable result, as this metric does not consider the realism aspect. This point is underscored by the presence of images in the bottom row of AugGAN’s outputs in Figure 4. Here, the translation between daytime and nighttime failed, and the increased Diversity (LPIPS [68]) can be interpreted as a result of images closer to daytime (which lack realism). Therefore, Diversity (LPIPS [68]) metrics should be evaluated comprehensively alongside realism assessments.

In this context, our proposed model received the highest evaluations in both the quantitative and qualitative assessments of realism. Simultaneously, it recorded values close to the Diversity (LPIPS [68]) of real night images. Hence, our proposed model can be deemed to have performed the best overall.

4.3. Network Settings

We examined different configurations of the proposed method to optimize the architectural composition. Initially, we analyzed the impact of combinations of attention maps. Subsequently, we demonstrated the effectiveness of sharing feature maps within the decoder during the day-to-night cycle, and alternatively calculating them separately within the night-to-day cycle.

4.3.1. Pipelines of the Semantic and Geometric Feature Maps

The proposed method, as shown in Figure 2b, incorporates the sharing of feature maps from the semantic segmentation and depth estimation decoders during the day-to-night cycle. While a similar process could be applied to the night-to-day cycle, estimating semantic segmentation and depth maps from nighttime images poses a challenge. There is a concern that sharing low-accuracy estimates may mislead the reconstruction process. To address this, we conducted experiments with three types of feature map pipelines throughout the night-to-day cycle. Figure 5 illustrates the visual results of these three pipelines.

In the night-to-day cycle, the expressions of the sky and streetlights vary based on the chosen pipeline. Sharing the feature maps or not adopting attention maps within the night-to-day cycle can lead to expression failures in the sky. On the other hand, when we separately extract the feature maps for translation and reconstruction during the night-to-day cycle, more detailed expressions of streetlights are observed. Table 2 indicates that the translated images obtained by separately extracting the feature maps during the night-to-day cycle are more realistic in all metrics.

We adopt the pipeline that shares the feature maps on the day-to-night cycle and extracts them separately on the night-to-day cycle for our proposed method.

4.3.2. Attention Modules

The proposed method aimed to enhance the semantic and geometric information for the day-to-night image translation network. As a means of enhancing the semantic and geometric information, we introduced attention maps derived from relevant information in the input images, which were then utilized in the image translation network.

We assumed that the semantic and geometric information was able to be extracted as feature maps by the decoders trained to infer the semantic segmentation and depth. Based on this assumption, we generated attention maps from the feature maps calculated by each decoder. In our method, one image translation network adopts two different sizes of attention maps for semantic and geometric information, respectively:

{m_{T}^{k}}_{k \in {S e g_{1}, D e p_{1}, S e g_{2}, D e p_{2}}}

, where

T \in {X, Y}

indicates the time domain. Each attention map is inferred from a different size or type of feature map.

{m_{T}^{S e g_{1}}, m_{T}^{S e g_{2}}}

and

{m_{T}^{D e p_{1}}, m_{T}^{D e p_{2}}}

are generated from the feature maps of the semantic segmentation and depth decoders. Moreover,

{m_{T}^{S e g_{1}}, m_{T}^{D e p_{1}}}

are generated from relatively small-sized feature maps, whereas

{m_{T}^{S e g_{2}}, m_{T}^{D e p_{2}}}

are inferred from relatively large-sized feature maps.

We conducted an experiment to verify the effects of combinations of these attention maps. Several combinations of different types and sizes of attention maps were applied to the image-to-image translation networks. Figure 6 and Table 3 present visual and quantitative comparisons of the results for each network combination.

In the visual comparison, translation failures in the sky area, sandwiched between the buildings, are observed in all other combinations except the proposed method. Additionally, the expressed size of the streetlight depends on the combination of attention maps, and the streetlights tended to appear larger when the image translation network adopted combinations including

{m_{T}^{S e g_{2}}, m_{T}^{D e p 2}}

.

The quantitative results indicate that the image translation network with all types and sizes of attention maps achieved the best results in both metrics.

With both the visual and quantitative results, we utilize all attention maps for our proposed method.

4.4. Discussion

Based on our assumption that the semantic segmentation and depth decoders can extract the semantic and geometric information from the input image, our method infers the semantic and geometric attention maps from the related feature maps extracted by the decoders. The attention maps were created from two different sizes of feature maps calculated from each decoder and applied to the image translation network. To investigate the necessity of these diverse types and sizes of attention maps for image translation, we conducted an experiment. The experimental results show that employing all types of attention maps yielded the best results in both the qualitative and quantitative evaluations.

Additionally, an experiment was performed to identify the optimal conditions for the pipeline of decoder feature maps in the image translation and reconstruction networks during the night-to-day cycle. The comprehensive results validate the effectiveness of our approach, demonstrating that introducing attention maps inferred from two different-sized feature maps from each decoder and extracting feature maps separately for the image translation and reconstruction networks during the night-to-day cycle yield the best performance.

Comparative evaluations against CycleGAN [22], SemGAN [41], AugGAN [43,44], Lee et al.’s method [64], UNIT [27], and MUNIT [30] substantiate the notion that our proposed model outperforms these methods in both quantitative and qualitative assessments.

Looking ahead, future efforts should include exploring the application of the proposed model to adverse weather conditions, such as rain or fog. In addition, research efforts are necessary that focus on leveraging translated images for the enhancement or evaluation of model training.

5. Conclusions

In this paper, we present an unpaired image-to-image translation network with semantic and geometric attention maps. We propose to first pre-train the multi-task network for the estimation of semantic segmentation and depth maps and then utilize the extracted feature maps from these pre-trained decoders to infer semantic and geometric attention maps. We apply these attention maps to the image-to-image translation network. The experiments indicate both qualitative and quantitative performance gains with our proposed method.

Author Contributions

Conceptualization, G.B., K.N. and S.K.; methodology, G.B.; software, G.B.; validation, G.B., J.L. and T.N.; formal analysis, G.B. and Y.E.; investigation, G.B.; resources, K.N. and S.K.; data curation, G.B. and T.N.; writing—original draft preparation, G.B.; writing—review and editing, G.B., J.L. and Y.E.; visualization, G.B.; supervision, K.N. and S.K.; project administration, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding authors. Please note that the data are not publicly accessible due to contractual obligations associated with the project. These data were derived from the following resources available in the public domain: BDD dataset (https://www.vis.xyz/bdd100k/ accessed on 11 January 2024), Cityscapes dataset (https://www.cityscapes-dataset.com/ accessed on 11 January 2024).

Conflicts of Interest

The authors T.N. and K.N. were employed by the company Mitsubishi Heavy Industries Machinery Systems Ltd. and Mitsubishi Heavy Industries Ltd., respectively. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Richter, S.R.; Hayder, Z.; Koltun, V. Playing for Benchmarks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for data: Ground truth from computer games. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. PMLR 2017, 78, 1–16. [Google Scholar]
Hossain, S.; Fayjie, A.R.; Doukhi, O.; Lee, D.-j. CAIAS Simulator: Self-driving Vehicle Simulator for AI Research. In Proceedings of the International Conference on Intelligent Computing & Optimization (ICO 2018), Pattaya, Thailand, 4–5 October 2018; Volume 866. [Google Scholar] [CrossRef]
Rong, G.; Shin, B.H.; Tabatabaee, H.; Lu, Q.; Lemke, S.; Možeiko, M.; Boise, E.; Uhm, G.; Gerow, M.; Mehta, S.; et al. LGSVL Simulator: A High Fidelity Simulator for Autonomous Driving. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020. [Google Scholar] [CrossRef]
Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field and Service Robotics; Springer Proceedings in Advanced Robotics; Springer: Cham, Switzerland, 2018; Volume 5, pp. 621–635. [Google Scholar] [CrossRef]
Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.A.; Darrell, T. CyCADA: Cycle-Consistent Adversarial Domain adaptation. PLMR 2018, 5, 1989–1998. [Google Scholar]
Richter, S.R.; Alhaija, H.A.; Koltun, V. Enhancing Photorealism Enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1700–1715. [Google Scholar] [CrossRef]
Nakajima, K.; Katayama, T.; Song, T.; Jiang, X.; Shimamoto, T. Domain Adaptive Semantic Segmentation through Photorealistic Enhancement of Video Game. In Proceedings of the 2022 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 7–9 January 2022. [Google Scholar] [CrossRef]
Shrivastava, A.; Pfister, T.; Tuzel, O.; Susskind, J.; Wang, W.; Webb, R. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Yu, F.; Xian, W.; Chen, Y.; Liu, F.; Liao, M.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling. arXiv 2018, arXiv:1805.04687v1. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2014, arXiv:1312.6114. [Google Scholar]
Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. PMLR 2016, 48, 1558–1566. [Google Scholar]
Srivastava, A.; Valkov, L.; Russell, C.; Gutmann, M.U.; Sutton, C. VEEGAN: Reducing mode collapse in GANs using implicit variational learning. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Smolley, S.P. Least Squares Generative Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. PMLR 2017, 70, 214–223. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Taigman, Y.; Polyak, A.; Wolf, L. Unsupervised cross-domain image generation. arXiv 2017, arXiv:1611.02200. [Google Scholar]
Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. PMLR 2017, 70, 1857–1865. [Google Scholar]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Liu, M.Y.; Tuzel, O. Coupled generative adversarial networks. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal Unsupervised Image-to-Image Translation. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
Lee, H.Y.; Tseng, H.Y.; Huang, J.B.; Singh, M.; Yang, M.H. Diverse Image-to-Image Translation via Disentangled Representations. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
Gonzalez-Garcia, A.; Weijer, J.V.D.; Bengio, Y. Image-to-image translation for cross-domain disentanglement. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
Yang, D.; Hong, S.; Jang, Y.; Zhao, T.; Lee, H. Diversity-sensitive conditional generative adversarial networks. arXiv 2019, arXiv:1901.09024. [Google Scholar]
Arruda, V.F.; Paixao, T.M.; Berriel, R.F.; Souza, A.F.D.; Badue, C.; Sebe, N.; Oliveira-Santos, T. Cross-Domain Car Detection Using Unsupervised Image-to-Image Translation: From Day to Night. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019. [Google Scholar] [CrossRef]
Romera, E.; Bergasa, L.M.; Yang, K.; Alvarez, J.M.; Barea, R. Bridging the day and night domain gap for semantic segmentation. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019. [Google Scholar] [CrossRef]
Sun, L.; Wang, K.; Yang, K.; Xiang, K. See clearer at night: Towards robust nighttime semantic segmentation through day-night image conversion. In Proceedings of the SPIE Security + Defence, Strasbourg, France, 9–12 September 2019. [Google Scholar] [CrossRef]
Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Gool, L.V. Night-to-day image translation for retrieval-based localization. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019. [Google Scholar] [CrossRef]
Punnappurath, A.; Abuolaim, A.; Abdelhamed, A.; Levinshtein, A.; Brown, M.S. Day-to-Night Image Synthesis for Training Nighttime Neural ISPs. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Zheng, Z.; Wu, Y.; Han, X.; Shi, J. ForkGAN: Seeing into the Rainy Night. In Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar] [CrossRef]
Li, P.; Liang, X.; Jia, D.; Xing, E.P. Semantic-aware grad-GAN for virtual-to-real urban scene adaption. arXiv 2019, arXiv:1801.01726. [Google Scholar]
Cherian, A.; Sullivan, A. Sem-GAN: Semantically-consistent image-to-image translation. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019. [Google Scholar] [CrossRef]
Ramirez, P.Z.; Tonioni, A.; Stefano, L.D. Exploiting semantics in adversarial training for image-level domain adaptation. In Proceedings of the 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), Sophia Antipolis, France, 12–14 December 2018. [Google Scholar] [CrossRef]
Huang, S.W.; Lin, C.T.; Chen, S.P.; Wu, Y.Y.; Hsu, P.H.; Lai, S.H. AugGAN: Cross domain adaptation with GAN-based data augmentation. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
Lin, C.T.; Huang, S.W.; Wu, Y.Y.; Lai, S.H. GAN-Based Day-to-Night Image Style Transfer for Nighttime Vehicle Detection. IEEE Trans. Intell. Transp. Syst. 2021, 22, 951–963. [Google Scholar] [CrossRef]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 3. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Yang, Z.; Zhu, L.; Wu, Y.; Yang, Y. Gated Channel Transformation for Visual Recognition. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation Networks for Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. PMLR 2019, 97, 7354–7363. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Li, W.; Zhu, X.; Gong, S. Harmonious Attention Network for Person Re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. BAM: Bottleneck attention module. arXiv 2019, arXiv:1807.06514. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Ramachandran, P.; Bello, I.; Parmar, N.; Levskaya, A.; Vaswani, A.; Shlens, J. Stand-alone self-attention in vision models. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-cross attention for semantic segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar] [CrossRef]
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. PMLR 2020, 119, 1691–1703. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Łukasz Kaiser; Polosukhin, I. An image is worth 16*16 words: Transformers for image recognition at scale. arXiv 2017, arXiv:2010.11929. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Lee, J.; Shiotsuka, D.; Bang, G.; Endo, Y.; Nishimori, T.; Nakao, K.; Kamijo, S. Day-to-night image translation via transfer learning to keep semantic information for driving simulator. IATSS Res. 2023, 47, 251–262. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Asian J. Appl. Sci. Eng. 2017, 8, 25–34. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Binkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying MMD GANs. arXiv 2018, arXiv:1801.01401. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]

Figure 1. The concept of our proposed method. Semantic and geometric information of input images is extracted as feature maps by pre-trained semantic segmentation and depth network. Utilizing attention modules, semantic and geometric spatial attention maps are deduced from these feature maps. Subsequently, both semantic and geometric attention maps are integrated into the image-to-image translation network.

Figure 2. The overview of the proposed method. The training process consists of two distinctive steps, (a) the semantic segmentation and depth multi-task network and (b) the image-to-image translation network. During the second step, encoders and decoders utilize the pre-trained parameters obtained in the first step, and attention modules infer spatial attention maps from the feature maps of decoders. Throughout only the day-to-night cycle, the feature maps of decoders are shared between the image translation and reconstruction processes.

Figure 3. Structure of the semantic and geometric attention module. The proposed method adopts four spatial attention modules (i.e.,

A_{T}^{S e g_{1}}

,

A_{T}^{D e p_{1}}

,

A_{T}^{S e g_{2}}

, and

A_{T}^{D e p_{2}}

, where T indicates time domain

T \in {X, Y}

). Each attention module infers a spatial attention map from the corresponding feature map. Here, two different sizes of feature maps are utilized from each decoder (i.e., semantic segmentation and depth decoder). The resultant attention maps are then mapped to their correspondingly sized feature maps within the generator.

Figure 3. Structure of the semantic and geometric attention module. The proposed method adopts four spatial attention modules (i.e.,

A_{T}^{S e g_{1}}

,

A_{T}^{D e p_{1}}

,

A_{T}^{S e g_{2}}

, and

A_{T}^{D e p_{2}}

, where T indicates time domain

T \in {X, Y}

). Each attention module infers a spatial attention map from the corresponding feature map. Here, two different sizes of feature maps are utilized from each decoder (i.e., semantic segmentation and depth decoder). The resultant attention maps are then mapped to their correspondingly sized feature maps within the generator.

Figure 4. Qualitative comparison with related works. The figure shows, from left to right, the input images in the source domain, followed by the day-to-night image translation results from CycleGAN [22], SemGAN [41], AugGAN [43,44], Lee et al. [64], UNIT [27], MUNIT [30], and our method, respectively.

Figure 5. Comparison of pipelines for feature maps from semantic segmentation and depth generators. We present (a) input images from the source domain, followed by the outputs from networks employing three distinct pipelines with attention modules. The pipeline in (b) shares feature maps during both day-to-night and night-to-day cycles, while that in (c) shares feature maps exclusively during the day-to-night cycle, without adopting attention maps during the night-to-day cycle. Additionally, the pipeline in (d) shares feature maps only during the day-to-night cycle and separately extracts them during the night-to-day cycle. The red-colored boxes denote the sky area between buildings or trees, while the yellow-colored boxes highlight areas containing street lights.

Figure 6. Comparison for combinations of attention maps. Each row shows, from top to bottom, (a) input images from the source domain and subsequently the outputs from networks employing four distinct combinations of attention maps. The combinations are (b) small-sized attention maps

{m^{S e g_{1}}, m^{D e p_{1}}}

, (c) large-sized attention maps

{m^{S e g_{2}}, m^{D e p_{2}}}

, (d) generated from feature maps of the depth decoder

{m_{T}^{D e p_{1}}, m_{T}^{D e p_{2}}}

, (e) generated from feature maps of the semantic segmentation decoder

{m_{T}^{S e g_{1}}, m_{T}^{S e g_{2}}}

, and (f) all feature maps

{m^{S e g_{1}}, m^{D e p_{1}}, m^{S e g_{2}}, m^{D e p_{2}}}

. The red-colored boxes denote the sky area between buildings or trees, while the yellow-colored boxes highlight areas containing streetlights.

Figure 6. Comparison for combinations of attention maps. Each row shows, from top to bottom, (a) input images from the source domain and subsequently the outputs from networks employing four distinct combinations of attention maps. The combinations are (b) small-sized attention maps

{m^{S e g_{1}}, m^{D e p_{1}}}

, (c) large-sized attention maps

{m^{S e g_{2}}, m^{D e p_{2}}}

, (d) generated from feature maps of the depth decoder

{m_{T}^{D e p_{1}}, m_{T}^{D e p_{2}}}

, (e) generated from feature maps of the semantic segmentation decoder

{m_{T}^{S e g_{1}}, m_{T}^{S e g_{2}}}

, and (f) all feature maps

{m^{S e g_{1}}, m^{D e p_{1}}, m^{S e g_{2}}, m^{D e p_{2}}}

. The red-colored boxes denote the sky area between buildings or trees, while the yellow-colored boxes highlight areas containing streetlights.

Table 1. Quantitative evaluation with FID [65], KID [67], and Diversity (LPIPS [68]).

Method	FID ↓	KID $\times_{100} ↓$	Diversity ↑
CycleGAN [22]	35.528	1.321	0.597
SemGAN [41]	35.256	1.309	0.605
AugGAN [43,44]	57.723	4.013	0.631
Lee et al. [64]	39.598	1.727	0.587
UNIT [27]	32.661	1.080	0.586
MUNIT [30]	69.968	5.791	0.578
Ours	31.245	0.898	0.586
Real night images	19.673	0.022	0.601

The up and down arrows next to the metrics indicate that a larger and smaller numerical value corresponds to a better outcome, respectively. The bold numbers highlight the best results.

Table 2. Quantitative evaluation of pipelines of the feature maps for attention maps.

Pipeline	Feature Maps for Attention Maps		FID ↓	KID $\times_{100} ↓$	Diversity ↑
Pipeline	Day-to-Night Cyc	Night-to-Day Cyc	FID ↓	KID $\times_{100} ↓$	Diversity ↑
1	share	share	35.963	1.293	0.595
2	share	no attention	34.856	1.216	0.596
3	share	separate	31.245	0.898	0.586

The up and down arrows next to the metrics indicate that a larger and smaller numerical value corresponds to a better outcome, respectively. The bold numbers highlight the best results.

Table 3. Quantitative evaluation of attention map combinations.

Method	Adopted Attention Maps ¹				FID ↓	KID $\times_{100} ↓$	Diversity ↑
Method	$m_{T}^{{Seg}_{1}}$	$m_{T}^{{Dep}_{1}}$	$m_{T}^{{Seg}_{2}}$	$m_{T}^{{Dep}_{2}}$	FID ↓	KID $\times_{100} ↓$	Diversity ↑
Proposed method w/o $m_{T}^{S e g_{2}}, m_{T}^{D e p_{2}}$ ²	✓	✓	-	-	33.696	1.118	0.591
Proposed method w/o $m_{T}^{S e g_{1}}, m_{T}^{D e p_{1}}$ ³	-	-	✓	✓	33.786	1.161	0.589
Proposed method w/o $m_{T}^{D e p_{1}}, m_{T}^{D e p_{2}}$ ⁴	✓	-	✓	-	33.365	1.132	0.591
Proposed method w/o $m_{T}^{S e g_{1}}, m_{T}^{S e g_{2}}$ ⁵	-	✓	-	✓	35.105	1.324	0.590
Proposed method	✓	✓	✓	✓	31.245	0.898	0.586

¹ T indicates time domain:

T \in {X, Y}

. ² The network only adopts small-sized attention maps. ³ The network only adopts large-sized attention maps. ⁴ The network only adopts semantic attention maps. ⁵ The network only adopts geometric attention maps. The up and down arrows next to the metrics indicate that a larger and smaller numerical value corresponds to a better outcome, respectively. The bold numbers highlight the best results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bang, G.; Lee, J.; Endo, Y.; Nishimori, T.; Nakao, K.; Kamijo, S. Semantic and Geometric-Aware Day-to-Night Image Translation Network. Sensors 2024, 24, 1339. https://doi.org/10.3390/s24041339

AMA Style

Bang G, Lee J, Endo Y, Nishimori T, Nakao K, Kamijo S. Semantic and Geometric-Aware Day-to-Night Image Translation Network. Sensors. 2024; 24(4):1339. https://doi.org/10.3390/s24041339

Chicago/Turabian Style

Bang, Geonkyu, Jinho Lee, Yuki Endo, Toshiaki Nishimori, Kenta Nakao, and Shunsuke Kamijo. 2024. "Semantic and Geometric-Aware Day-to-Night Image Translation Network" Sensors 24, no. 4: 1339. https://doi.org/10.3390/s24041339

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic and Geometric-Aware Day-to-Night Image Translation Network

Abstract

1. Introduction

2. Related Work

2.1. Generative Adversarial Networks (GANs)

2.2. Image-to-Image Translation Network

2.3. Day-to-Night Image Translation Network

2.4. Attention Mechanism

3. Proposed Method

3.1. Semantic Segmentation and Depth Estimation

3.2. Image Translation

3.2.1. Image-to-Image Translation Network

3.2.2. Sharing of Semantic and Geometric Feature Maps

3.3. Training Networks

3.3.1. Semantic Segmentation and Depth Estimation Network

3.3.2. Image-to-Image Translation Network

Adversarial Loss

Cycle Consistency Loss

Identity Loss

Total Loss

4. Experiments

4.1. Experimental Environments

4.1.1. Datasets

4.1.2. Implementation Settings

4.1.3. Comparison

4.1.4. Evaluation Metrics

4.2. Comparison to the Related Work

4.3. Network Settings

4.3.1. Pipelines of the Semantic and Geometric Feature Maps

4.3.2. Attention Modules

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI