# Invertible Autoencoder for Domain Adaptation

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Invertible Autoencoder

#### 3.1. Fully Connected Layer

#### 3.2. Convolutional Layer

#### 3.3. Activation Function

#### 3.4. Residual Block

#### 3.5. Bias

#### 3.6. Experimental Validation of Orthonormality

## 4. Invertible Autoencoder for Domain Adaptation

## 5. Experiments

#### 5.1. Experiments with Benchmark Datasets

- (i)
- Day-to-night and night-to-day image conversion: we used unpaired road pictures recorded during the day and at night obtained from KAIST dataset [30].
- (ii)
- Day-to-thermal and thermal-to-day image conversion: we used road pictures recorded during the day with a regular camera and a thermal camera obtained from KAIST dataset [30].
- (iii)
- Maps-to-satellite and satellite-to-maps: we used satellite images and maps obtained from Google Maps [1].

- For the tasks (ii) and (iii), we directly calculated the ${\ell}_{1}$ loss between the converted images and the ground truth.
- For the task (i), we trained two autoencoders ${\mathsf{\Omega}}_{\mathcal{A}}$ and ${\mathsf{\Omega}}_{\mathcal{B}}$ on both domains, i.e., we trained each of them to perform high-quality reconstruction of the images from its own domain and low-quality reconstruction of the images from the other domain. Then we use these two autoencoders to evaluate the quality of the converted images, where high ${\ell}_{1}$ reconstruction loss of the autoencoder for the images converted to resemble those from its corresponding domain implies low-quality image translation.

#### 5.2. Experiments with Autonomous Driving System

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Invertible Autoencoder for Domain Adaptation

- Additional Plots and Tables for Section 3.6

**Figure A1.**Comparison of the MSE of the diagonal of $DE-I$ for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets.

**Figure A2.**Comparison of the MSE of the off-diagonal of $DE-I$ for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR dataset.

**Figure A3.**Comparison of the ratio of MSE of the off-diagonal and diagonal of $DE$ for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets.

**Table A1.**Test reconstruction loss (MSE) for InvAuto, Auto, Cycle, and VAE on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets. VAE has significantly higher reconstruction loss by construction.

Dataset and Model | InvAuto | Auto | Cycle | VAE |
---|---|---|---|---|

MNIST MLP | $0.189$ | $0.100$ | $0.112$ | $1.245$ |

MNIST Conv | $0.168$ | $0.051$ | $0.057$ | $1.412$ |

CIFAR Conv | $0.236$ | $0.126$ | $0.195$ | $1.457$ |

CIFAR ResNet | $0.032$ | $0.127$ | $0.217$ | $0.964$ |

**Figure A4.**Heatmap of the values of matrix $DE$ for InvAuto, (

**a**,

**e**,

**i**,

**m**) Auto (

**b**,

**f**,

**j**,

**n**), Cycle (

**c**,

**g**,

**k**,

**o**), and VAE (

**d**,

**h**,

**l**,

**p**) on MLP, convolutional (Conv), and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of encoder and decoder, respectively. In case of InvAuto, $DE$ is the closest to the identity matrix.

- Additional Experimental Results for Section 5

- Invertible Autoencoder for Domain Adaptation: Architecture and Training

**Generator architecture**Our implementation of InvAuto contains 18 invertible residual blocks for both $128\times 128$ and $512\times 512$ images, where 9 blocks are used in the encoder and the remaining in the decoder. All layers in the decoder are the inverted versions of encoder’s layers. We furthermore add two down-sampling layers and two up-sampling layers for the model trained on $128\times 128$ images, and three down-sampling layers and three up-sampling layers for the model trained on $512\times 512$ images. The details of the generator’s architecture are listed in Table A3 and Table A4. For convenience, we use Conv to denote convolutional layer, ConvNormReLU to denote Convolutional-InstanceNorm-LeakyReLU layer, InvRes to denote invertible residual block, and Tanh to denote hyperbolic tangent activation function. The negative slope of LeakyReLU function is set to $0.2$. All filters are square and we have the following notations: K represents filter size and F represents the number of output feature maps. The paddings are added correspondingly.

**Discriminator architecture**We use similar discriminator architecture as PatchGAN [1]. It is described in Table A2. We use this architecture for training both on $128\times 128$ and $512\times 512$ images.

**Criterion and Optimization**At training, we set $\lambda =10$ and use ${l}_{1}$ loss for the cycle consistency in Equation (12). We use Adam optimizer [32] with learning rate ${l}_{r}$ = 0.0002, ${\beta}_{1}=0.5$ and ${\beta}_{2}=0.999$. We also add ${l}_{2}$ penalty with weight ${10}^{-6}$.

Name | Stride | Filter |
---|---|---|

ConvNormReLU | 2 × 2 | K4-F64 |

ConvNormReLU | 2 × 2 | K4-F128 |

ConvNormReLU | 2 × 2 | K4-F256 |

ConvNormReLU | 1 × 1 | K4-F512 |

Conv | 1 × 1 | K4-F1 |

Name | Stride | Filter |
---|---|---|

ConvNormReLU | 1 × 1 | K7-F64 |

ConvNormReLU | 2 × 2 | K3-F128 |

ConvNormReLU | 2 × 2 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

InvRes | 1 × 1 | K3-F256 |

ConvNormReLU | 1/2 × 1/2 | K3-F128 |

ConvNormReLU | 1/2 × 1/2 | K3-F64 |

Conv | 1 × 1 | K7-F3 |

Tanh |

Name | Stride | Filter |
---|---|---|

ConvNormReLU | 1 × 1 | K7-F64 |

ConvNormReLU | 2 × 2 | K3-F128 |

ConvNormReLU | 2 × 2 | K3-F256 |

ConvNormReLU | 2 × 2 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

InvRes | 1 × 1 | K3-F512 |

ConvNormReLU | 1/2 × 1/2 | K3-F256 |

ConvNormReLU | 1/2 × 1/2 | K3-F128 |

ConvNormReLU | 1/2 × 1/2 | K3-F64 |

Conv | 1 × 1 | K7-F3 |

Tanh |

## References

- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Wang, T.; Liu, M.; Zhu, J.; Tao, A.; Kautz, J.; Catanzaro, B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2017. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Liu, M.; Breuel, T.; Kautz, J. Unsupervised Image-to-Image Translation Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Nguyen, A.; Yosinski, J.; Bengio, Y.; Dosovitskiy, A.; Clune, J. Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Gan, Z.; Chen, L.; Wang, W.; Pu, Y.; Zhang, Y.; Liu, H.; Li, C.; Carin, L. Triangle Generative Adversarial Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zhang, H.; Xu, T.; Li, H. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Wang, C.; Wang, C.; Xu, C.; Tao, D. Tag Disentangled Generative Adversarial Network for Object Image Re-rendering. In Proceedings of the 2017 International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
- Wang, W.; Huang, Q.; You, S.; Yang, C.; Neumann, U. Shape Inpainting Using 3D Generative Adversarial Network and Recurrent Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Vondrick, C.; Pirsiavash, H.; Torralba, A. Generating Videos with Scene Dynamics. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Finn, C.; Goodfellow, I.; Levine, S. Unsupervised Learning for Physical Interaction Through Video Prediction. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Vondrick, C.; Torralba, A. Generating the Future with Adversarial Transformers. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative Adversarial Text to Image Synthesis. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016. [Google Scholar]
- Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv
**2016**, arXiv:1609.03499. [Google Scholar] - Lu, J.; Kannan, A.; Yang, J.; Parikh, D.; Batra, D. Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR2014), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Dong, H.; Neekhara, P.; Wu, C.; Guo, Y. Unsupervised Image-to-Image Translation with Generative Adversarial Networks. arXiv
**2017**, arXiv:1701.02676. [Google Scholar] - Taigman, Y.; Polyak, A.; Wolf, L. Unsupervised Cross-Domain Image Generation. arXiv
**2016**, arXiv:1611.02200. [Google Scholar] - Liu, M.Y.; Tuzel, O. Coupled Generative Adversarial Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), Helsinki, Finland, 5–9 July 2008. [Google Scholar]
- Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Ulyanov, D.; Lebedev, V.; Vedaldi, A.; Lempitsky, V.S. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016. [Google Scholar]
- Gatys, L.A.; Bethge, M.; Hertzmann, A.; Shechtman, E. Preserving Color in Neural Artistic Style Transfer. arXiv
**2016**, arXiv:1606.05897. [Google Scholar] - McCann, M.T.; Jin, K.H.; Unser, M. Convolutional Neural Networks for Inverse Problems in Imaging: A Review. IEEE Signal Process. Mag.
**2017**, 34, 85–95. [Google Scholar] [CrossRef] - Vasudevan, A.; Anderson, A.; Gregg, D. Parallel Multi Channel convolution using General Matrix Multiplication. In Proceedings of the 28th Annual IEEE International Conference on Application-specific Systems, Architectures and Processors, Seattle, WA, USA, 10–12 July 2017. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–10 June 2015. [Google Scholar]
- Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to End Learning for Self-Driving Cars. arXiv
**2016**, arXiv:1604.07316. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar]

**Figure 1.**Heatmap of the values of matrix $DE$ for InvAuto (

**a**,

**e**), Auto (

**b**,

**f**), Cycle (

**c**,

**g**), and VAE (

**d**,

**h**) on MLP and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of multi-layer encoder and decoder, respectively, e.g., $E={E}_{L}\dots {E}_{2}{E}_{1}$ and $D={D}_{L}\dots {D}_{2}{D}_{1}$ for a $2L$-layer autoencoder. In case of InvAuto, $DE$ is the closest to the identity matrix.

**Figure 2.**Comparison of the mean squared error (MSE) $\mathrm{MSE}(DE-I)$ for InvAuto, Auto, Cycle, and VAE on MLP, convolutional, and ResNet architectures and MNIST and CIFAR datasets. Matrices E and D are constructed by multiplying the weight matrices of consecutive layers of encoder and decoder, respectively.

**Figure 4.**The histograms of cosine similarity of the rows of E for InvAuto (

**a**,

**e**), Auto (

**b**,

**f**), Cycle (

**c**,

**g**), and VAE (

**d**,

**h**) on MLP and ResNet architectures and MNIST and CIFAR datasets.

**Figure 5.**The architecture of the domain translator with InvAuto $(E,D)$. ${x}_{\mathcal{A}}\in \mathcal{A}$ and ${x}_{\mathcal{B}}\in \mathcal{B}$ are the inputs of the translator. ${y}_{\mathcal{B}}$ is a converted image ${x}_{\mathcal{A}}$ into the $\mathcal{B}$ domain and ${y}_{\mathcal{A}}$ is a converted image ${x}_{\mathcal{B}}$ into the $\mathcal{A}$ domain. Invertible autoencoder $(E,D)$ is built of encoder E and decoder D, where each of those itself is an autoencoder. ${\mathrm{Enc}}_{\mathcal{A}},{\mathrm{Enc}}_{\mathcal{B}}$ are feature extractors, and ${\mathrm{Dec}}_{\mathcal{A}},{\mathrm{Dec}}_{\mathcal{B}}$ are the final layers of the generators ${\mathrm{Gen}}_{\mathcal{B}}$, i.e., (${\mathrm{Enc}}_{\mathcal{A}},E,{\mathrm{Dec}}_{\mathcal{B}}$), and ${\mathrm{Gen}}_{\mathcal{A}}$, i.e., (${\mathrm{Enc}}_{\mathcal{B}},D,{\mathrm{Dec}}_{\mathcal{A}}$), respectively. Discriminators ${\mathrm{Dis}}_{\mathcal{A}}$ and ${\mathrm{Dis}}_{\mathcal{B}}$ discriminate whether their input comes from the generator (True) or original dataset (False).

**Figure 6.**(

**Left**) Day-to-night image conversion. Zoomed image is shown in Figure A5 in Appendix A. (

**Right**) Night-to-day image conversion. Zoomed image is shown in Figure A6 in Appendix A.

**Figure 7.**(

**Left**) Day-to-thermal image conversion. Zoomed image is shown in Figure A7 in Appendix A. (

**Right**) Thermal-to-day image conversion. Zoomed image is shown in Figure A8 in Appendix A.

**Figure 8.**(

**Left**) Maps-to-satellite image conversion. Zoomed image is shown in Figure A9 in Appendix A. (

**Right**) Satellite-to-maps image conversion. Zoomed image is shown in Figure A10 in Appendix A.

**Figure 9.**(

**Left**) Experimental results with autonomous driving system: day-to-night conversion. Zoomed image is shown in Figure A11 in Appendix A. (

**Right**) Experimental results with autonomous driving system: night-to-day conversion. Zoomed image is shown in Figure A12 in Appendix A.

**Table 1.**Mean and standard deviation of cosine similarity of rows of E. InvAuto achieves cosine similarity that is the closest to 0. Best performer is in bold.

Dataset and Model | InvAuto | Auto | Cycle | VAE |
---|---|---|---|---|

MNIST | $\mathbf{0.001}$ | $0.008$ | $0.007$ | $0.001$ |

MLP | $\pm \mathbf{0.118}$ | $\pm 0.210$ | $\pm 0.207$ | $\pm 0.219$ |

MNIST | $\mathbf{0.001}$ | $0.001$ | $0.001$ | $-0.001$ |

Conv | $\pm \mathbf{0.148}$ | $\pm 0.179$ | $\pm 0.176$ | $\pm 0.190$ |

CIFAR | $\mathbf{0.001}$ | $0.002$ | $0.004$ | $0.003$ |

Conv | $\pm \mathbf{0.145}$ | $\pm 0.176$ | $\pm 0.195$ | $\pm 0.268$ |

CIFAR | $\mathbf{0.000}$ | $0.000$ | $0.000$ | $0.001$ |

ResNet | $\pm \mathbf{0.134}$ | $\pm 0.203$ | $\pm 0.232$ | $\pm 0.298$ |

**Table 2.**Mean and standard deviation of the ${\ell}_{2}$-norm of the rows of E. InvAuto achieves the ${\ell}_{2}$-norm of the rows that is the closest to the unit norm. Best performer is in bold.

Dataset and Model | InvAuto | Auto | Cycle | VAE |
---|---|---|---|---|

MNIST | $\mathbf{0.976}$ | $1.326$ | $1.268$ | $1.832$ |

MLP | $\pm \mathbf{0.190}$ | $\pm 0.095$ | $\pm 0.095$ | $\pm 0.501$ |

MNIST | $\mathbf{0.905}$ | $1.699$ | $1.780$ | $1.971$ |

Conv | $\pm \mathbf{0.321}$ | $\pm 0.732$ | $\pm 0.779$ | $\pm 0.794$ |

CIFAR | $\mathbf{0.908}$ | $3.027$ | $2.463$ | $1.176$ |

Conv | $\pm \mathbf{0.219}$ | $\pm 0.816$ | $\pm 0.688$ | $\pm 0.356$ |

CIFAR | $\mathbf{0.868}$ | $2.890$ | $2.650$ | $1.728$ |

ResNet | $\pm \mathbf{0.078}$ | $\pm 0.895$ | $\pm 0.937$ | $\pm 0.311$ |

Methods | |||
---|---|---|---|

Tasks | CycleGAN | UNIT | InvAuto |

Night-to-day | $0.033$ | $0.227$ | $0.062$ |

Day-to-nigth | $0.041$ | $0.114$ | $0.067$ |

Thermal-to-day | $0.287$ | $0.339$ | $0.299$ |

Day-to-thermal | $0.179$ | $0.194$ | $0.205$ |

Maps-to-satellite | $0.261$ | $0.331$ | $0.272$ |

Satellite-to-maps | $0.069$ | $0.104$ | $0.080$ |

**Table 4.**Experimental results with autonomous driving system: autonomy, position precision, and comfort.

Video Type | Autonomy | Position Precision | Comfort |
---|---|---|---|

Original day | $99.6\%$ | $73.3\%$ | $89.7\%$ |

Original night | $98.6\%$ | $63.1\%$ | $86.3\%$ |

Day-to-night | $99.0\%$ | $69.6\%$ | $83.2\%$ |

InvAuto | |||

Night-to-day | $99.3\%$ | $68.0\%$ | $84.7\%$ |

InvAuto | |||

Day-to-night | $99.0\%$ | $68.4\%$ | $84.7\%$ |

CycleGAN | |||

Night-to-day | $98.8\%$ | $64.0\%$ | $87.3\%$ |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Teng, Y.; Choromanska, A.
Invertible Autoencoder for Domain Adaptation. *Computation* **2019**, *7*, 20.
https://doi.org/10.3390/computation7020020

**AMA Style**

Teng Y, Choromanska A.
Invertible Autoencoder for Domain Adaptation. *Computation*. 2019; 7(2):20.
https://doi.org/10.3390/computation7020020

**Chicago/Turabian Style**

Teng, Yunfei, and Anna Choromanska.
2019. "Invertible Autoencoder for Domain Adaptation" *Computation* 7, no. 2: 20.
https://doi.org/10.3390/computation7020020