# Sequential Normalization: Embracing Smaller Sample Sizes for Normalization

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Related Work

## 2. Methodology

#### 2.1. Formulation

#### 2.2. The Effects of Ghost Normalization

#### 2.2.1. GhostNorm to BatchNorm

#### 2.2.2. GhostNorm to GroupNorm

#### 2.3. Implementation

#### 2.3.1. Ghost Normalization

#### 2.3.2. Sequential Normalization

## 3. Experiments and Results

#### 3.1. Loss Landscape Visualisation

#### 3.2. Image Classification

#### 3.2.1. CIFAR–100

#### 3.2.2. CIFAR-10

#### 3.2.3. ImageNet

## 4. Discussion

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. GhostNorm Implementations

## Appendix B. Loss Landscape Visualization

#### Appendix B.1. Implementation Details

#### Appendix B.2. Loss Landscape

#### Appendix B.3. Results

**Figure A3.**Comparison of the loss landscape on MNIST between the baseline, BatchNorm, and GhostNorm with different ${G}_{M}$ values. The last figure (bottom right) depicts the misclassification error on the testing set during training.

**Figure A4.**Comparison of the loss landscape on CIFAR–10 between BatchNorm and GhostNorm with different ${G}_{M}$ values. The last figure (bottom right) depicts the misclassification error on the testing set during training.

## Appendix C. Image Classification

#### Implementation Details

## Appendix D. Negative Results

## References

- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM
**2017**, 60, 84–90. [Google Scholar] [CrossRef] - Huang, G.; Liu, Z.; Weinberger, Q.K. Densely Connected Convolutional Networks. arXiv
**2016**, arXiv:1608.06993. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. arXiv
**2013**, arXiv:1303.5778. [Google Scholar] - Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 3104–3112. [Google Scholar]
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature
**2016**, 529, 484–489. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] - Santurkar, S.; Tsipras, D.; Ilyas, A.; Madry, A. How Does Batch Normalization Help Optimization? In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 2483–2493. [Google Scholar]
- Bjorck, N.; Gomes, C.P.; Selman, B.; Weinberger, K.Q. Understanding Batch Normalization. In Advances in Neural Information Processing Systems 31; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 7694–7705. [Google Scholar]
- Ioffe, S. Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1942–1950. [Google Scholar]
- Lei Ba, J.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv
**2016**, arXiv:1607.06450. [Google Scholar] - Qiao, S.; Wang, H.; Liu, C.; Shen, W.; Yuille, A. Weight Standardization. arXiv
**2019**, arXiv:1903.10520. [Google Scholar] - Wu, Y.; He, K. Group Normalization. arXiv
**2018**, arXiv:1803.08494. [Google Scholar] - Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv
**2016**, arXiv:1607.08022. [Google Scholar] - Luo, C.; Zhan, J.; Wang, L.; Gao, W. Extended Batch Normalization. arXiv
**2020**, arXiv:2003.05569. [Google Scholar] - Liang, S.; Huang, Z.; Liang, M.; Yang, H. Instance Enhancement Batch Normalization: An Adaptive Regulator of Batch Noise. arXiv
**2019**, arXiv:1908.04008v2. [Google Scholar] [CrossRef] - Singh, S.; Shrivastava, A. EvalNorm: Estimating Batch Normalization Statistics for Evaluation. arXiv
**2019**, arXiv:1904.06031. [Google Scholar] - Hoffer, E.; Hubara, I.; Soudry, D. Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; pp. 1729–1739. [Google Scholar]
- Yan, J.; Wan, R.; Zhang, X.; Zhang, W.; Wei, Y.; Sun, J. Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization. arXiv
**2020**, arXiv:2001.06838. [Google Scholar] - Summers, C.; Dinneen, M.J. Four Things Everyone Should Know to Improve Batch Normalization. arXiv
**2020**, arXiv:1906.03548. [Google Scholar] - Wu, J.; Hu, W.; Xiong, H.; Huan, J.; Braverman, V.; Zhu, Z. On the Noisy Gradient Descent that Generalizes as SGD. arXiv
**2019**, arXiv:1906.07405. [Google Scholar] - Smith, S.L.; Elsen, E.; De, S. On the Generalization Benefit of Noise in Stochastic Gradient Descent. arXiv
**2020**, arXiv:2006.15081. [Google Scholar] - De, S.; Smith, S.L. Batch Normalization Biases Residual Blocks towards the Identity Function in Deep Networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
- Bronskill, J.; Gordon, J.; Requeima, J.; Nowozin, S.; Turner, R.E. TaskNorm: Rethinking Batch Normalization for Meta-Learning. arXiv
**2020**, arXiv:2003.03284. [Google Scholar] - Luo, P.; Ren, J.; Peng, Z.; Zhang, R.; Li, J. Differentiable Learning-to-Normalize via Switchable Normalization. arXiv
**2019**, arXiv:1806.10779. [Google Scholar] - Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. RandAugment: Practical automated data augmentation with a reduced search space. arXiv
**2019**, arXiv:1909.13719. [Google Scholar] - Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Strategies From Data. arXiv
**2019**, arXiv:1805.09501. [Google Scholar] - Recht, B.; Roelofs, R.; Schmidt, L.; Shankar, V. Do ImageNet Classifiers Generalize to ImageNet? arXiv
**2019**, arXiv:1902.10811. [Google Scholar] - Smith, L.N.; Topin, N. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv
**2017**, arXiv:1708.07120. [Google Scholar] - DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv
**2017**, arXiv:1708.04552. [Google Scholar] - Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv
**2017**, arXiv:1706.02677. [Google Scholar] - Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv
**2017**, arXiv:1605.07146. [Google Scholar]

**Figure 1.**The input tensor with $(M,C,F)$ dimensions is divided into a number of line (1D) or plane (2D) slices. Each normalization technique slices the input tensor differently and each slice is normalized independently of the other slices. For ease of visualization, both ${G}_{C}$ and ${G}_{M}$ are set to 4.

**Figure 3.**Comparison of the loss landscape on CIFAR-10 between BatchNorm, GhostNorm, and the baseline.

**Table 1.**Results on CIFAR-100. For SeqNorm, we only show the best results for each ${G}_{C}$. Both validation and testing performance results are averaged over two different runs.

Validation Accuracy | Testing Accuracy | |
---|---|---|

BatchNorm | $80.6\pm 0.2\%$ | $82.1\pm 0.2\%$ |

Noisy BatchNorm | $80.8\pm 0.1\%$ | $82.0\pm 0.3\%$ |

GhostNorm (${G}_{M}=2$) | $80.9\pm 0.1\%$ | - |

GhostNorm (${G}_{M}=4$) | $81.2\pm 0.1\%$ | - |

GhostNorm (${G}_{M}=8$) | $81.4\pm 0.5\%$ | $82.8\pm 0.6\%$ |

GhostNorm (${G}_{M}=16$) | $80.3\pm 0.5\%$ | - |

SeqNorm (${G}_{C}=1$, ${G}_{M}=4$) | $82.3\pm 0.1\%$ | - |

SeqNorm (${G}_{C}=2$, ${G}_{M}=4$) | $82.4\pm 0.2\%$ | - |

SeqNorm (${G}_{C}=4$, ${G}_{M}=8$) | $\mathbf{82.5}\mathbf{\pm}\mathbf{0.1}\mathbf{\%}$ | $\mathbf{83.8}\mathbf{\pm}\mathbf{0.04}\mathbf{\%}$ |

SeqNorm (${G}_{C}=8$, ${G}_{M}=8$) | $82.4\pm 0.2\%$ | - |

SeqNorm (${G}_{C}=16$, ${G}_{M}=8$) | $82.3\pm 0.3\%$ | - |

BatchNorm w/ RandAugment | $81.4\pm 0.0\%$ | $82.9\pm 0.2\%$ |

GhostNorm w/ RandAugment | $82.3\pm 0.1\%$ | $83.5\pm 0.1\%$ |

SeqNorm w/ RandAugment | $82.4\pm 0.2\%$ | $\mathbf{83.8}\mathbf{\pm}\mathbf{0.3}\mathbf{\%}$ |

**Table 2.**Results on CIFAR-10 and ImageNet data sets. Both validation and testing performance results of CIFAR-10 are averaged over two different runs. For ImageNet, each model is evaluated on the conventional validation set, as well as on three newly released test sets [29].

Validation Accuracy | Testing Accuracy | |
---|---|---|

CIFAR–10 | ||

BatchNorm | $96.6\pm 0.1\%$ | $97.1\pm 0.05\%$ |

GhostNorm (${G}_{M}=4$) | $96.7\pm 0.2\%$ | $97.3\pm 0.1\%$ |

SeqNorm (${G}_{C}=16$, ${G}_{M}=8$) | $\mathbf{96.8}\mathbf{\pm}\mathbf{0.1}\mathbf{\%}$ | $\mathbf{97.4}\mathbf{\pm}\mathbf{0.2}\mathbf{\%}$ |

ImageNet | ||

BatchNorm | $71.2\pm 0.01\%$ | $67.0\pm 6.9\%$ |

GhostNorm (${G}_{M}=4$) | $71.6\pm 0.1\%$ | $67.4\pm 6.7\%$ |

SeqNorm (${G}_{C}=4$, ${G}_{M}=4$) | $\mathbf{72.3}\mathbf{\pm}\mathbf{0.2}\mathbf{\%}$ | $\mathbf{68.1}\mathbf{\pm}\mathbf{6.8}\mathbf{\%}$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dimitriou, N.; Arandjelović, O.
Sequential Normalization: Embracing Smaller Sample Sizes for Normalization. *Information* **2022**, *13*, 337.
https://doi.org/10.3390/info13070337

**AMA Style**

Dimitriou N, Arandjelović O.
Sequential Normalization: Embracing Smaller Sample Sizes for Normalization. *Information*. 2022; 13(7):337.
https://doi.org/10.3390/info13070337

**Chicago/Turabian Style**

Dimitriou, Neofytos, and Ognjen Arandjelović.
2022. "Sequential Normalization: Embracing Smaller Sample Sizes for Normalization" *Information* 13, no. 7: 337.
https://doi.org/10.3390/info13070337