# Modified Convolutional Neural Network Based on Dropout and the Stochastic Gradient Descent Optimizer

^{*}

## Abstract

**:**

## 1. Introduction

- The input value, which is originally changed considerably in a wide range to output within the (0, 1) range, can be squeezed when the sigmoid function is used as an activation function. When the training dataset is large, the sigmoid function easily causes gradient saturation and convergence is slowed.
- In the CNN, the early stop and regularization strategies are often used to mitigate the overfitting problem. The dataset in the early stop strategy is divided into the training and test sets. The training set calculates the gradient and updates the connection and threshold. The test set evaluates the error. The training stop sign reduces the training set error and increases the test set error. In the regularization strategy, the error objective function considers the factors that describe the complexity. When the number of learning layers in the CNN increases, the capability of these layers to solve the overfitting problem is reduced.
- In the training and evaluation phases, the CNN adjusts the cumulative error of the minimum training set of the target gradient through the reverse propagation algorithm and the gradient descent strategy. Not every level of training evaluates the cumulative error but evaluates it after a given interval layer. Although the time lock decreases, the cumulative error increases.

## 2. Related Works

#### 2.1. Typical CNN Model

#### 2.2. SGD Optimizer

_{1}, and then randomly select a training sample (${x}_{{i}_{t}},{y}_{{i}_{t}}$) from the whole training set, where ${i}_{t}\in \left\{1,\dots ,m\right\}$ is the target of the selected training sample at the tth iteration. The objective function is

#### 2.3. Dropout Layer

## 3. Modified CNN Based on Dropout and the SGD Optimizer

#### 3.1. Quadratic CNN Structure

^{(l)}

_{i}

_{,j}represents the weight of the i neuron in the j class of the l layer; b

_{i}represents the offset of the i class; * represents the convolution operation; ${x}_{j}{}^{(l)}$ represents the output of j neurons in the l layer convolution; ${x}_{j}{}^{(l-1)}$ represents the output of the $j$ neuron in the $l-1$ layer, i.e., the input data for the $l$ layer; and $f(\dots )$ represents the activation function of the model that has nonlinear characteristics.

#### 3.2. Activation Function Based on Leaky ReLU

^{(l)})

^{T}represents the weight of the l neurons in the t layer. When the neuron reverse propagation cost in the l + 1 layer ${\delta}^{(l+1)}$ approaches 0, the computed gradient also approaches 0 to achieve the goal of not adjusting the update parameters. (2) Constantly positive weight: The average value of the function output is not 0, which causes the neuron layer to yield the signal input of the nonzero mean. This activity makes the data of the input neuron positive. Consequently, the weight becomes positive. These problems result in a slow parameter convergence and affect training efficiency and the model recognition effect. The tanh function can map a real input into the [−1, 1] range but it is actually a variant of the sigmoid function, i.e., $\mathrm{tanh}(x)=2sigmoid(2x)-1$. Moreover, the tanh function also exhibits the gradient saturation problem.

#### 3.3. Method Based on Dropout and SGD for Preventing Overfitting

#### 3.4. Modified Convolutional Neural Network Based on Dropout and the SGD Optimizer

- Step 1: Pretrain the filter, and initialize the filter size pixel as P
_{1}× P_{2}. - Step 2: Enter the image dataset for training. Process the image of the training set into the same picture as the filter size, and read the data to form the image data matrix
**X**. - Step 3: Initialize the weight w
^{(l)}_{i,j}and bias b_{i}and invoke the kernel function def Kernel() provided by TensorFlow to initialize parallel operations. - Step 4: The Conv2d is used for two-dimensional convolution operation to obtain the first layer convolution feature matrix
**X**^{(1)}. - Step 5: The first layer convolution feature matrix
**X**^{(1)}is used as the input data of the pool layer. Use Formula (5) for the pool operation to obtain the feature matrix**X**^{(2)}. - Step 6: Use the SGD optimizer function expressed in Formula (4) to derive the learning rate of the top-down tuning optimizer, and use the weights in TensorFlow and the update-biased interface to update the weight w
_{i}and the bias b_{i}, thus obtaining the feature matrix**X**^{(3)}. - Step 7: Generate the second convolution following Steps 4, 5, and 6 to derive the feature matrix
**X**^{(4)}. - Step 8: Merge the feature matrix
**X**^{(4)}into a column vector as the input of the neuron at the full-joint layer, multiply it with the weight matrix plus the bias, and then use the Leaky ReLU activation function to obtain the eigenvector**b**._{1} - Step 9: Use the eigenvector of the fully connected stratum as the input of the dropout layer, compute the output probability of the neuron in the dropout layer using Formula (6), and the eigenvector
**b**is obtained._{2} - Step 10: Use the eigenvector
**b**as the input and the Softmax classifier [25] output to achieve the results._{2}

## 4. Experiment and Analysis

#### 4.1. Test Environment

#### 4.2. Comparison Algorithm

- (1)
- Algorithm 1: weighted CNN (WCNN) [11]. This algorithm uses a sigmoid function as the activation function through compounded multiple convolutions and pooling layers to achieve input signal processing. Simultaneously, the mapping relationship between the connection layer and output target is established and the clustering algorithm is used to classify the feature.
- (2)
- Algorithm 2: convolutional neural network with fully connected Multilayer perceptron (MLP) (MLP-CNN) [6]. This algorithm improves model performance by increasing the characteristic number of the neural networks and using the stochastic gradient descent algorithm to optimize the cross entropy.
- (3)
- Algorithm 3: extreme learning machine (ELM) for multi-classification called SVM-ELM(ELM optimized with support vector machine) [27]. This calculation combines the fast learning machine and SVM, reduces the number of hidden nodes as a class number, and optimizes the linear decision function of each node through the SVM.

#### 4.3. Datasets and Settings

#### 4.4. Experimental Results and Analysis of Recognition Performance under Different Learning Rates

#### 4.5. Comparison and Analysis of the Three Kinds of Algorithms

## 5. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Vieira, S.; Pinaya, W.H.L.; Mechelli, A. Using deep learning to investigate the neuroimaging correlates of psychiatric and neurological disorders: Methods and applications. Neurosci. Biobehav. Rev.
**2017**, 74, 58–75. [Google Scholar] [CrossRef] [PubMed] - Li, S.; Dou, Y.; Niu, X.; Lv, Q.; Wang, Q. A Fast and Memory Saved GPU Acceleration Algorithm of Convolutional Neural Networks for Target Detection. Neurocomputing
**2017**, 230, 48–59. [Google Scholar] [CrossRef] - Gong, T.; Fan, T.; Guo, J.; Cai, Z. GPU-based parallel optimization of immune convolutional neural network and embedded system. Eng. Appl. Artif. Intell.
**2016**, 36, 226–238. [Google Scholar] [CrossRef] - Zhang, Y.N.; Qu, L.; Chen, J.W.; Liu, J.R.; Guo, D.S. Weights and structure determination method of multiple-input Sigmoid activation function neural network. Appl. Res. Comput.
**2012**, 29, 4113–4116. [Google Scholar] - Chen, L.; Wu, C.; Fan, W.; Sun, J.; Naoi, S. Adaptive Local Receptive Field Convolutional Neural Networks for Handwritten Chinese Character Recognition. In Chinese Conference on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2014; pp. 455–463. [Google Scholar]
- Singh, P.; Verma, A.; Chaudhari, N.S. Deep Convolutional Neural Network Classifier for Handwritten Devanagari Character Recognition. In Information Systems Design and Intelligent Applications; Springer: New Delhi, India, 2016. [Google Scholar]
- Sun, W.; Su, F. A novel companion objective function for regularization of deepconvolutional neural networks. Image Vis. Comput.
**2016**, 56, 110–126. [Google Scholar] - Wachinger, C.; Reuter, M.; Klein, T. DeepNAT: Deep convolutional neural network for segmenting neuroanatomy. NeuroImage
**2017**. [Google Scholar] [CrossRef] [PubMed] - Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Comput. Sci.
**2016**, 26, 357–361. [Google Scholar] - Izotov, P.Y.; Kazanskiy, N.L.; Golovashkin, D.L.; Sukhanov, S.V. CUDA-enabled implementation of a neural network algorithm for handwritten digit recognition. Opt. Mem. Neural Netw.
**2011**, 20, 98–106. [Google Scholar] [CrossRef] - Hao, H.W.; Jiang, R.R. Training sample selection method for Neural Networks based on Nearest neighbor rule. Acta Autom. Sin.
**2007**, 33, 1247–1251. [Google Scholar] - Akeret, J.; Chang, C.; Lucchi, A.; Refregier, A. Radio frequency interference mitigation using deep convolutional neural networks. Astron. Comput.
**2017**, 18, 35–39. [Google Scholar] [CrossRef] - Costarelli, D.; Vinti, G. Pointwise and uniform approximation by multivariate neural network operators of the max-product type. Neural Netw.
**2016**, 81, 81–90. [Google Scholar] [CrossRef] [PubMed] - Lee, C.Y.; Xie, S.; Gallagher, P.; Zhang, Z.; Tu, Z. Deeply-supervised nets. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 21 February 2015; pp. 562–570. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2323. [Google Scholar] [CrossRef] - Zhou, F.-Y.; Jin, L.-P.; Dong, J. Review of Convolutional Neural Network. Chin. J. Comput.
**2017**, 40, 1229–1251. [Google Scholar] [CrossRef] - Najafabadi, M.M.; Khoshgoftaar, T.M.; Villanustre, F.; Holt, J. Large-scale distributed L-BFGS. J. Big Data
**2017**, 4, 22. [Google Scholar] [CrossRef] - Zinkevich, M.; Weimer, M.; Li, L.; Smola, A.J. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2010; pp. 2595–2603. [Google Scholar]
- Hardt, M.; Recht, B.; Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. arXiv, 2015; arXiv:1509.01240. [Google Scholar]
- Hou, Z.H.; Fang, H.J. Robust fault-tolerant control for networked control system with packet dropout. J. Syst. Eng. Electron.
**2007**, 18, 76–82. [Google Scholar] - Luo, P.; Li, H.F. Research on Quantum Neural Network and its Applications Based on Tanh Activation Function. Comput. Digit. Eng.
**2012**, 16, 33–39. [Google Scholar] [CrossRef] - Tang, Z.; Luo, L.; Peng, H.; Li, S. A joint residual network with paired ReLUs activation for image super-resolution. Neurocomputing
**2018**, 273, 37–46. [Google Scholar] [CrossRef] - Günnemann, N.; Pfeffer, J. Predicting Defective Engines using Convolutional Neural Networks on Temporal Vibration Signals. In Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, Munich, Germany, 11 October 2017; pp. 92–102. [Google Scholar]
- Jin, X.; Xu, C.; Feng, J.; Wei, Y.; Xiong, J.; Yan, S. Deep learning with S-shaped rectified linear activation units. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI Press), Phoenix, AZ, USA, 12–17 February 2016; pp. 1737–1743. [Google Scholar]
- Shi, X.B.; Fang, X.J.; Zhang, D.Y.; Guo, Z.Q. Image Classification Based on Mixed Deep Learning Model Transfer Learning. J. Syst. Simul.
**2016**, 28, 167–173. [Google Scholar] - Yang, W.; Jin, L.; Tao, D.; Xie, Z.; Feng, Z. DropSample: A new training method to enhance deep convolutional neural networks for large-scale unconstrained handwritten Chinese character recognition. Pattern Recognit.
**2016**, 58, 190–203. [Google Scholar] [CrossRef] - Shen, F.; Wang, L.; Zhang, J. Reduced extreme learning machine employing SVM technique. J. Huazhong Univ. Sci. Technol.
**2014**, 42, 107–110. [Google Scholar] [CrossRef] - Yang, G.-C.; Yang, J.; Su, Z.D.; Chen, Z.-J. Improved YOLO feature extraction algorithm and its application to privacy situation detection of social robots. Acta Autom. Sin.
**2018**, 1–12. [Google Scholar] [CrossRef]

**Figure 2.**Neural network unit connections with different mechanisms. (

**a**) Without the dropout layer mechanism; (

**b**) With the dropout layer mechanism.

**Figure 5.**Nonlinear function. (

**a**) Sigmoid function; (

**b**) Tanh function; (

**c**) ReLU function; (

**d**) Leaky ReLU function.

**Figure 6.**Convolution neural network structure model of comparison algorithms. (

**a**) WCNN; (

**b**) MLP-CNN; (

**c**) SVM-ELM.

**Figure 7.**Sample of handwriting character images. (

**a**) MNIST dataset; (

**b**) HCL2000 dataset; (

**c**) EnglishHand dataset.

**Figure 9.**Boxplot of the recognition rate obtained by MCNN-DS under different SGD learning rates. (

**a**) MNIST dataset; (

**b**) EnglishHand dataset; (

**c**) HCL2000 dataset.

Datasets | Size of Training Set | Size of Test Set | Number of Image Category |
---|---|---|---|

MNIST | 60,000 | 10,000 | 10 |

HCL2000 | 2000 | 1000 | 10 |

EnglishHand | 4200 | 1520 | 26 |

Dataset | Metric | WCNN [11] | MLP-CNN [6] | SVM-ELM [27] | MCNN-DS |
---|---|---|---|---|---|

MNIST | The lowest recognition rate (%) | 95.11 | 97.82 | 89.5 | 97.36 |

The highest recognition rate (%) | 95.71 | 98.96 | 91.35 | 99.97 | |

The average recognition rate (%) | 95.36 | 96.32 | 90.26 | 98.43 | |

Standard deviation of recognition rate | -- | -- | -- | 0.0084 | |

HCL2000 | The lowest recognition rate (%) | -- | -- | 83.60 | 85.42 |

The highest recognition rate (%) | -- | -- | 90.00 | 93.99 | |

The average recognition rate (%) | -- | -- | 88.63 | 90.98 | |

Standard deviation of recognition rate | -- | -- | -- | 0.0396 | |

EnglishHand | The lowest recognition rate (%) | -- | -- | -- | 84.93 |

The highest recognition rate (%) | -- | -- | -- | 95.29 | |

The average recognition rate (%) | -- | -- | -- | 89.77 | |

Standard deviation of recognition rate | -- | -- | -- | 0.0280 |

Algorithms | Time/ms | ||
---|---|---|---|

MNIST | HCL2000 | EnglishHand | |

MLP-CNN | 290,491 | -- | -- |

SVM-ELM | 132,634 | 316,372 | -- |

MCNN-DS | 13,236 | 20,531 | 21,617 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yang, J.; Yang, G.
Modified Convolutional Neural Network Based on Dropout and the Stochastic Gradient Descent Optimizer. *Algorithms* **2018**, *11*, 28.
https://doi.org/10.3390/a11030028

**AMA Style**

Yang J, Yang G.
Modified Convolutional Neural Network Based on Dropout and the Stochastic Gradient Descent Optimizer. *Algorithms*. 2018; 11(3):28.
https://doi.org/10.3390/a11030028

**Chicago/Turabian Style**

Yang, Jing, and Guanci Yang.
2018. "Modified Convolutional Neural Network Based on Dropout and the Stochastic Gradient Descent Optimizer" *Algorithms* 11, no. 3: 28.
https://doi.org/10.3390/a11030028