Demystifying Deep Learning Building Blocks

Ochoa Domínguez, Humberto de Jesús; Cruz Sánchez, Vianey Guadalupe; Vergara Villegas, Osslan Osiris

doi:10.3390/math12020296

Open AccessArticle

Demystifying Deep Learning Building Blocks

by

Humberto de Jesús Ochoa Domínguez

^1,*,†

,

Vianey Guadalupe Cruz Sánchez

^1,†

and

Osslan Osiris Vergara Villegas

^2,†

¹

Electrical and Computer Engineering Department, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez 32310, Mexico

²

Industrial and Manufacturing Engineering Department, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez 32310, Mexico

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(2), 296; https://doi.org/10.3390/math12020296

Submission received: 5 December 2023 / Revised: 4 January 2024 / Accepted: 8 January 2024 / Published: 17 January 2024

(This article belongs to the Special Issue Deep Neural Networks: Theory, Algorithms and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Building deep learning models proposed by third parties can become a simple task when specialized libraries are used. However, much mystery still surrounds the design of new models or the modification of existing ones. These tasks require in-depth knowledge of the different components or building blocks and their dimensions. This information is limited and broken up in different literature. In this article, we collect and explain the building blocks used to design deep learning models in depth, starting from the artificial neuron to the concepts involved in building deep neural networks. Furthermore, the implementation of each building block is exemplified using the Keras library.

Keywords:

deep learning; artificial neural networks; convolutional neural layer; activation layer; pooling layer; forward propagation; backpropagation

MSC:

68T07

1. Introduction

Industry 4.0 characterizes the transformation from traditional automation to engineered cyber-physical systems with human-like intelligence (including sensing, problem solving, and other cognitive capabilities). Indeed, this gives current Artificial Intelligence (AI) the privilege of playing a central role in Industry 4.0. Moreover, in the last ten years (2013–2023), the leading branch of AI turned out to be deep learning (DL) [1,2]. DL is an essential subfield of machine learning (ML) characterized by its layered structure of artificial neural networks (ANNs). Each layer extracts helpful knowledge to make decisions or future predictions to solve self-supervised, semisupervised, supervised, and unsupervised learning problems [3,4,5]. In this sense, DL has become the main driver of the current AI hype. Moreover, DL has achieved tremendous success in a wide range of tasks that have historically been extremely difficult for computers [1], leading to more AI models with human-level performance, as pointed out in [2]. This allows manufacturing companies to make intelligent use of the large number of data generated in the industrial business environment [6].

Currently, many industry sectors are experiencing a tremendous conversion and integrating DL models in their solutions [7,8,9]. For example, DL is used in the nuclear energy industry to predict cracking in hazardous areas. In the agriculture industry [10], DL is used to analyze historical rainfall patterns, wind direction, and atmospheric pressure to predict storms and river water levels [11]. In the manufacturing industry, it is used for predictive maintenance. In the food industry, it is used to understand current consumer preferences and behavior [12]. In the automotive sector [13], DL is used to guide autonomous vehicles [14]. In the medical industry, it is used to diagnose and predict illnesses [15,16,17]. The increase in popularity is mainly due to three factors: (1) the rapid evolution of the hardware with a highly parallel structure, (2) the development of open-source platforms for ML, and (3) the predominance of DL models in terms of accuracy and flexibility to represent the world with concepts that range from the simplest to the most complex. All this is powered by a vast data number [18,19].

Due to the popularity of DL, specialized hardware is necessary because current microprocessors and CPUs are not designed for training and inferencing DL models. Nevertheless, Graphics Processing Units (GPUs) have emerged as the ideal complement to Central Processing Units (CPUs) to bring intelligence to applications. In addition, the neural network accelerators based on Field-Programmable Gate Arrays (FPGA) are favored over CPUs because they accelerate computations by mapping calculations into parallel hardware [20]. Furthermore, the cloud has become mainstream and cheap. Consequently, industry, academia, and government are making their previously stored data locally available to the research community. As DL models are complex and require a large number of data, the availability of data sets significantly impacts AI research because today, scientists can access large data sets with diverse data points to train complex DL models. Therefore, problems like detecting cancer or predicting rainfall can be solved quickly and accurately because of high-performance computers and vast data. In addition, the available open-source platforms allow for the design, modification, and testing of models quickly and help deploy the model [2,21].

Despite this importance, DL models are considered black boxes whose components or building blocks are unclear and difficult to understand. Much of the research is based on the straight application of deep networks that have already been developed. The authors do not contemplate modifications and avoid discussing a new network design for a specific input signal type. Furthermore, the explanation of the building blocks for carrying out new model designs or modifications to the current ones is limited, light, and dispersed in the different literature. In this paper, we describe the building blocks for constructing DL models, from the main component, the artificial neuron (AN), to the formalization to analyze deep neural networks (DNNs). Examples of implementations in Keras accompany all of this [22].

2. Literature Review

In the past, computer applications were developed from a single data processing perspective. Currently, applications have shifted to ML being the DL one of the leading research trends, whose success is mainly due to convolutional neural networks (CNNs). DL algorithms can be classified into three groups: (i) recurrent neural networks (RNNs), (ii) multilayer perceptron (MLP), and (iii) CNNs.

RNNs are feed-forward neural network layers used as hidden states to allow past outputs to be used as inputs. A particular type of RNN is the long short-term memory (LSTM) network, capable of handling the vanishing gradient problem that RNNs face [23]. RNNs are commonly used in sequential data or time series like language translation [24], natural language processing (NLP) [25], speech recognition [26], and image captioning [27]. Moreover, RNNs are included in popular applications such as Siri [28] and Google Translate [29]. Currently, some challenges [30] that NLP is addressing with DL are (a) contextual words and homonyms, for example, the same words can have different meanings according to the context of a sentence, (b) irony and sarcasm, for example, a sentence that communicates the opposite of what is said, (c) ambiguity in phrases that can have two or more interpretations, (d) errors in text and speech, for example, misspelled words, (e) slang and colloquialism, and (f) domain-specific language, for example, the model used to process a legal document is different to that used to process a healthcare document.

An MLP consists of fully connected (FC) layers with nonlinear activation functions that separate nonlinearly separable data. MLPs have been successfully used in task pattern classification, recognition, prediction, and approximation. Transformers are among the newest and most powerful MLPs ever invented to track relationships in sequential data or words to learn the context and meanings of the data [31]. Vision transformers (ViTs) are the transformer version for computer vision applications [32,33]. In ViTs, the images are split patches serialized into a vector mapped to a smaller dimension. Then, the resulting vector is processed via a transformer. These algorithms have been employed for autonomous driving [34], image classification, object detection, image segmentation [35], video deepfake detection [32], anomaly detection [36], image synthesis [37], and cluster analysis [38]. Generative adversarial networks (GANs) are MLP extensions of stacked FC layers that generate new data with the same statistics as the training set. The GAN consists of two parts: the generator and the discriminator. The generator models the training data distribution and provides a compressed image representation, and the discriminator is a binary classifier that decides between real and fake [39]. There are several applications of the GANs. For example, CycleGAN is a GAN architecture that uses a technique for training unsupervised image translation models to learn transformations between images of different styles [40]. StyleGAN is a GAN that generates images of high resolution built from a stack of FC layers. The initial layers generate low-resolution images, and further layers refine the resolution [41]. PixelRNN is an auto-regressive generative model capable of learning an explicit data distribution, where GANs learn implicit probability distributions [42]. Autoencoders (AEs) and variational autoencoders (VAEs) are generative models explicitly designed to capture the probability distribution of a training set and generate new samples [43]. An encoder is made of FC layers that decrease in dimension as the encoder becomes deeper. This compresses the input data into an encoded representation several orders of magnitude smaller than the input data to produce the latent space of variables. A decoder takes a representation of the latent space and decompresses it, increasing the dimension of the FC layer as it approaches the output. When combined with prior knowledge, AE-based models have proven successful for anomaly detection in hyperspectral images. For example, in [44], a dynamic low-rank and sparse prior-constrained model was developed to combine a linear-based low-rank model, a sparse model, and a nonlinear-based deep AE to detect the anomaly and to extract the discriminative features between the background and anomaly for complex scenes. In [45], a deep self-representation learning framework for hyperspectral anomaly detection was proposed. The model integrates the prior knowledge of robust principal component analysis (PCA) and the local spatial information into the AE model for a result that outperforms state-of-the-art methods.

CNNs are stacked layers composed of the convolution of the trainable filter with the input signal or receptive field, followed by a pooling and an activation layer. The filtering extracts features from previous layers to form a feature map. CNNs are used for classification and prediction in computer vision tasks. For example, the LeNet-5 [46,47], the AlexNet [48], the VGG-16 [49,50], the DenseNet121 [51,52], the ResNet50 [53,54], and the MobileNet-V2 [55,56] have been used for classification tasks, while the U-Net [57,58] has been used for semantic segmentation problems. At present, some challenges addressed by computer vision [59] through DL are (a) image and video synthesis to create realistic images and videos for content creation and entertainment, (b) image style transfer to merge the artistic style of one image with the content of another, (c) text-to-image synthesis to extract meaning from the text description and convert it into an image for image editing, (d) enhancing the capabilities of autonomous vehicles to more precisely handle difficult driving scenarios, (e) detecting early signs of diseases before the symptoms appear, (f) identifying suspicious behavior or objects more accurately for security purposes, and (g) making DL models more interpretable, especially in applications where human lives, safety, and ethics are involved.

In summary, it is essential to mention that works in the literature have only addressed one side, focusing on one application or topic, such as the review of CNN architectures. However, the works do not provide a complete understanding of DL topics, such as the concepts and math behind the building blocks used to develop an architecture.

3. Theoretical Foundations of Deep Learning

3.1. Motivation

DNNs, specifically CNNs, play a central role in data-driven applications such as noise reduction, cancer detection, and weather forecasting, among others. These applications can be implemented with different models that have caused breakthroughs in image recognition, computer vision, and other fields [60,61]. However, the developed architectures are ad hoc for a determined set of input signals, and the theoretical background for important design choices is lacking. Some relevant works try to explain a proposed CNN’s internal operations. Nevertheless, understanding the ideas requires a high level of expertise [62,63].

Even though existing DL models have performed well in tackling a specific task, they still lack an in-depth understanding of how the specific elements or building blocks work. Without a clear understanding of the blocks, determining the desired configurations of a model remains a mystery. This causes many existing architectures to be built empirically without researching whether the network’s depth (number of parameters) and each building block are suitable for the task to be carried out. In this section, we describe the building blocks from a mathematical point of view to address the aforementioned challenging issues in a theoretically sound and explainable way.

3.2. The Artificial Neuron

The AN is the backbone of deep neural network (DNN) algorithms and bases its operation on a collection of interconnected artificial neurons organized in layers, where each connection transmits a signal to the neurons in the next layer. Figure 1 represents an artificial neuron (also known as a unit), and

w_{0}

to

w_{n_{x}}

are learnable weights or learnable parameters. Furthermore,

w_{0}

is the offset or bias that allows other influences to be included in the model, and

x_{0}

is an input set permanently to 1, meaning that the bias is always on. The input features are

x_{1}

to

x_{n_{x}}

, and

a = f (z)

is the output of the nonlinear activation function.

Two operations are conducted in an artificial neuron: (1) a linear combination of the

n_{x}

input features with the weights or parameters w and (2) a nonlinear mapping. The linear operation is expressed as

z = \sum_{k = 1}^{n_{x}} w_{k} x_{k} + w_{0}

(1)

The nonlinear mapping is carried out by adding a nonlinear activation function

a = f (z)

to process the linear combination z further and to approximate complex patterns. Then, the output

f (z)

moves to the succeeding layer through the next learnable weights. This process is known as forward propagation. Subsequently, a backward pass or backpropagation of the errors is performed to adjust the weights to correct errors and improve accuracy. Backpropagation is based on a gradient that requires continuously differentiable activation functions [64,65]. Table 1 shows some common activation functions.

The following code defines a nine-input neuron with bias and sigmoid activation function in Keras.

model = Sequential()

model.add(Dense(1, input_dim=9,

use_bias=True, activation=’sigmoid’))

3.3. Convolutional Layer

A CNN is a type of DNN used for finding patterns in images to recognize objects, classes, and categories. This network performs a convolution operation of trained filters with the input signal to extract patterns. Each filter extracts a different pattern and forms a convolutional layer, the major building block of CNNs. After training, the layers can extract patterns from the data without human intervention. The convolution operation consists of sliding trainable convolutional filters over the receptive field, producing a layer of features that indicates the strength and positions of the detected features. In addition, complex patterns are detected by adding activation functions to further process the filtered signal [66,67,68].

3.3.1. Convolutional Filters (Kernels)

The core of a CNN consists of trainable filters, also known as kernels, whose weights or parameters are learned during the training process. The input to the filters is known as the receptive field that, convolved with the filter parameters, yields a set of feature maps that is further processed with activation functions. The output can be used as the receptive field for the next set of trainable filters to extract different feature sets. The filter parameters depend on the characteristics of the input. They are trained to extract features from structures in the receptive field, such as vertical, horizontal, or diagonal edges. However, one filter can represent only a single structure. Hence, we must train multiple filters to represent multiple structures from a single receptive field.

Processing grayscale images of size

m \times n

requires two-dimensional (2D) kernel windows of height

F_{x}

and width

F_{y}

coefficients. Figure 2 shows the convolution process of obtaining three features from a one-channel 2D receptive field. The lines represent the learnable parameters (weights) and the black dot, joining the lines, represents the sum of the products of Equation (2) with

F_{x} = 3

and

F_{y} = 3

.

\sum_{k = 1}^{F_{x}} \sum_{j = 1}^{F_{y}} w_{k, j} \cdot x_{k, j}

(2)

The

3 \times 3

pixels in the receptive field are multiplied by the parameters, and the result is added to yield one feature. The filter is slid row- and column-wise over a valid area of the receptive field by a factor S known as the convolution stride. After the convolution operation over the complete receptive field, one map of features is obtained. This map is also called a channel.

Observe that the filter coefficients convolved with values inside the valid area of the receptive field produce a feature map smaller than the receptive field. To maintain the same size, the receptive field must be padded around with P rows and columns of zeros before convolution. After the convolution operation, the resulting feature map is of size

⌊\frac{m - F_{x} + 2 P}{S} + 1⌋ \times ⌊\frac{n - F_{y} + 2 P}{S} + 1⌋

(3)

where

⌊ \cdot ⌋

is the floor operation. Three-channel receptive fields, such as color images, are processed by programming one filter per channel. The filter outputs are added to produce a single feature map or channel. Notice that the number of channels is reduced from three to one, while the number of parameters is multiplied by a factor equal to the number of channels.

Figure 3 shows the filtering process of extracting one feature from a three-channel receptive field. Each filter carries out the sum of products inside their corresponding window. The three results are added to yield one feature. Afterward, the filter is shifted to the number of pixels defined by S. The three channels must be padded to keep the output feature map size equal to the receptive field.

Before training, the filter parameters are randomly initialized. One popular technique is the He initialization [69]. Here, the normal probability distribution is used with zero mean and a standard deviation of

\sqrt{\frac{2}{n_{h}^{[l - 1]}}}

, where

n_{h}^{[l - 1]}

is the number of hidden units in the previous layer

l - 1

.

During training, the feature maps are calculated and stored together with the parameters or learnable weights in the forward propagation step. The filter parameters are updated using the stored feature maps during backpropagation. Our example shows that the operation is a 2D convolution even though the filter moves over a 3D receptive field. The filter has the same depth as the receptive field and is shifted only along the rows and columns. The convolution does not occur along the depth, reducing the feature map to 2D.

One filter can extract features from only one structure of the receptive field. For example, if the extracted features belong to vertical edges and we want to extract features from other structures, we must add more filters. Therefore, each filter produces one channel or feature map. Hence, the resulting feature map can be seen as a volume of bidimensional feature maps with the number of channels (depth) defined by the number of filters utilized to produce the output volume.

Figure 4 shows an example of the convolution of a 3D filter of size

3 \times 3 \times 3

over a valid area of a three-channel receptive field of size

6 \times 6

samples each. The filters are represented as a 3D grid of trainable parameters

ω_{1}

to

ω_{9}

, and the bias term is omitted. According to Equation (3), the resulting 2D feature map is of size

4 \times 4

. The arrows point to the representation of the convolution in terms of volumes. The orange and yellow blocks are the volumetric representation of the receptive field and the filter, respectively. Observe that the output is only one channel or 2D map of size

4 \times 4

. Notice that * is the convolution operation.

If we add more filters, the number of output channels increases, as shown in Figure 5, where three filters are convolved with a volumetric receptive field on a valid area of size

6 \times 6 \times 3

to yield three feature maps or channels of size

4 \times 4 \times 3

, represented by the green volume.

Convolutional filters possess two important characteristics: (1) parameter sharing and (2) sparsity of connections. Parameter sharing means that if a trained filter can extract certain features in one part of the receptive field, extracting the same features in another part of the same receptive field is also helpful. Therefore, the same weights are used for the kernel independently where the kernel is placed. Hence, each filter can use the same parameters in different positions in our receptive field. The sparsity of connections means that each feature of the feature map is computed using an area of

F_{x} \times F_{y}

samples (see Figure 5). The remaining values outside the area do not affect the output. These two mechanisms allow a model to be defined with a reduced number of parameters and a reduction in the variance of parameters, reducing the possibility of overfitting.

It is important to highlight that classical filters are handcrafted to produce results such as detecting edges and removing noise. The difference with convolutional filters is that the parameters of the filters are learned during training by looking at the scene of a specific problem.

3.3.2. Two-Dimensional Convolutional Layer

For the model to learn complex functional mappings, a layer of activation functions is required after filtering. In some cases, a bias term is added to the linear combination. Figure 6 shows the process of obtaining a 2D convolutional layer. Suppose the receptive field is a color image of size

6 \times 6 \times 3

pixels. The convolution with three filters of size

3 \times 3 \times 3

, over the valid area of the receptive field, with

P = 0

and

S = 1

, will yield three feature maps of size

4 \times 4

each (see Equation (3)). This feature map is also called a map of activations or a convolutional layer. Moreover, to learn complex parameters, it is necessary to process the feature map further using an activation layer whose output is also called the convolutional layer. Therefore, the output volume of Figure 6 is called a 2D convolutional layer.

If we want the layer to be the same size as the receptive field, we use zero padding P around the image before convolution. Figure 7 shows a portion of a receptive field padded around with one row and one column of zeros

(P = 1)

and three steps of convolution with a stride of one

(S = 1)

. The gray squares at the output of the activation layer represent the output features of the convolutional layer. For a color image, each channel must be padded, and the convolution operation must be carried out beyond the limits of the valid area of each channel. The activation unit shown represents a layer of activation functions and not only a single unit. Figure 8 shows the convolution operation with

P = 1

and

S = 2

.

CNN models add one convolutional layer after another to extract features from the previous layers. The greater the depth, the greater the complexity of the extracted features. However, the complexity of the model increases as the number of parameters increases. The number of parameters used in any of the layers can be calculated as

N_{p}^{[l]} = (F_{x}^{[l]} \cdot F_{y}^{[l]} \cdot N_{c}^{[l - 1]} + 1) \cdot N_{c}^{[l]}

(4)

where l is the current layer number,

l - 1

is the previous layer number,

N_{p}^{[l]}

is the number of parameters in the current layer,

N_{c}^{[l]}

is the number of channels in the current layer, and

N_{c}^{[l - 1]}

is the number of channels in the previous layer. The dimension of a convolutional layer can be determined by using Equation (3). For example, if the input image is of size

32 \times 32

, grayscale, the filter size is

3 \times 3

, the stride is one, and no padding is used, the dimension of the output feature map is

⌊\frac{32 - 3 + 2 \cdot 0}{1} + 1⌋ = 30

Hence, the dimension of the layer is

30 \times 30

. Observe that the number of parameters is determined using the number of weights of the filter plus the bias if it is used. According to Equation (4), the total number of parameters is (

3 \cdot 3 \cdot 1 + 1) \cdot 1 = 10

. Figure 9 shows an example of obtaining a convolutional layer after the convolution operation of the receptive field with a

3 \times 3

filter and the activation layer. The size of the convolutional layer is determined via Equation (5).

One filter is insufficient to capture most of the features in a receptive field. Hence, multiple filters must be used to create multiple feature maps, as shown in Figure 10. The gray shade grids represent the filter coefficients. The receptive field is convolved with four

3 \times 3

filters. Each output is passed through an activation layer to yield a convolutional layer of size

⌊\frac{m^{[l - 1]} - F_{x}^{[l]} + 2 P^{[l]}}{S^{[l]}} + 1⌋ \times ⌊\frac{n^{[l - 1]} - F_{y}^{[l]} + 2 P^{[l]}}{S^{[l]}} + 1⌋ \times N c^{[l]}

(5)

Below is a piece of Keras code utilized to implement a two-layer CNN. The first layer accepts 2D receptive fields of size

28 \times 28

and learns 32 convolutional filters of size

3 \times 3

with a bias term. The convolution is carried out over a padded area with a convolution stride of 1 along rows and columns. The activation map is passed through a ReLU activation layer to remove negative values and produce a convolutional layer of the same size as the input receptive field. This layer is the receptive field of the next layer. The second layer learns 64 filters of

5 \times 5

. The convolution operation is performed over a valid receptive field area, followed by a ReLU activation layer.

model=Sequential()

model.add(Conv2D(32,(3,3),padding=’same’,

strides=(1,1), activation=’relu’,

kernel_initializer=’he_normal’,

input_shape=(28,28,1)))

model.add(Conv2D(64,(5,5),padding=’valid’,

strides=(1, 1),activation =’relu’,

kernel_initializer=’he_normal’))

According to Equation (4), the number of parameters used in the first layer is

(3 \cdot 3 \cdot 1 + 1) \cdot 32 = 320

and in the second layer is

(5 \cdot 5 \cdot 32 + 1) \cdot 64 = 51,264

. The first convolutional layer is a volume of size

28 \times 28 \times 32

, and the output layer is a volume of size

24 \times 24 \times 64

(see Equation (3)).

3.4. Three-Dimensional Convolutional Layer

Three-dimensional (3D) convolutional neural networks are mostly used in 3D data such as Positron Emission Tomography (PET) Imaging, Magnetic Resonance Imaging (MRI), and Computerized Tomography (CT) Scan [70]. Typically, 3D architectures use the 3D convolution operation to extract features from three-dimensional

(x, y, z)

data. Three-dimensional convolutions apply three-dimensional filters to the data. The kernels may not be the same depth as the receptive field. For example, filters could be smaller in size than the input image.

Figure 11 shows the convolution of a volumetric receptive field (voxels of a 3D image) of size (

8 \times 8 \times 8

) with a filter of size (

3 \times 3 \times 3

). The feature in the color blue is obtained by adding the sum of products inside the volume engulfed by the 3D filter in the receptive field (orange color). The filter is shifted by the number of voxels defined by S, and the calculation is repeated to compute a new feature. The size of the output volume depends on the values of P and S. The 3D CNN layer is obtained after a 3D activation layer.

The dimensions of the output volume are related to the dimensions of the previous volume using Equation (6).

m^{[l]} = ⌊\frac{m^{[l - 1]} - F_{x}^{[l]} + 2 P^{[l]}}{S^{[l]}} + 1⌋

(6)

where

m^{[l - 1]}

is the height of the input volume. The width and depth of the volume are calculated similarly. Feature maps often increase with the depth of the network, which can result in a considerable increase in the number of parameters. As a result, the computation complexity is exacerbated when filters of larger sizes are used (

5 \times 5

,

7 \times 7

, etc.).

Following is a piece of Keras code for implementing a two-layer model. The input shape (

16 \times 16 \times 16 \times 3

) has four dimensions. The fourth dimension is the number of input channels. The input data are sized (

16 \times 16 \times 16

). The layers learn 16 and 32 filters with kernel sizes of

5 \times 5 \times 5

and

3 \times 3 \times 3

, respectively. The parameters are randomly initialized [69]. Finally, the extracted 3D feature maps are passed through a 3D ReLU activation layer.

model = Sequential()

model.add(Conv3D(16,(5,5,5),padding=’valid’,

activation=’relu’,

kernel_initializer=’he_normal’,

input_shape = (16,16,16,3)))

model.add(Conv3D(32,(3,3,3),padding=’valid’,

activation=’relu’,

kernel_initializer=’he_normal’))

The number of parameters learned in the first layer is

(5 \cdot 5 \cdot 5 \cdot 3 + 1) \cdot 16 = 6016

and the layer size is

12 \times 12 \times 12 \times 16

. This layer is the receptive field of the second layer. The second layer learns

(3 \cdot 3 \cdot 3 \cdot 16 + 1) \cdot 32 = 13,856

parameters, and the output layer is of size

10 \times 10 \times 10 \times 32

, where the fourth dimension is the yielded number of channels. In both layers, the convolution operation is performed on the valid area of the receptive field. Therefore, the dimension of the convolutional layers is smaller than that of its receptive field.

3.5. A 1 × 1 Convolutional Layer

Stacking convolutional layers is computationally expensive and prone to overfitting. A strategy to reduce the complexity is to apply

1 \times 1

convolution to multichannel receptive fields. It is implemented using filters of size

1 \times 1 \times D

, where D is the depth of the receptive field. The convolution is carried out along rows and columns of the volume, creating a linear projection of a stack of channels or feature maps. Hence, the output is a feature map with the same spatial dimension as the receptive field with the number of channels D collapsed to one, as shown in Figure 12. It does not make sense to apply

1 \times 1

convolution to a single-channel receptive field because, in this case, the convolutional filter consists of one coefficient only [71].

The

1 \times 1

convolution layer can be added to a model architecture in Keras as follows.

model.add(Conv2D(16,(1,1),use_bias=False,

activation=’relu’,

kernel_initializer=’he_normal’,

input_shape=(256,256,3)))

The layer processes three-channel input of size

256 \times 256

. The bias factor in each

1 \times 1

filter is not used. Hence, the number of parameters is 48. The output layer is a volume of dimension

256 \times 256 \times 16

.

Figure 13 shows an inception module of the GoogLeNet or Inception V1 network [72] that uses three filters (

1 \times 1,

3 \times 3

and

5 \times 5

), one

3 \times 3

max pool operation and one concatenation layer. The convolution is implemented with a stride of one (

S = 1

), and the output layer keeps the same size (

P =

‘

s a m e

’) as the receptive field.

Suppose we want to obtain a

28 \times 28 \times 128

convolutional layer, similar to the yellow block in the second branch of Figure 13. Furthermore, assume we do not use the 96 filters of size

1 \times 1 \times 192

in the middle. Therefore, we must use 128 filters of

3 \times 3 \times 192

. Hence, we carry out a total of

3 \times 3 \times 192 \times 128 \times 28 \times 28 = 173, 408, 256

multiplications. On the other hand, if we insert the middle block, the number of multiplications is reduced to

1 \times 1 \times 192 \times 28 \times 28 \times 96 + 3 \times 3 \times 96 \times 28 \times 28 \times 128 = 101, 154, 816

.

3.6. Pooling Layer

After the activation layer, the convolutional layer contains the precise position of the features. Hence, if small movements in the receptive field occur, the resulting features change. This problem has been addressed in signal processing by using the so-called downsampling operation, which is carried out by using a stride different from one during the convolution operation to keep important features and discard details that may not be useful. A similar operation is performed by adding a pooling layer after the activation layer [73]. However, in the convolution operation, some generative models replace this layer with a stride different from one.

The pooling layer reduces the spatial resolution of a feature map and provides limited invariance to rotation and small shifts. Pooling acts on a convolutional layer separately to create a reduced set of feature maps and to lower the memory requirements. The main mechanisms used by this layer are max pooling, average pooling, and global pooling, which are explained below.

3.6.1. Max Pooling and Average Pooling

Pooling operations are applied on a fixed-shaped window, known as a pooling window, that is shifted in the feature map according to its stride and computing a single value in each window location. Max pooling preserves the maximum value inside the windows, and average pooling preserves the average value inside the window.

Figure 14 shows the max pooling operation on a

4 \times 4

valid area of a receptive field, without padding and with a stride of one and two, respectively. The red square represents the initial pooling windows. Notice how the window pools the four values into one.

Figure 15 shows the average pooling operation with strides of one and two, respectively. The output is the reduced feature map or max pooling layer. Notice that when

S = 2

, each resolution is halved, reducing the number of values in each feature map to one-quarter of the number of values in the receptive field.

Pooling in 3D is carried out upon a volumetric receptive field by extracting the maximum or the average value in a window of size

r \times r \times r

that moves along the x, y, and z axes. A pooling layer can be added after a convolution operation is carried out with a stride equal to one

S = 1

, to reduce the size of the feature map.

More complex feature maps can be extracted using the convolutional layers as the receptive field of another set of filters. Hence, we can stack several layers by extracting features from previously generated layers. This process is called a hierarchical representation of features because the initial layers detect simple patterns like edges and the higher layers detect more abstract features. A max pooling layer with a window size of

2 \times 2

and stride of 2 can be added to a model as follows.

model.add(MaxPooling2D(pool_size=(2,2),

strides =(2, 2)))

and the average pooling layer as,

model.add(AveragePooling2D(pool_size=(2,2),

strides =(2,2)))

3.6.2. Global Pooling

Global pooling reduces the entire feature map into a single sample. This mechanism is utilized as an alternative to a fully connected (FC) layer. It consists of converting the entire feature map to a single output prediction that sums up the presence of a feature in a feature map. Global pooling can be performed via a global average pooling operation, which consists of taking the average of the entire feature map, or the global max pooling operation, which takes the maximum value of a feature map. The resulting values of the global pooling operations are applied to each feature map. The resulting vector can be applied directly into the softmax layer in the case of multiclass classification. Some advantages of global pooling are that it is more robust to the spatial translation of the receptive field, there are no parameters to optimize, and it is more native to the convolution structure because there is a direct correspondence between feature maps and classes. One example of global pooling is the inception network that uses the global average pooling operation after the last inception module instead of FC layers.

3.7. Flattening Layer

In classification models, the last stage is built of fully connected layers. Hence, the input must be a one-dimensional linear vector. If the input to the fully connected layer has a different dimension, it has to be converted into a one-dimensional array to create a single feature vector. This process is known as flattening. Figure 16 shows the process of flattening a two-dimensional receptive field. It consists of concatenating the receptive field column by column (or row by row), one after the other, to obtain a one-dimensional vector of size

(N \cdot M) \times 1

. In Keras, a flattened layer can be added using the function

model.add(Flatten())

3.8. Fully Connected Layer

Figure 17 shows a four-layer shallow network. The lines are the weights conveying information from one layer to another in the direction of the arrows. The green, blue, and orange circles represent the input, hidden, and output units, respectively. The white circles are the bias units. Notice that the output units from a previous layer are connected to every layer in its preceding layer. This type of layer is known as the dense or FC layer.

Typically, ANNs are built with one or more FC layers. The computations are carried out using vectorized algorithms. However, when building ANN models, one frustrating problem is obtaining the correct shape of the vectors and matrices of each layer. Even when using software like Keras [22], Caffe [74], or PyTorch [75] with any available version, model design can be overwhelming if the dimensions of each layer are not considered. In this section, we explain how to calculate the dimensions of the matrices and vectors of FC layers.

Here, we introduce the superscript

[l]

to denote the lth layer and the subscript j to denote the unit number in the layer. Therefore, the notation

z_{j}^{[l]}

denotes the linear combination operation of the jth neuron, located in the lth layer. Hence, Equation (1) can be modified as

z_{j}^{[l]} = \sum_{k = 1}^{n_{x}} w_{j, k} x_{k} + w_{j, 0}

(7)

Let us review the modeling of an ANN. We aim to describe the network by finding the equations that define the forward propagation from the input to the output layer. We first construct the ith input vector

x^{(i) [0]} = {\{x_{k}\}}_{k = 1}^{n_{x}} \in R^{n_{x} \times 1}

in

l = 0

, also known as the feature vector or input example. The weights connecting the input features with each hidden unit can be placed in a column vector

w_{j}^{[1]} = {\{w_{j, k}^{[1]}\}}_{k = 1}^{n_{x}} \in R^{n_{x} \times 1}

to end up having a total of

n_{h}^{[1]}

column vectors, where

n_{h}^{[1]}

is the number of hidden units in layer 1. Consider the weights matrix

w^{[1]} \in R^{n_{h}^{[1]} \times n_{x}}

mapping the input vector from the layer

l = 0

to the layer

l = 1

, with rows being the set of transpose vectors

{\{w_{j}^{{[1]}^{T}}\}}_{j = 1}^{n_{h}^{[1]}}

and the bias vector the set of weights

b^{[1]} = {\{w_{j, 0}^{[1]}\}}_{j = 1}^{n_{h}^{[1]}} \in R^{n_{h}^{[1]} \times 1}

. Then, the linear combination in the first layer is the vector

z^{(i) [1]} = w^{[1]} x^{(i) [0]} + b^{[1]} \in R^{n_{h}^{[1]} \times 1}

and the activation output is

a^{(i) [1]} = f^{(i) [1]} (z^{(i) [1]}) \in R^{n_{h}^{[1]} \times 1}

.

By letting

x^{(i) [0]} = a^{(i) [0]}

be the output of some activation function, we can express the forward propagation equations of the network in terms of the ith input example as

z^{(i) [l]} = w^{[l]} a^{(i) [l - 1]} + b^{[l]}

(8)

a^{(i) [l]} = f^{(i) [l]} (w^{[l]} a^{(i) [l - 1]} + b^{[l]})

(9)

The dimension of the weights matrix

w^{[l]}

is the number of neurons in the current layer l multiplied by the number of neurons in the previous layer,

n_{h}^{[l]} \times n_{h}^{[l - 1]}

. The dimension of the bias vector

b^{[l]}

, the linear combination vector

z^{[l]}

, and the output vector

a^{[l]}

is the number of neurons in the current layer multiplied by one,

n_{h}^{[l]} \times 1

. Eventually, we can calculate the number of parameters as

(n_{h}^{[l - 1]} + 1) n_{h}^{[l]}

.

Observe the similarity of Equations (8) and (9) with the process shown in Figure 6. Analogously, we can say that the receptive field is the input image

a^{(1) [0]}

,

w^{(1)}

are the filter coefficients,

b^{[1]}

are bias terms, and

f^{(1) [1]}

is the activation layer. Therefore,

a^{(1) [1]}

is the output volume convolutional layer, which can be the input or receptive field to the next layer.

The following code implements the sequential model of Figure 17 using Keras software, assuming that the network accepts 50 inputs, the two hidden layers have 32 nodes each with the ReLU activation function, and the output layer has ten nodes with the softmax activation function. The weights are initialized using the He algorithm with normal distribution.

model=Sequential()

model.add(Dense(32,use_bias=True,

activation=’relu’,

kernel_initializer=’he_normal’,

input_shape=(50,)))

model.add(Dense(32,use_bias=True,

activation=’relu’,

kernel_initializer=’he_normal’))

model.add(Dense(10,activation=’softmax’))

3.9. Deep Neural Networks

Deep neural networks (DNNs) are composed of many layers that increase a model’s performance accuracy. However, the greater the number of layers, the greater the computational complexity. Hence, using loops during training to iterate over all the input examples in the data set makes the process very slow. Therefore, vectorization is an important strategy to help improve the time performance of a DL model.

Consider the matrix

A^{[0]} \in R^{n_{x} \times m}

with columns being a subset or the complete data set of m input vectors

{\{a^{(i) [0]}\}}_{i = 1}^{m}

. Equations (8) and (9) can now be generalized in matrix form as

Z^{[l]} = w^{[l]} A^{[l - 1]} + b^{[l]}

(10)

A^{[l]} = f^{[l]} (Z^{[l]}),

(11)

where

Z^{[l]} \in R^{n_{h}^{[l]} \times m}

and

A^{[l]} \in R^{n_{h}^{[l]} \times m}

. When

l = L

,

A^{[L]}

is considered the output of the prediction model. Figure 18 shows a DNN with L dense layers. The units or neurons are grouped into layers as follows: the green circles represent the input features, the blue ones are the hidden units in hidden layers, the white ones are the biases of each layer, and the orange circles are the units belonging to the output layer. The arrows indicate the flow of information from one layer to the next throughout the parameters. For binary classification, the last layer has one activation function, typically the sigmoid function, and for multiclass classification, it could be a layer with softmax functions.

In supervised learning, the data are labeled, meaning that, for each input example, there is a label that indicates to which class it belongs. Therefore, we can consider the matrix

Y \in R^{n_{y} \times m}

whose columns are the set of m vectors

y^{(i)} = {\{y_{k}\}}_{k = 1}^{n_{y}} \in R^{n_{y} \times 1}

with

n_{y}

the number of output labels per vector.

After forward propagation, we measure how the model is performing concerning the training set by using a cost function

J (w, b)

which depends on the parameters we wish to adjust. The goal is to minimize the cost. For this, we measure the variation in the cost concerning the variation in the parameters

\partial J (w, b) / \partial w

and

\partial J (w, b) / \partial b

from the last to the first layer to update such parameters with the appropriate values that minimize the cost function:

J (w, b) = \frac{1}{m} [L (A^{[L]}, Y)] \cdot 1^{T},

(12)

where

1^{T}

is a column vector of ones. In other words, we want to find the best parameters

w^{*}

such that the cost function is minimal:

w^{*} = arg min_{w} J (w, b)

(13)

The cost is the average loss over the entire training data set. For example, in binary classification, the loss function is the cross entropy denoted by

L (A^{[L]}, Y) = - [(Y \otimes log A^{[L]}) + (1 - Y) \otimes log (1 - A^{[L]})],

(14)

where ⊗ is the element-wise product operation. Often, we want to predict a continuous value function instead of ‘1’ or ‘0’. Then, we must use the mean squared error (MSE) as our loss function to measure the divergence between the predicted and the true output.

We can use the backpropagation algorithm with gradient descent or any variant, such as stochastic gradient descent with momentum, RMSprop, Adam, etc., to minimize Equation (13) [46]. The following equations are used to update the parameters, provided that

d A^{[L]}

is given, where

d A^{[L]} = \partial L (A^{[L]}, Y) / \partial A^{[L]}

is the partial derivative of the cost concerning

A^{[L]}

. For one input example

d a^{(i) [L]} = \frac{a^{(i) [L]} - y^{(i)}}{a^{(i) [L]} (1 - a^{(i) [L]})}

, it follows that

d A^{[L]}

is the vector of all derivatives

{\{\frac{a^{(i) [L]} - y^{(i)}}{a^{(i) [L]} (1 - a^{(i) [L]})}\}}_{i = 1}^{m}

. By letting

d Z = \frac{\partial J (w, b)}{\partial Z}

,

d w = \frac{\partial J (w, b)}{\partial w}

,

d b = \frac{\partial J (w, b)}{\partial b}

, we can easily calculate the updated equations as follows.

\begin{matrix} d Z^{[l]} = & d A^{[l]} \cdot \times f^{^{'} [l]} (Z^{[L]}) \\ d w^{[l]} = & \frac{1}{m} d Z^{[l]} A^{{[l - 1]}^{T}} \\ d b^{[l]} = & \frac{1}{m} d Z^{[l]} \cdot 1^{T} \\ d A^{[l - 1]} = & w^{{[l]}^{T}} \cdot d Z^{[l]} \end{matrix}

(15)

Then, we update the parameters using the gradient descent [65] with an adjustable learning rate

α

.

\begin{matrix} w^{[l]} = w^{[l]} - α d w^{[l]} \\ b^{[l]} = b^{[l]} - α d b^{[l]} \end{matrix}

(16)

Let us assume binary classification. Then,

d Z^{[L]} = A^{[L]} - Y

. Notice that

d w^{[l]} \in R^{n_{h}^{[l]} \times n_{h}^{[l - 1]}}

and

d b^{[l]} \in R^{n_{h}^{[l]} \times 1}

.

3.10. Comments on Training DNN

For a large number of input vectors, we can split the data set

A^{[0]}

into K subsets of data

{\{A_{k}^{[0]}\}}_{k = 1}^{K}

called minibatches with M examples each. Therefore, the set of minibatches are the matrices

A_{k}^{[0]} = {\{a^{(i) [0]}\}}_{i = (k - 1) M + 1}^{k * M} \in R^{n_{x} \times M}

. In supervised learning, each minibatch is composed of pairs of training vectors. Each pair contains an input example vector and its truth output label or target output.

This technique is useful for huge training sets because it allows us to observe the progress of the model with each training minibatch before processing the entire training set, i.e., it allows us to take one gradient descent step with each minibatch. Also, to make the learning process fast, it is recommended to normalize the input features and the activations in each layer of the network.

During training, underfitting and overfitting must be avoided so that the network can correctly generalize new examples. Underfitting occurs when the network has not been trained for enough time or the number of parameters is insufficient to establish the correct relationship between input and output variables. Underfitting is avoided by increasing the training time and/or increasing the number of model parameters. The overfitting problem is common when considering DNN because of the huge amount of parameters in the model. There are strategies to avoid overfitting. For example, we can employ data augmentation strategies such as random cropping, mirroring, rotating, and flipping the input data. In addition, we can add a regularization term to the loss function. Equation (17) shows the most common term of regularization known as the

L 2

-norm, weight decay, or Ridge Regression. It is nothing but the Euclidean norm of the weight matrices. Observe that the weights approach zero as the value of

λ

increases. This tends to cancel neurons, which reduces the number of parameters.

J (w, b) = \frac{1}{m} [L (A^{[L]}, Y)] \cdot 1^{T} + \frac{λ}{2 m} {| | w^{[l]} | |}_{2}^{2}

(17)

The Lasso Regression or

L 1

-norm is another regularization technique that uses the absolute values of the weight parameters as regularizers, as shown in Equation (18).

J (w, b) = \frac{1}{m} [L (A^{[L]}, Y)] \cdot 1^{T} + \frac{λ}{2 m} {| | w^{[l]} | |}_{1}

(18)

A similar effect can be seen when the dropout technique is used. During training, some of the outputs of some layers are probabilistically ignored (dropped out). Hence, these layers have a reduced number of units and reduced connectivity to the next layer, which means that the number of parameters is reduced, which prevents overfitting. The effect is similar to regularization. The following Keras code implements the neural network discussed in Section 3.8, using a dropout of 0.25. The architecture is configured to use the categorical cross-entropy loss function and the stochastic gradient with momentum and a learning rate of 0.01. The metric used to monitor and measure the performance of the classification model is ‘accuracy’. Finally, during training, the model is going to iterate the data (

A^{[0]}

) with 40 epochs in minibatches (

A_{k}^{[0]}

) of eight input vectors, each with their corresponding labels. There are three dense layers. However, only the middle one has an

L 2

regularizer. Notice that the expression keras.regularizers.l2(l =0.01) denotes the

L 2

regularizer with regularization parameter

λ = 0.01

.

model = Sequential()

model.add(Dense(32, activation=’relu’,

kernel_initializer=’he_normal’,

input_shape=(50,)))

model.add(Dropout(0.25))

model.add(Dense(32, activation=’relu’,

kernel_initializer=’he_normal’,

kernel_regularizer =keras.regularizers.l2(l =0.01)))

model.add(Dropout(0.25))

model.add(Dense(10, activation= ’softmax’ ))

opt = SGD(lr=0.01, momentum=0.9)

model.compile(loss=’binary_crossentropy’,

optimizer=opt, metrics=[’accuracy’])

model.fit(data, labels, epochs=40, batch_size=8)

3.11. Tensors

Even though tensors are not a building block for deep networks, they are important since the input of deep learning-based models requires a large number of high-dimensional data and tensors are special containers for this kind of data. Tensors can store multidimensional arrays with uniform types. For example, a zero-dimensional tensor stores a scalar with zero axes and rank zero. An array is a one-dimensional tensor for data of the same type, with one axis and rank one. A matrix is a two-dimensional tensor of rank two with two axes. In general, an n-dimensional tensor is a tensor of rank n. An axis is a specific dimension and the length is the number of indexes available along that axis [76].

Figure 19 shows six tensors. We have tensors of zero, one, two, and three dimensions in the upper part. The two-dimensional tensor can be used to contain a grayscale image. The three-dimensional tensor can be used to contain a color image or a color filter. The first dimension of the tensor expresses the number of channels. However, this dimension can be used to signal n different two-dimensional matrices, which could be n grayscale images. This type of tensor is used to store a batch of images. The first dimension indicates the batch size. In the lower part of Figure 19, we see four- and five-dimensional tensors. A four-dimensional tensor may contain color images or chunks of color images, or, in the case of a volumetric receptive field, it may hold the entire volume or small pieces of the volume.

The following Keras code defines tensors from ranks 0 through 3 shown in Figure 19, respectively.

rank_0_tensor=tf.constant(5)

rank_1_tensor=tf.constant([2,3,4])

rank_2_tensor=tf.constant([[2,5],[3,6],[4,7]])

rank_3_tensor=tf.constant([[[1,2,3,4,5],

[6,7,8,9,10]], [[11,12,13,14,15],

[16,17,18,19,20]],[[21,22,23,24,25],

[26,27,28,29,30]],])

4. Discussion

A vast number of applications with powerful predictive models have emerged in the area of DL. However, the explanations of the different building blocks used to modify or design new deep learning models are mild and dispersed in the literature [77,78]. Above, we describe the building blocks and concepts used in developing architectures for DL in depth. For example, filtering is one of the most important and unclear operations in CNNs.

N_{c}^{[1]}

filters are needed in the first convolutional layer to produce

N_{c}^{[1]}

two-dimensional feature maps. In other words, a feature volume with depth

N_{c}^{[1]}

is produced by filtering the image with

N_{c}^{[1]}

filters, as shown in Figure 2. However, in the subsequent layers, an independent filter is applied to each feature map whose outputs are added to form a single feature map. Therefore, if the input volume of a subsequent layer is of depth

N_{c}^{[l - 1]}

, then,

N_{c}^{[l - 1]}

filters are applied, and the outputs are summed to produce a single two-dimensional feature map like that shown in Figure 3, Figure 5 and Figure 6. Nevertheless, this process increases the number of parameters after the first layer as more filters are added to produce maps that capture different features as the network deepens [78]. This can be seen in CNNs like the LeNet-5 [46,47], AlexNet [48], VGG-16 [49], DenseNet121 [51], and ResNet50 [53], among others. All these networks introduce a pooling layer after the filtering operation to alleviate the problem of the number of parameters.

Other CNNs, like the MobileNet family, use depthwise separable convolutions [55]. These operations split the convolution into two separate layers. The first applies an independent filter per input channel as in Figure 9, and the second is a

1 \times 1

convolution called a pointwise convolution, which produces the feature map by calculating linear combinations of the input channels, as shown in Figure 12. This reduces the parameters needed to process a signal, making the MobileNet a family of lightweight networks.

Latent-variable models also benefit from the DL building blocks. These models take training examples from some distribution as input and learn a model representing that distribution. There are three types of architectures: autoencoders, variational autoencoders (VAEs), and generative adversarial networks (GANs). Autoencoders are neural networks used in unsupervised machine learning that find patterns in a data set by detecting key features. An encoder performs the compressed representation of the input. An encoder comprises stacked FC layers, like the network in Figure 12. However, the dimension of each hidden layer is reduced to obtain a compressed space of latent variables. A decoder is a reflection of the encoder network. The architecture of VAEs is like that of autoencoders [43]. The difference is that the elements of randomness or stochasticity are introduced into the latent space generated to try to produce new variables and learn smoother and more complete representations of the latent space. GANs comprise two models, one that generates images called the generator (like the VAE decoder) and the discriminator or binary classifier built of FC layers to discriminate fake from real images [39].

5. Conclusions

In this paper, we presented an in-depth overview of the building blocks of deep learning, aiming to carry out solutions and serve as a reference for constructing new models. The description begins with the artificial neuron’s model and exemplifies how a ten-parameter neural network, including bias, is used as a filter to produce a feature map. Furthermore, the filtering process or convolution with two- and three-dimensional input signals to obtain feature volumes is described. In addition, the 1 × 1 convolution is described, and its use is exemplified in a simple inception module. The forward and backpropagation process with vectorization is described in a deep neural network context. Even though tensors are not part of DL building blocks, they are explained because of their important role in containing data. Also, we explained the implementation of each building block using the Keras library.

As for future prospects, we recommend designing networks starting by analyzing the features of the input signal to process because, for example, medical images have a different set of characteristics than other natural images, leading to different network depths and building blocks. Finally, by performing ablation studies, one can investigate the performance of the developed architecture to understand the contribution of each block to the overall model.

Author Contributions

Conceptualization, H.d.J.O.D., V.G.C.S. and O.O.V.V.; methodology, H.d.J.O.D., V.G.C.S. and O.O.V.V.; validation, H.d.J.O.D., V.G.C.S. and O.O.V.V.; investigation, H.d.J.O.D., V.G.C.S. and O.O.V.V.; writing—original draft preparation, H.d.J.O.D.; writing—review and editing, V.G.C.S. and O.O.V.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hamid, O. Data-centric and model-centric AI: Twin drivers of compact and robust industry 4.0 solutions. Appl. Sci. 2023, 13, 2753. [Google Scholar] [CrossRef]
Hamid, O.; Smith, N.; Barzanji, A. Automation, per se, is not job elimination: How artificial intelligence forwards cooperative human-machine coexistence. In Proceedings of the 15th IEEE International Conference on Industrial Informatics (INDIN), Emden, Germany, 24 July 2017; pp. 899–904. [Google Scholar] [CrossRef]
Jiang, X.; Hadid, A.; Pang, Y.; Granger, E.; Feng, X. Deep Learning in Object Detection and Recognition; Springer Nature: Singapore, 2019. [Google Scholar]
Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017, 29, 2352–2449. [Google Scholar] [CrossRef] [PubMed]
Schmarje, L.; Santarossa, M.; Schröder, S.; Koch, R. A Survey on semi-, self- and unsupervised learning for image classification. IEEE Access 2021, 9, 82146–82168. [Google Scholar] [CrossRef]
Rikalovic, A.; Suzic, N.; Bajic, B.; Piuri, V. Industry 4.0 implementation challenges and opportunities: A technological perspective. IEEE Syst. J. 2022, 16, 2797–2810. [Google Scholar] [CrossRef]
Sorin, G.; Bogdan, T.; Tiberiu, C.; Gigel, M. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef]
Sun, Q.; Ge, Z. Deep learning for industrial KPI prediction: When ensemble learning meets semi-supervised data. IEEE Trans. Ind. Inform. 2021, 17, 260–269. [Google Scholar] [CrossRef]
Daud, M.; Saad, H.; Ijab, M. Conceptual design of human detection via deep learning for industrial safety enforcement in manufacturing site. In Proceedings of the 2021 IEEE International Conference on Automatic Control Intelligent Systems (I2CACIS), Shah Alam, Malaysia, 26 June 2021; pp. 369–373. [Google Scholar] [CrossRef]
Liu, Y.; Ma, X.; Shu, L.; Hancke, G.; Abu-Mahfouz, A. From industry 4.0 to agriculture 4.0: Current status, enabling technologies, and research challenges. IEEE Trans. Ind. Inform. 2021, 17, 4322–4334. [Google Scholar] [CrossRef]
Masrur, A.; Deo, R.; Ghahramani, A.; Feng, Q.; Raj, N.; Yin, Z.; Yang, L. New double decomposition deep learning methods for river water level forecasting. Sci. Total Environ. 2022, 831, 154722. [Google Scholar] [CrossRef]
Shiuann, S.; Bhaskar, C.; Vinay, S. A neural network based price sensitive recommender model to predict customer choices based on price effect. J. Retail. Consum. Serv. 2021, 61, 102573. [Google Scholar] [CrossRef]
Singh, S.; Yadav, B.; Batheri, R. Industry 4.0: Meeting the challenges of demand sensing in the automotive industry. IEEE Eng. Manag. Rev. 2023, 51, 179–184. [Google Scholar] [CrossRef]
Turay, T.; Vladimirova, T. Toward performing image classification and object detection with convolutional neural networks in autonomous driving systems: A survey. IEEE Access 2022, 10, 14076–14119. [Google Scholar] [CrossRef]
Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.-L.; Chen, S.-C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. 2019, 51, 1–36. [Google Scholar] [CrossRef]
Shi, D.; Ping, W.; Khushnood, A. A survey on deep learning and its applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
Piccialli, F.; Di Somma, V.; Gianpaolo, F.; Cuomo, S.; Fortino, G. A survey on deep learning in medicine: Why, how and when? Inf. Fusion 2021, 66, 111–137. [Google Scholar] [CrossRef]
Gary, M. The Next Decade in AI: Four Steps towards Robust Artificial Intelligence. 2020. Available online: https://arxiv.org/abs/2002.06177 (accessed on 22 January 2022).
Ganaie, M.; Minghui, H.; Malik, A.; Tanveer, M.; Suganthan, P. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Dhilleswararao, P.; Boppu, S.; Manikandan, M.; Cenkeramaddi, L. Efficient hardware architectures for accelerating deep neural networks: Survey. IEEE Access 2022, 10, 131788–131828. [Google Scholar] [CrossRef]
Osypanka, P.; Nawrocki, P. Resource usage cost optimization in cloud computing using machine learning. IEEE Trans. Cloud Comput. 2022, 10, 2079–2089. [Google Scholar] [CrossRef]
Keras Team. Developer Guides. 2022. Available online: https://keras.io/guides/ (accessed on 20 February 2022).
Ribeiro, A.; Tiels, K.; Aguirre, L.; Schön, T. Beyond exploding and vanishing gradients: Analysing RNN training using attractors and smoothness. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 28–30 March 2020; pp. 2370–2380. [Google Scholar]
Natarajan, B.; Rajalakshmi, E.; Elakkiya, R.; Kotecha, K.; Abraham, A.; Gabralla, L.A.; Subramaniyaswamy, V. Development of an end-to-end deep learning framework for sign language recognition, translation, and video generation. IEEE Access 2022, 10, 104358–104374. [Google Scholar] [CrossRef]
Choo, S.; Kim, W. A study on the evaluation of tokenizer performance in natural language processing. Appl. Artif. Intell. 2023, 37, 2175112. [Google Scholar] [CrossRef]
Oruh, J.; Viriri, S.; Adegun, A. Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access 2022, 10, 30069–30079. [Google Scholar] [CrossRef]
Sairam, G.; Mandha, M.; Prashanth, P.; Swetha, P. Image captioning using CNN and LSTM. In Proceedings of the 4th Smart Cities Symposium (SCS 2021), Online, 21–23 November 2021; pp. 274–277. [Google Scholar] [CrossRef]
Apple Inc. Speech and Natural Language Processing: Voice Trigger System for Siri. 2023. Available online: https://machinelearning.apple.com/research/voice-trigger/ (accessed on 30 December 2023).
NLP Architect by Intel® AI Lab. Compression of Google Neural Machine Translation Model. 2023. Available online: https://intellabs.github.io/nlp-architect/sparse_gnmt.html (accessed on 30 December 2023).
Khurana, D.; Koli, A.; Khatter, K.; Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 2023, 82, 3713–3744. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–15. [Google Scholar] [CrossRef]
Coccomini, D.; Messina, N.; Gennaro, C.; Falchi, F. Combining efficientNet and vision transformers for video deepfake detection. In Image Analysis and Processing—ICIAP 2022; Sclaroff, S., Distante, C., Leo, M., Farinella, G., Tombari, F., Eds.; Springer: Cham, Switzerland, 2022; pp. 219–229. [Google Scholar] [CrossRef]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar] [CrossRef]
Ma, J.; Xiong, G.; Xu, J.; Chen, X. CVTNet: A cross-view transformer network for LiDAR-based place recognition in autonomous driving environments. IEEE Trans. Ind. Inform. 2023, 1–10, early access. [Google Scholar] [CrossRef]
Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. Med. Image Anal. 2023, 88, 102802. [Google Scholar] [CrossRef] [PubMed]
Yao, H.; Luo, W.; Yu, W.; Zhang, X.; Qiang, Z.; Luo, D.; Shi, H. Dual-attention transformer and discriminative flow for industrial visual anomaly detection. IEEE Trans. Autom. Sci. Eng. 2023, 1–15, early access. [Google Scholar] [CrossRef]
Dalmaz, O.; Yurt, M.; Çukur, T. ResViT: Residual vision transformers for multimodal medical image synthesis. IEEE Trans. Med. Imaging 2022, 41, 2598–2614. [Google Scholar] [CrossRef]
Xie, Y.; Zhang, J.; Xia, Y.; van den Hengel, A.; Wu, Q. ClusTR: Exploring Efficient Self-Attention via Clustering for Vision Transformers. arXiv 2022, arXiv:2208.13138. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv 2020, arXiv:1703.10593. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Van Den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 20–22 June 2016; pp. 1747–1756. [Google Scholar] [CrossRef]
Ehsan, A.; Dick, A.; Van Den Hengel, A. Infinite variational autoencoder for semi-Supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Lin, S.; Zhang, M.; Cheng, X.; Shi, L.; Gamba, P.; Wang, H. Dynamic low-rank and sparse priors constrained deep autoencoders for hyperspectral anomaly detection. IEEE Trans. Instrum. Meas. 2024, 73, 1–18. [Google Scholar] [CrossRef]
Cheng, X.; Zhang, M.; Lin, S.; Li, Y.; Wang, H. Deep Self-Representation Learning Framework for Hyperspectral Anomaly Detection. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Karen, S.; Andrew, Z. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings. Yoshua, B., Yann, L., Eds.; Cornell University: Ithaca, NY, USA, 2015. [Google Scholar]
Haque, M.; Lim, H.; Kang, D. Object Detection Based on VGG with ResNet Network. In Proceedings of the 2019 International Conference on Electronics, Information, and Communication (ICEIC), Auckland, New Zealand, 22–25 January 2019; pp. 1–3. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der, M.; Weinberger, K. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Kateb, Y.; Meglouli, H.; Khebli, A. Coronavirus Diagnosis Based on Chest X-ray Images and Pre-Trained DenseNet-121. Rev. D’Intell. Artif. 2023, 37, 23. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Çınar, A.; Yıldırım, M.; Eroğlu, Y. Classification of pneumonia cell images using improved ResNet50 model. Trait. Signal 2021, 38, 165–173. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Kaya, Y.; Gürsoy, E. A MobileNet-based CNN model with a novel fine-tuning mechanism for COVID-19 infection detection. Soft Comput. 2023, 27, 5521–5535. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Naseer, I.; Akram, S.; Masood, T.; Rashid, M.; Jaffar, A. Lung cancer classification using modified U-Net based lobe segmentation and nodule detection. IEEE Access 2023, 11, 60279–60291. [Google Scholar] [CrossRef]
Morris, D.; Joppa, L. Challenges for the computer vision community. In Conservation Technology; Serge, A.W., Alex, K.P., Eds.; Oxford Academic: Oxford, UK, 2021; pp. 225–238. [Google Scholar] [CrossRef]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2015, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
Neha, S.; Reecha, S.; Neeru, J. Machine learning and deep learning applications—A vision. Glob. Transit. Proc. 2021, 2, 24–28. [Google Scholar] [CrossRef]
Wang, X.; Zhao, Y.; Pourpanah, F. Recent advances in deep learning. Int. J. Mach. Learn. Cybern. 2020, 11, 747–750. [Google Scholar] [CrossRef]
Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
Maclaurin, D.; Duvenaud, D.; Adams, R. Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 2113–2122. [Google Scholar]
Sebastian, R. An Overview of Gradient Descent Optimization Algorithms. 2017. Available online: http://arxiv.org/abs/1609.04747 (accessed on 2 February 2022).
Li, H.; Yue, X.; Wang, Z.; Wang, W.; Tomiyama, H.; Meng, L. A survey of convolutional neural networks—From software to hardware and the applications in measurement. Meas. Sens. 2021, 18, 100080. [Google Scholar] [CrossRef]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. INSPEC Accession Number: 15802053. [Google Scholar] [CrossRef]
Khagi, B.; Kwon, G. 3D CNN design for the classification of alzheimer’s disease using brain MRI and PET. IEEE Access 2020, 8, 217830–217847. [Google Scholar] [CrossRef]
Serkan, K.; Onur, A.; Osama, A.; Turker, I.; Moncef, G.; Daniel, J. 1D convolutional neural networks and applications: A survey. Mech. Syst. Signal Process. 2021, 151, 107398. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Rajendran, N.; Siyamalan, M.; Amirthalingam, R.; Ruixuan, W. Pooling in convolutional neural networks for medical image analysis: A survey and an empirical study. Neural Comput. Appl. 2022, 1, 1–27. [Google Scholar] [CrossRef]
Berkeley AI Research. Caffe. 2022. Available online: https://caffe.berkeleyvision.org/ (accessed on 25 February 2022).
Facebook’s AI Research. From Research to Production. 2022. Available online: https://pytorch.org/ (accessed on 10 March 2022).
Google Developers. Introduction to Tensors. 2020. Available online: https://www.tensorflow.org/guide/tensor (accessed on 26 January 2022).
Shrestha, A.; Mahmood, A. Review of deep learning algorithms and architectures. IEEE Access 2019, 7, 53040–53065. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The artificial neuron.

Figure 2. Three steps of 2D convolution over a valid area of a

3 \times 3

kernel with one-channel receptive field. The receptive field is a grayscale image, the lines are the kernel weights, the circle is the sum of products of the receptive field with the kernel, and the black squares represent the output features extracted by the kernel after each shift.

Figure 2. Three steps of 2D convolution over a valid area of a

3 \times 3

kernel with one-channel receptive field. The receptive field is a grayscale image, the lines are the kernel weights, the circle is the sum of products of the receptive field with the kernel, and the black squares represent the output features extracted by the kernel after each shift.

Figure 3. One step of 2D convolution of a

3 \times 3 \times 3

kernel with a three-channel receptive field. In this case, the input receptive field is a three-channel color image. The red, green, and blue squares represent the output of each filter, and the brown square is the final result or extracted feature. Notice that we need one filter per channel to produce one channel of features.

Figure 3. One step of 2D convolution of a

3 \times 3 \times 3

kernel with a three-channel receptive field. In this case, the input receptive field is a three-channel color image. The red, green, and blue squares represent the output of each filter, and the brown square is the final result or extracted feature. Notice that we need one filter per channel to produce one channel of features.

Figure 4. Two-dimensional filtering of 3D data over a valid area of the receptive field and their representation using volumes.

Figure 5. The convolution of three 3D filters over a valid area of a 3D receptive field and the resulting feature maps represented by the green volume with three channels.

Figure 6. Process for building one convolutional layer of size

4 \times 4 \times 3

.

Figure 6. Process for building one convolutional layer of size

4 \times 4 \times 3

.

Figure 7. Feature extraction to yield a 2D convolutional layer using a

3 \times 3

filter with zero padding of

P = 1

and stride

S = 1

in the receptive field. The receptive field can be a grayscale image or the output of a previous feature map, the lines are the weights of the kernel, the circle is the sum of products of the receptive field with the kernel, and the black squares represent the output features extracted by the activation function after each shift.

Figure 7. Feature extraction to yield a 2D convolutional layer using a

3 \times 3

filter with zero padding of

P = 1

and stride

S = 1

in the receptive field. The receptive field can be a grayscale image or the output of a previous feature map, the lines are the weights of the kernel, the circle is the sum of products of the receptive field with the kernel, and the black squares represent the output features extracted by the activation function after each shift.

Figure 8. Feature extraction to yield a 2D convolutional layer using a

3 \times 3

filter with zero padding of

P = 1

and stride

S = 2

in the receptive field.

Figure 8. Feature extraction to yield a 2D convolutional layer using a

3 \times 3

filter with zero padding of

P = 1

and stride

S = 2

in the receptive field.

Figure 9. One 2D convolutional layer obtained after filtering and the activation layer.

Figure 10. One volumetric convolutional layer of size

m \times n \times 4

, obtained after convolving the receptive field with the learned parameters of 4 filters and the activation layers.

Figure 10. One volumetric convolutional layer of size

m \times n \times 4

, obtained after convolving the receptive field with the learned parameters of 4 filters and the activation layers.

Figure 11. Three-dimensional convolution and output features’ volumes. The purple volume represents the volumetric filter acting on the three-dimensional receptive field. The blue feature is yielded after the first convolution operation.

Figure 12. Here, 1 × 1 convolution is used to reduce the dimension of a 3D receptive field to a 2D feature map. The blue squares are the output features that are yielded after 17 convolution operations. Notice that the depth of the filter is equal to the depth of the receptive field.

Figure 13. Inception module with dimensionality reduction [72].

Figure 14. Max pooling operation using a 2 × 2 window (left side) with no padding (

P = 0

) and stride of

S = 1

and (right side) with no padding (

P = 0

) and stride of

S = 2

. Here, only the maximum value within the red window is retained.

Figure 14. Max pooling operation using a 2 × 2 window (left side) with no padding (

P = 0

) and stride of

S = 1

and (right side) with no padding (

P = 0

) and stride of

S = 2

. Here, only the maximum value within the red window is retained.

Figure 15. Average pooling operation using a 2 × 2 window (left side) with no padding (

P = 0

) and stride of

S = 1

and (right side) with no padding (

P = 0

) and stride of

S = 2

. Here, the average of the value within the red window is calculated.

Figure 15. Average pooling operation using a 2 × 2 window (left side) with no padding (

P = 0

) and stride of

S = 1

and (right side) with no padding (

P = 0

) and stride of

S = 2

. Here, the average of the value within the red window is calculated.

Figure 16. Flattening via columns of a 2D receptive field.

Figure 17. Artificial neural network with four fully connected layers.

Figure 18. Standard feed-forward DNN architecture with L dense layers.

Figure 19. Example of six tensors with different dimensions.

Table 1. Some common activation functions.

Name	Definition
Sigmoid	$f (z) = 1 / (1 + e^{- z})$
Hyperbolic tangent	$f (z) = (e^{z} - e^{- z}) / (e^{z} + e^{- z})$
Softmax	$f (z) = e^{z_{i}} / \sum_{j = 1}^{K} e^{z_{j}}$ , $\forall i$ = 1, ⋯, K *
ReLU	$f (z) = \max (z, 0)$

* K is the number of classes.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ochoa Domínguez, H.d.J.; Cruz Sánchez, V.G.; Vergara Villegas, O.O. Demystifying Deep Learning Building Blocks. Mathematics 2024, 12, 296. https://doi.org/10.3390/math12020296

AMA Style

Ochoa Domínguez HdJ, Cruz Sánchez VG, Vergara Villegas OO. Demystifying Deep Learning Building Blocks. Mathematics. 2024; 12(2):296. https://doi.org/10.3390/math12020296

Chicago/Turabian Style

Ochoa Domínguez, Humberto de Jesús, Vianey Guadalupe Cruz Sánchez, and Osslan Osiris Vergara Villegas. 2024. "Demystifying Deep Learning Building Blocks" Mathematics 12, no. 2: 296. https://doi.org/10.3390/math12020296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Demystifying Deep Learning Building Blocks

Abstract

1. Introduction

2. Literature Review

3. Theoretical Foundations of Deep Learning

3.1. Motivation

3.2. The Artificial Neuron

3.3. Convolutional Layer

3.3.1. Convolutional Filters (Kernels)

3.3.2. Two-Dimensional Convolutional Layer

3.4. Three-Dimensional Convolutional Layer

3.5. A 1 × 1 Convolutional Layer

3.6. Pooling Layer

3.6.1. Max Pooling and Average Pooling

3.6.2. Global Pooling

3.7. Flattening Layer

3.8. Fully Connected Layer

3.9. Deep Neural Networks

3.10. Comments on Training DNN

3.11. Tensors

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI