Constrain Bias Addition to Train Low-Latency Spiking Neural Networks

Lin, Ranxi; Dai, Benzhe; Zhao, Yingkai; Chen, Gang; Lu, Huaxiang

doi:10.3390/brainsci13020319

Open AccessArticle

Constrain Bias Addition to Train Low-Latency Spiking Neural Networks

by

Ranxi Lin

^1,2,†,

Benzhe Dai

^1,2,†,

Yingkai Zhao

^1,2,

Gang Chen

^1,3,* and

Huaxiang Lu

^1,2,3,4,5

¹

Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China

²

University of Chinese Academy of Sciences, Beijing 100089, China

³

Semiconductor Neural Network Intelligent Perception and Computing Technology Beijing Key Laboratory, Beijing 100083, China

⁴

Collage of Microelectronics, University of Chinese Academy of Sciences, Beijing 100049, China

⁵

Materials and Optoelectronics Research Center, University of Chinese Academy of Sciences, Beijing 200031, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Brain Sci. 2023, 13(2), 319; https://doi.org/10.3390/brainsci13020319

Submission received: 19 November 2022 / Revised: 19 January 2023 / Accepted: 10 February 2023 / Published: 13 February 2023

(This article belongs to the Collection Collection on Theoretical and Computational Neuroscience)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In recent years, a third-generation neural network, namely, spiking neural network, has received plethora of attention in the broad areas of Machine learning and Artificial Intelligence. In this paper, a novel differential-based encoding method is proposed and new spike-based learning rules for backpropagation is derived by constraining the addition of bias voltage in spiking neurons. The proposed differential encoding method can effectively exploit the correlation between the data and improve the performance of the proposed model, and the new learning rule can take complete advantage of the modulation properties of bias on the spike firing threshold. We experiment with the proposed model on the environmental sound dataset RWCP and the image dataset MNIST and Fashion-MNIST, respectively, and assign various conditions to test the learning ability and robustness of the proposed model. The experimental results demonstrate that the proposed model achieves near-optimal results with a smaller time step by maintaining the highest accuracy and robustness with less training data. Among them, in MNIST dataset, compared with the original spiking neural network with the same network structure, we achieved a 0.39% accuracy improvement.

Keywords:

spiking neural network; backpropagation; Sigma-Delta ADC; neural encoding; pattern recognition

1. Introduction

Due to the inherent properties of energy efficiency and biological plausibility, spiking neural networks (SNNs) have made significant advancements lately. Unlike basic artificial neural network (ANN), SNNs make an effort to mimic biological neural networks more closely through discrete events. The SNNs can encode information into a sequence of spikes, and are useful tools for complex spatio-temporal information processing when compared to traditional ANN that use floating-point numbers for operations ([1]). Additionally, the generation of the spike sequence is an asynchronous process, and the operation of the spike is a binary operation, which offers tremendous adaptability to hardware circuits ([2]). Many research institutions have developed neural computing devices based on SNNs, such as SpiNNaker ([3]), TrueNorth ([4]), and Loihi ([5]), among others. The emergence of these platforms has also expanded the potential applications and prolific research prospects of SNNs.

SNNs and ANN differ primarily in two respects. (1) As shown in Figure 1, While the data is infiltrated in to the network, the SNNs must first encode it into a spike train; (2) The spike-firing process of spiking neurons is non-differentiable, which has a negative impact on training. In the field of neuroscience, neural information coding is concerned with the relationship between input signals and the response of individual or group neurons ([6,7,8]). The SNNs primarily utilize two coding methods for spike coding: temporal coding ([9]) and rate coding ([10]). Latency coding ([11]) and phase coding ([12]) are two variations of temporal coding. In a given time window, only one spike is encoded. The input data is included in the precise time of the spiked release, and its value is negatively correlated with the encoding time. At the moment, rate coding is the most predominant coding method used by researchers all over the world. At most, one spike is set to be emitted at each time step within a given time window, and the frequency of spike firing increases as the stimulation intensity increases, resulting in a spike sequence. Rate coding generally requires a sufficient number of time steps to improve coding accuracy, which increases network power consumption. Many researchers in the past have utilized Poisson coding or Bernoulli coding to implement rate coding, which is widely deployed in various deep SNNs. The rate coding method, however, employs statistical models that require sufficient time steps to guarantee coding accuracy. Additionally, it ignores data correlation and presumes that the input data are independent of one another. In recent years, some researchers have used SNNs to directly input real-valued data and treat the first few layers of spiking neural layers as the encoding layer ([13]). This scheme directly increases the scale of the network’s architecture and thereby upsurging the training time and energy consumption.

The post-synaptic neuron accumulates the spike from the preceding layer in the membrane potential of the neuron until it exceeds the threshold, whereupon the post-synaptic neuron emits a spike and reset its membrane potential. However, the spike-firing process is not differentiable, which complicates the backpropagation training of SNNs. At present, the training rules of SNN can be mainly divided into three categories: (1). Learning rules based on biological characteristics. The most typical methods in practice are Spike-Timing- Dependent-Plasticity (STDP) ([14]) and Tempotron ([15]). STDP is a prominent unsupervised learning rule, where the update in synaptic weights is only related to pre-and post-synaptic neuron activity. Tempotron is a gradient descent learning algrithom based on biological characteristics that is suitable for temporal coding, when a classification error occurs, it updates the synaptic weight with temporal information. (2). Surrogate gradient method ([16]). In order to alleviate the dilemma that the SNN is difficult to directly use Backpropagation (BP) for training, the researchers utilize Heaviside function in the process of forward propagation and use a surrogate gradient to approximate the gradient in the backpropagation. For example, in [17], the researchers proposed the spatio-temporal Backpropagation (STBP) algorithm, which utilizes surrogate gradients while computing the gradient of spiking neurons, and further discussed the impact of different surrogate gradients with hyperparameter settings on network performance. Although the surrogate gradient temporarily alleviates the supervised learning training problem of SNNs, there are many choices, and it is difficult to choose the most suitable one. (3). ANN-SNN conversion. The method is divided into two steps: train an ANN and then convert it into an SNN. Ref. [18] first proposed a method to convert ANN to SNNs, then [19] proposed a data-based regularization method, and then further improved the conversion method in [20] to reduce the conversion error; in [21], the researchers converted the residual neural network, which improved the scale of SNNs and in [22], the researchers converted the Tiny YOLO network, making SNNs more widely used in the field of object detection. The conversion-based method can quickly apply SNNs to different tasks and avoid the difficulty of directly training SNNs. However, there will be a loss of accuracy in the conversion process, and the final SNNs often have higher requirements on the encoding time step, which increases the inference latency and the energy consumption of the SNNs in disguised form.

In this paper, we propose a novel encoding scheme along with a spike-based learning rule. The proposed method elucidates a Sigma-Delta encoding method that effectively exploits the correlation between input data and achieves higher encoding gains; we reconsider the role of bias voltage in SNNs and modify the addition mechanism. We deploy the proposed SNNs model to an environmental sound classification task, and image classification tasks, and the results show that our models can achieve competitive results with fewer spikes.

The rest of this article is organized as follows: Section 2 introduces the Sigma-Delta encoding method, and explores the proposed bias addition rule and the new gradient approximation, thereby deriving the backpropagation formula. Section 3 verifies the effectiveness of the proposed encoding method and learning rules through experiments on different tasks and compares them with other models. In Section 4, we summarize and discuss the effectiveness of the outcome obtained in the proposed work.

2. Materials and Methods

2.1. Spiking Neuron Model

As the most commonly used spiking neuron, the Leaky Integrate-and-Fire neuron (LIF) adopts event-driven simulation strategy and has limited neural computing characteristics ([23]). The first-order differential form of its kinetic equation is as follows:

τ \frac{d V (t)}{d t} = - (V (t) - V_{r e s t}) + X (t)

(1)

where

V (t)

represents the membrane potential at the t-th moment;

V_{r e s t}

represents the resting potential;

X (t)

is the input at time t, including the output spikes of the previous layer and bias voltage;

τ

is the decay time constant. When the membrane potential

V (t)

does not exceed the threshold

V_{t h}

, the membrane potential decays, and the magnitude is controlled by

τ

, otherwise, the neuron will fire a spike and reset the membrane potential. We express the firing process with the following formula:

N (t) = \{\begin{matrix} 1, & if V (t) > V_{t h} \\ 0, & if V (t) \leq V_{t h} \end{matrix}

(2)

At the t-th time step, the value of the membrane potential of the j-th neuron in the l-th layer is given by:

V_{j}^{l} (t) = (1 - \frac{1}{τ}) V_{j}^{l} (t - 1) + \sum_{i = 1}^{M_{l - 1}} W_{i j} N_{i}^{l - 1} (t) + X_{b i a s, j}^{l}

(3)

where

M_{l - 1}

means the number of spiking neurons in

l - 1

-th layer,

X_{b i a s, j}^{l}

represents the bias voltage. For the spike reset phase, we use a subtraction-based reset mechanism:

V_{j}^{l} (t) = V_{j}^{l} (t) - V_{t h} * N_{j}^{l} (t)

(4)

When compared with the method of directly resetting the membrane potential to a fixed value, the subtraction-based mechanism is better suited to training deep SNNs and makes more biological sense ([20,24]).

2.2. Spike Encoding Method

Most of the present SNNs encoding methods such as time coding and rate coding involves the encoding of input numerical data. This characteristic assumes that the input data are judged to be independent of one another. But there are correlations between data at different points in a single image or spectrogram of a single speech signal, especially in spectrograms, where each row represents a subband. Differential Pulse Coding Method (DPCM) is a common coding method in classical speech coding schemes that can eliminate information redundancy in the spoken signal and achieve commendable data compression. DPCM does not encode the input data directly, but rather make use of the difference between adjacent sampled signals. This technology has been applied not just in the field of communication, but also in digital picture processing ([25,26]). However, the DPCM technique has to rebuild the sampled value at the previous time in the calculation process and then calculate the difference value with the current input, which increases the computing complexity of the encoding process.

The Sigma-Delta ADC is a very common circuit structure in the field of integrated circuits, which can convert analog signals to digital signals. We model the process by which a first-order Sigma-Delta ADC outputs a digital signal. In some previous studies, researchers carried out the process of combining Sigma-Delta ADC with SNNs. It is used as the spiking neurons to process spike trains in [27,28]; in [29], the researchers add it to each layer of the network and propose an STDP-like learning mechanism; researchers in [30] quantify the activation of ANN and convert them to SNNs using a method similar to Sigma-Delta modulation. We utilize Sigma-Delta ADC as the rate encoder and simplify the model to realize better computational efficiency. The Sigma-Delta ADC encoding process, shown in Figure 2, is similar to DPCM, but employs fixed return values instead of predictors. When the input is a picture or a spectrogram that involves a numeric matrix, we set the data in row i and column j as

x (i, j)

and encode the input data along the rows. It is important to note that the number of rows of input data is equal to the number of encoders. The encoding process of i-th row is as follows:

e (i, j) = x (i, j) - \hat{x} (i, j - 1)

(5)

I (i, j) = e (i, j) + I (i, j - 1)

(6)

\{\begin{matrix} y (i, j) = 1, \hat{x} (i, j) = 1, if I (i, j) > V_{t h} \\ y (i, j) = 0, \hat{x} (i, j) = - 1, if I (i, j) \leq V_{t h} \end{matrix}

(7)

where

e (i, j)

represents the prediction error,

I (i, j)

denotes the value of the integrator,

\hat{x} (i, j)

represents the predicted value, and is determined by the Formula (7),

y (i, j)

is the output of the encoder, and

V_{t h}

is the benchmark of the comparator. When the time window is N, the above formula will loop N times, but at the end of each loop, the value of the integrator will be retained and used for the next loop.

It can be seen from the Formulas (5)–(7) that any data will be affected by the already encoded data during the encoding process. If the input data is a spectrogram, the data of any subband will be disturbed by the data of the same subband during the encoding process. For image data, we still input along the rows during the experiment. Note that the input range for Sigma-Delta encoder(SDE) is [−1, 1]. We use linear and nonlinear methods to map the input data to this range.The mapping formula is as follows:

L i n e a r : x = \frac{X \times 2}{X_{m a x} - X_{m i n}} - \frac{X_{m a x} + X_{m i n}}{X_{m a x} - X_{m i n}}

(8)

S i n : x = \frac{X \times π}{X_{m a x} - X_{m i n}} - \frac{π (X_{m a x} + X_{m i n})}{2 (X_{m a x} - X_{m i n})}

(9)

where

X_{m a x}

and

X_{m i n}

represent the upper and lower bounds of the input value, respectively. Linear mapping will not affect the distribution of the data, while sinusoidal mapping will increase the difference between higher and lower value regions.

2.3. Supervised Training of Deep Spiking Neural Network

2.3.1. Spiking Neuron Gradient Estimation

ANN often uses the ReLU function as an activation function. The input and output characteristics are shown in Figure 3a. If and only when the input value exceeds zero, the information will continue to be transmitted; otherwise, the output is zero. The ReLU function introduces nonlinearity and alleviates the problem of exploding or vanishing gradients when training with the gradient backpropagation algorithm ([31]). As shown in Figure 3b, we can observe that there is also a positive correlation between the spike output frequency and the inputs of Integrate-and-Fire (IF) neurons in SNNs. But, as their outputs are spike trains rather than continuous values, the spike firing process is non-differentiable, which makes SNNs difficult to train directly by using the backpropagation algorithm.

In this study, the role of bias voltage in spiking neurons is reconsidered. By developing a new bias addition mechanism, a novel gradient approximation method is proposed based on the new rule’s input and output characteristics. In this work, it is observed that the gradient of a spiking neuron should be divided into two parts based on the range of the input: when the input is less than or equal to zero, the gradient is considered to be zero, and when the input is greater than zero, the gradient is calculated, and the detailed derivation of the gradient is given in the following sentences.

Spiking neurons in SNNs fire only when the membrane potential exceeds the threshold, the source of the membrane potential can be separated into three parts, as shown in Figure 4: the membrane potential at the last time-step, the input, and the bias. The bias voltage value and the spike firing rate have a positive association. Due to the fact that only the membrane potential and the input change during the network inference process, the neuron’s actual spike firing threshold

{\hat{V}}_{t h}

is equal to

V_{t h} - X_{b i a s}

. However, in the previous research, the processing of bias current is mainly divided into two cases: (1). The bias is directly set to 0 and not as a variable during the network learning process; (2). Like ANN, the bias is added at each moment and updated together with the weights during the network learning process. In case 1, since the bias is permanently forced to zero, it is equivalent to ignoring the regulation effect of bias on the threshold; while in case 2, adding bias voltage at every time-step undoubtedly violates the event-driven characteristics of SNNs because When time

T \to \infty

, in the absence of any input, the spiking neuron will spontaneously emit spikes, which has a negative impact on training and reasoning.

For the above two cases, we modify the bias addition mechanism. We want to think of the bias as the reference voltage in the spiking neuron, which means that in the absence of any input, the membrane potential of the spiking neuron will always be

X_{b i a s}

. When the output spikes of the previous layer arrive, for the change of the membrane potential of the IF neuron, we describe it with the following formula:

\{\begin{matrix} V_{j}^{l} (t_{-}) = V_{j}^{l} (t - 1) + \sum_{i = 1}^{M_{l - 1}} W_{i j} N_{i}^{l - 1} (t) \\ V_{j}^{l} (t) = V_{j}^{l} (t_{-}) + (V_{t h} - X_{b i a s, j}^{l}) * N_{j}^{l} (t) \end{matrix}

(10)

where

V_{j}^{l} (t_{-})

denotes the membrane potential of the previous layer when the spike just arrived,

V_{j}^{l} (t)

presents the membrane potential after the spike is emitted. When the spiking neuron emits a spike, the membrane potential is reset according to the Formula (4). We set the bias to be consumed during the reset process, and then the bias will be added at the end of the moment, the total process is shown in the right side of Figure 4. At the same time, we set the final membrane potential to be

V_{j}^{l} (T)

after the inference process is over. In the whole process, a single neuron satisfies:

\sum_{t = 1}^{T} \sum_{i = 1}^{M_{l - 1}} W_{i j} N_{i}^{l - 1} (t) + (N_{j}^{l} + 1) X_{b i a s, j}^{l} = N_{j}^{l} V_{t h} + V_{j}^{l} (T)

(11)

where

N_{j}^{l}

is the total number of transmitted spikes. In order to reduce the difficulty of solving, we ignore the final membrane potential

V_{j}^{l} (T)

because

V_{j}^{l} (T)

must contain bias, so a bias needs to be subtracted from the left side of the equation, and the final Equation (11) evolves into the following form:

\sum_{t = 1}^{T} \sum_{i = 1}^{M_{l - 1}} W_{i j} N_{i}^{l - 1} (t) + N_{j}^{l} X_{b i a s, j}^{l} = N_{j}^{l} V_{t h}

(12)

For a single spike, its value is generally set to 1, so it can be seen that

N_{j}^{l}

is the final output of the neuron. Let

X_{j}^{l} = \sum_{t = 1}^{T} \sum_{i = 1}^{M_{l - 1}} W_{i j} N_{i}^{l - 1} (t)

as the total input from the previous layer, we can get:

N_{j}^{l} = \frac{X_{j}^{l}}{V_{t h} - X_{b i a s, j}^{l}}

(13)

Since then, we have approximated the input-output relationship of the IF neuron, and can be approximately applied at each time step. The process is similar to the LIF neuron, but in the process of calculating the membrane potential, the decay factor needs to be considered like the Formula (3), then there is more loss when approximating the Formula (13). Therefore, only IF neurons are used in all experiments in this paper.

2.3.2. Spike-Based Backpropagation Algorithm

In Formula (13), we obtain the input-output relationship of the spiking neuron. By simply derivation, we can easily obtain the following formula:

\begin{matrix} \frac{d N_{j}^{l}}{d X_{j}^{l}} & = \frac{1}{V_{t h} - X_{b i a s, j}^{l}} \end{matrix}

(14)

\begin{matrix} \frac{d N_{j}^{l}}{d X_{b i a s, j}^{l}} & = \frac{N_{j}^{l}}{V_{t h} - X_{b i a s, j}^{l}} \end{matrix}

(15)

After obtaining the approximate gradient of the neuron, we design our backpropagation gradient algorithm using the BackPropagation Through Time (BPTT) concept. We define the output layer of the network as a linear layer that emits no spikes and accumulates the membrane potential to determine the category based on the magnitude of the membrane potential. For a M classification problem, we utilize Mean-Square-Error (MSE) as our loss function with:

E = \frac{1}{2} \sum_{m = 1}^{M} {({\hat{y}}_{m} - y_{m})}^{2}

(16)

where

y_{m} = \frac{V_{m}^{L} (T)}{T V_{t h}}

, during backpropagation, we use the chain rule. We set the L-th layer as the output layer, and then for the output layer neurons, the gradient calculation is as follows:

\begin{matrix} \frac{d E}{d X_{m}^{L}} & = \frac{d E}{d y_{m}} \frac{d y_{m}}{d X_{m}^{L}} \end{matrix}

(17)

\begin{matrix} = - \frac{({\hat{y}}_{m} - y_{m})}{T V_{t h}} \end{matrix}

(18)

\begin{matrix} \frac{d E}{d X_{b i a s, m}^{L}} & = \frac{d E}{d y_{m}} \frac{d y_{m}}{d X_{b i a s, m}^{L}} \end{matrix}

(19)

\begin{matrix} = - \frac{({\hat{y}}_{m} - y_{m})}{V_{t h}} \end{matrix}

(20)

For the hidden layer, to the j-th neuron in the l-th layer, the gradient calculation formula at the moment

t_{0}

is as follows:

\begin{matrix} \frac{d E}{d N_{j}^{l, t_{0}}} & = \sum_{m = 1}^{M} \frac{d E}{d y_{m}^{T}} (\frac{d y_{m}^{T}}{d y_{m}^{t_{0}}} \frac{d y_{m}^{t_{0}}}{d N_{j}^{l, t_{0}}} + \sum_{t = t_{0} + 1}^{T} \frac{d y_{m}^{T}}{d N_{j}^{l, t}} \frac{d N_{j}^{l, t}}{d N_{j}^{l, t_{0}}}) \end{matrix}

(21)

\begin{matrix} \frac{d y_{m}^{t_{0}}}{d N_{j}^{l, t_{0}}} & = \sum_{k = 1}^{N^{l + 1}} \frac{d y_{m}^{t_{0}}}{d N_{k}^{l + 1, t_{0}}} \frac{d N_{k}^{l + 1, t_{0}}}{d X_{k}^{l + 1, t_{0}}} \frac{d X_{k}^{l + 1, t_{0}}}{d N_{j}^{l, t_{0}}} \end{matrix}

(22)

\begin{matrix} \frac{d N_{j}^{l, t}}{d N_{j}^{l, t_{0}}} & = \frac{d N_{j}^{l, t}}{d V_{j}^{l} (t)} \frac{d V_{j}^{l} (t)}{d V_{j}^{l} (t_{0})} \frac{d V_{j}^{l} (t_{0})}{d N_{j}^{l, t_{0}}} \end{matrix}

(23)

According to the Formula (10), the first two items of the Formula (23) can be known:

\begin{matrix} \frac{d N_{j}^{l, t}}{d V_{j}^{l} (t)} & = \frac{d N_{j}^{l, t}}{d V_{j}^{l} (t)} \frac{d V_{j}^{l} (t)}{d X_{j}^{l, t}} = \frac{N_{j}^{l, t}}{V_{t h} - X_{b i a s, j}^{l}} \end{matrix}

(24)

\begin{matrix} \frac{d V_{j}^{l} (t)}{d V_{j}^{l} (t_{0})} & = 1 \end{matrix}

(25)

Considering the case of

N_{j}^{l, t_{0}} = 0

, the last term of Formula (23) cannot be calculated directly, but can be eliminated by the chain rule. Finally, we get the gradient of the loss function to the weight

W^{l}

and bias

X_{b i a s, j}^{l}

respectively:

\begin{matrix} \frac{d E}{d W^{l}} & = \sum_{t = 1}^{T} \sum_{k = 1}^{N^{l + 1}} \frac{d E}{d N_{j}^{l + 1, t}} \frac{d N_{j}^{l + 1, t}}{d W^{l}} \end{matrix}

(26)

\begin{matrix} \frac{d E}{d X_{b i a s, j}^{l}} & = \sum_{t = 1}^{T} \sum_{k = 1}^{N^{l + 1}} \frac{d E}{d N_{j}^{l, t}} \frac{d N_{j}^{l, t}}{d X_{b i a s, j}^{l}} \end{matrix}

(27)

The pseudocode for the entire training phase is shown below:

Algorithm 1: Training process of the proposed SNN model and SDE method.

Input: Network input

X_{i, j}

,

i \in [0, M]

,

j \in [0, N]

; sample label Y; time windows T;

1:: initialization parameters and states of network $ω_{i j}^{n}$ , $b i a s_{i}^{n}$ , $n = 1, 2, \dots N$ ; states of the integrator in SDE $I (i, j)$ ;
2:: Forward
3:: for $t \in [0, T]$ do
4:: for $j \in [0, N]$ do
5:: # Each SDE encodes an entire line of input data
6:: $S p i k e_{(} i, j) = S D E (X_{i, j})$
7:: end for
8:: for $n \in [0, N]$ do
9:: $S p i k e^{n} \leftarrow ω \cdot S p i k e^{n - 1}$
10:: end for
11:: end for
12:: Backforward
13:: Calculate loss function $L = M S E (S p i k e^{N - 1}, Y)$
14:: Calculate gradient through BPTT: $Δ_{ω} \leftarrow ▿_{ω} L,_{b i a s} \leftarrow ▿_{b i a s} L$
15:: Update Parameters
16:: Reset
17:: Reset the membrane $V (T - 1)$ of the spiking neuron and the state I of the SDE integrator to 0.

3. Results

3.1. Experimental Setup

3.1.1. Experimental Environment

To validate the effectiveness of the proposed encoding method and learning rule, we apply the proposed model to environmental sound classification tasks and image classification tasks. The network structures and hyperparameters we use are different for different tasks, The training process of the network is shown in Algorithm 1. For different tasks, the input data needs to be preprocessed differently.. It is worth noting that the input of the network needs to be input to the network at multiple time steps in SNNs. Hence, when using the dropout layer, it is a necessary requirement that the mask is the same for the same set of data input at each time step, which is mentioned in [32]. Our model is constructed using the PyTorch-based SpikingJelly framework, which facilitates faster inference with GPU acceleration ([33]). The environmental sound classification experiment is run on Intel(R) Core(TM) i7-8700 CPU, and the image classification task is performed on NVIDIA TitanXP Graphics card.

3.1.2. Datasets

In the experimental stage, the data set used for the environmental sound recognition task is the Real World Computing Partnership (RWCP) sound scene dataset dataset ([34]). Following the settings of the previous research, we selected ten categories: bells, bottle, buzzer, cymbals, horn, kara, metal, phone, ring, and whistle, with durations ranging from 0.5 s to 3.0 s. We randomly select 80 files from each class, picking half for training and the other half for testing. In order to test the effectiveness of learning, we add noise to the test set while the training set does not change. Under such mismatched conditions, the test set with noises of different signal-to-noise ratios is tested. The experimental results in this part are all the average values of ten groups of experimental results. We use the MNIST dataset and the Fashion-MNIST dataset for image classification tasks. The MNIST dataset is one of the most classic datasets in deep learning, which contains ten different handwritten digits and the input size is

28 \times 28

([35]). It contains ten classes, and the training set has a total of 60,000 samples, each class contains 6000 samples. The test set contains 10,000 samples, 1000 samples per class. The Fashion-MNIST and MNIST dataset have the same number of sample classes and samples. Its data is more complex, in which the content of the pictures includes various clothes, the input size is

28 \times 28

, and the number of training and test sets is the same as in MNIST ([36]).

3.1.3. Experimental Chapter Arrangement

Our experiment is divided into three parts. In Section 3.2, we completed the task of environmental sound classification in the RWCP dataset. We briefly introduced the feature extraction algorithm we used and the whole process of feature extraction and coding. We perform our experiment under different normalized mapping modes and check the network performance under different time Windows. In order to verify the network’s robustness, noises with various signal-to-noise ratios were introduced to the test set in addition to clean data sets for training and testing.

In Section 3.3, we used the MNIST and Fashion-MNIST data sets for image classification tasks. We evaluate the network’s performance with various numbers of codes, compare the outcomes with those from other SNN models, and display the results in a table. By lowering the amount of training samples, we further test the network’s capacity for learning in Section 3.4. Finally, We analyzed the algorithm efficiency in Section 3.5.

3.2. Environmental Sound Classification

3.2.1. Constant-Q Transform

Constant-Q transformation (CQT) refers to a technique for converting a time-domain signal x(n) into the time-domain frequency domain, and the center frequencies of the frequency bins are geometrically spaced. The conversion formula from discrete time-domain signal x(n) to frequency domain X(k, n) is as follows:

X^{C Q} (k, n) = \sum_{j = n - ⌊ N_{k} / 2 ⌋}^{n + ⌊ N_{k} / 2 ⌋} x (j) a_{k}^{*} (j - n + N_{k} / 2)

(28)

where k is the indexes of the frequency bins and denoted by

N_{b i n s}

in Table 1.

⌊ \cdot ⌋

denotes rounding towards negative infinity,

a_{k}^{*}

represents the complex conjugate of

a_{k} (n)

, details can be found in [37]. The calculation formula of the center frequency

f_{k}

of each frequency bin is:

f_{k} = f_{1} 2^{\frac{k - 1}{B}}

(29)

where

f_{1}

is the initial center frequency of the lowest frequency bin. The value of Q is determined by

Q = \frac{q}{Δ ω (2^{\frac{1}{B}} - 1)}

(30)

where

Δ

ω

has different values for different window functions, q is a scaling factor and usually set to 1, B represents the number of bins per octave which is important to the performance of CQT. If the value of B is set very high, then the interval between the center frequencies will be very small, for the same band length, we can obtain more frequency bins, which can be regarded as realizing the oversampling of the frequency axis. In our experiment, in order to limit the size of the spectrogram, we set the value of B as 3.

3.2.2. Network Structure and Parameter Setting

We use the CQT to extract audio features from environmental sound data and create spectrograms. The main advantage of CQT is that it devoid of suffering from uniform time-frequency resolution because the bandwidth of its subband to its center frequency ratio is a constant value. Therefore, high frequency waves have a larger bandwidth and more substantial temporal resolution at high frequencies read the path of quick changing overtones; In contrast, its bandwidth is relatively narrow for low-frequency waves, while it has a higher-frequency resolution to resolve related notes. In reality, the CQT can be used to evaluate the audio signal more thoroughly since the energy of the sound signal is primarily focused in the low-frequency range. We use Librosa library to complete the experiment of the feature extraction part ([38]). The hyperparameters we set in the experiments are shown in Table 1.

Due to the different lengths of the environmental sound segments employed, the dimensions of the spectrogram produced by CQT after its extraction also vary. Therefore, we divide the temporal dimension of the spectrogram into T-segments. Each segment has different sampling numbers for sound signals with different lengths. Then, we average each segment to obtain a spectrogram with a defined scale. If M cochlear filters are used and T is the length of the time axis, after segmenting the temporal dimension, we obtain a spectrogram of size (M, T). Therefore, we input the spectrogram directly to the SDE, and each SDE encodes only one data row, so we need M SDEs to perform the encoding process, which is shown in the Figure 5. After the encoding process, we flat the spike matrix into a 1-dimensional vector and input it into the fully connected SNNs. The network structure is

800 \times 50 \times 10

, and the threshold of the last layer is set to infinity. In the initialization phase of the network, the weights are initialized according to the normal distribution, the fixed bias is 1 and the threshold is 3. Since the RWCP dataset is relatively small, the default batch size is set to 1.

3.2.3. Classification Performance

We first conduct experiments under the condition of clean signal. As shown in Table 2, for different mapping methods, even the same time window will show differences. When the time window is equal to 20, the network using linear mapping achieves the best performance, and the average result of ten experiments achieves 99.85%. When the time window is 16, the sinusoidal mapping method achieves the best performance of 99.68%. In Table 3, we compare our results with other SNNs models and traditional MFCC-HMM methods; we can see that our proposed approach with a fully connected SNN model achieves results close to the best and even exceeds some SNN models that include convolutional operations. We further test the robustness of the model by adding noise to the test data, while still using clean sound for training and testing the ability of the network to extract audio features by constructing mismatched conditions. It can be seen from Table 4 that under the mismatched condition, the network using sinusoidal mapping can achieve better robustness, the optimal performance is achieved when the time window is 12, and the average precision reaches 71.49%, while the optimal performance of linear scaling is 67.23%, the time window is 14. In this regard, we consider sinusoidal mapping that can introduce more nonlinear transformations, which can improve the robustness of the network to a certain extent. Moreover, we find that the performance of the network first increases and then decreases as the time window increases, which is consistent with our expectations. When the time windows are too small, the encoding of the input feature image is relatively coarse. At this point, increasing the time windows may benefit feature extraction. However, if the time windows are too large, the network will be overfitted, the slight differences in the network can also affect the performance of the network.

3.3. Image Classification

3.3.1. Network Structure and Parameter Settings

Compared to environmental sound classification task, most current SNNs tasks choose to demonstrate network performance on image datasets, and the relevant network parameters are shown in Table 5. The network structure we choose is 15C5-40C5-FC300-FC10, where 15C5 refers to the output channel of the first layer, which is 15, and the size of the convolution kernel is

5 \times 5

. FC300 means that the number of neurons in the fully connected layer is 300. We set the parameter of the dropout layer to 0.2 and employ only linear mapping when we use SDE. We use stochastic gradient descent to update the network parameters and cosine annealing to control the change in the learning rate. The period of cosine annealing is 100, and the period of the whole training is 200, which means the learning rate at the beginning and end of the training is the maximum learning rate. At the 100th epoch, the learning rate decreases to its lowest value. The advantage of this setting is that the value of the loss function of the network is very small in the later training phase, and it is difficult for a small learning rate to break the optimal local solution. Therefore, using a higher learning rate at the end of network training allows the network to explore a larger space to achieve superior performance.

3.3.2. MNIST Dataset

As one of the most widely used datasets for classification tasks, MNIST has been used by many SNN models trained on spikes to test performance. In the Table 6, we show the performance of our model for different time windows. With a number equal to 12, the performance of the network has reached 99.37%. With the increase of the time window to 14, the classification accuracy reaches the highest of 99.60%. We further compare the results of our network with other models with similar structures and present the results in the Table 7. For other networks with similar structures, our network shows advantages in both the time windows and the network performance.

To examine the effect of different time windows on the performance of the classification task, we plot the training curves in the Figure 6. From the different training curves in the Figure 6a, we can see that our network can achieve more than 97% accuracy in the initial stage of training, and as the training time increases, the accuracy of the network continues to increase. In the later stages of training, the performance of the network is still very stable, even though the learning rate continues to increase at this time. A lower time window can better highlight the advantages of the SNN’s low energy consumption and reduce the time consumption for inference. We replaced the coding method with Poisson coding to test the influence of SDE on model performance. The training curve is shown in Figure 6b. It can be observed that the model based on SDE has better performance and faster training accuracy. Since the Sigma-Delta ADC is full-fledged method in the field of analog circuits and the SDE is equivalent to a first-order Sigma-Delta ADC, it is very easy to implement in hardware.

3.3.3. Fashion-MNIST Dataset

For the Fashion-MNIST dataset, the image content is more complex and challenging than the MNIST dataset. In the Table 8, we show the performance of other SNNs models on this dataset and show the performance of our model under a different number of codes. When the time window is 20, the model accuracy reaches 90.14%, and when the time window is increased to 100, the network performance reaches 91.71%. Our model can achieve high accuracy with a small time window, which also reflects the low power consumption of SNNs.

3.4. Training with Less Data

To test the learning ability of the network, we control the amount of data used to train the model and test with the three datasets above. For the RWCP dataset, We adjusted the number of training data. For the MNIST and Fashion-MNIST datasets, we set the ratio of the training data to the training set to 0.1 and 0.5, respectively, and the time windows are 14 and 20 with respect to the MNIST and Fashion-MNIST datasets.

The experimental results for the RWCP dataset are shown in Table 9. When we reduce the ratio of the training data to 0.1, our SNNs model can still achieve an accuracy of 98.97% and 97.318%, respectively, and even under mismatched conditions, the average accuracy can reach 70.53%. When we increased the ratio to 0.9, the performance of the network is significantly improved. The average accuracy reached 99.70% and 99.75% without adding noise, and the average performance under mismatched conditions is 68.00% and 73.20%, respectively.

As shown in Figure 7a, for the MNIST dataset, when the number of data points used for training is 6000, our proposed SNNs model achieves 98.41%, and when the training number increases to 30,000, the training accuracy further rises to 99.22%. For the Fashion-MNIST dataset, the training curve is shown in Figure 7b, we use 10% of the training set data to train, the model performance reaches 84.07%, and its training curve shows a trend of overfitting; when using half the number of training sets during training, the network accuracy improves to 87.74%. We believe that the images in the Fashion-MNIST dataset are more complex than those in MNIST. Although the amount of data used for training is the same, Fashion-MNIST’s image feature extraction is more difficult, and the model is more prone to overfitting.

3.5. Algorithm Efficiency

We compare the proposed SNN model with SNNs based on native IF neurons. The network model used is still LeNet5, and the dataset is MNIST. We first compare the average running time and memory consumption of each epoch, reflecting the efficiency of the algorithm. The running time of a single epoch for SNNs based on primary IF neurons is 94 s, while the running time of a single epoch for our proposed SNN model is 100 s. This is due to the need to determine whether to add bias when implementing the code. In terms of memory consumption, as can be seen in Figure 8a, our model occupies about 889 MB, while the native IF neuron model consumes 1003 MB, which is related to the spike sparsity of the network. Note that the approximate gradient we derived has an obvious advantage over the surrogate gradient: if the spike neuron emits no spikes at time t, the gradient at time t is 0, which effectively reduces memory consumption.

We further studied the performance of the network. We tested the model under the same time window. The training curves of the two models are shown in the Figure 8b. It can be seen that our model is not only much higher in precision than the model using native IF neurons, but also has obvious advantages in convergence speed.

4. Discussion

For SNNs, spike encoding is a crucial process, and the question of how to achieve a higher coding benefits is one worth exploring, the efficiency of encoded spike trains has an important impact on the performance of the model. The proposed SDE can be applied to sound recognition and image classification tasks and uses the correlation between data to effectively reduce the time windows. In traditional deep neural networks, the bias measures the ease with which positive and negative excitations are generated, while in SNNs, the bias adjusts the ease with which spikes are fired. In previous studies, researchers have not emphasized the effect of bias. We believe that the adjustment of the bias is equivalent to changing the spike firing threshold. In this work, we usefully restrict the addition of bias to tight coupling to the spike, which allows us to obtain identities for the input and output of a single neuron throughout the inference process and to obtain gradient approximations. The bias voltage is different for different neurons, and the actual spike firing threshold is also different, reflecting the heterogeneity of biological neurons and consistent with biological systems.

We apply the proposed SNNs model to the classification of environmental sounds. Compared to images, sound signals contain rich spatiotemporal features and are more suitable for evaluating the performance of SNNs. For the test set without noise, our average precision reaches a maximum of 99.85%, and for the test set with noise, our average precision reaches 71.49%, reflecting the robustness of the network. We also test the performance of the proposed SNN model with only a few samples. The experiments show that using only 80 audio samples for training and 720 audio samples for testing, the network achieves the highest average precision of 98.97% without adding noise. The average accuracy when noise is added reaches 70.53%, which reflects the excellent learning ability of the SNNs. We verify the effectiveness of the proposed learning rule and SDE on the MNIST and Fashion-MNIST image datasets. For the MNIST dataset, compared to previous SNN studies with similar network structures, our network achieves near-optimal results in only 14 time steps, and for the Fashion-MNIST dataset, the network performance also achieves 90.26% when the time window is equal to 20, which is better than most networks.

Apart from the conception of novel SNN coding method which is highly viable for hardware implementation, the proposed bias addition mechanism is also consistent with the event-driven nature of SNNs. Compared to most other SNNs models, our network uses fewer time windows, which means that our model spikes more sparsely during inference. This aspect affirms the advantage of low energy consumption of SNNs and reduces the response time of the network, which is beneficial for its application in practical scenarios. During the training process, in quest of speeding up the training, techniques such as gradient normalization and weight constraints are avoided. Hence, suitability facet for on-chip training of neuromorphic chips is assured with great aplomb.

Author Contributions

Conceptualization, R.L. and B.D.; methodology, R.L. and B.D.; software, R.L.; validation, R.L. and B.D.; formal analysis, R.L.; investigation, R.L.; resources, G.C.; data curation, R.L. and B.D.; writing—original draft preparation, R.L., B.D. and Y.Z.; writing—review and editing, G.C.; visualization, R.L.; supervision, G.C.; project administration, H.L.; funding acquisition, H.L. and G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China U19A2080, the National Natural Science Foundation of China U1936106, the CAS Strategic Leading Science and Technology Project XDA27040303, XDA18040400, and XDB44000000, and the High Technology Project 31513070501 and 1916312ZD00902201.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset we used in this study is the RWCP, MNIST and Fashion MNIST dataset, and they are openly available in http://research.nii.ac.jp/src/en/RWCP-SSD.html, http://yann.lecun.com/exdb/mnist and https://github.com/zalandoresearch/fashion-mnist (accessed on 10 November 2022).

Acknowledgments

We would like to express our gratitude to the Beijing Academy of Artificial Intelligence for supporting this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ponulak, F.; Kasinski, A. Introduction to spiking neural networks: Information processing, learning and applications. Acta Neurobiol. Exp. 2011, 71, 409–433. [Google Scholar]
Yan, Y.; Chu, H.; Jin, Y.; Huan, Y.; Zou, Z.; Zheng, L. Backpropagation With Sparsity Regularization for Spiking Neural Network Learning. Front. Neurosci. 2022, 16, 760298. [Google Scholar] [CrossRef] [PubMed]
Furber, S.B.; Galluppi, F.; Temple, S.; Plana, L.A. The spinnaker project. Proc. IEEE 2014, 102, 652–665. [Google Scholar] [CrossRef]
Akopyan, F.; Sawada, J.; Cassidy, A.; Alvarez-Icaza, R.; Arthur, J.; Merolla, P.; Imam, N.; Nakamura, Y.; Datta, P.; Nam, G.J.; et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2015, 34, 1537–1557. [Google Scholar] [CrossRef]
Davies, M.; Srinivasa, N.; Lin, T.H.; Chinya, G.; Cao, Y.; Choday, S.H.; Dimou, G.; Joshi, P.; Imam, N.; Jain, S.; et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 2018, 38, 82–99. [Google Scholar] [CrossRef]
Naud, R.; Sprekeler, H. Sparse bursts optimize information transmission in a multiplexed neural code. Proc. Natl. Acad. Sci. USA 2018, 115, E6329–E6338. [Google Scholar] [CrossRef] [PubMed]
Zang, Y.; Hong, S.; De Schutter, E. Firing rate-dependent phase responses of Purkinje cells support transient oscillations. Elife 2020, 9, e60692. [Google Scholar] [CrossRef]
Zang, Y.; De Schutter, E. The cellular electrophysiological properties underlying multiplexed coding in Purkinje cells. J. Neurosci. 2021, 41, 1850–1863. [Google Scholar] [CrossRef]
Han, B.; Roy, K. Deep spiking neural network: Energy efficiency through time based coding. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 388–404. [Google Scholar]
Kiselev, M. Rate coding vs. temporal coding-is optimum between? In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 1355–1359. [Google Scholar]
Sharma, V.; Srinivasan, D. A spiking neural network based on temporal encoding for electricity price time series forecasting in deregulated markets. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
Pan, Z.; Zhang, M.; Wu, J.; Li, H. Multi-tones’ phase coding (mtpc) of interaural time difference by spiking neural network. arXiv 2020, arXiv:2007.03274. [Google Scholar]
Wu, Y.; Deng, L.; Li, G.; Zhu, J.; Xie, Y.; Shi, L. Direct training for spiking neural networks: Faster, larger, better. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 8–12 October 2019; Volume 33, pp. 1311–1318. [Google Scholar]
Froemke, R.C.; Debanne, D.; Bi, G.Q. Temporal modulation of spike-timing-dependent plasticity. Front. Synaptic Neurosci. 2010, 2, 19. [Google Scholar] [CrossRef]
Gütig, R.; Sompolinsky, H. The tempotron: A neuron that learns spike timing–based decisions. Nat. Neurosci. 2006, 9, 420–428. [Google Scholar] [CrossRef] [PubMed]
Neftci, E.O.; Mostafa, H.; Zenke, F. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Process. Mag. 2019, 36, 51–63. [Google Scholar] [CrossRef]
Wu, Y.; Deng, L.; Li, G.; Zhu, J.; Shi, L. Spatio-temporal backpropagation for training high-performance spiking neural networks. Front. Neurosci. 2018, 12, 331. [Google Scholar] [CrossRef] [PubMed]
Cao, Y.; Chen, Y.; Khosla, D. Spiking deep convolutional neural networks for energy-efficient object recognition. Int. J. Comput. Vis. 2015, 113, 54–66. [Google Scholar] [CrossRef]
Diehl, P.U.; Neil, D.; Binas, J.; Cook, M.; Liu, S.C.; Pfeiffer, M. Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In Proceedings of the 2015 International joint conference on neural networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8. [Google Scholar]
Rueckauer, B.; Lungu, I.A.; Hu, Y.; Pfeiffer, M.; Liu, S.C. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Front. Neurosci. 2017, 11, 682. [Google Scholar] [CrossRef] [PubMed]
Hu, Y.; Tang, H.; Pan, G. Spiking Deep Residual Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021. [Google Scholar] [CrossRef] [PubMed]
Kim, S.; Park, S.; Na, B.; Yoon, S. Spiking-yolo: Spiking neural network for energy-efficient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11270–11277. [Google Scholar]
Gerstner, W.; Kistler, W.M.; Naud, R.; Paninski, L. Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Zang, Y.; Dieudonné, S.; De Schutter, E. Voltage-and branch-specific climbing fiber responses in Purkinje cells. Cell Rep. 2018, 24, 1536–1549. [Google Scholar] [CrossRef]
Mun, S.; Fowler, J.E. DPCM for quantized block-based compressed sensing of images. In Proceedings of the 2012 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 1424–1428. [Google Scholar]
Adsumilli, C.B.; Mitra, S.K. Error concealment in video communications using DPCM bit stream embedding. In Proceedings of the Proceedings (ICASSP’05), IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, PA, USA, 18–23 March 2005; Volume 2, p. ii-169. [Google Scholar]
Yoon, Y.C. Lif and simplified srm neurons encode signals into spikes via a form of asynchronous pulse sigma—Delta modulation. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 1192–1205. [Google Scholar] [CrossRef]
O’Connor, P.; Gavves, E.; Welling, M. Training a spiking neural network with equilibrium propagation. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 1516–1523. [Google Scholar]
O’Connor, P.; Gavves, E.; Welling, M. Temporally efficient deep learning with spikes. arXiv 2017, arXiv:1706.04159. [Google Scholar]
Yousefzadeh, A.; Hosseini, S.; Holanda, P.; Leroux, S.; Werner, T.; Serrano-Gotarredona, T.; Barranco, B.L.; Dhoedt, B.; Simoens, P. Conversion of synchronous artificial neural network to asynchronous spiking neural network using sigma-delta quantization. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan, 18–20 March 2019; pp. 81–85. [Google Scholar]
Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Lee, C.; Sarwar, S.S.; Panda, P.; Srinivasan, G.; Roy, K. Enabling spike-based backpropagation for training deep neural network architectures. Front. Neurosci. 2020, 14, 119. [Google Scholar] [CrossRef] [PubMed]
Fang, W.; Chen, Y.; Ding, J.; Chen, D.; Yu, Z.; Zhou, H.; Masquelier, T.; Tian, Y.; Ismail, K.H.; Yolk, A.; et al. SpikingJelly. 2020. Available online: https://github.com/fangwei123456/spikingjelly (accessed on 10 November 2022).
Nakamura, S.; Hiyane, K.; Asano, F.; Nishiura, T.; Yamada, T. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece, 31 May–2 June 2000. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Schörkhuber, C.; Klapuri, A. Constant-Q transform toolbox for music processing. In Proceedings of the 7th Sound and Music Computing Conference, Barcelona, Spain, 21–24 July 2010; pp. 3–64. [Google Scholar]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference; Citeseer: University Park, PA, USA, 2015; Volume 8, pp. 18–25. [Google Scholar]
Wu, J.; Chua, Y.; Zhang, M.; Li, H.; Tan, K.C. A spiking neural network framework for robust sound classification. Front. Neurosci. 2018, 12, 836. [Google Scholar] [CrossRef] [PubMed]
Yu, Q.; Yao, Y.; Wang, L.; Tang, H.; Dang, J. A multi-spike approach for robust sound recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 890–894. [Google Scholar]
Dennis, J.; Yu, Q.; Tang, H.; Tran, H.D.; Li, H. Temporal coding of local spectrogram features for robust sound recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 803–807. [Google Scholar]
Peterson, D.G.; Nawarathne, T.; Leung, H. Modulating STDP With Back-Propagated Error Signals to Train SNNs for Audio Classification. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 7, 89–101. [Google Scholar] [CrossRef]
Yu, Q.; Yao, Y.; Wang, L.; Tang, H.; Dang, J.; Tan, K.C. Robust environmental sound recognition with sparse key-point encoding and efficient multispike learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 625–638. [Google Scholar] [CrossRef]
Jin, Y.; Zhang, W.; Li, P. Hybrid macro/micro level backpropagation for training deep spiking neural networks. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Zhang, W.; Li, P. Spike-train level backpropagation for training deep recurrent spiking neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 7800–7811. [Google Scholar]
Wu, H.; Zhang, Y.; Weng, W.; Zhang, Y.; Xiong, Z.; Zha, Z.J.; Sun, X.; Wu, F. Training spiking neural networks with accumulated spiking flow. In Proceedings of the AAAI conference on artificial intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 10320–10328. [Google Scholar]
Tang, J.; Lai, J.H.; Zheng, W.S.; Yang, L.; Xie, X. Relaxation LIF: A gradient-based spiking neuron for direct training deep spiking neural networks. Neurocomputing 2022, 501, 499–513. [Google Scholar] [CrossRef]
Zhao, D.; Zeng, Y.; Zhang, T.; Shi, M.; Zhao, F. GLSNN: A multi-layer spiking neural network based on global feedback alignment and local STDP plasticity. Front. Comput. Neurosci. 2020, 14, 576841. [Google Scholar] [CrossRef]
Cheng, X.; Hao, Y.; Xu, J.; Xu, B. LISNN: Improving spiking neural networks with lateral interactions for robust object recognition. In Proceedings of the IJCAI, Yokohama, Japan, 11–17 July 2020; pp. 1519–1525. [Google Scholar]
Mirsadeghi, M.; Shalchian, M.; Kheradpisheh, S.R.; Masquelier, T. Spike time displacement based error backpropagation in convolutional spiking neural networks. arXiv 2021, arXiv:2108.13621. [Google Scholar]

Figure 1. Information transfer and structure diagram of Deep SNNs. If the input type is a floating point, it is usually encoded as a spike train and injected into the network at different time steps. When the membrane potential exceeds a threshold, the spiking neurons can accumulate the spikes from the presynaptic neurons and fire; otherwise, the output is zero.

Figure 2. The encoding process of the SDE. Because the integrator’s value

I (i, j)

is not reset during encoding, each input value is affected by others. The return value of 1-bit DAC

\hat{x} (i, j)

∈

\{- 1, 1\}

.

Figure 2. The encoding process of the SDE. Because the integrator’s value

I (i, j)

is not reset during encoding, each input value is affected by others. The return value of 1-bit DAC

\hat{x} (i, j)

∈

\{- 1, 1\}

.

Figure 3. The input-output curve of different neurons. (A) When the input is higher than zero, the output is equal to the input, and the gradient is 1; otherwise, the output and gradient are both 0. (B) Spiking neurons fire only when the membrane potential exceeds the threshold, and the process is non-differentiable.

Figure 4. The computational process of IF neurons. The left side is the existing bias addition mechanism, which adds bias in the accumulate spike train phase and triggers a spike when the membrane potential exceeds the threshold; the new bias addition mechanism adds bias only when the membrane potential is reset to ensure that the membrane potential does not increase in the absence of input.

Figure 5. We choose ’horn’ to illustrate the Sigma-Delta audio encoding process. (A) The time-domain plot of the horn. (B) CQT is employed to obtain the spectrogram. The sampling rate is 48 kHz, and the number of samples between consecutive frames is 128. (C) To reduce the scale of the spectrogram and the number of spiking neurons in the first layer of the full-connected SNNs, each spectrogram of the subband is divided into specified timeline segments, with the points in each segment being averaged. (D) Encode the scaled spectrogram using SDE, noting that each SDE encodes only one row of data.

Figure 6. Training curves for different encoding numbers and types. (A) The effect of different encoding numbers on network training and performance. (B) The effect of different encoding methods.

Figure 7. The training curves of the proposed SNNs model for the MNIST dataset and Fashion-MNIST dataset when the ratio of training data to the training set is 0.5 and 0.1, respectively. (A). Although there are fewer training data sets for the MNIST dataset, the network training curve is still flat. Even with a ratio of 0.1, the training accuracy still reaches 98.41%. (B). The Fashion-MNIST dataset’s image content is comparatively sophisticated. The training curve shows that the network tends to overfit when there are fewer data available for training.

Figure 8. (A) The average firing rate of the spiking layer and memory usage per epoch. (B) The training curve of our model and the primitive SNNs, the final performance of our model is not only better than the original SNNs, but also converges faster.

Table 1. Parameters set for the RWCP dataset.

Network Parameter	Description	Value
T	Simlulation time	20
$V_{t h}$	Threshold of spiking neurons	3.0
b	bias voltage	1.0
$N_{b i n s}$	The number of filters using in CQT	20
$T_{s e g}$	The Number of Timeline segments	40
$λ$	Learning rate	8 × 10 $^{- 3}$

Table 2. Accuracy of different mapping methods under different encoding numbers.

Linear			Sine
Time Window	Clean	Average	Time Window	Clean	Average
N = 12	99.83%	66.46%	N = 12	99.53%	71.49%
N = 14	99.73%	67.23%	N = 14	99.53%	71.38%
N = 16	99.63%	66.73%	N = 16	99.68%	70.53%
N = 18	99.80%	66.12%	N = 18	99.38%	69.27%
N = 20	99.85%	66.42%	N = 20	99.65%	69.11%

Table 3. Comparison of performance with different models.

Methods	Accuracy	Methods	Accuracy
MFCC-HMM ([39])	99.00%	SPEC-DNN ([40])	100.00%
LSF-SNN ([41])	98.50%	SPEC-CNN ([40])	99.83%
SOM-SNN ([39])	99.60%	Peterson et al. ([42])	99.30%
KP-SNN ([43])	100.00%
This work	99.85%

Table 4. Performance evaluation of various models in mismatched conditions.

Methods	Clean	20 dB	10 dB	0 dB	–5 dB	Average
MFCC-HMM ([39])	99.00%	62.10%	34.40%	21.80%	19.50%	47.30%
SPEC-DNN ([40])	100.00%	94.38%	71.80%	42.68%	34.85%	68.74%
SOM-SNN ([39])	99.60%	79.15%	36.25%	26.50%	19.55%	52.21%
This Work (Sine)	99.53%	99.38%	90.5%	47.68%	20.38%	71.49%
This Work (Linear)	99.73%	98.70%	77.85%	41.68%	18.18%	67.23%

Table 5. Parameters set of Mnist and Fashionmnist dataset.

Network Parameter	Description	Value
$V_{t h}$	Threshold of spiking neurons	1.0
epochs	Training epochs	200
batch size	The number of samples selected for a training	100
$T_{m a x}$	One quarter of the learning rate decay period	100
$l r_{i n i}$	The initial value of learning rate	5 × 10 $^{- 2}$
$l r_{m i n}$	The minimum value of learning rate	1 × 10 $^{- 4}$

Table 6. Network performance under different encoding numbers.

Time Window	Accuracy	Time Window	Accuracy
8	99.32%	14	99.60%
10	99.35%	16	99.46%
12	99.37%	18	99.51%

Table 7. Comparison with similar-architecture SNNs on MNIST dataset.

Model	Methods	Network Structure	Time Window	Accuracy
STBP [17]	Spike-based BP	15C5-P2-40C5-P2-FC300-FC10	30	99.42%
HM2-BP [44]	Spike-based BP	15C5-P2-40C5-P2-FC300-FC10	400	99.49%
ST-RSBP [45]	Spike-based BP	15C5-P2-40C5-P2-FC300-FC10	400	99.62%
Lee et al. [32]	Spike-based BP	20C5-P2-50C5-P2-FC200-FC10	120	99.59%
ASF-BP [46]	Spike-based BP	20C5-P2-50C5-P2-FC200-FC10	300	99.65%
Relaxation LIF [47]	Spike-based BP	15C5-P2-40C5-P2-FC300-FC10	10	99.53%
This work (without SDE)	Spike-based BP	15C5-P2-40C5-P2-FC300-FC10	14	99.52%
This work (with SDE)	Spike-based BP	15C5-P2-40C5-P2-FC300-FC10	14	99.60%

Table 8. Comparison with other SNNs model over Fashion-MNIST dataset.

Model	Network Structure	Time Window	Accuracy
ST-RSBP [45]	400-R400	400	90.13%
GLSNN [48]	$256 \times 8$	10	89.02%
LISNN [49]	32C3-P2-32C3-P2-FC128-FC10	20	92.07%
STiDi-BP [50]	20C5-P2-40C5-P2-1000-10	100	92.80%
This work	15C5-P2-40C5-P2-FC300-FC10	20	90.26%
This work	15C5-P2-40C5-P2-FC300-FC10	100	91.71%

Table 9. The performance of the proposed SNNs under different mapping methods with different training set proportions.

Methods	Time Window	Ratio	Clean	20 dB	10 dB	0 dB	−5 dB	Average
Linear	14	0.9	99.70%	98.88%	80.38%	40.88%	20.00%	68.00%
Linear	14	0.1	98.97%	97.35%	76.07%	43.55%	18.75%	66.90%
Sine	12	0.9	99.75%	99.625%	92.125%	50.50%	24.00%	73.20%
Sine	12	0.1	97.292%	97.318%	88.916%	47.21%	21.915%	70.53%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, R.; Dai, B.; Zhao, Y.; Chen, G.; Lu, H. Constrain Bias Addition to Train Low-Latency Spiking Neural Networks. Brain Sci. 2023, 13, 319. https://doi.org/10.3390/brainsci13020319

AMA Style

Lin R, Dai B, Zhao Y, Chen G, Lu H. Constrain Bias Addition to Train Low-Latency Spiking Neural Networks. Brain Sciences. 2023; 13(2):319. https://doi.org/10.3390/brainsci13020319

Chicago/Turabian Style

Lin, Ranxi, Benzhe Dai, Yingkai Zhao, Gang Chen, and Huaxiang Lu. 2023. "Constrain Bias Addition to Train Low-Latency Spiking Neural Networks" Brain Sciences 13, no. 2: 319. https://doi.org/10.3390/brainsci13020319

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Constrain Bias Addition to Train Low-Latency Spiking Neural Networks

Abstract

1. Introduction

2. Materials and Methods

2.1. Spiking Neuron Model

2.2. Spike Encoding Method

2.3. Supervised Training of Deep Spiking Neural Network

2.3.1. Spiking Neuron Gradient Estimation

2.3.2. Spike-Based Backpropagation Algorithm

3. Results

3.1. Experimental Setup

3.1.1. Experimental Environment

3.1.2. Datasets

3.1.3. Experimental Chapter Arrangement

3.2. Environmental Sound Classification

3.2.1. Constant-Q Transform

3.2.2. Network Structure and Parameter Setting

3.2.3. Classification Performance

3.3. Image Classification

3.3.1. Network Structure and Parameter Settings

3.3.2. MNIST Dataset

3.3.3. Fashion-MNIST Dataset

3.4. Training with Less Data

3.5. Algorithm Efficiency

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI