An HMM-DNN-Based System for the Detection and Classification of Low-Frequency Acoustic Signals from Baleen Whales, Earthquakes, and Air Guns off Chile

Buchan, Susannah J.; Duran, Miguel; Rojas, Constanza; Wuth, Jorge; Mahu, Rodrigo; Stafford, Kathleen M.; Becerra Yoma, Nestor

doi:10.3390/rs15102554

Open AccessArticle

An HMM-DNN-Based System for the Detection and Classification of Low-Frequency Acoustic Signals from Baleen Whales, Earthquakes, and Air Guns off Chile

by

Susannah J. Buchan

^1,2,3,4,*,

Miguel Duran

⁴,

Constanza Rojas

²,

Jorge Wuth

³,

Rodrigo Mahu

³,

Kathleen M. Stafford

⁵

and

Nestor Becerra Yoma

³

¹

Center for Oceanographic Research COPAS Sur-Austral and COPAS COASTAL, Universidad de Concepción, Casilla 160-C, Concepción 4070043, Chile

²

Centro de Estudios Avanzados en Zonas Áridas (CEAZA), Coquimbo 1780000, Chile

³

Speech and Processing Transmission Laboratory, Electrical Engineering Department, University of Chile, Santiago 8330111, Chile

⁴

Departamento de Ingeniería Eléctrica, Universidad de Concepción, Concepción 4070043, Chile

⁵

Marine Mammal Institute, Oregon State University, Newport, OR 97365, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(10), 2554; https://doi.org/10.3390/rs15102554

Submission received: 29 December 2022 / Revised: 21 March 2023 / Accepted: 31 March 2023 / Published: 13 May 2023

(This article belongs to the Section Ocean Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Marine passive acoustic monitoring can be used to study biological, geophysical, and anthropogenic phenomena in the ocean. The wide range of characteristics from geophysical, biological, and anthropogenic sounds sources makes the simultaneous automatic detection and classification of these sounds a significant challenge. Here, we propose a single Hidden Markov Model-based system with a Deep Neural Network (HMM-DNN) for the detection and classification of low-frequency biological (baleen whales), geophysical (earthquakes), and anthropogenic (air guns) sounds. Acoustic data were obtained from the Preparatory Commission for the Comprehensive Nuclear-Test-Ban Treaty Organization station off Juan Fernandez, Chile (station HA03) and annotated by an analyst (498 h of audio data containing 30,873 events from 19 different classes), and then divided into training (60%), testing (20%), and tuning (20%) subsets. Each audio frame was represented as an observation vector obtained through a filterbank-based spectral feature extraction procedure. The HMM-DNN training procedure was carried out discriminatively by setting HMM states as targets. A model with Gaussian Mixtures Models and HMM (HMM-GMM) was trained to obtain an initial set of HMM target states. Feature transformation based on Linear Discriminant Analysis and Maximum Likelihood Linear Transform was also incorporated. The HMM-DNN system displayed good capacity for correctly detecting and classifying events, with high event-level accuracy (84.46%), high weighted average sensitivity (84.46%), and high weighted average precision (89.54%). Event-level accuracy increased with higher event signal-to-noise ratios. Event-level metrics per class also showed that our HMM-DNN system generalized well for most classes but performances were best for classes that either had a high number of training exemplars (e.g., generally above 50) and/or were for classes of signals that had low variability in spectral features, duration, and energy levels. Fin whale and Antarctic blue whale song and air guns performed particularly well.

Keywords:

Hidden Markov Model; machine learning; marine passive acoustic monitoring; whales; air guns; earthquakes

1. Introduction

Ocean observation is fundamental to national and international decision making regarding marine conservation and resource management. International observation networks such as the Global Ocean Observing System (https://www.goosocean.org/, accessed on 1 March 2023) and the International Monitoring System (IMS) of the Preparatory Commission for the Comprehensive Nuclear-Test-Ban Treaty Organization (CTBTO, Vienna, Austria, www.ctbto.org, accessed on 1 March 2023) include the collection of passive acoustic data (see review by Van Parijs et al. [1]). Marine passive acoustic monitoring (PAM) is a method of remote ocean observation and can be used to study geophysical phenomena, i.e., geophony (underwater earthquakes, volcanoes, landslides, glacier and iceberg dynamics) [2,3,4,5,6,7,8,9,10]. It can also be used to examine anthropogenic noise sources, i.e., anthropophony, (ship traffic, resource exploration, and extraction) [11,12]. Finally, PAM is particularly well suited to monitoring biological sounds, i.e., biophony, (e.g., marine mammals, fish) [13,14,15,16,17,18,19,20,21]. Many of these phenomena produce loud (high-amplitude) sounds at low-frequencies (0–1000 Hz) that can propagate over large distances (100 s of km) in the ocean.

Effective monitoring requires the capacity to detect these acoustic signals in PAM datasets to determine when, where, and what type of events occur within the dataset. Since PAM datasets are typically large, automatization of analyses is critical for reliable and time-sensitive results. However, currently, there is no single analytical system for the automatic detection and classification of all the different sound sources in the ocean, or even for the variety of low-frequency sound sources. The wide range of different characteristics of geophysical, biological, and anthropogenic signals makes the simultaneous automatic detection and classification of these sounds a significant challenge. Detection and classification systems should be time-efficient concerning both computation and human supervision. Globally, the advancement of automatic detection and classification algorithms is a growth area in the field of underwater acoustics and computer science, as evidenced by the recent growth in the number of publications addressing these topics, such as [22,23,24,25,26,27,28,29,30].

Machine and deep learning methods have gained significant interest in recent years for analyzing large datasets generated by PAM. These methods have been applied to automatically detect and classify acoustic signals with excellent results, with little or no human intervention (see review by Bianco et al. [31]). Machine learning methods include decision trees [32] and support vector machines [33], while deep learning methods have employed Convolutional Neural Networks (CNN) [23,34,35]; Recurrent Neural Networks (RNN) [36]; Convolutional Recurrent Neural Network (CRNN) [37]; classic multilayer perceptron neural networks [38]; and a combination of CNN and Long-Short Term Memory (LSTM) [39]. Although they generate good results, they most often target a small number of classes, and, in some cases, only a single class. Additionally, some of them fail to integrate detection and classification in a single framework, requiring manual segmentation of signals prior to feeding them to classifiers.

Hidden Markov Models (HMMs) are an excellent solution for detection and classification of multiple classes and where variability is high within a class, as is the case of the wide variety of sources in the ocean and where signals from the same source can be highly variable. HMMs can model time-varying signals as a time series of pseudo stationary signals, assigning them to states that cluster their statistical properties, thus capturing high variability within a given class [40,41]. A thorough review of methods for automatic detection and classification can be found in [22]. HMM-based systems have been successfully applied to detect and classify several animal sounds, including two whale species, e.g., Bryde’s whale [42,43] and humpback whales [24], Hainan gibbons [37], Meagre (a species of fish) [30], and pig cough sounds [44]. Since it is possible to concatenate HMMs to model an arbitrary number of classes, HMMs are also suitable for multi-class classification, using one model for each targeted class [45]. In [45], the signals of 11 whale species were targeted for detection and classification, exploring different Mel-Frequency Cepstral Coefficients (MFCC) and HMM configurations, namely, the number of cepstral coefficients and number of hidden states for each model. HMMs have seldom been used for anthropogenic and geophysical signals in the ocean. Only [46] used HMMs to detect and classify small boats and large marine vessels. For seismic air guns, Kyhn et al. [47] used traditional energy sum detectors. For automatic earthquake detection, numerous methods have been used, including data processing [48,49,50,51] and statistical techniques [3,52,53]. Although we have not found extensive use of HMMs for hydroacoustic seismic events, they have been widely used for land-based seismicity, such as those linked specifically to volcanic activity [54,55,56] and tectonic-related earthquakes [57,58,59].

In this paper, we tackled the significant challenge of integrating the automatic detection and classification of several classes from biological (baleen whales), geophysical (earthquakes), and anthropogenic (air guns) sound sources in a single framework, based on HMM and Deep Neural Networks (DNN), a system that we called the HMM-DNN system. To the best of our knowledge, this is the first attempt to automatically detect and classify biological, geophysical, and anthropogenic signals with a single system. The HMM-DNN system was trained with a PAM dataset, which is a discrete time series for which HMM-based approaches work particularly well, because each discrete unit of the series (a short piece of audio) can be assigned to an HMM state, clustering statistically relevant information of that unit. For each step, there is a state transition, which makes it possible to capture the evolution and temporal dependencies of the entire time series, making this method particularly suitable for modeling PAM data. Besides the excellent fit of HMM-based approaches for time series, combining HMM with DNNs for acoustic modeling requires fewer trainable parameters than other deep learning approaches, such as the use of LSTMs without HMMs, as the temporal dependence is not only captured with a DNN but inherently present within the HMM framework. Other deep learning approaches usually require more layers or neurons to model the entire temporal dependence. Here, the authors’ previous experience, especially with time series, suggests that using HMMs drastically reduces the number of trainable parameters. This makes the system perform better when data availability is a problem, which can occur in the ocean environment, e.g., rare species of baleen whale, which are naturally scarce even with continuous AM [24,25,26,27].

In this study, we aim to develop a detection and classification system based on HMM-DNN through a process called decoding, in which the input to the trained system is an audio file. The expected output is a sequence of events present in the audio file, along with the beginning and end of each event, which has very powerful potential applications including monitoring and decision making. We also present a comparison with more classical acoustic modeling based on Gaussian Mixture Models (GMM), testing the differences in performance when training only with the most numerous classes in our dataset. Our dataset contains a lot of noise, which is common in an ocean environment [60,61,62,63], and is evidenced by low levels of Signal-to-Noise-Ration (SNR) of events. In our case, roughly 40% of the events had SNR levels below 10 dB, and only 10% of the events had SNR levels of 20 dB or higher. Such a noisy environment is a significant challenge for automatic classifiers. We tested the robustness of our system by evaluating its classification accuracy against different levels of noise measured in dB.

2. Materials and Methods

2.1. Data Collection

Passive acoustic monitoring data that were continuously collected at a sample rate of 250 Hz at the HA03 station off the Juan Fernandez Archipelago (Figure 1) between 2016 and 2017 were obtained from the Preparatory Commission for the Comprehensive Nuclear Test Ban Treaty Organization (CTBTO). Data from a single hydrophone, the North Station Node 1, at S 33°27′28.8″, W 78°56′2.76″, were used for analysis. The hydrophone is deployed at a depth of 813 m in a total water column depth of 1538 m.

2.2. Methodological Approach

The raw data were 4-h long wav files that were divided into 150 s long pieces of audio called audio units. These audio units were used to train the HMM-DNN system under a supervised approach, in which parameters were estimated from the audio units. We used data augmentation to enlarge the dataset used to train the HMM-DNN system. The technique we used was based on speed perturbation that slows down and speeds up the original audios by factors of 0.9 and 1.1, respectively. This technique is common in voice recognition, especially in Kaldi, and is well described in [64].

The training requires the sequence of events present in each audio unit, a sequence that is called the transcription. We define an event as the single occurrence of any example of the classes described in Table 1, i.e., a single whale call, a single earthquake, or a single air gun occurrence. The annotated acoustic dataset, i.e., audio units with their transcription, was curated by an expert acoustic analyst who manually annotated the acoustic data to obtain an annotated (or labeled) dataset of target events of baleen whale calls, air guns and earthquakes (Table 1; Section 2.3), and the SNR of each event was estimated (Section 2.4). From all annotated events, we only preserved those with at least 70 exemplars (N

\geq

70) of a specific sound type (i.e., each target signal needed to be represented by at least 70 occurrences). If fewer than 70 occurrences of one type of signal were found, we removed the audio units containing those signals, regardless of whether they contained other types of signals. The annotated acoustic dataset was divided into three sets called train (60%), test (20%), and dev (20%), which were used for the training, testing, and tuning, respectively, of our proposed HMM-DNN system (Section 2.6).

The HMM training works by assigning a small piece of audio, called a frame, to a state. In this work, we used three-state HMMs as a means of representing the beginning, middle, and ending of an event. Empirically, we found that 150 s long audio files containing 600, 0.5 s long 50% overlapping frames performed better in terms of less classification errors (see Section 2.5). Each frame was represented by an observation vector obtained through a filterbank-based spectral feature extraction procedure (Section 2.5).

The idea during training is to assign the sequence of frames to the correct sequence of states, which in turn generates the correct transcription. Thus, each HMM state is the target. The process of assigning each frame to one HMM state is called the alignment process. Initially, we did not know to which state each frame should be assigned. Consequently, we generated an alignment with an HMM system that used GMMs for acoustic modeling. We refer to this system as the HMM-GMM system. The HMM-GMM system assigns mixtures of multivariate gaussians, to act as probability density estimators, to each state. Initially, the frames are assigned uniformly to each state, using the transcription to determine which HMM should be used. Then, the HMM parameters are updated in an iterative manner. A more detailed treatment on how the HMM-GMM system is trained can be found in [28].

We incorporated a feature transformation based on Linear Discriminant Analysis (LDA) and Maximum Likelihood Linear Transform (MLLT) into our HMM-GMM, because we found empirically that these transformations yielded better results than using just the HMM-GMM approach, ultimately producing better alignments for our HMM-DNN system. To measure general performance at the event-level, we used the metrics of Word Error Rate (WER) and accuracy (%); and to evaluate the performance of each type of event (class), we report the standard Machine Learning metrics of precision, sensitivity, and F1-score (Section 2.10). Additionally, we report macro average (the simple average of a metric for all classes) and the weighted average, which considers the relative contribution of each class, with the more numerous events being the ones that more heavily affect each metric.

Once the training of the system HMM-GMM or HMM-DNN is completed, one can feed the system new data to perform detection and classification, which, in the context of HMMs, is called decoding. The decoding process is performed using a well-established algorithm called the Viterbi algorithm [65]. The product of decoding is a sequence of events per audio unit (a transcription). Since during decoding, the system assigns each frame to one HMM state, a by-product of the process is also the beginning and ending of an event.

We tested the accuracy of the system in the function of SNR thresholds between 0 dB and 20 dB (see Section 2.3, which details the SNR computation). Since the HMM-GMM is a detection and classification system itself, we also used it for comparison with our proposed HMM-DNN system; we present the results of the HMM-GMM system in Section 3.3. The following sections detail all the steps involved, namely, acoustic data annotation, feature extraction, performance metrics definition, and the architecture of the HMM-DNN system.

2.3. Acoustic Data Annotation

Acoustic data from 2016 and 2017 were viewed as spectrograms (512-point FFT for signals > 3 s, 128-point for signals < 3 s, 85% overlap, Hamming window) using the software Raven Pro 1.5. (Bioacoustics Research Program 2011). A quick visual scan of the two years showed that most of the events occurred between the months of June to August. Thus, a subset of 498 h was selected, comprising the months of June to August 2016 and 2017, by an experienced analyst who manually annotated (detected and classified) all target events per class, as listed in Table 1 (total events = 30,873; total classes = 19). Target events were selected on the spectrogram, and the beginning and end times were annotated using the Raven Pro 1.5 selection table tool.

Table 1. Event class definition, reference, label, and number of events for train, test, and dev sets.

Event Class	Reference	Class Label	N Events Train Set	N Frames Train Set	N Events Test Set	N Frames Test Set	N Events Dev Set	N Frames Dev Set
Possible fin whale 13 Hz call	-	13H	82	1605	10	198	25	497
Antarctic blue whale	[13,66]	AA	1403	47,762	386	13,589	437	15,199
Antarctic blue whale overlapped with Fin Whale Song	-	AAFWS	112	575	21	106	37	191
Southeast Pacific blue whale song 2 (SEP2) unit A	[15,67]	S21	120	5802	28	1488	25	1218
Southeast Pacific blue whale song 2 (SEP2) unit B	[15,67]	S22	51	2496	17	862	15	697
Southeast Pacific blue whale song 2 (SEP2) units C and D	[15,67]	S23	812	26,688	270	8852	257	8309
Southeast Pacific blue whale song 1 (SEP1) unit C	[67,68,69,70]	S13	39	1465	22	790	10	347
Southeast Pacific blue whale song, Undefined Unit	-	SEP	66	2583	25	878	24	943
Fin Whale, 20 Hz Song	[71,72]	FWS	13,381	72,106	4671	25,199	4413	24,017
Fin whale Downsweep Type 1	[71,72]	FWD	691	2983	254	1079	242	1068
Fin whale Downsweep Type 2	[71,72]	FWD2	399	1872	104	499	121	496
Fin whale Downsweep Type 3	[73,74]	FWD3	267	1480	62	333	104	565
Sei whale Upsweep	[66]	SWU	227	1371	44	260	64	369
Sei whale Downsweep	[73,74]	SWD	118	694	31	186	40	233
Minke whale Pulse Trains	[75,76,77]	MI	103	1240	30	385	19	244
Undefined biological sound	-	UND	69	715	27	262	19	158
Earthquake	[2,5,50,78,79,80]	ERQ	161	22,002	55	7269	57	7687
Unidentified Ambient Noise	-	AN	115	6363	36	1924	62	3845
Seismic air gun	[20]	AG	325	6192	103	2078	124	2462

2.4. Signal-to-Noise Ratio Computation

For each annotated event, the SNR was calculated by using the Inband Power (dB) measurement of each signal and its adjacent noise in the selection table according to the protocol detailed in https://ravensoundsoftware.com/knowledge-base/signal-to-noise-ratio-snr/ (accessed on 1 March 2023). The Inband Power is the integral of the average power spectral density over the frequency band of interest [81]. To calculate the SNR, the measurements of Inband Power were first converted into linear units (u) using

y = \frac{10^{x}}{10}

(1)

where x is Inband Power (dB) and y is Inband Power (Units). Then, the SNR in linear units (SNR(u)) was calculated by using

S N R (u) = \frac{y_{s i g n a l} - y_{n o i s e}}{y_{n o i s e}}

(2)

where

y_{s i g n a l}

is Inband Power (Units) for signal of interest and

y_{n o i s e}

is Inband Power (Units) for adjacent noise selection. Finally, we reconverted the SNR in linear units to SNR in dB, using

S N R_{d B} = 10 l o g_{10} S N R (u)

(3)

2.5. Feature Extraction

Audio units consisted of a 150 s long segment of data, from which we extracted 50% overlapped frames of 0.5 s of duration. We extracted a set of features from each frame that formed an observation vector

O_{t}

. We used a filterbank-based feature extraction process to capture the spectral characteristics of each frame. For the HMM-GMM system, we used a data-driven transformation based on LDA and MLLT to make the features as separable as possible, which greatly benefits modeling with GMMs. For our HMM-DNN system, we concatenated adjacent frames to a central frame, called context frames. The following sections detail each stage of the feature extraction process.

2.5.1. Filterbank Feature Extraction

Mel-Frequency Cepstral Coefficients (MFCC) is the standard feature extraction process used by the speech recognition community [82]. However, MFCCs are highly engineered towards extracting meaningful information where the source is the human voice, whereas our database contains signals produced by non-human biotic and abiotic sources. Thus, we designed a filterbank-based spectral feature extraction process that does not rely on the Mel-scale, which is a logarithmic transformation to adapt the sound’s frequency to mimic how humans hear (see [83] for a definition of the Mel scale), as there is no reason to assume that this scale works well for non-human signals. The spectral feature extraction performed with a filterbank was the first step in our feature extraction process.

The filterbank-based spectral feature extraction is depicted in Figure 2. The pipeline starts with a Hamming window, followed by the computation of power spectral density (PSD). The latter was defined as the square magnitude of the Fourier Transform, which was computed using the Fast Fourier Transform (FFT) algorithm with 128 points. Features from the PSD were extracted using a triangular filterbank, shown in Figure 3. Each filter F from the filterbank has a 50% overlap with the previous filter, and was designed to have the same bandwidth B. We found empirically that the best choice was 50 equally spaced filters, each having B = 3, meaning that each filter operates on 3 FFT samples. To account for the assumption of pseudo-stationary frames, we appended the so-called velocity and acceleration coefficients, or delta and delta-delta features that attempt to capture the dependence of future and previous frames. Delta features were computed using [41],

Δ_{t} = \frac{\sum_{i = 1}^{5} (O_{t + i} - O_{t - i})}{2}

(4)

Acceleration coefficients were derived in the same manner but operated on differences of the delta coefficients. Concatenation of the above-computed coefficients, along with the original observation vector, plus the energy of the frame, formed a preliminary feature vector for frame at time t, which was then transformed using LDA and MLLT. Mean and variance normalization (MVN) was also included as it has been shown to reduce sensitivity to channel variation and additive noise [41].

2.5.2. Linear Discriminant Analysis and Maximum Linear Likelihood Transformation

A widely adopted feature transformation is the combination of LDA and MLLT, where the objective is to make intra-class variance as low as possible, while maximizing the inter-class variance to make each class or state as separable as possible, facilitating the modeling with mixtures of gaussians. In this study, we considered a central frame, in which 6 frames were concatenated in times t − 1, t − 2, t − 3, and t + 1, t + 2, and t + 3. The filterbank feature dimensionality of these 7 concatenated frames was then reduced to 40 using LDA, which was followed by a diagonalizing transform known as MLLT. This transform was estimated three times during training, keeping the last computed transform. The reason it was re-estimated during training is that MLLT uses the state assigned to each frame for its computation (supervised training). Thus, as training goes, a more accurate classification of frames to states was expected. A more detailed treatment of this technique can be found in [41] and [84], while a more implementation-oriented explanation is provided in the Kaldi scripts.

2.6. Hidden Markov Model and Deep Neural Network Architecture

Hidden Markov Models are a mathematical and stochastic framework to model time-varying signals as a sequence of pseudo-stationary signals by assigning small pieces (frames) to a set of states

S_{1}, S_{2}, \dots, S_{N}

[40]. The frame is a fixed-length part of the audio from which we extracted features. Thus, each frame was represented by a set of acoustic-spectral features that form the observation vector

O_{t}

as explained in Section 2.5. At each time step, the HMM experiences a change of state (it could be a self-loop to the same state, see Figure 4) according to some probabilistic function. The underlying process of the HMM can be thought of as each state generating an observation vector once the state is entered. The HMM can be parametrically described as

λ = (A, B, π)

, where A represents the probability of transitioning to the next state (thus called the transition probability matrix). B encodes the probability that, in state

S_{i}

at time t, the observation vector

O_{t}

is emitted. B is the emission probability matrix.

π

contains the initial probabilities for each state, that is, the probability that the first frame of an event is assigned to any of the three states of a given HMM. In our system, the first frame was always assigned to the first state, i.e.,

π

= (1, 0, 0). An outstanding introduction and tutorial to HMMs can be found in [40].

Our base system used a mixture of Gaussians (hence called the HMM-GMM system) to model the emission probability

P (O_{t}| S_{i}, λ)

. The proposed HMM-DNN system used a DNN to replace the GMM and model the observation probabilities, i.e., we used a DNN for acoustic modelling and probability density estimator. A detailed explanation about our proposed HMM-DNN system can be found in the following Section 2.6.1.

We implemented a three-state left-to-right HMM (see Figure 4) to model all events with more than 70 exemplars from our database, concatenating all the HMM models to form an HMM network, whose components were individual models of each event, including noise. We consider noise to be the absence of an event, or background noise in an event frame. The parameters represented by the transition and emission matrices A and B were trained using the Kaldi Speech Recognition toolkit [85]. In the first step of the training procedure, the frames were uniformly assigned to each of the three states of their corresponding HMM. Then, the mean and covariance were calculated. This was followed by an iterative process of updating the mean and variances of each gaussian under an expectation maximization procedure, using the forward-backward algorithm [86]. Then, a realignment was performed, potentially reassigning frames to a more suitable state. The process of recomputing parameters was performed 40 times, and the realignment occurred at iterations 1–10, 12, 14, 16, 18, 20, 23, 26, 29, 32, 35, 38. These choices are the default configuration in Kaldi, which have been previously shown to provide better results [85].

2.6.1. Deep Neural Network for Acoustic Modelling

Mixtures of Gaussians were used for the acoustic modelling that mapped from input features to HMM states in a probabilistic way. It has been shown that this role can be performed by DNNs and even outperform GMMs [87]. We implemented a hybrid HMM-DNN system that took as input a central frame, t, plus 18 context frames, ranging from t − 9 to t + 9. A transformation based on LDA [88] was then applied to decorrelate the concatenated features as much as possible. Then, the DNN was discriminatively trained using the HMM states as targets (Figure 5). These targets were generated as a result of an alignment using the lowest WER, (see Section 2.10 for performance metrics) a trained HMM-GMM system. We trained the HMM-DNN system using the most recent and recommended approach with the Kaldi toolkit [89,90], which used back-propagation to update the weights associated with each neuron-to-neuron connection, and cross-entropy [91] as the objective function. The cross-entropy objective function penalizes when the system misclassifies a frame; that is, when it assigns the frame to a state different from the state assigned in the base HMM-GMM system. As training goes, new realignment processes occur, potentially reassigning the frames to different states.

Once the system is trained, it is possible to perform detection and classification in the decoding step. During decoding, the Viterbi algorithm [65] is used to find the sequence of states most likely to have generated the given input. Thus, we obtained a transcription (sequence of events) present in each audio unit, along with a classification (HMM states) of all the frames of said audio unit. Therefore, not only a sequence of events is produced, but also beginnings and endings of each detected event.

For the DNN architecture, we used 4 hidden layers, each with 2048 units and a Rectified Linear Unit (ReLU) [92] activation function. For classification, we used a softmax layer [93], which assigns a probability to each class, i.e., a probability to each HMM state. The system’s architecture is depicted in Figure 6. Although we tested using more hidden units and hidden layers, they did not provide significant improvement in performance in terms of WER (see Section 2.10 for performance metrics detail), and, in some cases, performed worse. We used the default hyperparameters provided by the standard DNN training script provided by the Kaldi toolkit. For input features, we tested three different sets of features including filterbank features, filterbank features with delta and delta-delta coefficients, and transformed features through the LDA and MLLT transformation found for the HMM-GMM system. The system configuration with the best performance made use of input features computed with a filterbank and added delta and delta-delta coefficients.

2.6.2. Hyperparameter Tuning and Feature Selection

Our HMM-DNN system had atypical features and various DNNs configuration choices. We started with a basic DNN with three layers, 1024 hidden units, and a ReLU activation function to find the best-performing set of features, hyperparameters, and configurations. To decide the best set of features and configurations, we used the WER as the performance metric (see Section 2.10, which explains performance metrics) applied to the dev set. First, we maintained said configuration while testing different sets of features. We chose filterbank features, and then varied the number of filterbanks. We then calculated MVN, delta and delta-delta features, frame energy, and feature transformation with LDA and MLLT. Finally, we tested the number of context frames, where we tried 3, 6, 9, 12, 15, 18, 23, and 30 context frames.

2.7. Comparison with the HMM-GMM System

GMMs have produced excellent results for many years [94], though using DNNs for acoustic modeling outperforms most of these systems [95]. Thus, given robust evidence for DNNs, the proposed system was based on them; however, the HMM-DNN system requires an initial set of target states. We compared both systems and measured performance using the performance metrics detailed in Section 2.10. Since some classes are naturally scarce, we wanted to measure the impact on performance when using only the classes with more numerous exemplars. Thus, we trained both systems with a minimum of 500 exemplars per class, as well as the 70 per class in the original system and used the same performance metrics to compare each system. Thus, comparing HMM-GMM vs. HMM-DNN with at least 70 exemplars per class, and HMM-GMM vs. HMM-DNN with at least 500 exemplars per class (and therefore fewer total classes).

2.8. A SNR Filter for the HMM-DNN System

It is common to have noisy recordings in the marine environment. Noise can reduce the HMM system’s performance, especially if the noise has spectral similarities to events of interest, as was our case with earthquakes. To evaluate the impact of noisy recordings on correctly detecting and classifying events, we used the performance metric accuracy (see Section 2.10 for the definition of the performance metric accuracy). We tested different SNR values ranging from 0 dB to 20 dB. Very few events had negative SNR, which were not considered in this analysis, as the minimum SNR considered was 0 dB. The higher SNR required the fewer events were analyzed, as we did not include events with lower SNR.

2.9. Train, Test, and Dev Sets

We divided the database into three disjointed sets: train, test, dev, each containing 60%, 20%, and 20% of the total annotated events, respectively. We used the train set for training both HMM-GMM and HMM-DNN systems. For hyperparameter tuning and some choice of parameters (see Section 2.6.2), we used the dev set. The test set was used for evaluating performance according to the metrics explained in Section 2.10.

2.10. Performance Metrics

The output of the HMM-DNN system was a frame-to-frame classification and a transcription of events per file. The output transcription, which contains the events detected by the system, is called the hypothesis transcription. A hypothesis transcription does not necessarily have one-to-one mapping to our reference transcription. To assess the system performance at the event-level, we aligned the reference transcription (ground-truth list of events per file) and the hypothesis transcription. A text alignment generally consists of the minimum number of insertion, deletion, and substitution operations to convert one string into another. In our case, where we have lists of events, the alignment gives a one-to-one mapping between predicted events and ground-truth events. We used the Levenshtein Distance Algorithm [96] within the Kaldi toolkit for the alignment process. An example of the alignment process is shown in Figure 7.

The performance metrics to measure general performance were the event-level classification accuracy and WER. In the case of event-level accuracy, a reference event was correctly detected and classified event when it was matched in the alignment process (see Figure 7; the two events ERQ are matched, while the event FWS was not). We define the performance metrics below:

event - level accuracy = \frac{Correctly detected and classified events}{Total number of events}

(5)

Word Error Rate = \frac{Insertions + Deletions + Substitutions}{N}

(6)

with N as the number of events in the reference transcription. Note that a WER of 0 translates to zero errors, i.e., a perfect one-to-one match between the reference transcription and the hypothesis transcription. Additionally, it is possible to have a WER greater than 100 because of insertions (the system could wrongly detect more events than the ones that are actually present).

To understand the system’s capacity to detect and classify low SNR events and how this compares to high SNR events, we used the event-level accuracy metric to evaluate events as a function of their estimated SNR, using SNR thresholds ranging from 0 dB to 20 dB. In the decoding step, false positives were possible, confusing background noise with some events. As we only had the SNR of annotated events, any wrongly detected events (insertion errors, in the WER sense) would not have an SNR estimate. Thus, WER was not a suitable performance metric to address the system’s performance as a function of SNR thresholds, so only event-level accuracy was used.

The output transcription (hypothesis transcription) could contain correctly detected, but incorrectly classified, events. These errors were measured by the metrics in Equations (5) and (6). However, Equation (5) only considers the correct classification of reference events. Additional falsely detected events were not measured with this metric. These kinds of errors were included in the WER: insertions accounted for incorrectly detected events (false positives), deletions for non-detected events (false negatives), and substitutions for correctly detected events, but incorrectly classified. WER and event-level accuracy were computed after aligning a reference transcription, which contained the events presented in each audio file, and the transcription output of the trained HMM-DNN system.

To account for the performance of specific classes, we used the classic Machine Learning metrics precision (related to false positives), sensitivity (related to false negatives), and F1-score, each defined below

sensitivity = \frac{T P}{T P + F N}

(7)

precision = \frac{T P}{T P + F P}

(8)

F 1 - score = 2 \cdot \frac{precision \cdot sensitivity}{precision + sensitivity}

(9)

where TP, FN, and FP stand for true positive, false negative, and false positive, respectively. TP, TN, and FP were calculated with respect to each class individually, and, subsequently, sensitivity, precision, and F1-score were calculated for each class. Better performance is indicated by a higher sensitivity, higher precision, and higher F1-score.

3. Results

To evaluate overall system performance, we used the metrics of WER and event-level accuracy (see Section 2.10). The results are shown in Table 2 and Table 3. Overall event-level accuracy was high, i.e., 84.46%, considering 6215 events in total. The WER was relatively low (25.02), with the type of error deletion (D, n = 757) that captures the number of false negatives being the most contributing factor. The next important type of error worsening the WER was the insertion (I, n = 589), which indicates the number of false positives. Substitutions (S, n = 209) were the less significant error, suggesting that, in general, the detected events were also correctly classified (a correctly detected but misclassified event was counted as a substitution, see Section 2.10 for details on this performance metric).

3.1. Performance of Each Class

The results of event-level performance for each class are presented in Table 2. For the baleen whale acoustic signal classes, according to F1-score, Fin Whale Song (FWS) (91.19%), Antarctic blue whale (AA) (87.05%), and Sei Whale Downsweep (SWD) (80.95%) were the highest performing classes; Southeast Pacific blue whale song 1 unit C (S13) (66.41%) and Southeast Pacific blue whale song 2 units C and D (S23) (71.06%) had a somewhat intermediate performance, and Minke whale Pulse Trains (MI) (39.49%) and Southeast Pacific blue whale song 2 unit B (S22) (16.68%) displayed the poorest performances. Note that the worst performing classes were also the ones with the lowest number of exemplars. MI was rather unbalanced in terms of precision vs. sensitivity, with intermediate precision but low sensitivity. For anthropogenic noise classes, seismic air gun (AG) (85.13%) and unidentified ambient sound (AN) (80.01%) had high performances. Performance for the only geophysical signal in the dataset, i.e., earthquakes (ERQ), was intermediate (67.80%), but in terms of precision, performance was poor for this class (55.01%). Overall, higher performance metrics were associated with higher numbers of exemplars (N).

Averaging the results from all the classes, an intermediate result was obtained when averaging all the precisions (64.49%) and sensitivities (66.01%). However, when considering the relative contribution of each class, higher numbers for weighted average precision (89.54%) and weighted average sensitivity (84.46%) were obtained, mainly due to the great performance of classes with a high number of exemplars (N).

3.2. HMM-DNN Performance with SNR Filter

Figure 8 shows the results of the performance evaluation in terms of event-level accuracy in the function of different SNR thresholds. Unsurprisingly, SNR was found to be an essential factor in detection and classification capacity, with a relative improvement of 14.04%, (as calculated at the difference between the original value and the improved value as a percentage of the original value) in event-level accuracy when comparing an SNR threshold of 0 dB (event-level accuracy of 84.46%) to an SNR threshold of 20 dB (event-level accuracy of 98.1%). The dataset was generally noisy; therefore, when an SNR threshold of 10 dB for the desired signals was applied, we were left with less than 60% of the original events, and for an SNR threshold of 20 dB, only 10% of original events remained.

3.3. A Comparison with the Ordinary HMM-GMM System

Table 3 summarizes the general performance metrics at event-level for the HMM-DNN and HMM-GMM systems. The HMM-DNN system had a relative improvement of 2.50% in event-level accuracy compared to the HMM-GMM system (84.46% vs. 82.35%) and showed a relative WER improvement of 4.26%, mainly due to fewer deletions, equivalent to a smaller number of false negatives.

The relative improvement in the HMM-DNN system for event-level accuracy compared with the HMM-GMM system when considering only classes with 500 or more exemplars was 4.92%. In terms of the WER, the relative improvement was 12.48%, which is considerably better than using classes with at least 70 exemplars.

4. Discussion

4.1. HMM-DNN System Performance and Event-Level Performance

In general, our HMM-DNN system displayed a good capacity for correctly detecting and classifying events, evidenced by high event-level accuracy (84.46%), high weighted average sensitivity (84.46%), and high weighted average precision (89.54%). In this paper, we performed multiclass classification, including 19 classes. A similar study targeting multiple classes had 11 species of whales and obtained 84.1% classification accuracy [45]; however, said study focused only on classification, not detection, as the events were already detected and segmented. If we examine results per class, the WER and event-level metrics showed that our HMM-DNN system generalized well to most classes. Performances were best for classes with a high number of training exemplars (e.g., generally above 50) and/or signals with low variability in spectral features, duration, and energy levels such as FWS and AA.

4.1.1. Baleen Whale Acoustic Signals

Fin whale song (class FWS) with and Antarctic blue whale song (class AA) were classes with a high number of training exemplars and are highly stereotyped signals (low variability in spectral features, duration, and energy levels) [63,97], hence they had the highest F1-scores (91.19% and 87.05%, respectively). Although the number of exemplars was low for sei whale downsweeps (class SWD), this class also performed well (F1-score of 80.95%) due to low variability among different exemplars, which has been described by several authors [68,76]. Performance for all Southeast Pacific blue whale classes (S13, S21, S22, S23, SEP) was not as high as for other species, which could be due to the low number of exemplars and the higher variability of this song [67]. Class S23 was the class that performed best for this acoustic population of blue whale, possibly due to it being the class with the highest number of exemplars.

4.1.2. Air Guns

Although we did not have a high number of training exemplars for seismic air guns (class AG), these are also highly stereotyped signals with very low variability between exemplars [98]. Additionally, AG spectral features have few similarities with biological sources, which facilitates their discrimination. Thus, despite training the AG-specific HMM with only 325 exemplars, we achieved a high F1-score (85.13%).

4.1.3. Earthquakes

Intermediate results were achieved for earthquakes (class ERQ), given the high variability of this class in spectral features, duration, and energy levels, which corresponds to different earthquake generation processes (for example, whether it was tectonic or volcanic in its origin) [2]. Additionally, it has been observed that bathymetric conditions can affect the energy at which an earthquake is observed [99]. Furthermore, high-energy background noise has spectral features that were sometimes confused by our HMM-DNN system as being earthquakes, which explains the low precision (high false positives) for this class (55.01%). This could be improved by using more data for training and noise reduction techniques, such as that proposed by Vickers et al. [25]. However, high sensitivity (low false negatives) for this class was also observed (88.33%), so these events were unlikely to be missed by our system, which is relevant to the possible implications for the use of this system in decision making in Chile. The high level of detection (high sensitivity) for earthquakes is consistent with previous studies, such as [3], where over a 99% sensitivity was attained. However, those authors used a preprocessing step based on Short-Term Averages/Long-Term Averages (STA/LTA) that discarded all biological and anthropogenic (air guns) events due to their short duration.

4.2. HMM-DNN System Performance as a Function of SNR Thresholds

Recognizing events in a noisy environment poses a significant challenge for our HMM-DNN system. The results show that our dataset has relatively high levels of noise, as is expected in the marine environment [60,61,62], with a 40% reduction in events at an SNR threshold of 10 dB. Poorer detection and classification performance metrics have been observed in other studies that target underwater acoustic signals such as [25], where the introduction of denoising methods boosts the mean accuracy from 73% to 85.2%. In this study, noise-reduction techniques were not applied; however, we plan to implement such techniques in the future since they may well improve overall results.

4.3. Comparison of HMM-DNN and HMM-GMM Systems

Although we mainly trained an HMM-GMM system to generate target alignments for our proposed HMM-DNN system, we also evaluated this system where the acoustical modeling was performed with mixtures of Gaussians. Our HMM-DNN system performed better, with a 2.50% relative improvement in event-level accuracy and a 5.47% relative reduction in WER. Although insertions were higher for the HMM-DNN system, these were not enough to be outperformed by the HMM-GMM system. It is known that the DNN power is exacerbated the more data one feeds to the system, and, in effect, improved performance was even more striking (4.92% relative improvement in event-level accuracy and a 12.48% relative reduction in WER) when the DNN system was only trained with the classes with more exemplars (>500).

5. Conclusions

In this study, we present an automatic system for detecting and classifying baleen whale, earthquake, and air gun sounds based on an HMM framework with a DNN for acoustic modeling, achieving high classification accuracy (84.46%) and low WER (25.02). To the best of our knowledge, it is the first attempt to integrate such a variety of low-frequency acoustic signals from PAM datasets within a single system, integrating detection and classification in a single step. Additionally, we present a comparison with a more classical approach based on GMMs, showing that DNNs outperform GMMs, with the improvement being more significant when considering only the most numerous classes. Our system also benefits from having high SNR data, as we observed improvements when considering high SNR events. However, high SNR events were scarce, and when a minimum SNR of 12 dB was required, we were left with less than 50% of the events. This suggests that future improvements must include noise reducing techniques to improve classification accuracy, i.e., the number of correctly detected events. For some applications, such as monitoring the seasonal trends of baleen whales, temporal patterns of species with high acoustic presence are likely to still be apparent event when a large % of events are discarded. However, for other applications, such as monitoring very rare species, or anthropogenic noise, it may be essential to include low SNR events.

Given that our system not only outputs a list of events in a recording but also reports the beginning and end of an event, we can precisely signal the timestamp of occurrence of an event, which is hugely useful for research and decision-making/monitoring purposes. This system could help the biological and earth sciences community to decrease analysis times for these large datasets generated by hydrophones in the ocean. This system could also be of use to the various decision makers and stake holders in Chile and internationally with an interest in monitoring endangered baleen whales, seismic activity, and anthropogenic noise. This system can be built on in the future to improve the detection and classification performance of classes and increase the number of target acoustic signals.

Author Contributions

S.J.B. contributed by securing funding, conception of study, planning and supervision of data analysis, writing and editing manuscript; M.D. contributed by carrying out the data analysis, writing and editing of manuscript; C.R. contributed by analyzing data and writing sections of the manuscript; J.W. and R.M. contributed with data analysis and technical input; K.M.S. contributed by obtaining data and manuscript editing; N.B.Y. contributed with securing funding, conception of study, planning and supervision of data analysis, and editing manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This study was made possible by funding from the Fondecyt de Iniciación grant from the Agencia Nacional de Investigación y Desarrollo (ANID) No. 11190597 awarded to S.J.B.; and the Fondecyt Regular grant from ANID/FONDECYT No. 1211946 awarded to N.B.Y. Thanks also goes to support from the Centro COPAS Sur-Austral AFB170006 and Centro COPAS Coastal FB210021 both funded by ANID.

Data Availability Statement

Acoustic data is fully available via contract with the Preparatory Commission for the Comprehensive Nuclear Test Ban Treaty Organization (CTBTO).

Acknowledgments

We would like to thank the Preparatory Commission for the Comprehensive Nuclear Test Ban Treaty Organization (CTBTO) and the Chilean Commission of Nuclear Energy (CCHEN) for proving the data used in this study. Thanks to Paola Garcia at CCHEN for her help during funding applications and throughout the project. Thanks to the Chilean Undersecretary of Fisheries and Aquaculture (SUBPESCA) and the Chilean National Fisheries and Aquaculture Service (SERNAPESCA) for their support during funding applications.

Conflicts of Interest

We declare no conflict of interest.

References

Van Parijs, S.M.; Clark, C.W.; Sousa-Lima, R.S.; Parks, S.E.; Rankin, S.; Risch, D.; Van Opzeeland, I.C. Management and research applications of real-time and archival passive acoustic sensors over varying temporal and spatial scales. Mar. Ecol. Prog. Ser. 2009, 395, 21–36. [Google Scholar] [CrossRef]
Fox, C.G.; Matsumoto, H.; Lau, T.-K.A. Monitoring Pacific Ocean seismicity from an autonomous hydrophone array. J. Geophys. Res. Solid Earth 2001, 106, 4183–4206. [Google Scholar] [CrossRef]
Sukhovich, A.; Irisson, J.; Perrot, J.; Nolet, G. Automatic recognition of T and teleseismic P waves by statistical analysis of their spectra: An application to continuous records of moored hydrophones. J. Geophys. Res. Solid Earth 2014, 119, 6469–6485. [Google Scholar] [CrossRef]
Matsumoto, H.; Haralabus, G.; Zampolli, M.; Özel, N.M. T-phase and tsunami pressure waveforms recorded by near-source IMS water-column hydrophone triplets during the 2015 Chile earthquake. Geophys. Res. Lett. 2016, 43, 12511–12519. [Google Scholar] [CrossRef]
Caplan-Auerbach, J.; Dziak, R.P.; Bohnenstiehl, D.R.; Chadwick, W.W.; Lau, T.-K. Hydroacoustic investigation of submarine landslides at West Mata volcano, Lau Basin. Geophys. Res. Lett. 2014, 41, 5927–5934. [Google Scholar] [CrossRef]
Bohnenstiehl, D.R.; Dziak, R.P.; Matsumoto, H.; Conder, J.A. Acoustic response of submarine volcanoes in the Tofua Arc and northern Lau Basin to two great earthquakes. Geophys. J. Int. 2013, 196, 1657–1675. [Google Scholar] [CrossRef]
Hay, A.E.; Hatcher, M.G.; Clarke, J.E.H. Underwater noise from submarine turbidity currents. JASA Express Lett. 2021, 1, 070801. [Google Scholar] [CrossRef]
Pettit, E.C.; Lee, K.M.; Brann, J.P.; Nystuen, J.A.; Wilson, P.S.; O’Neel, S. Unusually loud ambient noise in tidewater glacier fjords: A signal of ice melt. Geophys. Res. Lett. 2015, 42, 2309–2316. [Google Scholar] [CrossRef]
Glowacki, O.; Deane, G.B. Quantifying iceberg calving fluxes with underwater noise. Cryosphere 2020, 14, 1025–1042. [Google Scholar] [CrossRef]
Glowacki, O.; Moskalik, M.; Deane, G.B. The impact of glacier meltwater on the underwater noise field in a glacial bay. J. Geophys. Res. Oceans 2016, 121, 8455–8470. [Google Scholar] [CrossRef]
Merchant, N.D.; Blondel, P.; Dakin, D.T.; Dorocicz, J. Averaging underwater noise levels for environmental assessment of shipping. J. Acoust. Soc. Am. 2012, 132, EL343–EL349. [Google Scholar] [CrossRef]
Hatch, L.; Clark, C.; Merrick, R.; Van Parijs, S.; Ponirakis, D.; Schwehr, K.; Thompson, M.; Wiley, D. Characterizing the relative contributions of large vessels to total ocean noise fields: A case study using the Gerry E. Studds stellwagen bank national marine sanctuary. Environ. Manag. 2008, 42, 735–752. [Google Scholar] [CrossRef]
Stafford, K.M.; Fox, C.G.; Clark, D.S. Long-range acoustic detection and localization of blue whale calls in the northeast Pacific Ocean. J. Acoust. Soc. Am. 1998, 104, 3616–3625. [Google Scholar] [CrossRef]
Mellinger, D.; Stafford, K.; Moore, S.; Dziak, R.; Matsumoto, H. An overview of fixed passive acoustic observation methods for cetaceans. Oceanography 2007, 20, 36–45. [Google Scholar] [CrossRef]
Buchan, S.J.; Stafford, K.M.; Hucke-Gaete, R. Seasonal occurrence of southeast Pacific blue whale songs in southern Chile and the eastern tropical Pacific. Mar. Mammal Sci. 2014, 31, 440–458. [Google Scholar] [CrossRef]
Wall, C.C.; Simard, P.; Lembke, C.; Mann, D.A. Large-scale passive acoustic monitoring of fish sound production on the West Florida Shelf. Mar. Ecol. Prog. Ser. 2013, 484, 173–188. [Google Scholar] [CrossRef]
Prior, M.K.; Meless, O.; Bittner, P.; Sugioka, H. Long-range detection and location of shallow underwater explosions using deep-sound-channel hydrophoness. IEEE J. Ocean. Eng. 2011, 36, 703–715. [Google Scholar] [CrossRef]
Woodman, G.H.; Wilson, S.C.; Li, V.Y.; Renneberg, R. Acoustic characteristics of fish bombing: Potential to develop an automated blast detector. Mar. Pollut. Bull. 2002, 46, 99–106. [Google Scholar] [CrossRef] [PubMed]
Braulik, G.; Wittich, A.; Macaulay, J.; Kasuga, M.; Gordon, J.; Davenport, T.R.; Gillespie, D. Acoustic monitoring to document the spatial distribution and hotspots of blast fishing in Tanzania. Mar. Pollut. Bull. 2017, 125, 360–366. [Google Scholar] [CrossRef]
Nieukirk, S.L.; Mellinger, D.K.; Moore, S.E.; Klinck, K.; Dziak, R.P.; Goslin, J. Sounds from airguns and fin whales recorded in the mid-Atlantic Ocean, 1999–2009. J. Acoust. Soc. Am. 2012, 131, 1102–1112. [Google Scholar] [CrossRef]
Sutin, A.; Bunin, B.; Sedunov, A.; Sedunov, N.; Fillinger, L.; Tsionskiy, M.; Bruno, M. Stevens Passive Acoustic System for underwater surveillance. In Proceedings of the 2010 International WaterSide Security Conference, Carrara, Italy, 3–5 November 2010; pp. 1–6. [Google Scholar] [CrossRef]
Usman, A.M.; Ogundile, O.O.; Versfeld, D.J.J. Review of Automatic Detection and Classification Techniques for Cetacean Vocalization. IEEE Access 2020, 8, 105181–105206. [Google Scholar] [CrossRef]
Yang, W.; Luo, W.; Zhang, Y. Classification of odontocete echolocation clicks using convolutional neural network. J. Acoust. Soc. Am. 2020, 147, 49–55. [Google Scholar] [CrossRef] [PubMed]
Ogundile, O.O.; Babalola, O.P.; Odeyemi, S.G.; Rufai, K.I. Hidden Markov models for detection of Mysticetes vocalisations based on principal component analysis. Bioacoustics 2022, 31, 710–738. [Google Scholar] [CrossRef]
Vickers, W.; Milner, B.; Risch, D.; Lee, R. Robust North Atlantic right whale detection using deep learning models for denoising. J. Acoust. Soc. Am. 2021, 149, 3797–3812. [Google Scholar] [CrossRef]
Zhong, M.; Torterotot, M.; Branch, T.A.; Stafford, K.M.; Royer, J.-Y.; Dodhia, R.; Ferres, J.L. Detecting, classifying, and counting blue whale calls with Siamese neural networks. J. Acoust. Soc. Am. 2021, 149, 3086–3094. [Google Scholar] [CrossRef]
Waddell, E.E.; Rasmussen, J.H.; Širović, A. Applying Artificial Intelligence Methods to Detect and Classify Fish Calls from the Northern Gulf of Mexico. J. Mar. Sci. Eng. 2021, 9, 1128. [Google Scholar] [CrossRef]
Buchan, S.J.; Mahú, R.; Wuth, J.; Balcazar-Cabrera, N.; Gutierrez, L.; Neira, S.; Yoma, N.B. An unsupervised Hidden Markov Model-based system for the detection and classification of blue whale vocalizations off Chile. Bioacoustics 2019, 29, 140–167. [Google Scholar] [CrossRef]
Baumgartner, M.F.; Ball, K.; Partan, J.; Pelletier, L.-P.; Bonnell, J.; Hotchkin, C.; Corkeron, P.J.; Van Parijs, S.M. Near real-time detection of low-frequency baleen whale calls from an autonomous surface vehicle: Implementation, evaluation, and remaining challenges. J. Acoust. Soc. Am. 2021, 149, 2950–2962. [Google Scholar] [CrossRef] [PubMed]
Vieira, M.; Pereira, B.P.; Pousão-Ferreira, P.; Fonseca, P.J.; Amorim, M.C.P. Seasonal variation of captive meagre acoustic signalling: A manual and automatic recognition approach. Fishes 2019, 4, 28. [Google Scholar] [CrossRef]
Bianco, M.J.; Gerstoft, P.; Traer, J.; Ozanich, E.; Roch, M.A.; Gannot, S.; Deledalle, C.-A. Machine learning in acoustics: Theory and applications. J. Acoust. Soc. Am. 2019, 146, 3590–3628. [Google Scholar] [CrossRef]
Bahoura, M.; Simard, Y. Serial combination of multiple classifiers for automatic blue whale calls recognition. Expert Syst. Appl. 2012, 39, 9986–9993. [Google Scholar] [CrossRef]
Caruso, F.; Dong, L.; Lin, M.; Liu, M.; Gong, Z.; Xu, W.; Alonge, G.; Li, S. Monitoring of a Nearshore Small Dolphin Species Using Passive Acoustic Platforms and Supervised Machine Learning Techniques. Front. Mar. Sci. 2020, 7, 267. [Google Scholar] [CrossRef]
Ruff, Z.J.; Lesmeister, D.B.; Duchac, L.S.; Padmaraju, B.K.; Sullivan, C.M. Automated identification of avian vocalizations with deep convolutional neural networks. Remote Sens. Ecol. Conserv. 2019, 6, 79–92. [Google Scholar] [CrossRef]
Rasmussen, J.H.; Širović, A. Automatic detection and classification of baleen whale social calls using convolutional neural networks. J. Acoust. Soc. Am. 2021, 149, 3635–3644. [Google Scholar] [CrossRef] [PubMed]
Shiu, Y.; Palmer, K.J.; Roch, M.A.; Fleishman, E.; Liu, X.; Nosal, E.-M.; Helble, T.; Cholewiak, D.; Gillespie, D.; Klinck, H. Deep neural networks for automated detection of marine mammal species. Sci. Rep. 2020, 10, 607. [Google Scholar] [CrossRef]
Wang, Y.; Ye, J.; Borchers, D.L. Automated call detection for acoustic surveys with structured calls of varying length. Methods Ecol. Evol. 2022, 13, 1552–1567. [Google Scholar] [CrossRef]
Saffari, A.; Khishe, M.; Zahiri, S.-H. Fuzzy-ChOA: An improved chimp optimization algorithm for marine mammal classification using artificial neural network. Analog. Integr. Circuits Signal Process. 2022, 111, 403–417. [Google Scholar] [CrossRef]
Madhusudhana, S.; Shiu, Y.; Klinck, H.; Fleishman, E.; Liu, X.; Nosal, E.-M.; Helble, T.; Cholewiak, D.; Gillespie, D.; Širović, A.; et al. Improve automatic detection of animal call sequences with temporal context. J. R. Soc. Interface 2021, 18, 20210297. [Google Scholar] [CrossRef]
Rabiner, L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Gales, M.; Young, S. The Application of Hidden Markov Models in Speech Recognition. Found. Trends Signal Process. 2007, 1, 195–304. [Google Scholar] [CrossRef]
Ogundile, O.; Usman, A.; Babalola, O.; Versfeld, D. A hidden Markov model with selective time domain feature extraction to detect inshore Bryde’s whale short pulse calls. Ecol. Inform. 2020, 57, 101087. [Google Scholar] [CrossRef]
Putland, R.; Ranjard, L.; Constantine, R.; Radford, C. A hidden Markov model approach to indicate Bryde’s whale acoustics. Ecol. Indic. 2018, 84, 479–487. [Google Scholar] [CrossRef]
Zhao, J.; Li, X.; Liu, W.; Gao, Y.; Lei, M.; Tan, H.; Yang, D. DNN-HMM based acoustic model for continuous pig cough sound recognition. Int. J. Agric. Biol. Eng. 2020, 13, 186–193. [Google Scholar] [CrossRef]
Trawicki, M.B. Multispecies discrimination of whales (cetaceans) using Hidden Markov Models (HMMS). Ecol. Inform. 2021, 61, 101223. [Google Scholar] [CrossRef]
Vieira, M.; Amorim, M.C.P.; Sundelöf, A.; Prista, N.; Fonseca, P. Underwater noise recognition of marine vessels passages: Two case studies using hidden Markov models. ICES J. Mar. Sci. 2019, 77, 2157–2170. [Google Scholar] [CrossRef]
Kyhn, L.; Wisniewska, D.; Beedholm, K.; Tougaard, J.; Simon, M.; Mosbech, A.; Madsen, P. Basin-wide contributions to the underwater soundscape by multiple seismic surveys with implications for marine mammals in Baffin Bay, Greenland. Mar. Pollut. Bull. 2018, 138, 474–490. [Google Scholar] [CrossRef]
Hanson, J.A. Indian Ocean ridge seismicity observed with a permanent hydroacoustic network. Geophys. Res. Lett. 2005, 32, 102931. [Google Scholar] [CrossRef]
Metz, D.; Watts, A.B.; Grevemeyer, I.; Rodgers, M. Tracking Submarine Volcanic Activity at Monowai: Constraints from Long-Range Hydroacoustic Measurements. J. Geophys. Res. Solid Earth 2018, 123, 7877–7895. [Google Scholar] [CrossRef]
Yun, S.; Ni, S.; Park, M.; Lee, W.S. Southeast Indian Ocean-Ridge earthquake sequences from cross-correlation analysis of hydroacoustic data. Geophys. J. Int. 2009, 179, 401–407. [Google Scholar] [CrossRef]
Ingale, V.; Bazin, S.; Royer, J.-Y. Hydroacoustic observations of two contrasted seismic swarms along the Southwest Indian ridge in 2018. Geosciences 2021, 11, 225. [Google Scholar] [CrossRef]
Tsang-Hin-Sun, E.; Royer, J.-Y.; Perrot, J. Seismicity and active accretion processes at the ultraslow-spreading Southwest and intermediate-spreading Southeast Indian ridges from hydroacoustic data. Geophys. J. Int. 2016, 206, 1232–1245. [Google Scholar] [CrossRef]
Gomez, B.; Kadri, U. Earthquake source characterization by machine learning algorithms applied to acoustic signals. Sci. Rep. 2021, 11, 23062. [Google Scholar] [CrossRef] [PubMed]
Benitez, M.C.; Ramirez, J.; Segura, J.C.; Ibanez, J.M.; Almendros, J.; Garcia-Yeguas, A.; Cortes, G. Continuous HMM-based seismic-event classification at deception Island, Antarctica. IEEE Trans. Geosci. Remote Sens. 2006, 45, 138–146. [Google Scholar] [CrossRef]
Gutierrez, L.; Ibanez, J.; Cortes, G.; Ramirez, J.; Benitez, C.; Tenorio, V.; Isaac, A. Volcano-seismic signal detection and classification processing using hidden Markov models. Application to San Cristóbal volcano, Nicaragua. In Proceedings of the 2009 IEEE International Geoscience and Remote Sensing Symposium, Cape Town, South Africa, 12–17 July 2009; Volume 4, pp. IV-522–IV-525. [Google Scholar] [CrossRef]
Trujillo-Castrillón, N.; Valdés-González, C.M.; Arámbula-Mendoza, R.; Santacoloma-Salguero, C.C. Initial processing of volcanic seismic signals using Hidden Markov Models: Nevado del Huila, Colombia. J. Volcanol. Geotherm. Res. 2018, 364, 107–120. [Google Scholar] [CrossRef]
Beyreuther, M.; Wassermann, J. Continuous earthquake detection and classification using discrete Hidden Markov Models. Geophys. J. Int. 2008, 175, 1055–1066. [Google Scholar] [CrossRef]
Ebel, J.E.; Chambers, D.W.; Kafka, A.L.; Baglivo, J.A. Non-Poissonian Earthquake Clustering and the Hidden Markov Model as Bases for Earthquake Forecasting in California. Seism. Res. Lett. 2007, 78, 57–65. [Google Scholar] [CrossRef]
Pertsinidou, C.E.; Tsaklidis, G.; Papadimitriou, E.; Limnios, N. Application of hidden semi-Markov models for the seismic hazard assessment of the North and South Aegean Sea, Greece. J. Appl. Stat. 2016, 44, 1064–1085. [Google Scholar] [CrossRef]
Haver, S.M.; Klinck, H.; Nieukirk, S.L.; Matsumoto, H.; Dziak, R.P.; Miksis-Olds, J.L. The not-so-silent world: Measuring Arctic, Equatorial, and Antarctic soundscapes in the Atlantic Ocean. Deep Sea Res. Part I Oceanogr. Res. Pap. 2017, 122, 95–104. [Google Scholar] [CrossRef]
Shi, Y.; Yang, Y.; Tian, J.; Sun, C.; Zhao, W.; Li, Z.; Ma, Y. Long-term ambient noise statistics in the northeast South China Sea. J. Acoust. Soc. Am. 2019, 145, EL501–EL507. [Google Scholar] [CrossRef]
Wilcock, W.S.; Stafford, K.M.; Andrew, R.K.; Odom, R.I. Sounds in the Ocean at 1–100 Hz. Annu. Rev. Mar. Sci. 2014, 6, 117–140. [Google Scholar] [CrossRef]
Buchan, S.; Gutierrez, L.; Balcazar-Cabrera, N.; Stafford, K. Seasonal occurrence of fin whale song off Juan Fernandez, Chile. Endanger. Species Res. 2019, 39, 135–145. [Google Scholar] [CrossRef]
Peddinti, V.; Povey, D.; Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, INTERSPEECH, Dresden, Germany, 6–10 September 2015; Volume 2015, pp. 3214–3218. [Google Scholar] [CrossRef]
Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef]
Calderan, S.; Miller, B.; Collins, K.; Ensor, P.; Double, M.; Leaper, R.; Barlow, J. Low-frequency vocalizations of sei whales (Balaenoptera borealis) in the Southern Ocean. J. Acoust. Soc. Am. 2014, 136, EL418–EL423. [Google Scholar] [CrossRef]
Buchan, S.; Hucke-Gaete, R.; Rendell, L.; Stafford, K. A new song recorded from blue whales in the Corcovado Gulf, Southern Chile, and an acoustic link to the Eastern Tropical Pacific. Endanger. Species Res. 2014, 23, 241–252. [Google Scholar] [CrossRef]
Cummings, W.C.; Thompson, P.O. Underwater Sounds from the Blue Whale, Balaenoptera musculus. J. Acoust. Soc. Am. 1971, 50, 1193–1198. [Google Scholar] [CrossRef]
Charif, R.A.; Mellinger, D.K.; Dunsmore, K.J.; Fristrup, K.M.; Clark, C.W. Estimated source levels of fin whale (Balaenoptera physalus) vocalizations: Adjustments for surface interference. Mar. Mammal Sci. 2002, 18, 81–98. [Google Scholar] [CrossRef]
Watkins, W.A.; Tyack, P.; Moore, K.E.; Bird, J.E. The 20-Hz signals of finback whales (Balaenoptera physalus). J. Acoust. Soc. Am. 1987, 82, 1901–1912. [Google Scholar] [CrossRef]
Watkins, W.A. Activities and underwater sounds of fin whales. Sci. Rep. Whales Res. Inst. 1981, 33, 83–117. [Google Scholar]
Delarue, J.; Martin, B.; Hannay, D.; Berchok, C.L. Acoustic Occurrence and Affiliation of Fin Whales Detected in the Northeastern Chukchi Sea, July to October 2007—10. Arctic 2013, 66, 159–172. Available online: http://www.jstor.org/stable/23594680 (accessed on 15 July 2021). [CrossRef]
Baumgartner, M.F.; Fratantoni, D.M. Diel periodicity in both sei whale vocalization rates and the vertical migration of their copepod prey observed from ocean gliders. Limnol. Oceanogr. 2008, 53, 2197–2209. [Google Scholar] [CrossRef]
Español-Jiménez, S.; Bahamonde, P.A.; Chiang, G.; Häussermann, V. Discovering sounds in Patagonia: Characterizing sei whale (Balaenoptera borealis) downsweeps in the south-eastern Pacific Ocean. Ocean Sci. 2019, 15, 75–82. [Google Scholar] [CrossRef]
Mellinger, D.K.; Carson, C.D.; Clark, C.W. Characteristics of minke whale (Balaenoptera acutorostrata) pulse trains recorded near puerto rico. Mar. Mammal Sci. 2000, 16, 739–756. [Google Scholar] [CrossRef]
Schevill, W.E.; Watkins, W.A. Intense Low-Frequency Sounds from An Antarctic Minke Whale: Balaenoptera acutorostrata; Woods Hole Oceanographic Institution: Falmouth, MA, USA, 1972. [Google Scholar]
Shabangu, F.W.; Findlay, K.; Stafford, K.M. Seasonal acoustic occurrence, diel-vocalizing patterns and bioduck call-type composition of Antarctic minke whales off the west coast of South Africa and the Maud Rise, Antarctica. Mar. Mammal Sci. 2020, 36, 658–675. [Google Scholar] [CrossRef]
De Angelis, S.; McNutt, S.R. Observations of volcanic tremor during the January–February 2005 eruption of Mt. Veniaminof, Alaska. Bull. Volcanol. 2007, 69, 927–940. [Google Scholar] [CrossRef]
Dziak, R.P.; Fox, C.G.; Matsumoto, H.; Schreiner, A.E. The April 1992 Cape Mendocino Earthquake Sequence: Seismo-Acoustic Analysis Utilizing Fixed Hydrophone Arrays. Mar. Geophys. Res. 1997, 19, 137–162. [Google Scholar] [CrossRef]
Nishimura, C.E. Monitoring Whales and Earthquakes by Using SOSUS; Naval Research Laboratory: Washington, DC, USA, 1994. [Google Scholar]
Charif, R.; Strickman, L.M.; Waack, A.M. Raven Pro 1.4 User’s Manual; The Cornell Lab of Ornithology: Ithaca, NY, USA, 2010. [Google Scholar]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
Pedersen, P. The Mel Scale. J. Music. Theory 1965, 9, 295–308. Available online: https://about.jstor.org/terms (accessed on 15 July 2022). [CrossRef]
Gales, M. Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 1999, 7, 272–281. [Google Scholar] [CrossRef]
Povey, D.; Mittal, S. The Kaldi Speech Recognition Toolkit. 2011. Available online: http://kaldi.sf.net/ (accessed on 15 July 2022).
Leonard, B.E.; Ted, P.; George, S.; Norman, W. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat. 1970, 41, 164–171. [Google Scholar]
Maas, A.L.; Qi, P.; Xie, Z.; Hannun, A.Y.; Lengerich, C.T.; Jurafsky, D.; Ng, A.Y. Building DNN acoustic models for large vocabulary speech recognition. Comput. Speech Lang. 2017, 41, 195–213. [Google Scholar] [CrossRef]
Rath, S.P.; Povey, D.; Veselý, K.; Cernocký, J. Improved feature processing for Deep Neural Networks. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013. [Google Scholar]
Paul, D.B.; Baker, J.M. The Design for the Wall Street Journal-based CSR Corpus *. In Proceedings of the Speech and Natural Language: Proceedings of a Workshop, Harriman, NY, USA, 23–26 February 1992. [Google Scholar]
Peddinti, V.; Wang, Y.; Povey, D.; Khudanpur, S. Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process. Lett. 2017, 25, 373–377. [Google Scholar] [CrossRef]
Zhang, Z.; Sabuncu, M.R. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Fred Agarap, A.M. Deep Learning Using Rectified Linear Units (ReLU). 2018. Available online: https://github.com/AFAgarap/relu-classifier (accessed on 15 July 2022).
Bengio, Y.O.; de Mori, R.; Cardin, R. Speaker Independent Speech Recognition with Neural Networks and Speech Knowledge. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27–30 November 1989. [Google Scholar]
Wang, D.; Wang, X.; Lv, S. An Overview of End-to-End Automatic Speech Recognition. Symmetry 2019, 11, 1018. [Google Scholar] [CrossRef]
Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
Hollmann, H. A relation between Levenshtein-type distances and insertion-and-deletion correcting capabilities of codes. IEEE Trans. Inf. Theory 1993, 39, 1424–1427. [Google Scholar] [CrossRef]
Širović, A.; Hildebrand, J.A.; Wiggins, S.M.; McDonald, M.A.; Moore, S.E.; Thiele, D. Seasonality of blue and fin whale calls and the influence of sea ice in the Western Antarctic Peninsula. Deep Sea Res. Part II Top. Stud. Oceanogr. 2004, 51, 2327–2344. [Google Scholar] [CrossRef]
Nieukirk, S.L.; Stafford, K.M.; Mellinger, D.K.; Dziak, R.P.; Fox, C.G. Low-frequency whale and seismic airgun sounds recorded in the mid-Atlantic Ocean. J. Acoust. Soc. Am. 2004, 115, 1832–1843. [Google Scholar] [CrossRef]
De Caro, M.; Montuori, C.; Frugoni, F.; Monna, S.; Cammarano, F.; Beranzoli, L. T-Phases Observed at the Ionian Seafloor: Seismic Source and Bathymetric Effects. Seism. Res. Lett. 2020, 92, 481–493. [Google Scholar] [CrossRef]

Figure 1. Map of study area and location (white triangle) of CTBTO HA03 station off Juan Fernandez.

Figure 2. Filterbank feature extraction pipeline. A Hamming window was applied to each frame, followed by the power spectral density (PSD) computation. Features were extracted using a triangular filterbank and logged for smoothing. The energy of each frame was concatenated. Mean and variance normalization (MVN) was applied. Delta (

∆)

and delta-delta (

∆ ∆)

coefficients were concatenated forming a 153-dimensional observation vector

O_{t}

.

Figure 2. Filterbank feature extraction pipeline. A Hamming window was applied to each frame, followed by the power spectral density (PSD) computation. Features were extracted using a triangular filterbank and logged for smoothing. The energy of each frame was concatenated. Mean and variance normalization (MVN) was applied. Delta (

∆)

and delta-delta (

∆ ∆)

coefficients were concatenated forming a 153-dimensional observation vector

O_{t}

.

Figure 3. Triangular Filterbank. Each filter F has the same bandwidth B = 3, which means each triangular filter operates on 3 FFT samples. Note the 50% overlap each filter has with the previous one. A total of 50 filters forms the filterbank used to extract information from the 128-point power spectral density (PSD).

Figure 4. Hidden Markov Model. In the proposed model, we used three states S for all the events shown in Table 1. The arcs coming out of each state represent the probability of transitioning to the next state. For example, the probability

a_{1,2}

represent the probability of transitioning from state

S_{1}

to state

S_{2}

. These probabilities were automatically estimated during training.

Figure 4. Hidden Markov Model. In the proposed model, we used three states S for all the events shown in Table 1. The arcs coming out of each state represent the probability of transitioning to the next state. For example, the probability

a_{1,2}

represent the probability of transitioning from state

S_{1}

to state

S_{2}

. These probabilities were automatically estimated during training.

Figure 5. The HMM-DNN detection and classifying system. For illustration purposes, only 2 context frames are shown for the training and decoding stage, but in total there were 18 context frames in the implemented system.

Figure 6. DNN Architecture. Input layer receives a central frame plus context frames; for purposes of illustration, we only show 2 context frames. Our actual architecture used 18 context frames. Each hidden layer contained 2048 units and was fully connected to the next layer. All units used the Rectified Linear Unit (ReLU) activation function. Output layer was a softmax whose dimensionality equaled the number of HMM states.

Figure 7. An example of aligning the reference and hypothesis transcriptions. Although there were several ways of arranging the events of the two texts, the produced alignment has the minimum number of errors in terms of insertions, deletions, and substitutions. The special symbol <EPS> means there was no event to align to, in this case resulting in an insertion error. When an event was not detected, the special symbol <EPS> appears in the hypothesis transcription, as the reference has no event to align to, in which case we have a deletion error. Note that if we reverse the events of the reference with those of the hypothesis, the pair <EPS>, FWS (now FWS, <EPS>) would be a deletion.

Figure 8. Classification accuracy as a function of SNR thresholds of analyzed events (blue line) and the % of total events included in the analysis for each SNR threshold (red line).

Table 2. Event-level performance metrics for each class, sensitivity, precision and F1-score per class for our proposed HMM-DNN system. Results have been ranked according to F1-score performance, from highest to lowest.

Class Label	Precision HMM-DNN	Sensitivity HMM-DNN	F1-Score HMM-DNN	N
FWS	94.05%	88.50%	91.19%	4671
AA	83.01%	91.51%	87.05%	386
AG	97.02%	75.84%	85.13%	103
SWD	96.40%	69.76%	80.95%	36
AN	81.03%	79.01%	80.01%	31
FWD	83.33%	67.23%	74.42%	254
S23	75.55%	67.07%	71.06%	104
FWD2	71.66%	67.34%	69.43%	22
ERQ	55.01%	88.33%	67.80%	270
FWD3	83.11%	57.14%	67.72%	55
SWU	63.61%	70.45%	66.86%	62
S13	60.00%	74.35%	66.41%	44
AAFWS	50.13%	73.43%	59.58%	21
S21	46.02%	68.00%	54.89%	28
13H	32.35%	95.21%	48.29%	27
MI	63.91%	28.57%	39.49%	10
UND	45.56%	33.33%	38.50%	49
SEP	20.26%	46.21%	28.17%	25
S22	23.39%	12.96%	16.68%	17

Table 3. Performance metrics including HMM-GMM system for 70 or >500 events.

System	Minimum Number of Exemplars	Event-Level Accuracy	I	D	S	WER	$N_{event}$
HMM-DNN	70	84.46%	589	757	209	25.02	6215
HMM-GMM	70	82.35%	548	889	208	26.47	6215
HMM-DNN	500	89.01%	460	583	100	23.75	4812
HMM-GMM	500	84.63%	351	861	94	27.14	4812

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Buchan, S.J.; Duran, M.; Rojas, C.; Wuth, J.; Mahu, R.; Stafford, K.M.; Becerra Yoma, N. An HMM-DNN-Based System for the Detection and Classification of Low-Frequency Acoustic Signals from Baleen Whales, Earthquakes, and Air Guns off Chile. Remote Sens. 2023, 15, 2554. https://doi.org/10.3390/rs15102554

AMA Style

Buchan SJ, Duran M, Rojas C, Wuth J, Mahu R, Stafford KM, Becerra Yoma N. An HMM-DNN-Based System for the Detection and Classification of Low-Frequency Acoustic Signals from Baleen Whales, Earthquakes, and Air Guns off Chile. Remote Sensing. 2023; 15(10):2554. https://doi.org/10.3390/rs15102554

Chicago/Turabian Style

Buchan, Susannah J., Miguel Duran, Constanza Rojas, Jorge Wuth, Rodrigo Mahu, Kathleen M. Stafford, and Nestor Becerra Yoma. 2023. "An HMM-DNN-Based System for the Detection and Classification of Low-Frequency Acoustic Signals from Baleen Whales, Earthquakes, and Air Guns off Chile" Remote Sensing 15, no. 10: 2554. https://doi.org/10.3390/rs15102554

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An HMM-DNN-Based System for the Detection and Classification of Low-Frequency Acoustic Signals from Baleen Whales, Earthquakes, and Air Guns off Chile

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Methodological Approach

2.3. Acoustic Data Annotation

2.4. Signal-to-Noise Ratio Computation

2.5. Feature Extraction

2.5.1. Filterbank Feature Extraction

2.5.2. Linear Discriminant Analysis and Maximum Linear Likelihood Transformation

2.6. Hidden Markov Model and Deep Neural Network Architecture

2.6.1. Deep Neural Network for Acoustic Modelling

2.6.2. Hyperparameter Tuning and Feature Selection

2.7. Comparison with the HMM-GMM System

2.8. A SNR Filter for the HMM-DNN System

2.9. Train, Test, and Dev Sets

2.10. Performance Metrics

3. Results

3.1. Performance of Each Class

3.2. HMM-DNN Performance with SNR Filter

3.3. A Comparison with the Ordinary HMM-GMM System

4. Discussion

4.1. HMM-DNN System Performance and Event-Level Performance

4.1.1. Baleen Whale Acoustic Signals

4.1.2. Air Guns

4.1.3. Earthquakes

4.2. HMM-DNN System Performance as a Function of SNR Thresholds

4.3. Comparison of HMM-DNN and HMM-GMM Systems

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI