An Integrated Framework with ADD-LSTM and DeepLabCut for Dolphin Behavior Classification

Tseng, Shih-Pang; Hsu, Shao-En; Wang, Jhing-Fa; Jen, I-Fan

doi:10.3390/jmse12040540

Open AccessArticle

An Integrated Framework with ADD-LSTM and DeepLabCut for Dolphin Behavior Classification

¹

School of Software and Big Data, Changzhou College of Information Technology, No. 22, Mingxin Middle Road, Changzhou 213164, China

²

School of Information Science and Technology, Sanda University, Shanghai 201209, China

³

Department of Electrical Engineering, National Cheng Kung University, Tainan 701401, Taiwan

⁴

Farglory Ocean Park, Hualien 974, Taiwan

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(4), 540; https://doi.org/10.3390/jmse12040540

Submission received: 4 February 2024 / Revised: 2 March 2024 / Accepted: 16 March 2024 / Published: 24 March 2024

(This article belongs to the Section Marine Biology)

Download

Browse Figures

Versions Notes

Abstract

:

Caring for dolphins is a delicate process that requires experienced caretakers to pay close attention to their behavioral characteristics. However, caretakers may sometimes lack experience or not be able to give their full attention, which can lead to misjudgment or oversight. To address these issues, a dolphin behavior analysis system has been designed to assist caretakers in making accurate assessments. This study utilized image preprocessing techniques to reduce sunlight reflection in the pool and enhance the outline of dolphins, making it easier to analyze their movements. The dolphins were divided into 11 key points using an open-source tool called DeepLabCut, which accurately helped mark various body parts for skeletal detection. The AquaAI Dolphin Decoder (ADD) was then used to analyze six dolphin behaviors. To improve behavior recognition accuracy, the long short-term memory (LSTM) neural network was introduced. The ADD and LSTM models were integrated to form the ADD-LSTM system. Several classification models, including unidirectional and bidirectional LSTM, GRU, and SVM, were compared. The results showed that the ADD module combined with a double-layer bidirectional LSTM method achieved high accuracy in dolphin behavior analysis. The accuracy rates for each behavior exceeded 90%.

Keywords:

pose estimation; dolphin behavior analysis; LSTM; neural network; image recognition

1. Introduction

Marine sustainability refers to the responsible and balanced use of marine resources and ecosystems to ensure their long-term health and resilience [1]. Marine sustainability and the well-being of dolphins are interconnected because dolphins are marine mammals that depend on healthy and sustainable marine ecosystems for their survival. Ensuring the sustainability of marine environments is crucial for protecting dolphins and other marine species [2]. Dolphins have always been a subject of fascination for biologists and ecologists because of their highly intelligent nature and complex social behavior [3,4,5,6,7]. In Taiwan, these marine mammals are essential for the natural world, serving as vital indicators of the health of marine and coastal ecosystems [8,9]. However, understanding dolphin behavior and ecology is a challenging task.

Dolphin behavior is quite intricate, and it can be categorized into psychological and physiological aspects. We can typically observe dolphin behavior through their social activities [3], play [6,7], and resting times [5]. Nonetheless, dolphins also exhibit a behavior known as “vomiting fish”, which they often use as a method to catch prey [10].

It can be expected that the behaviors of wild dolphins and those in human care exhibit some differences. And the main works of dolphin conservation should focus on the wild population. However, it is not realistic to capture the long-term video of a wild dolphin group at this time. This study focuses on dolphins under human care to develop a feasible dolphin behavior observation and recognition system. At this stage, the proposed system can be used to improve the care quality of the dolphins under human care. The system can be enhanced for the research and conservation work about the wild dolphin group.

Manually observing the daily behaviors of dolphins takes a lot of time and effort. Prolonged human observation can lead to visual fatigue, exhaustion, and inconsistent interpretation standards caused by various factors, including subjective opinions, varying levels of experience, and decreases in concentration. As a result, caretakers invest substantial human resources and time in searching for potential abnormal dolphin behaviors in image analysis but still face the risk of interpretation errors due to human factors.

To address these challenges, a research project has been undertaken at Farglory Ocean Park. The project involves deploying cameras above dolphin pools to capture long-term footage, which is then analyzed using artificial intelligence to detect any abnormal behaviors in dolphins. By doing so, this system effectively lowers the chances of missing unusual behaviors and making incorrect judgments. It plays a vital role in the early assessment and treatment of dolphin care, and the analysis results serve as a reference for diagnosis. Additionally, it helps reduce costs related to dolphin care operations, such as manpower and time, within the ocean park.

Objectives

In this study, several important modules were proposed:

Image preprocessing module: To reduce the impact of oblique sunlight causing water surface reflections, this module employs image enhancement techniques in two steps: contrast and brightness adjustment [11], and sharpening processing [12]. These steps enhance the contours of dolphins in the images, ensuring the accuracy and effectiveness of subsequent analysis.
Pose estimation module: The DeepLabCut [13,14] deep learning model was used to identify the dolphin skeletons in the images. This model was trained based on a pre-trained ResNet50 [15] architecture and defined 11 key points of the dolphin skeleton. It was capable of recognizing the skeletons of multiple dolphins, ensuring that the root-mean-square error (RMSE) of each skeleton fell below 10.
Behavior analysis module: Developed a custom behavior analysis module named AquaAI Dolphin Decoder with long short-term memory (ADD-LSTM). This module initially uses a pose estimation model to perform a preliminary analysis of the processed dolphin key points, categorizing dolphin behaviors into six types based on set thresholds, namely vomit fish, side swimming, swimming together, playing with the toys, resting by the shore, and hook the swimming ring. Finally, a double-layer bidirectional LSTM recurrent neural network [16,17] is used to further classify dolphin behaviors, achieving a behavior recognition accuracy of 94.3%.

2. Related Works

2.1. Paper Survey of Dolphin Behavior

Observing the daily behavior of dolphins is crucial to assessing their physiological and psychological well-being. Effective identification and recording methods are required to monitor their behavior accurately. We compiled the characteristics and descriptions of dolphin behavior mentioned by Jensen et al. [4] and other academic papers, and presented this information in Table 1.

2.2. Paper Survey of Dolphin Detection

In Section 2.2.1, we discuss various existing methods and technologies for dolphin detection, while Section 2.2.2 highlights the limitations and challenges associated with these methods.

2.2.1. Existing Approaches for Dolphin Detection

Sensor-based methods: Sensor-based methods for dolphin identification involve utilizing their physiological characteristics. These methods typically require attaching sensors to the backs of dolphins (Figure 1) to detect and analyze physiological signals related to the dolphins, such as their position, acceleration, and heart rate [19,20,21,22].

Vision-based tracking methods: Karnowski et al. [23] proposed an algorithm that can detect and track dolphins automatically. The algorithm compared the performance of two methods for dolphin detection: robust principal component analysis (RPCA) and Gaussian mixture model (GMM). Additionally, they created the first dataset for dolphin detection, which included ground truth data. By utilizing the detection results, they were able to initialize a real-time compressed tracking algorithm, which enabled automated dolphin tracking.

Gabaldon et al. [24] proposed a framework based on deep learning to monitor and analyze dolphin behavior in artificially controlled environments. They used convolutional neural networks (CNNs) and the Faster R-CNN algorithm to detect dolphins in video footage. The study utilized Kalman filtering for post-processing to obtain short-term trajectories and kinematic information about the dolphins. The framework also employed heatmaps based on position and velocity to analyze the spatial utilization patterns of dolphins. They also calculated a motion diversity index, joint entropy, to reveal daily patterns in dolphin activity levels. The results demonstrated that this framework can enable long-term automated monitoring and analysis of dolphin behaviors such as resting, clockwise, and counterclockwise swimming.

Pose estimation-based methods: Pose estimation-based methods are used to identify and interpret the posture or joint positions of dolphins from images. These methods utilize tools such as OpenPose [25] or DeepLabCut [13,14] to recognize key points on dolphins. This involves analyzing the body structures and movements of dolphins in captured images or videos, which allows for detailed behavioral studies.

Mathis et al. [13] proposed an animal skeleton detection tool. They used the transfer learning principle to achieve high-precision animal body part pose estimation based on a small amount of annotated data. The core of this method is to use a pre-trained deep neural network (ResNet-50) to extract features, and then fine-tune it on a small amount of user-defined annotation data. Later, Lauer et al. [14] extended the functionality of the DeepLabCut tool to enable multi-animal pose estimation, recognition, and tracking. The research addresses complex challenges in multi-animal scenarios such as occlusion, similar appearance between animals, and frequent interactions.

2.2.2. Limitations of Current Approaches

Sensor-based methods: It has been found that using sensors to detect dolphins is an effective and accurate method. However, it can have adverse effects on the health of the dolphins. In their research, van der Hoopt al. [26] studied the potential risks of attaching sensors to dolphins, particularly regarding their well-being and behavior. They discovered that installing biotelemetry or recording devices on dolphins could increase the hydrodynamic resistance of their streamlined bodies, affecting their posture, swimming patterns, and energy balance.

Vision-based tracking methods: Significant progress has been made in dolphin identification through vision-based tracking methods. These techniques use RPCA for background subtraction and compressed tracking algorithms for automated dolphin tracking, which deliver a precision rate of 78.8% in dolphin identification [23]. Another approach is to use neural networks like Faster R-CNN and Kalman filters for dolphin recognition and trajectory generation, which achieve an accuracy rate of 81% in dolphin identification [24].

However, it is worth noting that these methods may only provide insight into dolphin location and trajectories, and may not capture more complex behaviors such as social interactions or distinct swimming patterns. Additionally, the processing of data is complex and requires significant computational resources and time. Therefore, image tracking has limitations in terms of accuracy, individual recognition, and behavior monitoring.

Pose estimation-based methods: It is common to use pose estimation methods to recognize different animal behaviors. For instance, DeepLabCut and support vector machine (SVM) models have been adopted to identify cattle behavior [27] and analyze the movements of primates and pigs [28,29]. However, there has been a scarcity of research on using pose estimation techniques to recognize dolphin skeletons. The only study conducted in this area was by Qi et al. [30], who used OpenPose technology for dolphin skeleton detection. Nevertheless, there has been no research utilizing DeepLabCut for similar dolphin skeleton detection.

It is essential to note that OpenPose is mainly designed for human joint point recognition, and its key point recognition rate for animals is not very high. Washabaugh, Edward et al. [31] compared four open-source pose estimation methods (OpenPose, TensorFlow MoveNet Lightning, TensorFlow MoveNet Thunder, and DeepLabCut) for human posture analysis. The experimental results showed that OpenPose performed the best for human gait analysis. However, the authors also noted in the paper that although OpenPose may outperform other pose estimation methods during human gait, DeepLabCut might perform better in animal model pose estimation. Therefore, the purpose of this study is to explore the use of DeepLabCut for dolphin skeleton detection and compare the accuracy of different methods.

2.3. Paper Survey of Behavior Classification

Pose estimation-based techniques have enabled more accurate methods of analyzing animal behavior. Various classification models, such as support vector machines (SVMs), long short-term memory networks (LSTMs), convolutional LSTMs, and others, are used for this purpose. For instance, Khin et al. [27] employed DeepLabCut and an SVM model to study the behavior and skeleton of cattle. They divided the body of the cattle into eight key parts and utilized the SVM model to classify behaviors such as standing, drinking water, eating, sitting down, and tail raised. The average accuracy achieved for each behavior was 88.75%.

Liu, Ruiqing, Juncai Zhu, and Xiaoping Rao [17] conducted a study on identifying mouse behavior using an improved DeepLabCut network for keypoint detection and behavior recognition through convolutional long short-term memory (ConvLSTM) networks. The authors utilized an enhanced DeepLabCut keypoint detection algorithm to identify specific points on the mouse’s body, such as the nose, ears, and tail base. Then, they used ConvLSTM networks to analyze the data from these key points and classify behaviors like walking, resting, grooming, and others.

After reviewing the studies mentioned above, it has become evident that pose estimation holds great importance in analyzing animal behavior. However, this technique is not used extensively in the study of dolphin behavior. Therefore, this research aims to implement pose estimation methods to analyze dolphin behavior. The study will also compare the effectiveness of various classification models and propose an optimal method for analyzing dolphin behavior.

3. Materials and Methods

3.1. System Overview

The architecture of the system used in this thesis is shown in Figure 2. At Farglory Ocean Park, three network cameras have been installed above the pools to capture images of dolphins. The images are then transmitted to a PC and undergo four stages of processing. In the first stage, an image preprocessing module prepares the images for further analysis by other modules. The second stage involves a pose estimation module, which locates various key points on the dolphins. In the third stage, the key point data are sent to a behavior analysis module for the classification of six different types of behavior. Finally, in the fourth stage, the behavior data are consolidated into an Excel file for future analysis by the user.

3.2. Data Preparation and Image Preprocessing

3.2.1. Data Preparation

At Farglory Ocean Park, six dolphins live in several interconnected pools with gates that allow them to move between them. For our research, we selected one pool that was about 150 square meters in size to observe the daily behavior of the dolphins. To capture the entire pool, we installed three network cameras at a height of approximately 5 m above the water. One can see how the cameras were placed in Figure 3.

The videos we recorded have a resolution of 1280 × 720 pixels and a frame rate of 5 fps. We captured over 200 h of dolphin video footage, which is equivalent to about one month of daytime pool footage. This footage was used as a dataset for our subsequent training and validation phases. However, filming the pools during the daytime can be challenging due to sunlight reflecting on the water. In the next section, we will introduce the method of image preprocessing that we used to address this issue.

3.2.2. Image Preprocessing

When taking photos from an elevated position above a swimming pool, several potential issues can affect the quality of the images. These include blurred images caused by the refraction of light by water waves, as well as obstructions due to the reflection of sunlight. To ensure that the data we used for analysis were of high quality, we performed image enhancement during the preprocessing stage. This helped improve the visibility of the dolphin contours within the images.

Image enhancement refers to the process of enhancing the overall quality, contrast, detail, and resolution of digital images. This paper uses two image enhancement techniques—contrast enhancement and edge enhancement—to reveal details more clearly than in the original image, thereby enhancing the effectiveness of model training and testing. We propose an image processing workflow that combines contrast adjustment [11] and sharpening techniques [12] to improve dolphin identification in images, as demonstrated in Figure 4. The results indicate that this method is highly effective in enhancing the visual clarity of dolphins while preserving the natural color of the image. The image preprocessing is divided into two parts as follows:

Contrast and brightness adjustment: It is common for images captured in pool environments to have low contrast and uneven brightness due to unstable lighting conditions. To address this issue, a study was conducted that employed CLAHE (contrast limited adaptive histogram equalization) [11] for image enhancement. CLAHE enhances local contrast by dividing the image into small blocks and performing histogram equalization on each. This improves the contrast without affecting the overall brightness balance. The RGB channels were processed separately and a comparison of histograms before and after processing demonstrated significant improvement in contrast, as shown in Figure 5. This step is crucial for enhancing the clarity of edges and contours in dolphins photographed under varying lighting conditions.
Sharpening process: The clarity of details in pool images of dolphins is often reduced due to the effects of water ripples and other environmental factors. To improve the clarity of the edges and contours of dolphins, a study was conducted that involved a sharpening process [12]. This sharpening step intensified the edge information in the images, resulting in a significant increase in the clarity of details, making the morphology and features of dolphins more pronounced. An adaptive sharpening technique was employed in this process, which involved comparing two different types of convolution kernels to achieve the optimal sharpening effect while avoiding the introduction of excessive visual noise. The sharpening effects and comparisons can be seen in Figure 6.

After applying the image enhancement techniques mentioned above, we effectively improved the dolphin’s contour and reduced the sunlight reflection. This helped to enhance the recognition accuracy for subsequent pose estimation, resulting in a 3% improvement.

3.3. Pose Estimation Model: DeepLabCut

DeepLabCut [13,14] is an open-source tool that uses deep learning techniques to analyze and understand animal movements and positions in video data. It is specifically designed for pose estimation in animal behavior studies. In this study, we used DeepLabCut to analyze the dolphin skeleton.

To label the dolphin skeleton, we extracted one image per second from the filmed video and categorized the dolphins into 11 key points such as rostrum, melon, dorsal fin, pectoral fin, belly, tail fin, and other parts of the body. DeepLabCut’s transfer learning-based approach enabled us to accurately identify these key points without requiring extensive training data.

Around 1200 dolphin skeleton images were labeled, and we trained them utilizing the pre-trained ResNet50 network [15]. The network architecture is described in the following section. We evaluated the accuracy of skeleton recognition using RMSE (root mean square error) and PAFs (partial affinity fields) [25]. If the results were not satisfactory, we fine-tuned the marking points and repeated the training until the desired performance was achieved.

Eventually, we obtained data on the location of each key point at different times, which were used for deep analysis of the dolphin’s behavioral patterns and characteristics. Figure 7 shows the system flowchart we used for this study.

3.3.1. ResNet50

The technology behind DeepLabCut is based on transfer learning principles. Its backbone network is ResNet50, a residual neural network that has been pre-trained on the ImageNet dataset for large-scale image recognition. This training enabled it to identify various features, from simple shapes to intricate textures. The structure of ResNet50 is pivotal in this process, as illustrated in Figure 8. When processing images of 1280 × 720 resolution, the image first underwent a 7 × 7 convolutional layer with a stride of 2, followed by a 3 × 3 max pooling layer with the same stride. This marked the entrance of the image into the core architecture of ResNet50. Each stage in ResNet50 is made up of numerous residual blocks, as shown in Figure 9. These residual blocks are crucial components of ResNet50, featuring skip connections that enhance feature learning, prevent information loss during training, and address gradient vanishing.

The ResNet50 network has four stages, with each stage performing a specific operation on the feature map. In Stage 1, the network extracts the initial feature map while maintaining its spatial size but increasing its depth. In the subsequent stages (from Stage 2 to Stage 4), the network deepens the depth and complexity gradually while reducing the spatial size of the feature map. These operations work together to obtain increasingly abstract features from the original image. This process is essential in providing visual information needed in the final stage of detecting dolphin key points.

In DeepLabCut, the input image undergoes five stages of processing using ResNet50. Each stage encodes the features of the image into deeper feature maps. The feature map output of each stage is a transformation of the previous stage, capturing features at different levels by increasing the depth and reducing the spatial size of the feature maps. These feature maps are used to accurately detect dolphin key points, which allows for high-precision analysis of dolphin postures.

3.3.2. The Network Architecture of DeepLabCut

DeepLabCut uses a technique called multiscale feature fusion to detect features of different scales effectively. This technique is illustrated in Figure 10. When processing input images with a resolution of 1280 × 720, the system first goes through a series of downsampling operations using the residual blocks of ResNet50, which achieve a downsampling ratio of 32. To enhance the model’s ability to capture spatial information and fuse multiscale features, additional downsampling steps are implemented in Stage 1 and Stage 2. This involves using a 3 × 3 convolutional layer with a stride of 2 to generate feature maps of different resolutions, which are then combined through a specific fusion strategy to create a comprehensive feature map of size 40 × 23.

The size of the receptive field of feature maps is crucial in detecting key points. This study reveals that the Stage 1 branch’s feature maps have a smaller receptive field, making it ideal for capturing details and low-level features like edges and textures. The Stage 2 branch provides feature maps with a medium-sized receptive field, which is suitable for recognizing medium-scale structures. On the other hand, the original ResNet50’s feature maps have the largest receptive field, making them capable of capturing higher-level features such as the overall shape and motion patterns of the dolphin.

To improve the accuracy of detecting dolphin key points, the feature maps are subjected to a process of multiscale fusion. This process involves upsampling the feature maps by a factor of 8 and applying three 3 × 3 transposed convolution operations. The final output is three feature maps, namely the score map, part affinity field (PAF), and key point feature map, each with a size of 320 × 180. You can see the detailed representation of these feature maps in Figure 11.

DeepLabCut uses a combination of score maps and PAF to locate individual key points and understand their spatial and anatomical context relative to each other. This dual consideration leads to accurate and reliable pose estimation. While the score map indicates the location of key points, PAF reveals how these key points are interconnected.

3.4. Behavior Analysis Model

We obtained the coordinates of various key points on dolphins after conducting a pose estimation analysis. In this paper, we introduce a system for analyzing six different dolphin behaviors, called “AquaAI Dolphin Decoder” (ADD). The six behaviors that ADD can detect are vomiting fish, resting, side swimming, swimming together, playing with balls, and hooking the swimming ring. A detailed description of ADD will be provided in the following section (Section 3.4.1). Once we recognized these six behaviors, which had time sequences, we input the results into a double-layer bidirectional long short-term memory (LSTM) model [32] to improve its accuracy. We will explain the LSTM model in the following sections (Section 3.4.2). Additionally, we also attempted other classification models such as unidirectional LSTM, gated recurrent unit (GRU) [33], and support vector machine (SVM) for comparison. We found that the double-layer bidirectional LSTM model achieved the highest accuracy after comparing the results.

3.4.1. AquaAI Dolphin Decoder (ADD)

Our Dolphin behavior analysis system process is detailed in Figure 12. This process consists of two main steps:

(1): In the first step of our analysis, we focus on the motion of dolphins. We start by calculating the Euclidean distances (EDR and EDT) between the rostrum and tail of the dolphin in each frame. This method measures the straight-line distance between two points. If EDR and EDT exceed the pre-defined threshold of 20, we classify the dolphin as swimming. Additionally, if the distance between two dolphins is less than 50, we categorize it as swimming together. If we observe a specific posture of the dolphin’s belly, we classify it as side swimming. On the other hand, if EDR and EDT are below 20, we consider that the dolphin might be resting or vomiting fish. Specifically, if the rostrum and tail of the dolphin are close to the shore, we classify it as resting. If we cannot see the dolphin’s head (possibly submerged in water) and the tail is vertical to the body, we categorize it as vomiting fish.
(2): In the second step, we interact with objects present in the scene. We measure the Euclidean distance (ED1) between the dolphin’s rostrum and the ball, and the Euclidean distance (ED2) between the dolphin’s dorsal fin and the swimming ring. If the value of ED1 is less than 30, and the ball has moved more than 20 pixels in the past second, we classify the dolphin as playing with the ball. Similarly, if the value of ED2 is less than 30, and the swimming ring has moved more than 20 pixels in the past second, we categorize the dolphin as hooking the swimming ring.

We set these thresholds based on our observations of dolphin behavior and a previous data analysis. This approach enables us to initially differentiate between various behavior patterns exhibited by dolphins.

3.4.2. Long Short-Term Memory (LSTM)

The study utilized a tool called AquaAI Dolphin Decoder (ADD) to analyze dolphin behavior. To improve the accuracy of behavior prediction, the study used a type of recurrent neural network (RNN) called the long short-term memory (LSTM) network [32]. LSTM is a specialized RNN that captures temporal dependencies within behavior patterns. It is useful for handling sequential data. Unlike traditional RNNs, LSTM overcomes the issues of vanishing or exploding gradients that arise when dealing with long sequences. This is due to its unique network structure that includes several key components: the forget gate, the input gate, the cell state, and the output gate.

In our research, we utilized a double-layer bidirectional LSTM (BiLSTM) architecture, as depicted in Figure 13. Unlike the standard LSTM, the BiLSTM has bidirectional inputs, which enables it to take into account both the forward and backward contexts of the time series data simultaneously. The first layer of the BiLSTM was designed with 128 units (64 × 2), while the second layer had 64 units (32 × 2). This design enables the model to capture a wide range of features in the first layer and then perform more detailed processing and optimization of these features in the second layer, effectively handling complex time-series data.

We used the AquaAI Dolphin Decoder (ADD) to obtain behavioral temporal data and combined it with the coordinates of the dolphin’s key points. This allowed us to create an integrated data input method that captures subtle variations in dolphin body movements through key point coordinates. Through time-series analysis, this approach provides a comprehensive understanding of dolphin behavior patterns and enables accurate predictions.

To prevent overfitting during the training process, we added a dropout layer after each BiLSTM layer, as illustrated in Figure 14. The dropout layer randomly drops some neural network connections during training, which helps to improve the model’s ability to generalize. For our model, we set the dropout rate to 0.2, which means that approximately 20% of the neurons are randomly ignored during training to reduce the risk of overfitting. In the final part of the model, the BiLSTM network’s output passes through a dense layer (fully connected layer). This dense layer’s task is to transform the complicated features extracted by the BiLSTM layers into specific outputs that correspond to the six specific behavioral categories of dolphins, as shown in Figure 15.

4. Results

4.1. Experimental Environment

Table 2 provides a detailed list of the hardware specifications and software versions used for training the neural network and developing dolphin skeleton detection and behavior analysis in this study. The computer setup was located at the Farglory Ocean Park. The model proposed in this research was built using TensorFlow, DeepLabCut, and other open-source neural network frameworks. The training process primarily took place on Windows and the Google Colab platform, utilizing Nvidia RTX 3060 GPU for accelerated computations to improve training efficiency and processing capability.

4.2. Experiment of DeepLabCut

As part of our study, we used nearly 200 h of dolphin footage gathered at Farglory Ocean Park to create our dataset. We annotated a total of 1200 images, identifying both the dolphin skeletons and the environmental features within them. The dataset was then split into two parts: a training set (80%) and a test set (20%). Our annotations for the dolphin skeletons included 11 key points, such as the rostrum, melon, dorsal fin, pectoral fin, belly, and tail fin. For the training process, we utilized the ResNet50-based configuration from DeepLabCut, with specific parameter settings applied.

Batch size: 8.
The learning rate remained at 0.0001 for the first 7500 iterations.
Subsequently, it was adjusted (the learning rate) to $5 \times 10^{- 5}$ between 7500 and 12,000 iterations.
Further fine-tuning was performed with a learning rate of $1 \times 10^{- 5}$ between 12,000 and 100,000 iterations.
Finally, the learning rate was finely tuned to $1 \times 10^{- 6}$ between 100,000 and 220,000 iterations.

We utilized a technique of gradually adjusting the learning rate to optimize the training process and improve the model’s performance. Furthermore, we plotted the changes in the training error (train error) and testing error (test error) during each iteration, as illustrated in Figure 16. The graph highlights that the model achieved a stable state after 150 k iterations, with the final training error reduced to 5.95 pixels and the testing error reduced to 8.55 pixels. Additionally, we conducted a comparative analysis of the root mean square error (RMSE) values and confidence scores of the model for different body parts. The RMSE formula is presented in Equation (1). A comprehensive performance evaluation and analysis of these results are provided and summarized in Table 3.

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{p} - y_{l})}^{2}}

(1)

Here, “

y_{p}

” represents the predicted value, and “

y_{l}

” represents the labeled value, providing a quantitative measure of labeling accuracy.

4.3. Experiment of Behavior Analysis Model

For the task of classifying dolphin behavior, we used various metrics to evaluate the effectiveness of our model. These metrics included accuracy, recall, precision, F-score, and area under the ROC curve (AUC). Since the task requires detecting six distinct behaviors, achieving high accuracy and good recall is crucial in order to minimize the number of false negatives. Here are the formulas for accuracy, recall, precision, and the F-score:

Accuracy = \frac{True Positives + True Negatives}{True Positives + True Negatives + False Positives + False Negatives}

(2)

‘Accuracy’ represents the ratio of correctly classified samples to the total number of samples.

Recall = \frac{True Positives}{True Positives + False Negatives}

(3)

‘Recall’ represents the ratio of correctly predicted positive samples to the actual number of positive samples. It is also known as sensitivity or the true positive rate.

Precision = \frac{True Positives}{True Positives + False Positives}

(4)

‘Precision’ represents the ratio of correctly predicted positive samples to the total number of samples predicted as positive.

F - score = 2 \frac{Precision \times Recall}{Precision + Recall}

(5)

The F-score is a measure of the classifier performance that considers both precision and recall. It is the harmonic mean of these two metrics.

The AUC (area under the curve) is a critical evaluation metric for classification models. It provides a comprehensive assessment of the classifier’s performance. By analyzing the AUC score, we can gain insights into the model’s ability to distinguish between various categories of dolphin behavior.

AUC = 1: This represents a perfect classifier, where there is at least one threshold that can yield perfect predictions. In most practical cases, perfect classifiers are rare.
0.5 < AUC < 1: A classifier with AUC in this range is better than random guessing. When appropriate thresholds are set, it provides predictive value.
AUC < 0.5: A classifier with an AUC below 0.5 performs worse than random guessing. However, if you always invert the predictions, it would be better than random guessing.

We also used the confusion matrix as an evaluation tool to provide a more comprehensive assessment of the model’s performance in classifying dolphin behavior tasks. In the confusion matrix, each row represents the predicted class by the model, while each column represents the actual class. This approach allows us to see how accurately the model identified each dolphin’s behavior and how often it is misclassified. For instance, a cell in the matrix shows how many times the model predicts the “playing with toys” behavior when it was actually “resting”.

4.3.1. AquaAI Dolphin Decoder (ADD)

In our research, we utilized our custom-developed AquaAI Dolphin Decoder (ADD) to classify dolphin behaviors. Initially, the model achieved an accuracy of 85.5%, a recall rate of 84.6%, a precision of 85.1%, and an F1 score of 84.8%. In addition, to showcase the model’s performance, we plotted ROC curves for the six behaviors separately, as shown in Figure 17. We also generated and visualized confusion matrices, as shown in Figure 18. Upon observation, we found that the accuracy of identifying behaviors such as vomiting fish, resting, and side swimming was relatively low.

4.3.2. Long Short-Term Memory (LSTM)

To improve the classification accuracy of the six behaviors, we initially employed a double-layer unidirectional LSTM model. During the model training process, we set the number of training epochs to 200 and used 20% of the dataset as validation data. We utilized the Adam optimizer [34] to compile and train the model. After training, the model achieved an accuracy of 90.1%, a recall rate of 90.2%, a precision of 89.4%, and an F1 score of 89.7%. Furthermore, we also plotted ROC curves for these six behaviors separately to visually demonstrate the model’s performance, as shown in Figure 19. Additionally, confusion matrices were generated and visualized, as seen in Figure 20.

4.3.3. Bidirectional Long Short-Term Memory (BiLSTM)

Subsequently, we introduced a double-layer bidirectional LSTM model for testing, with training parameters identical to the aforementioned unidirectional LSTM model. This model achieved an accuracy of 94.3%, a recall rate of 92.9%, a precision of 93.6%, and an F1 score of 93.2%, demonstrating superior performance compared to the double-layer unidirectional LSTM. For this model, we also plotted ROC curves and confusion matrices, as shown in Figure 21 and Figure 22, respectively.

4.3.4. Gated Recurrent Unit (GRU)

This study also incorporated a gated recurrent unit (GRU) model for analysis. Similar to the LSTM models, the GRU model was evaluated with the same training parameters and dataset. The results showed that the GRU model achieved an accuracy of 90.9%, a recall rate of 91.3%, a precision of 90.3%, and an F1 score of 90.7%. These results were further visualized using ROC curves and confusion matrices, as depicted in Figure 23 and Figure 24, respectively.

4.3.5. Support Vector Machine (SVM)

This study also explored the application of a support vector machine (SVM) in dolphin behavior analysis. We chose an SVM model with a Gaussian radial basis function (RBF) kernel, which performs exceptionally well in handling nonlinear data. However, in the context of dolphin behavior classification, the SVM model exhibited slightly lower performance compared to the LSTM and GRU models.

The SVM model achieved an accuracy of 88.6%, a recall rate of 85.6%, a precision rate of 86.3%, and an F1 score of 85.9%. These results are visually presented through ROC curves and confusion matrices, as shown in Figure 25 and Figure 26, providing an intuitive representation of the SVM model’s performance.

This comparison highlights that SVM slightly underperformed compared to the LSTM and GRU models in terms of accuracy and other metrics. It underscores the advantage of using recurrent neural networks like LSTM and GRU in sequence-based behavioral analysis tasks, as time dependency plays a crucial role in these tasks.

5. Discussion

5.1. Experiment Result Comparison

To further validate the performance of our model, we conducted a comparative analysis with several popular classification models, including ADD combined with LSTM, GRU, and SVM. In Table 4, we showcase the performance comparison of these methods. Through this comparison, we found that the double-layer bidirectional LSTM combined with ADD demonstrated the best performance in dolphin behavior classification tasks. This comparison considered the performance of various models on key performance metrics such as accuracy, recall, F-score, and AUC. The results show that the double-layer bidirectional LSTM structure provides superior capability in handling time-series data, especially when dealing with complex dolphin behavior patterns. It can more accurately identify and classify different behaviors compared to other models. This finding underscores the advantages of the double-layer bidirectional LSTM structure in complex behavior classification tasks, demonstrating the effectiveness and sophistication of our model in dolphin behavior analysis.

Additionally, we conducted a comparative analysis of the experimental results obtained in this study with the findings of other relevant research, as shown in Table 5. By making these comparisons, it can be observed that the solution proposed in this paper, which is based on DeepLabCut and classification model technology, has advantages in handling dolphin behavior classification tasks.

We compared the findings of our study with other related research methods, which are listed in Table 5. Each paper has different applications for detecting dolphins, but our proposed method addresses a gap in analyzing dolphin behavior. Our solution, which utilizes DeepLabCut and classification model techniques, has several advantages in classifying dolphin behavior and providing accurate observations of their daily activities.

5.2. Ablation Study

In this paper, an ablation study was conducted to evaluate the effect of each module on the system performance. The study aimed to analyze the functionality and performance of each module and determine its contribution to the overall system. The results of the study are presented in Table 6. The experimental design used in the study is described as follows:

Baseline experiment: We initially employed the basic model, which utilizes only DeepLabCut and the ADD algorithm for dolphin behavior analysis. The performance metrics obtained for this baseline model are as follows: accuracy of 82.4%, recall of 81.4%, precision of 81.8%, and an F-score of 81.6%.
Addition of preprocessing module: Next, we introduced the preprocessing module and observed its impact on system performance. The results showed that the preprocessing module effectively improved system performance, achieving an accuracy of 85.5%, a recall of 84.6%, a precision of 84.6%, and an F-score of 84.8%.
The addition of the double-layer BiLSTM module: Finally, we introduced the double-layer BiLSTM module, further enhancing the system’s performance. The results demonstrate that the inclusion of this module significantly improved the system’s performance, with an accuracy of 94.3%, a recall of 92.9%, a precision of 93.6%, and an F-score of 93.2%.

6. Conclusions

In the past, there has not been much research on how to detect dolphin behavior, and the existing methods have some limitations. But in our recent collaboration with the Farglory Ocean Park, we came up with a new way to observe dolphins under human care. We used image preprocessing techniques to reduce sunlight reflection and enhance dolphin contours, which made it easier to analyze the images. Then we labeled the skeletons of the dolphins using a tool called DeepLabCut and used that to train a model. The results showed that we could control the root-mean-square error (RMSE) of each skeleton to be below 10.

We also developed a custom analysis module called the AquaAI Dolphin Decoder with long short-term memory (ADD-LSTM). This module analyzes the dolphins’ key point information obtained through the pose estimation model and classifies their daily behaviors into six categories: vomiting fish, resting, side swimming, swimming together, playing with balls, and hooking the swimming ring. We used predefined thresholds for each behavior to make the classification. To provide an even more detailed classification of dolphin behavior, we used a double-layer bidirectional LSTM recurrent neural network. This approach resulted in an accuracy rate of 94.3%, a recall rate of 92.9%, a precision rate of 93.6%, and an F1 score of 93.2%.

6.1. Contributions

The primary focus of this paper is to introduce an innovative method for observing and analyzing dolphin behavior. In the beginning, we used image preprocessing techniques to enhance the dolphin’s contours, which led to an improvement in the accuracy and efficiency of the subsequent analyses. Later, we successfully integrated the DeepLabCut pose estimation tool and a custom classification module (ADD-LSTM), enabling us to effectively analyze long-duration videos with an impressive accuracy rate of 94.3%. The application of this method has significantly reduced the workload of dolphin caregivers and decreased labor costs. This study shows the strong potential of combining pose estimation tools with advanced machine learning techniques in dolphin behavior analysis. It also provides new perspectives and methods for future research in this field.

This paper proposes a feasible dolphin behavior observation and recognition system. At this stage, the proposed system can effectively help caregivers in improving the quality of care for dolphins under human care. We know that all animals can adjust their behaviors to changes in their living environments. The system can be enhanced to track and analyze the behaviors of wild dolphin groups. It is expected that various wild dolphin groups face different survival pressures in different environments. These different environmental pressures may obviously change the ratio of various behaviors. This, we can use the enhanced system to directly understand the environmental pressures faced by wild dolphin groups.

6.2. Future Works

We developed a method to identify dolphins at Farglory Ocean Park, which we plan to extend to whale and dolphin rescue centers, like the Sicao Rescue Center. This will enable long-term observations of dolphins and whales without relying heavily on human resources. Our model currently identifies six different dolphin behaviors, and we plan to expand its capabilities in two ways. First, we intend to enhance the model’s scope and accuracy by increasing the recognition of other behaviors. Second, we plan to introduce sound recognition technology to identify and observe dolphin daily breathing sounds, which will provide richer research data for marine biology and improve the efficiency and effectiveness of dolphin conservation and medical rescue efforts.

Author Contributions

Conceptualization, S.-P.T. and J.-F.W.; methodology, S.-P.T. and S.-E.H.; software, S.-E.H.; validation, J.-F.W. and I.-F.J.; resources, I.-F.J.; data curation, I.-F.J.; writing—original draft preparation, S.-E.H.; writing—review and editing, S.-P.T.; visualization, S.-E.H.; supervision, J.-F.W.; project administration, I.-F.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

This work was partially supported by Farglory Ocean Park, Hualien, Taiwan; partially supported by the Changzhou College of Information Technology (under contract no. “2019KYQD03”); and partially supported by Sanda University (under contract no. “20121ZD05”).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, V.Y.; Lu, D.J.; Han, Y.S. Hybrid Intelligence for Marine Biodiversity: Integrating Citizen Science with AI for Enhanced Intertidal Conservation Efforts at Cape Santiago, Taiwan. Sustainability 2024, 16, 454. [Google Scholar] [CrossRef]
Lehnhoff, L.; Glotin, H.; Bernard, S.; Dabin, W.; Le Gall, Y.; Menut, E.; Meheust, E.; Peltier, H.; Pochat, A.; Pochat, K.; et al. Behavioural Responses of Common Dolphins Delphinus delphis to a Bio-Inspired Acoustic Device for Limiting Fishery By-Catch. Sustainability 2022, 14, 3186. [Google Scholar] [CrossRef]
Shane, S.H.; Wells, R.S.; Würsig, B. Ecology, Behavior and Social Organization of the Bottlenose Dolphin: A Review. Mar. Mammal Sci. 1986, 2, 34–63. [Google Scholar] [CrossRef]
Jensen, A.; Delfour, F.; Carter, T.J. Anticipatory behavior in captive bottlenose dolphins (Tursiops truncatus): A preliminary study. Zoo Biol. 2013, 32 4, 436–444. [Google Scholar] [CrossRef]
Sekiguchi, Y.; Kohshima, S. Resting behaviors of captive bottlenose dolphins (Tursiops truncatus). Physiol. Behav. 2003, 79, 643–653. [Google Scholar] [CrossRef]
Kuczaj, S.A.; Makecha, R.N.; Trone, M.C.; Paulos, R.D.; Ramos, J.A.A. Role of Peers in Cultural Innovation and Cultural Transmission: Evidence from the Play of Dolphin Calves. Int. J. Comp. Psychol. 2006, 19, 223–240. [Google Scholar] [CrossRef]
Kuczaj, S.A.; Eskelinen, H.C. Why do Dolphins Play. Anim. Behav. Cogn. 2014, 1, 113–127. [Google Scholar] [CrossRef]
Ko, F.C.; We, N.Y.; Chou, L.S. Bioaccumulation of persistent organic pollutants in stranded cetaceans from Taiwan coastal waters. J. Hazard. Mater. 2014, 277, 127–133. [Google Scholar] [CrossRef] [PubMed]
Karczmarski, L.; Huang, S.; Wong, W.; Chang, W.L.; Chan, C.; Keith, M. Distribution of a Coastal Delphinid Under the Impact of Long-Term Habitat Loss: Indo-Pacific Humpback Dolphins off Taiwan’s West Coast. Estuaries Coasts 2017, 40, 594–603. [Google Scholar] [CrossRef]
Martins, J.; Pandolfo, L.; Sazima, I. Vomiting Behavior of the Spinner Dolphin (Stenella longirostris) and Squid Meals. Aquat. Mamm. 2004, 30, 271–274. [Google Scholar]
Reza, A.M. Realization of the Contrast Limited Adaptive Histogram Equalization (CLAHE) for Real-Time Image Enhancement. J. VLSI Signal Process. Syst. Signal Image Video Technol. 2004, 38, 35–44. [Google Scholar] [CrossRef]
Gross, H.N.; Schott, J.R. Application of Spectral Mixture Analysis and Image Fusion Techniques for Image Sharpening. Remote Sens. Environ. 1998, 63, 85–94. [Google Scholar] [CrossRef]
Mathis, A.; Mamidanna, P.; Cury, K.M.; Abe, T.; Murthy, V.N.; Mathis, M.W.; Bethge, M. DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 2018, 21, 1281–1289. [Google Scholar] [CrossRef] [PubMed]
Lauer, J.; Zhou, M.; Ye, S.; Menegas, W.; Schneider, S.; Nath, T.; Rahman, M.M.; Santo, V.D.; Soberanes, D.; Feng, G.; et al. Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods 2022, 19, 496–504. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2015; pp. 770–778. [Google Scholar]
Wang, Y.; Hayashibe, M.; Owaki, D. Prediction of Whole-Body Velocity and Direction From Local Leg Joint Movements in Insect Walking via LSTM Neural Networks. IEEE Robot. Autom. Lett. 2022, 7, 9389–9396. [Google Scholar] [CrossRef]
Liu, R.; Zhu, J.; Rao, X. Murine Motion Behavior Recognition Based on DeepLabCut and Convolutional Long Short-Term Memory Network. Symmetry 2022, 14, 1340. [Google Scholar] [CrossRef]
Baker, I.; O’Brien, J.; McHugh, K.; Berrow, S. An Ethogram for Bottlenose Dolphins (Tursiops truncatus) in the Shannon Estuary, Ireland. Aquat. Mamm. 2017, 43, 594–613. [Google Scholar] [CrossRef]
Zhang, D.; Shorter, K.A.; Rocho-Levine, J.; van der Hoop, J.M.; Moore, M.J.; Barton, K. Behavior Inference From Bio-Logging Sensors: A Systematic Approach for Feature Generation, Selection and State Classification. In Proceedings of the ASME 2018 Dynamic Systems and Control Conference, Atlanta, GA, USA, 30 September–3 October 2018. [Google Scholar]
Lauderdale, L.K.; Shorter, K.A.; Zhang, D.; Gabaldon, J.; Mellen, J.D.; Walsh, M.T.; Granger, D.A.; Miller, L.J. Bottlenose dolphin habitat and management factors related to activity and distance traveled in zoos and aquariums. PLoS ONE 2021, 16, e0250687. [Google Scholar] [CrossRef]
Shorter, K.A.; Shao, Y.; Ojeda, L.V.; Barton, K.; Rocho-Levine, J.; van der Hoop, J.; Moore, M.J. A day in the life of a dolphin: Using bio-logging tags for improved animal health and well-being. Mar. Mammal Sci. 2017, 33, 785–802. [Google Scholar] [CrossRef]
Aoki, K.; Watanabe, Y.; Inamori, D.; Funasaka, N.; Sakamoto, K.Q. Towards non-invasive heart rate monitoring in free-ranging cetaceans: A unipolar suction cup tag measured the heart rate of trained Risso’s dolphins. Philos. Trans. R. Soc. B 2021, 376, 20200225. [Google Scholar] [CrossRef]
Karnowski, J.; Hutchins, E.L.; Johnson, C.M. Dolphin Detection and Tracking. In Proceedings of the 2015 IEEE Winter Applications and Computer Vision Workshops, Waikoloa, HI, USA, 6–9 June 2015; pp. 51–56. [Google Scholar]
Gabaldon, J.; Zhang, D.; Lauderdale, L.K.; Miller, L.J.; Johnson-Roberson, M.; Barton, K.; Shorter, K.A. Computer-vision object tracking for monitoring bottlenose dolphin habitat use and kinematics. PLoS ONE 2022, 17, e0254323. [Google Scholar] [CrossRef] [PubMed]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
van der Hoop, J.M.; Fahlman, A.; Hurst, T.; Rocho-Levine, J.; Shorter, K.A.; Petrov, V.; Moore, M.J. Bottlenose dolphins modify behavior to reduce metabolic effect of tag attachment. J. Exp. Biol. 2014, 217, 4229–4236. [Google Scholar] [CrossRef] [PubMed]
Khin, M.P.; Zin, T.T.; Mar, C.C.; Tin, P.; Horii, Y. Cattle Pose Classification System Using DeepLabCut and SVM Model. In Proceedings of the 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 18–21 October 2022; pp. 494–495. [Google Scholar]
Labuguen, R.T.; Bardeloza, D.K.; Negrete, S.B.; Matsumoto, J.; Inoue, K.; Shibata, T. Primate Markerless Pose Estimation and Movement Analysis Using DeepLabCut. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA, 30 May–2 June 2019; pp. 297–300. [Google Scholar]
Farahnakian, F.; Heikkonen, J.; Björkman, S. Multi-pig Pose Estimation Using DeepLabCut. In Proceedings of the 2021 11th International Conference on Intelligent Control and Information Processing (ICICIP), Dali, China, 3–7 December 2021; pp. 143–148. [Google Scholar]
Qi, H.; Xue, M.; Peng, X.R.; Wang, C.; Jiang, Y. Dolphin movement direction recognition using contour-skeleton information. Multimed. Tools Appl. 2020, 82, 21907–21923. [Google Scholar] [CrossRef]
Washabaugh, E.P.; Shanmugam, T.A.; Ranganathan, R.; Krishnan, C. Comparing the accuracy of open-source pose estimation methods for measuring gait kinematics. Gait Posture 2022, 97, 188–195. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Chung, J.; Gülçehre, Ç.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]

Figure 1. The sensor attached to the back of the dolphin.

Figure 2. System overview.

Figure 3. The experimental environment: all cameras are from above and all videos are top-view.

Figure 4. The block diagram of preprocessing.

Figure 5. RGB color histogram after CLAHE enhancement.

Figure 6. Comparative analysis of image processing techniques.

Figure 7. The block diagram of DeepLabCut.

Figure 8. The architecture diagram of ResNet50.

Figure 9. ResNet50 network residual block diagram.

Figure 10. DeepLabCut’s network architecture diagram.

Figure 11. The maps of the dolphins.

Figure 12. Dolphin behavior classification tree.

Figure 13. The architecture of double-layer BiLSTM.

Figure 14. The BiLSTM layer with the dropout model.

Figure 15. The six behaviors of dolphins. (a) Swimming together, (b) rest, (c) vomit fish, (d) play the toys, (e) side swimming, and (f) hook the swimming ring.

Figure 16. Training error and testing error for each iteration.

Figure 17. The ROC curve of ADD.

Figure 18. The confusion matrix of ADD.

Figure 19. The ROC curve of LSTM.

Figure 20. The confusion matrix of LSTM.

Figure 21. The ROC curve of BiLSTM.

Figure 22. The confusion matrix of BiLSTM.

Figure 23. The ROC curve of GRU.

Figure 24. The confusion matrix of GRU.

Figure 25. The ROC curve of SVM.

Figure 26. The confusion matrix of SVM.

Table 1. Descriptions of the behaviors [4,18].

	Behavior	Description
1	Spy hopping [4]	Vertical in the water with its head above the surface.
2	Swim together [3]	Multiple dolphins swimming together.
3	Side swimming [4]	Exposing their bellies while swimming.
4	Jump [4]	Temporarily leaping out of the water and then returning to the surface.
5	Play the toys [7]	Playing with toys using their fins or heads.
6	Vomit fish [10]	Submerging in a head-down position at the bottom of the water while extending their tail fins vertically out of the water surface.
7	Rest by the shore [4]	Remaining motionless with the dorsal fin and head above the water surface.
8	Looking [4]	Floating on their side at the surface of the water, with one eye above the water level for observation.

Table 2. Experimental environment.

Experimental Environment
Hardware Specification	Operating System	Windows
	CPU	Intel i7-12700K
	GPU	Nvidia RTX3060
	RAM	32 GB
Software Version	Python 3.8 + TensorFlow 2.10.0 + DeepLabCut 2.2.2

Table 3. RMSE values and confidence scores for each key point.

Key Points	RMSE	Confidence Score
Rostrum	8.29	95.9%
Melon	5.76	97.8%
Dorsal fin	5.06	97.3%
Pectoral fin	3.8	98.3%
Bodypart1	6.77	96.5%
Bodypart2	4.53	97.9%
Bodypart3	5.81	97.1%
Tail top	6.49	96.7%
Tail left	3.58	98.5%
Tail right	3.24	98.7%
Belly	3.98	98.1%

Table 4. Comparison of different classification models.

Methods	Accuracy	Recall	Precision	F-Score
ADD	85.5%	84.6%	85.1%	84.8%
ADD+SVM	88.6%	85.6%	86.3%	85.9%
ADD+LSTM	90.1%	90.2%	89.4%	89.7%
ADD+GRU	90.9%	91.3%	90.3%	90.7%
ADD+BiLSTM	94.3%	92.9%	93.6%	93.2%

Table 5. Comparison of different approaches with our proposed system.

Methods	Purpose	Accuracy	Recall	Precision	F-Score
Compressive Tracking [23]	Detect and track dolphins	-	75.7%	78.8%	77.2%
OpenPose [30]	The angles of dolphin swimming	86%	85%	81%	82.9%
Faster R-CNN [24]	Track the trajectories of dolphins	81%	80.4%	82.3%	81.3%
Ours	Identify the daily behaviors of dolphins	94.3%	92.9%	93.6%	93.2%

Table 6. An ablation study of the proposed method with various models.

Settings	Accuracy	Recall	Precision	F-Score
Baseline	82.4%	81.4%	81.8%	81.6%
+Preprocessing	85.5%	84.6%	85.1%	84.8%
+Double-Layer BiLSTM	94.3%	92.9%	93.6%	93.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tseng, S.-P.; Hsu, S.-E.; Wang, J.-F.; Jen, I.-F. An Integrated Framework with ADD-LSTM and DeepLabCut for Dolphin Behavior Classification. J. Mar. Sci. Eng. 2024, 12, 540. https://doi.org/10.3390/jmse12040540

AMA Style

Tseng S-P, Hsu S-E, Wang J-F, Jen I-F. An Integrated Framework with ADD-LSTM and DeepLabCut for Dolphin Behavior Classification. Journal of Marine Science and Engineering. 2024; 12(4):540. https://doi.org/10.3390/jmse12040540

Chicago/Turabian Style

Tseng, Shih-Pang, Shao-En Hsu, Jhing-Fa Wang, and I-Fan Jen. 2024. "An Integrated Framework with ADD-LSTM and DeepLabCut for Dolphin Behavior Classification" Journal of Marine Science and Engineering 12, no. 4: 540. https://doi.org/10.3390/jmse12040540

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Integrated Framework with ADD-LSTM and DeepLabCut for Dolphin Behavior Classification

Abstract

1. Introduction

Objectives

2. Related Works

2.1. Paper Survey of Dolphin Behavior

2.2. Paper Survey of Dolphin Detection

2.2.1. Existing Approaches for Dolphin Detection

2.2.2. Limitations of Current Approaches

2.3. Paper Survey of Behavior Classification

3. Materials and Methods

3.1. System Overview

3.2. Data Preparation and Image Preprocessing

3.2.1. Data Preparation

3.2.2. Image Preprocessing

3.3. Pose Estimation Model: DeepLabCut

3.3.1. ResNet50

3.3.2. The Network Architecture of DeepLabCut

3.4. Behavior Analysis Model

3.4.1. AquaAI Dolphin Decoder (ADD)

3.4.2. Long Short-Term Memory (LSTM)

4. Results

4.1. Experimental Environment

4.2. Experiment of DeepLabCut

4.3. Experiment of Behavior Analysis Model

4.3.1. AquaAI Dolphin Decoder (ADD)

4.3.2. Long Short-Term Memory (LSTM)

4.3.3. Bidirectional Long Short-Term Memory (BiLSTM)

4.3.4. Gated Recurrent Unit (GRU)

4.3.5. Support Vector Machine (SVM)

5. Discussion

5.1. Experiment Result Comparison

5.2. Ablation Study

6. Conclusions

6.1. Contributions

6.2. Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI