As a consequence of this, autonomous underwater intervention has to make use of the most promising available technologies of Artificial Intelligence (AI), namely those of Deep Learning (DL) and Computer Vision (CV).
Related Work
Unmanned robots need to understand their complex environment to achieve complete autonomous operational capabilities, with object detection being the fundamental low-level task [
4]. The demand for underwater vehicles to achieve autonomous capabilities is even more demanding due to the challenging environmental conditions. Underwater Vehicles are equipped with various sensors and instrumentation such as GPS, cameras, LiDAR cameras, and sonars [
6,
7]. Cameras are essential because they allow for visual interaction between the user/operator and the vehicle [
8].
During the past few years, object detection models have become more sophisticated and accurate than ever before, and they are able to take advantage of the modern embedded systems [
4]. Some landmark architectures that revolutionise modern CV applications include the R-CNN family of models [
9,
10,
11], the YOLO architecture and its different versions [
12,
13], and Feature Pyramid Networks (FPN) [
4,
14]. Ross Girshick et al. [
9] introduced the algorithm designed to overcome the problem of selecting large regions during an object detection task. The model performs a selective search on the image by looking for potential objects (region proposals). Therefore, instead of detecting and classifying a larger region, the model divides the image into smaller regions, which ultimately increases the total training time. The Fast R-CNN model [
4,
10] was introduced to solve some of the drawbacks of the original model and create a faster model. The two models were approached from the same point of view, with the difference that the input image was fed to the CNN architecture to generate feature maps instead of region proposals, which increased the model’s speed and accuracy. The Mask R-CNN model [
4,
11] was introduced to solve the problem of object detection in images with complex backgrounds, and the model was able to detect objects with high accuracy and segmentation.
All prior object detection methods focus the item inside the picture using areas. The network does not analyse the picture in its entirety. Instead, the probability-rich regions of the picture that contain the item are the focus. You Only Look Once, or YOLO, is an object detection algorithm that differs significantly from the region-based techniques described previously. In YOLO, the bounding boxes and class probabilities for these boxes are predicted by a single neural network. YOLO performs object detection quicker than a conventional object detection algorithms, with speeds ranging from 45 to 145 frames per second. The drawback of the YOLO algorithm is that it struggles to detect small objects in an image, but this changes with more recent versions.
Finally, FPN [
14] is not a standalone object detector, and can be classified as a feature extractor that operates in conjunction with object detectors such as R-CNN and Fast R-CNN. The feature extractor accepts a single-scale picture of any arbitrary size as input, and generates correspondingly scaled feature maps at several layers using a fully CNN algorithm. This process is independent of the convolutional architectures’ core components. It is a general approach for constructing feature pyramids inside deep convolutional networks for applications such as object identification.
When applying deep learning approaches to problems involving image classification or object detection, one of the most frequent obstacles that arise is a lack of data. Applying data augmentation techniques to a dataset in order to expand its size and variety is a trial-and-error approach to the challenge of dealing with a lack of data [
15]. The traditional method of data augmentation involves the use of various libraries, such as those described in [
16,
17], which provide flexibility and easy-to-use implementation for a variety of augmentations to increase the size and the diversity of the dataset. This approach is known as the “classic” method of data augmentation. These libraries include a variety of enhancement methods, including cropping, blurring, colour saturation, contrast, and greyscale scaling, as well as rotation, changing colour channels, and shifting colour channels.
When it comes to more project specific tasks, the standard data augmentation method cannot generate images that are close to the preferred real-world data, and it requires a significant amount of time and trial and error to produce the desired results. Therefore, DL models such as Generative Adversarial Networks (GANs), CycleGAN, and U-Nets are the current state-of-the-art methods used to augment datasets and increase their size [
18,
19,
20]. GAN are mainly used to produce synthetic images that follow the same probability distribution as the real images. CycleGAN is a well-known GAN architecture that is typically used to learn image transformations across various patterns, whereas U-Net models focus more on semantic and structural differences between actual and artificial content. These methods is the most advanced currently available.
In addition, the difficulties presented by the underwater environment make the collection of data a laborious job that requires the use of specialised persons and specific equipment. As a consequence of this, it is difficult to build projects that need large underwater datasets. The underwater habitat, the light conditions that are present throughout the picture capturing process, and the task that the image was shot for all play a role in determining the unique problems that come with taking underwater photographs [
21]. When it comes to obtaining data for deep learning models, many researchers have focused primarily on underwater image enhancement and restoration to improve the quality of underwater images [
2,
22,
23]. This is to improve the quality of the images obtained from underwater environments.
The technique of enhancing underwater images has the improvement of the image’s visual quality as its goal, and it does not often take into account the physical qualities of light in the water, such as the attenuation coefficient or the light scattering [
22]. It is generally agreed that picture enhancement may be implemented far more quickly and is simpler to understand than image restoration. Image restoration is a more sophisticated process that has to take into consideration the physical behaviour of light in water. This is because water reflects light differently than air does. Image restoration requires information on the kind of water present, whether it is coastal or ocean water, as well as the quality of the light propagation in the water [
2,
23]. These methods only produce satisfactory outcomes in a controllable underwater environment, and it is difficult to put them into practice in the real world due to the complexity of their implementation and the large number of parameters that need to be taken into consideration [
2].
By including an attenuation coefficient for both the blue–red and blue–green spectrum channels, the technique that was presented by Berman and colleagues [
23] was able to consider the various light profiles produced by the various underwater settings. The technique they developed is based on the intensity of the image’s colour channel at the pixel level; more specifically, the attenuation coefficient incorporates the two spectrum characteristics. In addition to this, the topography of the location, the time of year, and the climate were all taken into consideration. Arnold-Bos and colleagues [
24] discuss the challenges that the vision of underwater vehicles encounter while operating in underwater conditions and suggest using deconvolution and augmentation approaches. The technique was developed to eliminate light backscattering, the primary feature of noise, and the attenuation inequalities that arise with contrast equalisation. A wavelet filtering approach was then used on the residual picture noise, which may correlate with sensor noise or floating particles. This algorithm helps enhance the edge recognition of underwater images.
To increase the amount and quality of the datasets, more novel techniques as mentioned above, such as data augmentation using GANs, are being employed in various sectors. Some examples of the use of GAN models can be found in the field of neuroscience where, for instance, there is a need to perform segmentation tasks from CT scan images [
25]. Additionally, the use of Deep Neural Networks (DNN) and U-Nets to perform segmentation in brain cell representation from Electron Microscopy (EM) images [
20,
26] is an example of the application of CycleGAN models for the purpose of data augmentation.
Because of the progress that has been made in DL and CV over the past few years in areas such as image classification, image segmentation, and object detection [
18,
27,
28], there is now an opportunity to develop models that are capable of performing image restoration and image enhancement in a manner that is more accurate and precise. These models have the potential to outperform any of the manual approaches that were used in the past [
19]. Because of the use of Convolution Neural Networks (CNN) and GANs, it is possible, in certain instances, to identify and detect objects with a higher level of accuracy than is possible for humans to attain [
29].
Zhu et al. [
28] proposed a CycleGAN model for image-to-image translation in order to learn the mapping functions between two domain images
X and
Y, translate the domain of the first image based on the second,
, and vice versa, to translate the second domain based on the first image,
. This allowed the model to translate the domain of interest. In addition, the authors incorporated two adversarial discriminators, one for the first domain image and one for the second domain image. The purpose of these discriminators was to assess whether or not the output image had been successfully translated to the target domain. In most instances, the results were adequate, and the translation of one image domain to another image domain delivered acceptable output. Nevertheless, the model might become confused between the domains when there is insufficient feature dispersion in the training set.
Currently, DL models are the standard in underwater applications, and the primary emphasis is on picture restoration, image enhancement, and improvement of underwater settings. Anwar et al. suggested using a CNN model that might improve the quality of photographs taken underwater [
22]. The network design is composed of convolutional blocks that are all linked to a dense layer at the end. This provides the whole model with modularity. The model’s output is an improved picture of the subsea scene, devoid of the cyan and emerald tones present in the original image.
A similar technique for restoring the colours in underwater photographs was used by Chen et al. [
30], where the authors attempted to reduce the effects of the underwater environment, increase the picture details, and fix the colours in the image. The image incorporates several diverse components, each of which is represented by one of the model’s three primary elements. The first part is used to estimate the ambient light of the image; the second part is responsible for the direct transmission estimation, which is a function of both the ambient light and the input image, and the third part is responsible for the reconstruction of the enhanced image. Li et al. [
31] presented an underwater enhancement method based on GAN models, where they tried to solve underwater degradation effects such as low contrast, colour casts, and haze-like effects using a fusion GAN model on the U45 dataset. The model is utilised by combining the benefits of the inception model architecture [
27] with the deep residual learning framework [
32].
Another research study by Li et al. [
33] approached underwater image enhancement from a different perspective by constructing a large-scale real-world underwater dataset containing 950 images under various light conditions, from natural light to artificial light. The collected data were then tested on the custom Water-Net model to perform image enhancement. Furthermore, Panetta et al. [
34] went further in underwater object tracking and image enhancement and introduced a benchmark underwater dataset, UOT100. The dataset comprises 104 underwater videos, from which they generated a complete set of 74 K annotated image frames. Additionally, they introduced the CRN-UIE GAN model to perform image enhancement. The model tries to improve underwater object detection performance by correcting the underwater environment’s effects.
Underwater image restoration using real-world images from coral reefs (HICRD) was proposed by Han et al. [
35]. They created the custom HICRD dataset to overcome the limitation of previous datasets to capture a more diverse underwater environment. The dataset consists of 9676 images and is used on the Constructive UndeWater Restoration (CWR) model for image restoration. The CWR model at its core utilises GAN models and Representation Learning [
36], essentially an unsupervised method used to perform image restoration. The CWR model performed satisfactorily, and the end result was close to the reference images without content or structural losses on the generated images.
In addition, during the last few years, there has been a shift from the conventional techniques toward the substantial use of CNN and GAN models for underwater image repair and enhancement [
19,
30,
37,
38]. Because of such networks’ characteristics, DL models represented a significant advancement in the analysis of underwater photographs. The task of processing and interacting with the underwater world poses a number of difficulties for any autonomous vehicle. DL makes it possible to create more accurate data-driven models of the environment, improving one’s capacity to analyse and comprehend that environment. The most notable benefit of DL models is that they can be put into action without the necessity of explicitly describing every facet of the environment and manually coding everything that is required for the operation line by line. This is the distinct advantage that sets them apart from other types of models. DL models can be trained and can learn the most valuable features on their own, provided the necessary data are fed through the network during the training process. As a result, the model would be able to learn the necessary characteristics and parameters for every given job, notwithstanding the complexity of the underwater scenes.
Consequently, as a result of the development of advanced deep learning algorithms, it is now more conceivable than it has ever been to generate underwater photographs that may be as similar as possible to the real world, despite the complexity of such an environment. Because of this, the data collection for marine imagery, which is an essential component of any project relating to the underwater environment, can be made more accessible and will not require direct underwater data, at least in the initial stages of the model development. This will result in savings of time, resources, and funding.
This paper is organised as follows.
Section 2 describes the methodology used to generate the underwater photographs.
Section 3 describes the results of the model development.
Section 4 presents the conclusions of the paper.