CCGAN as a Tool for Satellite-Derived Chlorophyll a Concentration Gap Reconstruction

Ćatipović, Leon; Matić, Frano; Kalinić, Hrvoje; Sathyendranath, Shubha; Županović, Tomislav; Dingle, James; Jackson, Thomas

doi:10.3390/jmse11091814

Open AccessEditor’s ChoiceArticle

CCGAN as a Tool for Satellite-Derived Chlorophyll a Concentration Gap Reconstruction

by

Leon Ćatipović

^1,*

,

Frano Matić

²

,

Hrvoje Kalinić

¹

,

Shubha Sathyendranath

³

,

Tomislav Županović

¹,

James Dingle

³

and

Thomas Jackson

³

¹

Environmental Data Analysis Laboratory, Faculty of Science, University of Split, 21000 Split, Croatia

²

University Department of Marine Studies, University of Split, 21000 Split, Croatia

³

National Centre for Earth Observations, Plymouth Marine Laboratory, Plymouth PL1 3DH, UK

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(9), 1814; https://doi.org/10.3390/jmse11091814

Submission received: 17 August 2023 / Revised: 7 September 2023 / Accepted: 13 September 2023 / Published: 18 September 2023

(This article belongs to the Special Issue Technological Oceanography Volume II)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This work represents a modification of the Context Conditional Generative Adversarial Network as a novel implementation of a non-linear gap reconstruction approach of missing satellite-derived chlorophyll a concentration data. By adjusting the loss functions of the network to focus on the structural credibility of the reconstruction, high numerical and structural reconstruction accuracies have been achieved in comparison to the original network architecture. The network also draws information from proxy data, sea surface temperature, and bathymetry, in this case, to improve the reconstruction quality. The implementation of this novel concept has been tested on the Adriatic Sea. The most accurate model reports an average error of

0.06 mg m^{- 3}

and a relative error of

3.87 %

. A non-deterministic method for the gap-free training dataset creation is also devised, further expanding the possibility of combining other various oceanographic data to possibly improve the reconstruction efforts. This method, the first of its kind, has satisfied the accuracy requirements set by scientific communities and standards, thus proving its validity in the initial stages of conceptual utilisation.

Keywords:

GAN; generative adversarial network; reconstruction; satellite chlorophyll a

1. Introduction

Satellite oceanography is a key tool in oceanographic measurements, but its usefulness is hindered by frequent gaps in data [1]. Solving the issue of missing data has been the aim of many solutions throughout the past two decades. During this time, a number of different reconstruction methods and approaches have been designed and implemented. The earliest attempts at gap-filling included data merging from multiple sensors [2], kriging [3], and basic regression techniques [4]. One of the most popular methods is the so-called Data Interpolating Empirical Orthogonal Functions (DINEOF) [5]—a linear reconstruction method based on empirical orthogonal functions. Lately, reconstruction techniques in oceanographical remote sensing have been turning toward the utilisation of various machine learning techniques. These techniques range from support vector machines [6], random forests [7] to various neural networks [8,9,10]. These techniques have seen an ever-increasing popularity, with an emphasis on neural-network-based approaches, which have seen an extreme surge in implementation, practicality and popularity in the past several years [11].

One special architecture of the neural networks, called the Generative Adversarial Network (GAN) represents the next leap in the neural-network reconstruction approach [12]. Unlike other neural networks, GANs represent a unity of two separate networks, a generator and a discriminator, “pitted against” each other. While the generator is tasked with generating new data, the discriminator is tasked with discerning whether or not the data presented to it originate from a real source or from the generator. This way, based on the feedback information, both networks become more proficient in their respective tasks, resulting in the generation of realistic data [12]. GANs have been successfully utilised within the domain of satellite oceanography for reconstruction purposes [13,14,15,16], but their application has been limited to sea surface temperature (SST) only [11]. Outside of strictly reconstruction-oriented purposes, GANs have seen a vast range of utilisation in oceanography [17,18,19,20].

This paper aims to showcase the concept and validity of reconstructing oceanographic data using a variation of Generative Adversarial Network, so-called Context Conditional Generative Adversarial Network (CCGAN) [21]. While the generator of typical GAN usually transforms a random input to generate new data, CCGAN inputs corrupt data and only generates the missing part based on the available data, or the context, hence the name [21]. The original development of Context Conditional Generative Adversarial Network focused on natural image reconstruction only and proved quite successful [21]. However, this particular architecture has not yet been used in the reconstruction of oceanographic data. This work utilises an adjusted version of CCGAN for missing chlorophyll a data (

{chl}_{a}

) reconstruction. The paper also examines the effects of utilising proxy data—correlated data originating from an external source (e.g., different sensor)—on the reconstruction accuracy. In this case, chlorophyll a data are augmented by sea surface temperature data and bathymetry data, both of which are correlated to chlorophyll a [22,23]. Rather than deriving the complex numerical correlation between the three to obtain the absolute chlorophyll a concentration, sea surface temperature has been included to better derive water masses that ultimately dictate the fronts and shape of surface chlorophyll a concentration, while bathymetry serves as an indicator of distance from the shore in the Adriatic [24], helping to discern high chlorophyll a coastal areas from oligotrophic open waters.

To gauge the validity of this reconstruction method, the International Organization for Standardization (ISO) standard for in situ measurements [25] and Essential Climate Variable (ECV) product requirements [26] will be taken into account.

2. Data and Methods

Rather than selecting open ocean regions, which are generally homogeneous regarding surface

{chl}_{a}

, the Adriatic was chosen as the area of interest due to a considerable amount of literature documenting oceanographic processes, different water masses, complex bathymetric and coastal structure, effects of fluvial and atmospheric influence, and so on [24]. Wet points, meaning points representing the sea in the area encapsulated by 40–

46^{\circ}

N and 12–

20^{\circ}

E, were considered but the Tyrrhenian Sea was discarded. The area of study is depicted in Figure 1.

2.1. Data Sources

Ocean Colour-Climate Change Initiative (OC-CCI) Version 5.0 Data suite [27] was created by band-shifting and bias-correcting the data from Sea-viewing Wide Field-of-view Sensor, Moderate Resolution Imaging Spectroradiometer, Visible Infrared Imaging Radiometer Suite, and Ocean and Land Colour Instrument data to match the Medium Resolution Imaging Spectrometer data [27]. The chlorophyll a concentration was derived using an array of algorithms [28] based on the target water class membership [29]. With a spatial resolution of 1 km, this resulted in matrix dimensions of 576 × 768. Time frame spanned from 1 January 2003 to 31 December 2020, a total of 6575 days. All data are publicly available and accessible at www.oceancolour.org accessed on 1 August 2023.

To determine whether a machine learning model for reconstructing missing

{chl}_{a}

could be improved, additional variables have been included. These variables are commonly referred to as proxy variables since they are used as an instrument for chlorophyll reconstruction. In this study, two additional proxy variables were considered: SST and bathymetry. Daily satellite SST was retrieved from Group for High Resolution Sea Surface Temperature Level 4 dataset (available at: www.podaac.jpl.nasa.gov/dataset/MUR-JPL-L4-GLOB-v4.1 accessed on 1 August 2023) [30]. As the SST data resolution was higher than the

{chl}_{a}

resolution, it was scaled down to match the 576 × 768 grid. SST data were time-matched with

{chl}_{a}

. Bathymetrical data are time-independent, therefore only one instance was obtained from the General Bathymetric Chart of the Oceans (available at www.gebco.net accessed on 1 August 2023) [31]. Bathymetry resolution was scaled identically as SST, and the data were subjected to cutoffs at

- 1000

and 400 m. The cutoff at 400 m was imposed to differentiate between low-lying coastal areas and mountainous coastal areas as coastal regions above the cutoff should not have a significantly different effect on the

{chl}_{a}

regardless of the altitude, if such effects should even take place. Similarly, the seafloor at the maximum depth of the Adriatic at 1233 m would probably not affect the

{chl}_{a}

significantly differently than a seafloor at 1000 m depth.

2.2. Dataset Integration

The bane of any machine learning algorithm is the sheer amount of representative data required for training [32]. This becomes an even bigger issue when the original data have gaps, which is certainly the case for satellite-derived

{chl}_{a}

. Since there is no go-to method for creating a gap-free and representative dataset, a heuristic approach was developed. As the input of the neural network is three

64 \times 64

data matrices, the dataset creation process implies sampling

64 \times 64

subsets from the original 576 × 768 locations area from day to day while being subject to certain rules. The details and outcomes of the dataset creation method have been explored in the Appendix A and Appendix B. The three matrices that go into the neural network contain

{chl}_{a}

, SST and bathymetrical data, respectively. The matrices are stacked vertically so that the final shape of the input is

3 \times 64 \times 64

. Because SST and bathymetrical data are gap-free, the sampling is therefore solely determined by the

{chl}_{a}

data.

2.3. Context Conditional Generative Adversarial Network

CCGAN is a variation of the established GAN architecture. The GAN architecture is based on pitting two separate networks against each other: the generator network and the discriminator network [12]. Simplified, the purpose of the generator is to generate a distribution

p_{g}

over the data x so that it matches the original data distribution

p_{x}

as closely as possible. The output of the generator for some input data z from distribution

p_{z}

is defined as

G (z)

. On the other hand, the discriminator discriminates whether or not its input originated from

p_{x}

or

p_{g}

, by outputting a probability value

D (G (z))

. The aim of the discriminator is to be as accurate as possible when discriminating between the real and fake input. GAN functions by simultaneously training both the generator and discriminator in a min-max game with a value function

V (G, D)

, also known as loss function, given as [12]:

min_{G} max_{D} V (D, G) = E_{x \sim p_{x}} [log D (x)] + E_{z \sim p_{z}} [log (1 - D (G (z)))] .

(1)

While some implementations and variations of GAN take random noise z as input [12,33], CCGAN takes masked/corrupted/missing data as input and fills in the missing parts based on the available context data around the missing parts [21]. This way, rather than transforming some random noise input z, CCGAN exploits available data to optimise the generation of missing data. Formally, let m be a binary mask that will obscure some parts of the data. Then, the generator receives

m ⊙ x

, where ⊙ denotes element-wise multiplication. The loss function from Equation (1) then becomes [21]:

min_{G} max_{D} V (D, G) = E_{x \sim p_{x}} [log D (x)] + E_{x \sim p_{x}, m} [log (1 - D (G (m ⊙ x)))],

(2)

With this formulation, the output of the generator becomes

x_{G} = G (m ⊙ x)

. However, this is not the final reconstruction result, as the generator is not focused on reconstructing data not obscured by the binary mask [21]. Fully reconstructed, or better yet, the inpainted matrix

x_{I}

is obtained via [21]:

x_{I} = (1 - m) ⊙ x_{G} + m ⊙ x .

(3)

Additionally, CCGAN has an optional input of the complete/uncorrupted/unmasked matrix of a lower resolution

x_{R}

which improves the reconstruction accuracy [21]. Low-resolution input has been obtained through bilinear interpolation to

16 \times 16

points. Given the original size of the matrix was

64 \times 64

points, the low-resolution matrix contained just

6.25 %

of the original information. Finally, with this additional input, the loss function 2 became:

\begin{matrix} min_{G} max_{D} V (D, G) & = E_{x \sim p_{x}} [log D (x)] \\ + E_{x \sim p_{x}, x_{R} \sim p_{x}, m} [log (1 - D (G (m ⊙ x, x_{R})))], \end{matrix}

(4)

The architecture of the network remained mostly the same [21], except that the outputting layer of the generator was changed from a hyperbolic tangent function to a sigmoid function, for reasons explained in the following paragraphs. Generator [33] consists of six downscaling and six upscaling layers, based on 2D convolution and transposed-convolution operators, respectively. Both operators consisted of kernel size

4 \times 4

, a stride of

2 \times 2

, and a padding of 1. Discriminator was based on the VGG-A network [34] without the fully connected layers [21]. Adam optimiser [35] was utilised. Learning rate was set at 0.0002, momentum term was set at 0.5. The remainder of the hyperparameters were left unchanged [21,35].

Each model was trained for 50 epochs, and each training dataset was split into batches of 20 data matrices. The number of epochs and batch size were purposefully set identically for each set to minimise the algorithm effect and maximise the dataset variability effect on the reconstruction accuracy. During the training process, at every 50 batch passes, the generator and discriminator loss values were recorded for later analysis. For the sake of convenience and to showcase the robustness of the CCGAN, a square mask will be applied to the middle of the training data matrix, so that a large and important portion of the data is obstructed. With the dimensions of the data matrix being

64 \times 64

and the dimensions of the mask being

32 \times 32

,

25 %

of the data are covered, as depicted in Figure 2.

For this research, the PyTorch implementation of CCGAN was used [36,37]. The implementation emulates semi-supervised learning based on arbitrary labels used for defining real data and fake data. Labels are defined as appropriately shaped tensors

T

and

F

filled with ones and zeros, representing real and fake data, respectively. For practicality, the original loss function given by Equation (4) was split into generator loss:

L_{G} = {∥ D (G (m ⊙ x, x_{R})), T ∥}_{2}^{2}

(5)

and discriminator loss, given as:

L_{D} = 0.5 \cdot [{∥ D (x), T ∥}_{2}^{2} + {∥ D (G (m ⊙ x, x_{R})), F ∥}_{2}^{2}],

(6)

where

{∥ ∥}_{2}

represents the mean squared error (squared L2 norm, MSE) between each element. While this implementation [36], denoted as the

{MSE}_{1}

-based model when used on CelebA Dataset [38], provided satisfactory results, if the loss functions defined in this way were to be used on

{chl}_{a}

data, the discriminator would significantly outperform the generator during training, leading to poor reconstruction accuracy, as represented in Figure 3. To mitigate this problem, updated loss functions were proposed:

L_{G} = ∥ G (m ⊙ x_{1}, x_{R}) ⊙ (1 - m_{LC}), x_{1} ⊙ (1 - m) ∥

(7)

and

\begin{matrix} L_{D} & = 0.5 \cdot [{∥ D ((1 - m) ⊙ x_{1}), T ∥}_{2}^{2} \\ + {∥ D (G (m ⊙ x_{1}, x_{R}) ⊙ (1 - m_{LC})), F ∥}_{2}^{2}] . \end{matrix}

(8)

These loss functions are a part of the

{MSE}_{2}

-based model. Two changes are to be noticed here. Firstly,

x_{1}

denotes the

{chl}_{a}

data, the only data that are taken into account for training evaluation, as only

{chl}_{a}

is targeted for reconstruction. Secondly, as the positions of land points, as defined by the dynamic land–sea mask, are always known for each data matrix, there is no need to gauge reconstruction accuracy on points that are known to be land points. Therefore, the land points from the real data were superimposed onto the respective generated data before being subjected to evaluation. This way, the training process will be streamlined and further optimised as only actual sea points will be subjected to testing. This change is implemented via the

m_{LC}

mask. However, even this update to the loss functions resulted in poor reconstruction accuracy, mainly because, in this instance, the generator outperformed the discriminator, as depicted in Figure 3. A possible cause of such poor results might lie in the MSE itself. While the MSE might keep numerical values in check, it cares little for the actual structural distribution of the data, as exemplified in Figure 4. Therefore, another model was proposed, whose generator loss function has been swapped for:

L_{G} = 1 - ψ [G (m ⊙ x_{1}, x_{R}) ⊙ (1 - m_{LC}), x_{1} ⊙ (1 - m)]

(9)

The

{∥ ∥}_{2}

function has been replaced with Structural Similarity Index function (

ψ)

[39,40].

ψ

is defined where a and b are square matrices,

μ_{a}

and

μ_{b}

are mean values of a and b,

σ_{a}^{2}

and

σ_{b}^{2}

are variances of a and b,

σ_{a b}

is the covariance of a and b,

c_{1} = {(0.01 L)}^{2}

and

c_{2} = {(0.03 L)}^{2}

are variables based on the dynamic range L of data-values [39].

ψ

outputs a value in the range

[- 1, 1]

, with 1 indicating a perfect copy, while −1 denotes maximal dissimilarity. Since

ψ

measures the similarity rather than difference, the loss value in Equation (9) is defined as

1 - ψ

. As

ψ

is undefined for negative values which can occur as an output of the final layer of the generator which contains a hyperbolic tangent function, the function was replaced with sigmoid function, ensuring all outputs are positive.

ψ

was implemented using piqa package [41]. The motivation behind including

ψ

is similar to the inclusion of SST: while numerical accuracy in reconstruction is important, the shape of the reconstructed features is no less important. These changes resulted in a stabler training process, as depicted in Figure 3, resulting in better reconstruction accuracy. This model was denoted as the Structural Similarity Index Measure (SSIM)-based model.

2.4. Error Metrics

To quantise the reconstruction accuracy, three error metrics were implemented. The definitive accuracy of reconstruction was measured only on the masked part. The value of MSE was calculated as:

MSE = {∥ G (m ⊙ x_{1}, x_{R}) ⊙ (1 - m_{LC}), x_{1} ⊙ (1 - m) ∥}_{2}^{2} .

(10)

Secondly, SSIM was given by:

SSIM = ψ (G (m ⊙ x_{1}, x_{R}) ⊙ (1 - m_{LC}), x_{1} ⊙ (1 - m)) .

(11)

The final error metric was a relative error (RE) as defined by:

RE = \frac{1}{M} \sum_{i = 1}^{M} \frac{| {(G (m ⊙ x_{1}, x_{R}) ⊙ (1 - m_{LC}))}_{i} - {(x_{1} ⊙ (1 - m))}_{i} |}{{(x_{1} ⊙ (1 - m))}_{i}} \times 100 %

(12)

This metric was added in order to put a more quantifiable and tacit measure of error. By dividing with the data values, RE can encounter division by zero if the land data happen to be subjected to evaluation. To avoid this, RE was calculated using only sea data denoted by

_{i}

. The total number of sea points per data matrix is M. While generally higher, SSIM also means a lower MSE and RE, measures which are not equivalent and may produce significantly different reconstruction results. While MSE and RE are relatively easily physically interpretable, SSIM does not share that feature. Therefore, it is natural to question the motivation behind the use of SSIM. However, it is easy to imagine a case where MSE and RE fail. As a showcase where only relying on numerical values could significantly hinder the structural realism of reconstruction, Figure 4 is presented. As can be noticed, structurally significantly different matrices are virtually equidistant from the original matrix in terms of MSE and RE, while this is not the case with the SSIM. If solely relying on MSE (or RE), it would be difficult to determine which matrix is a better representation of the original. Thus, when aiming at matrix reconstruction optimisation, SSIM better captures the structure of the data than either MSE or RE.

2.5. Growing Neural Gas

To provide the most common examples of reconstruction, the Growing Neural Gas (GNG) [42] was implemented to extract the characteristic patterns from the testing dataset. Input into the algorithm consisted of the

3 \times 64 \times 64

data matrix formatted into a data vector containing 12,288 features. Rather than implementing GNG once on the test dataset (containing over 50,000 of the aforementioned data vectors, see Appendix B and Appendix C) and obtaining a presentable number of patterns (e.g., around 10), GNG was implemented twice to avoid the oversmoothing of the data. First, the implementation reduced the dataset to 100 patterns. These 100 patterns were further fed into GNG in order to obtain the eight patterns used for the visualisation of reconstruction examples. Using these eight final patterns, also known as best matching units (BMUs), the closest possible dataset examples were determined using the least vector norm. The GNG algorithm itself was implemented in Python’s library NeuPy using fixed parameters: step = 0.1, neighbour step = 0.001, maximum edge age = 50, number of iterations before adding a neuron = 100, aftersplit error decay rate = 0.5, error decay rate = 0.995, and minimum update distance = 0.2. Each run lasted 300 epochs.

3. Results and Discussion

3.1. Verification of SSIM-Based Model

In order to justify changes to the architecture described in Section 2.3, a brief comparison between MSE-based and SSIM-based CCGAN models was made. Before calculating the metrics, data were rescaled to reflect the physical

{chl}_{a}

values in

mg m^{- 3}

. Results are displayed in Table 1. Not surprisingly, due to the training behaviour depicted in Figure 3, MSE-based models (depicted in the top and middle graphs) failed to optimise properly, and, even though the latter performed slightly better, both resulted in poor reconstruction accuracy. The SSIM-based model significantly outperformed both MSE-based models. Interestingly enough, even though the SSIM-based model is trained solely on the Structural Similarity Index function, MSE and RE metrics also show improvements. Figure 4 displays how the MSE-based models failed to optimise properly, causing significant noise in the training process.

3.2. Testing the SSIM-Based Model

Reconstruction accuracy was examined in both time and geographical space. Figure 5 displays the mean spatial distribution of the error metrics: SSIM, MSE, and RE, respectively. Since both SSIM and MSE output a single value for every

16 \times 16

point area, the single value was used to quantise the entire respective area. The RE returns the appropriate amount of values—hence the higher resolution. From Figure 5, it is apparent that the highest reconstruction accuracy is achieved at the open sea areas. The accuracy diminishes as the distance from the coast decreases—the gradient is higher towards the west coast than the east coast. The lowest accuracy is contained in the north-west area of the Adriatic by the mouth of the Po River. Figure 6 examines the metrics as a function of time. Intra-annually, RE fluctuates slightly, achieving its minimum during late spring. SSIM and MSE seem to share similar behaviour—higher accuracy is achieved during summer and the lowest accuracy occurs during late autumn. Interannually, SSIM, MSE, and RE are generally stable—with the caveat of a sudden drop in accuracy during 2013 and 2014.

These errors prompted a secondary investigation in order to determine the underlying cause of the diminished reconstruction accuracies in 2013 and 2014. This included an examination of the mean values of

{chl}_{a}

contained in the sets used for training and testing the model. The mean value and standard deviation of

{chl}_{a}

of both sets is

0.40 \pm 0.72 mg m^{- 3}

. Instances where the MSE of reconstruction was greater than 0.40

{mg}^{2} m^{- 6}

numbered 151 samples in total. The mean value and standard deviation of

{chl}_{a}

of the aforementioned samples was

4.88 \pm 1.76 mg m^{- 3}

. Furthermore—these samples were all localised at the mouth of River Po. The examination of

{chl}_{a}

throughout the years indicated a sudden increase in the aforementioned area in the years 2013 and 2014—coinciding with the drop in accuracy in Figure 6.

The eight most representative vectors from the dataset were selected based on the least vector norm given the eight patterns determined by the GNG from Section 2.5 (Figure 7). These vectors were masked according to the procedure and fed into the SSIM-based CCGAN to reconstruct the missing parts of the data. The visualisation of the results is displayed in Figure 7. CCGAN successfully managed to reconstruct all the closest possible data to characteristic patterns. Generally, the patterns describe certain distributions of

{chl}_{a}

, temperature and bathymetry. Pattern E represents a homogeneous distribution of low

{chl}_{a}

, typical for the southern parts of the Adriatic. High

{chl}_{a}

attributed to the Po river is displayed in pattern D. Eddy-like distributions can be seen in patterns A and G, the former represents a positive and the latter a negative eddy. All patterns are subject to some smoothing effects—but features (eddies, local minimums and maximums, small-scale features) still prevail. The temperature in the Adriatic Sea oscillates between the summer maximum—303 K—and the winter minimum—285 K. This range has not been encapsulated by the characteristic patterns—all of the patterns oscillate around the median value. The

{chl}_{a}

patterns that triggered the BMUs occurred during both summer and winter months, which resulted in the averaging of the temperature. Temperature has no clear influence on the reconstruction accuracy. Bathymetry, on the other hand, has been shown to be associated with

{chl}_{a}

, which is to be expected as bathymetry is connected with physical processes that influence

{chl}_{a}

—river discharge, upwelling, coastal processes, etc.

Another examination evaluated the effects of including various combinations of proxy variables. To date, each data matrix contained

{chl}_{a}

, SST and bathymetry data. Three additional models were trained. Considering how

{chl}_{a}

is crucial for reconstruction,

{chl}_{a}

was kept in all models, while SST and/or bathymetry were excluded. Based on the possible combinations, training and testing dataset matrices were modified, by removing the appropriate data, and models were trained on the newly obtained sets. Reconstruction accuracies were compared to the model that used all three types of data. Naturally, each model was tested using only the appropriate matrices. Based on the results, the chl_a and SST and bathymetry model performed the best. Removing either the SST or the bathymetry data decreases the reconstruction accuracy. It would seem that removing SST affects the SSIM and RE accuracy more than the removal of bathymetry, but the mean MSE benefits from retaining SST in favour of bathymetry. Removing both the SST and bathymetry improves the SSIM and RE accuracy compared to the chl_a and SST and the chl_a and bathymetry models, but degrades the mean MSE score when compared to chl_a and SST model. Therefore, it could be concluded that SST and bathymetry provide complementary data which improve the reconstruction accuracy, but the inclusion of just one or the other has negative effects on the reconstruction, at least when considering the sale of the entire Adriatic.

4. Conclusions

This paper described a method using deep neural network for chlorophyll a concentration data reconstruction. The method uses Context Conditional Generative Adversarial Network architecture which was trained on satellite-obtained data. As the most important change in the canonical CCGAN implementation, the modification of the optimisation function was pointed out, along with the introduction of the Structural Similarity Index Measure as a measure of (dis-)similarity between data matrices. Such a modification improved the convergence properties and provided better overall reconstruction results. The final results of the average error of

0.06 mg m^{- 3}

and relative error of

3.87 %

demonstrate outstanding performance. This is especially visible when compared to the required measurement uncertainty of essential climate variables of

30 %

[26]. When compared to the ISO standard [25] for laboratory measurements of chlorophyll, one can notice that the results are quite competitive, since the standard declares the coefficient of variation to be

4.3 %

.

CCGAN successfully managed to reconstruct small-scale features (a few kilometres in size) in the distribution of the chlorophyll a concentration. Despite the smoothing effects, the physical credibility of the reconstruction is preserved. Differences between the target data and the reconstruction output are randomised—there is no clear bias towards overestimation or underestimation. A slight seasonal error is present—the error is mostly related to the seasonal distribution of chlorophyll a in the Adriatic. This issue could potentially be rectified by introducing additional proxy variables, for example, nutrients. CCGAN’s accuracy remained stable throughout the years—with the exception of the years 2013 and 2014, when it failed to model the unusual patterns at the mouth of river Po. The highest relative errors are localised in the coastal areas—apart from the error by the Italian coast caused by the discharge of River Po, the largest errors are localised by the Albanian coast. Probable causes of this error is the exchange between the Adriatic and the Ionian Sea as dictated by the BIOS oscillating system [43] and increased coastal environmental pollution. Reconstructions in coastal areas are generally difficult due to increase, with highly localised variation dictated by anthropogenic influence (sewage discharge, industrial waste, shellfish farms), river discharge, local air–sea interaction, etc.

Apart from the description of the reconstruction method, the description of the method for combining various data into layers and the representative dataset creation was provided. It was shown that the parameters used for dataset creation generally somewhat influence the results of the reconstruction at the cost of the computation time and memory storage. It is known that the chlorophyll a concentration and sea surface temperature are not highly correlated [43]. Likewise, it is known that the chlorophyll a concentration is correlated to river discharges (i.e., Po river [44]). Furthermore, chlorophyll a concentration is not directly related to bathymetry, but is related to processes that are related to bathymetry (i.e., upwelling). By selecting non-highly correlated proxy variables, it was shown that CCGAN was able to utilise the non-linear correlation to improve the reconstruction accuracy. The inclusion of temperature aimed to describe the quasi-annual variability of the chlorophyll a concentration, while bathymetry was included to emphasise the physical processes that affect the concentration.

As a closing remark, while this paper showcased the feasibility of using GAN-based methods for missing

{chl}_{a}

reconstruction with relatively high reconstruction accuracies, a lot of room for improvement and further testing has been left, such as minimising the noise during training or varying the training parameters [45], inclusion of additional proxy variables, splitting models based on temporal and/or geographical differences, different methods for obtaining the low resolution, and so on. Regardless, based on the practical accuracy requirements of 5–11% [25] to

30 %

[26] when compared to the relative error obtained by this method as described in the previous chapter, the claim that solid reconstruction accuracy can be achieved by using CCGAN holds.

Author Contributions

Conceptualisation, L.Ć, H.K., F.M. and S.S. formal analysis, L.Ć. and F.M. investigation, L.Ć., F.M., T.Ž., J.D. and T.J. writing—original draft preparation, L.Ć. writing—review and editing, S.S., F.M. and H.K. visualisation, L.Ć. and F.M. supervision, S.S., F.M. and H.K. project administration, H.K. and S.S. funding acquisition, H.K., S.S. and F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Croatian Science Foundation (HRZZ) under the projects UIP-2019-04-1737, IP-2019-04-5875 StVar-Adri, and in part by the Simons Foundation Project “Collaboration on Computational Biogeochemical Modeling of Marine Ecosystems” (CBIOMES) (549947,SS).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Remote sensing data used for the creation of the datasets analysed in this paper are public domain and freely available. Chlorophyll a concentration data were obtained from www.oceancolour.org (accessed on 1 January 2020), sea surface temperature data was retrieved from www.podaac.jpl.nasa.gov/dataset/MUR-JPL-L4-GLOB-v4.1 (accessed on 1 January 2020) and bathymetrical data are available at www.gebco.net (accessed on 1 January 2020).

Acknowledgments

The authors would like to extend their gratitude to Jadranka Šepić for providing additional funding via the Croatian Science Foundation (HRZZ) under the project IP-2019-04-5875 StVar-Adri, and to Shubha Sathyendranath for all the provided support.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ECV	Essential Climate Variables
GAN	Generative Adversarial Network
CCGAN	Context Conditional Generative Adversarial Network
ISO	International Organization for Standardization
OC-CCI	Ocean Colour Climate Change Initiative
SST	Sea Surface Temperature
${chl}_{a}$	Chlorophyll a Concentration
MSE	Mean Squared Error
SSIM	Structural Similarity Index Measure
RE	Relative Error
WR	Window Resampling
RA	Resample Allowed
LA	Land Allowed
BMU	Best Matching Unit

Appendix A. Dataset Sampling

The ground requirement to train and evaluate the model was to keep the dataset completely gap-free. Therefore,

64 \times 64

areas that contain any missing data due to obnubilation (from clouds, rain, sunglint) or due to quality control were not subject to sampling. Based on the entire time span, generally, open seas data points are mostly available. However, points in a thin zone (less than 5 km) along the coastline are available around 50% of the time, and in some areas, even less than 20%. The east part of the Adriatic, because of the Adriatic archipelago, is more affected by the optical complexity of the coastal waters than the west part. This is due to the increased optical complexity of coastal waters [46,47], which significantly hinders the retrieval of

{chl}_{a}

[48,49]. If a fixed, predefined geographical land–sea mask is used, it would result in overwhelmingly numerous samples of the open sea and few, if any, samples of coastal areas. To avoid this, a much looser definition of land–sea mask was used. Opted for was a dynamic land–sea mask, depending on the availability of the data in the coastal area. Figure A1 depicts this, where Figure A1a shows the geographical land–sea mask, Figure A1b shows the dilated land–sea mask, where dilation was performed based on the availability of the data, and Figure A1c shows the difference between the two. Rather than obtaining the dilated version of the land–sea mask by using morphological transformations, it was obtained by defining areas which contain data less than

40 %

of time as land, while the rest was defined as sea. Obviously, this degraded the geographical accuracy on a global scale; however, the

64 \times 64

areas only contain local information, therefore by doing this preservation of the most important information the distribution of data and land and sea areas was achieved.

Figure A1. Dilated land–sea mask: (a) geographical land–sea mask; (b) the dilated land–sea mask; and (c) the difference. Land points are purple, sea points are yellow. White represents the geographically accurate sea points that were classified as land points using the dilated land–sea mask.

Unlike gaps in data in open seas, which are mostly caused by cloud coverage, gaps in the coastal zone are usually caused by

{chl}_{a}

deriving the algorithm’s inability to deal with optically complex waters [48,49]. This issue surfacing from the algorithm itself could easily be circumvented by just applying the dilated land–sea mask. However, on some occasions,

{chl}_{a}

was obtained in the white areas in Figure A1c. In order not to discard the data when present during the sampling, the algorithm took into consideration points in the white areas if they were available, otherwise, it considered it to be land rather than missing data. This is the reasoning behind the “dynamic” adjective in the dynamic land–sea mask.

After applying the dynamic land–sea mask, which allowed for near-shore sampling, the goal was to create a dataset that would be as continuous as possible, while being as spatially and temporally representative as possible. In order to do so, several heuristic rules were put in place. With the dynamic land–sea mask being used, sampling could now result in numerous samples containing an unnecessarily large percentage of land points as opposed to sea points. This would increase significantly the size of the dataset, but would either have no positive or a negative effect on the spatial representativeness of the set. Simplified, the algorithm for sampling needed to be barred from extracting

64 \times 64

samples (containing 4096 points in total) that contain too few sea points (e.g., one sea point and 4095 land points). To do so, a Land Allowed (LA) parameter rule was enforced, which verified that each candidate sample contains a percentage of land points no greater than the one given by the parameter. Furthermore,

{chl}_{a}

exhibits a certain degree of seasonality in the Adriatic. This seasonality is in tune with the climatology of the Adriatic: spring and summer skies are generally cloud-free, while autumn and winter months display significantly larger and more frequent cloud cover. This in turn results in more missing points during the autumn–winter period than in the spring–summer period. If this difference is not regulated, the created dataset could contain unproportionally more spring–summer samples, skewing the temporal distribution, which may result in autumn–winter examples being reconstructed with the incorrect seasonality. Therefore, the second rule limits the number of times an area could be sampled in a day, based on the average number of times points in the area have been sampled for that date. The parameter of this rule is named resample allowed (RA). This parameter is tied to the counter parameter, which, as the name suggests, counts the number of times a point has been sampled. This counter is updated after each successful sampling by a value called resample-increment, denoted by N. Surface

{chl}_{a}

distribution is a complex system. While the

64 \times 64

sample should encapsulate most oceanographic features (such as fronts, eddies, etc.), the surrounding area still has an influence on such features. Having a clear-cut separation between two neighbouring samples would most likely result in unnatural boundary artefacts [50,51] when attempting to reconstruct the missing data. To reduce this error, a continuous dataset is required. To achieve this, overlapping in sampling is allowed, but is governed by the third rule which dictates how many points around the centre of the sampled area will be tagged as already sampled after each

64 \times 64

sample. The width and height of this square window are defined by the Window Resampling (WR) parameter. Points within this WR window obtain their sampling counter updated by value N after each sampling. A toy example of an algorithm on a downsampled matrix is displayed in Figure A2.

Figure A2. Depiction of how the algorithm used for data sampling updates the counter variable to assure the representativeness of the set. Green pixels represent land, whilst blue pixels represent sea. The black number denotes sampling the counter variable for each pixel. For details, see text.

Figure A2 depicts a toy example for a downsampled matrix. In this example, the output matrix is reduced from

64 \times 64

points to just

4 \times 4

points and is shown as a pink rectangle. The yellow rectangle is defined by the parameter WR, in this case,

W R = 1

. Each point has an auxiliary integer variable counter. Based on the dynamic land–sea mask, the counter of land points is assigned the value

- 1

, while sea points are assigned 0. The random pair of latitude and longitude (lat, lon) is selected. After verifying that there are no missing points, the LA condition is satisfied by determining the percentage of green points inside the pink window and the RA condition is satisfied by comparing the value of RA to the mean value of the counter within the yellow square, in that specific order—the part of the matrix within the pink rectangle is sampled and added to the dataset. After sampling, only the points within the yellow square, whose dimensions are defined by the WR parameter, receive a counter update, by adding

N = 1

to their value (updated counter values are shown as red numbers). In the next step, a new (lat, lon) pair is selected (in this example, shifted by one point to south) and the process is repeated. While the algorithm tries to find every possible (lat, lon) pair, notice that it is generally a non-deterministic algorithm, as the resulting dataset depends on the order in which the (lat, lon) pairs are selected.

Appendix B. Sanity Tests

Dataset creation itself is a tedious and computationally intensive process that might introduce significant biases into the machine learning model. In order to avoid any of these, the statistical properties of the dataset, as well as the effects that parameters used for dataset creation may have on the model output were investigated. As explained earlier, the properties of the dataset are governed by the parameters WR, RA, and LA. Effects each parameter has on the dataset have been tested by varying its value of it, while keeping the values of the other two parameters constant. Starting with the dataset created with WR = 16, RA = 2.5, and LA = 0.50, six additional datasets were created, by varying WR from 8 to 32, RA from

1.5

to

3.5

, and finally by changing LA from

0.25

to

0.75

. To select the best possible combination of the parameters, seven separate neural networks were trained on their respective datasets and their reconstruction accuracies were compared. While this approach might reveal which dataset gives the most accurate training, it does not quite verify the representativeness of the dataset compared to the initially available data. Therefore, three sanity tests were performed on the WR = 16, RA = 2.5, LA = 0.50 dataset in order to approximate and validate the spatial and temporal distribution along with the numerical representativeness. Numerical representation was also tested for the six remaining datasets. The results of the numerical representativeness are displayed in Table A1 and the results of spatial and temporal tests of the WR = 16, RA = 2.5, LA = 0.50 dataset are displayed in Figure 5. All datasets performed similarly, with the general trend that bigger datasets capture the numerical distribution more closely.

Table A1. Numerical representativeness of the datasets with regard to the mean

{chl}_{a}

. While it may seem that the original .nc files contain a significantly higher mean

{chl}_{a}

value, it is important to realise that natural obnubilation does not follow the rectangular shape of the matrices, so some deviation is to be expected.

Table A1. Numerical representativeness of the datasets with regard to the mean

{chl}_{a}

. While it may seem that the original .nc files contain a significantly higher mean

{chl}_{a}

value, it is important to realise that natural obnubilation does not follow the rectangular shape of the matrices, so some deviation is to be expected.

WR	RA	LA	μchl_a (mg m⁻³)	σchl_a (mg m⁻³)	Number of Training Matrices
8	2.5	0.50	0.4254	0.7603	2,561,580
16	1.5	0.50	0.4147	0.7404	418,260
16	2.5	0.25	0.4085	0.7168	579,360
16	2.5	0.50	0.4189	0.7485	669,560
16	2.5	0.75	0.4306	0.7706	789,760
16	3.5	0.50	0.4215	0.7538	920,480
32	2.5	0.50	0.4048	0.7214	252,100
		.nc files	0.4831	1.005

Once datasets were extracted, all data were normalised to the range

[0, 1]

to improve the training stability. The normalisation was performed on all three channels. Bathymetry was normalised in accordance with the cutoff values that were assigned previously.

{chl}_{a}

and SST were examined for the highest available values. The lowest value of both variables is zero, as defined by the value assigned to the land points. SST’s highest recorded value was 303.57 K.

{chl}_{a}

normalisation required more work. Namely, due to landlocked bodies of water, such as Valli di Comacchio, the maximum

{chl}_{a}

was around 99

mg m^{- 3}

. This value is too high for the purpose of reconstructing sea

{chl}_{a}

, so a cutoff needed to be made. Examining the maximum values of

{chl}_{a}

per matrix, it was determined that less than 0.5% of matrices had a maximum value of

{chl}_{a}

greater than 20

mg m^{- 3}

, and around 3% had a maximum value of

{chl}_{a}

greater than 10

mg m^{- 3}

. The cause of such unusually high

{chl}_{a}

is the discharge of Po river [52]. Since this paper is dealing with the reconstruction of

{chl}_{a}

on the entire Adriatic, allowing such high values into the dataset could potentially skew the data distribution in an unfavourable way, which could diminish the reconstruction accuracy. Therefore, a normalisation cutoff was set at 12

mg m^{- 3}

, so that the dataset would not be affected by unreasonably high

{chl}_{a}

. Reconstruction efforts were tested on previously unseen matrices from the test set. The train–test split was performed in an 80:20 ratio [53].

Appendix C. Training and Testing the Datasets

In order to decide the most appropriate dataset among the seven datasets described, a comparison was made of reconstruction results in terms of SSIM, MRE, and RE. The reconstruction accuracy was evaluated on the test set, in accordance with the experiment performed on the SSIM-based model. As for the effects of varying the WR, RA and LA parameter values, there are several findings to point out. Firstly, increasing the WR decreases SSIM and MRE values, while RE seems to be lowest for WR = 16. The RA value seems to be proportional to SSIM, inversely proportional to MSE, and RE seems to be lowest for RA = 2.5. Finally, increasing LA decreases both MSE and RE, while no conclusive correlation to SSIM is seen. As hinted by the results in Appendix B, the WR = 8, RA = 2.5, LA = 0.50 dataset model scored the best accuracies across all three metrics. However, the WR = 8, RA = 2.5, LA = 0.50 dataset model took significantly longer to train, around 130 h, while the next biggest dataset took around 50 h to train. Since the remaining models did not perform significantly worse, the smallest dataset (the WR = 32, RA = 2.5, LA = 0.50) model, which took around 14 h to train, was selected as the most efficient one that was used in the remainder of the paper. All training was GPU accelerated, using four NVIDIA GeForce RTX 2080 Ti graphics cards, on a 128 GB RAM system. CPU used was AMD Ryzen Threadripper 1920X 12-Core.

References

Groom, S.; Sathyendranath, S.; Ban, Y.; Bernard, S.; Brewin, R.; Brotas, V.; Brockmann, C.; Chauhan, P.; Choi, J.K.; Chuprin, A.; et al. Satellite Ocean Colour: Current Status and Future Perspective. Front. Mar. Sci. 2019, 6, 485. [Google Scholar] [CrossRef] [PubMed]
Gregg, W.; Esaias, W.; Feldman, G.; Frouin, R.; Hooker, S.; McClain, C.; Woodward, R. Coverage opportunities for global ocean color in a multimission era. IEEE Trans. Geosci. Remote Sens. 1998, 36, 1620–1627. [Google Scholar] [CrossRef]
Pukhtyar, L.D.; Stanichny, S.V.; Timchenko, I.E. Optimal interpolation of the data of remote sensing of the sea surface. Phys. Oceanogr. 2009, 19, 225. [Google Scholar] [CrossRef]
Park, S.; Chu, P. Interannual SST variability in the Japan/East Sea and relationship with environmental variables. J. Oceanogr. 2006, 62, 115–132. [Google Scholar] [CrossRef]
Alvera-Azcárate, A.; Barth, A.; Rixen, M.; Beckers, J. Reconstruction of incomplete oceanographic data sets using empirical orthogonal functions: Application to the Adriatic Sea surface temperature. Ocean Model. 2005, 9, 325–346. [Google Scholar] [CrossRef]
Sunder, S.; Ramsankaran, R.; Ramakrishnan, B. Machine learning techniques for regional scale estimation of high-resolution cloud-free daily sea surface temperatures from MODIS data. ISPRS J. Photogramm. Remote Sens. 2020, 166, 228–240. [Google Scholar] [CrossRef]
Park, J.; Kim, J.H.; Kim, H.C.; Kim, B.K.; Bae, D.; Jo, Y.H.; Jo, N.; Lee, S.H. Reconstruction of Ocean Color Data Using Machine Learning Techniques in Polar Regions: Focusing on Off Cape Hallett, Ross Sea. Remote Sens. 2019, 11, 1366. [Google Scholar] [CrossRef]
Wang, J.; Deng, Z. Development of MODIS data-based algorithm for retrieving sea surface temperature in coastal waters. Environ. Monit. Assess. 2017, 189, 286. [Google Scholar] [CrossRef]
Ehrler, M.; Ernst, N. VConstruct: Filling Gaps in Chl-a Data Using a Variational Autoencoder. arXiv 2021, arXiv:2101.10260. [Google Scholar]
Barth, A.; Alvera-Azcárate, A.; Licer, M.; Beckers, J.M. DINCAE 1.0: A convolutional neural network with error estimates to reconstruct sea surface temperature satellite observations. Geosci. Model Dev. 2020, 13, 1609–1622. [Google Scholar] [CrossRef]
Ćatipović, L.; Matić, F.; Kalinić, H. Reconstruction Methods in Oceanographic Satellite Data Observation—A Survey. J. Mar. Sci. Eng. 2023, 11, 340. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Dong, J.; Yin, R.; Sun, X.; Li, Q.; Yang, Y.; Qin, X. Inpainting of Remote Sensing SST Images With Deep Convolutional Generative Adversarial Network. IEEE Geosci. Remote Sens. Lett. 2019, 16, 173–177. [Google Scholar] [CrossRef]
Kang, S.H.; Choi, Y.; Choi, J.Y. Restoration of Missing Patterns on Satellite Infrared Sea Surface Temperature Images Due to Cloud Coverage Using Deep Generative Inpainting Network. J. Mar. Sci. Eng. 2021, 9, 310. [Google Scholar] [CrossRef]
Shibata, S.; Iiyama, M.; Hashimoto, A.; Minoh, M. Restoration of Sea Surface Temperature Satellite Images Using a Partially Occluded Training Set. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2771–2776. [Google Scholar] [CrossRef]
Hirahara, N.; Sonogashira, M.; Iiyama, M. Cloud-Free Sea-Surface-Temperature Image Reconstruction From Anomaly Inpainting Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4203811. [Google Scholar] [CrossRef]
Zheng, Z.; Xin, Z.; Yu, Z.; Yeung, S.K. Real-time GAN-based image enhancement for robust underwater monocular SLAM. Front. Mar. Sci. 2023, 10, 1161399. [Google Scholar] [CrossRef]
Lin, J.C.; Hsu, C.B.; Lee, J.C.; Chen, C.H.; Tu, T.M. Dilated Generative Adversarial Networks for Underwater Image Restoration. J. Mar. Sci. Eng. 2022, 10, 500. [Google Scholar] [CrossRef]
Zhang, J.; Ning, P.; Zhang, X.; Wang, X.; Zhang, A. Deriving Sea Subsurface Temperature Fields From Satellite Remote Sensing Data Using a Generative Adversarial Network Model. Earth Space Sci. 2023, 10, e2022EA002804. [Google Scholar] [CrossRef]
Wu, P.; Harris, C.A.; Salavasidis, G.; Lorenzo-Lopez, A.; Kamarudzaman, I.; Phillips, A.B.; Thomas, G.; Anderlini, E. Unsupervised anomaly detection for underwater gliders using generative adversarial networks. Eng. Appl. Artif. Intell. 2021, 104, 104379. [Google Scholar] [CrossRef]
Denton, E.; Gross, S.; Fergus, R. Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks. arXiv 2016, arXiv:1611.06430. [Google Scholar]
Liu, D.; Wang, Y. Trends of satellite derived chlorophyll-a (1997–2011) in the Bohai and Yellow Seas, China: Effects of bathymetry on seasonal and inter-annual patterns. Prog. Oceanogr. 2013, 116, 154–166. [Google Scholar] [CrossRef]
Hussein, K.A.; Al Abdouli, K.; Ghebreyesus, D.T.; Petchprayoon, P.; Al Hosani, N.; O. Sharif, H. Spatiotemporal Variability of Chlorophyll-a and Sea Surface Temperature, and Their Relationship with Bathymetry over the Coasts of UAE. Remote Sens. 2021, 13, 2447. [Google Scholar] [CrossRef]
Cushman-Roisin, B.; Gacic, M.; Poulain, P.M.; Artegiani, A. Physical Oceanography of the Adriatic Sea: Past, Present, and Future; Springer: Dordrecht, The Netherlands, 2001. [Google Scholar]
ISO. ISO10260-1992; Water Quality—Measurement of Biochemical Parameters—Spectrometric Determination of the Chlorophyll-a Concentration. International Organization for Standardization: Geneva, Switzerland, 1992.
Belward, A.; Bourassa, M.; Dowell, M.; Briggs, S.; Dolman, H.A.; Holmlund, K.; Husband, R.; Quegan, S.; Simmons, A.; Sloyan, B.; et al. The Global Observing System for Climate: Implementation Needs; WHO: Geneva, Switzerland, 2016. [Google Scholar]
Sathyendranath, S.; Jackson, T.; Brockmann, C.; Brotas, V.; Calton, B.; Chuprin, A.; Clements, O.; Cipollini, P.; Danne, O.; Dingle, J.; et al. ESA Ocean Colour Climate Change Initiative (Ocean Colour CCI): Version 5.0 Data. 2021. Available online: https://doi.org/10.5285/1dbe7a109c0244aaad713e078fd3059a (accessed on 1 January 2020).
Hu, C.; Feng, L.; Lee, Z.; Franz, B.; Bailey, S.; Werdell, J.; Proctor, C. Improving Satellite Global Chlorophyll a Data Products Through Algorithm Refinement and Data Recovery. J. Geophys. Res. Ocean. 2019, 124, 1524–1543. [Google Scholar] [CrossRef]
Jackson, T.; Sathyendranath, S.; Mélin, F. An improved optical classification scheme for the Ocean Colour Essential Climate Variable and its applications. Remote Sens. Environ. 2017, 203, 152–161. [Google Scholar] [CrossRef]
NASA/JPL. GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1). 2015. Available online: https://doi.org/10.5067/GHGMR-4FJ04 (accessed on 1 January 2020).
GEBCO. Gridded Bathymetry Data (General Bathymetric Chart of the Oceans). 2022. Available online: https://doi.org/10.5285/e0f0bb80-ab44-2739-e053-6c86abc0289c (accessed on 1 January 2020).
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Županović, T. Using Deep Learning Methods Based on CNNS and Gans for Data Completion and Reconstruction. 2021. Available online: https://github.com/TomislavZupanovic/Data-Reconstruction (accessed on 1 January 2020).
Linder-Norén, E. PyTorch-GAN. 2018. Available online: https://github.com/eriklindernoren/PyTorch-GAN (accessed on 1 January 2020).
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Wang, Z.; Simoncelli, E.; Bovik, A. Multiscale structural similarity for image quality assessment. In Proceedings of the the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C. Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures. IEEE Signal Process. Mag. 2009, 26, 98–117. [Google Scholar] [CrossRef]
Francois-Rozet. Francois-Rozet/Piqa: Pytorch Image Quality Assessement Package. 2020. Available online: https://github.com/francois-rozet/piqa (accessed on 1 January 2020).
Fritzke, B. A Growing Neural Gas Network Learns Topologies. In Proceedings of the 7th International Conference on Neural Information Processing Systems, Denver, CO, USA, 28 November–1 December 1994; MIT Press: Cambridge, MA, USA, 1994; pp. 625–632. [Google Scholar]
Civitarese, G.; Gačić, M.; Lipizer, M.; Eusebi Borzelli, G.L. On the impact of the Bimodal Oscillating System (BiOS) on the biogeochemistry and biology of the Adriatic and Ionian Seas (Eastern Mediterranean). Biogeosciences 2010, 7, 3987–3997. [Google Scholar] [CrossRef]
Kourafalou, V. Process studies on the Po River plume, North Adriatic Sea. J. Geophys. Res. 1999, 1042, 29963–29986. [Google Scholar] [CrossRef]
Kodali, N.; Abernethy, J.D.; Hays, J.; Kira, Z. How to Train Your DRAGAN. arXiv 2017, arXiv:1705.07215. [Google Scholar]
Gower, J.; King, S.; Borstad, G.; Brown, L. Detection of intense plankton blooms using the 709nm band of the MERIS imaging spectrometer. Int. J. Remote Sens. 2005, 26, 2005–2012. [Google Scholar] [CrossRef]
Smith, M.; Robertson Lain, L.; Bernard, S. An optimized Chlorophyll a switching algorithm for MERIS and OLCI in phytoplankton-dominated waters. Remote Sens. Environ. 2018, 215, 217–227. [Google Scholar] [CrossRef]
Lee, Z.P.; Hu, C. Global distribution of Case-1 waters: An analysis from SeaWiFS measurements. Remote Sens. Environ. 2006, 101, 270–276. [Google Scholar] [CrossRef]
Sathyendranath, S.; Prieur, L.; Morel, A. A three-component model of ocean colour and its application to remote sensing of phytoplankton pigments in coastal waters. Int. J. Remote Sens. 1989, 10, 1373–1394. [Google Scholar] [CrossRef]
Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and Checkerboard Artifacts. Distill 2016. Available online: https://distill.pub/2016/deconv-checkerboard/ (accessed on 1 January 2020). [CrossRef]
Innamorati, C.; Ritschel, T.; Weyrich, T.; Mitra, N.J. Learning on the Edge: Investigating Boundary Filters in CNNs. Int. J. Comput. Vis. 2020, 128, 773–782. [Google Scholar] [CrossRef]
Marini, M.; Jones, B.H.; Campanelli, A.; Grilli, F.; Lee, C.M. Seasonal variability and Po River plume influence on biochemical properties along western Adriatic coast. J. Geophys. Res. Ocean. 2008, 113. [Google Scholar] [CrossRef]
Vrigazova, B. The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems. Bus. Syst. Res. J. 2021, 12, 228–242. [Google Scholar] [CrossRef]

Figure 1. The Adriatic Sea. The position on the globe is enclosed by the red rectangle on the minimap.

Figure 2. Visualisation of the masking process of

{chl}_{a}

data (a) representing the full data; and (b) representing the manually masked data.

Figure 2. Visualisation of the masking process of

{chl}_{a}

data (a) representing the full data; and (b) representing the manually masked data.

Figure 3. Iteration-dependent of the behaviours of generator and discriminator loss values based on two different loss metrics. The top graph displays the mean-squared-error-based (MSE) model. The middle graph depicts the behaviour of the MSE model that takes into account the distribution of land points. Bottom graph displays the Structural-Similarity-Index-Measure-based (SSIM) model.

Figure 4. Example of MSE ambiguity. The leftmost image was created by the cross multiplication of a sinusoidal vector and its transpose. The middle image was filled with the mean value of the leftmost image. The rightmost image was obtained by the linear transformation of the leftmost image.

Figure 5. Geospatial distribution of the three error metrics—Structural Similarity Index Measure (SSIM) in (a), mean squared error (MSE) in (b), relative error (RE) in (c). (d) displays the spatial distribution of test data sampling.

Figure 6. Intra-annual and interannual distributions of Structural Similarity Index Measure (SSIM, blue), mean squared error (MSE, red), and relative error (RE, green). Top part displays the monthly dependency, while the bottom displays the yearly. Boxplots display the minimum, maximum, mean, lower, and upper quartile of each metric.

Figure 7. Reconstruction of the masked parts of data matrices as selected by the least vector norm based on the characteristic patterns (A1–A5,B1–B5,C1–C5,D1–D5,E1–E5,F1–F5,G1–G5, and H1–H5) derived by double Growing Neural Gas. Columns 1 and 2 display the proxy variables—sea surface temperature (SST) and bathymetrical data, in that respective order. Column 3 is the masked part of the real data chlorophyll a concentration data presented to the CCGAN algorithm, column 4 displays the CCGAN’s reconstructed output and column 5 contains the respective difference between the target and reconstructed data.

Table 1. Comparison of the three reconstruction accuracy metrics for the three models with different loss functions.

{MSE}_{1}

represents the base model whose loss function is determined solely by the mean squared error value;

{MSE}_{2}

is the updated mean squared error model whose loss function takes into account the distribution of land points; and SSIM is the model whose loss function has exchanged the mean squared error metric for Structural Similarity Index Measure.

μ

represents the mean, and

σ

represents the standard deviation.

Table 1. Comparison of the three reconstruction accuracy metrics for the three models with different loss functions.

{MSE}_{1}

represents the base model whose loss function is determined solely by the mean squared error value;

{MSE}_{2}

is the updated mean squared error model whose loss function takes into account the distribution of land points; and SSIM is the model whose loss function has exchanged the mean squared error metric for Structural Similarity Index Measure.

μ

represents the mean, and

σ

represents the standard deviation.

	$μ_{SSIM}$	$σ_{SSIM}$	$μ_{MSE}$	$σ_{MSE}$	$μ_{RE}$ (%)	$σ_{RE}$ (%)
${MSE}_{1}$ -based model	0.09	0.14	29.61	8.26	2570	1445
${MSE}_{2}$ -based model	0.12	0.15	29.56	7.66	2553	1408
SSIM-based model	0.95	0.04	0.01	0.02	3	2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ćatipović, L.; Matić, F.; Kalinić, H.; Sathyendranath, S.; Županović, T.; Dingle, J.; Jackson, T. CCGAN as a Tool for Satellite-Derived Chlorophyll a Concentration Gap Reconstruction. J. Mar. Sci. Eng. 2023, 11, 1814. https://doi.org/10.3390/jmse11091814

AMA Style

Ćatipović L, Matić F, Kalinić H, Sathyendranath S, Županović T, Dingle J, Jackson T. CCGAN as a Tool for Satellite-Derived Chlorophyll a Concentration Gap Reconstruction. Journal of Marine Science and Engineering. 2023; 11(9):1814. https://doi.org/10.3390/jmse11091814

Chicago/Turabian Style

Ćatipović, Leon, Frano Matić, Hrvoje Kalinić, Shubha Sathyendranath, Tomislav Županović, James Dingle, and Thomas Jackson. 2023. "CCGAN as a Tool for Satellite-Derived Chlorophyll a Concentration Gap Reconstruction" Journal of Marine Science and Engineering 11, no. 9: 1814. https://doi.org/10.3390/jmse11091814

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CCGAN as a Tool for Satellite-Derived Chlorophyll a Concentration Gap Reconstruction

Abstract

1. Introduction

2. Data and Methods

2.1. Data Sources

2.2. Dataset Integration

2.3. Context Conditional Generative Adversarial Network

2.4. Error Metrics

2.5. Growing Neural Gas

3. Results and Discussion

3.1. Verification of SSIM-Based Model

3.2. Testing the SSIM-Based Model

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Dataset Sampling

Appendix B. Sanity Tests

Appendix C. Training and Testing the Datasets

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI