Few-Shot Wideband Tympanometry Classification in Otosclerosis via Domain Adaptation with Gaussian Processes

Nie, Leixin; Li, Chao; Bozorg Grayeli, Alexis; Marzani, Franck

doi:10.3390/app112411839

Open AccessArticle

Few-Shot Wideband Tympanometry Classification in Otosclerosis via Domain Adaptation with Gaussian Processes

¹

Laboratoire ImViA (EA 7535), Université Bourgogne Franche-Comté, 21078 Dijon, France

²

State Key Laboratory of Acoustics, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

⁴

Otolaryngology Department, Dijon University Hospital, 21000 Dijon, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(24), 11839; https://doi.org/10.3390/app112411839

Submission received: 20 November 2021 / Revised: 8 December 2021 / Accepted: 8 December 2021 / Published: 13 December 2021

(This article belongs to the Special Issue Applied Artificial Intelligence (AI))

Download

Browse Figures

Versions Notes

Abstract

:

Otosclerosis is a common middle ear disease that requires a combination of examinations for its diagnosis in routine. In a previous study, we showed that this disease could be potentially diagnosed by wideband tympanometry (WBT) coupled with a convolutional neural network (CNN) in a rapid and non-invasive manner. We showed that deep transfer learning with data augmentation could be applied successfully on such a task. However, the involved synthetic and realistic data have a significant discrepancy that impedes the performance of transfer learning. To address this issue, a Gaussian processes-guided domain adaptation (GPGDA) algorithm was developed. It leveraged both the loss about the distribution distance calculated by the Gaussian processes and the loss of conventional cross entropy during the transferring. On a WBT dataset including 80 otosclerosis and 55 control samples, it achieved an area-under-the-curve of

97.9 \pm 1.1

percent after receiver operating characteristic analysis and an F1-score of

95.7 \pm 0.9

percent that were superior to the baseline methods (

r = 10

,

p < 0.05

, ANOVA). To understand the algorithm’s behavior, the role of each component in the GPGDA was experimentally explored on the dataset. In conclusion, our GPGDA algorithm appears to be an effective tool to enhance CNN-based WBT classification in otosclerosis using just a limited number of realistic data samples.

Keywords:

wideband tympanometry; medical image classification; deep transfer learning; domain adaptation; Gaussian processes; otosclerosis

1. Introduction

Wideband tympanometry (WBT) provides a wealth of information on the middle ear mechanics in the form of a 3D diagram of the absorbance as a function of probe sound frequency and external ear pressure [1]. The diagrams show different patterns in healthy and otosclerosis middle ears [2,3], but the visual analysis requires professional experience, is time-consuming, and is prone to error. The automatic analysis approaches based on the manual feature extraction often have a limited accuracy and need to be further improved [4,5].

Deep learning uses the data-specific features that are automatically extracted from the available data via the convolutional neural networks (CNN) to classify images and complex signals and has made remarkable progress in recent years with the help of large-scale annotated datasets [6,7,8,9]. However, large amounts of annotated data are virtually unavailable in many real-world applications including the WBT analysis [5,10]. Therefore, for our few-shot scenario, it is necessary to improve the training phase of the deep learning, so that the novel method could achieve comparable generalization performance with less labeled data. There are many solutions to realize the few-shot learning [11], including network pruning [12], metric learning [13], meta-learning [14], and transfer learning [15].

In a previous study, we showed that, with the help of data augmentation and transfer learning, CNN could achieve an accurate classification of otosclerosis versus healthy ears using only a small amount of annotated WBT data [5]. This method would pre-train the network on a similar but different large source dataset and then further optimize its parameters on the small target dataset that was actually concerned. The source dataset employed for pre-training could be a large open-source dataset for the general classification task such as ImageNet [16], or it could also be generated with the help of prior knowledge [17] or data augmentation [18]. However, there is an inevitable bias between the source and target datasets due to the inherent nature of the transfer learning solution. Hence, it is essential to address the issue of domain shift for effective transferring, and the domain adaptation method could usually be adopted to deal with this hurdle [19].

In this article, a deep transfer learning solution was adopted to perform WBT classification for automatic diagnosis of otosclerosis under the constraint of limited labeled WBT data. The source dataset was synthesized by data augmentation. We focused on the issue of domain shift between the synthetic source and realistic target datasets and proposed a novel approach of domain adaptation to leverage the source dataset applied to the transfer learning and to improve the diagnostic accuracy. Our approach measured the difference between the source and target datasets via Gaussian processes (GP) and incorporated the information into the parameter optimisation loop when training on the small but focused target dataset in order to guide the transferring. The performance of our algorithm was experimentally verified on an open-source WBT dataset for the otosclerosis diagnosis task [5] and was compared with the state-of-the-art methods. Additionally, the selection of the hyperparameters in the proposed algorithm was assessed via numerical experiments.

2. Material and Method

2.1. Background

Classification of WBT in otosclerosis. The WBT analysis for diagnosis of otosclerosis can be regarded as a binary image classification scenario, and the solution generally consists of two steps: feature extraction and SoftMax classification. Feature extraction is the key to the pipeline, and therefore, the existing works regularly focus on it. The common extracted features include interpretable univariate feature [2], interpretable multivariate features [4], abstract multivariate features [20], and abstract CNN-based features [5], where the first two are manually extracted features, the third is obtained by classical machine learning methods such as principal components analysis, and the last is usually based on deep-learning approaches.

Domain adaption. The data and its corresponding labels from a dataset could be formalized as a sampling of a specific joint distribution. To achieve an effective transfer from the pre-trained task to the concerned task, a domain adaption method should accurately measure the difference between these two corresponding joint distributions and guide the adjustment of the network parameters on the target dataset with this knowledge. The existing methods usually employ the maximum mean discrepancy (MMD) distance [21,22,23] or Wasserstein distance [24,25,26] to measure the difference between two distributions. Generally, the MMD distance only considers the first-order moment, while the Wasserstein distance is relatively difficult to calculate numerically.

GP framework. When given the mean function

m (x)

and the covariance (or kernel) function

k (x, x^{'})

of

f (x)

as

\begin{matrix} m (x) & = E [f (x)], \\ k (x, x^{'}) & = E [(f (x) - m (x)) (f (x^{'}) - m (x^{'}))], \end{matrix}

(1)

the GP can be written as,

f (x) \sim GP (m (x), k (x, x^{'})) .

(2)

The sampling value of GP is a Gaussian random variable, and their collection of those obeys a joint Gaussian distribution. The realistic observation value y can be modeled as

y = f (x) + ϵ, ϵ \sim N (0, σ_{ϵ}^{2}),

(3)

where

ϵ

is the potential error and

σ_{ϵ}^{2}

is its variance.

If given a set of observed data

{(x, y)}

, the conditional probability

p (y^{'} | {x}, {y}, x^{'})

of the sampling value

y^{'}

of the new observation point

x^{'}

obeys Gaussian distribution

N (\hat{μ}, \hat{Σ})

[27] with

\begin{matrix} \hat{μ} = & K (x^{'}, {x}) {[K ({x}, {x}) + σ_{ϵ}^{2} I]}^{- 1} ({y} - {m (x)}) + m (x^{'}), \\ \hat{Σ} = & K (x^{'}, x^{'}) - K (x^{'}, {x}) {[K ({x}, {x}) + σ_{ϵ}^{2} I]}^{- 1} \cdot K ({x}, x^{'}), \end{matrix}

(4)

where

K (\cdot)

is a matrix in which each element consists of the corresponding kernel

k (\cdot)

.

For our work,

m (x)

is 0 (regular setting) and

k (x, x^{'})

is the radial basis function (RBF) kernel:

k (x, x^{'}) = exp (- \frac{1}{2 l^{2}} ‖ x - x^{'} ‖^{2}) .

(5)

2.2. Gaussian Processes-Guided Domain Adaptation (GPGDA) Algorithm

In this study, the approach of WBT classification in otosclerosis was based on deep transfer learning, which usually involves two problems, that is, how to generate the source dataset and how to transfer from the source dataset to the target dataset. We mainly focused on optimizing the transfer strategies for our diagnosis task by considering the domain shift between source dataset and target dataset. The source datasets were generated by several data augmentation schemes including nearest neighbor interpolation [28], additive noise [29], and mixup [30], named as SynDataI, SynDataN, and SynDataM [5], respectively. They were dependent on training data (realistic) but independent of testing data (realistic). SynDataI and SynDataN had deterministic hard labels, and SynDataM had probabilistic soft labels. The adopted backbone network was a lightweight LeNet-like architecture (Figure 1), where the number of channels was 12 and 24 for conv1 and conv2, respectively; the kernel size was

2 \times 4

and

4 \times 5

for convolutional layers and pooling layers, respectively; and the size of the hidden layer in the classifier (the output of fc3) was 32. In addition, no padding operation was employed for all convolutional and pooling layers in the backbone.

The whole pipeline was conventionally divided into a training phase to optimize the parameters of the backbone network and a testing phase to evaluate the final performance. The GPGDA algorithm ran during the training phase using both the realistic target and the synthetic source datasets to guide the parameter optimization (Figure 2). The test phase was the same as routine.

Specifically, the deep neural network transformed the data

x

(with its actual label y, unified as

(x, y)

) into a feature space via the feature extraction layers to obtain the latent variable

z

; then classified

z

via a SoftMax layer to yield the final classification output. This algorithm considered adapting the distributions to make

p ({y^{r}}, {z^{r}} | {x^{r}})

on the realistic dataset and

p ({y^{s}}, {z^{s}} | {x^{s}})

on the synthetic dataset as close as possible via GP. First, we hypothesized that the relationship between y and

z

could be modeled as

y = GP (0, k (z, z^{'})) + ϵ, ϵ \sim N (0, σ_{ϵ}^{2}),

(6)

where

ϵ

is the modeling error.

Naturally, the low-dimensional manifold on which

p ({y^{s}}, {z^{s}} | {x^{s}})

lies could be fitted with (6). As a result, if given the synthetic dataset

{(x^{s}, y^{s})}

and the realistic dataset

{(x^{r}, y^{r})}

, the conditional distribution

p ({y^{r}} | {z^{s}}, {y^{s}}, {z^{r}})

could be expressed as

N ({\hat{μ}}^{r}, {\hat{Σ}}^{r})

with

\begin{matrix} {\hat{μ}}^{r} = & K ({z^{r}}, {z^{s}}) {[K ({z^{s}}, {z^{s}}) + σ_{ϵ}^{2} I]}^{- 1} {y^{s}}, \\ {\hat{Σ}}^{r} = & K ({z^{r}}, {z^{r}}) - K ({z^{r}}, {z^{s}}) \cdot {[K ({z^{s}}, {z^{s}}) + σ_{ϵ}^{2} I]}^{- 1} K ({z^{s}}, {z^{r}}) . \end{matrix}

(7)

Then, the loss about uncertainty

l^{m}

and the loss about the difference of mean values and labels

l^{c}

could be defined as

\begin{matrix} l^{m} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{Σ}}_{i i}^{r}, \end{matrix}

(8)

\begin{matrix} l^{c} & = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{μ}}_{i}^{r} - y_{i}^{r})}^{2}, \end{matrix}

(9)

where n is the number of training (target) data,

{\hat{Σ}}_{i i}^{r}

is the element of the ith row and ith column of the matrix

{\hat{Σ}}^{r}

,

l^{m}

measures the difference between marginal distributions

p ({x})

, and

l^{c}

measures the difference between conditional distributions

p ({y} | {x})

.

The conventional cross entropy (CE) loss is denoted as

l_{s}^{C E}

or

l_{r}^{C E}

. The GP loss

l^{G P}

depends on whether the target data is labeled and could be defined as

l^{G P} = \{\begin{matrix} l^{c} + λ l^{m}, & l a b e l e d, \\ l^{m}, & u n l a b e l e d, \end{matrix}

(10)

where

λ

is a hyperparameter to adjust the proportion of different losses.

The objective of the proposed GPGDA algorithm was to realize the effective transfer from the synthetic domain to the real domain through a novel training manner. It defined new losses and optimized them to make the distributions of realistic domain and synthetic domain as close as possible (Algorithm 1).

Algorithm 1 The proposed GPGDA algorithm.

Input: Realistic dataset

D^{r}

, synthetic dataset

D^{s}

, feature extractor

g (\cdot; α^{0})

, the classifier

σ (\cdot; β^{0})

, the hyperparameter

λ

.
Output: Feature extractor

g (\cdot; α^{u p})

and the classifier

σ (\cdot; β^{u p})

.

1:: $α = α^{0}$ , $β = β^{0}$ .
2:: for $e p o c h = 1, 2, \dots$ do
3:: Calculate $l_{s}^{C E} = C E (y^{s}, σ (g (x^{s}; α); β))$ ;
3:: perform gradient backpropagation; update $α$ and $β$ .
4:: Learn manifold M on which $p ({y^{s}}, {z^{s}} | {x^{s}})$ lies by (6).
5:: if the data in $D^{r}$ has the label then
6:: Calculate $l_{r}^{C E} = C E (y^{r}, σ (g (x^{r}; α); β))$ ;
6:: perform gradient backpropagation; update $β$ .
7:: Infer for ${\hat{μ}}^{r}$ and ${\hat{Σ}}^{r}$ by M and (7) with $z^{r} = g (x^{r}; α)$ .
8:: Calculate $l^{G P}$ by (10);
8:: perform gradient backpropagation; update $α$ .
9:: else
10:: Infer for ${\hat{Σ}}^{r}$ by M and (7) with $z^{r} = g (x^{r}; α)$ .
11:: Calculate $l^{G P}$ by (10);
11:: perform gradient backpropagation; update $α$ .
12:: end if
13:: end for
14:: $α^{u p} = α$ , $β^{u p} = β$

In the pseudo-code of the GPGDA algorithm, line 3 represents the training phase of labeled synthetic data, lines 6 to 8 represent the training phase of labeled realistic data, and lines 10 and 11 represent the training phase of unlabeled realistic data.

λ

is the hyperparameter that adjusts the relative importance of the marginal loss

l^{m}

and the conditional loss

l^{c}

to backpropagation and parameters updating in the training phase.

2.3. Experimental Details

WBT dataset. In order to demonstrate the performance of the proposed algorithm for the otosclerosis diagnostic task, the numerical experiments were performed on an open-source WBT dataset [5]. This dataset consisted of 80 samples from otosclerosis ears and 55 samples from healthy control ears, and each WBT sample was normalized to a single-channel 2D image with

42 \times 107

pixels (Figure 3). The pressure axis of these WBT data ranged from −275 to 135 daPa (a stride of 10 daPa), and the frequency extended from 226 to 8000 Hz (1/24 octave band frequencies).

Evaluation scheme. The performance evaluation results were obtained via the five-fold cross-validation, and the evaluation indexes included F1-score and the area under the receiver operating characteristic curve (AUC). The dataset was divided into five non-overlapping parts. Each fold used one part as the testing set and the others as the training set. The recorded results were the average value of testing performance in all folds. Subsequently, the Monte Carlo test was applied, repeating the cross-validation with 10 different random seeds. The mean and standard deviation of these performance values were noted as the final result.

Baseline methods. In the follow-up comparative study, two existing transfer strategies were designated as baselines, including Fine-Tuning (FT) method in [5] and deep adaptation networks (DAN) in [22]. The two could be regarded as the basic solutions in the field. FT strategy did not explicitly consider the difference between distributions of the source dataset and the target dataset, whereas DAN measured the difference by the MMD distance and employed it to guide the transfer. After reproducing these benchmark methods on the WBT dataset, the performance and efficiency of GPGDA were verified through comparative experiments.

Statistical testing. The performance results were analyzed by a two-way ANOVA test, of which the factors were the discussed issue for current experiment (approach types, hyperparameter values, etc.) and the type of source datasets. When a significant difference (

p < 0.05

) was observed for the discussed issue, a post hoc test was further performed using Tukey’s multiple comparison test.

Implementation settings. These approaches were implemented via PyTorch (version 1.9.0) [31] and GPyTorch (version 1.5.0) [32], and the reproduction of DAN method also involved DALIB tools [33]. A single Tesla P100 GPU was utilized to train the networks from scratch. The weights were initialized following the default values from PyTorch. The batch size was 12, the epochs’ number was 40, and early stopping was used during the training. The ADAM optimizer [34] was employed to update the CNN parameters with an initial learning rate of 0.001 and a learning rate decay of 0.95 after every 20 epochs, where these hyperparameters followed the default values from PyTorch. Moreover, SynDataI, SynDataN, and SynDataM [5] were employed as the source datasets in deep transfer learning, and the ratio of augmented/realistic data was 50 (the size of source datasets would be

50 \times n

, where n was the number of the employed training data).

3. Results

Comparative experiment. The GPGDA performance and those of the two baseline methods (FT and DAN) were assessed by F1-score and AUC on the WBT dataset and three different source datasets including SynDataI, SynDataN, and SynDataM (Table 1). As judged by the F1-score and AUC, GPGDA yielded the best results in comparison with FT and DAN regardless of the source dataset types.

Moreover, it has to be noted that the performance gains of GPGDA over FT came at the cost of a considerable increase in training time. Since the GPGDA algorithm involved multiplication and inversion of matrices, its time complexity can be described by

O (n^{3})

with the number of target data n when the ratio of synthetic/realistic data is kept constant. It means that the difference in training time between the proposed algorithm with FT can increase exponentially with the dataset size. Actually, for our settings, the training of a five-fold cross-validation for FT took about 4 min via a Tesla P100 GPU, while GPGDA took 25 min under the same conditions and identical GPU. However, the training was performed offline, and there was no significant difference for the inference time of these approaches (approximately 9 ms via an Intel CPU) in clinical practice since they adopted the same backbone network architecture.

Ablation experiment I: hyperparameter selection. The proposed GPGDA algorithm included a hyperparameter

λ

for adjusting the weights of

l^{m}

and

l^{c}

in the final

l^{G P}

that was adopted to update the parameters of feature extractors.

λ

significantly influenced the performance of GPGDA as evaluated by AUC, regardless of the source dataset types (Table 2). Additionally, the optimal results were generally achieved at

λ = 0.5

for the task of WBT-based diagnosis of otosclerosis.

Ablation experiment II: the roles of $l^{G P}$ and $l_{r}^{C E}$ . The proposed GPGDA algorithm included two losses

l^{G P}

and

l_{r}^{C E}

, which offered the gradient update flow for optimizing the parameters of feature extractors and those of classifiers during the training, respectively. The two losses played different roles in GPGDA (Table 3). The proposed algorithm had both losses enabled, and disabling either of them caused a drop in performance. The feature extractors deferred from the parameters in the previous pre-training on the synthetic source dataset when only

l_{r}^{C E}

was enabled (similar to the fixed-feature extraction transfer strategy in [5]). The classifiers take the parameters from the pre-training only if

l^{G P}

is enabled. Disabling

l^{G P}

resulted in a more severe performance degradation than disabling

l_{r}^{C E}

. Additionally, disabling the two losses yielded a similar performance on the final classification to that of employing the pre-trained network directly. It did not explicitly utilize the realistic dataset and could be treated as the lower bound of the algorithm performance.

Furthermore, the decay modes of the training losses when increasing the number of epochs were investigated. Two components

l^{c}

and

l^{m}

in

l^{G P}

were considered separately. The analysis for three losses involved in the GPGDA algorithm showed that

l_{r}^{C E}

and

l^{c}

followed the same steep one-phase exponential decay while

l^{m}

followed a different and more gradual decline (Figure 4).

Experiment on the algorithm run in a weakly supervised manner. Since Gaussian processes naturally output uncertainty without the information on labels, the proposed GPGDA algorithm can potentially exploit both unlabeled and labeled data. Considering that the WBT dataset size was 135 and a five-fold cross-validation was adopted, the size of the training set in our previous experiments was equivalent to 108. To assess the effectiveness of the algorithm on unlabeled data, 2/3 of the training data were regarded as labeled, and the rest was considered as unlabeled (Table 4). It should be noted that unlabeled data were not employed for data augmentation. The use of unlabeled data resulted in a performance improvement of about one percent, which suggested that the algorithm can work in a weakly supervised manner.

4. Discussion

In this article, we focused on the few-shot WBT classification in otosclerosis and proposed a novel domain adaption algorithm named GPGDA to deal with the inevitable domain shift in deep transfer learning methods. The proposed algorithm employed the Gaussian processes to explicitly calibrate the distributions of the feature extractor’s outputs of the adopted backbone network between synthetic source dataset and realistic target dataset after pre-training, and the difference in distribution calibration was leveraged to optimize the parameters of the feature extractor with the help of gradient backpropagation algorithm. In addition, the classifier was trained on the realistic target dataset by using the conventional cross-entropy loss as an independent source. Furthermore, training was performed alternately on the source and the target datasets to the end of epochs in the proposed algorithm.

The algorithm was studied empirically on an open-source WBT dataset with 80 otosclerosis samples and 55 control samples to verify its performance for the WBT-based otosclerosis diagnosis task. All of the listed performance results were obtained through a five-fold cross-validation and Monte Carlo test to ensure the reliability of evaluation as much as possible. Our algorithm yielded AUC values of

97.6 \pm 1.1

percent,

97.8 \pm 1.2

percent, and

97.9 \pm 1.1

percent and F1-scores of

95.7 \pm 1.0

percent,

96.0 \pm 1.1

percent, and

95.7 \pm 0.9

percent when the different synthetic source datasets SynDataI, SynDataN, and SynDataM were employed, respectively (Table 1). Its performance results were usually superior to those of the baseline approaches in FT and DAN. In a further analysis, we showed that there was an optimal value of approximately 0.5 for the hyperparameter

λ

in the proposed algorithm (Table 2). The algorithm could almost degenerate to FT if

λ

took a too small value and showed unstable testing performance if the value of

λ

was too large. Moreover, it was also verified that the loss

l^{G P}

calculated by Gaussian processes dominated the performance gains of the proposed algorithm compared with the cross-entropy loss

l_{r}^{C E}

via a set of experiments (Table 3).

The FT approach, which did not exploit the domain shift information, produced lower AUCs and F1-scores than the existing DAN and the proposed GPGDA (Table 1). This indicated that there were non-negligible differences for the otosclerosis diagnosis between the distribution of the synthetic source dataset generated by data augmentation schemes and the distribution of the realistic target dataset. Therefore, explicitly considering the distribution calibration (domain adaptation) during the network training seems to be helpful in increasing the diagnostic accuracy of otosclerosis.

Our GPGDA provided a significant improvement in comparison with DAN (Table 1), and this could be explained by a more performant domain adaptation in our specific task of WBT-based diagnosis. It could be potentially attributed to the information employed to guide the distribution alignment. The GP was leveraged in GPGDA to characterize the manifold containing the latent variables (features) and to measure the distance between the distributions by jointly considering the first- and second-order moments (Equations (7)–(10)). In contrast, DAN exploited the MMD distance and took the first-order moment information into account only.

The mean values of AUC for GPGDA at

λ = 0

were smaller than those at

λ = 0.5

(Table 2). The information provided by the GP on uncertainty was not exploited at

λ = 0

, and only the loss

l^{c}

generated by comparing the mean inferred from the latent variables (features) via GP with the original label was adopted to update the parameters of the feature extractors. It was equivalent to using an independent classifier (GP without uncertainty output) different from the original SoftMax classifier to update the parameters of the feature extractors in this case (

l^{c}

and

l_{r}^{C E}

as a function of Epochs followed the same steep exponential decay model; Figure 4), so from this view, GPGDA with

λ = 0

could be considered to degenerate into FT. In addition, GPGDA could be regarded as using only the first-moment to measure the difference of the joint distributions if the uncertainty output by GP was not utilized. These potential reasons may explain the deterioration of AUC performance for GPGDA with

λ = 0

.

The uncertainty loss

l^{m}

in GP loss

l^{G P}

was employed to characterize the confidence for the GP inference. It was calculated using the estimated covariance matrix, and it did not require the information on labels. The

l^{m}

converged significantly slower than the other two losses (

l^{c}

and

l_{r}^{C E}

), which caused

l^{m}

to often remain at a value greater than 0 when the other two losses were already very close to 0 (Figure 4). Therefore, on one hand, the

l^{m}

component in

l^{G P}

could prevent the algorithm from falling into the local minima during the optimization by providing guidance information from another perspective, but on the other hand, a high proportion of

l^{m}

in

l^{G P}

would make the algorithm more difficult to converge because of the slow convergence rate of

l^{m}

. Under a practical constraint of the finite number of epochs, the difficulty of convergence also meant that the test performance may demonstrate larger fluctuations at a specific number of epochs. As a result, the standard deviation of AUC for GPGDA tended to become larger as the

λ

value increased (Table 2).

The explicit calibration operation for the distributions of the network’s hidden variable outputs (features) via the Gaussian processes led to good results under the constraint of limited realistic data (Table 3), and it could be considered to be play a key role in the proposed GPGDA algorithm. The underlying cause may be that the distributions of the synthetic (source) and the realistic (target) datasets were significantly different, leading to a non-negligible discrepancy between the distributions of their corresponding features obtained by the same feature extractor. Moreover, further training of the classifiers via

l_{r}^{C E}

was also beneficial and boosted the performance on the basis of calibrating the distributions.

When the information on labels is unavailable, GP loss

l^{G P}

in the GPGDA algorithm can still be calculated, allowing the algorithm to employ the unlabeled realistic data. Indeed, under the same conditions, the algorithm showed better AUC performance results in the presence of unlabeled realistic data than in the absence of these data (Table 4). The weakly supervised manner of this algorithm makes it more practical in real-world WBT classification where there may be missing labels. In addition, other task-unrelated WBT data (unlabeled for the current task) can also be potentially exploited to boost the performance of the algorithm on the current task.

In conclusion, the proposed algorithm with the Gaussian processes to explicitly perform the domain adaptation appears to improve the performance of WBT classification in otosclerosis significantly at the cost of a higher computational load during the training phase. Since the training is offline and the inference time is not affected, the algorithm can still be effectively applied for a real-world WBT-based otosclerosis diagnostic task. The approach also potentially provides a performant solution for the minimally invasive diagnosis of other middle and inner ear diseases. Moreover, other performant distance metrics or novel optimization strategies in domain adaptation are worthy of further exploration and research for few-shot WBT-based classification in the future work.

Author Contributions

Conceptualization, F.M., A.B.G. and C.L.; methodology, L.N.; software, L.N. and C.L.; validation, L.N. and C.L.; formal analysis, A.B.G. and L.N.; data curation, A.B.G.; writing—original draft preparation, L.N.; writing—review and editing, C.L., A.B.G. and F.M.; visualization, C.L.; supervision, A.B.G. and F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by China Scholarship Council, Chinese Academy of Sciences, and National Natural Science Foundation of China (No. 62171440).

Data Availability Statement

The open-source dataset was adopted, and the involved codes in this work will be available upon request.

Conflicts of Interest

The authors have no conflicts of interest to declare that are relevant to the content of this article. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ANOVA	Analysis of Variance
AUC	Area Under the Receiver Operating Characteristic Curve
CE	Cross Entropy
CNN	Convolutional Neural Network
DAN	Deep Adaptation Networks
FT	Fine-Tuning
GP	Gaussian Processes
GPGDA	Gaussian Processes-Guided Domain Adaptation
MMD	Maximum Mean Discrepancy
RBF	Radial Basis Function
WBT	Wideband Tympanometry

References

Margolis, R.H.; Saly, G.L.; Keefe, D.H. Wideband reflectance tympanometry in normal adults. J. Acoust. Soc. Am. 1999, 106, 265–280. [Google Scholar] [CrossRef] [PubMed]
Shahnaz, N.; Bork, K.; Polka, L.; Longridge, N.; Bell, D.; Westerberg, B.D. Energy reflectance and tympanometry in normal and otosclerotic ears. Ear Hear. 2009, 30, 219–233. [Google Scholar] [CrossRef] [PubMed]
Danesh, A.A.; Shahnaz, N.; Hall, J.W. The audiology of otosclerosis. Otolaryngol. Clin. N. Am. 2018, 51, 327–342. [Google Scholar] [CrossRef] [PubMed]
Myers, J.; Kei, J.; Aithal, S.; Aithal, V.; Driscoll, C.; Khan, A.; Manuel, A.; Joseph, A.; Malicka, A.N. Development of a diagnostic prediction model for conductive conditions in neonates using wideband acoustic immittance. Ear Hear. 2018, 39, 1116–1135. [Google Scholar] [CrossRef] [PubMed]
Nie, L.; Li, C.; Marzani, F.; Wang, H.; Thibouw, F.; Bozorg Grayeli, A. Classification of Wideband Tympanometry by Deep Transfer Learning with Data Augmentation for Automatic Diagnosis of Otosclerosis. IEEE J. Biomed. Health Inform. 2021. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Visual Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Grais, E.M.; Wang, X.; Wang, J.; Zhao, F.; Jiang, W.; Cai, Y.; Zhang, L.; Lin, Q.; Yang, H. Analysing wideband absorbance immittance in normal and ears with otitis media with effusion using machine learning. Sci. Rep. 2021, 11, 10643. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. 2020, 53, 1–34. [Google Scholar] [CrossRef]
Zheng, Y.D.; Ma, Y.T.; Liu, R.Z.; Lu, T. A Novel Group-Aware Pruning Method for Few-shot Learning. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–7. [Google Scholar]
Li, X.; Yu, L.; Fu, C.W.; Fang, M.; Heng, P.A. Revisiting metric learning for few-shot image classification. Neurocomputing 2020, 406, 49–58. [Google Scholar] [CrossRef] [Green Version]
Mahajan, K.; Sharma, M.; Vig, L. Meta-dermdiagnosis: Few-shot skin disease identification using meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 730–731. [Google Scholar]
Shin, H.C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef] [Green Version]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Uzunova, H.; Wilms, M.; Handels, H.; Ehrhardt, J. Training CNNs for image registration from few samples with model-based data augmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Quebec City, QC, Canada, 10–14 September 2017; pp. 223–231. [Google Scholar]
Frid-Adar, M.; Diamant, I.; Klang, E.; Amitai, M.; Goldberger, J.; Greenspan, H. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 2018, 321, 321–331. [Google Scholar] [CrossRef] [Green Version]
Ghafoorian, M.; Mehrtash, A.; Kapur, T.; Karssemeijer, N.; Marchiori, E.; Pesteie, M.; Guttmann, C.R.; de Leeuw, F.E.; Tempany, C.M.; Van Ginneken, B.; et al. Transfer learning for domain adaptation in MRI: Application in brain lesion segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Quebec City, QC, Canada, 10–14 September 2017; pp. 516–524. [Google Scholar]
Keefe, D.H.; Archer, K.L.; Schmid, K.K.; Fitzpatrick, D.F.; Feeney, M.P.; Hunter, L.L. Identifying otosclerosis with aural acoustical tests of absorbance, group delay, acoustic reflex threshold, and otoacoustic emissions. J. Am. Acad. Audiol. 2017, 28, 838–860. [Google Scholar] [CrossRef] [PubMed]
Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; Darrell, T. Deep domain confusion: Maximizing for domain invariance. arXiv 2014, arXiv:1412.3474. [Google Scholar]
Long, M.; Cao, Y.; Wang, J.; Jordan, M. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 97–105. [Google Scholar]
Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Deep transfer learning with joint adaptation networks. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 2208–2217. [Google Scholar]
Shen, J.; Qu, Y.; Zhang, W.; Yu, Y. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Lee, C.Y.; Batra, T.; Baig, M.H.; Ulbricht, D. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10285–10295. [Google Scholar]
Wu, H.; Yan, Y.; Ng, M.K.; Wu, Q. Domain-attention Conditional Wasserstein Distance for Multi-source Domain Adaptation. ACM Trans. Intell. Syst. Technol. (TIST) 2020, 11, 1–19. [Google Scholar] [CrossRef]
Williams, C.K.; Rasmussen, C.E. Regression. In Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Mikołajczyk, A.; Grochowski, M. Data augmentation for improving deep learning in image classification problem. In Proceedings of the International Interdisciplinary PhD Workshop, Swinoujście, Poland, 9–12 May 2018; pp. 117–122. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–13. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
Gardner, J.R.; Pleiss, G.; Bindel, D.; Weinberger, K.Q.; Wilson, A.G. GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration. Adv. Neural Inf. Process. Syst. 2018, 31, 1–11. Available online: https://proceedings.neurips.cc/paper/2018/hash/27e8e17134dd7083b050476733207ea1-Abstract.html (accessed on 10 December 2021).
Jiang, J.; Chen, B.; Fu, B.; Long, M. Transfer-Learning-Library. 2020. Available online: https://github.com/thuml/Transfer-Learning-Library (accessed on 10 December 2021).
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]

Figure 1. The adopted CNN architecture. First, conv1 and conv2 (convolution filters or feature extractors) processed the WBT image for feature extraction; then, fc3 and fc4 (classifiers) classified the features to yield the final diagnosis. The network included convolutional (Conv) layers, batch normalization (BatchNorm) layers, ReLU layers, MaxPooling layers, fully connected (FC) layers, and a SoftMax layer.

Figure 2. Framework of the GPGDA algorithm. All parameters of the network were pre-trained on the source dataset first. Then, the difference between the distribution of features learned from the source dataset and the training dataset (target dataset) was measured via the Gaussian processes, and the parameters of feature extractors were updated based on this information. In addition, the information of training dataset labels was adopted to update the parameters of the classifier individually. CE Loss: cross entropy loss; GP Loss: the difference measured by GP.

Figure 3. Examples of WBT images in patients with otosclerosis and control subjects; (a): the WBT sample in a control subject; (b): the WBT sample in a patient. Visually, the WBT image shows a distinct absorbance peak (−200–0 daPa, <2000 Hz) in the control ear, while the peak is flattened in patients with otosclerosis.

Figure 4. Training losses in the GPGDA algorithm versus Epochs with SynDataI. Average data from five folds could be fitted with a one-phase exponential decay equation (

y = (y_{0} - p) \cdot e^{- k x} + p

, the bottom-right panel,

R^{2}

: 0.68, 0.71, and 0.70 for

l_{r}^{C E}

,

l^{c}

, and

l^{m}

respectively).

l_{r}^{C E}

and

l^{c}

followed a similar decay (half-lives: 1.84 and 1.68, respectively,

p > 0.05

, extra-sum-of squares F-test), while

l^{m}

followed a more gradual decay (half-life: 4.46,

p < 0.001

, F-test). The same results were obtained with SynDataN and SynDataM (data not shown).

Figure 4. Training losses in the GPGDA algorithm versus Epochs with SynDataI. Average data from five folds could be fitted with a one-phase exponential decay equation (

y = (y_{0} - p) \cdot e^{- k x} + p

, the bottom-right panel,

R^{2}

: 0.68, 0.71, and 0.70 for

l_{r}^{C E}

,

l^{c}

, and

l^{m}

respectively).

l_{r}^{C E}

and

l^{c}

followed a similar decay (half-lives: 1.84 and 1.68, respectively,

p > 0.05

, extra-sum-of squares F-test), while

l^{m}

followed a more gradual decay (half-life: 4.46,

p < 0.001

, F-test). The same results were obtained with SynDataN and SynDataM (data not shown).

Table 1. Performance of the GPGDA algorithm in comparison with the baseline methods on the WBT dataset.

Source Datasets	Approaches	F1-Score	AUC Values
SynDataI	FT	$94.36 \pm 1.13$	$96.54 \pm 1.39$
	DAN	$94.60 \pm 1.07$	$96.22 \pm 0.80$
	GPGDA	$95.71 \pm 0.99$	$97.61 \pm 1.13$
SynDataN	FT	$94.69 \pm 1.02$	$96.84 \pm 1.05$
	DAN	$94.76 \pm 0.67$	$97.01 \pm 1.33$
	GPGDA	$95.97 \pm 1.06$	$97.75 \pm 1.21$
SynDataM	FT	$93.85 \pm 1.59$	$96.90 \pm 0.98$
	DAN	$95.83 \pm 1.41$	$97.34 \pm 1.26$
	GPGDA	$95.74 \pm 0.94$	$97.88 \pm 1.10$

Values are expressed as mean ± standard deviation (

r = 10

experimental replicates) of percentages.

λ = 0.5

in GPGDA.

p < 0.01

for the effect of approach types and not significant for the effect of source datasets for both AUC and F1-score; two-way ANVOA. For AUC,

p < 0.01

for GPGDA versus FT and

p < 0.05

for GPGDA versus DAN; for F1-score,

p < 0.001

for GPGDA versus FT and

p < 0.05

for GPGDA versus DAN; Tukey’s multiple comparison test, adjusted p-values for multiple comparisons.

Table 2. AUC performance of the GPGDA algorithm with different values of hyperparameter

λ

from 0 to 2.

Table 2. AUC performance of the GPGDA algorithm with different values of hyperparameter

λ

from 0 to 2.

Hyperparameter	AUC/SynDataI	AUC/SynDataN	AUC/SynDataM
$λ = 0$	$96.70 \pm 1.15$	$96.94 \pm 1.10$	$96.83 \pm 0.68$
$λ = 0.5$	$97.61 \pm 1.13$	$97.75 \pm 1.21$	$97.88 \pm 1.10$
$λ = 1$	$97.58 \pm 1.36$	$98.02 \pm 1.44$	$97.11 \pm 1.18$
$λ = 2$	$96.50 \pm 1.59$	$97.26 \pm 1.47$	$97.39 \pm 1.26$

Values are expressed as mean ± standard deviation (

r = 10

experimental replicates) of percentages.

p < 0.05

for the effect of different

λ

values and not significant for the effect of source datasets on AUC; two-way ANVOA.

p < 0.05

for

λ = 0.5

versus

λ = 0

and not significant for the other cases; Tukey’s multiple comparison test, adjusted p-values for multiple comparisons.

Table 3. AUC performance of the GPGDA algorithm when the update gradient flows from

l^{G P}

and

l_{r}^{C E}

are enabled or disabled.

Table 3. AUC performance of the GPGDA algorithm when the update gradient flows from

l^{G P}

and

l_{r}^{C E}

are enabled or disabled.

$l^{GP}$	$l_{r}^{CE}$	AUC/SynDataI	AUC/SynDataN	AUC/SynDataM
✗	✗	$93.74 \pm 1.09$	$94.11 \pm 0.92$	$94.09 \pm 0.88$
✓	✗	$96.28 \pm 0.94$	$97.30 \pm 1.27$	$97.19 \pm 1.34$
✗	✓	$96.22 \pm 0.75$	$96.76 \pm 1.01$	$95.86 \pm 1.38$
✓	✓	$97.61 \pm 1.13$	$97.75 \pm 1.21$	$97.88 \pm 1.10$

Values are expressed as mean ± standard deviation (

r = 10

experimental replicates) of percentages.

λ = 0.5

in GPGDA. ✓: enabling the update flow affected by the loss; ✗: disabling the corresponding flow.

p < 0.001

for the effect of update flow conditions and not significant for the effect of source datasets on AUC; two-way ANVOA.

p < 0.05

for the second-row condition versus the fourth-row condition, not significant for the second-row condition versus the third-row condition, and

p < 0.001

for the other cases; Tukey’s multiple comparison test, adjusted p-values for multiple comparisons.

Table 4. AUC performance of GPGDA algorithm with and without the additional unlabeled data.

Labeled	Unlabeled	AUC/SynDataI	AUC/SynDataN	AUC/SynDataM
72	0	$93.13 \pm 0.64$	$93.62 \pm 0.82$	$93.85 \pm 0.57$
72	36	$94.03 \pm 0.93$	$94.57 \pm 0.50$	$94.91 \pm 0.69$
108	0	$97.61 \pm 1.13$	$97.75 \pm 1.21$	$97.88 \pm 1.10$

Values are expressed as mean ± standard deviation (

r = 10

experimental replicates) of percentages.

p < 0.05

for varying numbers of labeled and unlabeled data on AUC; two-way ANVOA, Tukey’s multiple comparison test, adjusted p-values for multiple comparisons.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nie, L.; Li, C.; Bozorg Grayeli, A.; Marzani, F. Few-Shot Wideband Tympanometry Classification in Otosclerosis via Domain Adaptation with Gaussian Processes. Appl. Sci. 2021, 11, 11839. https://doi.org/10.3390/app112411839

AMA Style

Nie L, Li C, Bozorg Grayeli A, Marzani F. Few-Shot Wideband Tympanometry Classification in Otosclerosis via Domain Adaptation with Gaussian Processes. Applied Sciences. 2021; 11(24):11839. https://doi.org/10.3390/app112411839

Chicago/Turabian Style

Nie, Leixin, Chao Li, Alexis Bozorg Grayeli, and Franck Marzani. 2021. "Few-Shot Wideband Tympanometry Classification in Otosclerosis via Domain Adaptation with Gaussian Processes" Applied Sciences 11, no. 24: 11839. https://doi.org/10.3390/app112411839

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Wideband Tympanometry Classification in Otosclerosis via Domain Adaptation with Gaussian Processes

Abstract

1. Introduction

2. Material and Method

2.1. Background

2.2. Gaussian Processes-Guided Domain Adaptation (GPGDA) Algorithm

2.3. Experimental Details

3. Results

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI