Next Article in Journal
Improving Access to Diagnostics for Schistosomiasis Case Management in Oyo State, Nigeria: Barriers and Opportunities
Previous Article in Journal
Sonographic Pearls for Imaging the Brachial Plexus and Its Pathologies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Learning-Based Morphological Classification of Human Sperm Heads

1
Department of Information and Computational Sciences, School of Mathematical Sciences and LMAM, Peking University, Beijing 100871, China
2
Department of Biomedical Engineering, College of Engineering, Peking University, Beijing 100871, China
*
Author to whom correspondence should be addressed.
Diagnostics 2020, 10(5), 325; https://doi.org/10.3390/diagnostics10050325
Submission received: 15 February 2020 / Revised: 1 May 2020 / Accepted: 15 May 2020 / Published: 20 May 2020
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Abstract

:
Human infertility is considered as a serious disease of the reproductive system that affects more than 10% of couples across the globe and over 30% of the reported cases are related to men. The crucial step in the assessment of male infertility and subfertility is semen analysis that strongly depends on the sperm head morphology, i.e., the shape and size of the head of a spermatozoon. However, in medical diagnosis, the morphology of the sperm head is determined manually, and heavily depends on the expertise of the clinician. Moreover, this assessment as well as the morphological classification of human sperm heads are laborious and non-repeatable, and there is also a high degree of inter and intra-laboratory variability in the results. In order to overcome these problems, we propose a specialized convolutional neural network (CNN) architecture to accurately classify human sperm heads based on sperm images. It is carefully designed with several layers, and multiple filter sizes, but fewer filters and parameters to improve efficiency and effectiveness. It is demonstrated that our proposed architecture outperforms state-of-the-art methods, exhibiting 88% recall on the SCIAN dataset in the total agreement setting and 95% recall on the HuSHeM dataset for the classification of human sperm heads. Our proposed method shows the potential of deep learning to surpass embryologists in terms of reliability, throughput, and accuracy.

1. Introduction

Human spermatozoon is the gamete−the male reproductive cell−that may fertilize the mature oocyte. It is produced in the seminiferous tubules of the testicles. Structurally, normal human spermatozoa have four main parts: head, midpiece, tail, and end piece, as shown in Figure 1 [1]. A normal human sperm has a smooth oval head which looks like the shape of an egg. The sperm head can be further divided into two subunits: nucleus and acrosome.
Embryologists can observe the behavior of a spermatozoon by means of a microscope. It resembles a translucent tadpole since it has a long lashing tail and a circular head. The shape of the tail expedites the spermatozoon to progress keenly after it is evacuated from the reproductive gland. The tail supports propulsion of the spermatozoon towards the uterus in pursuit of an egg in the salpinges. Moreover, the tail of the spermatozoon enables the required motion to bind to and further penetrate a mature oocyte when it arrives.
Male human infertility or subfertility occurs when male reproductive cells fail to let a fertile female conceive a child or delay pregnancy after one or more years of regular unprotected sexual intercourse [2,3]. When a man fails to produce an adequate quantity of spermatozoa and/or produces low quality spermatozoa, these spermatozoa are called sub-optimal.
The generation of low quality spermatozoa minimizes the pregnancy rate [4]. These spermatozoa can be immotile or/and abnormal in shape. The immotile spermatozoa cannot move up to the fallopian tubes. As a result, they cannot fertilize a female ovum. The abnormally shaped spermatozoa may be able to travel, but even if they manage to reach the female gametocyte, they may not bind to and penetrate its shell and therefore the woman may reduce her chance of getting pregnant. Conversely, when a male body produces a low number of reproductive cells, the probability that one of the sperm in the semen unites an egg to form a zygote significantly decreases. There are some possible factors in the male body such as the age, anxiety, pathogens, and diet, which may impact the number of abnormal sperm in the semen [5,6]. It is clear that high sperm head deformities lead to low fertilization, implantation, and pregnancy rates [7].
Human infertility is a disease of the reproductive system that affects more than 10% of couples across the globe and over 30% of reported cases are related to men [8]. The crucial step for male fertility diagnosis relies on the examination of sperm morphology through the seminogram. The key types of defects of the abnormal sperm are: head, neck, tail and excess residual cytoplasm [9], but head abnormalities play a major role in male infertility. There are two main tasks in sperm morphology analysis; the first is to classify the types of defects in the sperm head, neck, and tail, and the second is to estimate the number of abnormal sperm. In this study, we emphasis on the classification of the head morphological defects or abnormalities.
In practice, the results derived from manual morphological analyses of sperm rely heavily on the expertise of laboratory technicians [10]. Moreover, this manual examination is laborious, non-repeatable, time intensive, and there is a high degree of inter and intra-laboratory variability [11]. For animal spermatozoon analysis, there exist certain computer-aided sperm analysis (CASA) using commercial software. However, human semen samples have a much lower quality of spermatozoa than animal semen samples [12], and thus the same software may not be directly applied to human spermatozoon analysis. Furthermore, it was found that the application of the CASA system to analyze human spermatozoa required human assistance which may affect results of the assessment subjectively [13].
According to the above analysis, it is important to design accurate, automatic, and efficient artificial intelligence (AI) systems to improve the numerical analysis of human spermatozoa from the sperm images. Actually, the morphological classification of human sperm heads plays an important role in the numerical analysis of human spermatozoa, which has already attracted extensive interest relating to the diagnosis of male infertility. Our main interest here is to focus on the development of deep learning model to extract features directly from sperm images for morphological classification of human sperm heads. According to the World Health Organization (WHO), there are 11 abnormal categories of human sperm heads, which are defined according to certain particular morphometric characteristics of the heads. They differ in shape, size, and texture in a very complicated way so that the task becomes extremely difficult even for a human expert. In addition to intra-class differences, there are also inter-class similarities. For instance, an elongated Amorphous head is similar to a Tapered head or pear-shaped like Pyriform head, and a Tapered head that is constricted near the tail is identical to a Pyriform head.
From the public SCIAN dataset [14] and recent studies, it was found that the morphological classification of human sperm heads is very challenging for the following reasons: (1) There is a high degree of inter-class similarities as well as certain intra-class differences in some cases; (2) Low-magnification microscopic images of sperm heads are very noisy; (3) The size of the images is very small: the length and width of the sperm heads are about 4 µm and 3 µm, respectively, and the size of each image is approximately 35 by 35 pixels; (4) The number of sperm head examples is insufficient for training a complex machine learning model; (5) The two-thirds of the examples in the SCIAN (partial agreement) dataset consists of only 2-out-of-3 human expert agreement; (6) The classes are highly imbalanced (e.g., the Amorphous class has ten times more examples than the Small class); (7) The Amorphous class has no common structure, and their forms can change in different ways.
The main aim of this research is to develop, implement, and calibrate an advanced deep learning model in the context of morphological sperm assessment. This specialized deep CNN architecture can accurately classify microscopic human sperm head images according to WHO criteria. Our proposed deep learning architecture is good to expedite the automatic classification process of human sperm heads. This innovative method has the potential of deep learning to exceed embryologists in terms of accuracy, reliability, and throughput.

2. Related Work

According to the guidelines of WHO, there are 11 categories of abnormalities of human sperm heads: Tapered, Pyriform, Amorphous, Small, Small acrosome, Large, Large acrosome, Round, Two heads, Vacuolated, and Vacuoles in the post-acrosomal region. Among them, the Tapered, Pyriform, Amorphous, and including Normal categories can mainly be discriminated by the precise shapes of their samples. Therefore, it is extremely challenging to distinguish them even by an embryologist. However, the remaining abnormal categories can mainly be discriminated by the different sizes of their heads or the existence of vacuoles or the acrosome and thus it is relatively easy to distinguish and recognize them. For sperm classification tasks, conventional machine learning algorithms have been adopted to alleviate the laborious work of embryologists and improve classification performance. Nonetheless, the input of these algorithms contain certain manually extracted spermatozoon features like the head perimeter, area, and eccentricity [15,16]. Although several approaches have been established for the semen analysis of animals (e.g., [17,18]), there are only a few approaches for the morphological classification of human sperm heads. We now briefly review some machine learning approaches related to the morphological classification of human sperm heads.
In 2017, Chang et al. [14] introduced a gold standard dataset, SCIAN-MorphoSpermGS, for the analysis and evaluation of morphological classification of human sperm heads. Notably, there had been no open and free available dataset before this gold standard dataset became public. The SCIAN dataset has five classes of human sperm heads for semen analysis namely: Normal, Tapered, Pyriform, Amorphous, and Small, which are available in the WHO laboratory manual. It consists of 1854 sperm head images, which were labeled by three Chilean referent domain experts as specified by the guidelines of WHO. Chang et al. [19] further proposed a two-phase analysis pipeline, CE-SVM, for the morphological classification of human sperm heads in the SCIAN dataset. In the first phase, a classifier is trained to distinguish the Amorphous category from the remaining four categories. In the second phase, four classifiers are trained for the four non-Amorphous categories, where each classifier aims to distinguish the specific non-Amorphous category from the Amorphous category.
From a different direction, Shaker et al. [20] released the Human Sperm Head Morphology (HuSHeM) dataset and proposed an adaptive dictionary learning (APDL)-based approach, which extracts certain square patches from the sperm head images to train the dictionaries to recognize those sperm head categories. At the evaluation stage, square patches are recreated with the dictionary and the minimum overall error among those of all the categories is computed to identify the best sperm head category. Recently, with the fast development of deep learning techniques, Riordon et al. [21] used a VGG16 architecture (FT-VGG) for the morphological classification of human sperm heads. First, the VGG network was pre-trained on ImageNet [22] and then fine-tuned on the SCIAN dataset. Their experimental results demonstrated that this automatic deep learning method can facilitate and boost the seminogram effectively.

3. Methodology

3.1. Datasets Descrption, Partitioning, and Augmentation

SCIAN [14] is a gold-standard dataset for the morphological classification of human sperm heads with five categories: Normal, Tapered, Pyriform, Amorphous, and Small. The manual labeling of sperm head images in this dataset was independently performed by three referent Chilean experts who had experience in sperm morphology examination for several years. The images in this dataset are of greyscale with stained sperm heads, being taken at 63× magnification and their height and width are both 35 pixels or 7 µm. There are three separate agreement settings among three domain experts: no agreement, partial agreement, and total agreement. The first set consists of 1854 sperm head images (175 Normal, 420 Tapered, 188 Pyriform, 919 Amorphous, and 152 Small), but an image in this set can be labeled manually into three dissimilar classes by three domain experts. The second set comprises 1132 images (100 Normal, 228 Tapered, 76 Pyriform, 656 Amorphous, and 72 Small) but an image can be labeled into two different sperm head classes. The third set includes 384 images (35 Normal, 69 Tapered, 7 Pyriform, 262 Amorphous, and 11 Small), all three experts assigned the same class label to a sperm head image. From the number of images in these three sets, we can appreciate the difficulty of the morphological classification of human sperm heads even by human experts. For illustration (Figure 2), we show typical samples of human sperm heads of microscopic images of the five classes in the partial agreement setting of the SCIAN dataset and the four classes of the HuSHeM dataset.
For effective usage of the SCIAN dataset, all images are converted into three channels and rotated so that all human sperm heads share the same orientation. For the convenience of comparison, we also adopt a stratified five-fold cross-validation scheme as used in [21]. That is, the SCIAN dataset is randomly partitioned into five parts, where the four parts that contain approximately 80% of the data from each class form the training set, while the remaining part which has roughly 20% of the data from each class forms the test set. The complete training/evaluation procedure is repeated five times for all possible choices of the training and test sets and the average results is reported. To compare the performance of our proposed model directly with the previous published results [20,21], each five-fold cross-validation procedure runs three times for stability. In addition, 20% of the fold-1 images are considered as the development set to tune the hyperparameters of our proposed network (see Table 1 for the details).
In order to tackle the issue of skewed classes and training image scarcity, we implement more augmentation options to the minority classes, and less augmentation options to the majority classes to balance the sample size in each class of the training set. Therefore, the training set is extended virtually for the deep learning task with the actual classes being balanced. For example, the Pyriform and Amorphous classes in the fold-5 partition of the partial agreement setting (see Table 1) have 61 and 524 distinct images, respectively, but the sample image sizes in the two corresponding augmented classes are similar, i.e., 6283 and 6288, respectively.
As for the specific data augmentation, we adopt three common techniques for the SCIAN dataset: rotation, translation, and flipping. For each sample image, we rotate it by −5 to 5 degrees. For translation, we shift ~6% of the original image to the left, the right, up, and down. For flipping, we vertically flip the image. For both partial (Table 1) and total (Table 2) agreement settings, we make a stratified five-fold partition of the SCIAN dataset as well as its augmentation for the evaluation of the proposed deep architecture. Similar pre-processing, partitioning, and augmentation are performed on the HuSHeM dataset. The details are available in Section 4.3 and Table 3. It should be noted that our data augmentation options are only implemented for the training set, while the development and test sets only contain the original sample images.

3.2. Proposed Deep CNN Architecture and Learning Paradigm

With the above pre-processing, partitioning, and augmentation of the SCIAN and the HuSHeM datasets, we try to design a deep CNN architecture especially for the morphological classification of human sperm heads. The deep CNN architectures [23,24,25,26,27,28] obtained the top results in many complicated classification and regression tasks. Since the morphological classification of human sperm heads is an image classification task, it is proper to apply the deep CNN to solve such a complicated problem. To combat this problem, our proposed deep CNN architecture, Morphological Classification of Human Sperm Heads (MC-HSH), consists of four main kernel components as shown in Figure 3.
Specifically, components one to four are all denoted by Block D with 3, 4, 6, and 3 repetitions from top to bottom in the upper left subfigure, respectively. It is clear that ‘x’ with prefix 3, 4, or 6 near the lower right corner of Block D denote that this block repeats 3, 4, or 6 times. Moreover, these components are connected by Block E each time. Actually, Block D is a combination of Block A, B, and C, and their concatenation and addition operations are shown in the bottom subfigure, where Block A, B, and C are shown in the upper right. The numbers of filters in Block A, B, and C are 128, 32, and 32, while their filter sizes are 1 by 1, 5 by 5, and 3 by 3, respectively. In the first component, we use 9 convolutional layers to detect the simple features such as those of nucleus and nuclear vacuoles. In the second component, we use 12 convolutional layers to detect the complex features such as the acrosome and outer acrosome membrane patterns. In the third component, we further implement 18 convolutional layers to identify the more complex features such as those of peri and sub-acrosomal space. In the fourth component, we add 9 more convolutional layers to learn the features that are quite precise to describe the categories of human sperm heads. As a result, this deep CNN architecture is effective for the morphological classification of human sperm heads.
There are a total of 53 convolutional layers in our proposed deep CNN architecture. Before each convolutional layer, the batch normalization [29] and LeakyReLU [30] are implemented. In Block D, we use element-wise addition and channel-wise concatenation to make this architecture more effective for this classification. The number of filters in Block E is equal to half the number of existing channels. LeCun uniform initializers [31] are used to initialize the weights and biases. LeakyReLU and softmax are utilized as the activation functions for the convolutional layers and output layer, respectively. We use an L2 norm as the kernel regularizer with λ being 0.005 in a dense layer to prevent overfitting.
We utilize the Adam learning algorithm [32] to train our proposed deep CNN model with a mini batch size of 1024 for 50 epochs for the SCIAN dataset. The learning rate is set by 0.0005 with a 0.0055 decay rate, while β1 and β2 are respectively set to be 0.9 and 0.999 in the moment estimates. Moreover, the categorical cross entropy is employed as the loss/cost function. We implement the training procedure by using Keras [33] with TensorFlow [34] backend on GPU. We further tune the hyperparameters of the model on the development set. Specifically, the hyperparameters are selected according to the lowest loss of the model evaluated on the development set. Finally, the obtained model is used to assess the test set.

4. Experimental Results

In this section, we ran five-fold cross-validation analyses for our proposed deep CNN model for the morphological classification of human sperm heads in the SCIAN and the HuSHeM datasets. We tested it on both partial and total agreement settings of the SCIAN and the HuSHeM, and compared our results with the state-of-the-art methods. We used the metrics of the precision, recall, specificity, F1-score, Jaccard similarity coefficient, geometric mean (G-mean), Matthews correlation coefficient (MCC), and Cohen’s kappa score (CKS) for the classification assessment and comparison. There are two types of averaging: macro-averaging and weighted-averaging. That is, when computing the average of the indices of the classes, equal weight is assigned to all the classes in the way of macro-averaging, while a different weight is assigned to a class that is proportional to the number of its images in the way of weighted-averaging. According to the stratified five-fold partitions of the SCIAN (in both partial and total agreement settings) and the HuSHeM datasets and the learning paradigm given in the previous section, we implement our model using the TensorFlow and Keras framework on a NVIDIA GeForce GTX 1080 card with 8GB GDDR5X memory. The training process takes roughly 18 hours in total for the SCIAN dataset. We also evaluate our deep learning model on the HuSHeM dataset. The training process takes approximately 5 hours in total on this dataset. In the following subsections, we summarize and discuss the experimental results and comparisons in both partial and total agreement settings of the SCIAN dataset as well as the results on the HuSHeM dataset.

4.1. On the Stratified Five-Fold Partition of the SCIAN Dataset with the Partial Agreement Setting

Our proposed model is first evaluated on the stratified five-fold partition of the SCIAN dataset in the partial agreement setting. We train the deep CNN architecture on each choice of training set in the partial agreement setting and tune the hyperparameters on the development set. The experimental results of our proposed model on the SCIAN dataset with the partial agreement setting is shown in Figure 4a–h. The detailed experimental results are shown in Supplementary Materials (Figures S1–S8). Figure 4a–b show typical classification accuracy and cost curves with the number of epochs on a specific choice of training and test sets. It is seen that the training process converged within 50 epochs. Notably, our proposed model achieves much better accuracy and recall than the previous methods in the partial agreement setting (Table 4). By the stratified five-fold cross-validation, we get the confusion matrix (Figure 4c), from which we can see how often images of each individual class (Normal, Tapered, Pyriform, Amorphous, and Small) are predicted by our proposed model on the test set in the partial agreement setting only for a typical run. We also get the average confusion matrix over 15 runs (5 folds × 3 runs) as shown in Table 5. After carefully examining these tables, we find that the Amorphous class is very difficult to distinguish from the remaining classes. The main reason for this may be that the Amorphous class has a variety of forms. On the contrary, we also find that the average true positive rate (TPR) of the Tapered class is relatively high so that the Tapered images can be easily detected. The precision, recall, and F1-score curves of five classes respectively on the test set in the partial agreement setting through a typical run are shown by Figure 4d–f. From these three subfigures, we can see that the five class curves for each of the precision, recall and F1-score globally tend to stabilize and increase as the number of epochs increase. We further plot the precision-recall curves of five classes on the test set as well as their micro-averaging precision-recall curve (Figure 4g) for a typical run. A large area under the precision-recall curve (PR-AUC) signifies the high precision as well as the high recall. Having observed this subfigure, we find out that the Amorphous class has the highest PR-AUC, whereas the Pyriform class has the lowest one. Furthermore, we plot the receiver operating characteristic (ROC) curves of five classes on the test set as well as their macro and micro-averaging ROC curves (Figure 4h) for a typical run. The area under the ROC curve (ROC-AUC) is also valuable because it shows the tradeoff between the TPR and false positive rate (FPR). From this subfigure, we can further find out that the Normal class has the highest ROC-AUC, whereas the Amorphous class has the lowest one. Finally, we summarize the detailed results of each fold in the partial agreement setting for each run in Table 6 which includes all the possible evaluation metrics such as the precision, recall, specificity, F1-score, Jaccard similarity coefficient, G-mean, ROC-AUC, PR-AUC, MCC, CKS, and evaluation time. The standard deviation in the last row of this table shows the stability of the result of our proposed model with a training run for each index. Since all standard deviations are less than 0.09, our proposed model is therefore quite stable with the learning algorithm.

4.2. On the Stratified Five-Fold Partition of the SCIAN Dataset with the Total Agreement Setting

Our proposed model is further evaluated on the stratified five-fold partition of the SCIAN dataset in the total agreement setting. Similarly, we train the deep CNN architecture on each choice of training set in the total agreement setting. However, we no longer tune the hyperparameters since they have been tuned in the previous case of the partial agreement setting. The experimental results of our proposed model in the total agreement setting are shown in Figure 5a–h. The detailed experimental evaluations in the total agreement setting are available in Supplementary Materials (Figures S9–S16). Specifically, Figure 5a–b show typical classification accuracy and cost curves during the training on a specific choice of training and test sets. It is seen that our proposed model obtains very high classification accuracy in the total agreement setting when the training process converged. Our proposed model also attains a much higher accuracy and recall than the previous methods in the total agreement setting, which is clearly shown in Table 7 by simply comparing the precision, specificity, and F1-score indices of our proposed model and the VGG model in [21]. For the elaborate comparisons with the models in [20,21], we employ the stratified five-fold cross-validation scheme in the total agreement setting. Figure 5c illustrates the confusion matrix of the classification for a typical run. We also compute the average confusion matrix over 15 runs (5 folds × 3 runs), as shown in Table 8. According to these tables, the Amorphous class remains the most difficult class to be differentiated from the other classes. Nevertheless, we can also find that the average TPRs of the Normal, Tapered, Pyriform, and Small classes become better. Therefore, the experimental results confirm that the Amorphous class is the most difficult to distinguish from the other classes. The precision, recall, and F1-score curves of five classes on the test set in the total agreement setting through a typical run are shown in Figure 5d–f, respectively. From these three subfigures, we can again see that the five class curves of each of the precision, recall, and F1-score globally tend to stabilize and increase as the number of epochs increases. We further plot the precision-recall curves of five classes on the test set as well as their micro-averaging precision-recall curve in Figure 5g for a typical run. It is clearly observed from this subfigure that the Amorphous and Normal classes have a higher PR-AUC than the other classes, while the Pyriform class has the lowest one. Moreover, we plot the ROC curves of five classes on the test set as well as their macro and micro-averaging ROC curves in Figure 5h for a typical run. From this subfigure, we can see that the Normal class has the highest ROC-AUC, while the Amorphous class has the lowest one. The detailed results of each fold in the total agreement setting for each run are available in Table 9. From the last row of this table, we can also see low standard deviations of different indices from our proposed model with a training run in the total agreement setting, demonstrating the stability of our proposed model with the learning algorithm. As the agreement is strict in this case, common and essential features can be extracted effectively from the labeled images so that the classification results are improved considerably. In summary, our proposed model attains an overall accuracy of 77%, a macro precision of 64%, a macro recall of 88%, and a macro specificity of 94% in the total agreement setting, which are much better than the previous results.

4.3. On the Stratified Five-Fold Partition of the HuSHeM Dataset

Our proposed model is finally evaluated on the HuSHeM [20] dataset. This is another dataset for the morphological classification of human sperm heads with 216 images (54 Normal, 53 Tapered, 57 Pyriform, and 52 Amorphous). Its images are also manually annotated by three human experts, but only the images with three-expert agreement are recorded. Each image consists of 131 by 131 pixels, being taken at 100× magnification.
In the pre-processing step, we first rotate the images so that all human sperm heads share the same orientation. We then crop the sample images so that the sperm heads appear in the center of the images. After this step, the images are reduced to 90 by 90 pixels. Approximately 80% of the images are considered for training and the remaining images for the evaluation. We further employ data augmentation techniques to solve the scarcity of training images. As for the data augmentation, we adopt three common techniques as we used in the training set of the SCIAN dataset. For rotation, we rotate the training image by −25 to 25 degrees. For translation, we shift ~6% of the original image to the left, the right, up, and down. For flipping, we vertically flip the image. Due to the same distribution of classes within this dataset, we apply equal augmentation options to each class. For the convenience of comparison, we also adopt a stratified five-fold cross-validation scheme as used in [20,21]. We utilize the Adam learning algorithm to train our proposed deep CNN model with mini a batch size of 256 for 25 epochs for the HuSHeM dataset. To compare the performance of our proposed model directly with previously published results [20,21], each five-fold cross-validation procedure runs three times for stability.
The experimental results of our proposed model on the HuSHeM dataset are shown in Figure 6a–e. The detailed experimental results are shown in Supplementary Materials (Figures S17–S21). The experimental results of our proposed model as well as the previous methods on the HuSHeM dataset are shown in Table 10. It is clearly seen that our proposed model achieves better accuracy, recall, precision, specificity, and F1-score than previous methods. Moreover, from the confusion matrix of our proposed model (Table 11), we can see that Pyriform classes in the test set are predicted 97% correctly. Results on the HuSHeM dataset are the average of 15 runs (5 folds × 3 runs). We also plot the precision-recall curves of four classes on the test set as well as their micro-averaging precision-recall curve in Figure 6d for a typical run. Furthermore, we plot the ROC curves of four classes on the test set as well as their macro and micro-averaging ROC curves in Figure 6e for a typical run. Finally, we summarize the detailed results of each fold for each run in Table 12 which includes all the possible evaluation metrics such as the precision, recall, specificity, F1-score, Jaccard, G-mean, ROC-AUC, PR-AUC, MCC, CKS, and evaluation time.

5. Discussion and Conclusions

We have established an advanced deep CNN architecture, MC-HSH, specially for the morphological classification of human sperm heads. In this deep learning architecture, there are a total of 53 convolutional layers. Before each convolutional layer, the batch normalization and LeakyReLU are used. We also apply the channel-wise concatenation and element-wise addition to make this model more effective for the morphological classification of human sperm heads. We employ the L2 penalty as the kernel regularizer in the dense layer to prevent overfitting. We utilize several layers and multiple filter sizes, but fewer filters and parameters, and we also make a new arrangement of convolutional layers, addition and concatenation operations for this classification task.
According to the WHO criteria [9], human sperm heads are classified into categories such as Normal, Tapered, Pyriform, Amorphous, and Small and their morphological classification is very challenging. Based on a golden standard SCIAN dataset of microscopic sperm images and the HuSHeM dataset, data-driven machine learning models and algorithms can be utilized to solve this difficult problem. By making careful pre-processing, partition, and argumentation of the SCIAN and the HuSHeM datasets, we design a specialized deep CNN architecture for the morphological classification of sperm heads based on the microscopic human sperm head images. The stratified five-fold cross-validation results demonstrate that our proposed model (along with the deep learning algorithm) is much more effective than the previous methods [14,19,20,21] for the morphological classification of human sperm heads. The performance indices on five classes (see Table 4, Table 5, Table 7 and Table 8) indicate that it is reliable in recognizing the images in the Normal class as well as the four abnormal classes. By attaining the embryologist level performance of the classification, our proposed model is also a balanced classifier where the TPR is similar to the positive predictive value (PPV).
It can be found from Table 4 and Table 7 that the previous methods are not so powerful to extract effective features from microscopic images for the classification of human sperm heads. Our proposed model achieves 68% and 88% average TPR on the SCIAN dataset in the partial and total agreement settings, respectively. We find out that our proposed model improves the accuracy and recall by a factor of 29% and 10%, respectively, in the partial agreement setting and 46% and 22%, respectively, in the total agreement setting compared with the state-of-the-art results reported in [21]. In the total agreement setting, our proposed model achieves a much better accuracy (77%) and recall (88%) because the training set has more images and the test set has the total expert agreement images in comparison with the accuracy (63%) and recall (68%) of the partial agreement setting. Our proposed model can extract the morphometric features for seminogram which are significant for sperm binding to the oocyte. The morphological classification of human sperm heads is an intricate problem because of intrinsic inter-class similarities and intra-class variabilities. Our proposed model achieves better classification results than the previous state-of-the-art methods without using transfer leaning. On the HuSHeM dataset, the results of our proposed model are also better than the state-of-the-art results. Our proposed approach achieves 96% accuracy and 95% recall on the HuSHeM dataset. The accuracy, recall, precision, and F1-score increase approximately 2%, whereas the specificity improves roughly 0.5% in comparison with [21]. The results of our proposed model are much better on the HuSHeM dataset than the SCIAN dataset. This improvement is due to three main reasons: (1) the HuSHeM dataset has only four sperm head classes; (2) its images have a high resolution; (3) and all of its images are 3-out-of-3 human expert agreement. The evaluation time of our proposed model is ~0.2 milliseconds (ms) for the SCIAN dataset, while ~0.9 ms for the HuSHeM dataset per image.
Developing an automated classification system of human sperm heads can greatly reduce the workload of embryologists and also decrease the subjectivity and inaccuracy of the classification induced by the human error. This automated system can become necessary and more valuable when experienced embryologists are not readily available and for inexperienced clinicians in underdeveloped countries. In fact, the classification results of our proposed model are comparable to those of the domain experts. Consequently, our proposed model can even be used to assign a class label to any new sperm head image, and this deep CNN architecture is good to expedite the automatic classification procedure of human sperm heads. Indeed, our research provides more strong evidence that the deep learning approach is able to play a key role in healthcare systems, assisting doctors to achieve higher conception and gestation rates. Our proposed architecture shows the potential of deep learning to surpass embryologists in terms of throughput, accuracy and reliability.
It is worth indicating the limitations of this study. As mentioned before, experiments are conducted on two publicly available datasets. The SCIAN dataset has 1132 and 384 human sperm heads images in the partial and total agreement settings, respectively, while the HuSHeM dataset has only 216 human sperm head images. These numbers of images are relatively small. Consequently, to obtain better generalizability, it is essential to increase the number of images for experimentation in the future. Secondly, due to limited computational power and memory the training time is high. Lastly, additional work remains to be done to evaluate the deep learning models in fertility clinics.

Supplementary Materials

Available online at https://www.mdpi.com/2075-4418/10/5/325/s1. Figure S1. Detailed experimental results of the proposed model through 50 epochs in the partial agreement setting of the SCIAN dataset, where 15 classification accuracy curves with the number of epochs during the training on each of five possible choices of the training and test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S2. Detailed experimental results of the proposed model through 50 epochs in the partial agreement setting of the SCIAN dataset, where 15 cost curves with the number of epochs during the training on each of five possible choices of the training and test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S3. Detailed experimental results of the proposed model through 50 epochs in the partial agreement setting of the SCIAN dataset, where 15 confusion matrixes on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S4. Detailed experimental results of the proposed model through 50 epochs in the partial agreement setting of the SCIAN dataset, where 15 precision curves of each class with the number of epochs during the training on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S5. Detailed experimental results of the proposed model through 50 epochs in the partial agreement setting of the SCIAN dataset, where 15 recall curves of each class with the number of epochs during the training on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S6. Detailed experimental results of the proposed model through 50 epochs in the partial agreement setting of the SCIAN dataset, where 15 F1-score curves of each class with the number of epochs during the training on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S7. Detailed experimental results of proposed model in the partial agreement setting of the SCIAN dataset, where 15 precision-recall curves of each class and their micro-averaging precision-recall curve on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated. A high area under the curve signifies the high precision as well as high recall; Figure S8. Detailed experimental results of proposed model in the partial agreement setting of the SCIAN dataset, where 15 receiver operating characteristic (ROC) curves of each class and their macro and micro-averaging ROC curves on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated. These plots show the tradeoff between the true positive rate and false positive rate; Figure S9. Detailed experimental results of the proposed model through 50 epochs in the total agreement setting of the SCIAN dataset, where 15 classification accuracy curves with the number of epochs during the training on each of five possible choices of the training and test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S10. Detailed experimental results of the proposed model through 50 epochs in the total agreement setting of the SCIAN dataset, where 15 cost curves with the number of epochs during the training on each of five possible choices of the training and test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S11. Detailed experimental results of the proposed model through 50 epochs in the total agreement setting of the SCIAN dataset, where 15 confusion matrixes on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S12. Detailed experimental results of the proposed model through 50 epochs in the total agreement setting of the SCIAN dataset, where 15 precision curves of each class with the number of epochs during the training on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S13. Detailed experimental results of the proposed model through 50 epochs in the total agreement setting of the SCIAN dataset, where 15 recall curves of each class with the number of epochs during the training on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S14. Detailed experimental results of the proposed model through 50 epochs in the total agreement setting of the SCIAN dataset, where 15 F1-score curves of each class with the number of epochs during the training on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S15. Detailed experimental results of proposed model in the total agreement setting of the SCIAN dataset, where 15 precision-recall curves of each class and their micro-averaging precision-recall curve on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated. A high area under the curve signifies the high precision as well as high recall; Figure S16. Detailed experimental results of proposed model in the total agreement setting of the SCIAN dataset, where 15 receiver operating characteristic (ROC) curves of each class and their macro and micro-averaging ROC curves on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated. These plots show the tradeoff between the true positive rate and false positive rate; Figure S17. Detailed experimental results of the proposed model through 25 epochs of the HuSHeM dataset, where 15 classification accuracy curves with the number of epochs during the training on each of five possible choices of the training and test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S18. Detailed experimental results of the proposed model through 25 epochs of the HuSHeM dataset, where 15 cost curves with the number of epochs during the training on each of five possible choices of the training and test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S19. Detailed experimental results of the proposed model through 25 epochs of the HuSHeM dataset, where 15 confusion matrixes on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated; Figure S20. Detailed experimental results of proposed model of the HuSHeM dataset, where 15 precision-recall curves of each class and their micro-averaging precision-recall curve on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated. A high area under the curve signifies the high precision as well as high recall; Figure S21. Detailed experimental results of proposed model of the HuSHeM dataset, where 15 receiver operating characteristic (ROC) curves of each class and their macro and micro-averaging ROC curves on each of five possible choices of the test sets for 3 runs (5 folds × 3 runs) are illustrated. These plots show the tradeoff between the true positive rate and false positive rate.

Author Contributions

I.I. and J.M. devised the project, the main conceptual ideas and proof outline; I.I. designed and performed the experiments; I.I. drafted the manuscript, designed the figures and tables with support from G.M.; I.I. and J.M. contributed to the interpretation of the results; J.M. supervised the project; I.I., J.M., and G.M. reviewed and revised the manuscript prior to submission. All authors have read and approved the final submitted manuscript.

Funding

This work was supported by the National Key Research and Development Program of China under grant 2018AAA0100205.

Acknowledgments

The authors would like to thank Zhe Sage Chen, Zhongyu Joan Lu and Qiang Xu for their kind help.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Villarreal, M.R. Complete Diagram of a Human Spermatozoon. Available online: https://commons.wikimedia.org/wiki/File:Complete_diagram_of_a_human_spermatozoa_en.svg (accessed on 9 October 2019).
  2. Zegers-Hochschild, F.; Adamson, G.D.; De Mouzon, J.; Ishihara, O.; Mansour, R.; Nygren, K.; Sullivan, E.; Vanderpoel, S. International Committee for Monitoring Assisted Reproductive Technology (ICMART) and the World Health Organization (WHO) revised glossary of ART. Fertil. Steril. 2009, 92, 1520–1524. [Google Scholar] [CrossRef] [PubMed]
  3. Blasco, V.; Pinto, F.M.; Gonz, C.; Santamar, E.; Candenas, L.; Fern, M. Tachykinins and Kisspeptins in the Regulation of Human Male Fertility. J. Clin. Med. 2020, 9, 113. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Wang, S.-C.; Wang, S.-C.; Li, C.-J.; Lin, C.-H.; Huang, H.-L.; Tsai, L.-M.; Chang, C.-H. The Therapeutic Effects of Traditional Chinese Medicine for Poor Semen Quality in Infertile Males. J. Clin. Med. 2018, 7, 239. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Kidd, S.A.; Eskenazi, B.; Wyrobek, A.J. Effects of male age on semen quality and fertility: A review of the literature. Fertil. Steril. 2001, 75, 237–248. [Google Scholar] [CrossRef]
  6. Barone, M.A.; Roelke, M.E.; Howard, J.; Brown, J.L.; Anderson, A.E.; Wildt, D.E. Reproductive Characteristics of Male Florida Panthers: Comparative studies from Florida, Texas, Colorado, Latin America, and North American Zoos. J. Mammal. 1994, 75, 150–162. [Google Scholar] [CrossRef] [Green Version]
  7. Monte, G.L.; Murisier, F.; Piva, I.; Germond, M.; Marci, R. Focus on intracytoplasmic morphologically selected sperm injection ( IMSI ): A mini-review. Asian J. Androl. 2013, 15, 608–615. [Google Scholar] [CrossRef] [Green Version]
  8. Maduro, M.R.; Lamb, D.J. Understanding the new genetics of male infertility. J. Urol. 2002. [Google Scholar]
  9. World Health Organization. WHO Laboratory Manual for the Examination and Processing of Human Semen, 5th ed.; World Health Organization: Geneva, Switzerland, 2010. [Google Scholar]
  10. Brazil, C. Practical semen analysis: From A to Z. Asian J. Androl. 2010, 12, 14–20. [Google Scholar] [CrossRef] [Green Version]
  11. Gatimel, N.; Moreau, J.; Parinaud, J.; Léandri, R.D. Sperm morphology: Assessment, pathophysiology, clinical relevance, and state of the art in 2017. Andrology 2017, 5, 845–862. [Google Scholar] [CrossRef] [Green Version]
  12. Mortimer, S.T.; Van Der Horst, G.; Mortimer, D. The future of computer-aided sperm analysis. Asian J. Androl. 2015, 17, 545–553. [Google Scholar] [CrossRef]
  13. Menkveld, R.; Holleboom, C.A.G.; Rhemrev, J.P.T. Measurement and significance of sperm morphology. Asian J. Androl. 2011, 13, 59–68. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Chang, V.; Garcia, A.; Hitschfeld, N.; Hartel, S. Gold-standard for computer-assisted morphological sperm analysis. Comput. Biol. Med. 2017, 83, 143–150. [Google Scholar] [CrossRef] [PubMed]
  15. Yi, W.J.; Park, K.S.; Paick, J.S. Parameterized characterization of elliptic sperm heads using Fourier representation and wavelet transform. In Proceedings of the Annual International Conference of the ZEEE Engineering in Medicine and Biology Society, Hong Kong, China, 1 November 1998; Volume 20, pp. 974–977. [Google Scholar]
  16. Li, J.; Tseng, K.K.; Dong, H.; Li, Y.; Zhao, M.; Ding, M. Human sperm health diagnosis with principal component analysis and k-nearest neighbor algorithm. In Proceedings of the International Conference on Medical Biometrics, Shenzhen, China, 30 May–1 June 2014; IEEE: Shenzhen, China, 2014; pp. 108–113. [Google Scholar]
  17. Beletti, M.E.; Costa, L.D.F.; Viana, M.P. A comparison of morphometric characteristics of sperm from fertile Bos taurus and Bos indicus bulls in Brazil. Anim. Reprod. Sci. 2005, 85, 105–116. [Google Scholar] [CrossRef] [PubMed]
  18. Severa, L.; Máchal, L.; Švábová, L.; Mamica, O. Evaluation of shape variability of stallion sperm heads by means of image analysis and Fourier descriptors. Anim. Reprod. Sci. 2010, 119, 50–55. [Google Scholar] [CrossRef]
  19. Chang, V.; Heutte, L.; Petitjean, C.; Hartel, S.; Hitschfeld, N. Automatic classification of human sperm head morphology. Comput. Biol. Med. 2017, 84, 205–216. [Google Scholar] [CrossRef]
  20. Shaker, F.; Monadjemi, S.A.; Alirezaie, J.; Naghsh-Nilchi, A.R. A dictionary learning approach for human sperm heads classification. Comput. Biol. Med. 2017, 91, 181–190. [Google Scholar] [CrossRef]
  21. Riordon, J.; Mccallum, C.; Sinton, D. Deep learning for the classification of human sperm. Comput. Biol. Med. 2019, 111. [Google Scholar] [CrossRef]
  22. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Jan, C.V.; Krause, J.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
  23. Szegedy, C.; Liu2, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan1, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  24. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Volume 9351, pp. 234–241. [Google Scholar]
  25. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  26. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
  27. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
  28. Iqbal, I.; Shahzad, G.; Rafiq, N.; Mustafa, G.; Ma, J. Deep learning-based automated detection of human knee joint’s synovial fluid from magnetic resonance images with transfer learning. IET Image Process. 2020. (In press) [Google Scholar]
  29. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
  30. Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16 June–21 June 2013; Volume 28. [Google Scholar]
  31. LeCun, Y.A.; Bottou, L.; Orr, G.B.; Müller, K.-R. Efficient BackProp; Springer: Heidelberg, Germany, 2012. [Google Scholar]
  32. Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  33. Chollet, F. Keras (2015). Available online: http//keras.io (accessed on 9 October 2019).
  34. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
Figure 1. The diagram of a human spermatozoon.
Figure 1. The diagram of a human spermatozoon.
Diagnostics 10 00325 g001
Figure 2. Typical samples of human sperm heads of microscopic images of the five classes in the partial agreement setting of the SCIAN dataset and the four classes of the HuSHeM dataset.
Figure 2. Typical samples of human sperm heads of microscopic images of the five classes in the partial agreement setting of the SCIAN dataset and the four classes of the HuSHeM dataset.
Diagnostics 10 00325 g002
Figure 3. The layout of the proposed deep CNN architecture, where ‘FC’ denotes the fully connected layer.
Figure 3. The layout of the proposed deep CNN architecture, where ‘FC’ denotes the fully connected layer.
Diagnostics 10 00325 g003
Figure 4. The experimental results in the partial agreement setting of the SCIAN dataset: (a,b) Typical classification accuracy and cost curves with the number of epochs during the training on the training and test sets; (c) The confusion matrix on the test set; (df) The precision, recall, and F1-score curves of five classes respectively on the test set; (g) The precision-recall curves of five classes on the test set as well as their micro-averaging precision-recall curve; (h) The receiver operating characteristic (ROC) curves of five classes on the test set as well as their macro and micro-averaging ROC curves.
Figure 4. The experimental results in the partial agreement setting of the SCIAN dataset: (a,b) Typical classification accuracy and cost curves with the number of epochs during the training on the training and test sets; (c) The confusion matrix on the test set; (df) The precision, recall, and F1-score curves of five classes respectively on the test set; (g) The precision-recall curves of five classes on the test set as well as their micro-averaging precision-recall curve; (h) The receiver operating characteristic (ROC) curves of five classes on the test set as well as their macro and micro-averaging ROC curves.
Diagnostics 10 00325 g004
Figure 5. The experimental results in the total agreement setting of the SCIAN dataset: (a,b) Typical classification accuracy and cost curves with the number of epochs during the training on the training and test sets; (c) The confusion matrix on the test set; (df) The precision, recall, and F1-score curves of five classes respectively the on test set; (g) The precision-recall curves of five classes on the test set as well as their micro-averaging precision-recall curve; (h) The receiver operating characteristic (ROC) curves of five classes on the test set as well as their macro and micro-averaging ROC curves.
Figure 5. The experimental results in the total agreement setting of the SCIAN dataset: (a,b) Typical classification accuracy and cost curves with the number of epochs during the training on the training and test sets; (c) The confusion matrix on the test set; (df) The precision, recall, and F1-score curves of five classes respectively the on test set; (g) The precision-recall curves of five classes on the test set as well as their micro-averaging precision-recall curve; (h) The receiver operating characteristic (ROC) curves of five classes on the test set as well as their macro and micro-averaging ROC curves.
Diagnostics 10 00325 g005
Figure 6. The experimental results of the HuSHeM dataset: (a,b) Typical classification accuracy and cost curves with the number of epochs during the training on the training and test sets; (c) The confusion matrix on the test set; (d) The precision-recall curves of four classes on the test set as well as their micro-averaging precision-recall curve; (e) The receiver operating characteristic (ROC) curves of four classes on the test set as well as their macro and micro-averaging ROC curves.
Figure 6. The experimental results of the HuSHeM dataset: (a,b) Typical classification accuracy and cost curves with the number of epochs during the training on the training and test sets; (c) The confusion matrix on the test set; (d) The precision-recall curves of four classes on the test set as well as their micro-averaging precision-recall curve; (e) The receiver operating characteristic (ROC) curves of four classes on the test set as well as their macro and micro-averaging ROC curves.
Diagnostics 10 00325 g006
Table 1. The stratified five-fold partition of the SCIAN dataset (partial agreement), where the numbers denote the distinct sample sizes in different classes while the numbers in parentheses denote the total number of augmented with the addition of original samples in different classes at each fold. Moreover, the bold numbers in the training set (fold-1) denote the number of samples in different classes assigned to the development set for tuning the hyperparameters of the network. To avoid repetition, folds 2 and 3 are described together.
Table 1. The stratified five-fold partition of the SCIAN dataset (partial agreement), where the numbers denote the distinct sample sizes in different classes while the numbers in parentheses denote the total number of augmented with the addition of original samples in different classes at each fold. Moreover, the bold numbers in the training set (fold-1) denote the number of samples in different classes assigned to the development set for tuning the hyperparameters of the network. To avoid repetition, folds 2 and 3 are described together.
FoldSetSperm Head ClassesTotal
NormalTaperedPyriformAmorphousSmall
1Train80–20 (4860)182–46 (4896)60–15 (4860)525–131 (4728)58–14 (4840)905–226 (24184)
Test20461613114227
2 and 3Train80 (6400)182 (6370)61 (6405)525 (6300)58 (6380)906 (31855)
Test20461513114226
4Train80 (6240)183 (6222)61 (6283)525 (6300)57 (6270)906 (31315)
Test20451513115226
5Train80 (6240)183 (6222)61 (6283)524 (6288)57 (6270)905 (31303)
Test20451513215227
Table 2. The stratified five-fold partition of the SCIAN dataset (total agreement), where the numbers denote the distinct sample sizes in different classes while the numbers in parentheses denote the total number of augmented with the addition of original samples in different classes at each fold. To avoid repetition, folds 1 and 2 are described together.
Table 2. The stratified five-fold partition of the SCIAN dataset (total agreement), where the numbers denote the distinct sample sizes in different classes while the numbers in parentheses denote the total number of augmented with the addition of original samples in different classes at each fold. To avoid repetition, folds 1 and 2 are described together.
FoldSetSperm Head ClassesTotal
NormalTaperedPyriformAmorphousSmall
1 and 2Train93 (7719)214 (7704)74 (7696)604 (7852)70 (7700)1055 (38671)
Test714252277
3Train93 (7719)214 (7704)75 (7725)604 (7852)70 (7700)1056 (38700)
Test714152276
4Train93 (7719)214 (7704)75 (7725)603 (7839)70 (7700)1055 (38687)
Test714153277
5Train93 (7626)215 (7525)75 (7575)603 (7839)69 (7590)1055 (38155)
Test713153377
Table 3. The stratified five-fold partition of the HuSHeM dataset, where the numbers denote the distinct sample sizes in different classes while the numbers in parentheses denote the total number of augmented with the addition of original samples in different classes at each fold. To avoid repetition, folds 1, 2 and 3 are described together.
Table 3. The stratified five-fold partition of the HuSHeM dataset, where the numbers denote the distinct sample sizes in different classes while the numbers in parentheses denote the total number of augmented with the addition of original samples in different classes at each fold. To avoid repetition, folds 1, 2 and 3 are described together.
FoldSetSperm Head ClassesTotal
NormalTaperedPyriformAmorphous
1, 2 and 3Train43 (4730)42 (4620)46 (5060)42 (4620)173 (19030)
Test1111111043
4Train43 (4730)43 (4730)45 (4950)41 (4510)172 (18920)
Test1110121144
5Train44 (4840)43 (4730)45 (4950)41 (4510)173 (19030)
Test1010121143
Table 4. The performance comparison of our proposed model with the previous methods in the partial agreement setting of the SCIAN dataset in terms of accuracy, precision, recall, specificity, and F1-score metrics. Bold font shows the best results. All the metrics are described in percentages. The accuracy, precision, specificity, and F1-score of the method in [21] were not reported directly, but calculated from its confusion matrix. The symbol ‘-’ stands for unreported results.
Table 4. The performance comparison of our proposed model with the previous methods in the partial agreement setting of the SCIAN dataset in terms of accuracy, precision, recall, specificity, and F1-score metrics. Bold font shows the best results. All the metrics are described in percentages. The accuracy, precision, specificity, and F1-score of the method in [21] were not reported directly, but calculated from its confusion matrix. The symbol ‘-’ stands for unreported results.
ModelTrue Positive RateAccuracy (Weighted Average TPR) Recall (Average TPR)Precision (Macro)Specificity (Macro)F1-Score (Macro)
NormalTaperedPyriformAmorphousSmall
MorphoSpermGS (SVM with Zernike moments) [14]44623323703646---
MorphoSpermGS (SVM with Fourier descriptors) [14]57685315543449---
CE-SVM [19]62645030824458---
APDL [20]71677135684962---
FT-VGG [21]67576938784962478753
Proposed model
(MC-HSH)
70796257716368569061
Table 5. The average confusion matrix on the stratified five-fold partition of the SCIAN dataset in the partial agreement setting, where each cell value (in percent) is the average of 15 runs (5 folds × 3 runs).
Table 5. The average confusion matrix on the stratified five-fold partition of the SCIAN dataset in the partial agreement setting, where each cell value (in percent) is the average of 15 runs (5 folds × 3 runs).
True ClassNormal7033204
Tapered2795131
Pyriform3862261
Amorphous10168579
Small10211671
NormalTaperedPyriformAmorphousSmall
Predicted Class
Table 6. The stratified five-fold cross-validation results for the morphological classification of human sperm heads in the partial agreement setting of the SCIAN dataset for every fold and every run, where all the metrics except Matthews correlation coefficient (MCC) and Cohen’s kappa score (CKS) are described in percentages.
Table 6. The stratified five-fold cross-validation results for the morphological classification of human sperm heads in the partial agreement setting of the SCIAN dataset for every fold and every run, where all the metrics except Matthews correlation coefficient (MCC) and Cohen’s kappa score (CKS) are described in percentages.
FoldRunPrecisionRecallSpecificityF1-ScoreJaccardG-meanROC-AUCPR-AUCMCCCKSEvaluation Time
per Image
(milliseconds)
MacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroMicroMicro
Accuracy
1First577167679086606744517775889069+0.52+0.50~0.2
Second547067639087586441487774878963+0.48+0.47
Third537066639087566440467674878964+0.48+0.46
2First597273659088636647498175899069+0.52+0.50
Second587273659087636646508176889067+0.52+0.50
Third627271679186656848518076899172+0.53+0.52
3First506663568886525636397368858453+0.43+0.39
Second506863548888525535387468848451+0.42+0.38
Third476362528785505334367366838348+0.38+0.35
4First587169659086626645497975899172+0.51+0.49
Second627269699184657048547976899173+0.54+0.53
Third637368709186657148557877909275+0.55+0.54
5First567068659085606643507874888963+0.50+0.48
Second587168679084616845527875888962+0.51+0.50
Third557069639086596442477973878758+0.49+0.47
Average567068639086596443487873878964+0.49+0.47
Standard deviation0.04700.02630.03310.05330.01160.01220.04940.05390.04750.05720.02590.03360.01990.02820.08350.04780.0561
Table 7. The performance comparison of our proposed model with the previous methods in the total agreement setting of the SCIAN dataset in terms of accuracy, precision, recall, specificity, and F1-score metrics. The bold font shows the best results. All the metrics are described in percentages. The accuracy, precision, specificity, and F1-score of the method in [21] were not reported directly, but calculated from its confusion matrix. The symbol ‘-’ stands for unreported results.
Table 7. The performance comparison of our proposed model with the previous methods in the total agreement setting of the SCIAN dataset in terms of accuracy, precision, recall, specificity, and F1-score metrics. The bold font shows the best results. All the metrics are described in percentages. The accuracy, precision, specificity, and F1-score of the method in [21] were not reported directly, but calculated from its confusion matrix. The symbol ‘-’ stands for unreported results.
ModelTrue Positive RateAccuracy (Weighted Average TPR)Recall (Average TPR)Precision (Macro)Specificity (Macro)F1-Score (Macro)
NormalTaperedPyriformAmorphousSmall
CE-SVM [19]747092301004673---
FT-VGG [21]72679544845372459055
Proposed model
(MC-HSH)
8086100721007788649474
Table 8. The average confusion matrix on the stratified five-fold partition of the SCIAN dataset in the total agreement setting, where each cell value (in percent) is the average of 15 runs (5 folds × 3 runs).
Table 8. The average confusion matrix on the stratified five-fold partition of the SCIAN dataset in the total agreement setting, where each cell value (in percent) is the average of 15 runs (5 folds × 3 runs).
True ClassNormal8003107
Tapered286093
Pyriform0010000
Amorphous8112727
Small0000100
NormalTaperedPyriformAmorphousSmall
Predicted Class
Table 9. The stratified five-fold cross-validation results for the morphological classification of human sperm heads in the total agreement setting of the SCIAN dataset for every fold and every run, where all the metrics except Matthews correlation coefficient (MCC) and Cohen’s kappa score (CKS) are described in percentages.
Table 9. The stratified five-fold cross-validation results for the morphological classification of human sperm heads in the total agreement setting of the SCIAN dataset for every fold and every run, where all the metrics except Matthews correlation coefficient (MCC) and Cohen’s kappa score (CKS) are described in percentages.
FoldRunPrecisionRecallSpecificityF1-ScoreJaccardG-meanROC-AUCPR-AUCMCCCKS
MacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroMicroMicro
Accuracy
1First698882739497687757648784959477+0.61+0.56
Second628588709396677252579081969375+0.60+0.54
Third628783759496667952678885949375+0.62+0.59
2First698283789385727961678781949582+0.61+0. 60
Second608487779390657950668983949481+0.62+0.60
Third628083779294687752648780959585+0.59+0.58
3First628391769291717756639283959477+0.63+0.60
Second648487809387728157699083959479+0.65+0.64
Third578489749494647549619183949373+0.61+0.57
4First608789799595678152699287979686+0.66+0.64
Second708185779286757863658981959582+0.59+0.58
Third668691799592748059689285979582+0.66+0.64
5First598692779495687853649385979684+0.66+0.62
Second678994799597758061689487989686+0.70+0.66
Third678794829695768361719488989686+0.71+0.69
Average648588779493707856669084969581+ 0.63+0.61
Standard deviation0.04040.02610.04050.03000.01230.04010.03940.02670.04560.03560.02470.02430.01450.01120.04420.03740.0421
Table 10. The performance comparison of our proposed model with the previous methods on the HuSHeM dataset in terms of accuracy, recall, precision, specificity, and F1-score metrics. Bold font shows the best results. All the metrics are described in percentages. The specificity of the methods in [19,20,21] were not reported directly, but calculated from their confusion matrices.
Table 10. The performance comparison of our proposed model with the previous methods on the HuSHeM dataset in terms of accuracy, recall, precision, specificity, and F1-score metrics. Bold font shows the best results. All the metrics are described in percentages. The specificity of the methods in [19,20,21] were not reported directly, but calculated from their confusion matrices.
ModelTrue Positive RateAccuracyRecallPrecisionSpecificityF1-Score
NormalTaperedPyriformAmorphous
CE-SVM [19]75.977.385.975.078.578.580.592.978.9
APDL [20]94.494.387.794.292.292.393.597.592.9
FT-VGG [21]96.494.592.393.294.094.194.798.194.1
Proposed model
(MC-HSH)
95.894.596.696.495.795.596.198.595.5
Table 11. The average confusion matrix of our proposed model on the HuSHeM dataset, where each cell value (in percent) is the average of 15 runs.
Table 11. The average confusion matrix of our proposed model on the HuSHeM dataset, where each cell value (in percent) is the average of 15 runs.
True ClassNormal96310
Tapered19423
Pyriform01972
Amorphous22096
NormalTaperedPyriformAmorphous
Predicted Class
Table 12. The stratified five-fold cross-validation results for the morphological classification of human sperm heads on the HuSHeM dataset for every fold and every run, where all the metrics except Matthews correlation coefficient (MCC) and Cohen’s kappa score (CKS) are described in percentages.
Table 12. The stratified five-fold cross-validation results for the morphological classification of human sperm heads on the HuSHeM dataset for every fold and every run, where all the metrics except Matthews correlation coefficient (MCC) and Cohen’s kappa score (CKS) are described in percentages.
FoldRunPrecisionRecallSpecificityF1-ScoreJaccardG-meanROC-AUCPR-AUCMCCCKSEvaluation Timeper Image (milliseconds)
MacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroWeightedMacroMicroMicro
Accuracy
1First989898989999989896969898100100100+0.97+0.97~0.9
Second100100100100100100100100100100100100100100100+1.00+1.00
Third989898989999989896969898100100100+0.97+0.97
2First9596959598989595919197971009998+0.94+0.94
Second95969595989895959191979710010099+0.94+0.94
Third9393939398989393878795951009999+0.91+0.91
3First9596959598989595919197971009999+0.94+0.94
Second959695959898959591919797999998+0.94+0.94
Third9596959598989595919197971009998+0.94+0.94
4First989898989999989895969898100100100+0.97+0.97
Second959595959999959591919797100100100+0.94+0.94
Third959695959999959591929697100100100+0.94+0.94
5First949493939898939387879595100100100+0.91+0.91
Second9596969599999595919197971009998+0.94+0.94
Third9494939398989393878795951009999+0.91+0.91
Average96969696999996969292979710010099+0.94+0.94
Standard deviation0.01910.01810.02070.02070.00640.00640.02070.02070.03650.03720.01330.01310.00260.00520.00860.02500.0250

Share and Cite

MDPI and ACS Style

Iqbal, I.; Mustafa, G.; Ma, J. Deep Learning-Based Morphological Classification of Human Sperm Heads. Diagnostics 2020, 10, 325. https://doi.org/10.3390/diagnostics10050325

AMA Style

Iqbal I, Mustafa G, Ma J. Deep Learning-Based Morphological Classification of Human Sperm Heads. Diagnostics. 2020; 10(5):325. https://doi.org/10.3390/diagnostics10050325

Chicago/Turabian Style

Iqbal, Imran, Ghulam Mustafa, and Jinwen Ma. 2020. "Deep Learning-Based Morphological Classification of Human Sperm Heads" Diagnostics 10, no. 5: 325. https://doi.org/10.3390/diagnostics10050325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop