1. Introduction
Cardiovascular disease, whose global mortality has been rising, is considered a severe threat to the health of humankind [
1]. Accordingly, growing attention has been paid to research and technology pertaining to the improvement of cardiac detection and reduction in the mortality rate associated with these diseases. One of the main concerns and the focus of heart function research and disease diagnosis is the left ventricle (LV). In cardiac functionality research, the left ventricle (LV) constitutes one of the chief concerns and is the diagnostic focus. LV boundary delineation is clinically critical in evaluating cardiac indices, such as the ejection fraction (EF), end-diastolic volume (EDV), and end-systolic volume (ESV) [
2].
Various medical imaging modalities have been utilized to evaluate the LV, and due to improvements in these medical imaging methods, it is now easier to diagnose cardiovascular problems. For example, echocardiography creates cardiac anatomical images using ultrasound (US) waves with high frequencies. Moreover, ultrasound is indispensable for assessing LV function due to its ease of access, exceptional temporal resolution, non-invasiveness, and real-time execution [
3]. An example of a two-dimensional US image is shown in
Figure 1, wherein the blue track indicates the LV boundary, which is an example output of manual segmentation. A specialist, usually a cardiologist, very meticulously segments the endocardial border of the LV at the end-systolic and end-diastolic phases. This information is then used to offer a quantitative functional examination of the heart in order to identify cardiopathies.
Accurate US segmentation of cardiac anatomy currently relies on sonographers, resulting in tedious jobs reliant on manual competency [
4]. In addition, it is challenging for human operation due to low contrast, speckle noise, and poorly defined boundaries [
5]. Lately, deep learning (DL) has been introduced as a completely automated technique for processing images, which can help in accurate automatic LV segmentation. We will introduce some mainstream methods and related systems on the EchoNet-Dynamic [
6] dataset, which is the largest publicly available set of apical 4-chamber (A4C) echocardiography videos with tracings and cardiac function labels.
In recent years, deep learning has become the most extensively utilized approach for cardiac image segmentation [
7]. The adoption of DL in a number of medical fields has generated significant interest in the field of medical image analysis. Ouyang et al. [
8] first developed a convolution model to generate frame-level semantic segmentation on the EchoNet-Dynamic dataset. Then, encouraged by the successful application of the transformer structure in the imaging-related task, a lightweight hybrid model integrating the transformer and the CNN architecture was proposed [
9]. The highlight of this work is the utilization of shuffled group convolution for redesigning new patch embeddings, which helps diminish the excessive number of parameters.
Different hybrid types of learning models, such as self-supervised and semi-supervised learning, have achieved enormous success in image segmentation [
10]. The self-supervised approach, or self-supervision, has proven useful when access to labeled data is limited [
11]. A self-supervised contrastive learning method was proposed in the literature to segment the left ventricle when limited annotated images exist [
12]. It self-trains by leveraging one portion of the data to predict the other part and generate labels accurately. Another newly revised version of semi-supervised LV segmentation was introduced, which exploits graph signal processing [
13]. In this work, instance segmentation and temporal, texture, and statistical feature extraction were required to represent the nodes, followed by graph sampling, where several labeled data were utilized to embed the graph.
Recently, more research focus has been placed on directly segmenting the boundaries of the LV based on time-varying 2D echocardiogram videos, since cardiac movement provides a more detailed description of heart function. For example, based on segmenting the entire cardiac cycle, Puyol-Antón et al. [
14] developed an AI-based model for obtaining advanced systolic and diastolic LV function biomarkers. In addition, Chen et al. [
15] employed a modified R2 + 1 D ResNet stem to construct a fully automated framework that offered motion tracking apart from joint semantic segmentation in multi-beat echocardiograms. This convolutional network operates sliding windows to accommodate the heart cycle while efficiently encoding spatiotemporal data.
In summary, convolutional neural network (CNN)-based models have remarkably improved LV segmentation on the EchoNet-Dynamic dataset. The evaluations of the segmentation network in the reviewed literature are summarized in
Table 1. However, these approaches failed to integrate the spatial features (low-level) with the contextual semantic features (high-level) in an efficient manner.
On the basis of the coarse and fine pathways, a dual-branch network is another method to segment echocardiogram images. A typical bilateral network is made up of a feature fusion module, a context path, and a spatial path. The feature fusion module is used to merge the features learned from both the spatial path and the context path, which are meant for capturing low-level spatial characteristics and high-level context features, respectively. Moreover, few studies have used bilateral architecture to segment echocardiograms, whether in open datasets or other private datasets, according to our research of the literature. Therefore, this paper presents a bilateral network with dilated convolutions, which achieves precise segmentation of the LV in four-chamber echocardiograms based on high- and low-level features. The chief contributions of our methods are as follows:
Considering low-level and high-level features, we adopted a bilateral-structured network, namely Bi-DCNet, to accurately segment the LV in four-chamber view images. The two branches are responsible for semantic and spatial information, respectively.
To improve the network’s ability to extract high-level semantic features, dilated convolutions were utilized in the context path, in addition to expanding the receptive field to capture multi-scale information.
We conducted experiments on EchoNet-Dynamic, a dataset offering 10,030 echocardiogram videos (apical 4-chamber view) with corresponding labeled segmentation tracings of the LV. To the best of the authors’ knowledge, this is the first bilateral-shaped network implemented on this large clinical video dataset to assess cardiac function.
The remainder of the present article is arranged as follows: In
Section 2, every module of our current methodology is elaborated. Moreover, the experimental setup and datasets are described, as well as the evaluation matrices. Then,
Section 3 summarizes the findings, and
Section 4 presents the discussion. Lastly,
Section 5 presents the conclusion of this study.
4. Discussion
This work suggests a bilateral segmentation model and evaluates its performance against U-Net and BiSeNet. We adopted a bilateral structured network for LV segmentation in echocardiography, where conventional segmentation approaches are not satisfactory due to low signal-to-noise ratios, fluctuating levels of speckle noise, and the existence of shadowing in the ultrasound. According to a literature review, this is the first time a bilateral structure was used on the EchoNet-Dynamic dataset, and a considerable improvement in segmentation accuracy was achieved. Moreover, since the left ventricle’s area and volume vary significantly throughout the cardiac cycle due to constant contraction and expansion, dilated convolution with varied dilation rates was first integrated into the context module in order to capture multi-scale feature information.
The provided model demonstrated the highest performance levels regarding DSC, IoU, accuracy, precision, and specificity, as indicated in
Table 5. The suggested method used the RestNet-18 model to extract features, which were then separately processed to maximize the use of the information and features.
On the other hand, the benefit of employing dilatation convolution over conventional convolution processes is the possibility of achieving more extensive receptive fields while preserving the same feature resolution and reducing the number of parameters [
31]. Additionally, the model is better able to comprehend information about the whole context—for instance, the left ventricle’s shape changes throughout the cardiac cycle [
32]. We employed dilated convolution with varied dilation rates in the context detail module to capture multi-scale feature information in order to more precisely separate the left ventricle, with its various sizes and appearances. This can increase the left ventricle’s segmentation precision during echocardiography.
In addition to the numerical and visual results, statistical outcomes were also offered for the evaluation metrics. All test images’ evaluation metric distributions were analyzed and contrasted using a boxplot.
Figure 5 (left) is a boxplot depicting the IoU values for the test images. The two bilateral networks clearly surpassed the U-Net model in terms of maximum values, minimum values, and medians, as well as the highest and lowest quartiles. Moreover, the proposed Bi-DCNet exhibited slightly higher performance than the classic Bi-DCNet, indicating that the dilated convolutions helped capture a larger global context.
Boxplots for DSC values are depicted in
Figure 5 (right). As inferred from the figure, the behavior of the proposed model for DSC metrics is quite similar to that of IoU. In addition, the reduced skewness of the boxplot for the proposed Bi-DCNet, in contrast to those of the other models, demonstrates the superior performance of the proposed network.
Boxplots for recall, precision, accuracy, and specificity are depicted in
Figure 6 and
Figure 7, which provide additional evidence that performance improvements were achieved when both bilateral structure and dilated convolution were employed to collect local and global information.
5. Conclusions
This article presents Bi-DCNet, a DL network that exploits bilateral structures to achieve completely automatic segmentation of the left ventricle in four-chamber view echocardiograms. Our proposed model consists of a spatial path responsible for low-level feature acquisition, a context path for high-level feature acquisition, and a fusion module that enables efficient integration of features captured by the preceding two paths.
It is essential to note that the use of dilated convolutions, which offers a broad receptive field size for a more global context, is most significant in improving segmentation performance. Although the fact that dilated convolution provides a large receptive field size, the segmentation mask formed by a single dilation rate throughout the segmentation process does not cover all semantic strengths. To successfully capture multi-scale objects, we applied multiple dilated convolutions with varying dilation rates in the bottleneck layer, a particular dilation method extensively used in the field of medical imaging.
The proposed model demonstrated impressive performance, with a DSC of 0.9228 and an IoU of 0.8576, outperforming the well-known deep learning techniques U-Net and BiSeNet. Despite the fact that our proposed model Bi-DCNet did not noticeably outperform BiSeNet, it is crucial to highlight that this manuscript applies a bilateral structure to the EchoNet-Dynamic dataset for the first time, with excellent results. Given that BiSeNet’s output is already satisfactory (0.9207), we applied dilated convolution on this basis to account for the uncertainty of the size of the left ventricle. Although the improvement (approximately 0.2%) is not immediately apparent, it is clear that dilated convolution accumulated to some degree. We hope to use this technique in the future for tasks such as automatic measuring and landmark recognition in medical image analysis.