EA-ConvNeXt: An Approach to Script Identification in Natural Scenes Based on Edge Flow and Coordinate Attention

Zhang, Zhiyun; Eli, Elham; Mamat, Hornisa; Aysa, Alimjan; Ubul, Kurban

doi:10.3390/electronics12132837

Open AccessArticle

EA-ConvNeXt: An Approach to Script Identification in Natural Scenes Based on Edge Flow and Coordinate Attention

by

Zhiyun Zhang

¹,

Elham Eli

¹,

Hornisa Mamat

¹,

Alimjan Aysa

^1,2,*

and

Kurban Ubul

^1,2,*

¹

School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

²

Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830046, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(13), 2837; https://doi.org/10.3390/electronics12132837

Submission received: 13 May 2023 / Revised: 19 June 2023 / Accepted: 20 June 2023 / Published: 27 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

In multilingual scene text understanding, script identification is an important prerequisite step for text image recognition. Due to the complex background of text images in natural scenes, severe noise, and common symbols or similar layouts in different language families, the problem of script identification has not been solved. This paper proposes a new script identification method based on ConvNext improvement, namely EA-ConvNext. Firstly, the method of generating an edge flow map from the original image is proposed, which increases the number of scripts and reduces background noise. Then, based on the feature information extracted by the convolutional neural network ConvNeXt, a coordinate attention module is proposed to enhance the description of spatial position feature information in the vertical direction. The public dataset SIW-13 has been expanded, and the Uyghur script image dataset has been added, named SIW-14. The improved method achieved identification rates of 97.3%, 93.5%, and 92.4% on public script identification datasets CVSI-2015, MLe2e, and SIW-13, respectively, and 92.0% on the expanded dataset SIW-14, verifying the superiority of this method.

Keywords:

language identification; local features; edge flow; coordinate attention; script dataset

1. Introduction

Script identification refers to the automatic identification of the language or cultural background of a text through computer technology. It emerged to deal with linguistic diversity and cultural differences in the process of globalization and to provide basic support for text translation, information extraction, and cultural exchange. With the acceleration of globalization and the development of Internet technology, more and more information and data involve multiple languages and cultural backgrounds. However, manual processing and management of this cross-language and cultural information can no longer meet the needs. Computer scientists have been studying how to use computer technology for script identification, improve the efficiency and accuracy of information processing, and better understand and manage cross-language and cultural information. Script identification has a wide range of practical applications, such as online archiving of multilingual scene images, product image search [1], multilingual machine translation [2], scene image understanding [3], etc.

Script identification technology plays a vital role in the Optical Character Recognition (OCR) recognition process [4]. Its greatest significance is that it can help to identify the corresponding language, and further select the best text recognition [5,6,7] model to realize multilingual automatic processing. Script identification can help people better process text information involving multiple languages and improve the efficiency and accuracy of information processing. Script identification in natural scenes refers to the identification of different language types such as the language or dialect to which the text belongs in the real world, so as to perform related processing and analysis. Different script categories include Arabic, Hindi, Bengali, and Telugu, among others. As shown in Figure 1, different scripting languages have different character structures. Script identification encounters significant challenges when dealing with scene text images, which differ significantly from document print [8] images for the following main reasons:

(1): Scripted images come in a variety of styles due to their variety of sources, such as outdoor billboards, traffic signs, user instructions, and more. These styles include various fonts, colors, artistic styles, and more;
(2): The impact of low resolution and light, noise, etc., low image quality, small differences in the same language family that share the same subset of characters. For example, Greek, English, and Russian share a subset of characters;
(3): Background interference will directly affect the identification rate. If the image background overlaps the text, the background may be mistaken for script;
(4): There are few public datasets in natural scenarios, and most datasets have a small amount of data, which are not conducive to the training and convergence of deep learning, resulting in a low script identification rate.

The main contributions of this paper are as follows:

(1): This paper proposes an improved natural scene script identification method based on convolutional neural network ConvNext, namely EA-ConvNext;
(2): A method for generating edge flows from raw images is proposed. The edge flow map is only black and white, similar to the binarized map, so the number of scripts is innovatively increased, and the background noise of the script image is reduced;
(3): Based on the feature extraction of the convolutional neural network ConvNeXt, a coordinate attention mechanism is introduced to enhance the feature description of the spatial position in the vertical direction of the script image;
(4): Expand the public script identification dataset SIW-13, adding Uyghur text images in natural scenes, named SIW-14;
(5): The method proposed in this paper is evaluated on the public script identification datasets CVSI-2015, MLe2e, and SIW-13 and the extended dataset SIW-14, which verifies the effectiveness of the method.

2. Related Works

The script identification process mainly includes preprocessing, feature extraction, and classification. For machine learning methods, the preprocessing part is also very important, but for deep learning methods, some preprocessing steps can be directly omitted. In recent years, script identification methods have developed rapidly in both machine learning and deep learning. Traditional manual feature extraction methods mainly include Local Binary Pattern (LBP) [9], Histogram of Oriented Gradients (HOG) [10], and Gray Level Co-occurrence Matrix (GLCM) [11], etc. These methods mainly extract the texture, color and other information of the text area for classification and identification. Although these methods have certain effects, because their feature extraction process is based on human experience, they cannot fully and effectively express the feature information of the text, so there are certain limitations in practical applications. Deep learning methods mainly include convolutional neural network (CNN) [12], recurrent neural network (RNN) [13], attention mechanism [14], etc. These methods mainly learn the features of text directly from raw data by end-to-end training on text images, so they have a good generalization ability. In recent years, deep learning methods have been widely used in the field of script identification in natural scenes and achieved good results.

Technical research on script identification began in 1990, and there are many traditional machine learning methods. At the end of the 20th century, Spitz et al. [15] proposed for the first time to divide text images into two categories according to Chinese and Latin and successfully identified three languages, Chinese, Japanese, and Korean, by analyzing the optical density distribution. In 2010, Padma [16] used the gray level co-occurrence matrix of wavelet transform coefficient subbands to extract Haralick texture features and used them for language identification of machine-printed text. Ferrer et al. [17] used LBP texture features to describe the stroke direction distribution of text characters in 2013 and used them as features of text images to classify them through SVM linear classifiers. In the same year, Rani et al. [18] used directional frequency-based Gabor features and gradient features based on individual character gradient information to identify multi-font and multi-size characters and used multi-class support vector machines for classification. Shivakumara et al. [19] used gradient angle segmentation to segment text words in document images in 2014 and generated candidate features for language identification. Mukarambi et al. [20] proposed a method based on LBP features in 2017 for language identification of Kannada, Indian, and English text images captured by cameras, and the method showed good identification results. The experimental results of these methods show that script identification technology has made remarkable progress.

The application of deep learning methods in language identification is developing rapidly. Gomez et al. [21] used methods such as convolutional features and naive Bayesian classifiers to learn the stroke features of text, paying special attention to the discriminative features of small strokes in the fine-grained framework. Its method achieves state-of-the-art results in the task of text-image-genre identification in natural scenes. At the same time, Mei et al. [22] combined the convolutional neural network and the recurrent neural network by means of feature sequence labeling, and designed a network model that can be trained end-to-end. It improves the semantic information and long-term spatial dependence of extracted text image features, making the model perform well in language identification tasks. Zdenek et al. [23] developed a new method that combines local convolutional triplet features and a visual bag-of-words model, enabling the model to obtain a more descriptive vocabulary and enhance the low-resolution ability of weak feature descriptors. It has shown competitive performance in script identification. Ankan et al. [24] proposed a convolutional recurrent neural network based on the attention mechanism, which fuses the global features and local features of text images through dynamic weighting, and demonstrates the performance of the model in scene text and video text images. Meanwhile, Cheng et al. [25] developed a local patch aggregator-based convolutional neural network model that considers the prediction scores of local patches to learn more discriminative feature information. This makes the model play an important role in the classification of similar texts and has achieved excellent performance. Shi Baoguang et al. [26] introduced the discriminative clustering algorithm into the convolutional neural network, designed the DiscCNN network model, and effectively captured the detailed differences in various languages. The multivariate time series classification method proposed by Karim et al. [27] is composed of Long Short-Term Memory (LSTM) and Fully Convolutional Network (FCN), and the effect of adding an attention mechanism is better. Fujii et al. [28] used Encoder and Summarizer to extract local features, which were fused by an attention mechanism to reflect the importance of different patches.

3. Methods

3.1. Overview of Network Structure

The network proposed in this paper can identify the script category of text images in an end-to-end manner. The network consists of three parts: feature extractor, coordinate attention and classification layer. Figure 2 shows the network structure. In the first part, raw images are generated as edge-flow images for data augmentation, and the ConvNeXt feature extraction block is utilized to generate feature maps. The second part introduces a coordinate attention mechanism to help the network focus on important aspects of text alignment structure, thereby reducing the negative impact of background and other text types. The third part uses a linear fully connected layer to calculate the probability of each script category and determine the script category. The input image is resized to a fixed square of 224 × 224, allowing flexible input size.

3.2. Edge Stream Image Enhancement Module

Scripted images in natural scenes are varied in style and rich in noise, which makes processing them challenging. Removing background noise is a critical preprocessing step in image processing, as noise can interfere with subsequent image analysis and interpretation. To this end, we propose a differentiable edge Sobel operator module for edge extraction and noise suppression algorithms. The Sobel operator is a gradient-based operator that can efficiently detect edges in images. The Differentiable Edge Sobel Operator module uses this operator to identify edges in an image and perform noise suppression in a differentiable manner, thereby improving image quality and usability.

As a method of edge detection, the Sobel operator is a commonly used discrete difference operator, which can be used to approximate the gradient of the image brightness function. It is a small integer filter that performs convolution operations both horizontally and vertically across the entire image, so computational resource requirements are minimal. The Sobel operator consists of two sets of 3 × 3 matrices, which can be used to calculate the brightness difference or gradient of the image in the horizontal and vertical directions. Operators in the horizontal direction are expressed as a matrix in the X direction, and operators in the vertical direction are expressed as a matrix in the Y direction. The operator calculation form is as follows:

G x = [\begin{matrix} - 1 & 0 & + 1 \\ - 2 & 0 & + 2 \\ - 1 & 0 & + 1 \end{matrix}] * A

(1)

G y = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ + 1 & + 2 & + 1 \end{matrix}] * A

(2)

In this context, A is the original image converted to grayscale, and Gx and Gy are the edge images obtained by applying the Sobel operator in the horizontal and vertical directions, respectively, to the grayscale image.

This paper proposes a new algorithm for generating edge flow images [29]. This method is based on the edge detection principle of Sobel operator, as shown in Figure 3. In order to extract the edge information of the image in the x and y directions, the algorithm first converts the original image to a grayscale image, and then applies two 3 × 3 convolutional layers to extract the image information. The colors of the two convolutional layers are orange and blue, respectively, the orange convolutional layer corresponds to the edge information in the x direction, and the blue convolutional layer corresponds to the edge information in the y direction.

Our edge flow detection algorithm improves upon the traditional Sobel edge operator by introducing differentiable parameters. Unlike the Sobel operator, which relies on fixed values to extract edges, our approach defines the Sobel operator as a convolution kernel that can be fine-tuned during network backpropagation. This makes the Sobel operator more adaptable, allowing it to learn its own parameters and extract edges that are better suited for the classification task. However, the edge images Gx and Gy extracted in the x and y directions only capture homogeneity, neglecting texture information. To address this limitation, we convert the original image to grayscale before inputting it. Finally, we merge the Gx, Gy, and grayscale images to produce a three-channel image.

3.3. Basic Feature Extraction Module

The basic feature extraction module uses the pure convolutional neural network ConvNext [30] as the basic network, which is improved based on the design concepts of ResNet50 and Swin transformer. ConvNeXt has 4 stages, like Swin Transformer, and ConvNeXt changes the ratio of the number of blocks in each stage from 3:4:6:3 to 1:1:3:1, which is the same as Swin Transformer. In addition, in terms of feature map downsampling, ConvNeXt uses the same method as Swin Transformer, that is, using a convolution kernel with a step size of 4 and a size of 4 × 4.

The ConvNeXt network adopts the basic modules of the ResNeXt [31] network, which includes parallel network layers and grouped convolutions. This improvement strategy can well balance the scale and performance of the model. In contrast, the ResNet network uses a three-layer network structure with a small middle and two large ends as the bottleneck layer, while MobileNet [32] uses an inverse bottleneck layer with a large middle and small ends for the first time. This structure is also adopted by Transformer and Swin Transformer. For a long time in the past, large convolution kernels were completely abandoned, because multiple layers of small convolution kernels can achieve the same receptive field while having fewer parameters, resulting in better model performance. However, in Swin Transformer, large convolution kernels are reintroduced without affecting performance. Therefore, the ConvNeXt network also experimented with larger convolution kernels and finally found that a 7 × 7 convolution kernel can achieve the best performance.

In ConvNeXt Block, the activation function is improved, using the GELU function commonly used in Transformer instead of the ReLU commonly used in convolutional neural networks. In addition, not every module needs to be followed by an activation function, so the activation function is only used in the fully connected layer after the DW convolution, and fewer activation functions are used to improve performance. In order to further reduce the complexity of the model, the Normalization layer is also simplified. Specifically, some Normalization layers in the ConvNeXt Block are removed, only the Normalization layer after the depthwise conv is retained, and Batch Normalization (BN) is replaced by Layer Normalization (LN). In addition, Resnet has been improved by using a separate downsampling layer instead of downsampling in the bottleneck. The Downsample operation is shown in Figure 4, which consists of a Layer Normalization plus a convolution layer with a convolution kernel size of 2 and a step size of 2.

The detailed structure of ConvNeXt Block is shown in Figure 5. It is mainly composed of a 7 × 7 convolution to the Layer Normalization (LN) layer, entering a 1 × 1 convolution through GELU activation, then 1 × 1 convolution, Layer Scale layer and Drop Path layer, and then with the initial input Residual connections.

3.4. Coordinate Attention

After analyzing various scripting languages in the dataset, it is found that the difference between various texts lies more in the vertical direction, as shown in Figure 6. From the Hindi, Bengali and Punjabi languages in the figure, it can be seen that there is a baseline in the horizontal direction of the text. The difference between each character is determined by the different strokes of the characters glued up and down on the reference line, and the different parts of these characters are mainly in the vertical direction. It can be seen that in the spatial position of script identification feature extraction, the feature description part in the vertical direction is as important as the horizontal direction.

To address the above issues, we introduce the coordinate attention mechanism [33] to improve the network’s ability to represent the font shape of text images, and Figure 7 depicts the architecture. Since the channel attention mechanism compresses the channel dimension information in the feature map into the channel dimension through the global average pooling operation, the position information is lost. We, therefore, employ a CA module to aggregate features in vertical and horizontal directions based on inter-channel relationships, enabling the attention module to capture long-range spatial interactions with precise location information. H × 1 and 1 × W 1D average pooling operations are performed on each 1D feature channel to efficiently capture the global perceptual field and encode accurate location information. On channel C, the calculation outputs of height

t_{c}^{h} (h)

and width

t_{c}^{w} (w)

are shown in Equations (3) and (4).

t_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i \leq W} a_{c} (h, i)

(3)

t_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j \leq H} a_{c} (j, w)

(4)

Next, the vertical and horizontal feature maps are aggregated and concatenated, and the channel dimension is compressed by a factor of C/s using a shared 1 × 1 convolutional transformation function Y1. Here, s is the channel scaling parameter, which helps to reduce the number of parameters involved in the calculation process. After this, a non-linear activation is applied to the feature map y by passing it to the sigmoid function after batch normalization. The dimension of the obtained feature map y is 1 × (W + H) × (C/s), and the calculation process is shown in Formula (5).

y = \partial (Y_{1} [t^{h}, t^{w}])

(5)

In the next step, the feature map is decomposed into two spatial dimensions of independent width and height, namely

y^{w}

and

y^{h}

, and then the number of channels of the two feature vectors

Y_{w}

and

Y_{h}

is restored to C by 1 × 1 convolution. The weight matrices 1 × W × C and H × 1 × C in the vertical and horizontal directions are restricted between 0 and 1 by the nonlinear activation function sigmoid, respectively. The calculation formulas of horizontal direction matrix

z^{w}

and vertical direction matrix

z^{h}

are shown in Formulas (6) and (7).

z^{w} = ℓ (Y_{w} (y^{w}))

(6)

z^{h} = ℓ (Y_{h} (y^{h}))

(7)

Finally, in order to obtain feature information maps based on attention in horizontal and vertical directions as well as channel attention, the weight matrix is weighted and multiplied with the original input image. The calculation result is shown in Formula (8).

u_{c} (i, j) = a_{c} (i, j) \times z_{c}^{w} (i) \times z_{c}^{h} (j)

(8)

where

u_{c}

is the final output feature map based on attention, and

a_{c}

is the original input feature map. The coordinate attention (CA) module utilizes the attention mechanism to integrate the spatial location information based on the channel dimension into the original feature map, reflecting the existence of the object of interest in the corresponding row and column, and accurately determining its location. It is beneficial to extract key feature information.

4. Experiments

This paper selects the public scene script identification datasets CVSI-2015 [34], MLe2e [35], and SIW-13 [24], and the SIW-13 dataset to expand the Uyghur script, namely the SIW-14 dataset, for experimental evaluation, presentation, and analysis of experimental results.

4.1. Dataset

An overview of the CVSI-2015, MLe2e, SIW-13, and SIW-14 datasets is shown in Table 1. CVSI-2015 is an International Conference on Document Analysis and Recognition (ICDAR) who proposed a scene script dataset for video script identification competition in 2015. A dataset used to verify the scene text language identification algorithm. The text pictures in the dataset come from the text superimposed on the video, and include 10 languages: English, Hindi, Bengali, Oriya, Gujarati, Punjabi, Kannada, Tamil, Tragu, and Arabic. Most of the languages belong to the minority languages of India, and the similarities between these languages are so high that it is difficult to distinguish them. The dataset has a total of 10,688 image data, of which 6412 are used for training, 1069 are used for verification, and 3207 are used for testing.

The MLe2e dataset is a dataset for a multilingual end-to-end reading system, which includes three tasks: text detection, script identification, and text recognition. We study script identification orientation, so we use a pre-segmented text version of the dataset containing cropped word images. The image data in the dataset comes from several existing text detection datasets and are scaled to keep the same image size. The dataset is relatively small, containing 1178 word images for training and 643 for testing, and supports four different script types, including Latin, Chinese, Kannada, and Korean.

The SIW-13 scene script dataset is collected by Shi et al. from the scene images of Google Maps, and the script images are cut from the text part and then corrected to the horizontal direction. Each script image has a label category, that is, language information. A total of 16,291 images have been collected in 13 languages, including Arabic, Cambodian, Chinese, English, Greek, Hebrew, Japanese, Kannada, Korean, Mongolian, Russian, Thai, and Tibetan. The text in this dataset varies in length and size, so the images vary in size. There are 9791 training samples and 6500 test samples in the entire dataset, and there are 500 images in each category in the test samples.

The SIW-14 dataset is based on the SIW-13 dataset to expand the Uyghur scene script image, filling the gaps in the Uyghur script identification. First, collect Uyghur scene pictures in natural scenes, and then mark and crop them to obtain the image identified by the script, cutting and expanding the training set of 1000 Uyghur scripts and 500 test sets, which contains 10,791 word images for training and 7000 for testing, with 14 different script categories.

4.2. Implementation Details

In the experiment, we used the computer graphics card model NVIDIA A40 single graphics card with 48 G memory; Intel(R) Xeon(R) Platinum 8358P CPU, 2.60 GHz and 56 GB memory; the system was ubuntu20.04; Python3.8 and PyTorch1.10.0, Cuda 11.3 compiled code, Pycharm2021.1 professional version compiler. In the experiment, first use the pre-training model convnext_base_22k_224.pth of the ConvNeXt network on the ImageNet dataset as the initialization parameters of the feature extraction network; readjust the image size to 224 × 224 before training and set the batch size to 64. Using AdamW as the optimizer, the learning rate is 5 × 10⁻⁴ and the weight decay is set to 5 × 10⁻².

4.3. Results of the Method in This Paper

In order to verify the effectiveness of the method in this paper, we evaluated and analyzed the added modules on the CVSI-2015 dataset. The results of the ablation experiments are shown in Table 2.

4.3.1. Benchmark Results

When only the basic convolutional neural network ConvNext is used to extract the feature information of the script image, the experimental results are shown in the second row of Table 2. At this time, the script identification rate is 93.6%, which is lower than the experimental results under other conditions. The bottleneck block of the ConvNext network adopts more component group convolutions and increases the network width to compensate for the performance drop. However, the insufficient amount of data and the lack of coordinate attention mechanism affect the optimal convergence of the model during training. At the same time, it does not pay attention to the extraction of feature information in the horizontal and vertical directions, resulting in a low identification rate.

4.3.2. Edge Flow Image Enhancement Results

We embed an edge stream data augmentation module on the dataset based on the predictions trained by the ConvNext network. For each original dataset image, an edge flow image is generated, which solves the problem of less data. At the same time, the generated edge flow image is similar to the binary image, only black and white, which reduces the noise of the script image in the natural scene. Edge flow information can more clearly describe the information characteristics of script characters, which is helpful for the accurate extraction of key feature information by convolutional neural networks. The experimental results are shown in Table 2. After adding edge flow image enhancement, the script identification rate reaches 96.2%. Compared with the base ConvNext network, the identification rate improved by 2.6%.

4.3.3. Coordinate Attention Results

In the third and fourth sets of experiments, we proposed a coordinate attention mechanism, adding a coordinate attention module behind the ConvNext network. When the convolutional neural network ConvNext extracts script image feature information, it pays more attention to the character stroke difference in the vertical direction of the script, which plays an important role in the classification and identification of various script images and extracts key feature information to improve the script identification rate. When only embedding coordinate attention without edge flow, the script identification rate reaches 94.1%. When the edge flow enhancement and coordinate attention modules are added at the same time, the script identification rate has reached 97.3%. Compared with not adding coordinate attention, the identification rate increased by 1.1%. Compared with the basic ConvNext network, the script identification rate of our method increased by 3.7%, thus verifying the effectiveness and superiority of our method.

4.3.4. Experimental Results under Different Parameters

In order to verify the effectiveness of the method in this paper and determine the best experimental effect parameters, our method was tested under different parameter conditions on the CVSI-2015 dataset. The experimental results are shown in Table 3. Choose different ConvNext network layers as the basic network structure, read the corresponding public pre-training models convnext_small_1k_224_ema, convnext_base_22k_224 and convnext_large_22k_224, add edge flow and coordinate attention models, and set different batch size values for experiments. The experimental results show that when the batch size value remains unchanged, the effect of script identification gradually improves as the number of network layers increases, and the feature information extracted by more network layers is more abundant. When the number of network layers is the same, the identification rate of the script increases linearly with the batch size, and of course it does not strictly follow the linear rule. For the improved method on the ConvNext_base basic network, when the batch size value is 64, the identification effect is better than when the batch size value is 128. From the overall experimental results in Table 3, the improved method based on the ConvNext_base network has the best experimental effect when the batch size value is 64, and its identification rate reaches 97.3%.

In order to further verify the effectiveness and generalization of our proposed method, we also verified the experimental effect of the method on the public dataset MLe2e, SIW-13 and the extended dataset SIW-14. In the field of script identification, CVSI-2015, SIW-13, and MLe2e public datasets are classic and authoritative. The experimental results show that the method in this paper has achieved an identification rate of 93.5% on the MLe2e dataset, 92.4% on the SIW-13 dataset, and 92.0% on the extended dataset SIW-14.

4.3.5. Error Analysis

This paper has a good identification effect on the natural scene script identification dataset, but there are still some script image identification errors in the dataset. As shown in Figure 8, an example of script identification errors on the CVSI-2015 dataset is shown. For the first image (id: KAN_817), its identification result is Oriya, but its ground truth is Kannada. The reason for the identification error is that most of the areas in the script image are shared characters, and the key feature information for identification is in the first small part, so it is necessary to obtain rich feature information of the image to perform correct identification. For the second image (id: TEL_962), the script content is two Arabic numerals, but it was misclassified as a Telugu label when making a dataset, and the model identified Gujarati as a reasonable identification result. For the third (id: GUJ_888) and sixth (id: BEN_792) images, the script characters are blurred, the character length is too short, and the similarities between languages lead to Gujarati and Bengali misidentified as Telugu and Gujarati, a script identification task that is very difficult. For the fourth (id: TEL_781) and fifth (id: KAN_899) images, the identification results are Gujarati and Telugu, respectively, but their ground truths are Telegu and Kannada, respectively. The reason for the identification error is that the feature information of multiple languages in the script image and the shared character part, the length of the unique character feature part of a specific language is too short, and the feature information cannot be matched well, resulting in an identification error.

Table 4 shows the step-by-step improvement from the basic network model to the method proposed in this paper and the experimental example analysis on the CVSI-2015 script dataset. Including the entire process from identifying errors in the original network model to identifying the correct one after the improved method. The first script image (id: KAN_1040) is Kannada, but the characters in the image are blurred, and the light is dark, which leads to serious noise. The original ConvNext network cannot extract the key feature information of the image well, resulting in the wrong identification as Telegu. After adding the edge flow image, the script image is correctly identified as Kannada. The reason is that the generated edge flow image is similar to a binarized image, only a black and white image of character edge lines. The noise information of the script image is reduced, and the amount of data are doubled at the same time. Data enhancement is innovatively carried out, and richer feature information of the script image is extracted, which is conducive to the learning and convergence of deep learning. The ground truth of the second script image is Bengali, but without embedding the coordinate attention module, the script is misidentified as Gujrathi. The reason is that the script images are relatively blurred, and the key feature information of such scripts lies in their vertical direction. After adding coordinate attention, the network pays attention to extracting features in the horizontal and vertical directions of the image at the same time and can obtain the unique feature information in the vertical direction of the corresponding script itself, and then correctly identify it as Bengali, which effectively verifies the effectiveness of each module we proposed.

4.4. A Discusion of Comparison with State-of-the-Art Results

The experimental results of the method in this paper on the public datasets CVSI-2015, MLe2e, SIW-13, and the extended Uyghur script dataset SIW-14 are compared with the identification rate of existing methods, as shown in Table 5. Due to the lack of experimental results related to script identification in natural scenes, comparative experiments are limited. From the table, we can find that the identification rate of the method in this paper is relatively high. For the CVSI-2015 dataset, the identification rate of the method in this paper reaches 97.3%, which is better than most other methods, and slightly lower than the Score CNN + Attention CNN and second-stage CNN methods. Maybe they set a special loss function for this dataset. On the MLe2e and SIW-13 datasets, the identification rate of our method is also in a leading position. On the basis of the SIW-13 dataset, the Uyghur scene script is expanded, that is, the SIW-14 dataset, and the identification rate of our method reaches 92.0.

4.5. Experimental Analysis

In order to effectively evaluate the effective performance of the method proposed in this paper, we analyze the identification rate of each script on each dataset, and the prediction results of each script class are shown in Figure 9. The confusion matrices of the predictions for each dataset are shown in Figure 10.

The identification rate of each language on the public CVSI-2015 dataset is shown in Figure 9a. It can be seen that the highest identification rate is Arabic, reaching 100% identification rate. Due to the particularity of the language, there is no need to share characters and the characters are quite different from other types of scripts, the background is not complex enough, and the influence of noise is small. A series of reasons are conducive to the accurate identification of scripts. The lowest identification rate is Kannada, with an identification rate of 88.9%, followed by English with an identification rate of 92.4%, which is relatively low compared to other scripts. Eight of the ten script categories have an identification rate of more than 97%, which is enough to illustrate the superiority of the method proposed in this paper. The confusion matrix of the public CVSI-2015 dataset prediction results is shown in Figure 10a. It can be seen from the figure that Kannada has the lowest identification rate. A total of 35 script images were misidentified, most of which, 31, were misidentified as Telugu. The reason is that the characters of Kannada and Telugu are very similar, and they have shared characters. Some script images even only have shared characters. Even humans cannot correctly identify the type of the script, so the identification error is high. A total of 25 images of scripts in English were misidentified, and 23 of them were misidentified as Telugu. The reason is that some fuzzy English scripts are more similar to Telugu character strokes and are directly misidentified as Telugu. The number of misidentifications of other scripts does not exceed five, and the identification effect is good. Our comprehensive identification rate on this dataset reached 97.3%.

The experimental results for the public dataset MLe2e are shown in Figure 9b. Compared with other datasets, the type and quantity of the MLe2e dataset are relatively small, which is not conducive to the training and convergence of deep learning, and the learned language feature information is not comprehensive enough. Therefore, although there are relatively few languages, there are four languages in total, the experiment still did not achieve a very ideal identification effect. Kannada has the highest identification rate at 95.7%, followed by Chinese at 95.0%, and the worst at 89.9%, which shows that the identification effect is relatively stable, basically reaching more than 90%. This is shown in the confusion matrix in Figure 10b. Seven of the Chinese images were incorrectly identified, that is, most of them were identified as Latin. Because some of the Chinese images contained Latin numbers and letters, they were mistaken for Latin. The vast majority of Korean identification errors are misidentified as Latin and Chinese, which has a lot to do with the similarity of numbers and characters. Among them, the number of identification error samples for Chinese and Kannada is no more than eight images. The identification effect of this method for Chinese, Kannada, Korean, and Latin is generally better, and the key feature information of each language can be obtained for identification. We achieved an identification rate of 93.5% for the entire test set on this MLe2e dataset.

The identification rate of each language on the public dataset SIW-13 is shown in Figure 9c, which shows that Tibetan has the best identification performance, reaching an identification rate of 99%. Greek has the lowest identification rate of 76.2%, followed by English, Hebrew, and Russian, which are all lower than 90%. The identification rate of other languages is higher than 90%, and the identification rate of six languages is higher than 95%. Compared with other datasets, the background of this dataset is more complex, and there are more languages with similar or shared characters, which is more difficult, so the identification rate of some languages is lower. The confusion matrix on SIW-13 is shown in Figure 10c. Among the misidentifications of the Greek language, the Russian language is identified most, because some characters are shared in the two language families and the language similarity is very high. Some script characters are too short, and there are no unique language characters for each language, so it is difficult for humans to correctly identify them. Among the English identification errors, 35 images were identified as Thai. The reason is also the problem of character similarity, and the content of some Thai scripts are English scripts. There may be errors when building a dataset to classify script images. Tibetan has the least identification error samples, with 5 images, followed by Chinese and Mongolian, and the identification error samples are no more than 15 images. We achieved an overall identification rate of 92.4% on this dataset.

For the performance on the extended dataset SIW-14, the identification accuracy of each script is shown in Figure 9d, and the corresponding confusion matrix is shown in Figure 10d. Due to the particularity of the characters in the Uyghur scene script image, the clear strokes of the image, and the light background noise, its identification rate has reached an astonishing 99.4%, and almost all the feature information of this language has been learned for accurate identification. For other languages, the identification rate of English is still low at 78.8%, and the identification rate of Russian has declined compared with the SIW-13 dataset. It may be that after the expansion of the Uyghur script, there was interference in the judgment and identification of the extracted feature information of the Russian script, so the identification rate decreased. The most misidentified samples of Russian language are Greek, which accounts for the vast majority of misidentified languages. In the same misidentified sample of Greek language, there are 23 script images, that is, most of them are misjudged as Russian. The reason is that the two languages share characters that lead to a high similarity, and some script image samples cannot be correctly identified by the human eye, so it is easier to identify errors. The overall identification rate of our extended script dataset SIW-14 is 92.0%.

5. Conclusions

Aiming at the small number of scene script datasets, complex background, and character similarity problems in multiple languages, this paper proposes a scene script identification method based on the convolutional neural network ConvNext, namely EA-ConvNext. ConvNext is an improved network structure based on ResNet50 and Swin transformer, and the local feature information extracted by using it as the basic network is more abundant. Firstly, a method of generating edge flow images from the original image is proposed to solve the problem of less data and complex background. Edge flow images only have black and white to describe the outline features of characters, which is similar to binary images and reduces the influence of background noise. Then, for the character differences of multiple languages mainly lie in the character stroke features in the vertical direction, a coordinate attention method is proposed to enhance the description of the spatial feature information of the script image in the vertical direction. For problems with fewer types of public datasets, the Uyghur scene script images were collected and cropped, and the category of the public dataset SIW-13 was added and named SIW-14. The extracted key position feature information directly improves the identification effect, and the enhancement of data volume is more conducive to the training and convergence of deep learning. In the future, the script dataset should be further expanded on a conditional basis, and the script identification method should be improved to meet the identification of more complex script images in actual production and life.

Author Contributions

Conceptualization, Z.Z.; methodology, Z.Z.; software, Z.Z.; investigation, Z.Z.; validation, Z.Z.; resources, H.M., E.E. and K.U.; writing—original draft preparation, Z.Z. and A.A.; writing—review and editing, Z.Z. and A.A.; visualization, Z.Z.; project administration, H.M., E.E. and K.U.; funding acquisition, A.A. and K.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of China (Nos. 62266044, 61862061 and 61363064).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this paper include the public datasets CVSI-2015, MLe2e, and SIW-13, and the extended dataset SIW-14.

Conflicts of Interest

The authors declare no conflict of interest.

References

Banu, J.F.; Muneeshwari, P.; Raja, K.; Suresh, S.; Latchoumi, T.P.; Deepan, S. Ontology based image retrieval by utilizing model annotations and content. In Proceedings of the 2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 27–28 January 2022; pp. 300–305. [Google Scholar]
Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond english-centric multilingual machine translation. J. Mach. Learn. Res. 2021, 22, 4839–4886. [Google Scholar]
Sumbul, G.; Charfuelan, M.; Demir, B.; Markl, V. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 5901–5904. [Google Scholar]
Islam, N.; Islam, Z.; Noor, N. A survey on optical character recognition system. arXiv 2017, arXiv:1710.05703. [Google Scholar]
Qiao, Z.; Zhou, Y.; Yang, D.; Zhou, Y.; Wang, W. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13528–13537. [Google Scholar]
Wang, T.; Zhu, Y.; Jin, L.; Luo, C.; Chen, X.; Wu, Y.; Wang, Q.; Cai, M. Decoupled attention network for text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12216–12224. [Google Scholar]
Yu, D.; Li, X.; Zhang, C.; Liu, T.; Han, J.; Liu, J.; Ding, E. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12113–12122. [Google Scholar]
Han, X.K.; Aysa, A.; Mamat, H.; Yadikar, N.; Ubul, K. Script Identification of Central Asia Based on Fused Texture Features. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 3675–3680. [Google Scholar]
Zeebaree, D.Q.; Haron, H.; Abdulazeez, A.M.; Zebari, D.A. Trainable model based on new uniform LBP feature to identify the risk of the breast cancer. In Proceedings of the 2019 International Conference on Advanced Science and Engineering (ICOASE), Zakho, Iraq, 2–4 April 2019; pp. 106–111. [Google Scholar]
Korkmaz, S.A.; Akçiçek, A.; Bínol, H.; Korkmaz, M.F. Recognition of the stomach cancer images with probabilistic HOG feature vector histograms by using HOG features. In Proceedings of the 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY), Subotica, Serbia, 14–16 September 2017; pp. 339–342. [Google Scholar]
Garg, M.; Dhiman, G. A novel content-based image retrieval approach for classification using GLCM features and texture fused LBP variants. Neural Comput. Appl. 2021, 33, 1311–1328. [Google Scholar] [CrossRef]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef] [Green Version]
Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. Coatnet: Marrying convolution and attention for all data sizes. Adv. Neural Inf. Process. Syst. 2021, 34, 3965–3977. [Google Scholar]
Spitz, A.L. Determination of the script and language content of document images. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 235–245. [Google Scholar] [CrossRef] [Green Version]
Padma, M.C.; Vijaya, P.A. Global approach for script identification using wavelet packet based features. Int. J. Signal Process. Image Process. Pattern Recognit. 2010, 3, 29–40. [Google Scholar]
Ferrer, M.A.; Morales, A.; Pal, U. Lbp based line-wise script identification. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 369–373. [Google Scholar]
Rani, R.; Dhir, R.; Lehal, G.S. Script identification of pre-segmented multi-font characters and digits. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1150–1154. [Google Scholar]
Shivakumara, P.; Sharma, N.; Pal, U.; Blumenstein, M.; Tan, C.L. Gradient-angular-features for word-wise video script identification. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 3098–3103. [Google Scholar]
Mukarambi, G.; Mallapa, S.; Dhandra, B.V. Script identification from camera based Tri-Lingual document. In Proceedings of the 2017 Third International Conference on Sensing, Signal Processing and Security (ICSSS), Chennai, India, 4–5 May 2017; pp. 214–217. [Google Scholar]
Gomez, L.; Karatzas, D. A fine-grained approach to scene text script identification. In Proceedings of the 2016 12th IAPR Workshop on Document Analysis Systems (DAS), Santorini, Greece, 11–14 April 2016; pp. 192–197. [Google Scholar]
Mei, J.; Dai, L.; Shi, B.; Bai, X. Scene text script identification with convolutional recurrent neural networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 4053–4058. [Google Scholar]
Zdenek, J.; Nakayama, H. Bag of local convolutional triplets for script identification in scene text. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 369–375. [Google Scholar]
Bhunia, A.K.; Konwer, A.; Bhunia, A.K.; Bhowmick, A.; Roy, P.P.; Pal, U. Script identification in natural scene image and video frames using an attention based convolutional-LSTM network. Pattern Recognit. 2019, 85, 172–184. [Google Scholar] [CrossRef] [Green Version]
Cheng, C.; Huang, Q.; Bai, X.; Feng, B.; Liu, W. Patch aggregator for scene text script identification. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 1077–1083. [Google Scholar]
Shi, B.; Yao, C.; Zhang, C.; Guo, X.; Huang, F.; Bai, X. Automatic script identification in the wild. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 531–535. [Google Scholar]
Karim, F.; Majumdar, S.; Darabi, H.; Harford, S. Multivariate LSTM-FCNs for time series classification. Neural Netw. 2019, 116, 237–245. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fujii, Y.; Driesen, K.; Baccash, J.; Hurst, A.; Popat, A.C. Sequence-to-label script identification for multilingual ocr. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 161–168. [Google Scholar]
Hao, S.; Wu, B.; Zhao, K.; Ye, Y.; Wang, W. Two-stream swin transformer with differentiable sobel operator for remote sensing image classification. Remote Sens. 2022, 14, 1507. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Zhou, T.; Zhao, Y.; Wu, J. Resnext and res2net structures for speaker verification. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 301–307. [Google Scholar]
Wang, W.; Li, Y.; Zou, T.; Wang, X.; You, J.; Luo, Y. A novel image classification approach via dense-MobileNet models. Mob. Inf. Syst. 2020, 2020, 7602384. [Google Scholar] [CrossRef] [Green Version]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Dastidar, S.G.; Dutta, K.; Das, N.; Kundu, M.; Nasipuri, M. Exploring knowledge distillation of a deep neural network for multi-script identification. In Proceedings of the International Conference on Computational Intelligence in Communications and Business Analytics, Santiniketan, India, 7–8 January 2021; Springer: Cham, Switzerland, 2021; pp. 150–162. [Google Scholar]
Ma, M.; Wang, Q.F.; Huang, S.; Huang, S.; Goulermas, Y.; Huang, K. Residual attention-based multi-scale script identification in scene text images. Neurocomputing 2021, 421, 222–233. [Google Scholar] [CrossRef]
Shi, B.; Bai, X.; Yao, C. Script identification in the wild via discriminative convolutional neural network. Pattern Recognit. 2016, 52, 448–458. [Google Scholar] [CrossRef]
Nicolaou, A.; Bagdanov, A.D.; Liwicki, M.; Karatzas, D. Sparse radial sampling LBP for writer identification. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 716–720. [Google Scholar]
Lu, L.; Wu, D.; Tang, Z.; Yi, Y.; Huang, F. Mining discriminative patches for script identification in natural scene images. J. Intell. Fuzzy Syst. 2021, 40, 551–563. [Google Scholar] [CrossRef]
Mahajan, S.; Rani, R. Word level script identification using convolutional neural network enhancement for scenic images. Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–29. [Google Scholar] [CrossRef]

Figure 1. Sample dataset example.

Figure 2. Network structure.

Figure 3. Generate edge flow image module diagram.

Figure 4. Downsample structure diagram.

Figure 5. ConvNext Block structure diagram.

Figure 6. Analysis diagram of script example.

Figure 7. Coordinate attention structure diagram.

Figure 8. Samples of script identification errors in the CVSI-2015 dataset. There are three lines of labels below each script, which are image id, identification result, and ground truth.

Figure 9. Classification performance on public datasets and extended datasets. The abscissa and ordinate, respectively, represent the identification rate of the language and the category of each language. The longer the abscissa, the higher the identification rate of the language. (a) CVSI-2015, (b) MLe2e, (c) SIW-13, (d) SIW-14.

Figure 10. Confusion matrix on public dataset and extended dataset. The x-coordinates and y-coordinates represent the real label and the predicted category, respectively, and the darker the color, the higher the proportion of correct predictions. (a) CVSI-2015, (b) MLe2e, (c) SIW-13, (d) SIW-14.

Table 1. Image category and number in the dataset.

Dataset	Category	Train	Validate	Test
CVSI-2015	10	6412	1069	3207
MLe2e	4	1178	-	643
SIW-13	13	9791	-	6500
SIW-14	14	10,791	-	7000

Table 2. Ablation experiment results of CVSI-2015.

ConvNext	Edge	CA	Accu. (%)
√			93.6
√	√		96.2
√		√	94.1
√	√	√	97.3

Table 3. Identification results (%) of the improved method based on ConvNext_small, ConvNext_base and ConvNext_large networks under different batch sizes.

Method	16	32	64	128
ConvNext_small + Edge + CA	96.7	96.9	96.9	97.0
ConvNext_base + Edge + CA	96.8	97.0	97.3	97.1
ConvNext_large + Edge + CA	97.0	97.1	97.2	97.2

Table 4. Ablation Two examples of script identification illustrate the gradual change from wrong to correct. These three methods correspond to the three methods in the ablation experiments in Table 1, respectively.

Script Image	Id	Ground Truth	ConvNext	ConvNext +Edge	ConvNext +Edge +CA
	KAN_1040	Kannada	Telegu	Kannada	Kannada
	BEN_914	Bengali	Gujrathi	Gujrathi	Bengali

Table 5. Comparison accuracy (%) of four script datasets.

Method	CVSI-2015	MLe2e	SIW-13	SIW-14
EA-ConvNext (Ours)	97.3	93.5	92.4	92.0
DisCNN [36]	94.0	-	88.0	-
CNN + LSTM [22]	94.0	-	92.0	-
SRS + LBP + KNN [37]	94.0	82.7	-	-
Conv.feature+ NBNN [21]	95.0	84.5	-	-
CNN + BoVW [23]	97.0	-	92.0	-
CNN + LSTM, VGG16 [34]	88.3	-	-	-
Score CNN + Attention CNN [38]	97.9	95.8	95.7	-
2-stage CNN [39]	97.4	-	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Eli, E.; Mamat, H.; Aysa, A.; Ubul, K. EA-ConvNeXt: An Approach to Script Identification in Natural Scenes Based on Edge Flow and Coordinate Attention. Electronics 2023, 12, 2837. https://doi.org/10.3390/electronics12132837

AMA Style

Zhang Z, Eli E, Mamat H, Aysa A, Ubul K. EA-ConvNeXt: An Approach to Script Identification in Natural Scenes Based on Edge Flow and Coordinate Attention. Electronics. 2023; 12(13):2837. https://doi.org/10.3390/electronics12132837

Chicago/Turabian Style

Zhang, Zhiyun, Elham Eli, Hornisa Mamat, Alimjan Aysa, and Kurban Ubul. 2023. "EA-ConvNeXt: An Approach to Script Identification in Natural Scenes Based on Edge Flow and Coordinate Attention" Electronics 12, no. 13: 2837. https://doi.org/10.3390/electronics12132837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EA-ConvNeXt: An Approach to Script Identification in Natural Scenes Based on Edge Flow and Coordinate Attention

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Overview of Network Structure

3.2. Edge Stream Image Enhancement Module

3.3. Basic Feature Extraction Module

3.4. Coordinate Attention

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Results of the Method in This Paper

4.3.1. Benchmark Results

4.3.2. Edge Flow Image Enhancement Results

4.3.3. Coordinate Attention Results

4.3.4. Experimental Results under Different Parameters

4.3.5. Error Analysis

4.4. A Discusion of Comparison with State-of-the-Art Results

4.5. Experimental Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI