1. Introduction
The object recognition problem, which has been extensively studied in past decades, has a wide range of real applications. The accuracy of recognition systems is always the first priority. However, the error cost largely depends on the problem’s specifics or on the particular application of the developed system. In many areas, such as identity verification [
1,
2,
3], selfdriving vehicles [
4,
5], and industrial diagnostics [
6,
7], incorrect recognition can cause financial loss or even harmful health outcomes. For this reason, the prediction of recognition reliability is vital for such systems. Obtaining an uncertain result should lead to the rejection of the image processing output or the transfer of control back to the user to prevent unfortunate situations. Therefore, modern recognition systems include different types of reliability assessment modules. There are three main approaches to reliability evaluation: recognition confidence analysis, pixelbased image quality assessment, and geometric image quality assessment. These approaches work at different recognition stages and, of course, can be applied together.
The first approach involves the estimation of the recognition confidence provided by the recognition module. Such systems aggregate the confidences of all recognized objects, such as text lines, and decide to accept or reject the recognition result depending on the error cost [
8]. However, these methods have several problems. First, recognition neural networks can be unstable to changes in input images. An alteration in several pixels may lead to a change in the recognition result [
9,
10] and, thus, different confidences. Additionally, neural networks tend to be overconfident, i.e., return high confidences for incorrectly recognized images [
11]. Moreover, this approach requires the entire recognition workflow to be performed for all objects/text fields or other data structures found, even if they are incorrectly segmented. Thus, it leads to unnecessary time spent on recognition with a high possibility of an incorrect result and, therefore, efficiency degradation, which can be a problem for mobile devices and embedded systems.
The other two approaches use image quality assessment because poor image quality is considered one of the most important sources of unstable recognition quality. Many works show a correlation between different image distortions and recognition accuracy [
12,
13,
14]. These distortions can occur for many reasons, such as compression, transmission artifacts, or uncontrolled capturing conditions, with the possible presence of highlights, motion blur, defocus, or geometric distortions [
15]. Quality assessment methods can be divided into two groups: pixelbased, which analyzes the image’s pixel values, and geometric, which does not.
The main focus of recent research has been on pixelbased methods. These methods consider distortions such as blur, digital noise, compression, and transmission artifacts [
16,
17], given that these distortions are common in the majority of recognition system application fields. In addition, methods exist for evaluating a specular highlight saliency map by utilizing deep learning [
18], an unnormalized form of Wiener entropy [
19], and other approaches. The authors of [
20] presented a detection method for holographic elements, which may significantly decrease text recognition accuracy.
Such methods can easily be incorporated into recognition systems. For example, in [
21,
22], the authors present a model of an optical recognition system with embedded image quality assessment and feedback modules. They rejected images of poor quality before recognition. This approach demonstrates an increase in recognition accuracy and reliability. Moreover, the authors showed that, in the case of recognition in a video stream, these modules provide new possibilities, such as selecting the best quality frames for further recognition or rejection of the worst frames. Considering the problem of document recognition, the assessment of text field images allows for a reevaluation of the confidence of the recognized text. The confidences obtained for one field in different frames can be further used as weights in the combination method for text field recognition in a video stream [
23,
24]. Pixelbased quality assessment methods need to analyze the whole input image, which can be timeconsuming (especially if deep learning is used) and may be a problem for realtime recognition systems.
Geometric quality assessment methods analyze the geometric distortion of an object in an image. In the case of document recognition in images taken with a mobile device camera, the most common distortion is a projective transform of a plane. A user trying to avoid highlights may take a photo with a high projective distortion of the document. In this case, document text regions become poorly recognizable (
Figure 1). Geometric quality assessment methods allow such regions to be rejected without analyzing their pixel intensities. This approach is fast and perfectly suited for the document recognition problem. However, there is a lack of research considering this subject.
In [
25], the authors consider the recognition of rectangular documents. They obtained document quadrangles and checked three conditions: (1) at least one pair of opposed quadrangle edges is parallel, (2) the average difference in angles between each pair of opposed angles is relatively small, and (3) the average perpendicularity of the four vertices is less than 25
${}^{\circ}$. In [
26], the criterion includes the following conditions: (1) the ratio of a document quadrangle area to the area of the whole image must exceed a threshold, (2) the aspect ratio for the document quadrangle must fit some predefined interval, and (3) the angles of the document quadrangle must be close to 90
${}^{\circ}$. Unfortunately, the authors do not report the thresholds and intervals used, so it is impossible to evaluate them experimentally. These empirical methods are reasonable. However, there is no theoretical proof or experimental evaluation of their connection to the level of projective distortion and recognition accuracy. For example, considering the relative area of a recognized object, images restored from one area may have significantly different quality (
Figure 2).
In this paper, we propose a novel, noreference method for the quality assessment of images restored from projectively distorted sources. The image quality is considered in terms of the probability of correct text recognition. The proposed method was tested experimentally on synthetic data created from the publicly available dataset MIDV2019 [
27].
2. Document Image Quality Assessment Problem Statement
We consider the problem of document recognition in images obtained with a mobile camera. We use the pinhole camera model (
Figure 3), so the camera is assumed to have no optical aberrations. Given that the document is a flat rectangular object, the document image is affected by projective distortion [
28], and the document boundary is a quadrangle.
Document recognition systems commonly consist of several submodules: document localization in a source image, segmentation of required zones such as text and photo fields, and field image restoration and recognition (
Figure 4). Considering the field segmentation step, the majority of the systems utilize document models. There are three general classes of models: templates, flexible forms, and endtoend models. Templates define the strictest constraints on the location of each zone and are most commonly used for identity documents. In [
25,
29,
30], document templates are used for the localization and classification of document images, but they are also helpful in field segmentation [
31]. Flexible form models are based on text segmentation and recognition result analysis and describe documents with soft restrictions on their structure. This model may contain text feature points [
32] or attributed relational graphs [
33] as a structural representation of a document. Endtoend models involve the simultaneous segmentation and recognition of text field regions [
34] and may not require any document structure.
We consider identity document recognition systems such as [
2] based on the template description of documents. For many identity documents, the regions of text and photo fields are fixed, and text fonts and font properties (size, boldness, etc.) for each of them are known. This information may be inserted into the document template description and used to further assess field image quality.
We assume that the results of the field segmentation are provided as a quadrangle in the source image. According to its coordinates, the field image should be restored and recognized. The main goal of this paper is to assess the quality of the restored image in terms of the reliability of recognition before the restoration itself (see
Figure 5). If the quality is insufficient, then the system ceases further processing to prevent false recognition results. Moreover, the possibility of early rejection decreases the runtime of the system, as the restoration and recognition are not performed on images of low quality. Based only on the source field quadrangle and a priori information of its size and fon, the restored field image can be assessed by relying on known properties of further submodules. We briefly discuss the submodules below.
The restoration submodule resamples the source image according to the projective transform, which maps the quadrangle of the field in the source image to the rectangle of a predefined size in the document model. The resampling process is usually characterized by interpolation and antialiasing methods. Given that, under a high level of projectivity, the field image may have an arbitrary small area and consequently may not be recognizable, in this work, we focus only on the magnification problem when mapping magnifies a source region. In this particular case, antialiasing methods can be excluded, as after magnification, the restored image cannot contain high frequencies. The most wellknown interpolation methods [
35] are the nearest neighbor, bilinear, bicubic, and cubic Bspline methods. For all of the mentioned interpolation methods, except the nearestneighbor algorithm, a small source area causes blur in the restored area, as shown in
Figure 2. The images obtained by nearestneighbor interpolation have comparatively low quality (see
Figure 6), so we exclude this method from consideration.
The results presented in [
21,
36] show that the presence of blur in text field images decreases the quality of recognition. It should be noted that the recognition submodule may contain a preprocessing step that refines the image using a deblurring method, for example, [
37]. However, its scope is limited, and a level of blurring exists under which the text field cannot be reliably recognized.
Figure 5.
The model of the quality assessment submodule.
Figure 5.
The model of the quality assessment submodule.
Figure 6.
Interpolation examples [
38]. (
a) Nearest pixel, (
b) bilinear, (
c) Bspline, and (
d) bicubic.
Figure 6.
Interpolation examples [
38]. (
a) Nearest pixel, (
b) bilinear, (
c) Bspline, and (
d) bicubic.
Assuming that the image restoration and the text recognition submodules can be predefined, we can use the given font and size of the field to estimate the maximum local distortion level that provides stable recognition of any text. We assume that the level can be evaluated as a rational value and denote the threshold distortion level as $\theta \in \mathbb{R}$. In this work, however, it is more convenient for us to use the inverse value $l\in \mathbb{R}$, $l={\displaystyle \frac{1}{\theta}}$, which we call the minimum scaling coefficient threshold. It should be noted that this threshold value is presumed to be evaluated once while developing the recognition system.
Let us denote the source image as
${I}_{src}$, the segmented field quadrangle, i.e., four points of its corners in the source image, as
F, and the rectangle of the restored image borders, defined in the document model, as
R. We need to estimate whether the quality of the restored field image
${I}_{rst}$ is sufficient in terms of reliability for further text recognition. For this purpose, let us denote the quality assessment function as
Q.
Q analyzes the source field quadrangle
F, the restored field rectangle
R, and the a priori minimum scaling coefficient threshold
l and returns 1 if the image quality allows for reliable recognition and 0 otherwise:
where
$\mathcal{F}$ is the set of all quadrangles lying inside the source image and
$\mathcal{R}$ is the set of all possible rectangles. The function
Q does not take the restored image itself as an argument. The evaluation process here is assumed to involve the analysis of a geometric transform rather than pixel intensities. Therefore, the quality assessment can be conducted before the use of the restoration submodule (
Figure 3).
3. The Models of Distorted Field Image Acquisition and Restoration
First, let us briefly describe the model of projectively distorted text field image acquisition [
35]. For simplification, we consider the onedimensional case. Let us define the undistorted field image signal as a continuous bounded function
$I\left(x\right)$:
where
B is the upper bound of
$I\left(x\right)$.
While being captured with a camera, the signal is distorted with a projective transform
$u=H\left(x\right)$:
where
${I}_{src}^{c}\left(u\right)$ is the continuous projectively distorted signal. Then, the signal
${I}_{src}^{c}\left(u\right)$ is sampled by a function
$s\left(u\right)$ with a known sampling pitch
$\Delta {u}_{s}$ to obtain a discrete image
${I}_{src}\left(k\right),\phantom{\rule{3.33333pt}{0ex}}k\in \mathbb{Z}$:
where
${I}_{src}^{d}\left(u\right)$ is the sampled distorted signal defined on
$\mathbb{R}$. We consider ideal sampling with the following:
where
$\delta \left(u\right)$ is the Dirac delta function.
The image
${I}_{src}\left(k\right)$ is the input of the recognition system. Before the final text recognition, the image should be restored to compensate for the projective distortion. In the image restoration process, the image is resampled with the inverse of the original projective transform
$x={H}^{1}\left(u\right)$. This transformation can be evaluated based on the source field quadrangle
F obtained in the field segmentation step and the rectangle of the restored field
R defined by the template description:
$R={H}^{1}\left(F\right)$. The resampling model is as follows. The discrete image
${I}_{src}\left(k\right)$ is reconstructed to obtain a continuous signal
${I}_{src}^{c}\left(u\right)$ through convolution with a reconstruction filter
$r\left(u\right)$:
After that, the domain of the continuous signal
${I}_{src}^{c}\left(u\right)$ is warped with the projective transform
$x={H}^{1}\left(u\right)$:
where
${I}_{rst}^{c}\left(x\right)$ is the restored continuous signal.
Depending on the mapping function
${H}^{1}\left(x\right)$,
${I}_{rst}^{c}\left(x\right)$ may have arbitrary high frequencies. To conform to the Nyquist rate, the signal should be bandlimited by a prefilter function
$h(x,y)$ that prevents aliasing:
where
${\widehat{I}}_{rst}^{c}\left(x\right)$ is the bandlimited restored signal and ⊛ denotes convolution. Then, the obtained signal is sampled with the same sampling pitch
$\Delta {x}_{s}=\Delta {u}_{s}$:
where
${I}_{rst}^{d}\left(x\right)$ is the sampled restored signal on
$\mathbb{R}$ and
${I}_{rst}\left(j\right)$ is the discrete restored signal.
In this paper, we consider only the magnification case, when the source region is stretched:
In this scenario, the signal mapping cannot provide high frequencies. Therefore, the prefilter has little impact on the restored image signal and can be ignored. Then, the restored image
${I}_{rst}\left(j\right)$ is as follows:
Let us define the sample pitches as equal to 1:
$\Delta {x}_{s}=\Delta {u}_{s}=1$. For simplicity, we refer to discrete images
${I}_{src}\left(k\right)$ and
${I}_{rst}\left(j\right)$ as
${I}_{src}\left(u\right)$ and
${I}_{rst}\left(x\right)$ and specify
$u,x\in \mathbb{Z}$. Then, Formula (
11) can be rewritten as follows:
The ideal reconstruction filter
$r\left(u\right),\phantom{\rule{3.33333pt}{0ex}}u\in \mathbb{R}$, is an ideal lowpass filter
$\mathrm{sin}\mathrm{c}=MMsin\pi x\phantom{\rule{1.111pt}{0ex}}/\phantom{\rule{0.55542pt}{0ex}}\pi x$ according to the cardinal theorem of interpolation [
39]. However, in practice, one uses its approximations with a finite window radius
R:
The bilinea, bicubic Bspline, and bicubic reconstruction functions have finite windows of radii $R=1$ and $R=2$.
We also assume that the reconstruction function is Lipschitz continuous with constant
M:
Hypothesis 1. The bilinear, bicubic Bspline, and bicubic reconstruction functions (see Figure 7) are Lipschitz continuous. Let us consider the bilinear reconstruction function.
Lemma 1. The bilinear interpolation function ${r}_{l}\left(u\right)$ is Lipschitz continuous (14), where ${r}_{l}\left(u\right)$ is defined as follows: Proof (Proof). Let us consider a pair of arbitrary points $x,y$. Due to the piecewise nature of ${r}_{l}\left(u\right)$, we have three cases.
Case 1:
$\forall x,y\in (\infty ,1]\cup [1,\infty )$Case 2: $\forall x,y\in (1,1)$
By the reverse triangle inequality, we can obtain:
Case 3:
$\forall x\in (1,1),\forall y\in (\infty ,1]\cup [1,\infty )$Hence, the bilinear reconstruction function ${r}_{l}\left(u\right)$ is Lipschitz continuous with the constant $M=1$. □
The bicubic Bspline and bicubic reconstruction functions are shown in
Figure 7b,c. We can see that they are continuous and have a bounded value increment and are thus Lipschitz continuous. The direct proof of Hypothesis 1 falls outside of the scope of this work. □
4. The Minimum Scaling Coefficient Assessment at a Restored Image Point
In [
21,
36], the authors incorporated an estimation of image blur into the algorithms of a combination of text recognition results in a video stream. Since an unblurred text image has high contrast in regions corresponding to strokes, they assume that the level of image blur is inversely related to the sharpness (called focus in the cited papers), which represents the directional minimum of the highest local contrasts of the image. In these papers, the blur is caused by defocusing or motion blur and is constant for the whole image. The sharpness is calculated based on the intensities in the source image. For this purpose, gradient images are calculated in different directions; for each of them, a 0.95 quantile of the gradient image is obtained, and their minimum represents the sharpness estimation.
In our case, the blurring distortion of the restored image is caused by projective mapping and, hence, is uneven over different points of the image. Let us consider the original undistorted image
$I\left(x\right)$ and denote its local contrast in a region between neighboring sampling points
$[\overline{x},\overline{x}+\Delta {x}_{s}]$ as
$L(\overline{x},\Delta {x}_{s}):0<L(\overline{x},\Delta {x}_{s})\le B/\Delta {x}_{s}$:
One can verify whether the restored image is able to provide the expected contrast in this region. Let us denote the local contrast of the restored image as
${L}_{rst}(\overline{x},\Delta {x}_{s})$. It can be calculated as follows:
where
${I}_{rst}\left(x\right)$ is the discrete restored signal;
${I}_{src}\left(k\right)$ is the discrete distorted signal; and
${K}_{1}=H\left(\overline{x}\right){t}_{1}<R,\phantom{\rule{3.33333pt}{0ex}}{t}_{1}\in \mathbb{Z}$ and
${K}_{2}=H(\overline{x}+\Delta {x}_{s}){t}_{2}<R,\phantom{\rule{3.33333pt}{0ex}}{t}_{2}\in \mathbb{Z}$ are sets of samples in the source image that are used for the reconstruction of samples
$\overline{x}$ and
$\overline{x}+\Delta {x}_{s}$, respectively.
According to (
10), the distance between points
$H\left(\overline{x}\right)$ and
$H(\overline{x}+\Delta {x}_{s})$ in the source image is less then its sampling pitch:
Then, in the worst case, the points
$H\left(\overline{x}\right)$ and
$H(\overline{x}+\Delta {x}_{s})$ have the same set of samples used for reconstruction, i.e.,
${K}_{1}={K}_{2}$. In that case, the contrast (
20) provided by the restored image can be estimated as follows:
where
${K}_{1}$ is the size of set
${K}_{1}$.
If the restored local contrast is much lower than the contrast in the original undistorted image
$L(\overline{x},\Delta {x}_{s})$, then the restored image edges are highly blurred or even undetectable. As we can see, the upper bound of the restored image local contrast
${L}_{rst}(\overline{x},\Delta {x}_{s})$ depends on the distance between the corresponding points in the source image
$H\left(\overline{x}\right)H(\overline{x}+\Delta {x}_{s})$. Thus, the smaller the distance, the higher the level of blur distortion in the considered region. Then, the ratio of the distance between source points to the sampling pitch can be used to estimate the maximum achievable sharpness of the restored region. Let us denote this function as the scaling coefficient
$s(\overline{x},\Delta x)$:
Above, we considered the onedimensional case; however, the image is a twodimensional function. The projective transform of the plane
$(u,v)=H(x,y)$ is determined as follows:
where
${h}_{q,w},\phantom{\rule{3.33333pt}{0ex}}q,w\in \{0,1,2\}$ are the coefficients of the projective transform
H.
The projective transform map points unevenly. For a fixed point $(x,y)$ and several shifts $(\Delta {x}_{m},\Delta {y}_{m})$ of one length: $\left\right(\Delta {x}_{m},\Delta {y}_{m}\left)\right=const\phantom{\rule{3.33333pt}{0ex}}\forall m$, the distance $H(x,y)H(x+\Delta {x}_{m},y+\Delta {y}_{m})$ can significantly vary. Since the directions of text strokes causing high local contrast in the image are arbitrary, the sharpness should be estimated for all possible shifts. Image sampling is conducted with a grid, so the sampling pitch in different directions also varies. However, the function $s(\overline{x},\Delta {x}_{s})$ is the length ratio, so a useful simplification is to consider $\Delta {x}_{s}$ equal for all of them. It should be noted that, here, we implicitly change the domain of the function $s(\overline{x},\Delta {x}_{s})$ from ${\mathbb{Z}}^{2}$ to ${{\displaystyle \mathbb{R}}}^{2}$. This can be performed because the image function is no longer used, and the projective transform H is defined on a set of real numbers.
Then, the scaling coefficient function
$s(\overline{p},\Delta {x}_{s})$ defined in (
23) should be rewritten for the twodimensional case as follows:
Let us denote this function as the
minimum scaling coefficient. The region under consideration is a circle with the center at the point
$\overline{p}=(\overline{x},\overline{y})$ and the radius equal to
$\Delta {x}_{s}$:
The projective transform
$(u,v)=H(x,y)$ maps the points of the infinity line
${l}_{\infty}:{h}_{2,0}x+{h}_{2,1}y+{h}_{2,2}=0$ to infinity. If the line crosses or touches the circle, then some points of its inner region become infinite, which is not possible in image restoration. Consequently, we can assume that the circle is not crossed by the
${l}_{\infty}$ line and is mapped onto an ellipse. Then, the length
${a}_{min}$ of the ellipse semiminor axis is the minimum distance between pairs of projected points:
Since
$\Delta p$ is assumed to be small, one can locally approximate the projective transform
H with an affine transform. In this approach, it can be shown that, for a unit circle, the lengths of the ellipse semiaxes are equal to the roots of eigenvalues
${\lambda}_{min}$ and
${\lambda}_{max}$ of the matrix
${\overline{J}}^{T}\overline{J}$, where
$\overline{J}$ is the Jacobian matrix of the transform
H at the point
$\overline{p}$ [
40]. Then, for the circle with the radius
$\Delta {x}_{s}$, the lengths of the semiminor and semimajor axes for the restored point
$\overline{p}$,
${a}_{min}$ and
${a}_{max}$, respectively, are calculated as follows:
It should be noted that the points on the infinity line ${l}_{\infty}$ become infinite under the transformation, so eigenvalues are not defined on this line. Then, the length function domain is ${\mathbb{R}}^{2}\setminus {l}_{\infty}$.
It is a wellknown fact that the eigenvalues are the roots of the characteristic equation. Then, the lengths of the semiminor and semimajor axes can be calculated as follows:
One can derive the values of the trace and the determinant of the matrix
${\overline{J}}^{T}\overline{J}$ expressed in terms of coefficients of the homography
H:
In this work, we use only the values of the semiminor axis length. However, the other lengths may be helpful in the problem of image decimation estimation. To illustrate the behavior of the semiminor and semimajor length functions, we constructed heatmaps for a synthetic example. An arbitrary source quadrangle
F (
Figure 8a) and a restored rectangle
R (
Figure 8b,c) were used to estimate semiminor (
Figure 8b) and semimajor (
Figure 8c) axis lengths at grid points on the restored plane. As we can see, the values increase as we approach the infinity line
${l}_{\infty}$, shown as a blue line in the figure. The region inside rectangle
R with semiminor axis lengths less than the threshold appears to be connected.
Then, according to (
27)–(
29), the minimum scaling coefficient (
25) does not depend on the sampling pitch
$\Delta {x}_{s}$ and can be redefined as
$s\left(\overline{p}\right)$:
This function can be used to estimate the local sharpness at each point of the restored image and is directly related to the local image quality. It should be noted that, if the transformation H is affine, i.e., ${h}_{2,0}^{2}+{h}_{2,1}^{2}=0$, then the Jacobian matrix and the minimum scaling coefficient are constant for the whole plane. Thus, only one value at an arbitrary point can be calculated.
5. The Proposed Method of Projectively Distorted Image Quality Assessment
Next, we define the quality assessment method Q, which provides a binary estimation of the whole restored image in terms of recognition reliability. Considering that incorrect recognition of any character leads to incorrectness of the whole recognized field text, the image quality can be estimated according to the region with the lowest quality.
For this purpose, we can estimate the maximum level of local distortion
$\theta $ that enables stable recognition of the restored image. The threshold depends on the recognition subsystem and on the chosen interpolation algorithm. Since the function
$s\left(\overline{p}\right)$ is inversely proportional to the local distortion level, for simplification, we use the minimum scaling coefficient threshold
l, which is the inverse of the level of distortion
$\theta $:
Then, we can construct the level curve of the minimum scaling coefficient function as follows:
If the level curve intersects the restored rectangle R, then one of two corresponding parts of the restored field image is not recognized reliably. Otherwise, if there is no intersection, we can calculate the value for one arbitrary point inside the rectangle to check whether the whole restored image has sufficient quality.
According to (
31) and (
30), the level curve (
33) can be written as follows:
This equation holds for both the minimum scaling coefficient function
$s\left(p\right)$ and the maximum scaling coefficient function
${s}_{max}\left(p\right)$, which is defined as the ratio of the semimajor axis length to the sampling pitch
$\Delta {x}_{s}$:
If the ${s}_{max}\left(p\right)$ branch intersects the rectangle, then both of its parts have low quality.
For simplification, Equation (
34) is translated to the new coordinate system by transformation
T:
Under this transform, the infinity line
${l}_{\infty}$ is mapped to the line
$Y=0$. After the substitution of (
36) into the level curve in Equation (
34), we obtain the following:
where
$\gamma ={h}_{2,0}{c}_{1}+{h}_{2,1}{c}_{2},\phantom{\rule{3.33333pt}{0ex}}\delta ={h}_{2,0}{c}_{3}+{h}_{2,1}{c}_{4}$ and
$\alpha ,\beta ,{c}_{1},{c}_{2},{c}_{3},{c}_{4}$ are defined in (
30).
As we can see, Equation (
37) is quadratic in terms of
X and, hence, symmetric. Then, we can approximate it by a piecewise linear curve. For this purpose, the minimum and maximum
Y values of the rectangle
R are calculated. After that, we choose several values
${Y}_{i},i=\{0,n1\},{Y}_{i}\ne 0$ between them and, for each
${Y}_{i}$, calculate two corresponding
X coordinates of the curve according to the following equality:
We should also take into account that, for semiminor and semimajor axis lengths, both branches of the curve may intersect the rectangle simultaneously. In order to construct the curve approximation correctly, we need to separate the points related to different branches. Then, moving along the Yaxis for each ${Y}_{i}$ value, we compare the corresponding discriminant ${D}_{i}$ with zero. If it is positive, then the obtained points lie on one branch. If the discriminant for a ${Y}_{i}$ value is equal to zero, then there is an inflection point in the current branch and the following values ${Y}_{i+k},\phantom{\rule{3.33333pt}{0ex}}k=\{1,ni1\}$ relate to another branch of the curve. Similarly, the negative discriminant ${D}_{i}$ implies a gap between branches, and points calculated for further values ${Y}_{i+k}$ lie on another branch.
As soon as the curve is obtained, we should decide whether the considered field quality is sufficient. There are several possible approaches. For example, we can calculate the ratio of sufficient and insufficient quality areas inside the restored rectangle. However, in this work, we mark the quality of the whole image as insufficient if there is a lowquality region of any area. The whole procedure for evaluating the restored image quality has $O\left(1\right)$ complexity because it is not dependent on the input image size but only on the number of points in the curve approximation, which we assume to be predefined. The procedure is summarized in Algorithm 1.
Algorithm 1 For quality assessment of a projectively distorted field quadrangle. 
Input: F—field quadrangle in source image; R—rectangle of restored field; l—minimum scaling coefficient threshold; n—vertex number of curve approximation. Output: True $\equiv 1$, if the restored field is predicted as recognizable; False $\equiv 0$, otherwise.  1:
procedure Q($F,R,l,n$)  2:
calculate coefficients of a projective transform $H:H\left(R\right)=F$  3:
$center\leftarrow $ center point of R  4:
if ${h}_{2,0}^{2}+{h}_{2,1}^{2}=0$ then ▹ affine transformation  5:
${s}_{c}\leftarrow s\left(center\right)$ according to ( 31)  6:
return ${s}_{c}\ge l$  7:
${R}^{\prime}\leftarrow T\left(R\right)$ according to ( 36) ▹ calculate new coordinates of rectangle  8:
${Y}_{min}\leftarrow min{\left\{{R}_{iY}^{\prime}\right\}}_{i=\{1..4\}}$  9:
${Y}_{max}\leftarrow max{\left\{{R}_{iY}^{\prime}\right\}}_{i=\{1..4\}}$  10:
calculate $\alpha ,\beta ,\gamma ,\delta $ according to ( 30)  11:
${X}_{sym}\leftarrow {\displaystyle \frac{\alpha \gamma +\beta \delta}{{\alpha}^{2}+{\beta}^{2}}}$  12:
$no\_roots\_prev\leftarrow True$  13:
$one\_root\_prev\leftarrow False$  14:
$curve\leftarrow \left\{\right\}$  15:
for $i=\{0,..n1\}$ do  16:
${Y}_{i}\leftarrow {Y}_{min}+i{\displaystyle \frac{{Y}_{max}{Y}_{min}}{n1}}$  17:
calculate ${D}_{i}$ according to ( 38)  18:
if ${D}_{i}>0$ then  19:
${X}_{i1,2}\leftarrow {X}_{sym}\pm \sqrt{{D}_{i}}$  20:
if $NOT(no\_roots\_prev)$ then  21:
$Insert(curve,Segment\{({X}_{i1},{Y}_{i}),\phantom{\rule{3.33333pt}{0ex}}({X}_{i1,1},{Y}_{i1})\})$  22:
$Insert(curve,Segment\{({X}_{i2},{Y}_{i}),\phantom{\rule{3.33333pt}{0ex}}({X}_{i1,2},{Y}_{i1})\})$  23:
if $no\_roots\_prev\phantom{\rule{3.33333pt}{0ex}}AND\phantom{\rule{3.33333pt}{0ex}}i\ne 0$ then  24:
$Insert(curve,Segment\{({X}_{i1},{Y}_{i}),\phantom{\rule{3.33333pt}{0ex}}({X}_{i2},{Y}_{i})\})$  25:
$one\_root\_prev\leftarrow False$  26:
$no\_roots\_prev\leftarrow False$  27:
else if ${D}_{i}=0$ then  28:
${X}_{i1,2}\leftarrow {X}_{sym}$  29:
if $NOT(one\_root\_prev)\phantom{\rule{3.33333pt}{0ex}}AND\phantom{\rule{3.33333pt}{0ex}}NOT(no\_roots\_prev)$ then  30:
$Insert(curve,Segment\{({X}_{i},{Y}_{i}),\phantom{\rule{3.33333pt}{0ex}}({X}_{i1,1},{Y}_{i1})\})$  31:
$Insert(curve,Segment\{({X}_{i},{Y}_{i}),\phantom{\rule{3.33333pt}{0ex}}({X}_{i1,2},{Y}_{i1})\})$  32:
$one\_root\_prev\leftarrow True$  33:
$no\_roots\_prev\leftarrow False$  34:
else  35:
if $NOT(one\_root\_prev)\phantom{\rule{3.33333pt}{0ex}}AND\phantom{\rule{3.33333pt}{0ex}}NOT(no\_roots\_prev)$ then  36:
$Insert(curve,Segment\{({X}_{i1,1},{Y}_{i1}),\phantom{\rule{3.33333pt}{0ex}}({X}_{i1,2},{Y}_{i1})\}$  37:
$one\_root\_prev\leftarrow False$  38:
$no\_roots\_prev\leftarrow True$  39:
for each $segment$ from $curve$ do  40:
if $segment$ intersects ${R}^{\prime}$ then  41:
return $False$  42:
${s}_{c}\leftarrow s\left(center\right)$ according to ( 31)  43:
return ${s}_{c}\ge l$

6. Experimental Results
In this section, the experimental results obtained using the proposed algorithmfor the quality assessment of projectively distorted field images are presented and compared with the performance of the algorithm described in [
25]. In the recognition system workflow, in order to obtain a quadrangle of the field to be restored and recognized, we need to perform document localization and field segmentation. Evaluating quality assessment methods, we had to eliminate the errors that occurred in these stages. For this purpose, datasets that provide ground truth for field quadrangles are commonly used. To the best of our knowledge, the only publicly available dataset with at least mild projective distortions is MIDV2019 [
27]. However, a preliminary experiment showed that it does not contain images with enough projectivity to produce insufficient restored image quality. For this reason, we created a dataset with synthetically distorted images of text fields.
6.1. Data Generation
In order to generate the data, we used the MIDV2019 dataset. This dataset contains 50 different types of annotated identity documents (ID cards, passports, driving licenses, etc.). It consists of 50 template images (original highquality document images used for creating physical document copies, one per document type) and video clips of these documents acquired in different conditions. An example of a template document is shown in
Figure 9.
All of the images were annotated manually. The video frames have a ground truth for their type and document quadrangle. The template images have a ground truth description consisting of field rectangles and their text content.
We considered template images only and scaled them to 300 dpi to obtain comparable pixel sizes for all documents. We used ground truth field rectangles to extract undistorted images of text fields with an additional 10% margin of their size. We only considered numeric fields and fields written with the Latin alphabet: dates, document numbers, machinereadable zone (MRZ) lines, document holder name, and surname. We recognized text in the obtained field images with Tesseract Open Source OCR Engine 4.1.1, which employs the LSTM neural network [
41]. Incorrectly recognized fields were eliminated from further processing. In our experiments, we used 184 fields collected from all document templates. Since the text in the fields may have different fonts, font sizes, and other properties, we considered them separately in our experiments. Here, we describe synthetic data generation for one field.
We denote an original image of a field f as ${D}_{f}$ and a rectangle bounding the field as ${R}_{f}$.
To test our algorithm, we generated a set of N projectively distorted field images ${\left\{{I}_{src,f}^{i}\right\}}_{i=1..N}$ with bounding quadrangles ${\left\{{F}_{f}^{i}\right\}}_{i=1..N}$ and corresponding projective transforms ${\left\{{H}_{f}^{i}\right\}}_{i=1..N}$: ${F}_{f}^{i}={H}_{f}^{i}\left({R}_{f}\right)$. In order to generate a distorted quadrangle ${F}_{f}^{i}$, we added random shifts to the corners of ${R}_{f}$. Then, the quadrangle ${F}_{f}^{i}$ was downscaled to approximately the same size as the original field image for a more representative dataset. We also ensured that the obtained distorted quadrangle ${F}_{f}^{i}$ and corresponding quadrangle of the whole distorted document were convex. Then, the homography ${H}_{f}^{i}$ was calculated, and the original field image was transformed to obtain the distorted field image: ${I}_{src,f}^{i}={H}_{f}^{i}\left({D}_{f}\right)$. Algorithm 3 shows the procedure of distorted image generation.
Then, the restoration process was conducted. The distorted images ${\left\{{I}_{src,f}^{i}\right\}}_{i=1..N}$ were rectified with projective transforms that map their bounding quadrangles ${F}_{f}^{i}$ to the rectangles ${R}_{f}$: ${R}_{f}={{H}_{f}^{i}}^{1}\left({F}_{f}^{i}\right)$. Thus, we obtained a set of restored images ${\left\{{I}_{rst,f}^{i}\right\}}_{i=1..N}$: ${I}_{rst,f}^{i}={{H}_{f}^{i}}^{1}\left({I}_{src,f}^{i}\right)$. The projective mapping of images was conducted using the bilinear interpolation method.
Finally, we generated the ground truth for our problem of the binary quality assessment. We consider it to be a binary classification problem, with a positive case when <<field image is recognizable>> and a negative case otherwise. Thus, we used Tesseract to recognize the restored field images ${I}_{rst,f}^{i}$ and compared the results with the annotation from MIDV500. If the recognition was correct, then the restored image was marked as recognizable.
6.2. Performance Metrics
To evaluate the performance of quality assessment algorithms, we calculated the positive and negative predictive values, PPV and NPV, respectively, as follows:
where
$TP$ is the number of truepositive samples (restored field images were correctly recognized by Tesseract and marked as recognizable by the quality assessment algorithm under evaluation),
$TN$ is the number of truenegative samples (fields were not recognized by Tesseract and marked as nonrecognizable by the algorithm),
$FP$ is the number of falsepositive samples (fields were not correctly recognized by Tesseract but marked as recognizable by the algorithm), and
$FN$ is the number of falsenegative samples (fields were correctly recognized by Tesseract but marked as nonrecognizable by the algorithm).
We also had to ensure the balance of data used to evaluate the algorithm. The decision made by the proposed quality assessment algorithm
Q depends on the minimum scaling coefficient threshold
l. Hence, the probability of randomly generating a sample predicted to be positive or negative varies when
l changes. To overcome this issue, for each
l, we took 1000 restored field images marked as positive and 1000 restored field images marked as negative by the algorithm.
Algorithm 2 Generation of projectively distorted images of a field. 
Input: D—an undistorted field image; $R(A,B,C,D)$—a bounding rectangle of an undistorted field, where A, B, C, and D are points of its corners from top left to bottom left clockwise; $T({A}_{t},{B}_{t},{C}_{t},{D}_{t})$—a bounding rectangle of a whole undistorted document, where ${A}_{t}$, ${B}_{t}$, ${C}_{t}$, and ${D}_{t}$ are points of its corners from top left to bottom left clockwise; N—the number of samples to generate. Output: ${\left\{{I}_{src}^{i}\right\}}_{i=1..N}$—a set of distorted field images; ${\left\{{F}^{i}({A}_{i}^{\prime},{B}_{i}^{\prime},{C}_{i}^{\prime},{D}_{i}^{\prime})\right\}}_{i=1..N}$—a set of bounding quadrangles of distorted field; ${\left\{{H}^{i}\right\}}_{i=1..N}$—a set of corresponding projective transforms.  1:
procedure G(D, R, T, N)  2:
calculate width of R: $w\leftarrow B.xD.x$  3:
calculate height of R: $h\leftarrow A.yD.y$  4:
set a uniform real random number generator: $rand=uniform(0,5min(w,h\left)\right)$;  5:
set generated number of samples $n\leftarrow 0$  6:
while n < N do  7:
${A}^{\prime}\leftarrow A+(rand\left(\right),rand\left(\right))$  8:
${B}^{\prime}\leftarrow B+(rand\left(\right),rand\left(\right))$  9:
${C}^{\prime}\leftarrow C+(rand\left(\right),rand\left(\right))$  10:
${D}^{\prime}\leftarrow D+(rand\left(\right),rand\left(\right))$  11:
generated quadrangle ${F}^{\prime}=({A}^{\prime},{B}^{\prime},{C}^{\prime},{D}^{\prime})$  12:
${w}^{\prime}=max({A}^{\prime}.x,{B}^{\prime}.x,{C}^{\prime}.x,{D}^{\prime}.x)min({A}^{\prime}.x,{B}^{\prime}.x,{C}^{\prime}.x,{D}^{\prime}.x)$  13:
${h}^{\prime}=max({A}^{\prime}.y,{B}^{\prime}.y,{C}^{\prime}.y,{D}^{\prime}.y)min({A}^{\prime}.y,{B}^{\prime}.y,{C}^{\prime}.y,{D}^{\prime}.y)$  14:
calculate scale factor $s=min\left(1.5{\displaystyle \frac{w}{{w}^{\prime}}},1.5{\displaystyle \frac{h}{{h}^{\prime}}}\right)$  15:
if s < 1 then  16:
${F}^{\prime}\leftarrow s{F}^{\prime}$  16:
 17:
if ${F}^{\prime}$ is not convex then  18:
continue  19:
calculate projective transform ${H}^{\prime}$: ${F}^{\prime}={H}^{\prime}\left(R\right)$  20:
if quadrangle ${H}^{\prime}\left(T\right)$ is not convex then  21:
continue  22:
${F}^{i}\leftarrow {F}^{\prime}$  23:
${H}^{i}\leftarrow {H}^{\prime}$  24:
${I}_{src}^{i}\leftarrow {H}^{i}\left(D\right)$  25:
$n\leftarrow n+1$  26:
return ${\left\{{I}_{src,f}^{i}\right\}}_{i=1..N}$, ${\left\{{F}_{f}^{i}\right\}}_{i=1..N}$, ${\left\{{H}_{f}^{i}\right\}}_{i=1..N}$

6.3. Behavior of the Proposed Method for Fields of Same and Different Fonts
In the framework of the first experiment, we estimated the variations in the PPV and NPV for the proposed algorithm
Q, depending on the minimum scaling coefficient threshold
l. We calculated the PPV and NPV functions separately for each field
f. We varied the threshold
l values from 0.075 to 0.9 with a step of 0.025. For each threshold
l, we generated 1000 positively and 1000 negatively marked images and calculated the predictive values. The parameter
n of the algorithm
Q that defines the vertex number of the level curve approximation was set to 100.
Figure 10 shows an example of the estimated PPV and NPV curves that were calculated for several text fields of the new Austrian driving license document, which is shown in
Figure 9.
While developing the assessment method, we assumed that the threshold is equal for all characters of one font. Thus, the predictive value functions should be close for different fields of one font and may vary if the font or font properties (size, boldness, etc.) are changed. As we can see, the curves for the date fields with the same font (
Figure 10a–c) show almost equal predictive values, as was expected. This means that we can estimate the valid threshold for all possible text fields of one font in advance. At the same time, PPV and NPV differ for a document number field that has a bold font (
Figure 10d). Comparing them, we can infer that bold text can be more projectively distorted while still being reliably recognized. Thus, the minimum scaling coefficient threshold should be chosen separately for each font and font property.
For all fields, the specific behavior of the curves is similar. The greater the
l, the sharper the restored image should be to be marked as <<recognizable>>. Indeed, in
Section 2, we define the minimum scaling coefficient threshold to be inverse to the level of distortion
$\theta $. As the threshold
l increases, rejection occurs at a lower level of distortion. The threshold value can be chosen according to the cost of falsepositive and falsenegative errors. In the case of equal cost, the PPV and NPV are higher than 80% for all four considered fields.
It should be noted that the obtained predictive value curves are nonmonotonic. This effect occurs because the OCR is not strictly monotonic with the projective distortion level. However, the tendency toward reduced recognition accuracy is evident.
6.4. Recognition System Simulation
In the second experiment, we estimated the recognition system’s performance with the incorporated reject submodule. We compared the results obtained for the proposed algorithm with the rejection criterion presented in [
25], which assesses the whole distorted document quadrangle. In addition, we estimated the same algorithm applied to each field quadrangle separately.
The geometric criterion presented in [
25] is based on the analysis of the quadrangle angles. The document quadrangle is rejected if not satisfying the following conditions:
 1.
At least one pair of the opposed edges is parallel with a tolerance of
${5}^{\circ}$:
where
$A,B,C$, and
D are the corners of the document quadrangle and
$\measuredangle \left[\overrightarrow{AB}\right],\measuredangle \left[\overrightarrow{CD}\right]$,
$\measuredangle \left[\overrightarrow{AD}\right]$, and
$\measuredangle \left[\overrightarrow{BC}\right]$ denote the edges’ angles with the horizontal axis defined in the range
$[{90}^{\circ},{90}^{\circ}]$.
 2.
The average difference in angles between each pair of opposed angles is less than
${10}^{\circ}$:
where
$\widehat{A},\widehat{B},\widehat{C}$, and
$\widehat{D}$ are the angles of the quadrangle defined in the range
$[{0}^{\circ},{180}^{\circ}]$.
 3.
The average perpendicularity of the four corners is less than
${25}^{\circ}$:
In order to estimate the system performance and to avoid errors that may occur in the document localization and segmentation stages, we synthesized distorted field images, as described in
Section 6.1. Before the experiment evaluating the performance of the proposed algorithm, we automatically estimated the field thresholds for the proposed algorithm as follows. Each of the 184 original field images
${D}_{f}$ was gradually uniformly downscaled from 0.9 to 0.1 of its size with a step of 0.025. The smallest scale that provided a correct recognition result was chosen as the threshold
${l}_{f}$. Then, for each field
f and threshold
${l}_{f}$, we generated 1000 positively and 1000 negatively marked restored field images. The proposed algorithm parameter
n defining the vertex number of the level curve approximation was set to 100. All positive images for all fields were contained in the overall positive set with a size of 184,000. The overall negative set was obtained similarly. The restored images of both sets were recognized using Tesseract, and the cumulative PPV and NPV values were calculated.
In the experiments conducted to evaluate the algorithm [
25], we used two versions of the criterion. The first, original criterion assesses the document quadrangle and, thus, ceases further processing of all document fields simultaneously. Additionally, we evaluated the strategy of applying the criterion to each distorted field quadrangle. For both versions, we used the same processes of data generation and performance evaluation, except that the set of 1000 images predicted to be recognized was constructed based on the algorithm under evaluation. The same applies to the set predicted to be unrecognized.
The results of the conducted experiments are shown in
Table 1. It can be seen that the thresholds of the algorithm in [
25] were defined under the assumption of a much higher cost of falsepositive error. However, the proposed algorithm outperforms both versions of the algorithm from [
25] not only in NPV but also in PPV.
Examples of falsepositive and falsenegative field images for the proposed method are shown in
Figure 11 and
Figure 12, respectively. As we can see, for some, the recognition error is due to the OCR submodule, while the images themselves can be easily read. In the examples of falsenegative images, the level of corruption differs. For example, field (b) is barely recognizable, while field (e) has adequate sharpness. The main reason is that we estimated the minimum possible sharpness in all directions. However, if the image is scaled orthogonally to the stroke, the blurring effect is small, which is seen in
Figure 12e.
Another possible reason for the proposed algorithm errors is the chosen approach for the estimation of the threshold. Due to the errors of the recognition submodule, some of the fields may overestimate the minimum scaling coefficient threshold. Moreover, in real applications, the text of a considered document field differs from the text in the template image. The current threshold estimation method is limited to only one possible text version. Thus, a more stable approach to threshold estimation needs to be developed to increase the performance of the algorithm. However, the presented results show that the proposed algorithm for text field quality assessment can already be successfully exploited for recognition reliability prediction.
7. Conclusions
In this paper, we consider the problem of quality assessment of a field image restored from a projectively distorted source document image. The quality is interpreted in terms of text recognition reliability. The results show that, by using a priori information about the field font, the restored field image quality can be estimated based only on the projective transform analysis. We present a theoretically based method for evaluating the distortion level at a point in the restored image. Moreover, we propose a novel algorithm of binary quality assessment that does not depend on the image size, i.e., it has $O\left(1\right)$ complexity. We also discuss the model of the reject submodule embedded in the document recognition system.
The algorithm was tested on synthetically distorted field images. The dataset was created based on document template images from the publicly available dataset MIDV2019. According to the obtained results, the algorithm provides equal predictive value curves, both positive and negative, for different text strings of one font and one font size. For dissimilar fonts, these curves differ. Thus, the assumption is confirmed that the maximum level of distortion that enables reliable recognition depends on the font of the recognized text. Therefore, the threshold of the algorithm can be estimated in advance for each font, regardless of the text that may occur in the input distorted field images.
In the experiment evaluating the performance of the reject submodule, we compared the proposed algorithm with the rejection criterion presented in [
25]. This algorithm is designed to assess the whole document quadrangle and, therefore, to reject or accept all document fields simultaneously. Additionally, we applied the same criterion separately for each distorted field image. The thresholds for the proposed algorithm were estimated in advance for each field by iterative downscaling of the undistorted field image and by recognizing the obtained image. The results show the superiority of our algorithm. The cumulative positive predictive value (PPV) for the proposed algorithm equals 86.7%, which is 7.5% higher than the best PPV value of other compared algorithms. The cumulative negative predictive value (NPV) estimated for our algorithm is 64.1%, and the difference from the best value of the other algorithm is 39.5%.
For future work, a more stable method for estimating the threshold should be developed. It should utilize all alphabet characters of an estimated font and projective distortions in addition to the scaling transform. Additionally, the current approach may be improved by relying on the ratio of sufficient and insufficient region areas defined by the constructed level curve.
It should be noted that the proposed method may be exploited not only for the reject submodule. The other possible application field is combination methods for text field recognition in video streams. The binary quality estimation can be used to reevaluate the confidence of the recognition result for one frame. Moreover, as the method also provides the level curve that bounds the lowquality region, we can utilize it to reevaluate the confidence of each recognized character according to its location. This may increase the video stream recognition accuracy.
For future work, a stable method of threshold estimation should be developed. It has to analyze the recognition correctness after restoration from different levels of projective distortions instead of only scaling transformations. The whole alphabet of the font considered is to be included to provide a stable threshold for all possible text strings in the field. Additionally, an experimental comparison should be conducted for the approaches to image quality estimation according to the constructed level curve. The current approach may be improved by relying on the ratio between sufficient and insufficient region areas.