1. Introduction
Using different visual communication systems results in more transmitted video data. On the other hand, wireless communication is limited by the throughput of available radio bands. The limitations are also encountered in wire transmission when cable connections are long. Despite the progress in video compression, channel coding, and modulation techniques, transmitting high-resolution video signals in band-limited channels is challenging.
Inherent parts of communications systems are analog circuits [
1,
2], which can generate and detect complex electric signals in the transmitter and receiver. Analog video signals used in popular low-cost CCTV cameras usually comply with old standard definition specifications like PAL and NTSC. The latest cameras also support custom formats developed for higher resolutions, such as HD-TVI (high-definition transport video interface) [
3], HD-CVI (high-definition composite video interface) [
4], and AHD (analog high definition) [
5]. The required bandwidth for these analog video systems is closely proportional to the number of pixels transmitted in a given period. In the case of standard definition video, the bandwidth is about 6–8 MHz. High-definition custom formats increase it to tens of MHz. The bandwidth requirements increase accordingly if more video signals are transmitted in a common channel.
Some works focused on video coding schemes for video transmission in wireless analog channels. They exploit the redundancy in the source to improve video quality in wireless networks. In particular, the schemes are designed to achieve graceful quality degradation for wireless channels whose conditions may vary unpredictably and drastically. Some of the schemes such as SoftCast [
6], ParCast [
7], LineCast [
8], OmniCast [
9], and FeatureCast [
10] provide signals which can be transmitted in an analog form with a small amount of digital metadata. The metadata are used to scale power in visual data chunks according to their impact on quality. The effective chunk-based blind estimation method allows the elimination of metadata [
11]. Other schemes such as WaveCast [
12], WSVC [
13], SK-Cast [
14], and “practical” one [
15] apply hybrid digital–analog transmission to enable scalable coding. The hybrid schemes utilize digital video coding at a reduced quality/resolution, which is improved by differential data from the analog transmission. The digital part provides low-quality reconstruction but is strongly protected against transmission errors. Each receiver is assumed to decode videos at a guaranteed low quality. The analog part enables improvement depending on interferences in the transmission channel.
The schemes described above minimize the impact of channel interference on video quality while bandwidth requirements remain unchanged or reduced to a small extent. In some communication systems (e.g., short-range wireless surveillance, video transmitted over long cables), it is possible to preserve sufficient signal quality, and the main problem is limited bandwidths. This problem is solved to some extent by video coding algorithms fully working in the digital domain. Digital codecs applying advanced standards such as H.264/AVC [
16], H.265/HEVC [
17], and H.266/VVC [
18] achieve excellent coding performance by using many coding tools with a large set of coding modes, together with a rate–distortion optimization (RDO) procedure to choose the optimal coding mode or parameters. Although the codecs have achieved great success for video communications, their algorithms are computationally-intensive and utilize a significant amount of hardware resources [
19,
20]. Moreover, they usually do not work well with highly dynamically varying wireless channels due to sensitivity to bit errors. This drawback is somewhat limited by error protection coding, which increases the system complexity and adds extra payload to compressed bit streams.
Bandwidth requirements on video transmission can also be reduced by selecting essential information in the analog domain at the cost of quality losses. Such an approach is applied in compressed sensing [
21,
22,
23]. However, the complexity of algorithms based on compressed sensing is significant as it involves transformations on many pixels included in whole frames. Moreover, reconstruction qualities are much worse as compared to digital codecs.
The main goal of this study is to develop a compression algorithm suitable for analog transmission with reduced complexity compared to digital video codecs. The goal is achieved by the division of input video into three-dimensional (3D) blocks subsampled with different factors according to rate–distortion cost. The subsampling reduces the number of pixel samples that can be transmitted using different signal modulations, including those used in the abovementioned schemes. The rate–distortion analysis selects subsampling factors. Since subsampling is a simple method, each evaluation of each factor exhibits low complexity compared to digital codecs. The study evaluates two different reconstruction methods for the encoder and the decoder to demonstrate possible implementation alternatives.
The rest of the paper is organized as follows:
Section 2 describes the main steps of the encoding and decoding algorithm. Implementation details are given in
Section 3.
Section 4 presents the evaluation results, and
Section 5 concludes the paper.
2. Algorithm
Subsampling is widely applied in video compression to reduce the size of chroma components. In particular, removing every second chroma sample in each row, known as 4:2:2 format, reduces the bit rate from 24 bits per pixel (bps) to 16 bps for typical 8-bit sample representations. Additionally, when every second chroma row is eliminated (4:2:0 format), the bit rate drops to 12 bps. This subsampling of chroma components involves subjectively negligible quality losses due to human visual system properties. The subsampling of all components (luma and chroma) can be much stronger, allowing much higher bit rate and bandwidth reductions. This method can be applied to reconstruct a digital representation from analog signals due to its regular pattern.
On the other hand, information density is usually differentiated between frames (motion) and frame regions (texture). Therefore, a constant subsampling pattern can lead to significant information/quality losses in the case of regions with complex texture or high-motion activity. Simultaneously, samples with significant information redundancy can represent flat or static regions.
In video transmission, the goal of compression is to simultaneously reduce the bit rate to a desired level and maximize reconstruction quality. Quality maximization is achieved by allocating different bit numbers to regions of various content complexity. While operating on bits is the feature of digital techniques, samples are much more suitable and efficient for forming analog signals due to their ability to transfer multi-valued variables in successive time instants or frequency carriers (in orthogonal frequency division multiplexing) more compactly. Therefore, the compression for analog transmission should allocate samples rather than bits to pixel regions. Since the correlation of pixels is in both the two-dimensional (2D) frame space and successive time steps (between frames), the operation on 3D blocks (cubes) provides higher compression ratios and more flexibility in sample allocation. For each 3D block, selecting a subsampling factor related to each dimension allows different reconstruction qualities and sample numbers. The general block diagram of the proposed compression scheme is shown in
Figure 1. Encoder and decoder modules are described in the following subsections.
2.1. Configurations and Subsamling Modes
The size of the 3D blocks used to select different subsampling factors can be fixed in the configuration or selected by the algorithm for each pixel region. Due to the design simplicity, this study applied fixed sizes. The 3D size is specified by three numbers corresponding to each dimension. In this study, the size configuration was labeled as [X, Y, Z], where X, Y, and Z denote the width, the height, and the frame number, respectively. Subsampling factors for blocks are limited by their size, i.e., at least one sample for each component must be present along a given dimension. The combination of subsampling factors corresponding to all three dimensions specifies modes. The modes were labeled as [x, y, z], where x, y, and z denote luma subsampling factors along horizontal, vertical, and frame dimensions, respectively. Chroma subsampling is twice as strong as luma in each dimension. When the luma subsampling factor is maximal, the chroma one is the same due to the abovementioned limitation. For simplicity, sizes and subsampling factors were restricted to natural powers of 2. Subsampling divides 3D blocks for each component into cubes, where each cube is represented by one sample. Higher subsampling factors form larger cubes.
Apart from 3D block size, the codec configuration is specified by encoding and decoding methods. There are two ways to encode: selecting samples at regular intervals and calculating the average value of the samples. The first approach is more straightforward as it takes/picks the bottom-right (regarding horizontal and vertical frame dimensions) and last-frame sample from each 3D cube resulting from subsampling. Locations of picked samples are shown in
Figure 2b and
Figure 3b. The second approach computes the average of all samples included in a cube. There are also two ways of decoding. The first is more straightforward as it duplicates sample representatives into all samples in the corresponding cube. The second approach applies trilinear or higher-order interpolation based on samples taken from previously decoded cubes and the representative of the current one.
One of the two decoding methods can be applied at the encoder side to estimate reconstruction distortions required to select the best mode. If the duplication is used in the decoder, it is no sense to estimate distortion based on the interpolation due to its complexity and the mismatch in the estimated (encoder) and actual (decoder) distortions. On the other hand, using the cube average reduces the encoder complexity. Therefore, this method is beneficial even when the decoder exploits the interpolation. Although LDI/ADI configurations involve some mismatches between qualities estimated in the encoder and obtained in the decoder, quality losses are small (see
Section 4) since samples in a cube are usually strongly correlated.
Table 1 summarizes the configurations regarding computation methods used in the coding algorithm. The last sample denotes the bottom-right and last frame sample within a cube.
Figure 2,
Figure 3,
Figure 4 and
Figure 5 depict examples of subsuming and reconstruction for four configurations: LDD, LDI/LII, ADD, and ADI/AII.
2.2. Rate–Distortion Optimization
The selection of subsampling modes uses the rate–distortion optimization (RDO) algorithm. The RDO takes the estimated sample rate (R) and reconstruction distortion (D) for each candidate mode. The two values are combined with the weight, known as Lagrange multiplier λ, to provide a joint cost. Costs computed for all modes are compared to each other to find the best for a given 3D block according to the following formula [
24]:
The multiplier value decides the impact of rates and distortions on costs. Its selection depends on the sum of the rates of all blocks. Larger values involve smaller rates achieved for stronger subsampling. Since rates contributed by all blocks depend on λ, some interactions must be performed to meet requirements on the total rate while minimizing the total distortion. Each iteration evaluates one λ value. To limit the iteration number, the bisection method was applied. The method narrows the range of the λ multiplier in each iteration by half. In particular, the evaluated λ value is set in the middle of the range. If the total rate exceeds the target one, the upper subrange is selected as a new range. Otherwise, the bottom subrange is selected. The method finds one bit of the multiplier in each iteration. Modes selected in the last iteration are used to subsample each block.
Figure 6 shows the example of the mode selection for the [32, 32, 1] configuration.
2.3. Smoothing Filter
The quality of reconstructed videos is improved by a smoothing filter operating on block edges within each frame. It allows smaller blocking artifacts introduced by subsampling. This study used a simple filter with configurations generating average samples (ADD, ADI, and AII). The filter modified edge luma samples in a given dimension when the corresponding subsampling factor equaled or exceeded 4. This condition applies separately for each side of the edge. The filtering is performed according to the following formula:
where
p and
q denote edge samples from the current and neighboring blocks, and
p′ is the modified sample.
2.4. Metadata
In order to recreate the video from the samples, the decoder needs information about the subsampling modes used in the 3D blocks into which the source video was divided. In the target framework, this information can be sent over a separate low-bit-rate digital channel or embedded into the analog signal with redundancy and/or protection bits. For one 3D block, the mode identifier combines three parts related to each dimension. The size of each part depends on the block size in the corresponding dimension. Generally, the number of possible subsampling factors increases with the size, and the factor value is limited by the case when only one luma sample is present along a given dimension. For example, the [16, 16, 16] block needs the 7-bit identifier, where each dimension has five allowed subsampling factors. In total, there are 125 (5 × 5 × 5) combinations, which fit the 128-valued range resulting from 7 bits. The general formula for the number of bits is as follows:
The impact of the identifier on the required throughput depends on the selected redundancy and protection rates. Provided that each bit is transmitted as two samples, the rate for the [16, 16, 16] block is increased to 14 samples, which corresponds to 0.003418 samples per pixel (spp) in the digital compression.
3. Implementation
The encoder and the decoder are implemented using Python programming language. Popular libraries such as NumPy, SciPy, and sklearn-image were used. To be able to divide the source sequence into 3D blocks, a group of pictures consisting of Z successive frames was stored in memory. Then this group was divided into [X, Y, Z] blocks. The blocks were processed in raster order, i.e., from the left block to the right one in each row, and from the top row to the bottom one. If the frame width or the height was not divisible by X or Y, the last horizontal or vertical blocks contained fewer columns or rows, respectively.
All possible modes were generated for each block. For example, if the codec operated in the [16, 16, 16] configuration, then there were 125 possible modes (from (1, 1, 1) to (16, 16, 16) for the luma and from (1, 1, 1) to (8, 8, 8) for chroma components). For each mode in a given block, encoding and decoding were performed to calculate the rate and the mean square error (MSE). Using the ConvexHull class from the SciPy library, the convex hull (RD curve) was found for each block and stored in the memory. Then the bisection method found the weight λ, based on which the target subsampling modes in each block were selected.
The zoom function from the SciPy library was used for interpolation when the bottom-right and last frame sample was selected as a representative (LDI and LII configurations). This function increased the size of a multidimensional array n times so that the values in new cells were obtained by spline interpolation. It should be noted that samples from neighboring blocks were also needed to reconstruct a block. In particular, the interpolation based on the last sample in the cube refers to left/top/previous-frame-group blocks. The input to the zoom function was an array of samples and a zoom factor for each axis. For example, from the [16, 16, 16] block at the (4, 4, 2) subsampling mode, 32 samples were taken and formed into a 4 × 4 × 2 array. After adding samples from adjacent blocks, a 5 × 5 × 3 array was created. The zoom function enlarged the given array to the size of 17 × 17 × 17, and the 16 × 16 × 16 slice.
When the interpolation was performed using average samples from cubes, the map_coordinates function from the SciPy library was used. The previously mentioned
zoom function indirectly uses this function. The input to the map_coordinates function is an array of samples and an array of coordinates where the interpolated values are to be computed at specified fraction-accuracy positions. The function supports the extrapolation, which is useful in the case of average-sample configurations. In particular, the nearest neighbor method calculates pixels between the extreme samples and the block edges. This extrapolation can lead to blocking artifacts, which can be reduced by the smoothing filter, as described in
Section 2.3. The filter requires references to four blocks neighboring horizontally and vertically.
The interpolation functions can be executed in a specified order. For the order equal to 1, the linear interpolation is performed. Higher orders allow the polynomial interpolation. By increasing the order, the reconstruction quality should be improved at a higher algorithm complexity cost. Therefore, using higher orders only in the decoder while keeping the encoder as simple as possible is reasonable.
4. Results
The developed codec was evaluated using nine test sequences, as listed in
Table 2,
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7 [
25]. For each sequence, the first 64 frames were coded. Four sample rates of 0.125, 0.1875, 0.25, and 0.3125 spp were tested for the peak signal to noise ratio (PSNR) and structural similarity index measure (SSIM) as distortion metrics. These sample rates corresponded to bit rates of 1, 1.5, 2, and 2.5 bits per pixel, respectively. They were selected to allow medium-quality reconstructions. Since the bandwidth is proportional to the number of samples transmitted in a period, the sample rate is the bandwidth reduction measure. If compression is not used, the bandwidth requirements are the highest. Provided the 4:2:0 chroma format, the sample rate of the uncompressed analog video is 1.5 spp.
At the first test stage, the effectiveness of the codec was checked in various configurations while maintaining a constant block size [16, 16, 16]. The interpolation order was set to 1, and the smoothing filter was not used. The obtained rate–distortion (RD) curves are shown in
Figure 7 and
Figure 8. As seen, the bandwidth reduction can be traded for reconstruction quality, which is typical in lossy compression schemes. Bjontegaard Delta (BD) metrics were computed and listed in
Table 2 and
Table 3 for these curves. The reference for the metrics was the LDD configuration since it turned out to be the worst regarding RD efficiency, as shown in
Figure 7 and
Figure 8. The best results were obtained for the LII configuration and slightly worse for AII and ADI configurations. Although LDI and ADD configurations were much better than the LDD one, they were worse than the three best modes. In the case of LDI, the reason was the mismatch between distortion estimation in the encoder and the actual one after decoding. On the other hand, the ADD did not take advantage of the interpolation.
The reconstruction quality strongly depends on motion activity. High motion activity decreases the efficiency of the subsampling in the time/frame dimension (sequences Football, Mobile, and Crowd_run). On the other hand, sequences with fixed camera view (Hall_monitor and News) achieve good compression efficiencies owing to strong time redundancy.
In the second test stage, the impact of the interpolation order and the smoothing filter on the quality of the reconstructed videos was evaluated. The three best configurations are selected from the first stage (LII, ADI, and AII). The results are summarized in
Table 4 and
Table 5 in terms of BD metrics. As can be seen, the order-2 interpolation outperformed the order-1 interpolation on average by 0.729 and 0.757 dB for ADI/ADI2 and AII/AII2 configurations, respectively. The improvement of the order-2 interpolation was about 0.01 for SSIM. On the other hand, the evaluation with the order equal to 3 provided slightly worse results. Thus, it is beneficial to take advantage of the order-2 interpolation.
The smoothing filter (using denoted as -F in ADI2-F and AII2-F) provided additional quality improvements. Although the average PSNR improvement was below 0.1 dB, subjective quality is much better due to the reduced blocking artifacts. This improvement can be observed in
Figure 9.
A larger 3D block size means more modes available and the ability to use stronger subsampling. On the other hand, a larger block means less flexibility in the choice of modes because it imposes the same mode on a larger portion of the video, which can cause poor results. In order to verify the above claims, in the third stage of experiments, the performance of the codec in the ADI configuration is compared using different block sizes. Results are summarized in
Table 6 and
Table 7. They show that the best efficiency is achieved for the [8, 8, 8] block size. Compared to the [16, 16, 16] configuration, the average gain is about 0.5 dB and 0.0026 for PSNR and SSIM, respectively. However, smaller block sizes have higher bandwidth requirements on the digital channel. In particular, the [8, 8, 8] configuration involves six mode bits per block, which is 0.0117 bits per sample. Assuming that each bit occupies two samples (see
Section 2.4), the increase is 0.0234, almost seven times more than in the case of the [16, 16, 16] configuration. It is almost 19% of the bandwidth of the main analog stream for the lowest rate considered in this study (0.125 spp).
The best block size differed between sequences, as the bolded numbers in the tables indicate. In particular, it depended on the temporal and spatial correlation of texture. Smaller blocks better adapted to changing video content. On the other hand, they increased the number of pixels on block edges leading to more blocking artifacts. For high motion activity (Football, Pedestrian_area, and Tractor), better results were achieved when the frame dimension of the block was decreased. For a fixed camera view with low motion activity (Hall_monitor, News), the frame dimension should be longer to utilize background redundancy between frames. The increase in spatial correlations within each frame favored blocks larger horizontally and vertically. These observations suggest that the block-size configuration should be selected considering the application and expected video content. It is also possible to extend this study by enabling the encoder to select between several block sizes. The selection would allow the adaptation to the best sizes based on local correlation.
Figure 10 depicts RD curves for the ADI2-F configuration and all tested video sequences. As seen, quality differed significantly between sequences, which mainly stems from their motion activities. Adaptation to the best block should also decrease the differences.
The implementation in Python was far from real-time performance. Nevertheless, it allowed the evaluation of relative complexities between configurations. Finally, the target implementation should be in hardware. Using one thread of the Intel i5-6300HQ processor clocked at 2.3 GHz, the LDD [16, 16, 16] configuration required 525.3 and 3.91 s to encode and decode 64 frames of Full High Definition (HD 1920 × 1080) video, respectively. In this simplest configuration, the complexity of the decoder is smaller by two orders of magnitude as compared to the decoder. This relationship stems from the fact that the encoder must evaluate all allowable modes to select the best one for each 3D block, whereas the decoder process only one mode for each block.
The average encoding and decoding times are summarized in
Table 8 for the remaining [8, 8, 8] and [16, 16, 16] configurations. The LII and AII configurations were much more computationally complex as they had to perform interpolation operations in every allowable subsampling mode. With this parameter in mind, ADI was the best configuration because it guaranteed good compression quality with relatively simple calculations. Although the block size of [8, 8, 8] had fewer subsampling modes to evaluate (64 vs. 125), it involved a much longer execution time. It stems from the fact that Python is the interpreted language, and most of the execution time was utilized for source code interpretation. Since source codes were executed eight times more often for the [8, 8, 8] configurations, execution time significantly increases. In the target implementation, these ratios should be different, and execution times should be reduced by several orders of magnitude.
The smoothing filter and the order-2 interpolation applied at the decoder did not impact the encoder complexity. The increase in execution time of the decoder was small for the active filtering. On the other hand, the order-2 was much more complex than the order-1 interpolation.
To the best of our knowledge, there were no previous works on video compression for analog/hybrid communication. Therefore, it was not possible to make a comparison with them.
5. Conclusions
The compression based on subsampling in 3D blocks was suitable for hybrid digital–analog transmission. The compression allowed for the reduction of the utilized bandwidth by several times. In particular, medium-quality reconstructions can be obtained for transmitting full high definition (1920 × 1080) videos in the channel dedicated to standard definition ones, i.e., sample rate equal to about 0.25–0.3125 spp. Moreover, one such channel can transmit more videos when low-motion activity is expected (e.g., surveillance systems).
The worst codec configuration regarding reconstruction quality was when interpolation was not used, and samples were picked up from the original video (LDD). Subsampling based on the interpolation applied at both the encoder and the decoder achieved the best results. For LII and AII the average improvement was 3733 and 3.388 dB, respectively. Limiting the interpolation to the decoder allowed for the significant complexity reduction and slightly smaller quality improvement, i.e., 3.350 for ADI. The order-2 interpolation increased the quality by about 0.7 dB for the subsampling based on the averaging (ADI–ADI2). On the other hand, the increase was slight for the subsampling based on the picking (LII–LII2). The smoothing filter gave an average improvement of about 0.1 dB and significantly reduced blocking artifacts. The change in the 3D block size from [16, 16, 16] to [8, 8, 8] increased the quality by about 0.5 dB. However, selecting the best block size depends on motion activity, and further studies are beneficial to develop better algorithms.
The reconstruction methods evaluated in this study are relatively simple. The author believes it is possible to improve reconstruction qualities by leveraging more sophisticated approaches, such as variable 3D block size, transforms, and neural networks (on the decoder side). Developing a rate control algorithm more suitable for one-pass coding, i.e., without saving results of all modes for a group of frames, is also beneficial.