Performance Comparison of H.264 and H.265 Encoders in a 4K FPV Drone Piloting System

Benjak, Jakov; Hofman, Daniel; Knezović, Josip; Žagar, Martin

doi:10.3390/app12136386

Open AccessArticle

Performance Comparison of H.264 and H.265 Encoders in a 4K FPV Drone Piloting System

¹

Faculty of Electrical Engineering and Computing, University of Zagreb, 10000 Zagreb, Croatia

²

Web and Mobile Computing Department, RIT Croatia, 10000 Zagreb, Croatia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(13), 6386; https://doi.org/10.3390/app12136386

Submission received: 2 May 2022 / Revised: 14 June 2022 / Accepted: 15 June 2022 / Published: 23 June 2022

(This article belongs to the Special Issue High Performance Computing and Computer Architectures)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid growth of video data traffic on the Internet and the development of new types of video transmission systems, the need for ad hoc video encoders has also increased. One such case involves Unmanned Aerial Vehicles (UAVs), widely known as drones, which are used in drone races, search and rescue efforts, capturing panoramic views, and so on. In this paper, we provide an efficiency comparison of the two most popular video encoders—H.264 and H.265—in a drone piloting system using first-person view (FPV). In this system, a drone is used to capture video, which is then transmitted to FPV goggles in real time. We examine the compression efficiency of 4K drone footage by varying parameters such as Group of Pictures (GOP) size, Quantization Parameter (QP), and target bitrate. The quality of the compressed footage is determined using four objective video quality measures: PSNR, SSIM, VMAF, and BRISQUE. Apart from video quality, encoding time and encoding energy consumption are also compared. The research was performed using numerous nodes on a supercomputer.

Keywords:

H.264; H.265; HEVC; UAV; video compression; FPV; remote reality; power consumption

1. Introduction

According to Cisco [1], video will comprise 82% of all global Internet traffic by 2022. Such a high percentage implies a need for high-quality encoders. To achieve the best compression results, an encoder must be adapted to the context it is used in, which can be attained by constructing a completely new encoder that matches a specific use-case. For example, in a first-person view (FPV) drone piloting system, input from the user affects the direction of movement of the drone. Such information could be used by the encoder for efficient motion estimation. Encoder adaptation and better—but still inefficient—encoding can also be attained more simply and cheaply. Popular implementations of the video coding standards MPEG-4 Advanced Video Coding (AVC), also known as H.264, and High-Efficiency Video Coding (HEVC), also known as H.265, are x264 and x265, respectively. These include a set of encoding parameters that can be tweaked to optimize the compression efficiency. This is the approach we consider throughout this paper, where our goal was to find the set providing the best quality/bitrate ratio.

Some of the parameters with the greatest impact on encoder operation are Group of Pictures (GOP) size, Quantization Parameter (QP), and target bitrate. GOP size defines the distance (in frames) between two intracoded frames. QP is used to define how strongly the transformation coefficients will be quantized, which is usually described by two parameters: QP min and QP max. QP min sets the minimum quantizer scale and, similarly, QP max sets the maximum quantizer scale. While encoding a video, the encoder will attempt to generate a video with the given target bitrate. Note that all three of these parameters are correlated, especially QP and target bitrate. Therefore, the target bitrate is often not achieved, so we consider the actual bitrate.

The best way to evaluate video quality is through surveying a group of people and obtaining a subjective quality assessment [2]; unfortunately, this method is not scalable and requires a lot of time and resources and, so, we evaluated video quality using four objective image/video quality measures. The four measures are Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), Video Multi-method Assessment Fusion (VMAF), and Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE). Later sections describe these four measures in greater detail.

2. Related Work

An overview of H.264 and H.265 video coding standards and their core differences has been given in [3], chapter III. In addition to H.264 and H.265, ref. [4] has also discussed the novel H.266 video coding standard. Encoder comparisons typically revolve around on-demand adaptive streaming, which is a service based on a server–client architecture where multiple video streams (differing in resolution and/or bitrate) of the same content are stored at the server side. This kind of comparison focuses only on the compression performance, meaning that parameters such as encoding time and power consumption are not considered. De Cock et al. [5] have shown how significant bitrate saving for the same quality can be achieved by using the x265 instead of x264 encoder. They compared x264, x265, and libvpx codecs by measuring six objective quality metrics: PSNR, PSNR(MSE), SSIM, MS-SSIM (Multiscale SSIM), VIF (Visual Information Fidelity), and VMAF. This resulted in more than half a million bitrate–quality curves. The same result has been obtained in [6], where it was revealed that x265 (along with aomenc and libvpx) achieved a substantial compression efficiency improvement over x264, according to PSNR and VMAF metrics. Objective quality metric scores can deviate from subjective quality opinions, especially as not all objective metrics are able to recognize different types of distortions (e.g., flicker, frame drops, interlacing, and compression artifacts). Moreover, scene complexity also affects objective metric scores. For this reason, it is desirable to use as many different metrics as possible when determining video quality. A detailed analysis of objective quality metric performance, performed on UHD high-motion sports footage (LIVE database), is given in [7].

Not only objective quality metrics have shown that x265 is superior to x264. A description of subjective evaluation methods for video sequences, according to ITU-T P.910, has been presented in [8]. The ACR (Absolute Category Rating; which was chosen for evaluation), DCR (Degradation Category Rating), and PM (Pair Comparison Method) methods were implemented, and a group of 89 students evaluated the influence of video codecs on three scenes encoded with 1080, 720, 480, and 360 px resolutions. Again, libx265 resulted in overall higher median and modus scores when compared to libx264. According to similar analyses, when computational complexity is not of the essence, it is safe to say that x265 outperforms x264 in practically every aspect.

Quality assessment differs when it comes to real-time encoding systems. Mansri et al. [9] have claimed that each newly developed codec outperforms its predecessor, in terms of coding efficiency, with a consequent increase in computational complexity. It is crucial to find a good trade-off between the encoded video bitrate and encoding time; for example, if a system is designed to operate under good network conditions with expected fast transmission speeds, the bitrate is not crucial, as there is a chance that the encoding time will consume more time than the transmission itself. In an FPV drone piloting system, a drone is limited by its battery capacity. Therefore, energy consumption is also an important factor. The authors of [10] have evaluated H.264, H.265, AVC, and VP9 compression quality for a live streaming environment by encoding eight popular video games. Their results showed that H.265 provided the best compression efficiency but was 2.6 times slower than H.264. A comparison of H.264 and H.265 encoders for use in VANETs (Vechicular Ad Hoc Networks) has been detailed in [11]. The comparison was carried out on four video sequences of 352 × 288 px resolution. According to the authors, H.265 had an advantage of 49% bitrate savings when compared to H.264.

Previous research has shown that real-time encoding can be achieved by parallelizing the encoder up to its parallelization limits, which are dependent on the encoder implementation [12]. H.264 video encoder parameter optimization in real-time wireless transmission systems has been presented in [13], and it was concluded that PSNR and bitrate show a direct and proportional relation. Furthermore, the optimal trade-off between QP and GOP size has been presented with minimal processing latency and nodal power consumption.

In the area of specialized video quality assessment for drone devices, ref. [14] presented the findings of a user study that assessed the QoE (Quality of Experience) of a video stream displayed to pilots using an FPV-based drone system. The video-encoding parameters had a significant impact on the perceived QoE. The study also noted the influence of simulator sickness in FPV systems, which may impair the QoE and severely influence the results of the study. In our research, we limited our comparison to objective methods, thus not requiring an additional pool of real pilots.

One of the goals of this work is to provide motivation for designing an ad hoc FPV video encoder. The researchers in [15] have described step-by-step implementation details when converting an HEVC encoder into a customized VVC (Versatile Video Coding) encoder. It was shown that reusing parts of the coding structure is a valid and effective approach. A similar methodology could be utilized to design an FPV encoder.

3. Evaluation Methodology

This section describes the evaluation process. The main idea is to automate the process for determining the best set of encoding parameters (e.g., codec, GOP size, QP) for a specific use. In other words, if a user has limited bitrate throughput, we wish to provide a set of parameters that provides the best quality at the specified bitrate. Similarly, if a user defines a minimal required quality, we aim to provide a set of parameters that generates a video that matches the quality requirement but with the lowest possible bitrate. To automate the process, we designed a test environment consisting of six Python scripts.

encode.py: This script encodes the input video through all combinations of defined codec, GOP size, QP, and target bitrate parameters. The four parameters are user-defined in separate lists. This script uses FFMPEG [16] to encode videos.
quality_assessment.py: After generating a set of test videos, this script calculates and saves PSNR, SSIM, and BRISQUE scores into a newly created data.csv file.
generate_vmaf.py: As we use FFMPEG to calculate VMAF, for each video, we need to save the console output into a separate text file, for further analysis.
avg_vmaf.py: Using this script, we extract average VMAF scores from previously created files using a regex. We save the scores into a newly created file, data_vmaf.csv.
append_vmaf.py: This script creates another file, data_final.csv. This file is created by appending all the VMAF scores onto data.csv, meaning that data_final.csv contains all the data we need for the analysis, including codec type, target bitrate, actual bitrate, average QP ( $\frac{QP_\max + QP_\min}{2}$ ), GOP size, and the scores of all four quality metrics.
draw.py: The final step involves visualizing all of the results, which can be challenging when dealing with multi-dimensional data. For a clear interpretation, we draw interactive parallel coordinates and bubble charts using plotly [17].

The scripts need to be executed in the order described in Figure 1. The whole process is computationally very complex, which is why we used a supercomputer, BURA [18], located in Rijeka, Croatia. BURA includes multi-computer and multi-processor systems based on a hybrid computing architecture. The multi-computer cluster has 288 compute nodes, each having two Xeon E5 processors (in total, 24 physical cores). The size of the node memory is 64 GB, and the storage space is 64 GB per node.

We deployed all scripts on BURA and performed the simulations remotely. Python scripts were run through SLURM (Simple Linux Utility for Resource Management) [19] scripts, which offer the possibility of specifying the number of threads, nodes, and so on for execution of the linked python script. For each video, encoding was carried out using one node and one thread. The encoding process was fairly fast (around 30 min in total for each raw video) and, so, there was no need for parallelization. Moreover, the onboard drone computer can hardly replicate the computational power of a supercomputer, so only one thread was considered sufficient for encoding. However, quality assessment is computationally very complex; so, for each raw video, 24 scripts were executed on the 24 nodes simultaneously. Each script used one thread for assessing the quality of selected compressed sequence samples, and increasing the number of threads/nodes had no impact on the metric score. Even on BURA, it took approximately 3 min to process a single frame. After obtaining all results and drawing the final graph, we could determine an optimal set of parameters that represents the best match for a specific need.

In terms of the availability of reference images, there are three types of image quality metrics: Full-reference (FR), No-reference (NR), and Reduced-reference (RR) metrics. FR metrics require a reference image and calculate scores by comparing the compressed image pixels with the reference image pixels. NR metrics are typically used when the reference image is not available. NR metric models are trained on a database of images with known distortions, and they are limited to evaluating the quality of images with the same type of distortion. RR metrics involve a combination of FR and NR metrics. They are designed to predict the perceptual quality of distorted images with only partial information about the reference images. In this paper, we only use FR and NR metrics. We tried to use the RR VQM (Video Quality Metric) developed by NTIA (National Telecommunications and Information Administration) [20], but we discovered that it is not compatible with 4K resolution videos; it is limited to assessing the quality of videos with up to 1260 image rows. There are several resources [21,22,23] claiming that the VQM score is very similar to VMAF, so we believe that not using VQM is not a critical issue in this research.

3.1. Peak Signal-to-Noise Ratio (PSNR)

Although this is the most well-known FR objective image quality metric, PSNR values [24] often deviate significantly from subjective results. The PSNR value is calculated as:

PSNR = 10 \times {log}_{10} (\frac{M A X_{I}^{2}}{M S E}),

(1)

where

M A X_{I}

is the maximum possible pixel intensity (which, in 8-bit images, is 255), and

M S E

is the mean square error between the reference image pixel value and the compressed image pixel value. The reason why PSNR quality assessment often deviates from subjective results is that it is not adapted to Human Visual System (HVS) characteristics: it involves a simple pixel value comparison. Typical values for the PSNR in lossy 8-bit image and video compression are between 30 and 50 dB, where higher is better.

3.2. Structural Similarity (SSIM)

SSIM is also an FR objective image quality metric, which results in a range of values between 0 and 1 (the higher, the better). It is a perception-based model that considers image degradation as a perceived change in structural information while also incorporating important HVS characteristics. In comparison with PSNR, SSIM quality evaluation is commonly more correlated with subjective results [25].

3.3. Video Multi-Method Assessment Fusion (VMAF)

Developed by Netflix [26], VMAF is another FR perceptual image/video quality assessment metric that combines human vision modeling with machine learning. This metric outputs a single score by fusing multiple objective quality features. The fusion model is based on Support Vector Machine (SVM) regression, and it is derived using machine learning over a set of subjective test results. VMAF scores range from 0 to 100, with 100 indicating the best possible quality. A good way to think about a VMAF score is to linearly map it to the human opinion scale.

3.4. Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE)

BRISQUE [27] is an NR image/video quality metric that uses a pre-trained model for outputting scores. The model is trained on a set of images with various degradations, such as noise and compression artefacts. A smaller score indicates better perceptual quality.

4. Experimental Results

The results presented in this section were obtained in experiments conducted on BURA, using one node for encoding each sequence and 24 nodes for assessing the quality. For this research, we needed raw video from the camera in order to avoid any possibility that previous compression led to degradation, which could influence the comparison results. We filmed several sequences of 4K/60 fps drone flights with a DJI ZENMUSE X5S camera on a DJI Inspire 2 drone.

4.1. Test Sequence Classification

All sequences were filmed in a rural area on a mostly cloudy day. Every sequence captured a different scene with various spatio-temporal characteristics. As mentioned in [28], scene complexity plays an important role in determining the compression quality. As spatial and temporal complexity increases, a higher bitrate is required to achieve satisfactory quality. Spatial information (or complexity) is commonly determined using an edge energy metric. This means that images with lots of smooth areas contain a small amount of spatial information (SI), while images with lots of spatially high-frequency areas contain a lot of spatial information. Temporal information (TI) is approximated from the difference between two consecutive frames. The higher the difference, the more temporal information. SI and TI can be calculated using mathematical models; for example, ITU-T P.910 [29] defines SI as in Equation ((2)) and TI as in Equation ((3)):

S I = m a x_{t i m e} {S I_{s t d}},

(2)

T I = m a x_{t i m e} {s t d [M_{p}^{n}]},

(3)

M_{p}^{n} = F_{p}^{n} - F_{p}^{n - 1},

(4)

where

S I_{s t d}

is computed for each frame in a sequence, and the maximum among all frames is the final

S I

; and

M_{p}^{n}

is the pixel intensity difference between the current frame (

F_{p}^{n}

) and the previous frame (

F_{p}^{n - 1}

). Again, the standard deviation of

M_{p}^{n}

is computed for each frame in a sequence, and the maximum is the final

T I

. The objective

T I

and

S I

approximation methods described were created in 2008, so to obtain the most credible results, we have approximated

S I

and

T I

, both subjectively and objectively, for every sequence. We classified sequences into four classes, from A to D (Figure 2), which are described as follows.

A: low spatial and low temporal complexity;
B: low spatial and high temporal complexity;
C: high spatial and low temporal complexity;
D: high spatial and high temporal complexity.

For example, a scene capturing a slow-moving panoramic view of crops would represent class A, while a scene capturing dynamic flight close to various obstacles would represent class D. We captured 16 sequences (Figure 3) in total and subjectively classified exactly four sequences into each class. When using objective approximations, each class contained at least two sequences.

Figure 4 shows an example of low spatial complexity and high spatial complexity frames. Figure 5 is an example of a low temporal complexity video, while Figure 6 is an example of a high temporal complexity video. Small green and red arrows are the motion vectors generated by the Elecard StreamEye software [30].

It is difficult to describe a video through a single frame. Just by looking at the frames, it might seem that some are identical, which is not the case. For example, A1 and B1 capture a similar scene, but the drone movement is very different. In A1, the drone moves slowly while in B1, the drone flies very fast and the video is much shorter. Table 1 displays basic information for every sequence.

4.2. Obtained Data

We represent the obtained data by drawing parallel coordinates, using the python package Plotly (Figure 7). The drawn plot is interactive (Figure 8 and Figure 9) and very distinct, but is not clear in the form of an image. This is why we extracted crucial data and visualize in the form of tables and more simplistic graphs in the following paragraphs. Instead of showing QP and target bitrate values, we only consider the actual bitrate of the encoded videos. We do this as the results indicated that the QP and target bitrate parameters affect each other too much, resulting in redundant data. Note that the last parallel represents a BRISQUE score of 100 to help make the chart look more intuitive.

The percentages shown in Table 2 represent libx265 scores relative to libx264 scores for subjective

S I

and

T I

approximation, and Table 3 presents the results for objective approximation. For example, if the libx264 PSNR score is 35 and the libx265 PSNR score is 40, the table will display a value of 14.3%. Considering the relevance of the data, only the libx264 bitrate will be displayed with corresponding relative percentages of libx265 bitrates in parentheses.

We averaged the values of some samples from each class and displayed only averages. Furthermore, we averaged all samples and displayed an overall result for each bitrate. The VMAF data from Table 2 are depicted in Figure 10. Note that the size of a bubble represents the encoding time.

4.3. GOP Size, QP, and Target Bitrate Analysis

One can notice high deviations between libx264 and libx265 at the highest and lowest bitrates. As previously mentioned, QP and target bitrate are strongly correlated, and oftentimes, QP restriction causes the target bitrate to become completely irrelevant. In other words, QP restriction sometimes causes codecs to deliver videos with much higher or lower bitrates than targeted. Both codecs were tested using the same set of parameters and, unlike libx264, libx265 almost always managed to output videos with bitrates very close to the target.

Even though there is a possibility of using a GOP size greater than 90, in the case of UAV video streaming, it is not advisable due to the possible errors in the transmission of video data over a network with lossy conditions. We noticed that the GOP size did not have a significant impact on the compression quality. For the same encoding configurations (i.e., same codec, min and max QP, and target bitrate), the VMAF scores varied less than

\pm 1.5 %

for 90, 120, and 150 GOP sizes. Other quality measures showed similar results, and the encoding time was practically not affected.

Regarding the classification of the sequences and the bitrates acquired with the same quality of the video, there were noticeable differences between classes A, B, C, and D. In Figure 10, considering the VMAF scores for all four classes and the two encoding algorithms, classes D and especially C had much higher bitrates when aiming at the same quality of the video. The same increase in bitrate can be seen regardless of the target video quality. This demonstrates that the increase in spatial complexity directly causes an increase in the output bitrate of the video sequence.

4.4. Energy Consumption

Energy consumption for the encoding of each video sequence is proportional to the amount of time consumed for the encoding of the sequence. Encoding can be performed in parallel on several cores or by using just one core. The total amount of time spent in processing is almost the same whether only one core is used or several cores are used. For this kind of high-resolution video, parallelization of both the H.264 and H.265 encoder is highly efficient and they scale almost perfectly, especially in the range of up to 16 cores [31,32,33].

Depending on the codec used, different parallelization strategies can be used such as slice-level, frame-level, intra-frame macroblock or 3D-Wave [34]. In addition to them, hardware implementations (or additional hardware specialized for multimedia processing) are often seen in SoCs used for drones. Those implementations using hardware-accelerated encoders achieve better performance than the software implementation. Different vendors have their video encoder and decoder semiconductor IP cores, such as the Hantro series from Veri Silicon. In SoCs, they are used for encoding and decoding together with processors such as ARM Cortex-A53 with several processing cores. For a precise calculation of the hardware needed for the real-time encoding of the 4K video, additional research should be performed with particular hardware, which is intended to be deployed on the drone.

5. Conclusions and Future Work

In this paper, we provided an efficiency comparison overview of the two most popular video encoders—x264 and x265—for use in UAV systems. We devised a test environment on HPC BURA and successfully obtained and analyzed compressed footage data. After filming 16 drone sequences, we encoded them while varying the GOP size, min and max QP, and target bitrate. For each encoded sequence, we calculated four objective quality measures and tracked the encoding time. The overall results confirmed our assumptions and indicated that libx265 dominated over the libx264 codec in every tested aspect, except for encoding time.

In all test sequences, x265 outperformed x264 by up to 10% in PSNR quality and up to 5% in SSIM quality. With 4K footage compressed up to 10 Mbps rates, BRISQUE scores demonstrated that the quality with x265 was up to 11% better than that with x264; furthermore, for VMAF, x265 outperformed x264 by up to 20%.

All four quality metrics (PSNR, SSIM, BRISQUE, and VMAF) showed that the quality of drone video encoded at the same bitrate with x265 was better than that of x264. Encoding times for the two encoders showed that x265 requires higher processing power: up to 51%, on average, depending on the encoding parameters. This means that x264 might be a better pick than x265 when parameters such as battery capacity or encoding latency are of the essence.

x265 showed compression quality and superiority especially at lower bitrates while, for the highest bitrates, when the compression factors were lower, both codecs produced compressed videos with no significant difference in quality.

Classification of the sequences showed differences in the bitrate, depending on the temporal and spatial complexity of the video. A higher spatial complexity of the video introduced an additional increase in the size of the encoded video at the same target video quality. This can be used when setting the parameters for the encoder and estimating the output bitrate based on the type of video scene.

Future research could include a comparison of wavelet-based video codecs, which have gained a lot of attention lately. Sequence classification—or, more precisely, the spatial and temporal complexity—could be approximated using mathematical models.

Author Contributions

Conceptualization, D.H.; methodology, J.B. and D.H.; software, J.B.; validation, J.B. and D.H.; formal analysis, J.B. and D.H.; investigation, J.B.; resources, D.H.; data curation, J.B.; writing—original draft preparation, J.B., D.H., J.K. and M.Ž.; writing—review and editing, J.B., D.H., J.K. and M.Ž.; visualization, J.B. and D.H.; supervision, D.H.; project administration, D.H.; funding acquisition, D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported in part by the project KK.01.2.1.02.0054 Razvoj uređaja za prijenos video signala ultra niske latencije (Development of ultra low latency video signal transmission device), financed by the EU from the European Regional Development Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank Ivica Vranjić for providing the DJI ZENMUSE X5S camera and DJI Inspire 2 drone for the experiments carried out in this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cisco. VNI Complete Forecast Highlights Saudi Arabia—Consumer Highlights VNI Complete Forecast Highlights; Cisco: San Jose, CA, USA, 2018; pp. 2017–2019. [Google Scholar]
Soong, H.C.; Lau, P.Y. Video quality assessment: A review of full-referenced, reduced-referenced and no-referenced methods. In Proceedings of the 2017 IEEE 13th International Colloquium on Signal Processing and Its Applications (CSPA), Penang, Malaysia, 10–12 March 2017; pp. 232–237. [Google Scholar] [CrossRef]
Altinisik, E.; Tasdemir, K.; Sencar, H.T. Mitigation of H.264 and H.265 Video Compression for Reliable PRNU Estimation. IEEE Trans. Inf. Forensics Secur. 2020, 15, 1557–1571. [Google Scholar] [CrossRef] [Green Version]
Li, Z.N.; Drew, M.S.; Liu, J. Modern Video Coding Standards: H.264, H.265, and H.266. In Fundamentals of Multimedia; Springer: Cham, Switzerland, 2021; pp. 423–478. [Google Scholar] [CrossRef]
De Cock, J.; Mavlankar, A.; Moorthy, A.; Aaron, A. A large-scale video codec comparison of x264, x265 and libvpx for practical VOD applications. Appl. Digit. Image Process. XXXIX 2016, 9971, 997116. [Google Scholar] [CrossRef]
Guo, L.; De Cock, J.; Aaron, A. Compression Performance Comparison of x264, x265, libvpx and aomenc for On-Demand Adaptive Streaming Applications. In Proceedings of the 2018 Picture Coding Symposium, San Francisco, CA, USA, 24–27 June 2018; pp. 26–30. [Google Scholar] [CrossRef]
Shang, Z.; Ebenezer, J.P.; Wu, Y.; Wei, H.; Sethuraman, S.; Bovik, A.C. Study of the Subjective and Objective Quality of High Motion Live Streaming Videos. IEEE Trans. Image Process. 2022, 31, 1027–1041. [Google Scholar] [CrossRef] [PubMed]
Cika, P.; Kovac, D.; Skorpil, V.; Srnec, T. Subjective comparison of modern video codecs. In Proceedings of the 2017 Progress In Electromagnetics Research Symposium—Spring (PIERS), St. Petersburg, Russia, 22–25 May 2017; pp. 776–779. [Google Scholar] [CrossRef]
Mansri, I.; Doghmane, N.; Kouadria, N.; Harize, S.; Bekhouch, A. Comparative Evaluation of VVC, HEVC, H.264, AV1, and VP9 Encoders for Low-Delay Video Applications. In Proceedings of the 2020 4th International Conference on Multimedia Computing, Networking and Applications, Valencia, Spain, 19–22 October 2020; pp. 38–43. [Google Scholar] [CrossRef]
Barman, N.; Martini, M.G. H.264/MPEG-AVC, H.265/MPEG-HEVC and VP9 Codec Comparison for Live Gaming Video Streaming; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2017. [Google Scholar] [CrossRef]
Paredes, C.I.; Mezher, A.M.; Igartua, M.A. Performance Comparison of H.265/HEVC, H.264/AVC and VP9 Encoders in Video Dissemination over VANETs. In Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering; Springer: Cham, Switzerland, 2017; pp. 51–60. [Google Scholar] [CrossRef]
Knezović, J.; Čavrak, I.; Hofman, D. Parallelizing MPEG Decoder with Scalable Streaming Computation Kernels. Automatika 2014, 55, 359–371. [Google Scholar] [CrossRef] [Green Version]
Hassan, H.; Khan, M.N.; Gilani, S.O.; Jamil, M.; Maqbool, H.; Malik, A.W.; Ahmad, I. H.264 Encoder Parameter Optimization for Encoded Wireless Multimedia Transmissions. IEEE Access 2018, 6, 22046–22053. [Google Scholar] [CrossRef]
Silic, M.; Suznjevic, M.; Skorin-Kapov, L. QoE assessment of FPV drone control in a cloud gaming based simulation. In Proceedings of the 2021 13th International Conference on Quality of Multimedia Experience, Montreal, QC, Canada, 14–17 June 2021; pp. 175–180. [Google Scholar] [CrossRef]
Viitanen, M.; Sainio, J.; Mercat, A.; Lemmetti, A.; Vanne, J. From HEVC to VVC: The First Development Steps of a Practical Intra Video Encoder. IEEE Trans. Consum. Electron. 2022, 68, 139–148. [Google Scholar] [CrossRef]
FFmpeg. Available online: https://www.ffmpeg.org/ (accessed on 27 April 2022).
Plotly Python Graphing Library|Python|Plotly. Available online: https://plotly.com/python/ (accessed on 27 April 2022).
Computing Resources—Center for Advanced Computing and Modelling. Available online: https://cnrm.uniri.hr/bura/ (accessed on 27 April 2022).
Slurm Workload Manager—Quick Start User Guide. Available online: https://slurm.schedmd.com/quickstart.html (accessed on 6 June 2022).
GitHub—NTIA/vqm: Video Quality Metrics. Available online: https://github.com/NTIA/vqm (accessed on 6 June 2022).
Brunnström, K.; Djupsjöbacka, A.; Andrén, B. Objective Video Quality Assessment Methods for Video Assistant Refereeing (VAR) System (ver 1.1); RISE Research Institute of Sweden: Gothenburg, Sweden, 2021. [Google Scholar] [CrossRef]
Toward a Practical Perceptual Video Quality Metric|by Netflix Technology Blog|Medium. Available online: https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652 (accessed on 6 June 2022).
Topiwala, P.; Dai, W.; Pian, J.; Biondi, K.; Krovvidi, A. VMAF and Variants: Towards A Unified VQA. arXiv 2021, arXiv:2103.07770. [Google Scholar]
Ohm, J.R.; Sullivan, G.J.; Schwarz, H.; Tan, T.K.; Wiegand, T. Comparison of the coding efficiency of video coding standards-including high efficiency video coding (HEVC). IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1669–1684. [Google Scholar] [CrossRef]
Setiadi, D.R.I.M. PSNR vs SSIM: Imperceptibility quality assessment for image steganography. Multimed. Tools Appl. 2020, 80, 8423–8444. [Google Scholar] [CrossRef]
GitHub—Netflix/vmaf: Perceptual Video Quality Assessment Based on Multi-Method Fusion. Available online: https://github.com/Netflix/vmaf (accessed on 27 April 2022).
Mittal, A.; Moorthy, A.K.; Bovik, A.C. Blind/referenceless image spatial quality evaluator. In Proceedings of the 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), Pacific Grove, CA, USA, 6–9 November 2011; pp. 723–727. [Google Scholar] [CrossRef] [Green Version]
Barman, N.; Khan, N.; Martini, M.G. Analysis of spatial and temporal information variation for 10-bit and 8-bit video sequences. In Proceedings of the IEEE International Workshop on Computer Aided Modeling and Design of Communication Links and Networks, Limassol, Cyprus, 11–13 September 2019. [Google Scholar] [CrossRef]
ITU-T Study Group. ITU-T Rec. P.910 (04/2008) Subjective Video Quality Assessment Methods for Multimedia Applications; Technical Report; International Telecommunication Union: Geneva, Switzerland, 2008. [Google Scholar]
Elecard StreamEye. Available online: https://www.elecard.com/products/video-analysis/streameye (accessed on 27 April 2022).
Radicke, S.; Hahn, J.U.; Wang, Q.; Grecos, C. Many-core HEVC encoding based on wavefront parallel processing and GPU-accelerated motion estimation. Commun. Comput. Inf. Sci. 2015, 554, 393–417. [Google Scholar] [CrossRef]
Sankaraiah, S.; Shuan, L.H.; Eswaran, C.; Abdullah, J. Scalable video encoding with macroblock-level parallelism. EURASIP J. Adv. Signal Process. 2014, 2014, 145. [Google Scholar] [CrossRef] [Green Version]
Amit, G.; Pinhas, A. Real-Time H. 264 Encoding by Thread-Level Parallelism: Gains and Pitfalls. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), Phoenix, AZ, USA, 14–16 November 2005. [Google Scholar]
Meenderinck, C.; Azevedo, A.; Juurlink, B.; Alvarez Mesa, M.; Ramirez, A. Parallel Scalability of Video Decoders. J. Signal Process. Syst. 2009, 57, 173–194. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Script execution flowchart.

Figure 2. Sequence classification.

Figure 3. Class A, B, C, and D frames.

Figure 4. Example of low spatial complexity (left) and high spatial complexity (right) images.

Figure 5. Motion vectors—low temporal complexity example.

Figure 6. Motion vectors—high temporal complexity example.

Figure 7. Video encoding results for x264 purple color and x265 teal color with various parameters, encoding times, and quality data, interpreted using parallel coordinates.

Figure 8. x264 filtered purple color parallel coordinates (filtered with magenta line).

Figure 9. x265 filtered purple color parallel coordinates (filtered with magenta line).

Figure 10. Classes A–D VMAF scores.

Table 1. Basic sequence information.

Sequence ID	Sequence alias	Class	Duration (s)
S1	A1	A	5.71
S2	A2	A	12.57
S3	A3	A	3.95
S4	A4	A	12.62
S5	B1	B	2.95
S6	B2	B	7.05
S7	B3	B	21.18
S8	B4	B	7.58
S9	C1	C	30.97
S10	C2	C	7.80
S11	C3	C	12.38
S12	C4	C	16.35
S13	D1	D	4.22
S14	D2	D	12.32
S15	D3	D	19.17
S16	D4	D	8.82

Table 2. libx265 results relative to libx264 (libx264 anchor), subjective sequence classification.

Class	Bitrate (Mbps)	PSNR (More Is Better)	SSIM (More Is Better)	BRISQUE (Less Is Better)	VMAF (More Is Better)	Encoding Time (Less Is Better)
A	2.75 (−30.56%)	9.11%	5.55%	−7.39%	19.39%	19.48%
	4.12 (1.67%)	8.37%	3.52%	−7.99%	14.31%	25.76%
	5.88 (−1.90%)	4.87%	1.88%	−4.44%	7.39%	24.78%
	7.59 (0.88%)	3.47%	1.34%	−2.73%	4.88%	25.84%
	27.28 (−27.52%)	0.33%	0.23%	16.25%	0.71%	37.27%
B	2.66 (−25.34%)	9.87%	4.28%	−9.07%	22.58%	23.47%
	4.17 (1.96%)	7.30%	3.09%	−7.82%	13.21%	25.00%
	5.97 (1.81%)	4.71%	1.89%	−5.51%	7.82%	26.43%
	7.66 (2.27%)	3.05%	1.19%	−3.81%	4.62%	28.97%
	26.79 (−29.17%)	0.48%	0.29%	11.85%	0.70%	44.63%
C	4.49 (−22.53%)	7.45%	7.23%	−0.21%	20.24%	47.96%
	5.46 (−2.57%)	8.93%	7.06%	−1.04%	21.18%	20.99%
	7.24 (0.08%)	6.84%	4.72%	−0.12%	13.84%	38.00%
	13.03 (−9.36%)	2.93%	1.82%	1.61%	4.17%	15.53%
	61.48 (1.07%)	0.73%	0.38%	8.67%	0.49%	32.21%
D	3.36 (−33.23%)	8.63%	5.00%	−7.56%	24.43%	48.32%
	4.06 (1.79%)	10.31%	6.07%	−7.83%	24.65%	33.10%
	5.91 (0.46%)	6.41%	3.14%	−4.81%	11.82%	42.63%
	9.48 (−3.64%)	3.96%	1.69%	−3.79%	4.41%	44.69%
	39.86 (−14.54%)	0.63%	0.31%	4.67%	0.19%	80.38%
Overall	2.49 (−26.91%)	8.24%	4.08%	−10.45%	19.17%	24.19%
	3.95 (−18.61%)	3.91%	2.18%	−7.88%	9.69%	31.75%
	5.79 (−16.99%)	6.24%	3.68%	−3.59%	12.47%	9.05%
	8.69 (−8.69%)	3.92%	2.22%	−0.36%	5.79%	11.71%
	36.37 (−14.82%)	1.26%	0.72%	6.45%	1.35%	50.86%

Table 3. libx265 results relative to libx264 (libx264 anchor), objective sequence classification.

Class	Bitrate (Mbps)	PSNR (More Is Better)	SSIM (More Is Better)	BRISQUE (Less Is Better)	VMAF (More Is Better)	Encoding Time (Less Is Better)
A	2.54 (−21.91%)	9.11%	4.98%	−5.49%	18.61%	21.91%
	4.35 (−1.35%)	6.76%	2.91%	−4.18%	10.36%	28.48%
	5.98 (−6.38%)	4.17%	1.72%	−2.28%	6.06%	28.40%
	7.52 (−1.36%)	3.51%	1.40%	−1.18%	4.79%	27.57%
	26.46 (−31.79%)	0.25%	0.20%	15.88%	0.58%	34.27%
B	2.66 (−25.34%)	9.87%	4.28%	−9.07%	22.58%	23.47%
	4.17 (1.96%)	7.30%	3.09%	−7.82%	13.21%	25.00%
	5.97 (1.81%)	4.71%	1.89%	−5.51%	7.82%	26.43%
	7.66 (2.27%)	3.05%	1.19%	−3.81%	4.62%	28.97%
	26.79 (−29.17%)	0.48%	0.29%	11.85%	0.70%	44.63%
C	4.91 (−20.15%)	6.62%	7.64%	7.34%	15.97%	25.34%
	5.82 (−3.51%)	8.54%	8.33%	6.30%	19.03%	28.14%
	7.46 (0.32%)	7.18%	5.73%	8.63%	13.75%	33.08%
	14.42 (−7.83%)	2.68%	1.66%	11.64%	3.10%	33.14%
	72.61 (3.10%)	0.73%	0.35%	4.18%	0.35%	87.68%
D	4.22 (−34.38%)	8.42%	5.19%	−8.36%	27.94%	32.31%
	4.66 (1.09%)	10.81%	6.16%	−7.38%	30.44%	30.75%
	6.49 (2.08%)	7.56%	3.69%	−5.04%	17.23%	43.48%
	12.27 (−9.44%)	3.61%	1.71%	−3.69%	4.49%	49.94%
	50.25 (−11.93%)	0.70%	0.31%	3.49%	0.15%	110.86%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Benjak, J.; Hofman, D.; Knezović, J.; Žagar, M. Performance Comparison of H.264 and H.265 Encoders in a 4K FPV Drone Piloting System. Appl. Sci. 2022, 12, 6386. https://doi.org/10.3390/app12136386

AMA Style

Benjak J, Hofman D, Knezović J, Žagar M. Performance Comparison of H.264 and H.265 Encoders in a 4K FPV Drone Piloting System. Applied Sciences. 2022; 12(13):6386. https://doi.org/10.3390/app12136386

Chicago/Turabian Style

Benjak, Jakov, Daniel Hofman, Josip Knezović, and Martin Žagar. 2022. "Performance Comparison of H.264 and H.265 Encoders in a 4K FPV Drone Piloting System" Applied Sciences 12, no. 13: 6386. https://doi.org/10.3390/app12136386

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Comparison of H.264 and H.265 Encoders in a 4K FPV Drone Piloting System

Abstract

1. Introduction

2. Related Work

3. Evaluation Methodology

3.1. Peak Signal-to-Noise Ratio (PSNR)

3.2. Structural Similarity (SSIM)

3.3. Video Multi-Method Assessment Fusion (VMAF)

3.4. Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE)

4. Experimental Results

4.1. Test Sequence Classification

4.2. Obtained Data

4.3. GOP Size, QP, and Target Bitrate Analysis

4.4. Energy Consumption

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI