Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A High-Performance Computing Cluster for Distributed Deep Learning: A Practical Case of Weed Classification Using Convolutional Neural Network Models

Appl. Sci. 2023, 13(10), 6007; https://doi.org/10.3390/app13106007

by Manuel López-Martínez^1,†

, Germán Díaz-Flórez^1,*,†

, Santiago Villagrana-Barraza¹, Luis O. Solís-Sánchez¹

, Héctor A. Guerrero-Osuna¹

, Genaro M. Soto-Zarazúa² and Carlos A. Olvera-Olvera^1,*,†

Reviewer 1: Anonymous

Reviewer 2:

Muhammad Umer Khan

Appl. Sci. 2023, 13(10), 6007; https://doi.org/10.3390/app13106007

Submission received: 8 February 2023 / Revised: 10 May 2023 / Accepted: 10 May 2023 / Published: 13 May 2023

(This article belongs to the Topic Theory and Applications of High Performance Computing)

Round 1

Reviewer 1 Report

This paper described the design, development, and implementation of a HPCC for image processing and analysis using Deep Learning techniques. However, the writing logic is messy, which makes the reader confused and is unable to identify the author's major innovations.

1.The objective of this article is to design an HPCC with desktop at hand with high performance and low-cost. In performance, it is not fair to compare the proposed HPCC with a single computer. It is strongly recommended that some distributed or parallel manners should be compared to proved the performance of HPCC. In cost, it is not clear whether USD $4000 is the total value of the computers or the new investment. It is not doubt that the cost of a single computer in the HPCC is not higher than USD $4000. So, this comparison of HPCC and a single computer is not acceptable. It is recommended to compare the accuracy and time on the hardware at this price, for example workstation or FPGA.

2.In discussion, authors introduced that “Our proposal highlights the importance of using HPC for image processing using AI techniques to reduce processing times and efforts when training models or implementing algorithms that require a large amount of computing power, which is of considerable benefit compared to if you take into account using a single computer. “. But this is a widely accepted conclusion, so what is the practical significant of the article?

3. As an article about technology, it is not enough to introduce how to design and configuration of the HPCC. It should discuss the difficulties of higher cost performance for HPCC. And the improved method should be proposed to solve these problems and improve the performance. For example, parallel distributed task scheduling is very important for running on multiple computers. But the article introduced Horovod was used to parallel train the model of VGG16 and InceptionV3. This description lacks in-depth analysis and discussion. Therefore, the description of this article is unsatisfied.

Author Response

We would like to thank you as a reviewer for critically comment our manuscript. Attached you will find the last revised version of our manuscript concerning the remarks made by you. I am pleased to re-submit again the research article "Design, development, and implementation of a High-Performance Computing Cluster: A practical case, weed classification using Convolutional Neural Networks models.” In this new version the article was rename, now the research article is named, “A High-Performance Computing Cluster for Distributed Deep Learning: A practical case of weed classification using convolutional neural network models”.

Response to first reviewer comments

Review report 1

Reviewer comment

The objective of this article is to design an HPCC with desktop at hand with high performance and low-cost. In performance, it is not fair to compare the proposed HPCC with a single computer. It is strongly recommended that some distributed or parallel manners should be compared to proved the performance of HPCC. In cost, it is not clear whether USD $4000 is the total value of the computers or the new investment. It is not doubt that the cost of a single computer in the HPCC is not higher than USD $4000. So, this comparison of HPCC and a single computer is not acceptable. It is recommended to compare the accuracy and time on the hardware at this price, for example workstation or FPGA.

Answer

First of all, we would like to express our sincere thanks to the reviewer for his careful review. A new version of the paper is presented since the observation suggested by the reviewer has been implemented.

In relation to the comment, "it is not fair...", a Workstation was acquired to carry out the experiments again and the comparison with the proposed HPCC.

In the same way "It is strongly...", we believe that implementing Horovod in both HPCC and Workstation will give us the performance results sought in this article, so we believe that it is not necessary to implement different distributed or parallel manners, which could be considered for future work. Furthermore, based on the work by Lerat et. al. [35], we decided to use the Horovod framework since in terms of acceleration it allows for reducing processing times when processing large amounts of data in a distributed manner.

Regarding the comment "In Cost, it is ....", we decided to eliminate the cost, since after analyzing the paragraph and what was written, the simple fact of mentioning that it is a high-performance infrastructure, that although we can consider its low cost, requires a large investment, so the cost is no longer relevant.

Finally, thanks for the recommendation to use FPGA or Workstation to compare our proposed HPCC. For this, a workstation was purchased to carry out the experiment.

Reviewer comment

In discussion, authors introduced that “Our proposal highlights the importance of using HPC for image processing using AI techniques to reduce processing times and efforts when training models or implementing algorithms that require a large amount of computing power, which is of considerable benefit compared to if you take into account using a single computer. “. But this is a widely accepted conclusion, so what is the practical significant of the article?

Answer

It is a fact that by applying parallel or distributed computing we will always obtain a benefit both in time and effort when carrying out complex calculations. In particular, for this project, the practical significance is focused on showing the scientific community that does not have access to this technology both physically and in the cloud, that they could implement HPCC with computers that they have in a small laboratory or computer center. A clear example is our university, which has an HPCC that is only used by the physics faculty and the other faculties that require the use of infrastructure to process large amounts of data can face two difficulties, one that the needs of the users do not adapt to the HPCC applications and the second is that the administrators do not allow the use of it. In other words, although there is an HPCC at our university, its use is restricted. In the discussion and conclusion sections, this practical significance is made clearer. Both sections were changed with the following text:

Discussion

Our study demonstrates the crucial role of HPC in image processing specifically using distributed DL when classifying different types of weeds. By using HPC, we significantly reduced the processing times and the effort required to train models or implement computationally demanding algorithms. Compared with the use of a single computer, our proposal offers a considerable benefit in terms of efficiency and profitability, making it a valuable alternative for the university and scientific community that may not have access to an HPC infrastructure listed within the TOP500 or that can access a service through the cloud. Our research focused on POI and AI applications, where we extracted important features such as characteristic plant greens to improve the weed-type classification. We used the color-based segmentation method to preprocess the images, which gave good results.

When implementing the HPCC, we ensured the control and correct installation of the application versions to ensure compatibility with the hardware installed in the computing nodes. The application versions that worked successfully for this project were CUDA 11.7, NCCL 2.1, CuDNN 7.5, and OpenMPI 4.0. We also ensured that the versions of the Tensorflow, Keras, and Horovod libraries were compatible with these applications. It is important to highlight the fact that the implementation process was very complex and took a long time, since when the libraries did not match or were not compatible, the HPCC did not work correctly.

Our research results showed that in using the HPCC, we achieved a maximum accuracy of 82% with a training time of over 4 hours, whereas the work of Olsen et. al [22] reported an average accuracy between 95.1% and 95.7% with a training time of 13 hours using a single graphics card.

One of the critical metrics obtained in our research was accuracy, where we achieved the highest value of 0.82 using a single process in the proposed HPCC. However, the longer training times may have been due to the need for more epochs to achieve better loss values in the CNN VGG16 model. The number of epochs depends on the number of processes introduced to the cluster, which explains why the metric values varied, as shown in Tables 4, 5, and 6.

An interesting finding was that applying six processes in the HPCC resulted in the best time for both CNN models. By manually balancing the load and distributing two processes as a parameter in each node, we were able to utilize 50% of the processor capacity in each computing node. This trend suggests that utilizing eight processes in an HPCC with additional computing nodes could yield even better processing times.

Finally, our research indicates that increasing the number of computing nodes could improve the model's performance and execution times. Despite the limitations, our proposal offers a low-cost alternative that can be applied to different research areas that re-quire HPC, such as PDI and AI.

Conclusions and future work.

HPC is an accessible tool for the scientific and university communities. The implementation of an HPCC with the characteristics presented in this proposal helps to use a small budget to make use of HPC to provide solutions or develop projects that involve PDI and AI techniques in a distributed way in small laboratories and research centers. This project demonstrates the benefit of processing time (a reduction of more than 60%) and the precision that is obtained if an application development environment related to PDI and AI is implemented, which have become very important techniques today.

Although there are much larger high-performance infrastructures in the world than the one proposed in this article, not all people or the scientific community can access them, due to the high cost and low accessibility, unless a considerable sum of money is invested.

Based on the experience acquired, it can be added that in an HPCC network, the resources are a very critical and important part and are essential to guarantee efficient and secure communication between cluster nodes and with other network resources. HPCCs require a high-quality network infrastructure to function optimally and achieve the benefits of high-performance computing.

For this reason, in this project, an accessible infrastructure is proposed that considerably reduces processing times and efforts compared to using a single computer.

As future work, it is proposed to integrate new computing nodes and GPU acceleration, which, although NCCL was used together with OpenMPI to use the collective communication libraries, could only be used by the CPUs and not by the NVIDIA-integrated cards. Implementing GPU acceleration could further reduce the rendering time and could test multiple CNN models that were not used in this project.

Reviewer comment

As an article about technology, it is not enough to introduce how to design and configuration of the HPCC. It should discuss the difficulties of higher cost performance for HPCC. And the improved method should be proposed to solve these problems and improve the performance. For example, parallel distributed task scheduling is very important for running on multiple computers. But the article introduced Horovod was used to parallel train the model of VGG16 and InceptionV3. This description lacks in-depth analysis and discussion. Therefore, the description of this article is unsatisfied.

Answer

Regarding the comment "Should be discussed...", this was corrected and in the new version of the article the high-cost performance part was addressed, mentioning scalability in terms of applications and interconnected devices, in fact, an analysis of the total theoretical yield of the HPCC and the proposed workstation was done. It is important to mention that our objective is to show a methodology to implement distributed deep learning and that Horovod was used to show its processing capabilities by classifying images by implementing OpenMPI as a platform for the distribution of processes in HPCC. This article is part of an introductory phase to carry out subsequent projects in which heterogeneous systems can be used for the distribution of tasks in multiple nodes with different software and hardware architectures.

Author Response File: Author Response.docx

Reviewer 2 Report

The paper should be re-written with more care.

The highlighting contributions of the work should be highlighted.

ResNet50 is mentioned in Figure 1, but never used.

On page 5, what do you mean by following image?

All the procedures are pretty straightforward and according to the book, so the novelty should be highlighted.

Include samples from the dataset and specify the objectives.

Present results in a more constructive and convincing manner.

Author Response

Response to second reviewer comments

Review report 2

Reviewer comment

The paper should be re-written with more care.

Answer

First of all, we would like to express our sincere thanks to the reviewer for her professional and thorough work. After reviewing our article again, we realized that it had to be modified, so a new version of the article has been made in which several sections have been rewritten to give it a better structure.

Reviewer comment

The highlighting contributions of the work should be highlighted.

Answer

To highlight the contributions of the article, the topics of the sections were delved a little deeper, in addition, in the discussion and conclusion part, the texts were modified to make it clearer to the reader about the relevance that exists when implementing HPCC for image processing in the area of precision agriculture. And promote the use of new technologies in agriculture.

Reviewer comment

ResNet50 is mentioned in Figure 1, but never used.

Answer

Thanks to the reviewer for this comment. In this new version of the article you can already see the corrected image when removing ResNet50, since it was never actually used.

Reviewer comment

On page 5, what do you mean by following image?

Answer

It was a mistake not to correctly cite the image we were referring to. The figure referenced by the phrase "following image ..." was removed.

Reviewer comment

All the procedures are pretty straightforward and according to the book, so the novelty should be highlighted.

Answer

Thanks to the reviewer for this comment. Innovation is the application of distributed deep learning in an HPCC that could be considered low cost, currently to carry out AI projects and obtain fast results, cloud services are used that become very expensive, there are also infrastructures already created with hundreds of computing nodes in the world and in our country Mexico, but to make use of them, based on our experience, bureaucratic procedures are carried out that take a long time.

To highlight the procedures, the topic of distributed computing for image classification in precision agriculture was deepened a little more, and the text of the original article was modified to add the following paragraphs in section 2.4:

This section describes the process that was carried out for weed classification using the HPCC configured in the previous section using the distributed DL.

DL in precision agriculture in recent years has had numerous applications; crop field monitoring and management, crop field identification and prediction, yield prediction, disease and pest detection, and development of autonomous farming equipment stand out. The above provide farmers or the agricultural industry with efficient methods to maximize the growth of the products and the good administration of the crop fields [29]. In the case of weed classification, methods that combine image processing and deep learning are used to optimize a plant control strategy for infestations that may thwart the growth of the desired plants in crop fields.

A CNN can be used for weed classification by training the model on a dataset of images of different types or classes. The CNN can learn to recognize patterns and features in the images that distinguish one type or class from another [12]. The CNNs achieve a recognition precision in images depending on the architecture or model of the neural network, that is, the number of layers and the depth that it possesses [30].

Reviewer comment

Include samples from the dataset and specify the objectives.

Answer

Thanks to the reviewer for this comment. We added a new image (Figure 4), which contains samples from the DeepWeeds dataset. To specify the objectives, we add the following paragraph in the introduction:

This article aims to demonstrate that HPC technology can be accessed using idle or convenient computers to unify their processing and storage capabilities and use them for AI projects. For the validation of this HPCC, a weed classification process was carried out through the use of DIP and distributed DL; specifically, CNN was applied to crop images. In addition, the processing time between a workstation and the proposed HPCC was also compared. For this, two pretrained CNN models were used, which were InceptionV3 and VGG16. We used the “DeepWeeds” dataset [22] to observe the behavior of the processing speed and the precision of the weed classification.

Reviewer comment

Present results in a more constructive and convincing manner.

Answer

Thanks to the reviewer for this comment. To display the results in a more convenient way, the results in the text have been changed and replaced with the following paragraphs:

To carry out the processing required for this research, we used three nodes: the master node and two computing nodes. The balancing of the load of the processes in each node was conducted manually; hence, it was crucial to have extensive knowledge of the server configuration and application version control, as each component must complement each other to function correctly. Developing a high-performance and efficient system with high computational power demands a significant investment of resources and capital. However, even with limited computational capabilities, the HPCC was able to implement the CNN models for the analysis and processing of images.

Our choice of models was based solely on testing the capacity of the HPCC in applications with CNNs. The metric used to evaluate the CNN models was accuracy, which allows for the recognition of the best performance based on how close a measured value is to the true value.

The evaluation of the infrastructure was based on several key measures, as shown in the tables below. Table 4 and Table 5 present the execution time of the CNN scripts used for the experimentation, where each execution was carried out with a different number of processes to determine the optimal configuration. For the InceptionV3 model, the best execution time was achieved with six processes, while the best accuracy was obtained with eight processes. Similarly, the VGG16 model performed best for the execution time with six processes and for the accuracy with one process.

In Table 6, we can see the results obtained by applying four processor cores to perform the distributed DL in the workstation. As mentioned above, the value of four was used, as in the experiments in which more than four cores were applied, the memory overflowed. Therefore, at least in the HPCC, it was possible to take advantage of the 12 cores in total that were available for processing. Still, it can be noticed that there was a better processing time with four cores in the HPCC than in the workstation.

Overall, the InceptionV3 model exhibited the best processing time, as shown in Figure 7, indicating its superior performance in infrastructures such as the one proposed in this project. By using the InceptionV3 model, we achieved a processing time of 37 minutes and 55.193 seconds with an accuracy of 0.65.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Thanks to the authors for their careful revision. Though it seems that the motivation, contributions and significances have been significantly revised, the presentation is not satisfied.

1. Although some computers are used to implement the GPU cluster, there is no clear theoretical analysis of burden and scheduling algorithms/methods of coordination. It is difficult for readers to clearly consider whether to adopt this scheme. In general, as more nodes are scheduled, the computational burden of the distributed method will increase.

2. Figure 3 shows the minimum logical configuration of the proposed HPCC. This figure shows many software are used for the implementation. If the goal of this article is to show that a distributed solution is available, then it should be a comparison of existing distributed scheduling solutions or introduction of the shortcomings about the distributed methods. In technical article, the thinking is as important as the implementation.

3. Notice that in the article "The balancing of the load of the processes in each node was conducted manually“. Because the target of article is a high performance computer cluster, manual adjusting the load is hardly satisfactory. It is hard to adjust by human experience.

4. Why are the experimental results not compared with Workstation? It is not sufficient to only analyze Flops between workstation and proposed solution in 2.5 Evaluation of the HPCC. The efficiency of deep learning algorithms is related to many factors, and Flops is certainly one of them. However, for example, the memory of the GPU is also relevant. Not the node memory in Table 2. The memory of the 3070 which is mentioned for the workstation in Table 2 is about 8GB. The memory of the 1050ti which is mentioned for the nodes in Table 2 is not more than 4GB. The memory of the GPU will directly affect parameters such as batch size, which directly leads to the training time and accuracy of deep learning. Of course, it is not easy to compare these important configurations between workstation and proposed solution. But the performance of them are necessary.

5. The comparison between VGG16 and InceptionV3 is hardly satisfactory. Is the purpose of this article to show high performance or good weed classification? It should be compared with existing weed classification methods to show its good performance. Because most of the existing weed classification methods are carried out by a single high-performance server or computer, it is necessary to choose the classic method and implement it on the distribution to be convincing.

Author Response

Universidad Autónoma de Zacatecas

Facultad de Ingeniería

98000 Zacatecas

México

May 2023

Ms. Flora Zong

Assistant Editor

I am pleased to re-submit again the research article "Design, development, and implementation of a High-Performance Computing Cluster: A practical case, weed classification using Convolutional Neural Networks models.”, In this new version the article was rename, now the research article is named, "High-Performance Computing Cluster for distributed deep learning: A practical case, weed classification using Convolutional Neural Networks models", applsci-2240042.

A new version of the paper is presented since all the commentaries provided by the reviewers were implemented. Attached you will find three documents; (1) the last revised version of our manuscript concerning the remarks (with the "track changes" function disabled); (2) the same revised version with the "Track Changes" function in Microsoft Word activated; (3) the responses to comments of the reviewers.

Yours sincerely,

Dr. Carlos Olvera

Response to first reviewer comments

Review report 1

Reviewer comment

Although some computers are used to implement the GPU cluster, there is no clear theoretical analysis of burden and scheduling algorithms/methods of coordination. It is difficult for readers to clearly consider whether to adopt this scheme. In general, as more nodes are scheduled, the computational burden of the distributed method will increase.

Answer

First of all, we would like to express our sincere thanks to the reviewer for his careful review. About this comment, the load and programming analysis is described in a general way in the development of the article, and since the programming processes are done directly from the Horovod framework, there is no need to show the algorithms and the programming of the methods since they are in the documentation of the Horovod Framework [36,37].

Reviewer comment

Figure 3 shows the minimum logical configuration of the proposed HPCC. This figure shows many software are used for the implementation. If the goal of this article is to show that a distributed solution is available, then it should be a comparison of existing distributed scheduling solutions or introduction of the shortcomings about the distributed methods. In technical article, the thinking is as important as the implementation.

Answer

Thanks to the reviewer for his comment. At first, other distributed applications or solutions that are applied for distributed deep learning such as PyTorch, MXNet, or distributed TensorFlow were worked on, but it was not possible to configure some of these in the HPCC. The problem arises when installing the dependencies and packages for these applications, as mentioned in section 2.3 at line 222 “It is also important to mention that the applications must coexist or be compatible for processing to be successful. If any of the applications are not compatible or lacks dependencies, the executions or processes that must be carried out in a distributed or parallel manner cannot be executed within an HPCC”.

Installing the distributed dependencies of MXNet, TensorFlow, and others required changing driver versions and this caused experiments to fail or not work during the classification process due to their lack of compatibility with hardware driver versions.

Any distributed solution depends on the compatibility of the components and intrinsically there will be a deficiency of distributed computing, in this case, there is a software deficiency due to the dependency between versions and hardware.

Due to the usability and performance as shown in reference [35], the use of TensorFlow, Keras, and Horovod was decided.

So a comparison between distributed forms and an introduction to the shortcomings of the different methods is beyond the scope of this article. However, his comment opens the possibility for us to start work for other publications.

Reviewer comment

Notice that in the article "The balancing of the load of the processes in each node was conducted manually“. Because the target of article is a high performance computer cluster, manual adjusting the load is hardly satisfactory. It is hard to adjust by human experience.

Answer

Thanks to the reviewer for his comment. We agree with you that manual load balancing is unsuccessful, so we corrected that part to "Process load balancing on each node was performed according to the Horovod documentation [37], using the horovodrun command to distribute processes between nodes", we use TORQUE and MAUI to manage the task queue and perform the workload on the proposed HPCC.

Reviewer comment

Why are the experimental results not compared with Workstation? It is not sufficient to only analyze Flops between workstation and proposed solution in 2.5 Evaluation of the HPCC. The efficiency of deep learning algorithms is related to many factors, and Flops is certainly one of them. However, for example, the memory of the GPU is also relevant. Not the node memory in Table 2. The memory of the 3070 which is mentioned for the workstation in Table 2 is about 8GB. The memory of the 1050ti which is mentioned for the nodes in Table 2 is not more than 4GB. The memory of the GPU will directly affect parameters such as batch size, which directly leads to the training time and accuracy of deep learning. Of course, it is not easy to compare these important configurations between workstation and proposed solution. But the performance of them are necessary.

Answer

Thanks to the reviewer for this comment. Regarding the question "Why the experimental...", we modified and show the processing comparison in Table 6.

As you mention, there are many factors that affect the accuracy and performance of deep learning algorithms, one of the metrics that was taken into account to compare the total theoretical performance between the HPCC and the Workstation, in effect, was the FLOPS which were calculated to show that while the Workstation had higher overall theoretical performance in terms of FLOPS, similar or lower results were achieved. than those obtained with the HPCC in terms of processing time and precision, as shown in Tables 4, 5 and 6.

In Table 3, the header refers to a GPU cluster, but it is actually a proposal that implements CPUs, so "GPU" has been removed from the header. It is important to note that the information for the GPUs was added only to complement all the information for each of the nodes.

The objective of this article was to obtain the processing time and the precision using the CPUs that make up the HPCC in the practical case of weed classification applying two CNN models. The analysis with the GPUs was considered from the beginning as future work in section 5. Since as you mention, there are great acceleration benefits with graphics cards, but in our research we noticed that when combining CPUs and GPUs the distribution of processes does not it was the same, that is, if two processes were executed in two CPU cores, only one process/core took advantage of the processing power of the graphics card, so it follows that for each core we would need a GPU, which is currently available outside of our configuration capabilities.

Reviewer comment

The comparison between VGG16 and InceptionV3 is hardly satisfactory. Is the purpose of this article to show high performance or good weed classification? It should be compared with existing weed classification methods to show its good performance. Because most of the existing weed classification methods are carried out by a single high-performance server or computer, it is necessary to choose the classic method and implement it on the distribution to be convincing.

Answer

Thanks to the reviewer for his comment. To answer your comment, the analysis with the CNN models was done precisely to test the performance of the proposed HPCC. As we mentioned in the introductory part, more than a comparison, what we want to show is the great benefit of high-performance computing in artificial intelligence tasks within precision agriculture.

In relation to "it is necessary to choose..." the comparison of several classical methods would give rise to new experiments in which the methodologies of the methods would be deepened and conclude with a comparison of both the performance and the accuracy of all these. What is believed could be another more specific work and not simply applied to precision agriculture.

Article Menu

A High-Performance Computing Cluster for Distributed Deep Learning: A Practical Case of Weed Classification Using Convolutional Neural Network Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI