Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

RCM: A Remote Cache Management Framework for Spark

Appl. Sci. 2022, 12(22), 11491; https://doi.org/10.3390/app122211491

by Yixin Song^1,2

, Junyang Yu^1,2

, Bohan Li^1,2

, Han Li^1,2,*, Xin He^1,2

, Jinjiang Wang^1,2

and Rui Zhai^1,2

Reviewer 1:

Hadi Shahriar Shahhoseini

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4:

Tahir Maqsood

Appl. Sci. 2022, 12(22), 11491; https://doi.org/10.3390/app122211491

Submission received: 11 October 2022 / Revised: 5 November 2022 / Accepted: 8 November 2022 / Published: 12 November 2022

Round 1

Reviewer 1 Report

1. The main drawback is insufficient discussion about experimental results. The authors present some barchart with a large number of previous method and describe what we seen in each chart. It should be analyzed.

2. In section 5.1. (Experimental setup) authors wrote they used three standard datasets, Web-BerkStan, Web-Google and Cit-Patents, but there is not sufficient discussion in performance evaluation about them.

3. The paper report the Execution Time. What about other parameter such as resource utilization and ….

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Good topic

English can be improved

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

The authors proposed a new cache management strategy named RCM that helps to improve cache management of big data platform. They also proposed a related modules that make up RCM. All the proposed modules are evaluated to assess their efficiency compared to existing modules in the literature. Moreover, they evaluated the performance of RCM against Spark cache management strategies such as MCM, SACM and DMAOM. This paper is a good read, however it needs some improvements in different areas to highlight the problem statement and to clarify some technical details.

In Abstract, research problem / gap in the literature is not clear. It only mentioned that enhancing cache management helps to improve the performance of big data platform and that this paper proposes RCM for Spark framework. In Introduction, I think the authors should be concise in listing their main contributions. Also, it seems to me that fifth point cannot be listed as a standalone contribution.

The authors do not provide discussion on the limitations of existing works to draw out the research gap that will be investigated in this paper.

Similar to Algorithm 3, I recommend that the authors provide time complexity and space complexity of Algorithm 1 and 2.

It would be better to elaborate more on why RCM is not in a dominant position compared with other strategies with WordCount dataset. This could help reveal the best type of scenario in which RCM can achieve the most optimisation.

There are grammatical mistakes and typos such as SuzhenWang et al.[7] design -> designed; Geng et al.[8] propose -> proposed; Reids as a service of romote cache -> remote cache

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 4 Report

In this article authors presented a remote cache management framework for big data processing. The problem addressed in the article is interesting and has practical applications in big data processing platforms. However, there few comments that needs to be addressed before the article can be accepted for publication.

Comments/Suggestions

1) Line 42, authors state “applications[6]. However, these remote cache solutions lack a remote cache management strategy for big data platforms”.

Need to provide reference this

Also why cache management strategy for data platforms is not implemented by these cache solutions?

2) Line 178, “Trecent represents run time of the job from beginning to now.”

What is meant by run time of the job from beginning to now? It is not clear from the text. the authors need to clarify this.

3) Line 178. Working and rationale of Eq. 5 need to be included for understanding of the readers.

4) The models presented in Eq. (6) and (7) seem confusing.

Generally, computing capacity and transmission capacities of servers are fixed and not calculated as mentioned in (6) and (7). Consequently, execution time and transmission time of jobs are calculated.

Authors need to revise the model or provide reason for using such model.

5) Line 188, In Eq. (8) what does and i and j indices represent? It is not clarified in the text.

6) Figure 4. According to this flowchart, all data may get placed on a single server.

Because, based on CWG the best server will be selected, which will be fastest server with best network links. If that server does not have enough space, CREP will replace data.

Why not select the second server in priority list if the best server does not have enough space? The authors need to carefully revise the flow and relevant text

7) Line 258, the authors state “In order to solve the maximum weight matching problem in bipartite graph, we adopt the Kuhn-Munkres (KM) Algorithm[1] to find optimal solution”.

What is the reason for selecting Kuhn-Munkres Algorithm? Are there any other alternatives?

8) Line 349, “Analyzing this phenomenon, we can conclude that serious cache pressure reduces the optimization of cache placement module”.

This does not seem to be a valid reason. The authors need to revise.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

The authors have addressed my comments and their revision has improved the paper.

Article Menu

RCM: A Remote Cache Management Framework for Spark

Further Information

Guidelines

MDPI Initiatives

Follow MDPI