IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in Hadoop

Kavitha, C.; Srividhya, S. R.; Lai, Wen-Cheng; Mani, Vinodhini

doi:10.3390/electronics11101599

Open AccessArticle

IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in Hadoop

¹

Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai 600119, India

²

Bachelor Program in Industrial Projects, National Yunlin University of Science and Technology, Douliu 640301, Taiwan

³

Department Electronic Engineering, National Yunlin University of Science and Technology, Douliu 640301, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(10), 1599; https://doi.org/10.3390/electronics11101599

Submission received: 29 March 2022 / Revised: 4 May 2022 / Accepted: 13 May 2022 / Published: 17 May 2022

(This article belongs to the Special Issue Big Data Technologies: Explorations and Analytics)

Download

Browse Figures

Versions Notes

Abstract

:

Hadoop is a framework for storing and processing huge amounts of data. With HDFS, large data sets can be managed on commodity hardware. MapReduce is a programming model for processing vast amounts of data in parallel. Mapping and reducing can be performed by using the MapReduce programming framework. A very large amount of data is transferred from Mapper to Reducer without any filtering or recursion, resulting in overdrawn bandwidth. In this paper, we introduce an algorithm called Inner MAPping Combiner (IMapC) for the map phase. This algorithm in the Mapper combines the values of recurring keys. In order to test the efficiency of the algorithm, different approaches were tested. According to the test, MapReduce programs that are implemented with the Default Combiner (DC) of IMapC will be 70% more efficient than those that are implemented without one. To make computations significantly faster, this work can be combined with MapReduce.

Keywords:

big data; combiner; distributed storage; hadoop; mapreduce; sort; task failure resilience; wordcount

1. Introduction

Big data is a huge volume of datasets that cannot be processed traditionally. The efficiency in storing such a vast dataset is not suitable for performing computation and reasonable execution time. A distributed data processing system is required to process such data. Another problem in the traditional database is data loss due to network failure and recovery strategies. It cannot handle the heterogeneity of data in the conventional approach. Hence, Google solved these problems by introducing a parallel algorithm called MapReduce. A new open-source Hadoop software framework was introduced using this approach.

Hadoop is a computational model which is widely used for storing and processing large amounts of data. In Hadoop, MapReduce is the computing and programming framework that handles the computation. There are typically two parts to a MapReduce model: map and reduce [1].

(a): The map function is customized by the user based on their requirements. This method splits the input into key and value pairs and generates intermediate key and value pairs.
(b): The reduce function accepts the intermediate output results from mappers and combines them to produce a smaller value set.

It is possible to perform multiple maps and reduce tasks simultaneously on one or more computers simultaneously. An input data block is processed by a user-defined function called the map function, and the intermediate result is generated as an output pair. In the standard Hadoop MapReduce, these generated key-value pairs from mappers are stored in the in-memory buffer temporarily and the contents are spilled to disk when the buffer reaches the threshold limit. The number of I/O operations required to complete a MapReduce job is I/O intensive. Consequently, every data-intensive phase should involve fewer I/O operations.

Task Failure Resilience (TFR) [2], is our previous work that allows the execution of the task to continue the execution from the point where it left off during the task failure. This technique stores the intermediate key-value pairs generated from the map tasks in the distributed storage, Amazon ElastiCache for Redis. The Amazon ElastiCache for Redis is an in-memory key-value store that offers a high throughput and low latency for read-heavy applications. The use of this in-memory data store removes the number of disk I/O operations in TFR. In this existing work, all the intermediate key-value pairs produced by map tasks are sent to reducers without any filtration. There may be repeated (key, value) pairs, which increases the data transfer overhead from Map to Reduce phase. This work aims to carry out a comprehensive study and establish a methodology to enhance TFR MapReduce’s performance.

The major contributions of this paper are:

The MapReduce execution flow is modified and is implemented on top of TFR. The source code of the mapper class is customized to implement IMapC.
The following changes are done in the mapper code:
(a)
Whenever the map tasks produce the intermediate key-value pairs, a filtering method is used for repeating keys. All key-value pairs are stored in the HashMap array and each generated new keys are compared with the existing key.
(b)
The filtered intermediate key-value pairs from the map phase are stored in the redis instances by customizing the input and output formats.
We evaluated the TFR with DC of IMapC using different benchmarking workloads namely WordCount and Sort. HiBench benchmarking tool is used to perform a comparison between Hadoop, TFR, and previous work, In-node combiner (INC).

The paper is organized as follows. The literature survey is discussed in Section 2 and the proposed algorithm is presented in Section 3. Section 4 discusses the performance evaluation. Finally, Section 5 concludes the paper.

2. Literature Survey

Several pieces of research have been conducted on MapReduce performance optimization, but little work has been done on how to minimize the throughput on data and filter keys in between phases. The study [1] proposed an algorithm that reduces overall latency during the shuffle phase by reducing the number of intermediate key-value pairs. It is proposed in [2,3] to implement an in-memory cache to reduce disk I/O, so that the algorithm will run faster and take less time to execute. Kavitha [4] proposed task failure resilience (TFR) to improve the performance of the Hadoop MapReduce framework, which allows for the continuation of an interrupted task without requiring the entire process to be redone. The ElastiCache for Redis is used to store the key-value pairs in a non-volatile manner. It was benchmarked using several Hadoop benchmarking suites. When compared to Hadoop’s default implementation, TFR’s experimental results showed significant performance improvements. Zhang [5] focused on small file processing.

A multi-pipeline data transfer methodology has been developed by [6]. Jeffrey Deana [7] proposed a simplified data processing framework that runs on a large commodity of clusters. It allows automatic parallelization and distribution of large-scale computations, as well as a very efficient implementation of these tools on clusters of commodity computers. One of the drawbacks of this work is the lack of a filter to check for duplicate/redundant keys. Lee [8] used a policy known as Limited Node Block Placement (LNBPP). A copy of the Rack-Local Map (RLM) blocks is required to place data blocks according to conventional default block placement (DBPP). By contrast, LNBPP places each block in a way that avoids RLMs, which reduces the time required to copy each block. A job in containers without RLM is finished faster as a collective than it is in containers with DBPP, since they are assigned to individual cores of a multicore node. In addition to rearrangement of blocks, LNBPP also reduces data transfer time between nodes by redesigning them into smaller nodes (hence, Limited Node). Kavitha [9] proposed an algorithm that can be applied to any critical task in order to assess work quality without wasting resources. This algorithm is implemented in MapReduce parallel programming using the Hadoop platform. It is effectively tested across a wide variety of big data scenarios to verify its effectiveness and accuracy.

Y. Guo, J. Rao, D [10] proposed an ishuffle engine that implements a user-transparent shuffle operation that sends map output results to nodes and schedules the reduce task execution according to the workload balance. A single-user cluster has a reduced job completion time by 29.6 percent and a multi-user cluster has a reduced job completion time of 34 percent based on testing with representative workloads and Facebook workload traces. Lee, WH, Jun, HG [11] discussed how the Hadoop framework impacts I/O performance and proposed methods to improve I/O performance. It is proven to improve memory locality by caching all map data. The in-node combining design extends the traditional combiner to a node level, with the aim of optimizing I/O. With the in-node combiner, the network traffic between mappers and reducers was reduced and the total number of intermediate results also reduced. X. Lu. N, S. Islam [12] identified the performance bottlenecks associated with a current Hadoop RPC design by analyzing buffer management and communication bottlenecks that do not manifest on slower networks.

HDFS has a problem with the amount of disk I/O operations. To solve this problem, there are two approaches: combining stored files, and deleting the old files or modifying HDFS’s existing I/O mechanism [13]. As a result of the latter approach, modifying the Hadoop system as a whole is a very complex task. PACMan is a system described in [14] that stores input data in a cache, so that distributed caches can receive synchronized services. A new distributed cache system named HDCache has been developed by Zhang et al. [15]. In this work, local disks are commonly shown as local disks within Hadoop as snapshots. The performance of the Standard Hadoop framework is greatly improved due to the use of local memory and network I/O instead of disk I/O. R-caching was implemented in Hadoop in [16]. Redis provides a fast and efficient local and global cache layer for Hadoop, which is an open-source, distributed memory cache. This creates a localized memory during shuffle phase, which results in fewer disk I/O operations and fewer network operations. As stated in [17], the authors were concerned with reducing the intermediate data. The intermediate results can be reduced using the combiners [18]. An enhanced version of the traditional mapper combiner is proposed [19]. Combinator functions are executed as part of the map function. Using an aggregator for minimizing network traffic is suggested in [20,21], but the location of the aggregator is a challenge.

Our work was inspired by work on combining I/O for lower throughput. After Mapper emits the intermediate keys, all existing works strive to minimize the throughput. The main motivation for this paper is to explore and resolve this limitation.

3. Methodology

A MapReduce framework [22,23] provides the capability of processing large data sets in a distributed environment. During the MapReduce phase, there is a lot of disk- and network-work to be done. I/O operations that are unnecessary must be minimized as part of a MapReduce job in order to improve performance. In this section, we describe the shortcomings of Standard MapReduce and TFR. We also describe the possible measures that can be taken to fix them. Figure 1 shows the standard MapReduce execution flow. In Standard MapReduce workflow, usually MapReduce uses the user-defined Map function to create intermediate key-value pairs that are sent to reduce function. The intermediate results from the mappers are stored in the in-memory buffer and local disk and are sent to reducers without any filtering. TFR allows the intermediate results to be stored in the Redis instances without identifying the repeated keys in the generated key-value pairs. The underlying problem is there is no way to filter out redundant or repeated keys [24,25,26]. These recurring keys generate unnecessary data traffic, resulting in wasted time. There is a default combiner method for combining key-value pairs. This runs in every mapper node.

The execution flow of TFR:

User submits the job.
The input key-value pairs are split into independent records and are assigned to each map tasks.
Each mapper generates huge amount of intermediate key-value pairs.
The output from the mappers is directly sent to the Redis Instances.
Shuffling and Sorting are performed by fetching from the Redis Instances.
The sorted results are stored in the Redis Instances.
Reduce tasks fetch the mappers results from the Redis Instances and performs reducing on the sorted data.
The final reduced output is stored in HDFS.

The objective is to minimize the amount of output results from the mapper so recurring keys can be handled prior to being written to the storage. Using a filter that checks whether a key was previously created. The filter identifies recurring intermediate keys within the mapper. When recurring keys are identified, their values are combined. As all mappers complete their mapping, there is a substantial decrease in the number of intermediate key-value pairs. In the proposed example, all generated intermediate keys are temporarily stored in a HashMap inside the mapper.

In Figure 2, it is observed that there are 14 <key, value> pairs. The output of the first mapper has two identical pairs of <hello, 1>. The output of the second mapper has two identical pair of <you 1>, and the output of the third mapper has two identical pairs of <hello, 1>. Using combiners, the 14 <key, value> pairs are reduced to 11 <key, value> pairs before transferring them to the reducers. This reduces the network traffic and communication cost as well as speeding up the execution time.

Figure 3 shows the proposed IMapC execution flow. The execution flow of the IMapC implemented in TFR is as follows:

User submits the job. During a MapReduce job, data is stored in input files. HDFS is the storage location for input files.
Java MR API provide a java class called JobClient that allows the user to interact with the cluster. This class is responsible for creating the task based on input data. InputSplits are created and each map task takes one InputSplit.
The task is submitted to the MapReduce job controller (MR Job Controller) which is responsible to assign the job to the mappers and reducers.
Each map task performs the execution and produces intermediate key-value pairs.
From the generated intermediate key-value pairs, the keys that are recurring inside the map task are identified and are stored inside the mapper in the HashMap array.
The output results from the filtering method are stored in the Redis instances.
The reducer node fetches the Mappers output from the Redis instance and performs the sorting operation.
The sorted data is recorded in the Redis instances.
The sorted intermediate key-value pairs are applied with the reduce function and produces the final reduced results.
The output of the reduce tasks are stored in HDFS.

The work discusses the combining method implemented inside the mapper process in TFR. During the mapping, once the intermediate key-value pairs are produced by the mapper, the mini reducer reduces the I/O throughput very efficiently [14].

In the proposed approach, the generated intermediate key-value pairs are combined inside the mappers before the mapper’s output results are stored in the Redis instance. From each mapping function, a perfectly unique pair of key-value pairs can be obtained. Therefore, this proposed approach is much more efficient.

The next section demonstrates the validity and usefulness of the proposed system model through its implementation and performance analysis. Based on Algorithm 1, the mapper function is described in steps and a process is shown for identifying and combining intermediate keys that occur within the mapper as well as identifying which values of those intermediate keys recur inside the mapper. Each map tasks are executed and produces the intermediate key and value pairs (K2, V2). In the mapper, the variable L contains all of the intermediate keys emitted (See line no. from 1 to 7).

Algorithm 1: Proposed Map function

Input (K1, V1)
Output (K2, V2)
Begin
1: L <- New HashMap
2: Execute Map function
3: for each (K1, V1) pair do
4: Generate intermediate (K2, V2)
5: if generated K2 ∊ L, then
6: intermediate value V2 of Emitted Key K2 is combined with the previous value
7: else
8: L <- keys and values
9: end if
10: end for
11: End

Customizing Input and Output Formats in Map phase

OutputFormat in Hadoop: The output configuration of a MapReduce job is checked. RecordWriter implementation is provided to write the unfiltered mapper’s output results to the filesystem.
OutputFormat in proposed: RedisHashOutputFormat is used to set up the input job configuration to verify the job configuration, and to specify the Redis instance nodes as well as the Redis hash key to write all output. It creates the RecordWriter to serialize all the output key-value pairs after the job has been submitted. In general, this is a file in HDFS. However, HDFS is not used for storing these intermediate results. To create an instance of a RecordWriter for the map and reduce task, the getRecordWriter method is used on the back end. This record writer is a nested class of the RedisHashOutputFormat class [24,25]. The getOutputCommitter method is used in the Hadoop framework to manage any temporary results before committing in case the task fails or needs re-execution.
RecordWriter in Hadoop: RecordWriter class writes the intermediate <key, value> pairs to a file system. It takes two functions such as ‘Write’ and ‘Close.’ The ‘write’ function writes the filtered key-value pairs from the map phase to the local disk. The ‘close’ function closes the Hadoop data stream to the output file
RecordWriter in proposed: A RedisHashRecordWriter class is used to enable data to be written to the Redis cache, and a RedisHashRecordWriter class is used to handle the connections to the Redis server through the Jedis client. In this case, all key-value pairs are evenly distributed across all Redis nodes after the intermediate key-value pairs have been filtered. In order to write to Redis, a constructor is created to store the hash key. Finally, Jedis instance is connected and maps it to an integer. Write method is used to get the assigned Jedis instance. The hash code [26,27] is used as a key which is taken modulo the number of configured Redis instances. The key-value pair is then written to the returned Jedis instance to the configured hash. All Jedis instances are disconnected finally, using the close method.

4. Results

The experiment was tested on a 10 node Hadoop cluster. Each node had a 3.5 GHZ AMD processor and 16 GB memory. Every node in a cluster had a 5 Gbps data throughput via SATA network interface. Each node had 8 cores and allocated 5 map and 5 reduce task slots. Table 1 shows the configuration details for setting up 10-node Hadoop cluster. This proposed method was implemented on a Hadoop 3.3.0 version with Java 2.0. We tested the proposed method by varying different sizes of datasets from 5 GB to 20 GB. Tests were conducted on the code by trying it four different ways with three sets of data. Table 2 shows the Job completion time of the proposed model which was taken from the job history in Hadoop. Four types of WordCount and Sort programs were used to test the data sets in TFR, Standard Hadoop.

Standard WordCount/Sort without combiners
WordCount/Sort with inner mapping combiner method
Standard WordCount/Sort with default combiner
WordCount/Sort with both inner mapping combiner and default combiner

4.1. Performance comparison of IMapC in Standard Hadoop

We performed a number of experiments to evaluate IMapC. We ran two different benchmarking programs such as WordCount and Sort programs for Hadoop 3.3.0 and TFR with different datasets varied from 5 GB to 20 GB.

Table 2 and Table 3 show the performance comparison results on running WordCount and Sort benchmarks in Standard Hadoop. Figure 4 and Figure 5 show the execution time of WordCount and Sort program compared with the Inner Mapping Combiner (IMapC) over Default Combiner and without Combiner for Hadoop.3.30.

From the experimental results, it was seen that the execution efficiency of the Inner mapping Combiner with the Default combiner performed better than other program types for both Hadoop 3.3.0 and TFR. It was also observed that the execution time on running WordCount and Sort programs were lesser for TFR with DC of IMapC when compared to TFR and Hadoop. The reason is that the repeated keys are filtered before the intermediate key-value pairs are stored in the Redis instances. This avoids the traffic time during the reduce phase, hence, reducing the overall job execution time.

4.2. Performance Comparison of IMapC in TFR

Table 4 and Table 5 show the performance comparison results on running WordCount and Sort benchmarks in TFR. Figure 6 and Figure 7 show the execution time of the WordCount and Sort program compared with the Inner Mapping Combiner (IMapC) over Default Combiner and without Combiner for TFR. Figure 8 and Figure 9 show the execution time of WordCount and Sort programs for TFR with IMapC compared with TFR and Hadoop with IMaPC. Table 6 and Table 7 show the performance comparison of TFR, TFR with DC of IMapC, and Hadoop on the running WordCount program. TFR with DC of IMapC can achieve the performance gain of 10 and 60% execution time compared to TFR and Hadoop with IMapC on the running WordCount program. For the Sort program, TFR with DC of IMapC speeds up the execution time by 15 and 65% on the running Sort program. Hence, the proposed method shows better results compared to the standard framework.

4.3. Computational Complexity

In this, we profile the resource utilization of TFR, TFR with IMapC, Hadoop 2.6.5, and Hadoop 2.6.5 with IMapC, based on the WordCount and Sort workloads with 35 GB datasets for CPU utilization and 15 GB WordCount and Sort workloads for memory footprint. Experiments were run on a 10-node cluster with Linux (Debian). Each node in a cluster was equipped with 64 GB of RAM. The HiBench tool was used for a clear view of hardware running indicators (CPU usage, memory usage). For all experiments, we report the experimental results that were averaged across five executions.

4.3.1. CPU Usage

Figure 10 and Figure 11 show the CPU usage on running WordCount and Sort workloads tested under 35 GB datasets for TFR, TFR with IMapC, Hadoop 2.6.5, and Hadoop 2.6.5 with IMapC platforms. The results indicate that the CPU usage on the TFR is higher than that of TFR with IMapC, Hadoop 2.6.5, and Hadoop 2.6.5 with IMapC platforms. The increase in frequency of the sending and receiving data requests was higher, thereby increasing the CPU usage in WordCount and Sort workloads for TFR with IMapC. The increase in CPU usage must decrease the execution efficiency in TFR with IMapC. TFR records every task progress, which increases the frequency of the read and write process leading to the increase in CPU usage. Since the TFR with IMapC was implemented in asynchronous threading, it will not interrupt the normal execution process. Therefore, TFR with IMapC will not bring distinct and considerable execution time compared to the Standard Hadoop and previous works.

4.3.2. Memory Usage

Figure 12 and Figure 13 show the memory usage on running WordCount and Sort workload for TFR, TFR with IMapC, Hadoop 2.6.5, and Hadoop 2.6.5 with IMapC platforms and 15 GB dataset. During the execution of the MapReduce job, the memory usage changed dynamically when the system allocated and released memory. The standard Hadoop and previous work use the local memory to store the intermediate spills generated from the map task. The memory mainly used by the Map tasks is the intermediate buffer and, hence, it needs extra storage. Hadoop performed some coarse-grained memory usage that caused the problem of memory constraint even more seriously. Whereas, TFR with IMapC stored the filtered intermediate spills in the Redis instance and fetched during the reduce task execution.

5. Conclusions

This research work focuses on designing an efficient MapReduce framework for Hadoop MapReduce. A new model called IMapC is presented—a method of filtering the recurring keys and combining their values. By locally aggregating partial results within the map, the inner mapping combiner (IMapC) helped to reduce the amount of intermediate results sent back and forth to the Redis instances. Compared to a traditional combiner, the IMAPC reduces the number of intermediate results emitted and reduces traffic through the mapper-reducer network. During mapper output, data values are combined inside the map method. We have modified Hadoop core to utilize this IMapC to filter and send recurring intermediate results to the external storage. The performance of the proposed and existing frameworks is experimentally evaluated using different benchmarking programs and measured under different parameters such as execution time and computational complexity. The results are shown and when compared to the default combiner of the standard MapReduce framework and TFR without combiner, our framework has shown better results. As we are storing the filtered intermediate data from the map and reduce phases in an in-memory data store, it guarantees the performance increase in TFR with IMapC during MapReduce execution. In the future, we will try to enhance our framework to be applicable in large Redis clusters.

Author Contributions

C.K.: research concept and methodology, writing—original draft preparation. S.R.S.: investigation, W.-C.L.: validation and funding acquisition. V.M.: review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the National Yunlin University of Science and Technology, Douliu.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jeyaraj, R.; Ananthanarayana, V.S. Multi-level per node combiner (MLPNC) to minimize MapReduce job latency on virtualized environment. In Proceedings of the ACM Symposium on Applied Computing, Pau, France, 9–13 April 2018; pp. 167–174. [Google Scholar]
Vinutha, D.C.; Raju, G.T. In-Memory Cache and Intra-Node Combiner Approaches for Optimizing Execution Time in High-Performance Computing. SN Comput. Sci. 2020, 1, 98. [Google Scholar] [CrossRef] [Green Version]
Shishir, M.N.S.; Yousuf, M.A. Performance Enhancement of Hadoop MapReduce by Combining Data Inside the Mapper. In Proceedings of the International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh, 5–7 January 2021. [Google Scholar]
Kavitha, C.; Anita, X. Task failure resilience technique for improving the performance of MapReduce in Hadoop. ETRI J. 2020, 42, 748–760. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, D. Improving the efficiency of storing for small files in hdfs. In Proceedings of the 2012 International Conference on Computer Science and Service System, CSSS, Nanjing, China, 11–13 August 2012; pp. 2239–2242. [Google Scholar]
Zhang, H.; Wang, L.; Huang, H. SMARTH: Enabling multi-pipeline data transfer in HDFS. In Proceedings of the International Conference on Parallel Processing, Minneapolis, MN, USA, 9–12 September 2014; pp. 30–39. [Google Scholar]
Dean, J.; Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, USA, 6–8 December 2004; pp. 137–150. [Google Scholar]
Lee, S.; Jo, J.Y.; Kim, Y. Performance improvement of MapReduce process by promoting deep data locality. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016, Montreal, Canada, 17–19 October 2016; pp. 292–301. [Google Scholar]
Kavitha, C.; Lakshmi, R.S.; Devi, J.A.; Pradheeba, U. Evaluation of worker quality in crowdsourcing system on Hadoop platform. Int. J. Reason.-Based Intell. Syst. 2019, 11, 181–185. [Google Scholar] [CrossRef]
Guo, Y.; Rao, J.; Cheng, D.; Zhou, X. iShuffle: Improving hadoop performance with shuffle-on-write. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 1649–1662. [Google Scholar] [CrossRef]
Lee, W.H.; Jun, H.; Kim, H.J. Hadoop MapReduce Performance Enhancement Using In-Node Combiners. Int. J. Comput. Sci. Inf. Technol. 2015, 7, 1–17. [Google Scholar] [CrossRef]
Lu, X.; Islam, N.S.; Wasi-Ur-Rahman, M.; Jose, J.; Subramoni, H.; Wang, H.; Panda, D.K. High-Performance design of Hadoop RPC with RDMA over InfiniBand. In Proceedings of the International Conference on Parallel Processing, Lyon, France, 1–4 October 2013; pp. 641–650. [Google Scholar]
Zhang, J.; Wu, G.; Hu, X.; Wu, X. A distributed cache for hadoop distributed fle system in real-time cloud services. In 2012 ACM/IEEE 13th International Conference on Grid Computing; IEEE: Piscataway, NJ, USA, 2012; pp. 12–21. [Google Scholar]
Pinto, V.F. In Trend Analysis using Hadoop’s MapReduce Framework. In Proceedings of the 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Bangalore, India, 21–23 December 2017; pp. 1–5. [Google Scholar]
Ananthanarayanan, G.; Ghodsi, A.; Warfield, A.; Borthakur, D.; Kandula, S.; Shenker, S.; Stoica, I. PACMan: Coordinated memory caching for parallel jobs. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation; USENIX Association: Berkeley, CA, USA; pp. 1–14.
Senthilkumar, K.; Satheeshkumar, K.; Chandrasekaran, S. Performance enhancement of data processing using multiple intelligent cache in hadoop. Int. J. Inf. Educ. Technol. 2014, 159–164. Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.647.320 (accessed on 28 March 2022).
Crume, A.; Buck, J.; Maltzahn, C.; Brandt, S. Compressing intermediate keys between mapper and reducers in scihadoop. In IEEE SC Companion: High Performance Computing, Networking Storage and Analysis; IEEE: Piscataway, NJ, USA, 2013; pp. 1–6. [Google Scholar]
Lin, J.; Schatz, M. Design patterns for efficient graph algorithms in MapReduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, Washington, DC, USA, 24–25 July 2010; pp. 78–85. [Google Scholar]
Ke, H.; Li, P.; Guo, S.; Stojmenovic, I. Aggregation on the fy: Reducing trafc for big data in the cloud. IEEE Netw. 2015, 29, 17–23. [Google Scholar] [CrossRef]
Dean, J.; Ghemawat, S. MapReduce: Simplifed data processing on large clusters. Commun. ACM 2008, 51, 107–113. [Google Scholar] [CrossRef]
Dev, K.; Maddikunta, P.K.R.; Gadekallu, T.R.; Bhattacharya, S.; Hegde, P.; Singh, S. Energy Optimization for Green Communication in IoT Using Harris Hawks Optimization. In IEEE Transactions on Green Communications and Networking; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar] [CrossRef]
Roy, A.K.; Nath, K.; Srivastava, G.; Gadekallu, T.R.; Lin, J.C.-W. Privacy Preserving Multi-Party Key Exchange Protocol for Wireless Mesh Networks. Sensors 2022, 22, 1958. [Google Scholar] [CrossRef] [PubMed]
Alazab, M.; Lakshmanna, K.; Reddy, T.; Pham, Q.V.; Maddikunta, P.K.R. Multi-objective cluster head selection using fitness averaged rider optimization algorithm for IoT networks in smart cities. Sustain. Energy Technol. Assess. 2021, 43, 100973. [Google Scholar] [CrossRef]
Kavitha, C.; Anita, X.; Selvan, S. Improving the efficiency of speculative execution strategy in hadoop using amazon elasticache for redis. J. Eng. Sci. Technol. 2021, 16, 4864–4878. [Google Scholar]
Mani, V.; Kavitha, C.; Band, S.S.; Mosavi, A.; Hollins, P.; Palanisamy, S. A Recommendation System Based on AI for Storing Block Data in the Electronic Health Repository. Front. Public Health 2022, 9, 831404. [Google Scholar] [CrossRef] [PubMed]
Kavitha, C.; Mani, V.; Srividhya, S.R.; Khalaf, O.I.; Romero, C.A.T. Early-Stage Alzheimer’s Disease Prediction Using Machine Learning Models. Front. Public Health 2022, 10, 853294. [Google Scholar] [CrossRef] [PubMed]
Vidhya, S.R.S.; Arunachalam, A.R. Automated Detection of False positives and false negatives in Cerebral Aneurysms from MR Angiography Images by Deep Learning Methods. In Proceedings of the 2021 International Conference on System, Computation, Automation and Networking (ICSCAN), Puducherry, India, 30–31 July 2021. [Google Scholar]

Figure 1. Standard MapReduce execution flow.

Figure 2. MapReduce using combiner.

Figure 3. Proposed IMapC in TFR.

Figure 4. Execution time of WordCount program for the Inner Mapping Combiner (IMapC) over the Default Combiner and without the Combiner in Standard Hadoop.

Figure 5. Execution time of Sort program for Inner Mapping Combiner (IMapC) over the Default Combiner and without the Combiner in Standard Hadoop.

Figure 6. The execution time of the WordCount program for Inner Mapping Combiner (IMapC) over Default Combiner and without Combiner in TFR.

Figure 7. The execution time of the Sort program for Inner Mapping Combiner (IMapC) Over Default Combiner and without Combiner in TFR.

Figure 8. The execution time of WordCount program comparing TFR, TFR with IMapC, and Hadoop.

Figure 9. The execution time of the Sort program comparing TFR, TFR with IMapC, and Hadoop.

Figure 10. CPU resource usage on running WordCount job for 35 GB dataset.

Figure 11. CPU usage on running Sort job for 35 GB.

Figure 12. Memory usage on running WordCount job.

Figure 13. Memory usage on running Sort job.

Table 1. Configuration of the Hadoop cluster.

No of Nodes	10
CPU	AMD Processor
No. of Cores	8
Frequency	3.5 GHZ
Memory	16 GB
OS version	CentOS
Interface	SATA 3.0
Speed of Interface	6 Gbps
No. of map tasks	5
No. of reduce tasks	5
Hadoop	Hadoop 3.3.0
HDFS Block size	256 MB
Replication factor size	3
JVM	JDK 2.8

Table 2. Performance improvement of IMapC with Default Combiner on running WordCount program in Standard Hadoop.

Program	5 GB	10 GB	15 GB	20 GB
Standard WordCount without combiner	200	533	754	1276
Inner Mapping Combiner (IMapC)	150	370	600	1050
Default combiner	200	460	680	1115
Inner Mapping Combiner with Default Combiner	120	300	590	1000

Table 3. Performance improvement of IMapC with the Default Combiner on running the Sort program in Standard Hadoop.

Program	5 GB	10 GB	15 GB	20 GB
Sort without combiner	140	300	500	927
Inner Mapping Combiner (IMapC)	110	200	420	780
Default combiner	122	250	482	810
Inner Mapping Combiner with Default Combiner	99	180	400	730

Table 4. Performance improvement of IMapC with Default Combiner on running WordCount program in TFR.

Program	5 GB	10 GB	15 GB	20 GB
Standard WordCount without combiner (s)	150	368	500	1019
Inner Mapping Combiner (IMapC) (s)	110	280	380	760
Default combiner (s)	130	300	440	810
Inner Mapping Combiner with Default Combiner (s)	93	210	322	700

Table 5. Performance improvement of IMapC with Default Combiner on running Sort program in TFR.

Program	5 GB	10 GB	15 GB	20 GB
Sort without combiner (s)	100	220	370	690
Inner Mapping Combiner (IMapC) (s)	88	180	230	488
Default combiner (s)	96	200	280	552
Inner Mapping Combiner with Default Combiner (s)	80	120	200	365

Table 6. Performance Comparison of TFR, TFR with DC of IMapC, and Hadoop on running WordCount program.

Algorithm	5 GB	10 GB	15 GB	20 GB
TFR (s)	150	368	500	1019
TFR with DC of IMapC (s)	93	210	322	700
Hadoop with IMapC (s)	120	300	590	1000

Table 7. Performance Comparison of TFR, TFR with DC of IMapC, and Hadoop on running Sort program.

Algorithm	5 GB	10 GB	15 GB	20 GB
TFR (s)	100	220	370	690
TFR with DC of IMapC (s)	80	120	200	365
Hadoop with IMapC (s)	99	180	400	730

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kavitha, C.; Srividhya, S.R.; Lai, W.-C.; Mani, V. IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in Hadoop. Electronics 2022, 11, 1599. https://doi.org/10.3390/electronics11101599

AMA Style

Kavitha C, Srividhya SR, Lai W-C, Mani V. IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in Hadoop. Electronics. 2022; 11(10):1599. https://doi.org/10.3390/electronics11101599

Chicago/Turabian Style

Kavitha, C., S. R. Srividhya, Wen-Cheng Lai, and Vinodhini Mani. 2022. "IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in Hadoop" Electronics 11, no. 10: 1599. https://doi.org/10.3390/electronics11101599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in Hadoop

Abstract

1. Introduction

2. Literature Survey

3. Methodology

4. Results

4.1. Performance comparison of IMapC in Standard Hadoop

4.2. Performance Comparison of IMapC in TFR

4.3. Computational Complexity

4.3.1. CPU Usage

4.3.2. Memory Usage

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI