Data Skew Handling in Heterogeneous Hadoop Cluster

Abhash Visoriya; Deepak Barade; Sunita Varma

Data Skew Handling in Heterogeneous Hadoop Cluster

Abhash Visoriya ¹, Deepak Barade ¹, Sunita Varma ²

Affiliations
1 Department of Computer Engineering, SGSITS, Indore, Madhya Pradesh, India
2 Department of Information Technology, SGSITS, Indore, Madhya Pradesh, India

Map reduce has been accepted as an important distributed processing model for processing big data which is generated by data intensive applications. Hadoop is an open source implementation which uses map reduce as programming model for processing big data. There are various programming tools for processing data but most of them are not suitable for processing big data.
In the current hadoop implementation it is assumed that all nodes are homogeneous in nature. Homogeneous means all nodes have same computation power. But in the practical scenario it is possible to have heterogeneous environment. We can improve the performance of map reduce by considering heterogeneous environment.
The second issue while processing the data with MapReduce framework is data skew. Uneven distribution of the data to each task is called data skew. so when data skew arises in system, then the tasks with skewed data take much longer time to complete compare than other tasks, this leads to performance degradation of the overall system.
In this paper we will focus on how data should be placed across all the nodes and how to process the data by taking into consideration the data skew problem so that we will get maximum performance from given resources. Our data handling strategy distributes and processes the data in such a way that we get the maximum performance from each node which in turn increases the overall performance of map reduce.

Keywords

Data Skew, HDFS, Hadoop, MapReduce, Heterogeneous.

I-Scholar

Journal Help

Subscription Login to verify subscription

User

Notifications

Journal Content
Browse

Font Size

Information

http://lucene.apache.org/hadoop.

R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the data: Parallel analysis with Sawzall,” vol. 13. IOS Press, 2005.

J. Dean, and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” OSDI ’04: 6th Symposium on Operating Systems Design and Implementation, pp. 137-150, 2008.

S. Ghemawat, H. Gobioff, and S. Leung, “The google file sys-tem,” SIGOPS Oper. Syst. Rev., vol. 37, no. 5, p. 2943, 2003.

J. Dean, and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, Jan. 2008.

L. Jimmy, “The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce,” In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), 2009.

C. B. Walton, A. G. Dale, and R. M. Jenevein, “A taxonomy and performance model of data skew effects in parallel joins,” In Proc. of the International Conference on Very Large Data Bases (VLDB), 1991.

D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri, “Practical skew handling in parallel joins,” In Proc. of the International Conference on Very Large DataBases (VLDB), 1992.

R. P. Mount, “The office of science data-management challenge,” Department of Energy, Tech. Rep., 2004.

M. C. Schatz, “Cloudburst: Highly sensitive read mapping with mapreduce,” Bioinformatics, vol. 25, no. 11, 2009.

B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, “Mars: A MapReduce framework on graphics processors,” ACM, 2008.

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating mapreduce for multicore and multi-processor systems,” High-Performance Computer Architecture, International Symposium on, 0:1324, 2007.

M. Zaharia, A. Konwinski, A. Joseph, Y. Zatz, and I.Stoica. Improving mapreduce performance in heterogeneous environ-ments. In OSDI08: 8th USENIX Symposium on Operating Systems Design and Implementation, October 2008.

Abstract Views: 216

PDF Views: 8

Data Skew Handling in Heterogeneous Hadoop Cluster

Abstract Views: 216 | PDF Views: 8

Authors

Abhash Visoriya
Department of Computer Engineering, SGSITS, Indore, Madhya Pradesh, India

Deepak Barade
Department of Computer Engineering, SGSITS, Indore, Madhya Pradesh, India

Sunita Varma
Department of Information Technology, SGSITS, Indore, Madhya Pradesh, India

Abstract

Keywords

Data Skew, HDFS, Hadoop, MapReduce, Heterogeneous.

Username
Password
Remember me

Username
Password
Remember me

International Journal of Distributed and Cloud Computing

International Journal of Distributed and Cloud Computing

Data Skew Handling in Heterogeneous Hadoop Cluster

Subscribe/Renew Journal

Keywords

Data Skew Handling in Heterogeneous Hadoop Cluster

Authors

Abstract

Keywords

References