Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Data Skew Handling in Heterogeneous Hadoop Cluster


Affiliations
1 Department of Computer Engineering, SGSITS, Indore, Madhya Pradesh, India
2 Department of Information Technology, SGSITS, Indore, Madhya Pradesh, India
     

   Subscribe/Renew Journal


Map reduce has been accepted as an important distributed processing model for processing big data which is generated by data intensive applications. Hadoop is an open source implementation which uses map reduce as programming model for processing big data. There are various programming tools for processing data but most of them are not suitable for processing big data.
In the current hadoop implementation it is assumed that all nodes are homogeneous in nature. Homogeneous means all nodes have same computation power. But in the practical scenario it is possible to have heterogeneous environment. We can improve the performance of map reduce by considering heterogeneous environment.
The second issue while processing the data with MapReduce framework is data skew. Uneven distribution of the data to each task is called data skew. so when data skew arises in system, then the tasks with skewed data take much longer time to complete compare than other tasks, this leads to performance degradation of the overall system.
In this paper we will focus on how data should be placed across all the nodes and how to process the data by taking into consideration the data skew problem so that we will get maximum performance from given resources. Our data handling strategy distributes and processes the data in such a way that we get the maximum performance from each node which in turn increases the overall performance of map reduce.

Keywords

Data Skew, HDFS, Hadoop, MapReduce, Heterogeneous.
Subscription Login to verify subscription
User
Notifications
Font Size


  • http://lucene.apache.org/hadoop.
  • R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the data: Parallel analysis with Sawzall,” vol. 13. IOS Press, 2005.
  • J. Dean, and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” OSDI ’04: 6th Symposium on Operating Systems Design and Implementation, pp. 137-150, 2008.
  • S. Ghemawat, H. Gobioff, and S. Leung, “The google file sys-tem,” SIGOPS Oper. Syst. Rev., vol. 37, no. 5, p. 2943, 2003.
  • J. Dean, and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, Jan. 2008.
  • L. Jimmy, “The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce,” In 7th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), 2009.
  • C. B. Walton, A. G. Dale, and R. M. Jenevein, “A taxonomy and performance model of data skew effects in parallel joins,” In Proc. of the International Conference on Very Large Data Bases (VLDB), 1991.
  • D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri, “Practical skew handling in parallel joins,” In Proc. of the International Conference on Very Large DataBases (VLDB), 1992.
  • R. P. Mount, “The office of science data-management challenge,” Department of Energy, Tech. Rep., 2004.
  • M. C. Schatz, “Cloudburst: Highly sensitive read mapping with mapreduce,” Bioinformatics, vol. 25, no. 11, 2009.
  • B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, “Mars: A MapReduce framework on graphics processors,” ACM, 2008.
  • C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating mapreduce for multicore and multi-processor systems,” High-Performance Computer Architecture, International Symposium on, 0:1324, 2007.
  • M. Zaharia, A. Konwinski, A. Joseph, Y. Zatz, and I.Stoica. Improving mapreduce performance in heterogeneous environ-ments. In OSDI08: 8th USENIX Symposium on Operating Systems Design and Implementation, October 2008.

Abstract Views: 216

PDF Views: 8




  • Data Skew Handling in Heterogeneous Hadoop Cluster

Abstract Views: 216  |  PDF Views: 8

Authors

Abhash Visoriya
Department of Computer Engineering, SGSITS, Indore, Madhya Pradesh, India
Deepak Barade
Department of Computer Engineering, SGSITS, Indore, Madhya Pradesh, India
Sunita Varma
Department of Information Technology, SGSITS, Indore, Madhya Pradesh, India

Abstract


Map reduce has been accepted as an important distributed processing model for processing big data which is generated by data intensive applications. Hadoop is an open source implementation which uses map reduce as programming model for processing big data. There are various programming tools for processing data but most of them are not suitable for processing big data.
In the current hadoop implementation it is assumed that all nodes are homogeneous in nature. Homogeneous means all nodes have same computation power. But in the practical scenario it is possible to have heterogeneous environment. We can improve the performance of map reduce by considering heterogeneous environment.
The second issue while processing the data with MapReduce framework is data skew. Uneven distribution of the data to each task is called data skew. so when data skew arises in system, then the tasks with skewed data take much longer time to complete compare than other tasks, this leads to performance degradation of the overall system.
In this paper we will focus on how data should be placed across all the nodes and how to process the data by taking into consideration the data skew problem so that we will get maximum performance from given resources. Our data handling strategy distributes and processes the data in such a way that we get the maximum performance from each node which in turn increases the overall performance of map reduce.

Keywords


Data Skew, HDFS, Hadoop, MapReduce, Heterogeneous.

References