The PDF file you selected should load here if your Web browser has a PDF reader plug-in installed (for example, a recent version of Adobe Acrobat Reader).

If you would like more information about how to print, save, and work with PDFs, Highwire Press provides a helpful Frequently Asked Questions about PDFs.

Alternatively, you can download the PDF file directly to your computer, from where it can be opened using a PDF reader. To download the PDF, click the Download link above.

Fullscreen Fullscreen Off


Objectives: File structure and storage becomes a challenging issue while processing huge amount of data in parallel and distributed environment. To increase processing capabilities an appropriate file structure must be implemented. Methods/Statistical Analysis: In our approach we have imported the data from the available relational databases like Oracle or MySql to Hive using Sqoop and analyzed the query processing based upon different file storage formats. We have focused on the Parquet, Sequence, RC file and ORC file format for query analysis in MapReduce framework on top of Hadoop. Findings: Understanding dynamic behavior of user buying habits in different web services and product recommendation using social media, e-marketing etc. the MapReduce based data warehousing system plays vital role to perform the Big Data analytic in a parallel and distributed environment. In such type of analysis the data structure used to store the data for parallel query processing effect the performance of Big Data warehouse system. During the analysis of huge amount of relational data in a parallel and distributed system few issues should be taken care to increase the query performance and optimization. These are 1. Faster loading of huge amount of relational data into the Big Data warehouse. 2. Optimized file format to efficiently manage the storage system. 3. Faster query processing by increasing the throughput. Our findings explained appropriate file formats to store the huge amount of relational data in the Big Data warehouse system based upon HDFS and MapReduce framework known as Hive and evaluated the performance of query processing in multi node Hadoop cluster. Application/Improvements: The cost of parallel query processing has been reduced as well as distributed storage efficiency increased by choosing appropriate file structure in Big Data warehouse systems.

Keywords

Big Data, HDFS, Hive, Hadoop, MapReduce, ORC File, Sqoop.
User