A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Pattabiraman, V.
- An Alternative Extension of the K-Means Algorithm for Clustering Medical Data
Authors
1 Department of Computer Science and Engineering, Kalaignar Karunanidhi Institute of Technology, Coimbatore, IN
2 Department of Master of Computer Applications, PSG College of Arts and Science, Coimbatore, IN
Source
Data Mining and Knowledge Engineering, Vol 1, No 8 (2009), Pagination: 375-382Abstract
Data clustering is a very powerful technique in many application areas. Not only may the clusters have meaning themselves, but clustering allows for efficient data management techniques in that data that is grouped in the same manner will usually be accessed together. Access to data within a cluster may predict that other data in that cluster will be accessed soon; this can lead to optimized storage strategies which perform much better than if the data were randomly stored.Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The k-means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this is to show how to apply the notion of “cluster centers” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a k-means-like algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with well-known medicine data sets.
Keywords
Clustering, K-Mean Clustering, Proximity.- Performance Evaluation of OR1200 Processor With Evolutionary Parallel HPRC Using GEP
Authors
1 School of Computing Science and Engineering, Vellore Institute of Technology, Chennai, IN
Source
ICTACT Journal on Soft Computing, Vol 2, No 3 (2012), Pagination: 331-336Abstract
In this fast computing era, most of the embedded system requires more computing power to complete the complex function/ task at the lesser amount of time. One way to achieve this is by boosting up the processor performance which allows processor core to run faster. This paper presents a novel technique of increasing the performance by parallel HPRC (High Performance Reconfigurable Computing) in the CPU/DSP (Digital Signal Processor) unit of OR1200 (Open Reduced Instruction Set Computer (RISC) 1200) using Gene Expression Programming (GEP) an evolutionary programming model. OR1200 is a soft-core RISC processor of the Intellectual Property cores that can efficiently run any modern operating system. In the manufacturing process of OR1200 a parallel HPRC is placed internally in the Integer Execution Pipeline unit of the CPU/DSP core to increase the performance. The GEP Parallel HPRC is activated /deactivated by triggering the signals i) HPRC_Gene_Start ii) HPRC_Gene_End. A Verilog HDL(Hardware Description language) functional code for Gene Expression Programming parallel HPRC is developed and synthesised using XILINX ISE in the former part of the work and a CoreMark processor core benchmark is used to test the performance of the OR1200 soft core in the later part of the work. The result of the implementation ensures the overall speed-up increased to 20.59% by GEP based parallel HPRC in the execution unit of OR1200.Keywords
GEP, Gene, Crossover, Mutation, CoreMark.- Built-in Big Data Applications Using Restful Web Services
Authors
1 School of Computing Science and Engineering, VIT University, Chennai, IN
Source
Data Mining and Knowledge Engineering, Vol 10, No 1 (2018), Pagination: 1-6Abstract
Apache Hive is a widely used data warehousing and analysis tool. Developers write SQL like HIVE queries, which are converted into MapReduce programs to runs on a cluster. Despite its popularity, there is little research on performance comparison and diagnose. Part of the reason is that instrumentation techniques used to monitor execution cannot be applied to intermediate MapReduce code generated from Hive query. Because the generated MapReduce code is hidden from developers, run time logs are the only places a developer can get a glimpse of the actual execution. Having an automatic tool to extract information and to generate report from logs is essential to understand the query execution behavior.In this paper designed a tool to build the execution profile of individual Hive queries by extracting information from HIVE and Hadoop logs. The profile consists of detailed information about MapReduce jobs, tasks and attempts belonging to a query. It is stored as a JSON document in MongoDB and can be retrieved to generate reports in charts or tables. The profiling tool tested with several experiments on AWS with TPC-H datasets and queries, it is found that the profiling tool is able to assist developers in comparing HIVE queries written in different formats, running on different data sets and configured with different parameters. It is also able to compare tasks/attempts within the same job to diagnose performance issues.
References
- Apache hive. [Online]. Available: http://hive.apache.org/
- Apache hadoop. [Online]. Available: http://hadoop.apache.org/
- Apache hadoopnextgenmapreduce (yarn). [Online]. Available: http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarnsite/YARN.html
- Tpc-h benchmark. [Online]. Available: http://www.tpc.org/tpch/
- M. Poess and C. Floyd, “New tpc benchmarks for decision support and web commerce,” SIGMOD Rec., vol. 29, no. 4, pp. 64–71, Dec. 2000. [Online]. Available: http://doi.acm.org/10.1145/369275.369291
- R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang, “Ysmart: Yet another sql-to-mapreduce translator,” in Distributed Computing Systems (ICDCS), 2011 31st International Conference on. IEEE, 2011, pp. 25 – 36.
- Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. N. Hanson, O. O’Malley, J. Pandey, Y. Yuan, R. Lee, and X. Zhang, “Major technical advancements in apache hive,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014, pp. 1235–1246.
- Srivastava and A. Eustace, ATOM: A system for building customized program analysis tools. ACM, 1994, vol. 29, no. 6.
- Q. Gao, F. Qin, and D. K. Panda, “Dmtracker: finding bugs in largescale parallel programs by detecting anomaly in data movements,” in Proceedings of the 2007 ACM/IEEE conference on Supercomputing. ACM, 2007, p. 15.
- H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost based optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
- Btrace: A dynamic instrumentation tool for java. [Online]. Available: https://kenai.com/projects/btrace
- X. Zhao, Y. Zhang, D. Lion, M. Faizan, Y. Luo, D. Yuan, and M. Stumm, “lprof: A nonintrusive request flow profiler for distributed systems,” in Proceedings of the 11th Symposium on Operating Systems Design and Implementation, 2014.
- P. Barham, R. Isaacs, R. Mortier, and D. Narayanan, “Magpie: Online modelling and performance-aware systems.” in HotOS, 2003, pp. 85–90.
- R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, “X-trace: A pervasive network tracing framework,” in In NSDI, 2007.
- R. R. Sambasivan, A. X. Zheng, M. De Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. R. Ganger, “Diagnosing performance changes by comparing request flows.” in NSDI, 2011.