Open Access Open Access  Restricted Access Subscription Access

A Novel Fragmentation Scheme for Textual Data Using Similarity-Based Threshold Segmentation Method in Distributed Network Environment


Affiliations
1 School of Computer Science and Engineering, Lovely Professional University, Phagwara, India
2 Department of Computer Science and Engineering, Chandigarh Engineering College, Mohali, India
 

Data distribution is one of the most essential architectures of any serving network. Data storage and its retrieval depend a lot on how the data is organized in the distributed environment. With the fast development of technology, the requirements of users have also changed. A user who was stationary earlier has become mobile now and requires access to the data from anywhere in the world. An unorganized data structure will result in output delay in the network and may further result in user migration from one service provider to another service provider. Data fragmentation is one of the most essential parts when it comes to data storage. Organized data always gives convenience to others to use it conveniently. Due to the vast collection of data extraction of information in a fast manner is very complicated. So, to achieve performance in a distributed system an optimal strategy is required to overcome previous lapses and serves the maximum number of users in a wide geographical network. This research paper proposes a novel relative based fragmentation method that analyses the attributes of the data in relative architecture and is helpful to achieve query performance with better speed and accuracy. To assess the current proposed work a comparison has been drawn between k-means dependent cosine similarity measurement and hybridization of cosine and soft-cosine partition methods for data partitioning. Mentioned results in the article shows that the proposed similarity-based threshold segmentation method outperforms the existing in terms of partitioning strategy, precision, and recall parameters to achieve performance.

Keywords

Fragmentation, K-Means, Similarity, Data Partitioning, Threshold, Segmentation, Precision, Recall.
User
Notifications
Font Size

  • Tarun S., Batth R. S. (2019). Distributed Database Design Challenges and its Countermeasures-A Study. Journal of the Gujarat Research Society 21 (6), pp. 875-886
  • S. Tarun, R. S. Batth and S. Kaur, "A Review on Fragmentation, Allocation and Replication in Distributed Database Systems," 2019 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), Dubai, United Arab Emirates, 2019, pp. 538-544, doi: 10.1109/ICCIKE47802.2019.9004233
  • R. Singh and K. S. Mann, “Improved TDMA Protocol for Channel Sensing in Vehicular Ad Hoc Network Using Time Lay,” Proceedings of 2nd International Conference on Communication, Computing and Networking Lecture Notes in Networks and Systems, pp. 303–311, 2018.
  • A. Nayar, R. S. Batth, D. B. Ha, and G. Sussendran, G. “Opportunistic networks: Present scenario-A mirror review” International Journal of Communication Networks and Information Security,” 10 (1), pp. 223-241, 2018.
  • G.S Shahi, R.S Batth, S. Egerton, 2020 “MRGM: An Adaptive Mechanism for Congestion Control in Smart Vehicular Network”, International Journal of Communication Networks and Information Security 12 (2).
  • Qi, H., & Gani, A. (2012, May). Research on mobile cloud computing: Review, trend and perspectives. In 2012 Second International Conference on Digital Information and Communication Technology and it's Applications (DICTAP), IEEE, pp. 195-202.
  • Venters, W., & Whitley, E. A. (2012). A critical review of cloud computing: researching desires and realities. Journal of Information Technology, 27(3), pp. 179-197.
  • Borkar, V., Deshmukh, K., & Sarawagi, S. (2001, May). Automatic Segmentation of text into structured records. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pp. 175-186.
  • Santini, S., & Jain, R. (1999). Similarity measures. IEEE Transactions on pattern analysis and machine Intelligence, 21(9), pp. 871-883.
  • Huang, A. (2008, April). Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand ,Vol. 4, pp. 9-56.
  • Sidorov, G., Gelbukh, A., Gómez-Adorno, H., & Pinto, D. (2014). Soft similarity and soft cosine measure: Similarity of features in vector space model. Computación y Sistemas, 18(3), pp. 491-504.
  • Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning. Neural and Statistical Classification, 13(1994), pp. 1-298.
  • Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS transactions on computers, 4(8), pp. 966-974.
  • Verma and A. Kumar, “Performance Enhancement of K-Means Clustering Algorithms for High Dimensional Data sets”, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 4, No. 1,pp.5-9, 2014.
  • Z.Tao, H. Liu, H. Fu and Y.Fu, “Image Co-segmentation via Saliency-Guided Constrained Clustering with Cosine Similarity”, AAAI,pp. 4285-4291,2017
  • X. Gu, H. Zhang and S. Kim, “Deep code search”, In Proceedings of the 40th International Conference on Software Engineering, ACM,pp. 933-944,2018.
  • W. L Xiang, Y. Z. Li, R. C. He, M.X. Gao, M.Q An, “A novel artificial bee colony algorithm based on the cosine similarity”, Computers & Industrial Engineering, Vol. 115, pp.54-68, 2018.
  • Wiese, L. (2014). Clustering-based fragmentation and data replication for flexible query answering in distributed databases. Journal of Cloud Computing 3, 18. https://doi.org/10.1186/s13677-014-0018-0
  • Ali A. Amer, Adel A. Sewisy, Taha M.A. Elgendy. (2017). An optimized approach for simultaneous horizontal data fragmentation and allocation in Distributed Database Systems (DDBSs). Heliyon 3 e00487. doi: 10.1016/j.heliyon.2017. e00487
  • Abdalla, H., &Artoli, A. M. (2019). Towards an efficient data fragmentation, allocation, and clustering approach in a distributed environment. Information, 10(3), 112. https://doi.org/10.3390/info10030112
  • Rahimi, H., Parand, F. A., & Riahi, D. (2018). Hierarchical simultaneous vertical fragmentation and allocation using modified Bond Energy Algorithm in distributed databases. Applied computing and informatics, 14(2), pp. 127-133. https://doi.org/10.1016/j.aci.2015.03.001
  • Lim, S., Ng, Y. (2001). A Hybrid Fragmentation Approach for Distributed Deductive Database Systems. Knowledge and Information Systems 3, pp. 198–224. https://doi.org/10.1007/PL00011666
  • Khan S. I., (2016). Efficient Partitioning of Large Databases without Query Statistics”, Database System Journal, pp. 34-53.
  • Peng, P., Zou, L., Chen, L., & Zhao, D. (2019). Adaptive distributed RDF graph fragmentation and allocation based on query workload. IEEE Transactions on Knowledge and Data Engineering, 31(4),pp.670-685. https://doi.org/10.1109/TKDE.2018.2841389
  • Aloini, D., Benevento, E., Stefanini, A., & Zerbino, P. (2020). Process fragmentation and port performance: Merging SNA and text mining. International Journal of Information Management, 51,101925.https://doi.org/10.1016/j.ijinfomgt.2019.03.012
  • Memmi, G., Kapusta, K., & Qiu, H. (2015, August). Data protection: Combining fragmentation, encryption, and dispersion. In 2015 International Conference on Cyber Security of Smart Cities, Industrial Control System and Communications (SSIC) (pp.1-9). IEEE. https://doi.org/10.1109/SSIC.2015.7245680
  • Links: https://www.kaggle.com/soaxelbrooke/first-inbound-and-response-tweets/data?select=sample.csv
  • Links: https://gist.github.com/larsyencken/1440509
  • Lende, S. P., &Raghuwanshi, M. M. (2016, February). Question answering system on education acts using NLP techniques. In 2016 world conference on futuristic trends in research and innovation for social welfare (Startup Conclave) (pp. 1-6). IEEE.
  • Zeyu, X., Qiangqian, S., Yijie, W., & Chenyang, Z. (2018). Paragraph vector representation based on word to vector and CNN learning. Computers, Materials & Continua, 55(2), pp. 213-227.
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Bartunov, S., Kondrashkin, D., Osokin, A., &Vetrov, D. (2016, May). Breaking sticks and ambiguities with adaptive skip-gram. In artificial intelligence and statistics, pp. 130-138.
  • H. Guo, J. Zhou and C.A. Wu (2018), “Imbalanced Learning Based on Data-Partition and SMOTE”, Information, Vol. 9, No. 9, pp. 238.
  • Kaur K., Laxmi V. (2019), “Hierarchical Clustering Based Improved Data Partitioning using Hybrid Similarity Measurement Approach”, International Journal of Innovative Technology and Exploring Engineering, Volume-8 Issue-8, pp. 3008-2014.

Abstract Views: 255

PDF Views: 0




  • A Novel Fragmentation Scheme for Textual Data Using Similarity-Based Threshold Segmentation Method in Distributed Network Environment

Abstract Views: 255  |  PDF Views: 0

Authors

Sashi Tarun
School of Computer Science and Engineering, Lovely Professional University, Phagwara, India
Ranbir Singh Batth
School of Computer Science and Engineering, Lovely Professional University, Phagwara, India
Sukhpreet Kaur
Department of Computer Science and Engineering, Chandigarh Engineering College, Mohali, India

Abstract


Data distribution is one of the most essential architectures of any serving network. Data storage and its retrieval depend a lot on how the data is organized in the distributed environment. With the fast development of technology, the requirements of users have also changed. A user who was stationary earlier has become mobile now and requires access to the data from anywhere in the world. An unorganized data structure will result in output delay in the network and may further result in user migration from one service provider to another service provider. Data fragmentation is one of the most essential parts when it comes to data storage. Organized data always gives convenience to others to use it conveniently. Due to the vast collection of data extraction of information in a fast manner is very complicated. So, to achieve performance in a distributed system an optimal strategy is required to overcome previous lapses and serves the maximum number of users in a wide geographical network. This research paper proposes a novel relative based fragmentation method that analyses the attributes of the data in relative architecture and is helpful to achieve query performance with better speed and accuracy. To assess the current proposed work a comparison has been drawn between k-means dependent cosine similarity measurement and hybridization of cosine and soft-cosine partition methods for data partitioning. Mentioned results in the article shows that the proposed similarity-based threshold segmentation method outperforms the existing in terms of partitioning strategy, precision, and recall parameters to achieve performance.

Keywords


Fragmentation, K-Means, Similarity, Data Partitioning, Threshold, Segmentation, Precision, Recall.

References





DOI: https://doi.org/10.22247/ijcna%2F2020%2F205322