Open Access Open Access  Restricted Access Subscription Access

A Semantic Deduplication of Temporal Dynamic Records from Multiple Web Databases


Affiliations
1 Department of Computer Science, Karpagam University, Coimbatore - 641021, Tamil Nadu, India
2 Department of Computer Science, Gobi Arts and Science College, Gobichettipalayam - 638453, Tamil Nadu, India
 

Objective: The main objective of this paper is to improve the true positive level of record deduplication using Ontology based MHMM-Fuzzy clustering approach. Methods/Statistical Analysis: Most of the record deduplication system in literature used genetic programming based record deduplication which combined different pieces of evidence extracted from the data content. However the accuracy of the system is low. To overcome this problem we propose a Multiple Hidden Markov Model (MHMM) which is used to increase the accuracy and also to identify joint duplicate records. In this model, if the database has multiple columns, it performs the deduplication for the all columns which will degrade the performance of the system. So to solve this problem, MHMM-Fuzzy Clustering based record deduplication is introduced. In this system Fuzzy clustering is performed through multiple observations from the Hidden Markov Model. Then duplicate data are grouped into one cluster according to their fuzzy logic and it can be eliminated easily. However the true positive level of the system is low. To improve the true positive level Fuzzy Ontology based semantic similarity is incorporated in MHMM-Fuzzy Clustering approach. This implies the improvement of the true positive level of the model. Thus it increases the efficiency of deduplication function that identifies the records of replica and duplications. Findings: Multiple Hidden Markov Model (MHMM) based record deduplication, MHMM-Fuzzy clustering based record deduplication and Ontology based MHMM-Fuzzy clustering approach are applied on Cora Bibliographic dataset and Restaurants dataset. The performance measures are evaluated in terms of precision, recall, f-measure, Execution time and accuracy results. Applications/Improvements: Thus the current research achieves improved result on record deduplication is better than previous works in terms of precision, recall, f-measure, Execution time and accuracy results.

Keywords

Hidden State Sequence, Membership Function, Observation Sequence, States, Semantic Deduplication
User

Abstract Views: 219

PDF Views: 0




  • A Semantic Deduplication of Temporal Dynamic Records from Multiple Web Databases

Abstract Views: 219  |  PDF Views: 0

Authors

R. Parimala Devi
Department of Computer Science, Karpagam University, Coimbatore - 641021, Tamil Nadu, India
V. Thigarasu
Department of Computer Science, Gobi Arts and Science College, Gobichettipalayam - 638453, Tamil Nadu, India

Abstract


Objective: The main objective of this paper is to improve the true positive level of record deduplication using Ontology based MHMM-Fuzzy clustering approach. Methods/Statistical Analysis: Most of the record deduplication system in literature used genetic programming based record deduplication which combined different pieces of evidence extracted from the data content. However the accuracy of the system is low. To overcome this problem we propose a Multiple Hidden Markov Model (MHMM) which is used to increase the accuracy and also to identify joint duplicate records. In this model, if the database has multiple columns, it performs the deduplication for the all columns which will degrade the performance of the system. So to solve this problem, MHMM-Fuzzy Clustering based record deduplication is introduced. In this system Fuzzy clustering is performed through multiple observations from the Hidden Markov Model. Then duplicate data are grouped into one cluster according to their fuzzy logic and it can be eliminated easily. However the true positive level of the system is low. To improve the true positive level Fuzzy Ontology based semantic similarity is incorporated in MHMM-Fuzzy Clustering approach. This implies the improvement of the true positive level of the model. Thus it increases the efficiency of deduplication function that identifies the records of replica and duplications. Findings: Multiple Hidden Markov Model (MHMM) based record deduplication, MHMM-Fuzzy clustering based record deduplication and Ontology based MHMM-Fuzzy clustering approach are applied on Cora Bibliographic dataset and Restaurants dataset. The performance measures are evaluated in terms of precision, recall, f-measure, Execution time and accuracy results. Applications/Improvements: Thus the current research achieves improved result on record deduplication is better than previous works in terms of precision, recall, f-measure, Execution time and accuracy results.

Keywords


Hidden State Sequence, Membership Function, Observation Sequence, States, Semantic Deduplication



DOI: https://doi.org/10.17485/ijst%2F2015%2Fv8i34%2F124256