Open Access Open Access  Restricted Access Subscription Access

Record Matching in Web Databases Using Unsupervised Approach


Affiliations
1 Department of Computer Science& Engineering, Muffakam Jah College of Engineering & Technology, Banjara-Hills, Hyderabad-500034, India
2 Department of Computer Science& Engineering, Muffakam Jah College of Engineering & Technology, Banjara-Hills, Hyderabad-500034, Pakistan
 

Record Matching is the problem of combining information from multiple heterogeneous databases. One step of data integration is relating the records that appear in the different databases specifically, determining which sets of records refer to the same real-world entities. Performing record matching solves the duplication detection problems; hence the needs for identifying the suitable record matching technique follow. Most of record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated. To overcome the problem, a new record matching method named Unsupervised Duplicate Detection (UDD) is proposed which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases and eliminating duplicates among records in dynamic query results. The idea of this paper is to adjust the weights of record fields in calculating similarities among records. Two classifiers namely weight component similarity summing classifier and support vector machine classifier are iteratively employed with UDD to identify duplicates in the query results from multiple Web databases.

Keywords

Record Matching, Unsupervised, UDD, Query Results.
User
Notifications
Font Size

Abstract Views: 108

PDF Views: 0




  • Record Matching in Web Databases Using Unsupervised Approach

Abstract Views: 108  |  PDF Views: 0

Authors

Fouzia Sultana
Department of Computer Science& Engineering, Muffakam Jah College of Engineering & Technology, Banjara-Hills, Hyderabad-500034, India
Manjusha Kalekuri
Department of Computer Science& Engineering, Muffakam Jah College of Engineering & Technology, Banjara-Hills, Hyderabad-500034, Pakistan

Abstract


Record Matching is the problem of combining information from multiple heterogeneous databases. One step of data integration is relating the records that appear in the different databases specifically, determining which sets of records refer to the same real-world entities. Performing record matching solves the duplication detection problems; hence the needs for identifying the suitable record matching technique follow. Most of record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated. To overcome the problem, a new record matching method named Unsupervised Duplicate Detection (UDD) is proposed which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases and eliminating duplicates among records in dynamic query results. The idea of this paper is to adjust the weights of record fields in calculating similarities among records. Two classifiers namely weight component similarity summing classifier and support vector machine classifier are iteratively employed with UDD to identify duplicates in the query results from multiple Web databases.

Keywords


Record Matching, Unsupervised, UDD, Query Results.