Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Data Extraction in Web Databases by Combining Tag and Value Similarity


Affiliations
1 Sethu Institute of Technology, India
     

   Subscribe/Renew Journal


In real time applications, identification of records that represent the same real-world entity is a major challenge to be solved. Detection and removal of duplicate records that relate to the same entity within one dataset is an important task in data preprocessing. The novel data extraction and alignment method called CTVS that combines both tag and value similarity is enhanced by using unsupervised duplicate detection algorithm (UDD) to eliminate the duplicate records in web databases. CTVS automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table, in which the data values from the same attribute are put into the same column. Specifically, new techniques are proposed to handle the case when the QRRs are not contiguous, which may be due to the presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs. Also a new record alignment algorithm that aligns the attributes in a record, first pairwise and then holistically, by combining the tag and data value similarity information is designed.

Keywords

Automatic Wrapper Generation, Data Extraction, Data Record Alignment, Duplicate Detection.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 164

PDF Views: 2




  • Data Extraction in Web Databases by Combining Tag and Value Similarity

Abstract Views: 164  |  PDF Views: 2

Authors

J. Deepika
Sethu Institute of Technology, India
S. Senthamarai Kannan
Sethu Institute of Technology, India

Abstract


In real time applications, identification of records that represent the same real-world entity is a major challenge to be solved. Detection and removal of duplicate records that relate to the same entity within one dataset is an important task in data preprocessing. The novel data extraction and alignment method called CTVS that combines both tag and value similarity is enhanced by using unsupervised duplicate detection algorithm (UDD) to eliminate the duplicate records in web databases. CTVS automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table, in which the data values from the same attribute are put into the same column. Specifically, new techniques are proposed to handle the case when the QRRs are not contiguous, which may be due to the presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that may exist in the QRRs. Also a new record alignment algorithm that aligns the attributes in a record, first pairwise and then holistically, by combining the tag and data value similarity information is designed.

Keywords


Automatic Wrapper Generation, Data Extraction, Data Record Alignment, Duplicate Detection.