Language Independent Content Extraction from Web Pages

R. Chandramma; R. C. Ravindranath; B. Raviteja; Y. B. Ravikumar; S. Venkatesh; M. Yashwanth

Language Independent Content Extraction from Web Pages

R. Chandramma , R. C. Ravindranath , B. Raviteja , Y. B. Ravikumar , S. Venkatesh , M. Yashwanth

Affiliations
1 Department of Computer Science and Engineering, Vivekananda Institute of Technology, Bangalore-74, India

Abstract
References
Article Metrics
Refbacks

The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web (WWW). However, there is lot of redundant and irrelevant information also on web pages like Navigation panels, Table of content (TOC), advertisements, copyright statements etc. There are various technologies & researches which are focusing on the extraction of relevant information from large web data storage. But still there is requirement of availability of automatic annotation of this extracted information into a systematic way so to be processed further for various purposes and for language independent also. In this system we present a simple, robust, accurate and language-independent solution for extracting the main content of an HTML formatted Web page and for removing additional content such as navigation menus, functional and design elements, and commercial advertisements. Accurate and efficient content extraction from Web pages is largely needed when searching or mining Web content. So in this system we use a new approach for content extraction called word to leaf ratio and density of links.

Keywords

Content Extraction, Entropy, Document Object Model, Template, Content Structure Tree, Web Page Segmentation, Clustering, Anchor Text.

I-Scholar

Journal Help

User

Notifications

Journal Content
Browse

Font Size

Information

Abstract Views: 138

PDF Views: 2

Language Independent Content Extraction from Web Pages

Abstract Views: 138 | PDF Views: 2

Authors

R. Chandramma
Department of Computer Science and Engineering, Vivekananda Institute of Technology, Bangalore-74, India

R. C. Ravindranath
Department of Computer Science and Engineering, Vivekananda Institute of Technology, Bangalore-74, India

B. Raviteja

Y. B. Ravikumar

S. Venkatesh

M. Yashwanth

Abstract

Keywords

Content Extraction, Entropy, Document Object Model, Template, Content Structure Tree, Web Page Segmentation, Clustering, Anchor Text.

Username
Password
Remember me

Username
Password
Remember me

International Journal of Engineering Research

International Journal of Engineering Research

Language Independent Content Extraction from Web Pages

Keywords

Language Independent Content Extraction from Web Pages

Authors

Abstract

Keywords