Open Access Open Access  Restricted Access Subscription Access

Language Independent Content Extraction from Web Pages


Affiliations
1 Department of Computer Science and Engineering, Vivekananda Institute of Technology, Bangalore-74, India
 

The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web (WWW). However, there is lot of redundant and irrelevant information also on web pages like Navigation panels, Table of content (TOC), advertisements, copyright statements etc. There are various technologies & researches which are focusing on the extraction of relevant information from large web data storage. But still there is requirement of availability of automatic annotation of this extracted information into a systematic way so to be processed further for various purposes and for language independent also. In this system we present a simple, robust, accurate and language-independent solution for extracting the main content of an HTML formatted Web page and for removing additional content such as navigation menus, functional and design elements, and commercial advertisements. Accurate and efficient content extraction from Web pages is largely needed when searching or mining Web content. So in this system we use a new approach for content extraction called word to leaf ratio and density of links.

Keywords

Content Extraction, Entropy, Document Object Model, Template, Content Structure Tree, Web Page Segmentation, Clustering, Anchor Text.
User
Notifications
Font Size

Abstract Views: 138

PDF Views: 2




  • Language Independent Content Extraction from Web Pages

Abstract Views: 138  |  PDF Views: 2

Authors

R. Chandramma
Department of Computer Science and Engineering, Vivekananda Institute of Technology, Bangalore-74, India
R. C. Ravindranath
Department of Computer Science and Engineering, Vivekananda Institute of Technology, Bangalore-74, India

Abstract


The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web (WWW). However, there is lot of redundant and irrelevant information also on web pages like Navigation panels, Table of content (TOC), advertisements, copyright statements etc. There are various technologies & researches which are focusing on the extraction of relevant information from large web data storage. But still there is requirement of availability of automatic annotation of this extracted information into a systematic way so to be processed further for various purposes and for language independent also. In this system we present a simple, robust, accurate and language-independent solution for extracting the main content of an HTML formatted Web page and for removing additional content such as navigation menus, functional and design elements, and commercial advertisements. Accurate and efficient content extraction from Web pages is largely needed when searching or mining Web content. So in this system we use a new approach for content extraction called word to leaf ratio and density of links.

Keywords


Content Extraction, Entropy, Document Object Model, Template, Content Structure Tree, Web Page Segmentation, Clustering, Anchor Text.