Open Access
Subscription Access
Language Independent Content Extraction from Web Pages
The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web (WWW). However, there is lot of redundant and irrelevant information also on web pages like Navigation panels, Table of content (TOC), advertisements, copyright statements etc. There are various technologies & researches which are focusing on the extraction of relevant information from large web data storage. But still there is requirement of availability of automatic annotation of this extracted information into a systematic way so to be processed further for various purposes and for language independent also. In this system we present a simple, robust, accurate and language-independent solution for extracting the main content of an HTML formatted Web page and for removing additional content such as navigation menus, functional and design elements, and commercial advertisements. Accurate and efficient content extraction from Web pages is largely needed when searching or mining Web content. So in this system we use a new approach for content extraction called word to leaf ratio and density of links.
Keywords
Content Extraction, Entropy, Document Object Model, Template, Content Structure Tree, Web Page Segmentation, Clustering, Anchor Text.
User
Font Size
Information
Abstract Views: 147
PDF Views: 2