Open Access Open Access  Restricted Access Subscription Access

A Framework To Automatically Categorize The Unstructured Text Documents


Affiliations
1 Bhaskaracharya College of Applied Sciences, Dwaraka, University of Delhi, New Delhi – 110075, Delhi, India
 

Objectives: This paper presents an application of Text mining to automatically categorize the text documents from a predefined set of class or categories based on the content of the document. Method: In our framework, Association rule mining is used to generate frequently occurring word sets from training data set rather than the word sets in the whole document to derive a feature set to be used as training data. The new documents are then classified by applying classification method on the extracted feature sets. Findings: The results show that instead of using the whole word set extracted from the training document, frequently associated word sets extracted by applying association rules provide much accurate classification of the new document. Applications: The associated word sets not only relatively reduce the size of the training data but also relate to context of the document, which is further used to classify the new document. This approach can be used for efficient storage and retrieval of documents over the web or organizational databases. Hierarchical classification organizes the documents into tree like structure or meaningful hierarchy to form a single path structure which can add new categories or classes when needed. This approach can be combined with hierarchical structure for efficient browsing on search engines and provide a boulevard for future research.

Keywords

Association Rules, Classification, Text Categorization, Text Mining
User

Abstract Views: 208

PDF Views: 0




  • A Framework To Automatically Categorize The Unstructured Text Documents

Abstract Views: 208  |  PDF Views: 0

Authors

Anshika Singh
Bhaskaracharya College of Applied Sciences, Dwaraka, University of Delhi, New Delhi – 110075, Delhi, India

Abstract


Objectives: This paper presents an application of Text mining to automatically categorize the text documents from a predefined set of class or categories based on the content of the document. Method: In our framework, Association rule mining is used to generate frequently occurring word sets from training data set rather than the word sets in the whole document to derive a feature set to be used as training data. The new documents are then classified by applying classification method on the extracted feature sets. Findings: The results show that instead of using the whole word set extracted from the training document, frequently associated word sets extracted by applying association rules provide much accurate classification of the new document. Applications: The associated word sets not only relatively reduce the size of the training data but also relate to context of the document, which is further used to classify the new document. This approach can be used for efficient storage and retrieval of documents over the web or organizational databases. Hierarchical classification organizes the documents into tree like structure or meaningful hierarchy to form a single path structure which can add new categories or classes when needed. This approach can be combined with hierarchical structure for efficient browsing on search engines and provide a boulevard for future research.

Keywords


Association Rules, Classification, Text Categorization, Text Mining



DOI: https://doi.org/10.17485/ijst%2F2017%2Fv10i8%2F151178