Open Access Open Access  Restricted Access Subscription Access

Feature Selection for Automatic Categorization of Patent Documents


Affiliations
1 Department of Analytics, School of Computer Science and Engineering, VIT University, Vellore - 632014, Tamil Nadu, India
2 Department of Computer Science and Engineering, Konkuk University, Seoul, Korea, Republic of
 

Objective: With the rapid increase in the number of patent documents worldwide, demand for their automatic categorization has grown significantly. The automatic categorization of patent documents is the organization of such documents in digital form, thus replacing the manual time-consuming process. In this work, we proposed a system that can automatically categorize patent document by considering the structural information of the patents. Methods: We propose a three-stage mechanism for automatic categorization. In the first stage, we apply a pre-processing mechanism to reduce unwanted noise that can influence the categorization process. Such noise includes terms that have less structural meaning in the document. In the second stage, feature selection is conducted based on the term frequencies. Feature vectors are constructed from the structural information of the patent. In the third stage, classifications are conducted using a Random Forest (RF), Support Vector Machine (SVM), and Naïve Bayes (NB) classifier. Findings: It was found that the semantic structural information of a patent document is an important feature set in constructing the terms of a document for the categorization. The experimental results also show that feature reduction using Information Gain (IG) is beneficial for obtaining a higher accuracy rate in a reduced dimensional space. Applications: The results reveal the importance of the proposed method for automatic categorization of patent documents.

Keywords

Classification, Feature Selection, Patent categorization, Structural information.
User

Abstract Views: 176

PDF Views: 0




  • Feature Selection for Automatic Categorization of Patent Documents

Abstract Views: 176  |  PDF Views: 0

Authors

S. Don
Department of Analytics, School of Computer Science and Engineering, VIT University, Vellore - 632014, Tamil Nadu, India
Dugki Min
Department of Computer Science and Engineering, Konkuk University, Seoul, Korea, Republic of

Abstract


Objective: With the rapid increase in the number of patent documents worldwide, demand for their automatic categorization has grown significantly. The automatic categorization of patent documents is the organization of such documents in digital form, thus replacing the manual time-consuming process. In this work, we proposed a system that can automatically categorize patent document by considering the structural information of the patents. Methods: We propose a three-stage mechanism for automatic categorization. In the first stage, we apply a pre-processing mechanism to reduce unwanted noise that can influence the categorization process. Such noise includes terms that have less structural meaning in the document. In the second stage, feature selection is conducted based on the term frequencies. Feature vectors are constructed from the structural information of the patent. In the third stage, classifications are conducted using a Random Forest (RF), Support Vector Machine (SVM), and Naïve Bayes (NB) classifier. Findings: It was found that the semantic structural information of a patent document is an important feature set in constructing the terms of a document for the categorization. The experimental results also show that feature reduction using Information Gain (IG) is beneficial for obtaining a higher accuracy rate in a reduced dimensional space. Applications: The results reveal the importance of the proposed method for automatic categorization of patent documents.

Keywords


Classification, Feature Selection, Patent categorization, Structural information.



DOI: https://doi.org/10.17485/ijst%2F2016%2Fv9i37%2F126923