Open Access Open Access  Restricted Access Subscription Access

Automatic Extraction of Significant Terms From the Title and Abstract of Scientific Papers Using the Machine Learning Algorithm: A Multiple Module Approach


Affiliations
1 Professor, Department of Library & Information Science, Banaras Hindu University, Varanasi., India
2 Junior Research Fellow, Department of Library & Information Science, Banaras Hindu University, Varanasi., India
 

Keyword extraction is the task of identifying important terms or phrase that are most representative of the source document. Although the process of automatic extraction of keywords from title is an old method, it was mainly for extraction from a single web document. Our approach differs from previous research works on keyword extraction in several aspects. For those who are non-expert of the scientific fields, understating scientific research trends is difficult. The purpose of this study is to develop an automatic method of obtaining overviews of a scientific field for non-experts by capturing research trends. This empirical study excavates significant term extraction using Natural Language Processing (NLP) tools. More than 15000 titles saved in a .csv file was our dataset and scripts written in Python were our process to compare how far significant terms of scientific title corpus are similar or different to the terms available in the abstract of that same scientific article corpus. A light-weight unsupervised title extractor, Yet Another Keyword Extractor (YAKE) was used to extract the results. Based on our analysis, it can be concluded that these algorithms can be used for other fields too by the non-experts of that subject field to perform automatic extraction of significant words and understanding trends. Our algorithm could be a solution to reduce the labour-intensive manual indexing process.

Keywords

Data mining, Title extraction, Natural Language Processing, YAKE, NLTK, Keyword Extraction-NLP.
User
Notifications
Font Size

  • Bavdekar S B, Formulating the right title for a research article, Journal of the Association of Physicians of India, 64 (2016) 53–6.
  • Alexandrov AV and Hennerici MG, Writing good abstracts, Cerebrovascular Discovery, 23 (2007) 256–59.
  • Vamathevan J, Apweiler R and Birney E, Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science, Annual Review of Biomedical Data Science, 2 (2019) 199–222.
  • Xue Y, Hu Y, Xin G, Song R, Shi S, Cao Y, Lin CY and Li H, Web page title extraction and its application, Information Processing & Management, 43 (2007) 1332–47. https://doi.org/10.1016/j.ipm.2006.11.007.
  • Gali N, Content-Based Title Extraction from Web Page. 12th International Conference on Web Information Systems and Technologies, 2016, pp. 204–10. https://doi.org/ 10.5220/0005794102040210.
  • Uzun E, Agun HV and Yerlikaya T, A hybrid approach for extracting informative content from web pages, Information Processing & Management, 49(4) (2013) 928–44. https://doi.org/10.1016/j.ipm.2013.02.005.
  • Giuffrida G, Shek E C and Yang J, Knowledge-based metadata extraction from postscript files. Proceedings of the fifth ACM conference on Digital Libraries, (2000) 77–84. https://doi.org/10.1145/336597.336639
  • Constantin A, Pettifer S and Voronkov A, PDFX: fully-automated pdf-to-xml conversion of scientific literature. ACM Symposium on Document Engineering, (2013) 177– 180.
  • Tkaczyk P, Szostek P, Dendek J, Fedoryszak M and Bolikowski L, CERMINE -- Automatic Extraction of Metadata and References from Scientific Literature, 11th IAPR International Workshop on Document Analysis Systems, Tours, France (2014) 217-221. doi: 10.1109/ DAS.2014.63.
  • Rinartha, Komang and Kartika L G S, Rapid Automatic Keyword Extraction and Word Frequency in Scientific Article Keywords Extraction. 3rd International Conference on Cybernetics and Intelligent System (ICORIS), (2021) 1–4. IEEE Xplore, https://doi.org/10.1109/ICORIS52787.2021. 9649458.
  • Wang J, Su G, Wan C, Huang X and Sun L, A keyword-based literature review data generating algorithm—analyzing a field from scientific publications, Symmetry, 12(6) (2020) 903,https://doi.org/10.3390/sym12060903.
  • Nakajima R and Nobuyuki M, Topic Extraction to Provide an Overview of Research Activities: The Case of the High-Temperature Superconductor and Simulation and Modelling, Journal of Information Science, 47(5) (2021) 590–608. https://doi.org/10.1177/0165551520920794.
  • Gunawan D, Purnamasari F, Ramadhiana R and Rahmat RF, Keyword extraction from scientific articles in Bahasa Indonesia using TextRank algorithm, In 4 th International Conference on Electrical, Telecommunication and Computer Engineering (ELTICOM),(2020) 260–64.
  • Jiang X, He K and Yang B, Automatic information extraction in the third-generation semiconductor materials domain based on DKNet and MANet, IEEE Access, 10 (2022) 29367–76.
  • Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C and Jatowt A, YAKE! keyword extraction from single documents using multiple local features, Information Sciences, 509 (2020) 257–89.

Abstract Views: 108

PDF Views: 63




  • Automatic Extraction of Significant Terms From the Title and Abstract of Scientific Papers Using the Machine Learning Algorithm: A Multiple Module Approach

Abstract Views: 108  |  PDF Views: 63

Authors

Bhaskar Mukherjee
Professor, Department of Library & Information Science, Banaras Hindu University, Varanasi., India
Debasis Majhi
Junior Research Fellow, Department of Library & Information Science, Banaras Hindu University, Varanasi., India

Abstract


Keyword extraction is the task of identifying important terms or phrase that are most representative of the source document. Although the process of automatic extraction of keywords from title is an old method, it was mainly for extraction from a single web document. Our approach differs from previous research works on keyword extraction in several aspects. For those who are non-expert of the scientific fields, understating scientific research trends is difficult. The purpose of this study is to develop an automatic method of obtaining overviews of a scientific field for non-experts by capturing research trends. This empirical study excavates significant term extraction using Natural Language Processing (NLP) tools. More than 15000 titles saved in a .csv file was our dataset and scripts written in Python were our process to compare how far significant terms of scientific title corpus are similar or different to the terms available in the abstract of that same scientific article corpus. A light-weight unsupervised title extractor, Yet Another Keyword Extractor (YAKE) was used to extract the results. Based on our analysis, it can be concluded that these algorithms can be used for other fields too by the non-experts of that subject field to perform automatic extraction of significant words and understanding trends. Our algorithm could be a solution to reduce the labour-intensive manual indexing process.

Keywords


Data mining, Title extraction, Natural Language Processing, YAKE, NLTK, Keyword Extraction-NLP.

References