- Urvashi Garg
- Rupinder Pal Kaur
- Shruti Bajaj Mangal
- Umrinder Pal Singh
- Anisha Rani
- Manish Kumar
- Jaskaran Bhullar
- Rakesh Kumar
- Ravinder Khanna
- Harjeet Singh
- Rajeev Puri
- R. P. S. Bedi
- Ajit Kumar
- Deepali
- Lalit Goyal
- Kapil Dev Goyal
- Mukhtiar Singh
- Sahil Raj
- Neetika Bansal
- Simpel Rani
- Maajid Bashir
- Kaiser J.Giri
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Goyal, Vishal
- Effect of Stop Word Removal on Document Similarity for Hindi Text
Authors
1 Haryana College of Technology and Management, Kaithal, IN
2 Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 13 (2014), Pagination: 161-163Abstract
Stop word removal is one of the important NLP techniques. Stop words are very common in any document. In this paper, we have created a list of stop words for Hindi text on the basis of frequency of words in documents. Hindi documents from EMILLE corpus have been used for finding out the stop words. UTF-8 encoding is used. The percentage of stop words in any document has been find out and experimentally analyzed. The paper discusses the effect of stop word removal on the similarity of two documents containing Hindi text. Hoad&Zobel approach is used for finding the similarity of documents containing Hindi text.Keywords
Stop Words, Removal, Text, Hindi, List, Frequency.- Tool for Measuring ‘Quality in Use’ for Websites of Punjabi and Hindi Newspapers: PH.WUQT
Authors
1 Post Graduate Department of Computer Science, Sri Guru Gobind Singh College, Sector 26, Chandigarh, IN
2 Department of Computer Science, Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 13 (2014), Pagination: 185-190Abstract
Quality is difficult to define and measure. ISO 25010 standard describes the model 'quality in use' for evaluating the usability quality of software. In this research a tool based on ISO 25010 standard has been developed specifically to test 'Quality in Use' for websites of Punjabi and Hindi newspapers. The attributes that affect the usability quality are recognized and arranged in a questionnaire format in preparation for data collection. The data analysis can be conducted by using some statistical tool on the weighted dataset, where the rating for a question for each respondent can be multiplied by its perceived importance. The tool developed can be used to obtain feedback that can help understand, control, improve and make predictions about websites of newspapers in other languages also.Keywords
Quality in Use, Newspaper Domain, Questionnaire, ISO 25010, Usability Quality, Quality Attributes, Websites of Punjabi and Hindi Newspapers.- Text News Classification System using Naive Bayes Classifier
Authors
1 Punajbi University, Patiala, IN
2 Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 13 (2014), Pagination: 209-213Abstract
This paper describes the Naive Bayes text News classification system developed for Punjabi Language. News corpus is used for training and testing purpose of the classifiers. Language specific preprocessing techniques are applied on raw data to generate a standardized and reduced-feature lexicon. Punjabi language is morphological rich language which makes those tasks complex. Statistical characteristics of corpus and lexicon are measured which show satisfactory results of text preprocessing module. We are able to get satisfactory results using Naive Bayes Classifier.Keywords
Naive Bayes, Text classification, Punjabi.- Disambiguating Hindi Words Using N-Gram Smoothing Models
Authors
1 Department of Computer Science, Punjabi University, Patiala, Punjab, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 10 (2014), Pagination: 26-29Abstract
Word sense disambiguation is widely studied and discussed area of NLP for any natural language under consideration. Words have different senses. The task of selecting the correct sense for a word is called word sense disambiguation. In Machine Translation the problem when we translate the ambiguous words. To resolve this problem we develop Word Sense Disambiguation module that resolve the problem of ambiguity of Hindi words of a particular sentence. Deleted Interpolation and Back-Off, N-gram smoothing models are used to implement Word Sense Disambiguation. we have used the tri-gram to implement these models. we use a dictionary based approach.Keywords
Word Sense Disambiguation, Punjabi, Back-Off, Deleted Interpolation.- Development of domain based Hindi to Punjabi Machine Translation System - Scope and Challenges
Authors
1 Punjab Technical University, Jalandhar, IN
2 Department of Computer Science, Punjabi University, Patiala, IN
3 MIMIT, Malout
Source
Research Cell: An International Journal of Engineering Sciences, Vol 9 (2013), Pagination: 49-58Abstract
Technical education plays vital role in the development of any state or country. Language also plays the major role in the educational development of any region. India is a Multi-lingual country where every region has their own prominent local language, which is the primary mode of communication and is also mode of basic education. On the other hand English plays the dominating role for providing Technical education in Punjab. In this paper we study the scope and challenge to develop a domain based Machine Translation system which can translate technical terms from Hindi to Punjabi to break the language barrier.Keywords
Machine Translation, Optical Character Recognition, Higher Education, Hindi, Punjabi Technical Ambiguity, Non-Technical Ambiguity.- N-gram Based Word Sense Disambiguation of Hindi Post Position से (sē) in the context of Hindi to Punjabi Machine Translation System
Authors
1 Punjab Technical University, Kapurthala, IN
2 Dept. of Computer Science, Punjabi University, Patiala, IN
3 MM University, Sadopur, Ambala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 9 (2013), Pagination: 59-67Abstract
India has many regional languages. Attempts have been made for developing machine translations between these languages, but little success has been reported so far. Analysis of Hindi to Punjabi machine translation system devised by Punjabi University, Patiala, India has found that Hindi post position से (sē) is translated inaccurately being its ambiguous nature, most of the times, as it has eighteen different senses in Punjabi. The overall translation success rate of this system reported as 87.60%, however the translation success rate in respect of this post position से (sē) is only about 2%. In this paper, N-gram approach (along with its smoothing variants) has been applied to improve the accuracy of translation of this post position से (sē) in already developed Hindi to Punjabi Machine Translation System. It has been concluded that bigram approach with Add-One smoothing algorithm gives the best results in improving the accuracy of translation of post position से (sē) from 2% to 85.49%, thus improving the overall machine translation accuracy of the system from 87.60% to 92.30% .Keywords
Natural Language Processing (NLP), Word Sense Disambiguation (WSD), Machine Translation (MT).- Machine Translation system for Standard Punjabi to Malwai Dialect
Authors
1 Punjab Technical University, Jalandhar, IN
2 Dept. of Computer Science, Punjabi University, Patiala, IN
3 MM University, Sadopur, Ambala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 9 (2013), Pagination: 68-73Abstract
A lot of work is done on the standard varieties of Indian languages in the field of MT but dialectal variety of a language is still an unexplored area. Machine Translation between a Standard language and its dialect is easy as standard variety and its dialect are closely related to each other. When the Language pair is closely related then due to the common grammar and vocabulary, it become easy to develop a MT system.
In Punjab generally Malwai, Majhi, Doabi and Powadi dialect are used for the oral communication and no work is still done on any dialect in the field of Machine Translation. This Paper discusses the various phases in the development of Machine Translation system for Standard Punjabi - Malwai dialect pair.
Keywords
Machine Translation, Malwai Dialect.- A Review of Literature on Word Sense Disambiguation
Authors
1 Punjab Technical University, Kapurthala, IN
2 Sachdeva Engg. College for Girls, Gharuan, Mohali, IN
3 Dept. of Computer Science, Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 6 (2012), Pagination: 224-230Abstract
Artificial intelligence (AI) has been a major research area in the later quarter of 20th century and is likely to be even more so in the 21st century. A key part of AI is Word Sense Disambiguation (WSD) which deals with choosing the correct sense of a word in the given text. All human languages have words with multiple meaning and selecting the intended sense is important. This paper briefly describes various methods presently used for WSD and their relative effectiveness. WSD applications currently find application in Information Retrieval, Information Extraction, Automated Answering Machine, Speech Reorganization, Machine Translation among many others. WSD has promise for the future in taking AI to the next higher level.Keywords
Natural Language Processing (NLP), Artificial Intelligence (AI), Word Sense Disambiguation (WSD), Knowledge Based Methods, Supervised/Unsupervised Methods.- Plagiarism Detection in Regional Languages – Its Challenges in Context to Punjabi Documents
Authors
1 Punjab Technical University, Kapurthala Road, Jalandhar, IN
2 Dept of Comp. Sc, Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 5 (2011), Pagination: 296-304Abstract
Plagiarism detection has always been a challenging task for researchers and developers across the world. A number of researchers have contributed to this task by suggesting methods to detect plagiarism in research publications, books and software. Subsequently some applications based on these methods are also available in open source as well as commercial marketplace. Majority of these applications are performing well with English language documents, but fail to give satisfactory results with documents written in regional languages. This paper deals with investigating the challenges in plagiarism detection in regional languages, with specific case study of documents written in Punjabi language.- Need and Acceptability of Mother Tongue as Medium of Technical Education at Higher Level in State of Punjab
Authors
1 Punjab Technical University, Jalandhar, Punjab, IN
2 Department of Computer Science, Punjabi University, Patiala, Punjab, IN
3 MIMIT, Malout, Punjab, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 8 (2013), Pagination: 98-108Abstract
Present paper is the report of the survey conducted to make a consensus whether technical education at Higher level should be provided in mother tongue in the state of Punjab. Survey has been conducted with primary data collected through structured questionnaire. Questionnaire was filled by various stake holders like students and teachers. Data was analyzed by the help of percentile, mean and presented by tables and figures.Keywords
Technical Education, Translation, Transliteration, Survey, Mother Tongue, Medium of Instruction.- Comparative Study of Standard Punjabi and Malwai Dialect with Regard to Machine Translation
Authors
1 Punjab Technical University, Jalandhar, IN
2 Department of Electronics, M.M. University, Ambala, IN
3 Department of Computer Science, Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 8 (2013), Pagination: 109-118Abstract
Punjabi language is an modern Indo-Aryan language. It is 10th most spoken language in the world. Most of the speakers of Punjabi language lives in Punjab region of India. Punjabi has mainly four dialects viz Majhi, Doabi, Malwi and Pwadhi. Today, dialects represents the default variety of oral communication in the Punjab and Standard Punjabi is almost exclusively used for writing. The Malwai dialect is taken for the study because there are twenty two districts in Punjab state and the malwai dialect is spoken in more than half of the districts. Both Standard Punjabi and Malwai use the same script i.e. Gurukhi but the dialect is different. In this paper the difference between the Standard Punjabi and Malwai are explained.Keywords
Machine Translation, Punjabi, Malwai Dialect.- Automated Stopwords Identification in Punjabi Documents
Authors
1 Punjab Technical University, Kapurthala Road, Jalandhar, IN
2 Dept of Comp. Sc, Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 8 (2013), Pagination: 119-125Abstract
Many information retrieval tasks deal with the classification of huge amount of data before giving final results. The data being processed in IR tasks may or may not be useful for the researchers. There has to be some method to identify such data (called stop words) and remove it from data set before beginning with the IR task. This gives dual benefits – Reducing the overall vector space, thereby leading to performance improvements in terms of execution speed and the relevance of results. The purpose of this paper is to find a suitable, automated method for identification of stop words in Punjabi Text.Keywords
Punjabi Stop Words List, Statistical Modeling, Borda Count, Information Processing, Text Classification.- TDIL Programme:A Government Initiative
Authors
1 Multani Mal Modi College, Patiala, IN
2 Department of Computer Science, Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, 2010, Pagination: 1-8Abstract
India is a country with huge population of over hundred and twenty crore, who speak different languages. Only 5% of Indian population can effectively communicate in English and rest 95% are comfortable with their regional languages and are deprived of the benefits of information technology. To penetrate the benefits of Communication and Information Technology up to common masses, Ministry of Communication and Information Technology, Government of India, initiated a Programme known as Technology Development for Indian Languages (TDIL). The objective of the programme is to develop Information Processing Tools and Techniques to facilitate human-machine interaction without language barrier; to create and access multilingual knowledge resources; and integrate them to develop innovative user products and services. This paper surveys the various initiatives taken under TDIL for the promotion of Communication and Information Technology in Indian Languages. We are also providing an introductory list of tools/products developed by different institutions for Indian languages.- Translation of English Complex/Compound Sentences into Indian Sign Language
Authors
1 Department of Computer Science, Punjabi University, Patiala (Punjab), IN
2 Department of Computer Science, DAV College, Jalandhar (Punjab), IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 33 (2020), Pagination: 1-14Abstract
.This paper outlines the concept for translation of English text to Indian Sign Language using real-domain synthetic animations. The translation framework consists of a processing module that parses the English input sentence to phrase structure grammar representation on which Indian sign language grammar rules are applied to reorder English phrase terms. The input English sentences was given by the user are parsed through the converter module which in turn exchange the complex and compound English sentences to their simplify versions by means of complex to simple and compound to simple English grammar rules. Elimination module removes unacceptable terms from the reordered sentence. Lemmatization is done in order to translate the words into the ischolar_main form since the Indian sign language does not use the word inflections. All the words in the sentence are then tested into a lexicon containing the English word and its HamNoSys notation and their synonym replaces the terms that are not in the lexicon. Sentence words are substituted with their HamNoSys counter code. In the event that the word is not in the lexicon, HamNoSys code is taken for each word alphabet. The HamNoSys data is translated to the SiGML tags, and these SiGML tags are sent to animation module which translates the SiGML code to the synthetic animation utilizing avatar.Keywords
Indian Sign language ,HamNoSys,SiGMLReferences
- Anuja, K. & Suryapriya, S. & Idicula, Sumam. (2010). Design and development of a frame based MT system for English-to-ISL: World Congress on Nature and Biologically Inspired Computing, NABIC 2009 - Proceedings. 1382 - 1387. 10.1109/NABIC.2009.53 93721.
- Anuja V Nair and Bindu V. (2013) Article: A Review on Indian Sign Language Recognition. International Journal o f Computer Applications 73(22):33-38, July 2013.
- Biplav Sarma, Anup Kumar Barman, (2015), A Comprehensive Survey of Noun Phrase Chunking in Natural Languages, International Journal Of Engineering Research & Technology (IJERT) Volume 04, Issue 04 (April 2015), http://dx.doi.org/10.17577/IJERTV4IS040854
- Dasgupta, T., & Basu, A. (2008). Prototype machine translation system from text-to-Indian sign language. In Proceedings o f the 13th international conference on Intelligent user interfaces (pp. 313-316).
- Goyal, L., & Goyal, V. (2016). Automatic translation of English text to Indian sign language synthetic animations in Proceedings o f the 13th International Conference on Natural Language Processing ICON (pp. 144-153).
- Goyal, L., & Goyal, V. (2017). Tutorial for Deaf-Teaching Punjabi Alphabet using Synthetic Animations. In Proceedings o f the 14th International Conference on Natural Language Processing (ICON-2017) (pp. 172-177).
- Narula, R., & Sharma, S. K. (2014). Identification and separation of simple, compound and complex sentences in Punjabi language. International Journal o f Computer Applications & Information Technology, 6.
- Pawan Kumar & Savita Khatri (2016) Generating Indian Sign Language Text Using English/Hindi Text, Special Issue: Conscientious and Unimpeachable Technologies International Journal o f Recent Research Aspects ISSN: 2349-7688, 2016, pp. 30-33
- R. Kaur & P. Kumar (2014) HamNoSys generation system for sign language, International Conference on Advances in Computing, Communications and Informatics (ICACCI), New Delhi, 2014, pp. 2727-2734, doi: 10.1109/ICACCI.2014.6968333.
- S.K Sharma (2019) Sentence Reduction for Syntactic Analysis of Compound Sentences in Punjabi Language. EAIEndorsed Transactions on Scalable Information Systems, 6 (20).
- Verma, D.A., & Kaur, S. (2015). Indian Sign Language Animation Generation System for Gurumukhi Script. IJCST ISSN: 0976-8491 (Online) | ISSN: 2229-4333 (Print) Vol. 6, Issue 3, July - Sept 2015p 117-121
- Extraction of Replicated Punjabi Multiword Expressions
Authors
1 Department of Computer Science, Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 33 (2020), Pagination: 33-45Abstract
Multiword Expressions (MWEs) play a vital role in Natural Language Processing. Multiword Expression is a combination of two or more words but treated as a single word. In Punjabi Language, there are varieties of MWEs and many of these are of the types that are not found in English. In this paper, we discuss different types of MWEs encountered in Punjabi. For example, replicated words, word combination with antonym, synonym, hyponym, gender, number and ‘waala’ morpheme have not been discovered as MWEs in English. Rule based approachs, statistical methods, and linguists’ approaches were used for MWE identification and extraction. In this paper, we present a methodology for identification and extraction of Punjabi MWEs using statistical methods, rule base methods and linguists’ approach.References
- Agarwal, A., Ray, B., Choudhury, M., Basu, A., & Sarkar, S. (n.d.). Automatic Extraction of Multiword Expressions in Bengali: An Approach for Miserly Resource Scenarios. In academia.edu. Retrieved August 31, 2020, from http://www.academia.edu/download/30405011/icon2004_mwe.pdf
- Baldwin, T., & Kim, S. N. (2010). Multiword expressions. Handbook of Natural Language Processing, Second Edition, 267–292.
- Brundage, J., Kresse, M., Schwall, U., & Storrer, A. (1992). Multiword lexemes: A monolingual and contrastive typology for natural language processing and machine translation.
- Church, K. W., & Hanks, P. (1989). Word association norms, mutual information, and lexicography. April, 76–83. https://doi.org/10.3115/981623.981633
- Fatima, Z., 2010, N. C.-P. of the, & 2010, undefined. (n.d.). Extracting Hindi Multiword Expressions Using a Rule Based Tool. IEEE Computer Society.
- Minia, M. (2012). Literature Survey on Multi-Lingual Multiword Expressions.
- Pecina, P. (2009). Collocation Extraction AND THEORETICAL LINGUISTICS. In Studies in computational and theoretical linguistics.
- Poddar, L. (2013). Multilingual Multiword Expressions. Detection of MultiWord Expression and Name Entity Recognition, 113050029.
- Singh, N. B., Bandyopadhyay, S., Nongmeikapam, K., Laishram, D., & Mayekleima Chanu, N. (2011). Identification of Reduplicated Multiword Expressions Using CRF. LNCS, 6608(PART 1), 41–51. https://doi.org/10.1007/978-3-642-19400-9_4
- Sinha, R. M. K. (2009). Mining complex predicates in Hindi using a parallel Hindi-English corpus. August, 40. https://doi.org/10.3115/1698239.1698247
- Smadja, F. (n.d.). Retrieving Collocations from Text: Xtract. In dl.acm.org. Retrieved August 31, 2020, from https://dl.acm.org/doi/abs/10.5555/972450.972458
- Sentiment Analysis of Code Mixed Text Consisting of English- Punjabi Lexicon
Authors
1 Department o f Computer Science, Punjabi University, Patiala, IN
2 School o f Management Studies, Punjabi University, Patiala, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 33 (2020), Pagination: 15-23Abstract
Sentiment analysis is a field of study for analyzing emotions of people such as happy, sad, angry, etc. towards the entities and attributes expressed in written text. In this study, the data was collected in the textual form from different sources like Facebook, YouTube, Twitter, and Whatsapp, then pre-processed the collected data. After that, identification of the language of code-mixed text performed, which includes tokenization, word-play, misspelled words, abbreviations, slang words, phonetic-typing, etc. After the identification task, the English-Punjabi dictionary was created which was consisting of opinionated words list like positive, negative, and neutral words list. The rest of the words are being stored in an unsorted word list. In the last, a statistical technique applied at sentence level sentiment polarity of the English-Punjabi code mixed dataset. It was identified that the results up to the Five-Grams and Tri-Grams approaches had the similarity.Keywords
Code Mixed Text, Romanized Text, Natural Language Processing, Text Processing, Romanized Text, Sentiment Analysis, Microblogging.References
- Zuo, M., Diao, L., Liu, Q., & Wang, P. (2010, May). Data mining strategies and techniques of internet education public sentiment monitoring and analysis system. In 2010 2nd International Conference on Future Computer and Communication (Vol. 2, pp. V2-124). IEEE.
- Tromp, E., & Pechenizkiy, M. (2011, May). Graph-based n-gram language identification on short texts. In Proc. 20th Machine Learning conference o f Belgium and The Netherlands (pp. 27-34).
- Committed to connecting the world Feb- https://www.itu.int/en/ITU-D/Statistics/Pages/stat/default.aspx.
- India - T-series Youtube Subscribers 2019. Sanika Diwanji - https://www.statista.com/statistics/ 1003413/tseries-subscriber-numbers youtube-india/
- Pang, B., Lee, L.: “Opinion Mining and Sentiment Analysis”, in “Foundations and Trends in Information 956 Retrieval”, Volume 2, Issue 1-2, January 2008, pp. 1-135.
- Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications o f the ACM, 56(4), p.82.
- P. Turney, "Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews", Proceedings of the Association for Computational Linguistics (ACL), 2002, pp. 417-424.
- B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up?: sentiment classification using machine learning techniques," Proceedings of the ACL-02 conference on Empirical methods in natural language processing,vol.10, 2002, pp. 79-86.
- C. Myers-Scotton. Common and uncommon ground: Social and structural factors in codeswitching. Language in society, 22(4):475-503, 1993.
- 1 J. J. Gumperz. Discourse strategies, volume 1. Cambridge University Press, 1982.
- Jamatia, A., Das, A., & Gamback, B. (2019). Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora. Journal o f Intelligent Systems, 25(3), 399-408.
- Ghosh, S., Ghosh, S., & Das, D. (2017). Sentiment Identification in Code-Mixed Social Media Text. arXiv preprint arXiv:1707.01184.
- Ghosh, S., Ghosh, S., & Das, D. (2017). Complexity Metric for Code-Mixed Social Media Text. arXiv preprint arXiv:1707.01183.
- Das, A., & Gamback, B. (2014). Identifying languages at the word level in code-mixed indian social media text.
- Bansal, N., Goyal, V., & Rani, S. (2020). Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text. International Journal o f EAdoption (IJEA), 12(1), 52-62.
- Aldogan, D., & Yaslan, Y (2017). A comparison study on active learning integrated ensemble approaches in sentiment analysis. Computers & Electrical Engineering, 57, 311-323.
- Karyotis, C., Doctor, F., Iqbal, R., James, A., & Chang, V. (2018). A fuzzy computational model of emotion for cloud based sentiment analysis. Information Sciences, 433, 448-463.
- Keshavarz, H., & Abadeh, M. S. (2017). ALGA: Adaptive lexicon learning using genetic algorithm for sentiment analysis of microblogs. Knowledge-Based Systems, 122, 1-16.
- Dhar, M. (2018). Towards a Deeper Understanding o f Code-Mixing (Doctoral dissertation, International Institute of Information Technology Hyderabad).
- Tyagi, P., & Tripathi, R. C. (2019, February). A review towards the sentiment analysis techniques for the analysis of twitter data. In Proceedings o f 2nd International Conference on Advanced Computing and Software Engineering (ICACSE).
- Word Level Language Identification of English-Punjabi Code-Mixed Social Media Text
Authors
1 Department of Computer Science, Punjabi University College of Engineering & Management, Rampura Phul, IN
2 Department of Computer Science Punjabi University, Patiala, IN
3 Department of Computer Science and Engineering, Yadavindra College of Engineering, Talwandi Sabo, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 33 (2020), Pagination: 24-32Abstract
Code mixing denotes using multiple languages in an utterance. It is clearly seen that code mixing is pervasive while people communicate over social media irrelevant of the mode being used. The fusion of languages makes it more challenging and requires consistent updates according to recent trends. The current paper addresses three approaches namely CRFs (Conditional Random Fields), Bi-LSTM (Long Short-term Memory) and CNNs( Convolutional Neural Networks). Firstly, for word-level language identification of code-mixed English-Punjabi text CRF based system uses lexical, contextual, character ngram, and special character features. Secondly, Recursive Neural Network namely Bi-LSTM with glove embedding is used for language identification and thirdly CNN with glove embedding is used for language identification. It is observed that CRFs is the best performing system with an f1-score of 0.96.Keywords
Code Mixing, Language Identification, Deep Learning, Glove Embedding, Conditional Random Fields.References
- Neetika, Vishal Goyal, and Simpel Rani. "Automatic Understanding of Code Mixed Social Media Text: A State of the Art." Advances in Information Communication Technology and Computing: 91. https://doi.org/10.1007/978-981-15-5421-6_10
- Gold, E. Mark. "Language identification in the limit." Information and control 10, no. 5 (1967): 447-474.
- Gumperz, John J. Discourse strategies. Vol. 1. Cambridge University Press, 1982.
- Myers-Scotton, Carol. Duelling languages: Grammatical structure in codeswitching. Oxford University Press, 1997.
- Beesley, Kenneth R. "Language identifier: A computer program for automatic natural-language identification of on-line text." In Proceedings of the 29 th annual conference of the American Translators Association, vol. 47, p. 54. 1988.
- Cavnar, William B., and John M. Trenkle. "N-gram-based text categorization." In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175.1994.
- Dunning, Ted. Statistical identification of language. Las Cruces, NM, USA: Computing Research Laboratory, New Mexico State University,1994.
- Prager, John M. "Linguini: Language identification for multilingual documents." Journal of Management Information Systems 16, no. 3 (1999): 71-101.
- Lui, Marco, and Timothy Baldwin. "langid. py: An off-the-shelf language identification tool." In Proceedings of the ACL 2012 system demonstrations, pp. 25-30. 2012.
- King, Ben, and Steven Abney. "Labeling the languages of words in mixed-language documents using weakly supervised methods." In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1110-1119. 2013.
- Lignos, Constantine, and Mitch Marcus. "Toward web-scale analysis of codeswitching." In 87 th Annual Meeting of the Linguistic Society of America, vol. 90. 2013.
- Nguyen, Dong, and A. Seza Dogruoz. 2013 Word level language identification in online multilingual communication. In Proceedings of the 2013 conference on empirical methods in natural language processing, 857-862.
- Nguyen, Dong, and A. Seza Dogruoz. "Word level language identification in online multilingual communication." In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 857-862. 2013.
- GokulChittaranjan, Yogarshi Vyas, Kalika Bali, and Monojit Choudhury. "A framework to label code-mixed sentences in social media." In Proceedings of the First Workshop on Computational Approaches to Code-Switching, Doha, Qatar, October. ACL. 2014.
- Chang, Joseph Chee, and Chu-Cheng Lin. "Recurrent-neural-network for language detection on Twitter code-switching corpus." arXiv preprint arXiv: 1412.4314 (2014).
- Sharma, Arnav, and Raveesh Motlani. "Pos tagging for code-mixed indian social media text: Systems from iiit-h for icon nlp tools contest." In International Conference On Natural Language Processing. 2015.
- Samih, Younes, Suraj Maharjan, Mohammed Attia, Laura Kallmeyer, and Thamar Solorio. "Multilingual code-switching identification via lstm recurrent neural networks." In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50-59. 2016.
- Shekhar, Shashi, Dilip Kumar Sharma, and MM Sufyan Beg. "Embedding Framework for Identifying Ambiguous Words in Code-Mixed Social Media Text." In 2019 International Conference on contemporary Computing and Informatics (IC3I), pp. 59-63. IEEE, 2019.
- Jamatia, Anupam, Amitava Das, and Bjorn Gamback. "Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora." Journal of Intelligent Systems 28, no. 3 (2019): 399-408.
- Bhaskaran, Sreebha, Geetika Paul, Deepa Gupta, and J. Amudha. "Indian Language Identification for Short Text." In Advances in Computational Intelligence and Communication Technology, pp. 47-58. Springer, Singapore. 2020.
- Jamatia, Anupam, Steve Durairaj Swamy, Bjorn Gamback, Amitava Das, and Swapan Debbarma. "Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus." International Journal on Artificial Intelligence Tools (2020).
- Bansal, Neetika, Vishal Goyal, and Simpel Rani. "Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text." International Journal of E-Adoption (IJEA) 12, no. 1 (2020): 52-62.
- Gundapu, Sunil, and Radhika Mamidi. "Word Level Language Identification in English Telugu Code Mixed Data." In PACLIC. 2018.
- PKIT: Printed Kashmiri Image Text Recognition Using Deep Learning
Authors
1 Department of Computer Science, Punjabi University, Patiala,147002, Punjab, India., IN
2 Department of Computer Science, Punjabi University, Patiala, 147002, Punjab, India., IN
3 Department of Computer Science, Islamic University of Science and Technology, Awantipora, 192122,Jammu and Kashmir, IN
Source
Research Cell: An International Journal of Engineering Sciences, Vol 35 (2023), Pagination: 01-19Abstract
Optical Character Recognition, often known as OCR, is a method that turns scanned documents, images of text, and PDFs into text documents, that can be edited and searched on a computer. OCR software analyzes a scanned image of text and turns it into machine-encoded text by identifying the characters in the image and transforming them into a digital format. Acknowledging the relevance of optical character recognition (OCR) in the actual world, a multitude of approaches have evolved both for Western and Asian languages. Kashmiri is mostly spoken in the Kashmir Valley, which is located in Jammu and Kashmir India. In spite of the significant amount of effort that has been done into recognizing Indian scripts such as Devanagari, Bengali, Urdu, and Punjabi, no such effort has been made to recognize Kashmiri script. In addition, several benchmark corpora for other Perso Arabic scripts, such as Urdu, Arabic, and Pashto, have been developed for the purpose of training and assessing various OCR systems. Notably, there is currently no OCR corpus for Kashmiri script that can be utilized to train and evaluate deep neural networks for the development of Kashmir OCR. To that purpose, we have proposed a Kashmiri corpus Printed Kashmiri Image Text (PKIT)consisting of 120000 line, and 523000-word level printed text images respectively, well suited for use in deep learning techniques. Additionally, we used the proposed dataset for training different state of art deep learning approaches thereby obtaining the Word Error Rate (WER) and Character Error Rate ((CER)of 5.62% on average.Keywords
Dataset Generation, Deep Learning, Kashmiri OCR, Optical Character Recognition, Printed Kashmiri Text Recognition.References
- Davis, R., Lyall, J.: Recognition of handwritten characters — a review. Image and Vision Computing 4(4), 208–218 (1986). https://doi.org/10. 1016/0262-8856(86)90048-X
- Plamondon, R., Srihari, S.N.: On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 22, 63–84 (2000)
- Isheawy, N.A.M., Hasan, H.: Optical character recognition (ocr) system. IOSR Journal of Computer Engineering (IOSR-JCE), e-ISSN, 2278–0661 (2015)
- Gon , calves, G.R., Diniz, M.A., Laroca, R., Menotti, D., Schwartz, W.R.: Real-time automatic license plate recognition through deep multi-task networks. In: 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 110–117 (2018). IEEE
- Qadri, M.T., Asif, M.: Automatic number plate recognition system for vehicle identification using optical character recognition. In: 2009 International Conference on Education Technology and Computer, pp. 335–338 (2009). IEEE
- Fujisawa, H.: Forty years of research in character and document recognition—an industrial perspective. Pattern Recognition 41(8), 2435–2446 (2008)
- Mohamed, M., Gader, P.: Handwritten word recognition using segmentation-free hidden markov modeling and segmentation-based dynamic programming techniques. IEEE transactions on pattern analysis and machine intelligence 18(5), 548–554 (1996)
- Tappert, C.C., Suen, C.Y., Wakahara, T.: The state of the art in online handwriting recognition. IEEE Transactions on pattern analysis and machine intelligence 12(8), 787–808 (1990)
- Breuel, T.M., Ul-Hasan, A., Al-Azawi, M.A., Shafait, F.: Highperformance ocr for printed english and fraktur using lstm networks. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 683–687 (2013). IEEE
- Cesar, M., Shinghal, R.: An algorithm for segmenting handwritten postal codes. International journal of man-machine studies 33(1), 63–80 (1990)
- Cheriet, M., Huang, Y.S., Suen, C.Y.: Background region-based algorithm for the segmentation of connected digits. In: 11th IAPR International Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition Methodology and Systems, vol. 1, pp. 619–620 (1992). IEEE Computer Society
- Doetsch, P., Kozielski, M., Ney, H.: Fast and robust training of recurrentneural networks for offline handwriting recognition. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 279–284 (2014). IEEE
- Natarajan, P., Bazzi, I., Lu, Z., Makhoul, J., Scwhartz, R.: Robust ocr of degraded documents. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pp. 357–361 (1999). IEEE
- Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. Advances in neural information processing systems 21 (2008)
- Ashwin, T., Sastry, P.: A font and size-independent ocr system for printed kannada documents using support vector machines. Sadhana 27(1), 35–58 (2002)
- Bashir, M., Goyal, V., Giri, K.J.: Challenges in recognition of Kashmiri script. In: Singh, P.K., Singh, Y., Kolekar, M.H., Kar, A.K., Gon ,calves, P.J.S. (eds.) Recent Innovations in Computing, pp. 33–43. Springer, Singapore (2022)
- Bashir, R., Quadri, S.M.K.: Identification of kashmiri script in a bilingual document image. 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013), 575–579 (2013)
- Graves, A., Schmidhuber, J.: Offline arabic handwriting recognition with multidimensional recurrent neural networks, pp. 545–552 (2008). https: //doi.org/10.1007/978-1-4471-4072-6 12
- Pal, A.: Bengali handwritten numeric character recognition using denoising autoencoders. In: 2015 IEEE International Conference on Engineering and Technology (ICETECH), pp. 1–6 (2015). IEEE
- Ahmad, I., Wang, X., Li, R., Rasheed, S.: Offline urdu nastaleeq optical character recognition based on stacked denoising autoencoder. China Communications 14(1), 146–157 (2017)
- Naz, S., Umar, A.I., Ahmad, R., Siddiqi, I., Ahmed, S.B., Razzak, M.I., Shafait, F.: Urdu nastaliq recognition using convolutional–recursive deep learning. Neurocomputing 243, 80–87 (2017)
- Addakiri, K., Bahaj, M.: Article: On-line handwritten arabic character recognition using artificial neural network. International Journal of Computer Applications 55(13), 42–46 (2012). Full text available
- Osman, H., Zaghw, K., Hazem, M., Elsehely, S.: An efficient languageindependent multi-font OCR for arabic script. CoRR abs/2009.09115 (2020) 2009.09115
- Fasha, M., Hammo, B.H., Obeid, N., Widian, J.: A hybrid deep learning model for arabic text recognition. ArXiv abs/2009.01987 (2020)
- Altwaijry, N., Al-Turaiki, I.: Arabic handwriting recognition system using convolutional neural network. Neural Computing and Applications 33 (2021). https://doi.org/10.1007/s00521- 020-05070-8
- Mushtaq, M.M.M.K.M.K. Faisel, Sing, S.: Urdudeepnet: offline handwritten urdu character recognition using deep neural network. Neural Computing and Applications 33 (2021)
- Qurat-ul-Ain Akram, F.A.S.U.S.H.S.S. Anneta Niazi: A comprehensive image dataset of urdu nastalique document images, 81–88 (2016)
- Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39(11), 2298–2304 (2016)
- Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv (2014). https://doi.org/10.48550/ ARXIV.1409.1556. https://arxiv.org/abs/1409.1556
- Wei, T.C., Sheikh, U., Ab Rahman, A.A.-H.: Improved optical character recognition with deep neural network. In: 2018 IEEE 14th International Colloquium on Signal Processing & Its Applications (CSPA), pp. 245–249 (2018). IEEE
- Paul, D., Chaudhuri, B.B.: A BLSTM network for printed bengali OCR system with high accuracy. CoRR abs/1908.08674 (2019) 1908.08674
- Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966). Soviet Union