Open Access Open Access  Restricted Access Subscription Access

A Survey on Malware Classification Using Machine Learning and Deep Learning


Affiliations
1 Department of Computer Science and Engineering, IK Gujral Punjab Technical University, Kapurthala, Punjab, India
 

In today’s era, there is fast development in the field of Information Technology. It is a matter of great concern for cyber professionals to maintain security and privacy. Studies revealed that the number of new malware is increasing tremendously. It is a never-ending cycle between the world of attack and the defense of malicious software. Antivirus companies are always putting their efforts to develop signatures of malicious software and attackers are always in try to overcome those signatures. For the detection of malware machine learning are highly efficient. The process of detection of malware is split into two categories first is feature extraction and the second is malware classification. The effectiveness of classification algorithms depends on the feature extracted. In this paper, firstly an in-depth study of the features is provided that can be used to differentiate malware. Thereafter describe the various stages of machine learning and deep learning that researchers use in their research work and the pros and cons they face that can assist new researchers while selecting an algorithm for their research work.

Keywords

Malware Detection, Static Analysis, Dynamic Analysis, Security, Features of Malware, Machine Learning, Deep Learning.
User
Notifications
Font Size

  • “Panda Lab: Pandalabs annual report 2015,” 2015. http://www.pandasecurity.com/mediacenter/src/uploads/2014/07/Pandal abs-2015-anual-EN.pdf (accessed Mar. 19, 2020).
  • “McAfee Labs Threats Reports – Threat Research | McAfee.” https://www.mcafee.com/enterprise/en-in/threat-center/mcafee-labs/reports.html (accessed Jun. 01, 2021).
  • N. Idika and A. P. Mathur, “A survey of malware detection techniques,” Purdue Univ., vol. 48, pp. 2007–2, 2007.
  • M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM Comput. Surv. CSUR, vol. 44, no. 2, pp. 1–42, 2008.
  • Z. Bazrafshan, H. Hashemi, S. M. H. Fard, and A. Hamzeh, “A survey on heuristic malware detection techniques,” in The 5th Conference on Information and Knowledge Technology, 2013, pp. 113–120.
  • E. Gandotra, D. Bansal, and S. Sofat, “Malware analysis and classification: A survey,” J. Inf. Secur., vol. 2014, 2014.
  • C. LeDoux and A. Lakhotia, “Malware and machine learning,” in Intelligent Methods for Cyber Warfare, Springer, 2015, pp. 1–42.
  • I. Basu, N. Sinha, D. Bhagat, and S. Goswami, “Malware detection based on source data using data mining: A survey,” Am. J. Adv. Comput., vol. 3, no. 1, pp. 18–37, 2016.
  • Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar, “A Survey on Malware Detection Using Data Mining Techniques,” ACM Comput. Surv., vol. 50, no. 3, pp. 1–40, Jun. 2017, doi: 10.1145/3073559.
  • D. Ucci, L. Aniello, and R. Baldoni, “Survey on the usage of machine learning techniques for malware analysis,” ArXivPrepr. ArXiv171008189, 2017.
  • D. S. Berman, A. L. Buczak, J. S. Chavis, and C. L. Corbett, “A survey of deep learning methods for cyber security,” Information, vol. 10, no. 4, p. 122, 2019.
  • D. Gibert, C. Mateu, J. Planes, and R. Vicens, “Using convolutional neural networks for classification of malware represented as images,” J. Comput. Virol. Hacking Tech., vol. 15, no. 1, pp. 15–28, 2019.
  • “Worm definition by The Linux Information Project.” http://www.linfo.org/worm.html (accessed Mar. 20, 2020).
  • “Keyloggers: How they work and how to detect them (Part 1) | Securelist.” https://securelist.com/keyloggers-how-they-work-and-how-to-detect-them-part-1/36138/ (accessed Mar. 20, 2020).
  • J. Allain, The Ugly Truth About Adware and Spyware. Lulu Press, Inc, 2015.
  • A. Shabtai, R. Moskovitch, Y. Elovici, and C. Glezer, “Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey,” Inf. Secur. Tech. Rep., vol. 14, no. 1, pp. 16–29, 2009.
  • C.-T. Lin, N.-J. Wang, H. Xiao, and C. Eckert, “Feature Selection and Extraction for Malware Classification.,” J Inf Sci Eng, vol. 31, no. 3, pp. 965–992, 2015.
  • M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, “Novel feature extraction, selection and fusion for effective malware family classification,” in Proceedings of the sixth ACM conference on data and application security and privacy, 2016, pp. 183–194.
  • B. Anderson, C. Storlie, and T. Lane, “Improving malware classification: bridging the static/dynamic gap,” p. 12, 2012.
  • J. Sexton, C. Storlie, and B. Anderson, “Subroutine based detection of APT malware,” J. Comput. Virol. Hacking Tech., vol. 12, no. 4, pp. 225–233, 2016.
  • H. Zhang, X. Xiao, F. Mercaldo, S. Ni, F. Martinelli, and A. K. Sangaiah, “Classification of ransomware families with machine learning based on N-gram of opcodes,” Future Gener. Comput. Syst., vol. 90, pp. 211–221, 2019.
  • P. Khodamoradi, M. Fazlali, F. Mardukhi, and M. Nosrati, “Heuristic metamorphic malware detection based on statistics of assembly instructions using classification algorithms,” in 2015 18th CSI International Symposium on Computer Architecture and Digital Systems (CADS), 2015, pp. 1–6.
  • M. Belaoued, A. Boukellal, M. A. Koalal, A. Derhab, S. Mazouzi, and F. A. Khan, “Combined dynamic multi-feature and rule-based behavior for accurate malware detection,” Int. J. Distrib. Sens. Netw., vol. 15, no. 11, p. 1550147719889907, 2019.
  • H. Darabian, A. Dehghantanha, S. Hashemi, S. Homayoun, and K.-K. R. Choo, “An opcode‐based technique for polymorphic Internet of Things malware detection,” Concurr. Comput. Pract. Exp., vol. 32, no. 6, p. e5173, 2020.
  • M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining methods for detection of new malicious executables,” in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001, Oakland, CA, USA, 2001, pp. 38–49. Accessed: Apr. 06, 2020. [Online]. Available: http://ieeexplore.ieee.org/document/924286
  • R. Ito and M. Mimura, “Detecting Unknown Malware from ASCII Strings with Natural Language Processing Techniques,” in 2019 14th Asia Joint Conference on Information Security (AsiaJCIS), 2019, pp. 1– 8.
  • A. A. Elhadi, M. A. Maarof, and B. Barry, “Improving the Detection of Malware Behaviour Using Simplified Data Dependent API Call Graph,” Int. J. Secur. Its Appl., vol. 7, no. 5, pp. 29–42, Sep. 2013, doi: 10.14257/ijsia.2013.7.5.03.
  • M. Graziano et al., “Needles in a haystack: Mining information from public dynamic analysis sandboxes for malware intelligence,” in 24th {USENIX} Security Symposium ({USENIX} Security 15), 2015, pp. 1057–1072.
  • D. Kong and G. Yan, “Discriminant malware distance learning on structural information for automated malware classification,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’13, Chicago, Illinois, USA, 2013, p. 1357. doi: 10.1145/2487575.2488219.
  • J. Kwon and H. Lee, “BinGraph: Discovering mutant malware using hierarchical semantic signatures,” in 2012 7th International Conference on Malicious and Unwanted Software, Fajardo, PR, USA, Oct. 2012, pp. 104–111. doi: 10.1109/MALWARE.2012.6461015.
  • C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, “Polymorphic worm detection using structural information of executables,” in International Workshop on Recent Advances in Intrusion Detection, 2005, pp. 207–226.
  • I. Chionis, S. Nikolopoulos, and I. Polenakis, “A survey on algorithmic techniques for malware detection,” 2013.
  • T.-Y. Wang, C.-H. Wu, and C.-C. Hsieh, “Detecting Unknown Malicious Executables Using Portable Executable Headers,” in 2009 Fifth International Joint Conference on INC, IMS and IDC, Seoul, South Korea, 2009, pp. 278–284. doi: 10.1109/NCM.2009.385.
  • G. Laurenza, L. Aniello, R. Lazzeretti, and R. Baldoni, “Malware Triage Based on Static Features and Public APT Reports,” in Cyber Security Cryptography and Machine Learning, vol. 10332, S. Dolev and S. Lodha, Eds. Cham: Springer International Publishing, 2017, pp. 288– 305. doi: 10.1007/978-3-319-60080-2_21.
  • M. Asquith, “Extremely scalable storage and clustering of malware metadata,” J. Comput. Virol. Hacking Tech., vol. 12, no. 2, pp. 49–58, May 2016, doi: 10.1007/s11416-015-0241-3.
  • J. Yonts, “Attributes of malicious files,” Inst. InfoSec Read. Room, 2012.
  • J. Bai, J. Wang, and G. Zou, “A Malware Detection Scheme Based on Mining Format Information,” Sci. World J., vol. 2014, pp. 1–11, 2014, doi: 10.1155/2014/260905.
  • M. Wadkar, F. Di Troia, and M. Stamp, “Detecting malware evolution using support vector machines,” Expert Syst. Appl., vol. 143, p. 113022, 2020.
  • M. Egele, M. Woo, P. Chapman, and D. Brumley, “Blanket execution: Dynamic similarity testing for program binaries and components,” in 23rd {USENIX} Security Symposium ({USENIX} Security 14), 2014, pp. 303–317.
  • I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. G. Bringas, “OPEM: A Static-Dynamic Approach for Machine-Learning-Based Malware Detection,” in International Joint Conference CISIS’12-ICEUTE´12-SOCO´12 Special Sessions, vol. 189, Á. Herrero, V. Snášel, A. Abraham, I. Zelinka, B. Baruque, H. Quintián, J. L. Calvo, J. Sedano, and E. Corchado, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 271–280. doi: 10.1007/978-3-642-33018-6_28.
  • M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian, and J. Nazario, “Automated classification and analysis of internet malware,” in International Workshop on Recent Advances in Intrusion Detection, 2007, pp. 178–197.
  • U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda, “Scalable, behavior-based malware clustering.,” in NDSS, 2009, vol. 9, pp. 8–11.
  • T. Lee, “Behavioral classification,” Proc. EICAR 2006 4, 2006.
  • M. Lindorfer, C. Kolbitsch, and P. M. Comparetti, “Detecting environment-sensitive malware,” in International Workshop on Recent Advances in Intrusion Detection, 2011, pp. 338–357.
  • P. Vadrevu, B. Rahbarinia, R. Perdisci, K. Li, and M. Antonakakis, “Measuring and Detecting Malware Downloads in Live Network Traffic,” in Computer Security – ESORICS 2013, vol. 8134, J. Crampton, S. Jajodia, and K. Mayes, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 556–573. doi: 10.1007/978-3-642-40203-6_31.
  • P. Vadrevu, B. Rahbarinia, R. Perdisci, K. Li, and M. Antonakakis, “Measuring and detecting malware downloads in live network traffic,” in European Symposium on Research in Computer Security, 2013, pp. 556–573.
  • A. Mohaisen, O. Alrawi, and M. Mohaisen, “AMAL: High-fidelity, behavior-based automated malware analysis and classification,” Comput. Secur., vol. 52, pp. 251–266, Jul. 2015, doi: 10.1016/j.cose.2015.04.001.
  • M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM Comput. Surv., vol. 44, no. 2, pp. 1–42, Feb. 2012, doi: 10.1145/2089125.2089126.
  • M. Christodorescu, S. Jha, and C. Kruegel, “Mining specifications of malicious behavior,” in Proceedings of the 1st conference on India software engineering conference - ISEC ’08, Hyderabad, India, 2008, p. 5. doi: 10.1145/1342211.1342215.
  • A. Pfeffer et al., “Malware Analysis and attribution using Genetic Information,” in 2012 7th International Conference on Malicious and Unwanted Software, Fajardo, PR, USA, Oct. 2012, pp. 39–45. doi: 10.1109/MALWARE.2012.6461006.
  • “Rootkits - Computing and Software Wiki.” http://wiki.cas.mcmaster.ca/index.php/Rootkits (accessed Mar. 21, 2020).
  • B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep Learning for Classification of Malware System Call Sequences,” in AI 2016: Advances in Artificial Intelligence, vol. 9992, B. H. Kang and Q. Bai, Eds. Cham: Springer International Publishing, 2016, pp. 137–149. doi: 10.1007/978-3-319-50127-7_11.
  • K. Huang, Y. Ye, and Q. Jiang, “ISMCS: An intelligent instruction sequence based malware categorization system,” in 2009 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication, Hong Kong, China, Aug. 2009, pp. 509–512. doi: 10.1109/ICASID.2009.5276989.
  • D. Uppal, R. Sinha, V. Mehra, and V. Jain, “Malware detection and classification based on extraction of API sequences,” in 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Delhi, India, Sep. 2014, pp. 2337–2342. doi: 10.1109/ICACCI.2014.6968547.
  • J. Singh and J. Singh, “Assessment of supervised machine learning algorithms using dynamic API calls for malware detection,” Int. J. Comput. Appl., pp. 1–8, Feb. 2020.
  • M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining methods for detection of new malicious executables,” in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001, 2000, pp. 38–49.
  • J. Z. Kolter and M. A. Maloof, “Learning to detect malicious executables,” in Machine Learning and Data Mining for Computer Security, Springer, 2006, pp. 47–63.
  • F. Ahmed, H. Hameed, M. Z. Shafiq, and M. Farooq, “Using spatio-temporal information in API calls with machine learning algorithms for malware detection,” in Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence, 2009, pp. 55–62.
  • C. Nachenberg, J. Wilhelm, A. Wright, and C. Faloutsos, “Polonium: Tera-scale graph mining for malware detection,” 2010.
  • I. Firdausi, A. Erwin, and A. S. Nugroho, “Analysis of machine learning techniques used in behavior-based malware detection,” in 2010 second international conference on advances in computing, control, and telecommunication technologies, 2010, pp. 201–203.
  • B. Anderson, D. Quist, J. Neil, C. Storlie, and T. Lane, “Graph-based malware detection using dynamic analysis,” J. Comput. Virol., vol. 7, no. 4, pp. 247–258, 2011.
  • I. Santos, J. Nieves, and P. G. Bringas, “Semi-supervised learning for unknown malware detection,” in International Symposium on Distributed Computing and Artificial Intelligence, 2011, pp. 415–422.
  • B. Anderson, C. Storlie, and T. Lane, “Improving malware classification: bridging the static/dynamic gap,” in Proceedings of the 5th ACM workshop on Security and artificial intelligence, 2012, pp. 3– 14.
  • J. Bai, J. Wang, and G. Zou, “A malware detection scheme based on mining format information,” Sci. World J., vol. 2014, 2014.
  • B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep learning for classification of malware system call sequences,” in Australasian Joint Conference on Artificial Intelligence, 2016, pp. 137–149.
  • E. Moshiri, A. B. Abdullah, R. Mahmood, and Z. Muda, “Malware Classification Framework for Dynamic Analysis using Information Theory,” Indian J. Sci. Technol., vol. 10, no. 21, pp. 1–10, 2017.
  • W. Hardy, L. Chen, S. Hou, Y. Ye, and X. Li, “Dl4md: A deep learning framework for intelligent malware detection,” in Proceedings of the International Conference on Data Science (ICDATA), 2016, p. 61.
  • M. A. Jerlin and K. Marimuthu, “A new malware detection system using machine learning techniques for API call sequences,” J. Appl. Secur. Res., vol. 13, no. 1, pp. 45–62, 2018.
  • N. Maleki, “A behavioral based detection approach for business email compromises,” University of New Brunswick., 2019.
  • M. Wadkar, F. Di Troia, and M. Stamp, “Detecting malware evolution using support vector machines,” Expert Syst. Appl., vol. 143, p. 113022, 2020.
  • J. Singh and J. Singh, “Assessment of supervised machine learning algorithms using dynamic API calls for malware detection,” Int. J. Comput. Appl., pp. 1–8, 2020.
  • A. Kumar et al., “Malware Detection Using Machine Learning,” in Iberoamerican Knowledge Graphs and Semantic Web Conference, 2020, pp. 61–71.
  • J. Hemalatha, S. A. Roseline, S. Geetha, S. Kadry, and R. Damaševičius, “An efficient DenseNet-based deep learning model for malware detection,” Entropy, vol. 23, no. 3, p. 344, 2021.
  • O. N. Elayan and A. M. Mustafa, “Android Malware Detection Using Deep Learning,” Procedia Comput. Sci., vol. 184, pp. 847–852, 2021.
  • A. L. Samuel, “Some Studies in Machine Learning Using the Game of Checkers,” IBM J. Res. Dev., vol. 3, no. 3, pp. 210–229, Jul. 1959, doi:10.1147/rd.33.0210.
  • G. H. John and P. Langley, “Estimating continuous distributions in Bayesian classifiers,” ArXivPrepr. ArXiv13024964, 2013.
  • E. Fix and J. L. Hodges, “Discriminatory analysis: Nonparametric discrimination: Consistency properties: (471672008-001).” American Psychological Association, 1951. doi: 10.1037/e471672008-001.
  • T. Joachims, “Making large-scale support vector machine learning practical, Advances in Kernel Methods,” Support Vector Learn., 1999.
  • J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106, Mar. 1986, doi: 10.1007/BF00116251.
  • L. Breiman, “[No title found],” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
  • T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acmsigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
  • P. M. Kavitha and B. Muruganantham, “A study on deep learning approaches over Malware detection,” in 2020 IEEE International Conference on Advances and Developments in Electrical and Electronics Engineering (ICADEE), 2020, pp. 1–5.
  • A. Makandar and A. Patrot, “Malware analysis and classification using artificial neural network,” in 2015 International conference on trends in automation, communications and computing technology (I-TACT-15), 2015, pp. 1–6.
  • K. S. Han, J. H. Lim, B. Kang, and E. G. Im, “Malware analysis using visualized images and entropy graphs,” Int. J. Inf. Secur., vol. 14, no. 1, pp. 1–14, 2015.
  • E. K. Kabanga and C. H. Kim, “Malware images classification using convolutional neural network,” J. Comput. Commun., vol. 6, no. 1, pp. 153–158, 2017.
  • M. Kalash, M. Rochan, N. Mohammed, N. D. Bruce, Y. Wang, and F. Iqbal, “Malware classification with deep convolutional neural networks,” in 2018 9th IFIP international conference on new technologies, mobility and security (NTMS), 2018, pp. 1–5.
  • H. Yakura, S. Shinozaki, R. Nishimura, Y. Oyama, and J. Sakuma, “Malware analysis of imaged binary samples by convolutional neural network with attention mechanism,” in Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy, 2018, pp. 127–134.
  • M. Rhode, P. Burnap, and K. Jones, “Early-stage malware prediction using recurrent neural networks,” Comput. Secur., vol. 77, pp. 578–594, 2018.
  • J.-Y. Kim, S.-J. Bu, and S.-B. Cho, “Zero-day malware detection using transferred generative adversarial networks based on deep autoencoders,” Inf. Sci., vol. 460, pp. 83–102, 2018.
  • D. Vasan, M. Alazab, S. Wassan, H. Naeem, B. Safaei, and Q. Zheng, “IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture,” Comput. Netw., vol. 171, p. 107138, 2020.
  • A. Darem, J. Abawajy, A. Makkar, A. Alhashmi, and S. Alanazi, “Visualization and deep-learning-based malware variant detection using OpCode-level features,” Future Gener. Comput. Syst., vol. 125, pp. 314–323, 2021.
  • S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Understanding of a convolutional neural network,” in 2017 International Conference on Engineering and Technology (ICET), 2017, pp. 1–6.
  • D. Song et al., “BitBlaze: A new approach to computer security via binary analysis,” in International Conference on Information Systems Security, 2008, pp. 1–25.

Abstract Views: 269

PDF Views: 1




  • A Survey on Malware Classification Using Machine Learning and Deep Learning

Abstract Views: 269  |  PDF Views: 1

Authors

Manish Goyal
Department of Computer Science and Engineering, IK Gujral Punjab Technical University, Kapurthala, Punjab, India
Raman Kumar
Department of Computer Science and Engineering, IK Gujral Punjab Technical University, Kapurthala, Punjab, India

Abstract


In today’s era, there is fast development in the field of Information Technology. It is a matter of great concern for cyber professionals to maintain security and privacy. Studies revealed that the number of new malware is increasing tremendously. It is a never-ending cycle between the world of attack and the defense of malicious software. Antivirus companies are always putting their efforts to develop signatures of malicious software and attackers are always in try to overcome those signatures. For the detection of malware machine learning are highly efficient. The process of detection of malware is split into two categories first is feature extraction and the second is malware classification. The effectiveness of classification algorithms depends on the feature extracted. In this paper, firstly an in-depth study of the features is provided that can be used to differentiate malware. Thereafter describe the various stages of machine learning and deep learning that researchers use in their research work and the pros and cons they face that can assist new researchers while selecting an algorithm for their research work.

Keywords


Malware Detection, Static Analysis, Dynamic Analysis, Security, Features of Malware, Machine Learning, Deep Learning.

References





DOI: https://doi.org/10.22247/ijcna%2F2021%2F210724