Software implementation of the automatic text classifier based on the specified method of forming a features space for categories

DOI: 10.31673/2412-4338.2020.011673

Authors

  • Т. В. Голуб, (Golub T. V.) National University "Zaporizhzhya Polytechnic", Zaporizhzhia
  • І. Я. Зеленьова, (Zeleneva I. Yа.) National University "Zaporizhzhya Polytechnic", Zaporizhzhia
  • С. С. Грушко, (Hrushko S. S.) National University "Zaporizhzhya Polytechnic", Zaporizhzhia
  • Н. В. Луценко, (Lutsenko N. V.) National University "Zaporizhzhya Polytechnic", Zaporizhzhia

Abstract

The article proposes a solution to one of the tasks of computer linguistics such as a text classification. Theoretical development and software implementation of the specified method for forming the space of category attributes is considered in this article. A study of the effectiveness of this method when used in the classification of text documents is carried out.
The peculiarity of this method is that it allows the classification of documents into categories of general subjects and thus clarifies the result. Within the framework of one subject, the use of the same terminology in several categories is observed. This increases the complexity of the classification process.
The specified method for creating the category attribute space includes two stages: preliminary processing of the text; creation of the attribute space. The stage of preliminary text processing is characterized by the dependence on the language of initial text, which determines the use of algorithms specialized for individual languages. This study examines texts in Ukrainian. Stemming, as one of the steps of text preprocessing, is built on the basis of an adapted method for texts in Ukrainian. It takes into account the peculiarities of syntax and word formation (morphology) in a given language. The creation of the category attribute space is performed on the basis of the TF-SLF method, which takes into account the occurrence of words in each category, and further filtering the resulting space based on a threshold value that reflects the importance of the word for the category.
As a result of consistent implementation of the specified method stages, a space of features for different categories is reduced, and the little-informative terms are excluded from this space. This allows decreasing in the number of iterations and calculations upon further classification, which leads to a reduction in the total time spent on solving the problem.
Based on the specified method proposed by the authors, a software package was developed, which is then used to confirm the effectiveness of this method.

Keywords: text classification, word preprocessing, stemming, filtering, category attribute space.

References
1. Bezverkhiy O. A. and Samokhvalova S. G. (2016), “Clustering of a large volume of text search queries”. Scientific notes Togu, 7(3). P. 104 - 110.
2. Labani, M., Moradi, P. and Jalili, M. (2020) “A multi-objective genetic algorithm for text feature selection using the relative discriminative criterion”. Expert Systems with Applications. 149. Access mode: https://doi.org/10.1016/j.eswa.2020.113276
3. Karpovich, S.N., Smirnov, A.V. and Teslya, N.N. (2019) “Classification of Text Documents Based on a Probabilistic Topic Model”. Scientific and Technical Information Processing. 46(5). P. 314-320
4. Glibovets A. M. and Tochitsky V. V. (2017) “Algorithm of tokenization and stemming for texts in Ukrainian”. Science notes of NaUKMA. Computer science. 198. P. 4-8.
5. Bisikalo O. V. And Visotska V. A. (2016) “Viyavlennya key words based on the method of content monitoring of Ukrainian new texts”. Radio electronics, informatics, control. 1. P. 74-83.
6. Moral Cristian, Angélica de Antonio, Imbert Ricardo and Ramírez Jaime (2014) “A survey of stemming algorithms in information retrieval”. Information research. 19(1). P. 605-625.
7. Hassanein A.M.D.E. and Nour M.A (2019) “Proposed model of selecting features for classifying Arabic text”. Jordanian journal of computers and information technology.5(3).P.275-290.
8. Alper Kursat Uysal (2016) “An improved global feature selection scheme for text classification”. Expert Systems with Applications. 43. P. 82-92
9. Pouramini Jafar, Behrouze Minaei-Bidgoli Dr. and Mahdi Esmaeili Dr. (2019) “A Novel One Sided Feature Selection Method for Imbalanced Text Classification”. JSDP. 16(1). P. 21-40.
10. Ferreira Charles Henrique Porto, Debora Maria Rossi de Medeiros and Fabricio Olivetti de Franc (2018) “DCDistance: A Supervised Text Document Feature extraction based on class labels”. Computer Science. 2. P.23-31.
11. Doan Son and Horiguchi Susumu. (2006) “Dynamic Feature Selection in Text Classification”. Part of book Intelligent Control and Automation, Lecture Notes in Control and Information Sciences. P. 664-675.
12. Kotelnikov E.V. (2019) “The methodology of the intellectual analysis of opinions in the processing of textual information based on a plausible vivod”: author. dis. ... cand. tech. Sciences: 05.13.17. Nizhny Novgorod, Russia. 39 s.
13. Chen, J., Dai, Z., Duan, J., Matzinger, H. and Popescu, I. (2019) “Naive bayes with correlation factor for text classification problem”. 18th IEEE International Conference on Machine Learning and Applications, ICMLA, Boca Raton, United States. 16 - 19 December 2019. Boca Raton, United States. P. 1051-1056
14. Yampolsky L.S. (2012) “Analytical approach to the choice of neural network topologies to solve the applied problems”. Adaptive systems of automatic control. 20. P. 159-179
15. A.Yu. Kononyuk(2008) “Neural measures and genetic algorithms”. K.: Kornіychuk, 446p.
16. Krasnyansky M. N., Obukhov A. D., Solomatina E. M. and Voyakina A. A. (2018) “Comparative analysis of machine learning methods to solve the problem of classifying documents of a scientific and educational institution”. Vestnik VGU, Series: System analysis and information technology. 3. P. 173-182.
17. Akbarhuzhayev S. A. and Abdurakhmanova N. N. (2019) “Comparative analysis of the methods of Naive Bayes and SVM algorithms for the classification of text documents”. Young scientist. 29. P. 8-10.
18. Mbaikodzi E., Dral’ А. А. and Sochenko I. V. (2012) “The method of automatic classification of short text messages”. Information technologies and computer systems. 3. P. 93-102
19. Tehseen Zia and Muhammad Pervez Akhter Qaiser Abbas (2015) “Comparative Study of Feature Selection Approaches for Urdu Text Categorization”. Malaysian Journal of Computer Science. 28(2). P. 93-109
20. Golub T.V. and Tyagunova M.Yu. (2017) “A method of stemming Ukrainian-language texts for classifying documents based on Porter's algorithm”. Scientific works of Donetsk National Technical University. Series: Computer Science, Cybernetics and Computer Engineering.1.P.59-63.
21. Golub T. (2019) “Modernized Mathematical Model of Text Document Classification”. The Second International Workshop on Computer Modeling and Intelligent Systems (CMIS-2019), Zaporizhzhia, Ukraine, 15-19 April 2019. Zaporizhzhia, Ukraine. P. 607-617. Access mode: http://ceur-ws.org/Vol-2353/paper48.pdf
22. Golub T.V. and Tyagunova M.Yu. (2019) “Method for reducing the size of the term vector for classifying text documents into categories”. Problems of regional energy. 1–2 (41). P. 84–94. DOI: 10.5281 / zenodo.3240216
23. Glibovets A. M. and Tochitsky V. V. (2017) “Algorithm of tokenization and stemming for texts in Ukrainian”. Science notes of NaUKMA. Computer science. 198. P. 4-8.
24. Bird S., Klein E. and Loper E. (2009) “Natural Language Processing with Python”. Sebastopol (USA): O’Reilly Media. 504p.
25. Perkins J. (2014) “Python 3 Text Processing with NLTK 3 Cookbook”. Birmingham (UK): Packt Publishing Ltd. 304 p.
26. The universal ten-year classifier. Access mode: http://www.udcsummary.info/php/index.php?id=13358&lang=uk

Published

2020-08-03

Issue

Section

Articles