Improving the efficiency of processing big data using the distributed data analysis method

DOI: 10.31673/2412-4338.2021.021523

Authors

  • О. В. Гордійчук-Бублівська, (Hordiychuk-Bublivsʹka O. V.) Lviv Polytechnic National University, Lviv
  • М. І. Бешлей, (Beshley M. I.) Lviv Polytechnic National University, Lviv
  • М. І. Кирик, (Kyryk M. I.) Lviv Polytechnic National University, Lviv
  • М. М. Климаш, (Klymash M. M.) Lviv Polytechnic National University, Lviv

Abstract

Various methods of collecting, storing, and analyzing information should be used to create efficient data processing systems. To solve the problem of finding the necessary information in large data sets, machine learning algorithms are used. Given that most modern large-scale information systems use a huge number of computing devices, it is much more efficient to use distributed data processing technologies. In particular, distributed machine learning is widely used, in which devices are trained on local datasets and send only results to the global model. This approach improves the reliability and confidentiality of data because user information remains on the same device. The article also presents an approach for the analysis of large amounts of information using the algorithm of Singular Value Decomposition (SVD). This algorithm allows both to reduce the amount of information, discarding redundancy, and to predict events based on the identified patterns in the data. The main features of distributed data analysis, the possibility of using complex algorithms for information analysis, and machine learning in such systems are identified. However, the algorithm of Singular Value Decomposition is quite difficult to implement given the distributed architecture. To improve the efficiency of this method in distributed systems, a special modified FedSVD algorithm is proposed. Based on this algorithm, user data is collected from different devices, but the ability to further protect them from possible interference or interception is added. The results of the work can be used in the design of systems for data analysis, increasing the reliability of the user information used, including in corporate information systems, financial or IT areas, etc. The proposed approaches can serve as a basis for the development of information technology for automatic provision of recommendations to users, prediction of emergencies in enterprises.

Keywords: Big Data, distributed systems, machine learning, increasing the reliability of information processing.

References
1. F. Ortega, and A. González-Prieto, “Recommender systems and collaborative filtering,” Appl. Sci., vol. 10, PP. 168-173, 2020.
2. Z. Wang, H. Wu, Z. Jiang, P. Ju, J. Yang, Z. Zhou, and X. Chen, “Singular value decomposition-based load indexes for load profiles clustering,” Transmission Distribution IET Generation, vol. 14, PP. 4164-4172, 2020.
3. M. Khan, Y. Jin, M. Li, Y. Xiang, and C. Jiang, “Hadoop performance modeling for job estimation and resource provisioning,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, PP. 441–454, 2016.
4. K. Sridharan, G. Komarasamy, S. Daniel Madan Raja, “Hadoop framework for efficient sentiment classification using trees,” IET Networks, vol. 9, PP. 223-228, 2020.
5. H. Zhang, Y. Wang, H. Chen, Y. Zhao and J. Zhang, "Exploring machine-learning-based control plane intrusion detection techniques in software defined optical networks," Optical Fiber Technology, vol. 39, PP. 37-42, 2017.
6. M. Prakash, G. Singaravel, “Haphazard, enhanced haphazard and personalised anonymisation for privacy preserving data mining on sensitive data sources,” International Journal of Business Intelligence and Data Mining, vol. 13, no. 4, PP. 456-474, 2018.
7. M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi and A.P. Sheth, “Machine learning for Internet of Things data analysis: A survey,” Digital Communications and Networks, Elsevier, vol.3, PP.34-41, 2017.
8. T.V. Boris, M.O. Alekseev, “Comparative analysis of the technology of parallel computation of large data sets MapReduce”, Second International Conference “Cluster Computing”, Lviv, 2013, PP. 1-3.
9. Di Chai, Leye Wang, Lianzhi Fu, Junxue Zhang, Kai Chen, and Qiang Yang, “Federated Singular Vector Decomposition”, arXiv:2105.08925, v1, May, 2021.
10. Di Chai, Leye Wang, Kai Chen, and Qiang Yang. “Secure federated matrix Factorization”, IEEE Intelligent Systems, 2020.
11. Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth, “Practical secure aggregation for federated learning on user-held data”. arXiv preprint arXi, v:1611.04482, 2016.

Published

2022-05-28

Issue

Section

Articles