METHODS OF STREAM DATA ANALYSIS FOR ENSURING FAULT TOLERANCE OF DISTRIBUTED SYSTEMS
DOI: 10.31673/2412-4338.2025.048913
Abstract
The relevance of the topic stems from the fact that modern distributed information systems (DS) form the basis for the functioning of digital platforms in cloud computing, financial services, and the Internet of Things (IoT). Their operational efficiency and overall reliability directly depend on the ability to process vast amounts of data in realtime. Component failures or incorrect data processing can lead to large-scale outages, data loss, and significant financial damages. In this context, stream analysis methods that allow real-time monitoring of infrastructure status, anomaly detection, and timely response to threats are particularly relevant. The application of machine learning (ML) in stream analysis significantly increases the accuracy of failure prediction and resource optimization. The purpose of the study is to analyze modern methods of processing and analyzing stream data to ensure the uninterrupted operation and fault tolerance of distributed system infrastructures, as well as to identify practical approaches for implementing such methods in real-world conditions. The research methodology is based on the analysis of scientific publications and technical documentation, comparative analysis of modern stream processing platforms, mathematical modeling methods for creating monitoring and failure prediction models, and empirical methods for studying DS performance. Platforms such as Apache Kafka, Apache Flink, and Apache Spark Streaming are analyzed in detail. The role of machine learning methods is assessed, particularly in the context of anomaly detection and adaptation to "concept drift". As a result of the study, stream analysis methods for enhancing DS fault tolerance have been systematized. Key architectural differences of leading tools were identified: Apache Kafka as a distributed event storage and streaming platform (commit log principle); Apache Flink as a tool for low-latency stream processing with support for stateful computations and checkpointing ; and Apache Spark Streaming, which implements a micro-batching approach. The role of orchestration tools, particularly Kubernetes, in automating the deployment, scaling, and self-healing of streaming pipelines is examined. The feasibility of using ML for real-time failure prediction is substantiated. Practical methods for ensuring uninterrupted operation are detailed, including backup (Velero snapshots in Kubernetes ), clustering , monitoring (Prometheus, Grafana ), and fault tolerance testing, including the use of "chaos engineering" approaches (e.g., Chaos Mesh ). The practical significance of the research lies in preparing recommendations for implementing stream analysis methods and fault tolerance systems in real-world information systems, particularly in cloud platforms, IoT systems, and critical IT infrastructures.
Keywords: computer network, streaming data, real-time processing, Apache Kafka, Apache Flink, machine learning, anomaly detection, Kubernetes, chaos engineering.