ANALYSIS OF FEATURE IMPORTANCE AND EFFECTIVENESS OF MACHINE LEARNING MODELS IN PREDICTING TUBERCULOSIS CASES IN INDIA

DOI 10.31673/2412-4338.2025.026806

Authors

Abstract

Tuberculosis (TB) remains one of the most severe infectious diseases globally, with India bearing the highest burden according to the World Health Organization (WHO). High population density, unequal access to healthcare, socioeconomic conditions, and comorbidities such as diabetes create a conducive environment for TB spread. This study analyzes the importance of factors influencing TB incidence in India and evaluates the effectiveness of machine learning models in predicting cases. The aim is to identify key determinants of TB spread and develop evidence-based recommendations to reduce the epidemiological burden in the region.
The study utilizes data from 2019–2022, sourced from open databases, including WHO and Indian government reports. The dataset comprises 126 records and 25 variables, encompassing diagnostic indicators (e.g., detected TB cases, multidrug-resistant TB, TB-HIV coinfection), social factors (e.g., tobacco and alcohol use), healthcare infrastructure, and treatment outcomes (e.g., success, mortality, treatment interruption). The analysis employed descriptive statistics, correlation analysis, multiple linear regression with L1/L2 regularization (Ridge, Lasso), and nonlinear machine learning methods, including Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest. Model accuracy was assessed via cross-validation using R² (coefficient of determination) and MSE (mean squared error) metrics.
Correlation analysis revealed no strong linear relationships between factors and total TB cases, suggesting nonlinear dependencies. Multiple linear regression showed low explanatory power (R² ≈ 0.3), while regularized methods (Lasso with α = 0.01) slightly improved generalization (R² = 0.1007). Among linear models, factors related to case notifications by gender and multidrug-resistant TB diagnosis were most significant. Nonlinear models proved more effective: initial analysis indicated Random Forest (R² = 0.4595 on test data) outperformed KNN and SVM, while Decision Tree suffered from overfitting (R² = -0.3044 on test data). To enhance accuracy, a new target variable—normalized TB cases per 100,000 population (total_inf)—was introduced, accounting for state population sizes. This adjustment significantly improved model performance: Decision Tree achieved R² = 0.8724, and Random Forest reached R² = 0.8378 on test data. Factor analysis confirmed that multidrug-resistant TB diagnosis (MDR/RR TB Diagnosed) and treatment center infrastructure (PMDT-Infrastructure) are key predictors, highlighting the critical role of medical resources and timely detection of resistant strains.

Keywords: tuberculosis, India, machine learning, multiple linear regression, random forest, decision tree, SHAP analysis, socioeconomic factors, multidrug-resistant tuberculosis, incidence prediction.

Published

2025-06-25

Issue

Section

Articles