Abstract
Using echocardiogram data for cardiovascular disease (CVD) can lead to difficulties due to imbalanced datasets, leading to biased predictions. Machine learning models can enhance prognosis accuracy, but their effectiveness is influenced by optimal feature selection and robust classification techniques. This study introduces an event-based self-similarity approach to enhance automatic feature selection approach for imbalanced echocardiogram data. Critical features correlated with disease progression were identified by leveraging self-similarity patterns. This study used an echocardiogram dataset, visual presentations of high-frequency sound wave signals, and data of patients with heart disease who are treated using three treatment methods: catheter ablation, ventricular defibrillator, and drug control-over the course of three years. The dataset was classified into nine categories and Recursive Feature Elimination (RFE) was applied to identify the most relevant features, reducing model complexity while maintaining diagnostic accuracy. Machine learning classification models, including XGBoost and CATBoost, were trained and evaluated. Both models achieved comparable accuracy values, 84.3% and 88.4%, respectively, under different normalization techniques. To further optimize performance, the models were combined into a voting ensemble, improving feature selection and predictive accuracy. Four essential features-age, aorta (AO), left ventricular (LV), and left atrium (LA)-were identified as critical for prognosis and were found in Random Forest (RF)-voting ensemble classifier. The results underscore the importance of feature selection techniques in handling imbalanced datasets, improving classification robustness, and reducing bias in automated prognosis systems. Our findings highlight the potential of machine learning-driven echocardiogram analysis to enhance patient care by providing accurate, data-driven assessments.
Citation
ID:
281982
Ref Key:
huang2025automatic