Data Preprocessing and Exploratory Data Analysis
The initial phase involved rigorous data cleaning and preprocessing. We handled missing values, particularly in the BMI column, encoded categorical variables, and standardized numerical features. Exploratory Data Analysis (EDA) was conducted using various statistical methods and visualization techniques, including histograms, correlation matrices, and statistical tests to understand the relationships between variables and stroke occurrence.
Feature Engineering and Selection
We engineered new features, such as interaction terms between age and other risk factors. Statistical tests (t-tests for continuous variables, chi-square tests for categorical) were used to identify the most significant predictors. This process revealed that age, average glucose level, and the presence of hypertension or heart disease were highly influential factors.
Model Development and Evaluation
Multiple machine learning models were implemented and compared:
- Logistic Regression
- Random Forest
- XGBoost
- LightGBM
- CatBoost
We used cross-validation and Bayesian optimization for hyperparameter tuning to optimize each model's performance. Evaluation metrics included precision, recall, F1-score, ROC AUC, and PR AUC, with a particular focus on maximizing recall due to the critical nature of stroke prediction.
Key Findings
The CatBoost model emerged as the top performer, achieving a recall of 0.9048 and an ROC AUC score of 0.8621 on the validation set. It demonstrated superior ability in balancing high sensitivity (crucial for identifying potential stroke cases) with acceptable precision.
Age and average glucose level were consistently the most important predictors across all models. Hypertension and heart disease also showed significant influence on stroke risk.
Model Interpretation and Clinical Insights
Feature importance analysis confirmed that age, glucose levels, and cardiovascular health indicators were the most influential predictors. This insight can guide personalized prevention strategies and help healthcare providers prioritize interventions for high-risk patients.
Challenges and Future Work
A key challenge was dealing with the highly imbalanced dataset, where stroke cases were relatively rare. Future work could explore more advanced techniques for handling imbalanced data, such as advanced sampling methods or anomaly detection approaches. Additionally, incorporating more detailed medical history or lifestyle factors could potentially enhance the model's predictive power.