Stroke Risk Prediction: Figuring Out Who's at Risk
I built a machine learning model to spot people at risk of a stroke, using data from over 5,000 patients. The goal? Catch the warning signs early so doctors can step in before things get bad. Strokes are a big deal—second leading cause of death worldwide, according to the World Health Organization—so this could really help.
Stroke Risk Prediction Project Header
Project Overview and Methodology

What I Did

  • Data Setup: Started with a dataset packed with patient info—age, glucose levels, BMI, hypertension, smoking habits, you name it. Cleaned it up, filled in gaps (like missing BMI numbers), and poked around with histograms and stats to see what's what.
  • Finding the Clues: Used stats tricks like t-tests and chi-square to pin down what's tied to strokes. Age and glucose levels kept popping up as big red flags.
  • Building the Model: Tried a bunch of approaches—Logistic Regression, Random Forest, XGBoost, LightGBM, and CatBoost. Tuned them with Bayesian optimization to squeeze out the best results, focusing on catching as many stroke risks as possible (high recall).
  • Results: CatBoost came out on top with a recall of 0.9048 and ROC AUC of 0.8621 on the validation set. Translation: it catches 90% of potential stroke cases and does a decent job sorting real risks from noise.

How It Went Down

Started by scrubbing the data—dropped 201 rows with missing BMI (about 4% of the set) since it wasn't a dealbreaker. Plotted stuff like age and glucose distributions; turns out age is pretty spread out, while glucose spikes high for some folks. Also flagged outliers but kept them—they're often the juicy cases in medical data.

Added some combo features—like age times glucose—to catch sneaky patterns. Stats showed age, glucose, hypertension, and heart disease matter most, so I leaned on those. Tested five models, tweaking them to nail recall since missing a stroke is worse than a false alarm. CatBoost won—it's great at balancing "don't miss anything" with "don't freak out over nothing."

The Nitty-Gritty

  • Data: 4,909 patients after cleanup. Features like age (0.08 to 82), glucose levels (55 to 271 mg/dL), and binary stuff like hypertension (9% yes). Stroke cases? Just 4%—super imbalanced.
  • Stats: Age and stroke? Strong link (Cohen's d = 1.18, p < 0.001). Glucose too (d=0.70). Hypertension's odds ratio hit 4.4—yikes.
  • Model Details: CatBoost nailed it after 50 rounds of tuning—depth, learning rate, all that jazz. Final recall of 0.9048 means it misses only 4 out of 42 stroke cases in validation. Precision's lower (0.12), but that's the trade-off for not missing the big stuff.

What's Next

  • The imbalance is a pain—more stroke data would help.
  • Could try anomaly detection to catch those weird edge cases.
  • Adding lifestyle details (diet, exercise) might sharpen it up.

Why It Matters

This model's a tool, not a doctor. Pair it with a pro's judgment, and it could flag high-risk folks early—maybe save some lives. It's not perfect, but it's a solid start. Check the code and full breakdown on my GitHub if you're curious!

What the Data Showed