Data Preprocessing and Exploratory Data Analysis
The initial phase involved rigorous data cleaning and preprocessing. We handled missing values, encoded categorical variables, and normalized numerical features. Exploratory Data Analysis (EDA) was conducted using various statistical methods and visualization techniques, including histograms, box plots, and correlation matrices.
Feature Engineering and Selection
We engineered new features and used statistical tests (Chi-square for categorical variables, Mann-Whitney U for numerical) to identify the most significant predictors. This process revealed that employment type, travel frequency, international travel experience, and annual income were highly influential factors.
Model Development and Evaluation
Multiple machine learning models were implemented and compared:
- Logistic Regression
- Random Forest
- Gradient Boosting
- Support Vector Machine (SVM)
We used cross-validation and hyperparameter tuning to optimize each model's performance. Evaluation metrics included accuracy, precision, recall, F1-score, and ROC AUC.
Key Findings
The Random Forest model emerged as the top performer, achieving an ROC AUC score of 0.7012. It demonstrated superior ability in balancing precision and recall, crucial for practical application in customer targeting.
Contrary to initial hypotheses, factors such as age, education level, and presence of chronic diseases showed minimal impact on insurance purchase decisions.
Model Interpretation and Business Insights
Feature importance analysis revealed that annual income, travel habits, and employment type were the most influential predictors. This insight can guide personalized marketing strategies and product development in the travel insurance sector.
Challenges and Future Work
A key challenge was balancing model complexity with interpretability. Future work could explore more advanced ensemble methods or deep learning approaches, as well as incorporating additional data sources for enhanced predictive power.