Red Wine Quality Analysis¶

Introduction

In the quest for excellence, the wine industry heavily emphasizes maintaining consistent quality to satisfy consumers and uphold brand reputation. This project focuses on analyzing the Red Wine Quality dataset through robust data analysis and machine learning techniques to discern the key factors that influence wine quality. We aim to uncover the intricate patterns among the physicochemical properties of the wine and explore their impact on the sensory quality score.

Data Description

The dataset for this analysis comprises physicochemical (input) and sensory (output) variables of red wine. These input variables, representing measurements from various stages of wine production, include:

  • Fixed Acidity
  • Volatile Acidity
  • Citric Acid
  • Residual Sugar
  • Chlorides
  • Free Sulfur Dioxide
  • Total Sulfur Dioxide
  • Density
  • pH
  • Sulphates
  • Alcohol

The output variable is the wine's quality, scored on a scale from 0 to 10. Details regarding the dataset's source will be acknowledged if available, and specifics such as the number of samples and features will be noted.

Data Loading and Preprocessing

We will utilize Pandas for data loading. The preprocessing steps will aim to optimize data quality and readiness for further analysis. Key actions will include:

  • Handling missing values: Depending on the data attributes, we will apply mean/median imputation or outright removal of missing data.
  • Scaling numerical features: To equalize the influence of each feature in our analyses, we'll employ scaling techniques such as standardization or normalization based on the distribution of the data.

Exploratory Data Analysis (EDA)

Our EDA will aim to thoroughly understand the data properties and unearth relationships between variables through:

  • Univariate analysis: Utilizing histograms and boxplots to examine distributions and pinpoint outliers.
  • Multivariate analysis: Employing heatmaps to explore the correlation matrix, highlighting relationships and potential multicollinearity among features.

Statistical Inference

Defining our target population (e.g., varieties of red wine), we will develop specific statistical hypotheses based on EDA insights concerning relationships between features and the wine quality:

  • Testing correlations between alcohol content and wine quality, as well as volatile acidity and wine quality.
  • Establishing confidence intervals for crucial statistics and setting significance levels.
  • Implementing suitable statistical tests, such as the Pearson correlation coefficient, to test these hypotheses.

Machine Learning Models

Focusing on linear regression models, we will predict the quality and alcohol content of the wine using the remaining features. Our approach will start with an Ordinary Least Squares regression, evaluating model performance through:

  • Feature significance: Analyzing p-values of coefficients to identify influential features.
  • R-squared and Adjusted R-squared: These metrics will reflect how well the model captures the variance in the target variable.
  • Information criteria: Using AIC and BIC to compare models and select the optimal balance of simplicity and fit.

Visualization

To effectively communicate our findings, we will create a dashboard using tools like Looker Studio, incorporating various chart types such as:

  • Bar charts for feature distributions
  • Scatter plots for correlations
  • Line charts for tracking model performance

Conclusion and Suggestions for Improvement

The final section will synthesize the analysis results, underscoring critical insights into how physicochemical properties influence wine quality and suggesting avenues for enhancing the analytical approach.

In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
import os
import sqlite3

import pandas as pd
from IPython.display import Image

from red_wine_quality_analysis.utils.red_wine_quality_utils import (
    get_columns,
    identify_outliers,
    log_transform_features,
    plot_box_chart,
    plot_correlation,
    plot_heatmap,
    plot_histograms,
    plot_model_predictions,
    remove_duplicates,
    test_correlation,
    train_linear_model,
)
In [3]:
csv_path = os.path.join("..", "data", "winequality_red.csv")
db_path = os.path.join("..", "data", "wine_quality.db")

wine_df = pd.read_csv(csv_path)
conn = sqlite3.connect(db_path)
wine_df.to_sql("wine_quality", conn, if_exists="replace", index=False)
Out[3]:
1599
In [4]:
wine_df.head()
Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [5]:
wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Great news, the data we are looking at does not have any missing entries. There are 1,599 rows of data with 12 columns each. Everything is stored as numeric variables.

Next up, we see if there are any duplicate entries in the data, if there are - get rid of them.

In [6]:
wine_df = remove_duplicates(wine_df)
Removed 240 duplicate rows

We can see that, there were 240 duplicates in the dataset, which have been removed, we can see that by the output of our function.

Now, we describe the data to get a better understanding of the distribution of the features.

In [7]:
wine_df.describe().T
Out[7]:
count mean std min 25% 50% 75% max
fixed acidity 1359.0 8.310596 1.736990 4.60000 7.1000 7.9000 9.20000 15.90000
volatile acidity 1359.0 0.529478 0.183031 0.12000 0.3900 0.5200 0.64000 1.58000
citric acid 1359.0 0.272333 0.195537 0.00000 0.0900 0.2600 0.43000 1.00000
residual sugar 1359.0 2.523400 1.352314 0.90000 1.9000 2.2000 2.60000 15.50000
chlorides 1359.0 0.088124 0.049377 0.01200 0.0700 0.0790 0.09100 0.61100
free sulfur dioxide 1359.0 15.893304 10.447270 1.00000 7.0000 14.0000 21.00000 72.00000
total sulfur dioxide 1359.0 46.825975 33.408946 6.00000 22.0000 38.0000 63.00000 289.00000
density 1359.0 0.996709 0.001869 0.99007 0.9956 0.9967 0.99782 1.00369
pH 1359.0 3.309787 0.155036 2.74000 3.2100 3.3100 3.40000 4.01000
sulphates 1359.0 0.658705 0.170667 0.33000 0.5500 0.6200 0.73000 2.00000
alcohol 1359.0 10.432315 1.082065 8.40000 9.5000 10.2000 11.10000 14.90000
quality 1359.0 5.623252 0.823578 3.00000 5.0000 6.0000 6.00000 8.00000

Those 12 columns represent the following features:

  • Fixed Acidity: The average value is 8.31, the highest value is 15.9.
  • Volatile Acidity: The average value is 0.529, the highest value is 1.58.
  • Citric Acid: The average value is 0.272, the highest value is 1.00.
  • Residual Sugar: The average value is 2.523, the highest value is 15.5.
  • Chlorides: The average value is 0.088, the highest value is 0.611.
  • Free Sulfur Dioxide: The average value is 15.89, the highest value is 72.00.
  • Total Sulfur Dioxide: The average value is 46.82, the highest value is 289.00.
  • Density: The average value is 0.996, the highest value is 1.004.
  • pH: The average value is 3.3, the highest value is 4.01.
  • Sulphates: The average value is 0.65, the highest value is 2.00.
  • Alcohol: The average value is 10.43, the highest value is 14.90.
  • Quality: The average value is 5.62, the highest value is 8.00.

Afterwards, we can rename the features to be able to use them more easily.

In [8]:
get_columns(wine_df)
Out[8]:
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality']
In [9]:
wine_df.rename(
    columns={
        "fixed acidity": "fixed_acidity",
        "volatile acidity": "volatile_acidity",
        "citric acid": "citric_acid",
        "residual sugar": "residual_sugar",
        "free sulfur dioxide": "free_sulfur_dioxide",
        "total sulfur dioxide": "total_sulfur_dioxide",
    },
    inplace=True,
)

After fixing the columns names, we can check for the outliers in the features itself.

In [10]:
plot_box_chart(
    wine_df,
    "Feature",
    "Value",
    "Boxplot of Features",
    save_path="../images/boxplot_features.png",
)
In [11]:
Image(filename="../images/boxplot_features.png")
Out[11]:

The boxplot reveals noticeable outliers in features like "free sulfur dioxide" and "total sulfur dioxide." While other features may also have outliers, the IQR (Interquartile Range) can help us further confirm these existing outliers and identify potential ones in other features.

In [12]:
outlier_info = identify_outliers(wine_df)
print(outlier_info["outliers_per_column"])
print(
    f"Total number of outliers across all columns: {outlier_info['total_outliers']}"
)
fixed_acidity            41
volatile_acidity         19
citric_acid               1
residual_sugar          126
chlorides                87
free_sulfur_dioxide      26
total_sulfur_dioxide     45
density                  35
pH                       28
sulphates                55
alcohol                  12
quality                  27
dtype: int64
Total number of outliers across all columns: 502

In our dataset, we have identified a total of 601 outliers across 12 features using the Interquartile Range (IQR) method.

These outliers represent values that significantly deviate from the rest of the data.

While outliers can sometimes indicate errors, they can also represent valid but extreme variations in the data.

At this stage, without a deeper understanding of the context and nature of the data, we have chosen to temporarily retain these outliers in our analysis.

This decision stems from the possibility that these outliers could provide valuable insights.

However, we acknowledge that outliers can potentially skew our analysis and we will therefore revisit this decision as necessary during later stages of our analysis.

In [13]:
plot_histograms(
    wine_df,
    ["fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar"],
    save_path="../images/histograms.png",
)
In [14]:
Image(filename="../images/histograms.png")
Out[14]:

Fixed Acidity: The distribution is right-skewed, indicating that most wines have a fixed acidity level around 7-8, with fewer wines having higher fixed acidity.

Volatile Acidity: The data is also right-skewed, showing that most wines have a volatile acidity around 0.5, with only a few wines having a volatile acidity above 1.

Citric Acid: This feature shows a bimodal distribution, indicating two groups of wines, one with low citric acid close to 0 and another with citric acid between 0.25 and 0.5.

Residual Sugar: The distribution is highly right-skewed, suggesting that most wines have low residual sugar levels, with a peak below five. Few wines have high residual sugar levels.

In [15]:
plot_histograms(
    wine_df,
    ["chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density"],
    save_path="../images/histograms1.png",
)
In [16]:
Image(filename="../images/histograms1.png")
Out[16]:
In [17]:
plot_histograms(
    wine_df,
    ["pH", "sulphates", "alcohol", "quality"],
    save_path="../images/histograms2.png",
)
In [18]:
Image(filename="../images/histograms2.png")
Out[18]:

pH: The distribution is approximately normal, indicating that most wines have a pH level around 3.2, with fewer wines having significantly higher or lower pH levels.

Sulphates: The data is right-skewed, showing that most wines have a sulphate level around 0.5, with only a few wines having a level above 1.

Alcohol: This feature shows a right-skewed distribution, indicating that most wines have an alcohol level around 9-10%, with fewer wines having a level above 13%.

Quality: The distribution is approximately normal, suggesting that most wines have a quality rating around 6, with few wines having significantly higher or lower ratings.

Our initial data exploration using histograms revealed that most features (fixed acidity, volatile acidity, etc.) exhibit right-skewed distributions, indicating a concentration of wines with values clustered around a central point.

This pattern suggests that the majority of wines fall within a specific range for these features. In contrast, "density" and "pH" show distributions closer to normal, while "citric acid" has a unique bimodal distribution.

Notably, "quality" itself appears to be normally distributed.

These observations provide a foundation for further analysis, particularly investigating relationships between these features and wine quality to identify potential patterns and correlations.

With that in mind as our dataset is mostly right skewed, we will perform log transformation.

In [19]:
columns_to_transform = [
    "sulphates",
    "alcohol",
    "chlorides",
    "free_sulfur_dioxide",
    "total_sulfur_dioxide",
    "fixed_acidity",
    "volatile_acidity",
    "citric_acid",
    "residual_sugar",
]
transformed_wine_df = log_transform_features(wine_df, columns_to_transform)
In [20]:
outlier_info = identify_outliers(transformed_wine_df)
print(outlier_info["outliers_per_column"])
print(
    f"Total number of outliers across all columns: {outlier_info['total_outliers']}"
)
fixed_acidity            12
volatile_acidity          8
citric_acid               0
residual_sugar          108
chlorides                87
free_sulfur_dioxide       0
total_sulfur_dioxide      0
density                  35
pH                       28
sulphates                48
alcohol                   7
quality                  27
dtype: int64
Total number of outliers across all columns: 360

We got rid of at least 142 outliers, for now we will keep the dataset as it is and will not check the distribution of the transformed features and move to correlation matrix.

In [21]:
corr_matrix = transformed_wine_df.corr()
print(corr_matrix)
                      fixed_acidity  volatile_acidity  citric_acid  \
fixed_acidity              1.000000         -0.259980     0.656309   
volatile_acidity          -0.259980          1.000000    -0.575063   
citric_acid                0.656309         -0.575063     1.000000   
residual_sugar             0.159338          0.020262     0.163785   
chlorides                  0.112359          0.067558     0.194126   
free_sulfur_dioxide       -0.161301          0.005780    -0.061352   
total_sulfur_dioxide      -0.106640          0.077575     0.027812   
density                    0.677643          0.031674     0.352629   
pH                        -0.708034          0.245293    -0.551118   
sulphates                  0.198090         -0.278336     0.331680   
alcohol                   -0.091192         -0.209158     0.095063   
quality                    0.109715         -0.397329     0.227422   

                      residual_sugar  chlorides  free_sulfur_dioxide  \
fixed_acidity               0.159338   0.112359            -0.161301   
volatile_acidity            0.020262   0.067558             0.005780   
citric_acid                 0.163785   0.194126            -0.061352   
residual_sugar              1.000000   0.032933             0.089687   
chlorides                   0.032933   1.000000            -0.006385   
free_sulfur_dioxide         0.089687  -0.006385             1.000000   
total_sulfur_dioxide        0.146414   0.062295             0.786095   
density                     0.380607   0.211604            -0.030833   
pH                         -0.093706  -0.275122             0.079773   
sulphates                  -0.002417   0.359596             0.056486   
alcohol                     0.089509  -0.238845            -0.093764   
quality                     0.020154  -0.137302            -0.047132   

                      total_sulfur_dioxide   density        pH  sulphates  \
fixed_acidity                    -0.106640  0.677643 -0.708034   0.198090   
volatile_acidity                  0.077575  0.031674  0.245293  -0.278336   
citric_acid                       0.027812  0.352629 -0.551118   0.331680   
residual_sugar                    0.146414  0.380607 -0.093706  -0.002417   
chlorides                         0.062295  0.211604 -0.275122   0.359596   
free_sulfur_dioxide               0.786095 -0.030833  0.079773   0.056486   
total_sulfur_dioxide              1.000000  0.109176 -0.030143   0.053566   
density                           0.109176  1.000000 -0.355617   0.152989   
pH                               -0.030143 -0.355617  1.000000  -0.196653   
sulphates                         0.053566  0.152989 -0.196653   1.000000   
alcohol                          -0.247685 -0.500142  0.213968   0.113559   
quality                          -0.165289 -0.184252 -0.055245   0.279517   

                       alcohol   quality  
fixed_acidity        -0.091192  0.109715  
volatile_acidity     -0.209158 -0.397329  
citric_acid           0.095063  0.227422  
residual_sugar        0.089509  0.020154  
chlorides            -0.238845 -0.137302  
free_sulfur_dioxide  -0.093764 -0.047132  
total_sulfur_dioxide -0.247685 -0.165289  
density              -0.500142 -0.184252  
pH                    0.213968 -0.055245  
sulphates             0.113559  0.279517  
alcohol               1.000000  0.481462  
quality               0.481462  1.000000  
In [22]:
plot_heatmap(corr_matrix, save_path="../images/correlation_heatmap.png")
In [23]:
Image(filename="../images/correlation_heatmap.png")
Out[23]:

Our analysis of the correlation matrix reveals notable relationships within the red wine data:

Positive Correlations:¶

  • Fixed Acidity and Citric Acid: This suggests a trend where wines with higher fixed acidity tend to also possess higher citric acid content, potentially contributing to the overall flavor profile.
  • Quality and Alcohol: Exhibits the strongest positive correlation in the dataset, indicating that wines with higher alcohol content are generally perceived as being of higher quality.

Negative Correlations:¶

  • Volatile Acidity and Quality: There is a trend where higher levels of volatile acidity are associated with lower wine quality, likely due to the influence on taste and aroma.
  • pH and Citric Acid: Demonstrates that wines with higher acidity typically have lower pH values, reflecting the direct chemical relationship between these factors.

Given these correlations, we should check our distributions once again before moving forward.

In [24]:
plot_histograms(
    transformed_wine_df,
    ["fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar"],
    save_path="../images/histograms_transformed.png",
)
In [25]:
Image(filename="../images/histograms_transformed.png")
Out[25]:
In [26]:
plot_histograms(
    transformed_wine_df,
    ["chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", "density"],
    save_path="../images/histograms1_transformed.png",
)
In [27]:
Image(filename="../images/histograms1_transformed.png")
Out[27]:
In [28]:
plot_histograms(
    transformed_wine_df,
    ["pH", "sulphates", "alcohol", "quality"],
    save_path="../images/histograms2_transformed.png",
)
In [29]:
Image(filename="../images/histograms2_transformed.png")
Out[29]:

Since most features appear to be approximately normally distributed, we can proceed with statistical hypothesis testing to further explore these relationships. The entire dataset of red wines will serve as our target population for these tests. Given the normality assumption, we will use Pearson's correlation coefficient (r) test instead of Z or t-tests for our numerical features.

Hypotheses for Statistical Testing:¶

  1. Fixed Acidity and Citric Acid

    • Null Hypothesis (H0): There is no significant relationship between fixed acidity and citric acid levels in wines.
    • Alternative Hypothesis (H1): Wines with higher fixed acidity have higher citric acid levels.
  2. Quality and Alcohol

    • Null Hypothesis (H0): Alcohol content does not significantly affect wine quality.
    • Alternative Hypothesis (H1): Higher alcohol content is associated with higher wine quality.
  3. Volatile Acidity and Quality

    • Null Hypothesis (H0): Volatile acidity levels do not significantly affect wine quality.
    • Alternative Hypothesis (H1): Higher volatile acidity is associated with lower wine quality.
  4. pH and Citric Acid

    • Null Hypothesis (H0): Citric acid levels do not significantly affect the pH of wines.
    • Alternative Hypothesis (H1): Higher citric acid levels result in lower pH values in wines.

These hypotheses will guide our detailed statistical analysis to further validate the initial findings from our correlation study.

In [30]:
test_correlation(transformed_wine_df, "fixed_acidity", "citric_acid")
Out[30]:
{'Correlation': np.float64(0.6563086255295857),
 'P-Value': np.float64(3.131882291957192e-168),
 'Reject H0': np.True_,
 '95% CI': (np.float64(0.6181577261929445), np.float64(0.6944595248662269))}
In [31]:
plot_correlation(
    transformed_wine_df,
    "fixed_acidity",
    "citric_acid",
    save_path="../images/correlation_plot.png",
)
In [32]:
Image(filename="../images/correlation_plot.png")
Out[32]:

There's a clear connection between two important wine properties: fixed acidity and citric acid. Our analysis shows a positive correlation of 0.656. This means when fixed acidity levels go up, citric acid levels tend to go up as well.

The results are incredibly statistically significant, with a p-value close to zero. This strong evidence allows us to reject the idea of no relationship between these properties (null hypothesis).

The 95% confidence interval (0.618 to 0.694) excludes zero, which boosts our confidence in the positive correlation. This interval tells us how precise our estimate of the connection is.

The equation of the regression line (y = 0.562x - 1.014) helps visualize this trend. Here, "x" represents fixed acidity and "y" represents citric acid content. If fixed acidity (x) increases by one unit, we can expect the citric acid (y) to increase by about 0.562, assuming all other factors remain constant.

In simpler terms, wines with higher fixed acidity tend to have more citric acid, and vice versa. This link is statistically significant, meaning it's very unlikely to be random.

In [33]:
test_correlation(transformed_wine_df, "quality", "alcohol")
Out[33]:
{'Correlation': np.float64(0.4814619565501097),
 'P-Value': np.float64(8.792571112773603e-80),
 'Reject H0': np.True_,
 '95% CI': (np.float64(0.4371434215430752), np.float64(0.5257804915571442))}
In [34]:
plot_correlation(
    transformed_wine_df,
    "quality",
    "alcohol",
    save_path="../images/correlation_plot1.png",
)
In [35]:
Image(filename="../images/correlation_plot1.png")
Out[35]:

Our analysis suggests that higher alcohol content might be associated with better quality wines. There's a moderate positive correlation (0.481), meaning wines with more alcohol tend to get higher quality ratings.

This connection is statistically significant, with a very small p-value (basically zero). This means it's very unlikely to be due to random chance. We can therefore reject the initial idea (null hypothesis) that alcohol content has no effect on quality.

The 95% confidence interval (0.437 to 0.526) excludes zero, which strengthens our confidence in the positive correlation. This interval tells us how precise our estimate of the connection is.

To visualize this, the scatter plot shows an upward trend. The equation (y=0.054x+2.131) represents this trend, where "x" is alcohol content and "y" is perceived quality. As the alcohol content increases (x), the perceived quality (y) also goes up, further supporting the positive association.

In simpler terms, there's a good chance that wines with higher alcohol content are rated as being of higher quality.

In [36]:
test_correlation(transformed_wine_df, "volatile_acidity", "quality")
Out[36]:
{'Correlation': np.float64(-0.39732873875315033),
 'P-Value': np.float64(1.2718423787494175e-52),
 'Reject H0': np.True_,
 '95% CI': (np.float64(-0.4437310239159599), np.float64(-0.35092645359034075))}
In [37]:
plot_correlation(
    transformed_wine_df,
    "volatile_acidity",
    "quality",
    save_path="../images/correlation_plot2.png",
)
In [38]:
Image(filename="../images/correlation_plot2.png")
Out[38]:

Looks like higher volatile acidity might not be good news for wine quality. Our analysis shows a negative correlation of -0.397. This means wines with higher levels of volatile acidity tend to get lower quality ratings.

More importantly, the results are statistically significant, with a very small p-value (basically zero). This strengthens the idea that the connection isn't random.

The 95% confidence interval (-0.444 to -0.351) excludes zero, making us more confident in the negative correlation. This interval tells us how precise our estimate of the connection is.

The equation of the regression line (y = -2.799x + 6.793) helps visualize this. Here, "x" represents volatile acidity and "y" represents quality score. If the volatile acidity (x) increases by one unit, the quality score (y) is expected to decrease by about 2.799 points, assuming everything else stays the same.

In simpler terms, wines with higher volatile acidity tend to be rated lower in quality.

In [39]:
test_correlation(transformed_wine_df, "pH", "citric_acid")
Out[39]:
{'Correlation': np.float64(-0.5511177680998307),
 'P-Value': np.float64(8.277860767471543e-109),
 'Reject H0': np.True_,
 '95% CI': (np.float64(-0.5933105757530911), np.float64(-0.5089249604465703))}
In [40]:
plot_correlation(
    transformed_wine_df,
    "pH",
    "citric_acid",
    save_path="../images/correlation_plot3.png",
)
In [41]:
Image(filename="../images/correlation_plot3.png")
Out[41]:

Our analysis found a link between wine's pH and citric acid content. There's a moderate negative correlation of -0.551, meaning wines with higher levels of citric acid tend to have lower pH values. This makes sense because citric acid contributes to a wine's acidity.

The results are very statistically significant (p-value basically zero), which means it's highly unlikely to be random. We can confidently reject the idea of no connection (null hypothesis).

The 95% confidence interval (-0.593 to -0.509) excludes zero, further strengthening our belief in the negative correlation. This interval tells us how precise our estimate of the connection is.

The equation of the regression line (y = -0.543x + 2.028) helps visualize this. Here, "x" represents pH and "y" represents citric acid content. If the pH (x) goes up by one unit, citric acid (y) is expected to decrease by about 0.543, assuming everything else stays the same. With a neutral pH (x = 7), the expected citric acid content is around 2.028.

In simpler terms, wines with higher citric acid tend to have lower pH, and vice versa. This connection is statistically significant, meaning it's likely not a coincidence.

In [42]:
model_results, X_test, y_test, y_pred = train_linear_model(
    transformed_wine_df, "quality"
)
plot_model_predictions(
    X_test,
    y_test,
    y_pred,
    "Predicted Quality vs Actual Quality",
    save_path="../images/model_predictions.png",
)
Model summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.354
Model:                            OLS   Adj. R-squared:                  0.348
Method:                 Least Squares   F-statistic:                     53.65
Date:                Thu, 13 Feb 2025   Prob (F-statistic):           2.58e-94
Time:                        10:09:43   Log-Likelihood:                -1086.3
No. Observations:                1087   AIC:                             2197.
Df Residuals:                    1075   BIC:                             2256.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                  -15.8513     28.148     -0.563      0.573     -71.082      39.380
fixed_acidity           -0.1069      0.331     -0.323      0.747      -0.756       0.543
volatile_acidity        -1.5706      0.238     -6.589      0.000      -2.038      -1.103
citric_acid             -0.2693      0.232     -1.160      0.246      -0.725       0.186
residual_sugar          -0.0229      0.103     -0.222      0.825      -0.226       0.180
chlorides               -2.5049      0.577     -4.343      0.000      -3.637      -1.373
free_sulfur_dioxide      0.1132      0.055      2.039      0.042       0.004       0.222
total_sulfur_dioxide    -0.1681      0.053     -3.175      0.002      -0.272      -0.064
density                 15.9937     28.720      0.557      0.578     -40.360      72.347
pH                      -0.7253      0.253     -2.863      0.004      -1.222      -0.228
sulphates                1.7184      0.260      6.611      0.000       1.208       2.228
alcohol                  3.5327      0.396      8.932      0.000       2.757       4.309
==============================================================================
Omnibus:                       17.971   Durbin-Watson:                   1.954
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               23.847
Skew:                          -0.194   Prob(JB):                     6.63e-06
Kurtosis:                       3.613   Cond. No.                     1.36e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.36e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Mean Squared Error: 0.4263608394154518
R-squared: 0.3980979574615455
Adjusted R-squared: 0.3927318174091967
In [43]:
Image(filename="../images/model_predictions.png")
Out[43]:

This model uses various physicochemical characteristics of wine to predict its quality. Below are some key points from the model's summary:

  • R-squared (Explained Variance): The model explains approximately 35.4% of the variance in wine quality. This indicates how much of the total variation in wine quality our model can account for based on the variables used.
  • Important Predictors: Characteristics such as volatile acidity, chlorides, free sulfur dioxide, total sulfur dioxide, pH, sulphates, and alcohol are significant predictors of wine quality.
  • Model Diagnostics:
    • Durbin-Watson: Close to 2, suggesting there are no major concerns regarding the independence of residuals.
    • Jarque-Bera Test: Indicates that residuals are not normally distributed, which is an area for caution.
    • Condition Number: High, suggesting potential issues with multicollinearity among predictors.

The plot, compares the predicted wine quality against the actual wine quality. It helps us visualize the accuracy of our predictions across different levels of wine quality. Some findings, predictions become more scattered as actual quality increases, indicating that the model is less accurate for higher-quality wines.

  • F-statistic and Prob (F-statistic): These metrics confirm that the overall regression model is statistically significant, meaning the predictors do have an impact on wine quality.
  • AIC/BIC: Lower values are preferred as they indicate a better model fit. These criteria help compare our model with others by balancing goodness of fit and complexity.
  • Condition Number: Indicates potential overlap in what our predictors tell us, which might affect the reliability of our predictions.

  • Categorization of Quality: Considering transforming the wine quality from a continuous variable to categorical (e.g., Low, Medium, High). This might enhance model performance and interpretability, especially at higher quality levels where predictions are currently less accurate.

While our model provides a foundation for predicting wine quality, refining it or approaching the prediction task differently could yield better results and more actionable insights.

In [44]:
model_results, X_test, y_test, y_pred = train_linear_model(
    transformed_wine_df, "alcohol"
)
plot_model_predictions(
    X_test,
    y_test,
    y_pred,
    "Predicted Alcohol vs Actual Alcohol",
    save_path="../images/model_predictions1.png",
)
Model summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                alcohol   R-squared:                       0.719
Model:                            OLS   Adj. R-squared:                  0.717
Method:                 Least Squares   F-statistic:                     250.7
Date:                Thu, 13 Feb 2025   Prob (F-statistic):          1.63e-287
Time:                        10:09:43   Log-Likelihood:                 1738.0
No. Observations:                1087   AIC:                            -3452.
Df Residuals:                    1075   BIC:                            -3392.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   51.1606      1.397     36.614      0.000      48.419      53.902
fixed_acidity            0.4277      0.021     20.477      0.000       0.387       0.469
volatile_acidity         0.0676      0.018      3.759      0.000       0.032       0.103
citric_acid              0.0801      0.017      4.680      0.000       0.047       0.114
residual_sugar           0.1363      0.006     21.040      0.000       0.124       0.149
chlorides               -0.0770      0.043     -1.782      0.075      -0.162       0.008
free_sulfur_dioxide     -0.0020      0.004     -0.478      0.633      -0.010       0.006
total_sulfur_dioxide    -0.0104      0.004     -2.646      0.008      -0.018      -0.003
density                -51.2766      1.457    -35.205      0.000     -54.135     -48.419
pH                       0.3253      0.016     20.185      0.000       0.294       0.357
sulphates                0.1652      0.019      8.660      0.000       0.128       0.203
quality                  0.0196      0.002      8.932      0.000       0.015       0.024
==============================================================================
Omnibus:                       93.390   Durbin-Watson:                   1.965
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              148.154
Skew:                           0.624   Prob(JB):                     6.74e-33
Kurtosis:                       4.308   Cond. No.                     1.15e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.15e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Mean Squared Error: 0.002494729526941092
R-squared: 0.6849929074277996
Adjusted R-squared: 0.6821845232443922
In [45]:
Image(filename="../images/model_predictions1.png")
Out[45]:

This model evaluates how various characteristics of wine can predict its alcohol content. Here are the highlights and key points from the model's summary:

  • R-squared (Explained Variance): The model explains approximately 71.9% of the variance in alcohol content. This indicates a strong relationship between the predictors used and the alcohol content in the wine.
  • Important Predictors: Factors such as fixed acidity, volatile acidity, citric acid, sulphates, and wine quality are significant predictors of the alcohol content. Each of these characteristics has a statistically significant impact on the alcohol content.
  • Model Diagnostics:
    • Durbin-Watson: With a value close to 2, this suggests that there is no significant autocorrelation in the residuals of the model.
    • Jarque-Bera Test: This indicates that the residuals are not normally distributed, which could suggest some concerns regarding the underlying assumptions of the model.
    • Condition Number: The high value points to potential multicollinearity issues among the predictors, which may affect the accuracy of the estimated coefficients.

The plot compares the predicted alcohol content against the actual alcohol content, showing how well the model performs in predicting wine alcohol levels. Also, the plot shows that residuals are generally close to zero, especially around higher predicted values, indicating good model accuracy in these regions. However, some spread in residuals at lower predicted values suggests variability in accuracy.

  • F-statistic and Prob (F-statistic): These indicate that the model as a whole is statistically significant—meaning the included predictors do impact the alcohol content.
  • AIC/BIC: The lower these values, the better the model balance between fit and complexity, indicating a preferable model fit to the data.
  • Condition Number: A reminder that the high number could mean some predictors are providing overlapping information, potentially complicating the interpretation of individual effects.

  • Addressing Non-Normality and Multicollinearity: Adjustments such as transforming variables or revising the model to include or exclude certain variables might help in resolving the issues indicated by the Jarque-Bera test and high condition number.

  • Further Analysis: Investigating other modeling approaches or adding interaction terms could provide deeper insights and potentially enhance the model’s performance and interpretability.

The model effectively predicts the alcohol content in wine with significant accuracy, although some areas for improvement remain, particularly in handling the underlying model assumptions and extending the model's applicability to broader wine types

Key Findings from Red Wine Quality Analysis¶

1. Key Physicochemical Properties¶

  • Included Properties: Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol content, and overall quality score.
  • These properties represent critical measurements from various stages of wine production.

2. Distribution of Properties¶

  • Distribution Patterns: Properties exhibit distributions that are right-skewed, normally distributed, or bimodal.
  • Understanding these distributions helps in profiling typical physicochemical characteristics of the wines.

3. Outliers¶

  • Identification and Impact: Outliers were identified using the Interquartile Range (IQR) method. Initially, these outliers were retained to allow potential uncovering of valuable insights.
  • The significance and impact of these outliers on further analysis were discussed, with a plan to revisit their inclusion.

4. Influence on Wine Quality¶

  • Significant Properties: Alcohol and volatile acidity show strong correlations with wine quality. Higher alcohol content correlates positively with higher quality, whereas higher volatile acidity correlates negatively with quality.
  • These findings are derived from correlation analyses and regression modeling.

5. Predictive Modeling of Wine Quality¶

  • Model Insights: Linear regression was used, with an R-squared value indicating moderate explanatory power of 35.4%.
  • The models highlighted the importance of alcohol, sulphates, and acidity levels as predictors of quality.

6. Predictors of Alcohol Content¶

  • Model Findings: A separate model was developed for predicting alcohol content, emphasizing the significance of fixed acidity, volatile acidity, citric acid, sulphates, and wine quality.
  • This model also showed strong explanatory power of 71.9% and provided insights into the physicochemical properties influencing alcohol levels.

7. Issues Affecting Model Reliability¶

  • Challenges Identified: Multicollinearity and non-normal distribution of residuals were notable issues, potentially affecting the accuracy and reliability of predictions.

8. Suggestions for Model Improvement¶

  • Enhancement Strategies:
    • Consideration of non-linear models to better capture relationships.
    • Variable transformations to address non-normality and multicollinearity.
    • Categorizing wine quality for improved model interpretability and accuracy.

Conclusion¶

  • This analysis provides a comprehensive understanding of the factors influencing red wine quality and alcohol content. The insights can aid producers and marketers in improving wine production and positioning strategies based on scientific findings.