Table of Contents¶

  1. Introduction
  2. Notebook settings
  3. Data Loading Cleaning and Checking
  4. EDA
    1. Multicollinearity Analysis
    2. Model Testing - First Approach: OLS with All Features (No Transformation)
    3. Model Testing - Second Approach: OLS with All Features -> Box-Cox Transformation
    4. Model Testing - Third Approach: OLS with Reduced Features (Based on VIF)
    5. Hypothesis Testing
    6. Model Comparison and Evaluation
    7. QQ Plots for Residuals Analysis
    8. Final Conclusion
  5. Suggestions for Improvement

Introduction¶

The dataset used in this project can be found here: https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009
It presents quemical data of red wine samples and their quality.
Only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

  • Goal:
    • Practice choosing variables and their transformations for a model.
    • Practice fitting a linear regression model.
    • Practice interpreting a fitted model’s coefficients and communicating uncertainty around them.

The dataset contains the following columns:

Input variables (based on physicochemical tests):

Column Description Datatype Count
fixed acidity most acids involved with wine or fixed or nonvolatile (do not evaporate readily) float64 1599
volatile acidity the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste float64 1599
citric acid found in small quantities, citric acid can add 'freshness' and flavor to wines float64 1599
residual sugar the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet float64 1599
chlorides the amount of salt in the wine float64 1599
free sulfur dioxide the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. float64 1599
total sulfur dioxide amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine float64 1599
density the density of wine is close to that of water depending on the percent alcohol and sugar content float64 1599
pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale float64 1599
sulphates a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant float64 1599
alcohol the percent alcohol content of the wine float64 1599

Output variable (based on sensory data):

Column Description Datatype Count
quality output variable (based on sensory data, score between 0 and 10) int64 1599

Notebook settings¶

In [ ]:
from assets.utils.functions import *
In [ ]:
from assets.utils.functions import *
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import PowerTransformer  
from statsmodels.stats.outliers_influence import variance_inflation_factor

Data Loading, Cleaning, and Checking¶

In [ ]:
wine_df = pd.read_csv('assets/data/winequality-red.csv')
wine_df.head()
Out[ ]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [ ]:
# Replace blank spaces with _ for each column
wine_df.columns = wine_df.columns.str.replace(' ', '_')
In [ ]:
print(wine_df.describe())
       fixed_acidity  volatile_acidity  citric_acid  residual_sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free_sulfur_dioxide  total_sulfur_dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000              6.000000     0.990070   
25%       0.070000             7.000000             22.000000     0.995600   
50%       0.079000            14.000000             38.000000     0.996750   
75%       0.090000            21.000000             62.000000     0.997835   
max       0.611000            72.000000            289.000000     1.003690   

                pH    sulphates      alcohol      quality  
count  1599.000000  1599.000000  1599.000000  1599.000000  
mean      3.311113     0.658149    10.422983     5.636023  
std       0.154386     0.169507     1.065668     0.807569  
min       2.740000     0.330000     8.400000     3.000000  
25%       3.210000     0.550000     9.500000     5.000000  
50%       3.310000     0.620000    10.200000     6.000000  
75%       3.400000     0.730000    11.100000     6.000000  
max       4.010000     2.000000    14.900000     8.000000  
In [ ]:
wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         1599 non-null   float64
 1   volatile_acidity      1599 non-null   float64
 2   citric_acid           1599 non-null   float64
 3   residual_sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free_sulfur_dioxide   1599 non-null   float64
 6   total_sulfur_dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
In [ ]:
print(f"\nNumber of duplicate rows: {wine_df.duplicated().sum()}")
Number of duplicate rows: 240

As we can see, it seems the dataset does not present null values, neither missing values and datatypes looks already correct with no need to applying any conversions or further cleaning.

Since for duplicated rows we found 240 records and considering the dataset cointains only Input variables (based on physicochemical tests), it supposed that the duplicated rows are not a mistake, but a real data.

Also, since these data is not labeled, we can't check if the duplicated rows are correct or not. Therefore, we will keep them in the dataset. So, the dataframe is ready for a proper analysis.

EDA¶

In [ ]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(x='quality',
                   data=wine_df,
                   hue='quality',
                   palette="viridis")
plt.title('Distribution of Wine Quality Ratings')

# Plot counts for each bar
for container in ax.containers:
    ax.bar_label(container)

# Remove y-axis numbers
ax.yaxis.set_ticks([])

# Remove the legend
ax.legend_.remove()

plt.show()
No description has been provided for this image

Quality Rating Distribution

  • Distribution Skewed Towards Middle Ratings: The majority of wine samples are rated as 5 and 6. Quality rating 5 has the highest frequency with 681 samples. Quality rating 6 follows closely with 638 samples.

  • Moderate Quality Wines are Most Common: The distribution suggests that moderate quality wines (ratings of 5 and 6) are the most common in the dataset.

  • Imbalance in Quality Ratings: There is an imbalance in the distribution of quality ratings, which could affect model performance, particularly for high (7, 8) and low (3, 4) quality wines.

  • Focus on Middle Quality Predictions: Given the skew towards middle ratings, the model might perform better for moderate quality wines.

Target Variable and Hypothesis formulation¶

The target variable for this analysis is the quality column, which has scores going from 3 to 8, as previously seen in the Summary Stats part above.

The hypotesis we are going to focus is formulated as below:
• Hypothesis: "Alcohol content has a positive impact on perceived wine quality."
• Null Hypothesis: "Alcohol content has no impact on perceived wine quality."

So, first we will check the distribution of quality ratings in the dataset.

In [ ]:
plot_features_against_quality(wine_df)
No description has been provided for this image

Summary of Insights:

Positive Correlations with Quality: Citric acid, sulphates, and alcohol show positive trends with higher quality wines.

Negative Correlations with Quality: Volatile acidity and chlorides tend to be lower in higher quality wines.

No Clear Trends: Fixed acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, density, and pH do not show strong relationships with wine quality.

In [ ]:
plot_hist_corr(wine_df)
No description has been provided for this image
No description has been provided for this image

Features Distribution Insights:

  • Features with Roughly Normal Distribution:

    • Fixed Acidity: slightly skewed to the right.
    • Density: Fairly normal but slightly skewed to the left.
    • pH: Roughly normal.
  • Features with Non-Normal Distribution (Right-Skewed):

    • Volatile Acidity
    • Citric Acid: Right-skewed with a peak at the lower end.
    • Residual Sugar: Heavily right-skewed with a long tail.
    • Chlorides: Heavily right-skewed.
    • Free Sulfur Dioxide: Right-skewed with a peak at the lower end.
    • Total Sulfur Dioxide
    • Sulphates
    • Alcohol
  • Features with Categorical Distribution:

    • Quality: Most wines have a quality score of 5, 6, or 7, with fewer wines scoring at the extremes (3, 4, 8).

Multicollinearity Insights:

  • Fixed Acidity and Citric Acid: Correlation of 0.66 suggests these variables are moderately related.
  • Free Sulfur Dioxide and Total Sulfur Dioxide: High correlation (0.79) indicates these variables are strongly related, which is expected since free sulfur dioxide is a part of the total sulfur dioxide.
  • Density and Residual Sugar: Correlation of 0.42 suggests a moderate relationship, which makes sense as residual sugar would contribute to the overall density.
  • pH and Fixed Acidity: High negative correlation (-0.71) suggests an inverse relationship between pH levels and fixed acidity.

Summary:

  • Positive Influences on Quality: Alcohol, Sulphates, Citric Acid.
  • Negative Influences on Quality: Volatile Acidity, Chlorides, Density, Total Sulfur Dioxide.
  • Minimal Influence: Fixed Acidity, Residual Sugar, Free Sulfur Dioxide, pH.

Data Preprocessing and Splitting¶

In [ ]:
# Split the data into training and hold-out sets
train_df, holdout_df = train_test_split(
    wine_df, test_size=0.2, random_state=42)

# Select features and the target variable
X_train = train_df.drop(columns=['quality'])
y_train = train_df['quality']
X_holdout = holdout_df.drop(columns=['quality'])
y_holdout = holdout_df['quality']

Multicollinearity Analysis¶

I decided to check the Multicollinearity between independent variables.

In [ ]:
# Calculate VIF for each feature
X_train_const = sm.add_constant(X_train)
vif = pd.DataFrame()
vif['Variable'] = X_train_const.columns
vif['VIF'] = [variance_inflation_factor(X_train_const.values, i) 
              for i in range(X_train_const.shape[1])]
print(vif)
                Variable           VIF
0                  const  1.678384e+06
1          fixed_acidity  7.374554e+00
2       volatile_acidity  1.813146e+00
3            citric_acid  3.211154e+00
4         residual_sugar  1.731605e+00
5              chlorides  1.519827e+00
6    free_sulfur_dioxide  1.982913e+00
7   total_sulfur_dioxide  2.203259e+00
8                density  6.016488e+00
9                     pH  3.275199e+00
10             sulphates  1.455527e+00
11               alcohol  2.933465e+00

VIF Analysis:

The VIF calculation helps identify multicollinearity among the features. High VIF values indicate that a feature is highly correlated with other features in the dataset, which can lead to unreliable coefficient estimates.

Although the interpretation for Multicollinearity is not straightforward, the general rule of thumb is as follows:

  • VIF < 5: Generally considered acceptable; indicates low multicollinearity.

  • VIF between 5 and 10: Moderate multicollinearity; may warrant further investigation.

  • VIF > 10: High multicollinearity; suggests a problematic level of multicollinearity.

  • fixed_acidity (VIF = 7.37) and density (VIF = 6.02) have high VIF values, suggesting multicollinearity.
    Features with lower VIF values are generally preferred as they contribute more independent information to the model.

Model Testing - First Approach: OLS with All Features (No Transformation)¶

In [ ]:
# Fit the model with all features (no transformation)
X_train_const = sm.add_constant(X_train)
X_holdout_const = sm.add_constant(X_holdout)

model_all = sm.OLS(y_train, X_train_const).fit()
print("\nModel with All Features (No Transformation) Summary:")
print(model_all.summary())
Model with All Features (No Transformation) Summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.342
Method:                 Least Squares   F-statistic:                     61.48
Date:                Sat, 10 Aug 2024   Prob (F-statistic):          1.48e-109
Time:                        19:13:48   Log-Likelihood:                -1266.4
No. Observations:                1279   AIC:                             2557.
Df Residuals:                    1267   BIC:                             2619.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   14.3551     23.705      0.606      0.545     -32.150      60.860
fixed_acidity            0.0231      0.029      0.801      0.423      -0.033       0.080
volatile_acidity        -1.0013      0.137     -7.283      0.000      -1.271      -0.732
citric_acid             -0.1408      0.168     -0.839      0.402      -0.470       0.188
residual_sugar           0.0066      0.017      0.391      0.696      -0.026       0.039
chlorides               -1.8065      0.457     -3.949      0.000      -2.704      -0.909
free_sulfur_dioxide      0.0056      0.002      2.252      0.025       0.001       0.011
total_sulfur_dioxide    -0.0036      0.001     -4.419      0.000      -0.005      -0.002
density                -10.3516     24.192     -0.428      0.669     -57.812      37.109
pH                      -0.3937      0.215     -1.830      0.067      -0.816       0.028
sulphates                0.8412      0.126      6.651      0.000       0.593       1.089
alcohol                  0.2819      0.030      9.465      0.000       0.223       0.340
==============================================================================
Omnibus:                       28.708   Durbin-Watson:                   2.003
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               46.050
Skew:                          -0.192   Prob(JB):                     1.00e-10
Kurtosis:                       3.847   Cond. No.                     1.12e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Insights from OLS Regression Results -> First Approach:
¶

  • Overview:

    • R-squared (Training Dataset): 0.348 -> approximately 34.8% of the variance in wine quality is explained by the model, based on the predictors included.
    • Adjusted R-squared: 0.342 -> this suggests a slight penalty for including predictors that do not significantly improve the model fit.
    • Multicollinearity issues raised concerns about the reliability of the model.
  • Significant Predictors (coefficients and p-value < 0.05):

    • Volatile Acidity: Higher volatile acidity is strongly associated with lower wine quality.
    • Chlorides: Higher chloride levels are significantly associated with lower wine quality.
    • Sulphates: Higher sulphates are significantly associated with higher wine quality.
    • Alcohol: Higher alcohol content is strongly associated with higher wine quality.
    • Free Sulfur Dioxide: Higher levels of free sulfur dioxide are associated with a slight increase in wine quality.
    • Total Sulfur Dioxide: Higher levels of total sulfur dioxide are significantly associated with lower wine quality.
  • Other Predictors:

    • Fixed Acidity, Citric Acid, Residual Sugar, Density, pH: These variables have higher p-values (> 0.05), indicating that they are not statistically significant in explaining wine quality in this model.

Model Testing - Second Approach: OLS with All Features -> Box-Cox Transformation¶

The Box-Cox transformation is applied to make the data more normally distributed, which can improve the linear relationship between the features and the target variable, leading to potentially better model performance.

In [ ]:
boxcox_transformer = PowerTransformer(method='box-cox', standardize=False)
X_train_boxcox = boxcox_transformer.fit_transform(X_train + 1) # avoid 0 values
X_holdout_boxcox = boxcox_transformer.transform(X_holdout + 1)

# Ensure index alignment
X_train_boxcox_df = pd.DataFrame(X_train_boxcox, 
                                 columns=X_train.columns,
                                 index=X_train.index)
X_holdout_boxcox_df = pd.DataFrame(X_holdout_boxcox, 
                                   columns=X_holdout.columns,
                                   index=X_holdout.index)

X_train_boxcox_df = sm.add_constant(X_train_boxcox_df)
X_holdout_boxcox_df = sm.add_constant(X_holdout_boxcox_df)

# Fit the model with Box-Cox transformed features
model_all_boxcox = sm.OLS(y_train, X_train_boxcox_df).fit()
print("\nModel with All Features (Box-Cox Transformed) Summary:")
print(model_all_boxcox.summary())
Model with All Features (Box-Cox Transformed) Summary:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.358
Model:                            OLS   Adj. R-squared:                  0.352
Method:                 Least Squares   F-statistic:                     64.22
Date:                Sat, 10 Aug 2024   Prob (F-statistic):          9.46e-114
Time:                        19:13:49   Log-Likelihood:                -1256.5
No. Observations:                1279   AIC:                             2537.
Df Residuals:                    1267   BIC:                             2599.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                 7.313e+04      3e+04      2.435      0.015    1.42e+04    1.32e+05
fixed_acidity            4.9402      1.680      2.941      0.003       1.645       8.235
volatile_acidity        -2.1957      0.325     -6.754      0.000      -2.833      -1.558
citric_acid             -0.4502      0.219     -2.052      0.040      -0.881      -0.020
residual_sugar           3.0520      1.355      2.252      0.024       0.393       5.711
chlorides              -12.2315      4.509     -2.713      0.007     -21.077      -3.386
free_sulfur_dioxide      0.0744      0.041      1.821      0.069      -0.006       0.155
total_sulfur_dioxide    -0.1268      0.043     -2.916      0.004      -0.212      -0.041
density              -1.047e+06   3.82e+05     -2.740      0.006    -1.8e+06   -2.98e+05
pH                      -1.2888      1.667     -0.773      0.440      -4.559       1.982
sulphates               20.2285      2.159      9.371      0.000      15.993      24.464
alcohol               2.788e+04   3707.756      7.519      0.000    2.06e+04    3.52e+04
==============================================================================
Omnibus:                       31.407   Durbin-Watson:                   2.020
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               51.559
Skew:                          -0.204   Prob(JB):                     6.37e-12
Kurtosis:                       3.895   Cond. No.                     1.12e+08
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.87e-12. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Insights from OLS Regression Results -> Second Approach:
¶

  • Overview:

    • R-squared: The model explains 35.8% of the variance in wine quality, which indicates a moderate fit.
    • Adjusted R-squared: 0.352 -> is slightly lower, accounting for the number of predictors in the model. This suggests that the model is capturing some, but not all, of the variability in wine quality.
    • Multicollinearity issues raised concerns about the reliability of the model.
  • Significant Predictors (coefficients and p-value < 0.05):

    • Fixed Acidity: Positive and significant, with a coefficient of 4.9402. An increase in fixed acidity is associated with a slight increase in wine quality, indicating that this factor positively influences the perceived quality of wine.
    • Volatile Acidity: Negative and significant, with a coefficient of -2.1957. Higher levels of volatile acidity are associated with a decrease in wine quality, consistent with the understanding that volatile acidity can produce undesirable vinegar-like flavors.
    • Citric Acid: Negative and significant, with a coefficient of -0.4502. Higher levels of citric acid, while often adding freshness, are linked to a slight reduction in wine quality in this dataset.
    • Residual Sugar: Positive and significant, with a coefficient of 3.0520. Higher residual sugar content is associated with a higher quality rating, although the effect size is modest.
    • Chlorides: Strongly negative and significant, with a coefficient of -12.2315. Higher chloride levels, which indicate salt content, are strongly associated with a lower quality rating, as excessive salt can negatively impact taste.
    • Total Sulfur Dioxide: Negative and significant, with a coefficient of -0.1268. Higher levels of total sulfur dioxide are associated with lower wine quality, likely due to its impact on wine preservation, which can lead to unwanted chemical tastes if overused.
    • Density: Strongly negative and significant, with a coefficient of -1.047e+06. This suggests that higher density, possibly indicative of higher sugar or alcohol content, has a strongly negative impact on wine quality in this context.
    • Sulphates: Strongly positive and significant, with a coefficient of 20.2285. Higher levels of sulphates, which act as preservatives, are associated with higher wine quality.
    • Alcohol: Strongly positive and significant, with a coefficient of 2.788e+04. Higher alcohol content is strongly associated with higher wine quality, aligning with the hypothesis that alcohol content improves perceived quality.

Model Testing - Third Approach: OLS with Reduced Features (Based on VIF)¶

In [ ]:
# Drop high VIF features (e.g., 'fixed_acidity', 'density')
X_train_reduced = X_train.drop(columns=['fixed_acidity', 'density',
                                        'citric_acid', 'pH',
                                        'residual_sugar', 'free_sulfur_dioxide',
                                        'total_sulfur_dioxide'])
X_holdout_reduced = X_holdout.drop(columns=['fixed_acidity', 'density',
                                            'citric_acid', 'pH',
                                            'residual_sugar', 'free_sulfur_dioxide',
                                            'total_sulfur_dioxide'])

# Fit model with reduced features
X_train_reduced_const = sm.add_constant(X_train_reduced)
X_holdout_reduced_const = sm.add_constant(X_holdout_reduced)

model_reduced = sm.OLS(y_train, X_train_reduced_const).fit()
print("\nModel with Reduced Features Summary (Based on VIF):")
print(model_reduced.summary())
Model with Reduced Features Summary (Based on VIF):
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.329
Model:                            OLS   Adj. R-squared:                  0.327
Method:                 Least Squares   F-statistic:                     156.0
Date:                Sat, 10 Aug 2024   Prob (F-statistic):          1.03e-108
Time:                        19:13:49   Log-Likelihood:                -1284.9
No. Observations:                1279   AIC:                             2580.
Df Residuals:                    1274   BIC:                             2606.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                2.7454      0.226     12.151      0.000       2.302       3.189
volatile_acidity    -1.1001      0.110     -9.971      0.000      -1.317      -0.884
chlorides           -1.5769      0.426     -3.697      0.000      -2.414      -0.740
sulphates            0.8175      0.122      6.688      0.000       0.578       1.057
alcohol              0.2939      0.019     15.819      0.000       0.257       0.330
==============================================================================
Omnibus:                       22.135   Durbin-Watson:                   1.985
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               35.753
Skew:                          -0.132   Prob(JB):                     1.72e-08
Kurtosis:                       3.776   Cond. No.                         247.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Insights from OLS Regression Results (Model with Reduced Features (Based on VIF):
¶

  • Overview:

    • R-squared: The model explains 32.9% of the variance in wine quality, which is moderately strong.
    • Adjusted R-squared: 0.327 -> The adjusted R-squared of 32.7% is very close to the R-squared value, suggesting that the model is well-balanced and that the included predictors are relevant without adding unnecessary complexity.
    • It does not present Multicollinearity issues
  • Significant Predictors (coefficients and p-value < 0.05):

    • Intercept (const): The intercept is significant and indicates the baseline level of wine quality when all predictors are at their average values.
    • Volatile Acidity: Negative and highly significant, with a coefficient of -1.1001. This indicates that higher levels of volatile acidity strongly decrease wine quality, which is consistent with its potential to produce off-flavors.
    • Chlorides: Negative and significant, with a coefficient of -1.5769. Higher chloride levels, which indicate salt content, are associated with lower wine quality, reflecting the detrimental effect of excessive saltiness on taste.
    • Sulphates: Positive and significant, with a coefficient of 0.8175. Higher levels of sulphates, which act as preservatives, are associated with higher wine quality, indicating that sulphates contribute positively to the preservation and taste of the wine.
    • Alcohol: Strongly positive and significant, with a coefficient of 0.2939. This suggests that higher alcohol content is strongly associated with higher wine quality, aligning with the hypothesis that alcohol content improves perceived quality.
  • Confidence Intervals:

    • Intercept (const):
      Coefficient: 2.7454
      95% Confidence Interval: [2.302, 3.189]
      Interpretation: The intercept (constant) is the expected wine quality score when all the predictor variables (volatile acidity, chlorides, sulphates, alcohol) are zero. The confidence interval does not include zero, indicating that the intercept is statistically significant. The relatively narrow interval suggests that the estimate is precise.

    • Volatile Acidity:
      Coefficient: -1.1001
      95% Confidence Interval: [-1.317, -0.884]
      Interpretation: The negative coefficient indicates that an increase in volatile acidity is associated with a decrease in wine quality. The confidence interval does not include zero, which confirms that volatile acidity has a statistically significant negative effect on wine quality. The interval is reasonably narrow, suggesting a precise estimate of this effect.

    • Chlorides:
      Coefficient: -1.5769
      95% Confidence Interval: [-2.414, -0.740]
      Interpretation: Chlorides also have a negative coefficient, meaning higher levels of chlorides are associated with lower wine quality. The confidence interval excludes zero, indicating statistical significance. However, the interval is somewhat wider compared to volatile acidity, suggesting more variability in the estimate, but it is still significant.

    • Sulphates:
      Coefficient: 0.8175
      95% Confidence Interval: [0.578, 1.057]
      Interpretation: The positive coefficient suggests that an increase in sulphates is associated with an increase in wine quality. The confidence interval does not include zero, confirming statistical significance. The interval is fairly narrow, indicating a precise estimate of the effect.

    • Alcohol:
      Coefficient: 0.2939
      95% Confidence Interval: [0.257, 0.330]
      Interpretation: Alcohol content has a positive effect on wine quality, as indicated by the positive coefficient. The confidence interval does not include zero, showing strong statistical significance. The interval is very narrow, which implies a highly precise estimate, making alcohol one of the most reliable predictors in the model.

Hypothesis Testing¶

In [ ]:
# Hypothesis Testing
# H0: "Alcohol content has no impact on perceived wine quality."
# H1: "Alcohol content has a positive impact on perceived wine quality."

# Extract the coefficient and p-value for the 'alcohol' variable
alcohol_coef = model_reduced.params['alcohol']
alcohol_pvalue = model_reduced.pvalues['alcohol']

# Display the results
print(f"Alcohol Coefficient: {alcohol_coef}")
print(f"Alcohol p-value: {alcohol_pvalue}")

# Hypothesis testing decision
if alcohol_pvalue < 0.05:
    print("Reject the null hypothesis: Alcohol content has a significant "
          "positive impact on perceived wine quality.")
else:
    print("Fail to reject the null hypothesis: Alcohol content does not have "
          "a significant positive impact on perceived wine quality.")
Alcohol Coefficient: 0.2939209003966722
Alcohol p-value: 1.3474711519479874e-51
Reject the null hypothesis: Alcohol content has a significant positive impact on perceived wine quality.

Insights on Hypothesis Testing Results:

  • Alcohol Coefficient: 0.2939 -> Each unit increase in alcohol content increases wine quality by approximately 0.294 units.
  • Alcohol p-value: 1.35e-51 -> Extremely low, indicating strong evidence against the null hypothesis.
  • Conclusion: There is strong statistical evidence that alcohol content has a significant positive impact on perceived wine quality.

Model Comparison and Evaluation¶

In [ ]:
# Evaluate each model
mse_all, mae_all, r2_all = evaluate_model(
    model_all, X_holdout_const, y_holdout)
mse_all_boxcox, mae_all_boxcox, r2_all_boxcox = evaluate_model(
    model_all_boxcox, X_holdout_boxcox_df, y_holdout)
mse_reduced, mae_reduced, r2_reduced = evaluate_model(
    model_reduced, X_holdout_reduced_const, y_holdout)

print("\nComparison of Model Performance:")
print(f"Model with All Features (No Transformation) - MSE: {mse_all}, "
      f"MAE: {mae_all}, R2: {r2_all}")
print(f"Model with All Features (Box-Cox) - MSE: {mse_all_boxcox}, "
      f"MAE: {mae_all_boxcox}, R2: {r2_all_boxcox}")
print(f"Model with Reduced Features (VIF Consideration) - MSE: {mse_reduced}, "
      f"MAE: {mae_reduced}, R2: {r2_reduced}")
Comparison of Model Performance:
Model with All Features (No Transformation) - MSE: 0.39002514396397386, MAE: 0.5035304415524223, R2: 0.4031803412795929
Model with All Features (Box-Cox) - MSE: 0.38402287532804735, MAE: 0.49277876529880127, R2: 0.41236506173744314
Model with Reduced Features (VIF Consideration) - MSE: 0.39583576287046485, MAE: 0.5162232158682217, R2: 0.3942888848019904

Insights on Model Performance Comparison:

Overall Summary: Best Performing Model: The model with all features after Box-Cox transformation performs the best, showing the lowest MSE and MAE and the highest R-squared. This suggests that transforming the data helped the model better capture the relationships between the predictors and wine quality.

Trade-off with Simplicity: The reduced model, while simpler and easier to interpret, does not perform as well as the full model with or without transformation.

Conclusion: If the goal is to maximize predictive accuracy, the model with all features and Box-Cox transformation should be preferred. However, if interpretability and simplicity are more critical, the reduced model might still be a viable option, albeit with slightly lower predictive power.

QQ Plots for Residuals Analysis¶

In [ ]:
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

sm.qqplot(model_all.resid, line='s', ax=axes[0])
axes[0].set_title('QQ Plot (All Features No Transformation)')

sm.qqplot(model_all_boxcox.resid, line='s', ax=axes[1])
axes[1].set_title('QQ Plot (All Features Box-Cox)')

sm.qqplot(model_reduced.resid, line='s', ax=axes[2])
axes[2].set_title('QQ Plot (Reduced Features Based on VIF)')

plt.tight_layout()
plt.show()
No description has been provided for this image

Insights from QQ Plots:

  • QQ Plot (All Features No Transformation):
    The residuals in this plot show some deviation from the theoretical quantiles, especially at the tails.
    This indicates that the residuals are not perfectly normally distributed, suggesting potential issues with the model's fit.
    The slight S-shape suggests that the model may not capture the underlying data patterns perfectly, possibly due to the presence of outliers or the effect of untransformed skewed variables.

  • QQ Plot (All Features Box-Cox):
    After applying the Box-Cox transformation, the residuals align more closely with the theoretical quantiles, especially in the middle range.
    This indicates an improvement in normality compared to the model with no transformation.
    The Box-Cox transformation appears to have helped stabilize variance and reduce skewness in the data, leading to a better model fit.
    However, some deviations still exist at the extremes, though they are less pronounced than in the non-transformed model.

  • QQ Plot (Reduced Features Based on VIF):
    The residuals for the model with reduced features (based on VIF) also show alignment with the theoretical quantiles but exhibit slight deviations at the tails.
    The plot suggests that the reduced model, while simpler, still struggles slightly with capturing the data's full complexity. The deviations, particularly at the extremes, might be due to the exclusion of some influential features, leading to a less accurate model.

Summary:

  • Best Normality Fit: The model with all features and Box-Cox transformation demonstrates the best alignment with the normal distribution, suggesting that this model has the most normally distributed residuals and, therefore, the best fit among the three models.

  • Impact of Feature Reduction: The reduced model (based on VIF) shows some deviations in the tails, indicating that reducing the number of features may have slightly compromised the model's ability to fit the data perfectly.
    However, this model still provides a reasonably good fit, with residuals mostly following the normal distribution.

Final Conclusion¶

  • Model with All Features (No Transformation):
    This model included all the available features without any transformations. It explained approximately 40.3% of the variance in wine quality (R² = 0.4032) and showed moderate predictive performance. However, the model faced some issues with multicollinearity, particularly with variables like fixed acidity and density, which may have introduced some noise into the model.

  • Model with All Features (Box-Cox Transformation):
    By applying the Box-Cox transformation to all features, we aimed to stabilize variance and make the data more normally distributed. This model performed slightly better, explaining 41.2% of the variance in wine quality (R² = 0.4124), and demonstrated improved predictive accuracy with lower MSE and MAE values. The QQ plot showed that the residuals were more normally distributed compared to the model without transformation, indicating a better model fit.

  • Model with Reduced Features (Based on VIF):
    This model excluded features with high Variance Inflation Factor (VIF) to reduce multicollinearity. While this model was simpler and more interpretable, it explained slightly less variance (R² = 0.3943) and had higher error metrics compared to the other two models. The reduction in the number of features likely led to the exclusion of some important information, resulting in a decrease in predictive power.

Suggestions for Improvement¶

  • Outlier Analysis:
    Outliers can have a significant impact on the performance of regression models by skewing the results and affecting the assumptions of normality.

  • Handling Multicollinearity:
    Multicollinearity can inflate the variance of coefficient estimates and make the model unstable.

  • In-depth features analysis:
    I believe I could have explored more features, such as interactions between variables, polynomial features, or other transformations to capture more complex relationships in the data.