Table of Contents¶
- Introduction
- Notebook settings
- Data Loading Cleaning and Checking
- EDA
- Multicollinearity Analysis
- Model Testing - First Approach: OLS with All Features (No Transformation)
- Model Testing - Second Approach: OLS with All Features -> Box-Cox Transformation
- Model Testing - Third Approach: OLS with Reduced Features (Based on VIF)
- Hypothesis Testing
- Model Comparison and Evaluation
- QQ Plots for Residuals Analysis
- Final Conclusion
- Suggestions for Improvement
Introduction¶
The dataset used in this project can be found here: https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009
It presents quemical data of red wine samples and their quality.
Only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
- Goal:
- Practice choosing variables and their transformations for a model.
- Practice fitting a linear regression model.
- Practice interpreting a fitted model’s coefficients and communicating uncertainty around them.
- Practice choosing variables and their transformations for a model.
The dataset contains the following columns:
Input variables (based on physicochemical tests):
| Column | Description | Datatype | Count |
|---|---|---|---|
| fixed acidity | most acids involved with wine or fixed or nonvolatile (do not evaporate readily) | float64 | 1599 |
| volatile acidity | the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste | float64 | 1599 |
| citric acid | found in small quantities, citric acid can add 'freshness' and flavor to wines | float64 | 1599 |
| residual sugar | the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet | float64 | 1599 |
| chlorides | the amount of salt in the wine | float64 | 1599 |
| free sulfur dioxide | the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. | float64 | 1599 |
| total sulfur dioxide | amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine | float64 | 1599 |
| density | the density of wine is close to that of water depending on the percent alcohol and sugar content | float64 | 1599 |
| pH | describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale | float64 | 1599 |
| sulphates | a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant | float64 | 1599 |
| alcohol | the percent alcohol content of the wine | float64 | 1599 |
Output variable (based on sensory data):
| Column | Description | Datatype | Count |
|---|---|---|---|
| quality | output variable (based on sensory data, score between 0 and 10) | int64 | 1599 |
Notebook settings¶
from assets.utils.functions import *
from assets.utils.functions import *
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import PowerTransformer
from statsmodels.stats.outliers_influence import variance_inflation_factor
Data Loading, Cleaning, and Checking¶
wine_df = pd.read_csv('assets/data/winequality-red.csv')
wine_df.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
# Replace blank spaces with _ for each column
wine_df.columns = wine_df.columns.str.replace(' ', '_')
print(wine_df.describe())
fixed_acidity volatile_acidity citric_acid residual_sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
chlorides free_sulfur_dioxide total_sulfur_dioxide density \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690
pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 5.636023
std 0.154386 0.169507 1.065668 0.807569
min 2.740000 0.330000 8.400000 3.000000
25% 3.210000 0.550000 9.500000 5.000000
50% 3.310000 0.620000 10.200000 6.000000
75% 3.400000 0.730000 11.100000 6.000000
max 4.010000 2.000000 14.900000 8.000000
wine_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed_acidity 1599 non-null float64 1 volatile_acidity 1599 non-null float64 2 citric_acid 1599 non-null float64 3 residual_sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free_sulfur_dioxide 1599 non-null float64 6 total_sulfur_dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
print(f"\nNumber of duplicate rows: {wine_df.duplicated().sum()}")
Number of duplicate rows: 240
As we can see, it seems the dataset does not present null values, neither missing values and datatypes looks already correct with no need to applying any conversions or further cleaning.
Since for duplicated rows we found 240 records and considering the dataset cointains only Input variables (based on physicochemical tests), it supposed that the duplicated rows are not a mistake, but a real data.
Also, since these data is not labeled, we can't check if the duplicated rows are correct or not. Therefore, we will keep them in the dataset. So, the dataframe is ready for a proper analysis.
EDA¶
plt.figure(figsize=(10, 6))
ax = sns.countplot(x='quality',
data=wine_df,
hue='quality',
palette="viridis")
plt.title('Distribution of Wine Quality Ratings')
# Plot counts for each bar
for container in ax.containers:
ax.bar_label(container)
# Remove y-axis numbers
ax.yaxis.set_ticks([])
# Remove the legend
ax.legend_.remove()
plt.show()
Quality Rating Distribution
Distribution Skewed Towards Middle Ratings: The majority of wine samples are rated as 5 and 6. Quality rating 5 has the highest frequency with 681 samples. Quality rating 6 follows closely with 638 samples.
Moderate Quality Wines are Most Common: The distribution suggests that moderate quality wines (ratings of 5 and 6) are the most common in the dataset.
Imbalance in Quality Ratings: There is an imbalance in the distribution of quality ratings, which could affect model performance, particularly for high (7, 8) and low (3, 4) quality wines.
Focus on Middle Quality Predictions: Given the skew towards middle ratings, the model might perform better for moderate quality wines.
Target Variable and Hypothesis formulation¶
The target variable for this analysis is the quality column, which has scores going from 3 to 8, as previously seen in the Summary Stats part above.
The hypotesis we are going to focus is formulated as below:
• Hypothesis: "Alcohol content has a positive impact on perceived wine quality."
• Null Hypothesis: "Alcohol content has no impact on perceived wine quality."
So, first we will check the distribution of quality ratings in the dataset.
plot_features_against_quality(wine_df)
Summary of Insights:
Positive Correlations with Quality: Citric acid, sulphates, and alcohol show positive trends with higher quality wines.
Negative Correlations with Quality: Volatile acidity and chlorides tend to be lower in higher quality wines.
No Clear Trends: Fixed acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, density, and pH do not show strong relationships with wine quality.
plot_hist_corr(wine_df)
Features Distribution Insights:
Features with Roughly Normal Distribution:
- Fixed Acidity: slightly skewed to the right.
- Density: Fairly normal but slightly skewed to the left.
- pH: Roughly normal.
- Fixed Acidity: slightly skewed to the right.
Features with Non-Normal Distribution (Right-Skewed):
- Volatile Acidity
- Citric Acid: Right-skewed with a peak at the lower end.
- Residual Sugar: Heavily right-skewed with a long tail.
- Chlorides: Heavily right-skewed.
- Free Sulfur Dioxide: Right-skewed with a peak at the lower end.
- Total Sulfur Dioxide
- Sulphates
- Alcohol
- Volatile Acidity
Features with Categorical Distribution:
- Quality: Most wines have a quality score of 5, 6, or 7, with fewer wines scoring at the extremes (3, 4, 8).
Multicollinearity Insights:
- Fixed Acidity and Citric Acid: Correlation of 0.66 suggests these variables are moderately related.
- Free Sulfur Dioxide and Total Sulfur Dioxide: High correlation (0.79) indicates these variables are strongly related, which is expected since free sulfur dioxide is a part of the total sulfur dioxide.
- Density and Residual Sugar: Correlation of 0.42 suggests a moderate relationship, which makes sense as residual sugar would contribute to the overall density.
- pH and Fixed Acidity: High negative correlation (-0.71) suggests an inverse relationship between pH levels and fixed acidity.
Summary:
- Positive Influences on Quality: Alcohol, Sulphates, Citric Acid.
- Negative Influences on Quality: Volatile Acidity, Chlorides, Density, Total Sulfur Dioxide.
- Minimal Influence: Fixed Acidity, Residual Sugar, Free Sulfur Dioxide, pH.
Data Preprocessing and Splitting¶
# Split the data into training and hold-out sets
train_df, holdout_df = train_test_split(
wine_df, test_size=0.2, random_state=42)
# Select features and the target variable
X_train = train_df.drop(columns=['quality'])
y_train = train_df['quality']
X_holdout = holdout_df.drop(columns=['quality'])
y_holdout = holdout_df['quality']
Multicollinearity Analysis¶
I decided to check the Multicollinearity between independent variables.
# Calculate VIF for each feature
X_train_const = sm.add_constant(X_train)
vif = pd.DataFrame()
vif['Variable'] = X_train_const.columns
vif['VIF'] = [variance_inflation_factor(X_train_const.values, i)
for i in range(X_train_const.shape[1])]
print(vif)
Variable VIF 0 const 1.678384e+06 1 fixed_acidity 7.374554e+00 2 volatile_acidity 1.813146e+00 3 citric_acid 3.211154e+00 4 residual_sugar 1.731605e+00 5 chlorides 1.519827e+00 6 free_sulfur_dioxide 1.982913e+00 7 total_sulfur_dioxide 2.203259e+00 8 density 6.016488e+00 9 pH 3.275199e+00 10 sulphates 1.455527e+00 11 alcohol 2.933465e+00
VIF Analysis:
The VIF calculation helps identify multicollinearity among the features. High VIF values indicate that a feature is highly correlated with other features in the dataset, which can lead to unreliable coefficient estimates.
Although the interpretation for Multicollinearity is not straightforward, the general rule of thumb is as follows:
VIF < 5: Generally considered acceptable; indicates low multicollinearity.
VIF between 5 and 10: Moderate multicollinearity; may warrant further investigation.
VIF > 10: High multicollinearity; suggests a problematic level of multicollinearity.
fixed_acidity (VIF = 7.37) and density (VIF = 6.02) have high VIF values, suggesting multicollinearity.
Features with lower VIF values are generally preferred as they contribute more independent information to the model.
Model Testing - First Approach: OLS with All Features (No Transformation)¶
# Fit the model with all features (no transformation)
X_train_const = sm.add_constant(X_train)
X_holdout_const = sm.add_constant(X_holdout)
model_all = sm.OLS(y_train, X_train_const).fit()
print("\nModel with All Features (No Transformation) Summary:")
print(model_all.summary())
Model with All Features (No Transformation) Summary:
OLS Regression Results
==============================================================================
Dep. Variable: quality R-squared: 0.348
Model: OLS Adj. R-squared: 0.342
Method: Least Squares F-statistic: 61.48
Date: Sat, 10 Aug 2024 Prob (F-statistic): 1.48e-109
Time: 19:13:48 Log-Likelihood: -1266.4
No. Observations: 1279 AIC: 2557.
Df Residuals: 1267 BIC: 2619.
Df Model: 11
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 14.3551 23.705 0.606 0.545 -32.150 60.860
fixed_acidity 0.0231 0.029 0.801 0.423 -0.033 0.080
volatile_acidity -1.0013 0.137 -7.283 0.000 -1.271 -0.732
citric_acid -0.1408 0.168 -0.839 0.402 -0.470 0.188
residual_sugar 0.0066 0.017 0.391 0.696 -0.026 0.039
chlorides -1.8065 0.457 -3.949 0.000 -2.704 -0.909
free_sulfur_dioxide 0.0056 0.002 2.252 0.025 0.001 0.011
total_sulfur_dioxide -0.0036 0.001 -4.419 0.000 -0.005 -0.002
density -10.3516 24.192 -0.428 0.669 -57.812 37.109
pH -0.3937 0.215 -1.830 0.067 -0.816 0.028
sulphates 0.8412 0.126 6.651 0.000 0.593 1.089
alcohol 0.2819 0.030 9.465 0.000 0.223 0.340
==============================================================================
Omnibus: 28.708 Durbin-Watson: 2.003
Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.050
Skew: -0.192 Prob(JB): 1.00e-10
Kurtosis: 3.847 Cond. No. 1.12e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Insights from OLS Regression Results -> First Approach:
¶
Overview:
- R-squared (Training Dataset): 0.348 -> approximately 34.8% of the variance in wine quality is explained by the model, based on the predictors included.
- Adjusted R-squared: 0.342 -> this suggests a slight penalty for including predictors that do not significantly improve the model fit.
- Multicollinearity issues raised concerns about the reliability of the model.
- R-squared (Training Dataset): 0.348 -> approximately 34.8% of the variance in wine quality is explained by the model, based on the predictors included.
Significant Predictors (coefficients and p-value < 0.05):
- Volatile Acidity: Higher volatile acidity is strongly associated with lower wine quality.
- Chlorides: Higher chloride levels are significantly associated with lower wine quality.
- Sulphates: Higher sulphates are significantly associated with higher wine quality.
- Alcohol: Higher alcohol content is strongly associated with higher wine quality.
- Free Sulfur Dioxide: Higher levels of free sulfur dioxide are associated with a slight increase in wine quality.
- Total Sulfur Dioxide: Higher levels of total sulfur dioxide are significantly associated with lower wine quality.
- Volatile Acidity: Higher volatile acidity is strongly associated with lower wine quality.
Other Predictors:
- Fixed Acidity, Citric Acid, Residual Sugar, Density, pH: These variables have higher p-values (> 0.05), indicating that they are not statistically significant in explaining wine quality in this model.
- Fixed Acidity, Citric Acid, Residual Sugar, Density, pH: These variables have higher p-values (> 0.05), indicating that they are not statistically significant in explaining wine quality in this model.
Model Testing - Second Approach: OLS with All Features -> Box-Cox Transformation¶
The Box-Cox transformation is applied to make the data more normally distributed, which can improve the linear relationship between the features and the target variable, leading to potentially better model performance.
boxcox_transformer = PowerTransformer(method='box-cox', standardize=False)
X_train_boxcox = boxcox_transformer.fit_transform(X_train + 1) # avoid 0 values
X_holdout_boxcox = boxcox_transformer.transform(X_holdout + 1)
# Ensure index alignment
X_train_boxcox_df = pd.DataFrame(X_train_boxcox,
columns=X_train.columns,
index=X_train.index)
X_holdout_boxcox_df = pd.DataFrame(X_holdout_boxcox,
columns=X_holdout.columns,
index=X_holdout.index)
X_train_boxcox_df = sm.add_constant(X_train_boxcox_df)
X_holdout_boxcox_df = sm.add_constant(X_holdout_boxcox_df)
# Fit the model with Box-Cox transformed features
model_all_boxcox = sm.OLS(y_train, X_train_boxcox_df).fit()
print("\nModel with All Features (Box-Cox Transformed) Summary:")
print(model_all_boxcox.summary())
Model with All Features (Box-Cox Transformed) Summary:
OLS Regression Results
==============================================================================
Dep. Variable: quality R-squared: 0.358
Model: OLS Adj. R-squared: 0.352
Method: Least Squares F-statistic: 64.22
Date: Sat, 10 Aug 2024 Prob (F-statistic): 9.46e-114
Time: 19:13:49 Log-Likelihood: -1256.5
No. Observations: 1279 AIC: 2537.
Df Residuals: 1267 BIC: 2599.
Df Model: 11
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 7.313e+04 3e+04 2.435 0.015 1.42e+04 1.32e+05
fixed_acidity 4.9402 1.680 2.941 0.003 1.645 8.235
volatile_acidity -2.1957 0.325 -6.754 0.000 -2.833 -1.558
citric_acid -0.4502 0.219 -2.052 0.040 -0.881 -0.020
residual_sugar 3.0520 1.355 2.252 0.024 0.393 5.711
chlorides -12.2315 4.509 -2.713 0.007 -21.077 -3.386
free_sulfur_dioxide 0.0744 0.041 1.821 0.069 -0.006 0.155
total_sulfur_dioxide -0.1268 0.043 -2.916 0.004 -0.212 -0.041
density -1.047e+06 3.82e+05 -2.740 0.006 -1.8e+06 -2.98e+05
pH -1.2888 1.667 -0.773 0.440 -4.559 1.982
sulphates 20.2285 2.159 9.371 0.000 15.993 24.464
alcohol 2.788e+04 3707.756 7.519 0.000 2.06e+04 3.52e+04
==============================================================================
Omnibus: 31.407 Durbin-Watson: 2.020
Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.559
Skew: -0.204 Prob(JB): 6.37e-12
Kurtosis: 3.895 Cond. No. 1.12e+08
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.87e-12. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Insights from OLS Regression Results -> Second Approach:
¶
Overview:
- R-squared: The model explains 35.8% of the variance in wine quality, which indicates a moderate fit.
- Adjusted R-squared: 0.352 -> is slightly lower, accounting for the number of predictors in the model. This suggests that the model is capturing some, but not all, of the variability in wine quality.
- Multicollinearity issues raised concerns about the reliability of the model.
- R-squared: The model explains 35.8% of the variance in wine quality, which indicates a moderate fit.
Significant Predictors (coefficients and p-value < 0.05):
- Fixed Acidity: Positive and significant, with a coefficient of 4.9402. An increase in fixed acidity is associated with a slight increase in wine quality, indicating that this factor positively influences the perceived quality of wine.
- Volatile Acidity: Negative and significant, with a coefficient of -2.1957. Higher levels of volatile acidity are associated with a decrease in wine quality, consistent with the understanding that volatile acidity can produce undesirable vinegar-like flavors.
- Citric Acid: Negative and significant, with a coefficient of -0.4502. Higher levels of citric acid, while often adding freshness, are linked to a slight reduction in wine quality in this dataset.
- Residual Sugar: Positive and significant, with a coefficient of 3.0520. Higher residual sugar content is associated with a higher quality rating, although the effect size is modest.
- Chlorides: Strongly negative and significant, with a coefficient of -12.2315. Higher chloride levels, which indicate salt content, are strongly associated with a lower quality rating, as excessive salt can negatively impact taste.
- Total Sulfur Dioxide: Negative and significant, with a coefficient of -0.1268. Higher levels of total sulfur dioxide are associated with lower wine quality, likely due to its impact on wine preservation, which can lead to unwanted chemical tastes if overused.
- Density: Strongly negative and significant, with a coefficient of -1.047e+06. This suggests that higher density, possibly indicative of higher sugar or alcohol content, has a strongly negative impact on wine quality in this context.
- Sulphates: Strongly positive and significant, with a coefficient of 20.2285. Higher levels of sulphates, which act as preservatives, are associated with higher wine quality.
- Alcohol: Strongly positive and significant, with a coefficient of 2.788e+04. Higher alcohol content is strongly associated with higher wine quality, aligning with the hypothesis that alcohol content improves perceived quality.
- Fixed Acidity: Positive and significant, with a coefficient of 4.9402. An increase in fixed acidity is associated with a slight increase in wine quality, indicating that this factor positively influences the perceived quality of wine.
Model Testing - Third Approach: OLS with Reduced Features (Based on VIF)¶
# Drop high VIF features (e.g., 'fixed_acidity', 'density')
X_train_reduced = X_train.drop(columns=['fixed_acidity', 'density',
'citric_acid', 'pH',
'residual_sugar', 'free_sulfur_dioxide',
'total_sulfur_dioxide'])
X_holdout_reduced = X_holdout.drop(columns=['fixed_acidity', 'density',
'citric_acid', 'pH',
'residual_sugar', 'free_sulfur_dioxide',
'total_sulfur_dioxide'])
# Fit model with reduced features
X_train_reduced_const = sm.add_constant(X_train_reduced)
X_holdout_reduced_const = sm.add_constant(X_holdout_reduced)
model_reduced = sm.OLS(y_train, X_train_reduced_const).fit()
print("\nModel with Reduced Features Summary (Based on VIF):")
print(model_reduced.summary())
Model with Reduced Features Summary (Based on VIF):
OLS Regression Results
==============================================================================
Dep. Variable: quality R-squared: 0.329
Model: OLS Adj. R-squared: 0.327
Method: Least Squares F-statistic: 156.0
Date: Sat, 10 Aug 2024 Prob (F-statistic): 1.03e-108
Time: 19:13:49 Log-Likelihood: -1284.9
No. Observations: 1279 AIC: 2580.
Df Residuals: 1274 BIC: 2606.
Df Model: 4
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const 2.7454 0.226 12.151 0.000 2.302 3.189
volatile_acidity -1.1001 0.110 -9.971 0.000 -1.317 -0.884
chlorides -1.5769 0.426 -3.697 0.000 -2.414 -0.740
sulphates 0.8175 0.122 6.688 0.000 0.578 1.057
alcohol 0.2939 0.019 15.819 0.000 0.257 0.330
==============================================================================
Omnibus: 22.135 Durbin-Watson: 1.985
Prob(Omnibus): 0.000 Jarque-Bera (JB): 35.753
Skew: -0.132 Prob(JB): 1.72e-08
Kurtosis: 3.776 Cond. No. 247.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Insights from OLS Regression Results (Model with Reduced Features (Based on VIF):
¶
Overview:
- R-squared: The model explains 32.9% of the variance in wine quality, which is moderately strong.
- Adjusted R-squared: 0.327 -> The adjusted R-squared of 32.7% is very close to the R-squared value, suggesting that the model is well-balanced and that the included predictors are relevant without adding unnecessary complexity.
- It does not present Multicollinearity issues
- R-squared: The model explains 32.9% of the variance in wine quality, which is moderately strong.
Significant Predictors (coefficients and p-value < 0.05):
- Intercept (const): The intercept is significant and indicates the baseline level of wine quality when all predictors are at their average values.
- Volatile Acidity: Negative and highly significant, with a coefficient of -1.1001. This indicates that higher levels of volatile acidity strongly decrease wine quality, which is consistent with its potential to produce off-flavors.
- Chlorides: Negative and significant, with a coefficient of -1.5769. Higher chloride levels, which indicate salt content, are associated with lower wine quality, reflecting the detrimental effect of excessive saltiness on taste.
- Sulphates: Positive and significant, with a coefficient of 0.8175. Higher levels of sulphates, which act as preservatives, are associated with higher wine quality, indicating that sulphates contribute positively to the preservation and taste of the wine.
- Alcohol: Strongly positive and significant, with a coefficient of 0.2939. This suggests that higher alcohol content is strongly associated with higher wine quality, aligning with the hypothesis that alcohol content improves perceived quality.
- Intercept (const): The intercept is significant and indicates the baseline level of wine quality when all predictors are at their average values.
Confidence Intervals:
Intercept (const):
Coefficient: 2.7454
95% Confidence Interval: [2.302, 3.189]
Interpretation: The intercept (constant) is the expected wine quality score when all the predictor variables (volatile acidity, chlorides, sulphates, alcohol) are zero. The confidence interval does not include zero, indicating that the intercept is statistically significant. The relatively narrow interval suggests that the estimate is precise.Volatile Acidity:
Coefficient: -1.1001
95% Confidence Interval: [-1.317, -0.884]
Interpretation: The negative coefficient indicates that an increase in volatile acidity is associated with a decrease in wine quality. The confidence interval does not include zero, which confirms that volatile acidity has a statistically significant negative effect on wine quality. The interval is reasonably narrow, suggesting a precise estimate of this effect.Chlorides:
Coefficient: -1.5769
95% Confidence Interval: [-2.414, -0.740]
Interpretation: Chlorides also have a negative coefficient, meaning higher levels of chlorides are associated with lower wine quality. The confidence interval excludes zero, indicating statistical significance. However, the interval is somewhat wider compared to volatile acidity, suggesting more variability in the estimate, but it is still significant.Sulphates:
Coefficient: 0.8175
95% Confidence Interval: [0.578, 1.057]
Interpretation: The positive coefficient suggests that an increase in sulphates is associated with an increase in wine quality. The confidence interval does not include zero, confirming statistical significance. The interval is fairly narrow, indicating a precise estimate of the effect.Alcohol:
Coefficient: 0.2939
95% Confidence Interval: [0.257, 0.330]
Interpretation: Alcohol content has a positive effect on wine quality, as indicated by the positive coefficient. The confidence interval does not include zero, showing strong statistical significance. The interval is very narrow, which implies a highly precise estimate, making alcohol one of the most reliable predictors in the model.
Hypothesis Testing¶
# Hypothesis Testing
# H0: "Alcohol content has no impact on perceived wine quality."
# H1: "Alcohol content has a positive impact on perceived wine quality."
# Extract the coefficient and p-value for the 'alcohol' variable
alcohol_coef = model_reduced.params['alcohol']
alcohol_pvalue = model_reduced.pvalues['alcohol']
# Display the results
print(f"Alcohol Coefficient: {alcohol_coef}")
print(f"Alcohol p-value: {alcohol_pvalue}")
# Hypothesis testing decision
if alcohol_pvalue < 0.05:
print("Reject the null hypothesis: Alcohol content has a significant "
"positive impact on perceived wine quality.")
else:
print("Fail to reject the null hypothesis: Alcohol content does not have "
"a significant positive impact on perceived wine quality.")
Alcohol Coefficient: 0.2939209003966722 Alcohol p-value: 1.3474711519479874e-51 Reject the null hypothesis: Alcohol content has a significant positive impact on perceived wine quality.
Insights on Hypothesis Testing Results:
- Alcohol Coefficient: 0.2939 -> Each unit increase in alcohol content increases wine quality by approximately 0.294 units.
- Alcohol p-value: 1.35e-51 -> Extremely low, indicating strong evidence against the null hypothesis.
- Conclusion: There is strong statistical evidence that alcohol content has a significant positive impact on perceived wine quality.
Model Comparison and Evaluation¶
# Evaluate each model
mse_all, mae_all, r2_all = evaluate_model(
model_all, X_holdout_const, y_holdout)
mse_all_boxcox, mae_all_boxcox, r2_all_boxcox = evaluate_model(
model_all_boxcox, X_holdout_boxcox_df, y_holdout)
mse_reduced, mae_reduced, r2_reduced = evaluate_model(
model_reduced, X_holdout_reduced_const, y_holdout)
print("\nComparison of Model Performance:")
print(f"Model with All Features (No Transformation) - MSE: {mse_all}, "
f"MAE: {mae_all}, R2: {r2_all}")
print(f"Model with All Features (Box-Cox) - MSE: {mse_all_boxcox}, "
f"MAE: {mae_all_boxcox}, R2: {r2_all_boxcox}")
print(f"Model with Reduced Features (VIF Consideration) - MSE: {mse_reduced}, "
f"MAE: {mae_reduced}, R2: {r2_reduced}")
Comparison of Model Performance: Model with All Features (No Transformation) - MSE: 0.39002514396397386, MAE: 0.5035304415524223, R2: 0.4031803412795929 Model with All Features (Box-Cox) - MSE: 0.38402287532804735, MAE: 0.49277876529880127, R2: 0.41236506173744314 Model with Reduced Features (VIF Consideration) - MSE: 0.39583576287046485, MAE: 0.5162232158682217, R2: 0.3942888848019904
Insights on Model Performance Comparison:
Overall Summary: Best Performing Model: The model with all features after Box-Cox transformation performs the best, showing the lowest MSE and MAE and the highest R-squared. This suggests that transforming the data helped the model better capture the relationships between the predictors and wine quality.
Trade-off with Simplicity: The reduced model, while simpler and easier to interpret, does not perform as well as the full model with or without transformation.
Conclusion: If the goal is to maximize predictive accuracy, the model with all features and Box-Cox transformation should be preferred. However, if interpretability and simplicity are more critical, the reduced model might still be a viable option, albeit with slightly lower predictive power.
QQ Plots for Residuals Analysis¶
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sm.qqplot(model_all.resid, line='s', ax=axes[0])
axes[0].set_title('QQ Plot (All Features No Transformation)')
sm.qqplot(model_all_boxcox.resid, line='s', ax=axes[1])
axes[1].set_title('QQ Plot (All Features Box-Cox)')
sm.qqplot(model_reduced.resid, line='s', ax=axes[2])
axes[2].set_title('QQ Plot (Reduced Features Based on VIF)')
plt.tight_layout()
plt.show()
Insights from QQ Plots:
QQ Plot (All Features No Transformation):
The residuals in this plot show some deviation from the theoretical quantiles, especially at the tails.
This indicates that the residuals are not perfectly normally distributed, suggesting potential issues with the model's fit.
The slight S-shape suggests that the model may not capture the underlying data patterns perfectly, possibly due to the presence of outliers or the effect of untransformed skewed variables.QQ Plot (All Features Box-Cox):
After applying the Box-Cox transformation, the residuals align more closely with the theoretical quantiles, especially in the middle range.
This indicates an improvement in normality compared to the model with no transformation.
The Box-Cox transformation appears to have helped stabilize variance and reduce skewness in the data, leading to a better model fit.
However, some deviations still exist at the extremes, though they are less pronounced than in the non-transformed model.QQ Plot (Reduced Features Based on VIF):
The residuals for the model with reduced features (based on VIF) also show alignment with the theoretical quantiles but exhibit slight deviations at the tails.
The plot suggests that the reduced model, while simpler, still struggles slightly with capturing the data's full complexity. The deviations, particularly at the extremes, might be due to the exclusion of some influential features, leading to a less accurate model.
Summary:
Best Normality Fit: The model with all features and Box-Cox transformation demonstrates the best alignment with the normal distribution, suggesting that this model has the most normally distributed residuals and, therefore, the best fit among the three models.
Impact of Feature Reduction: The reduced model (based on VIF) shows some deviations in the tails, indicating that reducing the number of features may have slightly compromised the model's ability to fit the data perfectly.
However, this model still provides a reasonably good fit, with residuals mostly following the normal distribution.
Final Conclusion¶
Model with All Features (No Transformation):
This model included all the available features without any transformations. It explained approximately 40.3% of the variance in wine quality (R² = 0.4032) and showed moderate predictive performance. However, the model faced some issues with multicollinearity, particularly with variables like fixed acidity and density, which may have introduced some noise into the model.Model with All Features (Box-Cox Transformation):
By applying the Box-Cox transformation to all features, we aimed to stabilize variance and make the data more normally distributed. This model performed slightly better, explaining 41.2% of the variance in wine quality (R² = 0.4124), and demonstrated improved predictive accuracy with lower MSE and MAE values. The QQ plot showed that the residuals were more normally distributed compared to the model without transformation, indicating a better model fit.Model with Reduced Features (Based on VIF):
This model excluded features with high Variance Inflation Factor (VIF) to reduce multicollinearity. While this model was simpler and more interpretable, it explained slightly less variance (R² = 0.3943) and had higher error metrics compared to the other two models. The reduction in the number of features likely led to the exclusion of some important information, resulting in a decrease in predictive power.
Suggestions for Improvement¶
Outlier Analysis:
Outliers can have a significant impact on the performance of regression models by skewing the results and affecting the assumptions of normality.Handling Multicollinearity:
Multicollinearity can inflate the variance of coefficient estimates and make the model unstable.In-depth features analysis:
I believe I could have explored more features, such as interactions between variables, polynomial features, or other transformations to capture more complex relationships in the data.