A Case Study: Red Wine Quality Data
Case Study代写 The red wine quality data is used for the case study. The dataset is available at Machine learning data set repositories, UCI,
Pick a dataset of your interest Case Study代写
The red wine quality data is used for the case study. The dataset is available at Machine learning data set repositories, UCI, http://archive.ics.uci.edu/ml/datasets/Wine+Quality. The physicochemical, that is the inputs, and sensory, the output, variables are available in the data set.
The inputs include objective tests. The available varibles are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol.
The output, or the response varaibel that is of interest, is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
Pose the initial questions (3 to 5) that you would like to answer Case Study代写
RQ1: How are the results of all kinds of objective tests and the wine quality distributed? RQ2: How do qualities change according to different alcohol and total sulfur dioxide?
RQ3: Given the different pH level, does the wine quality differ?
Assess the fitness of the data
Can we answer all those questions by using only the land registry data? The answer is yes. This could be a empirical data driven study indeed. According to the empirical distributions in RQ1, the continuous variables can be empirically divided into factor variables or dummy variables for the better understanding of the whole data set and the wine quality study.Case Study代写
Answer the initial questions by visualizing the dataset using R
|##||‘data.frame’: 1599 obs. of||12 variables:|
|##||$ fixed.acidity : num||7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 …|
|##||$ volatile.acidity : num||0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 …|
|##||$ citric.acid : num||0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 …|
|##||$ residual.sugar : num||1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 …|
|##||$ chlorides : num||0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071||…|
|##||$ free.sulfur.dioxide : num||11 25 15 17 11 13 15 15 9 17 … Case Study代写|
|##||$ total.sulfur.dioxide: num||34 67 54 60 34 40 59 21 18 102 …|
|##||$ density : num||0.998 0.997 0.997 0.998 0.998 …|
|##||$ pH : num||3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 …|
|##||$ sulphates : num||0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 …|
|##||$ alcohol : num||9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 …|
|##||$ quality : int||5 5 5 6 5 5 5 7 7 5 …|
To answer RQ1:
How are the results of all kinds of objective tests and the wine quality distributed? One might first plot the histogram for the variables.
The range and overall distribution can be easily shown by the above histograms.
Approximately, for alcohol, the range is from 8 to 14; For chlorides, the range is from 0 to 0.6; For citric acid,the range is from 0 to 1; For density, the range is from 0.9901 to 1.0037; For fixed acidity, the range is from 4 to 16; For free sulfur dioxide, the range is from 0 to 72; For pH, the range is from 2 to 4; For alcohol, the quality is from 3 to 8; For residual sugar, the range is from 0.9 to 15.5; For sulphates, the range is from 0.33 to 2; For total sulfur dioxide, the range is from 6 to 289; For volatile. acidity, the range is from 0.12 to 1.58.Case Study代写
From the histograms, residual sugar, free sulfur dioxide and total sulfur dioxide have large range and the distribution are significantly positive skewed. Therefore, one might conduct logarithm transformation to these variables.
After manipulating these variables, the overall distributions are concentrated according to the plots above. Visualization:
The densities for all input variables are approximately bell shaped. This is a piece of good news for analyzing the data set for the further question in that the assumptions for the statistcial analysis techniques will not be viiolated. Thus, it would make the final results appropriate.
For the output variables, the pie chart for the wine quality is shown as above. The most of the wine has quality scores 5 and 6. Only a samll number of wine are extremely poor or good.
The boxplots for all variables are plotted as above. The variables fixed.acidity and residual.suguar have relatively large numbers of outliers.Case Study代写
RQ2: How do qualities change according to different alcohol and total sulfur dioxide?
Through the scatter plots with a simple linear regression model fit, the higher alcohol level or higher the total sulfur dioxide level, the higher the wine quality.
RQ3: Given different pH level, does the wine quality differ?
The above plots show that whether the pH level is low or high,
the quality of the wine will not be affected as the distributions remain the same.Case Study代写
60% of the wine quality can be predicted correctly according to the above tree model and the rules. That is, one might start from the alcohol, for low alcohol wine, it the sulpates, the quality is 1, otherwise, if the total sulfur dioxide is large, the quaility if 4; otherwise if it has low volatile acidity, the quality is 2 if the volatule acidicty is high, the quality id 3. If the wine has high alcohol, if the sulphates is low and the volatile acidity is low, the quality is 6; if the sulphates is low and the volatile acidity is high, the quality is 4; if the sulphatesis high and the volatile alcohol is low, the quality is 7; if the sulphates is high and the volatile alcohol is high, the quality is 8.Case Study代写