Assignment
新加坡金融代写 What is the number of the confidence intervals that cover µ Y X1 = 2, X2 = 1 based on the output after running your R codes?
Please exactly follow the instructions of questions and write down your short answers of the following questions in the answer sheet file on the Wattle. Note that you do not need to copy the questions in the answer sheet. Please only submit your finished answer sheet and do not paste any unrelated results. The data used in this assignment can be found on the Wattle.
The significance level for all the questions is set to be 0.05. 新加坡金融代写
The file “q1data.csv” contains a dataset from the 1988 March U.S. Current population Survey. The dataset contains weekly wages in 1987 (dollars) for a sample of 25231 males between the age of 18 and 70 who worked full-time along with their years of education (educ), years of experience (exper), an indicator variable for whether they were black (black_ind: 1 if they were, otherwise 0), an indicator variable for whether they worked in a standard metropolitan area, i.e., in or near a city (smsa_ind: 1 if they did, otherwise 0), and a code for the region in the U.S. where they worked (region: northeast, midwest, south, and west).
Please use R to answer the following Questions 1 – 2 in the answer sheet.
Question 1 (Multiple Linear Regression for Continuous and Categorical Explanatory Variables, 6.5 points) 新加坡金融代写
Consider regressing the logarithm of “wage” on the explanatory variables “educ” and the indicator variables of “region” only, where the variable “region” is a categorical variable. Note that we are interested in whether the wage in each one of northeast, midwest and west regions, is significantly different from that in the region of south. Please choose indicator variables properly such that this purpose can be realised based on the R output after the model fitting.
Please answer the following questions in the answer sheet.
a)(1 point) Please use R to fit the model that satisfies all the above Please do not consider the interaction terms for now. Based on the “summary” function output of this fitted model, what are the null hypothesis and the alter- native hypothesis for the “F-statistic” in the “summary” function output? What conclusion can you obtain for this F -test?
b)(1 point) Based on the “summary” function output of this fitted model, is the wagein each one of northeast, midwest and west regions, significantly different from that in the region of south, after “educ” is accounted for? Why or why not?
c)(1point) If we are interested in testing whether at least one of northeast,
mid- west and west regions has different wage levels from the region of south, with “educ” held constant, please construct the null and alternative hypotheses we can What are the test statistic and the corresponding p-value (rounded to three decimal places)? What conclusion can you obtain based on the result?
d)(1 point) Consider the model and the variables in a). But now we add all the interaction terms between “educ” and “region”, and obtain a new model. Computeand show the sum of squared errors (SSE) for the fitted model in this question and a), respectively. Which one is smaller? 新加坡金融代写
e)(1point) Consider the model with interactions in d). What are the explanations of the estimated coefficients of the interaction terms between “educ” and “re- gion”? Are the interactions between “educ” and “region” significant? Why or why not?
f)(1 point) Consider the model with interactions in d). What are the 90% confi- denceintervals for the coefficients of the interaction terms between “educ” and “region”? Please round your answer to four decimal Please also interpret these confidence intervals for this real data analysis.
g)(0.5points) Please paste the R codes for all the above analyses of Question 1 in the answer sheet.
Question 2 (Model Diagnostics, 5.5 points) 新加坡金融代写
Consider the multiple linear regression model in Quesiton 1 d). Please answer the following questions in the answer sheet.
a)(1point) Please paste the residuals versus fitted values plot of the fitted model in Quesiton 1 d) in the answer Are the assumptions in the multiple linear regression model violated based on this plot?
b)(1 point) Please paste the Q-Q plot of the residuals based on the fitted model in Quesiton 1 d) in the answer sheet. What conclusions can you obtain via the Q-Qplot? 新加坡金融代写
c)(1point) Please paste the Cook’s distance plot of the fitted model in Quesiton 1
d)inthe answer Based on the criterion introduced in lectures, are there any influential observations? Why or why not?
d)(1 point) Please find the observation with the largest Cook’s distance. (Hint: use “which” function in R.) Based on the “rule of thumb” cut-offs for the stu- dentizedresidual, is this observation an outlier? How to deal with this suspected influential observation?
e)(1point) We have found the observation with the largest Cook’s distance in d). Based on the “rule of thumb” cut-off for the leverage, does this observation have distant explanatory variable values? Why or why not?
f)(0.5points) Please paste the R codes for all the above analyses of Question 2 in the answer sheet.
Question 3 (Simulation for Multiple Linear Regression, 3.0 points)
Consider the multiple linear regression model µ Y X1, X2 = β0 + β1X1 + β2X2 for the observations (Y , X1 , X2 ) : i = 1, , (n + 1) , and the least squares estimates βˆ0, βˆ1 and βˆ2 based on the data (Yi, X1,i, X2,i) : i = 1, , n for the coefficients β0, β1 and β2 can be obtained.
Lily wants to use R to generate random samples based on the multiple linear regression model assumptions. She follows the steps below. 新加坡金融代写
Step 1: Specify β0 = 2, β1 = −1 and β2 = 1,
Step 2: Suppose the observations X1,1, · · · , X1,n+1 are 1, 2, · · · , 101, so n = 100.
Step 3: Generate X2,1, , X2,n+1 from the t3 distribution. (Hint: use the R function “rt”.)
Step 4: Generate 1, , n+1 from the normal distribution with mean 0 and vari- ance 4 [N (0, 4)].
Step 5: Generate Yi = µ{Yi|X1,i, X2,i} + Ei, i = 1, · · · , (n + 1).
Step 6: Repeat Step 4 – Step 5 1,000 times and obtain 1,000 different datasets of
{(Yi, X1,i, X2,i) : i = 1, · · · , (n + 1)}.
Part 1.
(1.5 points) Lei Li is a friend of Lily. Lily hands over the above 1,000 datasets of (Yi, X1,i, X2,i) : i = 1, , n to him but she keeps the observation (Yn+1, X1,n+1, X2,n+1) for each dataset only for herself. She also does not tell him the true values of β0, β1 and β2. Based on each dataset of (Y , X1 , X2 ) : i = 1, , n , Lei Li computes the least squares estimates βˆ0, βˆ1 and βˆ2 as well as the 95% confidence interval for the mean of response given X1 = 2 and X2 = 1. Ultimately, he obtains 1,000 different confidence intervals.
Then Lily computes the mean of response µ Y X1 = 2, X2 = 1 and tells Lei Li this information. Lei Li counts the number of the above 1,000 confidence intervals that cover µ{Y |X1 = 2, X2 = 1}.
Please answer the following questions in the answer sheet. 新加坡金融代写
a)(0.5points) Suppose you play both roles of Lily and Lei Li and realise the above steps in Please paste the complete R codes for all the above procedures in the answer sheet.
b)(0.5points) What is the number of the confidence intervals that cover µ Y X1 = 2, X2 = 1 based on the output after running your R codes? Please answer this question in the answer sheet.
c)(0.5points) Please use the result of b) to interpret the 95% confidence interval for the mean of response in simulation Please answer this question in the answer sheet.
Part 2.
(1.5 points) James is another friend of Lily. Lily hands over the above 1,000 datasets of (Yi, X1,i, X2,i) : i = 1, , n and (X1,n+1, X2,n+1) to him but she keeps the observation of response Yn+1 for each dataset only for herself. She also does not tell him the true values of β0, β1 and β2. Based on each dataset of
(Y , X1 , X2 ) : i = 1, , n , James computes the least squares estimates βˆ0, βˆ1 and
βˆ2. Using those estimates and (X1,n+1, X2,n+1), he also calculates the 95% prediction interval of the response Yn+1. Ultimately, he obtains one prediction interval of the
response Yn+1 for each dataset, and 1,000 different prediction intervals in total.
Then Lily tells James the values of Yn+1 for 1,000 datasets. For each dataset, James counts “1” if the prediction interval covers the corresponding Yn+1; “0”, otherwise. Since there are 1,000 datasets, James can count the total number of “1”s in the above procedure.
Please answer the following questions in the answer sheet. 新加坡金融代写
a)(0.5points) Suppose you play both roles of Lily and James and realise the above steps in Please paste the complete R codes for all the above procedures in the answer sheet.
b)(0.5points) What is the total number of “1”s based on the output after running your R codes? Please answer this question in the answer sheet.
c)(0.5points) Please use the result of b) to interpret the 95% prediction interval for Yn+1 in simulation Please answer this question in the answer sheet.