Stat 5330: Assignment 2
统计学assignment代写 This assignment is due on 9pm Tuesday Oct 23rd, 2018. A paper copy of the well- written report should be submitted to my mailbox (Xiwei Tang)
This assignment is due on 9pm Tuesday Oct 23rd, 2018.统计学assignment代写
A paper copy of the well- written report should be submitted to my mailbox (Xiwei Tang), which can be found inthe first floor of Halsey Hall, the corner where you make a left turn after entering the frontdoor. An electronic copy of your code should be submitted on Collab. In the paper report,the results should be clearly stated with some reasonable explanations (including necessaryplots). Please also copy your codes at the end of the paper report. Please DO NOT just copy and paste the raw outputs obtained from your software. Also please write down both your full name and the computing ID at the first page.统计学assignment代写
1.Analyze the email spam dataset using different classification problems.统计学assignment代写
Thedataset consists of two parts : a training dataset with 3065 obs and 58 variables, a testing data set with 1536 obs and 58 variables. In each dataset, the first 57 columns store the predictors, and the last column stores the binary response variable (spam=1, not spam=0). The datasets are in .txt files attached as You might use following codes to read the data into R.
setwd ( ” path o f the f o l d e r where you put the data f i l e s ” )统计学assignment代写
t r a i n=re ad . t a b l e ( ” t r ai n d a t a . t x t ” )
t e s t=re ad . t a b l e ( ” t e s t d a t a . t x t ” )
(a)PerformLDA and QDA with the predictors V55-V57 (columns55-57), and apply your trained model on the testing Report the classification accuracy (num of obs that are correctly classified / total num of obs in the testing set), sensitivity rate (num of spam obs which are correctly classified as spam in the testing set/ num of totoal spam obs in the testing set), and specificity rate (num of non-spam obs which are correctly classified as non-spam in the testing set / num of totoal non-spam obs in the testing set).
(b)PerformLDA and QDA with all predictors V1-V57 (columns1-57),统计学assignment代写
and apply your trained model on the testing set. Report the corresponding classification accuracy, sensitivity rate, and specificity rate, respectively. Comparing your results with those obtained in part (a).
(c)Performa logistic regression model and an SVM model with all predictors V1-V57 (columns1-57), and apply your trained model on the testing set. Please report the corresponding classification accuracies, sensitivity rates, and specificity rates, Comparing your results with those obtained in part (b). Which method provides the “best” classification results? Why?统计学assignment代写
(d)Suppose we just coin a quarter to make a prediction on the testing set,
that is, foreach obs in the testing set, we randomly toss a coin, and assign the label 1 to this obs if we get a ”head” while assign 0 if get a ”tail”. Calculate the estimated classification accuracy, sensitivity rate and specificity rate. You might generate the prediction on the testing set by using the following codes to generate random binary (Bernoulli) outcomes:
1 # n i s the sample s i z e o f the t e s t i n g s e t , and l e t prob=0.5 统计学assignment代写
y . t e s t . p r e d i c t i o n=rbinom ( n , 1 , prob )
(e)Followingthe procedure in part (d), we try a sequence of values of ”prob”: 0, 2, 0.4, 0.5, 0.6, 0.8, 1 , please plot three figures showing the values of ”prob”(X) v.s. the prediction accuracy, sensitivity and specificity, respectively. Based on the exploration, can you infer what is the best ”prob” value in this procedure? Why?
