ISE 529 Predictive Analytics Exam 2
预测分析考试代考 1.Visit the Titanic https://www.kaggle.com/c/titanic page on the Kaggle site. Read the Description and Evaluation items,
1.Visit the Titanic https://www.kaggle.com/c/titanic page on the Kaggle site. Read the Description and Evaluation items, then use the Data tab to download the csv files. Read the Overview.
The objective is to predict if a passenger would survived based on the features data.
Visit the Titanic Data Science Solutions page:
https://www.kaggle.com/startupsci/titanic-data-science-solutions
Spend sometime reading and running the Jupyter Notebook provided in that page.
a) (10 pts.) 预测分析考试代考
Use the train set to answer true or false to each of the following
- More than 75% passengers did not travel with parents or children
- 30 to 33% of passengers had siblings and/or spouse aboard
- Less than 1% of passengers paid a fare as high as 500 dollars
- Less than 1% of passengers are 65+ years old
b) (20 pts.)
Fill NAs values as follows
- Drop columns PassengerID, Name, Ticket, Cabin.
- Fill NAs values in Embark with the most common category
- Fill NAs values in Fare with the median value in that column
- Fill NAs values in Age with the median value in each of Pclass x Gender combination
c) (20 pts.) 预测分析考试代考
Perform the following Data cleaning and Feature Engineering steps for both the train and test files.
- Split column Age into 5 intervals (0, 16, 32, 48, 64, 100) (now categorical)
- Split column Fare into 4 intervals (0, 7.9, 14.5, 31, 600) (now categorical)
- Create column Size by adding values from columns SibSp, Parch, then drop them keeping the created column.
- Create binary column Alone with value 1 (if passenger travels alone) and 0 otherwise.
Use get_dummies to convert categorical to binary columns.
d) (40 pts.) 预测分析考试代考
The test.csv file from kaggle does not include Survived. So split the train set into new train and test subsets. Use the train subset to fit the following models
- KNN
- Support Vector Classifier
- Logistic Regression
- Random Forest
- Gradient Boosting
Use GridSearchCV to fifind best hyperparameter values. Report the test accuracy rate (using the test subset). Try to improve the test accuracy rate (include polynomial and/or interaction terms or use other machine/satistical method or any other mean).
e) (10 pts.) 预测分析考试代考
Submit your best predictions, on to Kaggle. Report your kaggle ID name, the date submitted and, the Score provided by Kaggle.
Make sure your report includes your name and your Section (Tuesday or Friday).
Report should be clean and well formatted (do not truncate tables, plots, python commands, no screen captures). Use random_state = 0 wherever is needed.