COMP 2019 Assignment 2 – Machine Learning
Please submit your solution via LEARNONLINE. Submission instructions are given at the end of this assignment.
This assessment is due on Sunday, 10 June 2018, 11:55 PM.
This assessment is worth 20% of the total marks. This assessment consists of 6 questions.
In this assignment you will aim to predict if it will rain on each day given weather observations from the preceding day. You will perform a number of machine learning tasks, including training a classifier, assessing its output, and optimising its performance. You will document your findings in a written report. Write concise explanations; approximately one paragraph per task will be sufficient.
Download the data file for this assignment from the course website (file weather.zip). The archive contains the data file in CSV format, and some python code that you may use to visualise a decision tree model.
Before starting this assignment, ensure that you have a good understanding of the Python programming language, the Jupyter Python notebook environment, and an overall understanding of machine learning training and evaluation methods using the scikit-learn python library (Practical 3). You will need a working Python 3.x system with the Jupyter Notebook environment and the ‘sklearn’ package installed.
Documentation that you may find useful:
· Python: https://www.python.org/doc/
· Jupyter: https://jupyter-notebook.readthedocs.io/en/stable/
· Scikit-learn: http://scikit-learn.org/stable/
· Numpy: https://docs.scipy.org/doc/
Preparation
Create a Jupyter notebook and load the data. Use
import numpy as np
data = np.loadtxt(‘weather.csv’,skiprows=1,delimiter=’,’, dtype=np.int)
to load the data. Type this code into the notebook. You will get syntax errors if you copy and paste from this document. (Students familiar with the Pandas library may use that to load and explore the data instead.)
Familiarise yourself with the data. There are 44 columns and 2716 rows. All values are binary (0/1) where 0 indicates false and 1 indicates true.
Categorical variables were encoded using “One Hot” coding, where a separate column is used to indicate the presence or absence of each possible value of the variable. For example, the three binary-valued columns “MinTemp_Low”, “MinTemp_Moderate”,”MinTemp_High” correspond to the three possible values “Low”, “Moderate”, and “High” of variable “MinTemp”. A 1 in column “MinTemp_Low” means that the value of MinTemp was “Low”; the cells for the other two values must be 0 in this case.
Explore the distribution of data in each column.
The last column contains the prediction target (RainTomorrow). The meaning of the columns is as follows:
· MinTemp_{Low,Moderate,High}: 1 if the minimum temperature on the day was low/moderate/high
· MaxTemp_{Low,Moderate,High}: 1 if the maximum temperature on the day was low/moderate/high
· Evaporation_{Low,Moderate,High}: 1 if the measured evaporation on the day was low/moderate/high
· Sunshine_{Low,Moderate,High}: 1 if the aggregated periods of sunshine on the day was low/moderate/high
· WindSpeed9am_{Low,Moderate,High}: 1 if the measured wind speed at 9am on the day was low/moderate/high
· WindSpeed3pm_{Low,Moderate,High}: 1 if the measured wind speed at 3pm on the day was low/moderate/high
· Humidity9am_{Low,Moderate,High}: 1 if the humidity at 9am on the day was low/moderate/high
· Humidity3pm_{Low,Moderate,High}: 1 if the humidity at 3pm on the day was low/moderate/high
· Pressure9am_{Low,Moderate,High}: 1 if the barometric pressure at 9am on the day was low/moderate/high
· Pressure3pm_{Low,Moderate,High}: 1 if the barometric pressure at 3pm on the day was low/moderate/high
· Cloud9am_{Low,Moderate,High}: 1 if the cloud cover at 9am on the day was low/moderate/high
· Cloud3pm_{Low,Moderate,High}: 1 if the cloud cover at 3pm on the day was low/moderate/high
· Temp9am_{Low,Moderate,High}: 1 if the temperature at 9am on the day was low/moderate/high
· Temp3pm_{Low,Moderate,High}: 1 if the temperature at 3pm on the day was low/moderate/high
· RainToday: 1 if it rained on the day
· RainTomorrow: 1 if it rained on the following day. This is the target we wish to predict.
Question 1: Baseline
A simple model for predicting rain tomorrow is to use today’s weather (RainToday) as an indicator of tomorrow’s weather (RainTomorrow).
What performance can we expect from this simple model?
Choose an appropriate measure to evaluate the classifier. Select among Accuracy, F1-measure, Precision, and Recall.
Use a confusion matrix and/or classification report to support your analysis.
Question 2: Naïve Bayes
Train a Naïve Bayes classifier to predict RainTomorrow.
As all attributes are binary vectors, use the BernoulliNB classifier provided by scikit-learn. Ensure that you follow correct training and evaluation procedures.
1. Assess how well the classifier performs on the prediction task.
2. What performance can we expect from the trained model if we used next month’s data as input?
Question 3: Decision Tree
Train a DecisionTreeClassifier to predict RainTomorrow. Use argument class_weight=’balanced’ when constructing the classifier, as the target variable RainTomorrow is not equally distributed in the data set.
Ensure that you follow correct training and evaluation procedures.
1. Assess how well the classifier performs on the prediction task.
2. What performance can we expect from the model on new data?
If you wish to visualise the decision tree you can use function print_dt provided in dtutils.py provided in the Assignment 2 zip archive:
import dtutils
dtutils.print_dt(tree, feature_names=flabels)
where tree refers to the trained decision tree model, and flabels is a list of features names (columns) in the data.
Question 4: Diagnosis
Does the Decision Tree model suffer from overfitting or underfitting? Justify why/why not.
If the model exhibits overfitting or underfitting, revise your training procedure to remedy the problem, and re-evaluate the improved model. The DecisionTreeClassifier has a number of parameters that you can consider for tuning the model:
· max_depth: maximum depth of the tree
· min_samples_leaf: minimum number of samples in each leaf node
· max_leaf_nodes: maximum number of leaf nodes
Question 5: Recommendation
Which of the models you trained should be selected for the prediction task? Assume that all errors made are equally severe. That is, predicting rain if there is actually no rain is just as bad as predicting no rain if it actually rains.
Does your answer change if predicting rain for a day without rain is a negligible error? Justify why/why not.
Question 6: Report
Write a concise report showing your analysis for Question 1-5.
Demonstrate that you have followed appropriate training and evaluation procedures, and justify your conclusions with relevant evidence from the evaluation output.
Where there are alternatives (e.g. measures, procedures, models, conclusions), demonstrate that you have considered all relevant alternatives and justify why the selected alternative is appropriate.
Do not include the python code in your report.
Submission Instructions
Submit a single zip archive containing the following:
· weather.ipynb: the Jupyter Notebook file.
· weather.html: the HTML version of weather.ipynb showing the notebook including all output. Create this by selecting File>Download as>HTML after having run all cells in the Jupyter notebook.
· report.pdf: the report as specified in Question 6.
Marking Scheme
Question |
Marks |
Q1: Baseline
Appropriate measure selected and justified Correct evaluation |
10 |
Q2: Naïve Bayes
Correct training procedure applied Correct evaluation procedure applied Correct conclusion |
20 |
Q3: Decision Tree
Correct training procedure applied Correct evaluation procedure applied Correct conclusion |
15 |
Q4: Diagnosis
Correct diagnosis Correct revised training and evaluation procedure applied |
30 |
Q5: Recommendation
Correct recommendations Recommendations justified by evaluation results |
15 |
Q6: Report
Well-structured report Professional presentation |
10 |
Jupyter notebook
Executes correctly when using Run All Copy saved as HTML format submitted Matches the contents of the report |
Deductions apply |
代写CS&Finance|建模|代码|系统|报告|考试
编程类:C++,JAVA ,数据库,WEB,Linux,Nodejs,JSP,Html,Prolog,Python,Haskell,hadoop算法,系统 机器学习
金融类:统计,计量,风险投资,金融工程,R语言,Python语言,Matlab,建立模型,数据分析,数据处理
服务类:Lab/Assignment/Project/Course/Qzui/Midterm/Final/Exam/Test帮助代写代考辅导
E-mail:850190831@qq.com 微信:BadGeniuscs 工作时间:无休息工作日-早上8点到凌晨3点
如果您用的手机请先保存二维码到手机里面,识别图中二维码。如果用电脑,直接掏出手机果断扫描。