IEOR 4525 Homework 2
1. 机器学习课业代做
Consider the following table where certain keywords from daily Twitter messages of stock brokers and traders are collected, along with whether the stock market (as indicated by the S&P 500 index) went up or down on that day. Each word is represented by a binary feature variable, which is 1 if more than half of all participating brokers or traders used it on that day, and 0 otherwise. The data is given below:
Twit | Buy | Sell | inflation up | inflation down | unemployment high | unemployment down | recession | SP500 |
1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | UP |
2 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | UP |
3 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | UP |
4 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | UP |
5 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | Down |
6 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | Down |
7 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | Down |
8 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | Down |
9 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | Down |
Pr[1|UP] | ||||||||
Pr[1|Down] | ||||||||
Pr[0|UP] | ||||||||
Pr[0|Down] |
1a) Fill in the last four rows of the table, giving estimated conditional probabilities of 1 and 0 values in each column given S&P index is up, and given it is down. Also in the last column give the estimated prior probability of the index going up and going down. You may do calculations by hand, or programming to answer this and the rest of calculations in Q1. However, if you use computing, you may not use any scikit-learn packages. Your code should be a straightforward script to aid calculation by hand.
1b) A new data comes in for today, containing the words “Buy”, “inflation down”, and “recession”, only. Compute the estimated probability of market going up and market going down using Naive Bayes method. What will be the prediction for the market for today?
1c) Repeat questions 1a) and 1b), but this time use the Laplace Smoothing technique, with ↵ = 1 and β = 20. Will the prediction for the market behavior change?
2.
Consider the following set and its partition to the right.
2a) Compute the misclassification rate for both the left and right partitioning.
2b) Compute the Gini index for both left and right partitioning.
2c) Compute the cross entropy for both left and right partitioning. Use base 2 for all logarithms.
3. 机器学习课业代做
Programming Project: In this assignment you will write a Naive Bayes script to test whether an e-mail is Spam or not. The dataset you will use is an old one (made in 1990’s by a scientist working at HP Labs in Silicon Valley). The dataset is already cleaned up and you need minimal text-processing.
3a)
The data set can be obtained from the data repository at the University of California Irvine site, follow this link. This file is in csv format. Read the description and the format of this data set to make sure you understand how it is organized. You must not download the data. Rather use the Python read from a web site facility to directly access it.
3b)
The actual headers are in another site in the UCI archive. It contains the title of each column, along with some comments. These are lines that start with a ”|”. Please go to the web site and examine its content. That way it would be easier to understand the next few lines. Read the data from the site and treat lines starting with a ’|’ as comment (check the description of the file.) You can access the header file. Start from row 2 since row 1 does not have any data. (Check the documentation for pandas.read table. It contains information about how to treat comment lines.)
3c) 机器学习课业代做
Once the header and the data are read, create a Pandas data frame. Header labels are the keywords and the data under those words are the frequency of times they occurred in emails. The last three features are other information in the data. The last column is the target; each row under the target column should be labeled “spam” or “no-spam”, or 1,0. Finally, randomly select 20% of the data and save it in the matrices X Test and y test, and save the remaining 80% in matrices X train and y train.
3d)
You are first to turn the data into 0,1, that is, for this part, we do not care how many times a word occurrs in an e-mail. We care only whether it occurred or not at all. Also, ignore the last three features where various runs are counted. Build a binary Naive Bayes model with no Laplace smoothing ¹. Test your model on the test data. Print the confusion matrix and the accuracy rate (percent-age of e-mails in the test set your model correctly identified as spam or no-spam.)(See the documentation for BernoulliNB in the module sklearn.naive bayes to see how to set α. Notice tha the default is α = 1.0, so you need to explicitly set it to zero.)
3e)
Repeat previous question, but this time set Laplace parameter α = 1.0. Was there an improvement in the results?
3f)
Repeat previous question, but this time restore the proportion of occurrence of the each keyword. Treat this proportion as if it follows the normal distribution. Then run the naive Bayes model using this assumption. Compare the performance with the binomial model.
4. 机器学习课业代做
Programming Project: Regression The dataset of handwritten digits has been used extensively for classification problems. However, many believe it is too easy, and even the simplest of learning algorithms can reasonably predict handwritten data with high accuracy. A more challenging dataset consists of 28 × 28 images of ten different clothing items. This data set is known as fashion mnist 2.
² see the following link for the origins of the fashionMNIST dataset.
In this project, you will run the naive Bayes, CART and random forest algorithms on this data set.
4a)
First, you need to load this data. It is available from the tensorflow module ³. Make sure to install this module. Then using the following lines you can load the data:
³ You will need tensorflow and keras for neural networks anyway, so it is useful to install these modules.
import tensorflow as tf from tensorflow import keras fashion_mnist = keras.datasets.fashion_mnist (X_train_full, y_train_full), (X_test_full, y_test_full) =\ fashion_mnist.load_data() names=["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",\ "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
This is the easiest way to load this data. However, if you do not want to install the tensorflow and keras on your machine, you can go to the footnote below and download the data from the Kaggle.com site in the link given. The names list is the names of the clothing items. Use these names in the following instead of 0,1,…,9, as you did for hand-written digits.
The training data set contains 60,000 matrices, each of dimension 28 × 28, and the test data contains 10,000 matrices. This much data may be too large when you are developing your models and testing them. It might be easier to pick the first 2,000 of the training data and the first 500 of the test data, while you are developing your code. Once done and are sure your model is good, then you can run it on the full training and test data.
4b) 机器学习课业代做
Run the naive Bayes method on this dataset. First, assume that each feature’s class conditional distribution (pixel in this case) roughly follows the normal distribution. Next, repeat, but this time assume the class conditional distribution follows the multinomial distribution. Finally, repeat again, but this time assume categorical distribution. For all three cases, print the confusion matrix, the error rate, and the sensitivity and recall report.
Also, as in the cases worked out in class for the digits, run a loop through each example in the test set, plot its picture along with its predicted and actual values (use the predicted and actual names and not just a number, for instance: predicted: Sandal, actual: Ankle boot.) For each instance, draw the bar-chart of the predicted probability distribution of each of the ten clothing items for each test set example. Label the x-axis by the actual names (e.g., Sandal, Shirt, etc.) and not numbers. Make sure the labels on the x-axis are readable by writing them with a 90 degree rotation.
4c)
Repeat 4b), but this time use CART with these parameter values: Set the maximum depth to 5 and to 10; set the minimum number of points per leave to 1000, and 2000. For each of the four alternative models draw the decision tree, print the confusion matrix and draw its heatmap, and print the accuracy report.
4d)
Repeat 3b), but this time use random forests with number of estimators 500, and with maximum number leaves equal to 16. Set the out-of-bag parameter to True. Print both the out-of-bag score and the test error and compare them.
5. 机器学习课业代做
Programming Project: Regression A well-known, but old dataset that is included in scikit-learn package is called “Californa-Housing”. It can be loaded by the following line:
from sklearn.datasets import fetch_california_housing dat=fetch_california_housing() Xlabel,ylabel=dat[’feature_names’],dat[’target_names’]
Examine this dataset and find out about its properties. There are nine columns in it, the one labeled MedHouseVal is the target value. It includes the median house prices in a neighborhood in hundred thousand dollars. The other eight features are related to the neighborhood the data is taken from. You are to use two approaches to build a regression model.
5a) First organize the data into a pandas data frame. Use the display (instead of print) to show the top and bottom rows of this data frame. Next, split the data into training and test sets with 20% of the data randomly chosen to be in the test set.
5b) Use the Cost-Complexity Pruning technique to develop a regression CART model for this data. Plot both the number of nodes and the depth of the tree developed against α. Plot the score (the R² statistic 4) both for the training and test sets on the same panel (blue for the training set, orange for the test set.) For what α do you get the best score for the test set? Print the R² score for this best tree. Plot and print the best tree.
5c) Repeat part 5b) but this time use the random forest model with number of estimators equal to 500, and with minimum number of items per leaf equal to 10, and with out-of-bag score set tot True. (There is no need to plot the ↵ plots for this question.) Also, print the out-of-bag score, and compare it for the results on the test set.
更多代写:cs北美代考网课 网考作弊 英国操作系统代写 公共关系essay代写 管理学assignment代写 数值分析课业代做