Hands-on assignment #3, due Monday 3/4/2019 @ 8am
exercises代写 Credit: The data and ideas behind these exercises and homeworks are from the NIH LINCS DCIC Crowdsourcing Portal and Ma’ayan Lab
Credit: The data and ideas behind these exercises and homeworks are from the NIH LINCS DCIC Crowdsourcing Portal and Ma’ayan Lab @ Mt Sinai, New York. http://www.maayanlab.net/crowdsourcing/megatask1.php
The overarching goal is to predict adverse drug reactions. This assignment builds on the in- class examples on the ADR (adverse drug effect) prediction and hands-on HW2.exercises代写
This is a group assignment. You can work in a group consisting of 1, 2, 3 or 4 members. Each group will make one submission via canvas. Please state the names of all your team members in your submission.
This assignment focuses on classification and feature selection methods, and will be graded out of 10 points.
Upload 3 files for this assignment: exercises代写
A Jupyter notebook file named hw3.ipynb containing the R code and answers for the 5 questions.
Please use “#” (comment lines) and markdown cells in your notebook to indicate the question number and to extensively document your code.
A spreadsheet in tab-delimited text format representing the cross validation results of your methods for each side effect.exercises代写
Using the data “gene_expression_n438x978.txt” and “ADRs_HLGT_n438x232.txt” to answer all questions in this assignment. You can assume the files “gene_expression_n438x978.txt” and “ADRs_HLGT_n438x232.txt” are in your working directory.
In class, we discussed many techniques for classification and feature selection in the context of personalized medicine. We illustrated how to apply these methods to the breast cancer data in class.exercises代写
|Feature selection methods||Classification methods|
|none||k-nearest neighbor (k-NN)|
|t-test||Support vector machine (SVM)|
|Signal-to-noise (S2N)||Bayesian Model Averaging (BMA)|
|BSS/WSS exercises代写||Decision trees|
|Correlation with the class vector||Boosting, bagging and other ensemble methods|
|Golub’s method on the AML/ALL data|
Experiment combinations of the above feature selection and classification methods and apply to the ADR data to predict side effects. Evaluate the performance using
- fold cross validation, repeated 3 times. Note that you need to perform feature selection in each fold and each run of your cross validation results. In other words, you will perform feature selection and classification a total of 30
Each combination of feature selection + classification is worth 1 point. For example,
- t-test with p-value < 0.01 as feature selection and k-NN with k=10 as classification method will earn you 1point.exercises代写
- t-test with p-value < 0.001 as feature selection and k-NN with k=10 as classification method will earn you another 1point.
- No feature selection and k-NN with k=12 will earn you an additional 1
So, your group can try 8 combinations to earn up to 8 points. Different input parameter settings count as different combinations.exercises代写
Submit a spreadsheet in tab-delimited text format representing a table that consists of 232 rows and 8 columns. Each column represents a combination of the methods you tried. Each row represents a side effect. Each entry in this table is the average prediction accuracy from 10-fold cross validation, repeated 5 times.
(2 points) exercises代写
Compare the prediction accuracy of the methods you tried in your report. In particular, address the following questions:
- Which side effects you can predict with the highest accuracy in eachcombination?
- Which side effects you predict with the lowest accuracy in eachcombination?
- Some side effects have unbalanced class sizes. Did you do anything aboutthat? Why? If so, is your method effective?exercises代写
- Which feature selection and/or classification method would you consider as the “winner” in your empirical study? You can include the results from HW2 in your discussion.
- Any interesting negativeresults?