COMP5434 Big Data Computing

Project 1: Implement Logistic Regression with Gradient Descent Backpropagation

Introduction

In Tutorial 1, we got start with Python and TensorFlow basics, and in LAB 1, we are going to have hands-on practices on Pyhton and TensorFlow. To consolidate the theoretical knowledge given in the lecture, in Project 1, we will implement the logistic regression with gradient descent backpropagation. Project 1 is an individ-ual student project that student should work independently. In this project, basic programming skills are required, otherwise you may have di culties in nishing it.

Logistic Regression

1. Key Points Recap

Logistic regression is the linear classi cation algorithm for binary classi cation problems, which can be expressed very much like linear regression. Input values,

x = x1; x2; , are combined linearly with parameters, = 1; 2; , to predict an output value, y. A key di erence from linear regression is that the output value being modeled is a binary value (0 or 1) rather than a numeric value.

The logistic function is de ned as,

We define the cost function as

 The partial derivative of J ( ) for xj can be expressed as,

Please refer to the detailed derivation in Appendix.

Gradient descent algorithm for logistic algorithm is calculated as in Eq. 2, which minimizes J ( ) with parameter j. Repeat the step for many times until it con-verges.

2. Logistic Regression for Binary Classi cation

Inputs: suppose you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the

3

applicants scores on two exams and the admissions decision.

Problem: build a logistic regression model to predict whether a student gets admitted into a university (i.e., build a classi cation model that estimates an ap-plicants probability of admission based the scores from those two exams). If the probability h (x) > 0:5, the predicting class is 1, otherwise 0, if h (x) < 0:5.

Algorithm 1 shows logistic regression for binary classi cation step-by-step.

Algorithm 1: Logistic Regression for Binary Classi cation

Project Description

1. Dataset Introduction

In this project, a dataset, called \Pima Indians Diabetes Dataset", is provided to you, which involves predicting the onset of diabetes within 5 years in Pima Indians given basic medical details. The dataset is owned by National Institute of Diabetes and Digestive and Kidney Diseases, and donated by Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu), Research Center, RMI Group Leader Applied Physics Laboratory, The Johns Hopkins University, Johns Hopkins Road Laurel, MD 20707 (301) 953-6231.

The dataset has 768 rows and 9 columns. All of the values in the le are numeric, speci cally oating point values. Figure 1 illustrates a small number of samples of the rst few rows of the dataset and Table 1 lists the meanings of each segment in each line of the dataset.

Figure 1: Screenshot of \Pima Indians Diabetes Dataset" samples.

Table 1: Meanings of each segment of the \Pima Indians Diabetes Dataset".

 Index Meanings of each segment of the dataset 1. Number of times pregnant;

2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test;

 3. Diastolic blood pressure (mm Hg); 4. Triceps skin fold thickness (mm); 5. 2-Hour serum insulin (mu U/ml); 6. Body mass index (weight in kg/(height in m)2); 7. Diabetes pedigree function; 8. Age (years); 9. Class variable (0 or 1).

As we can see from the data structure of the \Pima Indians Diabetes Dataset", it can be formulated as a binary classi cation problem, where the prediction is either 0 (no diabetes) or 1 (diabetes).

In this project, we make use of the input data from index 1 to index 8 for predicting whether the patients have the disease of diabetes or not (in index 9). The dataset, \diabetes dataset.csv", is split to training set, x, and testing set, x0, with the proportion 8:2, which means train your logistic regression model with 80% of the total samples, then test your model for the classi cation results using the remaining 20% samples. In training the logistic regression model, you have to update the value of parameters, , for many iterations until they reach the global minimum. You will be given a small set of testing samples, \test samples.csv", without class labels (0 and 1) to test your model's performance.

In brief, the di erence between \diabetes dataset.csv" and \test samples.csv" is that \diabetes dataset.csv" contains the segment information of class labels while \test samples.csv" does not, which is illustrated in Figure 2.

Figure 2: Comparison of segment information between \diabetes dataset.csv" and \test samples.csv".

(2). The list of actual predicting classes in 0 and 1 of your model tested on the dataset given by \test samples.csv". For instance, the output should be a series of 0 and 1, e.g., [0; 0; 1; 0; 1; 1; ; 0; 1; 1].

Requirement

1. You must work independently without seeking cooperation from others, but you can look for information and examples from Google and other sources;

2. You can use any programming language that you are skilled with, e.g., C, C++, Matlab, and Java, not only restricted to Python, to nish this project;

3. You must implement the algorithm step-by-step on your own. It is not allowed to use the existing libraries or functions from other tools or packages;

4. You can compare your classi cation results with Scikit-Learn and the example code is given below, but it is not compulsory.

from sklearn import linear model

logreg = linear model.LogisticRegression() logreg. t(y, y')

Project Report Writing

1. Use the report template to nish the project report and make some changes in the format and structure of the template if necessary;

2. Write the project report in an academic way. Generally speaking, an academic report should include several parts, Abstract, Introduction, Problem De nition, Methodology, Experiments, Conclusion, Reference and Appendix ;

3. Well organize your project report following a clear logical order;

4. Present experiments in a nice way. Visualized ( gures) and numerical (tables of statistic) results are highly appreciated. Compare the results with di erent

6

parameter settings if necessary. Say for example, you can compare the results using di erent numbers of training iterations, k, and di erent learning rate, ;

5. Plot and illustrate your system model's architecture if applicable;

6. Enclose you code in the last Appendix part with brief description and comments on the function of di erent parts of your code. Similarity of codes will be checked, and similar codes from di erent students will be penalized in the project scores.

Project Report Submission

The due date of submitting Project 1 Report is 8th March, 2018 and please submit the hardcopy with your name and student number on it to Dr. ZHAO during the lecture.

 7 Appendix :  (8)