**STAT6118 Complex Survey Data Analysis **

**Assignment**

You submit the electronic copy (a ** single file**) via the STAT6118 Blackboard website using the TurnItIn Software (in the Assignments fold, select View/Complete to submit your report). A scanned handwritten document is not allowed. You are allowed to submit only one document to TurnItIn! Note that the file has to be smaller than 10 MB. If your Word file is larger than this, try converting it to a readable PDF or save your images (graphs, plots, …) in JPEG (instead of BMP).

It is the policy of the Department of Social Statistics that courseworks are anonymous. Your Student ID Number must appears in the first page of your Word or PDF document. To maintain anonymity please do not put your name on any part of your submission except the lower part of the Submission Form, which is removed prior to the marker seeing the coursework.

Students are encouraged to discuss and exchange ideas, since this is an important part of the educational process. However, it is not acceptable that you read and gain ideas for your coursework from another student’s finished work. ** It is very important that you read carefully the Section 5 **(Academic Integrity and Referencing)

Make sure that your assignment fits in a **single PDF document**.

A scanned handwritten document is not allowed. Your **Student ID Number must**** ****appear**** **on the first page of your Word or PDF document. Make sure that you have 3 sections called Part 1, Part 2 and Part 3. The subsections 1a), 1b), 2a), 2b) and 2c) should be also clearly labelled. The maximum number of words is 6000.

Information about coursework submission, penalty for late submission, policy for over-length work, procedure for coursework extensions, feedback and academic integrity and referencing can be found in module outline (available on blackboard). ** It is very important that you read carefully the module outline,** because it contains additional important information about this assignment.

It is recommended to use STATA for this assignment. However, you can use R instead of STATA (at your own risk), if you prefer.

**Your Assignment: 代写数据分析课业**

The data file called samprj.dta contains an extract from the Brazilian Family Budget Survey 2002/2003 or the state of Rio de Janeiro, in Brazil. Observations in this file correspond to residents in the participating households. The original dataset has been pre-processed to remove a few cases of households containing records for absent members and to select the relevant variables for analysis. But otherwise, these are the real survey data for the target region.

The variable person contains the label of each person within the household, and the reference person of the household is always labelled 1 in this variable. A description of the variables can be found in the Excel file “samprj_variables.xls” available on blackboard.

The sampling design is a stratified, two-stage sampling of households.

** Stratification** of PSUs by State & Education of heads of households (average at PSU level)

** Primary sampling units** are census enumeration areas. PSUs sampled with PPS – size = number of households in census

** Secondary sampling units** are households. SSUs sampled with SRS within each PSU

** Achieved sample sizes**:

**Part 1 (Descriptive statistics) **

** 1a) **Consider the proportion of households having microwave ovens (microwav) by education level (educatio) of the reference person (person = 1). In particular, the two proportions for heads with the highest and lowest education levels, respectively. You need to address both of the following:

1.Estimate the difference between the two proportions, taking into account of the sampling design.

2.Test the hypothesis “* the two proportions are equal*”, against the alternative “

*”, taking into account of the sampling design.*

*the two proportions are unequal*** 1b) **Apply two different tests of the independence between the number of bathrooms in the household (nbathrms) and the ethnic group of the reference person (ethngrp). If you think it is necessary, you can recode (ethngrp) by combining the groups.

For tasks **1a)** and

**below, you should explain how you took into account of the sampling design, by defining the appropriate estimator and test, using analytic expressions (as in the lectures slides). You should use STATA procedures “svy”. You should also describe briefly your STATA codes.**

**1b)**

**Part 2 (Modelling): 代写数据分析课业**

**2a) **

Fit a logistic regression model to the indicator of having credit card (crdcard) for persons with age 20 or above using as predictors educatio, ethngrp, sex and income. You should an aggregated approach that takes into account of the design.

1.Is sex a relevant determinant after you control for the other covariates?

2.Re-estimate the final fitted model without allowing for the sampling design. How do the results change?

**2b) **

Fit and interpret a model for the total monthly income (totincom) using as independent variables sex, ethngrp, educatio and age. Note that

- ‘no income’ is represented by ‘totincom= 0’ in this dataset,
- the dependent variable of your model may be a transformation oftotincom,
- and different predictors may be constructed based on the given independent variables.

You should use an aggregated approach that takes into account of the sampling design.

For tasks **2a)** and

**, you should explain how you took into account of the design, by defining the appropriate estimators, using analytic expressions (as in the lectures slides). You should use STATA procedures “svy” (except for the fits which do not allow for the survey design). You should also describe briefly your STATA codes.**

**2c)**

By using the STATA procedures “svy” in ** 2b)**, you should have used an aggregated approach to fit your regression model.

1.Describe a disaggregated approach that could have been used for the total monthly income. By using the disaggregated approach you propose, fit your disaggregated model with the effects of your final model obtained in ** 2b)**. Compare briefly your results with

**2b)**2.Describe a model-based aggregated approach that could have been used for the total monthly income. Compare this approach with the one used in ** 2b)**. Discuss the advantages and disadvantages. Fit this model and compare it with

**2b)**

**Part 3 (Nonresponse): 代写数据分析课业**

The data (DataCPS.CSV) is extracted from the September 1976 Current Population Survey in the USA. The units are individual persons. We assume that a stratified simple random sampling have been used. The population size is N = 46049. This does not correspond to any sub-population of the USA. It should be viewed as a fictitious population for the purpose of this assignment. The variables are

- “stratum”: The stratum label. The have 3 geographical strata.
- “area” represent compact geographic areas.
- “person”: person number
- “age”: the age of the persons.
- “agecat”: age category. 1 = 19 years and under; 2 = 20-24; 3 = 25-34; 4 = 35-64; 5 = 65 years and over.
- “race”: 1 = non-black; 2 = black
- “sex”: 1 = male; 2 = female
- “hour”: usual number of hours worked per week
- “wage”: usual amount of weekly wages (in 1976 US $). Contains missing values, labelled NA.

The variable “wage” contains missing values. Your aim is to estimate the population average of the variable “wage”. The strata sizes are 12279 for strata 1, 18420 for strata 2 and 15350 for strata 3.

Some population counts are given in the following table.

Using the data above. Create weights that take into account of the design and non-response. Your aim being to estimate the population average of the variable “wage”. The assumptions about the response mechanism must be clearly stated and justified. You should describe and justify the approach you adopted. Provide your weighted estimate for the population average of the variable “wage”. Any statistical package can be used.

