数据管理代写 FILE FORMATS代写

INF 559: Introduction to Data Management

HOMEWORK 2: FILE FORMATS

(CSV, JSON & TABLESCHEMA)

数据管理代写 The goal of this assignment is to get you familiar with DIfferent file formats and pros/cons of each of the file formats that we studied in Lecture.

100 points

Introduction: 数据管理代写

The goal of this assignment is to get you familiar with DIfferent file formats and pros/cons of each of the file formats that we studied in Lecture. We will also get exposure to Python libraries for reading and writing different file types and will try to understand some of the differences and issues that show up in reading and writing different types of data files and how we can validate our datasets using Table schemas.

We will be working with the same information encoded in two different file formats that we will be covering in class. The first is called a comma separated value or CSV format. In this format, the data is represented as a row with multiple values, each value is separated by a comma. Sometimes a different character like a ‘Tab’ or ‘|’ might be used to separate the column. Often a “header” row is used to say what the names of the columns are. Usually the same number and type of values are stored in every row, but this is not always the case as sometimes two tables might be stored in the same CSV file which can complicate figuring out what is what.

We will also be using files that use the JSON Object Notation, or JSON format. 数据管理代写

A JSON format is basically a standard representation of a Python dictionary, that can be read by other programming environments, such as JavaScript programs on the web. JSON is the format that is used to exchange much of the data used by Web APIs and cloud services due to its cross-platform independence.

Python provides a library to read and write CSV files a row at a time, or the whole file at once and read it into a Python structure. To handle JSON, Python provides a library to convert a Python dictionary into a string in JSON format which can then be written into a file. We expect you to look online at various online documentations and Stackoverflow to find the appropriate library to achieve Read/Write to and from CSV/JSON file formats.

Assignment Description: 数据管理代写

In this assignment, we are going to read and write information in CSV and JSON format. One particular aspect we are going to explore is how to provide information about what is in the file we write (metadata) so when we give the file to someone else, they can figure out what is inside of it. In this example, the metadata we are going to be concerned with is the name of each of the columns, a description of the file contents and information about who created the data (the author, date, and organization of the author).You have been provided with two files called vaccine.csv and vaccine.json that have the same data in different formats. The vaccine.csv is a subset of the data from COVID-19 World Vaccination Progress Dataset on Kaggle.

Note: All the pictures & screenshots are over dummy data, shouldn’t be used to match results in your outputs.

Task 1 (15 Points):

Write a function task1 and read in the CSV file – vaccine.csv. For every unique iso_code, calculate the average daily_vaccinations_per_million (Note: rounded to 1 decimal place). Write the results sorted by iso_code alphabetically (From A to Z) to a CSV file named task1.csv separated by commas as below. This file should have just the data in it without column headers.

Task 2 (15 points):

Write function task2 and repeat steps of task1. In addition, add a header row that includes the column names. Save the result in a file called task2.csv.

Task 3 (30 points):

Write a function task3 that reads in the JSON version of the data, computes the new values (As task 1) and writes out a JSON version in a file called task3.json which includes all of the output data and metadata.

Task 4 (25 points):

Write a function task4 that reads in your json file created in task3 (task3.json) and try to create only one CSV called task4.csv file containing combined information from both [‘metadata’][‘info’] and another for average daily_vaccinations_per_million data where your column header will be obtained from [‘metadata’][‘columns’] and filled with average daily_vaccinations_per_million data from [‘data’].

Note: Also include a small note (2-3 lines) about your conclusion which format is more suitable to store together average daily_vaccinations_per_million and Metadata in this task (CSV or JSON) and your reasons.

(Extra Credit (10 points): If you can come up with a python script that can read data back from task4.csv that you just created and print information out on the Python Notebook Console.)

Task 5 (15 points):

We would like you to explore tableschema package to generate schema through CSV file. Write a function task5 that reads your vaccine.csv file and generate a schema for it in ‘schema.json’ file. Please refer to https://github.com/frictionlessdata/tableschema-py where you can get documentation on how to use tableschema, what it looks like and steps to install in Google Colab environment (stack overflow).

Coding Environment: You can use any IDE of your choice.

Submission on Blackboard:

A single zip file FirstName_LastName_hw2.zip containing below files

a. FirstName_LastName_hw2.ipynb

b. All output files – task1.csv, task2.csv, task3.json, task4.csv, schema.json

Grading Criteria 数据管理代写

Only Python 3.5+ version submission will be accepted for this coding assignment.
If the program does not run, there will be a 50% penalty for each task. In that case, grading will be done based on the output files submitted.
If the resulting output – iso_code and avg_daily_vaccinations_per_million are not sorted by iso_code, there will be 20% penalty for each task.
We can accept Late homework till it is 24 hours after HW Deadline but it will be penalized by 20% points. No credit will be given for submission after 24 hours of its due time.

Important Note:

Submitted work must be your own. Don’t share your code with anyone, and start early!

合作平台：随笔代写论文代写写手招聘英国留学生代写