assignment 3B代写 In this assignment we are going to explore accessing data from different sources. You have already learned how to read text
Peter Dolan 3/6/2017
In this assignment we are going to explore accessing data from different sources. You have already learned how to
read text data in table format using (or the Tools…. Import Data-set… functions ). Now you will learn
how to access data from a web page, an XML data source, and data stored in an SQL database.
Problem 1. (8 points)assignment 3B代写
Web Scraping Background:
Questions (intermingle your answers)
The Web-site “http://www.50states.com/news/minn.htm” has a list of newspapers in Minnesota
- Use to obtain all HTML-lists from this webpage
- Identify the entry containing the list of newspapers
- Use , your knowledge of regular expressions, and whatever other techniques youdeem
necessary to extract the names of every newspaper.assignment 3B代写
- Extract the names of every town
- Generate a data-frame with two columns: One for the newspaper’s name, and the other for the town. Use to output the first 10 rows of your data-frame.
Problem 2. (8 points)assignment 3B代写
Create PDF in your applications with the Pdfcrowd HTML to PDF API
The basic strategy of extracting data from HTML (a web-page) is the same for tabular data (tables) as it is for lists. Look over the web page https://seer.cancer.gov/statfacts/html/thyro.html for some statistical data on Thyroid Cancer. In particular, look for the types of cancer, their number of occurrences, and the estimated deaths (this will appear as one table. For this problem you will want to use readHTMLTable()to scrape the data from the web-page.
to scrape the data from the web-page.
- Tryusing on the URL. Does the data you want appear?
- Use to get a local copy of the data in the URL and then use on that file.
Did that provide the data you want?
- Nowuse on the downloaded file. You should be able to extract the desired data this
- Store the table data in avariable named . The first column is superfluous– remove it. Now rename the remaining columns as “type”, “new”, and “death”. Assign row-names to the table using the cancer type. Then remove that column. Your data should have a format, somewhat like this (the table below is from 2015, not 2016):
|Breast Cancer (Female)||231,840||40,290|
|Lung and Bronchus Cancer||221,200||158,040|
|Colon and Rectum Cancer||132,700||49,700|
|Melanoma of the Skin||73,870||9,940|
|Kidney and Renal Pelvis Cancer||61,560||14,080|
Now the problem is that the all the data in your table is “character” (it’s the commas). You could try to use, but that won’t work (again, because of the commas). Remove them first using (look it up if you need to). Perhaps more than anything else in these assignments, this requires more independent understanding on your part. If you are not certain how to proceed, please contact one of the instructors to ask for some help.assignment 3B代写
- If you have managed to produce a 10-row by 2-col table with numeric entries, named rows, and named columns then create a bar plot showing the ratio of deaths (in 2016) to new cases (again in 2016). Based on this chart which cancer shows the best survival rate? (This is a bit unfair since new cases in 2016 are not really comparable to the death rate in 2016– but it’s not a horribleexaggeration)
Problem 3. (6 points)
- Use and to import the data from the URL into a data.frame.
- Use the shape file, http://www2.census.gov/geo/tiger/GENZ2010/gz_2010_27_700_00_500k.zip, and the techniques of chapter 15 to plot all 81 “Civil” entries from your SBA data on a map ofMinnesota
Problem 4. (8 points)
- What is the structure of your uncompressed file? (Ie, what are the directories, subdirectories, andfiles)?
- Extract all the text that is in bold assignment 3B代写
- Suppose you are hired as a TA for Introduction to Statistics and are asked to create a spreadsheet that details the questions asked and their point value. You decide it would be far more interesting to automate the process then to do it by hand. You begin to look carefully at the documents and realize that there are a few consistent patterns:
The question number and its point value are always bold
- The question number and its point value are always bold The question number always has a period in it
- The point value is always followed by the word ‘points’ assignment 3B代写
Demonstrate how to use R to extract this information from and turn it into a data.frame– the first column should contain the problems and be of type character, and the second should be of type numeric and contain the point values. Your data should look something like this:
|Problem Point Value|
|Problem Point Value|
- What is the total number of points that you extracted? How many points does the document think you need? If there is a difference where does the mistake lie?
Problem 5. (12 points)
- The SQL equivalent of the R command str() is DESCRIBE. We know that the database SUMMER_2015 has two tables. Let’s examine the structure of prices :
- The basic form of the SELECT command is: SELECT <cols> FROM <table>. Experiment with that. For example try the following SQL commands (you’ll need to R-ify them of course):
SELECT item,unit FROM inventory; SELECT unit FROM inventory;
SELECT unit, unit, item FROM inventory; assignment 3B代写
The * option will show all fields in a table. Try these too:
SELECT * FROM inventory; SELECT *, item FROM inventory;
3.You can control the name of a column in theoutput:
SELECT item AS ‘mega stuff’ FROM inventory;
SELECT item AS stuff, amount, unit AS ‘counting thing’ FROM inventory;
Try returning the items and their amount, but call the columns “junk_food” and “count”
- SQL has a large collection of functions and operators (see https://mariadb.com/kb/en/functions-and-operators/if you want to know more– but it’s pretty overwhelming, so don’t feel obligated to follow that link). For example:
dbGetQuery(con,”SELECT right(item,4) AS the_righiest FROM inventory”)
|## the_righiest ## 1 hips assignment 3B代写
## 2 eese
## 3 eese
## 4 corn
## 5 dogs
## 6 buns
Now you should: * Use
to store the contents of the
table in a data-frame called
Use R’s column.function on thecolumn into return the first 3 letters in the
Useand the SQL function
to do the same via SQL.
- SQL divides its commands into clauses. You have already beenusing the clause and theclause. SQL is picky and wants the clauses to go in the proper order. You are now going to see an example of the clause: assignment 3B代写
dbGetQuery(con,”SELECT * FROM inventory WHERE amount < 50″)
## item unit amount id ## 1 nacho cheese gallons 20 3
## 2 popcorn gallons 5 4
## 3 buns units 3 6
At this point you may find the tutorials at http://www.sqlclauses.com to be helpful. (although you may ignore alot of them).
SQL also has a limited ability to use regular expressions. The following command will find all rows in which contains an ‘n’:
dbGetQuery(con,”SELECT * FROM inventory WHERE item REGEXP ‘n'”)
## item unit amount id ## 1 nacho cheese gallons 100 2
## 2 nacho cheese gallons 20 3
## 3 popcorn gallons 5 4
## 4 buns units 3 6
Many variants of SQL even support a option:
This options is particularly helpful when searching through people’s names.
See if you can figure out how to return all the rows where the item name has an H OR, the amount is above 30.
- Two other clauses arehelpful: The clause allows you to only return a few results from your query: assignment 3B代写
dbGetQuery(con,“SELECT * FROM inventory LIMIT 2”)
## item unit amount id
## 1 Buffalo chips patty 1200 1
## 2 nacho cheese gallons 100 2
and the DISTINCT clause acts like R’s unique() function:
dbGetQuery(con,“SELECT distinct item from inventory”)
## 1 Buffalo chips
## 2 nacho cheese
## 3 popcorn
## 4 hot dogs
## 5 buns
However the real power in SQL comes from joins. Again, this is a complicated topic. This URL http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sqljoins.html has a good explanation (if you are interested), but our database server doesn’t do OUTER JOINS correctly… so some of this knowledge is,for the moment, theoretical).
In any event, R-ify the following commands:
SELECT * FROM inventory
SELECT * FROM prices
SELECT * FROM inventory,prices
- Noticethat using two tables in the clause created a result with 6 cross 4 The tricky bit is beingclever about your use of :
|##||1||Buffalo chips||patty||1200||1||1||12.43||Buy||from Ritchie|
|##||2||nacho cheese||gallons||100||2||2||100.19||Version A|
You will notice a few new things. Since both tables had an field– we can’t use “id” alone to specify its values. Instead we need to use SQL’s table.field notation. Second, this time we only had 3 rows in our result. Pay close attention to this next command. It uses everything that we have learned, introduces a new clause and shows how new fields can be constructed on the fly:
dbGetQuery(con,“SELECT amount,item,price as unit_price, price*amount as total_price F ROM inventory,prices WHERE inventory.id=prices.id AND amount > 50 ORDER BY amount”)assignment 3B代写
## Warning in .local(conn, statement, …): Decimal MySQL column 2 imported as
## Warning in .local(conn, statement, …): Decimal MySQL column 3 imported as
## amount item unit_price total_price
## 1 100 nacho cheese 100.19 10019
## 2 1200 Buffalo chips 12.43 14916
See if you can produce a query that will return the unit, item, and price of everything that costs more than $100.00. Don’t leave out inventory.id=prices.id .