STAT 7008
Assignment 4
Deadline 1 Dec 2018
All questions must be answered only with Python programs. Question 1 (Download Apple Products from Amazon.com) (60 marks)
This exercise explores the Python Selenium and the regular expressions to download the Apple products offered by amazon.com.
The 58 webpages offered Apple products (some of the products may just be Apple products related) by Amazon.com.
https://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2407749011%2Ck
%3AApple&page=1
- Construct the links in a
- Set up the Seleniumserver:
driver = webdriver.Chrome(location of your chromedriver.exe)
- The product information on each page is stored in the li tag with id = id=”result_”. The products in each page are divided into amazon and sponsor offered product. We are interested in only amazon offered products. A typical product may contain the following fields: Display_Dimensions 4
Display_Size 4 inches
Display_Type Retina
Manufacturer Apple
More_Offer $89.75(31 used & new offers)
Number_of_Review 1924
Offer_Price $90.00
Product_detail Apple iPhone 5c Unlocked Cellphone, 16GB, White
Special Feature dual-camera
Special_Offer NaN
Rating [3.6 out of 5 stars]
Use one of the following Selenium commands to obtain the product information:
driver.find_element_by_class_name(name) Elements that use the CSS class name
driver.find_element_by_css_selector(selector) Elements that match the CSS selector
driver.find_element_by_id(id)
Elements with a matching id attribute value
driver.find_element_by_link_text(text)
<a> elements that completely match the text provided
driver.find_element_by_partial_link_text(text)
<a> elements that contain the text provided
driver.find_element_by_name(name) Elements with a matching name attribute value
driver.find_element_by_tag_name(name)
Elements with a matching tag name (case insensitive; an <a> element is matched by ‘a’ and ‘A’)
driver.find_element_by_xpath(“//div[@id=’a’]/div/a[@class=’click’]”) Elements with a matching div tag with id = a and inside the div tag there is another div tag followed by an a tag with class = click.
Since not all the products have the same set of fields, use regular expressions to identify the data collected with their appropriate fields. For example, I use the following code to identify ‘More_Offer’ field:
r = re.compile(r'[$]?[0-9.]+\([0-9a-zA-Z\& ]+\)’) mo = filter(r.match,pri)
After the product information is obtained, store them in a pandas DataFrame or a dictionary.
- Thecomplication is with the The rating is stored in a pop-up tag which is unable to find. But if you read the page carefully (please see amazon_current_page.txt), the rating is stored as an xml Value of a span tag with class = ‘a-icon-alt’. We adopt the following strategy to download the rating. Use driver.page_source to obtain a text file of the webpage. After we have the text file, we can use regular expressions to find the rating from the text.
Note that although the rating and product information are interlacing with each other, some products, such as sponsor offered products, may not have rating. This creates the problem of identifying ratings with products. The strategy to find matching of ratings with products is by means of their positions placed in the text file. We use match.group() and match.span() to build two dictionaries. The first one is for the position of the result_ id and the second one is for the position of a- icon-alt class. Under normal circumstances, a result_id must be followed by an a-icon-alt class. Try this strategy or else to find the rating of all products in the page.
- Do the above parts for each of the 58 pages to collect all product information and store everything in a pandas
Question 2 (40 marks)
The website https://www.buzzfeednews.com/ contains a lot of current news. Our task is to extract all of the news. As you can see from the webpage, each news contains several components: Headline, content (of almost the same length as that of the headline) and author. To simplify matters, we will ignore the authors.
However, a complication with this website is that at the bottom of the page, there is a “Show More” button. Every time you press it, more news is coming out. The problem is that we don’t know how many times we need to press the button to find the last set of news.
Write a python program to press the button automatically till the last news is shown and then extract all the news in one go. After the news are retrieved, we put them in a tuple format:
(‘Duncan Hines Is Recalling 4 Cake Mixes Due To Possible Salmonella Contamination’,
‘Classic White, Yellow, and Confetti cake mixes are being pulled from shelves as the FDA investigates an ongoing salmonella outbreak.’)
At the end, we will have a list of tuples of news.