Scraping - Loading dynamic buttons - ruby

I'm trying to web-scrape the "Fresh & Chilled" products of Waitrose & Partners using Ruby and Nokogiri.
In order to load more products, I'd need to click in 'Load More...', which will dynamically load more products without altering the URL or redirecting to a new page.
How do I 'click' the "Load More" button to load more products?
I think it is a dynamic website as items are loaded dynamically after clicking the "Load More..." button and the URL is not being altered at all (so no pagination is visible)
Here's the code I've tried so far, but I'm stuck in loading more items. My guess is that the DOM is being loaded by itself, but you cannot actually click the button because it represents to call a javascript method which will load the rest of the items.
require "csv"
require "json"
require "nokogiri"
require "open-uri"
require "pry"
def scrape_category(category)
CSV.open("out/waitrose_items_#{category}.csv", "w") do |csv|
headers = [:id, :name, :category, :price_per_unit, :price_per_quantity, :image_url, :available, :url]
csv << headers
url = "https://www.waitrose.com/ecom/shop/browse/groceries/#{category}"
html = open(url)
doc = Nokogiri::HTML(html)
load_more = doc.css(".loadMoreWrapper___UneG1").first
pages = 0
while load_more != nil
puts pages.to_s
load_more.content # Here's where I don't know how to click the button to load more items
products = doc.css(".podHeader___3yaub")
puts "products = " + products.length.to_s
pages = pages + 1
load_more = doc.css(".loadMoreWrapper___UneG1").first
end
(0..products.length-1).each do |i|
puts "url = " + products[i].text
end
load_more = doc.css(".loadMoreWrapper___UneG1")[0]
# here goes the processing of each single item to put in csv file
end
end
def scrape_waitrose
categories = [
"fresh_and_chilled",
]
threads = categories.map do |category|
Thread.new { scrape_category(category) }
end
threads.each(&:join)
end
#binding.pry

Nokogiri is a way of parsing HTML. It's the Ruby equivalent to Javascript's Cheerio or Java's Jsoup. This is actually not a Nokogiri question.
What you are confusing is the way to parse the HTML and the method to collect the HTML, as delivered over the network. It is important to remember that lots of functions, like your button clicking, are enabled by Javascript. These days many sites, like React sites, are completely built by Javascript.
So when you execute this line:
doc = Nokogiri::HTML(html)
It is the html variable you have to concentrate on. Your html is NOT the same as the html that I would view from the same page in my browser.
In order to do any sort of reliable web scraping, you have to use a headless browser that will execute Javascript files. In Ruby terms, that used to mean using Poltergeist to control Phantomjs, a headless version of the Webkit browser. Phantomjs became unsupported when Puppeteer and headless Chrome arrived.

Related

Ruby Watir: Display hyperlinks from a chrome search

Objective: To print out an array of hyperlinks searched from a couple of keywords input to a search engine, chrome in this case, utilizing Watir.
Thing is my code below used to work fine over the last year. Tested it many times but something has changed within the confines of Chrome and the elements or selectors or perhaps watir has deprecated some functionality but I'm using very few lines of ruby code. It just hangs.(Edit No results are produced in the terminal, doesn't actually hang) I've gone over this painstakingly and only one other link with a similar case, where I got my previous code from, but now it no longer works: Show searched links url in command line with watir
require 'watir'
browser = Watir::Browser.new:chrome
browser.goto 'google.com'
browser.text_field(title: 'Search').set 'watir + ruby'
browser.button(name: 'btnK').click
sleep 3
links = browser.h3s(class: 'r').map(&:link)
hrefs = links.map(&:href)
links.each { |link| puts "#{link.data_href || link.href}" }
sleep(900)
Desired output (used to work)
Terminal Displaying hyperlinks
Working script now produces search results hyperlinks (credit to Justin Ko) :
require 'watir'
browser = Watir::Browser.new:chrome
browser.goto 'google.com'
browser.text_field(title: 'Search').set 'watir + ruby'
browser.button(name: 'btnK').click
sleep 3
puts browser.title
links = browser.divs(class: 'r').map(&:link)
hrefs = links.map(&:href)
# hrefs.each { |href| puts href }
links.each { |link| puts "#{link.data_href || link.href}" }
sleep(900)
enter image description here

want to get taobao's list of URL of products on search result page without taobao API

I want to get taobao's list of URL of products on search result page without taobao API.
I tried following Ruby script.
require "open-uri"
require "rubygems"
require "nokogiri"
url='https://world.taobao.com/search/search.htm?_ksTS=1517338530524_300&spm=a21bp.7806943.20151106.1&search_type=0&_input_charset=utf-8&navigator=all&json=on&q=%E6%99%BA%E8%83%BD%E6%89%8B%E8%A1%A8&cna=htqfEgp0pnwCATyQWEDB%2FRCE&callback=__jsonp_cb&abtest=_AB-LR517-LR854-LR895-PR517-PR854-PR895'
charset = nil
html = open(url) do |f|
charset = f.charset
f.read
end
doc = Nokogiri::HTML.parse(html, nil, charset)
p doc.xpath('//*[#id="list-itemList"]/div/div/ul/li[1]/div/div[1]/div/a/#href').each{|i| puts i.text}
# => 0
I want to get list of URL like https://click.simba.taobao.com/cc_im?p=%D6%C7%C4%DC%CA%D6%B1%ED&s=328917633&k=525&e=lDs3%2BStGrhmNjUyxd8vQgTvfT37ERKUkJtUYVk0Fu%2FVZc0vyfhbmm9J7EYm6FR5sh%2BLS%2FyzVVWDh7%2FfsE6tfNMMXhI%2B0UDC%2FWUl0TVvvELm1aVClOoSyIIt8ABsLj0Cfp5je%2FwbwaEz8tmCoZFXvwyPz%2F%2ByQnqo1aHsxssXTFVCsSHkx4WMF4kAJ56h9nOp2im5c3WXYS4sLWfJKNVUNrw%2BpEPOoEyjgc%2Fum8LOuDJdaryOqOtghPVQXDFcIJ70E1c5A%2F3bFCO7mlhhsIlyS%2F6JgcI%2BCdFFR%2BwwAwPq4J5149i5fG90xFC36H%2B6u9EBPvn2ws%2F3%2BHHXRqztKxB9a0FyA0nyd%2BlQX%2FeDu0eNS7syyliXsttpfoRv3qrkLwaIIuERgjVDODL9nFyPftrSrn0UKrE5HoJxUtEjsZNeQxqovgnMsw6Jeaosp7zbesM2QBfpp6NMvKM5e5s1buUV%2F1AkICwRxH7wrUN4%2BFn%2FJ0%2FIDJa4fQd4KNO7J5gQRFseQ9Z1SEPDHzgw%3D however I am getting 0
What should I do?
I don't know taobao.com but the page seems like its running lots of javascript. So perhaps the content can actually not be retrieved with a client without javascript capabilities. So instead of open-uri, you could try the gem selenium-webdriver:
https://rubygems.org/gems/selenium-webdriver/versions/2.53.4

Scraping successive pages until the last page using Nokogiri and Mechanize

I am trying to scrape multiple pages from a website. I want to scrape a page, then click on next, get that page, and repeat until I hit the end.
I wrote this so far:
page = agent.submit(form, form.buttons.first)
#submitting a form
while lien = page.link_with(:text=>'Next')
# while I have a next link on page, keep scraping
html_body = Nokogiri::HTML(body)
links = html_body.css('.list').xpath("//table/tbody/tr/td[2]/a[1]")
links.each do |link|
purelink = link['href']
puts purelink[/codeClub=([^&]*)/].gsub('codeClub=', '')
lien.click
end
end
Unfortunately, with this script I keep on scraping the same page in an infinite loop... How can I achieve what I want to do ?
I would try this, replace lien.click with page = lien.click.
It should look more like this:
page = form.submit form.button
scrape page
while link = page.link_with :text => 'Next'
page = link.click
scrape page
end
Also you don't need to parse the page body with nokogiri, mechanize already does that for you.

Web Scraping with Nokogiri and Mechanize

I am parsing prada.com and would like to scrape data in the div class "nextItem" and get its name and price. Here is my code:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
html_doc = Nokogiri::HTML(page)
page = html_doc.xpath("//ol[#class='nextItem']")
page.each do {|i| fp.write(i.text + "\n")}
end
I get an error and no output. What I think I am doing is instantiating a mechanize object and calling it agent.
Then creating a page variable and assigning it the url provided.
Then creating a variable that is a nokogiri object with the mechanize url passed in
Then searching the url for all class references that are titled nextItem
Then printing all the data contained there
Can someone show me where I might have went wrong?
Since Prada's website dynamically loads its content via JavaScript, it will be hard to scrape its content. See "Scraping dynamic content in a website" for more information.
Generally speaking, with Mechanize, after you get a page:
page = agent.get(page_url)
you can easily search items with CSS selectors and scrape for data:
next_items = page.search(".fooClass")
next_items.each do |item|
price = item.search(".fooPrice").text
end
Then simply handle the strings or generate hashes as you desire.
Here are the wrong parts:
Check again the block syntax - use {} or do/end but not both in the same time.
Mechanize#get returns a Mechanize::Page which act as a Nokogiri document, at least it has search, xpath, css. Use them instead of trying to coerce the document to a Nokogiri::HTML object.
There is no need to require 'open-uri', and require 'nokogiri' when you are not using them directly.
Finally check maybe more about Ruby's basics before continuing with web scraping.
Here is the code with fixes:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
page = page.search("//ol[#class='nextItem']").each do |i|
fp.write(i.text + "\n")
end
fp.close

Mechanize: picking right submit from multiple in same form

I use Mechanize to loop through a table, which is paginated.
I have a problem with a form that holds multiple submit inputs. The input tags are used as pagination and they are generated dynamically. When I loop through the pages I need to scrape, I need to be able to pick the right input, since only one of them will take me to the “next page”. The right tag can be identified by different attributes such as name, class, value etc. My problem is though, that I can’t find out how to tell mechanize which one to use.
I tried this:
require 'mechanize'
require 'yaml'
url = "http://www.somewhere.com"
agent = Mechanize.new
page = agent.get(url)
loop do
puts "some content from site using nokogiri"
if next_page = page.form_with(:action => /.*/)
page = next_page.submit(page.form_with(:action => /.*/).submits[3])
else
break
end
end
From this post, http://rubyforge.org/pipermail/mechanize-users/2008-November/000314.html, but as told the number of tags are changing so just picking a hardcoded number of the submits is not too good an idea.
What I would like to know is if there is a way like this:
loop do
puts "some content from site using nokogiri"
if next_page = page.form_with(:action => /.*/)
page = next_page.submit(:name => /the_right_submit_button/)
else
break
end
end
or something like that, maybe with a css or xpath selector.
I usually use form.button_with to select the right button to click:
form = results_page.forms[0]
results_page = form.submit(form.button_with(:name=>'ctl00$ContentBody$ResultsPager$NextButton'))

Resources