Scraping successive pages until the last page using Nokogiri and Mechanize - ruby

I am trying to scrape multiple pages from a website. I want to scrape a page, then click on next, get that page, and repeat until I hit the end.
I wrote this so far:
page = agent.submit(form, form.buttons.first)
#submitting a form
while lien = page.link_with(:text=>'Next')
# while I have a next link on page, keep scraping
html_body = Nokogiri::HTML(body)
links = html_body.css('.list').xpath("//table/tbody/tr/td[2]/a[1]")
links.each do |link|
purelink = link['href']
puts purelink[/codeClub=([^&]*)/].gsub('codeClub=', '')
lien.click
end
end
Unfortunately, with this script I keep on scraping the same page in an infinite loop... How can I achieve what I want to do ?

I would try this, replace lien.click with page = lien.click.

It should look more like this:
page = form.submit form.button
scrape page
while link = page.link_with :text => 'Next'
page = link.click
scrape page
end
Also you don't need to parse the page body with nokogiri, mechanize already does that for you.

Related

Scraping - Loading dynamic buttons

I'm trying to web-scrape the "Fresh & Chilled" products of Waitrose & Partners using Ruby and Nokogiri.
In order to load more products, I'd need to click in 'Load More...', which will dynamically load more products without altering the URL or redirecting to a new page.
How do I 'click' the "Load More" button to load more products?
I think it is a dynamic website as items are loaded dynamically after clicking the "Load More..." button and the URL is not being altered at all (so no pagination is visible)
Here's the code I've tried so far, but I'm stuck in loading more items. My guess is that the DOM is being loaded by itself, but you cannot actually click the button because it represents to call a javascript method which will load the rest of the items.
require "csv"
require "json"
require "nokogiri"
require "open-uri"
require "pry"
def scrape_category(category)
CSV.open("out/waitrose_items_#{category}.csv", "w") do |csv|
headers = [:id, :name, :category, :price_per_unit, :price_per_quantity, :image_url, :available, :url]
csv << headers
url = "https://www.waitrose.com/ecom/shop/browse/groceries/#{category}"
html = open(url)
doc = Nokogiri::HTML(html)
load_more = doc.css(".loadMoreWrapper___UneG1").first
pages = 0
while load_more != nil
puts pages.to_s
load_more.content # Here's where I don't know how to click the button to load more items
products = doc.css(".podHeader___3yaub")
puts "products = " + products.length.to_s
pages = pages + 1
load_more = doc.css(".loadMoreWrapper___UneG1").first
end
(0..products.length-1).each do |i|
puts "url = " + products[i].text
end
load_more = doc.css(".loadMoreWrapper___UneG1")[0]
# here goes the processing of each single item to put in csv file
end
end
def scrape_waitrose
categories = [
"fresh_and_chilled",
]
threads = categories.map do |category|
Thread.new { scrape_category(category) }
end
threads.each(&:join)
end
#binding.pry
Nokogiri is a way of parsing HTML. It's the Ruby equivalent to Javascript's Cheerio or Java's Jsoup. This is actually not a Nokogiri question.
What you are confusing is the way to parse the HTML and the method to collect the HTML, as delivered over the network. It is important to remember that lots of functions, like your button clicking, are enabled by Javascript. These days many sites, like React sites, are completely built by Javascript.
So when you execute this line:
doc = Nokogiri::HTML(html)
It is the html variable you have to concentrate on. Your html is NOT the same as the html that I would view from the same page in my browser.
In order to do any sort of reliable web scraping, you have to use a headless browser that will execute Javascript files. In Ruby terms, that used to mean using Poltergeist to control Phantomjs, a headless version of the Webkit browser. Phantomjs became unsupported when Puppeteer and headless Chrome arrived.

Ruby Mechanize Click Link without HREF

I'm trying to click on a link that doesn't have an HREF and has generated text. Here's my code:
agent = Mechanize.new
page = agent.get("http://stopandshop.shoplocal.com/")
links = page.search("div.rightSide > a")
result_page = Mechanize::Page::Link.new( links[1], agent, page ).click
puts result_page.title
pp result_page
I'm not getting any errors but the page is not changing (seen on pp result_page).
How do I click the first link and get to the next page using Mechanize?

Using Nokogiri to scrape search results from POST form

I would like to scrape search results from http://maxdelivery.com, but unfortunately, they are using POST instead of GET for their search form. I found this description of how to use Nokogiri and RestClient to fake a post form submission, but it's not returning any results for me: http://ruby.bastardsbook.com/chapters/web-crawling/
I've worked with Nokogiri before, but not for the results of a POST form submission.
Here's my code right now, only slightly modified from the example at the link above:
class MaxDeliverySearch
REQUEST_URL = "http://www.maxdelivery.com/nkz/exec/Search/Display"
def initialize(search_term)
#term = search_term
end
def search
if page = RestClient.post(REQUEST_URL, {
'searchCategory'=>'*',
'searchString'=>#term,
'x'=>'0',
'y'=>'0'
})
puts "Success finding search term: #{#term}"
File.open("temp/Display-#{#term}.html", 'w'){|f| f.write page.body}
npage = Nokogiri::HTML(page)
rows = npage.css('table tr')
puts "#{rows.length} rows"
rows.each do |row|
puts row.css('td').map{|td| td.text}.join(', ')
end
end
end
end
Now (ignoring the formatting stuff), I would expect if page = RestClient.post(REQUEST_URL, {...} to return some search results if passed a 'good' search term, but each time I just get the search results page back with no actual results, as if I had pasted the URL into the browser.
Anyone have any idea what I'm missing? Or, just how to get back the results I'm looking for with another gem?
With the class above, I would like to be able to do:
s = MaxDeliverySearch.new("ham")
s.search #=> big block of search results objects to traverse
Mechanize is what you should use to automate a web search form. This should get you started using Mechanize.
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://maxdelivery.com')
form = page.form('SearchForm')
form.searchString = "ham"
page = agent.submit(form)
page.search("div.searchResultItem").each do |item|
puts item.search(".searchName i").text.strip
end

Ruby Mechanize: Follow a Link

In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example:
page2 = page1.link_with(:text => "Continue").click
page3 = page2.link_with(:text => "About").click
...etc
Is there a way to run Mechanize without a variable holding every page state? like
my_only_page.link_with(:text => "Continue").click!
my_only_page.link_with(:text => "About").click!
I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages dynamically and process them, you could do it like this:
require 'mechanize'
url = "http://example.com"
agent = Mechanize.new
page = agent.get(url) #Get the starting page
loop do
# What you want to do on the page - ex. extract something...
item = page.parser.css('.some_item').text
item.save
if link = page.link_with(:text => "Continue") # As long as there is still a nextpage link...
page = link.click
else # If no link left, then break out of loop
break
end
end

Mechanize: picking right submit from multiple in same form

I use Mechanize to loop through a table, which is paginated.
I have a problem with a form that holds multiple submit inputs. The input tags are used as pagination and they are generated dynamically. When I loop through the pages I need to scrape, I need to be able to pick the right input, since only one of them will take me to the “next page”. The right tag can be identified by different attributes such as name, class, value etc. My problem is though, that I can’t find out how to tell mechanize which one to use.
I tried this:
require 'mechanize'
require 'yaml'
url = "http://www.somewhere.com"
agent = Mechanize.new
page = agent.get(url)
loop do
puts "some content from site using nokogiri"
if next_page = page.form_with(:action => /.*/)
page = next_page.submit(page.form_with(:action => /.*/).submits[3])
else
break
end
end
From this post, http://rubyforge.org/pipermail/mechanize-users/2008-November/000314.html, but as told the number of tags are changing so just picking a hardcoded number of the submits is not too good an idea.
What I would like to know is if there is a way like this:
loop do
puts "some content from site using nokogiri"
if next_page = page.form_with(:action => /.*/)
page = next_page.submit(:name => /the_right_submit_button/)
else
break
end
end
or something like that, maybe with a css or xpath selector.
I usually use form.button_with to select the right button to click:
form = results_page.forms[0]
results_page = form.submit(form.button_with(:name=>'ctl00$ContentBody$ResultsPager$NextButton'))

Resources