I'm trying to click on a link that doesn't have an HREF and has generated text. Here's my code:
agent = Mechanize.new
page = agent.get("http://stopandshop.shoplocal.com/")
links = page.search("div.rightSide > a")
result_page = Mechanize::Page::Link.new( links[1], agent, page ).click
puts result_page.title
pp result_page
I'm not getting any errors but the page is not changing (seen on pp result_page).
How do I click the first link and get to the next page using Mechanize?
Related
I'm trying to web-scrape the "Fresh & Chilled" products of Waitrose & Partners using Ruby and Nokogiri.
In order to load more products, I'd need to click in 'Load More...', which will dynamically load more products without altering the URL or redirecting to a new page.
How do I 'click' the "Load More" button to load more products?
I think it is a dynamic website as items are loaded dynamically after clicking the "Load More..." button and the URL is not being altered at all (so no pagination is visible)
Here's the code I've tried so far, but I'm stuck in loading more items. My guess is that the DOM is being loaded by itself, but you cannot actually click the button because it represents to call a javascript method which will load the rest of the items.
require "csv"
require "json"
require "nokogiri"
require "open-uri"
require "pry"
def scrape_category(category)
CSV.open("out/waitrose_items_#{category}.csv", "w") do |csv|
headers = [:id, :name, :category, :price_per_unit, :price_per_quantity, :image_url, :available, :url]
csv << headers
url = "https://www.waitrose.com/ecom/shop/browse/groceries/#{category}"
html = open(url)
doc = Nokogiri::HTML(html)
load_more = doc.css(".loadMoreWrapper___UneG1").first
pages = 0
while load_more != nil
puts pages.to_s
load_more.content # Here's where I don't know how to click the button to load more items
products = doc.css(".podHeader___3yaub")
puts "products = " + products.length.to_s
pages = pages + 1
load_more = doc.css(".loadMoreWrapper___UneG1").first
end
(0..products.length-1).each do |i|
puts "url = " + products[i].text
end
load_more = doc.css(".loadMoreWrapper___UneG1")[0]
# here goes the processing of each single item to put in csv file
end
end
def scrape_waitrose
categories = [
"fresh_and_chilled",
]
threads = categories.map do |category|
Thread.new { scrape_category(category) }
end
threads.each(&:join)
end
#binding.pry
Nokogiri is a way of parsing HTML. It's the Ruby equivalent to Javascript's Cheerio or Java's Jsoup. This is actually not a Nokogiri question.
What you are confusing is the way to parse the HTML and the method to collect the HTML, as delivered over the network. It is important to remember that lots of functions, like your button clicking, are enabled by Javascript. These days many sites, like React sites, are completely built by Javascript.
So when you execute this line:
doc = Nokogiri::HTML(html)
It is the html variable you have to concentrate on. Your html is NOT the same as the html that I would view from the same page in my browser.
In order to do any sort of reliable web scraping, you have to use a headless browser that will execute Javascript files. In Ruby terms, that used to mean using Poltergeist to control Phantomjs, a headless version of the Webkit browser. Phantomjs became unsupported when Puppeteer and headless Chrome arrived.
I want to retrieve my driving license number, issue_date, and expiry_date from this website("https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp"). When I try to fetch it, I get the error Mechanize::ResponseCodeError: 500 => Net::HTTPInternalServerError for https://sarathi.nic.in:8443/nrportal/sarathi/DlDetRequest.jsp -- unhandled response.
This is the code that I wrote to scrape:
require 'mechanize'
require 'logger'
require 'nokogiri'
require 'open-uri'
require 'openssl'
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
agent = Mechanize.new
agent.log = Logger.new "mech.log"
agent.user_agent_alias = 'Mac Safari 4'
Mechanize.new.get("https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp")
page=agent.get('https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp') # opening home page.
page = agent.page.links.find { |l| l.text == 'Status of Licence' }.click # click the link.
page.forms_with(:name=>"dlform").first.field_with(:name=>"dlform:DLNumber").value="TN38 20120001119" #user input to text field.
page.form_with(:name=>"dlform").field_with(:name=>"javax.faces.ViewState").value="SUBMIT" #submit button value assigning.
page.form(:name=>"dlform",:action=>"/nrportal/sarathi/DlDetRequest.jsp") #to specify the form i need.
agent.cookie_jar.clear!
gg=agent.submit page.forms.last #submitting my form
It isn't working since you are clearing off the cookies before submitting the form, hence removing all the input data you provided. I could get it working by removing it simply as:
...
page.forms_with(:name=>"dlform").first.field_with(:name=>"dlform:DLNumber").value="TN38 20120001119" #user input to text field
form = page.form(:name=>"dlform",:action=>"/nrportal/sarathi/DlDetRequest.jsp")
gg = agent.submit form, form.buttons.first
Note that you do not need to set the value for #submit button, rather pass the submit button while form submission itself.
I am trying to scrape multiple pages from a website. I want to scrape a page, then click on next, get that page, and repeat until I hit the end.
I wrote this so far:
page = agent.submit(form, form.buttons.first)
#submitting a form
while lien = page.link_with(:text=>'Next')
# while I have a next link on page, keep scraping
html_body = Nokogiri::HTML(body)
links = html_body.css('.list').xpath("//table/tbody/tr/td[2]/a[1]")
links.each do |link|
purelink = link['href']
puts purelink[/codeClub=([^&]*)/].gsub('codeClub=', '')
lien.click
end
end
Unfortunately, with this script I keep on scraping the same page in an infinite loop... How can I achieve what I want to do ?
I would try this, replace lien.click with page = lien.click.
It should look more like this:
page = form.submit form.button
scrape page
while link = page.link_with :text => 'Next'
page = link.click
scrape page
end
Also you don't need to parse the page body with nokogiri, mechanize already does that for you.
I'm trying to get images or image addresses from the website below. It works for the one website that I put below: "http://www.1stsourceservall.com/Category/Accessories". However--once it's finished with the page--I want it to then click on the next page link and cycle through all 20+ pages. How would I do that?
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.1stsourceservall.com/Category/Accessories"
while (url) do
doc = Nokogiri::HTML(open(url))
puts doc.css(".productImageMed")
end
link = doc.css('.pagination a')
url = link && link[0]['href'] #=> url is nil if no link is found on the page
end
In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example:
page2 = page1.link_with(:text => "Continue").click
page3 = page2.link_with(:text => "About").click
...etc
Is there a way to run Mechanize without a variable holding every page state? like
my_only_page.link_with(:text => "Continue").click!
my_only_page.link_with(:text => "About").click!
I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages dynamically and process them, you could do it like this:
require 'mechanize'
url = "http://example.com"
agent = Mechanize.new
page = agent.get(url) #Get the starting page
loop do
# What you want to do on the page - ex. extract something...
item = page.parser.css('.some_item').text
item.save
if link = page.link_with(:text => "Continue") # As long as there is still a nextpage link...
page = link.click
else # If no link left, then break out of loop
break
end
end