Ruby Watir: Display hyperlinks from a chrome search - ruby

Objective: To print out an array of hyperlinks searched from a couple of keywords input to a search engine, chrome in this case, utilizing Watir.
Thing is my code below used to work fine over the last year. Tested it many times but something has changed within the confines of Chrome and the elements or selectors or perhaps watir has deprecated some functionality but I'm using very few lines of ruby code. It just hangs.(Edit No results are produced in the terminal, doesn't actually hang) I've gone over this painstakingly and only one other link with a similar case, where I got my previous code from, but now it no longer works: Show searched links url in command line with watir
require 'watir'
browser = Watir::Browser.new:chrome
browser.goto 'google.com'
browser.text_field(title: 'Search').set 'watir + ruby'
browser.button(name: 'btnK').click
sleep 3
links = browser.h3s(class: 'r').map(&:link)
hrefs = links.map(&:href)
links.each { |link| puts "#{link.data_href || link.href}" }
sleep(900)
Desired output (used to work)
Terminal Displaying hyperlinks

Working script now produces search results hyperlinks (credit to Justin Ko) :
require 'watir'
browser = Watir::Browser.new:chrome
browser.goto 'google.com'
browser.text_field(title: 'Search').set 'watir + ruby'
browser.button(name: 'btnK').click
sleep 3
puts browser.title
links = browser.divs(class: 'r').map(&:link)
hrefs = links.map(&:href)
# hrefs.each { |href| puts href }
links.each { |link| puts "#{link.data_href || link.href}" }
sleep(900)
enter image description here

Related

Scraping - Loading dynamic buttons

I'm trying to web-scrape the "Fresh & Chilled" products of Waitrose & Partners using Ruby and Nokogiri.
In order to load more products, I'd need to click in 'Load More...', which will dynamically load more products without altering the URL or redirecting to a new page.
How do I 'click' the "Load More" button to load more products?
I think it is a dynamic website as items are loaded dynamically after clicking the "Load More..." button and the URL is not being altered at all (so no pagination is visible)
Here's the code I've tried so far, but I'm stuck in loading more items. My guess is that the DOM is being loaded by itself, but you cannot actually click the button because it represents to call a javascript method which will load the rest of the items.
require "csv"
require "json"
require "nokogiri"
require "open-uri"
require "pry"
def scrape_category(category)
CSV.open("out/waitrose_items_#{category}.csv", "w") do |csv|
headers = [:id, :name, :category, :price_per_unit, :price_per_quantity, :image_url, :available, :url]
csv << headers
url = "https://www.waitrose.com/ecom/shop/browse/groceries/#{category}"
html = open(url)
doc = Nokogiri::HTML(html)
load_more = doc.css(".loadMoreWrapper___UneG1").first
pages = 0
while load_more != nil
puts pages.to_s
load_more.content # Here's where I don't know how to click the button to load more items
products = doc.css(".podHeader___3yaub")
puts "products = " + products.length.to_s
pages = pages + 1
load_more = doc.css(".loadMoreWrapper___UneG1").first
end
(0..products.length-1).each do |i|
puts "url = " + products[i].text
end
load_more = doc.css(".loadMoreWrapper___UneG1")[0]
# here goes the processing of each single item to put in csv file
end
end
def scrape_waitrose
categories = [
"fresh_and_chilled",
]
threads = categories.map do |category|
Thread.new { scrape_category(category) }
end
threads.each(&:join)
end
#binding.pry
Nokogiri is a way of parsing HTML. It's the Ruby equivalent to Javascript's Cheerio or Java's Jsoup. This is actually not a Nokogiri question.
What you are confusing is the way to parse the HTML and the method to collect the HTML, as delivered over the network. It is important to remember that lots of functions, like your button clicking, are enabled by Javascript. These days many sites, like React sites, are completely built by Javascript.
So when you execute this line:
doc = Nokogiri::HTML(html)
It is the html variable you have to concentrate on. Your html is NOT the same as the html that I would view from the same page in my browser.
In order to do any sort of reliable web scraping, you have to use a headless browser that will execute Javascript files. In Ruby terms, that used to mean using Poltergeist to control Phantomjs, a headless version of the Webkit browser. Phantomjs became unsupported when Puppeteer and headless Chrome arrived.

Scraping successive pages until the last page using Nokogiri and Mechanize

I am trying to scrape multiple pages from a website. I want to scrape a page, then click on next, get that page, and repeat until I hit the end.
I wrote this so far:
page = agent.submit(form, form.buttons.first)
#submitting a form
while lien = page.link_with(:text=>'Next')
# while I have a next link on page, keep scraping
html_body = Nokogiri::HTML(body)
links = html_body.css('.list').xpath("//table/tbody/tr/td[2]/a[1]")
links.each do |link|
purelink = link['href']
puts purelink[/codeClub=([^&]*)/].gsub('codeClub=', '')
lien.click
end
end
Unfortunately, with this script I keep on scraping the same page in an infinite loop... How can I achieve what I want to do ?
I would try this, replace lien.click with page = lien.click.
It should look more like this:
page = form.submit form.button
scrape page
while link = page.link_with :text => 'Next'
page = link.click
scrape page
end
Also you don't need to parse the page body with nokogiri, mechanize already does that for you.

Using Nokogiri to scrape search results from POST form

I would like to scrape search results from http://maxdelivery.com, but unfortunately, they are using POST instead of GET for their search form. I found this description of how to use Nokogiri and RestClient to fake a post form submission, but it's not returning any results for me: http://ruby.bastardsbook.com/chapters/web-crawling/
I've worked with Nokogiri before, but not for the results of a POST form submission.
Here's my code right now, only slightly modified from the example at the link above:
class MaxDeliverySearch
REQUEST_URL = "http://www.maxdelivery.com/nkz/exec/Search/Display"
def initialize(search_term)
#term = search_term
end
def search
if page = RestClient.post(REQUEST_URL, {
'searchCategory'=>'*',
'searchString'=>#term,
'x'=>'0',
'y'=>'0'
})
puts "Success finding search term: #{#term}"
File.open("temp/Display-#{#term}.html", 'w'){|f| f.write page.body}
npage = Nokogiri::HTML(page)
rows = npage.css('table tr')
puts "#{rows.length} rows"
rows.each do |row|
puts row.css('td').map{|td| td.text}.join(', ')
end
end
end
end
Now (ignoring the formatting stuff), I would expect if page = RestClient.post(REQUEST_URL, {...} to return some search results if passed a 'good' search term, but each time I just get the search results page back with no actual results, as if I had pasted the URL into the browser.
Anyone have any idea what I'm missing? Or, just how to get back the results I'm looking for with another gem?
With the class above, I would like to be able to do:
s = MaxDeliverySearch.new("ham")
s.search #=> big block of search results objects to traverse
Mechanize is what you should use to automate a web search form. This should get you started using Mechanize.
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://maxdelivery.com')
form = page.form('SearchForm')
form.searchString = "ham"
page = agent.submit(form)
page.search("div.searchResultItem").each do |item|
puts item.search(".searchName i").text.strip
end

Stripping out results from a website that doesn't have differing URLs

I'm trying to automate the process of searching for alternative telephone numbers using SayNoTo0870 . Every time one searches for an alternate number or name it brings up the '/companysearch.php' page.
Clearly this page has no reference, and in my mind you can't just link to this page.
What I'm hoping to do is use the code below, to automate the opening of a browser, searching of a name/number, stripping out the HTML and then providing the top 5 results. I've got the automation part down, but clearly when trying to save the webpage using Hpricot it only brings up the 'Sorry nothing can be found page' because I can't link directly to the search result page.
Here is my code thus far:
(I've removed comments to shorten it)
require 'rubygems'
require 'watir'
require 'hpricot'
require 'open-uri'
class OH870
def searchName(name)
browser = Watir::Browser.new
browser.goto 'http://www.saynoto0870.com/search.php'
browser.text_field(:name => 'search_name').set name
browser.button(:name => 'submit').click
end
def searchNumber(number)
browser = Watir::Browser.new
browser.goto 'http://www.saynoto0870.com/search.php'
browser.text_field(:name => 'number').set number
browser.button(:name => 'submit').click
end
def loadNew(website)
doc = Hpricot(open(website))
puts(doc)
end
def strip_tags
stripped = website.gsub( %r{</?[^>]+?>}, '' )
puts stripped
end
end # class
class Main < OH870
puts "What is the name of the place you want?"
website = 'http://www.saynoto0870.com/companysearch.php'
question = gets.chomp
whichNumber = OH870.new
whichNumber.searchName(question)
#result = OH870.new
#withoutTags = website.strip_tags
#result.loadNew(withoutTags)
end
Now I'm not sure whether there's a way of "asking watir to follow through to the companysearch.php page and dump the results without having to pass this page as a variable.
I wonder if anyone has any suggestions here?
With WATIR, minus the extraneous libraries, here's all it takes to accomplish what you've described (using the 'name' test case only). I've pulled it out of the function format since you already know how to do that, and this will be a clearer test case path.
require 'watir'
#browser = Watir::Browser.new :firefox #open a browser called #browser
#browser.goto "http://(your search page here)" #go to the search page
#browser.text_field(:name => 'name').value = "Awesome" #fill in the 'name' field
#browser.button(:name => 'submit').click #submit the form
If all goes well, we should now be looking at the search results. WATIR already knows it's on a new page - we don't have to specify a URL. In the case that the results are in a frame, we do need to access that frame before we can view its content. Let's pretend they're in a DIV element with an ID of "search_results":
results = #browser.div(:id => "search_results").text
resultsFrame = #browser.frame(:index => 1) #in the case of a frame
results = resultsFrame.div(id => "search_results).text
As you can see, you do not need to save the entire page to parse the results. They could be in table cells, they could be in a different div per line, or a new frame. All are easily accessible with WATIR to be stored in a variable, array, or immediately written to the console or log file.
#results = Array.new #create an Array to store our results
#browser.divs.each do |div| #for each div element on the page
if div.id == "search_results" #if the div ID equals "search_results"
#results << div.text #add it to our array named #results
end
end
Now, if you just wanted the top 5 there are many ways to access them.
#results[0] #first element
#results[0..4] #first 5 elements
I'd also suggest you look into a few programming principles like DRY (Don't Repeat Yourself). In your function definitions where you see that they share code, like opening the browser and visiting the same URL - you can consolidate those:
def search(how, what)
#browser = Watir::Browser.new :firefox
#browser.goto "(that search url again)"
#browser.text_field(:name => how).value = what
etc...
end
search("name", "Hilton")
search("number", "555555")
Since we know that the two available text_field names are "name" and "number", and those make good logical sense as a 'how', we can parameterize them and use a single function for both the Search by Name and Search by Number test cases. This is more efficient, as long as the test cases remain similar enough to be shared.

Ruby Regex Help

I want to Extract the Members Home sites links from a site.
Looks like this
<a href="http://www.ptop.se" target="_blank">
i tested with it this site
http://www.rubular.com/
<a href="(.*?)" target="_blank">
Shall output http://www.ptop.se,
Here comes the code
require 'open-uri'
url = "http://itproffs.se/forumv2/showprofile.aspx?memid=2683"
open(url) { |page| content = page.read()
links = content.scan(/<a href="(.*?)" target="_blank">/)
links.each {|link| puts #{link}
}
}
if you run this, it dont works. why not?
I would suggest that you use one of the good ruby HTML/XML parsing libraries e.g. Hpricot or Nokogiri.
If you need to log in on the site you might be interested in a library like WWW::Mechanize.
Code example:
require "open-uri"
require "hpricot"
require "nokogiri"
url = "http://itproffs.se/forumv2"
# Using Hpricot
doc = Hpricot(open(url))
doc.search("//a[#target='_blank']").each { |user| puts "found #{user.inner_html}" }
# Using Nokogiri
doc = Nokogiri::HTML(open(url))
doc.xpath("//a[#target='_blank']").each { |user| puts "found #{user.text}" }
Several issues with your code
I don't know what you mean by using
{link}. But if you want to append a '#' character to the link make sure
you wrap that with quotes. ie
"#{link}"
String.scan accepts a block. Use it
to loop through the matches.
The page you are trying to access
does not return any links that the
regex would match anyway.
Here's something that would work:
require 'open-uri'
url = "http://itproffs.se/forumv2/"
open(url) do |page|
content = page.read()
content.scan(/<a href="(.*?)" target="_blank">/) do |match|
match.each { |link| puts link}
end
end
There're better ways to do it, I am sure. But this should work.
Hope it helps

Resources