How to extract JS rendered HTML using Selenium-webdriver and nokogiri? - ruby

Consider two webpages one and two. Site number two is easy to scrape using nokogiri because it doesn't use JS. Site number one however cannot be scraped using just nokogiri. I googled and searched far and wide and found that if I loaded the page with an automated web browser I could scrape the the rendered HTML. I have the following code right below:
# creates an instance
driver = Selenium::WebDriver.for :chrome
# opens an existing webpage
driver.get 'http://www.bigstub.com/search.aspx'
# wait is used to let the webpage load up and let the JS render
wait = Selenium::WebDriver::Wait.new(:timeout => 5)
My question is that I am trying to let the page load up an close immediately once I get my desired class. An example is that if I adjust the time out to 10 seconds until I can find the class .title-holder how would I write this code?
Pusedo code:
rendered_source_page will time out if .include?("title-holder"). I just don't know how to write it.
UPDATE:
In regards to the headless question, selenium has an option or configuration in where you can add in a headless option. This is done by the code below:
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options
For my next question in order for the site to fully scrape the JS rendered HTML I set my timeout variable to 5 seconds:
wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }
wait.until pretty much means wait 5 seconds until I find a title-holder class inside of the page_source or rendered HTML. This pretty much solved all my questions.

I am assuming you are running selenium on a server. So first install Xvfb
sudo apt-get install xvfb
Install firefox
sudo apt-get install firefox
Add the following two gems to your gemfile. You will need headless because you want to run the selenium webdriver on your server. Headless will start and stop Xvfb for you.
#gemfile
gem 'selenium-webdriver'
gem 'headless'
Code for scraping
headless = Headless.new
headless.start
driver = Selenium::WebDriver.for :firefox
driver.navigate.to example.com
wait = Selenium::WebDriver::Wait.new(:timeout => 30)
#scraping code comes here
Housekeeping so that you don't run out of memory.
driver.quit
headless.destroy
Hope this helps.

In regards to the headless question, selenium has an option or configuration in where you can add in a headless option. This is done by the code below:
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options
For my next question in order for the site to fully scrape the JS rendered HTML I set my timeout variable to 5 seconds:
wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }
wait.until pretty much means wait 5 seconds until I find a title-holder class inside of the page_source or rendered HTML. This pretty much solved all my questions.

Related

Getting prompts to upgrade browser on certain sites using headless chrome

I'm attempting to do some web scraping using headless chrome via selinium-webdriver on Heroku but most recently ran into trouble. Visiting, https://music.youtube.com I get a page with the following:
<div class="message">
Sorry, YouTube Music is not optimized for your browser. Check for updates or try Google Chrome.
</div>
I can confirm that my chrome driver is up to date and nothing on my end has changed to cause this break in functionality. Using any other automated scraping gem similar to selinium gives me a similar message in telling me my browser is deprecated. Note that performing these actions is no problem when I do it on my local machine. To give some background, I originally followed this answer to get everything running correctly on heroku prior to the breakage. On top of this here's my setup:
gem 'selenium-webdriver'
gem 'webdrivers', '~> 4.0', require: false
require 'webdrivers'
require 'selenium-webdriver'
chrome_bin_path = ENV.fetch('GOOGLE_CHROME_SHIM', nil)
options = Selenium::WebDriver::Chrome::Options.new
options.binary = chrome_bin_path if chrome_bin_path
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
$driver = Selenium::WebDriver.for :chrome, options: options
What would be the most logical answer to why I'm getting this message on a site like youtube?

Do not wait for page to finish loading in selenium

As the title states, I'm trying to create a script which opens multiple tabs in a browser. At the moment the script seems to wait until each page has finished loading before moving on to a new tab. Is there a way to move on without waiting for the page to load. It seems to be hard to find relevant information online.
#!/usr/bin/env ruby
require 'selenium-webdriver'
file = File.open(ARGV[0], 'r')
driver = Selenium::WebDriver.for :firefox
file.each do |host|
driver.get(host)
driver.execute_script( "window.open()" )
driver.switch_to.window( driver.window_handles.last )
end

chrome browser closes automatically after the program finishes in ruby using watir

I am using chrome 56, chrome driver 2.27(latest release) with selenium web driver 3.1.0. Referring to the issue(https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/1811) where chrome closes all the instances once the program finishes and it does not give me a chance to debug. I just want to know if this is fixed then why it is still happening ? or i am missing something ?
I am using the following code. Any help is appreciated.
require "uri"
require "net/http"
require 'watir-webdriver'
require 'selenium-webdriver'
#b = Watir::Browser.new :chrome
#b.goto 'http://www.google.com'
Firstly, watir-webdriver gem is deprecated. The updated code is in the watir gem. Also, you shouldn't need to require any of those other gems directly.
The chromedriver service is stopped when the ruby process exits. If you do not want the browsers that were started by chromedriver to close as well, you need to use the detach parameter. Currently this is done like so:
require 'watir'
caps = Selenium::WebDriver::Remote::Capabilities.chrome
caps[:chrome_options] = {detach: true}
#b = Watir::Browser.new :chrome, desired_capabilities: caps
Declare these
caps = Selenium::WebDriver::Remote::Capabilities.chrome("chromeOptions" => {'detach' => true})
browser = Watir::Browser.new :chrome, desired_capabilities: caps
On a side note! this might give a problem when you are running multiple scenario tests, chromedriver will actively refuse connection in case an other test initiates in the same chrome session. Ensure you have browser.close whenever required.

How to bypass website's security certificate in IE using Ruby/Selenium WebDriver

I am trying to automate some IE browser tests using Ruby/Selenium WebDriver.
When I run the following code, it opens a new IE browser with the url but it always tells that 'There is a problem with this website's security certificate.'
Is there any way to set the IE profile/capabilities using Ruby similar to the ones used in Java?
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :ie
driver.get "https://xxxxxxxxxxxxxxxxxx.com"
There's no way to set it by any capabilities, even if you were using Java. If you have found any approaches to achieve it in Java, please post it and see if it can be translated into Ruby.
But you can always simulate the clicking to bypass it.
# Tested under Windows 7, IE 10, Ruby 2.0.0
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :ie
driver.get "https://xxxxxxxxxxxxxxxxxx.com"
driver.get("javascript:document.getElementById('overridelink').click()");

How to avoid to launch firefox gui during a scraping of web page with javascript

I am trying to scrape a web page with a lot of javascript. with the help of pguardiano i have this piece of code in ruby.
require 'rubygems'
require 'watir-webdriver'
require 'csv'
#browser = Watir::Browser.new
#browser.goto 'http://www.oddsportal.com/matches/soccer/'
CSV.open('out.csv', 'w') do |out|
#browser.trs(:class => /deactivate/).each do |tr|
out << tr.tds.map(&:text)
end
end
The scraping is done recursively in background with a sleep time of 1 hour approximatively. I have no experience of ruby and in particular of web scraping, so i have a couple of questions.
How can i avoid that every time a new firefox session is opened with a lot of cpu and ram consumption?
Is it possible to use a firefox engine without using his GUI?
You can try a headless option.
require 'watir-webdriver'
require 'headless'
headless = Headless.new
headless.start
b = Watir::Browser.start 'www.google.com'
puts b.title
b.close
headless.destroy
An alternative is to use the selenium server. A third alternative is to use a scraper like Kapow.

Resources