Save page with css, js and images using watir - ruby

How can I save page with all its content using watir-webdriver?
browser.html save only browser's elements. If I open file where I dumped browser.html there is no styling.
Also browser.html doesn't save iframes. I can loop through iframes and save them separately, but they will be separated from the main page.
I record only htmls for now, maybe later I'll save screenshots, because there is no simple way to dump whole page with its css and images.
require 'fileutils'
class Recorder
attr_reader :request, :counter, :browser
# request should contain w(login_id start_time)
def initialize(request)
#request, #counter = request, 1
# Settings class contains my configs (enable recording, paths, etc.)
FileUtils.mkpath(path) if Settings.recorder.record and !File.exists?(path)
end
def record(hash)
return unless Settings.recorder.record
#browser = hash["browser"]
record_html(hash)
record_frames(hash)
#counter += 1
end
private
# hash should contain (method_name browser)
def record_html(hash)
File.open("#{path}#{generate_file_name(hash)}", "w") do |file|
file.write("<!--#{browser.url}-->\n")
file.write(browser.html)
end
end
def record_frames(hash)
browser.frames.each_with_index do |frame, index|
File.open("#{path}#{generate_file_name(hash, index + 1)}", "w") do |file|
file.write("<!--#{browser.url}-->\n")
file.write(frame.html)
end
end
end
def path
"#{Settings.recorder.path}/#{request["login_id"]}/#{request["start_time"]}/"
end
def generate_file_name(hash, frame=nil)
return "#{counter}-#{hash["method_name"]}.html" if frame.nil?
"#{counter}-frame#{frame}-#{hash["method_name"]}.html"
end
end

I don't know about Watir but for those who might want to save a page (including CSS and JavaScript that are directly in the page) using Selenium WebDriver (which Watir wraps around), the easiest way is to use the page_source method (of the WebDriver class). As its name suggests, it gives so whole source. Then it's just a matter of saving to a new file, like so :
driver = Selenium::WebDriver.for(:firefox)
driver.get(URL_of_page_to_save)
file = File.new(filename, "w")
file.puts(driver.page_source)
file.close
It won't save the JavaScript or CSS inside other files though.

Related

Ruby TempFile behaviour among different classes

Our processing server works mainly with TempFiles as it makes things easier on our side: no need to take care of deleting them as they get garbage collected or handle name collisions, etc.
Lately, we are having problems with TempFiles getting GCed too early in the process. Specially with one of our services that will convert a Foo file from a url to some Bar file and upload it to our servers.
For sake of clarity I added bellow a case scenario in order to make discussion easier and have an example at hand.
This workflow does the following:
Get a url as parameter
Download the Foo file as a TempFile
Duplicate it to a new TempFile
Download the related assets to TempFiles
Link the related assets into the local dup TempFile
Convert the Foo to Bar format
Upload it to our server
At times the conversion fail and everything points to the fact that our local Foo file is pointing to related assets that have been created and GCed before the conversion.
My two questions:
Is it possible that my TempFiles get GCed too early? I read about Ruby GCed system it was very conservative to avoid those scenarios.
How can I avoid this from happening? I could try to save all related assets from download_and_replace_uri(node) and passing them as a return to keep it alive while the instance of ConvertService is still existing. But I'm not sure if this would solve it.
myfile.foo
{
"buffers": [
{ "uri": "http://example.com/any_file.jpg" },
{ "uri": "http://example.com/any_file.png" },
{ "uri": "http://example.com/any_file.jpmp3" }
]
}
main.rb
ConvertService.new('http://example.com/myfile.foo')
ConvertService
class ConvertService
def initialize(url)
#url = url
#bar_file = Tempfile.new
end
def call
import_foo
convert_foo
upload_bar
end
private
def import_foo
#foo_file = ImportService.new(#url).call.edited_file
end
def convert_foo
`create-bar "#{#foo_file.path}" "#{#bar_file.path}"`
end
def upload_bar
UploadBarService.new(#bar_file).call
end
end
ImportService
class ImportService
def initialize(url)
#url = url
#edited_file ||= Tempfile.new
end
def call
download
duplicate
replace
end
private
def download
#original = DownloadFileService.new(#url).call.file
end
def duplicate
FileUtils.cp(#original.path, #edited_file.path)
end
def replace
file = File.read(#edited_file.path)
json = JSON.parse(file, symbolize_names: true)
json[:buffers]&.each do |node|
node[:uri] = DownloadFileService.new(node[:uri]).call.file.path
end
write_to_disk(#edited_file.path, json.to_json)
end
end
DownloadFileService
module Helper
class DownloadFileService < ApplicationHelperService
def initialize(url)
#url = url
#file = Tempfile.new
end
def call
uri = URI.parse(#url)
Net::HTTP.start(
uri.host,
uri.port,
use_ssl: uri.scheme == 'https'
) do |http|
response = http.request(Net::HTTP::Get.new(uri.path))
#file.binmode
#file.write(response.body)
#file.flush
end
end
end
end
UploadBarService
module Helper
class UploadBarService < ApplicationHelperService
def initialize(file)
#file = file
end
def call
HTTParty.post('http://example.com/upload', body: { file: #file })
# NOTE: End points returns the url for the uploaded file
end
end
end
Because of the complexity of your code and missing parts which may be obfuscated to us, the simple answer to your problem is to insure that your tempfile instance objects remain in memory throughout the lifecycle in which they are needed, otherwise they will get garbage collected immediately, removing the tempfile from the file system, and will lead to the the missing tempfile state you've encountered.
The Ruby Document for Tempfile states "When a Tempfile object is garbage collected, or when the Ruby interpreter exits, its associated temporary file is automatically deleted."
As per comments, others may find this conversation helpful when running into this problem.

Why can't PhantomJS/Poltergeist pull up this website correctly?

I'm using Capybara to navigate a web page. The url look like this:
http://www.myapp.com/page/ABCXYZ?page=<x>
The page has a paginated table on it. Passing a page number to it will paginate the table appropriately.
However, when using the poltergeist driver, the page parameter is always ignored.
Using the selenium driver is not an option because it's a hassle to get it to run headless, it doesn't want to run more than one time (gives "connection refused" error on localhost).
This looks like an encoding issue, but I'm not sure where exactly in the stack that the issue lies.
class Verifier
class << self
include Capybara::DSL
Capybara.default_driver = :poltergeist
Capybara.default_wait_time = 10
def parse_table(header)
xpath = '//*[#id="products"]/table[3]/tbody/tr/td/div[4]/div/table'
table = find(:xpath, xpath)
rows = []
table.all("tr").each do |row|
product_hash = {}
row.all("td").each_with_index do |col,idx|
product_hash[header[idx]] = col.text
end
rows << product_hash
end
rows
end
def pages
page.find(".numberofresults").text.gsub(" Products","").split(" ").last.to_i/25
end
def import(item)
visit "http://www.myapp.com/page/#{item}"
header = parse_header
apps = parse_vehicles(header)
pages.times do |pagenumber|
url = "http://www.myapp.com/page/#{item}?page=#{pagenumber+1}" # This is the problem
end
end
end
That url in the last loop? It is processes as if the pagenumber is not present. When I change the driver to :selenium, this whole thing works. So it's not a Capybara issue as far as I can see.

read json in Ruby and set variables for use in another class

The need here is to read a json file and to make the variables which is done from one class and use them with in another class. What I have so far is
helper.rb
class MAGEINSTALLER_Helper
#note nonrelated items removed
require 'fileutils'
#REFACTOR THIS LATER
def load_settings()
require 'json'
file = File.open("scripts/installer_settings.json", "rb")
contents = file.read
file.close
#note this should be changed for a better content check.. ie:valid json
#so it's a hack for now
if contents.length > 5
begin
parsed = JSON.parse(contents)
rescue SystemCallError
puts "must redo the settings file"
else
puts parsed['bs_mode']
parsed.each do |key, value|
puts "#{key}=>#{value}"
instance_variable_set("#" + key, value) #better way?
end
end
else
puts "must redo the settings file"
end
end
#a method to provide feedback simply
def download(from,to)
puts "completed download for #{from}\n"
end
end
Which is called in a file of Pre_start.rb
class Pre_start
#note nonrelated items removed
def initialize(params=nil)
puts 'World'
mi_h = MAGEINSTALLER_Helper.new
mi_h.load_settings()
bs_MAGEversion=instance_variable_get("#bs_MAGEversion") #doesn't seem to work
file="www/depo/newfile-#{bs_MAGEversion}.tar.gz"
if !File.exist?(file)
mi_h.download("http://www.dom.com/#{bs_MAGEversion}/file-#{bs_MAGEversion}.tar.gz",file)
else
puts "mage package exists"
end
end
end
the josn file is valid json and is a simple object (note there is more just showing the relevant)
{
"bs_mode":"lite",
"bs_MAGEversion":"1.8.0.0"
}
The reason I need to have a json settings file is that I will need to pull settings from a bash script and later a php script. This file is the common thread that is used to pass settings each share and need to match.
Right now I end up with an empty string for the value.
The instance_variable_setis creating the variable inside MAGEINSTALLER_Helper class. That's the reason why you can't access these variables.
You can refactor it into a module, like this:
require 'fileutils'
require 'json'
module MAGEINSTALLER_Helper
#note nonrelated items removed
#REFACTOR THIS LATER
def load_settings()
content = begin
JSON.load_file('scripts/installer_settings.json')
rescue
puts 'must redo the settings file'
{} # return an empty Hash object
end
parsed.each {|key, value| instance_variable_set("##{key}", value)}
end
#a method to provide feedback simply
def download(from,to)
puts "completed download for #{from}\n"
end
end
class PreStart
include MAGEINSTALLER_Helper
#note nonrelated items removed
def initialize(params=nil)
puts 'World'
load_settings # The method is available inside the class
file="www/depo/newfile-#{#bs_MAGEversion}.tar.gz"
if !File.exist?(file)
download("http://www.dom.com/#{#bs_MAGEversion}/file-#{#bs_MAGEversion}.tar.gz",file)
else
puts "mage package exists"
end
end
end
I refactored a little bit to more Rubish style.
On this line:
bs_MAGEversion=instance_variable_get("#bs_MAGEversion") #doesn't seem to work
instance_variable_get isn't retrieving from the mi_h Object, which is where your value is stored. The way you've used it, that line is equivalent to:
bs_MAGEversion=#bs_MAGEversion
Changing it to mi_h.instance_variable_get would work. It would also be painfully ugly ruby. But I sense that's not quite what you're after. If I read you correctly, you want this line:
mi_h.load_settings()
to populate #bs_MAGEversion and #bs_mode in your Pre_start object. Ruby doesn't quite work that way. The closest thing to what you're looking for here would probably be a mixin, as described here:
http://www.ruby-doc.org/docs/ProgrammingRuby/html/tut_modules.html
We do something similar to this all the time in code at work. The problem, and solution, is proper use of variables and scoping in the main level of your code. We use YAML, you're using JSON, but the idea is the same.
Typically we define a constant, like CONFIG, which we load the YAML into, in our main code, and which is then available in all the code we require. For you, using JSON instead:
require 'json'
require_relative 'helper'
CONFIG = JSON.load_file('path/to/json')
At this point CONFIG would be available to the top-level code and in "helper.rb" code.
As an alternate way of doing it, just load your JSON in either file. The load-time is negligible and it'll still be the same data.
Since the JSON data should be static for the run-time of the program, it's OK to use it in a CONSTANT. Storing it in an instance variable only makes sense if the data would vary from instance to instance of the code, which makes no sense when you're loading data from a JSON or YAML-type file.
Also, notice that I'm using a method from the JSON class. Don't go through the rigamarole you're using to try to copy the JSON into the instance variable.
Stripping your code down as an example:
require 'fileutils'
require 'json'
CONTENTS = JSON.load_file('scripts/installer_settings.json')
class MAGEINSTALLER_Helper
def download(from,to)
puts "completed download for #{from}\n"
end
end
class Pre_start
def initialize(params=nil)
file = "www/depo/newfile-#{ CONFIG['bs_MAGEversion'] }.tar.gz"
if !File.exist?(file)
mi_h.download("http://www.dom.com/#{ CONFIG['bs_MAGEversion'] }/file-#{ CONFIG['bs_MAGEversion'] }.tar.gz", file)
else
puts "mage package exists"
end
end
end
CONFIG can be initialized/loaded in either file, just do it from the top-level before you need to access the contents.
Remember, Ruby starts executing it at the top of the first file and reads downward. Code that is outside of def, class and module blocks gets executed as it's encountered, so the CONFIG initialization will happen as soon as Ruby sees that code. If that happens before you start calling your methods and creating instances of classes then your code will be happy.

Can't get page data from Jekyll plugin

I'm trying to write a custom tag plugin for Jekyll that will output a hierarchical navigation tree of all the pages (not posts) on the site. I'm basically wanting a bunch nested <ul>'s with links (with the page title as the link text) to the pages with the current page noted by a certain CSS class.
I'm very inexperienced with ruby. I'm a PHP guy.
I figured I'd start just by trying to iterate through all the pages and output a one-dimensional list just to make sure I could at least do that. Here's what I have so far:
module Jekyll
class NavTree < Liquid::Tag
def initialize(tag_name, text, tokens)
super
end
def render(context)
site = context.registers[:site]
output = '<ul>'
site.pages.each do |page|
output += '<li>'+page.title+'</li>'
end
output += '<ul>'
output
end
end
end
Liquid::Template.register_tag('nav_tree', Jekyll::NavTree)
And I'm inserting it into my liquid template via {% nav_tree %}.
The problem is that the page variable in the code above doesn't have all the data that you'd expect. page.title is undefined and page.url is just the basename with a forward slash in front of it (e.g. for /a/b/c.html, it's just giving me /c.html).
What am I doing wrong?
Side note: I already tried doing this with pure Liquid markup, and I eventually gave up. I can easily iterate through site.pages just fine with Liquid, but I couldn't figure out a way to appropriately nest the lists.
Try:
module Jekyll
# Add accessor for directory
class Page
attr_reader :dir
end
class NavTree < Liquid::Tag
def initialize(tag_name, text, tokens)
super
end
def render(context)
site = context.registers[:site]
output = '<ul>'
site.pages.each do |page|
output += '<li>'+(page.data['title'] || page.url) +'</li>'
end
output += '<ul>'
output
end
end
end
Liquid::Template.register_tag('nav_tree', Jekyll::NavTree)
page.title is not always defined (example: atom.xml). You have to check if it is defined. Then you can take page.name or not process the entry...
def render(context)
site = context.registers[:site]
output = '<ul>'
site.pages.each do |page|
unless page.data['title'].nil?
t = page.data['title']
else
t = page.name
end
output += "<li>'+t+'</li>"
end
output += '<ul>'
output
end
Recently I faced a similar problem where the error "cannot convert nill into string" is just blowing my head. My config.yml file holds a line something like this " baseurl: /paradocs/jekyll/out/ " now thats for my local for a server i need to make that beseurl empty and the error starts to appear in build time so finally i have to made " baseurl: / " .. And that's did my job.

Download a zip file through Net::HTTP

I am trying to download the latest.zip from WordPress.org using Net::HTTP. This is what I have got so far:
Net::HTTP.start("wordpress.org/") { |http|
resp = http.get("latest.zip")
open("a.zip", "wb") { |file|
file.write(resp.body)
}
puts "WordPress downloaded"
}
But this only gives me a 4 kilobytes 404 error HTML-page (if I change file to a.txt). I am thinking this has something to do with the URL probably is redirected somehow but I have no clue what I am doing. I am a newbie to Ruby.
My first question is why use Net::HTTP, or code to download something that could be done more easily using curl or wget, which are designed to make it easy to download files?
But, since you want to download things using code, I'd recommend looking at Open-URI if you want to follow redirects. Its a standard library for Ruby, and very useful for fast HTTP/FTP access to pages and files:
require 'open-uri'
open('latest.zip', 'wb') do |fo|
fo.print open('http://wordpress.org/latest.zip').read
end
I just ran that, waited a few seconds for it to finish, ran unzip against the downloaded file "latest.zip", and it expanded into the directory containing their content.
Beyond Open-URI, there's HTTPClient and Typhoeus, among others, that make it easy to open an HTTP connection and send queriers/receive data. They're very powerful and worth getting to know.
NET::HTTP doesn't provide a nice way of following redirects, here is a piece of code that I've been using for a while now:
require 'net/http'
class RedirectFollower
class TooManyRedirects < StandardError; end
attr_accessor :url, :body, :redirect_limit, :response
def initialize(url, limit=5)
#url, #redirect_limit = url, limit
end
def resolve
raise TooManyRedirects if redirect_limit < 0
self.response = Net::HTTP.get_response(URI.parse(url))
if response.kind_of?(Net::HTTPRedirection)
self.url = redirect_url
self.redirect_limit -= 1
resolve
end
self.body = response.body
self
end
def redirect_url
if response['location'].nil?
response.body.match(/<a href=\"([^>]+)\">/i)[1]
else
response['location']
end
end
end
wordpress = RedirectFollower.new('http://wordpress.org/latest.zip').resolve
puts wordpress.url
File.open("latest.zip", "w") do |file|
file.write wordpress.body
end

Resources