Read dynamic PDF from Ruby Watir - ruby

I am using Watir to log into an application, push some buttons, etc... Basically the normal stuff that a person would use Watir for.
However, my problem is that there is one particular page that I need to test. It's actually a dynamically-generated PDF and I need to get the actual binary data from it, so that I can load it using a certain gem that we're using. This normally works with static PDF files because we can just use:
open("http://site.com/something.pdf")
This works for static PDFs. However, for a dynamically generated one it doesn't work because we are using Ruby to send the HTTP request and it is not aware of the headers/cookies/session that Watir is using. So instead of getting the actual PDF we get a login page.
Another thing we tried was to use Watir to get the PDF:
#browser.goto "http://site.com/dynamic/thepdffile"
#browser.text
#browser.html
We tried getting the text or html from the page, but no luck because firefox creates a DOM when loading a pdf so the text is an empty string and the html is the DOM that firefox creates when viewing a pdf page. We need the raw HTTP response and there doesn't seem to be a way to extract that.
So we need a solution for this and in my opinion we have these options:
Figure out a way to use "open" or similar method in Ruby, using the session from Watir.
Figure out how to use watir to get the binary http response from the PDF page.
Disable the pdf plugin (which doesn't seem possible) such that the "save as" dialog appears.
Or if you have some other idea please share! Thanks in advance!

I figured out a solution.
In the profile for firefox you can set the plugin.scan.Acrobat to "999" which will effectively disable the PDF plugin.
profile = Selenium::WebDriver::Firefox::Profile.new
profile['plugin.scan.Acrobat'] = "999"
b = Watir::Browser.new :firefox, :profile => profile

Related

Watir-Webdriver how can I get embedded pdf text in Chrome using a Watir Browser

For some reason I can't access the PDF text in Chrome's built in pdf viewer anymore.
#browser.text
=> ""
The PDF is embedded and I haven't been able to easily get it with Net/HTTP gets or curb or httparty. But it is showing up plain as day in the browser...
Do I have to do something with #browser.driver#some_method? or maybe change the capabilities hash before Watir::Browser.new :chrome?
What are people doing now to check PDF text in web apps with the recent changes to Chrome and Chromedriver?
Watir is great for handling html, but isn't designed to deal with formats like pdf. If you want to parse pdf files, you can try something like pdf-reader:
require 'pdf-reader'
require 'open-uri'
io = open(#browser.url)
reader = PDF::Reader.new(io)
reader.pages.first.text

How to download a html generated by a javascript as pdf

I want to save the html generated by a javascript on a website.
When I run the javascript, it returns me the html ready, with a button that link to the chrome printer, to save as pdf. I want to save this html genrated as a PDF, but I can't do it.
I've spent days triyng almos everything, PDFKit with Nokogiri Parsing, searched for a chrome printer API, etc, but nothing made it. Does anyone knows how can I do that?
Using phantomjs and rasterize.js can convert it.
Then just run the command
phantomjs rasterize.js $URL_OR_PATH $PDF_OUT_FILENAME Letter
Based on the JavaScript you're running, figure out the URL it calls, along with whatever variables it adds to the GET/POST request, then use OpenURI or an HTTP client of some sort to request that file. Pass that to Nokogiri, and parse out the URL for the file.
The alternate is to use one of the WATIR gems to drive a browser, and access the file that way. Then you can retrieve the HTML, or have the browser retrieve the file, and get it off the disk when its done.
I didn't understood the second solution you proposed, can you explain more?
Sometimes developers use Ajax to retrieve HTML and insert it into a page, or directly manipulate the page's HTML using JavaScript.
You can ask a Watir-driven browser to give you the current HTML and then parse it using Nokogiri or another XML parser, to retrieve things that are part of the HTML DOM at that moment. From there you can save that to disk and have the Watir-driven browser read it and render it. Then it's a matter of figuring out how to get the browser to print to PDF, or grab a snapshot of the screen to turn it into a PDF.

How do I fetch AJAX-loaded content from an another site using Nokogiri?

I was trying to parse some HTML content from a site. Nokogiri works perfectly for content loaded the first time.
Now the issue is how to fetch that content which is loaded using AJAX. For example, there is a "see more" link and more items are fetched using AJAX, or consider a case for AJAX-based tabs.
How can I fetch that content?
You won't be able to parse anything that requires a JavaScript runtime to produce that content using Nokogiri. Nokogiri is a HTML/XML parser, not a web browser.
PhantomJS on the other hand is a web browser, albeit a special kind of browser ;) Take a look at that and have a play.
It isn't completely clear what you want to do, but if you are trying to get access to additional HTML that is loaded by AJAX, then you will need to study the code, figure out what URL is being used for the AJAX request, whether any session IDs or cookies have been set, then create a new URL that reproduces what AJAX is using. Request that, and you should get the new content back.
That can be difficult to do though. As #Nuby said, Mechanize could be good help, as it is designed to manage cookies and sessions for you in the background. Mechanize uses Nokogiri internally so if you request a page from Mechanize, you can use Nokogiri searches against it to drill down and extract any particular JavaScript strings. They'll be present as text, so then you can use regex or substring matches to get at the particular parameters you need, then construct the new URL and ask Mechanize to get it.

How to download dynamic generated content from webpage?

I'm trying to download some data from a webpage that is dynamically generated, so using wget doesn't work. The page is http://gaceta.diputados.gob.mx/SIL/Legislaturas/Listados.html I want to download the list shown for each of the options that can be selected in the field "Legislatura" once downloaded I can process the data in ruby.
Just wanted to know what is the best way to download this, and if posible to select each of the options and download.
You can use the Web Inspector in Safari or Chrome or the Firebug extension in Firefox to look at how the data is loaded. The page is doing an AJAX POST request to a Perl script for this website, and the data is return as XML.
I would use cURL to grab the data.
You could use http://watir.com/ or webrat to simulate what you would do to view the data then use Nokogiri to parse the HTML.

Using a Ruby script to login to a website via https

Alright, so here's the dealio: I'm working on a Ruby app that'll take data from a website, and aggregate that data into an XML file.
The website I need to take data from does not have any APIs I can make use of, so the only thing I can think of is to login to the website, sequentially load the pages that have the data I need (in this case, PMs; I want to archive them), and then parse the returned HTML.
The problem, though, is that I don't know of any ways to programatically simulate a login session.
Would anyone have any advice, or know of any proven methods that I could use to successfully login to an https page, and then programatically load pages from the site using a temporary cookie session from the login? It doesn't have to be a Ruby-only solution -- I just wanna know how I can actually do this. And if it helps, the website in question is one that uses Microsoft's .NET Passport service as its login/session mechanism.
Any input on the matter is welcome. Thanks.
Mechanize
Mechanize is ruby library which imititates the behaviour of a web browser. You can click links, fill out forms und submit them. It even has a history and remebers cookies. It seems your problem could be easily solved with the help of mechanize.
The following example is taken from http://docs.seattlerb.org/mechanize/EXAMPLES_rdoc.html:
require 'rubygems'
require 'mechanize'
a = Mechanize.new
a.get('http://rubyforge.org/') do |page|
# Click the login link
login_page = a.click(page.link_with(:text => /Log In/))
# Submit the login form
my_page = login_page.form_with(:action => '/account/login.php') do |f|
f.form_loginname = ARGV[0]
f.form_pw = ARGV[1]
end.click_button
my_page.links.each do |link|
text = link.text.strip
next unless text.length > 0
puts text
end
end
You can try use wget to fetch the page. You can analyse login process with this app www.portswigger.net/proxy/.
For what it's worth, you could check out Webrat. It is meant to be used a tool for automated acceptance tests, but I think you could use it to simulate filling out the login fields, then click through links by their names, and grab the needed HTML as a string. Haven't tried doing anything like it, tho.

Resources