How to download a html generated by a javascript as pdf - ruby

I want to save the html generated by a javascript on a website.
When I run the javascript, it returns me the html ready, with a button that link to the chrome printer, to save as pdf. I want to save this html genrated as a PDF, but I can't do it.
I've spent days triyng almos everything, PDFKit with Nokogiri Parsing, searched for a chrome printer API, etc, but nothing made it. Does anyone knows how can I do that?

Using phantomjs and rasterize.js can convert it.
Then just run the command
phantomjs rasterize.js $URL_OR_PATH $PDF_OUT_FILENAME Letter

Based on the JavaScript you're running, figure out the URL it calls, along with whatever variables it adds to the GET/POST request, then use OpenURI or an HTTP client of some sort to request that file. Pass that to Nokogiri, and parse out the URL for the file.
The alternate is to use one of the WATIR gems to drive a browser, and access the file that way. Then you can retrieve the HTML, or have the browser retrieve the file, and get it off the disk when its done.
I didn't understood the second solution you proposed, can you explain more?
Sometimes developers use Ajax to retrieve HTML and insert it into a page, or directly manipulate the page's HTML using JavaScript.
You can ask a Watir-driven browser to give you the current HTML and then parse it using Nokogiri or another XML parser, to retrieve things that are part of the HTML DOM at that moment. From there you can save that to disk and have the Watir-driven browser read it and render it. Then it's a matter of figuring out how to get the browser to print to PDF, or grab a snapshot of the screen to turn it into a PDF.

Related

Getting the JSON file when triggering AJAX

I'm writing a crawler to get the content from a website which uses AJAX.
There is a "show more" button at the bottom of the page, and my origin approach is to use Selenium.PhantomJS to pretend a web browser but it works in some website and some don't.
I'm wondering if there is some way i can directly get the underly JSON file of the AJAX action. Please give me some details, thanks.
By the way, I'm using Python.
I understand this is less of a python than a scraping problem in general (and I understand you meant "scraping" instead of "crawling" as a scraper reads/parses/processes one page whereas a crawler processes multiple pages and they're relation to each other).
You can get the JSON file immediately given you know it's URL. If you don't (for example because the URL changes from time to time), you might need to search through javascript files on the page manually to find out how the URL is generated.
Once you know the JSON file's URL, it's quite simple. As you already seem to know how to get the HTML of the "main" page, you can use your existing code to get the JSON file.
I'm not familiar with PhantomJS, but I reckon it's easier to get the JSON file immediately instead of simulating an AJAX request (if that's even possible with Phantom).

Read dynamic PDF from Ruby Watir

I am using Watir to log into an application, push some buttons, etc... Basically the normal stuff that a person would use Watir for.
However, my problem is that there is one particular page that I need to test. It's actually a dynamically-generated PDF and I need to get the actual binary data from it, so that I can load it using a certain gem that we're using. This normally works with static PDF files because we can just use:
open("http://site.com/something.pdf")
This works for static PDFs. However, for a dynamically generated one it doesn't work because we are using Ruby to send the HTTP request and it is not aware of the headers/cookies/session that Watir is using. So instead of getting the actual PDF we get a login page.
Another thing we tried was to use Watir to get the PDF:
#browser.goto "http://site.com/dynamic/thepdffile"
#browser.text
#browser.html
We tried getting the text or html from the page, but no luck because firefox creates a DOM when loading a pdf so the text is an empty string and the html is the DOM that firefox creates when viewing a pdf page. We need the raw HTTP response and there doesn't seem to be a way to extract that.
So we need a solution for this and in my opinion we have these options:
Figure out a way to use "open" or similar method in Ruby, using the session from Watir.
Figure out how to use watir to get the binary http response from the PDF page.
Disable the pdf plugin (which doesn't seem possible) such that the "save as" dialog appears.
Or if you have some other idea please share! Thanks in advance!
I figured out a solution.
In the profile for firefox you can set the plugin.scan.Acrobat to "999" which will effectively disable the PDF plugin.
profile = Selenium::WebDriver::Firefox::Profile.new
profile['plugin.scan.Acrobat'] = "999"
b = Watir::Browser.new :firefox, :profile => profile

How do I fetch AJAX-loaded content from an another site using Nokogiri?

I was trying to parse some HTML content from a site. Nokogiri works perfectly for content loaded the first time.
Now the issue is how to fetch that content which is loaded using AJAX. For example, there is a "see more" link and more items are fetched using AJAX, or consider a case for AJAX-based tabs.
How can I fetch that content?
You won't be able to parse anything that requires a JavaScript runtime to produce that content using Nokogiri. Nokogiri is a HTML/XML parser, not a web browser.
PhantomJS on the other hand is a web browser, albeit a special kind of browser ;) Take a look at that and have a play.
It isn't completely clear what you want to do, but if you are trying to get access to additional HTML that is loaded by AJAX, then you will need to study the code, figure out what URL is being used for the AJAX request, whether any session IDs or cookies have been set, then create a new URL that reproduces what AJAX is using. Request that, and you should get the new content back.
That can be difficult to do though. As #Nuby said, Mechanize could be good help, as it is designed to manage cookies and sessions for you in the background. Mechanize uses Nokogiri internally so if you request a page from Mechanize, you can use Nokogiri searches against it to drill down and extract any particular JavaScript strings. They'll be present as text, so then you can use regex or substring matches to get at the particular parameters you need, then construct the new URL and ask Mechanize to get it.

How to download dynamic generated content from webpage?

I'm trying to download some data from a webpage that is dynamically generated, so using wget doesn't work. The page is http://gaceta.diputados.gob.mx/SIL/Legislaturas/Listados.html I want to download the list shown for each of the options that can be selected in the field "Legislatura" once downloaded I can process the data in ruby.
Just wanted to know what is the best way to download this, and if posible to select each of the options and download.
You can use the Web Inspector in Safari or Chrome or the Firebug extension in Firefox to look at how the data is loaded. The page is doing an AJAX POST request to a Perl script for this website, and the data is return as XML.
I would use cURL to grab the data.
You could use http://watir.com/ or webrat to simulate what you would do to view the data then use Nokogiri to parse the HTML.

HTMLAgilityPack get AJAX value

I am trying to get the value of a timer using the HtmlAgilityPack however when I get the innerText by the element ID it returns --:--:--
Is there any way to get the time value since it uses AJAX?
The thing is that when you load an HttpAgilityPack.HtmlDocument (from the web, of course), it makes an HTTP web request to the website, and what you receive is plain text. No AJAX/JavaScript or images are loaded. When you see it in a browser, you don't realize about it, because it's a browser ;). After it receives the response, it starts loading images, animations, and executing javascript code, but HtmlAgilityPack uses HttpWebRequest to get the source, and it doesn't manage any javascript code.
I suggest you to download a really brilliant tool to inspect HTTP traffic in your network: Fiddler2. It will allow you to set breakpoints, and see exactly how the response is returned, and you will see that those things are really handled by the browser itself.
I don't really know what is the purpose of getting the time value of a timer in AJAX, but I think you could use WaTiN to load the source using Internet Explorer (and hidding it, because if not, an IE window will appear on the screen loading the page), and in the moment you need to get the value of the timer, get the source from WaTiN and load an HtmlDocument using LoadHtml(string html).

Resources