Chrome headless print-to-pdf doesn't render images - google-chrome-headless

I am trying to write a script to output a lot of markdown pages to PDF using Chrome's headless mode. My current command is:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless
--run-all-compositor-stages-before-draw --disable-gpu
--print-to-pdf="index.pdf" http://localhost:8080/#!index.md
The resulting PDF file seems to render as it would be shown except for the images. What I get in the PDF file is a link to the image instead of the image itself.
When I run the --screenshot option I do get the pictures you would expect in the resulting image file.
I think the reason is that it has something to do with the page being rendered with MDwiki, which does a lot of client-side work to convert markdown to HTML. But when I try to use the --virtual-time-budget option Chrome errors out with a message about multiple tables only allowed if debugger is enabled.
Any suggestions for what next to try?

It turns out that there is an node package that takes care of this: chrome-headless-render-pdf. There isn't much documentation but it works. Check out:
npm docs chrome-headless-render-pdf

Related

Can anyone suggest how to take screenshot of full webpage using ruby selenium?

I want to capture full webpage screenshot in chrome browser using ruby selenium. I am using Rspec testing framework. save_screenshot method captures screenshot only for visible area.
I have gone through the following link,
How to take a screenshot of a full browser page and its elements using selenium-webdriver/capybara in Ruby?
But I don't want to use window resizing or watir gem. Is there any other way or gem to achieve same.
1) You can use https://github.com/samnissen/watir-screenshot-stitch where
Directly employing geckodriver's new full page screenshot functionality (only on Firefox).
Screenshot stitching, paging down a given URL by the size of the viewport, capturing screenshots and adjoining them.
Employing a bundled html2canvas script against the page to generate a png from a canvas element.
2) or use native instrument for your OS - https://paulhammond.org/webkit2png
In code will look like this
webkit2png https://stackoverflow.com/questions/60728482/can-anyone-suggest-how-to-take-screenshot-of-full-webpage-using-ruby-selenium
where:
- main command - webkit2png
- link page - all else
It's old question, but it doesn't hurt to post this. Use Scrot to take screenshot, and later make gif. Since OP only want screenshots, command would be:
require 'screenscrot'
#screen = ScreenScrot.new
#screen.capture(:all)
On linux:
sudo apt install scrot -y && gem install screenscrot

Internal Links Not Working Convert .HTM to .pdf

I am trying to convert an .htm file from the SEC website to a .pdf and have the internal links work. I am successfully converting to .pdf using wkhtmltopdf, but all the internal links point me back to the first page.
wkhtmltopdf https://www.sec.gov/Archives/edgar/data/1594617/000119312514117433/d640354ds1a.htm test.pdf
It looks like there's an issue with wkhtmltopdf dealing with anchor tags that have no content. There's a PR that was opened in 2017 to resolve it, but it remains open.
As it turns out, your document does indeed have empty anchor tags, so that's probably the root cause:
<A NAME="toc640354_15"></A>
I would suggest using chrome to produce the pdf, with its --headless and --print-to-pdf flags. From your chrome installation directory, do:
chrome.exe --headless --disable-gpu --print-to-pdf="C:\path\to\file.pdf" https://www.sec.gov/Archives/edgar/data/1594617/000119312514117433/d640354ds1a.htm
Make sure you specify an absolute path to the output file or it doesn't seem to work, for whatever reason. The command will immediately return without any output or indication of success. Give it a few seconds to retrieve, render and write the file.
I tested with your document, and the links work perfectly.

Firefox add-on will not display images

I have a blank Firefox Add-On I made using the Getting Started Tutorial. When I run my extension using jpm run I observe the following.
If I navigate to any image it appears like this (image is displayed nicely in the centre):
However, I have the same image store in my extension under: ./data/test.jpg. When I navigate to resource://my-addon/data/test.jpg I get the following blank page:
The image is there, because if I hover over it in the inspector, it shows:
Am I doing something wrong, missing something in the docs about rendering images or is there a bug with how images are being rendered from the extension?
Include the self and then do
console.log(seld.data.url(''))
This will give you the id of your addon. It is very likely not my-addon it will be something like: jid1-4GP7z3tkUd3Tzg#jetpack - so your path to your image will be resource://jid1-4GP7z3tkUd3Tzg#jetpack/data/test.jpg.

How to download pdf file in ruby without .pdf in the link

I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.
Please help me in downloading the file.
The link
You an do this
require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
file.write open('http://someurl.com/2013-1-2/somefile/download').read
end
I have been doing this for my projects and it works.
If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'
At that point though, you might as well run wget.
The other way, is to just run a get on the page that you know the pdf is at
source = Net::HTTP.get("http://the.website.com", "/and/some/params")
There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf
In your case, I ran the following commands to get the pdf
wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf
Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.
Update:
There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.
How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.

Ruby pdf testing in browser

Has anyone been able to find a way to test pdf's with ruby within the browser? I have tried a few different ways and the only way I have been able to get any pdf testing to work is to save off the pdf and use the pdf_reader gem. This only seems to work on pdf's that, when the link is clicked, opens up a dialog box with the options to open or save the pdf. Unfortunately I have not been able to find a way to do anything like this with pdf's that are opened in browser, with no dialog box options to save it. Any ideas?
Maybe testing it in the browser isnt the best way. When you say test the pdf what are you trying to do? I wouldnt test the pdf in the browser if I was you.
Try docsplit, if you want to verify its contents.
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
You are not inventing a browser, or a PDF generator.
Use unit tests to check your back-end modules can take data in, and write PDF out, then serve the PDF in a website and let the browser do its thing. Test (as what Rails calls a "functional test") that the MVC will produce a web page containing a link to the PDF, and you are done.
You can use gem 'mechanize' to download an online PDF (the PDF with in a browser) on your computer and then read it via gem PDF reader.

Resources