Internal Links Not Working Convert .HTM to .pdf - wkhtmltopdf

I am trying to convert an .htm file from the SEC website to a .pdf and have the internal links work. I am successfully converting to .pdf using wkhtmltopdf, but all the internal links point me back to the first page.
wkhtmltopdf https://www.sec.gov/Archives/edgar/data/1594617/000119312514117433/d640354ds1a.htm test.pdf

It looks like there's an issue with wkhtmltopdf dealing with anchor tags that have no content. There's a PR that was opened in 2017 to resolve it, but it remains open.
As it turns out, your document does indeed have empty anchor tags, so that's probably the root cause:
<A NAME="toc640354_15"></A>
I would suggest using chrome to produce the pdf, with its --headless and --print-to-pdf flags. From your chrome installation directory, do:
chrome.exe --headless --disable-gpu --print-to-pdf="C:\path\to\file.pdf" https://www.sec.gov/Archives/edgar/data/1594617/000119312514117433/d640354ds1a.htm
Make sure you specify an absolute path to the output file or it doesn't seem to work, for whatever reason. The command will immediately return without any output or indication of success. Give it a few seconds to retrieve, render and write the file.
I tested with your document, and the links work perfectly.

Related

Chrome headless print-to-pdf doesn't render images

I am trying to write a script to output a lot of markdown pages to PDF using Chrome's headless mode. My current command is:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless
--run-all-compositor-stages-before-draw --disable-gpu
--print-to-pdf="index.pdf" http://localhost:8080/#!index.md
The resulting PDF file seems to render as it would be shown except for the images. What I get in the PDF file is a link to the image instead of the image itself.
When I run the --screenshot option I do get the pictures you would expect in the resulting image file.
I think the reason is that it has something to do with the page being rendered with MDwiki, which does a lot of client-side work to convert markdown to HTML. But when I try to use the --virtual-time-budget option Chrome errors out with a message about multiple tables only allowed if debugger is enabled.
Any suggestions for what next to try?
It turns out that there is an node package that takes care of this: chrome-headless-render-pdf. There isn't much documentation but it works. Check out:
npm docs chrome-headless-render-pdf

How to download pdf file in ruby without .pdf in the link

I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.
Please help me in downloading the file.
The link
You an do this
require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
file.write open('http://someurl.com/2013-1-2/somefile/download').read
end
I have been doing this for my projects and it works.
If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'
At that point though, you might as well run wget.
The other way, is to just run a get on the page that you know the pdf is at
source = Net::HTTP.get("http://the.website.com", "/and/some/params")
There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf
In your case, I ran the following commands to get the pdf
wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf
Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.
Update:
There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.
How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.

Why pdftk produced pdf files will not render in Firefox?

I have a site - www.jcrocetta.com.
On this site I have 2 pdf files. One file has blurred data and the other is clear, both files were created with pdftk.
In order to blur out some personal data in the pdf I used Inkscape. But Inkscape only opens/edits one PDF page at a time. After I made my edits in Inkscape I saved the files as .pdf formatted files. At that point I had three separate pdf files, pages 1 through 3. I then used pdftk to concatenate the 3 files into one.
The final pdftk-produced files are on www.jcrocetta.com. Just click the public information button.
In Chrome viewing inline works fine.
Downloading the file from Firefox works fine too.
But viewing inline on Firefox it renders blank pages. How can I fix this?
Also, I know that pdf files not produced with pdftk will render correctly on both Chrome and Firefox.
Thanks for your help.
FireFox has a lovely new feature: It now uses the PDF.js library to render PDF files, instead of calling out to an Adobe Reader plugin, or forcing you to save the file to disk. Unfortunately, it seem that PDF.js isn't quite perfect yet. A quick search shows that other people have the same issue, but the only "solution" I've seen offered boils down to "file a bug report at https://github.com/mozilla/pdf.js/issues or https://bugzilla.mozilla.org/enter_bug.cgi?product=Firefox&component=PDF+Viewer".
Also: Do the three individual PDF files render in FireFox, before you use pdftk to concatenate them?

broken image in chrome and firefox works in safari

I have a logo that shows up in Safari but in Chrome it appears as a broken link and simply does not show up at all in Firefox.
<img src="images/logo-01.png"/>
I have re-uploaded it many times and have even tried alternative paths and file names.
anyone know how i might be screwing this up?
I ran into this same problem. For me, it turns out the image was corrupt. If i tried to open the png file up in photoshop, i would get an error saying it could not parse the file.
For whatever reason, safari could display the corrupt file, but chrome could not. This is how i fixed my issue. I noticed "preview" on my macbook could open the file fine. If you are using windows, possibly try paint or gimp or some other program besides photoshop.
I downloaded the corrupt file onto my macbook, opened it with preview (open with > preview)
In the preview app, go to file > duplicate, which makes a copy of your image
Save that duplicated image
As a test, i tried opening that new copied image in photoshop and i was able to!
Upload new file to website. I was able to view the image in chrome now.
Hope that helps anyone who ran into the same problem.
It could be an issue with your file structure. Right now your links are using relative paths (e.g. href="index.html"). This is fine if the file you're referencing is in the same directory as the current page file. But if your current page is located elsewhere, like in a 'pages' directory or something, then you need to tell the links to start from the site root. That would look like href="/index.html" (note the slash). So for the image, you'd have:
<img src="/images/logo-01.png"/>

How do I save a web page, programmatically?

I would like to save a web page programmatically.
I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing.
The intended usage is a personal bookmarks application, in which link content is cached in case the original copy is taken down.
Take a look at wget, specifically the -p flag
−p −−page−requisites
This option causes Wget to download all the files
that are necessary to properly display
a givenHTML page. Thisincludes such
things as inlined images, sounds, and
referenced stylesheets.
The following command:
wget -p http://<site>/1.html
Will download page.html and all files it requires.
On Windows: you can run IE as a com object and pull everything out.
On other thing, you can take the source of Mozilla.
In Java, Lobo.
Or commons-httpclient and write a lot of code.
You could try the MHTML format (which is what IE uses). http://en.wikipedia.org/wiki/MHTML
In other words, you'd be downloading each object (image, css, etc.) to your computer, and then "embedding" them, via Base64, into a single file.

Resources