wkhtmltopdf how to set caching of html footer and header? - limit

Testing wkhtmltopdf for generating huge pdf reports.
Finaly pdf is about 600 pages. After ~500 pages got the same error
Error:failed to load file .......... with network status code 201 and http status code 0 - Error opening kn_footer.html: Too many open files
The value of fs.file-max was about 1.6M. After error I was increased it to 2097152 but it does not help.
Trying to add file cacheing but it's not working. Command line looks like:
wkhtmltopdf --cache-dir /tmp/ --allow /path/to/my/dir/ --margin-top 20 --load-error-handling ignore --orientation landscape --page-size A4 page kn_utf.html --footer-html kn_footer.html --header-html kn_header.html --footer-spacing 1 kn.pdf
Is there any way to say wkhtmltopdf get header and footer at once or close that files after each iteration?

I too was unable to get any sort of caching to work on a 532 page document. I did achieve some success by converting my footer.html file into wkhtmltopdf footer text. You might also consider splitting it into two documents, run wkhtmltodpf twice, and then use pdfunite to merge the two temporary PDFs into the final document. If you're interested in the text footer approach, here's the set of options I used. Don't forget to set the bottom margin (-B option), otherwise the text footer will get dropped off the page.
wkhtmltopdf \
-O landscape \
-s letter \
-L 20 \
-R 20 \
-T 20 \
-B 20 \
--exclude-from-outline \
--image-dpi 300 \
--header-html $BASE_DIR/header.html \
--footer-line \
--footer-spacing 10 \
--footer-left "$FOOTER_LEFT" \
--footer-center "$FOOTER_CENTER" \
--footer-right "$FOOTER_RIGHT" \
--footer-font-size 10 \
--cache-dir $BASE_DIR/cache \
--user-style-sheet $BASE_DIR/some.css \
$TARGET_DIR/[im]*.html \
$TARGET_DIR/final.pdf

Related

Is there any way to use wildcards to download images from unspecified URLs within a domain?

I want to download all of the display images for each of the 807 pokemon on Bulbapedia. For instance, for Bulbasaur, I'd like to obtain this image:
When I click on the image, I can see that the image addresses follow a certain pattern:
Bulbasaur: https://cdn.bulbagarden.net/upload/2/21/001Bulbasaur.png
Ivysaur: https://cdn.bulbagarden.net/upload/7/73/002Ivysaur.png
Venusaur: https://cdn.bulbagarden.net/upload/a/ae/003Venusaur.png
Charmander: https://cdn.bulbagarden.net/upload/7/73/004Charmander.png
Zeraora: https://cdn.bulbagarden.net/upload/a/a7/807Zeraora.png
...and so on. Basically, the URL that hosts each of the images is some form of https://cdn.bulbagarden.net/upload/*/*/*.png, each asterisk representing a wildcard.
My problem is that I'm unsure how I can represent these wildcards when using bash or wget. I've tried the following wget command to obtain the images:
wget -A.png -e robots=off -m -k -nv -np -p \ --no-check-certificate --user-agent="Mozilla/5.0 (compatible; Konqueror/3.0.0/10; Linux)" \ https://cdn.bulbagarden.net/upload/
However, I download 0 bytes in 0 files which means that no files are being recognized.
Is there any way I can go about doing this?
UPDATE: As some people have pointed out in the comments, I need some way to aggregate all the individual links themselves. I've found this page which has links to the articles for each of the 807 pokemon. However, this creates the dilemma of recursively retrieving links from the linked pages. In order to actually get to the images, I'd need to click two more links after landing on the article for the individual pokemon. I'll show what I mean graphically:
From the List of Pokémon by National Pokédex number page, get the page link for Bulbasaur:
From the Bulbasaur (Pokémon) page, click on the Bulbasaur image to get to the directory that links to the actual png:
Finally, from the File:001Bulbasaur.png page, get the image link to the target png: https://cdn.bulbagarden.net/upload/2/21/001Bulbasaur.png:
This process should be applied recursively to all of the links from the initial list page.
The command I've tried to get the desired result is:
wget --recursive --level=1 --no-directories --accept png https://bulbapedia.bulbagarden.net/wiki/List_of_Pokémon_by_National_Pokédex_number
But all I'm getting is this error: er: Unsupported scheme.
I'm pretty much a wget noob so I'm not quite sure what I'm doing wrong here. How can I recursively get to the image links?
I want to download all of the display images for each of the 807 pokemon on Bulbapedia.
[...]
Basically, the URL that hosts each of the images is some form of https://cdn.bulbagarden.net/upload/*/*/*.png, each asterisk representing a wildcard.
Especially the first 2 asterisks are pretty random, so I'd forget about this pattern if I were you. Using an HTML-parser instead, like xidel, would be a much better idea if you're looking for specific files on a website.
Fast-forward to 2023, the List of Pokémon by National Pokédex number page now lists 1008 Pokémon.
Extracting the 1008 individual urls:
$ xidel -s "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number" \
-e '//tr[starts-with(td,"#")]/td[2]/a/#href'
/wiki/Bulbasaur_(Pok%C3%A9mon)
/wiki/Ivysaur_(Pok%C3%A9mon)
/wiki/Venusaur_(Pok%C3%A9mon)
[...]
Retrieving the indirect image-url from those 1008 individual urls:
$ xidel -s "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number" \
-f '//tr[starts-with(td,"#")]/td[2]/a/#href' \
-e '//td[#colspan="4"]/a[#class="image"]/#href'
/wiki/File:0001Bulbasaur.png
/wiki/File:0002Ivysaur.png
/wiki/File:0003Venusaur.png
[...]
Retrieve the direct image-url:
$ xidel -s "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number" \
-f '//tr[starts-with(td,"#")]/td[2]/a/#href' \
-f '//td[#colspan="4"]/a[#class="image"]/#href' \
-e '//div[#class="fullMedia"]/p/a/#href' \
//archives.bulbagarden.net/media/upload/f/fb/0001Bulbasaur.png
//archives.bulbagarden.net/media/upload/8/81/0002Ivysaur.png
//archives.bulbagarden.net/media/upload/6/6b/0003Venusaur.png
[...]
And finally to download them to the current dir:
$ xidel "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number" \
-f '//tr[starts-with(td,"#")]/td[2]/a/#href' \
-f '//td[#colspan="4"]/a[#class="image"]/#href' \
-f '//div[#class="fullMedia"]/p/a/#href' \
--download .
(notice the absence of -s/--silent to see status information)
This involves 1 + (1008 x 3) = 3025 GET requests, so this will take a while!
Alternatively there's a quicker way:
$ xidel -s "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number" \
-e '//tr[starts-with(td,"#")]//img/#src'
//archives.bulbagarden.net/media/upload/thumb/f/fb/0001Bulbasaur.png/70px-0001Bulbasaur.png
//archives.bulbagarden.net/media/upload/thumb/8/81/0002Ivysaur.png/70px-0002Ivysaur.png
//archives.bulbagarden.net/media/upload/thumb/6/6b/0003Venusaur.png/70px-0003Venusaur.png
[...]
$ xidel -s "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number" \
-e '//tr[starts-with(td,"#")]//img/replace(#src,"(.+)/thumb(.+?png).+","$1$2")'
//archives.bulbagarden.net/media/upload/f/fb/0001Bulbasaur.png
//archives.bulbagarden.net/media/upload/8/81/0002Ivysaur.png
//archives.bulbagarden.net/media/upload/6/6b/0003Venusaur.png
[...]
$ xidel "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number" \
-f '//tr[starts-with(td,"#")]//img/replace(#src,"(.+)/thumb(.+?png).+","$1$2")' \
--download .
With a rather simple string-manipulation on the thumbnail-urls on the list-page you can get the same direct image-urls. This obviously only involves 1009 GET request and will get you these images a lot quicker.
Actually I had the same insane idea 5 years after you and this was my solution:
I copied all the tables from the link you posted in VSCode and copied all the URLs of the 70px images with a regex //archives.bulbagarden.net(.*?)(70px)(.*?)(.png)
I added https: at the beginning and replaced 70 with 375px (which is the most common resolution, and it's enough for my use case)
I made a python script to download all the images from a .txt file:
import os
import requests
def download_image(url, filename):
r = requests.get(url, allow_redirects=True)
open(filename, 'wb').write(r.content)
def main():
with open('links.txt', 'r') as f:
for line in f:
url = line.strip()
filename = url.split('/')[-1]
download_image(url, filename)
main()
Some regional versions are missing but I'm satisfied with the result:

How to batch convert PostScript files to PNGs via folder action with use of ImageMagick

I'm trying to assign a folder action to the a folder with PS files that would automatically convert the PS files droppped in the folder to the PNG files. My shell script looks like follow:
for img in "$#"; do
filename = ${img%.*}
convert "$img" -background white -flatten "$filename.png"
done
and settings for my Automator folder action are provided in the screenshot below
I'm experiencing two problems:
When I drop *.ps files onto the folder the Automator action starts but does not produce files. I'm guessing the the problem is concerned with not passing the file name to the Shell Script but I'm not able to find solution to this.
When I attempt to execute the conversion directly from the terminal with use of the command: convert b.ps b.png the produced image is cut, like in the screenshot below
.
I would like to fix so the Automator action:
- Takes all the files that I decide to specify via the Filter Finder Items option
- Converts them to high resolution, quality PNGs respecting the original PS file sizes (without cutting them or providing extra margins)
(You should spell out clearly in your question, that you are working on Mac OSX.)
You may have encountered a bug in ImageMagick when it comes to converting PS files (see also this discussion in the IM forum about it). Try to add -verbose to your convert command to see what exactly goes on.
Fact is, that ImageMagick cannot by itself consume PostScript (or PDF) input. It has to employ a delegate to do that for you -- and that delegate usually is Ghostscript.
It's a better approach for your task if you made your shell script differentiate between different input types: if you get PS or PDF, let Ghostscript do the job directly:
gs \
-o output.png \
-sDEVICE=pngalpha \
-dAlignToPixels=0 \
-dGridFitTT=2 \
-dTextAlphaBits=4 \
-dGraphicsAlphaBits=4 \
-r72x72 \
input.ps-or-pdf
Should you need further post-processing of the generated output.png (such as making the background white instead of transparent you could pipe the output to an ImageMagick convert-command now....
Update
Since it has been asked in a comment: if you want to use the same filename for output, but replace the .ps suffix by a .png suffix, use this Bashism:
inputfilename=../somedir/somefile.ps
gs \
-o ${inputfilename/.ps/.png} \
-sDEVICE=pngalpha \
-dAlignToPixels=0 \
-dGridFitTT=2 \
-dTextAlphaBits=4 \
-dGraphicsAlphaBits=4 \
-r72x72 \
${inputfilename}
or this one
-o $(dirname ${inputfilename})/$(basename ${inputfilename}).png
Both will allow you to keep the original directory (in case your 'inputfilename' includes an absolute or relative path). The first one is not as flexible with the input file's suffix: it works only for .ps -- for PDF you'd get a .pdf.png suffix...
Update 2
First determine which is the real BoundingBox covered by the original PostScript. This is different from the declared BoundingBox, that may (or may not!) be stated in a %%BoundingBox line of the PostScript code. Ghostscript's -sDEVICE=bbox will do that for you:
gs -q -o - -sDEVICE=bbox a.ps
%%BoundingBox: 102 118 866 698
%%HiResBoundingBox: 102.257997 118.502434 865.278747 697.013979
Now you can use this info to determine how many pixels horizontally and how many pixels vertically you want the PNG output file to be sized. I'll pick 940 pixels wide and 760 pixels high (to allow for some margin around the output. Use -g940x760 with Ghostscript to set this as the page size:
inputfilename=a.ps
gs \
-o ${inputfilename/.ps/.png} \
-sDEVICE=pngalpha \
-dAlignToPixels=0 \
-dGridFitTT=2 \
-dTextAlphaBits=4 \
-dGraphicsAlphaBits=4 \
-g940x760 \
${inputfilename}
The output is here:

Render a Blank Page with Ghostscript

How can I use ghostscript to create a blank page? I would like to do this when merging multiple PDFs together-- something like:
`gs -dNOPAUSE -o /path/to/output input1.pdf <blank-page-here> input2.pdf
To spell out more explicitely what KenS suggested:
gs \
-o new.pdf \
-sDEVICE=pdfwrite \
-f input1.pdf \
-c showpage \
-f input2.pdf \
-c showpage \
-f input3.pdf \
-c showpage
will insert an additional blank page into the new.pdf after the data of each input{1,2,3}.pdf has been processed.
Just send some PostScript, the 'showpage' operator terminates a page if there's nothing on it, it will be blank.
You can either stick that in a file or use the -c -f switches.
Please note that the pdfwrite device does not merge files. It interprets the content of the input to create marking operations which are fed to the device. The device then takes action on those operations, for rendering devices it renders to a bitmap, in the case of pdfwrite it reassembles them into a PDF file.
So the output from your command line is not a 'merge' of the input files, its a brand new file whose only relationship with the input files is that the marks made on the page are the same.

How to transform an high def PDF to low def using command line tools?

I've a unix server (mac osx in fact) which transform actually PS files to PDF files. It does this through ps2pdf, with those parameters:
ps2pdf14 \
-dPDFSETTINGS=/prepress \
-dEPSCrop \
-dColorImageResolution=72 \
-dColorConversionStrategy=/LeaveColorUnchanged \
INPUT_FILE \
OUTPUT_FILE
But now I've to adapt this script to have a PDF file as input instead as PS.
So I guess that ps2pdf will not work anymore, and I need something which can reduce the quality of the pdf.
Do you know a tool like this?
The ps2pdf14 script just runs the ps2pdfwr script with -dCompatibilityLevel=1.4, which in turn uses gs with various parameters. You can examine that script to see the options.
You could run gs directly, putting in the various options added by the scripts and your own -d options (which are passed directly to gs). I.e. try:
gs \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
-dEPSCrop \
-dColorImageResolution=72 \
-dColorConversionStrategy=/LeaveColorUnchanged \
-q \
-dNOPAUSE \
-dBATCH \
-sOutputFile=OUTPUT_FILE \
INPUT_FILE
Your command should works with PDFs: Ghostscript (backend for ps2pdf) accept PDF as input file. I just tested ps2pdf from Ghostscript 9.04 and it works

Using Ghostscript in server mode to convert PDFs to PNGs

while i am able to convert a specific page of a PDF to a PNG like so:
gs \
-dSAFER \
-dBATCH \
-dNOPAUSE \
-sDEVICE=png16m \
-dGraphicsAlphaBits=4 \
-sOutputFile=gymnastics-20.png \
-dFirstPage=20 \
-dLastPage=20 \
gymnastics.pdf
i am wondering if i can somehow use ghostscript's JOBSERVER mode to process several conversions without having to incur the cost of starting up ghostscript each time.
from: http://pages.cs.wisc.edu/~ghost/doc/svn/Use.htm
-dJOBSERVER
Define \004 (^D) to start a new encapsulated job used for compatibility with Adobe PS Interpreters that ordinarily run under a job server. The -dNOOUTERSAVE switch is ignored if -dJOBSERVER is specified since job servers always execute the input PostScript under a save level, although the exitserver operator can be used to escape from the encapsulated job and execute as if the -dNOOUTERSAVE was specified.
This also requires that the input be from stdin, otherwise an error will result (Error: /invalidrestore in --restore--).
Example usage is:
gs ... -dJOBSERVER - < inputfile.ps
-or-
cat inputfile.ps | gs ... -dJOBSERVER -
Note: The ^D does not result in an end-of-file action on stdin as it may on some PostScript printers that rely on TBCP (Tagged Binary Communication Protocol) to cause an out-of-band ^D to signal EOF in a stream input data. This means that direct file actions on stdin such as flushfile and closefile will affect processing of data beyond the ^D in the stream.
the idea is to run ghostscript in-process. the script would receive a request for a particular page of a pdf and would use ghostscript to generate the specified image. i'd rather not start up a new ghostscript process every time.
So why can't you simply use a command like this:
gs \
-sDEVICE=png16m \
-dGraphicsAlphaBits=4 \
-o pngimages_%03d.png \
\
-dFirstPage=20 \
-dLastPage=20 \
gymnastics.pdf
\
-dFirstPage=3 \
-dLastPage=3 \
sports.pdf
\
-dFirstPage=33 \
-dLastPage=33 \
athletics.pdf
\
-dFirstPage=4 \
-dLastPage=4 \
lazyness.pdf
This will generate several PNG images from different PDFs in a single go.

Resources