How Can I control file size of pdf (S, M, L) based on image quality using wkhtmltopdf? - wkhtmltopdf

I have to control PDF size based on their image or content quality, how it can be achieved using wkhtmltopdf shell utility tool.

Based on the help section you can use the --image-quality <integer> parameter to control the quality of the images. Replace <integer> with a number between 1 and 100 where 100 is the best quality. I would suggest trying 80, sometimes 70 still gives acceptable results but it depends on the image.

Related

Storing format for an image with its legend

I am trying to store images of plants and their legends (as text) together. However I can't find a straightforward way to do this.
I can of course use an "advanced" text editor (by advanced, I mean with formatting, not just raw text) in which I would import the image and write the text, before exporting in PDF. I have also thought about html, which could be used to create one stand-alone local web page for each pair image-legend. But still, there would be 2 files per pair : one for the image and one for the html code.
However those are quite heavy procedures and I would be much more satisfied if I could "simply" use a rawer format in which the image's data and the text are sort of concatenated, or so...
Do you know of any format of this kind ? If not I'd better just code it myself...
Thank you in advance !
Images can be polyglots of image plus text (not advisable)
Images can hold text as steganography (also unadvisable)
Images can hold textual metadata think Exif, Jpg comments, Tiff tags or IPTC
You could even add a legend strip into base of image, but that's not "text". At time of placement you paste both image and text.
HTML can hold image as text.base64 but the textual image requires 133% storage
FB2 is similar in that it is xml with encoded images but the advantage of being stored as zipped FB2Z thus nearest your concatenated requirement
PDF can hold both natively and if done right with less overhead than html but a bit more than exif.img
If done well as PDF/A both the image and text can be perfectly extracted raw from a PDF so image could be discarded, however, it is all too often that they are mashed beyond pure extraction or even reuse.
But in my case I can extract the image at 100% scale so its returned from this mini PDF here is the text
Hello, Flowers!
Microsoft Windows Welcome Scan
This was the code to store both together using cross platform Artifex Mutool
mutool create -o "output.pdf" -O ascii "Page1.txt" ["page2.txt" ...]
%%MediaBox 0 0 595 842
%%Font Helv Helvetica Latin
%%Image Flowers1 C:/Users/name/Documents/WelcomeScan.jpg
% Draw an image. x width, H line elevation (y skew), x skew, y height, left offset, bottom offset, units are pt.'s cm is not centimetres
q 512 0.0 0.0 384 41.5 400 cm /Flowers1 Do Q
% Draw a rectangle. move line fill
q 1 0.5 1 rg 41.5 370 m 553.5 370 l 553.5 270 l 41.5 270 l f Q
% Show some text.
q 0 0 1 rg
BT /Helv 24 Tf 210 330 Td (Hello, Flowers!) Tj ET
BT /Helv 24 Tf 100 290 Td (Microsoft Windows Welcome Scan) Tj ET
Q
Notes
%%MediaBox is Paper Size in points thus above = A4 Portrait
%%Font needs to be added for text Style (Language) to use later
%%Image needs internal name(s) and full path for pre-load Note this image is 1024x768 when extracted # 100% but will be displayed by choice at 50% (512x384)
Lines starting with single % are comments to remind me of pseudo PS directives to layout content. The blocks q ... Q are the guts of the page and are heavily abbreviated (after the value) thus 1 0.5 1 rg is 50% green in RGB ! Remove them in a working template or else they may be added to the PDF :-)
The trick is knowing how a PDF works page wise and places vectors or scaled images or text from bottom left origin bounded by a media box. Mutool takes the script and adds all the necessary overhead data for a valid PDF.
All the above can be easily templated and run with CMD or BASH, much in the same way an ePub can be templated then call TAR to convert folder into folder.epub, but the more complex ePub structure is not so easy to write in a script, thus suggest using a scriptable lib.
ePub is the goto answer since xhtml and image are zipped in their native formats, and can be easily printed to PDF or converted to normal HTML + images

Any Tricks to Use in wkhtmltopdf and pdftk to Reduce File Size?

I'm using wkhtmltopdf on OS X, and while it has been generally working as intended, the size of the files it generates is larger than I had hoped for. My goal is to essentially save a screenshot of the text content webpage as a pdf, and I don't really care about the images, hyperlinks, and other features on the page. I've been using the tool in conjunction with pdftk to save the first page of a website as a pdf, and below is an example of my code for the desired webpage (http://espn.go.com/mens-college-basketball/boxscore?gameId=400589702):
/usr/local/bin/wkhtmltopdf http://espn.go.com/mens-college-basketball/boxscore?gameId=400589702 --zoom 0.65 /Users/dwm8/Desktop/test.pdf
/usr/local/bin/pdftk /Users/dwm8/Desktop/test.pdf cat 1 output /Users/dwm8/Desktop/test2.pdf dont_ask
The size of the final file test2.pdf is 487 KB, which is larger than I would prefer. Are there any tricks I can use in wkhtmltopdf or pdftk to reduce the file size? Thanks for the help!
Well, if you don't care about hyperlinks or images, the obvious thing to do is suppress them using --disable-external-links and --no-images. If you are really only interested in the text, which is black and white, you may as well only generate a greyscale PDF too:
/usr/local/bin/wkhtmltopdf --disable-external-links --no-images --zoom 0.65 --grayscale http://espn.go.com/mens-college-basketball/boxscore?gameId=400589702 result.pdf
which gets the file size down from 500kB to 70kB on my system - a fairly useful 86% space saving!
You could pass in --lowquality true as this is used to shrink the generated pdfs size.
More information on options can be found here http://wkhtmltopdf.org/usage/wkhtmltopdf.txt

Atom 1.0 Syndication Feed Icon and Logo Sizes

The Atom 1.0 specification has the following lines about the icon and logo elements in the feed:
icon - The image SHOULD have an aspect ratio of one (horizontal) to one (vertical) and SHOULD be suitable for presentation at a small size.
logo - The image SHOULD have an aspect ratio of 2 (horizontal) to 1 (vertical).
What is the reccomended size for the icon and logo images that can be used by most Atom feed readers?
I found a blog post discussing just your question http://snook.ca/archives/rss/add_logo_to_feed/:
The Atom specification details two separate XML elements that can
contain a URL to an image: icon and logo. For icon, it indicates that
the image should have a 1:1 ratio and should be appropriate for small
sizes. Not much more is spelled out, such as appropriate image types.
... And further on ...
Now, the RSS specification is a little more specific when it comes to possible image sizes.
Maximum value for width is 144, default value is 88.
Maximum value for height is 400, default value is 31.
Since feed readers will probably both show ATOM and RSS feeds, this is the ballpark image size they're expecting. So I checked out the FlickR ATOM feed ( http://api.flickr.com/services/feeds/groups_discuss.gne?id=34427469792#N01&lang=en-us&format=atom), that uses a 48px square icon. Maybe you could stick to that for you icon. It doesn't supply a logo though.

SSRS can't properly render *some* images within PDF

I have a report that renders images (jpg) that have been collected from various sources. This works fine within the report viewer, and when exporting via Excel.
However, when exporting to PDF, about 5% of the images are rendered incorrectly as can be seen below, with the original on the left, and what is rendered on the right;
I find that if I open up one of these images in mspaint, and just click save, on the next report-run the image is now rendered correctly.
Are there any rules as to what image properties/format are valid for SSRS to render the image correctly within a PDF? Essentially I'd like to somehow find these images that will render incorrectly before the report is run and fix them prior...
Current Workaround
I never ended up getting SSRS to display the the problem images as they were, however, determining before running the report which images would be included in the non-displayable set so they could be converted to a supported format (automatically) was also a solution.
In my case, all images were supplied via users uploading to a website, so I was able to identify and convert images as they arrived. For all existing images, I was able to run a script that identified the problem images and convert them.
Identifying problem images
From the thousands of images I had, I was able to determine that the images that wouldn't render correctly had the following properties:
Image had CMYK colorspace or;
Image had extended color profiles or;
Both of the above
Converting an image
I was originally using the standard .NET GDI (System.Drawing) to manipulate images however the API is often prone to crashes (OutOfMemoryException) when dealing with images that have extra data. As such, I switched to using ImageMagick where for each of the identified images I:
Stripped the color profiles and;
Converted to RGB
Note that the conversion to RGB from CMYK without stripping the color profiles was not enough to get all images to render properly.
I ended up just doing those items on every image byte stream I received from users (without first identifying the problem) before saving an uploaded image to disk. After which, I never had the rendering problem again.
Because of the way the output looks I would say those JPEG images have CMYK colorspace but the SSRS assumes they use RGB colorspace and sets the wrong colorspace in PDF.
If you can post a JPEG image and a sample PDF I can give you more details.
I've had exactly the same problem with an image rendering correctly on screen but appearing like the one in the question when I exported the report to PDF. Here's how I solved it.
The Problem
The first clue was this article I came across on MSDN. It seems that regardless of the original image density, the PDF renderer in SSRS resizes all images to 96 DPI. If the original size of the image is larger than the size of the page (or container), then you will get this problem.
The Solution
The solution is to resize the source image such that it will fit on your page. The requires a little calculation depending on your page size and margin settings.
In my case, I'm using A4 paper size, which is 21cm by 29.7cm. However, my left margin is 1.5cm, and my right margin is 0.5cm, for a total inner width of 19cm. I allow an extra 0.5 cm as a margin of error, so I use an inner width of 18.5cm.
21 cm - 1.5 cm - 0.5 cm - 0.5 cm = 18.5 cm
As noted before, the resolution generated by the PDF renderer is 96 DPI (dots per inch). For those of us not in the United States or Republic of Liberia, that's 37.79 DPC (dots per centimetre). So, to get our width:
18.5 cm * 37.79 dpc = 699 pixels
Your result may be different depending on (1) the paper size you are using, and (2) the left and right margins.
As the page is higher than it is wide, we need only resize the width while keeping the image proportional. If you're using a paper size which is wider than it is tall, you'd use the length instead.
So now open the source image in Paint (or your image editor of choice), and proportionally resize the image to the desired width (or length) in pixels, save it, import it into your container, and size the image visually with respect to the container. It should look the same on screen, and now render correctly to PDF.
This is an issue reported to Microsoft Connect.
From SSRS 2008 How to get the best image quality possible?:
The image behavior you see in PDF is a result of some image conversions that the PDF renderer does, based on how the PDF specification requires that serialize images into PDF.
We know it's not ideal, and we classify the loss of image quality as a product issue. Therefore, it's difficult to really say what to do to get the best quality image.
Anecdotally, I have heard that customers have good results when the original image is a BMP

Ghostscript: how to reduce file size of large PDFs without changing smaller PDFs

I am using GhostScript to convert large batches of PDF to PDF to reduce file size. The original PDFs vary in size and quality. Where there is a low quality, small file size (<350kb) PDF the output from Ghostscript is often poor.
Is there a way I can get GhostScript to ignore files below a certain size and just pass them through without downsampling?
Current settings:
SearchablePDFSetting=-dColorImageResolution=120 -dMonoImageResolution=38 -dMonoImageDownsampleType=/Average -dOptimize=true -dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dUseCIEColor -dColorConversionStrategy=/sRGB -dFIXEDMEDIA -dDEVICEWIDTHPOINTS=596 -dDEVICEHEIGHTPOINTS=834
Thanks,
Vix
The pdfwrite device can already pass images (not files) through without downsampling, there is no way to 'pass through without changing' a file. If you want to not process files below a certain size, then don't process them.
To avoid further downsampling of images, you need to add the 'xxxxImageDownsampleThreshold' parameters (one each for Mono, Grey and Color). If you set this to (eg) 1.5 then images which are up to 50% higher resolution than the target resolution won't be downsampled.
Note that you haven't (apparently) set a GrayImageDownsampleResolution, you haven't set the downsample type for Color or Gray images and a MonoImageResolution of 38 looks pretty ugly to me.
The default Gray image filter is DCT (JPEG) as is the Color filter. If the original image was DCT then applying a second round of DCT compression will result in ugly artefacts, especially if the image is not downsampled. I would suggest you change the filter type to FlateEncode.
All these options are documented in ps2pdf.htm in the Ghostscript doc folder.
Add the option:
-dPDFSETTINGS=/screen
This "selects low-resolution output similar to the Acrobat Distiller 'Screen Optimized' setting."

Resources