iText: PDF size when adding an image - image

I'm using iText 5.5.4 for creating PDF from Java, and it's a great library.
This is not an issue or wrong behaviour. It's only a curious fact, that I'm trying to understand.
I have a three-page PDF, with header, footer, tables, etc.
Its size is 96KB.
I have added a 4th page with a 950KB JPEG image. It fits in an A4 page dimensions.
Adding 96KB + 950KB + 4th page metadata + others(header, footer) , I expected the new PDF was about 1.15MB
But the final size was 1.41MB
So, I have these two questions:
Is there anything wrong in my estimate? Why does image addition imply such overload?
If I scale the image to 75% with iText, the new resulting PDF also is 1.41MB of size. Does PDF include the original JPEG image, despite of the reduction?
I insist: the behaviour is right. It's only my own curiosity.
Thanks a lot.
EDIT: I don't have permission to share the image. My code to add an image to PDF is:
public Document addNewElement(Document input, String imageFilename) throws DocumentException {
try {
input.newPage();
Image image = Image.getInstance(path + imageFilename);
input.add(image);
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return input;
}

Scaling an image with iText doesn't reduce or increase the number of pixels of an image. The size of the image inside of the PDF is independent of the scaling factor you use. The only thing that changes is the perceived resolution. If you have a picture of 300 by 300 pixels, and you scale it to one by one inch, the resolution will be 300 dpi. If you scale it to two by two inches, the resolution will decrease, but the size of the image bytes will remain the same.
The increase of size of a PDF when you add an image depends on the type of image.
If you add a PNG file, iText will decode the PNG into pixels, create a bitmap, and add the compressed bitmap instead of the PNG, because the PDF specification doesn't have native support for PNG.
If you add a transparent image (e.g. a transparent GIF or PNG), iText will add two images: the opaque image as well as an image mask that defines the transparent parts. Why? Well, that's how transparency is defined in the PDF specification.
... (The behavior of all of the different types of images is described in one of my books. If you want to know more, I could look it up, or put it on the official web site if it's not already there somewhere.)
If you add a JPG, an exact copy of those bytes will be present in the PDF. It will be wrapped inside a stream object, causing some overhead, but no as substantial as a quarter megabyte.
Reading your question, I assume that you are adding a JPEG, but the increase of the file is much higher than you expect. We should take a look at the original PDF and your code to give a conclusive answer. I'll take one guess though. Suppose that the original PDF was compressed using the compression introduced in the PDF 1.5 specification (compressed objects + compressed XRef table), and you don't tell the stamper that you want the final result to use that compression, iText may create a PDF that only uses compression as defined in the PDF specifications prior to PDF 1.5. That cause is totally unrelated to you adding an image, but it could result in a difference of a quarter of a megabyte.

Related

Create small high quality PDF embedding optimized PNG?

I'm trying to create a small PDF file, embedding one optimized PNG image displayed as a header and footer on a 3 page PDF (same image must appear 6x in the PDF)
My optimized PNG image is only 2.3KB. It looks very sharp.
Failed with libreoffice
When I insert just one instance of the 2.3KB PNG image into a Libreoffice Writer doc containing only text, then export as PDF I can see that the image gets re-compressed to JPG and the resulting PDF file grows by about 40KB after adding the image. It also loses quality, the PNG also gets JPG fuzzy edges.
If I right click the image and select compression, there is no way to disable recompressing the image (it's already optimized better than libreoffice could do it) I've tried setting a compression level of 0,1,9 etc. Choosing JPG, no resize, lossless, etc but there was no improvement.
Failed with wkhtmltopdf
I also tried making a test page and used wkhtml2pdf but it did the same thing. Adding the low quality flag made no difference.
PDF Spec suggests PNG is supported?
From skimming the PDF spec, it looks like PNG images are supported.
Even plain text PDF files are surprisingly large
The disappointing thing is also when I take a 7KB HTML file which is basically just <html><body><p>foo...</p><p>bar...</p> (only about 15 paragraphs) with no CSS. The resulting 2 page PDF file is 30KB. Why should a 7kb (almost plain text) file become 30kb as a PDF?
Suggestions?
Can someone please suggest how to make a small PDF file in Linux?
I need to include 7KB of text and repeat one PNG image 6 times.
Manually or programatically. I'll take whatever I can get at this point.
PDF Spec suggests PNG is supported?
PNG isn't supported per se; PDF allows embedding JPEG images as-is, but not PNG images. PDF does borrow a set of features of the PNG format, however.
rinohtype (full disclosure: I'm the author) tries to embed as much as possible from PNG images as-is into the PDF. This does involve some bit-juggling to separate the alpha channel from the color data for example, but no reencoding of the image is performed. It does not (yet) support interlaced PNGs.
rinohtype should be able to do what you want to achieve. But please note that it currently is in a beta stage, so you might encounter some bugs.
Even plain text PDF files are surprisingly large
To keep the PDF size as small as possible, make sure not to embed/subset any of the fonts. Use only the fonts from the base 14 PDF fonts which are provided by PDF readers.
What you want is certainly achievable. Regarding the image quality, I would recommend making your image twice the size that you want it to actually display at in the PDF to keep it looking sharp.
As to the size, I've just modified a test in my PDF writer module (WIP..) to include a 7.2K png, 200px x 70px, in a PDF twice and the PDF came out at 6.8K 8). There's not much text included, but more text will only add what it's worth + a small percentage.
You can see the module and original test here.. https://github.com/DoccaPDF/docca-pdf-writer/blob/master/src/tests/writer.js#L40
That test adds ~112K of images to the PDF and results in a 103K PDF.
Of course not all images are created equal so you milage may vary..
*the images are only actually added to the PDF once, but are displayed multiple time.

Convert image to Blob

I want to upload image data to a php script on the server. I have a URL for an image source (PNG, the image might be located on a different server). I load this into a Javascript image, draw this into a canvas and use the canvas.toBlob() method (or a polyfill as it is not mainly supported yet) to generate a blob holding the image data. This works fine, but I recognized that the resulting blob size is much bigger than the original image data.
In contrast if I use a HTML File input and let the user select an image on the client the resulting blob has equal size to the original image. Can I get image data from a canvas that is equal to the original image size?
I guess the reason is that I loose the PNG (or any image compression) when using the canvas.toBlob() polyfill:
value: function (callback, type, quality) {
var binStr = atob(this.toDataURL(type, quality).split(',')[1]),
len = binStr.length,
arr = new Uint8Array(len);
for (var i=0; i<len; i++ ) {
arr[i] = binStr.charCodeAt(i);
}
callback(new Blob([arr], {type: type || 'image/png'}));
}
I am confused by so many conversion steps via image, canvas, blob - so maybe there is an alternative to get the image data from a given URL and finally append it to FormData to send it to the server?
The method toDataURL when using the png format only uses a limited set of the possible formats available for PNG files. It is the 8bit per channel RGBA (32 bits) compressed format. There are no options to use any of the other formats available so you are forced to include redundant data when you save as a PNG. PNG also has a 24bit and 8 bit format. PNG also has several compression options available though I am unsure which is used but each browser.
In most cases it is best to send the original image. If you need to modify the image and do not use the alpha channel (no transparency) but still want the quality to be high send it as a jpeg with quality set to 1 (max).
You may also consider the use of a custom encoder for PNG that gives you access to more of the PNG encoding options, or even try one of the many other formats available, or make up your own format, though you will be hard pushed to improve on jpeg and webp.
You could also consider compressing the data on the server when you store it, even jpeg and webp have a little room for more compression. For transport you should not worry as most data these days is compressed as it leaves the page and most definitely compressed by the time it leaves the clients ISP

OpenCV imwrite gives washed-out result for jpeg images

I am using OpenCV 3.0 and whenever I read an image and write it back the result is a washed-out image.
code:
cv::Mat img = cv::imread("dir/frogImage.jpg",-1);
cv::imwrite("dir/result.jpg",img);
Does anyone know whats causing this?
Original:
Result:
You can try to increase the compression quality parameter as shown in OpenCV Documentation of cv::imwrite :
cv::Mat img = cv::imread("dir/frogImage.jpg",-1);
std::vector<int> compression_params;
compression_params.push_back(CV_IMWRITE_JPEG_QUALITY);
compression_params.push_back(100);
cv::imwrite("dir/result.jpg",img, compression_params);
Without specifying the compression quality manually, quality of 95% will be applied.
but 1. you don't know what jpeg compression quality your original image had (so maybe you might increase the image size) and 2. it will (afaik) still introduce additional minor artifacts, because after all it is a lossy compression method.
UPDATE your problem seems to be not because of compression artifacts but because of an image with Adobe RGB 1998 color format. OpenCV interprets the color values as they are, but instead it should scale the color values to fit the "real" RGB color space. Browser and some image viewers do apply the color format correctly, while others don't (e.g. irfanView). I used GIMP to verify. Using GIMP you can decide on startup how to interpret the color values by format, either getting your desired or your "washed out" image.
OpenCV definitely doesn't care about such things, since it's not a photo editing library, so neither on reading nor on writing, color format will be handled.
This is because you are saving the image as JPG. When doing this the OpenCV will compress the image.
try to save it as PNG or BMP and no difference will be exist.
However, the IMPORTANT QUESTION : I am loading the image as jpg and saving it as JPG. So, how there is a difference?!
Yes, this is because there is many not identical compression/decompression algorithms for JPG.
if you want to get into some details see this question:
Reading jpg file in OpenCV vs C# Bitmap
EDIT:
You can see what I mean exactly here:
auto bmp(cv::imread("c:/Testing/stack.bmp"));
cv::imwrite("c:/Testing/stack_OpenCV.jpg", bmp);
auto jpg_opencv(cv::imread("c:/Testing/stack_OpenCV.jpg"));
auto jpg_mspaint(cv::imread("c:/Testing/stack_mspaint.jpg"));
cv::imwrite("c:/Testing/stack_mspaint_opencv.jpg", jpg_mspaint);
jpg_mspaint=(cv::imread("c:/Testing/stack_mspaint_opencv.jpg"));
cv::Mat jpg_diff;
cv::absdiff(jpg_mspaint, jpg_opencv, jpg_diff);
std::cout << cv::mean(jpg_diff);
The Result:
[0.576938, 0.466718, 0.495106, 0]
As #Micha commented:
cv::Mat img = cv::imread("dir/frogImage.jpg",-1);
cv::imwrite("dir/result.bmp",img);
I was always annoyed when mspaint.exe did the same to jpeg images. Especially for the screenshots...it ruined them everytime.

EXIF and thumbnails

I'm working on a photo viewer. In this context, I wrote a small class to be able to read and use some EXIF data, as e.g. image orientation. This class works well for reading.
However, I would add a new option to rotate photos. I want to rotate and write the photo data itself, not just rewrite the orientation tag. I already wrote the code to rotate and save the primary JPEG image, and it works well. But I also need to rotate the thumbnail contained in the EXIF data, if any, to keep the image coherent. For this reason I need to write in the EXIF data, to replace the existing thumbnail.
But this raises some questions, that I've some trouble answering, namely:
Can the EXIF data contains more than 1 thumbnail, and if yes, what is the maximum thumbnail count that an image can contain?
What are the supported formats for thumbnails? (I found JPEG and TIFF, are there other?)
Is there any guarantee in the EXIF standards that the thumbnails are always written in the late EXIF data, just before the primary image?
If not, then each tags containing an offset that points to a location beyond the thumbnail to replace should be updated. So, is there a standard way to iterate through all tags and sub-directories, to recognize which EXIF tags contain offsets, and to update them if needed? Or the only way is to read a maximum of tags and rewrite only that are known?
Or is there a way to guarantee that the size of the newly rotated thumbnail will be smaller or equal to previous thumbnail size to replace with?
Regards
Here are some answers for your questions:
1) The EXIF data is laid out like a TIFF file with 2 pages. The first page is the camera information and the second page is the thumbnail. If you add more pages (with thumbnails), 99.99% of the applications probably won't notice since you'll be doing it differently than the "standard" way. As far as "maximum count", you have 64k of data that can be stored in any JFIF tag. You can put what you want in that 64k.
2) There is only 1 supported EXIF thumbnail format: TIFF. Inside the TIFF there can be compressed (JPEG) or uncompressed data. Again, you're welcome to stick LZW-compressed data in there, but most apps probably won't be prepared to display it properly.
3) The JFIF container format allows for tags with metadata to appear before the main image. The APPx tags contain metadata that can follow the standard or not. You're welcome to stick multiple EXIF APP1 tags into your files, but again, most apps won't be able to properly handle that situation. So the simple answer is that the EXIF data (including thumbnail) must come before the main image and if you put more than 1 thumbnail it will most likely be ignored.
4) If you are modifying a JFIF (including the metadata), you must rewrite the metadata. It's actually quite simple because each tag is independent and has a simple length value instead of relative offsets.
5) You can do anything you want with the size/orientation of your thumbnail as long as you make the EXIF APP1 tag data total size fit within 64k.
Here's what you need to do...
1) Read the source image (and thumbnail if present).
2) Prepare your rotated image (and thumbnail).
3) Write the new metadata with the new thumbnail image.
4) Write the new main image.
If you want to preserve the original metadata along with your new thumbnail, it's pretty easy. Just read the original tags and hold on to them, then write them in the new image. Each JFIF tag is just a 2 byte identifier (FFxx) followed by a 2 byte length and then the data. They can be packed in almost any order and there's no hard limit on how many total tags can appear before the main image.

SSRS can't properly render *some* images within PDF

I have a report that renders images (jpg) that have been collected from various sources. This works fine within the report viewer, and when exporting via Excel.
However, when exporting to PDF, about 5% of the images are rendered incorrectly as can be seen below, with the original on the left, and what is rendered on the right;
I find that if I open up one of these images in mspaint, and just click save, on the next report-run the image is now rendered correctly.
Are there any rules as to what image properties/format are valid for SSRS to render the image correctly within a PDF? Essentially I'd like to somehow find these images that will render incorrectly before the report is run and fix them prior...
Current Workaround
I never ended up getting SSRS to display the the problem images as they were, however, determining before running the report which images would be included in the non-displayable set so they could be converted to a supported format (automatically) was also a solution.
In my case, all images were supplied via users uploading to a website, so I was able to identify and convert images as they arrived. For all existing images, I was able to run a script that identified the problem images and convert them.
Identifying problem images
From the thousands of images I had, I was able to determine that the images that wouldn't render correctly had the following properties:
Image had CMYK colorspace or;
Image had extended color profiles or;
Both of the above
Converting an image
I was originally using the standard .NET GDI (System.Drawing) to manipulate images however the API is often prone to crashes (OutOfMemoryException) when dealing with images that have extra data. As such, I switched to using ImageMagick where for each of the identified images I:
Stripped the color profiles and;
Converted to RGB
Note that the conversion to RGB from CMYK without stripping the color profiles was not enough to get all images to render properly.
I ended up just doing those items on every image byte stream I received from users (without first identifying the problem) before saving an uploaded image to disk. After which, I never had the rendering problem again.
Because of the way the output looks I would say those JPEG images have CMYK colorspace but the SSRS assumes they use RGB colorspace and sets the wrong colorspace in PDF.
If you can post a JPEG image and a sample PDF I can give you more details.
I've had exactly the same problem with an image rendering correctly on screen but appearing like the one in the question when I exported the report to PDF. Here's how I solved it.
The Problem
The first clue was this article I came across on MSDN. It seems that regardless of the original image density, the PDF renderer in SSRS resizes all images to 96 DPI. If the original size of the image is larger than the size of the page (or container), then you will get this problem.
The Solution
The solution is to resize the source image such that it will fit on your page. The requires a little calculation depending on your page size and margin settings.
In my case, I'm using A4 paper size, which is 21cm by 29.7cm. However, my left margin is 1.5cm, and my right margin is 0.5cm, for a total inner width of 19cm. I allow an extra 0.5 cm as a margin of error, so I use an inner width of 18.5cm.
21 cm - 1.5 cm - 0.5 cm - 0.5 cm = 18.5 cm
As noted before, the resolution generated by the PDF renderer is 96 DPI (dots per inch). For those of us not in the United States or Republic of Liberia, that's 37.79 DPC (dots per centimetre). So, to get our width:
18.5 cm * 37.79 dpc = 699 pixels
Your result may be different depending on (1) the paper size you are using, and (2) the left and right margins.
As the page is higher than it is wide, we need only resize the width while keeping the image proportional. If you're using a paper size which is wider than it is tall, you'd use the length instead.
So now open the source image in Paint (or your image editor of choice), and proportionally resize the image to the desired width (or length) in pixels, save it, import it into your container, and size the image visually with respect to the container. It should look the same on screen, and now render correctly to PDF.
This is an issue reported to Microsoft Connect.
From SSRS 2008 How to get the best image quality possible?:
The image behavior you see in PDF is a result of some image conversions that the PDF renderer does, based on how the PDF specification requires that serialize images into PDF.
We know it's not ideal, and we classify the loss of image quality as a product issue. Therefore, it's difficult to really say what to do to get the best quality image.
Anecdotally, I have heard that customers have good results when the original image is a BMP

Resources