Converting PDF to images of original size - image

I have a PDF file which is made of photographs of a book connected in a single PDF file. I'm trying to convert it back to single images in PNG format, every tool I tried asks me to set DPI which alters the size of resulting images, is there a way to get images of the exact same pixel size the original images were?

Most PDFs of books contain a single image per page and depending on the scanner these images can basically be in three different formats: JPEG, JPEG2000 or TIFF. JPEG2000 is rarely used, so your PDF probably contains JPEG and/or TIFF images.
The good thing about JPEG (and JPEG2000) images is that they can be embedded as-is into a PDF! So you can extract the images as they are stored in the PDF. With TIFF this is also sometimes possible (but I don't think always...).
As mentioned by Tim Roberts you should try using pdfimages or hexapdf images to view and extract the images stored in the PDF. This will give you the best result.

Related

Extract a whole scale image from a SVS format file in C++

I am trying to extract a whole scale image from a SVS file in C++.
I saw an explanation from the OpenSlide homepage.
It says the SVS format is "single-file pyramidal tiled TIFF".
So I tried to extract a whole-scale image like I did for TIFF image: I read all IFDs from the SVS file, but there is no 273 tag which contains an address pointing to a whole scale image.
That's why I am little confused now, the SVS format doesn't have a whole scale image inside the file?
I found an undefined private tag from a SVS file which number is 34675. Is this tag is for a whole scale image?
Or is there a proper way to extract it?
Aperio SVS is a tiled format. All levels of the pyramid are tiled images. The base layer is the first TIFF directory. This page of the LibTiff documentation shows how to read tiled images.
In short, you need to look for tag 324 (TIFFTAG_TILEOFFSETS), as well as tags 322 and 323 (TIFFTAG_TILEWIDTH, TIFFTAG_TILELENGTH). I highly recommend you use LibTiff for this, and don’t try to roll your own.
The custom tag in the SVS file contains metadata, including the physical size of a pixel in micron (SVS doesn’t set the resolution TIFF tags).
You can read out the thumbnail image (is this what you mean by whole scale image?) as an openslide associated image.
For example, libvips has a convenient openslide binding written by the openslide authors:
$ vipsheader -f slide-associated-images CMU-1.svs
label, macro, thumbnail
Lists the images in the SVS file. macro is the huge pyramid that you get by default, thumbnail is the small overview, label is the shot of the slide label.
Get the thumbnail like this:
$ vips copy CMU-1.svs[associated=thumbnail] x.jpg
To read as a JPG image.
In C++, you could write:
VImage thumb = VImage::new_from_file("CMU-1.svs",
VImage::option()->set("associated", "thumbnail"));
thumb.write_to_file("x.jpg");

Create small high quality PDF embedding optimized PNG?

I'm trying to create a small PDF file, embedding one optimized PNG image displayed as a header and footer on a 3 page PDF (same image must appear 6x in the PDF)
My optimized PNG image is only 2.3KB. It looks very sharp.
Failed with libreoffice
When I insert just one instance of the 2.3KB PNG image into a Libreoffice Writer doc containing only text, then export as PDF I can see that the image gets re-compressed to JPG and the resulting PDF file grows by about 40KB after adding the image. It also loses quality, the PNG also gets JPG fuzzy edges.
If I right click the image and select compression, there is no way to disable recompressing the image (it's already optimized better than libreoffice could do it) I've tried setting a compression level of 0,1,9 etc. Choosing JPG, no resize, lossless, etc but there was no improvement.
Failed with wkhtmltopdf
I also tried making a test page and used wkhtml2pdf but it did the same thing. Adding the low quality flag made no difference.
PDF Spec suggests PNG is supported?
From skimming the PDF spec, it looks like PNG images are supported.
Even plain text PDF files are surprisingly large
The disappointing thing is also when I take a 7KB HTML file which is basically just <html><body><p>foo...</p><p>bar...</p> (only about 15 paragraphs) with no CSS. The resulting 2 page PDF file is 30KB. Why should a 7kb (almost plain text) file become 30kb as a PDF?
Suggestions?
Can someone please suggest how to make a small PDF file in Linux?
I need to include 7KB of text and repeat one PNG image 6 times.
Manually or programatically. I'll take whatever I can get at this point.
PDF Spec suggests PNG is supported?
PNG isn't supported per se; PDF allows embedding JPEG images as-is, but not PNG images. PDF does borrow a set of features of the PNG format, however.
rinohtype (full disclosure: I'm the author) tries to embed as much as possible from PNG images as-is into the PDF. This does involve some bit-juggling to separate the alpha channel from the color data for example, but no reencoding of the image is performed. It does not (yet) support interlaced PNGs.
rinohtype should be able to do what you want to achieve. But please note that it currently is in a beta stage, so you might encounter some bugs.
Even plain text PDF files are surprisingly large
To keep the PDF size as small as possible, make sure not to embed/subset any of the fonts. Use only the fonts from the base 14 PDF fonts which are provided by PDF readers.
What you want is certainly achievable. Regarding the image quality, I would recommend making your image twice the size that you want it to actually display at in the PDF to keep it looking sharp.
As to the size, I've just modified a test in my PDF writer module (WIP..) to include a 7.2K png, 200px x 70px, in a PDF twice and the PDF came out at 6.8K 8). There's not much text included, but more text will only add what it's worth + a small percentage.
You can see the module and original test here.. https://github.com/DoccaPDF/docca-pdf-writer/blob/master/src/tests/writer.js#L40
That test adds ~112K of images to the PDF and results in a 103K PDF.
Of course not all images are created equal so you milage may vary..
*the images are only actually added to the PDF once, but are displayed multiple time.

Can a PNG image contain multiple pages?

On OSX I converted a multi-page PDF file to PNG and (somehow) it created a multi-page PNG file.
Is there an extension to the PNG format that allows this? Or is this not something I can validly create?
~~~~
To clarify, this is a PNG file, per the builtin file command and the identify command from imagemagick.
$ file algorithms-combined-print.png
algorithms-combined-print.png: PNG image data, 1275 x 1650, 8-bit/color RGBA, non-interlaced
$ identify algorithms-combined-print.png
algorithms-combined-print.png PNG 1275x1650 1275x1650+0+0 8-bit sRGB 3.537MB 0.000u 0:00.000
And here is a pastebin of the command identify -verbose algorithms-combined-print.png: http://pastebin.com/hw1yuRKa
What is notable from that output is that the pixel count is Number pixels: 2.104M which corresponds to one page. However, the file size is 3.537MB, which is clearly sufficient to hold all the pages.
Per request, here is the output of pngcheck: http://pastebin.com/aCRMEd9L
PNG does not support "multipage" images.
MNG is a PNG variant that supports multiple images - mostly for animations, but it's not a real PNG image (diffent signature/header), and has never become popular.
APNG is a similar attempt, but more focused on animations - it's more popular and alive, though it's less official - it's also PNG compatible (a standard PNG viewer, unaware of APNG, will display it as a single PNG image).
Another possible explanation is that your image is actually a TIFF image with a wrong .png extension, and the viewer ignores it.
The only way to know for sure is to look inside the image file itself (at least to the first bytes)
Update: given the pngcheck output, it seems to be a APNG file.

.net library for PDF creation that can handle vector (eps) images

Are there any libraries that lets you include vector based eps files (eg Illustrator eps images) in a PDF file - I mean not by first rasterizing the image to a .PNG file, in fact we need to retain the resolution independent vector format.
We are working on a solution that lets users create hi-res printable documents online. Their lay-out can include text and images in various formats. They can upload their own art which we then convert to a low-res onscreen image. If they upload a EPS file, we need to include the resolution independent original file in the final production document, which would be a PDF document.
Most PDF creation libraries seem lack support for vector eps images...

TIFF image file format

I am working on TIFF images for image compression.
I want to know how is the actual raw image data i.e. R,G,B components organised/stored in the TIFF file.
Is it stored as G0B0R0G1B1R1... (1 byte each for each color component, all components intereleaved)
or is it some other way viz. planar format or something else?
Thank you.
-AD.
TIFF specifies:
How attributes are associated with a page
How multiple pages (and their attributes) are packed into a single file
Page attributes include properties such as:
Dimensions
Encoding scheme
In other words, a TIFF file may contain data that's encoding using any of many different encoding schemes.
The TIFF file can store various image types:
Bilevel (B/W)
Grayscale
Palette-color
RGB full-color
The storing of actual image data is done differently for each image type.
The specification is not the scariest I have seen, but it is definitely not trivial!
The TIFF specification can be found here: http://partners.adobe.com/public/developer/tiff/index.html
I have been doing the same, with tiff files looking at multi resoution tiffs.
Adobe have TIFF 6 documentation on their website.
You should be able to use P/Invoke on LibTiff with c# or vb.net.
Their are many types of compression, some of them proprietary.
Looking at the doc supplied by tomassao, I see that uncompressed RGB is just one of the possible tiff encodings.
It looks like the data is not interleaved. In fact, you can specify more than 3 samples per pixels (but RGB is 3), and you can specify different numbers of bits per sample (but 8,8,8 is common).
I assume you already know about how the headers work. The document covers it if you don't.

Resources