converting pdf to image but after zooming in - image

This link shows how pdfs could be converted to images. Is there a way to zoom my pdfs before converting to images? In my project, i am converting pdfs to pngs and then using Python-tesseract library to extract text. I noticed that if I zoom pdfs and then save parts as pngs then OCR provides much better results. So is there a way to zoom pdfs before converting to pngs?

I think that raising the quality (resolution) of your image is a better solution than zooming into the pdf.
using pdf2image you can accomplish this quite easily:
install pdf2image: pip install pdf2image
then, in python, convert your pdf into a high quality image:
from pdf2image import convert_from_path
pages = convert_from_path('sample.pdf', 400) #400 is the Image quality in DPI (default 200)
pages[0].save("sample.png")
by playing around with the quality parameter you should get the result you desider

Related

Converting PDF to images of original size

I have a PDF file which is made of photographs of a book connected in a single PDF file. I'm trying to convert it back to single images in PNG format, every tool I tried asks me to set DPI which alters the size of resulting images, is there a way to get images of the exact same pixel size the original images were?
Most PDFs of books contain a single image per page and depending on the scanner these images can basically be in three different formats: JPEG, JPEG2000 or TIFF. JPEG2000 is rarely used, so your PDF probably contains JPEG and/or TIFF images.
The good thing about JPEG (and JPEG2000) images is that they can be embedded as-is into a PDF! So you can extract the images as they are stored in the PDF. With TIFF this is also sometimes possible (but I don't think always...).
As mentioned by Tim Roberts you should try using pdfimages or hexapdf images to view and extract the images stored in the PDF. This will give you the best result.

Converting TIFF to PDF with GraphicsMagick MediaBox / CropBox resolution

We are currently converting TIFF files to PDF using GraphicsMagick. The TIFF is coming from an eFAX and has a (pixel) resolution of 1728x2200.
If you do the conversion with tiff2pdf or just open it on Preview and convert export it to PDF, it is generated with a MediaBox value of 612x792 point, which is what is expected.
However graphics magick generates a MediaBox of 1728x4400 and a CropBox of 610x792. It all looks good if you open it on a PDF viewer because it's using the CropBox but if you're feeding it to GhostScript after, you don't get the Image on the full page but as a small square inside the document.
The lazy solutions would be to change for Tiff2PDF or add -dUseCropBox to our GhostScript command but I'd like to know what GraphicsMagick option should be used to have the PDF with the good MediaBox. It's like it doesn't understand that the resolution is in Pixels and not in Point. Hope somebody has insights

Create small high quality PDF embedding optimized PNG?

I'm trying to create a small PDF file, embedding one optimized PNG image displayed as a header and footer on a 3 page PDF (same image must appear 6x in the PDF)
My optimized PNG image is only 2.3KB. It looks very sharp.
Failed with libreoffice
When I insert just one instance of the 2.3KB PNG image into a Libreoffice Writer doc containing only text, then export as PDF I can see that the image gets re-compressed to JPG and the resulting PDF file grows by about 40KB after adding the image. It also loses quality, the PNG also gets JPG fuzzy edges.
If I right click the image and select compression, there is no way to disable recompressing the image (it's already optimized better than libreoffice could do it) I've tried setting a compression level of 0,1,9 etc. Choosing JPG, no resize, lossless, etc but there was no improvement.
Failed with wkhtmltopdf
I also tried making a test page and used wkhtml2pdf but it did the same thing. Adding the low quality flag made no difference.
PDF Spec suggests PNG is supported?
From skimming the PDF spec, it looks like PNG images are supported.
Even plain text PDF files are surprisingly large
The disappointing thing is also when I take a 7KB HTML file which is basically just <html><body><p>foo...</p><p>bar...</p> (only about 15 paragraphs) with no CSS. The resulting 2 page PDF file is 30KB. Why should a 7kb (almost plain text) file become 30kb as a PDF?
Suggestions?
Can someone please suggest how to make a small PDF file in Linux?
I need to include 7KB of text and repeat one PNG image 6 times.
Manually or programatically. I'll take whatever I can get at this point.
PDF Spec suggests PNG is supported?
PNG isn't supported per se; PDF allows embedding JPEG images as-is, but not PNG images. PDF does borrow a set of features of the PNG format, however.
rinohtype (full disclosure: I'm the author) tries to embed as much as possible from PNG images as-is into the PDF. This does involve some bit-juggling to separate the alpha channel from the color data for example, but no reencoding of the image is performed. It does not (yet) support interlaced PNGs.
rinohtype should be able to do what you want to achieve. But please note that it currently is in a beta stage, so you might encounter some bugs.
Even plain text PDF files are surprisingly large
To keep the PDF size as small as possible, make sure not to embed/subset any of the fonts. Use only the fonts from the base 14 PDF fonts which are provided by PDF readers.
What you want is certainly achievable. Regarding the image quality, I would recommend making your image twice the size that you want it to actually display at in the PDF to keep it looking sharp.
As to the size, I've just modified a test in my PDF writer module (WIP..) to include a 7.2K png, 200px x 70px, in a PDF twice and the PDF came out at 6.8K 8). There's not much text included, but more text will only add what it's worth + a small percentage.
You can see the module and original test here.. https://github.com/DoccaPDF/docca-pdf-writer/blob/master/src/tests/writer.js#L40
That test adds ~112K of images to the PDF and results in a 103K PDF.
Of course not all images are created equal so you milage may vary..
*the images are only actually added to the PDF once, but are displayed multiple time.

How Do I Convert Palette-Based PNG with Transparency To RGB in PIL?

I'm currently building a site in app engine that uploads images to google cloud storage and to complete basic manipulations I'm using python's PIL
I've been having problems with the following image which another stackoverflow member has mentioned is a palette-based PNG with transparency, which I've been reading may be a bit buggy in PIL
My question is really a back to basics one: What is the best way to convert this to an RGB format with transparent pixels set to #FFF? I've been able to get it to work through a combined RGBA then RGB paste but that seems redundant
However, for a direct conversion I'm getting a bad transparency mask i.e. using the solution from PIL Convert PNG or GIF with Transparency to JPG without
Also if anybody has ideas why the image degrades to terrible quality after conversion, that's entirely a bonus for me!
A way to do this is to first convert the file to jpg -- seems like a problem with the png encoding (or something related to that)
Check out this link that I used and got smooth conversion from transparent PNG to GIF:
Convert RGBA PNG to RGB with PIL
The function you are looking for is pure_pil_alpha_to_color_v2.
I also used for my image conversion tool PySmile:
https://github.com/vietlq/PySmile/blob/master/pysmile.py

tcpdf: poor image quality

I am using TCPDF to create PDF files converted from HTML input using it's writeHTML() function. However, images within the PDF have poor quality, while the original images have a high quality (as expected). The images are in PNG format. I already tried to use SetJPEGQuality(100), but that had no effect.
What is causing this?
Try using this:
$pdf->setImageScale(1.53);
http://sourceforge.net/projects/tcpdf/forums/forum/435311/topic/4831671
When using HTML to generate your PDFs you need to manually calculate the images dimensions by dividing it's original width and height by 1.53 and set the result as attributes.
For example, an image with dimensions of 200x100 pixels will become:
<img src="image.jpg" width="131" height="65" />
This is a nasty workaround and doesn't completely remove the blur, but the result is much better than without any scaling.
Try To convert your Image to JPG or JPEG first. Until Now, I DOnt have a problem to convert image with TCPDF. I Think TCPDF is powerfull, because it can convert arabic language too. I HAve try convert arabic font with fpdf n it still fail
Little Up.
I'd same quality problem and I solved it...
When you save your picture, do it in 8bits instead of 24bits and you will see a "beautiful anti-aliasing".

Resources