Can anyone suggest me how to convert a scanned image into a searchable image or a scanned pdf to a searchable pdf ?
I have been stuck in this situation since quite a while now.
i have tried pdfocr application in ubuntu but no success.
Tesseract version 3.03 supports creation of searchable PDF from image. For PDF, you can use GhostScript to convert it to image before sending it to Tesseract.
https://github.com/tesseract-ocr/tesseract
Currently, there is no right way of doing this on Ubuntu. All OCR engines output plain text and there is no way to add that text as a hidden layer on PDF over the image text.
Option 1: Use gscan2pdf which will make you a searchable PDF, but the OCRed text is placed in the top-left corner of the page, is invisible and much too small.
Option 2: Use PDF X-Change Viewer which has an option to OCR and works correctly by adding a text layer over the scanned image which is in concordance with it. You'll have to run it in wine, because it is a Windows application.
Related
I am converting PDF to image using GhostScript. The problem i am facing is when pdf has links then i need to have those clickable links in converted image as well. How could i achieve this.
How do you expect an image format to have 'clickable links' ? What image format do you think has the ability to click on links ?
The image http://www.barre.nom.fr/medical/samples/files/MR-MONO2-16-head.gz on http://www.barre.nom.fr/medical/samples/is not converting to other image formats. I tried following commands (after extracting the dicom file):
dcm2pnm --write-png MR-MONO2-16-head out.png
dcm2pnm +obr MR-MONO2-16-head out.bmp
dcm2pnm MR-MONO2-16-head out.pnm
It also did not work with dcmj2pnm and dcml2pnm. All of them just produce a gray box. The image otherwise is OK and is correctly read by proper dicom viewer softwares. Where is the problem and how can it be solved?
The problem is that no windowing settings are present in the header. You need to instruct dcm2pnm to calculate a window from the histogram (+Wm) or specify windowing values to apply.
dcm2pnm +Wm +obr MR-MONO2-16-head MR-MONO2-16-head.bmp
yields a bitmap image that looks fine to me.
I'm trying to create a small PDF file, embedding one optimized PNG image displayed as a header and footer on a 3 page PDF (same image must appear 6x in the PDF)
My optimized PNG image is only 2.3KB. It looks very sharp.
Failed with libreoffice
When I insert just one instance of the 2.3KB PNG image into a Libreoffice Writer doc containing only text, then export as PDF I can see that the image gets re-compressed to JPG and the resulting PDF file grows by about 40KB after adding the image. It also loses quality, the PNG also gets JPG fuzzy edges.
If I right click the image and select compression, there is no way to disable recompressing the image (it's already optimized better than libreoffice could do it) I've tried setting a compression level of 0,1,9 etc. Choosing JPG, no resize, lossless, etc but there was no improvement.
Failed with wkhtmltopdf
I also tried making a test page and used wkhtml2pdf but it did the same thing. Adding the low quality flag made no difference.
PDF Spec suggests PNG is supported?
From skimming the PDF spec, it looks like PNG images are supported.
Even plain text PDF files are surprisingly large
The disappointing thing is also when I take a 7KB HTML file which is basically just <html><body><p>foo...</p><p>bar...</p> (only about 15 paragraphs) with no CSS. The resulting 2 page PDF file is 30KB. Why should a 7kb (almost plain text) file become 30kb as a PDF?
Suggestions?
Can someone please suggest how to make a small PDF file in Linux?
I need to include 7KB of text and repeat one PNG image 6 times.
Manually or programatically. I'll take whatever I can get at this point.
PDF Spec suggests PNG is supported?
PNG isn't supported per se; PDF allows embedding JPEG images as-is, but not PNG images. PDF does borrow a set of features of the PNG format, however.
rinohtype (full disclosure: I'm the author) tries to embed as much as possible from PNG images as-is into the PDF. This does involve some bit-juggling to separate the alpha channel from the color data for example, but no reencoding of the image is performed. It does not (yet) support interlaced PNGs.
rinohtype should be able to do what you want to achieve. But please note that it currently is in a beta stage, so you might encounter some bugs.
Even plain text PDF files are surprisingly large
To keep the PDF size as small as possible, make sure not to embed/subset any of the fonts. Use only the fonts from the base 14 PDF fonts which are provided by PDF readers.
What you want is certainly achievable. Regarding the image quality, I would recommend making your image twice the size that you want it to actually display at in the PDF to keep it looking sharp.
As to the size, I've just modified a test in my PDF writer module (WIP..) to include a 7.2K png, 200px x 70px, in a PDF twice and the PDF came out at 6.8K 8). There's not much text included, but more text will only add what it's worth + a small percentage.
You can see the module and original test here.. https://github.com/DoccaPDF/docca-pdf-writer/blob/master/src/tests/writer.js#L40
That test adds ~112K of images to the PDF and results in a 103K PDF.
Of course not all images are created equal so you milage may vary..
*the images are only actually added to the PDF once, but are displayed multiple time.
I am using matplotlib to draw a graph using some data and I have saved it in Pdf format.Now I want to add a logo to this file.How can I do this.
Thanks in advance
If you can do it the other way round, it is easier:
plot the image
load the logo from file with, e.g. Image module (PIL)
add the logo with plt.imshow, use the extent keyword to place it correctly
save the image into PDF
(You may even want to plot the logo first, so that it stays in the background.)
Unfortunately, this does not work with vector graphics, but as logos usually are not that large, you may use a .png or even a .jpg.
If you already have the PDF's then this is not a matplotlib or python question. You need some PDF editing tools or libraries to add the logo. Possible, but an entirely different thing.
I'm doing a document scanning project that involves inserting a scanned image into an existing PDF template page that contains form fields. I've used ImageMagick to take process the scan, and then append a raster image of the form template to the bottom, and convert that image into a PDF. However, forms and checkbox fields have to be added manually to the resulting PDF. Below is a sample of my ImageMagick command.
convert inputScan.jpg -resize 975x420 FormTemplate.png -append CombinedFile.pdf
Ideally, I would run a command that would take the JPG scan and the PDF template file containing fields, and output a PDF file with the scan at the top of a page and the field-containing template text below it. The closest thing I could find to a solution was here, but PHP can't be used on the computer in question.
Any help or suggestions are greatly appreciated!