How can I take a pdf, and convert any jpeg2000/jpx/jp2 images in it to jpeg images? - macos

I am using MacOS Mojave on a Mac Mini, and I am also using an old Kindle Dx which cannot read jpeg2000 images. It also has trouble with too many or too large jpeg images.
I cannot use touchscreens, so newer e-readers and tablets aren't a solution.
So far, I've found some buggy solutions--
I can use Willus's k2pdfopt with -mode copy and -dev dx, which rasterizes everything. It's a good solution for scanned pdfs. If more detail is needed, -mode copy without -dev dx will preserve higher resolution. It's something of a last resort for pdf-born-pdfs, since text can be uglier and harder to read, and file sizes can increase alarmingly.
I can also use Ghostscript with -dCompatibilityLevel=1.4, which doesn't rasterize everything. It converts jpeg2000 images to jpeg images. But it doesn't tackle some oversized or poorly-constructed images, it often creates dark rectangles which can obscure text, and it occasionally loses the ability to search or select text. [P.S. I mean it takes a pdf which had searchable pdf and outputs one which does not. Also if I do any kind of image downsampling or removal, it sometimes rescales everything or loses pages.]
I have experimented with options to compress images in Ghostscript, with mixed success, and with the above bugs persisting. [P.S. I think I was downsampling, yes.]
For whatever reason, MacOS Quartz filters only work if they will reduce image sizes. So they tend not to work on the buggy images.
Now my ideal solution would preserve the text itself, preferably untangling ligatures, and would compress the images like Willus's k2pdfopt. But I have no idea if that's possible or how.
Short of that-- I'm wondering if there's a way to use Ghostscript to convert the jpeg2000 images without causing the gray rectangles or losing the ability to search or select text.
or if there's a way to use Quartz filters so they work. In some older versions of MacOS they did work.
or if there's a way to batch-print these pdf files to the appropriate resolution, apparently 800x1180, reprocessing images in the process.
I don't have much programming experience. I mainly use homebrew to install command-line tools, very sloppy bash scripts, and Automator to run them.
P.S. For a minimal example of the gray rectangles in Ghostscript, using the free pdf from here: https://www.peginc.com/store/test-drive-savage-worlds-the-wild-hunt/
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -o out.pdf in.pdf
substituting that pdf for in.pdf.
For a minimal example of losing searchable text, using the free pdf from here: http://datafortress2020.com/fileproject/details.php?image_id=498
same minimal script
Compatibility Level
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4 -o out.pdf in.pdf
Aggressive Downsampling and Grayscale
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4
-g800x1080 -r150 -dPDFFitPage \
-dFastWebView -sColorConversionStrategy=Gray \
-dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=75 -dGrayImageResolution=75 -dMonoImageResolution=150 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \ -o out.pdf in.pdf
P.P.S. I can use k2pdfopt to rasterize to fit my Kindle. If the file has searchable text, this retains it, if it doesn't I can run tesseract in k2 or run ocrmypdf afterwards.
But if I want especially good graphics, or especially clear text, and the file has hundreds of pages, it will need hundreds of megs. I had blamed this on rasterizing the text, which was why my ideal solution was to keep text and rasterize images, but apparently it's an issue with the images themselves.

If you think you've found a bug, then it's helpful to report it. If you don't it will never be fixed. You can report a bug at https://bugs.ghostscript.com, please be sure to attach an example file to reproduce the problem and state the command line used.
The Ghostscript pdfwrite device does not, ever, produce JPEG2000 images (due to patent issues). So you don't need to set the CompatibilityLEvel at all, and I'd recommend that you do not. By setting the CompatibilityLevel you are limiting the output. Unless your device cannot handle later versions then don't do this.
Without seeing an example file, a command line and knowing the version and operating system it's obviously not possible for anyone to comment on your 'gray rectangles'.
You can reduce the size of images (in bytes) by downsampling (as opposed to compressing) them, you can't do anything about the number of images.
Note that searchable text depends on the construction of the PDF file, and so cannot ever be guaranteed. Searchable text (in the sense of ToUnicode CMaps) was a later addition to the PDF Reference and is always optional, because it's possible to have input from which the Unicode code points cannot be determined (without using OCR software) but a perfectly readable PDF file can still be produced.
Ghostscript itself can produce a PDF file which is a rendered representation of the original, wrapped up as a PDF. See the pdfimage* devices.
Tesseract can take images and produce PDF files with searchable text, produced by OCR'ing the images. This would seem to me to be your best option, though obviously I don't know if a single large image is going to be acceptable to your device.
Edit
I already agreed that searching text is inherently not supported in PDF, except as an optional adjunct. The bug report you pointed to talks about 'corrupting text layers'. There are no text layers in PDF, and the text is neither corrupted nor missing, ts just not encoded as ASCII any more.
The reason you shouldn't set the resolution, and the size in pixels, is because PDF is not an image format. You aren't gaining anything by doing this. All that happens is that pdfwrite divides the 'g' valuess by the resolution, to get a media size in inches, and writes that as the MediaBox. Simpler just to set the Media Size. If you set the resolution you are fixing anything which does get rendered at that resolution. Choose a low resolution and you get crappy output. If you use a higher resolution then the image can be downscaled and smoothed giving better output.
It is indeed possible that your Kindle cannot handle transparency any better than the Mac, it is after all an old device. It's also possible that whoever built Ghostscript for you introduced a bug. I'm afraid we can't help you with either of those.
I did suggest, right back at the end of the original post, that you render the content to an image (Ghostscript will do that for you), then use Tesseract to convert the image back to a PDF, and at the same time OCR the text.
That will get past your problems with JPEG2000, will do a *better job of creating searchable text, since even files that aren't already searchable will become so, and will allow you to specify the resolution.

Related

Use gostscript 9.21 to convert text to outlines, and how to keep the resolution of the picture

I use gostscript to convert text to outlines with the following code :gswin32c.exe -sDEVICE=pdfwrite -sOutputFile=output.pdf -dQUIET -dNOPAUSE -dBATCH -dNoOutputFonts -f test_new.pdf,it works.But i got a very small output file from 2.5M to 70kb.Then i find the picture became blurred in pdf.
Add -dPDFSETTINGS=/default,This will have the same result.
I's better to use -dPDFSETTINGS=/printer or -dPDFSETTINGS=/prepress,but 300dpi is not enough for me(or for my boss).
Is there any way to keep the original resolution of the picture.
Or how to set a higher dpi for images in output pdf.
The test file is here.
Thanks in advance.
The answer to your question is 'yes' (but see later). Don't use PDFSETTINGS, that sets lots of things all in one go. If you want control then you need to specify each setting individually.
Rather than use this shotgun approach you need to read the documentation, decide which controls affect areas you want to change, and alter those controls only.
However, image downsampling is not your problem. If you don't use -dPDFSETTINGS then PDF file written by Ghostscript contains an image at exactly the same resolution as the image in the original file.
Your problem is that the image is being written with JPEG compression, and JPEG is a lossy compression, so you are losing fidelity. Note that in the original file the image is written uncompressed, which is why its so large.
It looks like the original image was a JPEG, and the free PDF editor you are using has realised that so it saved the image uncompressed (I may be giving it too much credit here, it may save all images uncompressed). Applying JPEG to an image which has already been quantised simply amplifies the artefacts.
Instead you need to specify that you want images compressed with Flate, which is a lossless compression. The documentation for the pdfwrite controls can be found here, you need to change AutoFilterColorImages and ColorImageFilter.
Note that by not applying JPEG quantisation (a second time) and DCT encoding, the compression is less than your first experience. For me the output file comes in at just over 600Kb (leaving the font in place, and the text as text, would be a couple of Kb smaller). However the image is identical, as expected.
Since you are clearly using Ghostscript in a commercial environment, can I just point you at the licence and ask you to check that your usage is compatible with the AGPL, bearing in mind that this covers software as a service usage as well.

how to convert PS image to PNG and fit to page or to multiple pages

I am analyzing a model (compiled with -pg option so it would generate "gmon.out") and then generated a PS file (using gprof2dot.py and piping that to "dot") that charts how much time is spent in each subroutine. When I use ghostscript to convert to a PDF it is cutting off some of the right hand side of the figure. So I tested outputting to multiple pages, but the first page still has the right side cut off and the second page is blank.
These are the 2 commands I have tried:
gs -dBATCH -dNOPAUSE -dPDFFitPage -sOutputFile=myfile.pdf -sDEVICE=pdfwrite output.ps
gs -dBATCH -dNOPAUSE -dPDFFitPage -sOutputFile=out%d.pdf -sDEVICE=pdfwrite output.ps
Please let me know if you have any suggestions. Thanks!
Your title says you want a PNG (and you render a PostScript program to PNG, not "convert a PostScript image", PostScript is a programming language not an image format) but your description says you are creating PDF files. So which is it, PNG or PDF ?
Using PDFFitPage scales the requested media size to fit an actual (already set up, fixed) media size, since you haven't set a fixed media size on the command line, no scaling will be performed.
So, what media size are you getting, and why is that not correct ? I would 'guess' that your PostScript program does not request a media size, if it did then pdfwrite would create a PDF with the same media size.
In the absence of a media request, the pdfwrite device uses the default. Depending on your system and configuration that will most likely be either A4 or US Letter. It then reproduces the PostScript drawing program by either rendering to a bitmap, or as a PDF page description, using that media size. If the original PostScript required a different media size, then bits will be clipped.
Since you have not supplied an example its not possible for me to do more than guess of course.
However, you should probably try setting -sPAPERSIZE to something like A3. Or set a specific media size using -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS.
If you supplied an example I could probably be more specific.
It would also be a decent idea to mention what version of Ghostscript you are using too.

ghostscript option to make a pdf with flattened images

Is there an option to print a pdf in ghostscript as images?
I can use:
gs -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile=p%03d.png my.pdf
Then use imagemagick to make a pdf out of them with:
convert *.png new.pdf
PDF printers seem to have an option that does the same thing that is a checkbox that says "print as image". I could not find anything in the ghostscript docs that sounded like that was an option. There may be a term for it that I just don't know to look for.
It is kind of hard to explain why you would want to take a pdf document that is text and turn it into a document of images of text that is 4 times the size of the original but that is what I want to do.
Currently the only way to do that would be to start with a PDF which contains transparency operations, and select a CompatibilityLevel of 1.3 or less.
I have an idea to implement this feature, but I have not had time to work on it.
You can do it as a 2-pass approach using Ghostscript to render an image, then using the view* scripts to read the image back into Ghostscript and produce a PDF. No better than using convert of course.

How to remove anti-aliasing in PDF images?

I use Abbyy FineReader for ScanSnap to OCR a couple of scanned PDF files. The software claims it retains the original PDF images. The PDF file sizes pre-OCR and post-OCR are almost identical, which is good.
After the software is done, all PDF images appear anti-aliased in Acrobat X. Page navigation is much slower than before, and when I zoom in/out, the images first go to what looks like the pre-anti-aliasing version before quickly changing to anti-aliased images.
Left: Scanned PDF / Right: after OCR with Abbyy
I would like to get the original images without anti-aliasing back. Interestingly, when I open a single page from the anti-aliased PDF in Photoshop, there is no anti-aliasing and the image looks like the left one.
My limited PDF programming experience leads me to believe that Abbyy likely sets some kind of anti-alias flag for each image during OCR processing. How do I un-set this flag?
Any pointers to useful ideas would be much appreciated.
After the software is done, all PDF images appear anti-aliased in Acrobat X. Page navigation is much slower than before, and when I zoom in/out, the images first go to what looks like the pre-anti-aliasing version before quickly changing to anti-aliased images.
Actually in the original file 2013_11_15_22_51_31.pdf contains a JPEG image while the OCR'ed file 2013_11_15_22_51_31_OCR.pdf contains a JPEG2000 image.
Comparing them in third party viewers, it becomes clear that the image in the OCR'ed file is not inherently anti-alias'ed. Furthermore there is no evident flag in the PDF instructing PDF viewers to apply anti-aliasing to the JPEG2000 image. Thus, Adobe Reader seems to automatically render JPEG and JPEG2000 images differently, applying anti-aliasing to the latter but not to the former.
Comparing both images in detail, though, it becomes clear that these images are not identical but instead the image in the OCR'ed PDF is slightly rotated.
I assume Abbyy FineReader recognized that the original scanned image is not correctly oriented. Thus, it rotated it slightly to correct this orientation.
Thus, replacing the image in the OCR'ed version with the one from the original one is no option: Due to the rotation the OCR information would partially be somewhat off.
What you might want to try is to recode the JPEG2000 image to JPEG and replace the image in the OCR'ed version with this recoded one. This will mean some loss of quality but most likely you can get rid of the anti-aliasing this way.
Be aware, though, that the JPEG2000 image is slightly larger than the JPEG image to accomodate for the rotation.
PS: As #VadimR pointed out, there is indeed an /Interpolate true entry in the image dictionary of the OCR-ed version I missed when looking at the file. This does not seem to be the major issue slowing down the rendering.
There is /Interpolate true entry in image dictionary of OCR-ed version, and that's what causes 'anti-aliasing'. Whether that (and not JPEG2000 instead of JPEG compression) is a cause of slow-down, you check on large enough files.
To un-set this key, the best would be to turn it off while creating a file, and if that's not possible, to write and run a small program in suitable language.
But, since your file doesn't sport 'compressed objects' and offending key is in plain view inside a file, in the spirit of 'job done quickly' you can simply process your file e.g. like this:
perl -M-encoding -0777pe "s!/Interpolate true!' 'x17!ge" <in.pdf >out.pdf

Fit to page size in ghostscript (with a possibly corrupt input)

I'm trying to use ghostscript to convert a .ps file to a series of .png files, largely because I don't have a tolerable ps viewer.
This is the command I've used:
gs -dBATCH -dEPSCrop -dEPSFitPage -sDEVICE=png16m -r300 -dNOPAUSE -sOutputFile=neptune_111115_ob1-2_13pca_boloplots_%d.png neptune_111115_ob1-2_13pca_boloplots.ps
(the .ps file is a multi-page postscript).
The outputs are partly off the page. I'd like the images to fit inside the page.
I can include example files, but they're pretty large - is there any particular part of the .ps file that would be helpful?
My suspicion is that the .ps file is specifying the bounding box incorrectly, but hacking the BB values didn't have any effect. The .ps file is written by IDL (ittvis' Interactive Data Language). I've also tried the above command without the -dEPS* commands without luck.
-dEPSCrop and -dEPSFitPage are mutually exclusive:
One crops the EPS to the BoundingBox specified in the comments.
The other scales up the EPS from the %%BoundingBox specified in the PS file's internal comments to fit the current media.
You can't really use both at the same time.
The file can't be an EPS file anyway, because you can't have multiple pages in an EPS file. So actually neither switch will have any effect (as you've discovered).
Either the PostScript requests a media size using setpage or setpagedevice, or it just uses whatever the currently set media is. My guess is that its just using the current media. Try setting -sPAPERSIZE=a4 and -sPAPERSIZE=letter.
If that works then the program does not request a media size. If it has no effect, then set -dFIXEDMEDIA in addition which will ignore subsequent requests to change the media size.
That should allow you to specify the correct media size, if you don't know what the media size should be then you can use the Ghostscript -sDEVICE=bbox device to find out.
Lastly, Ghostscript has a rudimentary display device which you can use to view the rendered output without first going to a PNG.

Resources