Fit to page size in ghostscript (with a possibly corrupt input) - ghostscript

I'm trying to use ghostscript to convert a .ps file to a series of .png files, largely because I don't have a tolerable ps viewer.
This is the command I've used:
gs -dBATCH -dEPSCrop -dEPSFitPage -sDEVICE=png16m -r300 -dNOPAUSE -sOutputFile=neptune_111115_ob1-2_13pca_boloplots_%d.png neptune_111115_ob1-2_13pca_boloplots.ps
(the .ps file is a multi-page postscript).
The outputs are partly off the page. I'd like the images to fit inside the page.
I can include example files, but they're pretty large - is there any particular part of the .ps file that would be helpful?
My suspicion is that the .ps file is specifying the bounding box incorrectly, but hacking the BB values didn't have any effect. The .ps file is written by IDL (ittvis' Interactive Data Language). I've also tried the above command without the -dEPS* commands without luck.

-dEPSCrop and -dEPSFitPage are mutually exclusive:
One crops the EPS to the BoundingBox specified in the comments.
The other scales up the EPS from the %%BoundingBox specified in the PS file's internal comments to fit the current media.
You can't really use both at the same time.
The file can't be an EPS file anyway, because you can't have multiple pages in an EPS file. So actually neither switch will have any effect (as you've discovered).
Either the PostScript requests a media size using setpage or setpagedevice, or it just uses whatever the currently set media is. My guess is that its just using the current media. Try setting -sPAPERSIZE=a4 and -sPAPERSIZE=letter.
If that works then the program does not request a media size. If it has no effect, then set -dFIXEDMEDIA in addition which will ignore subsequent requests to change the media size.
That should allow you to specify the correct media size, if you don't know what the media size should be then you can use the Ghostscript -sDEVICE=bbox device to find out.
Lastly, Ghostscript has a rudimentary display device which you can use to view the rendered output without first going to a PNG.

Related

How can I take a pdf, and convert any jpeg2000/jpx/jp2 images in it to jpeg images?

I am using MacOS Mojave on a Mac Mini, and I am also using an old Kindle Dx which cannot read jpeg2000 images. It also has trouble with too many or too large jpeg images.
I cannot use touchscreens, so newer e-readers and tablets aren't a solution.
So far, I've found some buggy solutions--
I can use Willus's k2pdfopt with -mode copy and -dev dx, which rasterizes everything. It's a good solution for scanned pdfs. If more detail is needed, -mode copy without -dev dx will preserve higher resolution. It's something of a last resort for pdf-born-pdfs, since text can be uglier and harder to read, and file sizes can increase alarmingly.
I can also use Ghostscript with -dCompatibilityLevel=1.4, which doesn't rasterize everything. It converts jpeg2000 images to jpeg images. But it doesn't tackle some oversized or poorly-constructed images, it often creates dark rectangles which can obscure text, and it occasionally loses the ability to search or select text. [P.S. I mean it takes a pdf which had searchable pdf and outputs one which does not. Also if I do any kind of image downsampling or removal, it sometimes rescales everything or loses pages.]
I have experimented with options to compress images in Ghostscript, with mixed success, and with the above bugs persisting. [P.S. I think I was downsampling, yes.]
For whatever reason, MacOS Quartz filters only work if they will reduce image sizes. So they tend not to work on the buggy images.
Now my ideal solution would preserve the text itself, preferably untangling ligatures, and would compress the images like Willus's k2pdfopt. But I have no idea if that's possible or how.
Short of that-- I'm wondering if there's a way to use Ghostscript to convert the jpeg2000 images without causing the gray rectangles or losing the ability to search or select text.
or if there's a way to use Quartz filters so they work. In some older versions of MacOS they did work.
or if there's a way to batch-print these pdf files to the appropriate resolution, apparently 800x1180, reprocessing images in the process.
I don't have much programming experience. I mainly use homebrew to install command-line tools, very sloppy bash scripts, and Automator to run them.
P.S. For a minimal example of the gray rectangles in Ghostscript, using the free pdf from here: https://www.peginc.com/store/test-drive-savage-worlds-the-wild-hunt/
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -o out.pdf in.pdf
substituting that pdf for in.pdf.
For a minimal example of losing searchable text, using the free pdf from here: http://datafortress2020.com/fileproject/details.php?image_id=498
same minimal script
Compatibility Level
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4 -o out.pdf in.pdf
Aggressive Downsampling and Grayscale
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4
-g800x1080 -r150 -dPDFFitPage \
-dFastWebView -sColorConversionStrategy=Gray \
-dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=75 -dGrayImageResolution=75 -dMonoImageResolution=150 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \ -o out.pdf in.pdf
P.P.S. I can use k2pdfopt to rasterize to fit my Kindle. If the file has searchable text, this retains it, if it doesn't I can run tesseract in k2 or run ocrmypdf afterwards.
But if I want especially good graphics, or especially clear text, and the file has hundreds of pages, it will need hundreds of megs. I had blamed this on rasterizing the text, which was why my ideal solution was to keep text and rasterize images, but apparently it's an issue with the images themselves.
If you think you've found a bug, then it's helpful to report it. If you don't it will never be fixed. You can report a bug at https://bugs.ghostscript.com, please be sure to attach an example file to reproduce the problem and state the command line used.
The Ghostscript pdfwrite device does not, ever, produce JPEG2000 images (due to patent issues). So you don't need to set the CompatibilityLEvel at all, and I'd recommend that you do not. By setting the CompatibilityLevel you are limiting the output. Unless your device cannot handle later versions then don't do this.
Without seeing an example file, a command line and knowing the version and operating system it's obviously not possible for anyone to comment on your 'gray rectangles'.
You can reduce the size of images (in bytes) by downsampling (as opposed to compressing) them, you can't do anything about the number of images.
Note that searchable text depends on the construction of the PDF file, and so cannot ever be guaranteed. Searchable text (in the sense of ToUnicode CMaps) was a later addition to the PDF Reference and is always optional, because it's possible to have input from which the Unicode code points cannot be determined (without using OCR software) but a perfectly readable PDF file can still be produced.
Ghostscript itself can produce a PDF file which is a rendered representation of the original, wrapped up as a PDF. See the pdfimage* devices.
Tesseract can take images and produce PDF files with searchable text, produced by OCR'ing the images. This would seem to me to be your best option, though obviously I don't know if a single large image is going to be acceptable to your device.
Edit
I already agreed that searching text is inherently not supported in PDF, except as an optional adjunct. The bug report you pointed to talks about 'corrupting text layers'. There are no text layers in PDF, and the text is neither corrupted nor missing, ts just not encoded as ASCII any more.
The reason you shouldn't set the resolution, and the size in pixels, is because PDF is not an image format. You aren't gaining anything by doing this. All that happens is that pdfwrite divides the 'g' valuess by the resolution, to get a media size in inches, and writes that as the MediaBox. Simpler just to set the Media Size. If you set the resolution you are fixing anything which does get rendered at that resolution. Choose a low resolution and you get crappy output. If you use a higher resolution then the image can be downscaled and smoothed giving better output.
It is indeed possible that your Kindle cannot handle transparency any better than the Mac, it is after all an old device. It's also possible that whoever built Ghostscript for you introduced a bug. I'm afraid we can't help you with either of those.
I did suggest, right back at the end of the original post, that you render the content to an image (Ghostscript will do that for you), then use Tesseract to convert the image back to a PDF, and at the same time OCR the text.
That will get past your problems with JPEG2000, will do a *better job of creating searchable text, since even files that aren't already searchable will become so, and will allow you to specify the resolution.

Ghotstscript increases length of content

I am using ghostscript to compress the PDF size. Following command is used /opt/pdf/ghostpdl-9.23/bin/gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=$1 $2
This will reduce the size of the PDF by compressing all the images inside PDF. However, when i inspect the compressed PDF in debugger tool of PDFBox, then i can see that the length of content has increased. It looks like ghostscript uncompressed the content, but re compression is not done appropriately
Original PDF: https://35.200.235.243/download?fileName=/opt/pdf/test.pdf
Compressed PDF: https://35.200.235.243/download?fileName=/opt/pdf/test-compress1.pdf
I tried using iText to re compress the content using setCompressionLevel(9). However, the original compression is still not achieved
Is there any mechanism by which original compression of content can be achieved post processing by ghostscript
Ghostscript (more specifically the pdfwrite device) doesn't 'compress' PDF files at all. It produces a brand new PDF file which may (or equally well may not) be smaller than the original.
Ghostscript always decompresses input, the process is described here and should explain why this is always going to happen.
I don't see any reason why you think Ghostscript isn't recompressing the image streams, all the image streams are compressed with either Flate or DCT encoding.
You haven't said which content you think has increased and given that the original file is 1.2 MB and the Ghostscript output is 390KB I'm not clear on what your complaint actually is. The output file apears significantly smaller to me.
If you are expecting the streams to be numbered the same in the output file as the input file then you are out of luck, see the Overview linked above to see why.
NB your command line doesn't compress images, it reduces their resolution resulting in lower quality.

how to convert PS image to PNG and fit to page or to multiple pages

I am analyzing a model (compiled with -pg option so it would generate "gmon.out") and then generated a PS file (using gprof2dot.py and piping that to "dot") that charts how much time is spent in each subroutine. When I use ghostscript to convert to a PDF it is cutting off some of the right hand side of the figure. So I tested outputting to multiple pages, but the first page still has the right side cut off and the second page is blank.
These are the 2 commands I have tried:
gs -dBATCH -dNOPAUSE -dPDFFitPage -sOutputFile=myfile.pdf -sDEVICE=pdfwrite output.ps
gs -dBATCH -dNOPAUSE -dPDFFitPage -sOutputFile=out%d.pdf -sDEVICE=pdfwrite output.ps
Please let me know if you have any suggestions. Thanks!
Your title says you want a PNG (and you render a PostScript program to PNG, not "convert a PostScript image", PostScript is a programming language not an image format) but your description says you are creating PDF files. So which is it, PNG or PDF ?
Using PDFFitPage scales the requested media size to fit an actual (already set up, fixed) media size, since you haven't set a fixed media size on the command line, no scaling will be performed.
So, what media size are you getting, and why is that not correct ? I would 'guess' that your PostScript program does not request a media size, if it did then pdfwrite would create a PDF with the same media size.
In the absence of a media request, the pdfwrite device uses the default. Depending on your system and configuration that will most likely be either A4 or US Letter. It then reproduces the PostScript drawing program by either rendering to a bitmap, or as a PDF page description, using that media size. If the original PostScript required a different media size, then bits will be clipped.
Since you have not supplied an example its not possible for me to do more than guess of course.
However, you should probably try setting -sPAPERSIZE to something like A3. Or set a specific media size using -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS.
If you supplied an example I could probably be more specific.
It would also be a decent idea to mention what version of Ghostscript you are using too.

Ghostscript cuts off part of image

I have this eps image named "input.eps".
I run the following command on it:
gs -dNOPAUSE -dBATCH -q -sDEVICE=ps2write -sOutputFile=output.eps input.eps
The resulting output file "output.eps" has the right side of the figure chopped off. Why?
Note: The reason I'm using GhostScript is to change the fonts in the input.eps file, which I'll do by specifying the -I switch with the path to the fonts. I haven't put that in the code snippet as it is not relevant to the issue.
EPS files do not request a media size (they are intended for inclusion in a PostScript program by applications). So, if you don't tell Ghostscript what size media to use it has no choice but to use its default.
Depending on your operating system (and locale if appropriate), this is likely to be either Letter (612 by 792 units) or A4 (596 by 842 units). Your EPS file claims it has a Bounding Box of 1008 units by 504 units.
So clearly your EPS won't fit across the media, and will therefore be cropped.
You can either wrap the EPS up as is normal for inclusion in a PostScript program, and request the media there, or you can use the -dEPSCrop switch which reads the Bounding Box from the comments and uses that for a media request.
Note that, despite the existence of the BoundingBox, this is not technically a valid EPS file. It has the wrong DSC identifier and executes showpage.
As a final note, you won't be 'changing' the fonts in the EPS file, as the EPS file does not contain any fonts, just references to font names.

GhostScript PDF to PostScript

I have to convert pdf files (created with jasperreports) to postscript.
I'm using ghostscript (Version 9.19) to make the conversion.
The commmand i'm using is:
gswin64c -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=file.ps file.pdf
The conversion is done without problem, but when i open the postscript file generated (using GSview 5.0), the top margin is crop by 2-3 cm, and some information to print is lost.
I have changed the device from ps2write to eps2write, used the property -g<width>x<height> with the page size in pixels, but the problem persist.
The file is to be printed in a preformated paper, so i can not use the postscript generated to print.
Can someone help?
Thanks
Its not possible to say with great certainty, but it sounds like the PDF mediaBox is larger than the media you have specified to GSView.
You can try using the -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS along with -dFIXEDMEDIA and -dPDFFitPage, that should allow you to set up a specific media size, override the size in the PDF file and scale the result to fit the specified size.
Perhaps you could post an example PDF file, without that its very hard to comment sensibly.

Resources