Prevent GhostScript to turn PNGs into JPEGs - ghostscript

I use GhostScript to fix/repair non-compliant/corrupted PDFs in order to let them be successfully opened by PDF readers and be edited with Acrobat Pro without errors or warnings.
gs \
-o repaired.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/default \
corrupted.pdf
I noticed however that PNGs images into the PDF are turned into JPEGs with a loss of quality.
Is there a way or specific option to avoid that?
I searched into the documentation without success.

PDF cannot contain PNG images, because the PDF format does not support PNG. Images can be compressed with a variety of algorithms and the options are documented. See:
https://ghostscript.readthedocs.io/en/latest/VectorDevices.html#distiller-parameters
You will want to alter the AutoFilter...Images switches and then the ColorImageFilter, MonoImageFilter and GrayImageFilter settings.
And there's really no point in putting -dPDFSETTINGS=/default :-)

Related

How can I take a pdf, and convert any jpeg2000/jpx/jp2 images in it to jpeg images?

I am using MacOS Mojave on a Mac Mini, and I am also using an old Kindle Dx which cannot read jpeg2000 images. It also has trouble with too many or too large jpeg images.
I cannot use touchscreens, so newer e-readers and tablets aren't a solution.
So far, I've found some buggy solutions--
I can use Willus's k2pdfopt with -mode copy and -dev dx, which rasterizes everything. It's a good solution for scanned pdfs. If more detail is needed, -mode copy without -dev dx will preserve higher resolution. It's something of a last resort for pdf-born-pdfs, since text can be uglier and harder to read, and file sizes can increase alarmingly.
I can also use Ghostscript with -dCompatibilityLevel=1.4, which doesn't rasterize everything. It converts jpeg2000 images to jpeg images. But it doesn't tackle some oversized or poorly-constructed images, it often creates dark rectangles which can obscure text, and it occasionally loses the ability to search or select text. [P.S. I mean it takes a pdf which had searchable pdf and outputs one which does not. Also if I do any kind of image downsampling or removal, it sometimes rescales everything or loses pages.]
I have experimented with options to compress images in Ghostscript, with mixed success, and with the above bugs persisting. [P.S. I think I was downsampling, yes.]
For whatever reason, MacOS Quartz filters only work if they will reduce image sizes. So they tend not to work on the buggy images.
Now my ideal solution would preserve the text itself, preferably untangling ligatures, and would compress the images like Willus's k2pdfopt. But I have no idea if that's possible or how.
Short of that-- I'm wondering if there's a way to use Ghostscript to convert the jpeg2000 images without causing the gray rectangles or losing the ability to search or select text.
or if there's a way to use Quartz filters so they work. In some older versions of MacOS they did work.
or if there's a way to batch-print these pdf files to the appropriate resolution, apparently 800x1180, reprocessing images in the process.
I don't have much programming experience. I mainly use homebrew to install command-line tools, very sloppy bash scripts, and Automator to run them.
P.S. For a minimal example of the gray rectangles in Ghostscript, using the free pdf from here: https://www.peginc.com/store/test-drive-savage-worlds-the-wild-hunt/
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -o out.pdf in.pdf
substituting that pdf for in.pdf.
For a minimal example of losing searchable text, using the free pdf from here: http://datafortress2020.com/fileproject/details.php?image_id=498
same minimal script
Compatibility Level
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4 -o out.pdf in.pdf
Aggressive Downsampling and Grayscale
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4
-g800x1080 -r150 -dPDFFitPage \
-dFastWebView -sColorConversionStrategy=Gray \
-dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=75 -dGrayImageResolution=75 -dMonoImageResolution=150 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \ -o out.pdf in.pdf
P.P.S. I can use k2pdfopt to rasterize to fit my Kindle. If the file has searchable text, this retains it, if it doesn't I can run tesseract in k2 or run ocrmypdf afterwards.
But if I want especially good graphics, or especially clear text, and the file has hundreds of pages, it will need hundreds of megs. I had blamed this on rasterizing the text, which was why my ideal solution was to keep text and rasterize images, but apparently it's an issue with the images themselves.
If you think you've found a bug, then it's helpful to report it. If you don't it will never be fixed. You can report a bug at https://bugs.ghostscript.com, please be sure to attach an example file to reproduce the problem and state the command line used.
The Ghostscript pdfwrite device does not, ever, produce JPEG2000 images (due to patent issues). So you don't need to set the CompatibilityLEvel at all, and I'd recommend that you do not. By setting the CompatibilityLevel you are limiting the output. Unless your device cannot handle later versions then don't do this.
Without seeing an example file, a command line and knowing the version and operating system it's obviously not possible for anyone to comment on your 'gray rectangles'.
You can reduce the size of images (in bytes) by downsampling (as opposed to compressing) them, you can't do anything about the number of images.
Note that searchable text depends on the construction of the PDF file, and so cannot ever be guaranteed. Searchable text (in the sense of ToUnicode CMaps) was a later addition to the PDF Reference and is always optional, because it's possible to have input from which the Unicode code points cannot be determined (without using OCR software) but a perfectly readable PDF file can still be produced.
Ghostscript itself can produce a PDF file which is a rendered representation of the original, wrapped up as a PDF. See the pdfimage* devices.
Tesseract can take images and produce PDF files with searchable text, produced by OCR'ing the images. This would seem to me to be your best option, though obviously I don't know if a single large image is going to be acceptable to your device.
Edit
I already agreed that searching text is inherently not supported in PDF, except as an optional adjunct. The bug report you pointed to talks about 'corrupting text layers'. There are no text layers in PDF, and the text is neither corrupted nor missing, ts just not encoded as ASCII any more.
The reason you shouldn't set the resolution, and the size in pixels, is because PDF is not an image format. You aren't gaining anything by doing this. All that happens is that pdfwrite divides the 'g' valuess by the resolution, to get a media size in inches, and writes that as the MediaBox. Simpler just to set the Media Size. If you set the resolution you are fixing anything which does get rendered at that resolution. Choose a low resolution and you get crappy output. If you use a higher resolution then the image can be downscaled and smoothed giving better output.
It is indeed possible that your Kindle cannot handle transparency any better than the Mac, it is after all an old device. It's also possible that whoever built Ghostscript for you introduced a bug. I'm afraid we can't help you with either of those.
I did suggest, right back at the end of the original post, that you render the content to an image (Ghostscript will do that for you), then use Tesseract to convert the image back to a PDF, and at the same time OCR the text.
That will get past your problems with JPEG2000, will do a *better job of creating searchable text, since even files that aren't already searchable will become so, and will allow you to specify the resolution.

Achieving same PDF compression as imagemagick's convert using ghostscript

Is there a way to achieve the same compression than (great compression ratio and quality but it's slow and can break pdfs):
pdfimages -tiff $1 pdf_images
convert pdf_images-* -alpha off -monochrome -compress Group4 -density 250 ${1%.pdf}.compressed.pdf
rm pdf_images-*
By only using ghostscript instead ?
Tried playing around with dPDFSETTINGS, dGrayImageDownsampleType, sColorConversionStrategy but the result was usually lower quality or bigger in size.
PDF consists of scanned pages (one image per page)
I usually use something like the following with GS (there's still something missing because images aren't converted...is this by design?):
gs \
-q \
-dNOPAUSE \
-dBATCH \
-dSAFER \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/screen \
-dEmbedAllFonts=false \
-dSubsetFonts=false \
-dGrayImageDownsampleType=/Bicubic \
-dGrayImageResolution=250 \
-dMonoImageDownsampleType=/Bicubic \
-dMonoImageResolution=250 \
-sProcessColorModel=DeviceGray \
-dProcessColorModel=/DeviceGray \
-sColorConversionStrategy=/Mono \
-dOverrideICC \
-sOutputFile=output.pdf \
input.pdf
Random PDF Sample from Google: https://www.2ndcollege.com/colleges/gcet/btech/sem5/ic/socio/notes/unit1.pdf
Original: 5.6MB
GS: 1.4MB (not mono)
PDFImages + ImageMagick: 1.4MB (only images are converted)
Adding as an answer because its too long for a comment.
The artefacts you are referring to are, I think, caused by JPEG quantisation. The original image has been decompressed, downsampled to a lower resolution, and then recompressed. Since you haven't selected any other compression method, the default for the /screen PDFSETTINGS is used, which is JPEG for Gray and colour images and CCITT Fax for mono images.
You could easily avoid that by using a different compression filter, though of course that would not produce as much compression of the output.
There are several suggestions I can make; firstly don't use PDFSETTINGS unless you are completely sure you want all the things it is doing. In general I would expect better results by leaving most settings alone and simply applying the switches you need.
Given that these are scanned pages, there is no point in setting any of the Font related parameters (unless invisible fonts have been added).
You've set ProcessColorModel twice, once as a name and once as a string. In fact, if you use ColorConversionStrategy, you shouldn't set it at all, and if you aren't using ColorConversionStrategy then it won't have any effect, so you can just drop these two entirely.
There is no ColorConversionStratefy of /Mono, and trying to set that causes errors for me. There appears to have been a bug introduced with the ColorConversionStrategy in the current release. If you set Gray you will actually get RGB. In order to get Gray you actually need to request CMYK. Obviously that's been fixed but in the meantime all the spaces are 'off by one'. sRGB->CMYK, CMYK->Gray and Gray->RGB. LeaveColorUnchanged is unaffected.
Of course this means that your setting of the Gray and Mono Image parameters is having no effect (at least not on the colour images anyway). This is why you get a low output size, and also why the result is heavily downsampled and quantised.
Now, as I've already said, you can't get Ghostscript's pdfwrite to produce monochrome output, only grayscale. Reducing the image data by a factor between 8 and 24 is where most of the gains are coming form I believe. So frankly there's no way you are going to get down to the same output size using pdfwrite without heavily downsampling the images. And if you do that, then the quality is going to suffer.
This command line:
\ghostpdl\debugbin\gswin32c -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=out.pdf -sColorConversionStrategy=CMYK -dPDFSETTINGS=/screen -dGrayImageDownsampleType=/Bicubic -dGrayImageFilter=/FlateEncode -dAutoFilterGrayImages=false unit1.pdf
produces a gray output file 2.1 MB in size, but the extreme downsampling has resulted in very blurry output, I don't think you will like it at all. You could change the amount of downsampling, but that of course will result in a larger output file. You could leave the compression filter unchanged (DCTEncode == JPEG), but that will get you compression artefacts.
Basically, as I said right at the beginning, if you want to manipulate image data, the best way to do it is with a tool designed to manipulate images, not one designed to render PostScript/PDF files.
You could, with some effort render the original pages to a btimap format with Ghostscript, using a stochastic screening method as IM appears to have used, then read the images back into Ghostscript to produce a PDF file, but that hardly seems like its easier than using IM as you are now.

Convert RGB pdf to CMYK preserve pdf

I am using ghostscript 9.25 windows.
I am trying to convert RGB pdf to CMYK preserve pdf using following command:
gswin32c.exe
-dSAFER -dBATCH -dNOPAUSE -dNOCACHE -sDEVICE=pdfwrite -sColorConversionStrategy=CMYK -dProcessColorModel=/DeviceCMYK -dAutoFilterColorImages=false -dAutoFilterGrayImages=false -sOutputFile=out.pdf input.pdf
input.pdf file here
https://www.dropbox.com/s/8jfnov526nhb9m9/blank.pdf?dl=0
output.pdf file here
https://www.dropbox.com/s/ftrmm32mmixaxqh/out.pdf?dl=0
but my output becomes light compared adobe output, expected result is it should be dark when i do in adobe CMYK preserve option, i am getting little dark compared to ghostscript output. Am I doing anything wrong?
Should I use any icc profile?
Thanks
You say you are using ImageMagick, yet you give a Ghostscript command line....
I presume that when you say CMYL you mean CMYK.
There is nothing immediately obviously wrong with your command line, but you have given no example file, nor any reason why you expect the result to be 'dark'.
If you want to control the conversion then you will need to supply at least one and possibly up to 4 ICC profiles. You will certainly need a CIE->CMYK Output profile, and you might like to supply ICC profiles for Gray->CIE, RGB->CIE and CMYK->CIE as well, in order to override the default ones Ghostscript is using.
[EDIT]
The problem is nothing to do with colour conversion. Your original file contains nothing except a very large image, which is compressed with the Flate filter (lossless). It looks like this:
You've turned off auto filtering, but you haven't told Ghostscript which compression filter to use for images, so it sticks with the default, which is JPEG (DCT). The image now looks like this:
For the nature of your original image, JPEG (lossy) compression is an outstandingly bad choice. The output image compresses less well, and it loses fidelity. You should change to using Flate compression instead of JPEG for images of this kind.
By the way, the image in your original PDF file was defined in CMYK space already.

ghostscript option to make a pdf with flattened images

Is there an option to print a pdf in ghostscript as images?
I can use:
gs -dNOPAUSE -dBATCH -sDEVICE=pngalpha -r300 -sOutputFile=p%03d.png my.pdf
Then use imagemagick to make a pdf out of them with:
convert *.png new.pdf
PDF printers seem to have an option that does the same thing that is a checkbox that says "print as image". I could not find anything in the ghostscript docs that sounded like that was an option. There may be a term for it that I just don't know to look for.
It is kind of hard to explain why you would want to take a pdf document that is text and turn it into a document of images of text that is 4 times the size of the original but that is what I want to do.
Currently the only way to do that would be to start with a PDF which contains transparency operations, and select a CompatibilityLevel of 1.3 or less.
I have an idea to implement this feature, but I have not had time to work on it.
You can do it as a 2-pass approach using Ghostscript to render an image, then using the view* scripts to read the image back into Ghostscript and produce a PDF. No better than using convert of course.

Scale scanned PDF without resampling image layer

I have a batch of PDFs (about 80,000 files) that consist of scanned pages. The pixel size of the image layer is consistent with 300dpi, but seems to be set to 72dpi. As a result the page size is showing something like 46x35 inches. I need to adjust these files so they register as 8.5 x 11, or whatever their natural size is, and I need to be able to script the process so I can leave this to churn on 80,000 documents (2-5 pages per document.)
I'd like to avoid resampling the image layer since that would potentially add loss, and slow the process down significantly. I've tried:
convert -density 300x300 input.pdf output.pdf
But it resamples the images. I've tried different variants on ghostscript such as
gs \
-o output.pdf \
-sDEVICE=pdfwrite \
-dDownsampleMonoImages=false \
-dMonoImageResolution=300 \
input.pdf
That generates a file, but it seems unchanged, and and still registers as oversized. Also, the pages are different sizes and orientations. So forcing them all to one size/orientation won't work.
(FYI, really I wouldn't care, but the next step is to have Acrobat Pro OCR all these files, and its OCR chokes on anything over 45 inches.)
PDF is a resolution independent format, so the resolution of the images and so on is pretty irrelevant. The 'natural size' of the pages is whatever Acrobat says it is, this is gathered from the MediaBox (or CropBox) information which is in the file.
It sounds to me like the original conversion to PDF is at fault, and the files genuinely are the (media) size they claim to be now.
I suspect that you can probably get the result you need; you 'simply' need to resize the document. The problem is that this isn't trivial where the media sizes differ (which you say they do).
However before going further I suggest you take a file which you want to be 8.5x11 and try this:
gs -dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite \
-dDEVICEWIDTHPOINTS=612 -dDEVICEHEIGHTPOINTS=792 \
-dFIXEDMEDIA \
-dPDFFitPage \
-sOutputFile=output.pdf \
input.pdf
This will fix the media being used at 8.5x11 and tell Ghostscript to resize the document to fit the page (by calculating and applying a scale factor). It should not affect the image data except for compression, if there are colour images we might need to worry about JPEG artefacts but that can be dealt with separately.
cpdf -scale-pages usletterportrait in.pdf -o out.pdf
Doesn't touch the page content other than to wrap it in a transformation matrix to do the scaling, and scales the media/crop/art/bleed/trim boxes too.
(Commercial, I'm afraid:
http://www.coherentpdf.com/
Disclaimer: I wrote it.)
Modify your original gs command like this:
gs \
-o output.pdf \
-sDEVICE=pdfwrite \
-dPDFFitPage \
-g6120x7920 \
input.pdf
Then check 2 things:
Page dimensions are displaying as 'letter' (or 612x792 pts, or 8.5x11 in) now.
File size is only marginally different from original one (indicating that no resampling of page image has happened).
If the input is scanned documents in grayscale only (as it seems to be), there is no need for setting -dDownSample*Images or for setting -d*ImageResolution.

Resources