Use gostscript 9.21 to convert text to outlines, and how to keep the resolution of the picture - ghostscript

I use gostscript to convert text to outlines with the following code :gswin32c.exe -sDEVICE=pdfwrite -sOutputFile=output.pdf -dQUIET -dNOPAUSE -dBATCH -dNoOutputFonts -f test_new.pdf,it works.But i got a very small output file from 2.5M to 70kb.Then i find the picture became blurred in pdf.
Add -dPDFSETTINGS=/default,This will have the same result.
I's better to use -dPDFSETTINGS=/printer or -dPDFSETTINGS=/prepress,but 300dpi is not enough for me(or for my boss).
Is there any way to keep the original resolution of the picture.
Or how to set a higher dpi for images in output pdf.
The test file is here.
Thanks in advance.

The answer to your question is 'yes' (but see later). Don't use PDFSETTINGS, that sets lots of things all in one go. If you want control then you need to specify each setting individually.
Rather than use this shotgun approach you need to read the documentation, decide which controls affect areas you want to change, and alter those controls only.
However, image downsampling is not your problem. If you don't use -dPDFSETTINGS then PDF file written by Ghostscript contains an image at exactly the same resolution as the image in the original file.
Your problem is that the image is being written with JPEG compression, and JPEG is a lossy compression, so you are losing fidelity. Note that in the original file the image is written uncompressed, which is why its so large.
It looks like the original image was a JPEG, and the free PDF editor you are using has realised that so it saved the image uncompressed (I may be giving it too much credit here, it may save all images uncompressed). Applying JPEG to an image which has already been quantised simply amplifies the artefacts.
Instead you need to specify that you want images compressed with Flate, which is a lossless compression. The documentation for the pdfwrite controls can be found here, you need to change AutoFilterColorImages and ColorImageFilter.
Note that by not applying JPEG quantisation (a second time) and DCT encoding, the compression is less than your first experience. For me the output file comes in at just over 600Kb (leaving the font in place, and the text as text, would be a couple of Kb smaller). However the image is identical, as expected.
Since you are clearly using Ghostscript in a commercial environment, can I just point you at the licence and ask you to check that your usage is compatible with the AGPL, bearing in mind that this covers software as a service usage as well.

Related

How can I take a pdf, and convert any jpeg2000/jpx/jp2 images in it to jpeg images?

I am using MacOS Mojave on a Mac Mini, and I am also using an old Kindle Dx which cannot read jpeg2000 images. It also has trouble with too many or too large jpeg images.
I cannot use touchscreens, so newer e-readers and tablets aren't a solution.
So far, I've found some buggy solutions--
I can use Willus's k2pdfopt with -mode copy and -dev dx, which rasterizes everything. It's a good solution for scanned pdfs. If more detail is needed, -mode copy without -dev dx will preserve higher resolution. It's something of a last resort for pdf-born-pdfs, since text can be uglier and harder to read, and file sizes can increase alarmingly.
I can also use Ghostscript with -dCompatibilityLevel=1.4, which doesn't rasterize everything. It converts jpeg2000 images to jpeg images. But it doesn't tackle some oversized or poorly-constructed images, it often creates dark rectangles which can obscure text, and it occasionally loses the ability to search or select text. [P.S. I mean it takes a pdf which had searchable pdf and outputs one which does not. Also if I do any kind of image downsampling or removal, it sometimes rescales everything or loses pages.]
I have experimented with options to compress images in Ghostscript, with mixed success, and with the above bugs persisting. [P.S. I think I was downsampling, yes.]
For whatever reason, MacOS Quartz filters only work if they will reduce image sizes. So they tend not to work on the buggy images.
Now my ideal solution would preserve the text itself, preferably untangling ligatures, and would compress the images like Willus's k2pdfopt. But I have no idea if that's possible or how.
Short of that-- I'm wondering if there's a way to use Ghostscript to convert the jpeg2000 images without causing the gray rectangles or losing the ability to search or select text.
or if there's a way to use Quartz filters so they work. In some older versions of MacOS they did work.
or if there's a way to batch-print these pdf files to the appropriate resolution, apparently 800x1180, reprocessing images in the process.
I don't have much programming experience. I mainly use homebrew to install command-line tools, very sloppy bash scripts, and Automator to run them.
P.S. For a minimal example of the gray rectangles in Ghostscript, using the free pdf from here: https://www.peginc.com/store/test-drive-savage-worlds-the-wild-hunt/
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -o out.pdf in.pdf
substituting that pdf for in.pdf.
For a minimal example of losing searchable text, using the free pdf from here: http://datafortress2020.com/fileproject/details.php?image_id=498
same minimal script
Compatibility Level
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4 -o out.pdf in.pdf
Aggressive Downsampling and Grayscale
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4
-g800x1080 -r150 -dPDFFitPage \
-dFastWebView -sColorConversionStrategy=Gray \
-dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=75 -dGrayImageResolution=75 -dMonoImageResolution=150 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \ -o out.pdf in.pdf
P.P.S. I can use k2pdfopt to rasterize to fit my Kindle. If the file has searchable text, this retains it, if it doesn't I can run tesseract in k2 or run ocrmypdf afterwards.
But if I want especially good graphics, or especially clear text, and the file has hundreds of pages, it will need hundreds of megs. I had blamed this on rasterizing the text, which was why my ideal solution was to keep text and rasterize images, but apparently it's an issue with the images themselves.
If you think you've found a bug, then it's helpful to report it. If you don't it will never be fixed. You can report a bug at https://bugs.ghostscript.com, please be sure to attach an example file to reproduce the problem and state the command line used.
The Ghostscript pdfwrite device does not, ever, produce JPEG2000 images (due to patent issues). So you don't need to set the CompatibilityLEvel at all, and I'd recommend that you do not. By setting the CompatibilityLevel you are limiting the output. Unless your device cannot handle later versions then don't do this.
Without seeing an example file, a command line and knowing the version and operating system it's obviously not possible for anyone to comment on your 'gray rectangles'.
You can reduce the size of images (in bytes) by downsampling (as opposed to compressing) them, you can't do anything about the number of images.
Note that searchable text depends on the construction of the PDF file, and so cannot ever be guaranteed. Searchable text (in the sense of ToUnicode CMaps) was a later addition to the PDF Reference and is always optional, because it's possible to have input from which the Unicode code points cannot be determined (without using OCR software) but a perfectly readable PDF file can still be produced.
Ghostscript itself can produce a PDF file which is a rendered representation of the original, wrapped up as a PDF. See the pdfimage* devices.
Tesseract can take images and produce PDF files with searchable text, produced by OCR'ing the images. This would seem to me to be your best option, though obviously I don't know if a single large image is going to be acceptable to your device.
Edit
I already agreed that searching text is inherently not supported in PDF, except as an optional adjunct. The bug report you pointed to talks about 'corrupting text layers'. There are no text layers in PDF, and the text is neither corrupted nor missing, ts just not encoded as ASCII any more.
The reason you shouldn't set the resolution, and the size in pixels, is because PDF is not an image format. You aren't gaining anything by doing this. All that happens is that pdfwrite divides the 'g' valuess by the resolution, to get a media size in inches, and writes that as the MediaBox. Simpler just to set the Media Size. If you set the resolution you are fixing anything which does get rendered at that resolution. Choose a low resolution and you get crappy output. If you use a higher resolution then the image can be downscaled and smoothed giving better output.
It is indeed possible that your Kindle cannot handle transparency any better than the Mac, it is after all an old device. It's also possible that whoever built Ghostscript for you introduced a bug. I'm afraid we can't help you with either of those.
I did suggest, right back at the end of the original post, that you render the content to an image (Ghostscript will do that for you), then use Tesseract to convert the image back to a PDF, and at the same time OCR the text.
That will get past your problems with JPEG2000, will do a *better job of creating searchable text, since even files that aren't already searchable will become so, and will allow you to specify the resolution.

Convert RGB pdf to CMYK preserve pdf

I am using ghostscript 9.25 windows.
I am trying to convert RGB pdf to CMYK preserve pdf using following command:
gswin32c.exe
-dSAFER -dBATCH -dNOPAUSE -dNOCACHE -sDEVICE=pdfwrite -sColorConversionStrategy=CMYK -dProcessColorModel=/DeviceCMYK -dAutoFilterColorImages=false -dAutoFilterGrayImages=false -sOutputFile=out.pdf input.pdf
input.pdf file here
https://www.dropbox.com/s/8jfnov526nhb9m9/blank.pdf?dl=0
output.pdf file here
https://www.dropbox.com/s/ftrmm32mmixaxqh/out.pdf?dl=0
but my output becomes light compared adobe output, expected result is it should be dark when i do in adobe CMYK preserve option, i am getting little dark compared to ghostscript output. Am I doing anything wrong?
Should I use any icc profile?
Thanks
You say you are using ImageMagick, yet you give a Ghostscript command line....
I presume that when you say CMYL you mean CMYK.
There is nothing immediately obviously wrong with your command line, but you have given no example file, nor any reason why you expect the result to be 'dark'.
If you want to control the conversion then you will need to supply at least one and possibly up to 4 ICC profiles. You will certainly need a CIE->CMYK Output profile, and you might like to supply ICC profiles for Gray->CIE, RGB->CIE and CMYK->CIE as well, in order to override the default ones Ghostscript is using.
[EDIT]
The problem is nothing to do with colour conversion. Your original file contains nothing except a very large image, which is compressed with the Flate filter (lossless). It looks like this:
You've turned off auto filtering, but you haven't told Ghostscript which compression filter to use for images, so it sticks with the default, which is JPEG (DCT). The image now looks like this:
For the nature of your original image, JPEG (lossy) compression is an outstandingly bad choice. The output image compresses less well, and it loses fidelity. You should change to using Flate compression instead of JPEG for images of this kind.
By the way, the image in your original PDF file was defined in CMYK space already.

Create small high quality PDF embedding optimized PNG?

I'm trying to create a small PDF file, embedding one optimized PNG image displayed as a header and footer on a 3 page PDF (same image must appear 6x in the PDF)
My optimized PNG image is only 2.3KB. It looks very sharp.
Failed with libreoffice
When I insert just one instance of the 2.3KB PNG image into a Libreoffice Writer doc containing only text, then export as PDF I can see that the image gets re-compressed to JPG and the resulting PDF file grows by about 40KB after adding the image. It also loses quality, the PNG also gets JPG fuzzy edges.
If I right click the image and select compression, there is no way to disable recompressing the image (it's already optimized better than libreoffice could do it) I've tried setting a compression level of 0,1,9 etc. Choosing JPG, no resize, lossless, etc but there was no improvement.
Failed with wkhtmltopdf
I also tried making a test page and used wkhtml2pdf but it did the same thing. Adding the low quality flag made no difference.
PDF Spec suggests PNG is supported?
From skimming the PDF spec, it looks like PNG images are supported.
Even plain text PDF files are surprisingly large
The disappointing thing is also when I take a 7KB HTML file which is basically just <html><body><p>foo...</p><p>bar...</p> (only about 15 paragraphs) with no CSS. The resulting 2 page PDF file is 30KB. Why should a 7kb (almost plain text) file become 30kb as a PDF?
Suggestions?
Can someone please suggest how to make a small PDF file in Linux?
I need to include 7KB of text and repeat one PNG image 6 times.
Manually or programatically. I'll take whatever I can get at this point.
PDF Spec suggests PNG is supported?
PNG isn't supported per se; PDF allows embedding JPEG images as-is, but not PNG images. PDF does borrow a set of features of the PNG format, however.
rinohtype (full disclosure: I'm the author) tries to embed as much as possible from PNG images as-is into the PDF. This does involve some bit-juggling to separate the alpha channel from the color data for example, but no reencoding of the image is performed. It does not (yet) support interlaced PNGs.
rinohtype should be able to do what you want to achieve. But please note that it currently is in a beta stage, so you might encounter some bugs.
Even plain text PDF files are surprisingly large
To keep the PDF size as small as possible, make sure not to embed/subset any of the fonts. Use only the fonts from the base 14 PDF fonts which are provided by PDF readers.
What you want is certainly achievable. Regarding the image quality, I would recommend making your image twice the size that you want it to actually display at in the PDF to keep it looking sharp.
As to the size, I've just modified a test in my PDF writer module (WIP..) to include a 7.2K png, 200px x 70px, in a PDF twice and the PDF came out at 6.8K 8). There's not much text included, but more text will only add what it's worth + a small percentage.
You can see the module and original test here.. https://github.com/DoccaPDF/docca-pdf-writer/blob/master/src/tests/writer.js#L40
That test adds ~112K of images to the PDF and results in a 103K PDF.
Of course not all images are created equal so you milage may vary..
*the images are only actually added to the PDF once, but are displayed multiple time.

How to remove anti-aliasing in PDF images?

I use Abbyy FineReader for ScanSnap to OCR a couple of scanned PDF files. The software claims it retains the original PDF images. The PDF file sizes pre-OCR and post-OCR are almost identical, which is good.
After the software is done, all PDF images appear anti-aliased in Acrobat X. Page navigation is much slower than before, and when I zoom in/out, the images first go to what looks like the pre-anti-aliasing version before quickly changing to anti-aliased images.
Left: Scanned PDF / Right: after OCR with Abbyy
I would like to get the original images without anti-aliasing back. Interestingly, when I open a single page from the anti-aliased PDF in Photoshop, there is no anti-aliasing and the image looks like the left one.
My limited PDF programming experience leads me to believe that Abbyy likely sets some kind of anti-alias flag for each image during OCR processing. How do I un-set this flag?
Any pointers to useful ideas would be much appreciated.
After the software is done, all PDF images appear anti-aliased in Acrobat X. Page navigation is much slower than before, and when I zoom in/out, the images first go to what looks like the pre-anti-aliasing version before quickly changing to anti-aliased images.
Actually in the original file 2013_11_15_22_51_31.pdf contains a JPEG image while the OCR'ed file 2013_11_15_22_51_31_OCR.pdf contains a JPEG2000 image.
Comparing them in third party viewers, it becomes clear that the image in the OCR'ed file is not inherently anti-alias'ed. Furthermore there is no evident flag in the PDF instructing PDF viewers to apply anti-aliasing to the JPEG2000 image. Thus, Adobe Reader seems to automatically render JPEG and JPEG2000 images differently, applying anti-aliasing to the latter but not to the former.
Comparing both images in detail, though, it becomes clear that these images are not identical but instead the image in the OCR'ed PDF is slightly rotated.
I assume Abbyy FineReader recognized that the original scanned image is not correctly oriented. Thus, it rotated it slightly to correct this orientation.
Thus, replacing the image in the OCR'ed version with the one from the original one is no option: Due to the rotation the OCR information would partially be somewhat off.
What you might want to try is to recode the JPEG2000 image to JPEG and replace the image in the OCR'ed version with this recoded one. This will mean some loss of quality but most likely you can get rid of the anti-aliasing this way.
Be aware, though, that the JPEG2000 image is slightly larger than the JPEG image to accomodate for the rotation.
PS: As #VadimR pointed out, there is indeed an /Interpolate true entry in the image dictionary of the OCR-ed version I missed when looking at the file. This does not seem to be the major issue slowing down the rendering.
There is /Interpolate true entry in image dictionary of OCR-ed version, and that's what causes 'anti-aliasing'. Whether that (and not JPEG2000 instead of JPEG compression) is a cause of slow-down, you check on large enough files.
To un-set this key, the best would be to turn it off while creating a file, and if that's not possible, to write and run a small program in suitable language.
But, since your file doesn't sport 'compressed objects' and offending key is in plain view inside a file, in the spirit of 'job done quickly' you can simply process your file e.g. like this:
perl -M-encoding -0777pe "s!/Interpolate true!' 'x17!ge" <in.pdf >out.pdf

Ghostscript: how to reduce file size of large PDFs without changing smaller PDFs

I am using GhostScript to convert large batches of PDF to PDF to reduce file size. The original PDFs vary in size and quality. Where there is a low quality, small file size (<350kb) PDF the output from Ghostscript is often poor.
Is there a way I can get GhostScript to ignore files below a certain size and just pass them through without downsampling?
Current settings:
SearchablePDFSetting=-dColorImageResolution=120 -dMonoImageResolution=38 -dMonoImageDownsampleType=/Average -dOptimize=true -dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dUseCIEColor -dColorConversionStrategy=/sRGB -dFIXEDMEDIA -dDEVICEWIDTHPOINTS=596 -dDEVICEHEIGHTPOINTS=834
Thanks,
Vix
The pdfwrite device can already pass images (not files) through without downsampling, there is no way to 'pass through without changing' a file. If you want to not process files below a certain size, then don't process them.
To avoid further downsampling of images, you need to add the 'xxxxImageDownsampleThreshold' parameters (one each for Mono, Grey and Color). If you set this to (eg) 1.5 then images which are up to 50% higher resolution than the target resolution won't be downsampled.
Note that you haven't (apparently) set a GrayImageDownsampleResolution, you haven't set the downsample type for Color or Gray images and a MonoImageResolution of 38 looks pretty ugly to me.
The default Gray image filter is DCT (JPEG) as is the Color filter. If the original image was DCT then applying a second round of DCT compression will result in ugly artefacts, especially if the image is not downsampled. I would suggest you change the filter type to FlateEncode.
All these options are documented in ps2pdf.htm in the Ghostscript doc folder.
Add the option:
-dPDFSETTINGS=/screen
This "selects low-resolution output similar to the Acrobat Distiller 'Screen Optimized' setting."

Resources