CropBox and MediaBox in GhostScript

CropBox and MediaBox in GhostScript - ghostscript

I'm using Ghostscript to turn PDFs into jpeg thumbnails. It works great for most files, but I've got a few that end up looking bad - like a tiny thumbnail on a huge white background.
This is because, on those problem PDFs, the MediaBox is set to a much larger size than the CropBox. I can fix this in Ghostscript by using -dUseCropbox to make it ignore the MediaBox dimensions ... but that does not work on other PDFs that have no CropBox defined.
So I can think of two solutions:
Somehow check a PDF file before import to see whether it has a CropBox defined. If it has a CropBox, then use the -dUseCropBox switch. If it does not have a CropBox defined, then we do not use that switch.
Modifying the MediaBox dimensions in the PDF file itself so that they match the CropBox dimensions.
So what code would I use to check a PDF file for CropBox/MediaBox dimensions and, if necessary, edit them?

What do you plan to do with files that have no CropBox ? It seems to me that you are already doing everything you can, if a CropBox is present (and you select -dUseCropBox) it is used, if not then (if I recall correctly) GS will use the MediaBox anyway.

I think what you're really looking for is a program / script to crop whitespace from PDFs, irrespective of the media/trim/crop box setting, you could try either of these freeware pdf croppers:
pdfcrop - a perl script, works on multiple platforms
http://tug.ctan.org/tex-archive/support/pdfcrop
(requires *tex, ghostscript and obviously perl)
PDF Cropper, for Windows
http://www.noliturbare.com/pdf-tools/pdf-cropper
(requires ghostscript and .NET 3.5)
Alternatively, if you have a Mac, you could use the crop function on the Preview application. This sets the cropbox without touching the mediabox (at least it does on MacOSX 10.4), allowing you to use -dUseCropbox.

Instead of Ghostscript, use Imagemagick. For example:
convert -resize 70px file.pdf file.jpg

Related

How can I take a pdf, and convert any jpeg2000/jpx/jp2 images in it to jpeg images?

I am using MacOS Mojave on a Mac Mini, and I am also using an old Kindle Dx which cannot read jpeg2000 images. It also has trouble with too many or too large jpeg images.
I cannot use touchscreens, so newer e-readers and tablets aren't a solution.
So far, I've found some buggy solutions--
I can use Willus's k2pdfopt with -mode copy and -dev dx, which rasterizes everything. It's a good solution for scanned pdfs. If more detail is needed, -mode copy without -dev dx will preserve higher resolution. It's something of a last resort for pdf-born-pdfs, since text can be uglier and harder to read, and file sizes can increase alarmingly.
I can also use Ghostscript with -dCompatibilityLevel=1.4, which doesn't rasterize everything. It converts jpeg2000 images to jpeg images. But it doesn't tackle some oversized or poorly-constructed images, it often creates dark rectangles which can obscure text, and it occasionally loses the ability to search or select text. [P.S. I mean it takes a pdf which had searchable pdf and outputs one which does not. Also if I do any kind of image downsampling or removal, it sometimes rescales everything or loses pages.]
I have experimented with options to compress images in Ghostscript, with mixed success, and with the above bugs persisting. [P.S. I think I was downsampling, yes.]
For whatever reason, MacOS Quartz filters only work if they will reduce image sizes. So they tend not to work on the buggy images.
Now my ideal solution would preserve the text itself, preferably untangling ligatures, and would compress the images like Willus's k2pdfopt. But I have no idea if that's possible or how.
Short of that-- I'm wondering if there's a way to use Ghostscript to convert the jpeg2000 images without causing the gray rectangles or losing the ability to search or select text.
or if there's a way to use Quartz filters so they work. In some older versions of MacOS they did work.
or if there's a way to batch-print these pdf files to the appropriate resolution, apparently 800x1180, reprocessing images in the process.
I don't have much programming experience. I mainly use homebrew to install command-line tools, very sloppy bash scripts, and Automator to run them.
P.S. For a minimal example of the gray rectangles in Ghostscript, using the free pdf from here: https://www.peginc.com/store/test-drive-savage-worlds-the-wild-hunt/
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -o out.pdf in.pdf
substituting that pdf for in.pdf.
For a minimal example of losing searchable text, using the free pdf from here: http://datafortress2020.com/fileproject/details.php?image_id=498
same minimal script
Compatibility Level
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4 -o out.pdf in.pdf
Aggressive Downsampling and Grayscale
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4
-g800x1080 -r150 -dPDFFitPage \
-dFastWebView -sColorConversionStrategy=Gray \
-dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=75 -dGrayImageResolution=75 -dMonoImageResolution=150 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \ -o out.pdf in.pdf
P.P.S. I can use k2pdfopt to rasterize to fit my Kindle. If the file has searchable text, this retains it, if it doesn't I can run tesseract in k2 or run ocrmypdf afterwards.
But if I want especially good graphics, or especially clear text, and the file has hundreds of pages, it will need hundreds of megs. I had blamed this on rasterizing the text, which was why my ideal solution was to keep text and rasterize images, but apparently it's an issue with the images themselves.

If you think you've found a bug, then it's helpful to report it. If you don't it will never be fixed. You can report a bug at https://bugs.ghostscript.com, please be sure to attach an example file to reproduce the problem and state the command line used.
The Ghostscript pdfwrite device does not, ever, produce JPEG2000 images (due to patent issues). So you don't need to set the CompatibilityLEvel at all, and I'd recommend that you do not. By setting the CompatibilityLevel you are limiting the output. Unless your device cannot handle later versions then don't do this.
Without seeing an example file, a command line and knowing the version and operating system it's obviously not possible for anyone to comment on your 'gray rectangles'.
You can reduce the size of images (in bytes) by downsampling (as opposed to compressing) them, you can't do anything about the number of images.
Note that searchable text depends on the construction of the PDF file, and so cannot ever be guaranteed. Searchable text (in the sense of ToUnicode CMaps) was a later addition to the PDF Reference and is always optional, because it's possible to have input from which the Unicode code points cannot be determined (without using OCR software) but a perfectly readable PDF file can still be produced.
Ghostscript itself can produce a PDF file which is a rendered representation of the original, wrapped up as a PDF. See the pdfimage* devices.
Tesseract can take images and produce PDF files with searchable text, produced by OCR'ing the images. This would seem to me to be your best option, though obviously I don't know if a single large image is going to be acceptable to your device.
Edit
I already agreed that searching text is inherently not supported in PDF, except as an optional adjunct. The bug report you pointed to talks about 'corrupting text layers'. There are no text layers in PDF, and the text is neither corrupted nor missing, ts just not encoded as ASCII any more.
The reason you shouldn't set the resolution, and the size in pixels, is because PDF is not an image format. You aren't gaining anything by doing this. All that happens is that pdfwrite divides the 'g' valuess by the resolution, to get a media size in inches, and writes that as the MediaBox. Simpler just to set the Media Size. If you set the resolution you are fixing anything which does get rendered at that resolution. Choose a low resolution and you get crappy output. If you use a higher resolution then the image can be downscaled and smoothed giving better output.
It is indeed possible that your Kindle cannot handle transparency any better than the Mac, it is after all an old device. It's also possible that whoever built Ghostscript for you introduced a bug. I'm afraid we can't help you with either of those.
I did suggest, right back at the end of the original post, that you render the content to an image (Ghostscript will do that for you), then use Tesseract to convert the image back to a PDF, and at the same time OCR the text.
That will get past your problems with JPEG2000, will do a *better job of creating searchable text, since even files that aren't already searchable will become so, and will allow you to specify the resolution.

Ghostscript - EPS (with embedded TIFF with transparent background) to PNG conversion

I'm trying to convert an EPS file with an embedded TIFF that has a transparent background to a PNG using GhostScript. The problem that I am having is that the background of the TIFF image becomes white in the PNG. It looks like the following:
IncorrectPNG
When I export from Adobe Illustrator, it comes out correct:
CorrectPNG
I was reading that there is not transparency in EPS, only marked and unmarked areas. I was wondering if there was a call that I was missing that would create the PNG through Ghostscript similar to that of Illustrator? Or if there is any other alternative that doesn't just replace white with transparency through ImageMagick?
I am using Windows and have Ghostscript 9.25 installed. Here is the command (one of many) that I've tried:
-q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -sDEVICE=pngalpha -r300 -dEPSCrop NamePlatePNG.png NamePlate.eps
I can get the EPS file to you if needed. Any help would be appreciated, thanks!
UPDATE:
Here is the EPS file (Hopefully this link works):
https://drive.google.com/open?id=1m4HHGLoPe0jdWkx1Oghe7ttiXPldZnJs
Also, I should have mentioned that the images I uploaded were just screenshots of the PNGs open in an image editor. The checkered portion is indeed fully transparent alpha channel. I was trying to easily accentuate the difference.

Your file doesn't look like its transparent, it looks like its masked, possibly with a stencil mask, possibly chroma-keyed. Without seeing the file I can't tell for sure.
You are correct that PostScript (and hence EPS) doesn't support transparency, but it does support several features which have somewhat similar effects.
The color space is irrelevant, and in fact the only kind of 'transparency' supported in PostScript works when the color space is CMYK, but not when its RGB (and certainly not sRGB, which isn't even a PostScript color space, you have to manufacture it from CIEBasedABC)
As far as I can see the command line you are using is correct, but as I say I can't tell much without seeing the actual EPS program.
[EDIT]
So the Ghostscript rendering is correct, that's what is in your EPS file, there is no transparency of any kind there. So how is Illustrator able to make a transparent PNG ? Well the answer is that Illustrator isn't using the PostScript part of the EPS file.
About 1/3 of the way through the EPS file you'll see a line which reads:
%AI9_PrivateDataBegin
What follows that is an Adobe Illustrator file format. When AI reads the file it finds that line, throws away the PostScript portion of the file, and reads the AI representation of the content from the portion of the file beginning with that comment.
Now stored somewhere in there will be the information that portions of the content are transparent. Although PostScript can't represent that, Illustrator's internal format can. So when you write a PNG file from Illustrator it knows that portion is transparent and writes it as such.
Ghostscript, however, is constrained by the PostScript portion of the file, it can't read the Illustrator native format, and so renders the image with a white background.
It 'might' be possible to save a different kind of EPS from Illustrator (level 3 instead of level 2 possibly, I notice this is a language level 2 EPS file) which duplicate the effect, but from what you have here, there isn't anything a standard PostScript interpreter can do which will give you the result you want.

GhostScript PDF to PostScript

I have to convert pdf files (created with jasperreports) to postscript.
I'm using ghostscript (Version 9.19) to make the conversion.
The commmand i'm using is:
gswin64c -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=file.ps file.pdf
The conversion is done without problem, but when i open the postscript file generated (using GSview 5.0), the top margin is crop by 2-3 cm, and some information to print is lost.
I have changed the device from ps2write to eps2write, used the property -g<width>x<height> with the page size in pixels, but the problem persist.
The file is to be printed in a preformated paper, so i can not use the postscript generated to print.
Can someone help?
Thanks

Its not possible to say with great certainty, but it sounds like the PDF mediaBox is larger than the media you have specified to GSView.
You can try using the -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS along with -dFIXEDMEDIA and -dPDFFitPage, that should allow you to set up a specific media size, override the size in the PDF file and scale the result to fit the specified size.
Perhaps you could post an example PDF file, without that its very hard to comment sensibly.

image not shown in dvi after latexing

I include several images of eps format in latex. After latex command, there are some of the images missing in the dvi file. Not sure if it is related to the image size, most of the images missing have size around 83kB while those shown up have a size less than 40kB. After conversion from dvi to ps, the images are all back. Just wonder what is the reason causing the images missing in dvi file?
Thanks and regards!

As far as I can remember, a dvi viewer cannot show eps file. Just use pdflatex as the front-end instead of latex and view the resulting pdf file.
Checking man xdvi reveals this:
Xdvi can show PostScript specials
by any of three methods. It will try
first to use Display PostScript,
then NeWS, then it will try to
use Ghostscript to render the images. All of these options
depend on additional software to work
properly; moreover, some of them may
not be compiled into this copy of xdvi.
So it would appear to be platform- and/or implementation-dependent.

How to convert an image (i.e. pdf) for use in a LaTeX document?

What is the preferred way to convert various images, bitmap and vector, for use in a LaTeX and PDFLaTeX document?
There are many ways to do this, some make use of standard inclusions in the various LaTeX packages, others give better results.

You can include a PDF image directly into a LaTeX document if you want to produce your final output using pdflatex, but not if you want to produce a dvi file.
pdflatex can use PDF, PNG, and JPEG
latex/dvips can use PS, EPS
See more details:
Including images in LaTeX files
Watch what you name graphics files in LaTeX

I convert bitmaps into PNG, and vector graphics (e.g. SVG) into PDF. pdflatex understand both PNG and PDF.

If you have an image "as PDF", and you don't want to include it as pdf, you may want to extract the complete image data first with pdfimages. Other conversions may render the image only with reduced resolution.

My current preferred way is using bmeps and epstopdf included in MikTeX. For the generation of pdf and eps versions of a png.
In a file called convertimage.bat,
bmeps -p3 -c -e8f -tpng %1.png > %1.eps
epstopdf %1.eps
Use by including in the path and writing convertimage.bat filenameminusextension
Include in the documents using,
\begin{figure}[h]
\begin{center}
\includegraphics[scale=0.25]{path/to/fileminuxextension}
\caption{My caption here}
\label{somelabelforreference}
\end{center}
\end{figure}

I only use Encapsulated PostScript (.eps) figures (converting bitmaps with NetPBM first), since I always use dvips + ps2pdf anyway, and then I do \includegraphics{file}.

As John D. Cook says, your available image formats depend on whether you are using latex or pdflatex.
I find ImageMagick a useful tool for converting images between formats. Handles bitmap images, plus ps/pdf/eps (with ghostscript) and a zillion others. Available through apt, macports, etc.

I use a mac so I use GraphicConverter to load images and export as PDFs.
When I draw diagrams, I use Omnigraffle which lets me export as PDFs.
On windows I used to use Visio which supported EPSs which I also had no problems embedding.

The basic issues are that a) you want to handle raster and vector images differently and b) this introduces potential pitfalls.
The "right" thing to do depends a bit on your final output.
If your final output is going to be a .pdf file, and you don't need pstricks or anything else that these days you're probably better off just using pdflatex to directly produce the file.
In this case:
store all vector figures as .pdf
store all raster figures as .png (or jpeg if they were originally jpeg)
use graphicx package and \includegraphics{filename-without-suffix}
If you don't do the above, your raster figures will be converted to jpegs and may gain compression artifacts. png is the best bet if you can choose output.
If you are headed for .dvi file you're going to want .eps for everything. (You can gzip these files as long as you generate a bounding box file).
If you're careful you can do both. I store all vector figures as (compressed) .eps because there are a few things .pdf can't do that .eps can. I store all raster figures as .png. Using make, I can have temporary copies of these canonical versions generated on the fly for .dvi or .pdf output as needed.
Someone above pointed out the filename issue. You want to avoid "." in the file names, and avoid suffixes always in your latex file itself.

I always include images in PNG format.

If you compile your code with pdflatex, then you also can use the \includegraphics to include images in pdf (you have to include the package graphix

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio