PostScript/PCL - Get document page info: page size, bw/color - ghostscript

I need to determine document page information from a postscript or a pcl file. Preferably in Java, but Ghostscript/Ghostpcl is as good as well.
What I tried to get the following info:
Page color
This can be achieved with ghostscript/ghostpcl using the device called inkcov.
PostScript
gswin64c.exe -dNOPAUSE -dBATCH -sDEVICE=inkcov -o- input.ps
PCL6
gpcl6win64 -dNOPAUSE -dBATCH -sDEVICE=inkcov -o- input.pcl
Page size
There is a device called bbox which gives me the boundary box per page for PostScript or PCL6 documents
PostScript
gswin64c.exe -dNOPAUSE -dBATCH -sDEVICE=bbox -o- input.ps
PCL6
gpcl6win64 -dNOPAUSE -dBATCH -sDEVICE=bbox -o- input.pcl
But in the end the boundary box is an inaccurate approximation for the page size.
I checked the following post, but the solution seems not to work with my ghostscript version 9.5
Getting the page sizes of a PostScript document

The bbox device should provide accurate information, in what way is it inaccurate ? I'd test it myself but you haven't supplied a file to demonstrate this.
You need to bear in mind that its possible some objects (eg images) might mark the page with white space. That still counts as marking the page for the purposes of the bbox device. If you want to only count non-white output samples, then you need to render the document (at the final resolution you intend to use) and actually count the non-white pixels. That's a potentially very slow operation because it needs to read every output colour sample of what could be a very large image.
Its not hard to code though, and you could use the inkcov device as a basis for doing both operations in the same pass.
Or you could just have GhostPDL deliver the rendered bitmap for you and code a solution to the bounding box using some other tool/language.
Ah, are you actually looking for the requested media size, rather than the Bounding Box ? That's not the same thing at all. The bounding box returns the smallest rectangle which encloses all the marks on the output, it doesn't tell you how big the requested media was. So a small rectangle in the bottom left would give you a tiny BBox, even if hte media was large.
You can reasonably easily get the media size requests from PostScript by writing a small PostScript program, but you can't do that with PCL. Perhaps the easiest solution in both cases is to render the content to a file at 72 dpi, then read the width/heiight of the rendered output and that gives you the media size in points.
Or use the pdfwrite device to convert the input into PDF and then the pdf_info.ps PostScript program can be used to give you the sizes of the pages from the PDF file.

Indeed I am looking for the requested media size, rather than the Bounding Box.
Maybe I should have been more specific.
Here is some ascii art to brighten up your day.
y
^
|
|
+-----------+
| +----+ |
| |bbox| |
| +----+ |
| |
| |
| |
| |
| |
+-----------+----> x
A simple document with some text in the upper left corner.
KenS: "The bounding box returns the smallest rectangle which encloses all the marks on the output, it doesn't tell you how big the requested media was."
So for the time being the "easiest" solution was really to transform the ps/pcl file into a pdf and read the media size from there.
Conversion to PDF
PostScript
gswin64c.exe -dBATCH -dNOPAUSE -dNOOUTERSAVE -sDEVICE=pdfwrite -sOutputFile=output.pdf input.ps
PCL6
gpcl6win64 -dBATCH -dNOPAUSE -dNOOUTERSAVE -sDEVICE=pdfwrite -sOutputFile=output.pdf input.pcl

Related

paper size not proper in output pdf ghostscript

I try to resize pdf by ghostscript command line but output pdf papersize not according to input i m using command gswin64c.exe -o E:\output.pdf -dBATCH -dNOPAUSE -dDOPDFMARKS -sDEVICE=pdfwrite -dFIXEDMEDIA -dPDFFitPage -dDEVICEWIDTHPOINTS=396 -dDEVICEHEIGHTPOINTS=612 -f E:\comic.pdf
and output pdf size is 396 x 604.653 pts
can you help me about this issue.
The answer is simple, you are trying to scale the PDF by different amounts horizontally and vertically, the PDFFitPage switch doesn't do that.
In fact there is no canned option for doing that in Ghostscript at all, you would need to write a PostScript program to do so.
If we look at your original file the page has a MediaBox of: /MediaBox[0.0 0.0 495.12 756.0], so that's (as you say) 495.12 x 756 points. You insist the output be 396x612.
So the x scale factor is 396/495.12 = 0.7998, the y scale factor is 612/756 = 0.809. So in order to scale isomorphically we need to use the scale factor of 0.7998. 756 * 0.7998 = 604.6488. Rounding errors probably explain the slight differences.

Achieving same PDF compression as imagemagick's convert using ghostscript

Is there a way to achieve the same compression than (great compression ratio and quality but it's slow and can break pdfs):
pdfimages -tiff $1 pdf_images
convert pdf_images-* -alpha off -monochrome -compress Group4 -density 250 ${1%.pdf}.compressed.pdf
rm pdf_images-*
By only using ghostscript instead ?
Tried playing around with dPDFSETTINGS, dGrayImageDownsampleType, sColorConversionStrategy but the result was usually lower quality or bigger in size.
PDF consists of scanned pages (one image per page)
I usually use something like the following with GS (there's still something missing because images aren't converted...is this by design?):
gs \
-q \
-dNOPAUSE \
-dBATCH \
-dSAFER \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/screen \
-dEmbedAllFonts=false \
-dSubsetFonts=false \
-dGrayImageDownsampleType=/Bicubic \
-dGrayImageResolution=250 \
-dMonoImageDownsampleType=/Bicubic \
-dMonoImageResolution=250 \
-sProcessColorModel=DeviceGray \
-dProcessColorModel=/DeviceGray \
-sColorConversionStrategy=/Mono \
-dOverrideICC \
-sOutputFile=output.pdf \
input.pdf
Random PDF Sample from Google: https://www.2ndcollege.com/colleges/gcet/btech/sem5/ic/socio/notes/unit1.pdf
Original: 5.6MB
GS: 1.4MB (not mono)
PDFImages + ImageMagick: 1.4MB (only images are converted)
Adding as an answer because its too long for a comment.
The artefacts you are referring to are, I think, caused by JPEG quantisation. The original image has been decompressed, downsampled to a lower resolution, and then recompressed. Since you haven't selected any other compression method, the default for the /screen PDFSETTINGS is used, which is JPEG for Gray and colour images and CCITT Fax for mono images.
You could easily avoid that by using a different compression filter, though of course that would not produce as much compression of the output.
There are several suggestions I can make; firstly don't use PDFSETTINGS unless you are completely sure you want all the things it is doing. In general I would expect better results by leaving most settings alone and simply applying the switches you need.
Given that these are scanned pages, there is no point in setting any of the Font related parameters (unless invisible fonts have been added).
You've set ProcessColorModel twice, once as a name and once as a string. In fact, if you use ColorConversionStrategy, you shouldn't set it at all, and if you aren't using ColorConversionStrategy then it won't have any effect, so you can just drop these two entirely.
There is no ColorConversionStratefy of /Mono, and trying to set that causes errors for me. There appears to have been a bug introduced with the ColorConversionStrategy in the current release. If you set Gray you will actually get RGB. In order to get Gray you actually need to request CMYK. Obviously that's been fixed but in the meantime all the spaces are 'off by one'. sRGB->CMYK, CMYK->Gray and Gray->RGB. LeaveColorUnchanged is unaffected.
Of course this means that your setting of the Gray and Mono Image parameters is having no effect (at least not on the colour images anyway). This is why you get a low output size, and also why the result is heavily downsampled and quantised.
Now, as I've already said, you can't get Ghostscript's pdfwrite to produce monochrome output, only grayscale. Reducing the image data by a factor between 8 and 24 is where most of the gains are coming form I believe. So frankly there's no way you are going to get down to the same output size using pdfwrite without heavily downsampling the images. And if you do that, then the quality is going to suffer.
This command line:
\ghostpdl\debugbin\gswin32c -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=out.pdf -sColorConversionStrategy=CMYK -dPDFSETTINGS=/screen -dGrayImageDownsampleType=/Bicubic -dGrayImageFilter=/FlateEncode -dAutoFilterGrayImages=false unit1.pdf
produces a gray output file 2.1 MB in size, but the extreme downsampling has resulted in very blurry output, I don't think you will like it at all. You could change the amount of downsampling, but that of course will result in a larger output file. You could leave the compression filter unchanged (DCTEncode == JPEG), but that will get you compression artefacts.
Basically, as I said right at the beginning, if you want to manipulate image data, the best way to do it is with a tool designed to manipulate images, not one designed to render PostScript/PDF files.
You could, with some effort render the original pages to a btimap format with Ghostscript, using a stochastic screening method as IM appears to have used, then read the images back into Ghostscript to produce a PDF file, but that hardly seems like its easier than using IM as you are now.

how to convert PS image to PNG and fit to page or to multiple pages

I am analyzing a model (compiled with -pg option so it would generate "gmon.out") and then generated a PS file (using gprof2dot.py and piping that to "dot") that charts how much time is spent in each subroutine. When I use ghostscript to convert to a PDF it is cutting off some of the right hand side of the figure. So I tested outputting to multiple pages, but the first page still has the right side cut off and the second page is blank.
These are the 2 commands I have tried:
gs -dBATCH -dNOPAUSE -dPDFFitPage -sOutputFile=myfile.pdf -sDEVICE=pdfwrite output.ps
gs -dBATCH -dNOPAUSE -dPDFFitPage -sOutputFile=out%d.pdf -sDEVICE=pdfwrite output.ps
Please let me know if you have any suggestions. Thanks!
Your title says you want a PNG (and you render a PostScript program to PNG, not "convert a PostScript image", PostScript is a programming language not an image format) but your description says you are creating PDF files. So which is it, PNG or PDF ?
Using PDFFitPage scales the requested media size to fit an actual (already set up, fixed) media size, since you haven't set a fixed media size on the command line, no scaling will be performed.
So, what media size are you getting, and why is that not correct ? I would 'guess' that your PostScript program does not request a media size, if it did then pdfwrite would create a PDF with the same media size.
In the absence of a media request, the pdfwrite device uses the default. Depending on your system and configuration that will most likely be either A4 or US Letter. It then reproduces the PostScript drawing program by either rendering to a bitmap, or as a PDF page description, using that media size. If the original PostScript required a different media size, then bits will be clipped.
Since you have not supplied an example its not possible for me to do more than guess of course.
However, you should probably try setting -sPAPERSIZE to something like A3. Or set a specific media size using -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS.
If you supplied an example I could probably be more specific.
It would also be a decent idea to mention what version of Ghostscript you are using too.

Scale scanned PDF without resampling image layer

I have a batch of PDFs (about 80,000 files) that consist of scanned pages. The pixel size of the image layer is consistent with 300dpi, but seems to be set to 72dpi. As a result the page size is showing something like 46x35 inches. I need to adjust these files so they register as 8.5 x 11, or whatever their natural size is, and I need to be able to script the process so I can leave this to churn on 80,000 documents (2-5 pages per document.)
I'd like to avoid resampling the image layer since that would potentially add loss, and slow the process down significantly. I've tried:
convert -density 300x300 input.pdf output.pdf
But it resamples the images. I've tried different variants on ghostscript such as
gs \
-o output.pdf \
-sDEVICE=pdfwrite \
-dDownsampleMonoImages=false \
-dMonoImageResolution=300 \
input.pdf
That generates a file, but it seems unchanged, and and still registers as oversized. Also, the pages are different sizes and orientations. So forcing them all to one size/orientation won't work.
(FYI, really I wouldn't care, but the next step is to have Acrobat Pro OCR all these files, and its OCR chokes on anything over 45 inches.)
PDF is a resolution independent format, so the resolution of the images and so on is pretty irrelevant. The 'natural size' of the pages is whatever Acrobat says it is, this is gathered from the MediaBox (or CropBox) information which is in the file.
It sounds to me like the original conversion to PDF is at fault, and the files genuinely are the (media) size they claim to be now.
I suspect that you can probably get the result you need; you 'simply' need to resize the document. The problem is that this isn't trivial where the media sizes differ (which you say they do).
However before going further I suggest you take a file which you want to be 8.5x11 and try this:
gs -dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite \
-dDEVICEWIDTHPOINTS=612 -dDEVICEHEIGHTPOINTS=792 \
-dFIXEDMEDIA \
-dPDFFitPage \
-sOutputFile=output.pdf \
input.pdf
This will fix the media being used at 8.5x11 and tell Ghostscript to resize the document to fit the page (by calculating and applying a scale factor). It should not affect the image data except for compression, if there are colour images we might need to worry about JPEG artefacts but that can be dealt with separately.
cpdf -scale-pages usletterportrait in.pdf -o out.pdf
Doesn't touch the page content other than to wrap it in a transformation matrix to do the scaling, and scales the media/crop/art/bleed/trim boxes too.
(Commercial, I'm afraid:
http://www.coherentpdf.com/
Disclaimer: I wrote it.)
Modify your original gs command like this:
gs \
-o output.pdf \
-sDEVICE=pdfwrite \
-dPDFFitPage \
-g6120x7920 \
input.pdf
Then check 2 things:
Page dimensions are displaying as 'letter' (or 612x792 pts, or 8.5x11 in) now.
File size is only marginally different from original one (indicating that no resampling of page image has happened).
If the input is scanned documents in grayscale only (as it seems to be), there is no need for setting -dDownSample*Images or for setting -d*ImageResolution.

Fit to page size in ghostscript (with a possibly corrupt input)

I'm trying to use ghostscript to convert a .ps file to a series of .png files, largely because I don't have a tolerable ps viewer.
This is the command I've used:
gs -dBATCH -dEPSCrop -dEPSFitPage -sDEVICE=png16m -r300 -dNOPAUSE -sOutputFile=neptune_111115_ob1-2_13pca_boloplots_%d.png neptune_111115_ob1-2_13pca_boloplots.ps
(the .ps file is a multi-page postscript).
The outputs are partly off the page. I'd like the images to fit inside the page.
I can include example files, but they're pretty large - is there any particular part of the .ps file that would be helpful?
My suspicion is that the .ps file is specifying the bounding box incorrectly, but hacking the BB values didn't have any effect. The .ps file is written by IDL (ittvis' Interactive Data Language). I've also tried the above command without the -dEPS* commands without luck.
-dEPSCrop and -dEPSFitPage are mutually exclusive:
One crops the EPS to the BoundingBox specified in the comments.
The other scales up the EPS from the %%BoundingBox specified in the PS file's internal comments to fit the current media.
You can't really use both at the same time.
The file can't be an EPS file anyway, because you can't have multiple pages in an EPS file. So actually neither switch will have any effect (as you've discovered).
Either the PostScript requests a media size using setpage or setpagedevice, or it just uses whatever the currently set media is. My guess is that its just using the current media. Try setting -sPAPERSIZE=a4 and -sPAPERSIZE=letter.
If that works then the program does not request a media size. If it has no effect, then set -dFIXEDMEDIA in addition which will ignore subsequent requests to change the media size.
That should allow you to specify the correct media size, if you don't know what the media size should be then you can use the Ghostscript -sDEVICE=bbox device to find out.
Lastly, Ghostscript has a rudimentary display device which you can use to view the rendered output without first going to a PNG.

Resources