Convert a searchable PDF to searchable PDF/A using Ghostscript - ghostscript

I am using Ghostscript to convert PDF to PDF/A by command line:
gs -dPDFA -dBATCH -dNOPAUSE -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile="output.pdf" input.pdf
But output file lost searchable text property.
How can I obtain searchable PDF/A files as output ?
Thanks.

You haven't supplied an input file to look at, nor mentioned which version of Ghostscript you are using.
Let me start with my standard lecture on this subject; when you take a PDF file as input, and use Ghostscript's pdfwrite device to produce a new PDF file, you are NOT 'converting', 'editing' or 'modifying' the input file.
What happens is that the PDF interpreter interprets the PDF file, and produces a series of graphcs primitives, which it feeds to the graphics library. This then processes these primitives, and passes them to the device. The device then emits them to the output file. In the case of a rendering device (eg TIFF) it renders theoperation to a bitmap and when it reaches the end of file, it writes the bitmap as a file. In the case of pdfwrite, it re-assembles these primtives into a brand new PDF file.
So the output PDF file has nothing in common with the input PDF file, except its appearance.
There are disadvantages to this approach (it does limit us in preserving some non-printing aspects of the input file), but there are also advantages; for instance it permits us to alter colour spaces, flatten transparency, change font encodings etc.
In addition to this you have chosen to create a PDF/A file. PDF/A limits the available features of the PDF specification, and it may be (its impossible to tell without seeing the original file) that it simply isn't possible to represent the original PDF file as a PDF/A file without altering some aspects of it.
Again, without seeing the original file I can tell, but it may be that you simply cannot achieve what you want, or at least not using Ghostscript.

Related

The right part is lost when using ghostscript to convert .prn file to pdf

I am using ghostpcl-9.20-win32.
I have tried this:
gpcl6win32 -dNOPAUSE-dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf input.prn
The right part of the input file is lost in the output.
input file:
https://drive.google.com/file/d/0B29492qqMUX7Zk9nUmhDYXpJRVk
output file:
https://drive.google.com/open?id=0B29492qqMUX7T2RxVDZJaE9seEE
Your PCL file (actually it appears to be simple text, not even PCL) doesn't contain a media request.
In the absence of a media size, GhostPCL (NB NOT Ghostscript, GhostPCL) uses its default media size. Depending on a number of factors that will be either A4 or US Letter, portrait.
If you want different media, then you need to tell GhostPCL what you want. You need to use -sPAPERSIZE or -dDEVICEWIDTHPOINTS -dDEVICEHIGHTPOINTS or any of the other media selection switches.

GhostScript PDF to PostScript

I have to convert pdf files (created with jasperreports) to postscript.
I'm using ghostscript (Version 9.19) to make the conversion.
The commmand i'm using is:
gswin64c -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=file.ps file.pdf
The conversion is done without problem, but when i open the postscript file generated (using GSview 5.0), the top margin is crop by 2-3 cm, and some information to print is lost.
I have changed the device from ps2write to eps2write, used the property -g<width>x<height> with the page size in pixels, but the problem persist.
The file is to be printed in a preformated paper, so i can not use the postscript generated to print.
Can someone help?
Thanks
Its not possible to say with great certainty, but it sounds like the PDF mediaBox is larger than the media you have specified to GSView.
You can try using the -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS along with -dFIXEDMEDIA and -dPDFFitPage, that should allow you to set up a specific media size, override the size in the PDF file and scale the result to fit the specified size.
Perhaps you could post an example PDF file, without that its very hard to comment sensibly.

Extract 2nd page of each document and merge into a single document with Ghostscript

I have a set of pdf files from which I would like to:
extract the 2nd page of each
merge all the 2nd pages into a single document
I know how to do each of these independently with Ghostscript (generating a bunch of temporary 1-page PDF files on the way), but is there any way to do it in one command?
What have you tried ?
Provided you want the same page(s) from every file then this:
gs -sDEVICE=pdfwrite -o out.pdf \
-dFirstPage=2 -dLastPage=2 \
input1.pdf input2.pdf
should work.
Please note that my usual caveats apply; pdfwrite is not 'manipulating' the source PDF files, it is fully interpreting them to produce lists of drawing primitives, which are then reassembled to form a brand new PDF file. At no point are you 'extracting' or 'merging' PDF files, the content of the output file(s) bears no relation, other than visual appearance, to the input file(s).

Ghostscript Stamp Image on PDF

Is there any way to stamp or overlap a tiff image on a existing PDF file and output the result using Ghostscript?
I have two PDF which i want to merge in a result PDF with one over the other using ghostscript. I want to know if this can be done and how, or if it may work with one PDF as tiff image on top of the base PDF.
Can ghostscript make this stamp using layers in the PDF?
Thank you for your answers
The pdfwrite device in Ghostscript doesn't really support layers, so you can't use that. Also its unclear why you think layers would help.
TIFF isn't part of PostScript (or PDF), so you can't directly read a TIFF file into GS. I have elsewhere posted a PostScript program which reads TIFF files and renders them for output. You could use that to read a TIFF file.
However, you would have to mess about with either the PDF interpreter or a custom EndPage procedure in order to read and render the TIFF file. And unless you take specific kinds of action, it will be opaque, which may well not be what you want.
The Ghostscript PDF interpreter doesn't really lend itself to this kind of manipulation, have you considered using pdftk instead ?

Fit to page size in ghostscript (with a possibly corrupt input)

I'm trying to use ghostscript to convert a .ps file to a series of .png files, largely because I don't have a tolerable ps viewer.
This is the command I've used:
gs -dBATCH -dEPSCrop -dEPSFitPage -sDEVICE=png16m -r300 -dNOPAUSE -sOutputFile=neptune_111115_ob1-2_13pca_boloplots_%d.png neptune_111115_ob1-2_13pca_boloplots.ps
(the .ps file is a multi-page postscript).
The outputs are partly off the page. I'd like the images to fit inside the page.
I can include example files, but they're pretty large - is there any particular part of the .ps file that would be helpful?
My suspicion is that the .ps file is specifying the bounding box incorrectly, but hacking the BB values didn't have any effect. The .ps file is written by IDL (ittvis' Interactive Data Language). I've also tried the above command without the -dEPS* commands without luck.
-dEPSCrop and -dEPSFitPage are mutually exclusive:
One crops the EPS to the BoundingBox specified in the comments.
The other scales up the EPS from the %%BoundingBox specified in the PS file's internal comments to fit the current media.
You can't really use both at the same time.
The file can't be an EPS file anyway, because you can't have multiple pages in an EPS file. So actually neither switch will have any effect (as you've discovered).
Either the PostScript requests a media size using setpage or setpagedevice, or it just uses whatever the currently set media is. My guess is that its just using the current media. Try setting -sPAPERSIZE=a4 and -sPAPERSIZE=letter.
If that works then the program does not request a media size. If it has no effect, then set -dFIXEDMEDIA in addition which will ignore subsequent requests to change the media size.
That should allow you to specify the correct media size, if you don't know what the media size should be then you can use the Ghostscript -sDEVICE=bbox device to find out.
Lastly, Ghostscript has a rudimentary display device which you can use to view the rendered output without first going to a PNG.

Resources