I am using Ghostscript to convert source PDF file into array of PNG images. Before I convert PDF page into PNG image I would need to extract (delete) all text from PDF so that converted page image would contain all other elements, excluding text.
Can I achieve this with Ghostscript or will I need to look into different tools?
I would also be interested in a tool that can read-save my source PDF removing all the text.
Since my previous answer, development has continued, and a new option is available now, which justifies a new answer.
The most recent versions of Ghostscript support 3 new parameters, which allow you to remove either all TEXT, or all IMAGE or all VECTOR elements from a PDF.
To remove all TEXT elements from an input PDF, run
gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
To remove all raster IMAGE elements from an input PDF, run
gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
To remove all VECTOR elements from an input PDF, run
gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf
Of course, you can also combine any of above two parameters (combining all three will create empty pages.
Here are screenshots of a PDF page, where the original contained all three elements whereas the resulting pages look different.
Screenshot of original PDF page containing "image", "vector" and "text" elements.
Running the following 6 commands will create all 6 possible variations of remaining contents:
gs -o noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
gs -o noTXT.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
gs -o noVCT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf
gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT input.pdf
gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERTEXT input.pdf
The following image illustrates the results:
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
You can achieve what you want without Ghostscript, simply by using a text editor.
Convert your compressed PDF into one which has (nearly) all PDF objects' contents and streams expanded into a readable form using QPDF:
qpdf --qdf --object-streams=disable input.pdf editable.pdf
Open your new editable.pdf file with a text editor (which also gracefully handles any remaining binary blobs inside the PDF such as font or ICC resources).
Search for all occurences of TJ and Tj strings (PDF operators used to show text) inside PDF object streams and replace them with the JT and jT strings respectively (undefined, nonsense PDF operators). Save the file as edited.pdf.
Now convert your edited.pdf to your PNG images as needed.
Note that edited.pdf will still display in most PDF viewers, but the text will be missing as intended. However, it will be easy to restore the text again, by restoring the original TJ/Tj operators and thus reversing any manual modification.
In the "normalized" form created by the qpdf command given above, objects with streams usually look like this (where NNN is an integer number):
NNN 0 obj
<<
% Here are the key:value pairs of the object dictionary
/Key1 somevalue1
/Key2 somevalue2
% ... (more key:value pairs)
>>
stream
% Here is the content of the object stream
endstream
endobj
An "image stream" has basically the same structure. But the key:value pairs typically contain the following four entries, in any order (where NNN and MMM are integer values giving width and height of the image in pixels):
/Type /XObject
/Subtype /Image
/Width NNN
/Height MMM
Update/Correction
My bad! My original answer contained a repeated typo. I had used tj at places where Tj should have been used. Sorry for any confusion that may have created.
Obviously this is not a standard requirement, but it was recently discussed on the #Ghostscript forum on IRC. The channel is logged and you can find the discussion here:
http://ghostscript.com/irclogs/2014/05/21.html
We originally suggested changing the initial text rendering mode to 3 in pdf_ops.ps, but that had no effect on the file as it was using a type 3 font. So we suggested instead altering the definitions of TJ and Tj in the same file. Look at around 15:37 in the log.
Related
I have to convert pdf files (created with jasperreports) to postscript.
I'm using ghostscript (Version 9.19) to make the conversion.
The commmand i'm using is:
gswin64c -dNOPAUSE -dBATCH -sDEVICE=ps2write -sOutputFile=file.ps file.pdf
The conversion is done without problem, but when i open the postscript file generated (using GSview 5.0), the top margin is crop by 2-3 cm, and some information to print is lost.
I have changed the device from ps2write to eps2write, used the property -g<width>x<height> with the page size in pixels, but the problem persist.
The file is to be printed in a preformated paper, so i can not use the postscript generated to print.
Can someone help?
Thanks
Its not possible to say with great certainty, but it sounds like the PDF mediaBox is larger than the media you have specified to GSView.
You can try using the -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS along with -dFIXEDMEDIA and -dPDFFitPage, that should allow you to set up a specific media size, override the size in the PDF file and scale the result to fit the specified size.
Perhaps you could post an example PDF file, without that its very hard to comment sensibly.
I have a set of pdf files from which I would like to:
extract the 2nd page of each
merge all the 2nd pages into a single document
I know how to do each of these independently with Ghostscript (generating a bunch of temporary 1-page PDF files on the way), but is there any way to do it in one command?
What have you tried ?
Provided you want the same page(s) from every file then this:
gs -sDEVICE=pdfwrite -o out.pdf \
-dFirstPage=2 -dLastPage=2 \
input1.pdf input2.pdf
should work.
Please note that my usual caveats apply; pdfwrite is not 'manipulating' the source PDF files, it is fully interpreting them to produce lists of drawing primitives, which are then reassembled to form a brand new PDF file. At no point are you 'extracting' or 'merging' PDF files, the content of the output file(s) bears no relation, other than visual appearance, to the input file(s).
I'm converting a folder full of print-ready PDFs into 600 dpi TIFFs, using CCITT Group IV compression (bitonal) on the TIFFs (one TIFF per page). My problem is that the PDFs, which begin with a page dimension of 9x6 inches, are converted into 8.5x11 inch TIFFs (5100 x 6600 px at 600 dpi). Here is the command I'm using to convert PDFs to TIFF files (using bash in Mac OS X):
for folder in $(find * -maxdepth 0 -type d ); \
do gs -dBATCH -dNOPAUSE -q -sDEVICE=tiffg4 -r600 "-sOutputFile=$folder/tiff/%04d.tif" "$folder/pdf/$folder.pdf";
done;
Is there a way to preserve the original page dimensions in my output files?
Thanks in advance!
Ghostscript will preserve the media size of the PDF w3hen creating the TIFF files, so if its not what you expected then either its a bug (you don't say which version of GS you are using, so it might be something that's been fixed) or, more likely, the PDF file has a CropBox which is different to the MediaBox. Screen viewers tend to use the CropBox, Ghostscript defaults to using the MediaBox (because it is at heart a printing application).
You can use the -dUseCropBox switch to have Ghostscript use the CropBox instead, if this is the problem. If it isn't I'd need to see a specimen PDF file. Probably the easiest way is to open a bug report at bugs.ghostscript.com where you can attach a file.
I'm writing an application that faxes a document (many supported types) provided by the end user. A requirement is that the end user can also provide text to be used as part of a custom fax header.
I've been using Ghostscript to render PDFs as TIFFs and it's been working great so far, but I have yet to find a straightforward way of overlaying the custom header at the top of a PDF. I've tried out a few recommendations:
How can I make a program overlay text on a postscript file?
How can I add a footer to the bottom of each page of a postscript or pdf file in linux?
Add comments to PDF files automagically with regular expressions
Stamp PDF file with control for position of stamp file
... with no luck.
I've used ImageMagick to do this successfully with documents rendered to TIFF via other tools, and I'm aware that ImageMagick can render PDF-to-TIFF on its own. However, I want to stick with Ghostscript because in my experience it has performed better and rendered clearer TIFFs.
Is this possible using Ghostscript and perhaps a PS helper script?
Edit:
Ghostscript (v9.04) is not throwing any errors. For example:
gswin64c -dSAFER -dBATCH -dNOPAUSE -dPDFFitPage -sDEVICE=tiffg3 ^
-sOutputFile=goofy.tif ^
-c "/Courier findfont 12 scalefont setfont 50 765 moveto (header text) show" ^
-f goofy.pdf
... produces a TIFF of the original PDF, but without the text I tried to add. If I append showpage to the postscript one-liner it (predictably, I suppose) prints a new, blank-except-for-header page, which doesn't help me much.
I would use another commandline tool combined with Ghostscript for this task. This tool is pdftk.exe. Then use a 3 step approach:
The task of Ghostscript would be to create an (otherwise empty) page with the header text:
gswin64c.exe ^
-o header.pdf ^
-sDEVICE=pdfwrite ^
-c "/Courier findfont 12 scalefont setfont" ^
-c "50 765 moveto (header text) show showpage"
The task of pdftk would be to overlay (stamp or background) the PDF file with the text header over the original PDF:
pdftk.exe goofy.pdf background header.pdf output goofy-with-header.pdf or
pdftk.exe goofy.pdf stamp header.pdf output goofy-with-header.pdf
The last step is to employ Ghostscript again in order to create your final TIFF output:
gswin64c.exe ^
-dPDFFitPage ^
-o goofy-with-header.tif ^
-sDEVICE=tiffg3 ^
goofy-with-header.pdf
I just tried your exact same approach with your exact same result. Then I removed -dSAFER from my command-line arguments and it works like a charm.
The way I see doping it is appending what you need on the PDF to the PDF file itself - before any conversion - the PDF filew format is designed so that one can append extra information at the end of the file (even information that goes on previous pages).
Unfortunatelly, I never worked on it - so I can'tell you eactly what you need to do - maybe there is aome PDF editing library in a programing language that would make this task easier - else you will have to create the PDF bits yourself. (Tradiditional libraries that render PDF's from some input format won't do, as you have to work inside the structure of your existing document) - but maybe taking a looka t the PDF specification can enlighten you on this approacj, and you check if it is worth:
http://www.adobe.com/devnet/pdf/pdf_reference_archive.html
Another approach there would be to work on the "other end" of your files: layout text on the post-rendered TIFF files using an image manipulation library. This is only possible, of course, if there is a fixed space on the pages reserved for you to add the information.
Sorry for not being able to offer a complete solution
I'm trying to use ghostscript to convert a .ps file to a series of .png files, largely because I don't have a tolerable ps viewer.
This is the command I've used:
gs -dBATCH -dEPSCrop -dEPSFitPage -sDEVICE=png16m -r300 -dNOPAUSE -sOutputFile=neptune_111115_ob1-2_13pca_boloplots_%d.png neptune_111115_ob1-2_13pca_boloplots.ps
(the .ps file is a multi-page postscript).
The outputs are partly off the page. I'd like the images to fit inside the page.
I can include example files, but they're pretty large - is there any particular part of the .ps file that would be helpful?
My suspicion is that the .ps file is specifying the bounding box incorrectly, but hacking the BB values didn't have any effect. The .ps file is written by IDL (ittvis' Interactive Data Language). I've also tried the above command without the -dEPS* commands without luck.
-dEPSCrop and -dEPSFitPage are mutually exclusive:
One crops the EPS to the BoundingBox specified in the comments.
The other scales up the EPS from the %%BoundingBox specified in the PS file's internal comments to fit the current media.
You can't really use both at the same time.
The file can't be an EPS file anyway, because you can't have multiple pages in an EPS file. So actually neither switch will have any effect (as you've discovered).
Either the PostScript requests a media size using setpage or setpagedevice, or it just uses whatever the currently set media is. My guess is that its just using the current media. Try setting -sPAPERSIZE=a4 and -sPAPERSIZE=letter.
If that works then the program does not request a media size. If it has no effect, then set -dFIXEDMEDIA in addition which will ignore subsequent requests to change the media size.
That should allow you to specify the correct media size, if you don't know what the media size should be then you can use the Ghostscript -sDEVICE=bbox device to find out.
Lastly, Ghostscript has a rudimentary display device which you can use to view the rendered output without first going to a PNG.