How to add TrimBox and BleedBox without fonts being rasterized - pdf-generation

I want Ghostscript to prepare PDFs for print production. The input PDF (Version 1.3) is in RGB, uses transparency and has crop marks.
Convert colors to CMYK by applying an ICC profile
Add a TrimBox and a BleedBox
I managed to achieve the steps above using the following command:
gs -o output.pdf -sDEVICE=pdfwrite -dPDFX -r100 -dOverrideICC=true -sOutputICCProfile=ISOcoated_v2_300_eci.icc -sColorConversionStrategy=CMYK -dProcessColorModel=/DeviceCMYK -dRenderIntent=3 input.pdf -c "[ /PAGES pdfmark << /PDFXSetBleedBoxToMediaBox false /PDFXTrimBoxToMediaBoxOffset [29 29 29 29] /PDFXBleedBoxToTrimBoxOffset [8 8 8 8] >> setdistillerparams" -f
But unfortunately the fonts are getting rasterized.
I found out that -dPDFX causes that.
But it seems like -dPDFX is needed to add the TrimBox and BleedBox. Without -dPDFX the fonts remain unchanged but the boxes won't be added.
I'm on OSX, the PDF contains two Type 3 fonts.
Any help is very appreciated.

OK, a few points:
Ghostscript (and more particularly the pdfwrite device) doesn't add anything to PDF files. It makes brand new PDF files from the supplied input, which may or may not be a PDF file.
The process is described here and I'd suggest you read it. Essentially you cannot assume that the content of the produced PDF file bears any relationship at all to the content of the input, if it's a PDF file.
Your usage of ICC profiles is not causing the conversion to CMYK, that's done by setting ColorConversionStrategy. Your setting of OverrideICC and OutputICCProfile aren't doing anything and you should remove those switches. In addition you should not set ProcessColorModel if you are setting ColorConversionStrategy. Similarly setting RenderIntent does nothing at all with the pdfwrite device, drop that too.
Don't set the resolution. All that does is set the resolution of any content which must be rendered (eg creating a PDF file < version 1.4 from an input file containing transparency). So drop the -r100.
-dPDFX doesn't cause fonts to be rasterised.
If that's happening then it's almost certainly nothing to do with selecting PDFX, without seeing your input file I can't comment further.
-dPDFX is not necessary to create a PDF file with TrimBox or BleedBox.
Of course, what your PostScript actually does is create PDFX Bleed and Trim Offsets, and yes, if you want those then you need to set PDFX, clearly. On the other hand, if you actually want to set normal regular Bleed or Trim boxes then that is also documented in the pdfmark reference (see page 37 of the 1.7 pdfmark reference):
The syntax for specifying a non-default page cropping for a particular
page in a document is as follows:
[ /CropBox [xll yll xur yur]
/PAGE pdfmark The syntax for specifying the default page cropping for a document is as follows:
[ /CropBox [xll yll xur yur]
/PAGES pdfmark
Obviously you would Substitute Bleed or Trim for CropBox.
Update
This command line:
gs -sDEVICE=pdfwrite -sOutputFile=new.pdf /temp/input.pdf -c "[/CropBox [100 100 7568 3784] /PAGES pdfmark" -f
for me produces a PDF file (new.pdf) which has a CropBox in the Pages tree root node. Entries in the Pages tree are inherited by all pages below the node where they are defined and when I open the file with Acrobatm I can see that the crop marks are no longer visible.
So for me, it all works fine, and without using PDFX. Adding -sColorConversionStrategy=CMYK should be enough to get a CMYK file out. And since it isn't using PDF/X, it will maintain the transparency instead of rendering it. Note that since your file is all in DeviceGray, the colour is retained unchanged anyway.
I forgot to say that using /TrimBox instead of /CropBox works perfectly well for me as well, in exactly the same manner.

Related

Can I bulk-remove links from a pdf from the command line?

I'm downloading some newspapers as pdf (for posterity). One title is a pain, it includes URI links in the pdf itself, if you accidentally click these it opens a browser tab to a page that 500s. It's not so bad on a desktop computer, but a pain in the butt if someone is reading it with a tablet. Each issues has approximately 200 of these links.
For a different title, it was as simple as using QPDF, like so:
qpdf --qdf --object-streams=disable file temp-file
This puts the temp version into postscript mode or something, and I was able to nuke the links with something like this:
s/obj\n<<\n( \/A <<\n \/S \/URI.+?)>>\nendobj/"obj\n<<\n" . " " x length($1). ">>\nendobj"/sge
This still works. However, a 15 meg original pdf is now becoming a 108meg "fixed" pdf. I can accept some bloat, but 720% is a bit absurd (I think it was more like 10% on the other title). Whenever I google for how to do this, I get results for Acrobat Reader and how you can click around in 20 menus to do such... does no one that uses Adobe products ever want to automate this stuff? There are between 180 and 300 links in a typical issue, spread across 45-150 pages (Sunday editions).
Are there any tools that can do this? Are there any clever arguments to qpdf that will make this more reasonable?
PS Yes I know it's hacky as hell to just overwrite the URIs with spaces, but I've never managed to figure out how to remove the objects entirely since their references also have to be removed.
You can do this with the community edition of cpdf: https://community.coherentpdf.com/
To remove all links in a PDF (well, to replace them with an empty link):
cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '""' -o out.pdf
This does not remove the annotations - it just makes sure that clicking on them won't go anywhere. It leaves the annotation in place, but with an empty link. You could replace with a working URL too, of course:
cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '"https://www.google.com/"' -o out.pdf
(You can also use -replace-dict-entry-search to replace only certain URLs - see the manual.)
Or, if you just want rid of all the annotations (link and non-link):
cpdf -remove-annotations in.pdf -o out.pdf
You can use HexaPDF (you need to have Ruby installed and then use gem install hexapdf to install HexaPDF) and the following small script to remove the links:
require 'hexapdf'
HexaPDF::Document.open(ARGV[0]) do |doc|
doc.pages.each do |page|
page.each_annotation.select {|annot| annot[:Subtype] == :Link}.each do |annot|
page[:Annots].delete(annot)
end
end
doc.write(ARGV[0] + '_processed.pdf', optimize: true)
end
Then batch execute the script for all the files you want the links removed.
Note that this will remove all links.
Just to round off the options I would suggest the best is potentially a PDF dedicated command line tool such as cpdf answer by johnwhitington or a dedicated library like iText.
There are several alternative methods touted for batch text editing your using qpdf
"temp version into postscript mode or something,"
That is a converted pdf into plain old decompressed text/pdf hybrid qdf so you can run sed or similar string editor. Here the primary difference is the upper out.pdf file shows as an editable QDF-1.0 version after editing so needs conversion to a conventional PDF as seen in the lower part where the stream is binary thus recompressed.
1) qpdf
At end of a bloating edit exercise the idea is to reverse back to application/pdf using
fix-qdf file-temp.pdf>out.pdf
to tidy up redirects and then
qpdf --compress-streams=y out.pdf outfixed.pdf
back to fixed.pdf
Other cross platform means are using
2) pdftk
$ pdftk infile.pdf output outfile.pdf uncompress
edit with vim or whatever sed scripting method then
$ pdftk outfile.pdf output fixedfile.pdf compress
3) mutool
mutool clean -d [options] input.pdf [output.pdf] [pages]
-d Decompress streams. This will make the output file larger, but provides easy access for reading and editing the contents with a text editor.
-i Toggle decompression of image streams. Use in conjunction with -d to leave images compressed.
-f Toggle decompression of font streams. Use in conjunction with -d to leave fonts compressed.
-a ASCII Hex encode binary streams. Use in conjunction with -d and -i or -f to ensure that although the images and/or fonts are compressed, the resulting file can still be viewed and edited with a text editor.
Whichever options you use, need to be reversed when recompressing
NOTE
Using text editors will potentially corrupt binary fonts and binary images, thus they need monitoring for any corruption in an editor that changes encoding or line feeds. This pdftk sample shows the image stream has been decompressed well into simple text but beware any change of End Of Line by editor would break up that stream
Additionally when making text edits that are not simple byte wise "find and replace", the xref table can be corrupted too much to be reindexed by recompression, try to overwrite with same number of characters when using a text edit method.
SIDE NOTE
EVEN if you remove actions and external hyperlinks actions but the text is present the reader will still provide that exploitable action. Same as here https://google.com but html will highlight usually in blue underline.
Hence ensure security is on

Initial View Property for PDFA

I am using ghostscript to convert my postscript file to PDF/A.
My requirement is to have the Initial View- Magnification property set to Fit Page.
However, the value is set to default always. I have tried different View properties in PDFMarks but none of them seems to be working.
Below is my PDFMarks:
[ /Title (Document title) /DOCINFO pdfmark
[ /PageMode /UseOutlines /View [/Fit] /Page 1 /DOCVIEW pdfmark
I have also tried /FitV,/FitB but none of them seem to be working.
Ghostscript's pdfwrite device converts this pdfmark into an OpenAction in the Catalog. Using your pdfmark code, and an empty page, this appears to work well for me.
So the questions:
Which version of Ghostscript are you using
What is the exact command line you are sending
What makes you think this isn't working ? (How precisely are you
verifying the action ?)

Can I create a PDF which will always opens at zoom level 100%?

I am running into an issue with PNG to PDF conversion.
Actually I have big PNG files not in size but in contents.
In PDF conversion it creates a big PDF files. I don't have any issue with its quality, but whenever I try to open this PDF in PDF viewer, it opens in "Fit to Page" mode.
So, I can't see the created PDF in the initial view, but I need to zoom it up to 100%.
My question is: can I create a PDF which will always open at zoom 100% ?
You can possibly achieve what you want with the help of Ghostscript.
Ghostscript supports to insert PostScript snippets into its command line parameters via -c "...[PostScript code here]...".
PostScript has a special operator called pdfmark. This operator is not understood by most PostScript interpreters, but is understood by Acrobat Distiller and (for most of its parameters) also by Ghostscript when generating PDFs.
So you could try to insert
-c "[ /PageMode /UseNone /Page 1 /View [/XYZ null null 1] \
/PageLayout /SinglePage /DOCVIEW pdfmark"
into a PDF->PDF conversion Ghostscript command line.
Please take note about various basic things concerning this snippet:
The contents of the command line snippet appears to be 'unbalanced' regarding the [ and ] operators/keywords. But it is not! The initial [ is balanced by the final pdfmark keyword. (Don't ask -- I did not define this syntax...)
The 'inner' [ ... ] brackets delimit an array representing the page /View settings you desire.
Not all PDF viewers do respect the view settings embedded in the PDF file (Acrobat software does!).
Most PDF viewers allow users to override the view settings embedded in PDF files (Acrobat software also does this). That is, you can tell your viewer to never respect any settings from the PDF files it opens, but f.e. to always open it with "fit to width".
Some specific things about this snippet:
The page mode /UseNone means: the document displays without bookmarks or thumbnails. It could be replaced by
/UseOutlines (to display bookmarks also, not just the pages)
/UseThumbs (to display thumbnail images of the pages, not just the pages
/FullScreen (to open document in full screen mode)
The array for the view mode constructed as [/XYZ <left> <top> <zoom>] means: The zoom factor is 1 (=100%), the left distance from the page origin is the special 'null' value, which means to keep the previously user-set value; the top distance from the page origin is also 'null'. This array could be replaced by
/Fit (to adapt the page to the current window size)
/FitB (to adapt the visible page content to the current window size)
/FitH <top>' (to adapt the page width to the current window width);` indicates the required distance from page origin to upper edge of window.
...plus several others I cannot remember right now.
So to change the settings of an existing PDF file, you could do the following:
gs \
-o out.pdf \
-sDEVICE=pdfwrite \
-c "[ /PageMode /UseNone /Page 1 /View [ /XYZ null null 1 ] " \
-c " /PageLayout /SinglePage /DOCVIEW pdfmark" \
-f in.pdf
To check if the Ghostscript command worked, open the PDF in a text editor which is capable of handling binary files. Search for the /View or the /PageMode keywords and check if they are there, inserted as values into the PDF root object.
If it worked, check if your PDF viewer honors the settings. If it doesn't honor them, see if there is an overriding setting within the viewers preference settings.
I did a quick test run on a sample PDF of mine. Here is how the PDF root object's dictionary looks now, checked with the help of pdf-parser.py:
pdf-parser-beta.py -s Catalog a.pdf
obj 1 0
Type: /Catalog
Referencing: 3 0 R, 9 0 R
<<
/Type /Catalog
/Pages 3 0 R
/PageMode /UseNone
/Page 1
/View [/XYZ null null 1]
/PageLayout /SinglePage
/Metadata 9 0 R
>>
To learn more about the pdfmark operator, google for 'pdfmark reference filetype:pdf'. You should be able to find it on the Adobe website and elsewhere:
https://www.google.de/search?q=pdfmark%20reference%20filetype%3Apdf&oq=pdfmark%20reference%20filetype%3Apdf
In order to let ImageMagick create a PDF as you want it, you may be able to hack the file defining your delegate settings. For more help about this topic see for example here:
http://www.imagemagick.org/Usage/files/#delegates
PDF specification supports this functionality in this way: create a GoTo action that goes to first page and sets the zoom level to 100% and then set the action as the document open action.
How exactly you implement it in real life depends very much on the tool you use to create the PDF file. I do not know if ImageMagick can create such actions.

Add page to multiple PDFs in batch without messing with fonts

I'm trying to use Ghostscript to append a PDF as "last page" to multiple other PDFs. The problem I'm encountering is that Ghostscript walks through the whole PDF and does a bunch of font substitution.
I'm using the following batch script:
FOR %%G IN (*.pdf) DO IF NOT %%G==lastpage.pdf gswin64c -sDEVICE=pdfwrite -sOutputFile="output\%%G" -dNOPAUSE -dBATCH "%%G" lastpage.pdf
Example Error:
Page 12
Substituting font Courier for GGCJBF+Courier.
I will also sometimes get other errors, like this:
jbig2dec FATAL ERROR decoding image: prevent DOS while decoding height classes (segment 0x00)
failed to create parsed JBIG2GLOBALS object.
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
All I need gs to do is append my lastpage.pdf to the existing PDFs without walking through the entire PDF I'm appending to, especially with font substitution, because I will not have most of the fonts other people are using in their PDFs.
Is it possible in gs to simply append without walking through every page of the PDF? Is there another tool that will allow appending of PDFs in batches without this issue?
You need to be aware that Ghostscript does not simply manipulate the incoming PDF file, so you aren't 'appending' a page. What it does is interpret the incoming file into marking operations, pass those to a device, and that device takes further action on them. Rendering devices write to a bitmap, pdfwrite reassembles the marking operations into a brand new file.
That's why it 'walks through the whole file', its the way it works. There are advantages to this (its possible to alter the file contents for example) and disadvantages.
Now if you are getting a font substitution for an embedded font, there's something wrong with the embedded font (or possibly you are using a really old version of Ghostscript with a bug). You could try a newer version of Ghostscript but you're never going to get away from processing the entire input file.
Why not try pdftk.

Configuring GhostScript to rotate page output other than multiple of 90 degrees (PDF => PNG)

I'm trying to rotate GhostScript's output from a PDF input using the following:
gs -dSAFER -dBATCH -dNOPAUSE -r200 -sDEVICE=pngmono \
-dAutoRotatePages=/None -sOutputFile=output.png -c 10 rotate -f input.pdf
It generates the output file without any rotation (vs. the desired 10 degree rotation). Any ideas what's going wrong here?
Firstly; AutoRotatePages is only defined for the pdfwrite family of devices, other devices don't do anything with it. So specifying it to the pngmono device will have no effect.
Secondly, the PDF interpreter resets the graphics state when it processes the PDF file. It does this because, in order to do things like page fitting, setting the PageSize to the MediaBox of the PDF file and a bunch of other stuff, it calls setpagedevice. One of the implicit actions of setpagedevice is to call initgraphics, which resets the CTM.
Basically you can't rely on the PostScript graphics state at the time when you start processing a PDF file to have any effect on the graphics state while processing the PDF.
If you really want to do this you will have to modify gs/Resource/Init/pdf_main.ps, at the end of pdfshowpage_setpage:
pop currentdict end setpagedevice
} bind def
You'll need to insert your rotation here, after the setpagedevice. The neatest way to do this is to use a PostScript parameter, say UserRotation. You might then do:
pop currentdict end setpagedevice
/UserRotation where {
/UserRotation get rotate
} if
} bind def
And call GS with -dUserRotation=10
For those running on a system where the Resources are built into the ROM file system, you will need to modify the file on disk and then tell GS to use the modified Resources using the -I switch (-I/ghostpdl/gs/Resource/Init). For anyone trying to use this in Windows, you will first need to get hold of the Resources (they are not currently supplied as part of the Windows binary release) which will probably mean downloading the Ghostscript sources.

Resources