I'm trying to use Ghostscript to append a PDF as "last page" to multiple other PDFs. The problem I'm encountering is that Ghostscript walks through the whole PDF and does a bunch of font substitution.
I'm using the following batch script:
FOR %%G IN (*.pdf) DO IF NOT %%G==lastpage.pdf gswin64c -sDEVICE=pdfwrite -sOutputFile="output\%%G" -dNOPAUSE -dBATCH "%%G" lastpage.pdf
Example Error:
Page 12
Substituting font Courier for GGCJBF+Courier.
I will also sometimes get other errors, like this:
jbig2dec FATAL ERROR decoding image: prevent DOS while decoding height classes (segment 0x00)
failed to create parsed JBIG2GLOBALS object.
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
All I need gs to do is append my lastpage.pdf to the existing PDFs without walking through the entire PDF I'm appending to, especially with font substitution, because I will not have most of the fonts other people are using in their PDFs.
Is it possible in gs to simply append without walking through every page of the PDF? Is there another tool that will allow appending of PDFs in batches without this issue?
You need to be aware that Ghostscript does not simply manipulate the incoming PDF file, so you aren't 'appending' a page. What it does is interpret the incoming file into marking operations, pass those to a device, and that device takes further action on them. Rendering devices write to a bitmap, pdfwrite reassembles the marking operations into a brand new file.
That's why it 'walks through the whole file', its the way it works. There are advantages to this (its possible to alter the file contents for example) and disadvantages.
Now if you are getting a font substitution for an embedded font, there's something wrong with the embedded font (or possibly you are using a really old version of Ghostscript with a bug). You could try a newer version of Ghostscript but you're never going to get away from processing the entire input file.
Why not try pdftk.
Related
I'm downloading some newspapers as pdf (for posterity). One title is a pain, it includes URI links in the pdf itself, if you accidentally click these it opens a browser tab to a page that 500s. It's not so bad on a desktop computer, but a pain in the butt if someone is reading it with a tablet. Each issues has approximately 200 of these links.
For a different title, it was as simple as using QPDF, like so:
qpdf --qdf --object-streams=disable file temp-file
This puts the temp version into postscript mode or something, and I was able to nuke the links with something like this:
s/obj\n<<\n( \/A <<\n \/S \/URI.+?)>>\nendobj/"obj\n<<\n" . " " x length($1). ">>\nendobj"/sge
This still works. However, a 15 meg original pdf is now becoming a 108meg "fixed" pdf. I can accept some bloat, but 720% is a bit absurd (I think it was more like 10% on the other title). Whenever I google for how to do this, I get results for Acrobat Reader and how you can click around in 20 menus to do such... does no one that uses Adobe products ever want to automate this stuff? There are between 180 and 300 links in a typical issue, spread across 45-150 pages (Sunday editions).
Are there any tools that can do this? Are there any clever arguments to qpdf that will make this more reasonable?
PS Yes I know it's hacky as hell to just overwrite the URIs with spaces, but I've never managed to figure out how to remove the objects entirely since their references also have to be removed.
You can do this with the community edition of cpdf: https://community.coherentpdf.com/
To remove all links in a PDF (well, to replace them with an empty link):
cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '""' -o out.pdf
This does not remove the annotations - it just makes sure that clicking on them won't go anywhere. It leaves the annotation in place, but with an empty link. You could replace with a working URL too, of course:
cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '"https://www.google.com/"' -o out.pdf
(You can also use -replace-dict-entry-search to replace only certain URLs - see the manual.)
Or, if you just want rid of all the annotations (link and non-link):
cpdf -remove-annotations in.pdf -o out.pdf
You can use HexaPDF (you need to have Ruby installed and then use gem install hexapdf to install HexaPDF) and the following small script to remove the links:
require 'hexapdf'
HexaPDF::Document.open(ARGV[0]) do |doc|
doc.pages.each do |page|
page.each_annotation.select {|annot| annot[:Subtype] == :Link}.each do |annot|
page[:Annots].delete(annot)
end
end
doc.write(ARGV[0] + '_processed.pdf', optimize: true)
end
Then batch execute the script for all the files you want the links removed.
Note that this will remove all links.
Just to round off the options I would suggest the best is potentially a PDF dedicated command line tool such as cpdf answer by johnwhitington or a dedicated library like iText.
There are several alternative methods touted for batch text editing your using qpdf
"temp version into postscript mode or something,"
That is a converted pdf into plain old decompressed text/pdf hybrid qdf so you can run sed or similar string editor. Here the primary difference is the upper out.pdf file shows as an editable QDF-1.0 version after editing so needs conversion to a conventional PDF as seen in the lower part where the stream is binary thus recompressed.
1) qpdf
At end of a bloating edit exercise the idea is to reverse back to application/pdf using
fix-qdf file-temp.pdf>out.pdf
to tidy up redirects and then
qpdf --compress-streams=y out.pdf outfixed.pdf
back to fixed.pdf
Other cross platform means are using
2) pdftk
$ pdftk infile.pdf output outfile.pdf uncompress
edit with vim or whatever sed scripting method then
$ pdftk outfile.pdf output fixedfile.pdf compress
3) mutool
mutool clean -d [options] input.pdf [output.pdf] [pages]
-d Decompress streams. This will make the output file larger, but provides easy access for reading and editing the contents with a text editor.
-i Toggle decompression of image streams. Use in conjunction with -d to leave images compressed.
-f Toggle decompression of font streams. Use in conjunction with -d to leave fonts compressed.
-a ASCII Hex encode binary streams. Use in conjunction with -d and -i or -f to ensure that although the images and/or fonts are compressed, the resulting file can still be viewed and edited with a text editor.
Whichever options you use, need to be reversed when recompressing
NOTE
Using text editors will potentially corrupt binary fonts and binary images, thus they need monitoring for any corruption in an editor that changes encoding or line feeds. This pdftk sample shows the image stream has been decompressed well into simple text but beware any change of End Of Line by editor would break up that stream
Additionally when making text edits that are not simple byte wise "find and replace", the xref table can be corrupted too much to be reindexed by recompression, try to overwrite with same number of characters when using a text edit method.
SIDE NOTE
EVEN if you remove actions and external hyperlinks actions but the text is present the reader will still provide that exploitable action. Same as here https://google.com but html will highlight usually in blue underline.
Hence ensure security is on
I am trying to do exactly what is described in the following thread:
AppleScript/Automator: renaming PDF with extracted text content of this PDF
So I am using the Chino22's version and there are two issues with it:
First, instead of the contents of the pdf, theFileContentsText gets some metadata stuff.
Second, althought the script runs to the end, I get the following error for the last step:
error "The variable thisFile is not defined." number -2753 from "thisFile"
So, how do I get the text contents instead, and how do I define thisFile to the current pdf that is being processed in the loop?
Thanks in advance!
I would not expect the linked script to work.
Except for document metadata, extracting text content from PDF is notoriously difficult and unreliable, and not a road you want to go down if you can possibly avoid it. Adobe’s PDF file format is designed for printing, not for data processing. PDF files contain blocks of Postscript-like page drawing instructions, typically compressed, and while it’s possible for PDFs also to include the original plain text for accessibility use, most PDF generators do not do this so the only way to get the original text is by reconstructing it from those low-level drawing instructions—not a trivial job.
AppleScript’s read command only reads that raw file data; it does not parse it into drawing instructions, never mind translating those drawing instructions back into plain text. Change a PDF file’s extension to .txt and open it in a plain text editor, and you’ll see what I mean. Nasty.
If you need to work with the PDF’s original content (text, images, whatever), your best solution is to get those files before they were converted into a PDF.
If you must extract content from a PDF file, use an existing tool that knows how to do it.
For instance, if you’re lucky enough to have PDFs that contain XFDF (XML form) or accessibility data, there are 3rd-party apps and libraries to extract that content in readable form. I can’t think offhand of any that are AppleScriptable (Adobe Acrobat has only minimal AS support) so you’ll probably need to find one you can run from command line (do shell script in AS).
Or, if the PDFs have a consistent visual structure, a 3rd-party library such as Python’s PDFMiner (which I’ve used in the past) can identify blocks of characters by position and convert those back into strings with varying degrees of reliability (it has to convert font glyphs back into Unicode characters, guess at which characters are close enough to constitute a word, and where to insert space and return characters between those words). You’ll have to write some Python code to extract the bits you want, so look for tutorials to get started (or pay someone to write it for you).
But again, if you can possibly avoid having to extract text from PDF, you should. You will save yourself a lot of trouble.
Is there an option for to me to ask Ghostscript to indent the Postscript it creates?
Everything starts at the beginning of a line and I find it difficult to follow.
Alternatively, I am using Emacs and ps-mode.
If anyone know how to indent code in this mode I would appreciate a tip (apologize because this may not be relevant to this StackExchange)
No, there is no option for indenting the output.
PostScript is pretty much regarded as a write-only language anyway, and the output of ps2write (which is what I assume you are using though you don't say) is particularly difficult since it fundamentally outputs PDF syntax with a PostScript program on the front to parse it into PostScript operations.
Why do you want to read it ?
[EDIT]
You can always edit your question, you don't need to post a new answer.
I'm afraid what you want to do isn't as simple as you might think.
It might be possible for this use case if the PDF files you receive are always created the same way, but there are significant problems.
The font you use as a substitute for the missing font must be encoded the same way. Say for example the font in the PDF file is encoded so that 0x41 is 'A', you need to make sure that the replacement font is also encoded so that 0x41 is an 'A'. So just the findfont, scalefont, setfont sequence is not always going to be sufficient, sometimes you will need to re-encode the font.
CIDFonts will be a major stumbling block. Firstly because ps2write simply doesn't emit CIDFonts at all. These were not part of level 2 PostScript. As a result all text in a CIDFont will be embedded as bitmaps. If your original file doesn't contain the CIDFont then you'll get the fallback CIDFont bitmapped.
Secondly CIDFonts can use multiple-byte character codes, of variable length. You can't simply replace a CIDFont with a Font, it just won't work.
The best solution, obviously, is to have the PDF files created with the fonts required embedded. This is best practice. If you can't get that, then I'd suggest that rather than trying to hand edit PostScript, you use the fontmap.GS and cidfmap files which Ghostscript uses to find font.
Ghostscript already has a load of code to do font substitution automatically, using both Fonts and CIDFonts as substitutes, and it does all the hard work of re-encoding the fonts or building CMaps as required. If you are on Windows much of this may already be done for you, when you install Ghostscript it will ask if you want to create font mappings. If you said yes then it will
Add the font substitutions you want to use in those files (they have comments explaining the layout) and then use the pdfwrite device to make a new PDF file. Set EmbedAllFonts to true (you may need to add a AlwayEmbed font array as well, listing the fonts specifically) and SubsetFonts to false.
That should create a new PDF file where the missing fonts have been replaced by your defined substitutes, those substitutes will have been embedded in the new PDF file and they have will not been subset (Acrobat will generally refuse to edit text in a subset font).
The switches I mentioned above are standard Adobe Distiller parameters, but they are documented for pdfwrite here. There's some documentation on adding fonts here and here and specifically for CIDFonts here.
Basically I'd suggest you define your substitutions and let Ghostscript do the work for you.
This is not an answer to the problem but rather an answer to KenS's question about "Why do you want to read it?"
I tried to put it in the comment box but it was too long.
I am a retired engineer with a strong programming background.
I would like to read and understand the postscript code for the reason shown below.
I play duplicate bridge as a hobby. I recieve a PDF file of what is know as a convention card (a single page document of bridge agreements).
Frequently I would like to edit these files.
When I open with Adobe Illustrator I have to spend a significant amount of time replacing fonts that are not on my system with fonts that I do have.
I can take the PDF and export it as a postscript file using Ghostscript.
I was going to write a little program to replace the embedded fonts with the fonts that I use to replace them.
I was going to leave the postscript file unaltered and insert things like
/HelveticaMonospacedPro-RG findfont
12 scalefont setfont
just above where the text is written.
I was planning on using the fonts that I have on my system (e.g., HelveticaMonospacedPro-RG).
Good day,
We print Postscript files directly on industrial Xerox printers.
One client's Postscript files were getting garbled due to a font issue that I was unable to track down, so I used Adobe's Distiller to convert from PS to PDF. The same font issues turned up in the PDFs that were generated from Distiller. No amount of option tweaking helped me out, and find/replace font operations using the Callas pdfToolbox didn't work out for me.
So, I downloaded Ghostscript and spent an entertaining hour remembering how DOS worked. I was eventually able to convert several PS files into flawless-looking PDFs by going to the Ghostscript directory and doing this:
gswin64 -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=myoutputfilename.pdf myinputfilename.ps
But, I didn't think things all the way through because now I'm faced with the problem of mixed-plex. Some of the documents in the file are one-page documents and some are two-page documents, which should be printed duplex.
PS handles all of this for us when we put it on one of the Xerox printers. PDF, of course, does not. I can only specify simplex or duplex on the printer - so it's either one or the other, which doesn't work for a PDF with both.
Is there any clean, (or dirty), way to get around this? I was thinking of somehow instructing Ghostscript to insert blank pages after every simplex page of a PS file, and then just printing the entire PDF duplex, but have no idea how I would begin to do this.
Any assistance greatly appreciated. :)
It 'sounds like' you have concatenated several PostScript program together here, is that the case ?
This isn't really a great idea, it can lead to incorrect output, I wonder if this is the source of your problem with Distiller and your printer.
Have you tried producing PostScript instead of PDF, by using the ps2write device instead of pdfwrite ? While this won't carry any of the device-specific controls (such as /Duplex), you can easily put them back. In fact recent versions of the device will allow you to specify code to be inserted at document and/or page level.
I have a pdf file basically RTC datasheet,
It is not allowing to jump to page using table of content and it doesn't have index or bookmark on the left side panel.
http://www.horustech.com.tw/WebMaster/FileData/Epson/RX8900SA(SA;CE).pdf
Now my question is it possible to update index/bookmark in this pdf using ghostscript or pdftk command ?
You could make a *new** PDF file, and you could add to that a new /Outlines tree (which I believe is called Bookmarks by Acrobat), but you would have to do it by creating a sequence of PostScript pdfmark operations. You would have to build those yourself, there's no way to do it automatically.
You could also (more difficult) /Link annotations to the table of contents so that it could jump to the relevant page/area. Again with Ghostscript you would have to do this by manually creating pdfmark operations.
Again, none of this can be done automatically, creating the pdfmarks would have to be done manually, especially the hyperlinks in the table of contents, those are almost certainly better handled using an interactive program, if you wanted to do that.
I'm reasonably sure you could do this with pdftk too, but again I think you would have to add the Outlines or Links manually.