ghostscript annotation conversion - pdf-generation

I'm trying to convert from pdf to pdf/a using version 9.19 on win server 2012r2.
commandline:
"D:\Program Files\gs\gs9.19\bin\gswin64c" -dPDFA -dNOOUTERSAVE -dColorConversionStrategy=/sRGB -dProcessColorModel=/DeviceRGB -sDEVICE=pdfwrite -o target.pdf -dPDFACompatibilityPolicy=2 "PDFA_def.ps" source.pdf
For a lot of files I get
"Annotation set to non-printing, not permitted in PDF/A, aborting conversion"
Using Acrobat Pro conversion, it converts non printing annotations without problems.
What may I need to look for in PDFA_def.ps?

There is nothing to look for in pdfa_def.ps, since that is just a template for the additional information required to produce a PDF/A file.
Your problem is that your annotation is not valid for inclusion in a PDF/A file, non-printing annotations are not permitted in PDF/A. To create a PDF/A file from such an input, either the annotation must be removed, or it must be set to print. Ghostscript's pdfwrite device cannot know which one you want.
However, you can change the PDFACompatibilityPolicy; the default value is 0 which will include the offending feature, and produce a non-PDF/A file. You can try changing it to 1 instead which will ignore the feature. I'm not in a position to test this right now (I'm heading for an airport) but it ought to work.
Obviously I don't know what Acrobat does in this case, but it must do something similar, or produce an invalid file. At least Ghostscript gives you the choice.

Related

Ghostscript makes text unsearchable after converting to pdf

Starting with a pdf file, in which all texts are searchable, I transform it to a new ps file with this command:
gswin64c -q -dSAFER -dNOPAUSE -dBATCH -sDEVICE=ps2write -dDOPDFMARKS -dLanguageLevel=2 -sOutputFile="new.ps" "old.pdf"
After that I transformed the new.ps file to a pdf with this command:
gswin64c -q -r400 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dSubsetFonts=false -dAutoRotatePages=/PageByPage -dAutoRot -dCompatibilityLevel=1.2 -sOutputFile="new.pdf" new.ps
In the new.pdf file I can't search for texts, although everything is visible.
How can I solve this problem?
This is what i'm using:
GPL Ghostscript 9.20 (2016-09-26)
Here is the output of the new.ps file:
'https://pastebin.com/HTXZJnKY'
Firstly; don't go to PostScript and then to PDF. If you want a new PDF file make it directly from the original PDF.
You haven't supplied the file to look at, so anything I say here is speculation but.... PDF files can (and often do) contain a ToUnicode CMap. This maps character codes to Unicode code points and is a reliable way of copy/paste/search for text.
PostScript, being intended for printing (on paper) doesn't have any such mechanism. So by creating a PostScript file and then creating a new PDF file from that PostScript you are going to lose the ToUnicode information if it was present.
Further than that, if the original file lacked a ToUnicode then it may be that the character codes used simply happened to match up to ASCII. The default for both ps2write and pdfwrite is to Subset fonts. This has the effect of altering the character codes so that the first glyph gets character code 1, the second gets character code 2 and so on. So Hello becomes 0x01, 0x02, 0x03, 0x03, 0x04.
You are also using a 3 year old version of Ghostscript. The current version is 9.50 and you should upgrade to that anyway, even though it won't affect this particular situation.
Your command lines have problems; You don't need to specify LanguageLevel=2 for ps2write, that's the default. You haven't specified -dSubsetFonts=false for ps2write, so there's no point in specifying it for pdfwrite, the damage is done in the first pass. -dAuoRot won't do anything. Unless you have a good reason you shouldn't change the resolution. Setting -dDOPDFMARKS won't preserve all the 'metadata' from the PDF file into the PostScript file. A load of stuff like Outlines and annotations won't be preserved.
You have specified a very low CompatibilityLevel for pdfwrite, why is that ? Its fairly pointless anyway, since you are starting from level 2 PostScript.
So in summary; don't do PDF->PS->PDF, just do PDF->PDF
If that doesn't achieve what you want you'll have to supply an example and be more specific about what your goal is here.

error occurs in distilling with adobe. while not in ghostscript

I have a postscript file when i open it with ghostscript it show output with no error. But when i try to distill it with adobe it stops with following error.
%%[ Error: undefined; OffendingCommand: show; ErrorInfo: MetricsCount --nostringval-- ]%%
I have shortened the file by removing text from it now there are only two words in output.
postscript file
The MetricsCount key is documented in Adobe Tech Note 5012 The Type 42 Font Format Specification. According to the specification it can have 3 possible values, 0, 2 or 4.
Section 5.7 on page 22 of the document:
When a key /MetricsCount is found in a CIDFont with CIDFontType 2, it
must be an integer with values 0, 2, or 4.
To me this suggests that the MetricsCount key/value pair is optional, and as I said other interpreters don't insist on its presence. I can't possibly tell you why Adobe Distiller appears to insist on it, I don't have any experience of the internals of the Distiller PostScript interpreter. I'd have to guess that all Adobe PostScript interpreters have this 'feature' though, presumably your printer is using an Adobe PostScript interpreter.
Simply adding the MetricsCount key does not work. Why didn't you try this yourself instead of asking me ? It would have been quicker....
The error is subtly different, I suspect the answer is that your CIDFont is missing something (or has something present) which is causing Distiller to look for a MetricsCount. I can't see anything obvious in the PostScript information, so perhaps there's something in the sfnts, though that would be surprising.
Interestingly I have in front of me a PostScript file containing a CIDFont which does not have a MetricsCount entry, and which Distiller processes without a complaint.
I can't let you have the file I'm using, its a customer file. However the fact that such a file exists indicates that other such files must exist. The one I'm looking at was created by QuarkXpress. I'd suggest that you try and find a working file to compare against. I'd also suggest that you try and make a smaller, simpler, CIDFont. One with a single glyph would be favourite I'd think.

GhostScript undefined glyp

I'm using gs 9.20 to merge some pdf documents into a single document
/usr/bin/gs9/bin/gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dRENDERTTNOTDEF=true -sOutputFile=/docs/merged.pdf
And I'm getting this error and have no idea how to resolve it. Has anyone come across these types of errors?
GPL Ghostscript 9.20: ERROR: Page 5 used undefined glyph 'g2' from
type 3 font 'PDFType3Untitled'
Without seeing the original file its not possible to be certain, but I would guess from the error that the file calls for a particular glyph in a font (PDFType3Untitled), and that font does not contain that glyph.
The result is that you get an error message (messages from the PDF interpreter which begin with ERROR, as opposed to WARNING, mean that the output is very likely to be incorrect).
You will still get a PDF file, and it may be visually identical with the original because, obviously, the original file didn't have the glyph either.
As for 'resolving' it, you need to fix the original PDF file,that's almost certainly where the problem is.
Please note that you are not 'merging' PDF files as I keep on saying to people, the original file is torn down to graphics primitives, and then a new file built from those primitives. You cannot depend on any constructs in the original file being present in the final file. A truly 'merged' file would preserve that, Ghostscript's pdfwrite device does not.
See here for an explanation.

Escaping special characters in User Input in IzPack Installer

I have an IzPack installer that takes in a lot of User Inputs and substitutes them in an XML file. This XML file is actually the configuration file for my application.
There is a major problem that I have hit and I cant move on from it.
In the Input fields (in the installer) user can enter any text and also special characters like & # % ' etc. These special characters messes up my XML file as they are no allowed in the XML syntax and needs to be escaped. for example for & one would need &
So far I have been asking the user to do this, as in escape the special characters themselves, but thats now working either.
Is there a way to have this done automatically? I really need a solution fast.
I am using IzPack V 4.1
You should use a proper XML Api (SAX, DOM) to generate the XML file, this will apply the correct encoding automatically. This may look more complicated first but guarantees that a well formed, syntactically correct file is written.
Searching for JAXP should give you a proper starting point.

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

Resources