error occurs in distilling with adobe. while not in ghostscript - debugging

I have a postscript file when i open it with ghostscript it show output with no error. But when i try to distill it with adobe it stops with following error.
%%[ Error: undefined; OffendingCommand: show; ErrorInfo: MetricsCount --nostringval-- ]%%
I have shortened the file by removing text from it now there are only two words in output.
postscript file

The MetricsCount key is documented in Adobe Tech Note 5012 The Type 42 Font Format Specification. According to the specification it can have 3 possible values, 0, 2 or 4.
Section 5.7 on page 22 of the document:
When a key /MetricsCount is found in a CIDFont with CIDFontType 2, it
must be an integer with values 0, 2, or 4.
To me this suggests that the MetricsCount key/value pair is optional, and as I said other interpreters don't insist on its presence. I can't possibly tell you why Adobe Distiller appears to insist on it, I don't have any experience of the internals of the Distiller PostScript interpreter. I'd have to guess that all Adobe PostScript interpreters have this 'feature' though, presumably your printer is using an Adobe PostScript interpreter.
Simply adding the MetricsCount key does not work. Why didn't you try this yourself instead of asking me ? It would have been quicker....
The error is subtly different, I suspect the answer is that your CIDFont is missing something (or has something present) which is causing Distiller to look for a MetricsCount. I can't see anything obvious in the PostScript information, so perhaps there's something in the sfnts, though that would be surprising.
Interestingly I have in front of me a PostScript file containing a CIDFont which does not have a MetricsCount entry, and which Distiller processes without a complaint.
I can't let you have the file I'm using, its a customer file. However the fact that such a file exists indicates that other such files must exist. The one I'm looking at was created by QuarkXpress. I'd suggest that you try and find a working file to compare against. I'd also suggest that you try and make a smaller, simpler, CIDFont. One with a single glyph would be favourite I'd think.

Related

What does a gcc output file look like and what exactly does it contain?

While compiling a c file, gcc by default compiles it to a file called "a.out". My professor said that the output file contains the binaries, but I when I open it I usually encounter unreadable text (VS Code says something like "This file contains unsupported text encoding").
I assumed that by 'binaries', I would be able to see literal zeroes and ones in the file but that does not seem to be the case. So what exactly does it output file look like or what exactly does it contain and what is 'text encoding'? Why can I not read it? What special characters might it contain? I'm aware of the fact that gcc first pre-processes, which means it removes all comments, expands all macros and copies the contents of any header files that might be included. You get the header file by running gcc -E <file_name>.c, then the this processed file is complied into assembly. Up to this point, the output files are readable, i.e., I can open them with VS Code, but after this the assembled code and the object file thereafter are human-unreadable.
For reference, I have no prior experience with programming or any language for that matter and this is my first CS related course in my first sem of college, and I apologize if this is too trivial of a question to ask.
I actually had the same confusion early on. Not about that file type specifically, but about binary vs text files.
After all aren't all files, even text ones binary? In the sense that all information is 1s and 0s? Well, yes, all information can be stored/transmitted as 1s and 0s, but that's not what binary/text files refer to.
It refers to what that information, the content of the file, those 1s and 0s represent.
In a text file the bytes encode characters. In a binary file the bits encode some information that is not text. The format and semantics of that information is completely free, it can mean anything and use whatever encoding scheme. It's up to the application that writes/reads the file to properly understand the bit patterns.
Most text editors (like VS Code) when open a file they treat it as a text file. I.e. they try to interpret the bit patterns as a text encoding scheme (e.g. ASCII or UTF-8) But not all bit patterns are valid ASCII/UTF-8 so that's why you get "unsupported text encoding".
If you want to inspect the actual 1s and 0 for both text and binary files you need to use a utility that shows you that, e.g. hex viewers/editors.

GhostScript undefined glyp

I'm using gs 9.20 to merge some pdf documents into a single document
/usr/bin/gs9/bin/gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dRENDERTTNOTDEF=true -sOutputFile=/docs/merged.pdf
And I'm getting this error and have no idea how to resolve it. Has anyone come across these types of errors?
GPL Ghostscript 9.20: ERROR: Page 5 used undefined glyph 'g2' from
type 3 font 'PDFType3Untitled'
Without seeing the original file its not possible to be certain, but I would guess from the error that the file calls for a particular glyph in a font (PDFType3Untitled), and that font does not contain that glyph.
The result is that you get an error message (messages from the PDF interpreter which begin with ERROR, as opposed to WARNING, mean that the output is very likely to be incorrect).
You will still get a PDF file, and it may be visually identical with the original because, obviously, the original file didn't have the glyph either.
As for 'resolving' it, you need to fix the original PDF file,that's almost certainly where the problem is.
Please note that you are not 'merging' PDF files as I keep on saying to people, the original file is torn down to graphics primitives, and then a new file built from those primitives. You cannot depend on any constructs in the original file being present in the final file. A truly 'merged' file would preserve that, Ghostscript's pdfwrite device does not.
See here for an explanation.

Apperance of no-break space character in typing

Using TeXShop to typeset LaTeX, I often come across the error Package inputenc error: Character \u8 not set up for use with LaTeX. That, I have learnt, is due to the fact that, for some reason, some spaces become "no-break space"s (U+00A0), which apparently inputenc doesn't like. So this is NOT a LaTeX question, but just one that was brought up by LaTeX. It might be about TeXShop or about I don't know what, but the LaTeX part is definitely solved. So the question is: why does it turn up? Is it a shortcut I am unaware of (I'm om Mac OS X 10.7.5), a TeXShop specific thing, or something else?
PS I'm not sure if the tag is appropriate. Were I not forced to give at least one, I probably would have given none. LaTeX, as stated above, is definitely NOT an appropriate tag for this question. The one I had put was probably more appropriate. Anyway I'll have a look at the list of popular tags (if I find one) and change the tag to the one that seems most appropriate to me.
As Wikipedia states on this page, Alt+Space on Mac inputs the No-Break Space. In speedy typing, I probably inadvertently press Alt while typing the space, resulting in this problem.

Utility to Stamp/Watermark Unicode Text Into a PDF

I am looking for a (preferably) command line utility to stamp/watermark unicode text content into a PDF document.
I tried PDF Stamp and a couple of others that I found over the net, but to no avail with Greek characters (e.g. ΓΔΘΛ become ÃÄÈË).
Many thanks for any help!
With sufficiently "odd" characters, you generally need to specify a font and an encoding. I suspect that at least one of the tools you experimented with have the capability to define such things.
Reading their docs, it looks like PDFStamp will let you specify a font, but not an encoding. That doesn't bode well. It might always pick "Identity-H" for system fonts... worth trying.
I must admit, I'm surprised. "Disappointed" even. Have you contacted their email support?
Once upon a time, iText shipped with a number of command line tools that were mostly intended as examples but were none the less useful. I suspect you could dig them out of the SVN archive on sourceforge and get them to build again, if your Java-fu is up to the task. Just be sure to use BaseFont.IDENTITY_H whenever you're given a choice of encodings for a font.

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

Resources