Ghostscript makes text unsearchable after converting to pdf - ghostscript

Starting with a pdf file, in which all texts are searchable, I transform it to a new ps file with this command:
gswin64c -q -dSAFER -dNOPAUSE -dBATCH -sDEVICE=ps2write -dDOPDFMARKS -dLanguageLevel=2 -sOutputFile="new.ps" "old.pdf"
After that I transformed the new.ps file to a pdf with this command:
gswin64c -q -r400 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dSubsetFonts=false -dAutoRotatePages=/PageByPage -dAutoRot -dCompatibilityLevel=1.2 -sOutputFile="new.pdf" new.ps
In the new.pdf file I can't search for texts, although everything is visible.
How can I solve this problem?
This is what i'm using:
GPL Ghostscript 9.20 (2016-09-26)
Here is the output of the new.ps file:
'https://pastebin.com/HTXZJnKY'

Firstly; don't go to PostScript and then to PDF. If you want a new PDF file make it directly from the original PDF.
You haven't supplied the file to look at, so anything I say here is speculation but.... PDF files can (and often do) contain a ToUnicode CMap. This maps character codes to Unicode code points and is a reliable way of copy/paste/search for text.
PostScript, being intended for printing (on paper) doesn't have any such mechanism. So by creating a PostScript file and then creating a new PDF file from that PostScript you are going to lose the ToUnicode information if it was present.
Further than that, if the original file lacked a ToUnicode then it may be that the character codes used simply happened to match up to ASCII. The default for both ps2write and pdfwrite is to Subset fonts. This has the effect of altering the character codes so that the first glyph gets character code 1, the second gets character code 2 and so on. So Hello becomes 0x01, 0x02, 0x03, 0x03, 0x04.
You are also using a 3 year old version of Ghostscript. The current version is 9.50 and you should upgrade to that anyway, even though it won't affect this particular situation.
Your command lines have problems; You don't need to specify LanguageLevel=2 for ps2write, that's the default. You haven't specified -dSubsetFonts=false for ps2write, so there's no point in specifying it for pdfwrite, the damage is done in the first pass. -dAuoRot won't do anything. Unless you have a good reason you shouldn't change the resolution. Setting -dDOPDFMARKS won't preserve all the 'metadata' from the PDF file into the PostScript file. A load of stuff like Outlines and annotations won't be preserved.
You have specified a very low CompatibilityLevel for pdfwrite, why is that ? Its fairly pointless anyway, since you are starting from level 2 PostScript.
So in summary; don't do PDF->PS->PDF, just do PDF->PDF
If that doesn't achieve what you want you'll have to supply an example and be more specific about what your goal is here.

Related

GhostScript undefined glyp

I'm using gs 9.20 to merge some pdf documents into a single document
/usr/bin/gs9/bin/gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dRENDERTTNOTDEF=true -sOutputFile=/docs/merged.pdf
And I'm getting this error and have no idea how to resolve it. Has anyone come across these types of errors?
GPL Ghostscript 9.20: ERROR: Page 5 used undefined glyph 'g2' from
type 3 font 'PDFType3Untitled'
Without seeing the original file its not possible to be certain, but I would guess from the error that the file calls for a particular glyph in a font (PDFType3Untitled), and that font does not contain that glyph.
The result is that you get an error message (messages from the PDF interpreter which begin with ERROR, as opposed to WARNING, mean that the output is very likely to be incorrect).
You will still get a PDF file, and it may be visually identical with the original because, obviously, the original file didn't have the glyph either.
As for 'resolving' it, you need to fix the original PDF file,that's almost certainly where the problem is.
Please note that you are not 'merging' PDF files as I keep on saying to people, the original file is torn down to graphics primitives, and then a new file built from those primitives. You cannot depend on any constructs in the original file being present in the final file. A truly 'merged' file would preserve that, Ghostscript's pdfwrite device does not.
See here for an explanation.

Maximum number of input file for Ghostscript (gs)

I simply want to combine multiple eps files into one big file using gs command
the command work flawlessly except that when I specify more than 20 input files.
Somehow the command ignore input files starting from 21st input.
Anyone experience the same behavior? Is there a cap of number of input files specify anywhere?
I look through the site and couldn't find one.
sample command
gs -o output.eps -sDEVICE=eps2write file1.eps file2.eps .... file21.eps
Thank you.
Edit: add sample command
Almost certainly you have simply reached the maximum length of the command line for your Operating System. You can use the # syntax for Ghostscript to supply a file containing the command line instead.
https://www.ghostscript.com/doc/current/Use.htm#Input_control
Note that the EPS files will not be placed appropriately using that command, and this does not actually combine EPS files, it creates a new EPS file whose marking content should be the same as the input(s).
If you actually want to combine the EPS files its easy enough, but will require a small amount of programming to parse the EPS file headers and produce appropriate scale/translate operations, as well as stripping off any bitmap previews (which will also happen when you run them through Ghostscript).

ghostscript annotation conversion

I'm trying to convert from pdf to pdf/a using version 9.19 on win server 2012r2.
commandline:
"D:\Program Files\gs\gs9.19\bin\gswin64c" -dPDFA -dNOOUTERSAVE -dColorConversionStrategy=/sRGB -dProcessColorModel=/DeviceRGB -sDEVICE=pdfwrite -o target.pdf -dPDFACompatibilityPolicy=2 "PDFA_def.ps" source.pdf
For a lot of files I get
"Annotation set to non-printing, not permitted in PDF/A, aborting conversion"
Using Acrobat Pro conversion, it converts non printing annotations without problems.
What may I need to look for in PDFA_def.ps?
There is nothing to look for in pdfa_def.ps, since that is just a template for the additional information required to produce a PDF/A file.
Your problem is that your annotation is not valid for inclusion in a PDF/A file, non-printing annotations are not permitted in PDF/A. To create a PDF/A file from such an input, either the annotation must be removed, or it must be set to print. Ghostscript's pdfwrite device cannot know which one you want.
However, you can change the PDFACompatibilityPolicy; the default value is 0 which will include the offending feature, and produce a non-PDF/A file. You can try changing it to 1 instead which will ignore the feature. I'm not in a position to test this right now (I'm heading for an airport) but it ought to work.
Obviously I don't know what Acrobat does in this case, but it must do something similar, or produce an invalid file. At least Ghostscript gives you the choice.

TeXtoGIF for XeTeX

I need to extend TeXtoGIF to be able to handle XeTeX, ConTeXt, and other more modern implementations of TeX (point being that they can handle multiple fonts). Unfortunately, XeTeX in particular does not support DVI as an output format for its input, and my modifications break.
Please see the diff of changes at GitHub. My changes to the codebase are as follows:
Introduce a variable $cmdTeX to hold the TeX engine (LaTeX, XeLaTeX, etc.)
Add the option -xetex (or anything beginning with an x, really) to specify xelatex as the engine
Substitute the hard-coded latex call with the variable $cmdTeX.
I see two options to fixing this issue:
Coerce XeLaTeX to produce standard DVI output which, IIRC, isn't possible.
Find another sequence of commands (probably a different use of GS, which is why I included the tag, but so be it) to work with PDF output directly instead of DVI
So, I guess the question boils down to:
How can I convert a PDF into GIF without using graphical software?
which, probably, isn't a good fit for SO anymore IMHO.
It sounds like what you have is a patch you would like to submit to the author. Have you contacted him? Unfortunately his software doesn't (appear to) include a license so it may be hard to proceed from a legal standpoint. Most of the time in the open source world, if you encounter a non-responsive (or unwilling) author, you can do as you have already done, fork and patch. At that point you can choose to publish your new version, possibly with a new name, and conforming to the author's license.
From a software standpoint, the code is rather ancient, written for Perl 4. Because Perl has excellent backwards compatibility it will probably still work, but the question is, do you really want to? It may depend on your use-case. The original author was making gifs to use in web pages. If this is what you are doing, you might want to look at MathJaX which lets you use LaTeX right in your browser/HTML directly.
Instead of adding to my Q, this turned out to be a valid solution to my overall issue and should be recorded as such.
I should also note that someone over at TeX.SX pointed me to the standalone class which provides an option convert which, using -shell-escape, can do just about everything I need. Thus,
\documentclass[convert={density=6000,
size=1920x1600,
outext=.png},
border=1cm]{standalone}
\usepackage{fontspec}
\setmainfont{Zapfino}
\pagestyle{empty}
\begin{document}
it's all text
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-engine: xetex
%%% TeX-master: t
%%% End:
ConTeXt is not a modern TeX implementation (like LuaTeX, for instance). It's a macro package for several engines.
Since you want to support specific engines (e.g. XeTeX) and particular macro packages (e.g. ConTeXt), MathJax is not an option. You have to run the TeX engine, create a PDF and post process that PDF. I don't know why you choose GIF as a format, the vector format SVG would produce much prettier results, PNG would be my second choice.
Since you are not very specific about your input I assume you deal with multi page input files. You can use ghostscript to convert the PDF to a series of images.
As you said, you require GIF. According to gs -h ghostscript does not support GIF output, so we convert to PNG first:
gs \
-sDEVICE=png256 \
-dNOPAUSE \
-dBATCH \
-dSAFER \
-dTextAlphaBits=4 \
-q \
-r300x300 \
-sOutputFile=output-%02d.png input.pdf
Then use graphicsmagick or imagemagick to convert the PNGs to GIFs:
mogrify --format gif -- output-*.png

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

Resources