Ghostscript PDF to text delimiter - ghostscript

I'm trying to do a conversion of a PDF to text with ghostscript with this command :
-dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=bla.txt c:\temp\example.pdf
My problem is with the seperation of the fields/columns. Some of my fields get seperated without any space or tab inbetween, for example three columns "CAT", "DOG", "12345" will ouput as CATDOG12345.
Is there any way I can specify a delimiter to be used, so my text would come out "CAT|DOG|12345"?
Thanks in advance

You could modify the source. However, this simply should not happen unless the original had no space between the text frgaments. You don't say what version of Ghostscript you are using, and you haven't supplied an example, so its not really possible to say more.
You could always output the text in pseudo-XML format and pick up the fragments and their locations yourself.

Related

Can I bulk-remove links from a pdf from the command line?

I'm downloading some newspapers as pdf (for posterity). One title is a pain, it includes URI links in the pdf itself, if you accidentally click these it opens a browser tab to a page that 500s. It's not so bad on a desktop computer, but a pain in the butt if someone is reading it with a tablet. Each issues has approximately 200 of these links.
For a different title, it was as simple as using QPDF, like so:
qpdf --qdf --object-streams=disable file temp-file
This puts the temp version into postscript mode or something, and I was able to nuke the links with something like this:
s/obj\n<<\n( \/A <<\n \/S \/URI.+?)>>\nendobj/"obj\n<<\n" . " " x length($1). ">>\nendobj"/sge
This still works. However, a 15 meg original pdf is now becoming a 108meg "fixed" pdf. I can accept some bloat, but 720% is a bit absurd (I think it was more like 10% on the other title). Whenever I google for how to do this, I get results for Acrobat Reader and how you can click around in 20 menus to do such... does no one that uses Adobe products ever want to automate this stuff? There are between 180 and 300 links in a typical issue, spread across 45-150 pages (Sunday editions).
Are there any tools that can do this? Are there any clever arguments to qpdf that will make this more reasonable?
PS Yes I know it's hacky as hell to just overwrite the URIs with spaces, but I've never managed to figure out how to remove the objects entirely since their references also have to be removed.
You can do this with the community edition of cpdf: https://community.coherentpdf.com/
To remove all links in a PDF (well, to replace them with an empty link):
cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '""' -o out.pdf
This does not remove the annotations - it just makes sure that clicking on them won't go anywhere. It leaves the annotation in place, but with an empty link. You could replace with a working URL too, of course:
cpdf -replace-dict-entry /URI cpdfmanual.pdf -replace-dict-entry-value '"https://www.google.com/"' -o out.pdf
(You can also use -replace-dict-entry-search to replace only certain URLs - see the manual.)
Or, if you just want rid of all the annotations (link and non-link):
cpdf -remove-annotations in.pdf -o out.pdf
You can use HexaPDF (you need to have Ruby installed and then use gem install hexapdf to install HexaPDF) and the following small script to remove the links:
require 'hexapdf'
HexaPDF::Document.open(ARGV[0]) do |doc|
doc.pages.each do |page|
page.each_annotation.select {|annot| annot[:Subtype] == :Link}.each do |annot|
page[:Annots].delete(annot)
end
end
doc.write(ARGV[0] + '_processed.pdf', optimize: true)
end
Then batch execute the script for all the files you want the links removed.
Note that this will remove all links.
Just to round off the options I would suggest the best is potentially a PDF dedicated command line tool such as cpdf answer by johnwhitington or a dedicated library like iText.
There are several alternative methods touted for batch text editing your using qpdf
"temp version into postscript mode or something,"
That is a converted pdf into plain old decompressed text/pdf hybrid qdf so you can run sed or similar string editor. Here the primary difference is the upper out.pdf file shows as an editable QDF-1.0 version after editing so needs conversion to a conventional PDF as seen in the lower part where the stream is binary thus recompressed.
1) qpdf
At end of a bloating edit exercise the idea is to reverse back to application/pdf using
fix-qdf file-temp.pdf>out.pdf
to tidy up redirects and then
qpdf --compress-streams=y out.pdf outfixed.pdf
back to fixed.pdf
Other cross platform means are using
2) pdftk
$ pdftk infile.pdf output outfile.pdf uncompress
edit with vim or whatever sed scripting method then
$ pdftk outfile.pdf output fixedfile.pdf compress
3) mutool
mutool clean -d [options] input.pdf [output.pdf] [pages]
-d Decompress streams. This will make the output file larger, but provides easy access for reading and editing the contents with a text editor.
-i Toggle decompression of image streams. Use in conjunction with -d to leave images compressed.
-f Toggle decompression of font streams. Use in conjunction with -d to leave fonts compressed.
-a ASCII Hex encode binary streams. Use in conjunction with -d and -i or -f to ensure that although the images and/or fonts are compressed, the resulting file can still be viewed and edited with a text editor.
Whichever options you use, need to be reversed when recompressing
NOTE
Using text editors will potentially corrupt binary fonts and binary images, thus they need monitoring for any corruption in an editor that changes encoding or line feeds. This pdftk sample shows the image stream has been decompressed well into simple text but beware any change of End Of Line by editor would break up that stream
Additionally when making text edits that are not simple byte wise "find and replace", the xref table can be corrupted too much to be reindexed by recompression, try to overwrite with same number of characters when using a text edit method.
SIDE NOTE
EVEN if you remove actions and external hyperlinks actions but the text is present the reader will still provide that exploitable action. Same as here https://google.com but html will highlight usually in blue underline.
Hence ensure security is on

RTF Template - Excel output didn't display full data

I created RTF Template from MS Word .I have problem which is wrapping of text in output excel cell.
Data gets wrapped in the output cell but full data is not visible when I open the xls file.
I tried :
-Uncheck Wrap text.
-resize width column .
-Check fit text.
-Check Automatically resize to fit content.
but it didn't work . Can anyone help me find what the problem is?
Regards ,
Mint
If you don't have any formatting requirements, and you just need to export date into something Excel can easily work with, try e-text templates. Use the comma separated version, not the fixed width.

macOS using html tags in nstextfield

Im wondering is there any ways to make this possible:
I have a nstextfield(or nstextview). And I also have one button, clicking on that should activate Bold mode for selected text, or the text that would written further.
First idea I had - is to use attributes for characters that would be written further, but this idea is not so good, as I would need to save that string in file later. I can save attributed string, but this gives me not proper format, what I would like to see is kind of or smth like that.
If I understand correctly your "First idea" is correct. Within your program you use NSAttributedString to add bold etc. your text. When you wish to save the text you can convert to HTML, or a number of other formats, and reading these formats and converting back to NSAttributed is also supported. A good place to start is Formatted Documents and Attributed Strings.

Spacing issue between letters while converting Word to PDF on Windows

I am having a word document(docx) of urdu text in Jameel Noori Nastaleeq Font. And in word its showing 10 pages file but after exporting into PDF its showing 11 pages pdf file becuase every letter contains extra space.
Can anyone please provide information ?
Edited:
Please download the file from
File
It has to do with the XML formatting of Word. When any text is pasted into Word (while the font is Jameel Noori Nastaleeq) Word places extra formatting in between the words. That formatting shows fine in Word however in when the file is converted into PDF the extra space becomes visible. When the text is merely typed in Word, the formatting is applied to entire paragraphs rather than words. That is why a typed document doesn't contain the extra spaces.

Add page to multiple PDFs in batch without messing with fonts

I'm trying to use Ghostscript to append a PDF as "last page" to multiple other PDFs. The problem I'm encountering is that Ghostscript walks through the whole PDF and does a bunch of font substitution.
I'm using the following batch script:
FOR %%G IN (*.pdf) DO IF NOT %%G==lastpage.pdf gswin64c -sDEVICE=pdfwrite -sOutputFile="output\%%G" -dNOPAUSE -dBATCH "%%G" lastpage.pdf
Example Error:
Page 12
Substituting font Courier for GGCJBF+Courier.
I will also sometimes get other errors, like this:
jbig2dec FATAL ERROR decoding image: prevent DOS while decoding height classes (segment 0x00)
failed to create parsed JBIG2GLOBALS object.
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
All I need gs to do is append my lastpage.pdf to the existing PDFs without walking through the entire PDF I'm appending to, especially with font substitution, because I will not have most of the fonts other people are using in their PDFs.
Is it possible in gs to simply append without walking through every page of the PDF? Is there another tool that will allow appending of PDFs in batches without this issue?
You need to be aware that Ghostscript does not simply manipulate the incoming PDF file, so you aren't 'appending' a page. What it does is interpret the incoming file into marking operations, pass those to a device, and that device takes further action on them. Rendering devices write to a bitmap, pdfwrite reassembles the marking operations into a brand new file.
That's why it 'walks through the whole file', its the way it works. There are advantages to this (its possible to alter the file contents for example) and disadvantages.
Now if you are getting a font substitution for an embedded font, there's something wrong with the embedded font (or possibly you are using a really old version of Ghostscript with a bug). You could try a newer version of Ghostscript but you're never going to get away from processing the entire input file.
Why not try pdftk.

Resources