how to add a separator after each word with ghostscript -sDEVICE=txtwrite - ghostscript

I have used ghostscript to successfully extract text from PDFs that have tables.
This simple command works very well:
gswin64c -sDEVICE=txtwrite -o test.txt "c:\reports\sample.pdf"
However some words get joined together especially from tables, for example:
234801111111109-12-2014 16:17:04764030208117034 2883253100.00 Payment
234801111111109-12-2014 16:18:461088956908117033 2883253400.00 Payment
234801111111109-12-2014 16:19:48769948208117040 2883253750.00 Payment
should actually be:
2348011111111 09-12-2014 16:17:04 764030208117034 2883253 100.00 Payment
2348011111111 09-12-2014 16:18:46 1088956908117033 2883253 400.00 Payment
2348011111111 09-12-2014 16:19:48 769948208117040 2883253 750.00 Payment
Please is there a way to add a separator character at the end of each word.
That would solve this perfectly.

No sorry, this idea simply won't work.
There is no such thing as a 'word' in a PDF file, there is simply a sequence of character codes and positions. The txtwrite code goes to some lengths to try and reconstruct words by looking at the position of each piece of text, and the metrics of the fonts used, but there are no words in the original.
I don't claim this is perfect, if you'd like me to look at it you will need to supply the original file. Best solution is to open a bug report and attach the file to it.
This is still an area I'm looking at, for a different project (RTF output) so now is a good time to report it. I cannot guarantee being able to resolve it, but it may well simply be that the 'rebuild the page layout' code is being too simple-minded about the location of the text.
You can, however, get a lower level output, the XML-like output will give you each fragment of text individually, and its position on the page. You could use that information yourself to rebuild the content.
The default option tries to build a simple representation of the page by using space characters to reproduce the layout of the original, as far as possible, but I have no illusions that there aren't bugs :-)

Related

Intermec Printer Language - Tabulation Problem

I'm an ABAP programmer and I was asked to make a minor modification to an IPL label.
Easily done, but now I was tasked to fix a long running error within said label.
I know nothing about IPL and the lack of a online viewer makes everything worse...
The problem is that "tabulation" right in the middle of a text (I underlined it in blue on the Label's pic).
I checked the code and there's nothing there that should make that tabulation appear.
I spent a whole month reading manuals and trying to fix it, but nothing changes...
Here's the code and the resulting label:
<STX>R<ETX>
<STX><ESC>C<SI>W791<SI>h<ETX>
<STX><ESC>P<ETX>
<STX>F*<ETX>
<STX>H1;f3;o220,52;c34;b0;h2;w1;d3,300052947-FANDANGOS PRESUNTO 140GX14 LD<ETX>
<STX>H2;f3;o130,52;c33;b0;h1;w1;d3,Val:<ETX>
<STX>H3;f3;o130,204;c34;b0;h1;w1;d3,QTD.Unidade:<ETX>
<STX>H4;f3;o90,33;c34;b0;h0;w1;d3,16/08/21<ETX>
<STX>H5;f3;o90,302;c34;b0;h1;w1;d3,14<ETX>
<STX>B6;f3;o375,44;c2,0;w6;h102;r0;d3,17892840816329<ETX>
<STX>H7;f3;o275,44;c26;b0;h17;w17;d3,17892840816329<ETX>
<STX>H8;f3;o130,490;c34;b0;h0;w1;d3,Lote:<ETX>
<STX>B9;f3;o090,600;c2,0;w2;h45;r0;d3,0005218177<ETX>
<STX>H10;f3;o130,600;c34;b0;h0;w1;d3,0005218177<ETX>
<STX>D0<ETX>
<STX>R<ETX>
<STX><SI>l13<ETX>
<STX><ESC>E*,1<CAN><ETX>
<STX><RS>1000<US>1<ETB><ETX>
Label
Can you guys help me, please??
Edit: Just to make it clear, I did that blue line on that image to show what's the problem.
Here are some tests I did by changing the data:
Test1
Test2
The error always appear at the same point in the label, as long as there's a space in that text.
Have you looked at the raw data of the output? Is it POSSIBLE that what looks like a space is actually some special character that is making IPL choke blue? Because it is literally the 1 character between the "O" and "1". For grins, you might also try to change the character in the data to a "-" just for purposes of confirming data context. It might even just be a TAB character.
I have done IPL years ago and have actually gone to the point of defining a pre-defined label template and generating output that says to use template X (whatever # I created as),and pass the data along that fills into the respective fields.
A final option I would throw in is this. Take the sample output you have and just force sample data into each of the output areas. So, instead of your literal data, put fake data in similar context just to see if it is data specific or other. Such as
<STX>H1;f3;o220,52;c34;b0;h2;w1;d3,300052947-FANDANGOS PRESUNTO 140GX14 LD<ETX>
becomes
<STX>H1;f3;o220,52;c34;b0;h2;w1;d3,123456789-TESTING-SAMPLEDATA-123XY12-AB<ETX>
Notice same context of data, but no spaces and using dash "-" just for testing. Is there something special about the actual data. This is a good way I have done historically for similar strangeness early on doing IPL labels.
User decided to not spend anymore time on this issue, so now I'm unable to further test the label.
Unfortunately this problem will go unsolved for now. Hope I get another chance to fix this and learn more about IPL.
Thanks you so much for your answers!

PDF - Edit raw text without special paid tool

Is there a way to edit the raw text from a PDF without any special paid software?
So there are PDFs with highlightable text. I assume that the text is stored somewhere in the file.
I tried to just drag & drop a PDF into vscode but it just showed me unknown characters; even a little of meta text but if I edit the meta-infos, the file gets mostly corrupted.
Apart from that, I could not find any of the text contents of my desired PDF in vscode-editor.
Does someone know if there is a solution like inspecting and changing the source code somehow without a special software? I want to edit the contents; not the meta-infos.
(I use macOS)
The text you see on a pdf page can be constructed in dozens of different ways, actually there are millions of users, using potentially hundreds if not thousands of different methods.
Update
The question is MacOS but for native cross platform you need to work in mime text/pdf to be universally useful. But by way of example how thats possible specifically in windows its possible to write line by line using say cmd here is a snippet of what was a few dozen lines :-)
echo %%PDF-1.0>demo.pdf
echo %%µ¶µ¶>>demo.pdf
echo/>>demo.pdf
for %%Z in (demo.pdf) do set "FZ1=%%~zZ"
echo 1 0 obj>>demo.pdf
echo ^<^</Type/Catalog/Pages 2 0 R^>^>>>demo.pdf
echo endobj>>demo.pdf
echo/>>demo.pdf
For the fuller "Feature Creep"ing of now over more than a 100 lines and counting see
https://github.com/GitHubRulesOK/MyNotes/raw/master/MAKE-PDF.cmd
However although plain text could be the simplest it is rarely used except to prove a conceptual point that it is possible. The rest of the time "Special Software" as you call it (a pdf generator/editor) will be used to compress the file objects, most frequently as different optimal binary streams.
So some text may be scanned pixels whilst other text may be line shapes that look like letters, or at other times plain letters without fonts but a named style, or even letters with the font included (embedded) in the file (the preferred option).
In many ways each page may be built different to the others and thus no two pdfs generally will use the same structure unless like a bank statement using a format that does not change much from month to month, even if the balance wobbles about.
So in summary the tool that will work best is the one that covers every single permutation that Adobe dreamed of, and still keep the result a valid Adobe PDF.
Thus Acrobat PRO 3D is on my shelf (even if not used from one year to the next)
There are many cheaper editors and ones I will use more often for small mods are Tracker Xchange and FreePDF PRO and both have different limitations.
Your choices for MacOS will be more limited thus search for the best you are willing to pay for.

How to find foreign language used in "C comments"

I have a large source code where most of the documentation and source code comments are in english. But one of the minor contributors wrote comments in a different language, spread in various places.
Is there a simple trick that will let me find them ? I imagine first a way to extract all comments from the code and generate a single text file (with possible source file / line number info), then pipe this through some language detection app.
If that matters, I'm on Linux and the current compiler on this project is CLang.
The only thing that comes to mind is to go through all of the code manually and check it yourself. If it's a similar language, that doesn't contain foreign letters, consider using something with a spellchecker. This way, the text that isn't recognized will get underlined, and easy to spot.
Other than that, I don't see an easy way to go through with this.
You could make a program, that reads the files and only prints the comments out to another output file, where you then spell check that file, but this would seem to be a waste of time, as you would easily be able to spot the comments yourself.
If you do make a program for that, however, keep in mind that there are three things to check for:
If comment starts with /*, make sure it stops reading when encountering */
If comment starts with //, only read one line - unless:
If line starting with // ends with \, read next line as well
While it is possible to detect a language from a string automatically, you need way more words than fit in a usual comment to do so.
Solution: Use your own eyes and your own brain...

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

How to efficiently work with gettext PO files when making small edits to large text values

Looking for tips and/or tools on how to efficiently work with gettext PO files when making small edits to large msgid values.
Example: We have lots of multi-sentence/multi-paragraph messages that are stored in our PO message catalog files. If we make a very minor change to a message, perhaps editing a single sentence or even correcting punctuation, we lose our original translation when we run the msgmerge utility.
Rather than re-translate long messages (that have already gone through an editorial approval process) from scratch, our translators return to backup copies of their PO files and manually search for the text of the last msgid/msgstr translation pair which they then diff against the current msgid values to see what has changed, followed by a copy and paste of the last translation which they then edit to reflect the updated msgid value.
That's a lot of work! Certainly there must be a better way of handling this type of workflow?
Is there a best practice way to archive and find previous translations that are no longer in a PO file? One idea that comes to mind is to store a unique msg id in the text of our messages or in the comments that precede our message and use this id to retrieve previous msgid/msgstr translation pairs for review. Or are there PO editors or online services that make this process more efficient?
Thank you,
Malcolm
I've been looking for a way to make minor changes to msgids without disturbing existing translations - for instance, typo fixes in the source text. Here's a recipe I've just worked out that doesn't involve websites:
Use msgen from GNU gettext to generate an English-to-English po file:
msgen project.pot >corrections.po
Manually edit the msgstrs in "corrections.po" to reflect the typo fixes made in the source text, so we have a mapping from uncorrected to corrected strings. (I haven't thought about how to automate this bit.)
For each "real" translation (for example ca.po): abuse poswap from the Translate Toolkit (translate-toolkit in Ubuntu) to change the msgids:
poswap -i corrections.po -t ca.po -o ca.new.po
This does seem to lose header comments and obsolete strings from GNU gettext po files, but manually fixing those up is much less work than manually tweaking msgids in each translation (and could probably easily be scripted).
(Obviously, this should only be used in exceptional circumstances, where you're absolutely sure that none of the translators need the opportunity to re-review their translations.)
Virtaal's translation memory support can probably help with this. If your original units are in the translation memory, it will be shown (with differences) within a certain margin of change (based on Levenshtein distance). It will still contain the original (unmodified) translation, but at least the original text is more easily accessible and the differences highlighted.
I'm not 100% sure, but Pootle might also offer a web based solution. If you need any help, ask in #pootle on FreeNode.
The more general improvement is, of course, to separate/segment the units as far as possible.

Resources