What is wrong with this PDF file? - debugging

I have to work with a PDF form created by a person unknown to me. Why did the program with which the form was created (Word + PDF export?) split the term "Stunde" into "S", "t" and "unde" in line 6909 of the decoded PDF? There is no visual break between the three parts.
/TT1 1 Tf
11.04 0 0 11.04 59.16 476.1203 Tm
(Datum)Tj
/C2_1 1 Tf
<0003>Tj
/TT1 1 Tf
(der)Tj
0.424 -1.315 Td
(Tätigkeit)Tj
-0.0022 Tc 0 11.04 -11.04 0 261.24 437.7203 Tm
[(Ve)-4.6<7267fc74>-4.2(ungssat)-4.2(z)]TJ
/C2_1 1 Tf
0 Tc <0003>Tj
/TT1 1 Tf
-0.0021 Tc 0.935 -1.315 Td
[<2880>-6.1(/)-7.2(S)0.8(t)-4.1(unde)-4.5(\))]TJ % <<< the important line
0 Tc 11.04 0 0 11.04 340.92 468.8003 Tm
(Anlass/Art)Tj
/C2_1 1 Tf
resulting in
[]
To get the source code above, I decoded the PDF file as described here. I have no know-how concerning the PDF file format.
Background: I had to replace the word "Stunde", it drove me crazy to find the place where "Stunde" was written (in parts) within the source code, since no free PDF editor seems to be able to work with horizontal text without problems.
Academic Bonus questions: Is it possible to set the sum over a column as default value for a form field? (Modifiable; changed every time the column is changed.) Why was I able to replace "Stunde" with "Einsatz" without making the PDF file corrupt due to now irregular offsets?

Why did the program with which the form was created (Word + PDF export?) split the term "Stunde" into "S", "t" and "unde" in line 6909 of the decoded PDF?
As #gettalong mentioned in his answer, in your case this most likely has been done to apply kerning.
If you start looking into the outputs of some other PDF producers, you'll see that this export from Word actually is very unobtrusive in regard to splitting words:
there are PDF producers that draw each character individually after explicitly setting the text matrix for it, and
there also are PDF producers that have the width information for the characters of the used fonts set to zero and use the numbers in TJ instructions to forward the current text matrix between characters accordingly.
And this doesn't cover all the variants to be found, not by far...
Thus,
I had to replace the word "Stunde", it drove me crazy to find the place where "Stunde" was written (in parts) within the source code
in your case replacing actually was a fairly trivial task...
Is it possible to set the sum over a column as default value for a form field? (Modifiable; changed every time the column is changed.)
If all the column values in question are stored in form fields, you can use JavaScript to recalculate sums after form changes. To have it serve as "default" only, you can use some other (hidden) field for a flag whether the field has already been touched. Beware, though: JavaScript is not supported by all PDF viewers. Furthermore, the JavaScript object model for PDF is not specified in an independent (like ISO) specification but in an Adobe one which can make interpretation of the specification biased.
Why was I able to replace "Stunde" with "Einsatz" without making the PDF file corrupt due to now irregular offsets?
As we don't know how exactly you applied the changes, this obviously is hard to tell.
Most likely, though, you did corrupt the PDF and the PDF viewers you opened it in merely repair the corruption under the hood. There is a strong tendency in PDF viewers to do such under-the-hood repairs without informing the user; the result is that a large part of the PDFs in the wild actually being broken.

You don't see a visual break but the standard distance between "S", "t" and "unde" has been changed nonetheless. This is done by PDF writers that support e.g. kerning so that the word appear nicer. This is the reason why it is split that way.

Related

To build a flow using Power Automate to download linked csv report in gmail

I'm trying to create a flow using Power Automate (which I'm quite new to) that can get the link/URL in an email I receive daily, then download the .csv file that normally a click to the link would do, and then save the file to a given local folder.
An example of the email I get:
Screenshot of the email I get daily
I searched in Power Automate Community and found this insightful LINK post & answer almost solved it. However, after following the steps and built the flow, it kept failing at the Compose step.
Screenshot of the Flow & Error Message
The flow
Error message
Expression used:
substring(body('Html_to_text'),add(indexOf(body('Html_to_text'),'here'),5),sub(indexOf(body('Html_to_text'),'Name'),5))
Seems the expression couldn't really get the URL/Link? I'm not sure and searched but couldn't find any more posts that can help.
Please kindly share all insights on approaches or workarounds that you think may help me solve the problem and truly thanks!
PPPPPPPPisces
We need to breakdown the bits of the function here which needs 3 bits of info
substring(1 text to search, 2 starting position of the text you want, 3 length of text)
For example, if you were trying to return an unknown number from the text dog 4567 bird
Our function would have 3 parts.
body('Html_to_text'), this bit gets the text we are searching for
add(indexOf(body('Html_to_text'),'dog'),4), this bit finds the position in the text 4 characters after the start of the word dog (3 letters for dog + the space)
sub(sub(indexOf(body('Html_to_text'),'bird'),2)),add(indexOf(body('Html_to_text'),'dog'),4)), I've changed the structure of your code here because this part needs to return the length of the URL, not the ending position. So here, we take the position of the end of the URL (position of the word bird minus two spaces) and subtract it from the position of the start of the URL (position of the word dog + 4 spaces) to get the length.
In your HTML to text output, you need to check what the HTML looks like, and search for a word before the URL starts, and a word after the URL starts, and count the exact amount of spaces to reach the URL. You can then put those words and counts into your code.
More generally, when you have a complicated problem that you need to troubleshoot, you can break it down into steps. For example. Rather than putting that big mess of code into a single block, you can make each chunk of the code in its own compose, and then one final compose to bring them all together - that way when you run it you can see what information each bit is giving out, or where it is failing, and experiment from there to discover what is wrong.

PDFClown MarkerContent gives only first two ContentObjects

I am a newbee to PDFClown and need help in parsing my pdf contents.
My PDF has huge number of MarkedContents which is displayed when converted as Stream.
But i am not able to parse them into objects to extract the Path Information contained within, which is my objective.
Here is my code -
if(level.Contents[i] is MarkedContent)
{
PdfDataObject ContentDataObj = level.Contents.BaseDataObject;
PdfIndirectObject pdfIndirectObject = level.Contents.BaseDataObject.IndirectObject;
PdfStream ContentStream = (PdfStream)ContentDataObj.Resolve();
ContentParser contentParser = new ContentParser(ContentStream.GetBody(true).ToByteArray());
IList<ContentObject> markerContentObjList = contentParser.ParseContentObjects();
//Here i am getting only two Content Objects, where as the stream has so many distinct Marked Contents
for (int k = 0; k < markerContentObjList.Count; k++)
{
}
}
Below is the DOM Inspector screenshot and Stream data
In Short
There are multiple errors in the content streams of your PDF, in particular errors that close more objects than are opened. This most likely is causing the early stop of parsing. Even if it is not, PDF Clown would associate starts and ends of objects differently than intended. Thus, the only real fix of the issue is to ask the source of the documents to provide a non-broken version.
The First Content Stream
The screen shot you provided shows your first page content stream:
The second content stream of that page exhibits the same issues as this one:
Non-Matching Starts and Ends of Marked Content Sequences
If we look at the marked content operators, we see
/OC /Heading BDC
...
EMC
EMC
/OC /Heading BDC
...
EMC
As you can see, there are two EMC operators for the first BDC. This is invalid. Confer ISO 32000-2 section 14.6 Marked content.
Invalid Fill Operator
Furthermore, there is a Fill operator directly following a text object:
BT
...
ET
f
This also is invalid, path painting operators are only allowed after a path object or a clipping path object, not after a text object. Confer ISO 32000-2 Figure 9 Graphics objects.
A Related PDF Clown Issue
Actually there is a bug in PDF Clown which makes processing of marked content with PDF Clown impossible anyway: PDF Clown assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap, see this answer for details. This assumption is wrong and results in incorrect graphic state contents as explained in that answer.
Thus, one should patch marked content support out of PDF Clown as explained there to at least have proper graphics state information. Thereafter, obviously, you cannot properly process marked content unless you add correct support for it yourself.
Why PDF Clown Stops at the End of the First Stream
As you observed, PDF Clown stops not after the extra EMC but instead at the end of the first content stream.
This is due to the PDF Clown issue explained above: Based on the assumption that marked content sections and save/restore graphics state blocks are properly contained in each other, PDF Clown simply makes EMC and Q close the most recently opened and still open marked content section or save/restore graphics state block without checking whether it matches alright.
Thus, it matches opening and closing operators in your stream like this:
[Start of page content]
. q
. . /OC /Heading BDC
. . EMC
. EMC
. /OC /Drawing BDC
. EMC
Q
So for PDF Clown that last Q does not match the initial q in the content but the start of page content itself.
I think that PDF Clown stops parsing here because it assumes it has found the end of page contents.

Other options to resize barcode for zebra printer using ZPL?

I want to print a Code 128 barcode with a Zebra printer. But I just can't get exactly where I want because the barcode is either too small or too big for the label size of 40x20 mm. Is there anything else I can try besides using the ^BY (Bar Code Field Default) module width and ratio?
^XA^PQ2^LH0,0^FS
^MUM
^GB40,20,0.1,B^FS
^FO1.5,4
^BY0.2
^BCN,10,N,N
^FD*030493LEJCG002999*^FS
^FO8,15
^A0N,3,3
^FD*030493LEJCG002830*^FS
^MUD
^XZ
Above script gives me a label that looks like this:
But when I just decrease the module width to 0.1 (which is the lowest) the barcode becomes too small and may be problematic to scan with a hand scanner:
Code-128 is a fixed-ratio code, so you would appear to have the choice of two sizes. You may be able to solve the problem by using a 300dpi printer in place of a 200.
If you can change the format (and I'm intrigued by the barcode and readable being different values) then you could save a little by printing one number-sequence and one alpha-sequence, as an even count of numerics will be encoded as alphabet C so you'd save one change-alphabet element.
Do you really need the * on each end?
Otherwise, perhaps code 39 (which prints the * if you use the print-interpretation-line option) would suit your purposes better.
Another Possibility is to do on the fly code-set changes, Try something like
^XA^PQ2^LH0,0^FS
^MUM
^GB60,20,0.1,B^FS
^FO1.5,4
^BY0.2
^BCN,10,N,N
^FD>:*>5030493>6LEJCG>5002830>6*^FS
^FO8,15
^A0N,3,3
^FD*030493LEJCG002830*^FS
^MUD
^XZ
This will allow less symbols to encode your data
If you can structure content to have all the alpha chars a one end or the other.
or (Depending on your firmware) you could use auto ^BCN,10,N,N,N,A

Convert PDF files to PDF/A via Ghostscript

I'd like to convert arbitrary PDF files to PDF/A with Ghostscript 9.15.
Is Ghostscript able to create PDF/A-3b conformant PDFs? There is no parameter which represents a PDF/A conformance level, so I assume there is no possibility. Or is there anything I have overlooked?
I was following a blog post where a Windows batch file is used to convert from PDF to PDF/A (see http://www.mcbsys.com/techblog/2013/04/batch-convert-pdf-to-pdfa/). The gs invokation in the batch is:
"%gs_path%\gswin64c" ^
-dPDFA ^
-dNOOUTERSAVE ^
-sProcessColorModel=DeviceRGB ^
-sDEVICE=pdfwrite ^
-o "GS_%file1%" ^
-dPDFACompatibilityPolicy=1 ^
"%currentdir%\PDFA_def.ps" ^
%inputfilelist%
The PDFA_def.ps is an adjusted version of the official one:
%!
% This prefix file for creating a PDF/A document is derived from
% the sample included with Ghostscript 9.07, released under the
% GNU Affero General Public License.
% Modified 4/15/2013 by MCB Systems.
% Feel free to modify entries marked with "Customize".
% This assumes an ICC profile to reside in the file (AdobeRGB1998.icc),
% unless the user modifies the corresponding line below.
% The color space described by the ICC profile must correspond to the
% ProcessColorModel specified when using this prefix file (GRAY with
% DeviceGray, RGB with DeviceRGB, and CMYK with DeviceCMYK).
% Define entries in the document Info dictionary :
/ICCProfile (... PATH TO ... AdobeRGB1998.icc) % Customize.
def
[ /Title (Title) % Customize.
/DOCINFO pdfmark
% Define an ICC profile :
[/_objdef {icc_PDFA} /type /stream /OBJ pdfmark
[{icc_PDFA} <</N systemdict /ProcessColorModel get /DeviceGray eq {1} {systemdict /ProcessColorModel get /DeviceRGB eq {3} {4} ifelse} ifelse >> /PUT pdfmark
[{icc_PDFA} ICCProfile (r) file /PUT pdfmark
% Define the output intent dictionary :
[/_objdef {OutputIntent_PDFA} /type /dict /OBJ pdfmark
[{OutputIntent_PDFA} <<
/Type /OutputIntent % Must be so (the standard requires).
/S /GTS_PDFA1 % Must be so (the standard requires).
/DestOutputProfile {icc_PDFA} % Must be so (see above).
/OutputConditionIdentifier (AdobeRGB1998) % Customize
>> /PUT pdfmark
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ]>> /PUT pdfmark
So, I use AdobeRGB1998.icc which is obviously useable for PDF files with RGB color space. Depending on the -sProcessColorModel value (DEVICERGB) a correct value is printed out.
The conversion works for all files. But when I validate the created PDF file against PDF/A-1b, I get different results depending whether the input file has RGB color space or not (e.g. CMYK). So, when I have an input PDF file which uses CMYK color space, the file gets converted by the script, but the validator says something like this:
input.pdf", 1, 38, 0x03418614, "A device-specific color space (DeviceCMYK) without an appropriate output intent is used.", 1
"output.pdf", 20, 0, 0x83410612, "The document does not conform to the requested standard.", 1
My question: Is there a way to get the conversion done for arbitrary files (i.e. independent of the used color space in the input file)?
Update
#KenS Thanks for your answer. I've updated my initial post to clarify what I want to achieve.
To make it more explicit, I will use an example. There are two files: input1.pdf (seems to use RGB) and input2.pdf (seems to use CMYK). I want to convert both of them to PDF/A-1. Thanks to your hint, I've let go of the above mentioned batch script and instead tested the command directly in the command line. After reading Ps2pdf.htm#PDFA, I have adjusted the (official) PDFA_def.ps so that AdobeRGB1998.icc is used. Then I invoked the following command on both input files (replaced output1.pdf by output2.pdf and input1.pdf by input2.pdf for the second file):
gswin64c.exe -dPDFA=1 -dBATCH -dNOPAUSE -dNOOUTERSAVE \
-sColorConversionStrategy=/RGB \
-sOutputICCProfile=AdobeRGB1998.icc -sDEVICE=pdfwrite \
-sOutputFile=output1.pdf -dPDFACompatibilityPolicy=1 \
"PATH/TO/OFFICIAL/PDFA_def.ps" input1.pdf
The conversion was done without any errors. The output1.pdf seems to be valid, but the output2.pdf is still invalid (tested with 3heights Validator):
"output2.pdf", 1, 40, 0x03418614, "A device-specific color space (DeviceCMYK) without an appropriate output intent is used.", 1
"output2.pdf", 20, 0, 0x83410612, "The document does not conform to the requested standard.", 1
So when I understand your answer correctly, the above command should produce a pdf file which uses the RGB color space - independent of the color space of the input file. If the input file uses CMYK, than the colors have to be translated into RGB with the above command.
When I interpret the first error message correctly, the used color space in the output2.pdf is still CMYK (although the command parameters like ColorConversionStrategy=/RGB). Since I used AdobeRGB1998.icc, the validation error appears.
What am I missing in the above command?
Going back to my original question (which is one step further): Instead of always converting to RGB (or CMYK), I wanted to somehow detect which color space is used in the input file and then dynamically switch to a RGB or CMYK icc file. Is it possible to achieve that?
Ghostscript does not support PDF/A-3. The conformance parameter you are looking for is -dPDFA= where valid values are nothing (defaults to 1), 1 or 2. You can find this documented in ghostpdl/gs/doc/ps2pdf/htm#PDFA
I'm not sure what you are asking for here though. You must either create a PDF/A file (in level 1 or 2 anyway, I haven't read the revision 3 spec yet) which is RGB or CMYK, because you aren't allowed to use both (you can convert everything to device independent colour of course). The colour space used in the input isn't relevant, other than to decide whether it needs to be converted.
This is something you need to decide, we can't decide it for you. One important reason is that the OutputIntent must be consistent with either RGB or CMYK, and the pdfwrite device doesn't check it, it assumes you chose one which matches the device space you are using for the PDF file (by the way, don't set the ProcessColorModel, use ColorConversionStrategy instead) In your case you have set OutputIntent to AdobeRGB1988 so your colours must be specified either in device independent colour, or RGB.
Given the errors you quote, I would suggest the problem is that you haven't specified -sColorConversionStrategy, so the input colours are not being converted to the required device space. I would further guess that the script you copied this from set -dUseCIEColor, and you didn't copy that bit. DO NOT set -dUseCIEColor, its a horrbile ancient piece of PostScript hackery. Instead set ColorConversionStrategy, which will convert colours in a much better way, as required.
Updated answer as this started getting too long for a comment:
I can't immediately see any problems with your command line, can you share an example PDF file ? Its much easier to investigate these things with a solid example. I know from our customers and other free users that pdfwrite is capable of producing conforming PDF/A-1b files.
Regarding the second question; its not possible to do that because currently you need to set the OutputIntentProfile to either a CMYK one or an RGB one before you start. You can't just run through the input PDF file until you come to a colour operation and then decide. If you feel like some programming it could be done by modifying pdfwrite, because the profile isn't actually used till the output is closed.
One problem is that, in order to do the colour conversion, you need to set the underlying ProcessColorModel (this is done for you automatically by ColorConversionStategy). The only way to change ProcessColorModel is to execute a setpagedevice, which causes an erasepage. Now I think that's actually fixable with pdfwrite, all it does is write a white rectangle over the page, so you should be able to intercept that and not emit it. Otherwise any marks you made before you encountered an RGB or CMYK operation would be underneath the white rectangle.....
So essentially no, you can't do it right now, if its important to you then you could probably modify the code to do so (don't forget you will also need to supply 2 OutputIntent profiles to choose between as well). We've never had a customer request to do this, so we won't likely take it on as a project. Of course if you did get this working we might very well incorporate it into the code base if you were to offer it back to us.

Parsing text files in Ruby when the content isn't well formed

I'm trying to read files and create a hashmap of the contents, but I'm having trouble at the parsing step. An example of the text file is
put 3
returns 3
between
3
pargraphs 1
4
3
#foo 18
****** 2
The word becomes the key and the number is the value. Notice that the spacing is fairly erratic. The word isn't always a word (which doesn't get picked up by /\w+/) and the number associated with that word isn't always on the same line. This is why I'm calling it not well-formed. If there were one word and one number on one line, I could just split it, but unfortunately, this isn't the case. I'm trying to create a hashmap like this.
{"put"=>3, "#foo"=>18, "returns"=>3, "paragraphs"=>1, "******"=>2, "4"=>3, "between"=>3}
Coming from Java, it's fairly easy. Using Scanner I could just use scanner.next() for the next key and scanner.nextInt() for the number associated with it. I'm not quite sure how to do this in Ruby when it seems I have to use regular expressions for everything.
I'd recommend just using split, as in:
h = Hash[*s.split]
where s is your text (eg s = open('filename').read. Believe it or not, this will give you precisely what you're after.
EDIT: I realized you wanted the values as integers. You can add that as follows:
h.each{|k,v| h[k] = v.to_i}

Resources