iText - adding Image element generates a corrupt PDF file - image

I'm using iText® 5.2.1 ©2000-2012 1T3XT BVBA and Integration Designer 8.0 to create a PDF file that is exported in an byte array.
I am creating a document with a fair amount of text and want to add a logo at the beginning.
Part of the code that is adding the image is as follows:
BASE64Decoder decoder = new BASE64Decoder();
byte[] decodedBytes = decoder.decodeBuffer(Stringovi.SLIKA1);
Image image1 = Image.getInstance(decodedBytes);
image1.setAbsolutePosition(30f, 770f);
image1.scalePercent(60f);
document.add(image1);
The input image is in byte array format because of the system requirements.
The rest of the document consists of different tables with various content and it's all text.
When I add the image in the before mentioned way the program finishes and i get an byte output that i run trough a Base64 decoder. Resulting PDF can not be opend and the error shown is:
"Error [PDF Structure 40]:Invalid reference table (xref)"
I can't see where my mistake is so if anybody could be so kind and point me in the right direction I would very much appreciate it.

The document you presented as a "broken PDF file" is not a complete PDF file. It doesn't end with %%EOF, it doesn't have a cross-reference table,... It's a PDF document that isn't complete.
This means that you don't have the following line in your code:
document.close();
If you do have this line, it isn't reached. For instance: an exception is thrown causing the code to jump to a catch clause, skipping the close() operation.
The error message saying Invalid reference table (xref) is consistent with that diagnosis. This isn't a problem caused by iText. It's a problem caused by bad coding: not closing the document and/or not dealing with exceptions correctly.

Related

Numpy genfromtxt() error using converters: Converter function does not skip footer

I am using numpy.genfromtxt() to load time data in string format HH:MM:SS.S, and wishing to modify it according to a converter function (translating string UTC-time into float decimal hours). The code generates errors at the end of the file, where the converter tries to read in footer lines even when set to skip.
Here is my code:
utc_converter= lambda hhmmss: int(hhmmss.split(':')[0])+ int(hhmmss.split(':')[1])/60 + float(hhmmss.split(':')[2])/3600
testtime= np.genfromtxt(path_to_gps, dtype=None, skip_header=headerlines, skip_footer=2, usecols=(9), converters={9: utc_converter}, encoding='utf-8')
Generates the error:
Converter #0 is locked and cannot be upgraded: (occurred line #18859 for value '(21910')
Line #18859 is indeed a footer message. When no converter is used, the data is read in correctly without including the footer. When the footer lines are manually deleted from the data file, and the code modified accordingly (skip_footer=0), the conversion works fine.
How can I solve this?

What is wrong with this PDF file?

I have to work with a PDF form created by a person unknown to me. Why did the program with which the form was created (Word + PDF export?) split the term "Stunde" into "S", "t" and "unde" in line 6909 of the decoded PDF? There is no visual break between the three parts.
/TT1 1 Tf
11.04 0 0 11.04 59.16 476.1203 Tm
(Datum)Tj
/C2_1 1 Tf
<0003>Tj
/TT1 1 Tf
(der)Tj
0.424 -1.315 Td
(Tätigkeit)Tj
-0.0022 Tc 0 11.04 -11.04 0 261.24 437.7203 Tm
[(Ve)-4.6<7267fc74>-4.2(ungssat)-4.2(z)]TJ
/C2_1 1 Tf
0 Tc <0003>Tj
/TT1 1 Tf
-0.0021 Tc 0.935 -1.315 Td
[<2880>-6.1(/)-7.2(S)0.8(t)-4.1(unde)-4.5(\))]TJ % <<< the important line
0 Tc 11.04 0 0 11.04 340.92 468.8003 Tm
(Anlass/Art)Tj
/C2_1 1 Tf
resulting in
[]
To get the source code above, I decoded the PDF file as described here. I have no know-how concerning the PDF file format.
Background: I had to replace the word "Stunde", it drove me crazy to find the place where "Stunde" was written (in parts) within the source code, since no free PDF editor seems to be able to work with horizontal text without problems.
Academic Bonus questions: Is it possible to set the sum over a column as default value for a form field? (Modifiable; changed every time the column is changed.) Why was I able to replace "Stunde" with "Einsatz" without making the PDF file corrupt due to now irregular offsets?
Why did the program with which the form was created (Word + PDF export?) split the term "Stunde" into "S", "t" and "unde" in line 6909 of the decoded PDF?
As #gettalong mentioned in his answer, in your case this most likely has been done to apply kerning.
If you start looking into the outputs of some other PDF producers, you'll see that this export from Word actually is very unobtrusive in regard to splitting words:
there are PDF producers that draw each character individually after explicitly setting the text matrix for it, and
there also are PDF producers that have the width information for the characters of the used fonts set to zero and use the numbers in TJ instructions to forward the current text matrix between characters accordingly.
And this doesn't cover all the variants to be found, not by far...
Thus,
I had to replace the word "Stunde", it drove me crazy to find the place where "Stunde" was written (in parts) within the source code
in your case replacing actually was a fairly trivial task...
Is it possible to set the sum over a column as default value for a form field? (Modifiable; changed every time the column is changed.)
If all the column values in question are stored in form fields, you can use JavaScript to recalculate sums after form changes. To have it serve as "default" only, you can use some other (hidden) field for a flag whether the field has already been touched. Beware, though: JavaScript is not supported by all PDF viewers. Furthermore, the JavaScript object model for PDF is not specified in an independent (like ISO) specification but in an Adobe one which can make interpretation of the specification biased.
Why was I able to replace "Stunde" with "Einsatz" without making the PDF file corrupt due to now irregular offsets?
As we don't know how exactly you applied the changes, this obviously is hard to tell.
Most likely, though, you did corrupt the PDF and the PDF viewers you opened it in merely repair the corruption under the hood. There is a strong tendency in PDF viewers to do such under-the-hood repairs without informing the user; the result is that a large part of the PDFs in the wild actually being broken.
You don't see a visual break but the standard distance between "S", "t" and "unde" has been changed nonetheless. This is done by PDF writers that support e.g. kerning so that the word appear nicer. This is the reason why it is split that way.

PDFClown MarkerContent gives only first two ContentObjects

I am a newbee to PDFClown and need help in parsing my pdf contents.
My PDF has huge number of MarkedContents which is displayed when converted as Stream.
But i am not able to parse them into objects to extract the Path Information contained within, which is my objective.
Here is my code -
if(level.Contents[i] is MarkedContent)
{
PdfDataObject ContentDataObj = level.Contents.BaseDataObject;
PdfIndirectObject pdfIndirectObject = level.Contents.BaseDataObject.IndirectObject;
PdfStream ContentStream = (PdfStream)ContentDataObj.Resolve();
ContentParser contentParser = new ContentParser(ContentStream.GetBody(true).ToByteArray());
IList<ContentObject> markerContentObjList = contentParser.ParseContentObjects();
//Here i am getting only two Content Objects, where as the stream has so many distinct Marked Contents
for (int k = 0; k < markerContentObjList.Count; k++)
{
}
}
Below is the DOM Inspector screenshot and Stream data
In Short
There are multiple errors in the content streams of your PDF, in particular errors that close more objects than are opened. This most likely is causing the early stop of parsing. Even if it is not, PDF Clown would associate starts and ends of objects differently than intended. Thus, the only real fix of the issue is to ask the source of the documents to provide a non-broken version.
The First Content Stream
The screen shot you provided shows your first page content stream:
The second content stream of that page exhibits the same issues as this one:
Non-Matching Starts and Ends of Marked Content Sequences
If we look at the marked content operators, we see
/OC /Heading BDC
...
EMC
EMC
/OC /Heading BDC
...
EMC
As you can see, there are two EMC operators for the first BDC. This is invalid. Confer ISO 32000-2 section 14.6 Marked content.
Invalid Fill Operator
Furthermore, there is a Fill operator directly following a text object:
BT
...
ET
f
This also is invalid, path painting operators are only allowed after a path object or a clipping path object, not after a text object. Confer ISO 32000-2 Figure 9 Graphics objects.
A Related PDF Clown Issue
Actually there is a bug in PDF Clown which makes processing of marked content with PDF Clown impossible anyway: PDF Clown assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap, see this answer for details. This assumption is wrong and results in incorrect graphic state contents as explained in that answer.
Thus, one should patch marked content support out of PDF Clown as explained there to at least have proper graphics state information. Thereafter, obviously, you cannot properly process marked content unless you add correct support for it yourself.
Why PDF Clown Stops at the End of the First Stream
As you observed, PDF Clown stops not after the extra EMC but instead at the end of the first content stream.
This is due to the PDF Clown issue explained above: Based on the assumption that marked content sections and save/restore graphics state blocks are properly contained in each other, PDF Clown simply makes EMC and Q close the most recently opened and still open marked content section or save/restore graphics state block without checking whether it matches alright.
Thus, it matches opening and closing operators in your stream like this:
[Start of page content]
. q
. . /OC /Heading BDC
. . EMC
. EMC
. /OC /Drawing BDC
. EMC
Q
So for PDF Clown that last Q does not match the initial q in the content but the start of page content itself.
I think that PDF Clown stops parsing here because it assumes it has found the end of page contents.

How to get Developer Exception page to show multiple lines of code around exception?

This page https://learn.microsoft.com/en-us/aspnet/core/api/microsoft.aspnetcore.builder.developerexceptionpageoptions states that a DeveloperExceptionPageOptions object can be passed as a parameter to app.UseDeveloperExceptionPage() and one of the properties on the options object is SourceCodeLineCount. Specifically, it says that the SourceCodeLineCount property:
Determines how many lines of code to include before and after the line of code present in an exception's stack frame. Only applies when symbols are available and source code referenced by the exception stack trace is present on the server.
But when I put the following code in the Configure method of the startup.cs class:
app.UseDeveloperExceptionPage( new DeveloperExceptionPageOptions() { SourceCodeLineCount = 10} );
The output in the Developer Exception Page doesn't appear to show the 20 lines of source code that it's suppose to.
How does one get the Developer Exception page to show multiple lines of code around exception?
I'm not sure if you're still having this problem or not, but have you tried browsers other than IE? When I'm running in Chrome, I'm seeing a little [+] symbol on the left of each line number, that can be used to expand each snippet of code.
The line that the error occurs is highlighted in red, and the SourceCodeLineCount value (in my case, set to 2) is used to display the number of lines above the line that caused your exception.
See sample screenshot below. Hope this helps!

Validation : how to check if the file being uploaded is in excel format? - Apache POI

Is there any way I can check if the file being uploaded is in excel format? I am using Apache POI library to read excel, and checking the uploaded file extension while reading the file.
Code snippet for getting the extension
String suffix = FilenameUtils.getExtension(uploadedFile.getName());
courtesy BalusC : uploading-files-with-jsf
String fileExtension = FilenameUtils.getExtension(uploadedFile.getName());
if ("xls".equals(fileExtension)) {
//rest of the code
}
I am sure, this is not the proper way of validation.
Sample code for browse button
<h:inputFileUpload id="file" value="#{sampleInterface.uploadedFile}"
valueChangeListener="#{sampleInterface.uploadedFile}" />
Sample code for upload button
<h:commandButton action="#{sampleInterface.sampleMethod}" id="upload"
value="upload"/>
User could change an extension of a doc or a movie file to "xls" and upload,then it would certainly throw an exception while reading the file.
Just hoping somebody could throw some input my way.
You can't check that before feeding it to POI. Just catch the exception which POI can throw during parsing. If it throws an exception then you can just show a FacesMessage to the enduser that the uploaded file is not in the supported excel format.
Please try to be more helpful to the poster. Of course you can test before poi.
Regular tests, to be performed before the try/catch, include the following. I suggest a fail-fast approach.
Is it a "good" file?
if file.isDirectory() -> die and exit.
if !file.isReadable() -> die and exit.
if file.available <= 100 -> die and exit (includes file size zero)
if file.size >= some ridiculous large number (check your biggest excel file and multiply by 10)
File seems good, but is contents like Excel?
Does it start with "ÐÏà" -> if not, die.
Does it contain the text "Sheet"-> if not, die
Some other internal excel bytes that I expected from you guys here.

Resources