PDFClown MarkerContent gives only first two ContentObjects - pdfclown

I am a newbee to PDFClown and need help in parsing my pdf contents.
My PDF has huge number of MarkedContents which is displayed when converted as Stream.
But i am not able to parse them into objects to extract the Path Information contained within, which is my objective.
Here is my code -
if(level.Contents[i] is MarkedContent)
{
PdfDataObject ContentDataObj = level.Contents.BaseDataObject;
PdfIndirectObject pdfIndirectObject = level.Contents.BaseDataObject.IndirectObject;
PdfStream ContentStream = (PdfStream)ContentDataObj.Resolve();
ContentParser contentParser = new ContentParser(ContentStream.GetBody(true).ToByteArray());
IList<ContentObject> markerContentObjList = contentParser.ParseContentObjects();
//Here i am getting only two Content Objects, where as the stream has so many distinct Marked Contents
for (int k = 0; k < markerContentObjList.Count; k++)
{
}
}
Below is the DOM Inspector screenshot and Stream data

In Short
There are multiple errors in the content streams of your PDF, in particular errors that close more objects than are opened. This most likely is causing the early stop of parsing. Even if it is not, PDF Clown would associate starts and ends of objects differently than intended. Thus, the only real fix of the issue is to ask the source of the documents to provide a non-broken version.
The First Content Stream
The screen shot you provided shows your first page content stream:
The second content stream of that page exhibits the same issues as this one:
Non-Matching Starts and Ends of Marked Content Sequences
If we look at the marked content operators, we see
/OC /Heading BDC
...
EMC
EMC
/OC /Heading BDC
...
EMC
As you can see, there are two EMC operators for the first BDC. This is invalid. Confer ISO 32000-2 section 14.6 Marked content.
Invalid Fill Operator
Furthermore, there is a Fill operator directly following a text object:
BT
...
ET
f
This also is invalid, path painting operators are only allowed after a path object or a clipping path object, not after a text object. Confer ISO 32000-2 Figure 9 Graphics objects.
A Related PDF Clown Issue
Actually there is a bug in PDF Clown which makes processing of marked content with PDF Clown impossible anyway: PDF Clown assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap, see this answer for details. This assumption is wrong and results in incorrect graphic state contents as explained in that answer.
Thus, one should patch marked content support out of PDF Clown as explained there to at least have proper graphics state information. Thereafter, obviously, you cannot properly process marked content unless you add correct support for it yourself.
Why PDF Clown Stops at the End of the First Stream
As you observed, PDF Clown stops not after the extra EMC but instead at the end of the first content stream.
This is due to the PDF Clown issue explained above: Based on the assumption that marked content sections and save/restore graphics state blocks are properly contained in each other, PDF Clown simply makes EMC and Q close the most recently opened and still open marked content section or save/restore graphics state block without checking whether it matches alright.
Thus, it matches opening and closing operators in your stream like this:
[Start of page content]
. q
. . /OC /Heading BDC
. . EMC
. EMC
. /OC /Drawing BDC
. EMC
Q
So for PDF Clown that last Q does not match the initial q in the content but the start of page content itself.
I think that PDF Clown stops parsing here because it assumes it has found the end of page contents.

Related

EOFError when converting gensim word2vec to binary format

I have a pretrained embeddings with word2vec format in txt. I loaded it and then saved it to .bin. But I cannot load this embeddings as an EOFError: unexpected end of input; is count incorrect or file otherwise damaged?
My original code is:
model = KeyedVectors.load_word2vec_format(wordfile)
model.save_word2vec_format("file.bin",binary=True,write_header=True)
bin_model = KeyedVectors.load_word2vec_format("file.bin",binary=True)
And I can load this file.bin with a limit arguement: KeyedVectors.load_word2vec_format("file.bin",binary=True, limit=10000).
Is there some other process needed when I save embeddings?
There's a good chance that your .bin file has an incorrect leading-count, or the file has been otherwise been damaged/truncated – because that error means the file declared in its header (1st line) a larger number of word-vectors than were found during attempted-load.
So, if you downloaded it or copied it from somewhere, check the original source, to make sure you've got the full file.
Is there a reason you're performing this conversion? The formats are essentialy equivalent, and result in the exact same object-in-Python after loading.
If there's any tiny on-disk size savings in binary-format, you could probably save more by GZIPping the file (which the .load_word2vec_format() will also happily decompress, if it sees a trailing .gz on the filename).

What is wrong with this PDF file?

I have to work with a PDF form created by a person unknown to me. Why did the program with which the form was created (Word + PDF export?) split the term "Stunde" into "S", "t" and "unde" in line 6909 of the decoded PDF? There is no visual break between the three parts.
/TT1 1 Tf
11.04 0 0 11.04 59.16 476.1203 Tm
(Datum)Tj
/C2_1 1 Tf
<0003>Tj
/TT1 1 Tf
(der)Tj
0.424 -1.315 Td
(Tätigkeit)Tj
-0.0022 Tc 0 11.04 -11.04 0 261.24 437.7203 Tm
[(Ve)-4.6<7267fc74>-4.2(ungssat)-4.2(z)]TJ
/C2_1 1 Tf
0 Tc <0003>Tj
/TT1 1 Tf
-0.0021 Tc 0.935 -1.315 Td
[<2880>-6.1(/)-7.2(S)0.8(t)-4.1(unde)-4.5(\))]TJ % <<< the important line
0 Tc 11.04 0 0 11.04 340.92 468.8003 Tm
(Anlass/Art)Tj
/C2_1 1 Tf
resulting in
[]
To get the source code above, I decoded the PDF file as described here. I have no know-how concerning the PDF file format.
Background: I had to replace the word "Stunde", it drove me crazy to find the place where "Stunde" was written (in parts) within the source code, since no free PDF editor seems to be able to work with horizontal text without problems.
Academic Bonus questions: Is it possible to set the sum over a column as default value for a form field? (Modifiable; changed every time the column is changed.) Why was I able to replace "Stunde" with "Einsatz" without making the PDF file corrupt due to now irregular offsets?
Why did the program with which the form was created (Word + PDF export?) split the term "Stunde" into "S", "t" and "unde" in line 6909 of the decoded PDF?
As #gettalong mentioned in his answer, in your case this most likely has been done to apply kerning.
If you start looking into the outputs of some other PDF producers, you'll see that this export from Word actually is very unobtrusive in regard to splitting words:
there are PDF producers that draw each character individually after explicitly setting the text matrix for it, and
there also are PDF producers that have the width information for the characters of the used fonts set to zero and use the numbers in TJ instructions to forward the current text matrix between characters accordingly.
And this doesn't cover all the variants to be found, not by far...
Thus,
I had to replace the word "Stunde", it drove me crazy to find the place where "Stunde" was written (in parts) within the source code
in your case replacing actually was a fairly trivial task...
Is it possible to set the sum over a column as default value for a form field? (Modifiable; changed every time the column is changed.)
If all the column values in question are stored in form fields, you can use JavaScript to recalculate sums after form changes. To have it serve as "default" only, you can use some other (hidden) field for a flag whether the field has already been touched. Beware, though: JavaScript is not supported by all PDF viewers. Furthermore, the JavaScript object model for PDF is not specified in an independent (like ISO) specification but in an Adobe one which can make interpretation of the specification biased.
Why was I able to replace "Stunde" with "Einsatz" without making the PDF file corrupt due to now irregular offsets?
As we don't know how exactly you applied the changes, this obviously is hard to tell.
Most likely, though, you did corrupt the PDF and the PDF viewers you opened it in merely repair the corruption under the hood. There is a strong tendency in PDF viewers to do such under-the-hood repairs without informing the user; the result is that a large part of the PDFs in the wild actually being broken.
You don't see a visual break but the standard distance between "S", "t" and "unde" has been changed nonetheless. This is done by PDF writers that support e.g. kerning so that the word appear nicer. This is the reason why it is split that way.

Save a figure to file with specific resolution

In an old version of my code, I used to do a hardcopy() with a given resolution, ie:
frame = hardcopy(figHandle, ['-d' renderer], ['-r' num2str(round(pixelsperinch))]);
For reference, hardcopy saves a figure window to file.
Then I would typically perform:
ZZ = rgb2gray(frame) < 255/2;
se = strel('disk',diskSize);
ZZ2 = imdilate(ZZ,se); %perform dilation.
Surface = bwarea(ZZ2); %get estimated surface (in pixels)
This worked until I switched to Matlab 2017, in which the hardcopy() function is deprecated and we are left with the print() function instead.
I am unable to extract the data from figure handler at a specific resolution using print. I've tried many things, including:
frame = print(figHandle, '-opengl', strcat('-r',num2str(round(pixelsperinch))));
But it doesn't work. How can I overcome this?
EDIT
I don't want to 'save' nor create a figure file, my aim is to extract the data from the figure in order to mesure a surface after a dilation process. I just want to keep this information and since 'im processing a LOT of different trajectories (total is approx. 1e7 trajectories), i don't want to save each file to disk (this is costly, time execution speaking). I'm running this code on a remote server (without a graphic card).
The issue I'm struggling with is: "One or more output arguments not assigned during call to "varargout"."
getframe() does not allow for setting a specific resolution (it uses current resolution instead as far as I know)
EDIT2
Ok, figured out how to do, you need to pass the '-RGBImage' argument like this:
frame = print(figHandle, ['-' renderer], ['-r' num2str(round(pixelsperinch))], '-RGBImage');
it also accept custom resolution and renderer as specified in the documentation.
I think you must specify formattype too (-dtiff in my case). I've tried this in Matlab 2016b with no problem:
print(figHandle,'-dtiff', '-opengl', '-r600', 'nameofmyfig');
EDIT:
If you need the CData just find the handle of the corresponding axes and get its CData
f = findobj('Tag','mytag')
Then depending on your matlab version use:
mycdata = get(f,'CData');
or directly
mycdta = f.CData;
EDIT 2:
You can set the tag of your image programatically and then do what I said previously:
a = imshow('peppers.png');
set(a,'Tag','mytag');

iText - adding Image element generates a corrupt PDF file

I'm using iText® 5.2.1 ©2000-2012 1T3XT BVBA and Integration Designer 8.0 to create a PDF file that is exported in an byte array.
I am creating a document with a fair amount of text and want to add a logo at the beginning.
Part of the code that is adding the image is as follows:
BASE64Decoder decoder = new BASE64Decoder();
byte[] decodedBytes = decoder.decodeBuffer(Stringovi.SLIKA1);
Image image1 = Image.getInstance(decodedBytes);
image1.setAbsolutePosition(30f, 770f);
image1.scalePercent(60f);
document.add(image1);
The input image is in byte array format because of the system requirements.
The rest of the document consists of different tables with various content and it's all text.
When I add the image in the before mentioned way the program finishes and i get an byte output that i run trough a Base64 decoder. Resulting PDF can not be opend and the error shown is:
"Error [PDF Structure 40]:Invalid reference table (xref)"
I can't see where my mistake is so if anybody could be so kind and point me in the right direction I would very much appreciate it.
The document you presented as a "broken PDF file" is not a complete PDF file. It doesn't end with %%EOF, it doesn't have a cross-reference table,... It's a PDF document that isn't complete.
This means that you don't have the following line in your code:
document.close();
If you do have this line, it isn't reached. For instance: an exception is thrown causing the code to jump to a catch clause, skipping the close() operation.
The error message saying Invalid reference table (xref) is consistent with that diagnosis. This isn't a problem caused by iText. It's a problem caused by bad coding: not closing the document and/or not dealing with exceptions correctly.

Is altChunk valid within PresentationML?

Is it valid to use an altChunk element within a any of the slide xml files within a pptx ooxml package?
I have read through the ECMA-376 spec, and while altChunk is defined within the WordprocessingML section of the spec (e.g.: "Any document part that permits a p element can also contain an altChunk element, whose id attribute refers to a relationship"), it is not mentioned anywhere else.
PresentationML apparently doesn't have a p element, and the other valid parent elements for AltChunk (body (§17.2.2); comment (§17.13.4.2); docPartBody (§17.12.6); endnote (§17.11.2); footnote (§17.11.10); ftr (§17.10.3); hdr (§17.10.4); tc (§17.4.66)) don't appear to be valid within PresentationML.
My attempts to use altChunk within a slide xml file (with proper validated entries in the relevant rels file) have resulted in invalid xml: PPT2010 offers to repair the file, and this great tool http://www.probatron.org:8080/officeotron/officeotron.html offers a number of different errors (e.g.: "Invalid content was found starting with element 'p:altChunk'." or "One of '{"http://schemas.openxmlformats.org/drawingml/2006/main":p}' is expected") depending on where I place the altChunk element.
(FWIW, the actual problem I am trying to solve is to include some basic HTML within a ppt slide.)
What do you want to accomplish by including the HTML within a slide?
If you simply need to store it (or any other text) you can use Tags. I don't know how they're implemented in XML but you can add one via a bit of VBA and see what appears in the XML:
Sub AddTagToSlide()
With ActivePresentation.Slides(1)
.Tags.Add "ThisIsTheTagName", "ThisIsTheTagValue"
End With
' Did it work? How do we retrieve a tag?
With ActivePresentation.Slides(1)
MsgBox .Tags("ThisIsTheTagName")
End With
End Sub
You can add as many tags as you like to the Presentation object, Slide objects, and Shape objects, among other things. There's no UI for tags, so they're not visible to users.

Resources