Debugging PDF for error - debugging

I'm creating PDF files using PDFClown java library.
Sometimes, when openning these files with Adobe Acrobat Reader I get the famous error message:
"An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem."
The error shows while reading (with Adobe) the attached file only when scrolling down to the 8'th page, then scrolling back up to 3'td page. Alternatively, Zooming out to 33.3% will also produce the message.
Just for the record, Foxit reader reads the file flawlessly, as well as other PDF readers like browsers.
My questions are:
What's wrong with my file?? (file is attached)
How can I find what's wrong with it? is there a tool which tells you where does the error lie?
Thanks!

Ok, this wasn't easy -
Due to a bug in PDFClown the my main stream of information in the PDF page has been corrupted.
After it's end it had a copy of a past instance of it.
This caused a partial text section without the starting command "BT" - which left a single "ET" without a "BT" in the end of the stream.
once I corrected this, it ran great.
Thank you all for your help.
I would have much more difficult time debugging it without the tool RUPS which #Bruno suggested.
edit:
The bug was in the Buffer.java:clone() (line 217)
instead of line:
clone.append(data);
needs to be:
clone.append(data, 0, this.length);
Without this correction it clones the whole data buffer, and set the cloned Buffer's length to the data[].length. This is very problematic if the Buffer.length is smaller than the data[].length.
The result in my case was that in the end of the stream there was garbage.

The error shows while reading (with Adobe) the attached file only when scrolling down to the 8'th page, then scrolling back up to 3'td page. Alternatively, Zooming out to 33.3% will also produce the message.
Well, I get it easier, I merely open the PDF and scroll down using the cursor keys. As soon as the top 2 cm of page 3 appear, the message appears.
What's wrong with my file??
The content of pages 1 and 2 look ok, so let's look at the content of page 3.
My initial attributing the issue to the use of text specific operations (especially Tf and Tw) outside of a text object was wrong as Stefano Chizzolini pointed out: Some text related operations indeed are allowed outside text objects, namely the text state operations, cf. figure 9 from the PDF specification:
So while being less common, text state operations at page description level are completely ok.
After my incorrect attempt to explain the issue, the OP's own answer indicated that the
main stream of information in the PDF page has been corrupted. After it's end it had a copy of a past instance of it. This caused a partial text section without the starting command "BT" - which left a single "ET" without a "BT" in the end of the stream.
An ET without a prior BT indeed would be an error, and quite likely it would be accompanied by operations at the wrong level... Inspecting the stream content of that third page (the focused page of this issue), though, I could not find any unmatched ET. In the course of that inspection, though, I discovered that the content stream contains more than 2000 trailing 0 bytes! Adobe Reader seems not to be able to cope with these 0 bytes.
The bug the OP found, can explain the issue:
in the Buffer.java:clone() (line 217)
instead of line:
clone.append(data);
needs to be:
clone.append(data, 0, this.length);
Without this correction it clones the whole data buffer, and set the cloned Buffer's length to the data[].length. This is very problematic if the Buffer.length`` is smaller than the data[].length.
Trailing 0 bytes can be an effect of such a buffer copying bug.
Furthermore symptoms as found by the OP (After it's end it had a copy of a past instance of it) can also be the effect of such a bug. So I assume the OP found those symptoms on a different page, not page 3, but fixing the bug healed all symptoms.
How can I find what's wrong with it? is there a tool which tells you where does the error lie?
There are PDF syntax checkers, e.g. the Preflight tool included in Adobe Acrobat. but even that fails on your file.
So essentially you have to extract the page content (using a PDF browser, e.g. RUPS) and check manually with the PDF specification on the other screen.

the general post about debugging pdf might have been also helpful as rups / pdfstreamdump etc is mentioned there How do you debug PDF files?

Related

How can I get a "break" after an image

Using iTextSharp (4.1.6) in a Xamarin.Forms app.
I have lines of data with CRLF at the end of each. I put these into paragraphs, which I add to the doc. At then end of many, I add to the doc one or more images (photos). Works great.
Now, I'd like to have a page break after the last image, so I get a fresh start.
But iText seems to be flowing the text around the image some, and sometimes I don't get that page break, at last not where I want it. The next paragraph follows immediately after the image.
I tried adding a small paragraph after the image, but this did not solve. I did find that adding a SECTION seems to cause a good break, but puts some text in that I don't want or need.
I don't seem to find anything like API documentation for this, I've been just working from examples.
It seems like this would be really easy. What am I doing wrong?
iText seems to be flowing the text around the image some
Indeed, if you add an image to your document which does not fit on the current page anymore (but there still is some place for text on that page), iText does not immediately start a new page but keeps the image in memory and waits for your next content additions. If you then add text, that text first fills the current page, and only if even text does not fit anymore (or if you add another image), a new page is started and the waiting image is added.
You can switch this off using for your PdfWriter writer using
writer.StrictImageSequence = true;

exist-db how to access a pdf

I am sure it is very simple ... I just cannot get my head around this...
the exist-db Documentation is a bit fuzzy on content extraction...
http://exist-db.org/exist/apps/doc/contentextraction.
I have a pdf-file, containing of about 162 high-res images (the pdf is quite big ...) and I do not know how to access any of the that are presumably created ...
please do not destroy me! I am just starting to build a database (for an Edition at Uni)I'd love to have a facsimile edition (so one Tab with the image-file and one tab with the transcribed texts)
I aim at doing something similar to what Heidelberg Universitdy did with the "Welsche Gast Digital" http://digi.ub.uni-heidelberg.de/diglit/cpg389/0190/image
(the choosen image is just an example! )
This pic
When clicking on faksimile the Scan opens and when clicking on Transkription the transcribed texts open!
I am quite new to Xquery, Xpath and most X-related stuff. I have a "working design" put together in exist-db and am looking at TEI for marking up the transcritpion etc, I fear I'll have to spend quite some time on this issue ...
(it is not about doing my job for me, it's just about pointing me in the right direction)
I m afraid the short answer is simply don't.
Storing a pdf in your db, and then trying to extract images from it, is kind of a recipe for disaster. Instead you should use the source images (not necessarily extracted from the pdf), and store these individually in a collection (e.g. resources/img). Those image files are then the binary resources that the documentation is actually talking about.
You might want to take a look at tei-publisher for creating digital edition in exist, especially this demo app for how to present high-res facsimiles with transcribed portions of text. I m afraid its all a bit more involved then just opening a pdf in a browser, but so is the Welsche Gast Digital

C# PDFsharp end of page detection

Is it possible to detect the end of page in creating PDF file with PDFsharp library? How? Or overflowing text on page? I am generating PDF file with list of users and if the list is too long, I need to add new page and continue on it. I don't want to write ugly code, I want it to be as automatic as possible.
I am aware of MigraDoc library, but I already have a lot of code written in PDFsharp, so if it's not necessary to use MigraDoc (which seems to be better), I would rather stay with PDFsharp. Thanks.
When using PDFsharp, you are responsible to detect the end of page and create a new page for the continuation.
We always say that PDFsharp is low level: no automatic page breaks, but anything can be drawn anywhere.
Still you can write clean code with PDFsharp that handles page breaks properly.
You always have a current page, a current gfx, and a current y position on the page. So when you have to start a new page, re-initialize those variables.

Opening a large text file (30mb) consumes 500,000k of Firefox memory

I was initially dabbling with IFrames to launch a document, and found that for large files, the memory in all browsers (I first noticed this in FF) jumped to 500,000 K.
At first I thought it might have been some bad JS code that I had written, but removing all the extraneous code and just OPENING the text file still displayed the same problem.
So right now, all I'm doing is going to a site http://url/largefile, and seeing the file slowly display to the screen.
Is there any efficient way for me to display the file without the browser exploding? What am I missing here?
EDIT: I've received responses to use a text editor for this purpose. My original goal was to allow a user to click the url, which would append a search term as a post variable. The opened textfile would then scroll to the specified point of the search term. Is there a way to auto open a text editor ... on that person's computer and then going directly to the search point?
30MB is kind of big, even for a regular text editor, I suspect you will be unable to convince FF to handle it well. I might try one of the following:
implement paging/searching in your web site so it only displays a portion of the file at one time
open the file in an actual text editor - it's what they are good at after all
Your paging implementation (if suitably clever) might only load the text around the selected piece of the file, and when they scroll up or down use AJAX to load additional parts of the file (kind of like a virtual list control in windows). This might help to mitigate the performance impact.
Is it xml? Firefox tries to create a DOM for xml files that can be many times larger than the file itself.

PDF generation under ruby - block should not cut by page separator

My PDF consists of a number of blocks (actually, a list of quotations), they go one after another till the end of the document. If the text of a quotation
does not fit on the page, the whole quotation should start from the top of the next page, instead of being torn apart. How can I implement that on any library under ruby?
Try PrinceXML - this is a standalone executable that generates PDF out of HTML or XML. It supports a lot of special CSS properties that will even help you to control page breaks. Refer to http://www.princexml.com/doc/6.0/page-breaks/
This application is available for windows and linux. I was using it for generation of a pretty complicated PDF documents with headers and footers on every page except first one. And since you don't need to output a PDF with precise positioning of elements, it might be a perfect solution for you.
I haven't tried it, but in Prawn I would try using either the Document#text_box method or looking up the table methods and putting your text in cells with invisible borders. The documentation's unclear on how page break functionality fits in with the bounding box models, but it's worth a shot.
HTMLDoc which converts HTML to PDF has a page break facility.

Resources