Opening a large text file (30mb) consumes 500,000k of Firefox memory - firefox

I was initially dabbling with IFrames to launch a document, and found that for large files, the memory in all browsers (I first noticed this in FF) jumped to 500,000 K.
At first I thought it might have been some bad JS code that I had written, but removing all the extraneous code and just OPENING the text file still displayed the same problem.
So right now, all I'm doing is going to a site http://url/largefile, and seeing the file slowly display to the screen.
Is there any efficient way for me to display the file without the browser exploding? What am I missing here?
EDIT: I've received responses to use a text editor for this purpose. My original goal was to allow a user to click the url, which would append a search term as a post variable. The opened textfile would then scroll to the specified point of the search term. Is there a way to auto open a text editor ... on that person's computer and then going directly to the search point?

30MB is kind of big, even for a regular text editor, I suspect you will be unable to convince FF to handle it well. I might try one of the following:
implement paging/searching in your web site so it only displays a portion of the file at one time
open the file in an actual text editor - it's what they are good at after all
Your paging implementation (if suitably clever) might only load the text around the selected piece of the file, and when they scroll up or down use AJAX to load additional parts of the file (kind of like a virtual list control in windows). This might help to mitigate the performance impact.

Is it xml? Firefox tries to create a DOM for xml files that can be many times larger than the file itself.

Related

AppleScript: renaming PDF with content of PDF

I am trying to do exactly what is described in the following thread:
AppleScript/Automator: renaming PDF with extracted text content of this PDF
So I am using the Chino22's version and there are two issues with it:
First, instead of the contents of the pdf, theFileContentsText gets some metadata stuff.
Second, althought the script runs to the end, I get the following error for the last step:
error "The variable thisFile is not defined." number -2753 from "thisFile"
So, how do I get the text contents instead, and how do I define thisFile to the current pdf that is being processed in the loop?
Thanks in advance!
I would not expect the linked script to work.
Except for document metadata, extracting text content from PDF is notoriously difficult and unreliable, and not a road you want to go down if you can possibly avoid it. Adobe’s PDF file format is designed for printing, not for data processing. PDF files contain blocks of Postscript-like page drawing instructions, typically compressed, and while it’s possible for PDFs also to include the original plain text for accessibility use, most PDF generators do not do this so the only way to get the original text is by reconstructing it from those low-level drawing instructions—not a trivial job.
AppleScript’s read command only reads that raw file data; it does not parse it into drawing instructions, never mind translating those drawing instructions back into plain text. Change a PDF file’s extension to .txt and open it in a plain text editor, and you’ll see what I mean. Nasty.
If you need to work with the PDF’s original content (text, images, whatever), your best solution is to get those files before they were converted into a PDF.
If you must extract content from a PDF file, use an existing tool that knows how to do it.
For instance, if you’re lucky enough to have PDFs that contain XFDF (XML form) or accessibility data, there are 3rd-party apps and libraries to extract that content in readable form. I can’t think offhand of any that are AppleScriptable (Adobe Acrobat has only minimal AS support) so you’ll probably need to find one you can run from command line (do shell script in AS).
Or, if the PDFs have a consistent visual structure, a 3rd-party library such as Python’s PDFMiner (which I’ve used in the past) can identify blocks of characters by position and convert those back into strings with varying degrees of reliability (it has to convert font glyphs back into Unicode characters, guess at which characters are close enough to constitute a word, and where to insert space and return characters between those words). You’ll have to write some Python code to extract the bits you want, so look for tutorials to get started (or pay someone to write it for you).
But again, if you can possibly avoid having to extract text from PDF, you should. You will save yourself a lot of trouble.

exist-db how to access a pdf

I am sure it is very simple ... I just cannot get my head around this...
the exist-db Documentation is a bit fuzzy on content extraction...
http://exist-db.org/exist/apps/doc/contentextraction.
I have a pdf-file, containing of about 162 high-res images (the pdf is quite big ...) and I do not know how to access any of the that are presumably created ...
please do not destroy me! I am just starting to build a database (for an Edition at Uni)I'd love to have a facsimile edition (so one Tab with the image-file and one tab with the transcribed texts)
I aim at doing something similar to what Heidelberg Universitdy did with the "Welsche Gast Digital" http://digi.ub.uni-heidelberg.de/diglit/cpg389/0190/image
(the choosen image is just an example! )
This pic
When clicking on faksimile the Scan opens and when clicking on Transkription the transcribed texts open!
I am quite new to Xquery, Xpath and most X-related stuff. I have a "working design" put together in exist-db and am looking at TEI for marking up the transcritpion etc, I fear I'll have to spend quite some time on this issue ...
(it is not about doing my job for me, it's just about pointing me in the right direction)
I m afraid the short answer is simply don't.
Storing a pdf in your db, and then trying to extract images from it, is kind of a recipe for disaster. Instead you should use the source images (not necessarily extracted from the pdf), and store these individually in a collection (e.g. resources/img). Those image files are then the binary resources that the documentation is actually talking about.
You might want to take a look at tei-publisher for creating digital edition in exist, especially this demo app for how to present high-res facsimiles with transcribed portions of text. I m afraid its all a bit more involved then just opening a pdf in a browser, but so is the Welsche Gast Digital

Debugging PDF for error

I'm creating PDF files using PDFClown java library.
Sometimes, when openning these files with Adobe Acrobat Reader I get the famous error message:
"An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem."
The error shows while reading (with Adobe) the attached file only when scrolling down to the 8'th page, then scrolling back up to 3'td page. Alternatively, Zooming out to 33.3% will also produce the message.
Just for the record, Foxit reader reads the file flawlessly, as well as other PDF readers like browsers.
My questions are:
What's wrong with my file?? (file is attached)
How can I find what's wrong with it? is there a tool which tells you where does the error lie?
Thanks!
Ok, this wasn't easy -
Due to a bug in PDFClown the my main stream of information in the PDF page has been corrupted.
After it's end it had a copy of a past instance of it.
This caused a partial text section without the starting command "BT" - which left a single "ET" without a "BT" in the end of the stream.
once I corrected this, it ran great.
Thank you all for your help.
I would have much more difficult time debugging it without the tool RUPS which #Bruno suggested.
edit:
The bug was in the Buffer.java:clone() (line 217)
instead of line:
clone.append(data);
needs to be:
clone.append(data, 0, this.length);
Without this correction it clones the whole data buffer, and set the cloned Buffer's length to the data[].length. This is very problematic if the Buffer.length is smaller than the data[].length.
The result in my case was that in the end of the stream there was garbage.
The error shows while reading (with Adobe) the attached file only when scrolling down to the 8'th page, then scrolling back up to 3'td page. Alternatively, Zooming out to 33.3% will also produce the message.
Well, I get it easier, I merely open the PDF and scroll down using the cursor keys. As soon as the top 2 cm of page 3 appear, the message appears.
What's wrong with my file??
The content of pages 1 and 2 look ok, so let's look at the content of page 3.
My initial attributing the issue to the use of text specific operations (especially Tf and Tw) outside of a text object was wrong as Stefano Chizzolini pointed out: Some text related operations indeed are allowed outside text objects, namely the text state operations, cf. figure 9 from the PDF specification:
So while being less common, text state operations at page description level are completely ok.
After my incorrect attempt to explain the issue, the OP's own answer indicated that the
main stream of information in the PDF page has been corrupted. After it's end it had a copy of a past instance of it. This caused a partial text section without the starting command "BT" - which left a single "ET" without a "BT" in the end of the stream.
An ET without a prior BT indeed would be an error, and quite likely it would be accompanied by operations at the wrong level... Inspecting the stream content of that third page (the focused page of this issue), though, I could not find any unmatched ET. In the course of that inspection, though, I discovered that the content stream contains more than 2000 trailing 0 bytes! Adobe Reader seems not to be able to cope with these 0 bytes.
The bug the OP found, can explain the issue:
in the Buffer.java:clone() (line 217)
instead of line:
clone.append(data);
needs to be:
clone.append(data, 0, this.length);
Without this correction it clones the whole data buffer, and set the cloned Buffer's length to the data[].length. This is very problematic if the Buffer.length`` is smaller than the data[].length.
Trailing 0 bytes can be an effect of such a buffer copying bug.
Furthermore symptoms as found by the OP (After it's end it had a copy of a past instance of it) can also be the effect of such a bug. So I assume the OP found those symptoms on a different page, not page 3, but fixing the bug healed all symptoms.
How can I find what's wrong with it? is there a tool which tells you where does the error lie?
There are PDF syntax checkers, e.g. the Preflight tool included in Adobe Acrobat. but even that fails on your file.
So essentially you have to extract the page content (using a PDF browser, e.g. RUPS) and check manually with the PDF specification on the other screen.
the general post about debugging pdf might have been also helpful as rups / pdfstreamdump etc is mentioned there How do you debug PDF files?

Splitting Ruby strings into pages

I've been thinking about this problem for a while, and not quite sure the best way to go about it.
In a rails app I have books, which have many chapters, which have many sections. Chapters are basically just containers for sections, though may contain strings of text themselves. The sections hold most of the book text.
I'm planning to build an HTML 5 ebook reader that works in a mobile browser, and I don't want the user to have to scroll down -- I want the text to break at the end of the page.
I'd assumed using split might be the way to go, but I'm not sure there's a way to break at regular intervals? Would a javascript option work better here?
I'd looked at this: Dividing text article to smaller parts with paging in Ruby on Rails but can't feasibly insert manual break marks in the text, some of which are 90,000+ words.
Any ideas would be appreciated.
I think the main problem here is that the page length will depend on the device (and possibly the text size, if that is feature of your app). You should probably send large chunks that are sure to be at least say 5 pages long, at a time and then let the javascript do the paging. Rails has no access, nor should it, to the size of the display.
Text requires very little data, you shouldn't worry about transmitting more than you need or keeping too much in memory.
You may use blank line("\n" or "") as the separator.
I'd send enough of the page content down to easily fill a page and more, then use javascript on the client slide to remove sentences from the page until the scroll-bar disappears.
Resize.js is something similar I wrote a while ago. I wanted to enlarge/reduce the font size used on a screen until the screen was just full (for a dashboard monitor).. Yours would be similar, but instead of changing the font size, you are trimming off sentences.
Let me know if you can't see how to adapt this code.
Note: I would also make the javascript note the amount of text it ends up displaying, and pass that to the server in the 'next page' request, so the server knows where to start the next page from.

PDF generation under ruby - block should not cut by page separator

My PDF consists of a number of blocks (actually, a list of quotations), they go one after another till the end of the document. If the text of a quotation
does not fit on the page, the whole quotation should start from the top of the next page, instead of being torn apart. How can I implement that on any library under ruby?
Try PrinceXML - this is a standalone executable that generates PDF out of HTML or XML. It supports a lot of special CSS properties that will even help you to control page breaks. Refer to http://www.princexml.com/doc/6.0/page-breaks/
This application is available for windows and linux. I was using it for generation of a pretty complicated PDF documents with headers and footers on every page except first one. And since you don't need to output a PDF with precise positioning of elements, it might be a perfect solution for you.
I haven't tried it, but in Prawn I would try using either the Document#text_box method or looking up the table methods and putting your text in cells with invisible borders. The documentation's unclear on how page break functionality fits in with the bounding box models, but it's worth a shot.
HTMLDoc which converts HTML to PDF has a page break facility.

Resources