Algorithm for Rendering Long Text in a Text Editor - algorithm

I've been thinking about writing a text editor control that can edit text that can have any arbitrary length (say, hundreds of megabytes), similar in some ways to the Scintilla editor. The goal is to lazy-read the file, so the user doesn't have to read five hundred megabytes of data just to view a small portion of it. I'm having two problems with this:
It seems to me to be impossible to implement any sensible scrolling feature for such an editor, unless I pre-read the entire file once, in order to figure out line breaks. Is this really true? Or is there a way to approximate things, that I'm not thinking of?
Because of various issues with Unicode (e.g. it allows many bytes to represent just one character, not just because of variable-length encoding but also because of accents and such), it seems nearly impossible to determine exactly how much text will fit on the screen -- I'd have to use TextOut() or something to draw one character, measure how big it was, and then draw the next character. And even then, that still doesn't say how I'd map the user's clicks back to the correct text position.
Is there anything I could read on the web regarding algorithms for handling these issues? I've searched, but I haven't found anything.
Thank you!

You can set a "coarse" position based on data size instead of lines. The "fine" position of your text window can be based on a local scan around an arbitrary entry point.
This means you will need to write functions that can scan locally (backwards and forwards) to find line starts, count Unicode characters, and so forth. This should not be too difficult; UTF8 is designed to be easy to parse in this way.
You may want to give special consideration of what to do about extremely long lines. Since there is no upper limit on how long a line can be, this makes finding the beginning (or end) of a line an unbounded task; I believe everything else you need for a screen editor display should be local.
Finally, if you want a general text editor, you need to figure out what you're going to do when you want to save a file in which you've inserted/deleted things. The straightforward thing is to rewrite the file; however, this is obviously going to take longer with a huge file. You can expect the user to run into problems if there is not enough room for a modified copy, so at the very least, you will want to check to make sure there is enough room on the filesystem.

#comingstorm is basically right. For display, you start at the cursor and scan backwards until you're sure you're past the top of the screen. Then you scan backwards to a line end, assuming you can identify a line end scanning backwards. Now you scan forwards, calculating and saving screen line start positions until you've gone far enough. Finally, you pick the line you want to start displaying on and off you go.
For simple text this can be done on an archaic processor fast enough to redraw a memory mapped video display every keystroke. [I invented this technology 30 years ago]. The right way to do this is to fix the cursor in the middle line of the screen.
For actually modifying files, you might look at using Gnu's ropes. A rope is basically a linked list of buffers. The idea is that all local edits can be done in just one small buffer, occasionally adding a new buffer, and occasionally merging adjacent buffers.
I would consider combining this technology with differential storage: the kind of thing all modern source control systems do. You basically have to use this kind of transaction based editing if you want to implement the undo function.
The key to this is invertible transactions, i.e. one which contains enough information to be applied backwards to undo what it did when applied forwards. The core editor transaction is:
at pos p replace old with new
which has inverse
at pos p replace new with old
This handles insert (old is empty) and delete (new is empty) as well as replace. Given a transaction list, you can undo inplace modifications to a string by applying the reverse of the list of inverse transactions.
Now you use the old checkpointing concept: you store a fairly recent in-place modified image of the file together with some recent transactions that haven't been applied yet. To display, you apply the transactions on the fly. To undo, you just throw away some transactions. Occasionally, you actually apply the transactions, making a "checkpoint" image. This speeds up the display, at the cost of making the undo slower.
Finally: to rewrite a huge sequential text file, you would normally rewrite the whole text, which is horrible. If you can cheat a bit, and allow arbitrary 0 characters in the text and you have access to virtual memory system page manager and low level disk access, you can do much better by keeping all the unchanged pages of text and just reorganising them: in other words, the ropes idea on disk.

Related

My Algorithm only fails for large values - How do I debug this?

I'm working on transcribing as3delaunay to Objective-C. For the most part, the entire algorithm works and creates graphs exactly as they should be. However, for large values (thousands of points), the algorithm mostly works, but creates some incorrect graphs.
I've been going back through and checking the most obvious places for error, and I haven't been able to actually find anything. For smaller values I ran the output of the original algorithm and placed it into JSON files. I then read that output in to my own tests (tests with 3 or 4 points only), and debugged until the output matched; I checked the output of the two algorithms line for line, and found the discrepancies. But I can't feasibly do that for 1000 points.
Answers don't need to be specific to my situation (although suggesting tools I can use would be excellent).
How can I debug algorithms that only fail for large values?
If you are transcribing an existing algorithm to Objective-C, do you have a working original in some other language? In that case, I would be inclined to put in print statements in both versions and debug the first discrepancy (the first, because later discrepancies could be knock-on errors).
I think it is very likely that the program also makes mistakes for smaller graphs, but more rarely. My first step would in fact be to use the working original (or some other means) to run a large number of automatically checked test runs on small graphs, hoping to find the bug on some more manageable input size.
Find the threshold
If it works for 3 or 4 items, but not for 1000, then there's probably some threshold in between. Use a binary search to find that threshold.
The threshold itself may be a clue. For example, maybe it corresponds to a magic value in the algorithm or to some other value you wouldn't expect to be correlated. For example, perhaps it's a problem when the number of items exceeds the number of pixels in the x direction of the chart you're trying to draw. The clue might be enough to help you solve the problem. If not, it may give you a clue as to how to force the problem to happen with a smaller value (e.g., debug it with a very narrow chart area).
The threshold may be smaller than you think, and may be directly debuggable.
If the threshold is a big value, like 1000. Perhaps you can set a conditional breakpoint to skip right to iteration 999, and then single-step from there.
There may not be a definite threshold, which suggests that it's not the magnitude of the input size, but some other property you should be looking at (e.g., powers of 10 don't work, but everything else does).
Decompose the problem and write unit tests
This can be tedious but is often extremely valuable--not just for the current issue, but for the future. Convince yourself that each individual piece works in isolation.
Re-visit recent changes
If it used to work and now it doesn't, look at the most recent changes first. Source control tools are very useful in helping you remember what has changed recently.
Remove code and add it back piece by piece
Comment out as much code as you can and still get some kind of reasonable output (even if that output doesn't meet all the requirements). For example, instead of using a complicated rounding function, just truncate values. Comment out code that adds decorative touches. Put assert(false) in any special case handlers you don't think should be activated for the test data.
Now verify that output, and slowly add back the functionality you removed, one baby step at a time. Test thoroughly at each step.
Profile the code
Profiling is usually for optimization, but it can sometimes give you insight into code, especially when the data size is too large for single-stepping through the debugger. I like to use line or statement counts. Is the loop body executing the number of times you expect? Or twice as often? Or not at all? How about the then and else clauses of those if statements? Logic bugs often become very obvious with this type of profiling.

Data structures for audio editor

I have been writing an audio editor for the last couple of months, and have been recently thinking about how to implement fast and efficient editing (cut, copy, paste, trim, mute, etc.). There doesn't really seem to be very much information available on this topic, however... I know that Audacity, for example, uses a block file strategy, in which the sample data (and summaries of that data, used for efficient waveform drawing) is stored on disk in fixed-sized chunks. What other strategies might be possible, however? There is quite a lot of info on data-structures for text editing - many text (and hex) editors appear to use the piece-chain method, nicely described here - but could that, or something similar, work for an audio editor?
Many thanks in advance for any thoughts, suggestions, etc.
Chris
the classical problem for editors handling relative large files is how to cope with deletion and insertion. Text editors obviously face this, as typically the user enters characters one at a time. Audio editors don't typically do "sample by sample" inserts, i.e. the user doesn't interactively enter one sample per time, but you have some cut-and-paste operations. I would start with a representation where an audio file is represented by chunks of data which are stored in a (binary) search tree. Insert works by splitting the chunk you are inserting into two chunks, adding the inserted chunk as a third one, and updating the tree. To make this efficient and responsive to the user, you should then have a background process that defragments the representation on disk (or in memory) and then makes an atomic update to the tree holding the chunks. This should make inserts and deletes as fast as possible. Many other audio operations (effects, normalize, mix) operate in-place and do not require changes to the data structure, but doing e.g. normalize on the whole audio sample is a good opportunity to defragment it at the same time. If the audio samples are large, you can keep the chunks as it is standard on hard disk also. I don't believe the chunks need to be fixed size; they can be variable size, preferably 1024 x (power of two) bytes to make file operations efficient, but a fixed-size strategy can be easier to implement.

How are text editors generally implemented?

This question is probably going to make me sound pretty clueless. That's because I am.
I'm just thinking, if I were hypothetically interested in designing my own text editor GUI control, widget, or whatever you want to call it (which I'm not), how would I even do it?
The temptation to a novice such as myself would be to store the content of the text editor in the form of a string, which seems quite costly (not that I'm too familiar with how string implementations differ between one language/platform and the next; but I know that in .NET, for example, they're immutable, so frequent manipulation such as what you'd need to support in a text editor would be magnificently wasteful, constructing one string instance after another in very rapid succession).
Presumably some mutable data structure containing text is used instead; but figuring out what this structure might look like strikes me as a bit of a challenge. Random access would be good (I would think, anyway—after all, don't you want the user to be able to jump around to anywhere in the text?), but then I wonder about the cost of, say, navigating to somewhere in the middle of a huge document and starting to type immediately. Again, the novice approach (say you store the text as a resizeable array of characters) would lead to very poor performance, I'm thinking, as with every character typed by the user there would be a huge amount of data to "shift" over.
So if I had to make a guess, I'd suppose that text editors employ some sort of structure that breaks the text down into smaller pieces (lines, maybe?), which individually comprise character arrays with random access, and which are themselves randomly accessible as discrete chunks. Even that seems like it must be a rather monstrous oversimplification, though, if it is even remotely close to begin with.
Of course I also realize that there may not be a "standard" way that text editors are implemented; maybe it varies dramatically from one editor to another. But I figured, since it's clearly a problem that's been tackled many, many times, perhaps a relatively common approach has surfaced over the years.
Anyway, I'm just interested to know if anyone out there has some knowledge on this topic. Like I said, I'm definitely not looking to write my own text editor; I'm just curious.
One technique that's common (especially in older editors) is called a split buffer. Basically, you "break" the text into everything before the cursor and everything after the cursor. Everything before goes at the beginning of the buffer. Everything after goes at the end of the buffer.
When the user types in text, it goes into the empty space in between without moving any data. When the user moves the cursor, you move the appropriate amount of text from one side of the "break" to the other. Typically there's a lot of moving around a single area, so you're usually only moving small amounts of text at a time. The biggest exception is if you have a "go to line xxx" kind of capability.
Charles Crowley has written a much more complete discussion of the topic. You might also want to look at The Craft of Text Editing, which covers split buffers (and other possibilities) in much greater depth.
A while back, I wrote my own text editor in Tcl (actually, I stole the code from somewhere and extended it beyond recognition, ah the wonders of open source).
As you mentioned, doing string operations on very, very large strings can be expensive. So the editor splits the text into smaller strings at each newline ("\n" or "\r" or "\r\n"). So all I'm left with is editing small strings at line level and doing list operations when moving between lines.
The other advantage of this is that it is a simple and natural concept to work with. My mind already considers each line of text to be separate reinforced by years of programming where newlines are stylistically or syntactically significant.
It also helps that the use case for my text editor is as a programmers editor. For example, I implemented syntax hilighting but not word/line wrap. So in my case there is a 1:1 map between newlines in text and the lines drawn on screen.
In case you want to have a look, here's the source code for my editor: http://wiki.tcl.tk/16056
It's not a toy BTW. I use it daily as my standard console text editor unless the file is too large to fit in RAM (Seriously, what text file is? Even novels, which are typically 4 to 5 MB, fit in RAM. I've only seen log files grow to hundreds of MB).
Depending on the amount of text that needs to be in the editor at one time, a one string for the entire buffer approach would probably be fine. I think Notepad does this-- ever notice how much slower it gets to insert text in a large file?
Having one string per line in a hash table seems like a good compromise. It would make navigation to a particular line and delete/paste efficient without much complexity.
If you want to implement an undo function, you'll want a representation that allows you to go back to previous versions without storing 30 copies of the entire file for 30 changes, although again that would probably be fine if the file was sufficiently small.
The simplest way would be to use some kind of string buffer class provided by the language. Even a simple array of char objects would do at a pinch.
Appending, replacing and seeking text are then relatively quick. Other operations are potentially more time-consuming, of course, with insertion of a sequence of characters at the start of the buffer being one of the more expensive actions.
However, this may be perfectly acceptable performance-wise for a simple use case.
If the cost of insertions and deletions is particularly significant, I'd be tempted to optimise by creating a buffer wrapper class that internally maintained a list of buffer objects. Any action (except simple replacement) that didn't occur at the tail of an existing buffer would result in the buffer concerned being split at the relevant point, so the buffer could be modified at its tail. However, the outer wrapper would maintain the same interface as a simple buffer, so that I didn't have to rewrite e.g. my search action.
Of course, this simple approach would quickly end up with an extremely fragmented buffer, and I'd consider having some kind of rule to coalesce the buffers when appropriate, or to defer splitting a buffer in the case of e.g. a single character insertion. Maybe the rule would be that I'd only ever have at most 2 internal buffers, and I'd coalesce them before creating a new one – or when something asked me for a view of the whole buffer at once. Not sure.
Point is, I'd start simple but access the mutable buffer through a carefully chosen interface, and play with the internal implementation if profiling showed me I needed to.
However, I definitely wouldn't start with immutable String objects!

Ways of Efficiently Seeking in Custom File Formats

I've been wondering what kind of ways seek is implemented across different file formats and what would be a good way to construct a file that has a lot of data to enable efficient seeking. Some ways I've considered have been having equal sized packets, which allows quick skipping since you know what each data chunk is like, also preindexing whenever a file is loaded is also a thought.
This entirely depends on the kind of data, and what you're trying to seek to.
If you're trying to seek by record index, then sure: fixed size fields makes life easier, but wastes space. If you're trying to seek by anything else, keeping an index of key:location works well. If you want to be able to build the file up sequentially, you can put the index at the end but keep the first four bytes of the file (after the magic number or whatever) to represent the location of the index itself (assuming you can rewrite those first four bytes).
If you want to be able to perform a sort of binary chop on variable length blocks, then having a reasonably efficient way of detecting the start of a block helps - as does having next/previous pointers, as mentioned by Alexander.
Basically it's all about metadata, really - but the right kind of metadata will depend on the kind of data, and the use cases for seeking in the first place.
Well, giving each chunk a size offset to the next chunk is common and allows fast skipping of unknown data. Another way would be an index chunk at the beginning of the file, storing a table of all chunks in the file along with their offsets. Programs would simply read the index chunk into memory.

Combining semacodes and steganography?

Update
I asked this question quite a while ago now, and I was curious if anything like this has been developed since I asked the question?
I don't even know if there is a term for this kind of algorithm, and I guess there won't be if nobody has invented it yet. However it also makes googling for this a bit hard. Does anybody know if there is a term for this algorithm/principle yet?
This is an idea I have been thinking about, but I do not quite know how to solve it. I would like to know if any solutions like this exists out there, or if you guys have any idea how this could be implemented.
Steganography
Steganography is basically the art of hiding messages. In modern days we do this digitally by e.g. modifying the least significant bits in a image as the one below. Thus for every pixel and for every colour component of that pixel we might be able to hide a byte or two.
This alternation is not visibly by the naked eye, but analysing the least significant bits might reveal patterns that exposes the existence and possibly content of a hidden message. To counter this we simply encrypt the message before embedding it in the image, which keeps the message safe and also helps preventing discovery of the existence of a hidden message.
Thus, in principle, steganography provides the following:
Hiding encrypted message in any kind of media data. (Images, music, video, etc.)
Complete deniability of the existence of a hidden message without the correct key.
Extraction of the hidden message with the correct key.
(source: cs.vu.nl)
Semacodes
Semacodes are a way of encoding data in a visually representation, that may be printed, copied, and scanned easily. The Data Matrix shown below is a example of a semacode containing the famous Lorem Ipsum text. This is essentially a 2D barcode with a higher capacity that usually barcodes. Programs for generating semacodes are readily available, and ditto for software for reading them, especially for cell phones. Semacodes usually contains error correcting codes, are generally very robust, and can be read in very damaged conditions.
Thus semacodes has the following properties:
Data encoding that may be printed and copied.
May be scanned and interpreted even in damaged (dirty) conditions, and generally a very robust encoding.
Combining it
So my idea is to create something that combines these two, with all of the combined properties. This means it would have to:
Embed a encrypted message in any media, probably a scanned image.
The message should be extractable even if the image is printed and scanned, and even partly damaged.
The existence of a embedded message should be undetectable without the key used for encryption.
So, first of all I would like to know if any solutions, algorithms or research is available on this? Secondly I would like to hear any ideas/thoughts on how this might be done?
I really hope to get a good discussion going on the possibilities and feasibility of implementing something like this, and I am looking forward to reading your answers.
Update
Thanks for all the good input on this. I will probably work a bit more on this idea when I have more time. I am convinced it must be possible. Think about research in embedding watermarks in music and movies.
I imagine part of the robustness of a semacode to damage/dirt/obscuration is the high contrast between the two states of any "cell". The reader can still make a good guess as to the actual state, even with some distortion.
That sort of contrast is not available in a photographic image, and is the very reason why steganography works - the lsb bit-flipping has almost no visual effect on the image itself, while digital fidelity ensures that a non-visual system can still very accurately read the embedded data.
As the two applications are sort of at opposite ends of the analog/digital spectrum (semacodes are all about being decipherable by analog (visual) processing but are on paper, not digital; steganography is all about the bits in the file and cares nothing for the analog representation, whether light or sound or something else), I imagine a combination of the two will extremely difficult, if not impossible.
Essentially what you're thinking of is being able to steganographically embed something in an image, print the image, make a colour photocopy of it, scan it in, and still be able to extract the embedded data.
I'm afraid I can't help, but if anyone achieves this, I'll be DAMN impressed! :)
It's not a complete answer, but you should look at watermarking. This technique solves your first two goals (embedable in a printed image and readable even from partly damaged scan).
Part of watermarking's reliability to distortion and transcription errors (from going from digital to analog and back) come from redundancy (e.g. repeating the data several times). Those would make the watermark detectable even without a key. However, you might be able to use redundancy techniques that are more subtle, maybe something related to erasure coding or secret sharing.
I know that's not a complete answer, but hopefully those leads will point you in the right direction!
What language/environment are you using? It shouldn't be that hard to write code that opens both the image and semacode as a bitmap (the latter as a monochrome), sets the lowest bit(s) of each byte of each pixel in the color image to the value of the corresponding pixel of the monochrome bitmap.
(optionally expand the semacode bitmap first to the same pixel-dimensions extending with white)

Resources