I have been writing an audio editor for the last couple of months, and have been recently thinking about how to implement fast and efficient editing (cut, copy, paste, trim, mute, etc.). There doesn't really seem to be very much information available on this topic, however... I know that Audacity, for example, uses a block file strategy, in which the sample data (and summaries of that data, used for efficient waveform drawing) is stored on disk in fixed-sized chunks. What other strategies might be possible, however? There is quite a lot of info on data-structures for text editing - many text (and hex) editors appear to use the piece-chain method, nicely described here - but could that, or something similar, work for an audio editor?
Many thanks in advance for any thoughts, suggestions, etc.
Chris
the classical problem for editors handling relative large files is how to cope with deletion and insertion. Text editors obviously face this, as typically the user enters characters one at a time. Audio editors don't typically do "sample by sample" inserts, i.e. the user doesn't interactively enter one sample per time, but you have some cut-and-paste operations. I would start with a representation where an audio file is represented by chunks of data which are stored in a (binary) search tree. Insert works by splitting the chunk you are inserting into two chunks, adding the inserted chunk as a third one, and updating the tree. To make this efficient and responsive to the user, you should then have a background process that defragments the representation on disk (or in memory) and then makes an atomic update to the tree holding the chunks. This should make inserts and deletes as fast as possible. Many other audio operations (effects, normalize, mix) operate in-place and do not require changes to the data structure, but doing e.g. normalize on the whole audio sample is a good opportunity to defragment it at the same time. If the audio samples are large, you can keep the chunks as it is standard on hard disk also. I don't believe the chunks need to be fixed size; they can be variable size, preferably 1024 x (power of two) bytes to make file operations efficient, but a fixed-size strategy can be easier to implement.
Related
I am looking for an algorithm to compress small ASCII strings. They contain lots of letters but they also can contain numbers and rarely special characters. They will be small, about 50-100 bytes average, 250 max.
Examples:
Android show EditText.setError() above the EditText and not below it
ImageView CENTER_CROP dont work
Prevent an app to show on recent application list on android kitkat 4.4.2
Image can't save validable in android
Android 4.4 SMS - Not receiving sentIntents
Imported android-map-extensions version 2.0 now my R.java file is missing
GCM registering but not receiving messages on pre 4.0.4. devices
I want to compress the titles one by one, not many titles together and I don't care much about CPU and memory usage.
You can use Huffman coding with a shared Huffman tree among all texts you want to compress.
While you typically construct a Huffman tree for each string to be compressed separately, this would require a lot of overhead in storage which should be avoided here. That's also the major problem when using a standard compression scheme for your case: most of them have some overhead which kills your compression efficiency for very short strings. Some of them don't have a (big) overhead but those are typically less efficient in general.
When constructing a Huffman tree which is later used for compression and decompression, you typically use the texts which will be compressed to decide which character is encoded with which bits. Since in your case the texts to be compressed seem to be unknown in advance, you need to have some "pseudo" texts to build the tree, maybe from a dictionary of the human language or some experience of previous user data.
Then construct the Huffman tree and store it once in your application; either hardcode it into the binary or provide it in the form of a file. Then you can compress and decompress any texts using this tree. Whenever you decide to change the tree since you gain better experience on which texts are compressed, the compressed string representation also changes. It might be a good idea to introduce versioning and store the tree version together with each string you compress.
Another improvement you might think about is to use multi-character Huffman encoding. Instead of compressing the texts character by character, you could find frequent syllables or words and put them into the tree too; then they require even less bits in the compressed string. This however requires a little bit more complicated compression algorithm, but it might be well worth the effort.
To process a string of bits in the compression and decompression routine in C++(*), I recommend either boost::dynamic_bitset or std::vector<bool>. Both internally pack multiple bits into bytes.
(*)The question once had the c++ tag, so OP obviously wanted to implement it in C++. But as the general problem is not specific to a programming language, the tag was removed. But I still kept the C++-specific part of the answer.
I have to write a tool that manages very large data sets (well, large for an ordinary workstations). I need basically something that works the opposite that the jpeg format. I need the dataset to be intact on disk where it can be arbitrarily large, but then it needs to be lossy compressed when it gets read in memory and only the sub-part used at any given time need to be uncompressed on the flight. I have started looking at ipp (Intel Integrated Performance Primitives) but it's not really clear for now if I can use them for what I need to do.
Can anyone point me in the right direction?
Thank you.
Given the nature of your data, it seems you are handling some kind of raw sample.
So the easiest and most generic "lossy" technique will be to drop the lower bits, reducing precision, up to the level you want.
Note that you will need to "drop the lower bits", which is quite different from "round to the next power of 10". Computer work on base 2, and you want all your lower bits to be "00000" for compression to perform as well as possible. This method suppose that the selected compression algorithm will make use of the predictable 0-bits pattern.
Another method, more complex and more specific, could be to convert your values as an index into a table. The advantage is that you can "target" precision where you want it. The obvious drawback is that the table will be specific to a distribution pattern.
On top of that, you may also store not the value itself, but the delta of the value with its preceding one if there is any kind of relation between them. This will help compression too.
For data to be compressed, you will need to "group" them by packets of appropriate size, such as 64KB. On a single field, no compression algorithm will give you suitable results. This, in turn, means that each time you want to access a field, you need to decompress the whole packet, so better tune it depending on what you want to do with it. Sequential access is easier to deal with in such circumstances.
Regarding compression algorithm, since these data are going to be "live", you need something very fast, so that accessing the data has very small latency impact.
There are several open-source alternatives out there for that use. For easier license management, i would recommend a BSD alternative. Since you use C++, the following ones look suitable :
http://code.google.com/p/snappy/
and
http://code.google.com/p/lz4/
This question is probably going to make me sound pretty clueless. That's because I am.
I'm just thinking, if I were hypothetically interested in designing my own text editor GUI control, widget, or whatever you want to call it (which I'm not), how would I even do it?
The temptation to a novice such as myself would be to store the content of the text editor in the form of a string, which seems quite costly (not that I'm too familiar with how string implementations differ between one language/platform and the next; but I know that in .NET, for example, they're immutable, so frequent manipulation such as what you'd need to support in a text editor would be magnificently wasteful, constructing one string instance after another in very rapid succession).
Presumably some mutable data structure containing text is used instead; but figuring out what this structure might look like strikes me as a bit of a challenge. Random access would be good (I would think, anyway—after all, don't you want the user to be able to jump around to anywhere in the text?), but then I wonder about the cost of, say, navigating to somewhere in the middle of a huge document and starting to type immediately. Again, the novice approach (say you store the text as a resizeable array of characters) would lead to very poor performance, I'm thinking, as with every character typed by the user there would be a huge amount of data to "shift" over.
So if I had to make a guess, I'd suppose that text editors employ some sort of structure that breaks the text down into smaller pieces (lines, maybe?), which individually comprise character arrays with random access, and which are themselves randomly accessible as discrete chunks. Even that seems like it must be a rather monstrous oversimplification, though, if it is even remotely close to begin with.
Of course I also realize that there may not be a "standard" way that text editors are implemented; maybe it varies dramatically from one editor to another. But I figured, since it's clearly a problem that's been tackled many, many times, perhaps a relatively common approach has surfaced over the years.
Anyway, I'm just interested to know if anyone out there has some knowledge on this topic. Like I said, I'm definitely not looking to write my own text editor; I'm just curious.
One technique that's common (especially in older editors) is called a split buffer. Basically, you "break" the text into everything before the cursor and everything after the cursor. Everything before goes at the beginning of the buffer. Everything after goes at the end of the buffer.
When the user types in text, it goes into the empty space in between without moving any data. When the user moves the cursor, you move the appropriate amount of text from one side of the "break" to the other. Typically there's a lot of moving around a single area, so you're usually only moving small amounts of text at a time. The biggest exception is if you have a "go to line xxx" kind of capability.
Charles Crowley has written a much more complete discussion of the topic. You might also want to look at The Craft of Text Editing, which covers split buffers (and other possibilities) in much greater depth.
A while back, I wrote my own text editor in Tcl (actually, I stole the code from somewhere and extended it beyond recognition, ah the wonders of open source).
As you mentioned, doing string operations on very, very large strings can be expensive. So the editor splits the text into smaller strings at each newline ("\n" or "\r" or "\r\n"). So all I'm left with is editing small strings at line level and doing list operations when moving between lines.
The other advantage of this is that it is a simple and natural concept to work with. My mind already considers each line of text to be separate reinforced by years of programming where newlines are stylistically or syntactically significant.
It also helps that the use case for my text editor is as a programmers editor. For example, I implemented syntax hilighting but not word/line wrap. So in my case there is a 1:1 map between newlines in text and the lines drawn on screen.
In case you want to have a look, here's the source code for my editor: http://wiki.tcl.tk/16056
It's not a toy BTW. I use it daily as my standard console text editor unless the file is too large to fit in RAM (Seriously, what text file is? Even novels, which are typically 4 to 5 MB, fit in RAM. I've only seen log files grow to hundreds of MB).
Depending on the amount of text that needs to be in the editor at one time, a one string for the entire buffer approach would probably be fine. I think Notepad does this-- ever notice how much slower it gets to insert text in a large file?
Having one string per line in a hash table seems like a good compromise. It would make navigation to a particular line and delete/paste efficient without much complexity.
If you want to implement an undo function, you'll want a representation that allows you to go back to previous versions without storing 30 copies of the entire file for 30 changes, although again that would probably be fine if the file was sufficiently small.
The simplest way would be to use some kind of string buffer class provided by the language. Even a simple array of char objects would do at a pinch.
Appending, replacing and seeking text are then relatively quick. Other operations are potentially more time-consuming, of course, with insertion of a sequence of characters at the start of the buffer being one of the more expensive actions.
However, this may be perfectly acceptable performance-wise for a simple use case.
If the cost of insertions and deletions is particularly significant, I'd be tempted to optimise by creating a buffer wrapper class that internally maintained a list of buffer objects. Any action (except simple replacement) that didn't occur at the tail of an existing buffer would result in the buffer concerned being split at the relevant point, so the buffer could be modified at its tail. However, the outer wrapper would maintain the same interface as a simple buffer, so that I didn't have to rewrite e.g. my search action.
Of course, this simple approach would quickly end up with an extremely fragmented buffer, and I'd consider having some kind of rule to coalesce the buffers when appropriate, or to defer splitting a buffer in the case of e.g. a single character insertion. Maybe the rule would be that I'd only ever have at most 2 internal buffers, and I'd coalesce them before creating a new one – or when something asked me for a view of the whole buffer at once. Not sure.
Point is, I'd start simple but access the mutable buffer through a carefully chosen interface, and play with the internal implementation if profiling showed me I needed to.
However, I definitely wouldn't start with immutable String objects!
I've been thinking about writing a text editor control that can edit text that can have any arbitrary length (say, hundreds of megabytes), similar in some ways to the Scintilla editor. The goal is to lazy-read the file, so the user doesn't have to read five hundred megabytes of data just to view a small portion of it. I'm having two problems with this:
It seems to me to be impossible to implement any sensible scrolling feature for such an editor, unless I pre-read the entire file once, in order to figure out line breaks. Is this really true? Or is there a way to approximate things, that I'm not thinking of?
Because of various issues with Unicode (e.g. it allows many bytes to represent just one character, not just because of variable-length encoding but also because of accents and such), it seems nearly impossible to determine exactly how much text will fit on the screen -- I'd have to use TextOut() or something to draw one character, measure how big it was, and then draw the next character. And even then, that still doesn't say how I'd map the user's clicks back to the correct text position.
Is there anything I could read on the web regarding algorithms for handling these issues? I've searched, but I haven't found anything.
Thank you!
You can set a "coarse" position based on data size instead of lines. The "fine" position of your text window can be based on a local scan around an arbitrary entry point.
This means you will need to write functions that can scan locally (backwards and forwards) to find line starts, count Unicode characters, and so forth. This should not be too difficult; UTF8 is designed to be easy to parse in this way.
You may want to give special consideration of what to do about extremely long lines. Since there is no upper limit on how long a line can be, this makes finding the beginning (or end) of a line an unbounded task; I believe everything else you need for a screen editor display should be local.
Finally, if you want a general text editor, you need to figure out what you're going to do when you want to save a file in which you've inserted/deleted things. The straightforward thing is to rewrite the file; however, this is obviously going to take longer with a huge file. You can expect the user to run into problems if there is not enough room for a modified copy, so at the very least, you will want to check to make sure there is enough room on the filesystem.
#comingstorm is basically right. For display, you start at the cursor and scan backwards until you're sure you're past the top of the screen. Then you scan backwards to a line end, assuming you can identify a line end scanning backwards. Now you scan forwards, calculating and saving screen line start positions until you've gone far enough. Finally, you pick the line you want to start displaying on and off you go.
For simple text this can be done on an archaic processor fast enough to redraw a memory mapped video display every keystroke. [I invented this technology 30 years ago]. The right way to do this is to fix the cursor in the middle line of the screen.
For actually modifying files, you might look at using Gnu's ropes. A rope is basically a linked list of buffers. The idea is that all local edits can be done in just one small buffer, occasionally adding a new buffer, and occasionally merging adjacent buffers.
I would consider combining this technology with differential storage: the kind of thing all modern source control systems do. You basically have to use this kind of transaction based editing if you want to implement the undo function.
The key to this is invertible transactions, i.e. one which contains enough information to be applied backwards to undo what it did when applied forwards. The core editor transaction is:
at pos p replace old with new
which has inverse
at pos p replace new with old
This handles insert (old is empty) and delete (new is empty) as well as replace. Given a transaction list, you can undo inplace modifications to a string by applying the reverse of the list of inverse transactions.
Now you use the old checkpointing concept: you store a fairly recent in-place modified image of the file together with some recent transactions that haven't been applied yet. To display, you apply the transactions on the fly. To undo, you just throw away some transactions. Occasionally, you actually apply the transactions, making a "checkpoint" image. This speeds up the display, at the cost of making the undo slower.
Finally: to rewrite a huge sequential text file, you would normally rewrite the whole text, which is horrible. If you can cheat a bit, and allow arbitrary 0 characters in the text and you have access to virtual memory system page manager and low level disk access, you can do much better by keeping all the unchanged pages of text and just reorganising them: in other words, the ropes idea on disk.
For an open source project I have I am writing an abstraction layer on top of the filesystem.
This layer allows me to attach metadata and relationships to each file.
I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.
To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.
So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.
One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.
Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.
I need this algorithm to work for very large files (1GB+ and tiny files 5K)
EDIT
I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).
EDIT 2
Related question: Algorithm for determining a file’s identity (Optimisation)
Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.
First level of indexing is just the length of the file.
Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:
To avoid hitting regularly spaced headers which may be highly similar or identical, you need to step in a non-conforming number, eg: multiples of a prime or successive primes.
Avoid steps which might end up encountering regular record headers, so if you are getting the same value from your sample bytes despite different location, try adjusting the step by another prime.
Cope with anomalous files with large stretches of identical values, either because they are unencoded images or just filled with nulls.
Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.
Believe it or not, I use the ticks for the last write time for the file. It is as cheap as it gets and I am still to see a clash between different files.
If you can drop the Linux share requirement and confine yourself to NTFS, then NTFS Alternate Data Streams will be a perfect solution that:
doesn't require any kind of hashing;
survives renames; and
survives moves (even between different NTFS volumes).
You can read more about it here. Basically you just append a colon and a name for your stream (e.g. ":meta") and write whatever you like to it. So if you have a directory "D:\Movies\Terminator", write your metadata using normal file I/O to "D:\Movies\Terminator:meta". You can do the same if you want to save the metadata for a specific file (as opposed to a whole folder).
If you'd prefer to store your metadata somewhere else and just be able to detect moves/renames on the same NTFS volume, you can use the GetFileInformationByHandle API call (see MSDN /en-us/library/aa364952(VS.85).aspx) to get the unique ID of the folder (combine VolumeSerialNumber and FileIndex members). This ID will not change if the file/folder is moved/renamed on the same volume.
How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.
If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.
The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).
Some more advice:
Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
Well, first you need to look more deeply into how filesystems work. Which filesystems will you be working with? Most filesystems support things like hard links and soft links and therefore "filename" information is not necessarily stored in the metadata of the file itself.
Actually, this is the whole point of a stackable layered filesystem, that you can extend it in various ways, say to support compression or encryption. This is what "vnodes" are all about. You could actually do this in several ways. Some of this is very dependent on the platform you are looking at. This is much simpler on UNIX/Linux systems that use a VFS concept. You could implement your own layer on tope of ext3 for instance or what have you.
**
After reading your edits, a couplre more things. File systems already do this, as mentioned before, using things like inodes. Hashing is probably going to be a bad idea not just because it is expensive but because two or more preimages can share the same image; that is to say that two entirely different files can have the same hashed value. I think what you really want to do is exploit the metadata of that the filesystem already exposes. This would be simpler on an open source system, of course. :)
Which bytes should I choose for the sample?
I think that I would try to use some arithmetic progression like Fibonacci numbers. These are easy to calculate, and they have a diminishing density. Small files would have a higher sample ratio than big files, and the sample would still go over spots in the whole file.
This work sounds like it could be more effectively implemented at the filesystem level or with some loose approximation of a version control system (both?).
To address the original question, you could keep a database of (file size, bytes hashed, hash) for each file and try to minimize the number of bytes hashed for each file size. Whenever you detect a collision you either have an identical file, or you increase the hash length to go just past the first difference.
There's undoubtedly optimizations to be made and CPU vs. I/O tradeoffs as well, but it's a good start for something that won't have false-positives.