What are some alternatives to a bit array? - data-structures

I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes.
My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse bit arrays, and others when roughly half the bits are set (when most bits are set, I can use negation to treat it as a sparse set of zeroes).
What structures might be good at each extreme?
Are there any in the middle?
Here are a few constraints or hints:
The bits are set only once, and in index order.
I need 100% accuracy, so something like a Bloom filter isn't good enough.
After the set is built, I need to be able to efficiently iterate over the "set" bits.
The bits are randomly distributed, so run-length–encoding algorithms aren't likely to be much better than a simple list of bit indexes.
I'm trying to optimize memory utilization, but speed still carries some weight.
Something with an open source Java implementation is helpful, but not strictly necessary. I'm more interested in the fundamentals.

Unless the data is truly random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (i.e.: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achieved. As long as you only need to access the bits sequentially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.
Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.
Finally, if the bits are truly random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.

I would strongly consider using range encoding in place of Huffman coding. In general, range encoding can exploit asymmetry more effectively than Huffman coding, but this is especially so when the alphabet size is so small. In fact, when the "native alphabet" is simply 0s and 1s, the only way Huffman can get any compression at all is by combining those symbols -- which is exactly what range encoding will do, more effectively.

Maybe too late for you, but there is a very fast and memory efficient library for sparse bit arrays (lossless) and other data types based on tries. Look at Judy arrays

Thanks for the answers. This is what I'm going to try for dynamically choosing the right method:
I'll collect all of the first N hits in a conventional bit array, and choose one of three methods, based on the symmetry of this sample.
If the sample is highly asymmetric,
I'll simply store the indexes to the
set bits (or maybe the distance to
the next bit) in a list.
If the sample is highly symmetric,
I'll keep using a conventional bit
array.
If the sample is moderately
symmetric, I'll use a lossless
compression method like Huffman
coding suggested by
InSciTekJeff.
The boundaries between the asymmetric, moderate, and symmetric regions will depend on the time required by the various algorithms balanced against the space they need, where the relative value of time versus space would be an adjustable parameter. The space needed for Huffman coding is a function of the symmetry, and I'll profile that with testing. Also, I'll test all three methods to determine the time requirements of my implementation.
It's possible (and actually I'm hoping) that the middle compression method will always be better than the list or the bit array or both. Maybe I can encourage this by choosing a set of Huffman codes adapted for higher or lower symmetry. Then I can simplify the system and just use two methods.

One more compression thought:
If the bit array is not crazy long, you could try applying the Burrows-Wheeler transform before using any repetition encoding, such as Huffman. A naive implementation would take O(n^2) memory during (de)compression and O(n^2 log n) time to decompress - there are almost certainly shortcuts to be had, as well. But if there's any sequential structure to your data at all, this should really help the Huffman encoding out.
You could also apply that idea to one block at a time to keep the time/memory usage more practical. Using one block at time could allow you to always keep most of the data structure compressed if you're reading/writing sequentially.

Straight forward lossless compression is the way to go. To make it searchable you will have to compress relatively small blocks and create an index into an array of the blocks. This index can contain the bit offset of the starting bit in each block.

Quick combinatoric proof that you can't really save much space:
Suppose you have an arbitrary subset of n/2 bits set to 1 out of n total bits. You have (n choose n/2) possibilities. Using Stirling's formula, this is roughly 2^n / sqrt(n) * sqrt(2/pi). If every possibility is equally likely, then there's no way to give more likely choices shorter representations. So we need log_2 (n choose n/2) bits, which is about n - (1/2)log(n) bits.
That's not a very good savings of memory. For example, if you're working with n=2^20 (1 meg), then you can only save about 10 bits. It's just not worth it.
Having said all that, it also seems very unlikely that any really useful data is truly random. In case there's any more structure to your data, there's probably a more optimistic answer.

Related

What is the best lossless compression algorithm for random data

I need to compress a random stream data like [25,94,182,3,254, ...]. The number of data are close to 4 million. I currently only get 1.4x ratio by Huffman code. The LZW algorithm I tried is take too much time to compress. I hope to find out an efficiency compression method and still have high compression rate, at least 3x.
Is there another algorithm that would be able to compress this random data more better?
It depends on the distribution of the rng. A compression ratio of 1:1.4 suggest that it's not uniform or not good. Huffman and arithmetic coding are practically the only options*, since there is no other correlation between successive entries of good RNG.
*To be precise, the best compression scheme has to be 0-order statistical compression that is able to allocate a variable number of bits for each symbol to reach the Shannon entropy
H(x) = -Sigma_{i=1}^{N} P(x_i) log_2 P(x_i)
The theoretical best is achieved by arithmetical coding, but other encodings can come close by chance. Arithmetic coding can allocate less than one bit per symbol, where as Huffman, or Golomb coding need at least one bit per symbol (or symbol group).

Fast hash function with collision possibility near SHA-1

I'm using SHA-1 to detect duplicates in a program handling files. It is not required to be cryptographic strong and may be reversible. I found this list of fast hash functions https://code.google.com/p/xxhash/ (list has been moved to https://github.com/Cyan4973/xxHash)
What do I choose if I want a faster function and collision on random data near to SHA-1?
Maybe a 128 bit hash is good enough for file deduplication? (vs 160 bit sha-1)
In my program the hash is calculated on chuncks from 0 - 512 KB.
Maybe this will help you:
https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM & MurmurHash
I don't know about xxHash but it looks also promising.
MurmurHash is very fast and version 3 supports 128bit length, I would choose this one. (Implemented in Java and Scala.)
Since the only relevant property of hash algorithms in your case is the collision probability, you should estimate it and choose the fastest algorithm which fulfills your requirements.
If we suppose your algorithm has absolute uniformity, the probability of a hash collision among n files using hashes with d possible values will be:
For example, if you need a collision probability lower than one in a million among one million of files, you will need to have more than 5*10^17 distinct hash values, which means your hashes need to have at least 59 bits. Let's round to 64 to account for possibly bad uniformity.
So I'd say any decent 64-bit hash should be sufficient for you. Longer hashes will further reduce collision probability, at a price of heavier computation and increased hash storage volume. Shorter caches like CRC32 will require you to write some explicit collision handling code.
Google developed and uses (I think) FarmHash for performance-critical hashing. From the project page:
FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash.
...
On CPUs with all the necessary machine instructions, about six different hash functions can contribute to FarmHash's lineup. In some cases we've made significant performance gains over CityHash by using newer instructions that are now commonly available. However, we've also squeezed out some more speed in other ways, so the vast majority of programs using CityHash should gain at least a bit when switching to FarmHash.
(CityHash was already a performance-optimized hash function family by Google.)
It was released a year ago, at which point it was almost certainly the state of the art, at least among the published algorithms. (Or else Google would have used something better.) There's a good chance it's still the best option.
The facts:
Good hash functions, specially the cryptographic ones (like SHA-1),
require considerable CPU time because they have to honor a number of
properties that wont be very useful for you in this case;
Any hash function will give you only one certainty: if the hash values of two files are different, the files are surely different. If, however, their hash values are equal, chances are that the files are also equal, but the only way to tell for sure if this "equality" is not just a hash collision, is to fall back to a binary comparison of the two files.
The conclusion:
In your case I would try a much faster algorithm like CRC32, that has pretty much all the properties you need, and would be capable of handling more than 99.9% of the cases and only resorting to a slower comparison method (like binary comparison) to rule out the false positives. Being a lot faster in the great majority of comparisons would probably compensate for not having an "awesome" uniformity (possibly generating a few more collisions).
128 bits is indeed good enough to detect different files or chunks. The risk of collision is infinitesimal, at least as long as no intentional collision is being attempted.
64 bits can also prove good enough if the number of files or chunks you want to track remain "small enough" (i.e. no more than a few millions ones).
Once settled the size of the hash, you need a hash with some very good distribution properties, such as the ones listed with Q.Score=10 in your link.
It kind of depends on how many hashes you are going to compute over in an iteration.
Eg, 64bit hash reaches a collision probability of 1 in 1000000 with 6 million hashes computed.
Refer to : Hash collision probabilities
Check out MurmurHash2_160. It's a modification of MurmurHash2 which produces 160-bit output.
It computes 5 unique results of MurmurHash2 in parallel and mixes them thoroughly. The collision probability is equivalent to SHA-1 based on the digest size.
It's still fast, but MurmurHash3_128, SpookyHash128 and MetroHash128 are probably faster, albeit with a higher (but still very unlikely) collision probability. There's also CityHash256 which produces a 256-bit output which should be faster than SHA-1 as well.

Fast/Area optimised sorting in hardware (fpga)

I'm trying to sort an array of 8bit numbers using vhdl.
I'm trying to find out a method which optimise delay and another which would use less hardware.
The size of the array is fixed. But I'm also interested to extend the functionality to variable lengths.
I've come across 3 algorithms so far:
Bathcher Parallel
Method Green Sort
Van Vorris Sort
Which of these will do the best job? Are there any other methods I should be looking at?
Thanks.
There is a lot of research articles in the matter. You could try to search the web for it. I did a search for "Sorting Networks" and came up with a lot of comparisons of different algorithms and how well they fitted into an FPGA.
The algorithm you choose will greatly depend on which parameter is most important to optimize for, i.e. latency, area, etc. Another important factor is where the values are stored at the beginning and end of the sort. If they are stored in registers, all might be accessed at once, but if you have to read them from a memory with a limited width, you should consider that in your implementation as well, because then you will have to sort values in a stream, and rearrange that stream before saving it back to memory.
Personally, I'd consider something time-constant like merge-sort, which has a constant time to sort, so you could easily schedule the sort for a fixed size array. I'm however not sure how well this scales or works with arbitrary sized arrays. You'd probably have to set an upper limit on array size, and also this approach works best if all data is stored in registers.
I read about this in a book by Knuth and according to that book, the Batcher's parallel merge sort is the fastest algorithm and also the most hardware efficient.

Good compression algorithm for small chunks of data? (around 2k in size)

I have a system with one machine generate small chunks of data in the form of objects containing arrays of integers and longs. These chunks get passed to another server which in turn distributes them elsewhere.
I want to compress these objects so the memory load on the pass-through server is reduced. I understand that compression algorithms like deflate need to build a dictionary so something like that wouldn't really work on data this small.
Are there any algorithms that could compress data like this efficiently?
If not, another thing I could do is batch these chunks into arrays of objects and compress the array once it gets to be a certain size. But I am reluctant to do this because I would have to change interfaces in an existing system. Compressing them individually would not require any interface changes, the way this is all set up.
Not that I think it matters, but the target system is Java.
Edit: Would Elias gamma coding be the best for this situation?
Thanks
If you think that reducing your data packet to its entropy level is at best as it can be, you can try a simple huffman compression.
For an early look at how well this would compress, you can pass a packet through Huff0 :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
It is a simple 0-order huffman encoder. So the result will be representative.
For more specific ideas on how to efficiently use the characteristics of your data, it would be advised to describe a bit what data the packets contains and how it is generated (as you have done in the comments, so they are ints (4 bytes?) and longs (8 bytes?)), and then provide one or a few samples.
It sounds like you're currently looking at general-purpose compression algorithms. The most effective way to compress small chunks of data is to build a special-purpose compressor that knows the structure of your data.
The important thing is that you need to match the coding you use with the distribution of values you expect from your data: to get a good result from Elias gamma coding, you need to make sure the values you code are smallish positive integers...
If different integers within the same block are not completely independent (e.g., if your arrays represent a time series), you may be able to use this to improve your compression (e.g., the differences between successive values in a time series tend to be smallish signed integers). However, because each block needs to be independently compressed, you will not be able to take this kind of advantage of differences between successive blocks.
If you're worried that your compressor might turn into an "expander", you can add an initial flag to indicate whether the data is compressed or uncompressed. Then, in the worst case where your data doesn't fit your compression model at all, you can always punt and send the uncompressed version; your worst-case overhead is the size of the flag...
Elias Gamma Coding might actually increase the size of your data.
You already have upper bounds on your numbers (whatever fits into a 4- or probably 8-byte int/long). This method encodes the length of your numbers, followed by your number (probably not what you want). If you get many small values, it might make things smaller. If you also get big values, it will probably increase the size (the 8-byte unsigned max value would become almost twice as big).
Look at the entropy of your data packets. If it's close to the maximum, compression will be useless. Otherwise, try different GP compressors. Tho I'm not sure if the time spent compressing and decompressing is worth the size reduction.
I would have a close look at the options of your compression library, for instance deflateSetDictionary() and the flag Z_FILTERED in http://www.zlib.net/manual.html. If you can distribute - or hardwire in the source code - an agreed dictionary to both sender and receiver ahead of time, and if that dictionary is representative of real data, you should get decent compression savings. Oops - in Java look at java.util.zip.Deflater.setDictionary() and FILTERED.

If consistent hash is efficient,why don't people use it everywhere?

I was asked some shortcommings of consistent hash. But I think it just costs a little more than a traditional hash%N hash. As the title mentioned, if consistent hash is very good, why not we just use it?
Do you know more? Who can tell me some?
Implementing consistent hashing is not trivial and in many cases you have a hash table that rarely or never needs remapping or which can remap rather fast.
The only substantial shortcoming of consistent hashing I'm aware of is that implementing it is more complicated than simple hashing. More code means more places to introduce a bug, but there are freely available options out there now.
Technically, consistent hashing consumes a bit more CPU; consulting a sorted list to determine which server to map an object to is an O(log n) operation, where n is the number of servers X the number of slots per server, while simple hashing is O(1).
In practice, though, O(log n) is so fast it doesn't matter. (E.g., 8 servers X 1024 slots per server = 8192 items, log2(8192) = 13 comparisons at most in the worst case.) The original authors tested it and found that computing the cache server using consistent hashing took only 20 microseconds in their setup. Likewise, consistent hashing consumes space to store the sorted list of server slots, while simple hashing takes no space, but the amount required is minuscule, on the order of Kb.
Why is it not better known? If I had to guess, I would say it's only because it can take time for academic ideas to propagate out into industry. (The original paper was written in 1997.)
I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.
The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.
So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.
Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.
Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.
Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.
The reason is because Consistent Hashing tends to cause more work on the Read side for range scan queries.
For example, if you want to search for entries that are sorted by a particular column then you'd need to send the query to EVERY node because consistent hashing will place even "adjacent" items in separate nodes.
It's often preferred to instead use a partitioning that is going to match the usage patterns. Better yet replicate the same data in a host of different partitions/formats

Resources