What are some common uses for bitarrays? - data-structures

I've done an example using bitarrays from a newbie manual. I want to know what they can be used for and what some common data structures for them (assuming that "array" is fairly loose terminology.)
Thanks.

There are several listed in the Applications section of the Bit array Wikipedia article:
Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of boolean flags or an ordered sequence of boolean values.
We mentioned above that bit arrays are used for priority queues, where the bit at index k is set if and only if k is in the queue; this data structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in hardware.
Bit arrays can be used for the allocation of memory pages, inodes, disk sectors, etc. In such cases, the term bitmap may be used. However, this term is frequently used to refer to raster images, which may use multiple bits per pixel.
Another application of bit arrays is the Bloom filter, a probabilistic set data structure that can store large sets in a small space in exchange for a small probability of error. It is also possible to build probabilistic hash tables based on bit arrays that accept either false positives or false negatives.
Bit arrays and the operations on them are also important for constructing succinct data structures, which use close to the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of 1 bits up to a certain position become important.
Bit arrays are also a useful abstraction for examining streams of compressed data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed Huffman coding representation of a single 8-bit character can be anywhere from 1 to 255 bits long.
In information retrieval, bit arrays are a good representation for the posting lists of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of a gap of n is 1/2n. This is also the special case of Golomb coding where the parameter M is 1; this parameter is only normally selected when -log(2-p)/log(1-p) ≤ 1, or roughly the term occurs in at least 38% of documents.

Related

what algorithm can save one bit of storage space for each arbitrary 32bit number in a LUT

a lookup table has a total of 4G entries, each entry of it is a 32bit arbitrary number but they never repeats.
is there any algorithm is able to utilize the index of each entry and its (index) value(32bit number)to make a fixed position bit of the value is always zero(so I can utilize the bit as a flag to log something). And I can retrieve the 32bit number by doing a reverse calculation.
Or step back and say, whether or not I can make a fixed position bit of every two continuous entries always zero?
my question is that is there any universal codes can make each arbitrary 32bit numeric save 1 bit. so I can utilize this bit as a lock flag. alternatively, is there a way can leverage the index and its value of a lookup table entry by some calculation to save 1 bit storage of the value.
It is not at all clear what you are asking. However I can perhaps find one thing in there that can be addressed, if I am reading it correctly, which is that you have a permutation of all of the integers in 0..232-1. Such a permutation can be represented in fewer bits than direct representation, which takes 32*232 bits. With a perfect representation of the permutations, each would be ceiling(log2(232!)) bits, since there are 232! possible permutations. That length turns out to be about 95.5% of the bits in the direct representation. So each permutation could be represented in about 30.6*232 bits, effectively taking off more than one bit per word.

Why are we using linked list to address collisions in hash tables?

I was wondering why many languages (Java, C++, Python, Perl etc) implement hash tables using linked lists to avoid collisions instead of arrays?
I mean instead of buckets of linked lists, we should use arrays.
If the concern is about the size of the array then that means that we have too many collisions so we already have a problem with the hash function and not the way we address collisions. Am I misunderstanding something?
I mean instead of buckets of linked lists, we should use arrays.
Pros and cons to everything, depending on many factors.
The two biggest problem with arrays:
changing capacity involves copying all content to another memory area
you have to choose between:
a) arrays of Element*s, adding one extra indirection during table operations, and one extra memory allocation per non-empty bucket with associated heap management overheads
b) arrays of Elements, such that the pre-existing Elements iterators/pointers/references are invalidated by some operations on other nodes (e.g. insert) (the linked list approach - or 2a above for that matter - needn't invalidate these)
...will ignore several smaller design choices about indirection with arrays...
Practical ways to reduce copying from 1. include keeping excess capacity (i.e. currently unused memory for anticipated or already-erased elements), and - if sizeof(Element) is much greater than sizeof(Element*) - you're pushed towards arrays-of-Element*s (with "2a" problems) rather than Element[]s/2b.
There are a couple other answers claiming erasing in arrays is more expensive than for linked lists, but the opposite's often true: searching contiguous Elements is faster than scanning a linked list (less steps in code, more cache friendly), and once found you can copy the last array Element or Element* over the one being erased then decrement size.
If the concern is about the size of the array then that means that we have too many collisions so we already have a problem with the hash function and not the way we address collisions. Am I misunderstanding something?
To answer that, let's look at what happens with a great hash function. Packing a million elements into a million buckets using a cryptographic strength hash, a few runs of my program counting the number of buckets to which 0, 1, 2 etc. elements hashed yielded...
0=367790 1=367843 2=184192 3=61200 4=15370 5=3035 6=486 7=71 8=11 9=2
0=367664 1=367788 2=184377 3=61424 4=15231 5=2933 6=497 7=75 8=10 10=1
0=367717 1=368151 2=183837 3=61328 4=15300 5=3104 6=486 7=64 8=10 9=3
If we increase that to 100 million elements - still with load factor 1.0:
0=36787653 1=36788486 2=18394273 3=6130573 4=1532728 5=306937 6=51005 7=7264 8=968 9=101 10=11 11=1
We can see the ratios are pretty stable. Even with load factor 1.0 (the default maximum for C++'s unordered_set and -map), 36.8% of buckets can be expected to be empty, another 36.8% handling one Element, 18.4% 2 Elements and so on. For any given array resizing logic you can easily get a sense of how often it will need to resize (and potentially copy elements). You're right that it doesn't look bad, and may be better than linked lists if you're doing lots of lookups or iterations, for this idealistic cryptographic-hash case.
But, good quality hashing is relatively expensive in CPU time, such that general purpose hash-table supporting hash functions are often very weak: e.g. it's very common for C++ Standard library implementations of std::hash<int> to return their argument, and MS Visual C++'s std::hash<std::string> picks 10 characters evently spaced along the string to incorporate in the hash value, regardless of how long the string is.
Clearly implementation's experience has been that this combination of weak-but-fast hash functions and linked lists (or trees) to handle the greater collision proneness works out faster on average - and has less user-antagonising manifestations of obnoxiously bad performance - for everyday keys and requirements.
Strategy 1
Use (small) arrays which get instantiated and subsequently filled once collisions occur. 1 heap operation for the allocation of the array, then room for N-1 more. If no collision ever occurs again for that bucket, N-1 capacity for entries is wasted. List wins, if collisions are rare, no excess memory is allocated just for the probability of having more overflows on a bucket. Removing items is also more expensive. Either mark deleted spots in the array or move the stuff behind it to the front. And what if the array is full? Linked list of arrays or resize the array?
One potential benefit of using arrays would be to do a sorted insert and then binary search upon retrieval. The linked list approach cannot compete with that. But whether or not that pays off depends on the write/retrieve ratio. The less frequently writing occurs, the more could this pay off.
Strategy 2
Use lists. You pay for what you get. 1 collision = 1 heap operation. No eager assumption (and price to pay in terms of memory) that "more will come". Linear search within the collision lists. Cheaper delete. (Not counting free() here). One major motivation to think of arrays instead of lists would be to reduce the amount of heap operations. Amusingly the general assumption seems to be that they are cheap. But not many will actually know how much time an allocation requires compared to, say traversing the list looking for a match.
Strategy 3
Use neither array nor lists but store the overflow entries within the hash table at another location. Last time I mentioned that here, I got frowned upon a bit. Benefit: 0 memory allocations. Probably works best if you have indeed low fill grade of the table and only few collisions.
Summary
There are indeed many options and trade-offs to choose from. Generic hash table implementations such as those in standard libraries cannot make any assumption regarding write/read ratio, quality of hash key, use cases, etc. If, on the other hand all those traits of a hash table application are known (and if it is worth the effort), it is well possible to create an optimized implementation of a hash table which is tailored for the set of trade offs the application requires.
The reason is, that the expected length of these lists is tiny, with only zero, one, or two entries in the vast majority of cases. Yet these lists may also become arbitrarily long in the worst case of a really bad hash function. And even though this worst case is not the case that hash tables are optimized for, they still need to be able to handle it gracefully.
Now, for an array based approach, you would need to set a minimal array size. And, if that initial array size is anything other then zero, you already have significant space overhead due to all the empty lists. A minimal array size of two would mean that you waste half your space. And you would need to implement logic to reallocate the arrays when they become full because you cannot put an upper limit to the list length, you need to be able to handle the worst case.
The list based approach is much more efficient under these constraints: It has only the allocation overhead for the node objects, most accesses have the same amount of indirection as the array based approach, and it's easier to write.
I'm not saying that it's impossible to write an array based implementation, but its significantly more complex and less efficient than the list based approach.
why many languages (Java, C++, Python, Perl etc) implement hash tables using linked lists to avoid collisions instead of arrays?
I'm almost sure, at least for most from that "many" languages:
Original implementors of hash tables for these languages just followed classic algorithm description from Knuth/other algorithmic book, and didn't even consider such subtle implementation choices.
Some observations:
Even using collision resolution with separate chains instead of, say, open addressing, for "most generic hash table implementation" is seriously doubtful choice. My personal conviction -- it is not the right choice.
When hash table's load factor is pretty low (that should chosen in nearly 99% hash table usages), the difference between the suggested approaches hardly could affect overall data structure perfromance (as cmaster explained in the beginning of his answer, and delnan meaningfully refined in the comments). Since generic hash table implementations in languages are not designed for high density, "linked lists vs arrays" is not a pressing issue for them.
Returning to the topic question itself, I don't see any conceptual reason why linked lists should be better than arrays. I can easily imagine, that, in fact, arrays are faster on modern hardware / consume less memory with modern momory allocators inside modern language runtimes / operating systems. Especially when the hash table's key is primitive, or a copied structure. You can find some arguments backing this opinion here: http://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures
But the only way to find the correct answer (for particular CPU, OS, memory allocator, virtual machine and it's garbage collection algorithm, and the hash table use case / workload!) is to implement both approaches and compare them.
Am I misunderstanding something?
No, you don't misunderstand anything, your question is legal. It's an example of fair confusion, when something is done in some specific way not for a strong reason, but, largely, by occasion.
If is implemented using arrays, in case of insertion it will be costly due to reallocation which in case of linked list doesn`t happen.
Coming to the case of deletion we have to search the complete array then either mark it as delete or move the remaining elements. (in the former case it makes the insertion even more difficult as we have to search for empty slots).
To improve the worst case time complexity from o(n) to o(logn), once the number of items in a hash bucket grows beyond a certain threshold, that bucket will switch from using a linked list of entries to a balanced tree (in java).

Such a thing as a constant quality (variable bit) digest hashing algorithm?

Problem space: We have a ton of data to digest that can range 6 orders of magnitude in size. Looking for a way to be more efficient, and thus use less disk space to store all of these digests.
So I was thinking about lossy audio encoding, such as MP3. There are two basic approaches - constant bitrate and constant quality (aka variable bitrate). Since my primary interest is quality, I usually go for VBR. Thus, to achieve the same level of quality, a pure sin tone would require significantly lower bitrate than a something like a complex classical piece.
Using the same idea, two very small data chunks should require significantly less total digest bits than two very large data chunks to ensure roughly the same statistical improbability (what I am calling quality in this context) of their digests colliding. This is an assumption that seems intuitively correct to me, but then again, I am not a crypto mathematician. Also note that this is all about identification, not security. It's okay if a small data chunk has a small digest, and thus computationally feasible to reproduce.
I tried searching around the inter-tubes for anything like this. The closest thing I found was a posting somewhere that talked about using a fixed size digest hash, like SHA256, as a initialization vector for AES/CTR acting as a psuedo-random generator. Then taking the first x number of bit off that.
That seems like a totally do-able thing. The only problem with this approach is that I have no idea how to calculate the appropriate value of x as a function of the data chunk size. I think my target quality would be statistical improbability of SHA256 collision between two 1GB data chunks. Does anyone have thoughts on this calculation?
Are there any existing digest hashing algorithms that already do this? Or are there any other approaches that will yield this same result?
Update: Looks like there is the SHA3 Keccak "sponge" that can output an arbitrary number of bits. But I still need to know how many bits I need as a function of input size for a constant quality. It sounded like this algorithm produces an infinite stream of bits, and you just truncate at however many you want. However testing in Ruby, I would have expected the first half of a SHA3-512 to be exactly equal to a SHA3-256, but it was not...
Your logic from the comment is fairly sound. Quality hash functions will not generate a duplicate/previously generated output until the input length is nearly (or has exceeded) the hash digest length.
But, the key factor in collision risk is the size of the input set to the size of the hash digest. When using a quality hash function, the chance of a collision for two 1 TB files not significantly different than the chance of collision for two 1KB files, or even one 1TB and one 1KB file. This is because hash function strive for uniformity; good functions achieve it to a high degree.
Due to the birthday problem, the collision risk for a hash function is is less than the bitwidth of its output. That wiki article for the pigeonhole principle, which is the basis for the birthday problem, says:
The [pigeonhole] principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression is lossless), which possibility the pigeonhole principle excludes.
So going to a 'VBR' hash digest is not guaranteed to save you space. The birthday problem provides the math for calculating the chance that two random things will share the same property (a hash code is a property, in a broad sense), but this article gives a better summary, including the following table.
Source: preshing.com
The top row of the table says that in order to have a 50% chance of a collision with a 32-bit hash function, you only need to hash 77k items. For a 64-bit hash function, that number rises to 5.04 billion for the same 50% collision risk. For a 160-bit hash function, you need 1.42 * 1024 inputs before there is a 50% chance that a new input will have the same hash as a previous input.
Note that 1.42 * 1024 160 bit numbers would themselves take up an unreasonably large amount of space; millions of Terabytes, if I'm doing the math right. And that's without counting for the 1024 item values they represent.
The bottom end of that table should convince you that a 160-bit hash function has a sufficiently low risk of collisions. In particular, you would have to have 1021 hash inputs before there is even a 1 in a million chance of a hash collision. That's why your searching turned up so little: it's not worth dealing with the complexity.
No matter what hash strategy you decide upon however, there is a non-zero risk of collision. Any type of ID system that relies on a hash needs to have a fallback comparison. An easy additional check for files is to compare their sizes (works well for any variable length data where the length is known, such as strings). Wikipedia covers several different collision mitigation and detection strategies for hash tables, most of which can be extended to a filesystem with a little imagination. If you require perfect fidelity, then after you've run out of fast checks, you need to fallback to the most basic comparator: the expensive bit-for-bit check of the two inputs.
If I understand the question correctly, you have a number of data items of different lengths, and for each item you are computing a hash (i.e. a digest) so the items can be identified.
Suppose you have already hashed N items (without collisions), and you are using a 64bit hash code.
The next item you hash will take one of 2^64 values and so you will have a N / 2^64 probability of a hash collision when you add the next item.
Note that this probability does NOT depend on the original size of the data item. It does depend on the total number of items you have to hash, so you should choose the number of bits according to the probability you are willing to tolerate of a hash collision.
However, if you have partitioned your data set in some way such that there are different numbers of items in each partition, then you may be able to save a small amount of space by using variable sized hashes.
For example, suppose you use 1TB disk drives to store items, and all items >1GB are on one drive, while items <1KB are on another, and a third is used for intermediate sizes. There will be at most 1000 items on the first drive so you could use a smaller hash, while there could be a billion items on the drive with small files so a larger hash would be appropriate for the same collision probability.
In this case the hash size does depend on file size, but only in an indirect way based on the size of the partitions.

Data structure for searching & inserting bitstrings where only "1"s are important

It's hard to explain the problem in pure words, so here's an example of the abstract problem I need to solve:
In this example, there are entries with the keys "1111","1010","1011","1000","0001" already inserted into the data structure
I search using the query "1001"
The query is supposed to return all entries in the data structure where the query has a matching "1" for all "1"s in the key of the entry, but the query may have many more 1s than the compared entries. For this example, the keys "1000" and "0001" should be returned, since the query matches the 1s of those keys. You could say the entries in the data structure "don't care" about the other bits in the query, the entry with the "1000" key only cares that the first bit of the query be 1, and the "0001" key only cares that they last bit be 1.
Some side information/constraints:
This is optimization for a real-time application, where profiling has shown that improvement in this area would be welcomed.
The number of entries will be "small" (most likely <500). This means I'm not necessarily looking for best "big O" performance, but rather practical performance on contemporary PC and mobile CPUs and memory. As small memory footprint as possible is a huge bonus, but I strongly suspect this will go hand-in-hand with a well performing solution.
insertions into the data structure will be very infrequent. Most at startup time of the application, so the structure doesn't have to be optimized for it. But searches will be frequent.
The values of the entries (key/value pair) in my concrete problem will be arrays of pointers.
The amount of bits in the numbers is arbitrary, but all keys in the structure and the queries will have the same length. I'm just mentioning this in case there are algorithms which rely on CPU hardware instructions to function efficiently, which would likely only work for 32 bit / 64 bit types. My keys will be longer, but not huge (~128-256 bits).
I want to specifically mention again that this is for strings of bits, nothing else.
Queries can have no results as well. For example, in my application, "0000" will never return results, since there are no "1"s to care about.
Programming language used is C++, compiler is "various compilers", as this is will run on multiple platforms and operating systems
How can I solve this efficiently? Also, are there practical implementations to look at?
First of all, I assume that you have already optimized your query/key comparison code. You should be able to do that efficiently with a bitwise-and plus a compare for each word of the key and query. If you are on an architecture with SIMD instructions, then those can be done in parallel.
You haven't said anything about the meaning of the bits or how you expect them to be distributed among the keys and queries.
If you expect queries to be repeated frequently, one very easy thing you can do is to simply use a linear search with a cache of the n-most frequently used queries.
If most bits of the keys will not be present in the majority of keys, then you could reorder the bits in the keys such that the least frequently occurring bits have the lowest value indexes (i.e bit 0 has the fewest number of keys with that bit set, bit 1 has the next fewest, and so on). Then create an array indexed by bit index whose entries contain the list of keys containing that bit. When resolving queries, pick the lowest set bit in the query (there are bit hacks to do this efficiently), look up the corresponding list of matches and search it linearly. As long as the keys do not have overly dense bit patterns, this should provide a significant speedup.

Fast space efficient data structure for set membership queries on small sets

I am trying to create a data structure for a fixed size set that should support the following operations:
Query whether an element is in the set (false positives are ok, false negatives are not)
Replace one element of the set with another element
In my case, the size of the set is likely to be very small (4-16 elements), but the lookups must be as fast as possible and read as few bits as possible. Also, it needs to be space efficient. Replacements (i.e. operation 2) are likely to be few. I looked into the following options:
Bloom Filters: This is the standard solution. However, it is difficult to delete elements and as such difficult to implement operation 2.
Counting Bloom Filters: The space requirement becomes much higher (~ 3-4x) of that of the standard Bloom filter for no decrease in false +ve rates.
Simply storing a list of hashes of all the elements: Gives better false +ve rates than counting bloom filter for similar space requirements, but is expensive to look up (in worst case all bits will be looked up).
Previous idea with perfect hashing for location: I don't have an idea about fast perfect hashes for small sets of elements.
Additional Information:
The elements are 64 bit numbers.
Any ideas on how to solve this?
Cuckoo Filter is an option that should be considered. To quote their abstract
Cuckoo Filter: Practically Better Than Bloom
We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set member-ship tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and
target moderately low false positive rates, **cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters out-perform previous data structures that extend Bloom filters to support deletions substantially in both time and space.
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
Well, note the following:
Using standard hash table, with a descent hash function (since it is numbers, there are bunch of standard hash functions) with 4|S| entries will require on average less then 2 look-ups (assuming unbiased numbers as input), though it might deteriorate to terrible worst case of 4|S|. Of course you can bound it as follows:
- If number of cells searched exceeds k - abort and return true (will cause FP at some probability that you can caclculate, and will give faster worst case performance).
Regarding counting bloom filters - this is the way to do it, IMO. Note that a bloom filter (standard) requires 154 bits to have FP probability of 1%, or 100 bits to have FP probability of 5%. (*)
So, if you need 4 times this number, you get 616 bits / 400 bits, Note that in most modern machine this is small enough to fill a few CPU-Cache blocks, which means (depending on the machine) - reading all these bits could really take less then 10 cycles on some machines.
IMO you cannot do anything to beat it without getting much higher FP rate.
(*) Calculated according to:
m = n ln(p) / ln(2)2
P.S. If you can guarantee each element is removed at most once, you can use a variation of bloom filter with double space instead that has slightly better FP, but also has some FNs, by simply using 2 bloom filters: 1 for regular and 1 for deleted. An element is in the set if it is in regular and NOT in deleted.
This improves FP rate at the expense of having also FNs
Check out succinct data structures, for example Membership in Constant Time and Minimum Space.
There are many situations in dealing with a subset chosen from the
bounded universe, in which the size of the subset is relatively big
but not big enough to use a bit map.

Resources