what algorithm can save one bit of storage space for each arbitrary 32bit number in a LUT

what algorithm can save one bit of storage space for each arbitrary 32bit number in a LUT - algorithm

a lookup table has a total of 4G entries, each entry of it is a 32bit arbitrary number but they never repeats.
is there any algorithm is able to utilize the index of each entry and its (index) value(32bit number)to make a fixed position bit of the value is always zero(so I can utilize the bit as a flag to log something). And I can retrieve the 32bit number by doing a reverse calculation.
Or step back and say, whether or not I can make a fixed position bit of every two continuous entries always zero?
my question is that is there any universal codes can make each arbitrary 32bit numeric save 1 bit. so I can utilize this bit as a lock flag. alternatively, is there a way can leverage the index and its value of a lookup table entry by some calculation to save 1 bit storage of the value.

It is not at all clear what you are asking. However I can perhaps find one thing in there that can be addressed, if I am reading it correctly, which is that you have a permutation of all of the integers in 0..232-1. Such a permutation can be represented in fewer bits than direct representation, which takes 32*232 bits. With a perfect representation of the permutations, each would be ceiling(log2(232!)) bits, since there are 232! possible permutations. That length turns out to be about 95.5% of the bits in the direct representation. So each permutation could be represented in about 30.6*232 bits, effectively taking off more than one bit per word.

Related

if a Bitcoin mining nounce is just 32 bits long how come is it increasingly difficult to find the winning hash?

I'm learning about mining and the first thing that surprised me is that the nounce part of the algorithm which is supposed to be randomly looped until you get a number smaller than the target hash .. is just 32 bits long.
Can you explain why then is it so difficult to loop an unsigned int and how come is it increasingly difficult over time? Thank you.

The task is: try different nonce values in your potential block until you reach a block having a hash value below some given threshold.
I can't find the source right now, but I'm quite sure that since the introduction of special mining ASICs the 32-bit nonce is no longer enough to keep the miners busy for the planned 10 minutes interval between blocks. They are able to compute 4 billion block hashes in less than 10 minutes.
Increasing the difficulty didn't help anymore, as that reached the point where none of the 4 billion possible nonce values gave a hash below the threshold.
So they found some additional fields in the block that are now used as nonce-extension. The principle is still the same: try different values until you reach a block with a hash below the threshold, only now it's more than 32 bits that can be varied, allowing for the threshold to be lowered beyond the former 32-bit-implied barrier.

Because it's not just the 32bit nonce that is involved in the calculation. The 1MB of transaction data is also part of the mining input. There is then a non-trivial amount of arithmetic to arrive at the output, which then can be compared with the target.
Bitcoin mining is looping over all 4billion uints until you find a "right" one.
The way that difficulty is increased, is that only some of the bits of the output matter. E.g. early on the lowest 11 bits had to be some specific pattern, the remaining 21bits could be anything. In theory there would be 2million "right" values for each transaction block, uniformly distributed across the range of a uint. Then the "difficulty" is increased so that 13 bits have to be some pattern, so now there are 4x fewer "right" answers, so it takes (on average) 4x longer to find one.

Such a thing as a constant quality (variable bit) digest hashing algorithm?

Problem space: We have a ton of data to digest that can range 6 orders of magnitude in size. Looking for a way to be more efficient, and thus use less disk space to store all of these digests.
So I was thinking about lossy audio encoding, such as MP3. There are two basic approaches - constant bitrate and constant quality (aka variable bitrate). Since my primary interest is quality, I usually go for VBR. Thus, to achieve the same level of quality, a pure sin tone would require significantly lower bitrate than a something like a complex classical piece.
Using the same idea, two very small data chunks should require significantly less total digest bits than two very large data chunks to ensure roughly the same statistical improbability (what I am calling quality in this context) of their digests colliding. This is an assumption that seems intuitively correct to me, but then again, I am not a crypto mathematician. Also note that this is all about identification, not security. It's okay if a small data chunk has a small digest, and thus computationally feasible to reproduce.
I tried searching around the inter-tubes for anything like this. The closest thing I found was a posting somewhere that talked about using a fixed size digest hash, like SHA256, as a initialization vector for AES/CTR acting as a psuedo-random generator. Then taking the first x number of bit off that.
That seems like a totally do-able thing. The only problem with this approach is that I have no idea how to calculate the appropriate value of x as a function of the data chunk size. I think my target quality would be statistical improbability of SHA256 collision between two 1GB data chunks. Does anyone have thoughts on this calculation?
Are there any existing digest hashing algorithms that already do this? Or are there any other approaches that will yield this same result?
Update: Looks like there is the SHA3 Keccak "sponge" that can output an arbitrary number of bits. But I still need to know how many bits I need as a function of input size for a constant quality. It sounded like this algorithm produces an infinite stream of bits, and you just truncate at however many you want. However testing in Ruby, I would have expected the first half of a SHA3-512 to be exactly equal to a SHA3-256, but it was not...

Your logic from the comment is fairly sound. Quality hash functions will not generate a duplicate/previously generated output until the input length is nearly (or has exceeded) the hash digest length.
But, the key factor in collision risk is the size of the input set to the size of the hash digest. When using a quality hash function, the chance of a collision for two 1 TB files not significantly different than the chance of collision for two 1KB files, or even one 1TB and one 1KB file. This is because hash function strive for uniformity; good functions achieve it to a high degree.
Due to the birthday problem, the collision risk for a hash function is is less than the bitwidth of its output. That wiki article for the pigeonhole principle, which is the basis for the birthday problem, says:
The [pigeonhole] principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression is lossless), which possibility the pigeonhole principle excludes.
So going to a 'VBR' hash digest is not guaranteed to save you space. The birthday problem provides the math for calculating the chance that two random things will share the same property (a hash code is a property, in a broad sense), but this article gives a better summary, including the following table.
Source: preshing.com
The top row of the table says that in order to have a 50% chance of a collision with a 32-bit hash function, you only need to hash 77k items. For a 64-bit hash function, that number rises to 5.04 billion for the same 50% collision risk. For a 160-bit hash function, you need 1.42 * 1024 inputs before there is a 50% chance that a new input will have the same hash as a previous input.
Note that 1.42 * 1024 160 bit numbers would themselves take up an unreasonably large amount of space; millions of Terabytes, if I'm doing the math right. And that's without counting for the 1024 item values they represent.
The bottom end of that table should convince you that a 160-bit hash function has a sufficiently low risk of collisions. In particular, you would have to have 1021 hash inputs before there is even a 1 in a million chance of a hash collision. That's why your searching turned up so little: it's not worth dealing with the complexity.
No matter what hash strategy you decide upon however, there is a non-zero risk of collision. Any type of ID system that relies on a hash needs to have a fallback comparison. An easy additional check for files is to compare their sizes (works well for any variable length data where the length is known, such as strings). Wikipedia covers several different collision mitigation and detection strategies for hash tables, most of which can be extended to a filesystem with a little imagination. If you require perfect fidelity, then after you've run out of fast checks, you need to fallback to the most basic comparator: the expensive bit-for-bit check of the two inputs.

If I understand the question correctly, you have a number of data items of different lengths, and for each item you are computing a hash (i.e. a digest) so the items can be identified.
Suppose you have already hashed N items (without collisions), and you are using a 64bit hash code.
The next item you hash will take one of 2^64 values and so you will have a N / 2^64 probability of a hash collision when you add the next item.
Note that this probability does NOT depend on the original size of the data item. It does depend on the total number of items you have to hash, so you should choose the number of bits according to the probability you are willing to tolerate of a hash collision.
However, if you have partitioned your data set in some way such that there are different numbers of items in each partition, then you may be able to save a small amount of space by using variable sized hashes.
For example, suppose you use 1TB disk drives to store items, and all items >1GB are on one drive, while items <1KB are on another, and a third is used for intermediate sizes. There will be at most 1000 items on the first drive so you could use a smaller hash, while there could be a billion items on the drive with small files so a larger hash would be appropriate for the same collision probability.
In this case the hash size does depend on file size, but only in an indirect way based on the size of the partitions.

Efficiently Store List of Numbers in Binary Format

I'm writing a compression algorithm (mostly for fun) in C, and I need to be able to store a list of numbers in binary. Each element of this list will be in the form of two digits, both under 10 (like (5,5), (3,6), (9,2)). I'll potentially be storing thousands of these pairs (one pair is made for each character in a string in my compression algorithm).
Obviously the simplest way to do this would be to concatenate each pair (-> 55, 36, 92) to make a 2-digit number (since they're just one digit each), then store each pair as a 7-bit number (since 99 is the highest). Unfortunately, this isn't so space-efficient (7 bits per pair).
Then I thought perhaps if I concatenate each pair, then concatenate that (553692), I'd be able to then store that as a plain number in binary form (10000111001011011100, which for three pairs is already smaller than storing each number separately), and keep a quantifier for the number of bits used for the binary number. The only problem is, this approach requires a bigint library and could be potentially slow because of that. As the number gets bigger and bigger (+2 digits per character in the string) the memory usage and slowdown would get bigger and bigger as well.
So here's my question: Is there a better storage-efficient way to store a list of numbers like I'm doing, or should I just go with the bignum or 7-bit approach?

The information-theoretic minimum for storing 100 different values is log2100, which is about 6.644. In other words, the possible compression from 7 bits is a hair more than 5%. (log2100 / 7 is 94.91%.)
If these pairs are simply for temporary storage during the algorithm, then it's almost certainly not worth going to a lot of effort to save 5% of storage, even if you managed to do that.
If the pairs form part of you compressed output, then your compression cannot be great (a character is only eight bits, and presumably the pairs are additional to any compressed character data.) Nonetheless, the easy compression technique is to store up to 6 pairs in 40 bits (5 bytes), which can be done without a bigint package assuming a 64-bit machine. (Alternatively, store up to 3 pairs in 20 bits and then pack two 20-bit sequences into five bytes.) That gives you 99.66% of the maximum compression for the values.
All of the above assumes that the 100 possible values are equally distributed. If the distribution is not even and it is possible to predict the frequencies, then you can use Hoffman encoding to improve compression. Even so, I wouldn't recommend it for temporary storage.

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?

Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.

As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.

Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.

How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.

Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.

Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

What are some common uses for bitarrays?

I've done an example using bitarrays from a newbie manual. I want to know what they can be used for and what some common data structures for them (assuming that "array" is fairly loose terminology.)
Thanks.

There are several listed in the Applications section of the Bit array Wikipedia article:
Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of boolean flags or an ordered sequence of boolean values.
We mentioned above that bit arrays are used for priority queues, where the bit at index k is set if and only if k is in the queue; this data structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in hardware.
Bit arrays can be used for the allocation of memory pages, inodes, disk sectors, etc. In such cases, the term bitmap may be used. However, this term is frequently used to refer to raster images, which may use multiple bits per pixel.
Another application of bit arrays is the Bloom filter, a probabilistic set data structure that can store large sets in a small space in exchange for a small probability of error. It is also possible to build probabilistic hash tables based on bit arrays that accept either false positives or false negatives.
Bit arrays and the operations on them are also important for constructing succinct data structures, which use close to the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of 1 bits up to a certain position become important.
Bit arrays are also a useful abstraction for examining streams of compressed data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed Huffman coding representation of a single 8-bit character can be anywhere from 1 to 255 bits long.
In information retrieval, bit arrays are a good representation for the posting lists of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of a gap of n is 1/2n. This is also the special case of Golomb coding where the parameter M is 1; this parameter is only normally selected when -log(2-p)/log(1-p) ≤ 1, or roughly the term occurs in at least 38% of documents.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio