Generate random 128 bit decimal in given range in go - go

Let's say that we have a random number generator that can generate random 32 or 64 bit integers (like rand.Rand in the standard library)
Generating a random int64 in a given range [a,b] is fairly easy:
rand.Seed(time.Now().UnixNano())
n := rand.Int63n(b-a) + a
Is it possible to generate random 128 bit decimal (as defined in specification IEEE 754-2008) in a given range from a combination of 32 or 64 bit random integers?

It is possible, but the solution is far from trivial. For a correct solution, there are several things to consider.
For one thing, values with exponent E are 10 times more likely than values with exponent E - 1.
Other issues include subnormal numbers and ranges that straddle zero.
I am aware of the Rademacher Floating-Point Library, which tackled this problem for binary floating-point numbers, but the solution there is complicated and its author has not yet written up how his algorithm works.
EDIT (May 11):
I have now specified an algorithm for generating random "uniform" floating-point numbers—
In any range,
with full coverage, and
regardless of the digit base (such as binary or decimal).

Possible, but by no means easy. Here is a sketch of a solution that might be acceptable — writing and debugging it would probably be at least a day of concerted effort.
Let min and max be primitive.Decimal128 objects from go.mongodb.org/mongo-driver/bson. Let MAXBITS be a multiple of 32; 128 is likely to be adequate.
Get the significand (as big.Int) and exponent (as int) of min and max using the BigInt method.
Align min and max so that they have the same exponent. As far as possible, left-justify the value with the larger exponent by decreasing its exponent and adding a corresponding number of zeroes to the right side of its significand. If this would cause the absolute value of the significand to become >= 2**(MAXBITS-1), then either
(a) Right-shift the value with the smaller exponent by dropping digits from the right side of its significand and increasing its exponent, causing precision loss.
(b) Dynamically increase MAXBITS.
(c) Throw an error.
At this point both exponents will be the same, and both significands will be aligned big integers. Set aside the exponents for now, and let range (a new big.Int) be maxSignificand - minSignificand. It will be between 0 and 2**MAXBITS.
Turn range into MAXBITS/32 uint32s using the Bytes or DivMod methods, whatever is easier.
If the highest word of range is equal to math.MaxUint32 then set a flag limit to false, otherwise true.
For n from 0 to MAXBITS/32:
if limit is true, use rand.Int63n (!, not rand.Int31n or rand.Uint32) to generate a value between 0 and the nth word of range, inclusive, cast it to uint32, and store it as the nth word of the output. If the value generated is equal to the nth word of range (i.e. if we generated the maximum possible random value for this word) then let limit remain true, otherwise set it false.
If limit is false, use rand.Uint32 to generate the nth word of the output. limit remains false regardless of the generated value.
Combine the generated words into a big.Int by building a []byte and using big/Int.SetBytes or multiplication and addition, as convenient.
Add the generated value to minSignificand to obtain the significand of the result.
Use ParseDecimal128FromBigInt with the result significand and the exponent from steps 2-3 to obtain the result.
The heart of the algorithm is step 6, which generates a uniform random unsigned integer of arbitrary length 32 bits at a time. The alignment in step 2 reduces the problem from a floating-point to an integer one, and the subtraction in step 3 reduces it to an unsigned one, so that we only have to think about one bound instead of 2. The limit flag records whether we're still dealing with that bound, or whether we've already narrowed the result down to an interval that doesn't include it.
Caveats:
I haven't written this, let alone tested it. I may have gotten it quite wrong. A sanity check by someone who does more numerical computation work than me would be welcome.
Generating numbers across a large dynamic range (including crossing zero) will lose some precision and omit some possible output values with smaller exponents unless a ludicrously large MAXBITS is used; however, 128 bits should give a result at least as good as a naive algorithm implemented in terms of decimal128.
The performance is probably pretty bad.

Go has a large number package that can do arbitrary length integers: https://golang.org/pkg/math/big/
It has a pseudo random number generator https://golang.org/pkg/math/big/#Int.Rand, and the crypto package also has https://golang.org/pkg/crypto/rand/#Int
You'd want to specify the max using https://golang.org/pkg/math/big/#Int.Exp as 2^128.
Can't speak to performance, though, or whether this is compliant if the IEEE standard, but large random numbers like what you'd use for UUIDs are possible.

It depends how many values you want to generate. If it's enough to have no more 10^34 values in a specified range - it's quite simple.
As I see the problem, a random value in the range min..max can be calculated as random(0..1)*(max-min)+min
Look like we need to generate only decimal128 value in range 0..1. So it's a random value in range 0..10^34-1 with exponent -34. This value can be generated with a golang standard random package.
To multiply, add and substruct float128 values can be used golang math/big package with values normalization.

This is definitely what you are looking for.

Related

Random number generator with freely chosen period

I want a simple (non-cryptographic) random number generation algorithm where I can freely choose the period.
One candidate would be a special instance of LCG:
X(n+1) = (aX(n)+c) mod m (m,c relatively prime; (a-1) divisible by all prime factors of m and also divisible by 4 if m is).
This has period m and does not restrict possible values of m.
I intend to use this RNG to create a permutation of an array by generating indices into it. I tried the LCG and it might be OK. However, it may not be "random enough" in that distances between adjacent outputs have very few possible values (i.e, plotting x(n) vs n gives a wrapped line). The arrays I want to index into have some structure that has to do with this distance and I want to avoid potential issues with this.
Of course, I could use any good PRNG to shuffle (using e.g. Fisher–Yates) an array [1,..., m]. But I don't want to have to store this array of indices. Is there some way to capture the permuted indices directly in an algorithm?
I don't really mind the method ending up biased w.r.t choice of RNG seed. Only the period matters and the permuted sequence (for a given seed) being reasonably random.
Encryption is a one-to-one operation. If you encrypt a range of numbers, you will get the same count of apparently random numbers back. In this case the period will be the size of the chosen range. So for a period of 20, encrypt the numbers 0..19.
If you want the output numbers to be in a specific range, then pick a block cipher with an appropriately sized block and use Format Preserving Encryption if needed, as #David Eisenstat suggests.
It is not difficult to set up a cipher with almost any reasonable block size, so long as it is an even number of bits, using the Feistel structure. If you don't require cryptographic security then four or six Feistel rounds should give you enough randomness.
Changing the encryption key will give you a different ordering of the numbers.

Compression by quazi-logarithmic scale

I need to compress a large set of (unsigned) integer values, whereas the goal is to keep their relative accuracy. Simply speaking, there is a big difference between 1,2,3, but the difference between 1001,1002,1003 is minor.
So, I need a sort of a lossy transformation. The natural choice is to build a logarithmic scale, but the drawback is that conversion to/from it requires floating-point operations, log/exp calculation and etc.
OTOH I don't need a truly logarithmic scale, I just need it to resemble it in some sense.
I came up with an idea of encoding numbers in a floating-point manner. That is, I allocate N bits for each compressed number, from which some represent the mantissa, and the remaining are for the order. The choice for the size of the mantissa and order would depend on the needed range and accuracy.
My question is: is it a good idea? Or perhaps there exists a better encoding scheme w.r.t. computation complexity vs quality (similarity to logarithmic scale).
What I implemented in details:
As I said, bits for mantissa and for order. Order bits are leading, so that the greater the encoded number - the greater the raw one.
The actual number is decoded by appending an extra leading bit to mantissa (aka implicit bit), and left-shifting it by the encoded order. The smallest decoded number would be 1 << M where M is the size of mantissa. If the needed range should start from 0 (like in my case) then this number can be subtracted.
Encoding the number is also simple. Add the 1 << M, then find its order, i.e. how much it should be right-shifted until it fits our mantissa with implicit leading bit, and then encoding is trivial. Finding the order is done via median-search, which results to just a few ifs. (for example, if there are 4 order bits, the max order is 15, and it's found within 4 ifs).
I call this a "quazi-logarithmic" scale. The absolute precision decreases the greater is the number. But unlike the true logarithmic scale, where the granularity increases contiguously, in our case it jumps by factor of 2 after each fixed-size range.
The advantages of this encoding:
Fast encoding and decoding
No floating-point numbers, no implicit precision loss during manipulations with them, boundary cases and etc.
Not dependent on standard libraries, complex math functions, etc.
Encoding and decoding may be implemented via C++ template stuff, so that conversion may even be implemented in compile-time. This is convenient to define some compile-time constants in a human-readable way.
In your compression-algorithm every group of numbers that result in the same output after being compressed will be decompressed to the lowest number in that group. If you changed that to the number in the middle the average fault would be reduced.
E.g. for a 8-bit mantissa and 5bit exponent the numbers in the range[0x1340, 0x1350) will be translated into 0x1340 by decompress(compress(x)). If the entire range would first be compressed and afterwards decompressed the total difference would be 120. If the output would be 0x1348 instead, the total error would only be 64, which reduces the error by a solid 46.7%. So simply adding 2 << (exponent - 1) to the output will significantly reduce the error of the compression-scheme.
Apart from that I don't see much of an issue with this scheme. Just keep in mind that you'll need a specific encoding for 0. There would be alternative encodings, but without knowing anything specific about the input this one will be the best you can get.
EDIT:
While it is possible to move the correction of the result from the decompression to the compression-step, this comes at an increased expenses of enlarging the exponent-range by one. This is due the fact that for the numbers with the MSB set only half of the numbers will use the corresponding exponent (the other half will be populated by numbers with the second-most significant bit set). The higher half of numbers with the MSB set will be placed in the next-higher order.
So for e.g. for 32-bit numbers encoded with a 15bit-mantissa only numbers until 0x8FFF FFFF will have order 15 (Mantissa = 0x1FFF and Exponent = 15). All higher values will have order 16 (Mantissa = 0x?FFF and Exponent = 16). While the increase of the exponent by 1 in itself doesn't seem much, in this example it already costs an additional bit for the exponent.
In addition the decompression-step for the above example will produce an integer-overflow, which may be problematic under certain circumstances (e.g. C# will throw an exception if the decompression is done in checked-mode). Same applies for the compression-step: unless properly handled, adding 2^(order(n) - 1) to the input n will cause an overflow, thus placing the number in order 0.
I would recommend moving the correction to the decompression-step (as shown above) to remove potential integer-overflows as a source of problems/bugs and keep the number of exponents that need to be encoded minimal.
EDIT2:
Another issue with this approach is the fact that half of the numbers (excluding the lowest order) wind up in a larger "group" when the correction is done on compression, thus reducing precision.

How to generate 256-bit random number within a range on embedded system

I need to generate cyptographically secure random number which is 256-bit long in specific range. I use microcontroller suited with random number generator (producer boasts that it's true random number, based on thermal noise).
The upper limit of number to be generated is given as byte array. My question is: will it be secure, to get the random number byte by byte, and performing:
n[i] = rand[i] mod limit[i]
where n[i] is i'th byte of my number etc.
The standard method, using all the bits from the RNG is:
number <- random()
while (number outside range)
number <- random()
endwhile
return number
There are some tweaks possible if the required range is less than half the size of the RNG output, but I assume that is not the case here: it would reduce the output size by one or more bits. Given that, then the while loop will normally only be entered once or twice if at all.
Comparing byte arrays is reasonably simple, and usually speedy providing you compare the most significant bytes first. If the most significant bytes differ, then there is no need to compare less significant bytes at all. We can tell that 7,###,###,### is larger than 5,###,###,### without knowing what digits the # stand for.

How many hash functions are required in a minhash algorithm

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.
The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.
Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?
To quote from the blog
It is tough to get the error bar on our similarity estimate much
smaller than [7%] because of the way error bars on statistically
sampled values scale — to cut the error bar in half we would need four
times as many samples.
Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?
One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).
Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.
Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.
That means that a reported measurement of 78% could easily come from only 50% similarity..
Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.
Mathematically, if you're reporting two digits the second should be meaningful.
Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.
I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.
This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.
The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.
The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.
There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.
Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.
Thanks to #BillDimm for his input & persistence in pointing out that XOR was necessary.
What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:
h(x,a,b) = ((ax+b) mod p) mod m
x is key you want to hash
a is any odd number you can choose between 1 to p-1 inclusive.
b is any number you can choose between 0 to p-1 inclusive.
p is a prime number that is greater than max possible value of x
m is a max possible value you want for hash code + 1
By selecting different values of a and b you can generate many hash codes that are independent of each other.
An optimized version of this formula can be implemented as follows in C/C++/C#/Java:
(unsigned) (a*x+b) >> (w-M)
Here,
- w is size of machine word (typically 32)
- M is size of hash code you want in bits
- a is any odd integer that fits in to machine word
- b is any integer less than 2^(w-M)
Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.
For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:
public int[] GetHashCodes(string s, int count, int seed = 0)
{
var hashCodes = new int[count];
var machineWordSize = sizeof(int);
var hashCodeSize = machineWordSize / 2;
var hashCodeSizeDiff = machineWordSize - hashCodeSize;
var hstart = s.GetHashCode();
var bmax = 1 << hashCodeSizeDiff;
var rnd = new Random(seed);
for(var i=0; i < count; i++)
{
hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >> hashCodeSizeDiff;
}
}
Notes:
I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
Normally you want to select a and b both randomly within above said ranges.
Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)
Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.
When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.
It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.
In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.
Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.

Map two numbers to one to achieve a particular sort?

I have a list in which each item has 2 integer attributes, n and m. I would like to map these two integer attributes to a single new attribute so that when the list is sorted on the new attribute, it is sorted on n first and then ties are broken with m.
I came up with n - 1/m. So the two integers are mapped to a single real number. I think this works. Any better ideas?
That's clever, so I hate to break it to you, but it won't work. Try it (with a computer) using the n=1,000,000,000 and values of m between 999,999,990 and 1,000,000,010. You'll find that n-1/m is the same value for all of those cases.
It would work if floating point numbers had infinite precision, or even if they had twice as much precision as an int (although even there you might run into some issues), but they don't: a double precision floating point number has 53 bits of precision. An integer is (probably) 32 bits, so you'd need at least 64 bits to encode two of them. But then, you could just use a 64-bit (long long) integer, encoding the pair as n*2^32 + m.

Resources