Converting a unique seed string into a random, yet deterministic, float value in Ruby - ruby

I'm having a hard time with this, conceptually.
Basically, I need to accept some arbitrary unique string, and be able to convert that to a normalized float value. What the output float value is doesn't really matter, so long as the same string input always results in the same normalized float output.
So this is a hashing algorithm right? I'm familiar with SHA1 or MD5, and this seems similar to password hashing where the result is the same for the correct password. But those methods output strings of characters, I believe. And what I'm not getting is how I would turn the result of a SHA1 or MD5 into a consistent float value.
# Goal
def string_to_float(seed_string)
# ...
end
string_to_float('abc-123') #=> 0.15789
string_to_float('abc-123') #=> 0.15789
string_to_float('def-456') #=> 0.57654
string_to_float('def-456') #=> 0.57654
So what kind of approach in Ruby can I take that would turn an arbitrary string into a random but consistent float value?

The key part that you want is a way of converting a SHA1 or MD5 hash output into a float that is both deterministic and 1-1. Here's a simple solution based on md5. This could be used as integers too.
require 'digest/md5'
class String
def float_hash
(Digest::MD5.hexdigest(self).to_i(16)).to_f
end
end
puts "example_string".float_hash # returns 1.3084281619666243e+38
This generates a hexadecimal hash, then converts it to an integer, then converts that to a float. Each step is deterministic.
Note: as pointed out by #emboss, this reduces collision resistance because a double is 8 bytes and the hash is 16 bytes. It shouldn't be a big deal though by the sounds of your application.

If security is no issue, what you are describing is in my opinion not a hash function. A hash function is a one-way function, meaning computing the hash is easy but reverting it is "hard" or, ideally, impossible.
Your requirements instead describe an injective function Given any x1, x2 in your domain X the following holds:
For all x1, x2 element of X, x1 != x2 => f(x1) != f(x2)
f(x) = x is such a function, f(x) = x² is not. In plain English: you want to have different results if your inputs are different, same results only if the inputs are the same. It is true that this also is true for secure hashes, but they additionally provide the one-way characteristics such as the property of not being able (easily) to find x if you are only given f(x), among others. As far as I understood, you don't need these security properties.
Trivially, such an injective mapping from String to Float would be given by simply interpreting the "String bytes" as "Float bytes" from now on, i.e. you interpret the bytes differently (think C:
unsigned char *bytes = "...";
double d = (double)bytes;
). But, there is as downside to this - the real trouble is that Float has a maximum precision, so you will run into an overflow situation if your strings are too long (Floats are internally represented as double values, that's 8 bytes on a 32 bit machine). So not enough space for almost any use case. Even MD5-ing your strings first doesn't solve the problem - MD5 output is already 16 bytes long.
So this could be a real problem, depending on your exact requirements. Although MD5 (or any other hash) will mess sufficiently with the input to make it as random as possible, you still cut the range of possible values from 16 bytes to effectively 8 bytes. (Note: Truncating random 16 byte output at 8 bytes is generally considered "secure" in terms of preserving the randomness. Elliptic Curve Cryptography does something similar. But as far as I know, nobody can really prove it, but neither could someone prove the contrary so far). So a collision is much more likely with your restricted Float range. By the birthday paradox, finding a collision takes sqrt(number of values in a finite range) tries. For MD5 this is 2^64, but for your scheme it's only 2^32. That's still very, very unlikely to produce a collision. It's probably something in the order of winning the lottery while at the same time being hit by a lightning. If you could live with this minimal possibility, go for it:
def string_to_float(str)
Digest::MD5.new.digest(str).unpack('D')
end
If uniqueness is of absolute priority I would recommend to move from floats to integers. Ruby has built-in support for large integers that are not restricted by the internal constraints of a long value (that's what a Fixnum boils down to). So any arbitrary hash output could be represented as a large integer number.

Yes, you are describing a hashing algorithm. You could use a MD5 or SHA1 digest (since they just produce random bits) to generate a floating point number simply by using the String#unpack method with an argument of "G" (double-precision float, network byte order) from a digest:
require 'digest/sha1'
def string_to_float(str)
Digest::SHA1.digest(str).unpack("G")[0]
end
string_to_float("abc-123") # => -2.86011943713676e-154
string_to_float("def-456") # => -1.13232994606094e+214
string_to_float("abc-123") # => -2.86011943713676e-154 OK!
string_to_float("def-456") # => -1.13232994606094e+214 OK!
Note that if you want the resulting floats to be in a particular range then you'll need to do some massaging.
Also note that the unpacked number doesn't use all of the bits from the digest so you might want to combine into the number of bytes for a double floating point number (although you'll have to be careful not to decrease the entropy of the hash function, if you care about that sort of thing), e.g.:
def str2float(s)
d = Digest::SHA1.digest(s)
x, y = d[0..9], d[10..19]
# XOR the 1st (x) and 2nd (y) halves to use all bits.
(0..9).map {|i| x[i] ^ y[i]}.pack("c*").unpack("G")[0]
end

Related

Ruby: Help improving hashing algorithm

I am still relatively new to ruby as a language, but I know there are a lot of convenience methods built into the language. I am trying to generate a "hash" to check against in a low level block-chain verifier and I am wondering if there are any "convenience methods" that I could you to try to make this hashing algorithm more efficient. I think I can make this more efficient by utilizing ruby's max integer size, but I'm not sure.
Below is the current code which takes in a string to hash, unpacks it into an array of UTF-8 values, does computationally intensive math to each one of those values, adds up all of those values after the math is done to them, takes that value modulo 65,536, and then returns the hex representation of that value.
def generate_hash(string)
unpacked_string = string.unpack('U*')
sum = 0
unpacked_string.each do |x|
sum += (x**2000) * ((x + 2)**21) - ((x + 5)**3)
end
new_val = sum % 65_536 # Gives a number from 0 to 65,535
new_val.to_s(16)
end
On very large block-chains there is a very large performance hit which I am trying to get around. Any help would be great!
First and foremost, it is extremely unlikely that you are going to create anything that is more efficient than simply using String#hash. This is a case of you trying to build a better mousetrap.
Honestly, your hashing algorithm is very inefficient. The entire point of a hash is to be a fast, low-overhead way of quickly getting a "unique" (as unique as possible) integer to represent any object to avoid comparing by values.
Using that as a premise, if you start doing any type of intense computation in a hash algorithm, it is already counter-productive. Once you start implementing modulo and pow functions, it is inefficient.
Usually best practice involves taking a value(s) of the object that can be represented as integers, and performing bit operations on them, typically with prime numbers to help reduce hash collisions.
def hash
h = value1 ^ 393
h += value2 ^ 17
h
end
In your example, you are for some reason forcing the hash to the max value of a 16-bit unsigned integer, when typically 32-bits is used, although if you are comparing on the Ruby-side, this would be 31-bits due to how Ruby masks Fixnum values. Fixnum was deprecated on the Ruby side as it should have been, but internally the same threshold exists between what how a Bignum and Fixnum are handled. The Integer class simply provides one interface on the Ruby side, as those two really should never have been exposed outside of the C code.
In your specific example using strings, I would simply symbolize them. This guarantees a quick and efficient way that determines if two strings are equal without hardly any overhead, and comparing 2 symbols is the exact same as comparing 2 integers. There is a caveat to this method if you are comparing a vast number of strings. Once a symbol is created, it is alive for the life of the program. Any additional strings that equal to it will return the same symbol, but you cannot remove the memory of the symbol (just a few bytes) for as long as the program runs. Not good if using this method to compare thousands and thousands of unique strings.

why does Ruby's Rational class treat string arguments differently from numeric arguments?

I'm using ruby's Rational library to convert the width & height of images to aspect ratios.
I've noticed that string arguments are treated differently than numeric arguments.
>> Rational('1.91','1')
=> (191/100)
>> Rational(1.91,1)
=> (8601875288277647/4503599627370496)
>> RUBY_VERSION
=> "2.1.5"
>> RUBY_ENGINE
=> "ruby"
FYI 1.91:1 is an aspect ratio recommended by Facebook for images on their platform.
Values like 191 and 100 are much more convenient to store in my database than 8601875288277647 and 4503599627370496. But I'd like to understand where this different originates before deciding which approach to use.
The Rational test suite doesn't seem to cover this exact case.
Disclaimer: This is only an educated guess, based on some knowledge on how to implement such a feat.
As Kent Dahl already said, Floats are not precise, they have a fixed precision, which means 1.91 is really 1.910000000000000000001 or something like that, which ruby "knows" should be displayed as 1.91.
"1.91" on the other hand is a string, basically an array of characters: '1', '.', '9', '1'.
This said, here is what you need to do, to build the rational out of floats:
Get rid of the . (mathematically by multiplying the numerator and denominator with 10^x, or multiplying with ten as many times, as there are numbers behind the .)
Find the greatest common denominator (gcd)
Divide num and denom with the gcd
Step 1 however, is a little different for Float and String:
The Float, we will have to multiply with 10^x, where x is (because of the precision) not 2 (as one would think with 1.91), but more something like 16 (remember: 1.9100...1).
For the String, we COULD cast it into a float and do the same trick, but hey, there is an easier way: We just count the number of characters behind the dot (which is 2), remove the dot and multiply the denom with 10^2... This is not only the easier, but also the more precise way.
The big numbers might disappear again, when applying step 3, that's why you will not always get those strange results when dealing with rationals from floats.
TLDR: The numbers will be built differently based on the arguments being String, or FLoat. FLoats can produce long-ass numbers, because precision.
The Float 1.91 is stored as a double which has a given amount of precision, limited by binary presentation. The equivalent Rational object retains this precision a such as possible, so it is huge. There is no way of storing 1.91 exactly in a double, but the value you get is close enough for most uses.
As for the String, it represents a different value - the exact value of 1.91 - and as you create a Rational it retains it better. It is more correct than the Float, UT takes longer to use for calculations.
This is similar to the problem with 1.0/3 as it "goes on forever" 0.333333...etc, but Rational can represent it exactly.

What is the range of possible sha1hash results?

What is the lowest and highest possible returns from sha1? (with respect that sha1 results are actualy 5 32 bit values rather than 1 true 160 bit value)
To create a secure hash the output of the hash must be indistinguishable from random. Many pseudo random number generators and key derivation methods actually use a hash as final calculation.
So the "highest" result consists of all zero's, the lowest consists of all ones. That is, if you interpret the result to be an unsigned integer of course. The chances of exactly getting those values is almost zero of course, as SHA-1 results should be evenly distributed. But the change of a number starting with 8 ones is still 1/2^8 == 1/256, which is certainly not insignificant.
Note that the result of SHA-1 should be interpreted as a bit string. Most runtimes don't have a very useful bitstring representation and use an octet string (aka byte array) instead. I would consider it very annoying of a SHA-1 implementation would return shorts instead of bytes. You don't want to annoy the user with differences in little-endian and big-endian representations, and most other primitives do expect their input represented as bytes.

Looking for an one-way function with small input and long output

I'm looking for an algorithm, which is a one-way function, like Hash function. And the algorithm accept a small input(serveral bits, less than 512 bits), and map it to a long output(1K Byte or more). Do you know an algorithm or a function like this?
From the Shannon theorem you don't gain any security by having a cyphertext of a size bigger than your plain text, unless the key (or the procedure to create the cyphertext) is different for any input. Even in this case, you will need to assign only one key (or mechanism) for each input x otherwise you violate the definition of a function. So if you apply an encryption mechanism f: X (set of inputs) -> Y (set of outputs), then |Y| <= |X|.
All this to say that if your input is less than 512 bits, you gain nothing by producing a 1KB output. Now, I recommend you to use one of the functions listed on the one-way function wiki page
Keccak has variable length output, (although not evaluated for in SHA-3), it's "security claim is disentangled from the output length. There is a minimum output length..." and Skein hash function has a variable output of up to 16 exabytes
Whatever your reasons are, you can calculate hashes of the same small data using different algorithms, then concatenate those hashes. If the output is not large enough, calculate hashes of hashes and append them.
As pointed in other answers, this doesn't have much sense from security perspective.

Proving a perfect hash function over a fixed length input

I have seen the answers on here stating to use gperf, however, I would prefer to roll my own based on the proof that I create for the domain of strings with a fixed length of <= 200 Based on the calculations I have from wolfram I get ~7.9 x 10^374 total permutations. Therefore my line of thinking is if I have a 2048 bit hash function (3.2 x 10^616) I should be able to handle the entire universe of strings that I need to process. My question is how can I prove that the hash implementation I end up producing will be perfect given the constraint of the universe of all strings of length 200 or less?
Strings with a length of 200 characters only have 200 * 8 = 1600 bits. If a 2048 bit hash is OK for your purpose, you could just use the string bits as a perfect hash. The identity hash function is perfect, as it maps each input to a distinct hash value (obviously, because there is no mapping).

Resources