Random Hash for Blockchain - random

We're are developing a true random bits generator based on chaos systems. Can we use this to develop hash values of block chain?
Is it okay to use random hash values when creating a node/block? Or does it definitely has to be generated using hashing algorithms?

Related

How to find the value for a given key if a hash function is chosen randomly from Universal family of hash functions?

I am taking a course on data structures in coursera and I read recently about Universal family of hash functions. If i choose a hash function randomly from a universal family of hash functions, How will i exactly remap it to look up for a value. If i have to remember the function chosen for each key, then i should maintain a list for it. And this evaluation of finding the correct hash function for a key itself will take linear time violating the constant time look up of hash tables. How should i proceed implementing it?
When making one hash map, you use one function from the family. When you rehash the entire map (typically because of lack of capacity or too many collisions) or create a separate map, you can then choose a different hashing function from the family. You wouldn't use two different functions to attempt to create the same hash map.

Hashing and encryption technique for a huge data set containing phone numbers

Description of problem:
I'm in the process of working with a highly sensitive data-set that contains the people's phone number information as one of the columns. I need to apply (encryption/hash function on them) to convert them as some encoded values and do my analysis. It can be an one-way hash - i.e, after processing with the encrypted data we wont be converting them back to original phone numbers. Essentially, am looking for an anonymizer that takes phone numbers and converts them to some random value on which I can do my processing. Suggest the best way to do about this process. Recommendations on the best algorithms to use are welcome.
Update: size of the dataset
My dataset is really huge in the size of hundreds of GB.
Update: Sensitive
By sensitive, I meant that phone number should not be a part of our analysis.So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value.
Update: Implementation ?
Thanks for your answers.I am looking for elaborate implementation.I was going through python's hashlib library for hashing, Does it necessarily do the same set of steps that you suggested ? Here is the link
Can you give me some example code to achieve the process , preferably in Python ?
Generate a key for your data set (16 or 32 bytes) and keep it secret. Use Hmac-sha1 on your data with this key, and base 64 encode that and you have a random unique string per phonenumber that isn't reversable (without the key).
Example (Hmac-Sha1 with 256bit key) using Keyczar:
Create random secret key:
$> python keyczart.py create --location=path_to_key_set --purpose=sign
$> python keyczart.py addkey --location=path_to_key_set --status=primary
Anonymize phone number:
from keyczar import keyczar
def anonymize(phone_num):
signer = keyczar.Signer.Read("path_to_key_set");
return signer.Sign(phone_num)
If you're going to use cryptography, you want to apply a pseudorandom function to each phone number and throw away the key. Collision-resistant hashes such as SHA-256 do not provide the right security guarantees. Really, though, are there that many different phone numbers that you can't just construct incrementally a map representing an actually random function?
sort your data by the respective column and start counting distinct values ... replace the actual values with their respective counter value ... collision free ... one way ...
"So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value."
This screams for a solution based on a cryptographic hash function. MD5 and SHA-1 are the best known examples, and work wonderfully for this. You will read that "MD5 has been cracked", but for your purpose that doesn't matter.

Perfect Hash Building

Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).

What hashing algorithm does memcached use to hash keys?

Memcached uses distributed consistent hashing to choose which server to put a key on but which hashing algo does it use to map string key into the final hash on which the Ketama algo is applied for server selection. And how good is that algo at spreading similar keys to different servers.
According to the source code in hash.c, memcached uses the following algorithm:
The hash function used here is by Bob Jenkins, 1996:
http://burtleburtle.net/bob/hash/doobs.html
"By Bob Jenkins, 1996. bob_jenkins#burtleburtle.net.
You may use this code any way you wish, private, educational,
or commercial. It's free."
From Bob Jenkins' website:
I offer you a new hash function for hash table lookup that is faster and more thorough than the one you are using now. I also give you a way to verify that it is more thorough.
Also, his requirements are:
The keys are unaligned variable-length byte arrays.
Sometimes keys are several such arrays.
Sometimes a set of independent hash functions were required.
Average key lengths ranged from 8 bytes to 200 bytes.
Keys might be character strings, numbers, bit-arrays, or weirder things.
Table sizes could be anything, including powers of 2.
The hash must be faster than the old one.
The hash must do a good job.
...
The real requirement then is that a good hash function should distribute hash values uniformly for the keys that users actually use.
To get back to your other question, he measured the ability of the algorithm to uniformly distribute hash values, so I would presume that the hash does a good job at spreading similar keys to different servers. If you have concerns, the code is isolated so you should be able to run your own tests.

How to test a hash function?

Is there a way to test the quality of a hash function? I want to have a good spread when used in the hash table, and it would be great if this is verifyable in a unit test.
EDIT: For clarification, my problem was that I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap. So I need a different hash, and would like to have a Unit Test so that this problem cannot creep in any more.
You have to test your hash function using data drawn from the same (or similar) distribution that you expect it to work on. When looking at hash functions on 64-bit longs, the default Java hash function is excellent if the input values are drawn uniformly from all possible long values.
However, you've mentioned that your application uses the long to store essentially two independent 32-bit values. Try to generate a sample of values similar to the ones you expect to actually use, and then test with that.
For the test itself, take your sample input values, hash each one and put the results into a set. Count the size of the resulting set and compare it to the size of the input set, and this will tell you the number of collisions your hash function is generating.
For your particular application, instead of simply XORing them together, try combining the 32-bit values in ways a typical good hash function would combine two indepenet ints. I.e. multiply by a prime, and add.
First I think you have to define what you mean by a good spread to yourself. Do you mean a good spread for all possible input, or just a good spread for likely input?
For example, if you're hashing strings that represent proper full (first+last) names, you're not going to likely care about how things with the numerical ASCII characters hash.
As for testing, your best bet is to probably get a huge or random input set of data you expect, and push it through the hash function and see how the spread ends up. There's not likely going to be a magic program that can say "Yes, this is a good hash function for your use case.". However, if you can programatically generate the input data, you should easily be able to create a unit test that generates a significant amount of it and then verify that the spread is within your definition of good.
Edit: In your case with a 64 bit long, is there even really a reason to use a hash map? Why not just use a balanced tree directly, and use the long as the key directly rather than rehashing it? You pay a little penalty in overall node size (2x the size for the key value), but may end up saving it in performance.
If your using a chaining hash table, what you really care about is the number of collisions. This would be trivial to implement as a simple counter on your hash table. Every time an item is inserted and the table has to chain, increment a chain counter. A better hashing algorithm will result in a lower number of collisions. A good general purpose table hashing function to check out is: djb2
Based on your clarification:
I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap.
it appears you have some unhappy "resonances" between the way you assign the two ID values and the sizes of your HashMap instances.
Are you explicitly sizing your maps, or using the defaults? A QAD check seems to indicate that a HashMap<Long,String> starts with a 16-bucket structure and doubles on overflow. That would mean that only the low-order bits of the ID values are actually participating in the hash bucket selection. You could try using one of the constructors that takes an initial-size parameter and create your maps with a prime initial size.
Alternately, Dave L's suggestion of defining your own hashing of long keys would allow you to avoid the low-bit-dependency problem.
Another way to look at this is that you're using a primitive type (long) as a way to avoid defining a real class. I'd suggest looking at the benefits you could achieve by defining the business classes and then implementing hash-coding, equality, and other methods as appropriate on your own classes to manage this issue.

Resources