I need to compute a hash code of a string and store it into a 'long' variable.
MD5 and SHA1 produce hash codes which are longer than 64 bits (MD5 - 128 bits, SHA1 - 160 bit).
Ideas any one?
Cheers,
Doron
You can truncate the hash and use just the first 64 bits. The hash will be somewhat less strong, but the first 64 bits are still extremely likely to be unique.
For most uses of a hash this is both a common and perfectly acceptable practice.
You can also store the complete hash in two 64-bit integers.
The FNV Hash is pretty easy to implement. We extended it to 64 bits and it works very well. Using it is much faster than computing MD5 or SHA1 and then truncating the result. However, we don't depend on it for cryptographic functions--just for hash tables and such.
More information on FNV, with source code and detailed explanations: http://isthe.com/chongo/tech/comp/fnv/
I'm using this (Java):
public class SimpleLongHash {
final MessageDigest md;
//
public SimpleLongHash() throws NoSuchAlgorithmException {
md = MessageDigest.getInstance("MD5");
}
//
public long hash(final String str) {
return hash(str.getBytes());
}
public long hash(final byte[] buf) {
md.reset();
final byte[] digest = md.digest(buf);
return (getLong(digest, 0) ^ getLong(digest, 8));
}
//
private static final long getLong(final byte[] array, final int offset) {
long value = 0;
for (int i = 0; i < 8; i++) {
value = ((value << 8) | (array[offset+i] & 0xFF));
}
return value;
}
}
What would be the probability for a collision as a result of a XOR between the first 64 bits and the last 64 bits?
XOR the bits together? E.g. for MD5, bits 0-63 XOR bits 64-127, voila, 64 bits. This will give you a weaker hash, check if that's acceptable for you.
(also, unless your environment is extremely constrained - e.g. embedded devices - there's a question of "why do you need to shorten it?")
You can also play with various hash algorithms with FooBabel Hasher
Related
What I need
I need an algorithm that produces a bijective output. I have a 31-bit input and need a pseudo-random 31-bit output.
What I have considered
CRCs are bijective within their bit-width.
I have looked on Google and can find the polynomials for this, but not the tables or algorithm.
Could anyone point me in the right direction?
I need a CRC-31 algorithm using polynomial say 0x737e312b, or any bijective function that will do what I need.
NOTE
I found the following code, but I unfortunately don't have the tools to compile and run it.
For any hash function hash, you can do:
function bijectiveHash31(int val) {
val &= 0x7FFFFFFF; //make sure it's 31 bits
for (int i=0; i<5; ++i) {
// the high bits affect the low bits
val ^= hash(val>>15) & 32767;
// rotate bits
val = ((val&32767)<<16) | ((val>>15)&65535);
}
return val;
}
This is a Feistel structure, which forms the basis of many ciphers: https://en.wikipedia.org/wiki/Feistel_cipher
If you need it to be fast and you don't need it to be super good, then this works fine:
function bijectiveHash31(int val) {
val = ((val*RANDOM_ODD_NUMBER) + RANDOM_NUMBER) & 0x7FFFFFFF;
val ^= (val>>15);
val ^= (val>>8);
return val;
}
In both of these cases, it's not too difficult to figure out how you could undo each elementary operation, which shows that the whole hash is bijective. If you need help establishing that for the multiplication, see https://en.wikipedia.org/wiki/Modular_multiplicative_inverse
I found this problem:
Consider sequences of 36 bits. Each such sequence has 32 5 - bit
sequences consisting of adjacent bits. For example, the sequence
1101011… contains the 5 - bit sequences 11010,10101,01011,…. Write a
program that prints all 36 - bit sequences with the two properties:
1.The first 5 bits of the sequence are 00000.
2. No two 5 - bit subsequences are the same.
So I generalized to find all n-bit sequences with k - bit unique subsequences satisfy the above requirements. However, the only approach I can think of is using a brutal force search: generate all permutations of n-bit sequence with the first k bits zero, then for each sequence, check if all k-bit subsequences are unique. This apparently is not a very efficient approach. I am wondering is there a better way to solve the problem?
Thanks.
The simplest approach seems to be a backtracking approach. You can keep track of which 5-bit sequences you've seen with a flat array. At each bit, try adding 0 -- counter = (counter & 0x0f) << 1 and check if you've seen that before, then do a counter = counter | 1 and try that path.
There are probably more efficient algorithms that can prune the search space faster. This seems related to https://en.wikipedia.org/wiki/De_Bruijn_sequence. I am not certain, but I believe that it is actually equivalent; that is, the last five digits of the sequence will have to be 10000, making it cyclic.
EDIT: here's some c code. Less efficient than it could be in terms of space, because of the recursion, but simple. The worst bit is the mask management. It appears I was correct about De Bruijn sequences; this finds all 2048 of them.
#include <stdio.h>
#include <stdlib.h>
char *binprint(int val) {
static char res[33];
int i;
res[32] = 0;
for (i = 0; i < 32; i++) {
res[31 - i] = (val & 1) + '0';
val = val >> 1;
}
return res;
}
void checkPoint(int mask, int counter) {
// Get the appropriate bit in the mask
int idxmask = 1 << (counter & 0x1f);
// Abort if we've seen this suffix before
if (mask & idxmask) {
return;
}
// Update the mask
mask = mask | idxmask;
// We're done if we've hit all 32
if (mask == 0xffffffff) {
printf("%10u 0000%s\n", counter, binprint(counter));
return;
}
checkPoint(mask, counter << 1);
checkPoint(mask, (counter << 1) | 1);
}
void main(int argc, char *argv[]) {
checkPoint(0, 0);
}
I remember seeing this exact problem in a programming interview questions book of mine. Here is their solution:
hope it helps. cheers.
The following is the implementation of BitSet in the solution of question 10-4 in cracking the coding interview book. Why is it allocating an array of size/32 not (size/32 + 1). Am I missing something here or this is a bug?
If I pass 33 to the constructor of BitSet then I will allocate only one int and If I try to set or get the bit 32, I will get an AV!
package Question10_4;
class BitSet {
int[] bitset;
public BitSet(int size) {
bitset = new int[size >> 5]; // divide by 32
}
boolean get(int pos) {
int wordNumber = (pos >> 5); // divide by 32
int bitNumber = (pos & 0x1F); // mod 32
return (bitset[wordNumber] & (1 << bitNumber)) != 0;
}
void set(int pos) {
int wordNumber = (pos >> 5); // divide by 32
int bitNumber = (pos & 0x1F); // mod 32
bitset[wordNumber] |= 1 << bitNumber;
}
}
From what I can gather from reading the solution you mention (on page 205), and the little I understand about computer programming, it seems to me that this is a special implementation of a bitset, meant to take the argument of 32,000 in its construction (see the checkDuplicates function. The question is about examining an array with numbers from 1 to N, where N is at most 32,000, with only 4KB of memory).
This way, an array of 1000 elements is created, each one used for 32 bits in the bit set. You can see in the bitset class that to get a bit's position, we (floor) divide by 32 to get the array index, and then mod 32 to get the specific bit position.
Yes, answer in the book is incorrect.
Correct answer:
bitset = new int[(size + 31) >> 5]; // divide by 32
This was asked in my Google interview recently and I offered an answer which involved bit shift and was O(n) but she said this is not the fastest way to go about doing it. I don't understand, is there a way to count the bits set without having to iterate over the entire bits provided?
Brute force: 10000 * 16 * 4 = 640,000 ops. (shift, compare, increment and iteration for each 16 bits word)
Faster way:
We can build table 00-FF -> number of bits set. 256 * 8 * 4 = 8096 ops
I.e. we build a table where for each byte we calculate a number of bits set.
Then for each 16-bit int we split it to upper and lower
for (n in array)
byte lo = n & 0xFF; // lower 8-bits
byte hi = n >> 8; // higher 8-bits
// simply add number of bits in the upper and lower parts
// of each 16-bits number
// using the pre-calculated table
k += table[lo] + table[hi];
}
60000 ops in total in the iteration. I.e. 68096 ops in total. It's O(n) though, but with less constant (~9 times less).
In other words, we calculate number of bits for every 8-bits number, and then split each 16-bits number into two 8-bits in order to count bits set using the pre-built table.
There's (almost) always a faster way. Read up about lookup tables.
I don't know what the correct answer was when this question was asked, but I believe the most sensible way to solve this today is to use the POPCNT instruction. Specifically, you should use the 64-bit version. Since we just want the total number of set bits, boundaries between 16-bit elements are of no interest to us. Since the 32-bit and 64-bit POPCNT instructions are equally fast, you should use the 64-bit version to count four elements' worth of bits per cycle.
I just implemented it in Java:
import java.util.Random;
public class Main {
static int array_size = 1024;
static int[] array = new int[array_size];
static int[] table = new int[257];
static int total_bits_in_the_array = 0;
private static void create_table(){
int i;
int bits_set = 0;
for (i = 0 ; i <= 256 ; i++){
bits_set = 0;
for (int z = 0; z <= 8 ; z++){
bits_set += i>>z & 0x1;
}
table[i] = bits_set;
//System.out.println("i = " + i + " bits_set = " + bits_set);
}
}
public static void main(String args[]){
create_table();
fill_array();
parse_array();
System.out.println("The amount of bits in the array is: " + total_bits_in_the_array);
}
private static void parse_array() {
int current;
for (int i = 0; i < array.length; i++){
current = array[i];
int down = current & 0xff;
int up = current & 0xff00;
int sum = table[up] + table[down];
total_bits_in_the_array += sum;
}
}
private static void fill_array() {
Random ran = new Random();
for (int i = 0; i < array.length; i++){
array[i] = Math.abs(ran.nextInt()%512);
}
}
}
Also at https://github.com/leitao/bits-in-a-16-bits-integer-array/blob/master/Main.java
You can pre-calculate the bit counts in bytes and then use that for lookup. It is faster, if you make certain assumptions.
Number of operations (just computation, not reading input) should take the following
Shift approach:
For each byte:
2 ops (shift, add) times 16 bits = 32 ops, 0 mem access times 10000 = 320 000 ops + 0 mem access
Pre-calculation approach:
255 times 2 ops (shift, add) times 8 bits = 4080 ops + 255 mem access (write the result)
For each byte:
2 ops (compute addresses) + 2 mem access + op (add the results) = 30 000 ops + 20 000 mem access
Total of 30 480 ops + 20 255 mem access
So a lot more memory access with lot fewer operations
Thus, assuming everything else being equal, pre-calculation for 10 000 bytes is faster if we can assume memory access is faster than an operation by a factor of (320 000 - 30 480)/20 255 = 14.29
Which is probably true if you are alone on a dedicated core on a reasonably modern box as the 255 bytes should fit into a cache. If you start getting cache misses, the assumption might no longer hold.
Also, this math assumes pointer arithmetic and direct memory access as well as atomic operations and atomic memory access. Depending on your language of choice (and, apparently, based on previous answers, your choice of compiler switches), that assumption might not hold.
Finally, things get more interesting if you consider scalability: shifting can be easily parallelised onto up to 10000 cores but pre-computation not necessarily. As byte number goes up, however, lookup gets more and more advantageous.
So, in short. Yes, pre-calculation is faster under pretty reasonable assumptions but no, it is not guaranteed to be faster.
Using the hash function MD5 on a string creates a very long value, and it creates the same value for the same string every time. Now, my question is: is there a way to do something similar, like give it a string and it returns the same integer every time, and also the integers that it returns for different string are inside a specific interval. What i mean is something like this.
Ex: Give it "Mary had a little lamb." and it returns the value 10. Give the same string, it returns 10 again.
Please do ask, in case i wasn't entirely clear.
You are describing a "hash function". Look it up on Wikipedia.
MD5 is one kind of hash function. Most MD5 implementations return a string, but that string is just a representation of a (LARGE) integer. You can take an MD5 hash, and then use as many of the low-order bits as you need to get an integer of the desired size. If the desired range is not a power of 2, you will need to do a modulo operation to get it into the desired range.
Also, virtually every modern programming language has a built-in function for hashing strings, which returns an integer. In Java, it's String.hashCode(). In Ruby, it's String#hash.
In this case, the language is Javascript, which (I was shocked to learn) doesn't seem to have something like this built in. This is String.hashCode() from the Java platform (perhaps you can port it to Javascript):
public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
You can use the lower bytes of an MD5 hash. You have to consider that JavaScript (at least in Firefox 9) can use something like 48 bits (6 bytes) to store exact integer numbers, the length of an MD5 hash on the other hand is 128 bits (16 bytes). So you will necessarily have more hash collisions than you would normally get with MD5. But still:
function toHashCode(str)
{
// Convert string to an array of bytes
var array = Array.prototype.slice.call(str);
// Create MD5 hash
var hashEngine = Components.classes["#mozilla.org/security/hash;1"]
.createInstance(Components.interfaces.nsICryptoHash);
hashEngine.init(hashEngine.MD5);
hashEngine.update(array, array.length);
var hash = hashEngine.finish(false);
// Turn the first 6 bytes of the hash into a number
var result = 0;
for (var i = 0; i < 6; i++)
result = result * 256 + hash.charCodeAt(i);
return result;
}
alert(toHashCode("test")); // Displays 265892827251497
alert(toHashCode("Mary had a little lamb.")); // Displays 117938552300214