Non colliding hash algorithm for strings up to 255 characters

Non colliding hash algorithm for strings up to 255 characters - algorithm

I am looking for a hash-algorithm, to create as close to a unique hash of a string (max len = 255) as possible, that produces a long integer (DWORD).
I realize that 26^255 >> 2^32, but also know that the number of words in the English language is far less than 2^32.
The strings I need to 'hash' would be mostly single words or some simple construct using two or three words.
The answer:
One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs. (Answered by Arachnid)

See here for a previous iteration of this question (and the answer).

One technique is to use a well-known hash algorithm (say, MD5 or SHA-1) and use only the first 32 bits of the result.
Be aware that the risk of hash collisions increases faster than you might expect. For information on this, read about the Birthday Paradox.

Ronny Pfannschmidt did a test with common english words yesterday and hasn't encountered any collisions for the 10000 words he tested in the Python string hash function. I haven't tested it myself, but that algorithm is very simple and fast, and seems to be optimized for common words.
Here the implementation:
static long
string_hash(PyStringObject *a)
{
register Py_ssize_t len;
register unsigned char *p;
register long x;
if (a->ob_shash != -1)
return a->ob_shash;
len = Py_SIZE(a);
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= Py_SIZE(a);
if (x == -1)
x = -2;
a->ob_shash = x;
return x;
}

H(key) = [GetHash(key) + 1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))] % hashsize
MSDN article on HashCodes

Java's String.hash() can be easily viewed here, its algorithm is
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Related

How to add extra information to a hash code?

Given an array of bytes, there are several well-known good algorithms for calculating a hash code, such as FNV or MD5. (Not talking about cryptography here, just general purpose hash codes.)
Suppose what you have is an array of bytes plus one extra piece of information, a small integer (which is not located next to the array in memory), and you want a hash code based on the whole lot. What's the best way to do this? For example, one could take the hash code of the array and add it to the small integer, or exclusive-or them. But is there a better way?

I think, more easiest and efficient way - just init "hash" accumulator with your small value, and thereafter compute hash by ordinary way.
Following example illustrates my approach, where we compute hash from int and C-style string:
uint32_t hash(const char *str, uint32_t x) {
char c;
while((c = *str++) != 0)
x = ((x << 5) | (x >> (32 - 5))) + c;
return x ^ (x >> 16);
}

Dynamic Programming +Bit-Masks

I recently learned the concept of Bit Manipulation for Competitive Programming so I'm quite new to the concept ,I also read many tutorials on Bit-Masking + Dynamic Programming on Hackerearth ,CodeChef and many more .
I also solved a couple of problems on Codechef including this one problem
and I have a couple of doubts regarding Bitmasks after I have been through some questions.
The problems I solved were mostly focused on manipulating the subsets but I wonder how do I work on permutations with bitmasks , i.e when I have to work on a state where all bits in the mask need to be set.
For ex: If we have to find number of numbers that can be formed by arranging all digits of a given number A which are divisible by a given number B where (A ,B<= 10**6) how can this be done with bitmasks.(I hope this can be done with bitmask+dp)
If A= 514 ,and B=2
The question expects the answer to be
514
154
Which are both divisible by 2 .
So the answer is 2.
With the knowledge I have: 514 and 154 represent the same mask 111 where all bits are set So how do I use bitmasks here where the mask is same for two or more answers!( I hope you understand this ).
And also as it is impossible to allocate memory worth n!*n for a little large value of n since we can have that many permutations of digits how can this problem be done using bitmasks where we need only (2**n)*n space (If i'm not wrong).
So how do I approach the above problem iteratively? /Or any DP state equation which I can possibly understand ,I couldn't understand recursive approach of some similar problems I Read.
I also tried to think on a similar problem TSHIRTS but I couldn't understand the logic behind the recursion.

You don't actually need DP for this one but you can use bit manipulation nicely :) Since A <= 10^6 it means that A has most 7 digits; so you only have to check 7! = 5040 states.
const int A = 514;
const int B = 2;
vector <int> v; //contains digits of A (e.g. 5, 1, 4) this can be done before the recursive function in a while loop.
int rec(int mask, int current_number){
if(mask == (1 << v.size()) - 1){ //no digit left to pick
if(current_number % B == 0) return 1;
else return 0;
}
int ret = 0;
for(int i = 0; i < v.size(); i++){
if(mask & (1 << i)) continue; //this is already picked
ret += rec(mask | (1 << i), current_number * 10 + v[i]);
}
return ret;
}
Note that the reason I didn't use DP here was that current number might differ even if mask is the same; so you can't actually say that the situation has been repeated. Unless you memo-ize mask AND current_number which requires much more space.

how to calculate 2^32 without multiplying numbers directly?

the simplest way to calculate 2^32 is 2*2*2*2*2......= 4294967296
, I want to know that is there any other way to get 4294967296? (2^16 * 2^16 is treated as the same method as 2*2*2.... )
and How many ways to calculate it?
Is there any function to calculate it?
I can't come up with any methods to calculate it without 2*2*2...

2 << 31
is a bit shift. It effectively raises 2 to the 32nd power.

Options:
1 << 32
2^32 = (2^32 - 1) + 1 = (((2^32 - 1) + 1) - 1) + 1 = ...
Arrange 32 items on a table. Count the ways you can choose subsets of them.

If you are not much of a fan of binary magic, then I would suggest quickpower.This function computes xn in O(logn) time.
int qpower(int x,int n)
{
if(n==0)return 1;
if(n==1)return x;
int mid=qpower(x,n/2);
if(n%2==0)return mid*mid;
return x*mid*mid;
}

If you are on a common computer you can left bitshift 2 by 31 (i.e. 2<<31) to obtain 2^32.
In standard C:
unsigned long long x = 2ULL << 31;
unsigned long long is needed since a simple unsigned long is not guaranteed to be large enough to store the value of 2<<31.
In section 5.2.4.2.1 paragraph 1 of the C99 standard:
... the
following shall be replaced by expressions that have the same type as would an
expression that is an object of the corresponding type converted according to the integer
promotions. Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
— maximum value for an object of type unsigned long int
ULONG_MAX 4294967295 // 2^32 - 1
— maximum value for an object of type unsigned long long int
ULLONG_MAX 18446744073709551615 // 2^64 - 1

Why not using Math.Pow() (in .NET). I think most language (or environment) would support the similar function for you:
Math.Pow(2,32);

In Groovy/Java you can do something like following with long number (signed integer can be max 2^31 in Java)
long twoPOW32 = 1L << 32;

How to compute the integer absolute value

How to compute the integer absolute value without using if condition.
I guess we need to use some bitwise operation.
Can anybody help?

Same as existing answers, but with more explanations:
Let's assume a twos-complement number (as it's the usual case and you don't say otherwise) and let's assume 32-bit:
First, we perform an arithmetic right-shift by 31 bits. This shifts in all 1s for a negative number or all 0s for a positive one (but note that the actual >>-operator's behaviour in C or C++ is implementation defined for negative numbers, but will usually also perform an arithmetic shift, but let's just assume pseudocode or actual hardware instructions, since it sounds like homework anyway):
mask = x >> 31;
So what we get is 111...111 (-1) for negative numbers and 000...000 (0) for positives
Now we XOR this with x, getting the behaviour of a NOT for mask=111...111 (negative) and a no-op for mask=000...000 (positive):
x = x XOR mask;
And finally subtract our mask, which means +1 for negatives and +0/no-op for positives:
x = x - mask;
So for positives we perform an XOR with 0 and a subtraction of 0 and thus get the same number. And for negatives, we got (NOT x) + 1, which is exactly -x when using twos-complement representation.

Set the mask as right shift of integer by 31 (assuming integers are stored as two's-complement 32-bit values and that the right-shift operator does sign extension).
mask = n>>31
XOR the mask with number
mask ^ n
Subtract mask from result of step 2 and return the result.
(mask^n) - mask

Assume int is of 32-bit.
int my_abs(int x)
{
int y = (x >> 31);
return (x ^ y) - y;
}

One can also perform the above operation as:
return n*(((n>0)<<1)-1);
where n is the number whose absolute need to be calculated.

In C, you can use unions to perform bit manipulations on doubles. The following will work in C and can be used for both integers, floats, and doubles.
/**
* Calculates the absolute value of a double.
* #param x An 8-byte floating-point double
* #return A positive double
* #note Uses bit manipulation and does not care about NaNs
*/
double abs(double x)
{
union{
uint64_t bits;
double dub;
} b;
b.dub = x;
//Sets the sign bit to 0
b.bits &= 0x7FFFFFFFFFFFFFFF;
return b.dub;
}
Note that this assumes that doubles are 8 bytes.

I wrote my own, before discovering this question.
My answer is probably slower, but still valid:
int abs_of_x = ((x*(x >> 31)) | ((~x + 1) * ((~x + 1) >> 31)));

If you are not allowed to use the minus sign you could do something like this:
int absVal(int x) {
return ((x >> 31) + x) ^ (x >> 31);
}

For assembly the most efficient would be to initialize a value to 0, substract the integer, and then take the max:
pxor mm1, mm1 ; set mm1 to all zeros
psubw mm1, mm0 ; make each mm1 word contain the negative of each mm0 word
pmaxswmm1, mm0 ; mm1 will contain only the positive (larger) values - the absolute value

In C#, you can implement abs() without using any local variables:
public static long abs(long d) => (d + (d >>= 63)) ^ d;
public static int abs(int d) => (d + (d >>= 31)) ^ d;
Note: regarding 0x80000000 (int.MinValue) and 0x8000000000000000 (long.MinValue):
As with all of the other bitwise/non-branching methods shown on this page, this gives the single non-mathematical result abs(int.MinValue) == int.MinValue (likewise for long.MinValue). These represent the only cases where result value is negative, that is, where the MSB of the two's-complement result is 1 -- and are also the only cases where the input value is returned unchanged. I don't believe this important point was mentioned elsewhere on this page.
The code shown above depends on the value of d used on the right side of the xor being the value of d updated during the computation of left side. To C# programmers this will seem obvious. They are used to seeing code like this because .NET formally incorporates a strong memory model which strictly guarantees the correct fetching sequence here. The reason I mention this is because in C or C++ one may need to be more cautious. The memory models of the latter are considerably more permissive, which may allow certain compiler optimizations to issue out-of-order fetches. Obviously, in such a regime, fetch-order sensitivity would represent a correctness hazard.

If you don't want to rely on implementation of sign extension while right bit shifting, you can modify the way you calculate the mask:
mask = ~((n >> 31) & 1) + 1
then proceed as was already demonstrated in the previous answers:
(n ^ mask) - mask

What is the programming language you're using? In C# you can use the Math.Abs method:
int value1 = -1000;
int value2 = 20;
int abs1 = Math.Abs(value1);
int abs2 = Math.Abs(value2);

Sort N numbers in digit order

Given a N number range E.g. [1 to 100], sort the numbers in digit order (i.e) For the numbers 1 to 100, the sorted output wound be
1 10 100 11 12 13 . . . 19 2 20 21..... 99
This is just like Radix Sort but just that the digits are sorted in reversed order to what would be done in a normal Radix Sort.
I tried to store all the digits in each number as a linked list for faster operation but it results in a large Space Complexity.
I need a working algorithm for the question.
From all the answers, "Converting to Strings" is an option, but is there no other way this can be done?
Also an algorithm for Sorting Strings as mentioned above can also be given.

Use any sorting algorithm you like, but compare the numbers as strings, not as numbers. This is basically lexiographic sorting of regular numbers. Here's an example gnome sort in C:
#include <stdlib.h>
#include <string.h>
void sort(int* array, int length) {
int* iter = array;
char buf1[12], buf2[12];
while(iter++ < array+length) {
if(iter == array || (strcmp(itoa(*iter, &buf1, 10), itoa(*(iter-1), &buf2, 10) >= 0) {
iter++;
} else {
*iter ^= *(iter+1);
*(iter+1) ^= *iter;
*iter ^= *(iter+1);
iter--;
}
}
}
Of course, this requires the non-standard itoa function to be present in stdlib.h. A more standard alternative would be to use sprintf, but that makes the code a little more cluttered. You'd possibly be better off converting the whole array to strings first, then sort, then convert it back.
Edit: For reference, the relevant bit here is strcmp(itoa(*iter, &buf1, 10), itoa(*(iter-1), &buf2, 10) >= 0, which replaces *iter >= *(iter-1).

I have a solution but not exactly an algorithm.. All you need to do is converts all the numbers to strings & sort them as strings..

Here is how you can do it with a recursive function (the code is in Java):
void doOperation(List<Integer> list, int prefix, int minimum, int maximum) {
for (int i = 0; i <= 9; i++) {
int newNumber = prefix * 10 + i;
if (newNumber >= minimum && newNumber <= maximum) {
list.add(newNumber);
}
if (newNumber > 0 && newNumber <= maximum) {
doOperation(list, newNumber, minimum, maximum);
}
}
}
You call it like this:
List<Integer> numberList = new ArrayList<Integer>();
int min=1, max =100;
doOperation(numberList, 0, min, max);
System.out.println(numberList.toString());
EDIT:
I translated my code in C++ here:
#include <stdio.h>
void doOperation(int list[], int &index, int prefix, int minimum, int maximum) {
for (int i = 0; i <= 9; i++) {
int newNumber = prefix * 10 + i;
if (newNumber >= minimum && newNumber <= maximum) {
list[index++] = newNumber;
}
if (newNumber > 0 && newNumber <= maximum) {
doOperation(list, index, newNumber, minimum, maximum);
}
}
}
int main(void) {
int min=1, max =100;
int* numberList = new int[max-min+1];
int index = 0;
doOperation(numberList, index, 0, min, max);
printf("[");
for(int i=0; i<max-min+1; i++) {
printf("%d ", numberList[i]);
}
printf("]");
return 0;
}
Basically, the idea is: for each digit (0-9), I add it to the array if it is between minimum and maximum. Then, I call the same function with this digit as prefix. It does the same: for each digit, it adds it to the prefix (prefix * 10 + i) and if it is between the limits, it adds it to the array. It stops when newNumber is greater than maximum.

i think if you convert numbers to string, you can use string comparison to sort them.
you can use anny sorting alghorighm for it.
"1" < "10" < "100" < "11" ...

Optimize the way you are storing the numbers: use a binary-coded decimal (BCD) type that gives simple access to a specific digit. Then you can use your current algorithm, which Steve Jessop correctly identified as most significant digit radix sort.
I tried to store all the digits in
each number as a linked list for
faster operation but it results in a
large Space Complexity.
Storing each digit in a linked list wastes space in two different ways:
A digit (0-9) only requires 4 bits of memory to store, but you are probably using anywhere from 8 to 64 bits. A char or short type takes 8 bits, and an int can take up to 64 bits. That's using 2X to 16X more memory than the optimal solution!
Linked lists add additional unneeded memory overhead. For each digit, you need an additional 32 to 64 bits to store the memory address of the next link. Again, this increases the memory required per digit by 8X to 16X.
A more memory-efficient solution stores BCD digits contiguously in memory:
BCD only uses 4 bits per digit.
Store the digits in a contiguous memory block, like an array. This eliminates the need to store memory addresses. You don't need linked lists' ability to easily insert/delete from the middle. If you need the ability to grow the numbers to an unknown length, there are other abstract data types that allow that with much less overhead. For example, a vector.
One option, if other operations like addition/multiplication are not important, is to allocate enough memory to store each BCD digit plus one BCD terminator. The BCD terminator can be any combination of 4 bits that is not used to represent a BCD digit (like binary 1111). Storing this way will make other operations like addition and multiplication trickier, though.
Note this is very similar to the idea of converting to strings and lexicographically sorting those strings. Integers are internally stored as binary (base 2) in the computer. Storing in BCD is more like base 10 (base 16, actually, but 6 combinations are ignored), and strings are like base 256. Strings will use about twice as much memory, but there are already efficient functions written to sort strings. BCD's will probably require developing a custom BCD type for your needs.

Edit: I missed that it's a contiguous range. That being the case, all the answers which talk about sorting an array are wrong (including your idea stated in the question that it's like a radix sort), and True Soft's answer is right.
just like Radix Sort but just that the digits are sorted in reversed order
Well spotted :-) If you actually do it that way, funnily enough, it's called an MSD radix sort.
http://en.wikipedia.org/wiki/Radix_sort#Most_significant_digit_radix_sorts
You can implement one very simply, or with a lot of high technology and fanfare. In most programming languages, your particular example faces a slight difficulty. Extracting decimal digits from the natural storage format of an integer, isn't an especially fast operation. You can ignore this and see how long it ends up taking (recommended), or you can add yet more fanfare by converting all the numbers to decimal strings before sorting.
Of course you don't have to implement it as a radix sort: you could use a comparison sort algorithm with an appropriate comparator. For example in C, the following is suitable for use with qsort (unless I've messed it up):
int lex_compare(void *a, void *b) {
char a_str[12]; // assuming 32bit int
char b_str[12];
sprintf(a_str, "%d", *(int*)a);
sprintf(b_str, "%d", *(int*)b);
return strcmp(a_str,b_str);
}
Not terribly efficient, since it does a lot of repeated work, but straightforward.

If you do not want to convert them to strings, but have enough space to store an extra copy of the list I would store the largest power of ten less than the element in the copy. This is probably easiest to do with a loop. Now call your original array x and the powers of ten y.
int findPower(int x) {
int y = 1;
while (y * 10 < x) {
y = y * 10;
}
return y;
}
You could also compute them directly
y = exp10(floor(log10(x)));
but I suspect that the iteration may be faster than the conversions to and from floating point.
In order to compare the ith and jth elements
bool compare(int i, int j) {
if (y[i] < y[j]) {
int ti = x[i] * (y[j] / y[i]);
if (ti == x[j]) {
return (y[i] < y[j]); // the compiler will optimize this
} else {
return (ti < x[j]);
}
} else if (y[i] > y[j]) {
int tj = x[j] * (y[i] / y[j]);
if (x[i] == tj) {
return (y[i] < y[j]); // the compiler will optimize this
} else {
return (x[i] < tj);
}
} else {
return (x[i] < x[j];
}
}
What is being done here is we are multiplying the smaller number by the appropriate power of ten to make the two numbers have an equal number of digits, then comparing them. if the two modified numbers are equal, then compare the digit lengths.
If you do not have the space to store the y arrays you can compute them on each comparison.
In general, you are likely better off using the preoptimized digit conversion routines.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio