Hash function required for custom data structure containing 12 integers - c++11

I have a custom structure that holds 12 integer values, x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6.
The range of the numbers is between 1 and 5 inclusive and every structure is guaranteed to have different combinations i.e NO two structures can have all the values of x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6 same as the respective values of other.
I need a good hash function to perform O(1) operations.
The requirement is to find out a structure with specific x1,y1....x6,y6 values
Right now I am using the following:-
struct Hash_6
{
size_t operator () ( const Node& n ) const
{
int result=17;
result=31*result+n.x1;
result=31*result+n.x2;
result=31*result+n.x3;
result=31*result+n.x4;
result=31*result+n.x5;
result=31*result+n.x6;
result=31*result+n.y1;
result=31*result+n.y2;
result=31*result+n.y3;
result=31*result+n.y4;
result=31*result+n.y5;
result=31*result+n.y6;
return result;
}
};
I want to know if there is any better more efficient hash function out there which I could use for this specific case.

If the values are always between one and five inclusive, then you can get a unique hash within a 32-bit value.
That's because five (the values) to the power of twelve (the number of variables) is 244,140,625, a value that can be represented in 28 bits.
Hence you hash function becomes (pseudo-code):
def hasher(s):
res = s.x1 - 1
for val in s.x2, s.x3, s.x4, s.x5, s.x6 s.y1, s.y2, s.y3, s.y4, s.y5, s.y6:
res = res * 5 + val - 1;
return res
With your constraints, you get a unique value out of that hash function.
If you wanted to use that hash for bucket selection (such as used in a set or dictionary), you would probably want to reduce it with a modulus to a more suitable value (introducing collisions as part of the process).
But it's unclear whether you're needing a hash for identification (leave as is) or bucketing (reduce it). If the latter, and values are reasonably evenly distributed, that would be along the lines of:
bucket_to_use = hasher(item) modulo num_buckets

Related

How to create a unique hash that will match any strings permutations

Given a string abcd how can I create a unique hashing method that will hash those 4 characters to match bcad or any other permutation of the letters abcd?
Currently I have this code
long hashString(string a) {
long hashed = 0;
for(int i = 0; i < a.length(); i++) {
hashed += a[i] * 7; // Timed by a prime to make the hash more unique?
}
return hashed;
}
Now this will not work becasue ad will hash with bc.
I know you can make it more unique by multiplying the position of the letter by the letter itself hashed += a[i] * i but then the string will not hash to its permutations.
Is it possible to create a hash that achieves this?
Edit
Some have suggested sorting the strings before you hash them. Which is a valid answer but the sorting would take O(nlog) time and I am looking for a hash function that runs in O(n) time.
I am looking to do this in O(1) memory.
Create an array of 26 integers, corresponding to letters a-z. Initialize it to 0. Scan the string from beginning to end, and increment the array element corresponding to the current letter. Note that up to this point the algorithm has O(n) time complexity and O(1) space complexity (since the array size is a constant).
Finally, hash the contents of the array using your favorite hash function.
The basic thing you can do is sort the strings before applying the hash function. So, to compute the hash of "adbc" or "dcba" you instead compute the hash of "abcd".
If you want to make sure that there are no collisions in your hash function, then the only way is to have the hash result be a string. There are many more strings than there are 32-bit (or 64-bit) integers so collisions are innevitable (collisions are unlikely with a good hash function though).
Easiest way to understand: sort the letters in the string, and then hash the resulting string.
Some variations on your original idea also work, like:
long hashString(string a) {
long hashed = 0;
for(int i = 0; i < a.length(); i++) {
long t = a[i] * 16777619;
hashed += t^(t>>8);
}
return hashed;
}
I suppose you need a hash such that two anagrams will hash to the same value. I'd suggest you sort them first and use any of the common hash function such as md5. I write the following code using Scala:
import java.security.MessageDigest
def hash(s: String) = {
MessageDigest.getInstance("MD5").digest(s.sorted.getBytes)
}
Note in scala:
scala> "hello".sorted
res0: String = ehllo
scala> "cinema".sorted
res1: String = aceimn
Synopsis: store a histogram of the letters in the hash value.
Step 1: compute a histogram of the letters (since a histogram uniquely identifies the letters in the string without regard to the order of the letters).
int histogram[26];
for ( int i = 0; i < a.length(); i++ )
histogram[a[i] - 'a']++;
Step 2: pack the histogram into the hash value. You have several options here. Which option to choose depends on what sort of limitations you can put on the strings.
If you knew that each letter would appear no more than 3 times, then it takes 2 bits to represent the count, so you could create a 52-bit hash that's guaranteed to be unique.
If you're willing to use a 128-bit hash, then you've got 5 bits for 24 letters, and 4 bits for 2 letters (e.g. q and z). The 128-bit hash allows each letter to appear 31 times (15 times for q and z).
But if you want a fixed sized hash, say 16-bit, then you need to pack the histogram into those 16 bits in a way that reduces collisions. The easiest way to do that is to create a 26 byte message (one byte for each entry in the histogram, allowing each letter to appear up to 255 times). Then take the 16-bit CRC of the message, using your favorite CRC generator.

shuffled index without need of memory

I need a function with profile
int shuffledIndex(seed, index, range)
for every index in the range returns a random other index from the range, but for a single seed all the values will be returned once and only once.
There are tons of algorithms that for a given container they can apply shuffling procedure, but I am not looking for such.
I need something that will not require extra memory, because the range is relatively big and will have many simultaneous seed sessions at the same time.
The shuffling does not need to be extremely strong and there could be limitation about the range - say being with size that is power of 2.
Are you aware of such algorithm?
This doesn't shuffle particularly well, but multiplying by an odd number modulo a power of two gives a bijection, so if you input all the indexes you get a permutation of them. You could also add an offset to prevent 0 from mapping to itself for every seed.
For example, in C# or similar: (requires range to be a power of two)
int shuffledIndex(int seed, int index, int range)
{
return (index * (seed | 1) + seed) & (range - 1);
}

Random logic engine implementation ideas

I try to find an effective random logic algorithm for this scenario. It doesn't matter which programming Language:
Say I have 20 element array filled with numbers
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
From this I need to construct each time 15 size array BUT
each time I set numbers that must be in this new array, and the remaining slots will be filled with random numbers from the master array.
For example:
In the new array the numbers that must be in are: 1,11,13,20,8,9
so the new array will be:
[1,N,N,11,N,20,8,N,9,N,N,N,13,N,N]
Where the Ns are random numbers from ALL 20 elements of the Master array.
Another example:
given 2,18,17,9,5
create new 10 element array:
[2,2,18,2,11,17,20,5,5,9]
No problem with duplicate elements
I'm trying to find some good algorithm for this.
If you want to receive one random number at a time and don't want to create the full result array up front, an alternative to my other answer is this:
Get a random number ranging from 0..requested_number (where requested_number is the total number of elements to fetch).
If this index is between 0 and length(required), print it from the array required; then remove it from the array;
.. else the next index is > length(required) and so pick any random number out of the optional array.
Decrease requested_number and repeat until this reaches 0.
You need 2 calls to random; the first to select an index from total_number - required_number, so you know from which array to pick a value, and the second time for optional, to pick a random number out of the entire available range.
Here is a basic implementation in C (footnote: using mod on rand() does not yield A Good Random Number, but it'll do for this example).
int main()
{
int optional[] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 };
int required[] = { 21,22,23,24,25 };
int requested_number = 15;
int take_from_required, optional_size, next;
srand(time(NULL));
if (requested_number < sizeof(required)/sizeof(required[0]))
{
printf ("requested number of elements must be at least as large as required array\n");
return EDOM;
}
/* Use this much from 'required': */
take_from_required = sizeof(required)/sizeof(required[0]);
/* Use this much from 'optional': */
optional_size = sizeof(optional)/sizeof(optional[0]);
while (requested_number > 0)
{
/* Please note this is a fairly bad 'random'!
As discussed many times before on SO. */
next = rand() % requested_number;
/* Take from which array? */
if (next >= take_from_required)
{
printf ("%d\n", optional[rand() % optional_size]);
} else
{
printf ("%d (required)\n", required[next]);
required[next] = required[take_from_required-1];
take_from_required--;
}
requested_number--;
}
return 0;
}
If I understand correctly, this is the issue:
optional [ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 ]
required [ 2,18,17,9,5 ]
Now construct a new array containing at least all elements of required, and filled to its capacity with elements taken from optional.
The problem seems to be that you need to take out random numbers from either required or optional and at the same time make sure required is empty at the end. [*]
Create a new array result (which needs to be at least as long as required -- then again, that can be inferred from the question). Copy all elements of required into it; fill the rest with random elements from optional.
At this point, you fulfill the primary condition, but the elements of required always appear first. So, as a last step, shuffle the elements now stored in the result array (for example, with the well-known Fisher-Yates shuffle).
[*] 'Empty', because all numbers in required must be used at least once. Taking them "out" of the array is the easiest way to make sure this happens. Things start to get complicated when (a) you may have duplicates of any number (from both optional and required) and (b) required is not a subset of optional.

Data structure for set of (non-disjoint) sets

I'm looking for a data structure that roughly corresponds to (in Java terms) Map<Set<int>, double>. Essentially a set of sets of labeled marbles, where each set of marbles is associated with a scalar. I want it to be able to efficiently handle the following operations:
Add a given integer to every set.
Remove every set that contains (or does not contain) a given integer, or at least set the associated double to 0.
Union two of the maps, adding together the doubles for sets that appear in both.
Multiply all of the doubles by a given double.
Rarely, iterate over the entire map.
under the following conditions:
The integers will fall within a constrained range (between 1 and 10,000 or so); the exact range will be known at compile-time.
Most of the integers within the range (80-90%) will never be used, but which ones will not be easily determinable until the end of the calculation.
The number of integers used will almost always still be over 100.
Many of the sets will be very similar, differing only by a few elements.
It may be possible to identify certain groups of integers that frequently appear only in sequential order: for example, if a set contains the integers 27 and 29 then it (almost?) certainly contains 28 as well.
It may be possible to identify these groups prior to running the calculation.
These groups would typically have 100 or so integers.
I've considered tries, but I don't see a good way to handle the "remove every set that contains a given integer" operation.
The purpose of this data structure would be to represent discrete random variables and permit addition, multiplication, and scalar multiplication operations on them. Each of these discrete random variables would ultimately have been created by applying these operations to a fixed (at compile-time) set of independent Bernoulli random variables (i.e. each takes the value 1 or 0 with some probability).
The systems being modeled are close to being representable as a time-inhomogeneous Markov chains (which would of course simplify this immensely) but, unfortunately, it is essential to track the duration since various transitions.
Here's a data structure, that can do all of your operations pretty efficiently:
I'm going to refer to it as a BitmapArray for this explanation.
Thinking about it, apparently for just the operations you have described a sorted array with bitmaps as keys and weights(your doubles) as values will be pretty efficient.
The bitmaps are what maintain membership in your set. Since you said the range of integers in the set are between 1-10,000, we can maintain information about any set with a bitmap of length 10,000.
It's gonna be tough sorting an array where the keys can be as big as 2^10000, but you can be smart about implementing the comparison function in the following way:
Iterate from left to right on the two bitmaps
XOR the bits on each index
Say you get a 1 at ith position
Whichever bitmap has 1 at ith position is greater
If you never get a 1 they're equal
I know this is still a slow comparison.
But not too slow, Here's a benchmark fiddle I did on bitmaps with length 10000.
This is in Javascript, if you're going to write in Java, it's going to perform even better.
function runTest() {
var num = document.getElementById("txtValue").value;
num = isNaN(num * 1) ? 0 : num * 1;
/*For integers in the range 1-10,000 the worst case for comparison are any equal integers which will cause the comparision to iterate over the whole BitArray*/
bitmap1 = convertToBitmap(10000, num);
bitmap2 = convertToBitmap(10000, num);
before = new Date().getMilliseconds();
var result = firstIsGreater(bitmap1, bitmap2, 10000);
after = new Date().getMilliseconds();
alert(result + " in time: " + (after-before) + " ms");
}
function convertToBitmap(size, number) {
var bits = new Array();
var q = number;
do {
bits.push(q % 2);
q = Math.floor(q / 2);
} while (q > 0);
xbitArray = new Array();
for (var i = 0; i < size; i++) {
xbitArray.push(0);
}
var j = xbitArray.length - 1;
for (var i = bits.length - 1; i >= 0; i--) {
xbitArray[j] = bits[i];
j--
}
return xbitArray;
}
function firstIsGreater(bitArray1, bitArray2, lengthOfArrays) {
for (var i = 0; i < lengthOfArrays; i++) {
if (bitArray1[i] ^ bitArray2[i]) {
if (bitArray1[i]) return true;
else return false;
}
}
return false;
}
document.getElementById("btnTest").onclick = function (e) {
runTest();
};
Also, remember that you only have to do this once, when building your BitmapArray (or while taking unions) and then it's going to become pretty efficient for the operations you'd do most often:
Note: N is the length of the BitmapArray.
Add integer to every set: Worst/best case O(N) time. Flip a 0 to 1 in each bitmap.
Remove every set that contains a given integer: Worst case O(N) time.
For each bitmap check the bit that represents the given integer, if 1 mark it's index.
Compress the array by deleting all marked indices.
If you're okay with just setting the weights to 0 it'll be even more efficient. This also makes it very easy if you want to remove all sets that have any element in a given set.
Union of two maps: Worst case O(N1+N2) time. Just like merging two sorted arrays, except you have to be smart about comparisons once more.
Multiply all of the doubles by a given double: Worst/best case O(N) time. Iterate and multiply each value by the input double.
Iterate over the BitmapArray: Worst/best case O(1) time for next element.

Is there an efficient data structure for row and column swapping?

I have a matrix of numbers and I'd like to be able to:
Swap rows
Swap columns
If I were to use an array of pointers to rows, then I can easily switch between rows in O(1) but swapping a column is O(N) where N is the amount of rows.
I have a distinct feeling there isn't a win-win data structure that gives O(1) for both operations, though I'm not sure how to prove it. Or am I wrong?
Without having thought this entirely through:
I think your idea with the pointers to rows is the right start. Then, to be able to "swap" the column I'd just have another array with the size of number of columns and store in each field the index of the current physical position of the column.
m =
[0] -> 1 2 3
[1] -> 4 5 6
[2] -> 7 8 9
c[] {0,1,2}
Now to exchange column 1 and 2, you would just change c to {0,2,1}
When you then want to read row 1 you'd do
for (i=0; i < colcount; i++) {
print m[1][c[i]];
}
Just a random though here (no experience of how well this really works, and it's a late night without coffee):
What I'm thinking is for the internals of the matrix to be a hashtable as opposed to an array.
Every cell within the array has three pieces of information:
The row in which the cell resides
The column in which the cell resides
The value of the cell
In my mind, this is readily represented by the tuple ((i, j), v), where (i, j) denotes the position of the cell (i-th row, j-th column), and v
The would be a somewhat normal representation of a matrix. But let's astract the ideas here. Rather than i denoting the row as a position (i.e. 0 before 1 before 2 before 3 etc.), let's just consider i to be some sort of canonical identifier for it's corresponding row. Let's do the same for j. (While in the most general case, i and j could then be unrestricted, let's assume a simple case where they will remain within the ranges [0..M] and [0..N] for an M x N matrix, but don't denote the actual coordinates of a cell).
Now, we need a way to keep track of the identifier for a row, and the current index associated with the row. This clearly requires a key/value data structure, but since the number of indices is fixed (matrices don't usually grow/shrink), and only deals with integral indices, we can implement this as a fixed, one-dimensional array. For a matrix of M rows, we can have (in C):
int RowMap[M];
For the m-th row, RowMap[m] gives the identifier of the row in the current matrix.
We'll use the same thing for columns:
int ColumnMap[N];
where ColumnMap[n] is the identifier of the n-th column.
Now to get back to the hashtable I mentioned at the beginning:
Since we have complete information (the size of the matrix), we should be able to generate a perfect hashing function (without collision). Here's one possibility (for modestly-sized arrays):
int Hash(int row, int column)
{
return row * N + column;
}
If this is the hash function for the hashtable, we should get zero collisions for most sizes of arrays. This allows us to read/write data from the hashtable in O(1) time.
The cool part is interfacing the index of each row/column with the identifiers in the hashtable:
// row and column are given in the usual way, in the range [0..M] and [0..N]
// These parameters are really just used as handles to the internal row and
// column indices
int MatrixLookup(int row, int column)
{
// Get the canonical identifiers of the row and column, and hash them.
int canonicalRow = RowMap[row];
int canonicalColumn = ColumnMap[column];
int hashCode = Hash(canonicalRow, canonicalColumn);
return HashTableLookup(hashCode);
}
Now, since the interface to the matrix only uses these handles, and not the internal identifiers, a swap operation of either rows or columns corresponds to a simple change in the RowMap or ColumnMap array:
// This function simply swaps the values at
// RowMap[row1] and RowMap[row2]
void MatrixSwapRow(int row1, int row2)
{
int canonicalRow1 = RowMap[row1];
int canonicalRow2 = RowMap[row2];
RowMap[row1] = canonicalRow2
RowMap[row2] = canonicalRow1;
}
// This function simply swaps the values at
// ColumnMap[row1] and ColumnMap[row2]
void MatrixSwapColumn(int column1, int column2)
{
int canonicalColumn1 = ColumnMap[column1];
int canonicalColumn2 = ColumnMap[column2];
ColumnMap[row1] = canonicalColumn2
ColumnMap[row2] = canonicalColumn1;
}
So that should be it - a matrix with O(1) access and mutation, as well as O(1) row swapping and O(1) column swapping. Of course, even an O(1) hash access will be slower than the O(1) of array-based access, and more memory will be used, but at least there is equality between rows/columns.
I tried to be as agnostic as possible when it comes to exactly how you implement your matrix, so I wrote some C. If you'd prefer another language, I can change it (it would be best if you understood), but I think it's pretty self descriptive, though I can't ensure it's correctedness as far as C goes, since I'm actually a C++ guys trying to act like a C guy right now (and did I mention I don't have coffee?). Personally, writing in a full OO language would do it the entrie design more justice, and also give the code some beauty, but like I said, this was a quickly whipped up implementation.

Resources