CRC32 endianess swapped - crc32

I have two functions to calculate CRC32:
1)
for (loop = 0u; loop < len; ++loop)
{
crc = lut[((uint8_t)(crc >> 24) ^ data[loop])] ^ (crc << 8u);
}
for (i = 0u; i < len; i++)
{
crc = lut[((uint32_t)data[i] ^ crc) & 0xFFu] ^ (crc >> 8u);
}
Both can calculate the same result but:
lookup table for second one has vales with different Endianess
After calculation result also has swapped endianess
The question is why there are two different implementations? Is there a specific name for calculation like in the second example?

They are clearly equivalent after byte-swapping the results of one of them, if the table is also byte swapped.
Normally I see those two forms of CRC calculation for different CRCs, because one is using a bit-reflected polynomial (the code with >>), and the other with a normal polynomial ("<<").
However I have not seen a case where someone took one of those and then byte-swapped the table and switched from "<<" to ">>" or vice versa. I don't think there is a name for that.
The application I might imagine for such a thing is that someone had to byte swap the result at the end in order to make it easier to put the CRC in some predefined format They then realized they could avoid the byte swap if instead they built that into the table and the direction of shifting when computing.

Related

Converting decimal to binary and using it as a filter

So the user enters a number and then gives a map of key value pairs(keys from 0 to n).
eg. Number = 6
Map = {{0,a},{1,b},{2,c},{3,d},{4,e},{5,f}}
The problem is to convert number to binary(110 in this case) and print elements from the map which have bit 1 corresponding to its position.
in this case print {0,a} and {1,b} as they correspond to "110"
I converted the number to binary by dividing by two recursively and then traversing the array from end and printing the corresponding value in Map if the bit is 1.
I was asked this during an internship, and for my solution, I was told it was very very inefficient and had high time and space complexity. I was asked to use AND operator to do this efficiently. No more was said and we moved on. I still wonder how to complete this using AND operator. So I would like to know how AND operator should be used here to get the solution
You need to know which bits are set in the given number so that you can print the numbers with the corresponding keys.
Using the bitwise-and operator you can mask out those bits and check whether they are set.
The following is in JS but it's very similar in other languages:
const num = 6;
let mask = 1;
for(let i = 0; i < 8; i++) {
console.log("num (binary rep):\t", ("0".repeat(7) + num.toString(2)).slice(-8))
console.log("mask (binary rep):\t", ("0".repeat(7) + mask.toString(2)).slice(-8))
console.log(!!(num & mask));
console.log(i+1, `th bit is${num & mask ? "" : " not"} set!`)
console.log("*".repeat(20))
mask = mask << 1;
}
There is a lot of fluff in the above code for presentation purposes but it boils down to the following lines:
if (mask & num)
console.log("Hurray! Set! Do stuff")
mask = mask << 1
Now, since the keys of your dictionary are all integers starting from 0 and increasing by one, you can incorporate the logic for checking them right into the above loop.

Fill device array consecutively in CUDA

(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)
I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.
Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.
Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.
But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?
EDIT 1
This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.
__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
//a 64 bit binary value to be evaluated is called a kValue
unsigned long long int kStart, kEnd, kRoot, kSize, curK;
//kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
//kStart is the kValue to start this kernel call on
//kEnd is the last kValue to validate
//kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
//curK is the current kValue represented as a 64 bit unsigned integer
int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges;
kStart = kInfo[0];
kEnd = kInfo[1];
kRoot = kInfo[2];
nodes = kInfo[3];
edges = kInfo[4];
kSize = kInfo[5];
curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
if (curK > kEnd) {//check to make sure you don't overshoot the end value
return;
}
kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
for (row = 0; row < nodes; row++) {
rowCount = 0;
kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
for (col = 0; col < nodes; col++) {
if (col > row) {
if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
rowCount++;
}
kBitLocation++;
}
if (col < row) {
if (col > 0) {
kMirrorBitLocation += (nodes - 2) - (col - 1);
}
if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
rowCount++;
}
}
}
if (rowCount != edges) {
//set the ith bit to zero
values[curK - kRoot] = 0;
return;
}
}
//set the ith bit to one
values[curK - kRoot] = 1;
*solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}
(This answer assumes output order is inconsequential and so are the positions of the valid values.)
Conceptually, your analysis produces a set of valid values. The implementation you described uses a dense representation of this set: One bit for every potential value. Yet you've indicated that the data is quite sparse (either 5e-2 or 1000/10^9 = 1e-6); moreover, copying data across PCI express is quite a pain.
Well, then, why not consider a sparse representation? The simplest one would be merely an unordered sequence of the valid values. Of course, writing that requires some synchronization across threads - perhaps even across blocks. Roughly, you can have warps collect their valid values in shared memory; then synchronize at the block level to collect the block's valid values (for a given chunk of the input it has analyzed); and finally use atomics to collect the data from all the blocks.
Oh, also - have each thread analyze multiple values, so you don't have to do that much synchronization.
So, you would want to have each thread analyze multiple numbers (thousands or millions) before you do a return from the computation. So if you analyze a million numbers in your thread, you will only need %5 of that amount of space to possible hold the results of that computation.

Dynamic Programming +Bit-Masks

I recently learned the concept of Bit Manipulation for Competitive Programming so I'm quite new to the concept ,I also read many tutorials on Bit-Masking + Dynamic Programming on Hackerearth ,CodeChef and many more .
I also solved a couple of problems on Codechef including this one problem
and I have a couple of doubts regarding Bitmasks after I have been through some questions.
The problems I solved were mostly focused on manipulating the subsets but I wonder how do I work on permutations with bitmasks , i.e when I have to work on a state where all bits in the mask need to be set.
For ex: If we have to find number of numbers that can be formed by arranging all digits of a given number A which are divisible by a given number B where (A ,B<= 10**6) how can this be done with bitmasks.(I hope this can be done with bitmask+dp)
If A= 514 ,and B=2
The question expects the answer to be
514
154
Which are both divisible by 2 .
So the answer is 2.
With the knowledge I have: 514 and 154 represent the same mask 111 where all bits are set So how do I use bitmasks here where the mask is same for two or more answers!( I hope you understand this ).
And also as it is impossible to allocate memory worth n!*n for a little large value of n since we can have that many permutations of digits how can this problem be done using bitmasks where we need only (2**n)*n space (If i'm not wrong).
So how do I approach the above problem iteratively? /Or any DP state equation which I can possibly understand ,I couldn't understand recursive approach of some similar problems I Read.
I also tried to think on a similar problem TSHIRTS but I couldn't understand the logic behind the recursion.
You don't actually need DP for this one but you can use bit manipulation nicely :) Since A <= 10^6 it means that A has most 7 digits; so you only have to check 7! = 5040 states.
const int A = 514;
const int B = 2;
vector <int> v; //contains digits of A (e.g. 5, 1, 4) this can be done before the recursive function in a while loop.
int rec(int mask, int current_number){
if(mask == (1 << v.size()) - 1){ //no digit left to pick
if(current_number % B == 0) return 1;
else return 0;
}
int ret = 0;
for(int i = 0; i < v.size(); i++){
if(mask & (1 << i)) continue; //this is already picked
ret += rec(mask | (1 << i), current_number * 10 + v[i]);
}
return ret;
}
Note that the reason I didn't use DP here was that current number might differ even if mask is the same; so you can't actually say that the situation has been repeated. Unless you memo-ize mask AND current_number which requires much more space.

Optimizing this query based search

We have two N-bit numbers (0< N< 100000). We have to perform q queries (0< q<500000) over these numbers. The query can be of following three types:
set_a idx x: Set A[idx] to x, where 0 <= idx < N, where A[idx] is idx'th least significant bit of A.
set_b idx x: Set B[idx] to x, where 0 <= idx < N.
get_c idx: Print C[idx], where C=A+B, and 0<=idx
Now, I have optimized the code to the best extent I can.
First, I tried with an int array for a, b and c. For every update, I calculate c and return the ith bit when queried. It was damn slow. Cleared 4/11 test cases only.
I moved over to using boolean array. It was around 2 times faster than int array approach. Cleared 7/11 testcases.
Next, I figured out that I need not calculate c for calculating idx th bit of A+B. I will just scan A and B towards right from idx until I find either a[i]=b[i]=0 or a[i]=b[i]=1. If a[i]=b[i]=0, then I just add up towards left to idx th bit starting with initial carry=0. And if a[i]=b[i]=1, then I just add up towards left to idx th bit starting with initial carry=1.
This was faster but cleared only 8/11 testcases.
Then, I figured out once, I get to the position i, a[i]=b[i]=0 or a[i]=b[i]=1, then I need not add up towards idx th position. If a[i]=b[i]=0, then answer is (a[idx]+b[idx])%2 and if a[i]=b[i]=1, then the answer is (a[idx]+b[idx]+1)%2. It was around 40% faster but still cleared only 8/11 testcases.
Now my question is how do get down those 3 'hard' testcases? I dont know what they are but the program is taking >3 sec to solve the problem.
Here is the code: http://ideone.com/LopZf
One possible optimization is to replace
(a[pos]+b[pos]+carry)%2
with
a[pos]^b[pos]^carry
The XOR operator (^) performs addition modulo 2, making the potentially expensive mod operation (%) unnecessary. Depending on the language and compiler, the compiler may make optimizations for you when doing a mod with a power of 2. But since you are micro-optimizing it is a simple change to make that removes dependence on that optimization being made for you behind the scenes.
http://en.wikipedia.org/wiki/Exclusive_or
This is just one suggestion that is simple to make. As others have suggested, using packed ints to represent your bit array will likely also improve what is probably the worst case test for your code. That would be the get_c function of the most significant bit, with either A or B (but not both) being 1 for all the other positions, requiring a scan of every bit position to the least significant bit to determine carry. If you were using packed ints for your bits, there would only be approximately 1/32 as many operations neccessary (assuming 32 bit ints). Using packed ints however would be a somewhat more complicated than your use of a simple boolean array (which really is likely just an array of bytes).
C/C++ Bit Array or Bit Vector
Convert bit array to uint or similar packed value
http://en.wikipedia.org/wiki/Bit_array
There are lots of other examples on Stackoverflow and the net for using ints as if they were bit arrays.
Here is a solution that looks a bit like your algorithm. I demonstrate it with bytes, but of course you can easily optimize the algorithm using 32 bit words (I suppose your machine has 64 bits arithmetic nowadays).
void setbit( unsigned char*x,unsigned int idx,unsigned int bit)
{
unsigned int digitIndex = idx>>3;
unsigned int bitIndex = idx & 7;
if( ((x[digitIndex]>>bitIndex)&1) ^ bit) x[digitIndex]^=(1u<<bitIndex);
}
unsigned int getbit(unsigned char *a,unsigned char *b,unsigned int idx)
{
unsigned int digitIndex = idx>>3;
unsigned int bitIndex = idx & 7;
unsigned int c = a[digitIndex]+b[digitIndex];
unsigned int bit = (c>>bitIndex) & 1;
/* a zero bit on the right will absorb a carry, let's check if any */
if( (c^(c+1))>>bitIndex )
{
/* none, we must check if there's a carry propagating from the right digits */
for(;digitIndex-- > 0;)
{
c=a[digitIndex]+b[digitIndex];
if( c > 255 ) return bit^1; /* yes, a carry */
if( c < 255 ) return bit; /* no carry possible, a zero bit will absorb it */
}
}
return bit;
}
If you find anything cryptic, just ask.
Edit: oops, I inverted the zero bit condition...

Non colliding hash algorithm for strings up to 255 characters

I am looking for a hash-algorithm, to create as close to a unique hash of a string (max len = 255) as possible, that produces a long integer (DWORD).
I realize that 26^255 >> 2^32, but also know that the number of words in the English language is far less than 2^32.
The strings I need to 'hash' would be mostly single words or some simple construct using two or three words.
The answer:
One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs. (Answered by Arachnid)
See here for a previous iteration of this question (and the answer).
One technique is to use a well-known hash algorithm (say, MD5 or SHA-1) and use only the first 32 bits of the result.
Be aware that the risk of hash collisions increases faster than you might expect. For information on this, read about the Birthday Paradox.
Ronny Pfannschmidt did a test with common english words yesterday and hasn't encountered any collisions for the 10000 words he tested in the Python string hash function. I haven't tested it myself, but that algorithm is very simple and fast, and seems to be optimized for common words.
Here the implementation:
static long
string_hash(PyStringObject *a)
{
register Py_ssize_t len;
register unsigned char *p;
register long x;
if (a->ob_shash != -1)
return a->ob_shash;
len = Py_SIZE(a);
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= Py_SIZE(a);
if (x == -1)
x = -2;
a->ob_shash = x;
return x;
}
H(key) = [GetHash(key) + 1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))] % hashsize
MSDN article on HashCodes
Java's String.hash() can be easily viewed here, its algorithm is
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Resources