The difference of location of Checksum and CRC. WHY? - location

I have a question while studying CRC and checksum.
CRC is located at tail, but checksum is located at header.
I thought because of complication of CRC and checksum.
CRC is more complicate, so it locates header, but checksum is less complicate, so it locates tail.
Is it right?
Why is that?

In the case of transmitted or received data, hardware implementations generally generate a CRC or checksum while data is being transmitted, and then transmit the CRC or checksum (so the CRC or checksum would be at the end of data). This eliminates the need to buffer more than what is needed to hold the CRC or checksum and a unit of transmission (such as a byte).
For a message in memory, the CRC parity bytes or checksum can be located anywhere within a message. For a checksum, this is straight forward, but for a CRC, the CRC has to be generated normally, then cycled backwards and stored into where it will be located in a message. The backward cycling can be optimized as shown in the second part of my answer.
Checksums can be anywhere in a message, since the calculation is easy.
Intel hex format is/was a fairly common format for storing binary data in a text file, and has the checksum after the end of data on each line of a text file:
https://en.wikipedia.org/wiki/Intel_HEX#Record_structure
IPv4 header puts the checksum on message_word[5]:
https://en.wikipedia.org/wiki/IPv4_header_checksum#Calculating_the_IPv4_header_checksum
It is possible to have the CRC parities anywhere in a message. The parity bytes are zeroed, a normal CRC is calculated, then the CRC is "reverse cycled" to the location for where it will be stored. Rather than actually reversing the CRC, a carryless multiply can be used:
parity = (crc ยท (pow(2,-1-reverse_distance)%poly))%poly
The -1 represents the cyclic period for a CRC. For CRC32, the period is 2^32-1 = 0xffffffff
Example code for a 32 byte message with 14 data byte, 4 parity bytes, 14 data bytes. After the parity bytes are stored in the message, a normal CRC calculation on the message will be zero.
#include <stdio.h>
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
static uint32_t crctbl[256];
void GenTbl(void) /* generate crc table */
{
uint32_t crc;
uint32_t c;
uint32_t i;
for(c = 0; c < 0x100; c++){
crc = c<<24;
for(i = 0; i < 8; i++)
crc = (crc<<1)^((0-(crc>>31))&0x04c11db7);
crctbl[c] = crc;
}
}
uint32_t GenCrc(uint8_t * bfr, size_t size) /* generate crc */
{
uint32_t crc = 0u;
while(size--)
crc = (crc<<8)^crctbl[(crc>>24)^*bfr++];
return(crc);
}
/* carryless multiply modulo crc */
uint32_t MpyModCrc(uint32_t a, uint32_t b) /* (a*b)%crc */
{
uint32_t pd = 0;
uint32_t i;
for(i = 0; i < 32; i++){
pd = (pd<<1)^((0-(pd>>31))&0x04c11db7u);
pd ^= (0-(b>>31))&a;
b <<= 1;
}
return pd;
}
/* exponentiate by repeated squaring modulo crc */
uint32_t PowModCrc(uint32_t p) /* pow(2,p)%crc */
{
uint32_t prd = 0x1u; /* current product */
uint32_t sqr = 0x2u; /* current square */
while(p){
if(p&1)
prd = MpyModCrc(prd, sqr);
sqr = MpyModCrc(sqr, sqr);
p >>= 1;
}
return prd;
}
/* message 14 data, 4 parities, 14 data */
/* parities = crc cycled backwards 18 bytes */
int main()
{
uint32_t pmr;
uint32_t crc;
uint32_t par;
uint8_t msg[32] = {0x01,0x02,0x03,0x04,0x05,0x06,0x07,0x08,
0x09,0x0a,0x0b,0x0c,0x0d,0x0e,0x00,0x00,
0x00,0x00,0x13,0x14,0x15,0x16,0x17,0x18,
0x19,0x1a,0x1b,0x1c,0x1d,0x1e,0x1f,0x20};
GenTbl(); /* generate crc table */
pmr = PowModCrc(-1-(18*8)); /* pmr = pow(2,-1-18*8)%crc */
crc = GenCrc(msg, 32); /* generate crc */
par = MpyModCrc(crc, pmr); /* par = (crc*pmr)%crc */
msg[14] = (uint8_t)(par>>24); /* store parities in msg */
msg[15] = (uint8_t)(par>>16);
msg[16] = (uint8_t)(par>> 8);
msg[17] = (uint8_t)(par>> 0);
crc = GenCrc(msg, 32); /* crc == 0 */
printf("%08x\n", crc);
return 0;
}

Related

Finding out the correct CRC polynomial

I am using simpleBGC gimbal controller from Basecam electronics. The controller has a serial API for communication which requires the calculation of crc16 checksum for the commands sent to the controller(https://www.basecamelectronics.com/file/SimpleBGC_2_6_Serial_Protocol_Specification.pdf) (page 3)
I want to send the reset command to the controller which has the following format:
Header: {start char: '$', command id: '114', payload size: '3', header checksum : '117'}
Payload: {3,0,0} (3 bytes corresponding to reset options and time to reset)
crc16 checksum : ? (using polynomial 0x8005 calculated for all bytes except start char)
The hex representation of my command is: 0x24720375030000 and I need to find crc16 checksum for 0x720375030000. I used different crc calculators but the controller is not responding to the command and I assume that crc checksum is not correct.
To find correct crc16 checksum I sent every possible combination of crc16 checksum and found out that the controller responds when checksum is '7b25'.
so the correct command in hex is : "24 720375030000 7b25".
But this checksum 7b25 does not correspond to the polynomial 0x8005.
How can I find the correct polynomial or crc16 calculation function?
Did you try the code in the appendix of the document you linked? It works fine, and produces 0x257b for the CRC of your example data. That is then written in the stream in little-endian order, giving the 7b 25 you are expecting.
Here is a simpler and faster C implementation than what is in the appendix:
#include <stddef.h>
// Return a with the low 16 bits reversed and any bits above that zeroed.
static unsigned rev16(unsigned a) {
a = (a & 0xff00) >> 8 | (a & 0x00ff) << 8;
a = (a & 0xf0f0) >> 4 | (a & 0x0f0f) << 4;
a = (a & 0xcccc) >> 2 | (a & 0x3333) << 2;
a = (a & 0xaaaa) >> 1 | (a & 0x5555) << 1;
return a;
}
// Implement the CRC specified in the BASECAM SimpleBGC32 2.6x serial protocol
// specification. Return crc updated with the length bytes at message. If
// message is NULL, then return the initial CRC value. This CRC is like
// CRC-16/ARC, but with the bits reversed.
//
// This is a simple bit-wise implementation. Byte-wise and word-wise algorithms
// using tables exist for higher speed if needed. Also this implementation
// chooses to reverse the CRC bits as opposed to the data bits, as done in the
// specficiation appendix. The CRC only needs to be reversed once at the start
// and once at the end, whereas the alternative is reversing every data byte of
// the message. Reversing the CRC twice is faster for messages with length
// greater than two bytes.
unsigned crc16_simplebgc(unsigned crc, void const *message, size_t length) {
if (message == NULL)
return 0;
unsigned char const *data = message;
crc = rev16(crc);
for (size_t i = 0; i < length; i++) {
crc ^= data[i];
for (int k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ 0xa001 : crc >> 1;
}
return rev16(crc);
}
#include <stdio.h>
// Example usage of crc_simplebgc(). A CRC can be computed all at once, or with
// portions of the data at a time.
int main(void) {
unsigned crc = crc16_simplebgc(0, NULL, 0); // set initial CRC
crc = crc16_simplebgc(crc, "\x72\x03\x75", 3); // first three bytes
crc = crc16_simplebgc(crc, "\x03\x00\x00", 3); // remaining bytes
printf("%04x\n", crc); // prints 257b
return 0;
}

Zlib crc32 combine endian format

I just go through the code of Zlib CRC32 combine function, but I am confused about the endian of the CRC32 input. Does it only work for big endian ? If I have small endian format, should I do a byte swap first before I use the function ? Thanks in advance.
/* ========================================================================= */
local uLong crc32_combine_(crc1, crc2, len2)
uLong crc1;
uLong crc2;
z_off64_t len2;
{
int n;
unsigned long row;
unsigned long even[GF2_DIM]; /* even-power-of-two zeros operator */
unsigned long odd[GF2_DIM]; /* odd-power-of-two zeros operator */
/* degenerate case (also disallow negative lengths) */
if (len2 <= 0)
return crc1;
/* put operator for one zero bit in odd */
odd[0] = 0xedb88320UL; /* CRC-32 polynomial */
row = 1;
for (n = 1; n < GF2_DIM; n++) {
odd[n] = row;
row <<= 1;
}
/* put operator for two zero bits in even */
gf2_matrix_square(even, odd);
/* put operator for four zero bits in odd */
gf2_matrix_square(odd, even);
/* apply len2 zeros to crc1 (first square will put the operator for one
zero byte, eight zero bits, in even) */
do {
/* apply zeros operator for this bit of len2 */
gf2_matrix_square(even, odd);
if (len2 & 1)
crc1 = gf2_matrix_times(even, crc1);
len2 >>= 1;
/* if no more bits set, then done */
if (len2 == 0)
break;
/* another iteration of the loop with odd and even swapped */
gf2_matrix_square(odd, even);
if (len2 & 1)
crc1 = gf2_matrix_times(odd, crc1);
len2 >>= 1;
/* if no more bits set, then done */
} while (len2 != 0);
/* return combined crc */
crc1 ^= crc2;
return crc1;
}
All of zlib works in little or big endian architectures.
There is no "endianess" of the arguments of crc32_combine(). The crc1 and crc2 arguments are passed as 32-bit integers, not sequences of bytes, and so have no endianess.
By the way, there is more recent code for crc32_combine() that is little bit more efficient.

CRC32 Calculation for Zero Filled Buffer/File

If I want to calculate the CRC32 value for a large number of consecutive zero bytes, is there a constant time formula I can use given the length of the run of zeros? For example, if I know I have 1000 bytes all filled with zeros, is there a way to avoid a loop with 1000 iterations (just an example, actual number of zeros is unbounded for the sake of this question)?
You can compute the result of applying n zeros not in O(1) time, but in O(log n) time. This is done in zlib's crc32_combine(). A binary matrix is constructed that represents the operation of applying a single zero bit to the CRC. The 32x32 matrix multiplies the 32-bit CRC over GF(2), where addition is replaced by exclusive-or (^) and multiplication is replaced by and (&), bit by bit.
Then that matrix can be squared to get the operator for two zeros. That is squared to get the operator for four zeros. The third one is squared to get the operator for eight zeros. And so on as needed.
Now that set of operators can be applied to the CRC based on the one bits in the number n of zero bits that you want to compute the CRC of.
You can precompute the resulting matrix operator for any number of zero bits, if you happen to know you will be frequently applying exactly that many zeros. Then it is just one matrix multiplication by a vector, which is in fact O(1).
You do not need to use the pclmulqdq instruction suggested in another answer here, but that would be a little faster if you have it. It would not change the O() of the operation.
Time complexity can be reduced to O(1) using a table lookup followed by a multiply. The explanation and example code are shown in the third section of this answer.
If the 1000 is a constant, a precomputed table of 32 values, each representing
each bit of a CRC to 8000th power mod poly could be used. A set of matrices (one set per byte of the CRC) could be used to work with a byte at a time. Both methods would be constant time (fixed number of loops) O(1).
As commented above, if the 1000 is not a constant, then exponentiation by squaring could be used which would be O(log2(n)) time complexity, or a combination of precomputed tables for some constant number of zero bits, such as 256, followed by exponentiation by squaring could be used so that the final step would be O(log2(n%256)).
Optimization in general: for normal data with zero and non-zero elements, on an modern X86 with pclmulqdq (uses xmm registers), a fast crc32 (or crc16) can be implemented, although it's close to 500 lines of assembly code. Intel document: crc using pclmulqdq. Example source code for github fast crc16. For a 32 bit CRC, a different set of constants is needed. If interested, I converted the source code to work with Visual Studio ML64.EXE (64 bit MASM), and created examples for both left and right shift 32 bit CRC's, each with two sets of constants for the two most popular CRC 32 bit polynomials (left shift polys: crc32:0x104C11DB7 and crc32c: 0x11EDC6F41, right shift poly's are bit reversed).
Example code for fast adjustment of CRC using a software based carryless multiply modulo the CRC polyonomial. This will be much faster than using a 32 x 32 matrix multiply. A CRC is calculated for non-zero data: crf = GenCrc(msg, ...). An adjustment constant is calculated for n zero bytes: pmc = pow(2^(8*n))%poly (using exponentiation by repeated squaring). Then the CRC is adjusted for the zero bytes: crf = (crf*pmc)%poly.
Note that time complexity can be reduced to O(1) with generation of a table of pow(2^(8*i))%poly constants for i = 1 to n. Then the calculation is a table lookup and a fixed iteration (32 cycles) multiply % poly.
#include <stdio.h>
#include <stdlib.h>
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
static uint32_t crctbl[256];
void GenTbl(void) /* generate crc table */
{
uint32_t crc;
uint32_t c;
uint32_t i;
for(c = 0; c < 0x100; c++){
crc = c<<24;
for(i = 0; i < 8; i++)
crc = (crc<<1)^((0-(crc>>31))&0x04c11db7);
crctbl[c] = crc;
}
}
uint32_t GenCrc(uint8_t * bfr, size_t size) /* generate crc */
{
uint32_t crc = 0u;
while(size--)
crc = (crc<<8)^crctbl[(crc>>24)^*bfr++];
return(crc);
}
/* carryless multiply modulo crc */
uint32_t MpyModCrc(uint32_t a, uint32_t b) /* (a*b)%crc */
{
uint32_t pd = 0;
uint32_t i;
for(i = 0; i < 32; i++){
pd = (pd<<1)^((0-(pd>>31))&0x04c11db7u);
pd ^= (0-(b>>31))&a;
b <<= 1;
}
return pd;
}
/* exponentiate by repeated squaring modulo crc */
uint32_t PowModCrc(uint32_t p) /* pow(2,p)%crc */
{
uint32_t prd = 0x1u; /* current product */
uint32_t sqr = 0x2u; /* current square */
while(p){
if(p&1)
prd = MpyModCrc(prd, sqr);
sqr = MpyModCrc(sqr, sqr);
p >>= 1;
}
return prd;
}
/* # data bytes */
#define DAT ( 32)
/* # zero bytes */
#define PAD (992)
/* DATA+PAD */
#define CNT (1024)
int main()
{
uint32_t pmc;
uint32_t crc;
uint32_t crf;
uint32_t i;
uint8_t *msg = malloc(CNT);
for(i = 0; i < DAT; i++) /* generate msg */
msg[i] = (uint8_t)rand();
for( ; i < CNT; i++)
msg[i] = 0;
GenTbl(); /* generate crc table */
crc = GenCrc(msg, CNT); /* generate crc normally */
crf = GenCrc(msg, DAT); /* generate crc for data */
pmc = PowModCrc(PAD*8); /* pmc = pow(2,PAD*8)%crc */
crf = MpyModCrc(crf, pmc); /* crf = (crf*pmc)%crc */
printf("%08x %08x\n", crc, crf); /* crf == crc */
free(msg);
return 0;
}
CRC32 is based on multiplication in GF(2)[X] modulo some polynomial, which is multiplicative. Tricky part is splitting the non-multiplicative from the multiplicative.
First define a sparse file with the following structure (in Go):
type SparseFile struct {
FileBytes []SparseByte
Size uint64
}
type SparseByte struct {
Position uint64
Value byte
}
In your case it would be SparseFile{[]FileByte{}, 1000}
Then, the function would be:
func IEEESparse (file SparseFile) uint32 {
position2Index := map[uint64]int{}
for i , v := range(file.FileBytes) {
file.FileBytes[i].Value = bits.Reverse8(v.Value)
position2Index[v.Position] = i
}
for i := 0; i < 4; i++ {
index, ok := position2Index[uint64(i)]
if !ok {
file.FileBytes = append(file.FileBytes, SparseByte{Position: uint64(i), Value: 0xFF})
} else {
file.FileBytes[index].Value ^= 0xFF
}
}
// Add padding
file.Size += 4
newReminder := bits.Reverse32(reminderIEEESparse(file))
return newReminder ^ 0xFFFFFFFF
}
So note that:
Division is performed on bits in the opposite order (per byte).
First four bytes are xored with 0xFF.
File is padded with 4 bytes.
Reminder is reversed again.
Reminder is xored again.
The inner function reminderIEEESparse is the true reminder and it is easy to implement it in O(log n) where n is the size of the file.
You can find a full implementation here.

Why are odd-sized fread requests split in two?

I noticed that on Windows every time I issue an unbuffered fread() request with an odd length, it's split into 2 requests (as observed through procmon):
a) fread for my requested length-1
b) 2-byte fread for the last byte
This has an obvious performance overhead like 2 kernel requests instead of one etc.
Sample code ran on Windows 10:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[]) {
FILE* pFile;
char* buffer;
pFile = fopen(argv[0], "rb");
setbuf(pFile, nullptr);
size_t len = 3;
buffer = (char*)malloc(sizeof(char)*len);
if (len != fread(buffer, 1, len, pFile)) { fputs("Reading error", stderr); exit(3); }
free(buffer);
fclose(pFile);
return 0;
}
This results in the following procmon reported calls:
ReadFile c:\work\cpptry\Debug\cpptry.exe SUCCESS Offset: 0, Length: 2, Priority: Normal
ReadFile c:\work\cpptry\Debug\cpptry.exe SUCCESS Offset: 2, Length: 2
It seems as if Windows is incapable of issuing odd-sized requests to the file system.
What's up with that?
This is implementation artifact.
MS CRT keeps all FILEs buffered even if you tell it to don't do this. Instead file buffer is set to internal buffer with space for two bytes. This allows to keep one code path instead of two and simplifies implementation of fast path in fgetc and fputc.
#define fgetc(_stream) (--(_stream)->_cnt >= 0 ? 0xff & *(_stream)->_ptr++ : _filbuf(_stream))
Some of you are probably bothered by size of the buffer (2 bytes when quasi unbuffered), but in _fread_nolock_s function we can find optimization
witch tries to read multiplies of buffer size directly to the destination bypassing file buffer.
See fread.c in CRT sources:
/* calc chars to read -- (count/streambufsize) * streambufsize */
nbytes = (unsigned)(count - count % streambufsize);
...
nread = _read_nolock(_fileno(stream), data, nbytes);
Because the file buffer's size is equal 2, even number of bytes is read directly to the destination and eventual one byte goes through the file buffer. Sometimes there could be some bytes in the buffer that need to be transfered to destination before optimized read can take place.
Bonus: buffer size is always forced to multiple of 2.
See setvbuf.c:
/*
* force size to be even by masking down to the nearest multiple
* of 2
*/
size &= (size_t)~1;
...
/*
* CASE 1: No Buffering.
*/
if (type & _IONBF) {
stream->_flag |= _IONBF;
buffer = (char *)&(stream->_charbuf);
size = 2;
}
Code snippets above are from VC 2013 CRT.
For comparison snippets from Universal CRT 10.0.17134
read.cpp
unsigned const bytes_to_read = stream_buffer_size != 0
? static_cast<unsigned>(maximum_bytes_to_read - maximum_bytes_to_read % stream_buffer_size)
: maximum_bytes_to_read;
...
int const bytes_read = _read_nolock(_fileno(stream.public_stream()), data, bytes_to_read);
setvbuf.cpp
// Force the buffer size to be even by masking the low order bit:
size_t const usable_buffer_size = buffer_size_in_bytes & ~static_cast<size_t>(1);
...
// Case 1: No buffering:
if (type & _IONBF)
{
return set_buffer(stream, reinterpret_cast<char*>(&stream->_charbuf), 2, _IOBUFFER_NONE);
}
And snippets from VC 6.0 (1998)
read.c
/* calc chars to read -- (count/bufsize) * bufsize */
nbytes = ( bufsize ? (count - count % bufsize) : count );
nread = _read(_fileno(stream), data, nbytes);
setvbuf.c
/*
* force size to be even by masking down to the nearest multiple
* of 2
*/
size &= (size_t)~1;
...
/*
* CASE 1: No Buffering.
*/
if (type & _IONBF) {
stream->_flag |= _IONBF;
buffer = (char *)&(stream->_charbuf);
size = 2;
}

efficiently find the first element matching a bit mask

I have a list of N 64-bit integers whose bits represent small sets. Each integer has at most k bits set to 1. Given a bit mask, I would like to find the first element in the list that matches the mask, i.e. element & mask == element.
Example:
If my list is:
index abcdef
0 001100
1 001010
2 001000
3 000100
4 000010
5 000001
6 010000
7 100000
8 000000
and my mask is 111000, the first element matching the mask is at index 2.
Method 1:
Linear search through the entire list. This takes O(N) time and O(1) space.
Method 2:
Precompute a tree of all possible masks, and at each node keep the answer for that mask. This takes O(1) time for the query, but takes O(2^64) space.
Question:
How can I find the first element matching the mask faster than O(N), while still using a reasonable amount of space? I can afford to spend polynomial time in precomputation, because there will be a lot of queries. The key is that k is small. In my application, k <= 5 and N is in the thousands. The mask has many 1s; you can assume that it is drawn uniformly from the space of 64-bit integers.
Update:
Here is an example data set and a simple benchmark program that runs on Linux: http://up.thirld.com/binmask.tar.gz. For large.in, N=3779 and k=3. The first line is N, followed by N unsigned 64-bit ints representing the elements. Compile with make. Run with ./benchmark.e >large.out to create the true output, which you can then diff against. (Masks are generated randomly, but the random seed is fixed.) Then replace the find_first() function with your implementation.
The simple linear search is much faster than I expected. This is because k is small, and so for a random mask, a match is found very quickly on average.
A suffix tree (on bits) will do the trick, with the original priority at the leaf nodes:
000000 -> 8
1 -> 5
10 -> 4
100 -> 3
1000 -> 2
10 -> 1
100 -> 0
10000 -> 6
100000 -> 7
where if the bit is set in the mask, you search both arms, and if not, you search only the 0 arm; your answer is the minimum number you encounter at a leaf node.
You can improve this (marginally) by traversing the bits not in order but by maximum discriminability; in your example, note that 3 elements have bit 2 set, so you would create
2:0 0:0 1:0 3:0 4:0 5:0 -> 8
5:1 -> 5
4:1 5:0 -> 4
3:1 4:0 5:0 -> 3
1:1 3:0 4:0 5:0 -> 6
0:1 1:0 3:0 4:0 5:0 -> 7
2:1 0:0 1:0 3:0 4:0 5:0 -> 2
4:1 5:0 -> 1
3:1 4:0 5:0 -> 0
In your example mask this doesn't help (since you have to traverse both the bit2==0 and bit2==1 sides since your mask is set in bit 2), but on average it will improve the results (but at a cost of setup and more complex data structure). If some bits are much more likely to be set than others, this could be a huge win. If they're pretty close to random within the element list, then this doesn't help at all.
If you're stuck with essentially random bits set, you should get about (1-5/64)^32 benefit from the suffix tree approach on average (13x speedup), which might be better than the difference in efficiency due to using more complex operations (but don't count on it--bit masks are fast). If you have a nonrandom distribution of bits in your list, then you could do almost arbitrarily well.
This is the bitwise Kd-tree. It typically needs less than 64 visits per lookup operation. Currently, the selection of the bit (dimension) to pivot on is random.
#include <limits.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
typedef unsigned long long Thing;
typedef unsigned long Number;
unsigned thing_ffs(Thing mask);
Thing rand_mask(unsigned bitcnt);
#define WANT_RANDOM 31
#define WANT_BITS 3
#define BITSPERTHING (CHAR_BIT*sizeof(Thing))
#define NONUMBER ((Number)-1)
struct node {
Thing value;
Number num;
Number nul;
Number one;
char pivot;
} *nodes = NULL;
unsigned nodecount=0;
unsigned itercount=0;
struct node * nodes_read( unsigned *sizp, char *filename);
Number *find_ptr_to_insert(Number *ptr, Thing value, Thing mask);
unsigned grab_matches(Number *result, Number num, Thing mask);
void initialise_stuff(void);
int main (int argc, char **argv)
{
Thing mask;
Number num;
unsigned idx;
srand (time(NULL));
nodes = nodes_read( &nodecount, argv[1]);
fprintf( stdout, "Nodecount=%u\n", nodecount );
initialise_stuff();
#if WANT_RANDOM
mask = nodes[nodecount/2].value | nodes[nodecount/3].value ;
#else
mask = 0x38;
#endif
fprintf( stdout, "\n#### Search mask=%llx\n", (unsigned long long) mask );
itercount = 0;
num = NONUMBER;
idx = grab_matches(&num,0, mask);
fprintf( stdout, "Itercount=%u\n", itercount );
fprintf(stdout, "KdTree search %16llx\n", (unsigned long long) mask );
fprintf(stdout, "Count=%u Result:\n", idx);
idx = num;
if (idx >= nodecount) idx = nodecount-1;
fprintf( stdout, "num=%4u Value=%16llx\n"
,(unsigned) nodes[idx].num
,(unsigned long long) nodes[idx].value
);
fprintf( stdout, "\nLinear search %16llx\n", (unsigned long long) mask );
for (idx = 0; idx < nodecount; idx++) {
if ((nodes[idx].value & mask) == nodes[idx].value) break;
}
fprintf(stdout, "Cnt=%u\n", idx);
if (idx >= nodecount) idx = nodecount-1;
fprintf(stdout, "Num=%4u Value=%16llx\n"
, (unsigned) nodes[idx].num
, (unsigned long long) nodes[idx].value );
return 0;
}
void initialise_stuff(void)
{
unsigned num;
Number root, *ptr;
root = 0;
for (num=0; num < nodecount; num++) {
nodes[num].num = num;
nodes[num].one = NONUMBER;
nodes[num].nul = NONUMBER;
nodes[num].pivot = -1;
}
nodes[num-1].value = 0; /* last node is guaranteed to match anything */
root = 0;
for (num=1; num < nodecount; num++) {
ptr = find_ptr_to_insert (&root, nodes[num].value, 0ull );
if (*ptr == NONUMBER) *ptr = num;
else fprintf(stderr, "Found %u for %u\n"
, (unsigned)*ptr, (unsigned) num );
}
}
Thing rand_mask(unsigned bitcnt)
{struct node * nodes_read( unsigned *sizp, char *filename)
{
struct node *ptr;
unsigned size,used;
FILE *fp;
if (!filename) {
size = (WANT_RANDOM+0) ? WANT_RANDOM : 9;
ptr = malloc (size * sizeof *ptr);
#if (!WANT_RANDOM)
ptr[0].value = 0x0c;
ptr[1].value = 0x0a;
ptr[2].value = 0x08;
ptr[3].value = 0x04;
ptr[4].value = 0x02;
ptr[5].value = 0x01;
ptr[6].value = 0x10;
ptr[7].value = 0x20;
ptr[8].value = 0x00;
#else
for (used=0; used < size; used++) {
ptr[used].value = rand_mask(WANT_BITS);
}
#endif /* WANT_RANDOM */
*sizp = size;
return ptr;
}
fp = fopen( filename, "r" );
if (!fp) return NULL;
fscanf(fp,"%u\n", &size );
fprintf(stderr, "Size=%u\n", size);
ptr = malloc (size * sizeof *ptr);
for (used = 0; used < size; used++) {
fscanf(fp,"%llu\n", &ptr[used].value );
}
fclose( fp );
*sizp = used;
return ptr;
}
Thing value = 0;
unsigned bit, cnt;
for (cnt=0; cnt < bitcnt; cnt++) {
bit = 54321*rand();
bit %= BITSPERTHING;
value |= 1ull << bit;
}
return value;
}
Number *find_ptr_to_insert(Number *ptr, Thing value, Thing done)
{
Number num=NONUMBER;
while ( *ptr != NONUMBER) {
Thing wrong;
num = *ptr;
wrong = (nodes[num].value ^ value) & ~done;
if (nodes[num].pivot < 0) { /* This node is terminal */
/* choose one of the wrong bits for a pivot .
** For this bit (nodevalue==1 && searchmask==0 )
*/
if (!wrong) wrong = ~done ;
nodes[num].pivot = thing_ffs( wrong );
}
ptr = (wrong & 1ull << nodes[num].pivot) ? &nodes[num].nul : &nodes[num].one;
/* Once this bit has been tested, it can be masked off. */
done |= 1ull << nodes[num].pivot ;
}
return ptr;
}
unsigned grab_matches(Number *result, Number num, Thing mask)
{
Thing wrong;
unsigned count;
for (count=0; num < *result; ) {
itercount++;
wrong = nodes[num].value & ~mask;
if (!wrong) { /* we have a match */
if (num < *result) { *result = num; count++; }
/* This is cheap pruning: the break will omit both subtrees from the results.
** But because we already have a result, and the subtrees have higher numbers
** than our current num, we can ignore them. */
break;
}
if (nodes[num].pivot < 0) { /* This node is terminal */
break;
}
if (mask & 1ull << nodes[num].pivot) {
/* avoid recursion if there is only one non-empty subtree */
if (nodes[num].nul >= *result) { num = nodes[num].one; continue; }
if (nodes[num].one >= *result) { num = nodes[num].nul; continue; }
count += grab_matches(result, nodes[num].nul, mask);
count += grab_matches(result, nodes[num].one, mask);
break;
}
mask |= 1ull << nodes[num].pivot;
num = (wrong & 1ull << nodes[num].pivot) ? nodes[num].nul : nodes[num].one;
}
return count;
}
unsigned thing_ffs(Thing mask)
{
unsigned bit;
#if 1
if (!mask) return (unsigned)-1;
for ( bit=random() % BITSPERTHING; 1 ; bit += 5, bit %= BITSPERTHING) {
if (mask & 1ull << bit ) return bit;
}
#elif 0
for (bit =0; bit < BITSPERTHING; bit++ ) {
if (mask & 1ull <<bit) return bit;
}
#else
mask &= (mask-1); // Kernighan-trick
for (bit =0; bit < BITSPERTHING; bit++ ) {
mask >>=1;
if (!mask) return bit;
}
#endif
return 0xffffffff;
}
struct node * nodes_read( unsigned *sizp, char *filename)
{
struct node *ptr;
unsigned size,used;
FILE *fp;
if (!filename) {
size = (WANT_RANDOM+0) ? WANT_RANDOM : 9;
ptr = malloc (size * sizeof *ptr);
#if (!WANT_RANDOM)
ptr[0].value = 0x0c;
ptr[1].value = 0x0a;
ptr[2].value = 0x08;
ptr[3].value = 0x04;
ptr[4].value = 0x02;
ptr[5].value = 0x01;
ptr[6].value = 0x10;
ptr[7].value = 0x20;
ptr[8].value = 0x00;
#else
for (used=0; used < size; used++) {
ptr[used].value = rand_mask(WANT_BITS);
}
#endif /* WANT_RANDOM */
*sizp = size;
return ptr;
}
fp = fopen( filename, "r" );
if (!fp) return NULL;
fscanf(fp,"%u\n", &size );
fprintf(stderr, "Size=%u\n", size);
ptr = malloc (size * sizeof *ptr);
for (used = 0; used < size; used++) {
fscanf(fp,"%llu\n", &ptr[used].value );
}
fclose( fp );
*sizp = used;
return ptr;
}
UPDATE:
I experimented a bit with the pivot-selection, favouring bits with the highest discriminatory value ("information content"). This involves:
making a histogram of the usage of bits (can be done while initialising)
while building the tree: choosing the one with frequency closest to 1/2 in the remaining subtrees.
The result: the random pivot selection performed better.
Construct a a binary tree as follows:
Every level corresponds to a bit
It corresponding bit is on go right, otherwise left
This way insert every number in the database.
Now, for searching: if the corresponding bit in the mask is 1, traverse both children. If it is 0, traverse only the left node. Essentially keep traversing the tree until you hit the leaf node (BTW, 0 is a hit for every mask!).
This tree will have O(N) space requirements.
Eg of tree for 1 (001), 2(010) and 5 (101)
root
/ \
0 1
/ \ |
0 1 0
| | |
1 0 1
(1) (2) (5)
With precomputed bitmasks. Formally is is still O(N), since the and-mask operations are O(N). The final pass is also O(N), because it needs to find the lowest bit set, but that could be sped up, too.
#include <limits.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/* For demonstration purposes.
** In reality, this should be an unsigned long long */
typedef unsigned char Thing;
#define BITSPERTHING (CHAR_BIT*sizeof (Thing))
#define COUNTOF(a) (sizeof a / sizeof a[0])
Thing data[] =
/****** index abcdef */
{ 0x0c /* 0 001100 */
, 0x0a /* 1 001010 */
, 0x08 /* 2 001000 */
, 0x04 /* 3 000100 */
, 0x02 /* 4 000010 */
, 0x01 /* 5 000001 */
, 0x10 /* 6 010000 */
, 0x20 /* 7 100000 */
, 0x00 /* 8 000000 */
};
/* Note: this is for demonstration purposes.
** Normally, one should choose a machine wide unsigned int
** for bitmask arrays.
*/
struct bitmap {
char data[ 1+COUNTOF (data)/ CHAR_BIT ];
} nulmaps [ BITSPERTHING ];
#define BITSET(a,i) (a)[(i) / CHAR_BIT ] |= (1u << ((i)%CHAR_BIT) )
#define BITTEST(a,i) ((a)[(i) / CHAR_BIT ] & (1u << ((i)%CHAR_BIT) ))
void init_tabs(void);
void map_empty(struct bitmap *dst);
void map_full(struct bitmap *dst);
void map_and2(struct bitmap *dst, struct bitmap *src);
int main (void)
{
Thing mask;
struct bitmap result;
unsigned ibit;
mask = 0x38;
init_tabs();
map_full(&result);
for (ibit = 0; ibit < BITSPERTHING; ibit++) {
/* bit in mask is 1, so bit at this position is in fact a don't care */
if (mask & (1u <<ibit)) continue;
/* bit in mask is 0, so we can only select items with a 0 at this bitpos */
map_and2(&result, &nulmaps[ibit] );
}
/* This is not the fastest way to find the lowest 1 bit */
for (ibit = 0; ibit < COUNTOF (data); ibit++) {
if (!BITTEST(result.data, ibit) ) continue;
fprintf(stdout, " %u", ibit);
}
fprintf( stdout, "\n" );
return 0;
}
void init_tabs(void)
{
unsigned ibit, ithing;
/* 1 bits in data that dont overlap with 1 bits in the searchmask are showstoppers.
** So, for each bitpos, we precompute a bitmask of all *entrynumbers* from data[], that contain 0 in bitpos.
*/
memset(nulmaps, 0 , sizeof nulmaps);
for (ithing=0; ithing < COUNTOF(data); ithing++) {
for (ibit=0; ibit < BITSPERTHING; ibit++) {
if ( data[ithing] & (1u << ibit) ) continue;
BITSET(nulmaps[ibit].data, ithing);
}
}
}
/* Logical And of two bitmask arrays; simular to dst &= src */
void map_and2(struct bitmap *dst, struct bitmap *src)
{
unsigned idx;
for (idx = 0; idx < COUNTOF(dst->data); idx++) {
dst->data[idx] &= src->data[idx] ;
}
}
void map_empty(struct bitmap *dst)
{
memset(dst->data, 0 , sizeof dst->data);
}
void map_full(struct bitmap *dst)
{
unsigned idx;
/* NOTE this loop sets too many bits to the left of COUNTOF(data) */
for (idx = 0; idx < COUNTOF(dst->data); idx++) {
dst->data[idx] = ~0;
}
}

Resources