I'm looking for a good checksum for short binary data messages (3-5 bytes typical) on a microcontroller. I would like something that detects the kinds of errors that can sometimes happen on an SPI bus, for example off-by-ones and repeats ("abc" -> "bcd", and "abc"->"aab"). Also it should catch the edge cases of all-zeros, all-ones and all-same-value. The checksum can add 2-4 bytes.
Running speed is not as critical as this will not process very much data; but code size is somewhat important.
I ended up using CRC16 CCITT. This is only ~50 bytes of compiled code on the target system (not using any lookup tables!), runs reasonably fast, and handles all-zero and all-one cases pretty decently.
Code (from http://www.sal.wisc.edu/st5000/documents/tables/crc16.c):
unsigned short int
crc16(unsigned char *p, int n)
{
unsigned short int crc = 0xffff;
while (n-- > 0) {
crc = (unsigned char)(crc >> 8) | (crc << 8);
crc ^= *p++;
crc ^= (unsigned char)(crc & 0xff) >> 4;
crc ^= (crc << 8) << 4;
crc ^= ((crc & 0xff) << 4) << 1;
}
return(crc);
}
See http://pubs.opengroup.org/onlinepubs/009695299/utilities/cksum.html for the algorithm used by cksum, which is itself based on the one used within the ethernet standard. Its use within ethernet is to catch errors that are similar to the ones that you face.
That algorithm will give you a 4 byte checksum for any size of data that you wish.
Related
I am using simpleBGC gimbal controller from Basecam electronics. The controller has a serial API for communication which requires the calculation of crc16 checksum for the commands sent to the controller(https://www.basecamelectronics.com/file/SimpleBGC_2_6_Serial_Protocol_Specification.pdf) (page 3)
I want to send the reset command to the controller which has the following format:
Header: {start char: '$', command id: '114', payload size: '3', header checksum : '117'}
Payload: {3,0,0} (3 bytes corresponding to reset options and time to reset)
crc16 checksum : ? (using polynomial 0x8005 calculated for all bytes except start char)
The hex representation of my command is: 0x24720375030000 and I need to find crc16 checksum for 0x720375030000. I used different crc calculators but the controller is not responding to the command and I assume that crc checksum is not correct.
To find correct crc16 checksum I sent every possible combination of crc16 checksum and found out that the controller responds when checksum is '7b25'.
so the correct command in hex is : "24 720375030000 7b25".
But this checksum 7b25 does not correspond to the polynomial 0x8005.
How can I find the correct polynomial or crc16 calculation function?
Did you try the code in the appendix of the document you linked? It works fine, and produces 0x257b for the CRC of your example data. That is then written in the stream in little-endian order, giving the 7b 25 you are expecting.
Here is a simpler and faster C implementation than what is in the appendix:
#include <stddef.h>
// Return a with the low 16 bits reversed and any bits above that zeroed.
static unsigned rev16(unsigned a) {
a = (a & 0xff00) >> 8 | (a & 0x00ff) << 8;
a = (a & 0xf0f0) >> 4 | (a & 0x0f0f) << 4;
a = (a & 0xcccc) >> 2 | (a & 0x3333) << 2;
a = (a & 0xaaaa) >> 1 | (a & 0x5555) << 1;
return a;
}
// Implement the CRC specified in the BASECAM SimpleBGC32 2.6x serial protocol
// specification. Return crc updated with the length bytes at message. If
// message is NULL, then return the initial CRC value. This CRC is like
// CRC-16/ARC, but with the bits reversed.
//
// This is a simple bit-wise implementation. Byte-wise and word-wise algorithms
// using tables exist for higher speed if needed. Also this implementation
// chooses to reverse the CRC bits as opposed to the data bits, as done in the
// specficiation appendix. The CRC only needs to be reversed once at the start
// and once at the end, whereas the alternative is reversing every data byte of
// the message. Reversing the CRC twice is faster for messages with length
// greater than two bytes.
unsigned crc16_simplebgc(unsigned crc, void const *message, size_t length) {
if (message == NULL)
return 0;
unsigned char const *data = message;
crc = rev16(crc);
for (size_t i = 0; i < length; i++) {
crc ^= data[i];
for (int k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ 0xa001 : crc >> 1;
}
return rev16(crc);
}
#include <stdio.h>
// Example usage of crc_simplebgc(). A CRC can be computed all at once, or with
// portions of the data at a time.
int main(void) {
unsigned crc = crc16_simplebgc(0, NULL, 0); // set initial CRC
crc = crc16_simplebgc(crc, "\x72\x03\x75", 3); // first three bytes
crc = crc16_simplebgc(crc, "\x03\x00\x00", 3); // remaining bytes
printf("%04x\n", crc); // prints 257b
return 0;
}
I need to swap alternate bytes in a buffer as quickly as possible in an embedded system using ARM Cortex M4 processor. I use gcc. The amount of data is variable but the max is a little over 2K. it doesn't matter if a few extra bytes are converted because I can use an over-sized buffer.
I know that the ARM has the REV16 instruction, which I can use to swap alternate bytes in a 32-bit word. What I don't know is:
Is there a way of getting at this instruction in gcc without resorting to assembler? The __builtin_bswap16 intrinsic appears to operate on 16-bit words only. Converting 4 bytes at a time will surely be faster than converting 2 bytes.
Does the Cortex M4 have a reorder buffer and/or do register renaming? If not, what do I need to do to minimise pipeline stalls when I convert the dwords of the buffer in a partially-unrolled loop?
For example, is this code efficient, where REV16 is appropriately defined to resolve (1):
uint32_t *buf = ... ;
size_t n = ... ; // (number of bytes to convert + 15)/16
for (size_t i = 0; i < n; ++i)
{
uint32_t a = buf[0];
uint32_t b = buf[1];
uint32_t c = buf[2];
uint32_t d = buf[3];
REV16(a, a);
REV16(b, b);
REV16(c, c);
REV16(d, d);
buf[0] = a;
buf[1] = b;
buf[2] = c;
buf[3] = d;
buf += 4;
}
You can't use the __builtin_bswap16 function for the reason you stated, it works on 16 bit words so will 0 the other halfword. I guess the reason for this is to keep the intrinsic working the same on processors which don't have an instruction behaving similarly to REV16 on ARM.
The function
uint32_t swap(uint32_t in)
{
in = __builtin_bswap32(in);
in = (in >> 16) | (in << 16);
return in;
}
compiles to (ARM GCC 5.4.1 -O3 -std=c++11 -march=armv7-m -mtune=cortex-m4 -mthumb)
rev r0, r0
ror r0, r0, #16
bx lr
And you could probably ask the compiler to inline it, which would give you 2 instructions per 32bit word. I can't think of a way to get GCC to generate REV16 with a 32bit operand, without declaring your own function with inline assembly.
EDIT
As a follow up, and based on artless noise's comment about the non portability of the __builtin_bswap functions, the compiler recognizes
uint32_t swap(uint32_t in)
{
in = ((in & 0xff000000) >> 24) | ((in & 0x00FF0000) >> 8) | ((in & 0x0000FF00) << 8) | ((in & 0xFF) << 24);
in = (in >> 16) | (in << 16);
return in;
}
and creates the same 3 instruction function as above, so that is a more portable way to achieve it. Whether different compilers would produce the same output though...
EDIT EDIT
If inline assembler is allowed, the following function
inline uint32_t Rev16(uint32_t a)
{
asm ("rev16 %1,%0"
: "=r" (a)
: "r" (a));
return a;
}
gets inlined, and acts as a single instruction as can be seen here.
Is there an efficient way to get 0x00000001 or 0xFFFFFFFF for a non-zero unsigned integer values, and 0 for zero without branching?
I want to test several masks and create another mask based on that. Basically, I want to optimize the following code:
unsigned getMask(unsigned x, unsigned masks[4])
{
return (x & masks[0] ? 1 : 0) | (x & masks[1] ? 2 : 0) |
(x & masks[2] ? 4 : 0) | (x & masks[3] ? 8 : 0);
}
I know that some optimizing compilers can handle this, but even if that's the case, how exactly do they do it? I looked through the Bit twiddling hacks page, but found only a description of conditional setting/clearing of a mask using a boolean condition, so the conversion from int to bool should be done outside the method.
If there is no generic way to solve this, how can I do that efficiently using x86 assembler code?
x86 SSE2 can do this in a few instructions, the most important being movmskps which extracts the top bit of each 4-byte element of a SIMD vector into an integer bitmap.
Intel's intrinsics guide is pretty good, see also the SSE tag wiki
#include <immintrin.h>
static inline
unsigned getMask(unsigned x, unsigned masks[4])
{
__m128i vx = _mm_set1_epi32(x);
__m128i vm = _mm_load_si128(masks); // or loadu if this can inline where masks[] isn't aligned
__m128i and = _mm_and_si128(vx, vm);
__m128i eqzero = _mm_cmpeq_epi32(and, _mm_setzero_si128()); // vector of 0 or -1 elems
unsigned zeromask = _mm_movemask_ps(_mm_castsi128_ps(eqzero));
return zeromask ^ 0xf; // flip the low 4 bits
}
Until AVX512, there's no SIMD cmpneq, so the best option is scalar XOR after extracting a mask. (We want to just flip the low 4 bits, not all of them with a NOT.)
The usual way to do this in x86 is:
test eax, eax
setne al
You can use !! to coerce a value to 0 or 1 and rewrite the expression like this
return !!(x & masks[0]) | (!!(x & masks[1]) << 1) |
(!!(x & masks[2]) << 2) | (!!(x & masks[3]) << 3);
Is there an efficient (fast) algorithm that will perform bit expansion/duplication?
For example, expand each bit in an 8bit value by 3 (creating a 24bit value):
1101 0101 => 11111100 01110001 11000111
The brute force method that has been proposed is to create a lookup table. In the future, the expansion value may need to be variable. That is, in the above example we are expanding by 3 but may need to expand by some other value(s). This would require multiple lookup tables that I'd like to avoid if possible.
There is a chance to make it quicker than lookup table if arithmetic calculations are for some reason faster than memory access. This may be possible if calculations are vectorized (PPC AltiVec or Intel SSE) and/or if other parts of the program need to use every bit of cache memory.
If expansion factor = 3, only 7 instructions are needed:
out = (((in * 0x101 & 0x0F00F) * 0x11 & 0x0C30C3) * 5 & 0x249249) * 7;
Or other alternative, with 10 instructions:
out = (in | in << 8) & 0x0F00F;
out = (out | out << 4) & 0x0C30C3;
out = (out | out << 2) & 0x249249;
out *= 7;
For other expansion factors >= 3:
unsigned mask = 0x0FF;
unsigned out = in;
for (scale = 4; scale != 0; scale /= 2)
{
shift = scale * (N - 1);
mask &= ~(mask << scale);
mask |= mask << (scale * N);
out = out * ((1 << shift) + 1) & mask;
}
out *= (1 << N) - 1;
Or other alternative, for expansion factors >= 2:
unsigned mask = 0x0FF;
unsigned out = in;
for (scale = 4; scale != 0; scale /= 2)
{
shift = scale * (N - 1);
mask &= ~(mask << scale);
mask |= mask << (scale * N);
out = (out | out << shift) & mask;
}
out *= (1 << N) - 1;
shift and mask values are better to be calculated prior to bit stream processing.
You can do it one input bit at at time. Of course, it will be slower than a lookup table, but if you're doing something like writing for a tiny, 8-bit microcontroller without enough room for a table, it should have the smallest possible ROM footprint.
I noticed that in Jeff's slides "Challenges in Building Large-Scale Information Retrieval Systems", which can also be downloaded here: http://research.google.com/people/jeff/WSDM09-keynote.pdf, a method of integers compression called "group varint encoding" was mentioned. It was said much faster than 7 bits per byte integer encoding (2X more). I am very interested in this and looking for an implementation of this, or any more details that could help me implement this by myself.
I am not a pro and new to this, and any help is welcome!
That's referring to "variable integer encoding", where the number of bits used to store an integer when serialized is not fixed at 4 bytes. There is a good description of varint in the protocol buffer documentation.
It is used in encoding Google's protocol buffers, and you can browse the protocol buffer source code.
The CodedOutputStream contains the exact encoding function WriteVarint32FallbackToArrayInline:
inline uint8* CodedOutputStream::WriteVarint32FallbackToArrayInline(
uint32 value, uint8* target) {
target[0] = static_cast<uint8>(value | 0x80);
if (value >= (1 << 7)) {
target[1] = static_cast<uint8>((value >> 7) | 0x80);
if (value >= (1 << 14)) {
target[2] = static_cast<uint8>((value >> 14) | 0x80);
if (value >= (1 << 21)) {
target[3] = static_cast<uint8>((value >> 21) | 0x80);
if (value >= (1 << 28)) {
target[4] = static_cast<uint8>(value >> 28);
return target + 5;
} else {
target[3] &= 0x7F;
return target + 4;
}
} else {
target[2] &= 0x7F;
return target + 3;
}
} else {
target[1] &= 0x7F;
return target + 2;
}
} else {
target[0] &= 0x7F;
return target + 1;
}
}
The cascading ifs will only add additional bytes onto the end of the target array if the magnitude of value warrants those extra bytes. The 0x80 masks the byte being written, and the value is shifted down. From what I can tell, the 0x7f mask causes it to signify the "last byte of encoding". (When OR'ing 0x80, the highest bit will always be 1, then the last byte clears the highest bit (by AND'ing 0x7f). So, when reading varints you read until you get a byte with a zero in the highest bit.
I just realized you asked about "Group VarInt encoding" specifically. Sorry, that code was about basic VarInt encoding (still faster than 7-bit). The basic idea looks to be similar. Unfortunately, it's not what's being used to store 64bit numbers in protocol buffers. I wouldn't be surprised if that code was open sourced somewhere though.
Using the ideas from varint and the diagrams of "Group varint" from the slides, it shouldn't be too too hard to cook up your own :)
Here is another page describing Group VarInt compression, which contains decoding code. Unfortunately they allude to publicly available implementations, but they don't provide references.
void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
const byte* limit = compressed + size;
uint32_t current_value = 0;
while (compressed != limit) {
const uint32_t selector = *compressed++;
const uint32_t selector1 = (selector & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector1];
*uncompressed++ = current_value;
compressed += selector1 + 1;
const uint32_t selector2 = ((selector >> 2) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector2];
*uncompressed++ = current_value;
compressed += selector2 + 1;
const uint32_t selector3 = ((selector >> 4) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector3];
*uncompressed++ = current_value;
compressed += selector3 + 1;
const uint32_t selector4 = (selector >> 6);
current_value += *((uint32_t*)(compressed)) & MASK[selector4];
*uncompressed++ = current_value;
compressed += selector4 + 1;
}
}
I was looking for the same thing and found this GitHub project in Java:
https://github.com/stuhood/gvi/
Looks promising !
Instead of decoding with bitmask, in c/c++ you could use predefined structures that corresponds to the value in the first byte.. complete example that uses this: http://www.oschina.net/code/snippet_12_5083
Another Java implementation for groupvarint: https://github.com/catenamatteo/groupvarint
But I suspect the very large switch has some drawback in Java