Fast bit shift of a byte array - CMAC subkeys - performance

I need to implement as fast as possible left bit shift of a 16-byte array in JavaCard.
I tried this code:
private static final void rotateLeft(final byte[] output, final byte[] input) {
short carry = 0;
short i = (short) 16;
do {
--i;
carry = (short)((input[i] << 1) | carry);
output[i] = (byte)carry;
carry = (short)((carry >> 8) & 1);
} while (i > 0);
}
Any ideas how to improve the performace? I was thinking about some Util.getShort(...) and Util.setShort(...) magic, but I did not manage to make it work faster then the implementation above.
This is one part of CMAC subkeys computation and it is done quite often, unfortunately. In case you know some faster way to compute CMAC subkeys (both subkeys in one loop or something like that), please, let me know.

When it comes for speed, known length, hard-coded version is the fastest (but ugly). If you need to shift more than one bit, ensure to update the code accordingly.
output[0] = (byte)((byte)(input[0] << 1) | (byte)((input[1] >> 7) & 1));
output[1] = (byte)((byte)(input[1] << 1) | (byte)((input[2] >> 7) & 1));
output[2] = (byte)((byte)(input[2] << 1) | (byte)((input[3] >> 7) & 1));
output[3] = (byte)((byte)(input[3] << 1) | (byte)((input[4] >> 7) & 1));
output[4] = (byte)((byte)(input[4] << 1) | (byte)((input[5] >> 7) & 1));
output[5] = (byte)((byte)(input[5] << 1) | (byte)((input[6] >> 7) & 1));
output[6] = (byte)((byte)(input[6] << 1) | (byte)((input[7] >> 7) & 1));
output[7] = (byte)((byte)(input[7] << 1) | (byte)((input[8] >> 7) & 1));
output[8] = (byte)((byte)(input[8] << 1) | (byte)((input[9] >> 7) & 1));
output[9] = (byte)((byte)(input[9] << 1) | (byte)((input[10] >> 7) & 1));
output[10] = (byte)((byte)(input[10] << 1) | (byte)((input[11] >> 7) & 1));
output[11] = (byte)((byte)(input[11] << 1) | (byte)((input[12] >> 7) & 1));
output[12] = (byte)((byte)(input[12] << 1) | (byte)((input[13] >> 7) & 1));
output[13] = (byte)((byte)(input[13] << 1) | (byte)((input[14] >> 7) & 1));
output[14] = (byte)((byte)(input[14] << 1) | (byte)((input[15] >> 7) & 1));
output[15] = (byte)(input[15] << 1);
And use RAM byte array!

This is the fastest algorithm to rotate arbitrary number of bits I could come up with (I rotate array of 8 bytes, you can easily transform it to shifting 16 instead):
Use EEPROM to create a masking table for your shifts. Mask is just increasing amounts of 1s from the right:
final static byte[] ROTL_MASK = {
(byte) 0x00, //shift 0: 00000000 //this one is never used, we don't do shift 0.
(byte) 0x01, //shift 1: 00000001
(byte) 0x03, //shift 2: 00000011
(byte) 0x07, //shift 3: 00000111
(byte) 0x0F, //shift 4: 00001111
(byte) 0x1F, //shift 5: 00011111
(byte) 0x3F, //shift 6: 00111111
(byte) 0x7F //shift 7: 01111111
};
Then you first use Util.arrayCopyNonAtomic for quick swap of bytes if shift is larger than 8:
final static byte BITS = 8;
//swap whole bytes:
Util.arrayCopyNonAtomic(in, (short) (shift/BITS), out, (short) 0, (short) (8-(shift/BITS)));
Util.arrayCopyNonAtomic(in, (short) 0, out, (short) (8-(shift/BITS)), (short) (shift/BITS));
shift %= BITS; //now we need to shift only up to 8 remaining bits
if (shift > 0) {
//apply masks
byte mask = ROTL_MASK[shift];
byte comp = (byte) (8 - shift);
//rotate using masks
out[8] = in[0]; // out[8] is any auxiliary variable, careful with bounds!
out[0] = (byte)((byte)(in[0] << shift) | (byte)((in[1] >> comp) & mask));
out[1] = (byte)((byte)(in[1] << shift) | (byte)((in[2] >> comp) & mask));
out[2] = (byte)((byte)(in[2] << shift) | (byte)((in[3] >> comp) & mask));
out[3] = (byte)((byte)(in[3] << shift) | (byte)((in[4] >> comp) & mask));
out[4] = (byte)((byte)(in[4] << shift) | (byte)((in[5] >> comp) & mask));
out[5] = (byte)((byte)(in[5] << shift) | (byte)((in[6] >> comp) & mask));
out[6] = (byte)((byte)(in[6] << shift) | (byte)((in[7] >> comp) & mask));
out[7] = (byte)((byte)(in[7] << shift) | (byte)((in[8] >> comp) & mask));
}
You can additionally remove mask variable and use direct reference to the table instead.
Using this rather than naive implementation of bit-wise rotation proved to be about 450% - 500% faster.

It might help to cache CMAC subkeys when signing repeatedly using the same key (i.e. the same DESFire EV1 session key). The subkeys are always the same for the given key.
I think David's answer could be even faster if it used two local variables to cache the values read twice from the same offset of the input array (from my observations on JCOP, the array access is quite expensive even for transient arrays).
EDIT: I can provide the following implementation which does 4 bit right shift using short (32-bit int variant for cards supporting it would be even faster):
short pom=0; // X000 to be stored next
short pom2; // loaded value
short pom3; // 0XXX to be stored next
short curOffset=PARAMS_TRACK2_OFFSET;
while(curOffset<16) {
pom2=Util.getShort(mem_PARAMS, curOffset);
pom3=(short)(pom2>>>4);
curOffset=Util.setShort(mem_RAM, curOffset, (short)(pom|pom3));
pom=(short)(pom2<<12);
}
Beware, this code assumes same offsets in source and destination.
You can unroll this loop and use constant parameters if desired.

Related

Algorithm for bit expansion/duplication?

Is there an efficient (fast) algorithm that will perform bit expansion/duplication?
For example, expand each bit in an 8bit value by 3 (creating a 24bit value):
1101 0101 => 11111100 01110001 11000111
The brute force method that has been proposed is to create a lookup table. In the future, the expansion value may need to be variable. That is, in the above example we are expanding by 3 but may need to expand by some other value(s). This would require multiple lookup tables that I'd like to avoid if possible.
There is a chance to make it quicker than lookup table if arithmetic calculations are for some reason faster than memory access. This may be possible if calculations are vectorized (PPC AltiVec or Intel SSE) and/or if other parts of the program need to use every bit of cache memory.
If expansion factor = 3, only 7 instructions are needed:
out = (((in * 0x101 & 0x0F00F) * 0x11 & 0x0C30C3) * 5 & 0x249249) * 7;
Or other alternative, with 10 instructions:
out = (in | in << 8) & 0x0F00F;
out = (out | out << 4) & 0x0C30C3;
out = (out | out << 2) & 0x249249;
out *= 7;
For other expansion factors >= 3:
unsigned mask = 0x0FF;
unsigned out = in;
for (scale = 4; scale != 0; scale /= 2)
{
shift = scale * (N - 1);
mask &= ~(mask << scale);
mask |= mask << (scale * N);
out = out * ((1 << shift) + 1) & mask;
}
out *= (1 << N) - 1;
Or other alternative, for expansion factors >= 2:
unsigned mask = 0x0FF;
unsigned out = in;
for (scale = 4; scale != 0; scale /= 2)
{
shift = scale * (N - 1);
mask &= ~(mask << scale);
mask |= mask << (scale * N);
out = (out | out << shift) & mask;
}
out *= (1 << N) - 1;
shift and mask values are better to be calculated prior to bit stream processing.
You can do it one input bit at at time. Of course, it will be slower than a lookup table, but if you're doing something like writing for a tiny, 8-bit microcontroller without enough room for a table, it should have the smallest possible ROM footprint.

Bit hack to generate all integers with a given number of 1s

I forgot a bit hack to generate all integers with a given number of 1s. Does anybody remember it (and probably can explain it also)?
From Bit Twiddling Hacks
Update Test program Live On Coliru
#include <utility>
#include <iostream>
#include <bitset>
using I = uint8_t;
auto dump(I v) { return std::bitset<sizeof(I) * __CHAR_BIT__>(v); }
I bit_twiddle_permute(I v) {
I t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
I w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b001001;
std::cout << dump(p) << "\n";
for (I n = bit_twiddle_permute(p); n>p; p = n, n = bit_twiddle_permute(p)) {
std::cout << dump(n) << "\n";
}
}
Prints
00001001
00001010
00001100
00010001
00010010
00010100
00011000
00100001
00100010
00100100
00101000
00110000
01000001
01000010
01000100
01001000
01010000
01100000
10000001
10000010
10000100
10001000
10010000
10100000
11000000
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we want the next permutation of N 1 bits in a lexicographical sense. For example, if N is 3 and the bit pattern is 00010011, the next patterns would be 00010101, 00010110, 00011001,00011010, 00011100, 00100011, and so forth. The following is a fast way to compute the next permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for x86, the intrinsic is _BitScanForward. These both emit a bsf instruction, but equivalents may be available for other architectures. If not, then consider using one of the methods for counting the consecutive zero bits mentioned earlier.
Here is another version that tends to be slower because of its division operator, but it
does not require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.
For bit hacks I like to refer to this page: Bit Twiddling Hacks.
Regarding your specific question, read the part entitled Compute the lexicographically next bit permutation.
Compute the lexicographically next bit permutation
Suppose we have a pattern of N bits set to 1 in an integer and we want the next permutation of N 1 bits in a lexicographical sense. For example, if N is 3 and the bit pattern is 00010011, the next patterns would be 00010101, 00010110, 00011001,00011010, 00011100, 00100011, and so forth. The following is a fast way to compute the next permutation.
unsigned int v; // current permutation of bits
unsigned int w; // next permutation of bits
unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number of trailing zeros. If you are using Microsoft compilers for x86, the intrinsic is _BitScanForward. These both emit a bsf instruction, but equivalents may be available for other architectures. If not, then consider using one of the methods for counting the consecutive zero bits mentioned earlier.
Here is another version that tends to be slower because of its division operator, but it does not require counting the trailing zeros.
unsigned int t = (v | (v - 1)) + 1;
w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
Thanks to Dario Sneidermanis of Argentina, who provided this on November 28, 2009.
To add onto #sehe's answer included below (originally from Dario Sneidermanis also at http://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation.)
#include <utility>
#include <iostream>
#include <bitset>
using I = uint8_t;
auto dump(I v) { return std::bitset<sizeof(I) * __CHAR_BIT__>(v); }
I bit_twiddle_permute(I v) {
I t = v | (v - 1); // t gets v's least significant 0 bits set to 1
// Next set to 1 the most significant bit to change,
// set to 0 the least significant ones, and add the necessary 1 bits.
I w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b001001;
std::cout << dump(p) << "\n";
for (I n = bit_twiddle_permute(p); n>p; p = n, n = bit_twiddle_permute(p))
{
std::cout << dump(n) << "\n";
}
}
There are boundary issues with bit_twiddle_permute(I v). Whenever v is the last permutation, t is all 1's (e.g. 2^8 - 1), (~t & -~t) = 0, and w is the first permutation of bits with one fewer 1s than v, except when v = 000000000 in which case w = 01111111. In particular if you set p to 0; the loop in main will produce all permutations with seven 1's, and the following slight modification of the for loop, will cycle through all permutations with 0, 7, 6, ..., 1 bits set -
for (I n = bit_twiddle_permute(p); n>p; n = bit_twiddle_permute(n))
If this is the intention, it is perhaps worth a comment. If not it is trivial to fix, e.g.
if (t == (I)(-1)) { return v >> __builtin_ctz(v); }
So with an additional small simplification
I bit_twiddle_permute2(I v) {
I t = (v | (v - 1)) + 1;
if (t == 0) { return v >> __builtin_ctz(v); }
I w = t | ((~t & v) >> (__builtin_ctz(v) + 1));
return w;
}
int main() {
I p = 0b1;
cout << dump(p) << "\n";
for (I n = bit_twiddle_permute2(p); n>p; n = bit_twiddle_permute2(n)) {
cout << dump(n) << "\n";
}
}
The following adaptation of Dario Sneidermanis's idea may be slightly easier to follow
I bit_twiddle_permute3(I v) {
int n = __builtin_ctz(v);
I s = v >> n;
I t = s + 1;
I w = (t << n) | ((~t & s) >> 1);
return w;
}
or with a similar solution to the issue I mentioned at the beginning of this post
I bit_twiddle_permute3(I v) {
int n = __builtin_ctz(v);
I s = v >> n;
I t = s + 1;
if (v == 0 || t << n == 0) { return s; }
I w = (t << n) | ((~t & s) >> 1);
return w;
}

UTF8 Conversion

I need to generate a UTF8 string to pass to a 3rd party library and I'm having trouble figuring out the right gymnastics... Also, to make matters worst, I'm stuck using C++ Builder 6 and every example I found talks about using std::string which CBuilder6 evidentially has no support for. I'd like to accomplish this without using STL what so ever.
Here is my code so far that I can't seem to make work.
wchar_t *SS1;
char *SS2;
SS1 = L"select * from mnemonics;";
int strsize = WideCharToMultiByte(CP_UTF8, 0, SS1, wcslen(SS1), NULL, 0, NULL, NULL);
SS2 = new char[strsize+1];
WideCharToMultiByte( CP_UTF8, 0, SS1, wcslen(SS1), SS2, strsize, NULL, NULL);
3rd party library chokes when I pass it SS2 as a parameter. Obviously, I'm on a Windows platform using Microsoft's WideCharToMultiByte but eventually I would like to not need this function call as this code must also be compiled on an embedded platform as well under Linux but I'll cross that bridge when I get to it.
For now, I just need to be able to either convert a wchar_t or char to UTF8 encoded string preferably without using any STL. I won't have STL on the embedded platform.
Thanks!
Something like that:
extern void someFunctionThatAcceptsUTF8(const char* utf8);
const char* ss1 = "string in system default multibyte encoding";
someFunctionThatAcceptsUTF8( w2u( a2w(ss1) ) ); // that conversion you need:
// a2w: "ansi" -> widechar string
// w2u: widechar string -> utf8 string.
You just need to grab and include this file:
http://code.google.com/p/tiscript/source/browse/trunk/sdk/include/aux-cvt.h
It should work on Builder just fine.
If you're still looking for an answer, here's a simple implementation of a utf8 convertor in C language:
/*
** Transforms a wchar to utf-8, returning a string of converted bytes
*/
void ft_to_utf8(wchar_t c, unsigned char *buffer)
{
if (c < (1 << 7))
*buffer++ = (unsigned char)(c);
else if (c < (1 << 11))
{
*buffer++ = (unsigned char)((c >> 6) | 0xC0);
*buffer++ = (unsigned char)((c & 0x3F) | 0x80);
}
else if (c < (1 << 16))
{
*buffer++ = (unsigned char)((c >> 12) | 0xE0);
*buffer++ = (unsigned char)(((c >> 6) & 0x3F) | 0x80);
*buffer++ = (unsigned char)((c & 0x3F) | 0x80);
}
else if (c < (1 << 21))
{
*buffer++ = (unsigned char)((c >> 18) | 0xF0);
*buffer++ = (unsigned char)(((c >> 12) & 0x3F) | 0x80);
*buffer++ = (unsigned char)(((c >> 6) & 0x3F) | 0x80);
*buffer++ = (unsigned char)((c & 0x3F) | 0x80);
}
*buffer = '\0';
}

clear all but the two most significant set bits in a word

Given an 32 bit int which is known to have at least 2 bits set, is there a way to efficiently clear all except the 2 most significant set bits? i.e. I want to ensure the output has exactly 2 bits set.
What if the input is guaranteed to have only 2 or 3 bits set.?
Examples:
0x2040 -> 0x2040
0x0300 -> 0x0300
0x0109 -> 0x0108
0x5040 -> 0x5000
Benchmarking Results:
Code:
QueryPerformanceFrequency(&freq);
/***********/
value = (base =2)|1;
QueryPerformanceCounter(&start);
for (l=0;l<A_LOT; l++)
{
//!!value calculation goes here
junk+=value; //use result to prevent optimizer removing it.
//advance to the next 2|3 bit word
if (value&0x80000000)
{ if (base&0x80000000)
{ base=6;
}
base*=2;
value=base|1;
}
else
{ value<<=1;
}
}
QueryPerformanceCounter(&end);
time = (end.QuadPart - start.QuadPart);
time /= freq.QuadPart;
printf("--------- name\n");
printf("%ld loops took %f sec (%f additional)\n",A_LOT, time, time-baseline);
printf("words /sec = %f Million\n",A_LOT/(time-baseline)/1.0e6);
Results on using VS2005 default release settings on Core2Duo E7500#2.93 GHz:
--------- BASELINE
1000000 loops took 0.001630 sec
--------- sirgedas
1000000 loops took 0.002479 sec (0.000849 additional)
words /sec = 1178.074206 Million
--------- ashelly
1000000 loops took 0.004640 sec (0.003010 additional)
words /sec = 332.230369 Million
--------- mvds
1000000 loops took 0.005250 sec (0.003620 additional)
words /sec = 276.242030 Million
--------- spender
1000000 loops took 0.009594 sec (0.007964 additional)
words /sec = 125.566361 Million
--------- schnaader
1000000 loops took 0.025680 sec (0.024050 additional)
words /sec = 41.580158 Million
If the input is guaranteed to have exactly 2 or 3 bits then the answer can be computed very quickly. We exploit the fact that the expression x&(x-1) is equal to x with the LSB cleared. Applying that expression twice to the input will produce 0, if 2 or fewer bits are set. If exactly 2 bits are set, we return the original input. Otherwise, we return the original input with the LSB cleared.
Here is the code in C++:
// assumes a has exactly 2 or 3 bits set
int topTwoBitsOf( int a )
{
int b = a&(a-1); // b = a with LSB cleared
return b&(b-1) ? b : a; // check if clearing the LSB of b produces 0
}
This can be written as a confusing single expression, if you like:
int topTwoBitsOf( int a )
{
return a&(a-1)&((a&(a-1))-1) ? a&(a-1) : a;
}
I'd create a mask in a loop. At the beginning, the mask is 0. Then go from the MSB to the LSB and set each corresponding bit in the mask to 1 until you found 2 set bits. Finally AND the value with this mask.
#include <stdio.h>
#include <stdlib.h>
int clear_bits(int value) {
unsigned int mask = 0;
unsigned int act_bit = 0x80000000;
unsigned int bit_set_count = 0;
do {
if ((value & act_bit) == act_bit) bit_set_count++;
mask = mask | act_bit;
act_bit >>= 1;
} while ((act_bit != 0) && (bit_set_count < 2));
return (value & mask);
}
int main() {
printf("0x2040 => %X\n", clear_bits(0x2040));
printf("0x0300 => %X\n", clear_bits(0x0300));
printf("0x0109 => %X\n", clear_bits(0x0109));
printf("0x5040 => %X\n", clear_bits(0x5040));
return 0;
}
This is quite complicated, but should be more efficient as using a for loop over the 32 bits every time (and clear all bits except the 2 most significant set ones). Anyway, be sure to benchmark different ways before using one.
Of course, if memory is not a problem, use a lookup table approach like some recommended - this will be much faster.
how much memory is available at what latency? I would propose a lookup table ;-)
but seriously: if you would perform this on 100s of numbers, an 8 bit lookup table giving 2 msb and another 8 bit lookup table giving 1 msb may be all you need. Depending on the processor this might beat really counting bits.
For speed, I would create a lookup table mapping an input byte to
M(I)=0 if 1 or 0 bits set
M(I)=B' otherwise, where B' is the value of B with the 2 msb bits set.
Your 32 bit int are 4 input bytes I1 I2 I3 I4.
Lookup M(I1), if nonzero, you're done.
Compare M(I1)==0, if zero, repeat previous step for I2.
Else, lookup I2 in a second lookup table with 1 MSB bits, if nonzero, you're done.
Else, repeat previous step for I3.
etc etc. Don't actually loop anything over I1-4 but unroll it fully.
Summing up: 2 lookup tables with 256 entries, 247/256 of cases are resolved with one lookup, approx 8/256 with two lookups, etc.
edit: the tables, for clarity (input, bits table 2 MSB, bits table 1 MSB)
I table2 table1
0 00000000 00000000
1 00000000 00000001
2 00000000 00000010
3 00000011 00000010
4 00000000 00000100
5 00000101 00000100
6 00000110 00000100
7 00000110 00000100
8 00000000 00001000
9 00001001 00001000
10 00001010 00001000
11 00001010 00001000
12 00001100 00001000
13 00001100 00001000
14 00001100 00001000
15 00001100 00001000
16 00000000 00010000
17 00010001 00010000
18 00010010 00010000
19 00010010 00010000
20 00010100 00010000
..
250 11000000 10000000
251 11000000 10000000
252 11000000 10000000
253 11000000 10000000
254 11000000 10000000
255 11000000 10000000
Here's another attempt (no loops, no lookup, no conditionals). This time it works:
var orig=0x109;
var x=orig;
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x = orig & ~(x & ~(x >> 1));
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
var solution=orig & ~(x >> 1);
Console.WriteLine(solution.ToString("X")); //0x108
Could probably be shortened by someone cleverer than me.
Following up on my previous answer, here's the complete implementation. I think it is as fast as it can get. (sorry for unrolling the whole thing ;-)
#include <stdio.h>
unsigned char bittable1[256];
unsigned char bittable2[256];
unsigned int lookup(unsigned int);
void gentable(void);
int main(int argc,char**argv)
{
unsigned int challenge = 0x42341223, result;
gentable();
if ( argc > 1 ) challenge = atoi(argv[1]);
result = lookup(challenge);
printf("%08x --> %08x\n",challenge,result);
}
unsigned int lookup(unsigned int i)
{
unsigned int ret;
ret = bittable2[i>>24]<<24; if ( ret ) return ret;
ret = bittable1[i>>24]<<24;
if ( !ret )
{
ret = bittable2[i>>16]<<16; if ( ret ) return ret;
ret = bittable1[i>>16]<<16;
if ( !ret )
{
ret = bittable2[i>>8]<<8; if ( ret ) return ret;
ret = bittable1[i>>8]<<8;
if ( !ret )
{
return bittable2[i] | bittable1[i];
} else {
return (ret | bittable1[i&0xff]);
}
} else {
if ( bittable1[(i>>8)&0xff] )
{
return (ret | (bittable1[(i>>8)&0xff]<<8));
} else {
return (ret | bittable1[i&0xff]);
}
}
} else {
if ( bittable1[(i>>16)&0xff] )
{
return (ret | (bittable1[(i>>16)&0xff]<<16));
} else if ( bittable1[(i>>8)&0xff] ) {
return (ret | (bittable1[(i>>8)&0xff]<<8));
} else {
return (ret | (bittable1[i&0xff]));
}
}
}
void gentable()
{
int i;
for ( i=0; i<256; i++ )
{
int bitset = 0;
int j;
for ( j=128; j; j>>=1 )
{
if ( i&j )
{
bitset++;
if ( bitset == 1 ) bittable1[i] = i&(~(j-1));
else if ( bitset == 2 ) bittable2[i] = i&(~(j-1));
}
}
//printf("%3d %02x %02x\n",i,bittable1[i],bittable2[i]);
}
}
Using a variation of this, I came up with the following:
var orig=56;
var x=orig;
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
Console.WriteLine(orig&~(x>>2));
In c# but should translate easily.
EDIT
I'm not so sure I've answered your question. This takes the highest bit and preserves it and the bit next to it, eg. 101 => 100
Here's some python that should work:
def bit_play(num):
bits_set = 0
upper_mask = 0
bit_index = 31
while bit_index >= 0:
upper_mask |= (1 << bit_index)
if num & (1 << bit_index) != 0:
bits_set += 1
if bits_set == 2:
num &= upper_mask
break
bit_index -= 1
return num
It makes one pass over the number. It builds a mask of the bits that it crosses so it can mask off the bottom bits as soon as it hits the second-most significant one. As soon as it finds the second bit, it proceeds to clear the lower bits. You should be able to create a mask of the upper bits and &= it in instead of the second while loop. Maybe I'll hack that in and edit the post.
I'd also use a table based approach, but I believe one table alone should be sufficient. Take the 4 bit case as an example. If you're input is guaranteed to have 2 or 3 bits, then your output can only be one of 6 values
0011
0101
0110
1001
1010
1100
Put these possible values in an array sorted by size. Starting with the largest, find the first value which is equal to or less than your target value. This is your answer. For the 8 bit version you'll have more possible return values, but still easily less than the maximum possible permutations of 8*7.
public static final int [] MASKS = {
0x03, //0011
0x05, //0101
0x06, //0110
0x09, //1001
0x0A, //1010
0x0C, //1100
};
for (int i = 0; i < 16; ++i) {
if (countBits(i) < 2) {
continue;
}
for (int j = MASKS.length - 1; j >= 0; --j) {
if (MASKS[j] <= i) {
System.out.println(Integer.toBinaryString(i) + " " + Integer.toBinaryString(MASKS[j]));
break;
}
}
}
Here's my implementation in C#
uint OnlyMostSignificant(uint value, int count) {
uint newValue = 0;
int c = 0;
for(uint high = 0x80000000; high != 0 && c < count; high >>= 1) {
if ((value & high) != 0) {
newValue = newValue | high;
c++;
}
}
return newValue;
}
Using count, you could make it the most significant (count) bits.
My solution:
Use "The best method for counting bits in a 32-bit integer", then clear the lower bit if the answer is 3. Only works when input is limited to 2 or 3 bits set.
unsigned int c; // c is the total bits set in v
unsigned int v = value;
v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
crc+=value&value-(c-2);

Looking for more details about "Group varint encoding/decoding" presented in Jeff's slides

I noticed that in Jeff's slides "Challenges in Building Large-Scale Information Retrieval Systems", which can also be downloaded here: http://research.google.com/people/jeff/WSDM09-keynote.pdf, a method of integers compression called "group varint encoding" was mentioned. It was said much faster than 7 bits per byte integer encoding (2X more). I am very interested in this and looking for an implementation of this, or any more details that could help me implement this by myself.
I am not a pro and new to this, and any help is welcome!
That's referring to "variable integer encoding", where the number of bits used to store an integer when serialized is not fixed at 4 bytes. There is a good description of varint in the protocol buffer documentation.
It is used in encoding Google's protocol buffers, and you can browse the protocol buffer source code.
The CodedOutputStream contains the exact encoding function WriteVarint32FallbackToArrayInline:
inline uint8* CodedOutputStream::WriteVarint32FallbackToArrayInline(
uint32 value, uint8* target) {
target[0] = static_cast<uint8>(value | 0x80);
if (value >= (1 << 7)) {
target[1] = static_cast<uint8>((value >> 7) | 0x80);
if (value >= (1 << 14)) {
target[2] = static_cast<uint8>((value >> 14) | 0x80);
if (value >= (1 << 21)) {
target[3] = static_cast<uint8>((value >> 21) | 0x80);
if (value >= (1 << 28)) {
target[4] = static_cast<uint8>(value >> 28);
return target + 5;
} else {
target[3] &= 0x7F;
return target + 4;
}
} else {
target[2] &= 0x7F;
return target + 3;
}
} else {
target[1] &= 0x7F;
return target + 2;
}
} else {
target[0] &= 0x7F;
return target + 1;
}
}
The cascading ifs will only add additional bytes onto the end of the target array if the magnitude of value warrants those extra bytes. The 0x80 masks the byte being written, and the value is shifted down. From what I can tell, the 0x7f mask causes it to signify the "last byte of encoding". (When OR'ing 0x80, the highest bit will always be 1, then the last byte clears the highest bit (by AND'ing 0x7f). So, when reading varints you read until you get a byte with a zero in the highest bit.
I just realized you asked about "Group VarInt encoding" specifically. Sorry, that code was about basic VarInt encoding (still faster than 7-bit). The basic idea looks to be similar. Unfortunately, it's not what's being used to store 64bit numbers in protocol buffers. I wouldn't be surprised if that code was open sourced somewhere though.
Using the ideas from varint and the diagrams of "Group varint" from the slides, it shouldn't be too too hard to cook up your own :)
Here is another page describing Group VarInt compression, which contains decoding code. Unfortunately they allude to publicly available implementations, but they don't provide references.
void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
const byte* limit = compressed + size;
uint32_t current_value = 0;
while (compressed != limit) {
const uint32_t selector = *compressed++;
const uint32_t selector1 = (selector & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector1];
*uncompressed++ = current_value;
compressed += selector1 + 1;
const uint32_t selector2 = ((selector >> 2) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector2];
*uncompressed++ = current_value;
compressed += selector2 + 1;
const uint32_t selector3 = ((selector >> 4) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector3];
*uncompressed++ = current_value;
compressed += selector3 + 1;
const uint32_t selector4 = (selector >> 6);
current_value += *((uint32_t*)(compressed)) & MASK[selector4];
*uncompressed++ = current_value;
compressed += selector4 + 1;
}
}
I was looking for the same thing and found this GitHub project in Java:
https://github.com/stuhood/gvi/
Looks promising !
Instead of decoding with bitmask, in c/c++ you could use predefined structures that corresponds to the value in the first byte.. complete example that uses this: http://www.oschina.net/code/snippet_12_5083
Another Java implementation for groupvarint: https://github.com/catenamatteo/groupvarint
But I suspect the very large switch has some drawback in Java

Resources