Neon Optimization for multiplication and store in ARM

Neon Optimization for multiplication and store in ARM - gcc

Using an ARM Cortex A15 board I'm trying to optimize a perfectly working C code by using NEON intrinsics.
compiler: gcc 4.7 on ubuntu 12.04
Flags:-g -O3 -mcpu=cortex-a15 -mfpu=neon-vfpv4 -ftree-vectorize -DDRA7XX_ARM -DARM_PROC -DSL -funroll-loops -ftree-loop-ivcanon -mfloat-abi=hard
I wanted to do the following function ,its just a simple load->multiply->store.
here are some parameters:
*input is a pointer to an array of size 40680 and after completing the loop the pointer should retain the current position and do the same for next input stream via input pointer.
float32_t A=0.7;
float32_t *ptr_op=(float*)output[9216];
float32x2_t reg1;
for(i= 0;i< 4608;i+=4){
/*output[(2*i)] = A*(*input); // C version
input++;
output[(2*i)+1] = A*(*input);
input++;*/
reg1=vld1q_f32(input++); //Neon version
R_N=vmulq_n_f32(reg1,A);
vst1q_f32(ptr_op++,R_N);
}
I want to understand where am I making mistake in this loop because it seems pretty straightforward.
Here is my assembly implementation of the same . Am I going in the correct direction???
__asm__ __volatile__(
"\t mov r4, #0\n"
"\t vdup.32 d1,%3\n"
"Lloop2:\n"
"\t cmp r4, %2\n"
"\t bge Lend2\n"
"\t vld1.32 d0, [%0]!\n"
"\t vmul.f32 d0, d0, d1\n"
"\t vst1.32 d0, [%1]!\n"
"\t add r4, r4, #2\n"
"\t b Lloop2\n"
"Lend2:\n"
: "=r"(input), "=r"(ptr_op), "=r"(length), "=r"(A)
: "0"(input), "1"(ptr_op), "2"(length), "3"(A)
: "cc", "r4", "d1", "d0");

Hmmmmm, does your code compile in the first place? I didn't know that you can multiply a vector by a float scalar. Probably the compiler did convert if for you.
Anyway, you have to understand that most NEON instructions are bound with a long latency. Unless you hide them properly, your code won't be any faster than the standard C version, if not slower.
vld1q..... // 1 cycle
// 4 cycles latency + potential cache miss penalty
vmulq..... // 2 cycles
// 6 cycles latency
vst1q..... // 1 cycle
// 2 cycles loop overhead
The example above roughly shows the cycles required for each iteration.
And as you can see, it's minimum 18 cycles/iteration from which only 4 cycles are spent on actual computation while 14 cycles are wasted meaninglessly.
It's called RAW dependency (Read after Write)
The easiest and practically only way to hide these latencies is loop unrolling: a deep one.
Unrolling by four vectors per iteration is usually sufficient, and eight is even better, if you don't mind the code length.
void vecMul(float * pDst, float * pSrc, float coeff, int length)
{
const float32x4_t scal = vmovq_n_f32(coeff);
float32x4x4_t veca, vecb;
length -= 32;
if (length >= 0)
{
while (1)
{
do
{
length -= 32;
veca = vld1q_f32_x4(pSrc++);
vecb = vld1q_f32_x4(pSrc++);
veca.val[0] = vmulq_f32(veca.val[0], scal);
veca.val[1] = vmulq_f32(veca.val[1], scal);
veca.val[2] = vmulq_f32(veca.val[2], scal);
veca.val[3] = vmulq_f32(veca.val[3], scal);
vecb.val[0] = vmulq_f32(vecb.val[0], scal);
vecb.val[1] = vmulq_f32(vecb.val[1], scal);
vecb.val[2] = vmulq_f32(vecb.val[2], scal);
vecb.val[3] = vmulq_f32(vecb.val[3], scal);
vst1q_f32_x4(pDst++, veca);
vst1q_f32_x4(pDst++, vecb);
} while (length >= 0);
if (length <= -32) return;
pSrc += length;
pDst += length;
}
}
///////////////////////////////////////////////////////////////
if (length & 16)
{
veca = vld1q_f32_x4(pSrc++);
}
if (length & 8)
{
vecb.val[0] = vld1q_f32(pSrc++);
vecb.val[1] = vld1q_f32(pSrc++);
}
if (length & 4)
{
vecb.val[2] = vld1q_f32(pSrc++);
}
if (length & 2)
{
vld1q_lane_f32(pSrc++, vecb.val[3], 0);
vld1q_lane_f32(pSrc++, vecb.val[3], 1);
}
if (length & 1)
{
vld1q_lane_f32(pSrc, vecb.val[3], 2);
}
veca.val[0] = vmulq_f32(veca.val[0], scal);
veca.val[1] = vmulq_f32(veca.val[1], scal);
veca.val[2] = vmulq_f32(veca.val[2], scal);
veca.val[3] = vmulq_f32(veca.val[3], scal);
vecb.val[0] = vmulq_f32(vecb.val[0], scal);
vecb.val[1] = vmulq_f32(vecb.val[1], scal);
vecb.val[2] = vmulq_f32(vecb.val[2], scal);
vecb.val[3] = vmulq_f32(vecb.val[3], scal);
if (length & 16)
{
vst1q_f32_x4(pDst++, veca);
}
if (length & 8)
{
vst1q_f32(pDst++, vecb.val[0]);
vst1q_f32(pDst++, vecb.val[1]);
}
if (length & 4)
{
vst1q_f32(pDst++, vecb.val[2]);
}
if (length & 2)
{
vst1q_lane_f32(pDst++, vecb.val[3], 0);
vst1q_lane_f32(pDst++, vecb.val[3], 1);
}
if (length & 1)
{
vst1q_lane_f32(pDst, vecb.val[3], 2);
}
}
Now we are dealing with eight independent vectors, hence the latencies are completely hidden, and the potential cache miss penalty as well as the flat loop overhead are rather diminishing.

Related

CRC Reverse Engineer (Checksum from Machine / PC)

I'm currently looking for on how to determine the CRC produced from the machine to PC (and vice-versa).
The devices are communicating using serial communication or RS232 cable.
I do only have data to be able for us to create a program to be used for both devices.
The data given was from my boss and the program was corrupted. So we are trying for it to work out.
I hope everyone can help.
Thanks :)

The sequence to use for the CRC calculation in your protocol is the ASCII string
starting from the first printing character (e.g. the 'R' from REQ)
until and including the '1E' in the calculation.
It's a CRC with the following specs according to our CRC calculator
CRC:16,1021,0000,0000,No,No
which means:
CRC width: 16 bit (of course)
polynomial: 1021 HEX (truncated CRC polynomial)
init value: 0000
final Xor applied: 0000
reflectedInput: No
reflectedOutput: No`
(If 'init value' were FFFF, it would be a "16 bit width CRC as designated by CCITT").
See also the Docklight CRC glossary and the Boost CRC library on what the CRC terms mean plus sample code.
What I did is to write a small script that tries out the popular 16 bit CRCs on varying parts of the first simple "REQ=INI" command, and see if I end up with a sum of 4255. This failed, but instead of going a full brute force with trying all sorts of polynoms, I assumed that it was maybe just an oddball / flawed implementation of the known standards, and indeed succeeded with a variation of the CRC-CCITT.
Heres is some slow & easy C code (not table based!) to calculate all sorts of CRCs:
// Generic, not table-based CRC calculation
// Based on and credits to the following:
// CRC tester v1.3 written on 4th of February 2003 by Sven Reifegerste (zorc/reflex)
unsigned long reflect (unsigned long crc, int bitnum) {
// reflects the lower 'bitnum' bits of 'crc'
unsigned long i, j=1, crcout=0;
for (i=(unsigned long)1<<(bitnum-1); i; i>>=1) {
if (crc & i) crcout|=j;
j<<= 1;
}
return (crcout);
}
calcCRC(
const int width, const unsigned long polynominal, const unsigned long initialRemainder,
const unsigned long finalXOR, const int reflectedInput, const int reflectedOutput,
const unsigned char message[], const long startIndex, const long endIndex)
{
// Ensure the width is in range: 1-32 bits
assert(width >= 1 && width <= 32);
// some constant parameters used
const bool b_refInput = (reflectedInput > 0);
const bool b_refOutput = (reflectedOutput > 0);
const unsigned long crcmask = ((((unsigned long)1<<(width-1))-1)<<1)|1;
const unsigned long crchighbit = (unsigned long)1<<(width-1);
unsigned long j, c, bit;
unsigned long crc = initialRemainder;
for (long msgIndex = startIndex; msgIndex <= endIndex; ++msgIndex) {
c = (unsigned long)message[msgIndex];
if (b_refInput) c = reflect(c, 8);
for (j=0x80; j; j>>=1) {
bit = crc & crchighbit;
crc<<= 1;
if (c & j) bit^= crchighbit;
if (bit) crc^= polynominal;
}
}
if (b_refOutput) crc=reflect(crc, width);
crc^= finalXOR;
crc&= crcmask;
return(crc);
}
With this code and the CRCs specs listed above, I have been able to re-calculate the following three sample CRCs:
10.03.2014 22:20:57.109 [TX] - REQ=INI<CR><LF>
<RS>CRC=4255<CR><LF>
<GS>
10.03.2014 22:20:57.731 [TX] - ANS=INI<CR><LF>
STATUS=0<CR><LF>
<RS>CRC=57654<CR><LF>
<GS>
10.03.2014 22:20:59.323 [TX] - ANS=INI<CR><LF>
STATUS=0<CR><LF>
MID="CTL1"<CR><LF>
DEF="DTLREQ";1025<CR><LF>
INFO=0<CR><LF>
<RS>CRC=1683<CR><LF>
<GS>
I failed on the very complex one with the DEF= parts - probably didn't understand the character sequence correctly.
The Docklight script I used to reverse engineer this:
Sub crcReverseEngineer()
Dim crctypes(7)
crctypes(0) = "CRC:16,1021,FFFF,0000" ' CCITT
crctypes(1) = "CRC:16,8005,0000,0000" ' CRC-16
crctypes(2) = "CRC:16,8005,FFFF,0000" ' CRC-MODBUS
' lets try also some nonstandard variations with different init and final Xor, but stick
' to the known two polynoms.
crctypes(3) = "CRC:16,1021,FFFF,FFFF"
crctypes(4) = "CRC:16,1021,0000,FFFF"
crctypes(5) = "CRC:16,1021,0000,0000"
crctypes(6) = "CRC:16,8005,FFFF,FFFF"
crctypes(7) = "CRC:16,8005,FFFF,0000"
crcString = "06 1C 52 45 51 3D 49 4E 49 0D 0A 1E 43 52 43 3D 30 30 30 30 0D 0A 1D"
For reflectedInOrOut = 0 To 3
For cType = 0 To 7
crcSpec = crctypes(cType) & "," & IIf(reflectedInOrOut Mod 2 = 1, "Yes", "No") & "," & IIf(reflectedInOrOut > 1, "Yes", "No")
For cStart = 1 To 3
For cEnd = 9 To (Len(crcString) + 1) / 3
subDataString = Mid(crcString, (cStart - 1) * 3 + 1, (cEnd - cStart + 1) * 3)
result = DL.CalcChecksum(crcSpec, subDataString, "H")
resultInt = CLng("&h" + Left(result, 2)) * 256 + CLng("&h" + Right(result, 2))
If resultInt = 4255 Then
DL.AddComment "Found it!"
DL.AddComment "sequence: " & subDataString
DL.AddComment "CRC spec: " & crcSpec
DL.AddComment "CRC result: " & result & " (Integer = " & resultInt & ")"
Exit Sub
End If
Next
Next
Next
Next
End Sub
Public Function IIf(blnExpression, vTrueResult, vFalseResult)
If blnExpression Then
IIf = vTrueResult
Else
IIf = vFalseResult
End If
End Function
Hope this helps and I'm happy to provide extra information or clarify details.

F# image manipulation performance problem

I am currently trying to improve the performance of an F# program to make it as fast as its C# equivalent. The program does apply a filter array to a buffer of pixels. Access to memory is always done using pointers.
Here is the C# code which is applied to each pixel of an image:
unsafe private static byte getPixelValue(byte* buffer, double* filter, int filterLength, double filterSum)
{
double sum = 0.0;
for (int i = 0; i < filterLength; ++i)
{
sum += (*buffer) * (*filter);
++buffer;
++filter;
}
sum = sum / filterSum;
if (sum > 255) return 255;
if (sum < 0) return 0;
return (byte) sum;
}
The F# code looks like this and takes three times as long as the C# program:
let getPixelValue (buffer:nativeptr<byte>) (filterData:nativeptr<float>) filterLength filterSum : byte =
let rec accumulatePixel (acc:float) (buffer:nativeptr<byte>) (filter:nativeptr<float>) i =
if i > 0 then
let newAcc = acc + (float (NativePtr.read buffer) * (NativePtr.read filter))
accumulatePixel newAcc (NativePtr.add buffer 1) (NativePtr.add filter 1) (i-1)
else
acc
let acc = (accumulatePixel 0.0 buffer filterData filterLength) / filterSum
match acc with
| _ when acc > 255.0 -> 255uy
| _ when acc < 0.0 -> 0uy
| _ -> byte acc
Using mutable Variables and a for loop in F# does result in the same speed as using recursion. All Projects are configured to run in Release Mode with Code Optimization turned on.
How could the performance of the F# version be improved?
EDIT:
The bottleneck seems to be in (NativePtr.get buffer offset). If I replace this code with a fixed value and also replace the corresponding code in the C# version with a fixed value, I get about the same speed for both programs. In fact, in C# the speed does not change at all, but in F# it makes a huge difference.
Can this behaviour possibly be changed or is it rooted deeply in the architecture of F#?
EDIT 2:
I refactored the code again to use for-loops. The execution speed remains the same:
let mutable acc <- 0.0
let mutable f <- filterData
let mutable b <- tBuffer
for i in 1 .. filter.FilterLength do
acc <- acc + (float (NativePtr.read b)) * (NativePtr.read f)
f <- NativePtr.add f 1
b <- NativePtr.add b 1
If I compare the IL code of a version that uses (NativePtr.read b) and another version that is the same except that it uses a fixed value 111uy instead of reading it from the pointer, Only the following lines in the IL code change:
111uy has IL-Code ldc.i4.s 0x6f (0.3 seconds)
(NativePtr.read b) has IL-Code lines ldloc.s b and ldobj uint8 (1.4 seconds)
For comparison: C# does the filtering in 0.4 seconds.
The fact that reading the filter does not impact performance while reading from the image buffer does is somehow confusing. Before I filter a line of the image I copy the line into a buffer that has the length of a line. That's why the read operations are not spread all over the image but are within this buffer, which has a size of about 800 bytes.

If we look at the actual IL code of the inner loop which traverses both buffers in parallel generated by C# compiler (relevant part):
L_0017: ldarg.0
L_0018: ldc.i4.1
L_0019: conv.i
L_001a: add
L_001b: starg.s buffer
L_001d: ldarg.1
L_001e: ldc.i4.8
L_001f: conv.i
L_0020: add
and F# compiler:
L_0017: ldc.i4.1
L_0018: conv.i
L_0019: sizeof uint8
L_001f: mul
L_0020: add
L_0021: ldarg.2
L_0022: ldc.i4.1
L_0023: conv.i
L_0024: sizeof float64
L_002a: mul
L_002b: add
we'll notice that while C# code uses only add operator while F# needs both mul and add. But obviously on each step we only need to increment pointers (by 'sizeof byte' and 'sizeof float' values respectively), not to calculate address (addrBase + (sizeof byte)) F# mul is unnecessary (it always multiplies by 1).
The cause for that is that C# defines ++ operator for pointers while F# provides only add : nativeptr<'T> -> int -> nativeptr<'T> operator:
[<NoDynamicInvocation>]
let inline add (x : nativeptr<'a>) (n:int) : nativeptr<'a> = to_nativeint x + nativeint n * (# "sizeof !0" type('a) : nativeint #) |> of_nativeint
So it's not "rooted deeply" in F#, it's just that module NativePtr lacks inc and dec functions.
Btw, I suspect the above sample could be written in a more concise manner if the arguments were passed as arrays instead of raw pointers.
UPDATE:
So does the following code have only 1% speed up (it seems to generate very similar to C# IL):
let getPixelValue (buffer:nativeptr<byte>) (filterData:nativeptr<float>) filterLength filterSum : byte =
let rec accumulatePixel (acc:float) (buffer:nativeptr<byte>) (filter:nativeptr<float>) i =
if i > 0 then
let newAcc = acc + (float (NativePtr.read buffer) * (NativePtr.read filter))
accumulatePixel newAcc (NativePtr.ofNativeInt <| (NativePtr.toNativeInt buffer) + (nativeint 1)) (NativePtr.ofNativeInt <| (NativePtr.toNativeInt filter) + (nativeint 8)) (i-1)
else
acc
let acc = (accumulatePixel 0.0 buffer filterData filterLength) / filterSum
match acc with
| _ when acc > 255.0 -> 255uy
| _ when acc < 0.0 -> 0uy
| _ -> byte acc
Another thought: it might also depend on the number of calls to getPixelValue your test does (F# splits this function into two methods while C# does it in one).
Is it possible that you post your testing code here?
Regarding array - I'd expect the code be at least more concise (and not unsafe).
UPDATE #2:
Looks like the actual bottleneck here is byte->float conversion.
C#:
L_0003: ldarg.1
L_0004: ldind.u1
L_0005: conv.r8
F#:
L_000c: ldarg.1
L_000d: ldobj uint8
L_0012: conv.r.un
L_0013: conv.r8
For some reason F# uses the following path: byte->float32->float64 while C# does only byte->float64. Not sure why is that, but with the following hack my F# version runs with the same speed as C# on gradbot test sample (BTW, thanks gradbot for the test!):
let inline preadConvert (p : nativeptr<byte>) = (# "conv.r8" (# "ldobj !0" type (byte) p : byte #) : float #)
let inline pinc (x : nativeptr<'a>) : nativeptr<'a> = NativePtr.toNativeInt x + (# "sizeof !0" type('a) : nativeint #) |> NativePtr.ofNativeInt
let rec accumulatePixel_ed (acc, buffer, filter, i) =
if i > 0 then
accumulatePixel_ed
(acc + (preadConvert buffer) * (NativePtr.read filter),
(pinc buffer),
(pinc filter),
(i-1))
else
acc
Results:
adrian 6374985677.162810 1408.870900 ms
gradbot 6374985677.162810 1218.908200 ms
C# 6374985677.162810 227.832800 ms
C# Offset 6374985677.162810 224.921000 ms
mutable 6374985677.162810 1254.337300 ms
ed'ka 6374985677.162810 227.543100 ms
LAST UPDATE
It turned out that we can achieve the same speed even without any hacks:
let rec accumulatePixel_ed_last (acc, buffer, filter, i) =
if i > 0 then
accumulatePixel_ed_last
(acc + (float << int16 <| NativePtr.read buffer) * (NativePtr.read filter),
(NativePtr.add buffer 1),
(NativePtr.add filter 1),
(i-1))
else
acc
All we need to do is to convert byte into, say int16 and then into float. This way 'costly' conv.r.un instruction will be avoided.
PS Relevant conversion code from "prim-types.fs" :
let inline float (x: ^a) =
(^a : (static member ToDouble : ^a -> float) (x))
when ^a : float = (# "" x : float #)
when ^a : float32 = (# "conv.r8" x : float #)
// [skipped]
when ^a : int16 = (# "conv.r8" x : float #)
// [skipped]
when ^a : byte = (# "conv.r.un conv.r8" x : float #)
when ^a : decimal = (System.Convert.ToDouble((# "" x : decimal #)))

How does this compare? It has less calls to NativePtr.
let getPixelValue (buffer:nativeptr<byte>) (filterData:nativeptr<float>) filterLength filterSum : byte =
let accumulatePixel (acc:float) (buffer:nativeptr<byte>) (filter:nativeptr<float>) length =
let rec accumulate acc offset =
if offset < length then
let newAcc = acc + (float (NativePtr.get buffer offset) * (NativePtr.get filter offset))
accumulate newAcc (offset + 1)
else
acc
accumulate acc 0
let acc = (accumulatePixel 0.0 buffer filterData filterLength) / filterSum
match acc with
| _ when acc > 255.0 -> 255uy
| _ when acc < 0.0 -> 0uy
| _ -> byte acc
F# source code of NativePtr.
[<NoDynamicInvocation>]
[<CompiledName("AddPointerInlined")>]
let inline add (x : nativeptr<'T>) (n:int) : nativeptr<'T> = toNativeInt x + nativeint n * (# "sizeof !0" type('T) : nativeint #) |> ofNativeInt
[<NoDynamicInvocation>]
[<CompiledName("GetPointerInlined")>]
let inline get (p : nativeptr<'T>) n = (# "ldobj !0" type ('T) (add p n) : 'T #)

My results on a larger test.
adrian 6374730426.098020 1561.102500 ms
gradbot 6374730426.098020 1842.768000 ms
C# 6374730426.098020 150.793500 ms
C# Offset 6374730426.098020 150.318900 ms
mutable 6374730426.098020 1446.616700 ms
F# test code
open Microsoft.FSharp.NativeInterop
open System.Runtime.InteropServices
open System.Diagnostics
open AccumulatePixel
#nowarn "9"
let test size fn =
let bufferByte = Marshal.AllocHGlobal(size * 4)
let bufferFloat = Marshal.AllocHGlobal(size * 8)
let bi = NativePtr.ofNativeInt bufferByte
let bf = NativePtr.ofNativeInt bufferFloat
let random = System.Random()
for i in 1 .. size do
NativePtr.set bi i (byte <| random.Next() % 256)
NativePtr.set bf i (random.NextDouble())
let duration (f, name) =
let stopWatch = Stopwatch.StartNew()
let time = f(0.0, bi, bf, size)
stopWatch.Stop()
printfn "%10s %f %f ms" name time stopWatch.Elapsed.TotalMilliseconds
List.iter duration fn
Marshal.FreeHGlobal bufferFloat
Marshal.FreeHGlobal bufferByte
let rec accumulatePixel_adrian (acc, buffer, filter, i) =
if i > 0 then
let newAcc = acc + (float (NativePtr.read buffer) * (NativePtr.read filter))
accumulatePixel_adrian (newAcc, (NativePtr.add buffer 1), (NativePtr.add filter 1), (i - 1))
else
acc
let accumulatePixel_gradbot (acc, buffer, filter, length) =
let rec accumulate acc offset =
if offset < length then
let newAcc = acc + (float (NativePtr.get buffer offset) * (NativePtr.get filter offset))
accumulate newAcc (offset + 1)
else
acc
accumulate acc 0
let accumulatePixel_mutable (acc, buffer, filter, length) =
let mutable acc = 0.0
let mutable f = filter
let mutable b = buffer
for i in 1 .. length do
acc <- acc + (float (NativePtr.read b)) * (NativePtr.read f)
f <- NativePtr.add f 1
b <- NativePtr.add b 1
acc
[
accumulatePixel_adrian, "adrian";
accumulatePixel_gradbot, "gradbot";
AccumulatePixel.getPixelValue, "C#";
AccumulatePixel.getPixelValueOffset, "C# Offset";
accumulatePixel_mutable, "mutable";
]
|> test 100000000
System.Console.ReadLine() |> ignore
C# test code
namespace AccumulatePixel
{
public class AccumulatePixel
{
unsafe public static double getPixelValue(double sum, byte* buffer, double* filter, int filterLength)
{
for (int i = 0; i < filterLength; ++i)
{
sum += (*buffer) * (*filter);
++buffer;
++filter;
}
return sum;
}
unsafe public static double getPixelValueOffset(double sum, byte* buffer, double* filter, int filterLength)
{
for (int i = 0; i < filterLength; ++i)
{
sum += buffer[i] * filter[i];
}
return sum;
}
}
}

Most efficient way to calculate Levenshtein distance

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.
Here's my implementation:
public int calculate(String root, String query)
{
int arr[][] = new int[root.length() + 2][query.length() + 2];
for (int i = 2; i < root.length() + 2; i++)
{
arr[i][0] = (int) root.charAt(i - 2);
arr[i][1] = (i - 1);
}
for (int i = 2; i < query.length() + 2; i++)
{
arr[0][i] = (int) query.charAt(i - 2);
arr[1][i] = (i - 1);
}
for (int i = 2; i < root.length() + 2; i++)
{
for (int j = 2; j < query.length() + 2; j++)
{
int diff = 0;
if (arr[0][j] != arr[i][0])
{
diff = 1;
}
arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
}
}
return arr[root.length() + 1][query.length() + 1];
}
public int min(int n1, int n2, int n3)
{
return (int) Math.min(n1, Math.min(n2, n3));
}

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).
Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.
I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

According to a comment on this blog, Speeding Up Levenshtein, you can use VP-Trees and achieve O(nlogn). Another comment on the same blog points to a python implementation of VP-Trees and Levenshtein. Please let us know if this works.

The Wikipedia article discusses your algorithm, and various improvements. However, it appears that at least in the general case, O(n^2) is the best you can get.
There are however some improvements if you can restrict your problem (e.g. if you are only interested in the distance if it's smaller than d, complexity is O(dn) - this might make sense as a match whose distance is close to the string length is probably not very interesting ). See if you can exploit the specifics of your problem...

I modified the Levenshtein distance VBA function found on this post to use a one dimensional array. It performs much faster.
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2
L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1
For i = 0 To L1 Step 1: D(i) = i: Next i 'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j 'setup array positions 0,1,2,3,4,...
For j = 1 To L2
ssL = (L1 + 1) * j
For i = (ssL + 1) To (ssL + L1)
If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
cI = D(i - 1) + 1
cD = D(i - L1p1) + 1
cS = D(i - L1p2) + cost
If cI <= cD Then 'Insertion or Substitution
If cI <= cS Then D(i) = cI Else D(i) = cS
Else 'Deletion or Substitution
If cD <= cS Then D(i) = cD Else D(i) = cS
End If
Next i
Next j
LevenshteinDistance2 = D(LD)
End Function
I have tested this function with string 's1' of length 11,304 and 's2' of length 5,665 ( > 64 million character comparisons). With the above single dimension version of the function, the execution time is ~24 seconds on my machine. The original two dimensional function that I referenced in the link above requires ~37 seconds for the same strings. I have optimized the single dimensional function further as shown below and it requires ~10 seconds for the same strings.
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long 'loop counters, loop step
Dim ssL As Long, cost As Long 'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String 'Character arrays for string S1 & S2
L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1
For i = 0 To L1 Step 1: D(i) = i: Next i 'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j 'setup array positions 0,1,2,3,4,...
ReDim sss1(1 To L1) 'Size character array S1
ReDim sss2(1 To L2) 'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i 'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i 'Fill S2 character array
For j = 1 To L2
ssL = (L1 + 1) * j
For i = (ssL + 1) To (ssL + L1)
If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
cI = D(i - 1) + 1
cD = D(i - L1p1) + 1
cS = D(i - L1p2) + cost
If cI <= cD Then 'Insertion or Substitution
If cI <= cS Then D(i) = cI Else D(i) = cS
Else 'Deletion or Substitution
If cD <= cS Then D(i) = cD Else D(i) = cS
End If
Next i
Next j
LevenshteinDistance = D(LD)
End Function

Commons-lang has a pretty fast implementation. See http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm.
Here's my translation of that into Scala:
// The code below is based on code from the Apache Commons lang project.
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with this
* work for additional information regarding copyright ownership. The ASF
* licenses this file to You under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance with the
* License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/
/**
* assert(levenshtein("algorithm", "altruistic")==6)
* assert(levenshtein("1638452297", "444488444")==9)
* assert(levenshtein("", "") == 0)
* assert(levenshtein("", "a") == 1)
* assert(levenshtein("aaapppp", "") == 7)
* assert(levenshtein("frog", "fog") == 1)
* assert(levenshtein("fly", "ant") == 3)
* assert(levenshtein("elephant", "hippo") == 7)
* assert(levenshtein("hippo", "elephant") == 7)
* assert(levenshtein("hippo", "zzzzzzzz") == 8)
* assert(levenshtein("hello", "hallo") == 1)
*
*/
def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = {
import scala.annotation.tailrec
def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = {
// Inside impl n <= m!
val p = new Array[Int](n + 1) // 'previous' cost array, horizontally
val d = new Array[Int](n + 1) // cost array, horizontally
#tailrec def fillP(i: Int) {
p(i) = i
if (i < n) fillP(i + 1)
}
fillP(0)
#tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = {
d(0) = j
#tailrec def eachI(i: Int) {
val a = d(i - 1) + 1
val b = p(i) + 1
d(i) = if (a < b) a else {
val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1
if (b < c) b else c
}
if (i < n)
eachI(i + 1)
}
eachI(1)
if (j < m)
eachJ(j + 1, t.charAt(j), p, d)
else
d(n)
}
eachJ(1, t.charAt(0), d, p)
}
val n = s.length
val m = t.length
if (n == 0) m else if (m == 0) n else {
if (n > m) impl(t, s, m, n) else impl(s, t, n, m)
}
}

I know this is very late but it is relevant to the discussion at hand.
As mentioned by others, if all you want to do is check whether the edit distance between two strings is within some threshold k, you can reduce the time complexity to O(kn). A more precise expression would be O((2k+1)n). You take a strip which spans k cells either side of the diagonal cell (length of strip 2k+1) and compute the values of cells lying on this strip.
Interestingly, there's been an improvement by Li et. al. and this has been further reduced to O((k+1)n).

Counting, reversed bit pattern

I am trying to find an algorithm to count from 0 to 2n-1 but their bit pattern reversed. I care about only n LSB of a word. As you may have guessed I failed.
For n=3:
000 -> 0
100 -> 4
010 -> 2
110 -> 6
001 -> 1
101 -> 5
011 -> 3
111 -> 7
You get the idea.
Answers in pseudo-code is great. Code fragments in any language are welcome, answers without bit operations are preferred.
Please don't just post a fragment without even a short explanation or a pointer to a source.
Edit: I forgot to add, I already have a naive implementation which just bit-reverses a count variable. In a sense, this method is not really counting.

This is, I think easiest with bit operations, even though you said this wasn't preferred
Assuming 32 bit ints, here's a nifty chunk of code that can reverse all of the bits without doing it in 32 steps:
unsigned int i;
i = (i & 0x55555555) << 1 | (i & 0xaaaaaaaa) >> 1;
i = (i & 0x33333333) << 2 | (i & 0xcccccccc) >> 2;
i = (i & 0x0f0f0f0f) << 4 | (i & 0xf0f0f0f0) >> 4;
i = (i & 0x00ff00ff) << 8 | (i & 0xff00ff00) >> 8;
i = (i & 0x0000ffff) << 16 | (i & 0xffff0000) >> 16;
i >>= (32 - n);
Essentially this does an interleaved shuffle of all of the bits. Each time around half of the bits in the value are swapped with the other half.
The last line is necessary to realign the bits so that bin "n" is the most significant bit.
Shorter versions of this are possible if "n" is <= 16, or <= 8

At each step, find the leftmost 0 digit of your value. Set it, and clear all digits to the left of it. If you don't find a 0 digit, then you've overflowed: return 0, or stop, or crash, or whatever you want.
This is what happens on a normal binary increment (by which I mean it's the effect, not how it's implemented in hardware), but we're doing it on the left instead of the right.
Whether you do this in bit ops, strings, or whatever, is up to you. If you do it in bitops, then a clz (or call to an equivalent hibit-style function) on ~value might be the most efficient way: __builtin_clz where available. But that's an implementation detail.

This solution was originally in binary and converted to conventional math as the requester specified.
It would make more sense as binary, at least the multiply by 2 and divide by 2 should be << 1 and >> 1 for speed, the additions and subtractions probably don't matter one way or the other.
If you pass in mask instead of nBits, and use bitshifting instead of multiplying or dividing, and change the tail recursion to a loop, this will probably be the most performant solution you'll find since every other call it will be nothing but a single add, it would only be as slow as Alnitak's solution once every 4, maybe even 8 calls.
int incrementBizarre(int initial, int nBits)
// in the 3 bit example, this should create 100
mask=2^(nBits-1)
// This should only return true if the first (least significant) bit is not set
// if initial is 011 and mask is 100
// 3 4, bit is not set
if(initial < mask)
// If it was not, just set it and bail.
return initial+ mask // 011 (3) + 100 (4) = 111 (7)
else
// it was set, are we at the most significant bit yet?
// mask 100 (4) / 2 = 010 (2), 001/2 = 0 indicating overflow
if(mask / 2) > 0
// No, we were't, so unset it (initial-mask) and increment the next bit
return incrementBizarre(initial - mask, mask/2)
else
// Whoops we were at the most significant bit. Error condition
throw new OverflowedMyBitsException()
Wow, that turned out kinda cool. I didn't figure in the recursion until the last second there.
It feels wrong--like there are some operations that should not work, but they do because of the nature of what you are doing (like it feels like you should get into trouble when you are operating on a bit and some bits to the left are non-zero, but it turns out you can't ever be operating on a bit unless all the bits to the left are zero--which is a very strange condition, but true.
Example of flow to get from 110 to 001 (backwards 3 to backwards 4):
mask 100 (4), initial 110 (6); initial < mask=false; initial-mask = 010 (2), now try on the next bit
mask 010 (2), initial 010 (2); initial < mask=false; initial-mask = 000 (0), now inc the next bit
mask 001 (1), initial 000 (0); initial < mask=true; initial + mask = 001--correct answer

Here's a solution from my answer to a different question that computes the next bit-reversed index without looping. It relies heavily on bit operations, though.
The key idea is that incrementing a number simply flips a sequence of least-significant bits, for example from nnnn0111 to nnnn1000. So in order to compute the next bit-reversed index, you have to flip a sequence of most-significant bits. If your target platform has a CTZ ("count trailing zeros") instruction, this can be done efficiently.
Example in C using GCC's __builtin_ctz:
void iter_reversed(unsigned bits) {
unsigned n = 1 << bits;
for (unsigned i = 0, j = 0; i < n; i++) {
printf("%x\n", j);
// Compute a mask of LSBs.
unsigned mask = i ^ (i + 1);
// Length of the mask.
unsigned len = __builtin_ctz(~mask);
// Align the mask to MSB of n.
mask <<= bits - len;
// XOR with mask.
j ^= mask;
}
}
Without a CTZ instruction, you can also use integer division:
void iter_reversed(unsigned bits) {
unsigned n = 1 << bits;
for (unsigned i = 0, j = 0; i < n; i++) {
printf("%x\n", j);
// Find least significant zero bit.
unsigned bit = ~i & (i + 1);
// Using division to bit-reverse a single bit.
unsigned rev = (n / 2) / bit;
// XOR with mask.
j ^= (n - 1) & ~(rev - 1);
}
}

void reverse(int nMaxVal, int nBits)
{
int thisVal, bit, out;
// Calculate for each value from 0 to nMaxVal.
for (thisVal=0; thisVal<=nMaxVal; ++thisVal)
{
out = 0;
// Shift each bit from thisVal into out, in reverse order.
for (bit=0; bit<nBits; ++bit)
out = (out<<1) + ((thisVal>>bit) & 1)
}
printf("%d -> %d\n", thisVal, out);
}

Maybe increment from 0 to N (the "usual" way") and do ReverseBitOrder() for each iteration. You can find several implementations here (I like the LUT one the best).
Should be really quick.

Here's an answer in Perl. You don't say what comes after the all ones pattern, so I just return zero. I took out the bitwise operations so that it should be easy to translate into another language.
sub reverse_increment {
my($n, $bits) = #_;
my $carry = 2**$bits;
while($carry > 1) {
$carry /= 2;
if($carry > $n) {
return $carry + $n;
} else {
$n -= $carry;
}
}
return 0;
}

Here's a solution which doesn't actually try to do any addition, but exploits the on/off pattern of the seqence (most sig bit alternates every time, next most sig bit alternates every other time, etc), adjust n as desired:
#define FLIP(x, i) do { (x) ^= (1 << (i)); } while(0)
int main() {
int n = 3;
int max = (1 << n);
int x = 0;
for(int i = 1; i <= max; ++i) {
std::cout << x << std::endl;
/* if n == 3, this next part is functionally equivalent to this:
*
* if((i % 1) == 0) FLIP(x, n - 1);
* if((i % 2) == 0) FLIP(x, n - 2);
* if((i % 4) == 0) FLIP(x, n - 3);
*/
for(int j = 0; j < n; ++j) {
if((i % (1 << j)) == 0) FLIP(x, n - (j + 1));
}
}
}

How about adding 1 to the most significant bit, then carrying to the next (less significant) bit, if necessary. You could speed this up by operating on bytes:
Precompute a lookup table for counting in bit-reverse from 0 to 256 (00000000 -> 10000000, 10000000 -> 01000000, ..., 11111111 -> 00000000).
Set all bytes in your multi-byte number to zero.
Increment the most significant byte using the lookup table. If the byte is 0, increment the next byte using the lookup table. If the byte is 0, increment the next byte...
Go to step 3.

With n as your power of 2 and x the variable you want to step:
(defun inv-step (x n) ; the following is a function declaration
"returns a bit-inverse step of x, bounded by 2^n" ; documentation
(do ((i (expt 2 (- n 1)) ; loop, init of i
(/ i 2)) ; stepping of i
(s x)) ; init of s as x
((not (integerp i)) ; breaking condition
s) ; returned value if all bits are 1 (is 0 then)
(if (< s i) ; the loop's body: if s < i
(return-from inv-step (+ s i)) ; -> add i to s and return the result
(decf s i)))) ; else: reduce s by i
I commented it thoroughly as you may not be familiar with this syntax.
edit: here is the tail recursive version. It seems to be a little faster, provided that you have a compiler with tail call optimization.
(defun inv-step (x n)
(let ((i (expt 2 (- n 1))))
(cond ((= n 1)
(if (zerop x) 1 0)) ; this is really (logxor x 1)
((< x i)
(+ x i))
(t
(inv-step (- x i) (- n 1))))))

When you reverse 0 to 2^n-1 but their bit pattern reversed, you pretty much cover the entire 0-2^n-1 sequence
Sum = 2^n * (2^n+1)/2
O(1) operation. No need to do bit reversals

Edit: Of course original poster's question was about to do increment by (reversed) one, which makes things more simple than adding two random values. So nwellnhof's answer contains the algorithm already.
Summing two bit-reversal values
Here is one solution in php:
function RevSum ($a,$b) {
// loop until our adder, $b, is zero
while ($b) {
// get carry (aka overflow) bit for every bit-location by AND-operation
// 0 + 0 --> 00 no overflow, carry is "0"
// 0 + 1 --> 01 no overflow, carry is "0"
// 1 + 0 --> 01 no overflow, carry is "0"
// 1 + 1 --> 10 overflow! carry is "1"
$c = $a & $b;
// do 1-bit addition for every bit location at once by XOR-operation
// 0 + 0 --> 00 result = 0
// 0 + 1 --> 01 result = 1
// 1 + 0 --> 01 result = 1
// 1 + 1 --> 10 result = 0 (ignored that "1", already taken care above)
$a ^= $b;
// now: shift carry bits to the next bit-locations to be added to $a in
// next iteration.
// PHP_INT_MAX here is used to ensure that the most-significant bit of the
// $b will be cleared after shifting. see link in the side note below.
$b = ($c >> 1) & PHP_INT_MAX;
}
return $a;
}
Side note: See this question about shifting negative values.
And as for test; start from zero and increment value by 8-bit reversed one (10000000):
$value = 0;
$add = 0x80; // 10000000 <-- "one" as bit reversed
for ($count = 20; $count--;) { // loop 20 times
printf("%08b\n", $value); // show value as 8-bit binary
$value = RevSum($value, $add); // do addition
}
... will output:
00000000
10000000
01000000
11000000
00100000
10100000
01100000
11100000
00010000
10010000
01010000
11010000
00110000
10110000
01110000
11110000
00001000
10001000
01001000
11001000

Let assume number 1110101 and our task is to find next one.
1) Find zero on highest position and mark position as index.
11101010 (4th position, so index = 4)
2) Set to zero all bits on position higher than index.
00001010
3) Change founded zero from step 1) to '1'
00011010
That's it. This is by far the fastest algorithm since most of cpu's has instructions to achieve this very efficiently. Here is a C++ implementation which increment 64bit number in reversed patern.
#include <intrin.h>
unsigned __int64 reversed_increment(unsigned __int64 number)
{
unsigned long index, result;
_BitScanReverse64(&index, ~number); // returns index of the highest '1' on bit-reverse number (trick to find the highest '0')
result = _bzhi_u64(number, index); // set to '0' all bits at number higher than index position
result |= (unsigned __int64) 1 << index; // changes to '1' bit on index position
return result;
}
Its not hit your requirements to have "no bits" operations, however i fear there is now way how to achieve something similar without them.

Count the number of set bits in a 32-bit integer

8 bits representing the number 7 look like this:
00000111
Three bits are set.
What are the algorithms to determine the number of set bits in a 32-bit integer?

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.
Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86's popcnt (on CPUs where it's supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).
The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.
Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. C++20 std::popcount(), or C++ std::bitset<32>::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.
Portable algorithms that don't need (or benefit from) any HW support
A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.
If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.
I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):
GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. (https://godbolt.org/z/qGdh1dvKK)
int numberOfSetBits(uint32_t i)
{
// Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
// C or C++: use uint32_t
i = i - ((i >> 1) & 0x55555555); // add pairs of bits
i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // quads
i = (i + (i >> 4)) & 0x0F0F0F0F; // groups of 8
return (i * 0x01010101) >> 24; // horizontal sum of bytes
}
For JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i >> 1) & 0x55555555);
This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)
References:
https://graphics.stanford.edu/~seander/bithacks.html
https://catonmat.net/low-level-bit-hacks for bithack basics, like how subtracting 1 flips contiguous zeros.
https://en.wikipedia.org/wiki/Hamming_weight
http://gurmeet.net/puzzles/fast-bit-counting-routines/
http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)
How this SWAR bithack works:
i = i - ((i >> 1) & 0x55555555);
The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).
The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.
The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).
So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:
Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.
A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.
With full SIMD for wider vectors (e.g. counting a whole array)
This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)
However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).
On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.
https://github.com/WojciechMula/sse-popcount state-of-the-art x86 SIMD popcount for SSSE3, AVX2, AVX512BW, AVX512VBMI, or AVX512 VPOPCNT. Using Harley-Seal across vectors to defer popcount within an element. (Also ARM NEON)
Counting 1 bits (population count) on large data using AVX-512 or AVX-2
related: https://github.com/mklarqvist/positional-popcount - separate counts for each bit-position of multiple 8, 16, 32, or 64-bit integers. (Again, x86 SIMD including AVX-512 which is really good at this, with vpternlogd making Harley-Seal very good.)

Some languages portably expose the operation in a way that can use efficient hardware support if available, otherwise some library fallback that's hopefully decent.
For example (from a table by language):
C++ has std::bitset<>::count(), or C++20 std::popcount(T x)
Java has java.lang.Integer.bitCount() (also for Long or BigInteger)
C# has System.Numerics.BitOperations.PopCount()
Python has int.bit_count() (since 3.10)
Not all compilers / libraries actually manage to use HW support when it's available, though. (Notably MSVC, even with options that make std::popcount inline as x86 popcnt, its std::bitset::count still always uses a lookup table. This will hopefully change in future versions.)
Also consider the built-in functions of your compiler when the portable language doesn't have this basic bit operation. In GNU C for example:
int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);
In the worst case (no single-instruction HW support) the compiler will generate a call to a function (which in current GCC uses a shift/and bit-hack like this answer, at least for x86). In the best case the compiler will emit a cpu instruction to do the job. (Just like a * or / operator - GCC will use a hardware multiply or divide instruction if available, otherwise will call a libgcc helper function.) Or even better, if the operand is a compile-time constant after inlining, it can do constant-propagation to get a compile-time-constant popcount result.
The GCC builtins even work across multiple platforms. Popcount has almost become mainstream in the x86 architecture, so it makes sense to start using the builtin now so you can recompile to let it inline a hardware instruction when you compile with -mpopcnt or something that includes that (e.g. https://godbolt.org/z/Ma5e5a). Other architectures have had popcount for years, but in the x86 world there are still some ancient Core 2 and similar vintage AMD CPUs in use.
On x86, you can tell the compiler that it can assume support for popcnt instruction with -mpopcnt (also implied by -msse4.2). See GCC x86 options. -march=nehalem -mtune=skylake (or -march= whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.
To make binaries optimized for the machine you build them on, use -march=native (with gcc, clang, or ICC).
MSVC provides an intrinsic for the x86 popcnt instruction, but unlike gcc it's really an intrinsic for the hardware instruction and requires hardware support.
Using std::bitset<>::count() instead of a built-in
In theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++ std::bitset<>. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.
For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and it's std::bitset<>::count always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, which in turn implies the popcnt feature.) (Update: see below; that does get MSVC's C++20 std::popcount to use x86 popcnt, but still not its bitset<>::count. MSVC could fix that by updating their standard library headers to use std::popcount when available.)
But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.
#include <bitset>
#include <limits>
#include <type_traits>
template<typename T>
//static inline // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value, unsigned >::type
popcount(T x)
{
static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");
// sizeof(x)*CHAR_BIT
constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
// std::bitset constructor was only unsigned long before C++11. Beware if porting to C++03
static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");
typedef typename std::make_unsigned<T>::type UT; // probably not needed, bitset width chops after sign-extension
std::bitset<bitwidth> bs( static_cast<UT>(x) );
return bs.count();
}
See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.
x86-64 gcc -O3 -std=gnu++11 -mpopcnt emits this:
unsigned test_short(short a) { return popcount(a); }
movzx eax, di # note zero-extension, not sign-extension
popcnt rax, rax
ret
unsigned test_int(int a) { return popcount(a); }
mov eax, edi
popcnt rax, rax # unnecessary 64-bit operand size
ret
unsigned test_u64(unsigned long long a) { return popcount(a); }
xor eax, eax # gcc avoids false dependencies for Intel CPUs
popcnt rax, rdi
ret
PowerPC64 gcc -O3 -std=gnu++11 emits (for the int arg version):
rldicl 3,3,0,32 # zero-extend from 32 to 64-bit
popcntd 3,3 # popcount
blr
This source isn't x86-specific or GNU-specific at all, but only compiles well with gcc/clang/icc, at least when targeting x86 (including x86-64).
Also note that gcc's fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn't wonderful for ARM, for example.
C++20 has std::popcount(T)
Current libstdc++ headers unfortunately define it with a special case if(x==0) return 0; at the start, which clang doesn't optimize away when compiling for x86:
#include <bit>
int bar(unsigned x) {
return std::popcount(x);
}
clang 11.0.1 -O3 -std=gnu++20 -march=nehalem (https://godbolt.org/z/arMe5a)
# clang 11
bar(unsigned int): # #bar(unsigned int)
popcnt eax, edi
cmove eax, edi # redundant: if popcnt result is 0, return the original 0 instead of the popcnt-generated 0...
ret
But GCC compiles nicely:
# gcc 10
xor eax, eax # break false dependency on Intel SnB-family before Ice Lake.
popcnt eax, edi
ret
Even MSVC does well with it, as long as you use -arch:AVX or later (and enable C++20 with -std:c++latest). https://godbolt.org/z/7K4Gef
int bar(unsigned int) PROC ; bar, COMDAT
popcnt eax, ecx
ret 0
int bar(unsigned int) ENDP ; bar

In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness any time.
unsigned int bitCount (unsigned int value) {
unsigned int count = 0;
while (value > 0) { // until all bits are zero
if ((value & 1) == 1) // check lower bit
count++;
value >>= 1; // shift bits, removing lower bit
}
return count;
}
If you want more speed (and assuming you document it well to help out your successors), you could use a table lookup:
// Lookup table for fast calculation of bits set in 8-bit unsigned char.
static unsigned char oneBitsInUChar[] = {
// 0 1 2 3 4 5 6 7 8 9 A B C D E F (<- n)
// =====================================================
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
: : :
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};
// Function for fast calculation of bits set in 16-bit unsigned short.
unsigned char oneBitsInUShort (unsigned short x) {
return oneBitsInUChar [x >> 8]
+ oneBitsInUChar [x & 0xff];
}
// Function for fast calculation of bits set in 32-bit unsigned int.
unsigned char oneBitsInUInt (unsigned int x) {
return oneBitsInUShort (x >> 16)
+ oneBitsInUShort (x & 0xffff);
}
These rely on specific data type sizes so they're not that portable. But, since many performance optimisations aren't portable anyway, that may not be an issue. If you want portability, I'd stick to the readable solution.

From Hacker's Delight, p. 66, Figure 5-2
int pop(unsigned x)
{
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
x = x + (x >> 16);
return x & 0x0000003F;
}
Executes in ~20-ish instructions (arch dependent), no branching.Hacker's Delight is delightful! Highly recommended.

I think the fastest way—without using lookup tables and popcount—is the following. It counts the set bits with just 12 operations.
int popcount(int v) {
v = v - ((v >> 1) & 0x55555555); // put count of each 2 bits into those 2 bits
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits
return ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
It works because you can count the total number of set bits by dividing in two halves, counting the number of set bits in both halves and then adding them up. Also know as Divide and Conquer paradigm. Let's get into detail..
v = v - ((v >> 1) & 0x55555555);
The number of bits in two bits can be 0b00, 0b01 or 0b10. Lets try to work this out on 2 bits..
---------------------------------------------
| v | (v >> 1) & 0b0101 | v - x |
---------------------------------------------
0b00 0b00 0b00
0b01 0b00 0b01
0b10 0b01 0b01
0b11 0b01 0b10
This is what was required: the last column shows the count of set bits in every two bit pair. If the two bit number is >= 2 (0b10) then and produces 0b01, else it produces 0b00.
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
This statement should be easy to understand. After the first operation we have the count of set bits in every two bits, now we sum up that count in every 4 bits.
v & 0b00110011 //masks out even two bits
(v >> 2) & 0b00110011 // masks out odd two bits
We then sum up the above result, giving us the total count of set bits in 4 bits. The last statement is the most tricky.
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
Let's break it down further...
v + (v >> 4)
It's similar to the second statement; we are counting the set bits in groups of 4 instead. We know—because of our previous operations—that every nibble has the count of set bits in it. Let's look an example. Suppose we have the byte 0b01000010. It means the first nibble has its 4bits set and the second one has its 2bits set. Now we add those nibbles together.
v = 0b01000010
(v >> 4) = 0b00000100
v + (v >> 4) = 0b01000010 + 0b00000100
It gives us the count of set bits in a byte, in the second nibble 0b01000110 and therefore we mask the first four bytes of all the bytes in the number (discarding them).
0b01000110 & 0x0F = 0b00000110
Now every byte has the count of set bits in it. We need to add them up all together. The trick is to multiply the result by 0b10101010 which has an interesting property. If our number has four bytes, A B C D, it will result in a new number with these bytes A+B+C+D B+C+D C+D D. A 4 byte number can have maximum of 32 bits set, which can be represented as 0b00100000.
All we need now is the first byte which has the sum of all set bits in all the bytes, and we get it by >> 24. This algorithm was designed for 32 bit words but can be easily modified for 64 bit words.

If you happen to be using Java, the built-in method Integer.bitCount will do that.

I got bored, and timed a billion iterations of three approaches. Compiler is gcc -O3. CPU is whatever they put in the 1st gen Macbook Pro.
Fastest is the following, at 3.7 seconds:
static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}
Second place goes to the same code but looking up 4 bytes instead of 2 halfwords. That took around 5.5 seconds.
Third place goes to the bit-twiddling 'sideways addition' approach, which took 8.6 seconds.
Fourth place goes to GCC's __builtin_popcount(), at a shameful 11 seconds.
The counting one-bit-at-a-time approach was waaaay slower, and I got bored of waiting for it to complete.
So if you care about performance above all else then use the first approach. If you care, but not enough to spend 64Kb of RAM on it, use the second approach. Otherwise use the readable (but slow) one-bit-at-a-time approach.
It's hard to think of a situation where you'd want to use the bit-twiddling approach.
Edit: Similar results here.

unsigned int count_bit(unsigned int x)
{
x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
x = (x & 0x0000FFFF) + ((x >> 16)& 0x0000FFFF);
return x;
}
Let me explain this algorithm.
This algorithm is based on Divide and Conquer Algorithm. Suppose there is a 8bit integer 213(11010101 in binary), the algorithm works like this(each time merge two neighbor blocks):
+-------------------------------+
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | <- x
| 1 0 | 0 1 | 0 1 | 0 1 | <- first time merge
| 0 0 1 1 | 0 0 1 0 | <- second time merge
| 0 0 0 0 0 1 0 1 | <- third time ( answer = 00000101 = 5)
+-------------------------------+

This is one of those questions where it helps to know your micro-architecture. I just timed two variants under gcc 4.3.3 compiled with -O3 using C++ inlines to eliminate function call overhead, one billion iterations, keeping the running sum of all counts to ensure the compiler doesn't remove anything important, using rdtsc for timing (clock cycle precise).
inline int pop2(unsigned x, unsigned y)
{
x = x - ((x >> 1) & 0x55555555);
y = y - ((y >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
y = (y + (y >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
y = y + (y >> 8);
x = x + (x >> 16);
y = y + (y >> 16);
return (x+y) & 0x000000FF;
}
The unmodified Hacker's Delight took 12.2 gigacycles. My parallel version (counting twice as many bits) runs in 13.0 gigacycles. 10.5s total elapsed for both together on a 2.4GHz Core Duo. 25 gigacycles = just over 10 seconds at this clock frequency, so I'm confident my timings are right.
This has to do with instruction dependency chains, which are very bad for this algorithm. I could nearly double the speed again by using a pair of 64-bit registers. In fact, if I was clever and added x+y a little sooner I could shave off some shifts. The 64-bit version with some small tweaks would come out about even, but count twice as many bits again.
With 128 bit SIMD registers, yet another factor of two, and the SSE instruction sets often have clever short-cuts, too.
There's no reason for the code to be especially transparent. The interface is simple, the algorithm can be referenced on-line in many places, and it's amenable to comprehensive unit test. The programmer who stumbles upon it might even learn something. These bit operations are extremely natural at the machine level.
OK, I decided to bench the tweaked 64-bit version. For this one sizeof(unsigned long) == 8
inline int pop2(unsigned long x, unsigned long y)
{
x = x - ((x >> 1) & 0x5555555555555555);
y = y - ((y >> 1) & 0x5555555555555555);
x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
y = (y & 0x3333333333333333) + ((y >> 2) & 0x3333333333333333);
x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
y = (y + (y >> 4)) & 0x0F0F0F0F0F0F0F0F;
x = x + y;
x = x + (x >> 8);
x = x + (x >> 16);
x = x + (x >> 32);
return x & 0xFF;
}
That looks about right (I'm not testing carefully, though). Now the timings come out at 10.70 gigacycles / 14.1 gigacycles. That later number summed 128 billion bits and corresponds to 5.9s elapsed on this machine. The non-parallel version speeds up a tiny bit because I'm running in 64-bit mode and it likes 64-bit registers slightly better than 32-bit registers.
Let's see if there's a bit more OOO pipelining to be had here. This was a bit more involved, so I actually tested a bit. Each term alone sums to 64, all combined sum to 256.
inline int pop4(unsigned long x, unsigned long y,
unsigned long u, unsigned long v)
{
enum { m1 = 0x5555555555555555,
m2 = 0x3333333333333333,
m3 = 0x0F0F0F0F0F0F0F0F,
m4 = 0x000000FF000000FF };
x = x - ((x >> 1) & m1);
y = y - ((y >> 1) & m1);
u = u - ((u >> 1) & m1);
v = v - ((v >> 1) & m1);
x = (x & m2) + ((x >> 2) & m2);
y = (y & m2) + ((y >> 2) & m2);
u = (u & m2) + ((u >> 2) & m2);
v = (v & m2) + ((v >> 2) & m2);
x = x + y;
u = u + v;
x = (x & m3) + ((x >> 4) & m3);
u = (u & m3) + ((u >> 4) & m3);
x = x + u;
x = x + (x >> 8);
x = x + (x >> 16);
x = x & m4;
x = x + (x >> 32);
return x & 0x000001FF;
}
I was excited for a moment, but it turns out gcc is playing inline tricks with -O3 even though I'm not using the inline keyword in some tests. When I let gcc play tricks, a billion calls to pop4() takes 12.56 gigacycles, but I determined it was folding arguments as constant expressions. A more realistic number appears to be 19.6gc for another 30% speed-up. My test loop now looks like this, making sure each argument is different enough to stop gcc from playing tricks.
hitime b4 = rdtsc();
for (unsigned long i = 10L * 1000*1000*1000; i < 11L * 1000*1000*1000; ++i)
sum += pop4 (i, i^1, ~i, i|1);
hitime e4 = rdtsc();
256 billion bits summed in 8.17s elapsed. Works out to 1.02s for 32 million bits as benchmarked in the 16-bit table lookup. Can't compare directly, because the other bench doesn't give a clock speed, but looks like I've slapped the snot out of the 64KB table edition, which is a tragic use of L1 cache in the first place.
Update: decided to do the obvious and create pop6() by adding four more duplicated lines. Came out to 22.8gc, 384 billion bits summed in 9.5s elapsed. So there's another 20% Now at 800ms for 32 billion bits.

Why not iteratively divide by 2?
count = 0
while n > 0
if (n % 2) == 1
count += 1
n /= 2
I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity

The Hacker's Delight bit-twiddling becomes so much clearer when you write out the bit patterns.
unsigned int bitCount(unsigned int x)
{
x = ((x >> 1) & 0b01010101010101010101010101010101)
+ (x & 0b01010101010101010101010101010101);
x = ((x >> 2) & 0b00110011001100110011001100110011)
+ (x & 0b00110011001100110011001100110011);
x = ((x >> 4) & 0b00001111000011110000111100001111)
+ (x & 0b00001111000011110000111100001111);
x = ((x >> 8) & 0b00000000111111110000000011111111)
+ (x & 0b00000000111111110000000011111111);
x = ((x >> 16)& 0b00000000000000001111111111111111)
+ (x & 0b00000000000000001111111111111111);
return x;
}
The first step adds the even bits to the odd bits, producing a sum of bits in each two. The other steps add high-order chunks to low-order chunks, doubling the chunk size all the way up, until we have the final count taking up the entire int.

For a happy medium between a 232 lookup table and iterating through each bit individually:
int bitcount(unsigned int num){
int count = 0;
static int nibblebits[] =
{0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
for(; num != 0; num >>= 4)
count += nibblebits[num & 0x0f];
return count;
}
From http://ctips.pbwiki.com/CountBits

This can be done in O(k), where k is the number of bits set.
int NumberOfSetBits(int n)
{
int count = 0;
while (n){
++ count;
n = (n - 1) & n;
}
return count;
}

It's not the fastest or best solution, but I found the same question in my way, and I started to think and think. finally I realized that it can be done like this if you get the problem from mathematical side, and draw a graph, then you find that it's a function which has some periodic part, and then you realize the difference between the periods... so here you go:
unsigned int f(unsigned int x)
{
switch (x) {
case 0:
return 0;
case 1:
return 1;
case 2:
return 1;
case 3:
return 2;
default:
return f(x/4) + f(x%4);
}
}

I think the Brian Kernighan's method will be useful too...
It goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.
int countSetBits(unsigned int n) {
unsigned int n; // count the number of bits set in n
unsigned int c; // c accumulates the total bits set in n
for (c=0;n>0;n=n&(n-1)) c++;
return c;
}
Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)
The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.

private int get_bits_set(int v)
{
int c; // 'c' accumulates the total bits set in 'v'
for (c = 0; v>0; c++)
{
v &= v - 1; // Clear the least significant bit set
}
return c;
}

Few open questions:-
If the number is negative then?
If the number is 1024 , then the "iteratively divide by 2" method will iterate 10 times.
we can modify the algo to support the negative number as follows:-
count = 0
while n != 0
if ((n % 2) == 1 || (n % 2) == -1
count += 1
n /= 2
return count
now to overcome the second problem we can write the algo like:-
int bit_count(int num)
{
int count=0;
while(num)
{
num=(num)&(num-1);
count++;
}
return count;
}
for complete reference see :
http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

I use the below code which is more intuitive.
int countSetBits(int n) {
return !n ? 0 : 1 + countSetBits(n & (n-1));
}
Logic : n & (n-1) resets the last set bit of n.
P.S : I know this is not O(1) solution, albeit an interesting solution.

What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.
But if the speed is the major factor and not the code size then I think the follow can be faster:
static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
static int bitCountOfByte( int value ){
return BIT_COUNT[ value & 0xFF ];
}
static int bitCountOfInt( int value ){
return bitCountOfByte( value )
+ bitCountOfByte( value >> 8 )
+ bitCountOfByte( value >> 16 )
+ bitCountOfByte( value >> 24 );
}
I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.

I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.
With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.
#define BitCount(X,Y) \
Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
Y = ((Y + (Y >> 3)) & 030707070707); \
Y = (Y + (Y >> 6)); \
Y = (Y + (Y >> 12) + (Y >> 24)) & 077;
Here is a secret to the first and most complex step:
input output
AB CD Note
00 00 = AB
01 01 = AB
10 01 = AB - (A >> 1) & 0x1
11 10 = AB - (A >> 1) & 0x1
so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.
Don Gillies

if you're using C++ another option is to use template metaprogramming:
// recursive template to sum bits in an int
template <int BITS>
int countBits(int val) {
// return the least significant bit plus the result of calling ourselves with
// .. the shifted value
return (val & 0x1) + countBits<BITS-1>(val >> 1);
}
// template specialisation to terminate the recursion when there's only one bit left
template<>
int countBits<1>(int val) {
return val & 0x1;
}
usage would be:
// to count bits in a byte/char (this returns 8)
countBits<8>( 255 )
// another byte (this returns 7)
countBits<8>( 254 )
// counting bits in a word/short (this returns 1)
countBits<16>( 256 )
you could of course further expand this template to use different types (even auto-detecting bit size) but I've kept it simple for clarity.
edit: forgot to mention this is good because it should work in any C++ compiler and it basically just unrolls your loop for you if a constant value is used for the bit count (in other words, I'm pretty sure it's the fastest general method you'll find)

C++20 std::popcount
The following proposal has been merged http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0553r4.html and should add it to a the <bit> header.
I expect the usage to be like:
#include <bit>
#include <iostream>
int main() {
std::cout << std::popcount(0x55) << std::endl;
}
I'll give it a try when support arrives to GCC, GCC 9.1.0 with g++-9 -std=c++2a still doesn't support it.
The proposal says:
Header: <bit>
namespace std {
// 25.5.6, counting
template<class T>
constexpr int popcount(T x) noexcept;
and:
template<class T>
constexpr int popcount(T x) noexcept;
Constraints: T is an unsigned integer type (3.9.1 [basic.fundamental]).
Returns: The number of 1 bits in the value of x.
std::rotl and std::rotr were also added to do circular bit rotations: Best practices for circular shift (rotate) operations in C++

You can do:
while(n){
n = n & (n-1);
count++;
}
The logic behind this is the bits of n-1 is inverted from rightmost set bit of n.
If n=6, i.e., 110 then 5 is 101 the bits are inverted from rightmost set bit of n.
So if we & these two we will make the rightmost bit 0 in every iteration and always go to the next rightmost set bit. Hence, counting the set bit. The worst time complexity will be O(log n) when every bit is set.

I'm particularly fond of this example from the fortune file:
#define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x) ((x) - (((x)>>1)&0x77777777)
- (((x)>>2)&0x33333333)
- (((x)>>3)&0x11111111))
I like it best because it's so pretty!

Java JDK1.5
Integer.bitCount(n);
where n is the number whose 1's are to be counted.
check also,
Integer.highestOneBit(n);
Integer.lowestOneBit(n);
Integer.numberOfLeadingZeros(n);
Integer.numberOfTrailingZeros(n);
//Beginning with the value 1, rotate left 16 times
n = 1;
for (int i = 0; i < 16; i++) {
n = Integer.rotateLeft(n, 1);
System.out.println(n);
}

I found an implementation of bit counting in an array with using of SIMD instruction (SSSE3 and AVX2). It has in 2-2.5 times better performance than if it will use __popcnt64 intrinsic function.
SSSE3 version:
#include <smmintrin.h>
#include <stdint.h>
const __m128i Z = _mm_set1_epi8(0x0);
const __m128i F = _mm_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
uint64_t BitCount(const uint8_t * src, size_t size)
{
__m128i _sum = _mm128_setzero_si128();
for (size_t i = 0; i < size; i += 16)
{
//load 16-byte vector
__m128i _src = _mm_loadu_si128((__m128i*)(src + i));
//get low 4 bit for every byte in vector
__m128i lo = _mm_and_si128(_src, F);
//sum precalculated value from T
_sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo)));
//get high 4 bit for every byte in vector
__m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F);
//sum precalculated value from T
_sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi)));
}
uint64_t sum[2];
_mm_storeu_si128((__m128i*)sum, _sum);
return sum[0] + sum[1];
}
AVX2 version:
#include <immintrin.h>
#include <stdint.h>
const __m256i Z = _mm256_set1_epi8(0x0);
const __m256i F = _mm256_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
uint64_t BitCount(const uint8_t * src, size_t size)
{
__m256i _sum = _mm256_setzero_si256();
for (size_t i = 0; i < size; i += 32)
{
//load 32-byte vector
__m256i _src = _mm256_loadu_si256((__m256i*)(src + i));
//get low 4 bit for every byte in vector
__m256i lo = _mm256_and_si256(_src, F);
//sum precalculated value from T
_sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo)));
//get high 4 bit for every byte in vector
__m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F);
//sum precalculated value from T
_sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi)));
}
uint64_t sum[4];
_mm256_storeu_si256((__m256i*)sum, _sum);
return sum[0] + sum[1] + sum[2] + sum[3];
}

A fast C# solution using a pre-calculated table of Byte bit counts with branching on the input size.
public static class BitCount
{
public static uint GetSetBitsCount(uint n)
{
var counts = BYTE_BIT_COUNTS;
return n <= 0xff ? counts[n]
: n <= 0xffff ? counts[n & 0xff] + counts[n >> 8]
: n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff]
: counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff];
}
public static readonly uint[] BYTE_BIT_COUNTS =
{
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};
}

I always use this in competitive programming, and it's easy to write and is efficient:
#include <bits/stdc++.h>
using namespace std;
int countOnes(int n) {
bitset<32> b(n);
return b.count();
}

There are many algorithm to count the set bits; but i think the best one is the faster one!
You can see the detailed on this page:
Bit Twiddling Hacks
I suggest this one:
Counting bits set in 14, 24, or 32-bit words using 64-bit instructions
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
// option 1, for at most 14-bit values in v:
c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
// option 2, for at most 24-bit values in v:
c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;
// option 3, for at most 32-bit values in v:
c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) %
0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 10; and the third option takes 15.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Neon Optimization for multiplication and store in ARM - gcc

Related

CRC Reverse Engineer (Checksum from Machine / PC)

F# image manipulation performance problem

Most efficient way to calculate Levenshtein distance

Counting, reversed bit pattern

Count the number of set bits in a 32-bit integer

Categories

Resources