Usage of _mm_shuffle_epi8 intrinsic - performance

Can someone please explain the _mm_shuffle_epi8 SSSE3 intrinsic?
I know it shuffles 16 8-bit integers in an __m128i but not sure how I could use this.
I basically want to use _mm_shuffle_epi8 to modify the function below to get better performance.
while(not done)
dest[i+0] = (src+j).a;
dest[i+1] = (src+j).b;
dest[i+2] = (src+j).c;
dest[i+3] = (src+j+1).a;
dest[i+4] = (src+j+1).b;
dest[i+5] = (src+j+1).c;
i+=6;
j+=2;

_mm_shuffle_epi8 (better known as pshufb), essentially does this:
temp = dst;
for (int i = 0; i < 16; i++)
dst[i] = (src[i] & 0x80) == 0 ? temp[src[i] & 15] : 0;
As for whether you can use it here, it's impossible to tell without knowing the types involved. It won't be "nice" anyway because the destination is a block of 6 bytes (or words? or dwords?). You could make that work by unrolling and doing a lot of shifting and or-ing.

here's an example of using the intrinsic; you'll have to find out how to apply it to your particular situation. this code endian-swaps 4 32-bit integers at a time:
unsigned int *bswap(unsigned int *destination, unsigned int *source, int length) {
int i;
__m128i mask = _mm_set_epi8(12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3);
for (i = 0; i < length; i += 4) {
_mm_storeu_si128((__m128i *)&destination[i],
_mm_shuffle_epi8(_mm_loadu_si128((__m128i *)&source[i]), mask));
}
return destination;
}

Related

Minimizing the number of warehouses for an order

I am trying to figure out an algorithm to efficiently solve the following problem:
There are w warehouses that store p different products with different quantities
A customer places an order on n out of the p products
The goal is to pick the minimum number of warehouses from which the order could be allocated.
E.g. the distribution of inventory in three warehouses is as follows
| Product 1 | Product 2 | Product 3 |
|---------------|---------------|---------------|---------------|
| Warehouse 1 | 2 | 5 | 0 |
| Warehouse 2 | 1 | 4 | 4 |
| Warehouse 3 | 3 | 1 | 4 |
Now suppose an order is placed with the following ordered quantities:
| Product 1 | Product 2 | Product 3 |
|---------------|---------------|---------------|---------------|
| Ordered Qty | 5 | 4 | 1 |
The optimal solution here would be to allocate the order from Warehouse 1 and Warehouse 3. No other smaller subset of the 3 warehouses would be a better choice
I have tried using brute force to solve this, however, for a larger number of warehouses, the algorithm performs very poorly. I have also tried a few greedy allocation algorithms, however, as expected, they are unable to minimize the number of sub-orders in many cases. Are there any other algorithms/approaches that I should look into?
Part 1 (see also Part 2 below)
Your task looks like a Set Cover Problem which is NP-complete, hence having exponential solving time.
I decided (and implemented in C++) my own solution for it, which might be sub-exponential in one case - if it happens that many sub-sets of warehouses produce same amount of products in sum. In other words if an exponential size of a set of all warehouses sub-sets (which is 2^NumWarehouses) is much bigger than a set of all possible combinations of products counts produced by all sub-sets of warehouses. It often happens like so in most of tests of such problem like your in online competition. If so happens then my solution will be sub-exponential both in CPU and in RAM.
I used Dynamic Programming approach for this. Whole algorithm may be described as following:
We create a map as a key having vector of amount of each product, and this key points to a triple, a) set of previous taken warehouses that reach current products amounts, this is to restore exact chosen warehouses, b) minimal amount of needed to take warehouses to achieve this products amounts, c) previous taken warehous that achieved this minimum of needed warehouses. This set is initialized with single key - vector of 0 products (0, 0, ..., 0).
Iterate through all warehouses in a loop and do 3.-4..
Iterate through all current products amounts (vectors) in a map and do 4..
To iterated vector of products (in a map) we add amounts of products of iterated warehouse. This sum of two vectors is a new key in a map, inside a value pointed by this key we add to set an index of iterated warehouse, while minimum and previous warehouse we set to -1 (uninitialized).
Using a recursive function for each key of a map find a minimum needed amount of warehouses and also find previous warehous achieving this minimum. This is easily done if for given key to iterate all warehouses in a Set, and find (recursively) their minimums, then minimum of current key will be minimum of all minimums plus 1.
Iterate through all keys in a map that are bigger or equal (as a vector) to ordered amount of products. All these keys will give a solution, but only some of them will give Minimal solution, save a key that gives minimal solution of all. In a case if all keys in a map are smaller than current ordered vector then there is no possible solution and we can finish program with error.
Having a minimal key we restore path backwards of all used warehouses to achieve this minimum. This is easy because for each key in a map we keep minimal amount of warehouses and previous warehouse that should be taken to achieve this minimum. Jumping by "previous" warehouses we restore whole path of needed warehouses. Finally output this found minimal solution.
As already mentioned this algorithm has Memory and Time complexity equal to amount of different distinct vectors of products that can be formed by all sub-sets of all warehouses. Which may (if we're lucky) or may not be (if we're unlucky) sub-exponential.
Full C++ code implementing algorithm above (implemented from scratch by me):
Try it online!
#include <cstdint>
#include <vector>
#include <tuple>
#include <map>
#include <set>
#include <unordered_map>
#include <functional>
#include <stdexcept>
#include <iostream>
#include <algorithm>
#define ASSERT(cond) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "!"); }
#define LN { std::cout << "LN " << __LINE__ << std::endl; }
using u16 = uint16_t;
using u32 = uint32_t;
using u64 = uint64_t;
int main() {
std::vector<std::vector<u32>> warehouses_products = {
{2, 5, 0},
{1, 4, 4},
{3, 1, 4},
};
std::vector<u32> order_products = {5, 4, 1};
size_t const nwares = warehouses_products.size(),
nprods = warehouses_products.at(0).size();
ASSERT(order_products.size() == nprods);
std::map<std::vector<u32>, std::tuple<std::set<u16>, u16, u16>> d =
{{std::vector<u32>(nprods), {{}, 0, u16(-1)}}};
for (u16 iware = 0; iware < nwares; ++iware) {
auto const & wprods = warehouses_products[iware];
ASSERT(wprods.size() == nprods);
auto dc = d;
for (auto const & [k, _]: d) {
auto prods = k;
for (size_t i = 0; i < wprods.size(); ++i)
prods[i] += wprods[i];
dc.insert({prods, {{}, u16(-1), u16(-1)}});
std::get<0>(dc[prods]).insert(iware);
}
d = dc;
}
std::function<u16(std::vector<u32> const &)> FindMin =
[&](auto const & prods) {
auto & [a, b, c] = d.at(prods);
if (b != u16(-1))
return b;
u16 minv = u16(-1), minw = u16(-1);
for (auto iware: a) {
auto const & wprods = warehouses_products[iware];
auto cprods = prods;
for (size_t i = 0; i < wprods.size(); ++i)
cprods[i] -= wprods[i];
auto const fmin = FindMin(cprods) + 1;
if (fmin < minv) {
minv = fmin;
minw = iware;
}
}
ASSERT(minv != u16(-1) && minw != u16(-1));
b = minv;
c = minw;
return b;
};
for (auto const & [k, v]: d)
FindMin(k);
std::vector<u32> minp;
u16 minv = u16(-1);
for (auto const & [k, v]: d) {
bool matched = true;
for (size_t i = 0; i < nprods; ++i)
if (order_products[i] > k[i]) {
matched = false;
break;
}
if (!matched)
continue;
if (std::get<1>(v) < minv) {
minv = std::get<1>(v);
minp = k;
}
}
if (minp.empty()) {
std::cout << "Can't buy all products!" << std::endl;
return 0;
}
std::vector<u16> answer;
while (minp != std::vector<u32>(nprods)) {
auto const & [a, b, c] = d.at(minp);
answer.push_back(c);
auto const & wprods = warehouses_products[c];
for (size_t i = 0; i < wprods.size(); ++i)
minp[i] -= wprods[i];
}
std::sort(answer.begin(), answer.end());
std::cout << "WareHouses: ";
for (auto iware: answer)
std::cout << iware << ", ";
std::cout << std::endl;
}
Input:
WareHouses Products:
{2, 5, 0},
{1, 4, 4},
{3, 1, 4},
Ordered Products:
{5, 4, 1}
Output:
WareHouses: 0, 2,
Part 2
Totally different solution I also implemented below.
Now it is based on Back Tracking using Recursive Function.
This solution although being exponential in worth case, yet it gives close to optimal solution after little time. So you just run this program as long as you can afford and whatever it has found so far you output as approximate solution.
Algorithm is as follows:
Suppose we have some products left to buy. Lets sort in descending order all not taken so far warehouses by total amount of all products that they can buy us.
In a loop we take each next warehouse from sorted descending list, but we take only first limit (this is fixed given value) elements from this sorted list. This way we take greedely warehouses in order of relevance, in order of the amount of products left to buy.
After warehouse is taken we do recursive descend into current function in which we again form a sorted list of warehouses and take another most relevant warehouse, in other words jump to 1. of this algorithm.
On each function call if we bought all products and amount of taken warehouses is less than current minimum then we output this solution and update minimum value.
Thus algorithm above starts from very greedy behaviour and then becomes slower and slower while becoming less greedy and more of brute force approach. And very good solutions appear already on first seconds.
As an example below I create 40 random warehouses with 40 random amounts of products each. This quite large task is solved Probably optimal within first second. By saying Probably I mean that next minutes of run don't give any better solution.
Try it online!
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <random>
#include <vector>
#include <functional>
#include <chrono>
#include <cmath>
using u8 = uint8_t;
using u16 = uint16_t;
using u32 = uint32_t;
using i32 = int32_t;
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
void Solve(auto const & wps, auto const & ops) {
size_t const nwares = wps.size(), nprods = ops.size(), max_depth = 1000;
std::vector<u32> prods_left = ops;
std::vector<std::vector<u16>> sorted_wares_all(max_depth);
std::vector<std::vector<u32>> prods_copy_all(max_depth);
std::vector<u16> path;
std::vector<u8> used(nwares);
size_t min_wares = size_t(-1);
auto ProdGrow = [&](auto const & prods){
size_t grow = 0;
for (size_t i = 0; i < nprods; ++i)
grow += std::min(prods_left[i], prods[i]);
return grow;
};
std::function<void(size_t, size_t, size_t)> Rec = [&](size_t depth, size_t off, size_t lim){
size_t prods_need = 0;
for (auto e: prods_left)
prods_need += e;
if (prods_need == 0) {
if (path.size() < min_wares) {
min_wares = path.size();
std::cout << std::endl << "Time " << std::setw(4) << std::llround(Time())
<< " sec, Cnt " << std::setw(3) << path.size() << ": ";
auto cpath = path;
std::sort(cpath.begin(), cpath.end());
for (auto e: cpath)
std::cout << e << ", ";
std::cout << std::endl << std::flush;
}
return;
}
auto & sorted_wares = sorted_wares_all.at(depth);
auto & prods_copy = prods_copy_all.at(depth);
sorted_wares.clear();
for (u16 i = off; i < nwares; ++i)
if (!used[i])
sorted_wares.push_back(i);
std::sort(sorted_wares.begin(), sorted_wares.end(),
[&](auto a, auto b){
return ProdGrow(wps[a]) > ProdGrow(wps[b]);
});
sorted_wares.resize(std::min(lim, sorted_wares.size()));
for (size_t i = 0; i < sorted_wares.size(); ++i) {
u16 const iware = sorted_wares[i];
auto const & wprods = wps[iware];
prods_copy = prods_left;
for (size_t j = 0; j < nprods; ++j)
prods_left[j] -= std::min(prods_left[j], wprods[j]);
path.push_back(iware);
used[iware] = 1;
Rec(depth + 1, iware + 1, lim);
used[iware] = 0;
path.pop_back();
prods_left = prods_copy;
}
for (auto e: sorted_wares)
used[e] = 0;
};
for (size_t lim = 1; lim <= nwares; ++lim) {
std::cout << "Limit " << lim << ", " << std::flush;
Rec(0, 0, lim);
}
}
int main() {
size_t const nwares = 40, nprods = 40;
std::mt19937_64 rng{std::random_device{}()};
std::vector<std::vector<u32>> wps(nwares);
for (size_t i = 0; i < nwares; ++i) {
wps[i].resize(nprods);
for (size_t j = 0; j < nprods; ++j)
wps[i][j] = rng() % 90 + 10;
}
std::vector<u32> ops;
for (size_t i = 0; i < nprods; ++i)
ops.push_back(rng() % (nwares * 20));
Solve(wps, ops);
}
Output:
Limit 1, Limit 2, Limit 3, Limit 4,
Time 0 sec, Cnt 13: 6, 8, 12, 13, 29, 31, 32, 33, 34, 36, 37, 38, 39,
Limit 5,
Time 0 sec, Cnt 12: 6, 8, 12, 13, 28, 29, 31, 32, 36, 37, 38, 39,
Limit 6, Limit 7,
Time 0 sec, Cnt 11: 6, 8, 12, 13, 19, 26, 31, 32, 33, 36, 39,
Limit 8, Limit 9, Limit 10, Limit 11, Limit 12, Limit 13, Limit 14, Limit 15,
If you want to go down the ILP route, you could formulate the following programme:
Where w is the number of warehouses, p the number of products, n_j the quantity of product j ordered, and C_ij the quantity of product j stored in warehouse i. Then, the decisions are to select warehouse i (x_i = 1) or not (x_i = 0).
Using Google's ortools and the open-source CBC solver, this could be implemented as follows in Python:
import numpy as np
from ortools.linear_solver import pywraplp
# Some test data, replace with your own.
p = 50
w = 1000
n = np.random.randint(0, 10, p)
C = np.random.randint(0, 5, (w, p))
solver = pywraplp.Solver("model", pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
x = [solver.BoolVar(f"x[{i}]") for i in range(w)]
for j in range(p):
solver.Add(C[:, j] # x >= n[j])
solver.Minimize(sum(x))
This formulation solves instances with up to a thousand warehouses in a few seconds to a minute. Smaller instances solve much quicker, for (I hope) obvious reasons.
The following outputs the solution, and some statistics:
assert solver.Solve() is not None
print("Solution:")
print(f"assigned = {[i + 1 for i in range(len(x)) if x[i].solution_value()]}")
print(f" obj = {solver.Objective().Value()}")
print(f" time = {solver.WallTime() / 1000}s")

Cannot understand hoow to recursively merge sort

Currently self-learning C++ with Daniel Liang's Introduction to C++.
On the topic of the merge sort, I cannot seem to understand how his code is recursively calling itself.
I understand the general concept of the merge sort, but I am having trouble understanding this code specifically.
In this example, we first pass the list 1, 7, 3, 4, 9, 3, 3, 1, 2, and its size (9) to the mergeSort function.
From there, we divide the list into two until the array size reaches 1. In this case, we would get: 1,7,3,4 -> 1,7 -> 1. We then move onto the merge sorting the second half. The second half array would be 7 in this case. We merge the two arrays [1] and [7] and proceed to delete the two arrays that were dynamically allocated to prevent any memory leak.
The part I don't understand is how does this code run from here? After delete[] firstHalf and delete[] secondHalf. From my understanding, shouldn't there be another mergeSort function call in order to merge sort the new firstHalf and secondHalf?
#include <iostream>
using namespace std;
// Function prototype
void arraycopy(int source[], int sourceStartIndex,
int target[], int targetStartIndex, int length);
void merge(int list1[], int list1Size,
int list2[], int list2Size, int temp[]);
// The function for sorting the numbers
void mergeSort(int list[], int arraySize)
{
if (arraySize > 1)
{
// Merge sort the first half
int* firstHalf = new int[arraySize / 2];
arraycopy(list, 0, firstHalf, 0, arraySize / 2);
mergeSort(firstHalf, arraySize / 2);
// Merge sort the second half
int secondHalfLength = arraySize - arraySize / 2;
int* secondHalf = new int[secondHalfLength];
arraycopy(list, arraySize / 2, secondHalf, 0, secondHalfLength);
mergeSort(secondHalf, secondHalfLength);
// Merge firstHalf with secondHalf
merge(firstHalf, arraySize / 2, secondHalf, secondHalfLength,
list);
delete [] firstHalf;
delete [] secondHalf;
}
}
void merge(int list1[], int list1Size,
int list2[], int list2Size, int temp[])
{
int current1 = 0; // Current index in list1
int current2 = 0; // Current index in list2
int current3 = 0; // Current index in temp
while (current1 < list1Size && current2 < list2Size)
{
if (list1[current1] < list2[current2])
temp[current3++] = list1[current1++];
else
temp[current3++] = list2[current2++];
}
while (current1 < list1Size)
temp[current3++] = list1[current1++];
while (current2 < list2Size)
temp[current3++] = list2[current2++];
}
void arraycopy(int source[], int sourceStartIndex,
int target[], int targetStartIndex, int length)
{
for (int i = 0; i < length; i++)
{
target[i + targetStartIndex] = source[i + sourceStartIndex];
}
}
int main()
{
const int SIZE = 9;
int list[] = {1, 7, 3, 4, 9, 3, 3, 1, 2};
mergeSort(list, SIZE);
for (int i = 0; i < SIZE; i++)
cout << list[i] << " ";
return 0;
}
From my understanding, shouldn't there be another mergeSort function
call in order to merge sort the new firstHalf and secondHalf?
It is happening implicitly during the recursive call. When you reach these two lines:
delete [] firstHalf;
delete [] secondHalf;
It means that one call to mergeSort is completed. If this call belongs to merging a first half, then code starts from the line after, i.e. these lines:
// Merge sort the second half
int secondHalfLength = arraySize - arraySize / 2;
...
But, if this call belongs to merging of the second half, then the control goes back to the line just after that call, i.e. these lines:
// Merge firstHalf with secondHalf
merge(firstHalf, arraySize / 2, secondHalf, secondHalfLength,
list);
And everything if doing well as planned.

Halide: Reduction over a domain for the specific values

I got a func f(x, y, z) in which the values is either 1 and 0, and I need to get the the first 100 coordinates of the values which equals to 1, to reduction/update them to 0.
This is very simple to realize in c and other languages, However, I've been trying to solve it with Halide for a couple of days. Is there any Function or Algorithm that I can use to solve it in Halide Generators?
The question amounts to "How do I implement stream compaction in Halide?" There is much written on parallel stream compaction and it is somewhat non-trivial to do well. See this Stack Overflow answer on doing it in cuda for some discussion and references: CUDA stream compaction algorithm
An quick implementation of simple stream compaction in Halide using a prefix sum looks like so:
#include "Halide.h"
#include <iostream>
using namespace Halide;
static void print_1d(const Buffer<int32_t> &result) {
std::cout << "{ ";
const char *prefix = "";
for (int i = 0; i < result.dim(0).extent(); i++) {
std::cout << prefix << result(i);
prefix = ", ";
}
std::cout << "}\n";
}
int main(int argc, char **argv) {
uint8_t vals[] = {0, 10, 99, 76, 5, 200, 88, 15};
Buffer<uint8_t> in(vals);
Var x;
Func prefix_sum;
RDom range(1, in.dim(0).extent() - 1);
prefix_sum(x) = (int32_t)0;
prefix_sum(range) = select(in(range - 1) > 42, prefix_sum(range - 1) + 1, prefix_sum(range - 1));
RDom in_range(0, in.dim(0).extent());
Func compacted_indices;
compacted_indices(x) = -1;
compacted_indices(clamp(prefix_sum(in_range), 0, in.dim(0).extent() - 1)) = select(in(in_range) > 42, in_range, - 1);
Buffer<int32_t> sum = prefix_sum.realize(8);
Buffer<int32_t> indices = compacted_indices.realize(8);
print_1d(sum);
print_1d(indices);
return 0;
}

find the index of the highest bit set of a 32-bit number without loops obviously

Here's a tough one(atleast i had a hard time :P):
find the index of the highest bit set of a 32-bit number without using any loops.
With recursion:
int firstset(int bits) {
return (bits & 0x80000000) ? 31 : firstset((bits << 1) | 1) - 1;
}
Assumes [31,..,0] indexing
Returns -1 if no bits set
| 1 prevents stack overflow by capping the number of shifts until a 1 is reached (32)
Not tail recursive :)
Very interesting question, I will provide you an answer with benchmark
Solution using a loop
uint8_t highestBitIndex( uint32_t n )
{
uint8_t r = 0;
while ( n >>= 1 )
r++;
return r;
}
This help to better understand the question but is highly inefficient.
Solution using log
This approach can also be summarize by the log method
uint8_t highestSetBitIndex2(uint32_t n) {
return (uint8_t)(log(n) / log(2));
}
However it is also inefficient (even more than above one, see benchmark)
Solution using built-in instruction
uint8_t highestBitIndex3( uint32_t n )
{
return 31 - __builtin_clz(n);
}
This solution, while very efficient, suffer from the fact that it only work with specific compilers (gcc and clang will do) and on specific platforms.
NB: It is 31 and not 32 if we want the index
Solution with intrinsic
#include <x86intrin.h>
uint8_t highestSetBitIndex5(uint32_t n)
{
return _bit_scan_reverse(n); // undefined behavior if n == 0
}
This will call the bsr instruction at assembly level
Solution using inline assembly
LZCNT and BSR can be summarize in assembly with the below functions:
uint8_t highestSetBitIndex4(uint32_t n) // undefined behavior if n == 0
{
__asm__ __volatile__ (R"(
.intel_syntax noprefix
bsr eax, edi
.att_syntax noprefix
)"
);
}
uint8_t highestSetBitIndex7(uint32_t n) // undefined behavior if n == 0
{
__asm__ __volatile__ (R"(.intel_syntax noprefix
lzcnt ecx, edi
mov eax, 31
sub eax, ecx
.att_syntax noprefix
)");
}
NB: Do Not Use unless you know what you are doing
Solution using lookup table and magic number multiplication (probably the best AFAIK)
First you use the following function to clear all the bits except the highest one:
uint32_t keepHighestBit( uint32_t n )
{
n |= (n >> 1);
n |= (n >> 2);
n |= (n >> 4);
n |= (n >> 8);
n |= (n >> 16);
return n - (n >> 1);
}
Credit: The idea come from Henry S. Warren, Jr. in his book Hacker's Delight
Then we use an algorithm based on DeBruijn's Sequence to perform a kind of binary search:
uint8_t highestBitIndex8( uint32_t b )
{
static const uint32_t deBruijnMagic = 0x06EB14F9; // equivalent to 0b111(0xff ^ 3)
static const uint8_t deBruijnTable[64] = {
0, 0, 0, 1, 0, 16, 2, 0, 29, 0, 17, 0, 0, 3, 0, 22,
30, 0, 0, 20, 18, 0, 11, 0, 13, 0, 0, 4, 0, 7, 0, 23,
31, 0, 15, 0, 28, 0, 0, 21, 0, 19, 0, 10, 12, 0, 6, 0,
0, 14, 27, 0, 0, 9, 0, 5, 0, 26, 8, 0, 25, 0, 24, 0,
};
return deBruijnTable[(keepHighestBit(b) * deBruijnMagic) >> 26];
}
Another version:
void propagateBits(uint32_t *n) {
*n |= *n >> 1;
*n |= *n >> 2;
*n |= *n >> 4;
*n |= *n >> 8;
*n |= *n >> 16;
}
uint8_t highestSetBitIndex8(uint32_t b)
{
static const uint32_t Magic = (uint32_t) 0x07C4ACDD;
static const int BitTable[32] = {
0, 9, 1, 10, 13, 21, 2, 29,
11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7,
19, 27, 23, 6, 26, 5, 4, 31,
};
propagateBits(&b);
return BitTable[(b * Magic) >> 27];
}
Benchmark with 100 million calls
compiling with g++ -std=c++17 highestSetBit.cpp -O3 && ./a.out
highestBitIndex1 136.8 ms (loop)
highestBitIndex2 183.8 ms (log(n) / log(2))
highestBitIndex3 10.6 ms (de Bruijn lookup Table with power of two, 64 entries)
highestBitIndex4 4.5 ms (inline assembly bsr)
highestBitIndex5 6.7 ms (intrinsic bsr)
highestBitIndex6 4.7 ms (gcc lzcnt)
highestBitIndex7 7.1 ms (inline assembly lzcnt)
highestBitIndex8 10.2 ms (de Bruijn lookup Table, 32 entries)
I would personally go for highestBitIndex8 if portability is your focus, else gcc built-in is nice.
Floor of logarithm-base-two should do the trick (though you have to special-case 0).
Floor of log base 2 of 0001 is 0 (bit with index 0 is set).
" " of 0010 is 1 (bit with index 1 is set).
" " of 0011 is 1 (bit with index 1 is set).
" " of 0100 is 2 (bit with index 2 is set).
and so on.
On an unrelated note, this is actually a pretty terrible interview question (I say this as someone who does technical interviews for potential candidates), because it really doesn't correspond to anything you do in practical programming.
Your boss isn't going to come up to you one day and say "hey, so we have a rush job for this latest feature, and it needs to be implemented without loops!"
You could do it like this (not optimised):
int index = 0;
uint32_t temp = number;
if ((temp >> 16) != 0) {
temp >>= 16;
index += 16;
}
if ((temp >> 8) != 0) {
temp >>= 8
index += 8;
}
...
sorry for bumping an old thread, but how about this
inline int ilog2(unsigned long long i) {
union { float f; int i; } = { i };
return (u.i>>23)-27;
}
...
int highest=ilog2(x); highest+=(x>>highest)-1;
// and in case you need it
int lowest = ilog2((x^x-1)+1)-1;
this can be done as a binary search, reducing complexity of O(N) (for an N-bit word) to O(log(N)). A possible implementation is:
int highest_bit_index(uint32_t value)
{
if(value == 0) return 0;
int depth = 0;
int exponent = 16;
while(exponent > 0)
{
int shifted = value >> (exponent);
if(shifted > 0)
{
depth += exponent;
if(shifted == 1) return depth + 1;
value >>= exponent;
}
exponent /= 2;
}
return depth + 1;
}
the input is a 32 bit unsigned integer.
it has a loop that can be converted into 5 levels of if-statements , therefore resulting in 32 or so if-statements. you could also use recursion to get rid of the loop, or the absolutely evil "goto" ;)
Let
n - Decimal number for which bit location to be identified
start - Indicates decimal value of ( 1 << 32 ) - 2147483648
bitLocation - Indicates bit location which is set to 1
public int highestBitSet(int n, long start, int bitLocation)
{
if (start == 0)
{
return 0;
}
if ((start & n) > 0)
{
return bitLocation;
}
else
{
return highestBitSet(n, (start >> 1), --bitLocation);
}
}
long i = 1;
long startIndex = (i << 31);
int bitLocation = 32;
int value = highestBitSet(64, startIndex, bitLocation);
System.out.println(value);
int high_bit_set(int n, int pos)
{
if(pos<0)
return -1;
else
return (0x80000000 & n)?pos:high_bit_set((n<<1),--pos);
}
main()
{
int n=0x23;
int high_pos = high_bit_set(n,31);
printf("highest index = %d",high_pos);
}
From your main call function high_bit_set(int n , int pos) with the input value n, and default 31 as the highest position. And the function is like above.
Paislee's solution is actually pretty easy to make tail-recursive, though, it's a much slower solution than the suggested floor(log2(n));
int firstset_tr(int bits, int final_dec) {
// pass in 0 for final_dec on first call, or use a helper function
if (bits & 0x80000000) {
return 31-final_dec;
} else {
return firstset_tr( ((bits << 1) | 1), final_dec+1 );
}
}
This function also works for other bit sizes, just change the check,
e.g.
if (bits & 0x80) { // for 8-bit
return 7-final_dec;
}
Note that what you are trying to do is calculate the integer log2 of an integer,
#include <stdio.h>
#include <stdlib.h>
unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
Observe that you can attempt to search more than 1 bit at a time.
unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
This approach uses a binary search
unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;
while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
Another binary search method, perhaps more readable,
unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
And because you will want to test these,
int main()
{
unsigned long int x = 3;
for( x=2; x<1000000000; x*=2 ) {
//printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
}
return(0);
}
well from what I know the function Log is Implemented very efficiently in most programming languages, and even if it does contain loops , it is probably very few of them , internally
So I would say that in most cases using the log would be faster , and more direct.
you do have to check for 0 though and avoid taking the log of 0, as that would cause the program to crash.

Very basic radix sort

I just wrote a simple iterative radix sort and I'm wondering if I have the right idea.
Recursive implementations seem to be much more common.
I am sorting 4-byte integers (unsigned to keep it simple).
I am using 1-byte as the 'digit'. So I have 2^8=256 buckets.
I am sorting the most significant digit (MSD) first.
After each sort I put them back into array in the order they exist in buckets and then perform the next sort.
So I end up doing 4 bucket sorts.
It seems to work for a small set of data. Since I am doing it MSD I'm guessing that's not stable and may fail with different data.
Did I miss anything major?
#include <iostream>
#include <vector>
#include <list>
using namespace std;
void radix(vector<unsigned>&);
void print(const vector<list<unsigned> >& listBuckets);
unsigned getMaxForBytes(unsigned bytes);
void merge(vector<unsigned>& data, vector<list<unsigned> >& listBuckets);
int main()
{
unsigned d[] = {5,3,6,9,2,11,9, 65534, 4,10,17,13, 268435455, 4294967294,4294967293, 268435454,65537};
vector<unsigned> v(d,d+17);
radix(v);
return 0;
}
void radix(vector<unsigned>& data)
{
int bytes = 1; // How many bytes to compare at a time
unsigned numOfBuckets = getMaxForBytes(bytes) + 1;
cout << "Numbuckets" << numOfBuckets << endl;
int chunks = sizeof(unsigned) / bytes;
for(int i = chunks - 1; i >= 0; --i)
{
vector<list<unsigned> > buckets; // lazy, wasteful allocation
buckets.resize(numOfBuckets);
unsigned mask = getMaxForBytes(bytes);
unsigned shift = i * bytes * 8;
mask = mask << shift;
for(unsigned j = 0; j < data.size(); ++j)
{
unsigned bucket = data[j] & mask; // isolate bits of current chunk
bucket = bucket >> shift; // bring bits down to least significant
buckets[bucket].push_back(data[j]);
}
print(buckets);
merge(data,buckets);
}
}
unsigned getMaxForBytes(unsigned bytes)
{
unsigned max = 0;
for(unsigned i = 1; i <= bytes; ++i)
{
max = max << 8;
max |= 0xFF;
}
return max;
}
void merge(vector<unsigned>& data, vector<list<unsigned> >& listBuckets)
{
int index = 0;
for(unsigned i = 0; i < listBuckets.size(); ++i)
{
list<unsigned>& list = listBuckets[i];
std::list<unsigned>::const_iterator it = list.begin();
for(; it != list.end(); ++it)
{
data[index] = *it;
++index;
}
}
}
void print(const vector<list<unsigned> >& listBuckets)
{
cout << "Printing listBuckets: " << endl;
for(unsigned i = 0; i < listBuckets.size(); ++i)
{
const list<unsigned>& list = listBuckets[i];
if(list.size() == 0) continue;
std::list<unsigned>::const_iterator it = list.begin(); // Why do I need std here!?
for(; it != list.end(); ++it)
{
cout << *it << ", ";
}
cout << endl;
}
}
Update:
Seems to work well in LSD form which it can be modified by changing the the chunk loop in radix as follows:
for(int i = chunks - 1; i >= 0; --i)
Let's look at en example with two-digit decimal numbers:
49, 25, 19, 27, 87, 67, 22, 90, 47, 91
Sorting by the first digit yields
19, 25, 27, 22, 49, 47, 67, 87, 90, 91
Next, you sort by the second digit, yielding
90, 91, 22, 25, 27, 47, 67, 87, 19, 49
Seems wrong, doesn't it? Or isn't this what you are doing? Maybe you can show us the code if I got you wrong.
If you are doing the second bucket sort on all groups with the same first digit(s), your algorithm would be equivalent to the recursive version. It would be stable as well. The only difference is that you'd do the bucket sorts breadth-first instead of depth-first.
You also need to make sure you Sort every bucket from MSD to LSD before reassembling.
Example:
19,76,90,34,84,12,72,38
Sort into 10 buckets [0-9] on MSD
B0=[];B1=[19,12];B2=[];B3=[34,38];B4=[];B5=[];B6=[];B7=[76,72];B8=[84];B9=[90];
if you were to reassemble and then sort again it would not work. Instead recursively sort each bucket.
B1 is sorted into B1B2=[12];B1B9=[19]
Once all have been sorted you can reassemble correctly.

Resources