Halide: Reduction over a domain for the specific values

Halide: Reduction over a domain for the specific values - halide

I got a func f(x, y, z) in which the values is either 1 and 0, and I need to get the the first 100 coordinates of the values which equals to 1, to reduction/update them to 0.
This is very simple to realize in c and other languages, However, I've been trying to solve it with Halide for a couple of days. Is there any Function or Algorithm that I can use to solve it in Halide Generators?

The question amounts to "How do I implement stream compaction in Halide?" There is much written on parallel stream compaction and it is somewhat non-trivial to do well. See this Stack Overflow answer on doing it in cuda for some discussion and references: CUDA stream compaction algorithm
An quick implementation of simple stream compaction in Halide using a prefix sum looks like so:
#include "Halide.h"
#include <iostream>
using namespace Halide;
static void print_1d(const Buffer<int32_t> &result) {
std::cout << "{ ";
const char *prefix = "";
for (int i = 0; i < result.dim(0).extent(); i++) {
std::cout << prefix << result(i);
prefix = ", ";
}
std::cout << "}\n";
}
int main(int argc, char **argv) {
uint8_t vals[] = {0, 10, 99, 76, 5, 200, 88, 15};
Buffer<uint8_t> in(vals);
Var x;
Func prefix_sum;
RDom range(1, in.dim(0).extent() - 1);
prefix_sum(x) = (int32_t)0;
prefix_sum(range) = select(in(range - 1) > 42, prefix_sum(range - 1) + 1, prefix_sum(range - 1));
RDom in_range(0, in.dim(0).extent());
Func compacted_indices;
compacted_indices(x) = -1;
compacted_indices(clamp(prefix_sum(in_range), 0, in.dim(0).extent() - 1)) = select(in(in_range) > 42, in_range, - 1);
Buffer<int32_t> sum = prefix_sum.realize(8);
Buffer<int32_t> indices = compacted_indices.realize(8);
print_1d(sum);
print_1d(indices);
return 0;
}

Related

Minimizing the number of warehouses for an order

I am trying to figure out an algorithm to efficiently solve the following problem:
There are w warehouses that store p different products with different quantities
A customer places an order on n out of the p products
The goal is to pick the minimum number of warehouses from which the order could be allocated.
E.g. the distribution of inventory in three warehouses is as follows
| Product 1 | Product 2 | Product 3 |
|---------------|---------------|---------------|---------------|
| Warehouse 1 | 2 | 5 | 0 |
| Warehouse 2 | 1 | 4 | 4 |
| Warehouse 3 | 3 | 1 | 4 |
Now suppose an order is placed with the following ordered quantities:
| Product 1 | Product 2 | Product 3 |
|---------------|---------------|---------------|---------------|
| Ordered Qty | 5 | 4 | 1 |
The optimal solution here would be to allocate the order from Warehouse 1 and Warehouse 3. No other smaller subset of the 3 warehouses would be a better choice
I have tried using brute force to solve this, however, for a larger number of warehouses, the algorithm performs very poorly. I have also tried a few greedy allocation algorithms, however, as expected, they are unable to minimize the number of sub-orders in many cases. Are there any other algorithms/approaches that I should look into?

Part 1 (see also Part 2 below)
Your task looks like a Set Cover Problem which is NP-complete, hence having exponential solving time.
I decided (and implemented in C++) my own solution for it, which might be sub-exponential in one case - if it happens that many sub-sets of warehouses produce same amount of products in sum. In other words if an exponential size of a set of all warehouses sub-sets (which is 2^NumWarehouses) is much bigger than a set of all possible combinations of products counts produced by all sub-sets of warehouses. It often happens like so in most of tests of such problem like your in online competition. If so happens then my solution will be sub-exponential both in CPU and in RAM.
I used Dynamic Programming approach for this. Whole algorithm may be described as following:
We create a map as a key having vector of amount of each product, and this key points to a triple, a) set of previous taken warehouses that reach current products amounts, this is to restore exact chosen warehouses, b) minimal amount of needed to take warehouses to achieve this products amounts, c) previous taken warehous that achieved this minimum of needed warehouses. This set is initialized with single key - vector of 0 products (0, 0, ..., 0).
Iterate through all warehouses in a loop and do 3.-4..
Iterate through all current products amounts (vectors) in a map and do 4..
To iterated vector of products (in a map) we add amounts of products of iterated warehouse. This sum of two vectors is a new key in a map, inside a value pointed by this key we add to set an index of iterated warehouse, while minimum and previous warehouse we set to -1 (uninitialized).
Using a recursive function for each key of a map find a minimum needed amount of warehouses and also find previous warehous achieving this minimum. This is easily done if for given key to iterate all warehouses in a Set, and find (recursively) their minimums, then minimum of current key will be minimum of all minimums plus 1.
Iterate through all keys in a map that are bigger or equal (as a vector) to ordered amount of products. All these keys will give a solution, but only some of them will give Minimal solution, save a key that gives minimal solution of all. In a case if all keys in a map are smaller than current ordered vector then there is no possible solution and we can finish program with error.
Having a minimal key we restore path backwards of all used warehouses to achieve this minimum. This is easy because for each key in a map we keep minimal amount of warehouses and previous warehouse that should be taken to achieve this minimum. Jumping by "previous" warehouses we restore whole path of needed warehouses. Finally output this found minimal solution.
As already mentioned this algorithm has Memory and Time complexity equal to amount of different distinct vectors of products that can be formed by all sub-sets of all warehouses. Which may (if we're lucky) or may not be (if we're unlucky) sub-exponential.
Full C++ code implementing algorithm above (implemented from scratch by me):
Try it online!
#include <cstdint>
#include <vector>
#include <tuple>
#include <map>
#include <set>
#include <unordered_map>
#include <functional>
#include <stdexcept>
#include <iostream>
#include <algorithm>
#define ASSERT(cond) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "!"); }
#define LN { std::cout << "LN " << __LINE__ << std::endl; }
using u16 = uint16_t;
using u32 = uint32_t;
using u64 = uint64_t;
int main() {
std::vector<std::vector<u32>> warehouses_products = {
{2, 5, 0},
{1, 4, 4},
{3, 1, 4},
};
std::vector<u32> order_products = {5, 4, 1};
size_t const nwares = warehouses_products.size(),
nprods = warehouses_products.at(0).size();
ASSERT(order_products.size() == nprods);
std::map<std::vector<u32>, std::tuple<std::set<u16>, u16, u16>> d =
{{std::vector<u32>(nprods), {{}, 0, u16(-1)}}};
for (u16 iware = 0; iware < nwares; ++iware) {
auto const & wprods = warehouses_products[iware];
ASSERT(wprods.size() == nprods);
auto dc = d;
for (auto const & [k, _]: d) {
auto prods = k;
for (size_t i = 0; i < wprods.size(); ++i)
prods[i] += wprods[i];
dc.insert({prods, {{}, u16(-1), u16(-1)}});
std::get<0>(dc[prods]).insert(iware);
}
d = dc;
}
std::function<u16(std::vector<u32> const &)> FindMin =
[&](auto const & prods) {
auto & [a, b, c] = d.at(prods);
if (b != u16(-1))
return b;
u16 minv = u16(-1), minw = u16(-1);
for (auto iware: a) {
auto const & wprods = warehouses_products[iware];
auto cprods = prods;
for (size_t i = 0; i < wprods.size(); ++i)
cprods[i] -= wprods[i];
auto const fmin = FindMin(cprods) + 1;
if (fmin < minv) {
minv = fmin;
minw = iware;
}
}
ASSERT(minv != u16(-1) && minw != u16(-1));
b = minv;
c = minw;
return b;
};
for (auto const & [k, v]: d)
FindMin(k);
std::vector<u32> minp;
u16 minv = u16(-1);
for (auto const & [k, v]: d) {
bool matched = true;
for (size_t i = 0; i < nprods; ++i)
if (order_products[i] > k[i]) {
matched = false;
break;
}
if (!matched)
continue;
if (std::get<1>(v) < minv) {
minv = std::get<1>(v);
minp = k;
}
}
if (minp.empty()) {
std::cout << "Can't buy all products!" << std::endl;
return 0;
}
std::vector<u16> answer;
while (minp != std::vector<u32>(nprods)) {
auto const & [a, b, c] = d.at(minp);
answer.push_back(c);
auto const & wprods = warehouses_products[c];
for (size_t i = 0; i < wprods.size(); ++i)
minp[i] -= wprods[i];
}
std::sort(answer.begin(), answer.end());
std::cout << "WareHouses: ";
for (auto iware: answer)
std::cout << iware << ", ";
std::cout << std::endl;
}
Input:
WareHouses Products:
{2, 5, 0},
{1, 4, 4},
{3, 1, 4},
Ordered Products:
{5, 4, 1}
Output:
WareHouses: 0, 2,
Part 2
Totally different solution I also implemented below.
Now it is based on Back Tracking using Recursive Function.
This solution although being exponential in worth case, yet it gives close to optimal solution after little time. So you just run this program as long as you can afford and whatever it has found so far you output as approximate solution.
Algorithm is as follows:
Suppose we have some products left to buy. Lets sort in descending order all not taken so far warehouses by total amount of all products that they can buy us.
In a loop we take each next warehouse from sorted descending list, but we take only first limit (this is fixed given value) elements from this sorted list. This way we take greedely warehouses in order of relevance, in order of the amount of products left to buy.
After warehouse is taken we do recursive descend into current function in which we again form a sorted list of warehouses and take another most relevant warehouse, in other words jump to 1. of this algorithm.
On each function call if we bought all products and amount of taken warehouses is less than current minimum then we output this solution and update minimum value.
Thus algorithm above starts from very greedy behaviour and then becomes slower and slower while becoming less greedy and more of brute force approach. And very good solutions appear already on first seconds.
As an example below I create 40 random warehouses with 40 random amounts of products each. This quite large task is solved Probably optimal within first second. By saying Probably I mean that next minutes of run don't give any better solution.
Try it online!
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <random>
#include <vector>
#include <functional>
#include <chrono>
#include <cmath>
using u8 = uint8_t;
using u16 = uint16_t;
using u32 = uint32_t;
using i32 = int32_t;
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
void Solve(auto const & wps, auto const & ops) {
size_t const nwares = wps.size(), nprods = ops.size(), max_depth = 1000;
std::vector<u32> prods_left = ops;
std::vector<std::vector<u16>> sorted_wares_all(max_depth);
std::vector<std::vector<u32>> prods_copy_all(max_depth);
std::vector<u16> path;
std::vector<u8> used(nwares);
size_t min_wares = size_t(-1);
auto ProdGrow = [&](auto const & prods){
size_t grow = 0;
for (size_t i = 0; i < nprods; ++i)
grow += std::min(prods_left[i], prods[i]);
return grow;
};
std::function<void(size_t, size_t, size_t)> Rec = [&](size_t depth, size_t off, size_t lim){
size_t prods_need = 0;
for (auto e: prods_left)
prods_need += e;
if (prods_need == 0) {
if (path.size() < min_wares) {
min_wares = path.size();
std::cout << std::endl << "Time " << std::setw(4) << std::llround(Time())
<< " sec, Cnt " << std::setw(3) << path.size() << ": ";
auto cpath = path;
std::sort(cpath.begin(), cpath.end());
for (auto e: cpath)
std::cout << e << ", ";
std::cout << std::endl << std::flush;
}
return;
}
auto & sorted_wares = sorted_wares_all.at(depth);
auto & prods_copy = prods_copy_all.at(depth);
sorted_wares.clear();
for (u16 i = off; i < nwares; ++i)
if (!used[i])
sorted_wares.push_back(i);
std::sort(sorted_wares.begin(), sorted_wares.end(),
[&](auto a, auto b){
return ProdGrow(wps[a]) > ProdGrow(wps[b]);
});
sorted_wares.resize(std::min(lim, sorted_wares.size()));
for (size_t i = 0; i < sorted_wares.size(); ++i) {
u16 const iware = sorted_wares[i];
auto const & wprods = wps[iware];
prods_copy = prods_left;
for (size_t j = 0; j < nprods; ++j)
prods_left[j] -= std::min(prods_left[j], wprods[j]);
path.push_back(iware);
used[iware] = 1;
Rec(depth + 1, iware + 1, lim);
used[iware] = 0;
path.pop_back();
prods_left = prods_copy;
}
for (auto e: sorted_wares)
used[e] = 0;
};
for (size_t lim = 1; lim <= nwares; ++lim) {
std::cout << "Limit " << lim << ", " << std::flush;
Rec(0, 0, lim);
}
}
int main() {
size_t const nwares = 40, nprods = 40;
std::mt19937_64 rng{std::random_device{}()};
std::vector<std::vector<u32>> wps(nwares);
for (size_t i = 0; i < nwares; ++i) {
wps[i].resize(nprods);
for (size_t j = 0; j < nprods; ++j)
wps[i][j] = rng() % 90 + 10;
}
std::vector<u32> ops;
for (size_t i = 0; i < nprods; ++i)
ops.push_back(rng() % (nwares * 20));
Solve(wps, ops);
}
Output:
Limit 1, Limit 2, Limit 3, Limit 4,
Time 0 sec, Cnt 13: 6, 8, 12, 13, 29, 31, 32, 33, 34, 36, 37, 38, 39,
Limit 5,
Time 0 sec, Cnt 12: 6, 8, 12, 13, 28, 29, 31, 32, 36, 37, 38, 39,
Limit 6, Limit 7,
Time 0 sec, Cnt 11: 6, 8, 12, 13, 19, 26, 31, 32, 33, 36, 39,
Limit 8, Limit 9, Limit 10, Limit 11, Limit 12, Limit 13, Limit 14, Limit 15,

If you want to go down the ILP route, you could formulate the following programme:
Where w is the number of warehouses, p the number of products, n_j the quantity of product j ordered, and C_ij the quantity of product j stored in warehouse i. Then, the decisions are to select warehouse i (x_i = 1) or not (x_i = 0).
Using Google's ortools and the open-source CBC solver, this could be implemented as follows in Python:
import numpy as np
from ortools.linear_solver import pywraplp
# Some test data, replace with your own.
p = 50
w = 1000
n = np.random.randint(0, 10, p)
C = np.random.randint(0, 5, (w, p))
solver = pywraplp.Solver("model", pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
x = [solver.BoolVar(f"x[{i}]") for i in range(w)]
for j in range(p):
solver.Add(C[:, j] # x >= n[j])
solver.Minimize(sum(x))
This formulation solves instances with up to a thousand warehouses in a few seconds to a minute. Smaller instances solve much quicker, for (I hope) obvious reasons.
The following outputs the solution, and some statistics:
assert solver.Solve() is not None
print("Solution:")
print(f"assigned = {[i + 1 for i in range(len(x)) if x[i].solution_value()]}")
print(f" obj = {solver.Objective().Value()}")
print(f" time = {solver.WallTime() / 1000}s")

Calculate depth of array

I want to calculate the depth of array as per the formula in the image.
I have implemented the following code but I am not able to get correct results.
Input contains the size of the array n and elements
depth=0;
for(int i=0;i<n-1;i++)
{
depth=depth+arr[i]+(1/arr[i+1]);
}

try this, its simplest to do with recursion -
static double calc_depth(int arr[], int i) {
return arr[i] + (i<arr.length-1 ? + (1.0 / calc_depth(arr, i+1)) : 0.0);
}
public static void main(String args[]) {
int[] a = {2, 1};
System.out.println(calc_depth(a, 0));
}

I don't have a c++ editor now, but the idea is the same:
you can do this in your way, this is my python code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
def cal(arr):
depth, n = 0.0, len(arr)
# go from n - 1 to 0
for i in range(n - 1, -1, -1):
if depth == 0.0:
depth = arr[i]
else:
depth = (arr[i] + 1.0 / depth)
return depth
arr = [10, 20, 30]
print cal(arr)
As I mentioned in comment, if you wish to implement it iteratively, you need to go from n-1 to 0, not 0 to n-1, and you need to solve the base case.
Also, this one can be implemented in a recursive way as first answer.

It looks like the C++ has been removed in the meantime, so if you still want c++, here's almost an one-liner:
#include <vector>
#include <numeric>
double depth(const std::vector<double>& array){
return std::accumulate(array.rbegin(),array.rend(),std::numeric_limits<double>::infinity(),[](double s, double a){
return a + 1.0/s;
});
}
#include <iostream>
int main(){
std::vector<double> v{ 1.0,2.0,3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
std::cout<<depth(v);
}
But you won't get very precise results for continued fractions due to the floating-point precision.

You can do it like :
double ans=0;
for(int i=n-1;i>1;i--){
arr[i-1] = arr[i-1] + (double)(1/arr[i]);
ans = arr[i-1];
}

Error: Input buffer filter is accessed at 63, which is beyond the max (15) in dimension 2 Aborted (core dumped)

I want to test my algorithm written in halide on tiramisu compiler
once i run it i got an error like this one
Error: Input buffer filter is accessed at 63, which is beyond the max (15) in dimension 2
Aborted (core dumped)
So i decided to only test the call of the method even i have same parameter but i get same error or error similar like
Error: Input buffer bias is accessed at 15, which is beyond the max (4) in dimension 0
Aborted (core dumped)
here is my wrapper_vgg.h
#ifndef HALIDE__build___wrapper_vgg_o_h
#define HALIDE__build___wrapper_vgg_o_h
#include <tiramisu/utils.h>
#define RADIUS 3
#ifdef __cplusplus
extern "C" {
#endif
int vgg_tiramisu(halide_buffer_t *, halide_buffer_t *_b_input_buffer ,halide_buffer_t *filter,halide_buffer_t *bias,halide_buffer_t *conv,halide_buffer_t *filter2, halide_buffer_t *bias2 ,halide_buffer_t *conv2,halide_buffer_t *_b_output_buffer,halide_buffer_t *_negative_slope);
int vgg_tiramisu_argv(void **args);
int vgg_ref( halide_buffer_t *_b_input_buffer ,halide_buffer_t *filter,halide_buffer_t *bias,halide_buffer_t *filter2, halide_buffer_t *bias2 ,halide_buffer_t *_b_output_buffer);
int vgg_ref_argv(void **args);
// Result is never null and points to constant static data
const struct halide_filter_metadata_t *vgg_tiramisu_metadata();
const struct halide_filter_metadata_t *vgg_ref_metadata();
#ifdef __cplusplus
} // extern "C"
#endif
and here is my vgg_ref.cpp
#include "Halide.h"
#include "configure.h"
using namespace Halide;
int main(int argc, char **argv)
{
ImageParam input{Float(32), 4, "input"};
ImageParam filter{Float(32), 4, "filter"};
ImageParam bias{Float(32), 1, "bias"};
ImageParam filter2{Float(32), 4, "filter2"};
ImageParam bias2{Float(32), 1, "bias2"};
/* THE ALGORITHM */
Var x("x"), y("y"), z("z"), n("n");
Func f_conv("conv"), f_conv2("conv2");
Func f_ReLU("ReLU"), f_ReLU2("ReLU2") ;
//Func f_Maxpool("Maxpool");
Func f_vgg("vgg");
RDom r(0, K+1, 0, K+1, 0, FIn);
RDom r2(0, K+1, 0, K+1, 0, FOut);
// First conv computations
f_conv(x, y, z, n) = bias(z);
f_conv(x, y, z, n) += filter(r.x, r.y, r.z, z) * input(x + r.x, y + r.y, r.z, n);
//first relu
f_ReLU(x, y, z, n) = max(0, f_conv(x, y, z, n));
.....
.....
/* THE SCHEDULE */
// Provide estimates on the input image
.....
.....
f_vgg.compile_to_object("build/generated_fct_vgg_ref.o", {input, filter, bias, filter2, bias2}, "vgg_ref");
f_vgg.compile_to_lowered_stmt("build/generated_fct_vgg_ref.txt", {input, filter, bias, filter2, bias2}, Text);
return 0;
}
and here is the wrapper where i call vgg_ref method
...
#include "configure.h"
#include "wrapper_vgg.h"
#include <tiramisu/utils.h>
using namespace std;
int main(int, char**)
{
Halide::Buffer<float> input(N+K, N+K, FIn, BATCH_SIZE);
Halide::Buffer<float> filter(K+1, K+1, FIn, FOut);
Halide::Buffer<float> bias(FOut);
Halide::Buffer<float> conv(N, N, FOut, BATCH_SIZE);
Halide::Buffer<float> filter2(K+1, K+1, FOut, FOut);
Halide::Buffer<float> bias2(FOut);
Halide::Buffer<float> conv2_tiramisu(N-K, N-K, FOut, BATCH_SIZE);
Halide::Buffer<float> vgg_tiramisu_buff(N-2*K, N-2*K, FOut, BATCH_SIZE);
Halide::Buffer<int> parameters(5);
Halide::Buffer<float> negative_slope(1);negative_slope(0) = 1;
// Buffer for Halide
Halide::Buffer<float> vgg_halide(N-2*K, N-2*K, FOut, BATCH_SIZE);
std::vector<std::chrono::duration<double,std::milli>> duration_vector_1;
std::vector<std::chrono::duration<double,std::milli>> duration_vector_2;
/****************************************** Initialize Buffers *********************************************/
....
....
....
std::cout << "\t\tBuffers initialized" << std::endl;
/****************************************** Halide Part ********************************************************/
for (int i=0; i<NB_TESTS; i++)
{
auto start1 = std::chrono::high_resolution_clock::now();
vgg_ref(input.raw_buffer(), filter.raw_buffer(), bias.raw_buffer(), filter2.raw_buffer(), bias2.raw_buffer(), vgg_halide.raw_buffer());
auto end1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double,std::milli> duration = end1 - start1;
duration_vector_2.push_back(duration);
}
std::cout << "\t\tHalide vgg duration" << ": " << median(duration_vector_1)/1000 << "; " << std::endl;
std::cout << "\t\t Result" << ": ";
/****************************************** Tiramisu Part ********************************************************/
/* // Initialize parameters[]
parameters(0) = N;
parameters(1) = K;
parameters(2) = FIn;
parameters(3) = FOut;
parameters(4) = BATCH_SIZE;
for (int i=0; i<NB_TESTS; i++)
{
// srand (1);
auto start1 = std::chrono::high_resolution_clock::now();
vgg_tiramisu(parameters.raw_buffer(), input.raw_buffer(), filter.raw_buffer(), bias.raw_buffer(), conv.raw_buffer(), filter2.raw_buffer(), bias2.raw_buffer(), conv2_tiramisu.raw_buffer(),vgg_tiramisu_buff.raw_buffer(),negative_slope.raw_buffer());
auto end1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double,std::milli> duration = end1 - start1;
duration_vector_1.push_back(duration);
}
std::cout << "\t\tTiramisu vgg duration" << ": " << median(duration_vector_2)/1000 << "; " << std::endl;
std::cout << "\t\t Result" << ": ";
*/
}
i noticed that once i comment this line in halide part everything work well
vgg_ref(input.raw_buffer(), filter.raw_buffer(), bias.raw_buffer(), filter2.raw_buffer(), bias2.raw_buffer(), vgg_halide.raw_buffer());
so the problem is in this call of the halide function "vgg_ref" .
but i do not know this error related to what i tried to call only one parameter i do always have same problem. i do not know how to fix it.
thank you for sharing any advice or paying my attention to something.
Thank you.

I have been able to fix the problem later AlhamduAllah.
I wanna pay the attention here that it's impossible to be able to run the benchmarks without creating the ".o" file so without this line
f_vgg.compile_to_object("build/generated_fct_vgg_ref.o", {input, filter, bias, filter2, bias2}, "vgg_ref");
But how it comes that it was run in my case!!!
Ok this is basically because ".o" file was generated somewhere in the previous execution.
Be careful here :The trick of the old ".o" should be a reflex many issues of the false result is due to the existence of an old copy that object file.
Even I pay attention for that later, I still have same error or error similar :(.
What does this error refer to ? it mean generally in your code their is an index that does not mach it's definition in the wrapper.
So here is two (02) things to verify to help fix this issue:
Verify the call of the function, it's parameter : ex if the function require put 5 parameter verify if you put 5 not more not less.
Verify all the index their interval.
My problem was in this 2 lines
RDom r(0, K, 0, K, 0, FIn);
RDom r2(0, K, 0, K, 0, FOut);
RDom (A multi-dimensional domain over which to iterate.) help you to browse a small matrix in the input matrix like apply a filter for the input. This RDom above define the intervals of x, y and z of the filter matrix.
In the wrapper i define the parameter of the filter like this
Halide::Buffer<float> filter(K+1, K+1, FIn, FOut);
So in RDom too i have to put that x varies from 0 to k+1 but i have only k that's why i got that problem shown in the question.
So it should be done like this
RDom r(0, K+1, 0, K+1, 0, FIn);
RDom r2(0, K+1, 0, K+1, 0, FOut);
And that do fix my problem.
So just pay attention to those small errors that may ruins your day but it's ok since it will help you learn more.

C++11 app that uses dispatch_apply not working under Mac OS Sierra

I had a completely functioning codebase written in C++11 that used Grand Central Dispatch parallel processing, specifically dispatch_apply to do the basic parallel for loop for some trivial game calculations.
Since upgrading to Sierra, this code still runs, but each block is run in serial -- the cout statement shows that they are being executed in serial order, and CPU usage graph shows no parallel working on.
Queue is defined as:
workQueue = dispatch_queue_create("workQueue", DISPATCH_QUEUE_CONCURRENT);
And the relevant program code is:
case Concurrency::Parallel: {
dispatch_apply(stateMap.size(), workQueue, ^(size_t stateIndex) {
string thisCode = stateCodes[stateIndex];
long thisCount = stateCounts[stateIndex];
GameResult sliceResult = playStateOfCode(thisCode, thisCount);
results[stateIndex] = sliceResult;
if ((stateIndex + 1) % updatePeriod == 0) {
cout << stateIndex << endl;
}
});
break;
}
I strongly suspect that this either a bug, but if this is GCD forcing me to use new C++ methods for this, I'm all ears.

I'm not sure if it is a bug in Sierra or not. But it seems to work if you explicitly associate a global concurrent queue as target:
dispatch_queue_t target =
dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);
dispatch_queue_t workQueue =
dispatch_queue_create_with_target("workQueue", DISPATCH_QUEUE_CONCURRENT, target);
// ^~~~~~~~~~~ ^~~~~~

Here is a working example
#include <iostream>
#include <fstream>
#include <vector>
#include <cmath>
#include <sstream>
#include <dispatch/dispatch.h>
void load_problem(const std::string, std::vector<std::pair<double,double>>&);
int main() {
// n-factor polynomial - test against a given problem provided as a set of space delimited x y values in 2d.txt
std::vector<std::pair<double,double>> problem;
std::vector<double> test = {14.1333177226503,-0.0368874860476915,
0.0909424058436257,2.19080982673558,1.24632025036125,0.0444549880462031,
1.06824631867947,0.551482840616757, 1.04948148731933};
load_problem("weird.txt",problem); //a list of space delimited doubles representing x, y.
size_t a_count = test.size();
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
__block double diffs = 0.0; //sum of all values..
dispatch_apply(problem.size(), queue, ^(size_t i) {
double g = 0;
for (size_t j=0; j < a_count - 1; j++) {
g += test[j]*pow(problem[i].first,a_count - j - 1);
}
g += test[a_count - 1];
diffs += pow(g - problem[i].second,2);
});
double delta = 1/(1+sqrt(diffs));
std::cout << "test: fit delta: " << delta << std::endl;
}
void load_problem(const std::string file, std::vector<std::pair<double,double>>& repo) {
repo.clear();
std::ifstream ifs(file);
if (ifs.is_open()) {
std::string line;
while(getline(ifs, line)) {
double x= std::nan("");
double y= std::nan("");
std::istringstream istr(line);
istr >> std::skipws >> x >> y;
if (!isnan(x) && !isnan(y)) {
repo.push_back({x, y});
};
}
ifs.close();
}
}

Very basic radix sort

I just wrote a simple iterative radix sort and I'm wondering if I have the right idea.
Recursive implementations seem to be much more common.
I am sorting 4-byte integers (unsigned to keep it simple).
I am using 1-byte as the 'digit'. So I have 2^8=256 buckets.
I am sorting the most significant digit (MSD) first.
After each sort I put them back into array in the order they exist in buckets and then perform the next sort.
So I end up doing 4 bucket sorts.
It seems to work for a small set of data. Since I am doing it MSD I'm guessing that's not stable and may fail with different data.
Did I miss anything major?
#include <iostream>
#include <vector>
#include <list>
using namespace std;
void radix(vector<unsigned>&);
void print(const vector<list<unsigned> >& listBuckets);
unsigned getMaxForBytes(unsigned bytes);
void merge(vector<unsigned>& data, vector<list<unsigned> >& listBuckets);
int main()
{
unsigned d[] = {5,3,6,9,2,11,9, 65534, 4,10,17,13, 268435455, 4294967294,4294967293, 268435454,65537};
vector<unsigned> v(d,d+17);
radix(v);
return 0;
}
void radix(vector<unsigned>& data)
{
int bytes = 1; // How many bytes to compare at a time
unsigned numOfBuckets = getMaxForBytes(bytes) + 1;
cout << "Numbuckets" << numOfBuckets << endl;
int chunks = sizeof(unsigned) / bytes;
for(int i = chunks - 1; i >= 0; --i)
{
vector<list<unsigned> > buckets; // lazy, wasteful allocation
buckets.resize(numOfBuckets);
unsigned mask = getMaxForBytes(bytes);
unsigned shift = i * bytes * 8;
mask = mask << shift;
for(unsigned j = 0; j < data.size(); ++j)
{
unsigned bucket = data[j] & mask; // isolate bits of current chunk
bucket = bucket >> shift; // bring bits down to least significant
buckets[bucket].push_back(data[j]);
}
print(buckets);
merge(data,buckets);
}
}
unsigned getMaxForBytes(unsigned bytes)
{
unsigned max = 0;
for(unsigned i = 1; i <= bytes; ++i)
{
max = max << 8;
max |= 0xFF;
}
return max;
}
void merge(vector<unsigned>& data, vector<list<unsigned> >& listBuckets)
{
int index = 0;
for(unsigned i = 0; i < listBuckets.size(); ++i)
{
list<unsigned>& list = listBuckets[i];
std::list<unsigned>::const_iterator it = list.begin();
for(; it != list.end(); ++it)
{
data[index] = *it;
++index;
}
}
}
void print(const vector<list<unsigned> >& listBuckets)
{
cout << "Printing listBuckets: " << endl;
for(unsigned i = 0; i < listBuckets.size(); ++i)
{
const list<unsigned>& list = listBuckets[i];
if(list.size() == 0) continue;
std::list<unsigned>::const_iterator it = list.begin(); // Why do I need std here!?
for(; it != list.end(); ++it)
{
cout << *it << ", ";
}
cout << endl;
}
}
Update:
Seems to work well in LSD form which it can be modified by changing the the chunk loop in radix as follows:
for(int i = chunks - 1; i >= 0; --i)

Let's look at en example with two-digit decimal numbers:
49, 25, 19, 27, 87, 67, 22, 90, 47, 91
Sorting by the first digit yields
19, 25, 27, 22, 49, 47, 67, 87, 90, 91
Next, you sort by the second digit, yielding
90, 91, 22, 25, 27, 47, 67, 87, 19, 49
Seems wrong, doesn't it? Or isn't this what you are doing? Maybe you can show us the code if I got you wrong.
If you are doing the second bucket sort on all groups with the same first digit(s), your algorithm would be equivalent to the recursive version. It would be stable as well. The only difference is that you'd do the bucket sorts breadth-first instead of depth-first.

You also need to make sure you Sort every bucket from MSD to LSD before reassembling.
Example:
19,76,90,34,84,12,72,38
Sort into 10 buckets [0-9] on MSD
B0=[];B1=[19,12];B2=[];B3=[34,38];B4=[];B5=[];B6=[];B7=[76,72];B8=[84];B9=[90];
if you were to reassemble and then sort again it would not work. Instead recursively sort each bucket.
B1 is sorted into B1B2=[12];B1B9=[19]
Once all have been sorted you can reassemble correctly.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio