Using an algorithm to determine the ideal size for shipping boxes - algorithm

I work in a logistic department for a company, recently we have been trying to narrow down the amount of different packaging options that we use.
I have all the necessary product data like length, width, height, volume and also sales data.
So I was thinking if it is possible to use an algorithm to cluster the different volumes of the products and maybe also take into account which sizes are selling the most, to determine, which box sizes would be ideal.
(Taking into account how often a product sells is secondary so that is not absolutely necessary)
What I want is that I can give the Algorithm an amount of how many different boxsizes I want and the algorithm should determine where to put the limits, so that there is a solution for every product that we have. With the goal of the optimization being minimum volume wasted while also not using more than the set amount of different boxes.
Also important to note, the orientation of the products and the amount per box is set, so there is no need to determine how to pack the products and how many go into one box idealy or something like that.
What kind of algorithms could be used for a problem like this and what are my options to program them? I was thinking of using Matlab, but would also be open for other possible options. I want to program it, not simply use an existing program like SPSS.
Thanks in advance and forgive me if my english is not the best, I'm not a native speaker.

The following C++ program will find optimal solutions for small instances. For 10 input box sizes, each having dimensions randomly chosen in the range 1..100, and for any number 1..10 of box sizes to choose, it computes the answer in a couple of seconds on my computer. For 15 input box sizes, it takes around 10s. For 20 input box sizes, I could compute up to 4 chosen box sizes in about 3 minutes, with memory becoming an issue (it used around 3GB). I had to increase the linker's default stack size to avoid stack overflows.
#include <iostream>
#include <algorithm>
#include <vector>
#include <array>
#include <map>
#include <set>
#include <functional>
#include <climits>
using namespace std;
ostream& operator<<(ostream& os, array<int, 3> a) {
return os << '(' << a[0] << ", " << a[1] << ", " << a[2] << ')';
}
template <int N>
long long vol(array<int, N> b) {
return static_cast<long long>(b[0]) * b[1] * b[2];
}
template <int N, int M>
bool fits(array<int, N> a, array<int, M> b) {
return a[0] <= b[0] && a[1] <= b[1] && a[2] <= b[2];
}
// Compares first by volume, then lexicographically.
struct CompareByVolumeDesc {
bool operator()(array<int, 3> a, array<int, 3> b) const {
return vol(a) > vol(b) || vol(a) == vol(b) && a < b;
}
};
vector<array<int, 3>> candSizes;
struct State {
vector<array<int, 4>> req;
int n;
int k;
// Needed for map<>
bool operator<(State const& other) const {
return make_tuple(n, k, req) < make_tuple(other.n, other.k, other.req);
}
} dummy = { {}, -1, -1 };
map<State, pair<int, State>> memo;
// Compute the minimum volume required for the given list of box sizes if we use exactly k of the first n candidate box sizes.
pair<long long, State> solve(State const& s) {
if (empty(s.req)) return { 0, dummy };
if (s.k == 0 || s.k > s.n) return { LLONG_MAX / 4, dummy };
auto previousAnswer = memo.find(s);
if (previousAnswer != end(memo)) return (*previousAnswer).second;
// Try using the nth candidate box size.
int nFitting = 0;
vector<array<int, 4>> notFitting;
for (auto r : s.req) {
if (fits(r, candSizes[s.n - 1])) {
nFitting += r[3];
} else {
notFitting.push_back(r);
}
}
pair<long long, State> solution;
solution.second = { s.req, s.n - 1, s.k };
solution.first = solve(solution.second).first;
if (nFitting > 0) {
State useNth = { notFitting, s.n - 1, s.k - 1 };
long long useNthVol = nFitting * vol(candSizes[s.n - 1]) + solve(useNth).first;
if (useNthVol < solution.first) solution = { useNthVol, useNth };
}
memo[s] = solution;
return solution;
}
void printOptimalSolution(State s) {
while (!empty(s.req)) {
State next = solve(s).second;
if (next.k < s.k) cout << candSizes[s.n - 1] << endl;
s = next;
}
}
int main(int argc, char** argv) {
int n, k;
cin >> n >> k;
vector<array<int, 4>> requestedBoxSizes;
set<int> lengths, widths, heights;
for (int i = 0; i < n; ++i) {
array<int, 4> d; // d[3] is actually the number of requests for this box size
cin >> d[0] >> d[1] >> d[2] >> d[3];
sort(begin(d), begin(d) + 3, std::greater<int>());
requestedBoxSizes.push_back(d);
lengths.insert(d[0]);
widths.insert(d[1]);
heights.insert(d[2]);
}
// Generate all candidate box sizes
for (int l : lengths) {
for (int w : widths) {
for (int h : heights) {
array<int, 3> cand = { l, w, h };
sort(begin(cand), end(cand), std::greater<int>());
candSizes.push_back(cand);
}
}
}
sort(begin(candSizes), end(candSizes), CompareByVolumeDesc());
candSizes.erase(unique(begin(candSizes), end(candSizes)), end(candSizes));
cout << "Number of candidate box sizes: " << size(candSizes) << endl;
State startState = { requestedBoxSizes, static_cast<int>(size(candSizes)), k };
long long minVolume = solve(startState).first;
cout << "Minimum achievable volume using " << k << " box sizes: " << minVolume << endl;
cout << "Optimal set of " << k << " box sizes:" << endl;
printOptimalSolution(startState);
return 0;
}
Example input:
15 5
100 61 35 27
17 89 96 47
31 69 30 55
37 23 39 9
94 11 48 19
38 17 29 36
63 79 80 36
59 52 37 51
86 63 54 7
32 30 11 26
50 88 51 5
74 70 33 14
67 46 4 79
83 94 89 58
65 42 37 69
Example output:
Number of candidate box sizes: 2310
Minimum achievable volume using 5 box sizes: 124069460
Optimal set of 5 box sizes:
(94, 48, 11)
(69, 52, 37)
(100, 89, 35)
(88, 79, 63)
(94, 89, 83)
I'll explain the algorithm behind this if there's interest. It's better than considering all possible combinations of k candidate box sizes, but not terribly efficient.

Related

Understanding the speed up of openmp program across NUMA nodes

I came across this behavior of speed up and I am finding it hard to explain. Following is the background:
Program
Invocation of Gaussian Elimination method to solve linear equation within a loop to parallelize the work load across compute units. We use an augmented matrix of dimension (M by M+1) where one additional column holds the RHS
HPC Setup - Cray XC50 node with Intel Xeon 6148 Gold with the following configuration
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 0 size: 95325 MB
node 0 free: 93811 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 1 size: 96760 MB
node 1 free: 96374 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Although not the actual HPC, but the block diagram and the related explanation seems to fully apply (https://www.nas.nasa.gov/hecc/support/kb/skylake-processors_550.html). Specifically sub NUMA clustering seems to be disabled.
Job submitted through APLS is as follows
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,30,31,32,33,34,35,36,37,38,39 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 40,41,42,43,44,45,46,47,48,49,60,61,62,63,64,65,66,67,68,69 -e N=4000 -e M=200 -e MODE=2 ./gem
In the above N indicates the number of matrices and M replaces the dimension of the matrix. These are passed as environment variable to the program and used internally. MODE can be ignored for this discussion
cc list specifically lists the CPUs to bind with. OMP_NUM_THREADS is set to 20. The intent is to use 20 threads across 20 compute units.
Time to run sequentially and parallel is recorded within the program using omp_get_wtime() and the results are the following
CPU Binding
Objective
Speed Up
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Load work across 20 physical cores on socket 0
13.081944
0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29
Spread across first 10 physical cores on socket 0 & socket 1
18.332559
10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
Spread across 2nd set of 1o physical cores on socket 0 & first 10 of socket 1
18.636265
40,41,42,43,44,45,46,47,48,49,60,61,62,63,64,65,66,67,68,69
Spread across virtual cores across sockets(40-0, 60-21)
15.922209
Why is the speed up less for the first case when all physical nodes on socket 0 are being used ? The understanding here is that when tasks are spread across sockets, UPI comes into effect and it should be slower whereas it seems to be exactly the opposite. Also what can possibly explain the last scenario when virtual cores are being used.
Note: We have tried multiple iterations and the results for the above combinations are pretty consistent.
Edit1:
Edit2: Source code
#define _GNU_SOURCE
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include "sched.h"
#include "omp.h"
double drand(double low, double high, unsigned int *seed)
{
return ((double)rand_r(seed) * (high - low)) / (double)RAND_MAX + low;
}
void init_vars(int *N, int *M, int *mode)
{
const char *number_of_instances = getenv("N");
if (number_of_instances) {
*N = atoi(number_of_instances);
}
const char *matrix_dim = getenv("M");
if (matrix_dim) {
*M = atoi(matrix_dim);
}
const char *running_mode = getenv("MODE");
if (running_mode) {
*mode = atoi(running_mode);
}
}
void print_matrix(double *instance, int M)
{
for (int row = 0; row < M; row++) {
for (int column = 0; column <= M; column++) {
printf("%lf ", instance[row * (M + 1) + column]);
}
printf("\n");
}
printf("\n");
}
void swap(double *a, double *b)
{
double temp = *a;
*a = *b;
*b = temp;
}
void init_matrix(double *instance, unsigned int M)
{
unsigned int seed = 45613 + 19 * omp_get_thread_num();
for (int row = 0; row < M; row++) {
for (int column = 0; column <= M; column++) {
instance[row * (M + 1) + column] = drand(-1.0, 1.0, &seed);
}
}
}
void initialize_and_solve(int M)
{
double *instance;
instance = malloc(M * (M + 1) * sizeof(double));
// Initialise the matrix
init_matrix(instance, M);
// Performing elementary operations
int i, j, k = 0, c, flag = 0, m = 0;
for (i = 0; i < M; i++) {
if (instance[i * (M + 2)] == 0) {
c = 1;
while ((i + c) < M && instance[(i + c) * (M + 1) + i] == 0)
c++;
if ((i + c) == M) {
flag = 1;
break;
}
for (j = i, k = 0; k <= M; k++) {
swap(&instance[j * (M + 1) + k], &instance[(j + c) * (M + 1) + k]);
}
}
for (j = 0; j < M; j++) {
// Excluding all i == j
if (i != j) {
// Converting Matrix to reduced row
// echelon form(diagonal matrix)
double pro = instance[j * (M + 1) + i] / instance[i * (M + 2)];
for (k = 0; k <= M; k++)
instance[j * (M + 1) + k] -= (instance[i * (M + 1) + k]) * pro;
}
}
}
// Get the solution in the last column
for (int i = 0; i < M; i++) {
instance[i * (M + 1) + M] /= instance[i * (M + 2)];
}
free(instance);
instance = NULL;
}
double solve_serial(int N, int M)
{
double now = omp_get_wtime();
for (int i = 0; i < N; i++) {
initialize_and_solve(M);
}
return omp_get_wtime() - now;
}
double solve_parallel(int N, int M)
{
double now = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < N; i++) {
initialize_and_solve(M);
}
return omp_get_wtime() - now;
}
int main(int argc, char **argv)
{
// Default parameters
int N = 200, M = 200, mode = 2;
if (argc == 4) {
N = atoi(argv[1]);
M = atoi(argv[2]);
mode = atoi(argv[3]);
}
init_vars(&N, &M, &mode);
if (mode == 0) {
// Serial only
double l2_norm_serial = 0.0;
double serial = solve_serial(N, M);
printf("Time, %d, %d, %lf\n", N, M, serial);
} else if (mode == 1) {
// Parallel only
double l2_norm_parallel = 0.0;
double parallel = solve_parallel(N, M);
printf("Time, %d, %d, %lf\n", N, M, parallel);
} else {
// Both serial and parallel
// Solve using GEM (serial)
double serial = solve_serial(N, M);
// Solve using GEM (parallel)
double parallel = solve_parallel(N, M);
printf("Time, %d, %d, %lf, %lf, %lf\n", N, M, serial, parallel, serial / parallel);
}
return 0;
}
Edit3: Rephrased the first point to clarify what is actually being done ( based on feedback in comment )
You say you implement a "Simple implementation of Gaussian Elimination". Sorry, there is no such thing. There are multiple different algorithms and they all come with their own analysis. But let's assume you use the textbook one. Even then, Gaussian Elimination is not simple.
First of all, you haven't stated that you initialized your data in parallel. If you don't do that, all the data will wind up on socket 0 and you will get bad performance, never mind the speedup. But let's assume you did the right thing here. (If not, google "first touch".)
In the GE algorithm, each of the sequential k iterations works on a smaller and smaller subset of the data. This means that no simple mapping of data to cores is possible. If you place your data in such a way that initially each core works on local data, this will quickly no longer be the case.
In fact, after half the number of iterations, half your cores will be pulling data from the other socket, leading to NUMA coherence delays. Maybe a spread binding is better here than your compact binding.
Why is the speed up less for the first case when all physical nodes on socket 0 are being used ?
Results are often dependent of the application but some patterns regularly happens. My guess is that your application heavily use the main RAM and 2 sockets results in more DDR4 RAM blocks being used than only one. Indeed, with local NUMA-node allocations, 1 socket can access to the RAM at the speed of 128 GB/s while 2 sockets can access to the RAM at the speed of 256 GB/s. With a balanced use of DDR4 RAM blocks, the performance with be far worst and bounded by UPI (I do not expect 2 socket to be much slower because of the full-duplex data transfer).
The understanding here is that when tasks are spread across sockets, UPI comes into effect and it should be slower whereas it seems to be exactly the opposite.
UPI is only a bottleneck if data are massively transferred between the two sockets, but good NUMA applications should not do that because they should operate on their own NUMA-node memory.
You can check the use of the UPI and RAM throughput using hardware counters.
Also what can possibly explain the last scenario when virtual cores are being used.
I do not have an explanation for this. Note higher IDs are the second hyperthreads of each core so it is certainly related to a low-level behaviour of the hyperthreading (maybe some processes are bound to some PU causing pre-emption the target PUs or simply the second PU have somehow a lower priority). Note also that physical core IDs and logical PU IDs are often not mapped the same way so if you use the wrong one you could end up binding 2 threads to the same core. I advise you to use hwloc to check that.

Minimizing the number of warehouses for an order

I am trying to figure out an algorithm to efficiently solve the following problem:
There are w warehouses that store p different products with different quantities
A customer places an order on n out of the p products
The goal is to pick the minimum number of warehouses from which the order could be allocated.
E.g. the distribution of inventory in three warehouses is as follows
| Product 1 | Product 2 | Product 3 |
|---------------|---------------|---------------|---------------|
| Warehouse 1 | 2 | 5 | 0 |
| Warehouse 2 | 1 | 4 | 4 |
| Warehouse 3 | 3 | 1 | 4 |
Now suppose an order is placed with the following ordered quantities:
| Product 1 | Product 2 | Product 3 |
|---------------|---------------|---------------|---------------|
| Ordered Qty | 5 | 4 | 1 |
The optimal solution here would be to allocate the order from Warehouse 1 and Warehouse 3. No other smaller subset of the 3 warehouses would be a better choice
I have tried using brute force to solve this, however, for a larger number of warehouses, the algorithm performs very poorly. I have also tried a few greedy allocation algorithms, however, as expected, they are unable to minimize the number of sub-orders in many cases. Are there any other algorithms/approaches that I should look into?
Part 1 (see also Part 2 below)
Your task looks like a Set Cover Problem which is NP-complete, hence having exponential solving time.
I decided (and implemented in C++) my own solution for it, which might be sub-exponential in one case - if it happens that many sub-sets of warehouses produce same amount of products in sum. In other words if an exponential size of a set of all warehouses sub-sets (which is 2^NumWarehouses) is much bigger than a set of all possible combinations of products counts produced by all sub-sets of warehouses. It often happens like so in most of tests of such problem like your in online competition. If so happens then my solution will be sub-exponential both in CPU and in RAM.
I used Dynamic Programming approach for this. Whole algorithm may be described as following:
We create a map as a key having vector of amount of each product, and this key points to a triple, a) set of previous taken warehouses that reach current products amounts, this is to restore exact chosen warehouses, b) minimal amount of needed to take warehouses to achieve this products amounts, c) previous taken warehous that achieved this minimum of needed warehouses. This set is initialized with single key - vector of 0 products (0, 0, ..., 0).
Iterate through all warehouses in a loop and do 3.-4..
Iterate through all current products amounts (vectors) in a map and do 4..
To iterated vector of products (in a map) we add amounts of products of iterated warehouse. This sum of two vectors is a new key in a map, inside a value pointed by this key we add to set an index of iterated warehouse, while minimum and previous warehouse we set to -1 (uninitialized).
Using a recursive function for each key of a map find a minimum needed amount of warehouses and also find previous warehous achieving this minimum. This is easily done if for given key to iterate all warehouses in a Set, and find (recursively) their minimums, then minimum of current key will be minimum of all minimums plus 1.
Iterate through all keys in a map that are bigger or equal (as a vector) to ordered amount of products. All these keys will give a solution, but only some of them will give Minimal solution, save a key that gives minimal solution of all. In a case if all keys in a map are smaller than current ordered vector then there is no possible solution and we can finish program with error.
Having a minimal key we restore path backwards of all used warehouses to achieve this minimum. This is easy because for each key in a map we keep minimal amount of warehouses and previous warehouse that should be taken to achieve this minimum. Jumping by "previous" warehouses we restore whole path of needed warehouses. Finally output this found minimal solution.
As already mentioned this algorithm has Memory and Time complexity equal to amount of different distinct vectors of products that can be formed by all sub-sets of all warehouses. Which may (if we're lucky) or may not be (if we're unlucky) sub-exponential.
Full C++ code implementing algorithm above (implemented from scratch by me):
Try it online!
#include <cstdint>
#include <vector>
#include <tuple>
#include <map>
#include <set>
#include <unordered_map>
#include <functional>
#include <stdexcept>
#include <iostream>
#include <algorithm>
#define ASSERT(cond) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "!"); }
#define LN { std::cout << "LN " << __LINE__ << std::endl; }
using u16 = uint16_t;
using u32 = uint32_t;
using u64 = uint64_t;
int main() {
std::vector<std::vector<u32>> warehouses_products = {
{2, 5, 0},
{1, 4, 4},
{3, 1, 4},
};
std::vector<u32> order_products = {5, 4, 1};
size_t const nwares = warehouses_products.size(),
nprods = warehouses_products.at(0).size();
ASSERT(order_products.size() == nprods);
std::map<std::vector<u32>, std::tuple<std::set<u16>, u16, u16>> d =
{{std::vector<u32>(nprods), {{}, 0, u16(-1)}}};
for (u16 iware = 0; iware < nwares; ++iware) {
auto const & wprods = warehouses_products[iware];
ASSERT(wprods.size() == nprods);
auto dc = d;
for (auto const & [k, _]: d) {
auto prods = k;
for (size_t i = 0; i < wprods.size(); ++i)
prods[i] += wprods[i];
dc.insert({prods, {{}, u16(-1), u16(-1)}});
std::get<0>(dc[prods]).insert(iware);
}
d = dc;
}
std::function<u16(std::vector<u32> const &)> FindMin =
[&](auto const & prods) {
auto & [a, b, c] = d.at(prods);
if (b != u16(-1))
return b;
u16 minv = u16(-1), minw = u16(-1);
for (auto iware: a) {
auto const & wprods = warehouses_products[iware];
auto cprods = prods;
for (size_t i = 0; i < wprods.size(); ++i)
cprods[i] -= wprods[i];
auto const fmin = FindMin(cprods) + 1;
if (fmin < minv) {
minv = fmin;
minw = iware;
}
}
ASSERT(minv != u16(-1) && minw != u16(-1));
b = minv;
c = minw;
return b;
};
for (auto const & [k, v]: d)
FindMin(k);
std::vector<u32> minp;
u16 minv = u16(-1);
for (auto const & [k, v]: d) {
bool matched = true;
for (size_t i = 0; i < nprods; ++i)
if (order_products[i] > k[i]) {
matched = false;
break;
}
if (!matched)
continue;
if (std::get<1>(v) < minv) {
minv = std::get<1>(v);
minp = k;
}
}
if (minp.empty()) {
std::cout << "Can't buy all products!" << std::endl;
return 0;
}
std::vector<u16> answer;
while (minp != std::vector<u32>(nprods)) {
auto const & [a, b, c] = d.at(minp);
answer.push_back(c);
auto const & wprods = warehouses_products[c];
for (size_t i = 0; i < wprods.size(); ++i)
minp[i] -= wprods[i];
}
std::sort(answer.begin(), answer.end());
std::cout << "WareHouses: ";
for (auto iware: answer)
std::cout << iware << ", ";
std::cout << std::endl;
}
Input:
WareHouses Products:
{2, 5, 0},
{1, 4, 4},
{3, 1, 4},
Ordered Products:
{5, 4, 1}
Output:
WareHouses: 0, 2,
Part 2
Totally different solution I also implemented below.
Now it is based on Back Tracking using Recursive Function.
This solution although being exponential in worth case, yet it gives close to optimal solution after little time. So you just run this program as long as you can afford and whatever it has found so far you output as approximate solution.
Algorithm is as follows:
Suppose we have some products left to buy. Lets sort in descending order all not taken so far warehouses by total amount of all products that they can buy us.
In a loop we take each next warehouse from sorted descending list, but we take only first limit (this is fixed given value) elements from this sorted list. This way we take greedely warehouses in order of relevance, in order of the amount of products left to buy.
After warehouse is taken we do recursive descend into current function in which we again form a sorted list of warehouses and take another most relevant warehouse, in other words jump to 1. of this algorithm.
On each function call if we bought all products and amount of taken warehouses is less than current minimum then we output this solution and update minimum value.
Thus algorithm above starts from very greedy behaviour and then becomes slower and slower while becoming less greedy and more of brute force approach. And very good solutions appear already on first seconds.
As an example below I create 40 random warehouses with 40 random amounts of products each. This quite large task is solved Probably optimal within first second. By saying Probably I mean that next minutes of run don't give any better solution.
Try it online!
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <random>
#include <vector>
#include <functional>
#include <chrono>
#include <cmath>
using u8 = uint8_t;
using u16 = uint16_t;
using u32 = uint32_t;
using i32 = int32_t;
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
void Solve(auto const & wps, auto const & ops) {
size_t const nwares = wps.size(), nprods = ops.size(), max_depth = 1000;
std::vector<u32> prods_left = ops;
std::vector<std::vector<u16>> sorted_wares_all(max_depth);
std::vector<std::vector<u32>> prods_copy_all(max_depth);
std::vector<u16> path;
std::vector<u8> used(nwares);
size_t min_wares = size_t(-1);
auto ProdGrow = [&](auto const & prods){
size_t grow = 0;
for (size_t i = 0; i < nprods; ++i)
grow += std::min(prods_left[i], prods[i]);
return grow;
};
std::function<void(size_t, size_t, size_t)> Rec = [&](size_t depth, size_t off, size_t lim){
size_t prods_need = 0;
for (auto e: prods_left)
prods_need += e;
if (prods_need == 0) {
if (path.size() < min_wares) {
min_wares = path.size();
std::cout << std::endl << "Time " << std::setw(4) << std::llround(Time())
<< " sec, Cnt " << std::setw(3) << path.size() << ": ";
auto cpath = path;
std::sort(cpath.begin(), cpath.end());
for (auto e: cpath)
std::cout << e << ", ";
std::cout << std::endl << std::flush;
}
return;
}
auto & sorted_wares = sorted_wares_all.at(depth);
auto & prods_copy = prods_copy_all.at(depth);
sorted_wares.clear();
for (u16 i = off; i < nwares; ++i)
if (!used[i])
sorted_wares.push_back(i);
std::sort(sorted_wares.begin(), sorted_wares.end(),
[&](auto a, auto b){
return ProdGrow(wps[a]) > ProdGrow(wps[b]);
});
sorted_wares.resize(std::min(lim, sorted_wares.size()));
for (size_t i = 0; i < sorted_wares.size(); ++i) {
u16 const iware = sorted_wares[i];
auto const & wprods = wps[iware];
prods_copy = prods_left;
for (size_t j = 0; j < nprods; ++j)
prods_left[j] -= std::min(prods_left[j], wprods[j]);
path.push_back(iware);
used[iware] = 1;
Rec(depth + 1, iware + 1, lim);
used[iware] = 0;
path.pop_back();
prods_left = prods_copy;
}
for (auto e: sorted_wares)
used[e] = 0;
};
for (size_t lim = 1; lim <= nwares; ++lim) {
std::cout << "Limit " << lim << ", " << std::flush;
Rec(0, 0, lim);
}
}
int main() {
size_t const nwares = 40, nprods = 40;
std::mt19937_64 rng{std::random_device{}()};
std::vector<std::vector<u32>> wps(nwares);
for (size_t i = 0; i < nwares; ++i) {
wps[i].resize(nprods);
for (size_t j = 0; j < nprods; ++j)
wps[i][j] = rng() % 90 + 10;
}
std::vector<u32> ops;
for (size_t i = 0; i < nprods; ++i)
ops.push_back(rng() % (nwares * 20));
Solve(wps, ops);
}
Output:
Limit 1, Limit 2, Limit 3, Limit 4,
Time 0 sec, Cnt 13: 6, 8, 12, 13, 29, 31, 32, 33, 34, 36, 37, 38, 39,
Limit 5,
Time 0 sec, Cnt 12: 6, 8, 12, 13, 28, 29, 31, 32, 36, 37, 38, 39,
Limit 6, Limit 7,
Time 0 sec, Cnt 11: 6, 8, 12, 13, 19, 26, 31, 32, 33, 36, 39,
Limit 8, Limit 9, Limit 10, Limit 11, Limit 12, Limit 13, Limit 14, Limit 15,
If you want to go down the ILP route, you could formulate the following programme:
Where w is the number of warehouses, p the number of products, n_j the quantity of product j ordered, and C_ij the quantity of product j stored in warehouse i. Then, the decisions are to select warehouse i (x_i = 1) or not (x_i = 0).
Using Google's ortools and the open-source CBC solver, this could be implemented as follows in Python:
import numpy as np
from ortools.linear_solver import pywraplp
# Some test data, replace with your own.
p = 50
w = 1000
n = np.random.randint(0, 10, p)
C = np.random.randint(0, 5, (w, p))
solver = pywraplp.Solver("model", pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
x = [solver.BoolVar(f"x[{i}]") for i in range(w)]
for j in range(p):
solver.Add(C[:, j] # x >= n[j])
solver.Minimize(sum(x))
This formulation solves instances with up to a thousand warehouses in a few seconds to a minute. Smaller instances solve much quicker, for (I hope) obvious reasons.
The following outputs the solution, and some statistics:
assert solver.Solve() is not None
print("Solution:")
print(f"assigned = {[i + 1 for i in range(len(x)) if x[i].solution_value()]}")
print(f" obj = {solver.Objective().Value()}")
print(f" time = {solver.WallTime() / 1000}s")

hash function for a 64-bit OS/compile, for an object that's really just a 4-byte int

I have a class named Foo that is privately nothing more than 4-byte int. If I return its value as an 8-byte size_t, am I going to be screwing up unordered_map<> or anything else? I could fill all bits with something like return foo + foo << 32;. Would that be better, or would it be worse as all hashes are now multiples of 0x100000001? Or how about return ~foo + foo << 32; which would use all 64 bits and also not have a common factor?
namespace std {
template<> struct hash<MyNamespace::Foo> {
typedef size_t result_type;
typedef MyNamespace::Foo argument_tupe;
size_t operator() (const MyNamespace::Foo& f ) const { return (size_t) f.u32InternalValue; }
};
}
An incremental uint32_t key converted to uint64_t works well
unordered_map will reserve space for the hash-table incrementally.
The less significant bits of the key is used to determine the bucket position, in an example for 4 entries/buckets, the less significant 2 bits are used.
Elements with a key giving the same bucket (multiple of the number of buckets) are chained in a linked list. This carry the concept of load-factor.
// 4 Buckets example
******** ******** ******** ******** ******** ******** ******** ******XX
bucket 00 would contains keys like {0, 256, 200000 ...}
bucket 01 would contains keys like {1, 513, 4008001 ...}
bucket 10 would contains keys like {2, 130, 10002 ...}
bucket 11 would contains keys like {3, 259, 1027, 20003, ...}
If you try to save an additional values in a bucket, and it load factor goes over the limit, the table is resized (e.g. you try to save a 5th element in a 4-bucket table with load_factor=1.0).
Consequently:
Having a uint32_t or a uint64_t key will have little impact until you reach 2^32-elements hash-table.
Would that be better, or would it be worse as all hashes are now multiples of 0x100000001?
This will have no impact until you reach 32-bits overflow (2^32) hash-table.
Good key conversion between incremental uint32_t and uint64_t:
key64 = static_cast<uint64>(key32);
Bad key conversion between incremental uint32_t and uint64_t:
key64 = static_cast<uint64>(key32)<<32;
The best is to keep the keys as even as possible, avoiding hashes with the same factor again and again. E.g. in the code below, keys with all factor 7 would have collision until resized to 16 buckets.
https://onlinegdb.com/r1N7TNySv
#include <iostream>
#include <unordered_map>
using namespace std;
// Print to std output the internal structure of an unordered_map.
template <typename K, typename T>
void printMapStruct(unordered_map<K, T>& map)
{
cout << "The map has " << map.bucket_count()<<
" buckets and max load factor: " << map.max_load_factor() << endl;
for (size_t i=0; i< map.bucket_count(); ++i)
{
cout << " Bucket " << i << ": ";
for (auto it=map.begin(i); it!=map.end(i); ++it)
{
cout << it->first << " ";
}
cout << endl;
}
cout << endl;
}
// Print the list of bucket sizes by this implementation
void printMapResizes()
{
cout << "Map bucket counts:"<< endl;
unordered_map<size_t, size_t> map;
size_t lastBucketSize=0;
for (size_t i=0; i<1024*1024; ++i)
{
if (lastBucketSize!=map.bucket_count())
{
cout << map.bucket_count() << " ";
lastBucketSize = map.bucket_count();
}
map.emplace(i,i);
}
cout << endl;
}
int main()
{
unordered_map<size_t,size_t> map;
printMapStruct(map);
map.emplace(0,0);
map.emplace(1,1);
printMapStruct(map);
map.emplace(72,72);
map.emplace(17,17);
printMapStruct(map);
map.emplace(7,7);
map.emplace(14,14);
printMapStruct(map);
printMapResizes();
return 0;
}
Note over the bucket count:
In the above example, the bucket count is as follow:
1 3 7 17 37 79 167 337 709 1493 3209 6427 12983 26267 53201 107897 218971 444487 902483 1832561
This seems to purposely follow a series of prime numbers (minimizing collisions). I am not aware of the function behind.
std::unordered_map<> bucket_count() after default rehash

Algorithm for Simple Squared Squares

I want to split a square in unequal squares.
After some search on the web found this Link.
This is an output i need :
Does anyone have idea for this?
As Yves Daoust said the algorithm to solve this is going to be slow. The first challenge is to determine what squares COULD be combined to fit into your big square. Then figure out if they WILL fit in there.
I would first filter by area.
To answer the first part you need to look for a combination of squares that will fit into your big one. There are likely multiple combinations as a 5x5 square takes up the same area as a 3x3 with a 4x4 square. This is a O(2^n) problem in itself.
Then attempt to arrange them.
I would make a matrix that is the size of your big square. Then starting at the topmost then right most index add in a square by marking the matrix positions as occupied by that square. Then move to the next unoccupied space, based on the previous rules adding an unused square. If no square fits then remove the previous square and continue to the next. This is a method begging for recursion.
As I said at the beginning this is a SLOW way to do it but it will give you a solution if one exists.
I used a dynamic programming approach for solving this. but it works until n ~ 50. I stored a solution as a bitset for efficiency:
You can compile the code yourself with:
$ g++ -O3 -std=c++11 squares.cpp -o squares
#include <bitset>
#include <iostream>
#include <list>
#include <vector>
using namespace std;
constexpr auto N = 116;
class FastSquareList {
public:
FastSquareList() = default;
FastSquareList(int i) { mask_.set(i); }
FastSquareList operator+(int i) const {
FastSquareList result = *this;
result.mask_.set(i);
return result;
}
bool has(int i) const { return mask_.test(i); }
void display() const {
for (auto i = 1; i <= N; ++i) {
if (has(i)) {
cout << i * i << " ";
}
}
cout << endl;
}
private:
bitset<N + 1> mask_;
};
int main() {
int n;
cin >> n;
vector<list<FastSquareList> > v(n * n + 1);
for (int i = 1; i <= n; ++i) {
v[i * i].push_back(i);
for (int a = i * i + 1; a <= n * n; ++a) {
int p = a - i * i;
for (const auto& l : v[p]) {
if (l.has(i)) {
continue;
}
v[a].emplace_back(l + i);
}
}
}
for (const auto& l : v[n * n]) {
l.display();
}
cout << "solutions count = " << v[n*n].size() << endl;
return 0;
}
an example:
$ ./Squares
15
9 16 36 64 100
25 36 64 100
1 4 9 16 25 49 121
4 36 64 121
4 100 121
4 16 25 36 144
1 16 64 144
81 144
4 16 36 169
4 9 16 196
4 25 196
225
solutions count = 12

How to partly sort arrays on CUDA?

Problem
Provided I have two arrays:
const int N = 1000000;
float A[N];
myStruct *B[N];
The numbers in A can be positive or negative (e.g. A[N]={3,2,-1,0,5,-2}), how can I make the array A partly sorted (all positive values first, not need to be sorted, then negative values)(e.g. A[N]={3,2,5,0,-1,-2} or A[N]={5,2,3,0,-2,-1}) on the GPU? The array B should be changed according to A (A is keys, B is values).
Since the scale of A,B can be very large, I think the sort algorithm should be implemented on GPU (especially on CUDA, because I use this platform). Surely I know thrust::sort_by_key can do this work, but it does muck extra work since I do not need the array A&B to be sorted entirely.
Has anyone come across this kind of problem?
Thrust example
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
thrust::greater<float>() );
Thrust's documentation on Github is not up-to-date. As #JaredHoberock said, thrust::partition is the way to go since it now supports stencils. You may need to get a copy from the Github repository:
git clone git://github.com/thrust/thrust.git
Then run scons doc in the Thrust folder to get an updated documentation, and use these updated Thrust sources when compiling your code (nvcc -I/path/to/thrust ...). With the new stencil partition, you can do:
#include <thrust/partition.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
thrust::partition(thrust::host, // if you want to test on the host
thrust::make_zip_iterator(thrust::make_tuple(keyVec.begin(), valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(keyVec.end(), valVec.end())),
keyVec.begin(),
is_positive());
This returns:
Before:
keyVec = 0 -1 2 -3 4 -5 6 -7 8 -9
valVec = 0 1 2 3 4 5 6 7 8 9
After:
keyVec = 0 2 4 6 8 -5 -3 -7 -1 -9
valVec = 0 2 4 6 8 5 3 7 1 9
Note that the 2 partitions are not necessarily sorted. Also, the order may differ between the original vectors and the partitions. If this is important to you, you can use thrust::stable_partition:
stable_partition differs from partition in that stable_partition is
guaranteed to preserve relative order. That is, if x and y are
elements in [first, last), such that pred(x) == pred(y), and if x
precedes y, then it will still be true after stable_partition that x
precedes y.
If you want a complete example, here it is:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/partition.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
void print_vec(const thrust::host_vector<int>& v)
{
for(size_t i = 0; i < v.size(); i++)
std::cout << " " << v[i];
std::cout << "\n";
}
int main ()
{
const int N = 10;
thrust::host_vector<int> keyVec(N);
thrust::host_vector<int> valVec(N);
int sign = 1;
for(int i = 0; i < N; ++i)
{
keyVec[i] = sign * i;
valVec[i] = i;
sign *= -1;
}
// Copy host to device
thrust::device_vector<int> d_keyVec = keyVec;
thrust::device_vector<int> d_valVec = valVec;
std::cout << "Before:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
// Partition key-val on device
thrust::partition(thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.begin(), d_valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.end(), d_valVec.end())),
d_keyVec.begin(),
is_positive());
// Copy result back to host
keyVec = d_keyVec;
valVec = d_valVec;
std::cout << "After:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
}
UPDATE
I made a quick comparison with the thrust::sort_by_key version, and the thrust::partition implementation does seem to be faster (which is what we could naturally expect). Here is what I obtain on NVIDIA Visual Profiler, with N = 1024 * 1024, with the sort version on the left, and the partition version on the right. You may want to do the same kind of tests on your own.
How about this?:
Count how many positive numbers to determine the inflexion point
Evenly divide each side of the inflexion point into groups (negative-groups are all same length but different length to positive-groups. these groups are the memory chunks for the results)
Use one kernel call (one thread) per chunk pair
Each kernel swaps any out-of-place elements in the input groups into the desired output groups. You will need to flag any chunks that have more swaps than the maximum so that you can fix them during subsequent iterations.
Repeat until done
Memory traffic is swaps only (from original element position, to sorted position). I don't know if this algorithm sounds like anything already defined...
You should be able to achieve this in thrust simply with a modification of your comparison operator:
struct my_compare
{
__device__ __host__ bool operator()(const float x, const float y) const
{
return !((x<0.0f) && (y>0.0f));
}
};
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
my_compare() );

Resources