How would you merge sparse matrices to create a new sparse matrix? - algorithm

I'm trying to achieve the following thing:
Given A(m,n), B(m,q), C(p,n), D(p,q), sparse matrices.
Create E(m+p,n+q), a sparse matrix, like :
E = | A B |
| C D |
I tried the following :
Eigen - the only way I found to achieve this was reading all non zeroes values from A, B, C and D, store it in a std::vector<Triplet>, and construct E with setFromTriplets. This is way too complex.
Intel MKL - the algorithm was the same, except I was using Block Sparse Row storage representation. First, read non zeroes then call a constructor. Same complexity, not usable.
The problem with these libraries is that E is construct just like any sparse matrice, without using the fact that there are redundancies between parts of E and A,B,C,D. I suppose that it should be possible to construct E just by reindexing what's store in the internal structures of A,B,C and D.
The question is the following : how would you achieve this merging operation with sparse matrices? Which software would you use? Which algorithm would you use?
The ideal solution would not use a block storage scheme so that sparsity would be based on zero values, not zero blocks.
Programming language doesn't matter.
Thanks in advance.

Old question, new solution! It seems like everybody has figured out a way to do this by creating a new container of triplets, and adding accordingly. This is admittedly safe and intuitive, but not so fast even if you reserve the nonzeros accordingly.
In this approach, which works only on column-major matrices (you could make row-major verisons quite easily), I work directly with the CSC data structure. I present 3 methods: 1) "vertically stacking" two sparse matrices with the same number of columns, 2) "horizontally stacking" two sparse matrices with the same number of rows, and 3) "in-place horizontally stacking" two sparse matrices with the same number of rows. 3) is inherently more efficient because the storage of the left matrix is unchanged.
As you can see, horizontally stacking two matrices is almost trivial even when working directly with the raw pointers. Vertically stacking requires a bit of thinking, but not much. If anyone can figure out an in-place method for vertically stacking that avoids some copying (or any improvement), please let me know!
#include <algorithm>
#include <Eigen/Sparse>
template<typename Scalar, typename StorageIndex>
void sparse_stack_v(
const SparseMatrix<Scalar, ColMajor, StorageIndex>& top,
const SparseMatrix<Scalar, ColMajor, StorageIndex>& bottom,
SparseMatrix<Scalar, ColMajor, StorageIndex>& stacked)
assert(top.cols() == bottom.cols());
stacked.resize(top.rows() + bottom.rows(), top.cols());
stacked.resizeNonZeros(top.nonZeros() + bottom.nonZeros());
StorageIndex i = 0;
for (StorageIndex col = 0; col < top.cols(); col++)
stacked.outerIndexPtr()[col] = i;
for (StorageIndex j = top.outerIndexPtr()[col]; j < top.outerIndexPtr()[col + 1]; j++, i++)
stacked.innerIndexPtr()[i] = top.innerIndexPtr()[j];
stacked.valuePtr()[i] = top.valuePtr()[j];
for (StorageIndex j = bottom.outerIndexPtr()[col]; j < bottom.outerIndexPtr()[col + 1]; j++, i++)
stacked.innerIndexPtr()[i] = (StorageIndex)top.rows() + bottom.innerIndexPtr()[j];
stacked.valuePtr()[i] = bottom.valuePtr()[j];
stacked.outerIndexPtr()[top.cols()] = i;
template<typename Scalar, typename StorageIndex>
void sparse_stack_h(
const SparseMatrix<Scalar, ColMajor, StorageIndex>& left,
const SparseMatrix<Scalar, ColMajor, StorageIndex>& right,
SparseMatrix<Scalar, ColMajor, StorageIndex>& stacked)
assert(left.rows() == right.rows());
stacked.resize(left.rows(), left.cols() + right.cols());
stacked.resizeNonZeros(left.nonZeros() + right.nonZeros());
std::copy(left.innerIndexPtr(), left.innerIndexPtr() + left.nonZeros(), stacked.innerIndexPtr());
std::copy(right.innerIndexPtr(), right.innerIndexPtr() + right.nonZeros(), stacked.innerIndexPtr() + left.nonZeros());
std::copy(left.valuePtr(), left.valuePtr() + left.nonZeros(), stacked.valuePtr());
std::copy(right.valuePtr(), right.valuePtr() + right.nonZeros(), stacked.valuePtr() + left.nonZeros());
std::copy(left.outerIndexPtr(), left.outerIndexPtr() + left.cols(), stacked.outerIndexPtr());//dont need the last entry of A.outerIndexPtr() -- total length is AB.cols() + 1 = A.cols() + B.cols() + 1
std::transform(right.outerIndexPtr(), right.outerIndexPtr() + right.cols() + 1, stacked.outerIndexPtr() + left.cols(), [&](StorageIndex i) { return i + left.nonZeros(); });
template<typename Scalar, typename StorageIndex>
void sparse_stack_h_inplace(
SparseMatrix<Scalar, ColMajor, StorageIndex>& left,
const SparseMatrix<Scalar, ColMajor, StorageIndex>& right)
assert(left.rows() == right.rows());
const StorageIndex leftcol = (StorageIndex)left.cols();
const StorageIndex leftnz = (StorageIndex)left.nonZeros();
left.conservativeResize(left.rows(), left.cols() + right.cols());
left.resizeNonZeros(left.nonZeros() + right.nonZeros());
std::copy(right.innerIndexPtr(), right.innerIndexPtr() + right.nonZeros(), left.innerIndexPtr() + leftnz);
std::copy(right.valuePtr(), right.valuePtr() + right.nonZeros(), left.valuePtr() + leftnz);
std::transform(right.outerIndexPtr(), right.outerIndexPtr() + right.cols() + 1, left.outerIndexPtr() + leftcol, [&](StorageIndex i) { return i + leftnz; });

As of Octave 7, you can just join using standard matrix notation:
>> E = [A,B;C,D];
As a test:
>> x = sparse(2,3)
x = Compressed Column Sparse (rows = 2, cols = 3, nnz = 0 [0%])
>> y = sparse(20,3)
y = Compressed Column Sparse (rows = 20, cols = 3, nnz = 0 [0%])
>> x(1,2) = 1
x = Compressed Column Sparse (rows = 2, cols = 3, nnz = 1 [17%])
>> y(10,1)=99
y = Compressed Column Sparse (rows = 20, cols = 3, nnz = 1 [1.7%])
>> [x,x;y,y]
ans =Compressed Column Sparse (rows = 22, cols = 6, nnz = 4 [3%])
(12, 1) -> 99
(1, 2) -> 1
(12, 4) -> 99
(1, 5) -> 1


Minimizing the number of warehouses for an order

I am trying to figure out an algorithm to efficiently solve the following problem:
There are w warehouses that store p different products with different quantities
A customer places an order on n out of the p products
The goal is to pick the minimum number of warehouses from which the order could be allocated.
E.g. the distribution of inventory in three warehouses is as follows
| Product 1 | Product 2 | Product 3 |
| Warehouse 1 | 2 | 5 | 0 |
| Warehouse 2 | 1 | 4 | 4 |
| Warehouse 3 | 3 | 1 | 4 |
Now suppose an order is placed with the following ordered quantities:
| Product 1 | Product 2 | Product 3 |
| Ordered Qty | 5 | 4 | 1 |
The optimal solution here would be to allocate the order from Warehouse 1 and Warehouse 3. No other smaller subset of the 3 warehouses would be a better choice
I have tried using brute force to solve this, however, for a larger number of warehouses, the algorithm performs very poorly. I have also tried a few greedy allocation algorithms, however, as expected, they are unable to minimize the number of sub-orders in many cases. Are there any other algorithms/approaches that I should look into?
Part 1 (see also Part 2 below)
Your task looks like a Set Cover Problem which is NP-complete, hence having exponential solving time.
I decided (and implemented in C++) my own solution for it, which might be sub-exponential in one case - if it happens that many sub-sets of warehouses produce same amount of products in sum. In other words if an exponential size of a set of all warehouses sub-sets (which is 2^NumWarehouses) is much bigger than a set of all possible combinations of products counts produced by all sub-sets of warehouses. It often happens like so in most of tests of such problem like your in online competition. If so happens then my solution will be sub-exponential both in CPU and in RAM.
I used Dynamic Programming approach for this. Whole algorithm may be described as following:
We create a map as a key having vector of amount of each product, and this key points to a triple, a) set of previous taken warehouses that reach current products amounts, this is to restore exact chosen warehouses, b) minimal amount of needed to take warehouses to achieve this products amounts, c) previous taken warehous that achieved this minimum of needed warehouses. This set is initialized with single key - vector of 0 products (0, 0, ..., 0).
Iterate through all warehouses in a loop and do 3.-4..
Iterate through all current products amounts (vectors) in a map and do 4..
To iterated vector of products (in a map) we add amounts of products of iterated warehouse. This sum of two vectors is a new key in a map, inside a value pointed by this key we add to set an index of iterated warehouse, while minimum and previous warehouse we set to -1 (uninitialized).
Using a recursive function for each key of a map find a minimum needed amount of warehouses and also find previous warehous achieving this minimum. This is easily done if for given key to iterate all warehouses in a Set, and find (recursively) their minimums, then minimum of current key will be minimum of all minimums plus 1.
Iterate through all keys in a map that are bigger or equal (as a vector) to ordered amount of products. All these keys will give a solution, but only some of them will give Minimal solution, save a key that gives minimal solution of all. In a case if all keys in a map are smaller than current ordered vector then there is no possible solution and we can finish program with error.
Having a minimal key we restore path backwards of all used warehouses to achieve this minimum. This is easy because for each key in a map we keep minimal amount of warehouses and previous warehouse that should be taken to achieve this minimum. Jumping by "previous" warehouses we restore whole path of needed warehouses. Finally output this found minimal solution.
As already mentioned this algorithm has Memory and Time complexity equal to amount of different distinct vectors of products that can be formed by all sub-sets of all warehouses. Which may (if we're lucky) or may not be (if we're unlucky) sub-exponential.
Full C++ code implementing algorithm above (implemented from scratch by me):
Try it online!
#include <cstdint>
#include <vector>
#include <tuple>
#include <map>
#include <set>
#include <unordered_map>
#include <functional>
#include <stdexcept>
#include <iostream>
#include <algorithm>
#define ASSERT(cond) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "!"); }
#define LN { std::cout << "LN " << __LINE__ << std::endl; }
using u16 = uint16_t;
using u32 = uint32_t;
using u64 = uint64_t;
int main() {
std::vector<std::vector<u32>> warehouses_products = {
{2, 5, 0},
{1, 4, 4},
{3, 1, 4},
std::vector<u32> order_products = {5, 4, 1};
size_t const nwares = warehouses_products.size(),
nprods =;
ASSERT(order_products.size() == nprods);
std::map<std::vector<u32>, std::tuple<std::set<u16>, u16, u16>> d =
{{std::vector<u32>(nprods), {{}, 0, u16(-1)}}};
for (u16 iware = 0; iware < nwares; ++iware) {
auto const & wprods = warehouses_products[iware];
ASSERT(wprods.size() == nprods);
auto dc = d;
for (auto const & [k, _]: d) {
auto prods = k;
for (size_t i = 0; i < wprods.size(); ++i)
prods[i] += wprods[i];
dc.insert({prods, {{}, u16(-1), u16(-1)}});
d = dc;
std::function<u16(std::vector<u32> const &)> FindMin =
[&](auto const & prods) {
auto & [a, b, c] =;
if (b != u16(-1))
return b;
u16 minv = u16(-1), minw = u16(-1);
for (auto iware: a) {
auto const & wprods = warehouses_products[iware];
auto cprods = prods;
for (size_t i = 0; i < wprods.size(); ++i)
cprods[i] -= wprods[i];
auto const fmin = FindMin(cprods) + 1;
if (fmin < minv) {
minv = fmin;
minw = iware;
ASSERT(minv != u16(-1) && minw != u16(-1));
b = minv;
c = minw;
return b;
for (auto const & [k, v]: d)
std::vector<u32> minp;
u16 minv = u16(-1);
for (auto const & [k, v]: d) {
bool matched = true;
for (size_t i = 0; i < nprods; ++i)
if (order_products[i] > k[i]) {
matched = false;
if (!matched)
if (std::get<1>(v) < minv) {
minv = std::get<1>(v);
minp = k;
if (minp.empty()) {
std::cout << "Can't buy all products!" << std::endl;
return 0;
std::vector<u16> answer;
while (minp != std::vector<u32>(nprods)) {
auto const & [a, b, c] =;
auto const & wprods = warehouses_products[c];
for (size_t i = 0; i < wprods.size(); ++i)
minp[i] -= wprods[i];
std::sort(answer.begin(), answer.end());
std::cout << "WareHouses: ";
for (auto iware: answer)
std::cout << iware << ", ";
std::cout << std::endl;
WareHouses Products:
{2, 5, 0},
{1, 4, 4},
{3, 1, 4},
Ordered Products:
{5, 4, 1}
WareHouses: 0, 2,
Part 2
Totally different solution I also implemented below.
Now it is based on Back Tracking using Recursive Function.
This solution although being exponential in worth case, yet it gives close to optimal solution after little time. So you just run this program as long as you can afford and whatever it has found so far you output as approximate solution.
Algorithm is as follows:
Suppose we have some products left to buy. Lets sort in descending order all not taken so far warehouses by total amount of all products that they can buy us.
In a loop we take each next warehouse from sorted descending list, but we take only first limit (this is fixed given value) elements from this sorted list. This way we take greedely warehouses in order of relevance, in order of the amount of products left to buy.
After warehouse is taken we do recursive descend into current function in which we again form a sorted list of warehouses and take another most relevant warehouse, in other words jump to 1. of this algorithm.
On each function call if we bought all products and amount of taken warehouses is less than current minimum then we output this solution and update minimum value.
Thus algorithm above starts from very greedy behaviour and then becomes slower and slower while becoming less greedy and more of brute force approach. And very good solutions appear already on first seconds.
As an example below I create 40 random warehouses with 40 random amounts of products each. This quite large task is solved Probably optimal within first second. By saying Probably I mean that next minutes of run don't give any better solution.
Try it online!
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <random>
#include <vector>
#include <functional>
#include <chrono>
#include <cmath>
using u8 = uint8_t;
using u16 = uint16_t;
using u32 = uint32_t;
using i32 = int32_t;
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
void Solve(auto const & wps, auto const & ops) {
size_t const nwares = wps.size(), nprods = ops.size(), max_depth = 1000;
std::vector<u32> prods_left = ops;
std::vector<std::vector<u16>> sorted_wares_all(max_depth);
std::vector<std::vector<u32>> prods_copy_all(max_depth);
std::vector<u16> path;
std::vector<u8> used(nwares);
size_t min_wares = size_t(-1);
auto ProdGrow = [&](auto const & prods){
size_t grow = 0;
for (size_t i = 0; i < nprods; ++i)
grow += std::min(prods_left[i], prods[i]);
return grow;
std::function<void(size_t, size_t, size_t)> Rec = [&](size_t depth, size_t off, size_t lim){
size_t prods_need = 0;
for (auto e: prods_left)
prods_need += e;
if (prods_need == 0) {
if (path.size() < min_wares) {
min_wares = path.size();
std::cout << std::endl << "Time " << std::setw(4) << std::llround(Time())
<< " sec, Cnt " << std::setw(3) << path.size() << ": ";
auto cpath = path;
std::sort(cpath.begin(), cpath.end());
for (auto e: cpath)
std::cout << e << ", ";
std::cout << std::endl << std::flush;
auto & sorted_wares =;
auto & prods_copy =;
for (u16 i = off; i < nwares; ++i)
if (!used[i])
std::sort(sorted_wares.begin(), sorted_wares.end(),
[&](auto a, auto b){
return ProdGrow(wps[a]) > ProdGrow(wps[b]);
sorted_wares.resize(std::min(lim, sorted_wares.size()));
for (size_t i = 0; i < sorted_wares.size(); ++i) {
u16 const iware = sorted_wares[i];
auto const & wprods = wps[iware];
prods_copy = prods_left;
for (size_t j = 0; j < nprods; ++j)
prods_left[j] -= std::min(prods_left[j], wprods[j]);
used[iware] = 1;
Rec(depth + 1, iware + 1, lim);
used[iware] = 0;
prods_left = prods_copy;
for (auto e: sorted_wares)
used[e] = 0;
for (size_t lim = 1; lim <= nwares; ++lim) {
std::cout << "Limit " << lim << ", " << std::flush;
Rec(0, 0, lim);
int main() {
size_t const nwares = 40, nprods = 40;
std::mt19937_64 rng{std::random_device{}()};
std::vector<std::vector<u32>> wps(nwares);
for (size_t i = 0; i < nwares; ++i) {
for (size_t j = 0; j < nprods; ++j)
wps[i][j] = rng() % 90 + 10;
std::vector<u32> ops;
for (size_t i = 0; i < nprods; ++i)
ops.push_back(rng() % (nwares * 20));
Solve(wps, ops);
Limit 1, Limit 2, Limit 3, Limit 4,
Time 0 sec, Cnt 13: 6, 8, 12, 13, 29, 31, 32, 33, 34, 36, 37, 38, 39,
Limit 5,
Time 0 sec, Cnt 12: 6, 8, 12, 13, 28, 29, 31, 32, 36, 37, 38, 39,
Limit 6, Limit 7,
Time 0 sec, Cnt 11: 6, 8, 12, 13, 19, 26, 31, 32, 33, 36, 39,
Limit 8, Limit 9, Limit 10, Limit 11, Limit 12, Limit 13, Limit 14, Limit 15,
If you want to go down the ILP route, you could formulate the following programme:
Where w is the number of warehouses, p the number of products, n_j the quantity of product j ordered, and C_ij the quantity of product j stored in warehouse i. Then, the decisions are to select warehouse i (x_i = 1) or not (x_i = 0).
Using Google's ortools and the open-source CBC solver, this could be implemented as follows in Python:
import numpy as np
from ortools.linear_solver import pywraplp
# Some test data, replace with your own.
p = 50
w = 1000
n = np.random.randint(0, 10, p)
C = np.random.randint(0, 5, (w, p))
solver = pywraplp.Solver("model", pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
x = [solver.BoolVar(f"x[{i}]") for i in range(w)]
for j in range(p):
solver.Add(C[:, j] # x >= n[j])
This formulation solves instances with up to a thousand warehouses in a few seconds to a minute. Smaller instances solve much quicker, for (I hope) obvious reasons.
The following outputs the solution, and some statistics:
assert solver.Solve() is not None
print(f"assigned = {[i + 1 for i in range(len(x)) if x[i].solution_value()]}")
print(f" obj = {solver.Objective().Value()}")
print(f" time = {solver.WallTime() / 1000}s")

Assemble block sparse matrix in Eigen

Lets say I have a small sparse matrix B.
I want to build a bigger sparse matrix like
BtB 0
0 (BtB)^-1
I want to know if Eigen provides some functionality to assemble something like this. I have been searching and found nothing. One option that I can use is to compute the operations, extract the triplets and assemble the matrix based on the triplets. Is there any easier way?
If you want to explicitly build a sparse block matrix (there are legitimate reasons!), this is what I use (only works for column major -- easily adapted to row major). Here, sparse_stack_v means to "stack" two matrices vertically. sparse_stack_his likewise used to "stack" two sparse matrices horizontally. sparse_stack_h_inplace is a more efficient operation as you can reuse most of the sparsity structure for the left matrix (and overwrite it).
void sparse_stack_v(
const SparseMatrix<Scalar, ColMajor, StorageIndex>& top,
const SparseMatrix<Scalar, ColMajor, StorageIndex>& bottom,
SparseMatrix<Scalar, ColMajor, StorageIndex>& stacked)
assert(top.cols() == bottom.cols());
stacked.resize(top.rows() + bottom.rows(), top.cols());
stacked.resizeNonZeros(top.nonZeros() + bottom.nonZeros());
StorageIndex i = 0;
for (StorageIndex col = 0; col < top.cols(); col++)
stacked.outerIndexPtr()[col] = i;
for (StorageIndex j = top.outerIndexPtr()[col]; j < top.outerIndexPtr()[col + 1]; j++, i++)
stacked.innerIndexPtr()[i] = top.innerIndexPtr()[j];
stacked.valuePtr()[i] = top.valuePtr()[j];
for (StorageIndex j = bottom.outerIndexPtr()[col]; j < bottom.outerIndexPtr()[col + 1]; j++, i++)
stacked.innerIndexPtr()[i] = (StorageIndex)top.rows() + bottom.innerIndexPtr()[j];
stacked.valuePtr()[i] = bottom.valuePtr()[j];
stacked.outerIndexPtr()[top.cols()] = i;
template<typename Scalar, typename StorageIndex>
void sparse_stack_h(
const SparseMatrix<Scalar, ColMajor, StorageIndex>& left,
const SparseMatrix<Scalar, ColMajor, StorageIndex>& right,
SparseMatrix<Scalar, ColMajor, StorageIndex>& stacked)
assert(left.rows() == right.rows());
stacked.resize(left.rows(), left.cols() + right.cols());
stacked.resizeNonZeros(left.nonZeros() + right.nonZeros());
std::copy(left.innerIndexPtr(), left.innerIndexPtr() + left.nonZeros(), stacked.innerIndexPtr());
std::copy(right.innerIndexPtr(), right.innerIndexPtr() + right.nonZeros(), stacked.innerIndexPtr() + left.nonZeros());
std::copy(left.valuePtr(), left.valuePtr() + left.nonZeros(), stacked.valuePtr());
std::copy(right.valuePtr(), right.valuePtr() + right.nonZeros(), stacked.valuePtr() + left.nonZeros());
std::copy(left.outerIndexPtr(), left.outerIndexPtr() + left.cols(), stacked.outerIndexPtr());//dont need the last entry of A.outerIndexPtr() -- total length is AB.cols() + 1 = A.cols() + B.cols() + 1
std::transform(right.outerIndexPtr(), right.outerIndexPtr() + right.cols() + 1, stacked.outerIndexPtr() + left.cols(), [&](StorageIndex i) { return i + left.nonZeros(); });
template<typename Scalar, typename StorageIndex>
void sparse_stack_h_inplace(
SparseMatrix<Scalar, ColMajor, StorageIndex>& left,
const SparseMatrix<Scalar, ColMajor, StorageIndex>& right)
assert(left.rows() == right.rows());
const StorageIndex leftcol = (StorageIndex)left.cols();
const StorageIndex leftnz = (StorageIndex)left.nonZeros();
left.conservativeResize(left.rows(), left.cols() + right.cols());
left.resizeNonZeros(left.nonZeros() + right.nonZeros());
std::copy(right.innerIndexPtr(), right.innerIndexPtr() + right.nonZeros(), left.innerIndexPtr() + leftnz);
std::copy(right.valuePtr(), right.valuePtr() + right.nonZeros(), left.valuePtr() + leftnz);
std::transform(right.outerIndexPtr(), right.outerIndexPtr() + right.cols() + 1, left.outerIndexPtr() + leftcol, [&](StorageIndex i) { return i + leftnz; });
If you want to 'pad' a sparse matrix with zeros, just stack the matrices with an empty sparse matrix of an appropriate size.
I know this is years late, but in case anyone else finds their way here, I have a solution. My particular implementation comes from the perspective of using Rcpp and RcppEigen, but I believe this could be written using the std library list object class. Since I'm working with precision matrices, I do assume that the input matrices are square, but again, it would not take too much thought to convert this code to allow for arbitrary dimensions from each of the constituent matrices. This will create a block-diagonal sparse matrix efficiently:
#include <Rcpp.h>
#include <RcppEigen.h>
Eigen::SparseMatrix<double> sparseBdiag(Rcpp::List B_list)
int K = B_list.length();
Eigen::VectorXi B_cols(K);
for(int k=0;k<K;k++) {
Eigen::SparseMatrix<double> Bk = B_list(k);
B_cols[k] = Bk.cols();
int sumCols = B_cols.sum();
Eigen::SparseMatrix<double> A(sumCols,sumCols);
int startCol=0, stopCol=0, Bk_cols;
for(int k=0;k<K;k++) {
Eigen::SparseMatrix<double> Bk = B_list(k);
Bk_cols = Bk.cols();
stopCol = startCol + Bk_cols;
for(int j=startCol;j<stopCol;j++){
for(Eigen::SparseMatrix<double,0,int>::InnerIterator it(Bk,j-startCol); it; ++it) {
A.insertBack(it.row()+startCol,j) = it.value();
startCol = stopCol;
return A;

copy an one eigen matrix of vectors

I have
A(matrix of vectors with length = depth) is 5x5 (5 rows and 5 cols).
depth = 3 (it is the length of vector of any cell of matrix A).
B(matrix of single values) is 75 x Any (5*5*3 rows and Any cols).
x_size_kernel = 5.
block_idx is the index, here for example we have made it equal 0 (for one column of matrix B only)
The task of this simple and strict example is to copy all vectors of matrix A to one (first column) of matrix B.
Now I solve the problem like this (it is concrete example with precise data)
Eigen::MatrixXf B;
B = Eigen::MatrixXf(x_size_kernel * y_size_kernel * depth, 100).setZero();
Eigen::Matrix<Eigen::VectorXf, Eigen::Dynamic, Eigen::Dynamic> A;
A.resize(5, 5);
auto depth = 3;
for (auto yy = 0; yy < A.rows(); yy++) {
for (auto xx = 0; xx < A.cols(); xx++) {
A(yy, xx).resize(depth);
auto block_idx = 0;
// and here are all copy for one column of matrix B
for (auto my = 0; my < x_size_kernel; my++) {
for (auto mx = 0; mx < x_size_kernel; mx++) {
// add the next column of block data
segment(mx * depth + my * x_size_kernel * depth, depth).noalias() =
A(my, mx);
But the above code is very slow, so I need more fast code. Maybe somebody know how to copy data in such way using only Eigen one pass.
Thank you for helping.

How to improve the performance of needleman -wunsch algorithm in CUDA

I need an advice on how optimizing my implementation of the Needleman-Wunsch algorithm in CUDA.
I want to optimize my code to fill the DP matrix in CUDA. Due to the data dependence between matrix elements (each next element depends on the other ones - left to it, up to it, and left-up to it), I'm filling anti-diagonal matrix elements in parallel as follows:
__global__ void alignment_kernel(int *T, char *A, char *B, int t_M, int t_N, int d) {
int row = BLOCK_SIZE_Y * blockIdx.y + threadIdx.y;
int col = BLOCK_SIZE_X * blockIdx.x + threadIdx.x;
// Check if we are inside the table boundaries.
if (!(row < t_M && col < t_N)) {
// Check if current thread is on the current diagonal
if (row + col != d) {
int v1;
int v2;
int v3;
int v4;
v1 = v2 = v3 = v4 = INT_MIN;
if (row > 0 && col > 0) {
v1 = T[t_N * (row - 1) + (col - 1)] + score_matrix_read(A[row - 1], B[col - 1]);
if (row > 0 && col >= 0) {
v2 = T[t_N * (row - 1) + col] + gap;
if (row >= 0 && col > 0) {
v3 = T[t_N * row + (col - 1)] + gap;
if (row == 0 && col == 0) {
v4 = 0;
// Synchronize (ensure all the data is available)
T[t_N * row + col] = mmax(v1, v2, v3, v4);
Nevertheless, one obvious problem of my code is that I do multiple kernel calls (code bellow). Until now, I don't know how to use threads to process the anti-diagonal synchronously without doing that. I think this is a major problem to reach a better performance.
// Invoke kernel.
for (int d = 0; d < t_M + t_N - 1; d++) {
alignment_kernel<<< gridDim, blockDim >>>(d_T, d_A, d_B, t_M, t_N, d);
How can I process the anti-diagonal in parallel and, maybe, using shared memory to increase the speedup?
Beyond this problem, is there any way to do the back trace step of the needleman-wunsch algorithm in parallel?
I am currently working on a parallel implementation of the Needleman Wunsch algorithm as well (to use in a genome mapper). Depending on how many alignments you will be doing, it may be more efficient to do a single alignment per thread.
However, here is a publication that performs a single alignment in parallel (on a GPU). The novelty of their approach is that it does not generate the matrix sequentially, by rather diagonally. They don't talk about how they backtrack in their publication. They send the matrix back to the host after it is generated, then they perform the backtrack using a CPU. I think that backtracking on the GPU would be terribly inefficient due to branching.

Data structures and algorithms for adaptive "uniform" mesh?

I need a data structure for storing float values at an uniformly sampled 3D mesh:
x = x0 + ix*dx where 0 <= ix < nx
y = y0 + iy*dy where 0 <= iy < ny
z = z0 + iz*dz where 0 <= iz < nz
Up to now I have used my Array class:
Array3D<float> A(nx, ny,nz);
A(0,0,0) = 0.0f; // ix = iy = iz = 0
Internally it stores the float values as an 1D array with nx * ny * nz elements.
However now I need to represent an mesh with more values than I have RAM,
e.g. nx = ny = nz = 2000.
I think many neighbour nodes in such an mesh may have similar values so I was thinking if there was some simple way that I could "coarsen" the mesh adaptively.
For instance if the 8 (ix,iy,iz) nodes of an cell in this mesh have values that are less than 5% apart; they are "removed" and replaced by just one value; the mean of the 8 values.
How could I implement such a data structure in a simple and efficient way?
thanks Ante for suggesting lossy compression. I think this could work the following way:
#define BLOCK_SIZE 64
struct CompressedArray3D {
CompressedArray3D(int ni, int nj, int nk) {
NI = ni/BLOCK_SIZE + 1;
NJ = nj/BLOCK_SIZE + 1;
NK = nk/BLOCK_SIZE + 1;
blocks = new float*[NI*NJ*NK];
compressedSize = new unsigned int[NI*NJ*NK];
void setBlock(int I, int J, int K, float values[BLOCK_SIZE][BLOCK_SIZE][BLOCK_SIZE]) {
unsigned int csize;
blocks[I*NJ*NK + J*NK + K] = compress(values, csize);
compressedSize[I*NJ*NK + J*NK + K] = csize;
float getValue(int i, int j, int k) {
int I = i/BLOCK_SIZE;
int J = j/BLOCK_SIZE;
int K = k/BLOCK_SIZE;
int ii = i - I*BLOCK_SIZE;
int jj = j - J*BLOCK_SIZE;
int kk = k - K*BLOCK_SIZE;
float *compressedBlock = blocks[I*NJ*NK + J*NK + K];
unsigned int csize = compressedSize[I*NJ*NK + J*NK + K];
decompress(compressedBlock, csize, values);
return values[ii][jj][kk];
// number of blocks:
int NI, NJ, NK;
// number of samples:
int ni, nj, nk;
float** blocks;
unsigned int* compressedSize;
For this to be useful I need a lossy compression that is:
extremely fast, also on small datasets (e.g. 64x64x64)
compress quite hard > 3x, never mind if it looses quite a bit of info.
Any good candidates?
It sounds like you're looking for a LOD (level of detail) adaptive mesh. It's a recurring theme in video games and terrain simulation.
For terrain, see here: -- look for the ROAM video which is IIRC not only adaptive by distance, but also by view direction.
For non-terrain entities, there is a huge body of work (here's one example: Generic Adaptive Mesh Refinement).
I would suggest to use OctoMap to handle large 3D data.
And to extend it as shown here to handle geometrical properties.
