Given a vector of reals c and a vector of integers rw, I want to create a vector z with elements z_i=c_i^rw_i. I tried to do this using the component-wise function pow, but I get a compiler error.
#include <Eigen/Core>
typedef Eigen::VectorXd RealVector;
typedef Eigen::VectorXi IntVector; // dynamically-sized vector of integers
RealVector c; c << 2, 3, 4, 5;
IntVector rw; rw << 6, 7, 8, 9;
RealVector z = c.pow(rw); **compile error**
The compiler error is
error C2664: 'const Eigen::MatrixComplexPowerReturnValue<Derived> Eigen::MatrixBase<Derived>::pow(const std::complex<double> &) const': cannot convert argument 1 from 'IntVector' to 'const double &'
with
[
Derived=Eigen::Matrix<double,-1,1,0,-1,1>
]
c:\auc\sedanal\LammSolve.h(117): note: Reason: cannot convert from 'IntVector' to 'const double'
c:\auc\sedanal\LammSolve.h(117): note: No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
What is wrong with this code? And, assuming it can be fixed, how would I do the same operation when c is a real matrix instead of a vector, to compute c_ij^b_i for all elements of c?
Compiler is Visual Studio 2015, running under 64-bit Windows 7.
First of all, MatrixBase::pow is a function that computes the matrix power of a square matrix (if the matrix has an eigenvalue decomposition, it is the same matrix, but with the eigenvalues raised to the given power).
What you want is an element-wise power, which since there is no cwisePow function in MatrixBase, requires switching to the Array-domain. Furthermore, there is no integer-specialization for the powers (this could be efficient, but only up to a certain threshold -- and checking for that threshold for every element would waste computation time), so you need to cast the exponents to the type of your matrix.
To also answer your bonus-question:
#include <iostream>
#include <Eigen/Core>
int main(int argc, char **argv) {
Eigen::MatrixXd A; A.setRandom(3,4);
Eigen::VectorXi b = (Eigen::VectorXd::Random(3)*16).cast<int>();
Eigen::MatrixXd C = A.array() // go to array domain
.pow( // element-wise power
b.cast<double>() // cast exponents to double
.replicate(1, A.cols()).array() // repeat exponents to match size of A
);
std::cout << A << '\n' << b << '\n' << C << '\n';
}
Essentially, this will call C(i,j) = std::pow(A(i,j), b(i)) for each i, j. If all your exponents are small, you might actually be faster than that with a
simple nested loop that calls a specialized pow(double, int) implementation (like gcc's __builtin_powi), but you should benchmark that with actual data.
Related
I have a large array that I need to sort on the GPU. The array itself is a concatenation of multiple smaller subarrays that satisfy the condition that given i < j, the elements of the subarray i are smaller than the elements of the subarray j. An example of such array would be {5 3 4 2 1 6 9 8 7 10 11},
where the elements of the first subarray of 5 elements are smaller than the elements of the second subarray of 6 elements. The array I need is {1, 2, 3, 4, 5, 6, 7, 10, 11}. I know the position where each subarray starts in the large array.
I know I can simply use thrust::sort on the whole array, but I was wondering if it's possible to launch multiple concurrent sorts, one for each subarray. I'm hoping to get a performance improvement by doing that. My assumption is that it would be faster to sort multiple smaller arrays than one large array with all the elements.
I'd appreciate if someone could give me a way to do that or correct my assumption in case it's wrong.
A way to do multiple concurrent sorts (a "vectorized" sort) in thrust is via the marking of the sub arrays, and providing a custom functor that is an ordinary thrust sort functor that also orders the sub arrays by their key.
Another possible method is to use back-to-back thrust::stable_sort_by_key as described here.
As you have pointed out, another method in your case is just to do an ordinary sort, since that is ultimately your objective.
However I think its unlikely that any of the thrust sort methods will give a signficant speed-up over a pure sort, although you can try it. Thrust has a fast-path radix sort which it will use in certain situations, which the pure sort method could probably use in your case. (In other cases, e.g. when you provide a custom functor, thrust will often use a slower merge-sort method.)
If the sizes of the sub arrays are within certain ranges, I think you're likely to get much better results (performance-wise) with block radix sort in cub, one block per sub-array.
Here is an example that uses specific sizes (since you've given no indication of size ranges and other details), comparing a thrust "pure sort" to a thrust segmented sort with functor, to the cub block sort method. For this particular case, the cub sort is fastest:
$ cat t1.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/scan.h>
#include <thrust/equal.h>
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
const int num_blocks = 2048;
const int items_per = 4;
const int nTPB = 512;
const int block_size = items_per*nTPB; // must be a whole-number multiple of nTPB;
typedef float mt;
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct my_sort_functor
{
template <typename T, typename T2>
__host__ __device__
bool operator()(T t1, T2 t2){
if (thrust::get<1>(t1) < thrust::get<1>(t2)) return true;
if (thrust::get<1>(t1) > thrust::get<1>(t2)) return false;
if (thrust::get<0>(t1) > thrust::get<0>(t2)) return false;
return true;}
};
// from: https://nvlabs.github.io/cub/example_block_radix_sort_8cu-example.html#_a0
#define CUB_STDERR
#include <stdio.h>
#include <iostream>
#include <algorithm>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <cub/block/block_radix_sort.cuh>
using namespace cub;
//---------------------------------------------------------------------
// Globals, constants and typedefs
//---------------------------------------------------------------------
bool g_verbose = false;
bool g_uniform_keys;
//---------------------------------------------------------------------
// Kernels
//---------------------------------------------------------------------
template <
typename Key,
int BLOCK_THREADS,
int ITEMS_PER_THREAD>
__launch_bounds__ (BLOCK_THREADS)
__global__ void BlockSortKernel(
Key *d_in, // Tile of input
Key *d_out) // Tile of output
{
enum { TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD };
// Specialize BlockLoad type for our thread block (uses warp-striped loads for coalescing, then transposes in shared memory to a blocked arrangement)
typedef BlockLoad<Key, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_WARP_TRANSPOSE> BlockLoadT;
// Specialize BlockRadixSort type for our thread block
typedef BlockRadixSort<Key, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Shared memory
__shared__ union TempStorage
{
typename BlockLoadT::TempStorage load;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Per-thread tile items
Key items[ITEMS_PER_THREAD];
// Our current block's offset
int block_offset = blockIdx.x * TILE_SIZE;
// Load items into a blocked arrangement
BlockLoadT(temp_storage.load).Load(d_in + block_offset, items);
// Barrier for smem reuse
__syncthreads();
// Sort keys
BlockRadixSortT(temp_storage.sort).SortBlockedToStriped(items);
// Store output in striped fashion
StoreDirectStriped<BLOCK_THREADS>(threadIdx.x, d_out + block_offset, items);
}
int main(){
const int ds = num_blocks*block_size;
thrust::host_vector<mt> data(ds);
thrust::host_vector<int> keys(ds);
for (int i = block_size; i < ds; i+=block_size) keys[i] = 1; // mark beginning of blocks
thrust::device_vector<int> d_keys = keys;
for (int i = 0; i < ds; i++) data[i] = (rand()%block_size) + (i/block_size)*block_size; // populate data
thrust::device_vector<mt> d_data = data;
thrust::inclusive_scan(d_keys.begin(), d_keys.end(), d_keys.begin()); // fill out keys array 000111222...
thrust::device_vector<mt> d1 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long os = dtime_usec(0);
thrust::sort(d1.begin(), d1.end()); // ordinary sort
cudaDeviceSynchronize();
os = dtime_usec(os);
thrust::device_vector<mt> d2 = d_data; // make a copy of unsorted data
cudaDeviceSynchronize();
unsigned long long ss = dtime_usec(0);
thrust::sort(thrust::make_zip_iterator(thrust::make_tuple(d2.begin(), d_keys.begin())), thrust::make_zip_iterator(thrust::make_tuple(d2.end(), d_keys.end())), my_sort_functor());
cudaDeviceSynchronize();
ss = dtime_usec(ss);
if (!thrust::equal(d1.begin(), d1.end(), d2.begin())) {std::cout << "oops1" << std::endl; return 0;}
std::cout << "ordinary thrust sort: " << os/(float)USECPSEC << "s " << "segmented sort: " << ss/(float)USECPSEC << "s" << std::endl;
thrust::device_vector<mt> d3(ds);
cudaDeviceSynchronize();
unsigned long long cs = dtime_usec(0);
BlockSortKernel<mt, nTPB, items_per><<<num_blocks, nTPB>>>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d3.data()));
cudaDeviceSynchronize();
cs = dtime_usec(cs);
if (!thrust::equal(d1.begin(), d1.end(), d3.begin())) {std::cout << "oops2" << std::endl; return 0;}
std::cout << "cub sort: " << cs/(float)USECPSEC << "s" << std::endl;
}
$ nvcc -o t1 t1.cu
$ ./t1
ordinary thrust sort: 0.001652s segmented sort: 0.00263s
cub sort: 0.000265s
$
(CUDA 10.2.89, Tesla V100, Ubuntu 18.04)
I have no doubt that your sizes and array dimensions don't correspond to mine. The purpose here is to illustrate some possible methods, not a black-box solution that works for your particular case. You probably should do benchmark comparisons of your own. I also acknowledge that the block radix sort method for cub expects equal-sized sub-arrays, which you may not have. It may not be a suitable method for you, or you may wish to explore some kind of padding arrangement. There's no need to ask this question of me; I won't be able to answer it based on the information in your question.
I don't claim correctness for this code or any other code that I post. Anyone using any code I post does so at their own risk. I merely claim that I have attempted to address the questions in the original posting, and provide some explanation thereof. I am not claiming my code is defect-free, or that it is suitable for any particular purpose. Use it (or not) at your own risk.
I'm trying to insert one value into the third location in a host_vector using thrust.
static thrust::host_vector <int *> bins;
int * p;
bins.insert(3, 1, p);
But am getting errors:
error: no instance of overloaded function "thrust::host_vector<T, Alloc>::insert [with T=int *, Alloc=std::allocator<int *>]" matches the argument list
argument types are: (int, int, int *)
object type is: thrust::host_vector<int *, std::allocator<int *>>
Has anyone seen this before, and how can I solve this? I want to use a vector to pass information into the GPU. I was originally trying to use a vector of vectors to represent spatial cells that hold different numbers of data, but learned that wasn't possible with thrust. So instead, I'm using a vector bins that holds my data, sorted by the spatial cell (first 3 values might correspond to the first cell, the next 2 to the second cell, the next 0 to the third cell, etc.). The values held are pointers to particles, and represent the numbers of particles in the spatial cell (which is not known before runtime).
As noted in comments, thrust::host_vector is modelled directly on std::vector and the operation you are trying to use requires an iterator for the position argument, which is why you get a compilation error. You can see this if you consult the relevant documentation:
http://en.cppreference.com/w/cpp/container/vector/insert
https://thrust.github.io/doc/classthrust_1_1host__vector.html#a9bb7c8e26ee8c10c5721b584081ae065
A complete working example of the code snippet you showed would look like this:
#include <iostream>
#include <thrust/host_vector.h>
int main()
{
thrust::host_vector <int *> bins(10, reinterpret_cast<int *>(0));
int * p = reinterpret_cast<int *>(0xdeadbeef);
bins.insert(bins.begin()+3, 1, p);
auto it = bins.begin();
for(int i=0; it != bins.end(); ++it, i++) {
int* v = *it;
std::cout << i << " " << v << std::endl;
}
return 0;
}
Note that this requires that C++11 language features are enabled in nvcc (so use CUDA 8.0):
~/SO$ nvcc -std=c++11 -arch=sm_52 thrust_insert.cu
~/SO$ ./a.out
0 0
1 0
2 0
3 0xdeadbeef
4 0
5 0
6 0
7 0
8 0
9 0
10 0
In the quest for optimal matrix-matrix multiplication using eigen3 (and hopefully profiting from SIMD support) I wrote the following test:
#include <iostream>
#include <Eigen/Dense>
#include <ctime>
using namespace Eigen;
using namespace std;
const int test_size= 13;
const int test_size_16b= test_size+1;
typedef Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> TestMatrix_dyn16b_t;
typedef Matrix<double, Dynamic, Dynamic> TestMatrix_dynalloc_t;
typedef Matrix<double, test_size, test_size> TestMatrix_t;
typedef Matrix<double, test_size_16b, test_size_16b> TestMatrix_fix16b_t;
template<typename TestMatrix_t> EIGEN_DONT_INLINE void test(const char * msg, int m_size= test_size, int n= 10000) {
double s= 0.0;
clock_t elapsed= 0;
TestMatrix_t m3;
for(int i= 0; i<n; i++) {
TestMatrix_t m1 = TestMatrix_t::Random(m_size, m_size);
TestMatrix_t m2= TestMatrix_t::Random(m_size, m_size);
clock_t begin = clock();
m3.noalias()= m1*m2;
clock_t end = clock();
elapsed+= end - begin;
// make sure m3 is not optimized away
s+= m3(1, 1);
}
double elapsed_secs = double(elapsed) / CLOCKS_PER_SEC;
cout << "Elapsed time " << msg << ": " << elapsed_secs << " size " << m3.cols() << ", " << m3.rows() << endl;
}
int main() {
#ifdef EIGEN_VECTORIZE
cout << "EIGEN_VECTORIZE on " << endl;
#endif
test<TestMatrix_t> ("normal ");
test<TestMatrix_dyn16b_t> ("dyn 16b ");
test<TestMatrix_dynalloc_t>("dyn alloc");
test<TestMatrix_fix16b_t> ("fix 16b ", test_size_16b);
}
compiled with g++ -msse3 -O2 -DEIGEN_DONT_PARALLELIZE test.cpp and ran it on an Athlon II X2 255. The result rather surprised me:
EIGEN_VECTORIZE on
Elapsed time normal : 0.019193 size 13, 13
Elapsed time dyn 16b : 0.025226 size 13, 13
Elapsed time dyn alloc: 0.018648 size 13, 13
Elapsed time fix 16b : 0.018221 size 14, 14
Similar results are attained with other odd numbers for test_size. What confuses me is this:
From reading Eigen Vectorization FAQ I would have thought that a 13x13 matrix has no multiple of 16 bytes size and thus will not profit from SIMD optimization. I expected the computation time to be much worse but it isn't.
From reading about Optional template parameters I would have thought that dynamic matrices with fixed upper bound known at compile time would behave much like dynamically allocated matrices an thus would have a similar computation speed. But they don't. That's actually what surprises me the most and what triggered my initial quest: I wanted to know if it is better to use a dynamic matrix with fixed upper bound that is a multiple of 16 bytes than a fixed size matrix whos size is not a multiple of 16 bytes.
Finally interesting but not so much surprising: a matrix whos fixed size is a multiple of 16 is no slower that that of a matrix whos col and row length is one less. SIMD just does the extra col and row for free.
Not my original question but also interesting: when the test is compiled without SSE2 support and thus without vectorization the relative computation times are roughly proportional. The dynamically sized fixed memory matrix is again slowest.
To put my question short: why is Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> so much slower? Can you confirm my observations and maybe even explain them?
The FAQ was obsolete. Since Eigen version 3.3, unaligned vectors and matrices are vectorized.
Then regarding why Matrix<double, Dynamic, Dynamic, ColMajor, test_size_16b, test_size_16b> was slower, that was just an issue in the compile-time selection of the preferred matrix product implementation. The fix will be part of Eigen 3.3.1.
The compiler I use is g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4.
I compile my programs with the following command:
g++ -std=c++11 -pedantic -Wall program.cpp
The program no. 1.:
#include <iostream>
using namespace std;
int main() {
unsigned int b;
b = -54;
cout << b << endl;
return 0;
}
The program prints 4294967242 and this is the value I expected, because this is the case when we assign an out-of-range value to a variable of unsigned type, so the result is the remainder of a modulo division.
The program no. 2.:
#include <iostream>
using namespace std;
int main() {
unsigned int b;
b = 54.1234;
cout << b << endl;
return 0;
}
The program prints 54, and this is also OK, because the stored value is the part before the decimal point, and the franctional part is truncated.
The program no. 3.:
#include <iostream>
using namespace std;
int main() {
unsigned int b;
b = -54.1234;
cout << b << endl;
return 0;
}
Here during compilation I get the warning "overflow in implicit constant conversion".
And the program prints 0. Why is it so? I thought that it will do the truncation of the fractional part (as in program 2) and then store the result of the modulo division (as in program 1).
But if I write program no. 4.:
program no. 4.
#include <iostream>
using namespace std;
int main() {
unsigned int b;
float k = -54.1234;
b = k;
cout << b << endl;
return 0;
}
then I get no warning, and I get the result (expected by me) 4294967242, which is the result of the modulo division.
I would be grateful if somebody can explain it to me.
Why doesn't the program no. 3 behave like program no. 4? Why don't I get a warning when compiling program no. 1, but I get one when compiling program no. 3.?
According to the standard (ยง[conv.fpint]).
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.
So, your -54.1234 is truncated to -54. Since that can't be represented in an unsigned, you get undefined behavior.
When converting floating point numbers to integers, C and C++ round floating point numbers towards zero. The rounded result must then be representable in the destination type.
As a result, for 32 bit unsigned int the conversion is guaranteed to give the correct result if -1 < x < 2^32. For smaller numbers there are no guarantees. Since numbers between -1 and 0 must be rounded to zero, and numbers -1 and smaller have no requirements, it wouldn't be surprising if the compiler checks whether x < 0 and gives a result of 0 in that case. (The compiler might check whether x < 1 and give a result of 0; this handles very small positive numbers as well).
Situation is the following: I have a number (1000s) of elements which are given by small matrices of dimensions 4x2, 9x3 ... you get the idea. All matrices have the same dimension.
I want to multiply each of these matrices with a fixed vector of precalculated values. In short:
for(i = 1...n)
X[i] = M[i] . N;
What is the best approach to do this in parallel using Thrust? How do I lay out my data in memory?
NB: There might be specialized, more suitable libraries to do this on GPUs. I'm interested in Thrust because it allows me to deploy to different backends, not just CUDA.
One possible approach:
flatten the arrays (matrices) into a single data vector. This is an advantageous step for enabling general thrust processing anyway.
use a strided range mechanism to take your scaling vector and extend it to the overall length of your flattened data vector
use thrust::transform with thrust::multiplies to multiply the two vectors together.
If you need to access the matrices later out of your flattened data vector (or result vector), you can do so with pointer arithmetic, or a combination of fancy iterators.
If you need to re-use the extended scaling vector, you may want to use the method outlined in step 2 exactly (i.e. create an actual vector using that method, length = N matrices, repeated). If you are only doing this once, you can achieve the same effect with a counting iterator, followed by a transform iterator (modulo the length of your matrix in elements), followed by a permutation iterator, to index into your original scaling vector (length = 1 matrix).
The following example implements the above, without using the strided range iterator method:
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>
#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024
struct my_modulo_functor : public thrust::unary_function<int, int>
{
__host__ __device__
int operator() (int idx) {
return idx%(H_MAT*W_MAT);}
};
int main(){
thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
thrust::host_vector<int> scale(H_MAT*W_MAT);
// synthetic; instead flatten/copy matrices into data vector
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;
thrust::device_vector<int> d_data = data;
thrust::device_vector<int> d_scale = scale;
thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);
thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());
thrust::host_vector<int> result = d_result;
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
EDIT: Responding to a question below:
The benefit of fancy iterators (i.e. transform(numbers, iterator)) is that they often allow for eliminaion of extra data copies/data movement, as compared to assembling other number (which requires extra steps and data movement) and then passing it to transform(numbers, other numbers). If you're only going to use other numbers once, then the fancy iterators will generally be better. If you're going to use other numbers again, then you may want to assemble it explicitly. This preso is instructive, in particular "Fusion".
For a one-time use of other numbers the overhead of assembling it on the fly using fancy iterators and the functor is generally lower than explicitly creating a new vector, and then passing that new vector to the transform routine.
When looking for a software library which is concisely made for multiplying small matrices, then one may have a look at https://github.com/hfp/libxsmm. Below, the code requests a specialized matrix kernel according to the typical GEMM parameters (please note that some limitations apply).
double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;
xmm = libxsmm_dmmdispatch(m, n, k,
&m/*lda*/, &k/*ldb*/, &m/*ldc*/,
&alpha, &beta, &flags, &prefetch);
Given the above code, one can proceed and run "xmm" for an entire series of (small) matrices without a particular data structure (below code also uses "prefetch locations").
if (0 < n) { /* check that n is at least 1 */
# pragma parallel omp private(i)
for (i = 0; i < (n - 1); ++i) {
const double *const ai = a + i * asize;
const double *const bi = b + i * bsize;
double *const ci = c + i * csize;
xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
}
xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
/* pseudo prefetch for last element of batch (avoids page fault) */
a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}
In addition to the manual loop control as shown above, libxsmm_gemm_batch (or libxsmm_gemm_batch_omp) can be used (see ReadTheDocs). The latter is useful if data structures exist that describe the series of operands (A, B, and C matrices).
There are two reasons why this library gives superior performance: (1) on-the-fly code specialization using an in-memory code generation technique, and (2) loading the next matrix operands while calculating the current product.
( Given one is looking for something that blends well with C/C++, this library supports it. However, it does not aim for CUDA/Thrust. )