I was trying to explore the option of "solveInPlace()" function while using LLT in Eigen3.3.7 to speed up the matrix inverse computation in my application.
I used the following code to test it.
int main()
{
const int M=3;
Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic> R = Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic>::Zero(M,M);
// to make sure full rank
for(int i=0; i<M*2; i++)
{
const Eigen::Matrix<MyType, Eigen::Dynamic,1> tmp = Eigen::Matrix<MyType,Eigen::Dynamic,1>::Random(M);
R += tmp*tmp.transpose();
}
std::cout<<"R \n";
std::cout<<R<<std::endl;
decltype (R) R0 = R; // saving for later comparison
Eigen::LLT<Eigen::Ref<Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic> > > myllt(R);
const Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic> I = Eigen::Matrix<MyType,Eigen::Dynamic,Eigen::Dynamic>::Identity(R.rows(), R.cols());
myllt.solveInPlace(I);
std::cout<<"I: "<<I<<std::endl;
std::cout<<"Prod InPlace: \n"<<R0*I<<std::endl;
return 0;
}
After reading the Eigen documentation, I thought that the input matrix (here "R") will be modified while computing the transform. To my surprise, I found that the results is store in "I". This was not expected as I defined "I" as a constant. Please provide an explanation for this behaviour.
The simple non-compiler answer would be that you're asking for the LLT to solve in-place (i.e. in the passed parameter) so what would you expect the result to be? Apparently, you would expect it to be a compiler error, as the "in-place" means change the parameter, but you're passing a const object.
So, if we search the Eigen docs for solveInPlace, we find the only item that takes a const reference to have the following note:
"in-place" version of TriangularView::solve() where the result is written in other
Warning
The parameter is only marked 'const' to make the C++ compiler accept a temporary expression here. This function will const_cast it, so constness isn't honored here.
The non-in-place option would be:
R = myllt.solve(I);
but that won't really speed up the calculation. In any case, benchmark before you decide that you need the in-place option.
You're question is in place, as what const_cast is meant to do is strip references/pointers of their const-ness iff the underlying variable is not const qualified* (cppref). If you were to write some examples
const int i = 4;
int& iRef = const_cast<int&>(i); // UB, i is actually const
std::cout << i; // Prints "I want coffee", or it can as we like UB
int j = 4;
const int& jRef = j;
const_cast<int&>(jRef)++; // Legal. Underlying variable is not const.
std::cout << j; // Prints 5
The case with i may well work as expected or not, we're dependent on each implementation/compiler. It may work with gcc but not with clang or MSVC. There are no guarantees. As you are indirectly invoking UB in your example, the compiler can choose to do what you expect or something else entirely.
*Technically it's the modification that's UB, not the const_cast itself.
Related
I have defined a constexpr function as following:
constexpr int foo(int i)
{
return i*2;
}
And this is what in the main function:
int main()
{
int i = 2;
cout << foo(i) << endl;
int arr[foo(i)];
for (int j = 0; j < foo(i); j++)
arr[j] = j;
for (int j = 0; j < foo(i); j++)
cout << arr[j] << " ";
cout << endl;
return 0;
}
The program was compiled under OS X 10.8 with command clang++. I was surprised that the compiler did not produce any error message about foo(i) not being a constant expression, and the compiled program actually worked fine. Why?
The definition of constexpr functions in C++ is such that the function is guaranteed to be able to produce a constant expression when called such that only constant expressions are used in the evaluation. Whether the evaluation happens during compile-time or at run-time if the result isn't use in a constexpr isn't specified, though (see also this answer). When passing non-constant expressions to a constexpr you may not get a constant expression.
Your above code should, however, not compile because i is not a constant expression which is clearly used by foo() to produce a result and it is then used as an array dimension. It seems clang implements C-style variable length arrays as it produces the following warning for me:
warning: variable length arrays are a C99 feature [-Wvla-extension]
A better test to see if something is, indeed, a constant expression is to use it to initialize the value of a constexpr, e.g.:
constexpr int j = foo(i);
I used the code at the top (with "using namespace std;" added in) and had no errors when compiling using "g++ -std=c++11 code.cc" (see below for a references that qualifies this code) Here is the code and output:
#include <iostream>
using namespace std;
constexpr int foo(int i)
{
return i*2;
}
int main()
{
int i = 2;
cout << foo(i) << endl;
int arr[foo(i)];
for (int j = 0; j < foo(i); j++)
arr[j] = j;
for (int j = 0; j < foo(i); j++)
cout << arr[j] << " ";
cout << endl;
return 0;
}
output:
4
0 1 2 3
Now consider reference https://msdn.microsoft.com/en-us/library/dn956974.aspx It states: "...A constexpr function is one whose return value can be computed at compile when consuming code requires it. A constexpr function must accept and return only literal types. When its arguments are constexpr values, and consuming code requires the return value at compile time, for example to initialize a constexpr variable or provide a non-type template argument, it produces a compile-time constant. When called with non-constexpr arguments, or when its value is not required at compile-time, it produces a value at run time like a regular function. (This dual behavior saves you from having to write constexpr and non-constexpr versions of the same function.)"
It gives as valid example:
constexpr float exp(float x, int n)
{
return n == 0 ? 1 :
n % 2 == 0 ? exp(x * x, n / 2) :
exp(x * x, (n - 1) / 2) * x;
}
This is an old question, but it's the first result on a google search for the VS error message "constexpr function return is non-constant". And while it doesn't help my situation, I thought I'd put my two cents in...
While Dietmar gives a good explanation of constexpr, and although the error should be caught straight away (as it is with the -pedantic flag) - this code looks like its suffering from some compiler optimization.
The value i is being set to 2, and for the duration of the program i never changes. The compiler probably noticed this and optimized the variable to be a constant (just replacing all references to variable i to the constant 2... before applying that parameter to the function), thus creating a constexpr call to foo().
I bet if you looked at the disassembly you'd see that calls to foo(i) were replaced with the constant value 4 - since that is the only possible return value for a call to this function during execution of the program.
Using the -pedantic flag forces the compiler to analyze the program from the strictest point of view (probably done before any optimizations) and thus catches the error.
I am writing a function in RcppEigen for weighted covariances. In one of the steps I want to take column i and column j of a matrix, X, and compute the cwiseProduct, which should return some kind of vector. The output of cwiseProduct will go into an intermediate variable which can be reused many times. From the docs it seems cwiseProduct returns a CwiseBinaryOp, which itself takes two types. My cwiseProduct operates on two column vectors, so I thought the correct return type should be Eigen::CwiseBinaryOp<Eigen::ColXpr, Eigen::ColXpr>, but I get the error no member named ColXpr in namespace Eigen
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
Rcpp::List Crossprod_sparse(Eigen::MappedSparseMatrix<double> X, Eigen::Map<Eigen::MatrixXd> W) {
int K = W.cols();
int p = X.cols();
Rcpp::List crossprods(W.cols());
for (int i = 0; i < p; i++) {
for (int j = i; j < p; j++) {
Eigen::CwiseBinaryOp<Eigen::ColXpr, Eigen::ColXpr> prod = X.col(i).cwiseProduct(X.col(j));
for (int k = 0; k < K; k++) {
//double out = prod.dot(W.col(k));
}
}
}
return crossprods;
}
I have also tried saving into a SparseVector
Eigen::SparseVector<double> prod = X.col(i).cwiseProduct(X.col(j));
as well as computing, but not saving at all
X.col(i).cwiseProduct(X.col(j));
If I don't save the product at all, the functions returns very quickly, hinting that cwiseProduct is not an expensive function. When I save it into a SparseVector, the function is extremely slow, making me think that SparseVector is not the right return type and Eigen is doing extra work to get it into that type.
Recall that Eigen relies on expression templates, so if you don't assign an expression then this expression is essentially a no-op. In your case, assigning it to a SparseVector is the right thing to do. Regarding speed, make sure to compile with compiler optimizations ON (like -O3).
Nonetheless, I believe there is a faster way to write your overall computations. For instance, are you sure that all X.col(i).cwiseProduct(X.col(j)) are non empty? If not, then the second loop should be rewritten to iterate over the sparse set of overlapping columns only. Loops could also be interchanged to leverage efficient matrix products.
I have an C++11 application where I commonly iterate over several different structure of arrays for various algorithms. Raw CPU performance is important for this app.
The array elements are fundamental types (int, double, ..) or simple struct. The array are typically tens of thousands of elements long. I often need to iterate several arrays at once in a given loop. So typically I would need one pointer for each array of whatever type. So times I need to increment five individual pointers which is verbose.
Based on these answers about tuples,
Why is std::pair faster than std::tuple
C++11 tuple performance
I hoped there was no overhead to using tuples to pack the pointers together into a single object.
I thought it might be nice to implement a cursor like object to assist in iterating, since missing the increment on a particular pointer would be an annoying bug.
auto pts = std::make_tuple(p1, p2, p3...);
allow you to bundle a bunch of variables together in a typesafe way. Then you can implement a variadic template function to increment each pointer in the tuple in a type safe way.
However...
When I measure performance, the tuple version was slower then using raw pointers. But when I look at the generated assembly I see additional mov instructions in the tuple loop increment. Maybe due to the fact the std::get<> returns a reference? I had hoped that would be compiled away...
Am I missing something or are raw pointers just going to beat tuples when used like this? Here is a simple test harness. I threw away the fancy cursor code and just use a std::tuple<> for this test
On my machine, the tuple loop is consistently twice as slow as the raw pointer version for various data sizes.
My system config is Visual C++ 2013 x64 on Windows 8 with a release build. I did try turning on various optimization in Visual Studio such as
Inline Function Expansion : Any Suitable (/Ob2)
but it did not seem to change the time result for my case.
I did need to do two extra things to avoid aggressive optimization by VS
1) I forced the test data array to allocated on the heap, not the stack. That made a big difference when I timed things, possibly due to memory cache effects.
2) I forced a side effect by writing to static variable at the end so the compiler would not just skip my loop.
struct forceHeap
{
__declspec(noinline) int* newData(int M)
{
int* data = new int[M];
return data;
}
};
void timeSumCursor()
{
static int gIntStore;
int maxCount = 20;
int M = 10000000;
// compiler might place array on stack which changes the timing
// int* data = new int[N];
forceHeap fh;
int* data = fh.newData(M);
int *front = data;
int *end = data + M;
int j = 0;
for (int* p = front; p < end; ++p)
{
*p = (++j) % 1000;
}
{
BEGIN_TIMING_BLOCK("raw pointer loop", maxCount);
int* p = front;
int sum = 0;
int* cursor = front;
while (++cursor != end)
{
sum += *cursor;
}
gIntStore = sum;// force a side effect
END_TIMING_BLOCK();
}
printf("%d\n", gIntStore);
{
// just use a simple tuple to show the issue
// rather full blown cursor object
BEGIN_TIMING_BLOCK("tuple loop", maxCount);
int sum = 0;
auto cursor = std::make_tuple(front);
while (++std::get<0>(cursor) != end)
{
sum += *std::get<0>(cursor);
}
gIntStore = sum; // force a side effect
END_TIMING_BLOCK();
}
printf("%d\n", gIntStore);
delete[] data;
}
I am creating a ruby wrapper for the fftw3 library for the Scientific Ruby Foundation which uses nmatrix objects instead of regular ruby arrays.
I have a curious problem in returning the transformed array in that I am not sure how to do this so I can check the transform has been computed correctly against octave or (something like this) in my specs
I have an idea that I might be best to cast the output array out which is an fftw_complex type to a VALUE to pass it to the nmatrix object before returning but I am not sure whether I should be using a wisdom and getting the values from that with fftw.
Here is the method and the link to the spec output on travis-ci
static VALUE
fftw_r2c_one(VALUE self, VALUE nmatrix)
{
VALUE cNMatrix = rb_define_class("NMatrix", rb_cObject);
fftw_plan plan;
VALUE shape = rb_funcall(nmatrix, rb_intern("shape"), 0);
const int size = NUM2INT(rb_funcall(cNMatrix, rb_intern("size"), 1, shape));
double* in = ALLOC_N(double, size);
for (int i = 0; i < size; i++)
{
in[i] = NUM2DBL(rb_funcall(nmatrix, rb_intern("[]"), 1, INT2FIX(i)));
printf("IN[%d]: in[%.2f] \n", i, in[i]);
}
fftw_complex* out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * size + 1);
plan = fftw_plan_dft_r2c(1,&size, in, out, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
xfree(in);
fftw_free(out);
return nmatrix;
}
Feel free to clone the repo from github and have a play about, if you like.
Note: I am pretty new to fftw3 and have not used C (or ruby) much, before starting this project. I had got more used to java, python and javascript to date so haven't quite got my head around lower level concepts like memory management but am getting the with this project. Please bear that in mind in your answers, and try to see that they are clear for someone and who up to recently has mainly got used to an object orientated approach up to now by avoiding jargon (or taking care to point it out) as that would really help.
Thank you.
I got some advice from Colin Fuller and after some pointers from him I came up with this solution:
VALUE fftw_complex_to_nm_complex(fftw_complex* in) {
double real = ((double (*)) in)[1];
double imag = ((double (*)) in)[2];
VALUE mKernel = rb_define_module("Kernel");
return rb_funcall(mKernel,
rb_intern("Complex"),
2,
rb_float_new(real),
rb_float_new(imag));
}
/**
fftw_r2c
#param self
#param nmatrix
#return nmatrix
With FFTW_ESTIMATE as a flag in the plan,
the input and and output are not overwritten at runtime
The plan will use a heuristic approach to picking plans
rather than take measurements
*/
static VALUE
fftw_r2c_one(VALUE self, VALUE nmatrix)
{
/**
Define and initialise the NMatrix class:
The initialisation rb_define_class will
just retrieve the NMatrix class that already exists
or define a new class altogether if it does not
find NMatrix. */
VALUE cNMatrix = rb_define_class("NMatrix", rb_cObject);
fftw_plan plan;
const int rank = rb_iv_set(self, "#rank", 1);
// shape is a ruby array, e.g. [2, 2] for a 2x2 matrix
VALUE shape = rb_funcall(nmatrix, rb_intern("shape"), 0);
// size is the number of elements stored for a matrix with dimensions = shape
const int size = NUM2INT(rb_funcall(cNMatrix, rb_intern("size"), 1, shape));
double* in = ALLOC_N(double, size);
fftw_complex* out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex) * size * size);
for (int i = 0; i < size; i++)
{
in[i] = NUM2DBL(rb_funcall(nmatrix, rb_intern("[]"), 1, INT2FIX(i)));;
}
plan = fftw_plan_dft_r2c(1,&size, in, out, FFTW_ESTIMATE);
fftw_execute(plan);
for (int i = 0; i < 2; i++)
{
rb_funcall(nmatrix, rb_intern("[]="), 2, INT2FIX(i), fftw_complex_to_nm_complex(out + i));
}
// INFO: http://www.fftw.org/doc/New_002darray-Execute-Functions.html#New_002darray-Execute-Functions
fftw_destroy_plan(plan);
xfree(in);
fftw_free(out);
return nmatrix;
}
The only problem which remains it getting the specs to recognise the output types which I am looking at solving in the ruby core Complex API
If you want to see any performance benefit from using FFTW then you'll need to re-factor this code so that plan generation is performed only once for a given FFT size, since plan generation is quite costly, while executing the plan is where the performance gains come from.
You could either
a) have two entry points - an initialisation routine which generates the plan and then a main entry point which executes the plan
b) use a memorization technique so that you only generate the plan once, the first time you are called for a given FFT dimension, and then you cache the plan for subsequent re-use.
The advantage of b) is that it is a cleaner implementation with a single entry point; the disadvantage being that it breaks if you call the function with dimensions that change frequently.
Consider you have some expression like
i = j = 0
supposing this is well-defined in your language of choice. Would it generally be better to split this up into two expressions like
i = 0
j = 0
I see this sometimes in library code. It doesn't seem buy you much in terms of brevity and shouldn't perform any better than the two statements (though that may be compiler dependant). So, is there a reason to use one over the other? Or is it just personal preference? I know this sounds like a silly question but it's bugging me for a long time now :-).
Once upon a time there was a performance difference, which is one of the reason that this kind of assignment was used. The compilers would turn i = 0; j = 0; into:
load 0
store [i]
load 0
store [j]
So you could save an instruction by using i = j = 0 as the compiler would turn this into:
load 0
store [j]
store [i]
Nowadays compilers can do this type of optimisations by themselves. Also, as the current CPUs run several instructions at once, performance can no longer simply be measured in number of instructions. Instructions where one action doesn't rely on the result of another can run in parallel, so the version that uses a separate value for each variable might actually be faster.
Regarding programming style, you should use the way that best expresses the intention of the code.
You can for example chain the assignments when you simply want to clear some variables, and make it separate assignments when the value has a specific meaning. Especially if the meaning of setting one variable to the value is different from setting the other variable to the same value.
The two forms reflects different points of view on the assignment.
The first case treats assignment (at least the inner one) as an operator (a function returning a value).
The second case treats assignment as a statement (a command to do something).
There is some cases where the assignment as an operator has it's point, mostly for brevity, or to use in contexts that expect a result. However I feel it confusing. For several reasons:
Assignment operator are basically side effect operators, and nowadays it's a problem to optimize them for compilers. In languages like C and C++ they lead to many Undefined Behavior cases, or unoptimized code.
It is unclear what it should return. Should assignment operator return the value that as been assigned, or should it return the address of the place it has been stored. One or the other could be useful, depending on the context.
With composite assignments like +=, it's even worse. It is unclear if the operator should return the initial value, the combined result, or even the place it was stored to.
The assignment as a statement lead sometimes to intermediate variables, but that's the only drawback I see. It is clear and compilers know how to optimize efficiently successive such statements.
Basically, I would avoid assignment as operator whenever possible. The presented case is very simple and not really confusing, but as a general rule I would still prefer.
i = 0
j = 0
or
i, j = 0, 0
for languages that supports, parallel assignment.
It depends on the language. In highly-object-oriented languages, double assignment results in the same object being assigned to multiple variables, so changes in one variable are reflected in the other.
$ python -c 'a = b = [] ; a.append(1) ; print b'
[1]
Firstly, at a semantic level, it depends whether you want to say that i and j are the same value, or just happen to both have the same value.
For example, if i and j are the indexes into a 2D array, they both start at zero. j = i = 0 says i starts at zero, and j starts where i started. If you wanted to start at the second row, you wouldn't necessarily want to start at the second column, so I wouldn't initialise them both in the same statement - the indices for rows and columns independently happen to both start at zero.
Also, in languages where i and j represent complicated objects rather than integral variables, or where assignment may cause an implicit conversion, they are not equivalent:
#include <iostream>
class ComplicatedObject
{
public:
const ComplicatedObject& operator= ( const ComplicatedObject& other ) {
std::cout << " ComplicatedObject::operator= ( const ComplicatedObject& )\n";
return *this;
}
const ComplicatedObject& operator= ( int value ) {
std::cout << " ComplicatedObject::operator= ( int )\n";
return *this;
}
};
int main ()
{
{
// the functions called are not the same
ComplicatedObject i;
ComplicatedObject j;
std::cout << "j = i = 0:\n";
j = i = 0;
std::cout << "i = 0; j = 0:\n";
i = 0;
j = 0;
}
{
// the result of the first assignment is
// effected by implicit conversion
double j;
int i;
std::cout << "j = i = 0.1:\n";
j = i = 0.1;
std::cout << " i == " << i << '\n'
<< " j == " << j << '\n'
;
std::cout << "i = 0.1; j = 0.1:\n";
i = 0.1;
j = 0.1;
std::cout << " i == " << i << '\n'
<< " j == " << j << '\n'
;
}
}
Most of the people will find both possibilities equally readable. Some of these people will have a personal preference for either way. But there are people who might, at first glance, get confused by the "double assignment". I personally like the separate approach, bacause
It is 100% readable
It is not really verbose compared to the double variant
It allows me forget the rules of associativity for = operator
The second way is more readable and clear, I prefer it.
However I try to avoid "double" declaration:
int i, j;
instead of
int i;
int j;
if they're going consecutively. Especially in case of MyVeryLong.AndComplexType