Code as below:
// Generate the returns matrix
boost::shared_ptr<Eigen::MatrixXd> returns_m = boost::make_shared<Eigen::MatrixXd>(factor_size, num_of_obs_per_simulation);
//Generate covariance matrix
boost::shared_ptr<MatrixXd> corMatrix = boost::make_shared<MatrixXd>(factor_size, factor_size);
(*corMatrix) = (*returns_m) * (*returns_m).transpose() / (num_of_obs_per_simulation - 1);
The point is that I want to return the corMatrx as a pointer, not as an object, to be stored as a member of a result class for later use. Above code seems to make a copy of the big matrix, is there any better way to do it?
Thank you and best wishes...
Slight improvement to #ggael's answer, you can directly construct your corMatrix shared pointer from the expression:
boost::shared_ptr<MatrixXd> corMatrix
= boost::make_shared<MatrixXd>((*returns_m) * (*returns_m).transpose() * (1./(num_of_obs_per_simulation - 1));
Or, you can exploit the symmetry of the product using rankUpdate:
boost::shared_ptr<MatrixXd> corMatrix = boost::make_shared<MatrixXd>(MatrixXd::Zero(factor_size, factor_size));
corMatrix->selfadjointView<Eigen::Upper>().rankUpdate(*returns_m, 1.0 / (num_of_obs_per_simulation - 1));
// optionally copy upper half to lower half as well:
corMatrix->triangularView<Eigen::StrictlyLower>() = corMatrix->adjoint();
I don't understand your question as returning corMatrix as a shared_ptr will do exactly what you want, but regarding your product, you can save one temporary using noalias and * (1./x):
(*corMatrix).noalias() = (*returns_m) * (*returns_m).transpose() * (1./(num_of_obs_per_simulation - 1));
The whole expression will be turned as a single call to a gemm-like routine.
To complete the explanation:
Recall that Eigen is an expression template library, so when you do A = 2*B + C.transpose(); with A,B,C matrices, then everything happen in operator=, that is the right-hand-side expression is directly evaluated within A. For products, the story is slightly different because 1) to be efficient it needs to be directly evaluated within something, and 2) it is not possible to directly write to the destination if there is aliasing, e.g.: A = A*B. The noalias tells Eigen that the destination does not not alias and the product can be directly evaluated within it.
Related
I'm writing some Lua scripts in Tabletop Simulator and seeing the error attempt to call a number value near for..in that has me completely perplexed. Here's the code snippet with the for loop that is causing the error:
function resetTurnOrder()
local map = getObjectFromGUID(GUIDs.Map)
local shift, center, points = map.getPosition(), map.getTable('MapData').center, map.getSnapPoints()
local i, p = 0
for nation, guids in pairs(GUIDs.Nations) do
print('Resetting turn order for ' .. nation)
if checkScenario(nation) then
i = i + 1
p = map.positionToLocal(shift - points[i].position)
p[1] = p[1] * 0.75 + center[1] * 0.25
p[3] = p[3] * 0.75 + center[3] * 0.25
getObjectFromGUID(guids.turn_token).setPositionSmooth(map.positionToWorld(p), false, false)
getObjectFromGUID(guids.turn_token).setRotationSmooth({0, points[i].rotation[2], 0}, false, false)
print('Done resetting turn order for ' .. nation)
else
print(nation .. ' not in this scenario')
end
end
end
Okay, so first of all I will say that the error went away by commenting out the two lines that assign directly to p[1] and p[3], and when I replaced those lines with the equivalent statement
p = {p[1] * 0.75 + center[1] * 0.25, p[2], p[3] * 0.75 + center[3] * 0.25}
then everything worked perfectly. However, I am completely dumbfounded as to why this would fix the error. I use this exact for loop definition in like half a dozen places to iterate over the players and their components (which are stored in the global GUIDs) and it has worked flawlessly everywhere else.
To add a little more detail, even with the old code the first iteration of the loop works perfectly. The first turn token is moved to its proper position, both messages are printed, but the error prevents further iterations. The error is clearly occurring when incrementing the loop iterator, but I can't understand how assigning directly to p[1] and p[3] could possibly interfere with this but assigning to p is fine. One more detail: declaring p inside the for loop instead of outside beforehand didn't help.
(EDITED TO ADD MORE DETAILS)
After more testing it looks like #luther is probably correct that something weird is going on with the metatable for the value returned by positionToLocal. The value returned by this function is a Vector defined by Tabletop Simulator which I believe is an extension of Unity's Vector3 type. An important detail is that this type allows you to refer to the indices x,y,z and 1,2,3 interchangeably.
So, I replaced the p[1] and p[3] assignments with p.x and p.z which fixed the error. This seems to imply that the Vector returned by positionToLocal did not define indices 1,2,3 explicitly but instead uses a metamethod to link those indices to x,y,z. And, somehow, that metamethod is messing with the loop iterator... but honestly that still boggles my mind.
GUIDs.Nations is the table passed to the pairs() function which is used to generate the iterator and it is basically a constant--I never add to or update it in any function because it contains static GUIDs. It certainly has no connection to p.
FURTHER DETAILS
This definitely seems likely to be connected to Tabletop Simulator's Vector implementation: https://forums.tabletopsimulator.com/showthread.php?8344-For-loop
The above example just uses a simple numeric for loop to update indices 1,2,3 of a Vector value, and an assignment statement which uses i to index the Vector ends up changing the value of i to the same value that is assigned.
I'm still unable to understand how this is even possible in the language though...
Okay, I'm certain that this is a bug in Moonsharp, which is the Lua interpreter used by Tabletop Simulator: https://github.com/moonsharp-devs/moonsharp/issues/133
The bug was fixed in March 2016 (https://github.com/moonsharp-devs/moonsharp/commit/3ebc0e1fc706c452df9b309d51daec88a15eb0d1) but it seems like TTS probably hasn't updated Moonsharp.
When casting a vector integers (i.e. Eigen::VectorXi) to a vector of doubles, and then operating on that vector of doubles, the generated assembly is dramatically different if the return type of the cast is auto.
In other words, using:
Eigen::VectorXi int_vec(3);
int_vec << 1, 2, 3;
Eigen::VectorXd dbl_vec = int_vec.cast<double>();
Compared to:
Eigen::VectorXi int_vec(3);
int_vec << 1, 2, 3;
auto dbl_vec = int_vec.cast<double>();
Here are two examples on godbolt:
VectorXd return type: https://godbolt.org/z/0FLC4r
auto return type: https://godbolt.org/z/MGxCaL
What are the ramifications of using auto for the return here? I thought it would be more efficient by avoiding a copy, but now I'm not sure.
Indeed, in your code in the question you avoid a copy (indeed, until dbl_vec is used, it's essentially a noop). However, in the code on godbolt, you traverse the original int_vec and evaluate dbl_vec at least twice, possibly thrice:
max + std::log((dbl_vec.array() - max)
^^^ ^^^^^^^ ^^^
I'm not sure if the two calls to max are collapsed into a temporary or not. I'd hope so.
In any case, kmdreko is right and you should avoid using auto with Eigen unless you know exactly what you're doing. In this case, the auto is an expression template that does not get evaluated until used. If you use it more than once, then it gets evaluated more than once. If the evaluation is expensive, then the savings from not using a copy are lost (with interest) to the additional evaluation times.
I'm trying to translate the following Matlab code to C/C++.
indl = find(dlamu1 < 0); indu = find(dlamu2 < 0);
s = min([1; -lamu1(indl)./dlamu1(indl); -lamu2(indu)./dlamu2(indu)]);
I've read on another thread that there's yet no equivalent in the Eigen library to the find() function and I'm at peace with that and have brute-forced around it.
Now, if I wanted to do the coefficient-wise division of lamu1 and dlamu1, I'd go for lamu1.cwiseQuotient(dlamu1) but how do I go about doing that but only for some of their coefficients, which indexes are specified by the coefficients of indl? I haven't found anything about this in the documentation, but maybe I'm not using the right search terms.
With the default branch you can just write lamu1(indl) with indl a std::vector<int> or a Eigen::VectorXi or whatever you like that supports random access through operator[].
There is no equivalent of find (yet) even in the default branch. Your function can however be expressed using the select method (also works with Eigen 3.3.x):
double ret1 = (dlamu1.array()<0).select(-lamu1.cwiseQuotient(dlamu1), 1.0).minCoeff();
return std::min(1.0,ret1); // not necessary, if dlamu1.array()<0 at least once
select evaluates lazily, i.e., only if the condition is true, the quotient will be calculated. On the other hand, a lot of unnecessary comparisons with 1.0 will happen with the code above.
If [d]lamu are stored in Eigen::ArrayXd instead of Eigen::VectorXd, you can write:
double ret1 = (dlamu1<0).select(-lamu1/dlamu1, 1.0).minCoeff();
If you brute-forced indl anyway, you can as ggael suggested write:
lamu1(indl).cwiseQuotient(dlamu1(indl)).minCoeff();
(this is undefined/crashes if indl.size()==0)
I need a quick advice how-to. I mention that the following scenario is based on the use of c_api available already to my monetdblite compilation on 64bit, intention is to use it with some adhoc C written functions.
Short: how can I achieve or simulate the following scenario:
update aTable set a,b,c = func(x,y,z,…)
Long. Many algorithms are returning more than one variable as, for instance, multiple regression.
bool m_regression(IN const double **data, IN const int cols, IN const int rows, OUT double *fit_values, OUT double *residuals, OUT double *std_residuals, OUT double &p_value);
In order to minimize the transfer of data between monetdb and heavy computational function, all those results are generated in one step. Question is how can I transfer them back at once, minimizing computational time and memory traffic between monetdb and external C/C++(/R/Python) function?
My first thought to solve this is something like this:
1. update aTable set dummy = func_compute(x,y,z,…)
where dummy is a temporary __int64 field and func_compute will compute all the necessary outputs and store the result into a dummy pointer. To make sure is no issue with constant estimation, first returned value in the array will be the real dummy pointer, the rest just an incremented value of dummy + i;
2. update aTable set a = func_ret(dummy, 1), b= func_ret (dummy, 2), c= func_ret (dummy, 3) [, dummy=func_free(dummy)];
Assuming the func_ret will get the dummy in the same order that it was returned on first call, I would just copy the prepared result into provided storage; In case the order is not preserved, I will need an extra step to get the minimum (real dummy pointer), then to use the offset of current value to lookup in my array.
__int64 real_dummy = __inputs[0][0];
double *my_pointer_data = (double *) (real_dummy + __inputs[1][0] * sizeof(double)* row_count);
memcpy(__outputs[0], my_pointer_data, sizeof(double)* row_count);
// or ============================
__int64 real_dummy = minimum(__inputs[0]);
double *my_pointer_data = (double *) (real_dummy + __inputs[0][1] * sizeof(double)* row_count);
for (int i=0;i<row_count;i++)
__outputs[0][i] = my_pointer_data[__inputs[0][i] - real_dummy];
It is less relevant how am I going to free the temporary memory, can be in the last statement in update or in a new fake update statement using func_free.
Problem is that it doesn’t look to me that, even if I save some computational (big) time, the passing of the dummy is still done 3 times (any chance that memory is actually not copied?).
Is it any other better way of achieving this?
I am not aware of a good way of doing this, sorry. You could retrieve the table, add your columns as BATs in whichever way you like and write it back.
first of all, I apologize for the overly verbose question. I couldn't think of any other way to accurately summarize my problem... Now on to the actual question:
I'm currently experimenting with C++0x rvalue references... The following code produces unwanted behavior:
#include <iostream>
#include <utility>
struct Vector4
{
float x, y, z, w;
inline Vector4 operator + (const Vector4& other) const
{
Vector4 r;
std::cout << "constructing new temporary to store result"
<< std::endl;
r.x = x + other.x;
r.y = y + other.y;
r.z = z + other.z;
r.w = w + other.w;
return r;
}
Vector4&& operator + (Vector4&& other) const
{
std::cout << "reusing temporary 2nd operand to store result"
<< std::endl;
other.x += x;
other.y += y;
other.z += z;
other.w += w;
return std::move(other);
}
friend inline Vector4&& operator + (Vector4&& v1, const Vector4& v2)
{
std::cout << "reusing temporary 1st operand to store result"
<< std::endl;
v1.x += v2.x;
v1.y += v2.y;
v1.z += v2.z;
v1.w += v2.w;
return std::move(v1);
}
};
int main (void)
{
Vector4 r,
v1 = {1.0f, 1.0f, 1.0f, 1.0f},
v2 = {2.0f, 2.0f, 2.0f, 2.0f},
v3 = {3.0f, 3.0f, 3.0f, 3.0f},
v4 = {4.0f, 4.0f, 4.0f, 4.0f},
v5 = {5.0f, 5.0f, 5.0f, 5.0f};
///////////////////////////
// RELEVANT LINE HERE!!! //
///////////////////////////
r = v1 + v2 + (v3 + v4) + v5;
return 0;
}
results in the output
constructing new temporary to store result
constructing new temporary to store result
reusing temporary 1st operand to store result
reusing temporary 1st operand to store result
while I had hoped for something like
constructing new temporary to store result
reusing temporary 1st operand to store result
reusing temporary 2nd operand to store result
reusing temporary 2nd operand to store result
After trying to re-enact what the compiler was doing (I'm using MinGW G++ 4.5.2 with option -std=c++0x in case it matters), it actually seems quite logical. The standard says that arithmetic operations of equal precedence are evaluated/grouped left-to-right (why I assumed right-to-left I don't know, I guess it's more intuitive to me). So what happened here is that the compiler evaluated the sub-expression (v3 + v4) first (since it's in parentheses?), and then began matching the operations in the expression left-to-right against the operator overloads, resulting in a call to Vector4 operator + (const Vector4& other) for the sub-expression v1 + v2. If I want to avoid the unnecessary temporary, I'd have to make sure that no more than one lvalue operand appears to the immediate left of any parenthesized sub-expression, which is counter-intuitive to anyone using this "library" and innocently expecting optimal performance (as in minimizing the creation of temporaries).
(I'm aware that there's ambiguity in my code regarding operator + (Vector4&& v1, const Vector4& v2) and operator + (Vector4&& other) when (v3 + v4) is to be added to the result of v1 + v2, resulting in a warning. But it's harmless in my case and I don't want to add yet another overload for two rvalue reference operands - anyone know if there's a way to disable this warning in gcc?)
Long story short, my question boils down to: Is there any way or pattern (preferably compiler-independent) this vector class could be rewritten to enable arbitrary use of parentheses in expressions that still results in the "optimal" choice of operator overloads (optimal in terms of "performance", i.e. maximizing the binding to rvalue references)? Perhaps I'm asking for too much though and it's impossible... if so, then that's fine too. I just want to make sure I'm not missing anything.
Many thanks in advance
Addendum
First thanks to the quick responses I got, within minutes (!) - I really should have started posting here sooner...
It's becoming tedious replying in the comments, so I think a clarification of my intent with this class design is in order. Maybe you can point me to a fundamental conceptual flaw in my thought process if there is one.
You may notice that I don't hold any resources in the class like heap memory. Its members are only scalar types even. At first sight this makes it a suspect candidate for move-semantics based optimizations (see also this question that actually helped me a great deal grasping the concepts behind rvalue references).
However, since the classes this one is supposed to be a prototype for will be used in a performance-critical context (a 3D engine to be precise), I want to optimize every little thing possible. Low-complexity algorithms and maths-related techniques like look-up tables should of course make up the bulk of the optimizations as anything else would simply be addressing the symptoms and not eradicating the real reason for bad performance. I am well aware of that.
With that out of the way, my intent here is to optimize algebraic expressions with vectors and matrices that are essentially plain-old-data structs without pointers to data in them (mainly due to the performance drawbacks you get with data on the heap [having to dereference additional pointers, cache considerations etc.]).
I don't care about move-assignment or construction, I just don't want more temporaries being created during the evaluation of a complicated algebraic expression than absolutely necessary (usually just one or two, e.g. a matrix and a vector).
Those are my thoughts that might be erroneous. If they are, please correct me:
To achieve this without relying on RVO, return-by-reference is necessary (again: keep in mind I don't have remote resources, only scalar data members).
Returning by reference makes the function-call expression an lvalue, implying the returned object is not a temporary, which is bad, but returning by rvalue reference makes the function-call expression an xvalue (see 3.10.1), which is okay in the context of my approach (see 4)
Returning by reference is dangerous, because of the possibly short lifetime of objects, but:
temporaries are guaranteed to live until the end of the evaluation of the expression they were created in, therefore:
making it safe to return by reference from those operators that take at least one rvalue-reference as their argument, if the object referenced by this rvalue reference argument is the one being returned by reference. Therefore:
Any arbitrary expression that only employs binary operators can be evaluated by creating only one temporary when not more than one PoD-like type is involved, and the binary operations don't require a temporary by nature (like matrix multiplication)
(Another reason to return by rvalue-reference is because it behaves like returning by value in terms of rvalue-ness of the function-call expression; and it's required for the operator/function-call expression to be an rvalue in order to bind to subsequent calls to operators that take rvalue references. As stated in (2), calls to functions that return by reference are lvalues, and would therefore bind to operators with the signature T operator+(const T&, const T&), resulting in the creation of an unnecessary temporary)
I could achieve the desired performance by using a C-style approach of functions like add(Vector4 *result, Vector4 *v1, Vector4 *v2), but come on, we're living in the 21st century...
In summary, my goal is creating a vector class that achieves the same performance as the C-approach using overloaded operators. If that in itself is impossible, than I guess it can't be helped. But I'd appreciate if someone could explain to me why my approach is doomed to fail (the left-to-right operator evaluation issue that was the initial reason for my post aside, of course).
As a matter of fact, I've been using the "real" vector class this one is a simplification of for a while without any crashes or corrupted memory so far. And in fact, I never actually return local objects as references, so there shouldn't be any problems. I dare say what I'm doing is standard-compliant.
Any help on the original issue would of course be appreciated as well!
many thanks for all the patience again
You should not return an rvalue reference, you should return a value. In addition, you should not specify both a member and a free operator+. I'm amazed that even compiled.
Edit:
r = v1 + v2 + (v3 + v4) + v5;
How could you possibly only have one temporary value when you're performing two sub-computations? That's just impossible. You can't re-write the Standard and change this.
You will just have to trust your users to do something not completely stupid, like write the above line of code, and expect to have just one temporary.
I recommend modeling your code after the basic_string operator+() found in chapter 21 of N3225.