I would like to create "vectorized" high-performant Eigen::Matrix for running multiple simulation scenarios at once, where matrix scalar elements would
be vectors of doubles instead of doubles, with componentwise operations defined for these vectors. ArrayXd looks like a good choice for such scalar element,
having component-wise operations already defined. Being a relative novice with Eigen, I have some questions in this regard.
Is it true that when multiplying two matrices of the type Matrix< ArrayXd,...>, generally speaking, no unnecessary temporaries are created
during the computation of the sum of products of ArrayXd elements, due to lazily evaluated expression templates for ArrayXd?
Given that in our case, there is no interaction between "horizontal" 2-dimensional slices of Matrix< ArrayXd,...>, is it the most efficient data representation
in terms of the computing speed, or may be Tensor module would be better?
What are the possible approaches for managing heap memory allocation when creating large Matrix< ArrayXd,...> matrices?
Product of two Matrix< ArrayXd,...> does not compile due to several overloads for operator*, giving the error message below.
Is there an easy way to fix it?
error C2666: 'Eigen::MatrixBase< Derived>::operator *' : 3 overloads have similar conversions
with [ Derived=Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1> ]
...\vendor\i686-win32-vc12.0\include\eigen\src\core\generalproduct.h(387): could be 'const Eigen::Product< Derived,Derived,0>
Eigen::MatrixBase< Derived>::operator *< Derived>(const Eigen::MatrixBase< Derived> &) const'
with [ Derived=Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1> ]
...\vendor\i686-win32-vc12.0\include\eigen\src\plugins\commoncwisebinaryops.h(50): or
'const Eigen::CwiseBinaryOp< Eigen::internal::scalar_product_op< Eigen::Array< double,-1,1,0,-1,1>,PromotedType>,const Derived,
const Eigen::CwiseNullaryOp< Eigen::internal::scalar_constant_op< Eigen::Array< double,-1,1,0,-1,1>>,const Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>>>
Eigen::MatrixBase< Derived>::operator *< Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>>(const T &) const'
with [PromotedType=Eigen::Array< double,-1,1,0,-1,1>, Derived=Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>, T=Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>]
...\vendor\i686-win32-vc12.0\include\eigen\src\plugins\commoncwisebinaryops.h(50): or
'const Eigen::CwiseBinaryOp< Eigen::internal::scalar_product_op< Eigen::Array< double,-1,1,0,-1,1>,PromotedType>,
const Eigen::CwiseNullaryOp< Eigen::internal::scalar_constant_op< Eigen::Array< double,-1,1,0,-1,1>>,const Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>>,
const Derived> Eigen::operator *< Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>>(const T &,const Eigen::MatrixBase< Derived> &)'
with [PromotedType=Eigen::Array< double,-1,1,0,-1,1>,Derived=Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>,T=Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>]
while trying to match the argument list '(Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>, Eigen::Matrix< Eigen::ArrayXd,-1,-1,0,-1,-1>)'
Related
Suppose I have a vector of integers and of strings, and I want to compare whether they have equivalent elements, without consideration of order. Ultimately, I'm asking if the integer vector is a permutation of the string vector (or vice versa). I'd like to be able to just call is_permutation, specify a binary predicate that allows me to compare the two, and move on with my life. eg:
bool checkIntStringComparison( const std::vector<int>& intVec,
const std::vector<std::string>& stringVec,
const std::map<int, std::string>& intStringMap){
return std::is_permutation<std::vector<int>::const_iterator, std::vector<std::string>::const_iterator>(
intVec.cbegin(), intVec.cend(), stringVec.cbegin(), [&intStringMap](const int& i, const std::string& string){
return string == intStringMap.at(i);
});
}
But trying to compile this (in gcc) returns an error message that boils down to:
no match for call to stuff::< lambda(const int&, const string& >)(const std::_cxx11::basic_string&, const int&)
see how it switches the calling signature from the lambda's? If I switch them around, the signature switches itself the other way.
Digging around about this error, it seems that the standard specifies for std::is_permutation that ForwardIterator1 and 2 must be the same type. So I understand the compiler error in that regard. But why should it be this way? If I provide a binary predicate that allows me to compare the two (or if we had previously defined some equality operator between the two?), isn't the real core of the algorithm just searching through container 1 to make sure all its elements are in container 2 uniquely?
The problem is that an element can occur more than once. That means that the predicate needs to be able to not only compare the elements of the first range to the elements of the second range, but to compare the elements of the first range to themselves:
if (size(range1) != size(range2))
return false;
for (auto const& x1 : range1)
if (count_if(range1, [&](auto const& y1) { return pred(x1, y1); }) !=
count_if(range2, [&](auto const& y2) { return pred(x1, y2); }))
return false;
return true;
Since it's relatively tricky to create a function object that takes two distinct signatures, and passing two predicates would be confusing, the easiest option was to specify that both ranges must have the same value type.
Your options are:
Wrap one range (or both) in a transform that gives the same value type (e.g. use Boost.Adaptors.Transformed);
Write your own implementation of std::is_permutation (e.g. copying the example implementation on cppreference);
Actually, note that the gcc (i.e. libstdc++) implementation does not enforce that the value types are the same; it just requires several signatures which you'd have to provide anyway, so write a polymorphic predicate as e.g. a function object or a polymorphic lambda, or with parameter types convertible from both range value types (e.g. in your case boost::variant<int, string> - ugly, but probably not that bad). This is non-portable, as another implementation might choose to enforce that requirement.
Would somebody please describe the following code ?
template<typename _Rep2, typename = typename
enable_if<is_convertible<_Rep2, rep>::value
&& (treat_as_floating_point<rep>::value
|| !treat_as_floating_point<_Rep2>::value)>::type>
constexpr explicit duration(const _Rep2& __rep)
: __r(static_cast<rep>(__rep)) { }
template<typename _Rep2, typename _Period2, typename = typename
enable_if<treat_as_floating_point<rep>::value
|| (ratio_divide<_Period2, period>::den == 1
&& !treat_as_floating_point<_Rep2>::value)>::type>
constexpr duration(const duration<_Rep2, _Period2>& __d)
: __r(duration_cast<duration>(__d).count()) { }
These are the gcc/libstdc++ implementation of the std::chrono::duration constructors. We can look at them one at a time:
template <typename _Rep2,
typename = typename enable_if
<
is_convertible<_Rep2, rep>::value &&
(treat_as_floating_point<rep>::value ||
!treat_as_floating_point<_Rep2>::value)
>::type>
constexpr
explicit
duration(const _Rep2& __rep)
: __r(static_cast<rep>(__rep))
{ }
Formatting helps readability. It doesn't really matter what the style is, as long as it has some. ;-)
This first constructor is constexpr and explicit, meaning if the inputs are compile-time constants, the constructed duration can be a compile-time constant, and the input won't implicitly convert to the duration.
The overall purpose of this constructor is to explicitly convert a scalar (or emulation of a scalar) into a chrono::duration.
The second typename in the template argument list is a constraint on _Rep2. It says:
_Rep2 must be implicitly convertible to rep (rep is the representation type of the duration), and
Either rep is a floating point type (or emulating a floating point type), or _Rep2 is not a floating point type (or emulation of one).
If these constraints are not met, this constructor literally does not exist. The effect of these constraints is that you can construct floating-point-based durations from floating-point and integral arguments, but integral-based durations must be constructed from integral arguments.
The rationale for this constraint is to prevent silently discarding the fractional part of floating-point arguments. For example:
minutes m{1.5}; // compile-time error
This will not compile because minutes is integral based, and the argument is floating point, and if it did compile, it would silently discard the .5 resulting in 1min.
Now for the second chrono::duration constructor:
template <typename _Rep2,
typename _Period2,
typename = typename enable_if
<
treat_as_floating_point<rep>::value ||
(ratio_divide<_Period2, period>::den == 1 &&
!treat_as_floating_point<_Rep2>::value)
>::type>
constexpr
duration(const duration<_Rep2, _Period2>& __d)
: __r(duration_cast<duration>(__d).count())
{ }
This constructor serves as a converting chrono::duration constructor. That is, it converts one unit into another (e.g. hours to minutes).
Again there is a constraint on the template arguments Rep2 and Period2. If these constraints are not met, the constructor does not exist. The constraints are:
rep is floating-point, or
_Period2 / period results in a ratio with a denominator of 1 and _Rep2 is an integral type (or emulation thereof).
The effect of this constraint is that if you have a floating-point duration, then any other duration (integral or floating-point-based) will implicitly convert to it.
However integral-based durations are much more picky. If you are converting to an integral-based duration, then the source duration can not be floating-point-based and the conversion from the source integral-based duration to the destination integral-based duration must be exact. That is, the conversion must not divide by any number except 1 (only multiply).
For example:
hours h = 30min; // will not compile
minutes m = 1h; // ok
The first example does not compile because it would require division by 60, resulting in h which is not equal to 30min. But the second example compiles because m will exactly equal 1h (it will hold 60min).
What you can take away from this:
Always let <chrono> do conversions for you. If you are multiplying or dividing by 60 or 1000 (or whatever) in your code, you are needlessly introducing the possibility of errors. Furthermore <chrono> will let you know if you have any lossy conversions if you delegate all of your conversions to <chrono>.
Use implicit <chrono> conversions as much as possible. They will either compile and be exact, or they won't compile. If they don't compile, that means you are asking for a conversion that involves truncation error. It is ok to ask for truncation error, as long as you don't do so accidentally. The syntax for asking for a truncating conversion is:
hours h = duration_cast<hours>(30min); // ok, h == 0h
I compiled the following code on VS2013 (using "Release" mode optimization) and was dismayed to find the assembly of std::swap(v1,v2) was not the same as std::swap(v3,v4).
#include <vector>
#include <iterator>
#include <algorithm>
template <class T>
class WRAPPED_VEC
{
public:
typedef T value_type;
void push_back(T value) { m_vec.push_back(value); }
WRAPPED_VEC() = default;
WRAPPED_VEC(WRAPPED_VEC&& other) : m_vec(std::move(other.m_vec)) {}
WRAPPED_VEC& operator =(WRAPPED_VEC&& other)
{
m_vec = std::move(other.m_vec);
return *this;
}
private:
std::vector<T> m_vec;
};
int main (int, char *[])
{
WRAPPED_VEC<int> v1, v2;
std::generate_n(std::back_inserter(v1), 10, std::rand);
std::generate_n(std::back_inserter(v2), 10, std::rand);
std::swap(v1, v2);
std::vector<int> v3, v4;
std::generate_n(std::back_inserter(v3), 10, std::rand);
std::generate_n(std::back_inserter(v4), 10, std::rand);
std::swap(v3, v4);
return 0;
}
The std::swap(v3, v4) statement turns into "perfect" assembly. How can I achieve the same efficiency for std::swap(v1, v2)?
There are a couple of points to be made here.
1. If you don't know for absolutely certain that your way of calling swap is equivalent to the "correct" way of calling swap, you should always use the "correct" way:
using std::swap;
swap(v1, v2);
2. A really convenient way to look at the assembly for something like calling swap is to put the call by itself in a test function. That makes it easy to isolate the assembly:
void
test1(WRAPPED_VEC<int>& v1, WRAPPED_VEC<int>& v2)
{
using std::swap;
swap(v1, v2);
}
void
test2(std::vector<int>& v1, std::vector<int>& v2)
{
using std::swap;
swap(v1, v2);
}
As it stands, test1 will call std::swap which looks something like:
template <class T>
inline
swap(T& x, T& y) noexcept(is_nothrow_move_constructible<T>::value &&
is_nothrow_move_assignable<T>::value)
{
T t(std::move(x));
x = std::move(y);
y = std::move(t);
}
And this is fast. It will use WRAPPED_VEC's move constructor and move assignment operator.
However vector swap is even faster: It swaps the vector's 3 pointers, and if std::allocator_traits<std::vector<T>::allocator_type>::propagate_on_container_swap::value is true (and it is not), also swaps the allocators. If it is false (and it is), and if the two allocators are equal (and they are), then everything is ok. Otherwise Undefined Behavior happens.
To make test1 identical to test2 performance-wise you need:
friend
void
swap(WRAPPED_VEC<int>& v1, WRAPPED_VEC<int>& v2)
{
using std::swap;
swap(v1.m_vec, v2.m_vec);
}
One interesting thing to point out:
In your case, where you are always using std::allocator<T>, the friend function is always a win. However if your code allowed other allocators, possibly those with state, which might compare unequal, and which might have propagate_on_container_swap::value false (as std::allocator<T> does), then these two implementations of swap for WRAPPED_VEC diverge somewhat:
1. If you rely on std::swap, then you take a performance hit, but you will never have the possibility to get into undefined behavior. Move construction on vector is always well-defined and O(1). Move assignment on vector is always well-defined and can be either O(1) or O(N), and either noexcept(true) or noexcept(false).
If propagate_on_container_move_assignment::value is false, and if the two allocators involved in a move assignment are unequal, vector move assignment will become O(N) and noexcept(false). Thus a swap using vector move assignment will inherit these characteristics. However, no matter what, the behavior is always well-defined.
2. If you overload swap for WRAPPED_VEC, thus relying on the swap overload for vector, then you expose yourself to the possibility of undefined behavior if the allocators compare unequal and have propagate_on_container_swap::value equal to false. But you pick up a potential performance win.
As always, there are engineering tradeoffs to be made. This post is meant to alert you to the nature of those tradeoffs.
PS: The following comment is purely stylistic. All capital names for class types are generally considered poor style. It is tradition that all capital names are reserved for macros.
The reason for this is that std::swap does have an optimized overload for type std::vector<T> (see right click -> go to definition). To make this code work fast for your wrapper, follow instructions found on cppreference.com about std::swap:
std::swap may be specialized in namespace std for user-defined types,
but such specializations are not found by ADL (the namespace std is
not the associated namespace for the user-defined type). The expected
way to make a user-defined type swappable is to provide a non-member
function swap in the same namespace as the type: see Swappable for
details.
class Foo {
std::vector<SomeType> data_;
};
Say Foo can only be constructed by make a copy (technically I mean a copy or move) of a std::vector<SomeType> object. What's the best way to write constructor(s) for Foo?
My first feeling is
Foo(std::vector<SomeType> data) noexcept : data_(std::move(data)) {};
Using it, construction of an instance takes 0 or 1 times of vector copy, depending on whether the argument for {data} is moveable or not.
Your first feeling is good. Strictly speaking it is not optimal. But it is so close to optimal that you would be justified in saying you don't care.
Explanation:
Foo(std::vector<SomeType> data) noexcept : data_(std::move(data)) {};
When the client passes in an lvalue std::vector<SomeType> 1 copy will be made to bind to the data argument. And then 1 move will be made to "copy" the argument into data_.
When the client passes in an xvalue std::vector<SomeType> 1 move will be made to bind to the data argument. And then another move will be made to "copy" the argument into data_.
When the client passes in a prvalue std::vector<SomeType> the move will be elided in binding to the data argument. And then 1 move will be made to "copy" the argument into data_.
Summary:
client argument number of copies number of moves
lvalue 1 1
xvalue 0 2
prvalue 0 1
If you instead did:
Foo(const std::vector<SomeType>& data) : data_(data) {};
Foo( std::vector<SomeType>&& data) noexcept : data_(std::move(data)) {};
Then you have a very slightly higher performance:
When the client passes in an lvalue std::vector<SomeType> 1 copy will be made to copy the argument into data_.
When the client passes in an xvalue std::vector<SomeType> 1 move will be made to "copy" the argument into data_.
When the client passes in a prvalue std::vector<SomeType> 1 move will be made to "copy" the argument into data_.
Summary:
client argument number of copies number of moves
lvalue 1 0
xvalue 0 1
prvalue 0 1
Conclusion:
std::vector move constructions are very cheap, especially measured with respect to copies.
The first solution will cost you an extra move when the client passes in an lvalue. This is likely to be in the noise level, compared to the cost of the copy which must allocate memory.
The first solution will cost you an extra move when the client passes in an xvalue. This could be a weakness in the solution, as it doubles the cost. Performance testing is the only reliable way to assure that either this is, or is not an issue.
Both solutions are equivalent when the client passes a prvalue.
As the number of parameters in the constructor increases, the maintenance cost of the second solution increases exponentially. That is you need every combination of const lvalue and rvalue for each parameters. This is very manageable at 1 parameter (two constructors), less so at 2 parameters (4 constructors), and rapidly becomes unmanageable after that (8 constructors with 3 parameters). So optimal performance is not the only concern here.
If one has many parameters, and is concerned about the cost of an extra move construction for lvalue and xvalue arguments, there are other solutions, but they involve relatively ugly template meta-programming techniques which many consider too ugly to use (I don't, but I'm trying to be unbiased).
For std::vector, the cost of an extra move construction is typically small enough you won't be able to measure it in overall application performance.
The complexity problem of the performance optimal solution mentioned in Howard Hinnant's answer for constructors taking multiple arguments would be solved by use of perfect forwarding:
template<typename A0>
Foo(A0 && a0) : data_(std::forward<A0>(a0)) {}
In case of more parameters, extend accordingly:
template<typename A0, typename A1, ...>
Foo(A0 && a0, A1 && a1, ...)
: m0(std::forward<A0>(a0))
, m1(std::forward<A1>(a1))
, ...
{}
class Foo {
using data_t = std::vector<SomeType>;
data_t data_;
public:
constexpr Foo(data_t && d) noexcept : data_(std::forward<data_t>(d)) {}
};
I'm porting a game to OS X, which uses operators overloading for __m128 type like:
__forceinline __m128 operator - ( __m128 a, __m128 b )
{
return _mm_sub_ps ( a, b );
}
__forceinline __m128 operator * ( __m128 a, __m128 b )
{
return _mm_mul_ps ( a, b );
}
And Apple GCC v4.2.1 gives me the following errors:
error: 'float vector
operator-(float vector, float
vector)' must have an argument of class or enumerated type
error: 'float vector
operator*(float vector, float
vector)' must have an argument of class or enumerated type
I have found a link, describing this kind of errors as a GCC bug, which was solved in v4.0...
And now I'm completely lost... Please help me deal with this issue...
i'm using g++ 4.6.1 and found same error message on a similar situation, it happens when both arguments are builtin types:
algo operator+(const char* a,double b);
algo operator+(double a,const char* b);
I understand it would be tricky to redefine +(int,int) because the compiler rely on that to do calculations, thouse calculations occour on compile time, not on runtime, so the code you provide is not available at compile time and Later at runtime the data is already calculated.
too bad we can't do this, it should be allowed for builtin types provided that types were diferent (wich is my case) and for that case the compiler has no default answer.
for the case +(int,int) i think it would never be allowed because of the above explained. unless compilers accept some sort of parameters to leave that calc's to runtime (and i didnt check that kind of parameters yet). I think similar thing is happening to floats.