Do the memory alignment issues with Eigen listed in the documentation still apply with C++11? It seems that C++11 can already take care of properly aligning objects on the stack and on the heap, with alignas and std::allocator which supports alignment.
Yes, the alignment issues are still present in C++11. The alignas specifier has no effect on dynamic allocations, which can thus still cause misalignments resulting in assertions thrown by Eigen.
You will have to continue to use the facilities Eigen provides for aligned allocation, such as EIGEN_MAKE_ALIGNED_OPERATOR_NEW for allocating objects or Eigen::aligned_allocator<T> for aligning containers.
While the question is about C++11 specifically, it is worth noting that a combination of the upcoming Eigen version 3.4 with a C++17 compliant compiler will free us from the need to use EIGEN_MAKE_ALIGNED_OPERATOR_NEW and Eigen::aligned_allocator<T>. The former macro is actually even empty then. This is possible by a new form of operator new that is specifically designed to support overalignment.
Related
Back in D3DXMath we had ability to multiply, add or subtract even divide vector types, which were D3DXVECTOR2, D3DXVECTOR3, D3DXVECTOR4 structures.....
Now in DirectXMath incarnation we have XMFLOAT2, XMFLOAT3, XMFLOAT4 and XMVECTOR. If i want to do any math operation i must do conversion from XMFLOAT to XMVECTOR either way Visual Studio is throwing an error "There is no
user defined conversion". Why is that ? Actually it's a fact that in a new version(Windows 8.1, 10) of DirectX math library vector operation slightly has changed . Am i doing something wrong........... ?!
P.S. Well for Matrices there are another question but right now lets talk only on vectors. These changes is pushing third party developers to create their own Math library and they had done it..... :)
This is actually explained in detail in the DirectXMath Programmer's Guide on MSDN:
The XMVECTOR and XMMATRIX types are the work horses for the DirectXMath Library. Every operation consumes or produces data of these types. Working with them is key to using the library. However, since DirectXMath makes use of the SIMD instruction sets, these data types are subject to a number of restrictions. It is critical that you understand these restrictions if you want to make good use of the DirectXMath functions.
You should think of XMVECTOR as a proxy for a SIMD hardware register, and XMMATRIX as a proxy for a logical grouping of four SIMD hardware registers. These types are annotated to indicate they require 16-byte alignment to work correctly. The compiler will automatically place them correctly on the stack when they are used as a local variable, or place them in the data segment when they are used as a global variable. With proper conventions, they can also be passed safely as parameters to a function (see Calling Conventions for details).
Allocations from the heap, however, are more complicated. As such, you need to be careful whenever you use either XMVECTOR or XMMATRIX as a member of a class or structure to be allocated from the heap. On Windows x64, all heap allocations are 16-byte aligned, but for Windows x86, they are only 8-byte aligned. There are options for allocating structures from the heap with 16-byte alignment (see Properly Align Allocations). For C++ programs, you can use operator new/delete/new[]/delete[] overloads (either globally or class-specific) to enforce optimal alignment if desired.
However, often it is easier and more compact to avoid using XMVECTOR or XMMATRIX directly in a class or structure. Instead, make use of the XMFLOAT3, XMFLOAT4, XMFLOAT4X3, XMFLOAT4X4, and so on, as members of your structure. Further, you can use the Vector Loading and Vector Storage functions to move the data efficiently into XMVECTOR or XMMATRIX local variables, perform computations, and store the results. There are also streaming functions (XMVector3TransformStream, XMVector4TransformStream, and so on) that efficiently operate directly on arrays of these data types.
By design, DirectXMath is encouraging you to write efficient, SIMD-friendly code. Loading or storing a vector is expensive, so you should try to work in a 'stream' model where you load data, work with it in-register a lot, then write the results.
That said, I totally get that the usage is a little complex for people new to SIMD math or DirectX in general, and is a bit verbose even for professional developers. That's why I also wrote the SimpleMath wrapper for DirectXMath which makes it work more like the classic math library you are looking for using XNA Game Studio like Vector2, Vector3, Matrix classes with 'C++ magic' covering up all the explicit loads & stores. SimpleMath types interop neatly with DirectXMath, so you can mix and match as you want.
See this blog post and GitHub as well.
DirectXMath is purposely an 'inline' library meaning in optimized code you shouldn't be passing variables much and instead just computing the value inside your larger function. The D3DXMath library in the deprecated D3DX9, D3DX10, D3DX11 library is more old-school which relies on function-pointer tables and is heavily performance bound by the calling-convention overhead.
These of course represent different engineering trade-offs. D3DXMath was able to do more substitution at runtime of specialized processor code paths, but pays for this flexibility with the calling-convention and indirection overhead. DirectXMath, on the other hand, assumes a SIMD baseline of SSE/SSE2 (or AVX on Xbox One) so you avoid the need for runtime detection or indirection and instead aggressively utilize inlining.
C++17 adds extensions for parallelism to the standard library (e.g. std::sort(std::execution::par_unseq, arr, arr + 1000), which will allow the sort to be done with multiple threads and with vector instructions).
I noticed that Microsoft's experimental implementation mentions that the VC++ compiler lacks support to do vectorization over here, which surprises me - I thought that modern C++ compilers are able to reason about the vectorizability of loops, but apparently the VC++ compiler/optimizer is unable to generate SIMD code even if explicitly told to do so. The seeming lack of automatic vectorization support contradicts the answers for this 2011 question on Quora, which suggests that compilers will do vectorization where possible.
Maybe, compilers will only vectorize very obvious cases such as a std::array<int, 4>, and no more than that, thus C++17's explicit parallelization would be useful.
Hence my question: Do current compilers automatically vectorize my code when not explicitly told to do so? (To make this question more concrete, let's narrow this down to Intel x86 CPUs with SIMD support, and the latest versions of GCC, Clang, MSVC, and ICC.)
As an extension: Do compilers for other languages do better automatic vectorization (maybe due to language design) (so that the C++ standards committee decides it necessary for explicit (C++17-style) vectorization)?
The best compiler for automatically spotting SIMD style vectorisation (when told it can generate opcodes for the appropriate instruction sets of course) is the Intel compiler in my experience (which can generate code to do dynamic dispatch depending on the actual CPU if required), closely followed by GCC and Clang, and MSVC last (of your four).
This is perhaps unsurprising I realise - Intel do have a vested interest in helping developers exploit the latest features they've been adding to their offerings.
I'm working quite closely with Intel and while they are keen to demonstrate how their compiler can spot auto-vectorisation, they also very rightly point out using their compiler also allows you to use pragma simd constructs to further show the compiler assumptions that can or can't be made (that are unclear from a purely syntactic level), and hence allow the compiler to further vectorise the code without resorting to intrinsics.
This, I think, points at the issue with hoping that the compiler (for C++ or another language) will do all the vectorisation work... if you have simple vector processing loops (eg multiply all the elements in a vector by a scalar) then yes, you could expect that 3 of the 4 compilers would spot that.
But for more complicated code, the vectorisation gains that can be had come not from simple loop unwinding and combining iterations, but from actually using a different or tweaked algorithm, and that's going to hard if not impossible for a compiler to do completely alone. Whereas if you understand how vectorisation might be applied to an algorithm, and you can structure your code to allow the compiler to see the opportunities do so, perhaps with pragma simd constructs or OpenMP, then you may get the results you want.
Vectorisation comes when the code has a certain mechanical sympathy for the underlying CPU and memory bus - if you have that then I think the Intel compiler will be your best bet. Without it, changing compilers may make little difference.
Can I recommend Matt Godbolt's Compiler Explorer as a way to actually test this - put your c++ code in there and look at what different compilers actually generate? Very handy... it doesn't include older version of MSVC (I think it currently supports VC++ 2017 and later versions) but will show you what different versions of ICC, GCC, Clang and others can do with code...
Boost provides two different implementations of string_view, which will be a part of C++17:
boost::string_ref in utility/string_ref.hpp
boost::string_view in core/string_view.hpp
Are there any significant differences between these? Which should be preferred going forward?
Note: I noticed in Boost 1.61, boost::log has deprecated string_ref in favor of string_view; perhaps that's an indicator? (http://www.boost.org/users/history/version_1_61_0.html)
Funnily enough right now I'm at the ACCU conference with Marshall Clow (the force behind string_view et al on the committee) and I was quite literally about to ask him at the bar earlier today before I was called away about his views on string_view versus Bjarne's Guideline Support Library (GSL) gsl::span<T> which is a very similar thing (gsl-lite is my personal favourite implementation of the GSL as it's 03 compatible, but there are many others). I had heard they were to be unified into a single implementation for standardisation, and the gsl::span<T> direction is to be the future, but I'll report back here from the horse's mouth himself if I'm wrong on that. For now, assume the gsl::span<T> direction is the current future and Boost will get updated to have something similar soon, even if using string_view = gsl::span<char> is essentially string_view.
Edit: I just spoke to Marshall downstairs. He tells me that string_view, as per the implementation in Boost, is definitely in C++ 17. array_view is not, nor is anything historically surrounding string_view for now.
The GSL string_span is a separate entity not expected to enter in C++ 17, nor are there any present plans to unify the implementations as they solve different use cases, specifically that string_view is always a constant view of the borrowed character array, whereas string_span is expected to be a potentially modifiable view of the borrowed character array with potential uses as a source for construction of new strings, so string_span might perhaps eventually become a generalisation of string_view in some future C++ standard.
According to this email from the boost mailing list, boost::string_ref won't be used in the future and is being replaced by string_view in other boost libraries.
boost::string_view has the following advantages:
Better matches what the standards committee is
doing for C++17
Has WAY more constexpr support
I'm thinking about (just as an idea) disjointed pointer aliasing in C++0x. I was thinking about seeing if it could be implemented similarly to const correctness- that is, enforced by the compiler. What would be the requirements for such a thing? As this is more of a thought experiment, I'm perfectly happy to look at solutions that destroy legacy code or redefine half the language and that kind of thing.
What I'd really rather not do is have, say, restrict from C99 where the programmer just promises it. It should be enforced.
I was thinking about having unique_ptr be not part of the library, but part of the language. That way, the compiler can perform special optimizations on it and write their own unique pointer classes if they need to.
The Standard C++ Library (including std::unique_ptr) is a part of the language.
Also, conforming programs are not allowed to add declarations and definitions to the namespace std.
Upon seeing an instantiation of std::unique_ptr<T>, the compiler knows everything about the behavior of this instantiation - it's exactly that behavior which was implemented as a part of the language implementation the compiler itself is a part of and the compiler is free to perform "special optimizations" coming from the guarantees of the C++ standard.
As an example for something coming from the same line of thinking, GCC already does this with a number of standard C99 functions in hosted mode - it may replace standard function calls with inline insn sequence or with calls to other functions - precisely because GCC knows the exact semantics by just knowing the name of the function.
I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purpose compilers like gcc actually use these instructions (or a subset thereof), or are they mainly to be used in hand-coded assembler? How does the compiler detect where it is appropriate to use SIMD instructions?
Generally, few compilers use them. GCC and Visual Studio arn't usually able to use the SIMD instructions. If you enable SSE as a compiler flag, it will use the scalar SSE instructions for regular floating-point operations, but generally, don't expect the vectorized ones to be used automatically. Recent versions of GCC might be able to use them in some cases, but didn't work last I tried. Intel's C++ compiler is the only big compiler I know of that is able to auto-vectorize some loops.
In general though, you'll have to use them yourself. Either in raw assembler, or by using compiler intrinsics. In general, I'd say intrinsics are the better approach, since they better allow the compiler to understand the code, and so schedule and optimize, but in practice, I know MSVC at least doesn't always generate very efficient code from intrinsics, so plain asm may be the best solution there. Experiment, see what works. But don't expect the compiler to use these instructions for you, unless you 1) use the right compiler, and 2) write fairly simple loops that can be trivially vectorized.
Update 2012
Ok, so three years have passed since I wrote this answer. GCC has been able to auto-vectorize (simple) code for a couple of years now, and in VS2012, MSVC finally gains the same capability. Of course, the main part of my answer still applies: compilers can still only vectorize fairly trivial code. For anything more complex, you're stuck fiddling with intrinsics or inline asm.
Mono can use SIMD extensions as long as you use its classes for vectors. You can read about it here: http://tirania.org/blog/archive/2008/Nov-03.html
GCC should do some automatic vectorisation as long as you're using -O3 or a specific flag. They have an info page here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
The question of how to exploit SSE and other small vector units automatically (without direction from the programmer in the form of special language constructs or specially blessed compiler "intrinsics") has been a topic of compiler research for some time. Most results seem to be specialized to a particular problem domain, such as digital signal processing. I have not kept up with the literature on this topic, but what I have read suggests that exploiting the vector (SSE) unit is still a topic for research, and that one should have low expectations of general-purpose compilers commonly used in the field.
Suggested search term: vectorizing compiler
I have seen gcc use sse to zero out a default std::string object. Not a particularly powerful use of sse, but it exists. In most cases, though you will have to write your own.
I know this because I had allowed the stack to become unaligned and it crashed, otherwise I probably wouldn't have noticed!
If you use the vector pascal compiler you will get efficient SIMD code for types for which SIMD gives an advantage. Basically this is anything of length less than 64 bits. ( for 64 bit reals it is actually slower to do SIMD).
Latest versions of the compiler will also automatically parallelise accross cores