Pass v4sf by value or reference - gcc

Which is more efficient of passing a SSE vector by value or reference?
typedef float v4sf __attribute__ ((vector_size(16)));
//Pass by reference
void doStuff(v4sf& foo);
//Pass by value
v4sf doStuff(v4sf foo);
On one hand, v4sf is large 16 byte.
But, we can deal with these things as if they were single element data, and the reference may introduce one level of indirection

Typically SIMD functions which take vector parameters are relatively small and performance-critical, which usually means they should be inlined. Once inlined it doesn't really matter whether you pass by value, pointer or reference, as the compiler will optimise away unnecessary copies or dereferences.
One further point: if you think you might ever need to port your code to Windows then you will almost certainly want to use references, as there are some inane ABI restrictions which limit how many vector parameters you can pass (by value), even when the function is inlined.

Related

List storing pointers or "plain object"

I am designing a class which tracks the user manipulations in a software in order to restore previous application states (i.e. CTRL+Z/CTRL+Y). I symply wanted to clarify something about performances.
I am using the std::list container of the STL. This list is not meant to contain really huge objects, but a significant number. Should I use pointers or not?
For instance, here is the kinds of objects which will be stored:
struct ImagesState
{
cv::Mat first;
cv::Mat second;
};
struct StatusBarState
{
std::string notification;
std::string algorithm;
};
For now, I store the whole thing under the form of struct pointers, such as:
std::list<ImagesStatee*> stereoImages;
I know (I think) that new and delete operators are time consuming, but I don't want to encounter a stack overflow with "plain object". Is it a bad design?
If you are using a list, i would suggest not to use the pointer. The list items are on the heap anyway and the pointer just adds an unnecessary layer of indirection.
If you are after performance, using std::list is most likely not the best solution. Using std::vector might boost your performance significantly since the objects are better for your caches.
Even in an vector, the objects would lie on the heap and therefore the pointer are not needed (they would even harm you more than with a list). You only have to care about them if you make an array on your stack.
like so:
Type arrayName[REALLY_HUGE_NUMBER]

Go implicit conversion to interface does memory allocation?

When defining a function with variadic arguments of type interface{} (e.g. Printf), the arguments are apparently implicitly converted to interface instances.
Does this conversion imply memory allocation? Is this conversion fast? When concerned by code efficiency, should I avoid using variadic functions?
The best explanation i found about the interface memory allocation in Go is still this article from Rus Cox, one of the core Go programmer. It's well worth to read it.
http://research.swtch.com/interfaces
I picked up some of the most interesting parts:
Values stored in interfaces might be arbitrarily large, but only one
word is dedicated to holding the value in the interface structure, so
the assignment allocates a chunk of memory on the heap and records the
pointer in the one-word slot.
...
Calling fmt.Printf(), the Go compiler generates code that calls the
appropriate function pointer from the itable, passing the interface
value's data word as the function's first (in this example, only)
argument.
Go's dynamic type conversions mean that it isn't reasonable for the
compiler or linker to precompute all possible itables: there are too
many (interface type, concrete type) pairs, and most won't be needed.
Instead, the compiler generates a type description structure for each
concrete type like Binary or int or func(map[int]string). Among other
metadata, the type description structure contains a list of the
methods implemented by that type.
...
The interface runtime computes the itable by looking for each method
listed in the interface type's method table in the concrete type's
method table. The runtime caches the itable after generating it, so
that this correspondence need only be computed once.
...
If the interface type involved is empty—it has no methods—then the
itable serves no purpose except to hold the pointer to the original
type. In this case, the itable can be dropped and the value can point
at the type directly.
Because Go has the hint of static typing to go along with the dynamic method lookups, it can move the lookups back from the call sites to the point when the value is stored in the interface.
Converting to an interface{} is a separate concept from variadic arguments which are contained in a slice and can be of any type. However these are all probably free in the sense of allocations as long as they don't escape to the heap (in the GC toolchain).
The excess allocations you would see from fmt functions like Printf are going to be from reflection rather than from the use of interface{} or variadic arguments.
If you're concerned with efficiency though, avoiding indirection will always be more efficient than not, so using the correct value types will yield more efficient code. The difference can be minimal though, so benchmark the code first before concerning yourself with minor optimizations.
Go passes arguments copy_by_value, so it does memory allocation anyway. You always should better avoid using interface{} if possible. In described case your function will need to reflect arguments to use them. Reflection is quite expensive operation that's why fmt.Printf() is so slow.

Data type compatibility with NEON intrinsics

I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:
The instruction vzip_u8 returns a uint8x8x2_t value (in fact an array of two uint8x8_t). I want to assign the returned value to a plain uint16x8_t. I see no appropriate vreinterpretq intrinsic to achieve that, and simple casts are rejected.
Some definitions to answer clearly...
NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
uint16x8_t is a type which requires 128-bit storage thus it needs to be in an quadword register.
ARM NEON Intrinsics has a definition called vector array data type in ARM® C Language Extensions:
... for use in load and store operations, in
table-lookup operations, and as the result type of operations that return a pair of vectors.
vzip instruction
... interleaves the elements of two vectors.
vzip Dd, Dm
and has an intrinsic like
uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)
from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.
Now the answer is...
uint8x8x2_t can contain non-consecutive two dualword registers while uint16x8_t is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).
Because of this you can't cast vector array data type with two double word registers to a quadword register... easily.
Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.
You can construct a 128 bit vector from two 64 bit vectors using the vcombine_* intrinsics. Thus, you can achieve what you want like this.
#include <arm_neon.h>
uint8x16_t f(uint8x8_t a, uint8x8_t b)
{
uint8x8x2_t tmp = vzip_u8(a,b);
uint8x16_t result;
result = vcombine_u8(tmp.val[0], tmp.val[1]);
return result;
}
I have found a workaround: given that the val member of the uint8x8x2_t type is an array, it is therefore seen as a pointer. Casting and deferencing the pointer works ! [Whereas taking the address of the data raises an "address of temporary" warning.]
uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val;
It turns out that this compiles and executes as should (at least in the case I have tried). I haven't looked at the assembly code so I cannot grant it is implemented properly (I mean just keeping the value in a register instead of writing/read to/from memory.)
I was facing the same kind of problem, so I introduced a flexible data type.
I can now therefore define the following:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.
Its a bug in GCC(now fixed) on 4.5 and 4.6 series.
Bugzilla link http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48252
Please take the fix from this bug and apply to gcc source and rebuild it.

Performance of std::vector<Test> vs std::vector<Test*>

In an std::vector of a non POD data type, is there a difference between a vector of objects and a vector of (smart) pointers to objects? I mean a difference in the implementation of these data structures by the compiler.
E.g.:
class Test {
std::string s;
Test *other;
};
std::vector<Test> vt;
std::vector<Test*> vpt;
Could be there no performance difference between vt and vpt?
In other words: when I define a vector<Test>, internally will the compiler create a vector<Test*> anyway?
In other words: when I define a vector, internally will the compiler create a vector anyway?
No, this is not allowed by the C++ standard. The following code is legal C++:
vector<Test> vt;
Test t1; t1.s = "1"; t1.other = NULL;
Test t2; t2.s = "1"; t2.other = NULL;
vt.push_back(t1);
vt.push_back(t2);
Test* pt = &vt[0];
pt++;
Test q = *pt; // q now equal to Test(2)
In other words, a vector "decays" to an array (accessing it like a C array is legal), so the compiler effectively has to store the elements internally as an array, and may not just store pointers.
But beware that the array pointer is valid only as long as the vector is not reallocated (which normally only happens when the size grows beyond capacity).
In general, whatever the type being stored in the vector is, instances of that may be copied. This means that if you are storing a std::string, instances of std::string will be copied.
For example, when you push a Type into a vector, the Type instance is copied into a instance housed inside of the vector. The copying of a pointer will be cheap, but, as Konrad Rudolph pointed out in the comments, this should not be the only thing you consider.
For simple objects like your Test, copying is going to be so fast that it will not matter.
Additionally, with C++11, moving allows avoiding creating an extra copy if one is not necessary.
So in short: A pointer will be copied faster, but copying is not the only thing that matters. I would worry about maintainable, logical code first and performance when it becomes a problem (or the situation calls for it).
As for your question about an internal pointer vector, no, vectors are implemented as arrays that are periodically resized when necessary. You can find GNU's libc++ implementation of vector online.
The answer gets a lot more complicated at a lower than C++ level. Pointers will of course have to be involved since an entire program cannot fit into registers. I don't know enough about that low of level to elaborate more though.

Is it guaranteed that Complex Float variables will be 8-byte aligned in memory?

In C99 the new complex types were defined. I am trying to understand whether a compiler can take advantage of this knowledge in optimizing memory accesses. Are these objects (A-F) of type complex float guaranteed to be 8-byte aligned in memory?
#include "complex.h"
typedef complex float cfloat;
cfloat A;
cfloat B[10];
void func(cfloat C, cfloat *D)
{
cfloat E;
cfloat F[10];
}
Note that for D, the question relates to the object pointed to by D, not to the pointer storage itself. And, if that is assumed aligned, how can one be sure that the address passed is of an actual complex and not a cast from another (non 8-aligned) type?
UPDATE 1: I probably answered myself in the last comment regarding the D pointer. B/c there is no way to know what address will be assigned to the parameter of the function call, there is no way to guarantee that it will be 8-aligned. This is solvable via the __builtin_assumed_aligned() function.
The question is still open for the other variables.
UPDATE 2: I posted a follow-up question here.
A float complex is guaranteed to have the same memory layout and alignment as an array of two float (§6.2.5). Exactly what that alignment will be is defined by your compiler or platform. All you can say for sure is that a float complex is at least as aligned as a float.
if that is assumed aligned, how can one be sure that the address passed is of an actual complex and not a cast from another (non 8-aligned) type?
If your caller passes you an insufficiently-aligned pointer, that's undefined behavior and a bug in their code (§6.3.2.3). You don't need to support that (though you may choose to).

Resources