GCC Vector Extensions Sqrt

GCC Vector Extensions Sqrt - gcc

I am currently experimenting with the GCC vector extensions. However, I am wondering how to go about getting sqrt(vec) to work as expected.
As in:
typedef double v4d __attribute__ ((vector_size (16)));
v4d myfunc(v4d in)
{
return some_sqrt(in);
}
and at least on a recent x86 system have it emit a call to the relevant intrinsic sqrtpd. Is there a GCC builtin for sqrt that works on vector types or does one need to drop down to the intrinsic level to accomplish this?

Looks like it's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54408 I don't know of any workaround other than do it component-wise. The vector extensions were never meant to replace platform specific intrinsics anyway.
Some funky code to this effect:
#include <cmath>
#include <utility>
template <::std::size_t...> struct indices { };
template <::std::size_t M, ::std::size_t... Is>
struct make_indices : make_indices<M - 1, M - 1, Is...> {};
template <::std::size_t... Is>
struct make_indices<0, Is...> : indices<Is...> {};
typedef float vec_type __attribute__ ((vector_size(4 * sizeof(float))));
template <::std::size_t ...Is>
vec_type sqrt_(vec_type const& v, indices<Is...> const)
{
vec_type r;
::std::initializer_list<int>{(r[Is] = ::std::sqrt(v[Is]), 0)...};
return r;
}
vec_type sqrt(vec_type const& v)
{
return sqrt_(v, make_indices<4>());
}
int main()
{
vec_type v;
return sqrt(v)[0];
}
You could also try your luck with auto-vectorization, which is separate from the vector extension.

You can loop over the vectors directly
#include <math.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d myfunc(v2d in) {
v2d out;
for(int i=0; i<2; i++) out[i] = sqrt(in[i]);
return out;
}
The sqrt function has to trap for signed zero and NAN but if you avoid these with -Ofast both Clang and GCC produce simply sqrtpd.
https://godbolt.org/g/aCuovX
GCC might have a bug because I had to loop to 4 even though there are only 2 elements to get optimal code.
But with AVX and AVX512 GCC and Clang are ideal
AVX
https://godbolt.org/g/qdTxyp
AVX512
https://godbolt.org/g/MJP1n7

My reading of the question is that you want the square root of 4 packed double precision values... that's 32 bytes. Use the appropriate AVX intrinsic:
#include <x86intrin.h>
typedef double v4d __attribute__ ((vector_size (32)));
v4d myfunc (v4d v) {
return _mm256_sqrt_pd(v);
}
x86-64 gcc 10.2 and x86-64 clang 10.0.1
using -O3 -march=skylake :
myfunc:
vsqrtpd %ymm0, %ymm0 # (or just `ymm0` for Intel syntax)
ret
ymm0 is the return value register.
That said, it just so happens there is a builtin: __builtin_ia32_sqrtpd256, which doesn't require the intrinsics header. I would definitely discourage its use however.

Related

Vector overload of a function (provide a manually vectorized version of a function for auto-vectorization to use)

I am using C, and I want to have two versions of the same function, a scalar version and a vector version. The two functions the same signature, and the compiler should pick the correct version depending on the context - if the context is autovectorized loop, it should pick an vectorized function, otherwise it should pick the scalar version.
How to achieve this, so it works in both GCC and CLANG?
One of the suggestions is to use pragma omp declare simd ..., but this doesn't work for me because I need different implementation for the two version (vectorized version is implemented using vector intrinsics).

Let' say we have a function int square(int num) { return num*num; } and we want to have an explicit manually vectorized version.
We compile with -fopenmp-simd, -mavx2 and -O3 compilation flags. This enables SIMD extensions and enabled vectorization.
In the first compilation unit we have something like this:
#pragma omp declare simd notinbranch
int square(int num);
...
#pragma omp simd
for (int i = 0; i < SIZE; i++) {
res[i] = square(values[i]);
}
The compiler knows there is a vectorized version of square in another compilation unit.
In another compilation unit, we define square, both its scalar and vector counterparts. The scalar counterpart has a simple name square, whereas the vector counterparts use name mangling as described here.
In the other compilation unit we define:
#include <stdio.h>
int square(int num) {
printf("1");
return num * num;
}
#include <immintrin.h>
__m256i _ZGVdN8v_square(__m256i num) {
printf("2");
return num;
}
__m128i _ZGVcN4v_square(__m128i num) {
printf("3");
return num;
}
The first function square is the scalar version, the second _ZGVdN8v_square is a version that processes 8 integers in one call, and the third version _ZGVcN4v_square processes 4 integers in one call.
For this example, name mangling goes like this:
_ZGV is the vector prefix
d is AVX2 isa, c is AVX isa
N is the unmasked version (corresponds to notinbranch in square declaration)
4 and 8 are vector lengths
*v stands for vector parameter
I tested it with GCC and it works.

GCC AVX __m256i cast to int array leads to wrong values [duplicate]

I'm trying to learn to code using intrinsics and below is a code which does addition
compiler used: icc
#include<stdio.h>
#include<emmintrin.h>
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);
printf("%d\n",c[2]);
return 0;
}
I get the below error:
test.c(9): error: expression must have pointer-to-object type
printf("%d\n",c[2]);
How do I print the values in the variable c which is of type __m128i

Use this function to print them:
#include <stdint.h>
#include <string.h>
void print128_num(__m128i var)
{
uint16_t val[8];
memcpy(val, &var, sizeof(val));
printf("Numerical: %i %i %i %i %i %i %i %i \n",
val[0], val[1], val[2], val[3], val[4], val[5],
val[6], val[7]);
}
You split 128bits into 16-bits(or 32-bits) before printing them.
This is a way of 64-bit splitting and printing if you have 64-bit support available:
#include <inttypes.h>
void print128_num(__m128i var)
{
int64_t v64val[2];
memcpy(v64val, &var, sizeof(v64val));
printf("%.16llx %.16llx\n", v64val[1], v64val[0]);
}
Note: casting the &var directly to an int* or uint16_t* would also work MSVC, but this violates strict aliasing and is undefined behaviour. Using memcpy is the standard compliant way to do the same and with minimal optimization the compiler will generate the exact same binary code.

Portable across gcc/clang/ICC/MSVC, C and C++.
fully safe with all optimization levels: no strict-aliasing violation UB
print in hex as u8, u16, u32, or u64 elements (based on #AG1's answer)
Prints in memory order (least-significant element first, like _mm_setr_epiX). Reverse the array indices if you prefer printing in the same order Intel's manuals use, where the most significant element is on the left (like _mm_set_epiX). Related: Convention for displaying vector registers
Using a __m128i* to load from an array of int is safe because the __m128 types are defined to allow aliasing just like ISO C unsigned char*. (e.g. in gcc's headers, the definition includes __attribute__((may_alias)).)
The reverse isn't safe (pointing an int* onto part of a __m128i object). MSVC guarantees that's safe, but GCC/clang don't. (-fstrict-aliasing is on by default). It sometimes works with GCC/clang, but why risk it? It sometimes even interferes with optimization; see this Q&A. See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
See GCC AVX _m256i cast to int array leads to wrong values for a real-world example of GCC breaking code which points an int* at a __m256i.
(uint32_t*) &my_vector violates the C and C++ aliasing rules, and is not guaranteed to work the way you'd expect. Storing to a local array and then accessing it is guaranteed to be safe. It even optimizes away with most compilers, so you get movq / pextrq directly from xmm to integer registers instead of an actual store/reload, for example.
Source + asm output on the Godbolt compiler explorer: proof it compiles with MSVC and so on.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
void p128_hex_u8(__m128i in) {
alignas(16) uint8_t v[16];
_mm_store_si128((__m128i*)v, in);
printf("v16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %x\n",
v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7],
v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}
void p128_hex_u16(__m128i in) {
alignas(16) uint16_t v[8];
_mm_store_si128((__m128i*)v, in);
printf("v8_u16: %x %x %x %x, %x %x %x %x\n", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}
void p128_hex_u32(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("v4_u32: %x %x %x %x\n", v[0], v[1], v[2], v[3]);
}
void p128_hex_u64(__m128i in) {
alignas(16) unsigned long long v[2]; // uint64_t might give format-string warnings with %llx; it's just long in some ABIs
_mm_store_si128((__m128i*)v, in);
printf("v2_u64: %llx %llx\n", v[0], v[1]);
}
If you need portability to C99 or C++03 or earlier (i.e. without C11 / C++11), remove the alignas() and use storeu instead of store. Or use __attribute__((aligned(16))) or __declspec( align(16) ) instead.
(If you're writing code with intrinsics, you should be using a recent compiler version. Newer compilers usually make better asm than older compilers, including for SSE/AVX intrinsics. But maybe you want to use gcc-6.3 with -std=gnu++03 C++03 mode for a codebase that isn't ready for C++11 or something.)
Sample output from calling all 4 functions on
// source used:
__m128i vec = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16);
// output:
v2_u64: 0x807060504030201 0x100f0e0d0c0b0a09
v4_u32: 0x4030201 0x8070605 0xc0b0a09 0x100f0e0d
v8_u16: 0x201 0x403 0x605 0x807 | 0xa09 0xc0b 0xe0d 0x100f
v16_u8: 0x1 0x2 0x3 0x4 | 0x5 0x6 0x7 0x8 | 0x9 0xa 0xb 0xc | 0xd 0xe 0xf 0x10
Adjust the format strings if you want to pad with leading zeros for consistent output width. See printf(3).

I know this question is tagged C, but it was the best search result also when looking for a C++ solution to the same problem.
So, this could be a C++ implementation:
#include <string>
#include <cstring>
#include <sstream>
#if defined(__SSE2__)
template <typename T>
std::string __m128i_toString(const __m128i var) {
std::stringstream sstr;
T values[16/sizeof(T)];
std::memcpy(values,&var,sizeof(values)); //See discussion below
if (sizeof(T) == 1) {
for (unsigned int i = 0; i < sizeof(__m128i); i++) { //C++11: Range for also possible
sstr << (int) values[i] << " ";
}
} else {
for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) { //C++11: Range for also possible
sstr << values[i] << " ";
}
}
return sstr.str();
}
#endif
Usage:
#include <iostream>
[..]
__m128i x
[..]
std::cout << __m128i_toString<uint8_t>(x) << std::endl;
std::cout << __m128i_toString<uint16_t>(x) << std::endl;
std::cout << __m128i_toString<uint32_t>(x) << std::endl;
std::cout << __m128i_toString<uint64_t>(x) << std::endl;
Result:
141 114 0 0 0 0 0 0 151 104 0 0 0 0 0 0
29325 0 0 0 26775 0 0 0
29325 0 26775 0
29325 26775
Note: there exists a simple way to avoid the if (size(T)==1), see https://stackoverflow.com/a/28414758/2436175

#include<stdio.h>
#include<emmintrin.h>
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
const int32_t* q;
//add a pointer
c = _mm_add_epi32(a,b);
q = (const int32_t*) &c;
printf("%d\n",q[2]);
//printf("%d\n",c[2]);
return 0;
}
Try this code.

What differences in behaviour can there be for a single program between C and C++? [duplicate]

C and C++ have many differences, and not all valid C code is valid C++ code.
(By "valid" I mean standard code with defined behavior, i.e. not implementation-specific/undefined/etc.)
Is there any scenario in which a piece of code valid in both C and C++ would produce different behavior when compiled with a standard compiler in each language?
To make it a reasonable/useful comparison (I'm trying to learn something practically useful, not to try to find obvious loopholes in the question), let's assume:
Nothing preprocessor-related (which means no hacks with #ifdef __cplusplus, pragmas, etc.)
Anything implementation-defined is the same in both languages (e.g. numeric limits, etc.)
We're comparing reasonably recent versions of each standard (e.g. say, C++98 and C90 or later)
If the versions matter, then please mention which versions of each produce different behavior.

Here is an example that takes advantage of the difference between function calls and object declarations in C and C++, as well as the fact that C90 allows the calling of undeclared functions:
#include <stdio.h>
struct f { int x; };
int main() {
f();
}
int f() {
return printf("hello");
}
In C++ this will print nothing because a temporary f is created and destroyed, but in C90 it will print hello because functions can be called without having been declared.
In case you were wondering about the name f being used twice, the C and C++ standards explicitly allow this, and to make an object you have to say struct f to disambiguate if you want the structure, or leave off struct if you want the function.

For C++ vs. C90, there's at least one way to get different behavior that's not implementation defined. C90 doesn't have single-line comments. With a little care, we can use that to create an expression with entirely different results in C90 and in C++.
int a = 10 //* comment */ 2
+ 3;
In C++, everything from the // to the end of the line is a comment, so this works out as:
int a = 10 + 3;
Since C90 doesn't have single-line comments, only the /* comment */ is a comment. The first / and the 2 are both parts of the initialization, so it comes out to:
int a = 10 / 2 + 3;
So, a correct C++ compiler will give 13, but a strictly correct C90 compiler 8. Of course, I just picked arbitrary numbers here -- you can use other numbers as you see fit.

The following, valid in C and C++, is going to (most likely) result in different values in i in C and C++:
int i = sizeof('a');
See Size of character ('a') in C/C++ for an explanation of the difference.
Another one from this article:
#include <stdio.h>
int sz = 80;
int main(void)
{
struct sz { char c; };
int val = sizeof(sz); // sizeof(int) in C,
// sizeof(struct sz) in C++
printf("%d\n", val);
return 0;
}

C90 vs. C++11 (int vs. double):
#include <stdio.h>
int main()
{
auto j = 1.5;
printf("%d", (int)sizeof(j));
return 0;
}
In C auto means local variable. In C90 it's ok to omit variable or function type. It defaults to int. In C++11 auto means something completely different, it tells the compiler to infer the type of the variable from the value used to initialize it.

Another example that I haven't seen mentioned yet, this one highlighting a preprocessor difference:
#include <stdio.h>
int main()
{
#if true
printf("true!\n");
#else
printf("false!\n");
#endif
return 0;
}
This prints "false" in C and "true" in C++ - In C, any undefined macro evaluates to 0. In C++, there's 1 exception: "true" evaluates to 1.

Per C++11 standard:
a. The comma operator performs lvalue-to-rvalue conversion in C but not C++:
char arr[100];
int s = sizeof(0, arr); // The comma operator is used.
In C++ the value of this expression will be 100 and in C this will be sizeof(char*).
b. In C++ the type of enumerator is its enum. In C the type of enumerator is int.
enum E { a, b, c };
sizeof(a) == sizeof(int); // In C
sizeof(a) == sizeof(E); // In C++
This means that sizeof(int) may not be equal to sizeof(E).
c. In C++ a function declared with empty params list takes no arguments. In C empty params list mean that the number and type of function params is unknown.
int f(); // int f(void) in C++
// int f(*unknown*) in C

This program prints 1 in C++ and 0 in C:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
int d = (int)(abs(0.6) + 0.5);
printf("%d", d);
return 0;
}
This happens because there is double abs(double) overload in C++, so abs(0.6) returns 0.6 while in C it returns 0 because of implicit double-to-int conversion before invoking int abs(int). In C, you have to use fabs to work with double.

#include <stdio.h>
int main(void)
{
printf("%d\n", (int)sizeof('a'));
return 0;
}
In C, this prints whatever the value of sizeof(int) is on the current system, which is typically 4 in most systems commonly in use today.
In C++, this must print 1.

Another sizeof trap: boolean expressions.
#include <stdio.h>
int main() {
printf("%d\n", (int)sizeof !0);
}
It equals to sizeof(int) in C, because the expression is of type int, but is typically 1 in C++ (though it's not required to be). In practice they are almost always different.

An old chestnut that depends on the C compiler, not recognizing C++ end-of-line comments...
...
int a = 4 //* */ 2
+2;
printf("%i\n",a);
...

The C++ Programming Language (3rd Edition) gives three examples:
sizeof('a'), as #Adam Rosenfield mentioned;
// comments being used to create hidden code:
int f(int a, int b)
{
return a //* blah */ b
;
}
Structures etc. hiding stuff in out scopes, as in your example.

Another one listed by the C++ Standard:
#include <stdio.h>
int x[1];
int main(void) {
struct x { int a[2]; };
/* size of the array in C */
/* size of the struct in C++ */
printf("%d\n", (int)sizeof(x));
}

Inline functions in C default to external scope where as those in C++ do not.
Compiling the following two files together would print the "I am inline" in case of GNU C but nothing for C++.
File 1
#include <stdio.h>
struct fun{};
int main()
{
fun(); // In C, this calls the inline function from file 2 where as in C++
// this would create a variable of struct fun
return 0;
}
File 2
#include <stdio.h>
inline void fun(void)
{
printf("I am inline\n");
}
Also, C++ implicitly treats any const global as static unless it is explicitly declared extern, unlike C in which extern is the default.

#include <stdio.h>
struct A {
double a[32];
};
int main() {
struct B {
struct A {
short a, b;
} a;
};
printf("%d\n", sizeof(struct A));
return 0;
}
This program prints 128 (32 * sizeof(double)) when compiled using a C++ compiler and 4 when compiled using a C compiler.
This is because C does not have the notion of scope resolution. In C structures contained in other structures get put into the scope of the outer structure.

struct abort
{
int x;
};
int main()
{
abort();
return 0;
}
Returns with exit code of 0 in C++, or 3 in C.
This trick could probably be used to do something more interesting, but I couldn't think of a good way of creating a constructor that would be palatable to C. I tried making a similarly boring example with the copy constructor, that would let an argument be passed, albeit in a rather non-portable fashion:
struct exit
{
int x;
};
int main()
{
struct exit code;
code.x=1;
exit(code);
return 0;
}
VC++ 2005 refused to compile that in C++ mode, though, complaining about how "exit code" was redefined. (I think this is a compiler bug, unless I've suddenly forgotten how to program.) It exited with a process exit code of 1 when compiled as C though.

Don't forget the distinction between the C and C++ global namespaces. Suppose you have a foo.cpp
#include <cstdio>
void foo(int r)
{
printf("I am C++\n");
}
and a foo2.c
#include <stdio.h>
void foo(int r)
{
printf("I am C\n");
}
Now suppose you have a main.c and main.cpp which both look like this:
extern void foo(int);
int main(void)
{
foo(1);
return 0;
}
When compiled as C++, it will use the symbol in the C++ global namespace; in C it will use the C one:
$ diff main.cpp main.c
$ gcc -o test main.cpp foo.cpp foo2.c
$ ./test
I am C++
$ gcc -o test main.c foo.cpp foo2.c
$ ./test
I am C

int main(void) {
const int dim = 5;
int array[dim];
}
This is rather peculiar in that it is valid in C++ and in C99, C11, and C17 (though optional in C11, C17); but not valid in C89.
In C99+ it creates a variable-length array, which has its own peculiarities over normal arrays, as it has a runtime type instead of compile-time type, and sizeof array is not an integer constant expression in C. In C++ the type is wholly static.
If you try to add an initializer here:
int main(void) {
const int dim = 5;
int array[dim] = {0};
}
is valid C++ but not C, because variable-length arrays cannot have an initializer.

Empty structures have size 0 in C and 1 in C++:
#include <stdio.h>
typedef struct {} Foo;
int main()
{
printf("%zd\n", sizeof(Foo));
return 0;
}

This concerns lvalues and rvalues in C and C++.
In the C programming language, both the pre-increment and the post-increment operators return rvalues, not lvalues. This means that they cannot be on the left side of the = assignment operator. Both these statements will give a compiler error in C:
int a = 5;
a++ = 2; /* error: lvalue required as left operand of assignment */
++a = 2; /* error: lvalue required as left operand of assignment */
In C++ however, the pre-increment operator returns an lvalue, while the post-increment operator returns an rvalue. It means that an expression with the pre-increment operator can be placed on the left side of the = assignment operator!
int a = 5;
a++ = 2; // error: lvalue required as left operand of assignment
++a = 2; // No error: a gets assigned to 2!
Now why is this so? The post-increment increments the variable, and it returns the variable as it was before the increment happened. This is actually just an rvalue. The former value of the variable a is copied into a register as a temporary, and then a is incremented. But the former value of a is returned by the expression, it is an rvalue. It no longer represents the current content of the variable.
The pre-increment first increments the variable, and then it returns the variable as it became after the increment happened. In this case, we do not need to store the old value of the variable into a temporary register. We just retrieve the new value of the variable after it has been incremented. So the pre-increment returns an lvalue, it returns the variable a itself. We can use assign this lvalue to something else, it is like the following statement. This is an implicit conversion of lvalue into rvalue.
int x = a;
int x = ++a;
Since the pre-increment returns an lvalue, we can also assign something to it. The following two statements are identical. In the second assignment, first a is incremented, then its new value is overwritten with 2.
int a;
a = 2;
++a = 2; // Valid in C++.

Intel C++ compiler failing to select template function overload

The following code in C++11 compiles correctly with g++ 6.3.0 and result in the behavior that I consider correct (namely, the first function is picked). However with Intel's C++ compiler (icc 17.0.4) it fails to compile; the compiler indicates that multiple possible function overloads exist.
#include <iostream>
template<typename R, typename ... Args>
static void f(R(func)(const int&, Args...)) {
std::cout << "In first version of f" << std::endl;
}
template<typename R, typename ... Args, typename X = typename std::is_void<R>::type>
static void f(R(func)(Args...), X x = X()) {
std::cout << "In second version of f" << std::endl;
}
double h(const int& x, double y) {
return 0;
}
int main(int argc, char** argv) {
f(h);
return 0;
}
Here is the error reported by icc:
test.cpp(18): error: more than one instance of overloaded function "f" matches the argument list:
function template "void f(R (*)(const int &, Args...))"
function template "void f(R (*)(Args...), X)"
argument types are: (double (const int &, double))
f(h);
^
So my two questions are: which compiler is correct with respect to the standard? and how would you modify this code so that it compiles? (note that f is a user-facing API and I would like to avoid modifying its prototype).
Note that if I remove typename X = typename std::is_void<R>::type and the X x = X() argument in the second version of f, icc compiles it fine.

This is a bug in Intel Compiler 17.0 Update 4. The behavior of GCC is right in this case. This issue is resolved in Intel Compiler 18.0 Update 4 and above.

SSE (SIMD extensions) support in gcc

I see a code as below:
#include "stdio.h"
#define VECTOR_SIZE 4
typedef float v4sf __attribute__ ((vector_size(sizeof(float)*VECTOR_SIZE)));
// vector of four single floats
typedef union f4vector
{
v4sf v;
float f[VECTOR_SIZE];
} f4vector;
void print_vector (f4vector *v)
{
printf("%f,%f,%f,%f\n", v->f[0], v->f[1], v->f[2], v->f[3]);
}
int main()
{
union f4vector a, b, c;
a.v = (v4sf){1.2, 2.3, 3.4, 4.5};
b.v = (v4sf){5., 6., 7., 8.};
c.v = a.v + b.v;
print_vector(&a);
print_vector(&b);
print_vector(&c);
}
This code builds fine and works expectedly using gcc (it's inbuild SSE / MMX extensions and vector data types. this code is doing a SIMD vector addition using 4 single floats.
I want to understand in detail what does each keyword/function call on this typedef line means and does:
typedef float v4sf __attribute__ ((vector_size(sizeof(float)*VECTOR_SIZE)));
What is the vector_size() function return;
What is the __attribute__ keyword for
Here is the float data type being type defined to vfsf type?
I understand the rest part.
thanks,
-AD

__attribute__ is GCCs way of exposing functionality from the compiler that isn't in the C or C++ standards.
__attribute__((vector_size(x))) instructs GCC to treat the type as a vector of size x. For SSE this is 16 bytes.
However, I would suggest using the __m128, __m128i or __m128d types found in the various <*mmintrin.h> headers. They are more portable across compilers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio