I'm trying to learn to code using intrinsics and below is a code which does addition
compiler used: icc
#include<stdio.h>
#include<emmintrin.h>
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);
printf("%d\n",c[2]);
return 0;
}
I get the below error:
test.c(9): error: expression must have pointer-to-object type
printf("%d\n",c[2]);
How do I print the values in the variable c which is of type __m128i
Use this function to print them:
#include <stdint.h>
#include <string.h>
void print128_num(__m128i var)
{
uint16_t val[8];
memcpy(val, &var, sizeof(val));
printf("Numerical: %i %i %i %i %i %i %i %i \n",
val[0], val[1], val[2], val[3], val[4], val[5],
val[6], val[7]);
}
You split 128bits into 16-bits(or 32-bits) before printing them.
This is a way of 64-bit splitting and printing if you have 64-bit support available:
#include <inttypes.h>
void print128_num(__m128i var)
{
int64_t v64val[2];
memcpy(v64val, &var, sizeof(v64val));
printf("%.16llx %.16llx\n", v64val[1], v64val[0]);
}
Note: casting the &var directly to an int* or uint16_t* would also work MSVC, but this violates strict aliasing and is undefined behaviour. Using memcpy is the standard compliant way to do the same and with minimal optimization the compiler will generate the exact same binary code.
Portable across gcc/clang/ICC/MSVC, C and C++.
fully safe with all optimization levels: no strict-aliasing violation UB
print in hex as u8, u16, u32, or u64 elements (based on #AG1's answer)
Prints in memory order (least-significant element first, like _mm_setr_epiX). Reverse the array indices if you prefer printing in the same order Intel's manuals use, where the most significant element is on the left (like _mm_set_epiX). Related: Convention for displaying vector registers
Using a __m128i* to load from an array of int is safe because the __m128 types are defined to allow aliasing just like ISO C unsigned char*. (e.g. in gcc's headers, the definition includes __attribute__((may_alias)).)
The reverse isn't safe (pointing an int* onto part of a __m128i object). MSVC guarantees that's safe, but GCC/clang don't. (-fstrict-aliasing is on by default). It sometimes works with GCC/clang, but why risk it? It sometimes even interferes with optimization; see this Q&A. See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
See GCC AVX _m256i cast to int array leads to wrong values for a real-world example of GCC breaking code which points an int* at a __m256i.
(uint32_t*) &my_vector violates the C and C++ aliasing rules, and is not guaranteed to work the way you'd expect. Storing to a local array and then accessing it is guaranteed to be safe. It even optimizes away with most compilers, so you get movq / pextrq directly from xmm to integer registers instead of an actual store/reload, for example.
Source + asm output on the Godbolt compiler explorer: proof it compiles with MSVC and so on.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
void p128_hex_u8(__m128i in) {
alignas(16) uint8_t v[16];
_mm_store_si128((__m128i*)v, in);
printf("v16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %x\n",
v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7],
v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}
void p128_hex_u16(__m128i in) {
alignas(16) uint16_t v[8];
_mm_store_si128((__m128i*)v, in);
printf("v8_u16: %x %x %x %x, %x %x %x %x\n", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}
void p128_hex_u32(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("v4_u32: %x %x %x %x\n", v[0], v[1], v[2], v[3]);
}
void p128_hex_u64(__m128i in) {
alignas(16) unsigned long long v[2]; // uint64_t might give format-string warnings with %llx; it's just long in some ABIs
_mm_store_si128((__m128i*)v, in);
printf("v2_u64: %llx %llx\n", v[0], v[1]);
}
If you need portability to C99 or C++03 or earlier (i.e. without C11 / C++11), remove the alignas() and use storeu instead of store. Or use __attribute__((aligned(16))) or __declspec( align(16) ) instead.
(If you're writing code with intrinsics, you should be using a recent compiler version. Newer compilers usually make better asm than older compilers, including for SSE/AVX intrinsics. But maybe you want to use gcc-6.3 with -std=gnu++03 C++03 mode for a codebase that isn't ready for C++11 or something.)
Sample output from calling all 4 functions on
// source used:
__m128i vec = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16);
// output:
v2_u64: 0x807060504030201 0x100f0e0d0c0b0a09
v4_u32: 0x4030201 0x8070605 0xc0b0a09 0x100f0e0d
v8_u16: 0x201 0x403 0x605 0x807 | 0xa09 0xc0b 0xe0d 0x100f
v16_u8: 0x1 0x2 0x3 0x4 | 0x5 0x6 0x7 0x8 | 0x9 0xa 0xb 0xc | 0xd 0xe 0xf 0x10
Adjust the format strings if you want to pad with leading zeros for consistent output width. See printf(3).
I know this question is tagged C, but it was the best search result also when looking for a C++ solution to the same problem.
So, this could be a C++ implementation:
#include <string>
#include <cstring>
#include <sstream>
#if defined(__SSE2__)
template <typename T>
std::string __m128i_toString(const __m128i var) {
std::stringstream sstr;
T values[16/sizeof(T)];
std::memcpy(values,&var,sizeof(values)); //See discussion below
if (sizeof(T) == 1) {
for (unsigned int i = 0; i < sizeof(__m128i); i++) { //C++11: Range for also possible
sstr << (int) values[i] << " ";
}
} else {
for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) { //C++11: Range for also possible
sstr << values[i] << " ";
}
}
return sstr.str();
}
#endif
Usage:
#include <iostream>
[..]
__m128i x
[..]
std::cout << __m128i_toString<uint8_t>(x) << std::endl;
std::cout << __m128i_toString<uint16_t>(x) << std::endl;
std::cout << __m128i_toString<uint32_t>(x) << std::endl;
std::cout << __m128i_toString<uint64_t>(x) << std::endl;
Result:
141 114 0 0 0 0 0 0 151 104 0 0 0 0 0 0
29325 0 0 0 26775 0 0 0
29325 0 26775 0
29325 26775
Note: there exists a simple way to avoid the if (size(T)==1), see https://stackoverflow.com/a/28414758/2436175
#include<stdio.h>
#include<emmintrin.h>
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
const int32_t* q;
//add a pointer
c = _mm_add_epi32(a,b);
q = (const int32_t*) &c;
printf("%d\n",q[2]);
//printf("%d\n",c[2]);
return 0;
}
Try this code.
Related
Having following code:
#include <iostream>
#include <bitset>
#include <limits>
#include <limits.h>
using namespace std;
constexpr std::size_t maxBits = CHAR_BIT * sizeof(std::size_t);
int main() {
std::size_t value =47;
unsigned int begin=0;
unsigned int end=32;
//std::size_t allBitsSet(std::numeric_limits<std::size_t>::max());
std::bitset<maxBits> allBitsSet(std::numeric_limits<std::size_t>::max());
//std::size_t mask((allBitsSet >> (maxBits - end)) ^(allBitsSet >> (maxBits - begin)));
std::bitset<maxBits> mask = (allBitsSet >> (maxBits - end)) ^(allBitsSet >> (maxBits - begin));
//std::size_t bitsetValue(value);
std::bitset<maxBits> bitsetValue(value);
auto maskedValue = bitsetValue & mask;
auto result = maskedValue >> begin;
//std::cout << static_cast<std::size_t>(result) << std::endl;
std::cout << static_cast<std::size_t>(result.to_ulong()) << std::endl;
}
Which in fact should return the same value as value, but for some reason the version with std::bitset works just fine and version with std::size_t does not.
It is strange as such, because AFAIK std::bitset, when something is wrong simply throws exception and what is more AFAIK bitset should behave the same way as operations on unsigned integers, but as we can see even if bitset has same number of bits it does not behave the same. In fact it seems for me, that std::bitset works fine, while std::size_t does not.
My configuration is:
intel corei7 - g++-5.4.0-r3
[expr.shift]/1 ... The behavior [of the shift operator - IT] is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand.
Emphasis mine. allBitsSet >> (maxBits - begin) (in the non-bitset version) exhibits undefined behavior.
On the other hand, the behavior of bitset::operator>> is well-defined: allBitsSet >> (maxBits - begin) produces a bitset with all zero bits.
I've just read about support for %fs and %gs segment prefixes on the Intel platforms in GCC.
It was mentioned that "The way you obtain %gs-based pointers, or control the
value of %gs itself, is out of the scope of gcc;"
I'm looking for a way when I manually can set the value of %fs (I'm on IA32, RH Linux) and work with it. When I just set %fs=%ds the test below works fine and this is expected. But I cannot change the test in order to have another value of %fs and do not get a segmentation fault. I start thinking that changing the value of %fs is not the only thing to do. So I'm looking for an advice how to make a part of memory addressed by %fs that is not equal to DS.
#include <stddef.h>
typedef char __seg_fs fs_ptr;
fs_ptr p[] = {'h','e','l','l','o','\0'};
void fs_puts(fs_ptr *s)
{
char buf[100];
buf[0] = s[0];
buf[1] = s[1];
buf[2] = s[2];
buf[3] = '\0';
puts(buf);
}
void __attribute__((constructor)) set_fs()
{
__asm__("mov %ds, %bx\n\t"
"add $0, %bx\n\t" //<---- if fs=ds then the program executes as expected. If not $0 here, then segmentation fault happens.
"mov %bx, %fs\n\t");
}
int main()
{
fs_puts(p);
return 0;
}
I've talked with Armin who implemented __seg_gs/__seg_fs in GCC (Thanks Armin!).
So basically I cannot use these keywords for globals. The aim of introducing __seg_gs/fs was to have a possibility to dynamically allocate regions of memory that are thread-local.
We cannot use __thread for a pointer and to allocate a memory for it using malloc. But __seg_gs/fs introduce such possibility.
The test below somehow illustrates that.
Note that arch_prctl() was used. It exists as 64-bit version only.
Also note that %fs is used for __thread on 64-bit and %gs is free.
#include <stddef.h>
#include <string.h>
#include <stdio.h>
#include <asm/ldt.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <asm/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>
typedef __seg_gs char gs_str;
void gs_puts(gs_str *ptr)
{
int i;
char buf[100];
for(i = 0; i < 100; i++)
buf[i] = ptr[i];
puts(buf);
}
int main()
{
int i;
void *buffer = malloc(100 * sizeof(char));
arch_prctl(ARCH_SET_GS, buffer);
gs_str *gsobj = (gs_str *)0;
for (i = 0; i < 100; i++)
gsobj[i] = 'a'; /* in the %gs space */
gs_puts(gsobj);
return 0;
}
I am currently experimenting with the GCC vector extensions. However, I am wondering how to go about getting sqrt(vec) to work as expected.
As in:
typedef double v4d __attribute__ ((vector_size (16)));
v4d myfunc(v4d in)
{
return some_sqrt(in);
}
and at least on a recent x86 system have it emit a call to the relevant intrinsic sqrtpd. Is there a GCC builtin for sqrt that works on vector types or does one need to drop down to the intrinsic level to accomplish this?
Looks like it's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54408 I don't know of any workaround other than do it component-wise. The vector extensions were never meant to replace platform specific intrinsics anyway.
Some funky code to this effect:
#include <cmath>
#include <utility>
template <::std::size_t...> struct indices { };
template <::std::size_t M, ::std::size_t... Is>
struct make_indices : make_indices<M - 1, M - 1, Is...> {};
template <::std::size_t... Is>
struct make_indices<0, Is...> : indices<Is...> {};
typedef float vec_type __attribute__ ((vector_size(4 * sizeof(float))));
template <::std::size_t ...Is>
vec_type sqrt_(vec_type const& v, indices<Is...> const)
{
vec_type r;
::std::initializer_list<int>{(r[Is] = ::std::sqrt(v[Is]), 0)...};
return r;
}
vec_type sqrt(vec_type const& v)
{
return sqrt_(v, make_indices<4>());
}
int main()
{
vec_type v;
return sqrt(v)[0];
}
You could also try your luck with auto-vectorization, which is separate from the vector extension.
You can loop over the vectors directly
#include <math.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d myfunc(v2d in) {
v2d out;
for(int i=0; i<2; i++) out[i] = sqrt(in[i]);
return out;
}
The sqrt function has to trap for signed zero and NAN but if you avoid these with -Ofast both Clang and GCC produce simply sqrtpd.
https://godbolt.org/g/aCuovX
GCC might have a bug because I had to loop to 4 even though there are only 2 elements to get optimal code.
But with AVX and AVX512 GCC and Clang are ideal
AVX
https://godbolt.org/g/qdTxyp
AVX512
https://godbolt.org/g/MJP1n7
My reading of the question is that you want the square root of 4 packed double precision values... that's 32 bytes. Use the appropriate AVX intrinsic:
#include <x86intrin.h>
typedef double v4d __attribute__ ((vector_size (32)));
v4d myfunc (v4d v) {
return _mm256_sqrt_pd(v);
}
x86-64 gcc 10.2 and x86-64 clang 10.0.1
using -O3 -march=skylake :
myfunc:
vsqrtpd %ymm0, %ymm0 # (or just `ymm0` for Intel syntax)
ret
ymm0 is the return value register.
That said, it just so happens there is a builtin: __builtin_ia32_sqrtpd256, which doesn't require the intrinsics header. I would definitely discourage its use however.
Suppose we have;
struct collapsed {
char **seq;
int num;
};
...
__device__ *collapsed xdev;
...
collapsed *x_dev
cudaGetSymbolAddress((void **)&x_dev, xdev);
cudaMemcpyToSymbol(x_dev, x, sizeof(collapsed)*size); //x already defined collapsed * , this line gives ERROR
Whay do you think I am getting error at the last line : invalid device symbol ??
The first problem here is that x_dev isn't a device symbol. It might contain an address in a device memory, but that address cannot be passed to cudaMemcpyToSymbol. The call should just be:
cudaMemcpyToSymbol(xdev, ......);
Which brings up the second problem. Doing this:
cudaMemcpyToSymbol(xdev, x, sizeof(collapsed)*size);
would be illegal. xdev is a pointer, so the only valid value you can copy to xdev is a device address. If x is the address of a struct collapsed in device memory, then the only valid version of this memory transfer operation is
cudaMemcpyToSymbol(xdev, &x, sizeof(collapsed *));
ie. x must have previously have been set to the address of memory allocated in the device, something like
collapsed *x;
cudaMalloc((void **)&x, sizeof(collapsed)*size);
cudaMemcpy(x, host_src, sizeof(collapsed)*size, cudaMemcpyHostToDevice);
As promised, here is a complete working example. First the code:
#include <cstdlib>
#include <iostream>
#include <cuda_runtime.h>
struct collapsed {
char **seq;
int num;
};
__device__ collapsed xdev;
__global__
void kernel(const size_t item_sz)
{
if (threadIdx.x < xdev.num) {
char *p = xdev.seq[threadIdx.x];
char val = 0x30 + threadIdx.x;
for(size_t i=0; i<item_sz; i++) {
p[i] = val;
}
}
}
#define gpuQ(ans) { gpu_assert((ans), __FILE__, __LINE__); }
void gpu_assert(cudaError_t code, const char *file, const int line)
{
if (code != cudaSuccess)
{
std::cerr << "gpu_assert: " << cudaGetErrorString(code) << " "
<< file << " " << line << std::endl;
exit(code);
}
}
int main(void)
{
const int nitems = 32;
const size_t item_sz = 16;
const size_t buf_sz = size_t(nitems) * item_sz;
// Gpu memory for sequences
char *_buf;
gpuQ( cudaMalloc((void **)&_buf, buf_sz) );
gpuQ( cudaMemset(_buf, 0x7a, buf_sz) );
// Host array for holding sequence device pointers
char **seq = new char*[nitems];
size_t offset = 0;
for(int i=0; i<nitems; i++, offset += item_sz) {
seq[i] = _buf + offset;
}
// Device array holding sequence pointers
char **_seq;
size_t seq_sz = sizeof(char*) * size_t(nitems);
gpuQ( cudaMalloc((void **)&_seq, seq_sz) );
gpuQ( cudaMemcpy(_seq, seq, seq_sz, cudaMemcpyHostToDevice) );
// Host copy of the xdev structure to copy to the device
collapsed xdev_host;
xdev_host.num = nitems;
xdev_host.seq = _seq;
// Copy to device symbol
gpuQ( cudaMemcpyToSymbol(xdev, &xdev_host, sizeof(collapsed)) );
// Run Kernel
kernel<<<1,nitems>>>(item_sz);
// Copy back buffer
char *buf = new char[buf_sz];
gpuQ( cudaMemcpy(buf, _buf, buf_sz, cudaMemcpyDeviceToHost) );
// Print out seq values
// Each string should be ASCII starting from ´0´ (0x30)
char *seq_vals = buf;
for(int i=0; i<nitems; i++, seq_vals += item_sz) {
std::string s;
s.append(seq_vals, item_sz);
std::cout << s << std::endl;
}
return 0;
}
and here it is compiled and run:
$ /usr/local/cuda/bin/nvcc -arch=sm_12 -Xptxas=-v -g -G -o erogol erogol.cu
./erogol.cu(19): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : 8 bytes gmem, 4 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelm' for 'sm_12'
ptxas info : Used 5 registers, 20 bytes smem, 4 bytes cmem[1]
$ /usr/local/cuda/bin/cuda-memcheck ./erogol
========= CUDA-MEMCHECK
0000000000000000
1111111111111111
2222222222222222
3333333333333333
4444444444444444
5555555555555555
6666666666666666
7777777777777777
8888888888888888
9999999999999999
::::::::::::::::
;;;;;;;;;;;;;;;;
<<<<<<<<<<<<<<<<
================
>>>>>>>>>>>>>>>>
????????????????
################
AAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFF
GGGGGGGGGGGGGGGG
HHHHHHHHHHHHHHHH
IIIIIIIIIIIIIIII
JJJJJJJJJJJJJJJJ
KKKKKKKKKKKKKKKK
LLLLLLLLLLLLLLLL
MMMMMMMMMMMMMMMM
NNNNNNNNNNNNNNNN
OOOOOOOOOOOOOOOO
========= ERROR SUMMARY: 0 errors
Some notes:
To simplify things a bit, I have only used a single memory allocation _buf to hold all of the string data. Each value of seq is set to a different address within _buf. This is functionally equivalent to running a separate cudaMalloc call for each pointer, but much faster.
The key concept is to assemble a copy of the structure you wish to access on the device in host memory, then copy that to the device. All of the pointers in my xdev_host are device pointers. The CUDA API doesn't have any sort of deep copy or automatic pointer translation facility, so it is the programmer's responsibility to make sure this is correct.
Each thread in the kernel just fills its sequence with a difference ASCII character. Note that I have declared my xdev as a structure, rather than pointer to structure and copy values rather than a reference to the __device__ symbol (again to simplify things slightly). But otherwise the sequence of operations is what you would need to make your design pattern work.
Because I only have access to a compute 1.x device, the compiler issues a warning. One compute 2.x and 3.x this won't happen because of the improved memory model in those devices. The warning is normal and can be safely ignored.
Because each sequence is just written into a different part of _buf, I can transfer all the sequences back to the host with a single cudaMemcpy call.
good afternoon.
I got the code below on a book. I'm trying to execute it, but I don't know what is the "first" and "last" parameters on the MakeCodeWritable function, or where I can find them. Someone can help? This code is about C obfuscation method. I'm using Xcode program and LLVM GCC 4.2 compiler.
#include <stdio.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
typedef unsigned int uint32;
typedef char* caddr_t;
typedef uint32* waddr_t;
#define Tam_celula 64
#define ALIGN __attribute__((aligned(Tam_celula)))
void makeCodeWritable(char* first, char* last) {
char* firstpage = first - ((int)first % getpagesize());
char* lastpage = last - ((int)last % getpagesize());
int pages = (lastpage-firstpage)/getpagesize()+1;
if (mprotect(firstpage,pages*getpagesize(), PROT_READ|PROT_EXEC|PROT_WRITE)==-1) perror("mprotect");
}
void xor(caddr_t from, caddr_t to, int len){
int i;
for(i=0;i<len;i++){
*to ^= *from; from++; to++;
} }
void swap(caddr_t from, caddr_t to, int len){
int i;
for(i=0;i<len;i++){
char t = *from; *from = *to; *to = t; from++; to++;
} }
#define CELLSIZE 64
#define ALIGN asm volatile (".align 64\n");
void P() {
static int firsttime=1; if (firsttime) {
xor(&&cell5,&&cell2,CELLSIZE);
xor(&&cell0,&&cell3,CELLSIZE);
swap(&&cell1,&&cell4,CELLSIZE);
firsttime = 0; }
char* a[] = {&&align0,&&align1,&&align2,&&align3,&&align4,&&align5};
char*next[] ={&&cell0,&&cell1,&&cell2,&&cell3, &&cell4,&&cell5};
goto *next[0];
align0: ALIGN
cell0: printf("SPGM0\n");
xor(&&cell0,&&cell3,3*CELLSIZE);
goto *next[3];
align1: ALIGN
cell1: printf("SPGM2\n"); xor(&&cell0,&&cell3,3*CELLSIZE);
goto *next[4];
align2: ALIGN
cell2: printf("SPGM4\n"); xor(&&cell0,&&cell3,3*CELLSIZE);
goto *next[5];
align3: ALIGN
cell3: printf("SPGM1\n"); xor(&&cell3,&&cell0,3*CELLSIZE);
goto *next[1];
align4: ALIGN
cell4: printf("SPGM3\n"); xor(&&cell3,&&cell0,3*CELLSIZE);
goto *next[2];
align5: ALIGN
cell5: printf("SPGM5\n");
xor(&&cell3,&&cell0,3*CELLSIZE);
}
int main (int argc, char *argv[]) {
makeCodeWritable(...);
P(); P();
}
The first argument should be (char *)P, because it looks like you want to modify code inside function P. The second argument is the ending address of function P. You can first compile the code, and using objdump -d to see the address of beginning and end of P, then calculate the size of the function, SIZE, then manually specify in the makeCodeWritable( (char *)P, ((char *)P) + SIZE.
The second way is utilizing the as to get the size of function P, but it depends on the assembler language on your platform. This is code snipe I modified from your code, it should be able to compile and run in x86, x86_64 in GCC 4.x on Linux platform.
align5: ALIGN
cell5: printf("SPGM5\n");
xor(&&cell3,&&cell0,3*CELLSIZE);
// adding an label to the end of function P to assembly code
asm ("END_P: \n");
;
}
extern char __sizeof__myfunc[];
int main (int argc, char *argv[]) {
// calculate the code size, ending - starting address of P
asm (" __sizeof__myfunc = END_P-P \n");
// you can see the code size of P
printf("code size is %d\n", (unsigned)__sizeof__myfunc);
makeCodeWritable( (char*)P, ((char *)P) + (unsigned)__sizeof__myfunc);
P(); P();
}
With some modification to support LLVM GCC and as in Mac OS X
int main (int argc, char *argv[]) {
size_t sizeof__myfunc = 0;
asm volatile ("movq $(_END_P - _P),%0;"
: "=r" (sizeof__myfunc)
: );
printf("%d\n", sizeof__myfunc);