I am working on a C++ app on Intel Mac OS X 10.6.x. I have a variable which contains pixel data which was obtained using OpenGL call glReadPixels. I want to do some operations directly on the pixel data using x86_64 assembly instructions. The assembly routine works fine in test programs but when I try to use it on the pixel data, it only gets zeroes in the memory location pointed by the pixel data variable. I am guessing this is since I am trying to access video memory directly from x86_64 assembly. Is there a way to access x86_64 video memory directly from assembly? Otherwise how can I resolve this situation?
Appreciate any pointers. Thanks in advance.
See below for code sample to flip last n and 1st n bytes. Same code works well in test program.
void Flip(void *b, unsigned int w, unsigned int h)
{
__asm {
mov r8, rdi //rdi b
mov r9, rsi //W
mov r10,rdx //H
mov r11, 0 // h <- 0
mov r12, 0 // w<- 0
outloop:
------------
.............
.............
}
This isn't really an answer but the comments bit is too short to post this.
Your inline assembly is problematic, in multiple ways:
it assumes by the time the compiler gets to the inline block, the function arguments are still in the arg registers. This isn't guaranteed.
it uses MS-VC++ style inline assembly; I'm unsure about OSX Clang, but gcc proper refuses to even compile this.
It'd also be good to have a complete (compilable) source fragment. Something like (32bit code):
int mogrifyFramebufferContents(unsigned char *fb, int width, int height)
{
int i, sum;
glReadPixels(1, 1, width, height, GL_RGBA, GL_UNSIGNED_BYTE, fb);
for (i = 0, sum = 0; i < 4 * width * height; i++)
sum += fb[i];
printf("sum over fb via C: %d\n", sum);
asm("xorl %0, %0\n"
"xorl %1, %1\n"
"0:\n"
"movsbl (%2, %1), %ebx\n"
"addl %ebx, %0\n"
"incl %1\n"
"cmpl %1, %3\n"
"jl 0b"
: "=r"(sum)
: "r"(i), "r"(fb), "r"(4 * width * height)
: "cc", "%ebx");
printf("sum over fb via inline asm: %d\n", sum);
return (sum);
}
If I haven't made a one-off error, this code should result in the same output for C and assembly. Try something similar, right at the place where you access the read data within C, and compare assembly results - or even singlestep through the generated assembly with a debugger.
A stackoverflow search for "gcc inline assembly syntax" will give you a starting point for what the %0...%2 placeholders mean, and/or how the register assignment constraints like "=r"(sum) above work.
Related
I looked into some C code from
http://www.mcs.anl.gov/~kazutomo/rdtsc.html
They use stuff like __inline__, __asm__ etc like the following:
code1:
static __inline__ tick gettick (void) {
unsigned a, d;
__asm__ __volatile__("rdtsc": "=a" (a), "=d" (d) );
return (((tick)a) | (((tick)d) << 32));
}
code2:
volatile int __attribute__((noinline)) foo2 (int a0, int a1) {
__asm__ __volatile__ ("");
}
I was wondering what does the code1 and code2 do?
(Editor's note: for this specific RDTSC use case, intrinsics are preferred: How to get the CPU cycle count in x86_64 from C++? See also https://gcc.gnu.org/wiki/DontUseInlineAsm)
The __volatile__ modifier on an __asm__ block forces the compiler's optimizer to execute the code as-is. Without it, the optimizer may think it can be either removed outright, or lifted out of a loop and cached.
This is useful for the rdtsc instruction like so:
__asm__ __volatile__("rdtsc": "=a" (a), "=d" (d) )
This takes no dependencies, so the compiler might assume the value can be cached. Volatile is used to force it to read a fresh timestamp.
When used alone, like this:
__asm__ __volatile__ ("")
It will not actually execute anything. You can extend this, though, to get a compile-time memory barrier that won't allow reordering any memory access instructions:
__asm__ __volatile__ ("":::"memory")
The rdtsc instruction is a good example for volatile. rdtsc is usually used when you need to time how long some instructions take to execute. Imagine some code like this, where you want to time r1 and r2's execution:
__asm__ ("rdtsc": "=a" (a0), "=d" (d0) )
r1 = x1 + y1;
__asm__ ("rdtsc": "=a" (a1), "=d" (d1) )
r2 = x2 + y2;
__asm__ ("rdtsc": "=a" (a2), "=d" (d2) )
Here the compiler is actually allowed to cache the timestamp, and valid output might show that each line took exactly 0 clocks to execute. Obviously this isn't what you want, so you introduce __volatile__ to prevent caching:
__asm__ __volatile__("rdtsc": "=a" (a0), "=d" (d0))
r1 = x1 + y1;
__asm__ __volatile__("rdtsc": "=a" (a1), "=d" (d1))
r2 = x2 + y2;
__asm__ __volatile__("rdtsc": "=a" (a2), "=d" (d2))
Now you'll get a new timestamp each time, but it still has a problem that both the compiler and the CPU are allowed to reorder all of these statements. It could end up executing the asm blocks after r1 and r2 have already been calculated. To work around this, you'd add some barriers that force serialization:
__asm__ __volatile__("mfence;rdtsc": "=a" (a0), "=d" (d0) :: "memory")
r1 = x1 + y1;
__asm__ __volatile__("mfence;rdtsc": "=a" (a1), "=d" (d1) :: "memory")
r2 = x2 + y2;
__asm__ __volatile__("mfence;rdtsc": "=a" (a2), "=d" (d2) :: "memory")
Note the mfence instruction here, which enforces a CPU-side barrier, and the "memory" specifier in the volatile block which enforces a compile-time barrier. On modern CPUs, you can replace mfence:rdtsc with rdtscp for something more efficient.
asm is for including native Assembly code into the C source code. E.g.
int a = 2;
asm("mov a, 3");
printf("%i", a); // will print 3
Compilers have different variants of it. __asm__ should be synonymous, maybe with some compiler-specific differences.
volatile means the variable can be modified from outside (aka not by the C program). For instance when programming a microcontroller where the memory address 0x0000x1234 is mapped to some device-specific interface (i.e. when coding for the GameBoy, buttons/screen/etc are accessed this way.)
volatile std::uint8_t* const button1 = 0x00001111;
This disabled compiler optimizations that rely on *button1 not changing unless being changed by the code.
It is also used in multi-threaded programming (not needed anymore today?) where a variable might be modified by another thread.
inline is a hint to the compiler to "inline" calls to a function.
inline int f(int a) {
return a + 1
}
int a;
int b = f(a);
This should not be compiled into a function call to f but into int b = a + 1. As if f where a macro. Compilers mostly do this optimization automatically depending on function usage/content. __inline__ in this example might have a more specific meaning.
Similarily __attribute__((noinline)) (GCC-specific syntax) prevents a function from being inlined.
The __asm__ attribute specifies the name to be used in assembler code for the function or variable.
The __volatile__ qualifier, generally used in Real-Time-Computing of embedded systems, addresses a problem with compiler tests of the status register for the ERROR or READY bit causing problems during optimization. __volatile__ was introduced as a way of telling the compiler that the object is subject to rapid change and to force every reference of the object to be a genuine reference.
Can I base a mission-critical application on the results of this test, that 100 threads reading a pointer set a billion times by a main thread never see a tear?
Any other potential problems doing this besides tearing?
Here's a stand-alone demo that compiles with g++ -g tear.cxx -o tear -pthread.
#include <atomic>
#include <thread>
#include <vector>
using namespace std;
void* pvTearTest;
atomic<int> iTears( 0 );
void TearTest( void ) {
while (1) {
void* pv = (void*) pvTearTest;
intptr_t i = (intptr_t) pv;
if ( ( i >> 32 ) != ( i & 0xFFFFFFFF ) ) {
printf( "tear: pv = %p\n", pv );
iTears++;
}
if ( ( i >> 32 ) == 999999999 )
break;
}
}
int main( int argc, char** argv ) {
printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );
vector<thread> athr;
// Create lots of threads and have them do the test simultaneously.
for ( int i = 0; i < 100; i++ )
athr.emplace_back( TearTest );
for ( int i = 0; i < 1000000000; i++ )
pvTearTest = (void*) (intptr_t)
( ( i % (1L<<32) ) * 0x100000001 );
for ( auto& thr: athr )
thr.join();
if ( iTears )
printf( "%d tears\n", iTears.load() );
else
printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}
The actual application is a malloc()'ed and sometimes realloc()'d array (size is power of two; realloc doubles storage) that many child threads will absolutely be hammering in a mission-critical but also high-performance-critical way.
From time to time a thread will need to add a new entry to the array, and will do so by setting the next array entry to point to something, then increment an atomic<int> iCount. Finally it will add data to some data structures that would cause other threads to attempt to dereference that cell.
It all seems fine (except I'm not positive if the increment of count is assured of happening before following non-atomic updates)... except for one thing: realloc() will typically change the address of the array, and further frees the old one, the pointer to which is still visible to other threads.
OK, so instead of realloc(), I malloc() a new array, manually copy the contents, set the pointer to the array. I would free the old array but I realize other threads may still be accessing it: they read the array base; I free the base; a third thread allocates it writes something else there; the first thread then adds the indexed offset to the base and expects a valid pointer. I'm happy to leak those though. (Given the doubling growth, all old arrays combined are about the same size as the current array so overhead is simply an extra 16 bytes per item, and it's memory that soon is never referenced again.)
So, here's the crux of the question: once I allocate the bigger array, can I write it's base address with a non-atomic write, in utter safety? Or despite my billion-access test, do I actually have to make it atomic<> and thus slow all worker threads to read that atomic?
(As this is surely environment dependent, we're talking 2012-or-later Intel, g++ 4 to 9, and Red Hat of 2012 or later.)
EDIT: here is a modified test program that matches my planned scenario much more closely, with only a small number of writes. I've also added a count of the reads. I see when switching from void* to atomic I go from 2240 reads/sec to 660 reads/sec (with optimization disabled). The machine language for the read is shown after the source.
#include <atomic>
#include <chrono>
#include <thread>
#include <vector>
using namespace std;
chrono::time_point<chrono::high_resolution_clock> tp1, tp2;
// void*: 1169.093u 0.027s 2:26.75 796.6% 0+0k 0+0io 0pf+0w
// atomic<void*>: 6656.864u 0.348s 13:56.18 796.1% 0+0k 0+0io 0pf+0w
// Different definitions of the target variable.
atomic<void*> pvTearTest;
//void* pvTearTest;
// Children sum the tears they find, and at end, total checks performed.
atomic<int> iTears( 0 );
atomic<uint64_t> iReads( 0 );
bool bEnd = false; // main thr sets true; children all finish.
void TearTest( void ) {
uint64_t i;
for ( i = 0; ! bEnd; i++ ) {
intptr_t iTearTest = (intptr_t) (void*) pvTearTest;
// Make sure top 4 and bottom 4 bytes are the same. If not it's a tear.
if ( ( iTearTest >> 32 ) != ( iTearTest & 0xFFFFFFFF ) ) {
printf( "tear: pv = %ux\n", iTearTest );
iTears++;
}
// Output periodically to prove we're seeing changing values.
if ( ( (i+1) % 50000000 ) == 0 )
printf( "got: pv = %lx\n", iTearTest );
}
iReads += i;
}
int main( int argc, char** argv ) {
printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );
vector<thread> athr;
// Create lots of threads and have them do the test simultaneously.
for ( int i = 0; i < 100; i++ )
athr.emplace_back( TearTest );
tp1 = chrono::high_resolution_clock::now();
#if 0
// Change target as fast as possible for fixed number of updates.
for ( int i = 0; i < 1000000000; i++ )
pvTearTest = (void*) (intptr_t)
( ( i % (1L<<32) ) * 0x100000001 );
#else
// More like our actual app: change target only periodically, for fixed time.
for ( int i = 0; i < 100; i++ ) {
pvTearTest.store( (void*) (intptr_t) ( ( i % (1L<<32) ) * 0x100000001 ),
std::memory_order_release );
this_thread::sleep_for(10ms);
}
#endif
bEnd = true;
for ( auto& thr: athr )
thr.join();
tp2 = chrono::high_resolution_clock::now();
chrono::duration<double> dur = tp2 - tp1;
printf( "%ld reads in %.4f secs: %.2f reads/usec\n",
iReads.load(), dur.count(), iReads.load() / dur.count() / 1000000 );
if ( iTears )
printf( "%d tears\n", iTears.load() );
else
printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}
Dump of assembler code for function TearTest():
0x0000000000401256 <+0>: push %rbp
0x0000000000401257 <+1>: mov %rsp,%rbp
0x000000000040125a <+4>: sub $0x10,%rsp
0x000000000040125e <+8>: movq $0x0,-0x8(%rbp)
0x0000000000401266 <+16>: movzbl 0x6e83(%rip),%eax # 0x4080f0 <bEnd>
0x000000000040126d <+23>: test %al,%al
0x000000000040126f <+25>: jne 0x40130c <TearTest()+182>
=> 0x0000000000401275 <+31>: mov $0x4080d8,%edi
0x000000000040127a <+36>: callq 0x40193a <std::atomic<void*>::operator void*() const>
0x000000000040127f <+41>: mov %rax,-0x10(%rbp)
0x0000000000401283 <+45>: mov -0x10(%rbp),%rax
0x0000000000401287 <+49>: sar $0x20,%rax
0x000000000040128b <+53>: mov -0x10(%rbp),%rdx
0x000000000040128f <+57>: mov %edx,%edx
0x0000000000401291 <+59>: cmp %rdx,%rax
0x0000000000401294 <+62>: je 0x4012bb <TearTest()+101>
0x0000000000401296 <+64>: mov -0x10(%rbp),%rax
0x000000000040129a <+68>: mov %rax,%rsi
0x000000000040129d <+71>: mov $0x40401a,%edi
0x00000000004012a2 <+76>: mov $0x0,%eax
0x00000000004012a7 <+81>: callq 0x401040 <printf#plt>
0x00000000004012ac <+86>: mov $0x0,%esi
0x00000000004012b1 <+91>: mov $0x4080e0,%edi
0x00000000004012b6 <+96>: callq 0x401954 <std::__atomic_base<int>::operator++(int)>
0x00000000004012bb <+101>: mov -0x8(%rbp),%rax
0x00000000004012bf <+105>: lea 0x1(%rax),%rcx
0x00000000004012c3 <+109>: movabs $0xabcc77118461cefd,%rdx
0x00000000004012cd <+119>: mov %rcx,%rax
0x00000000004012d0 <+122>: mul %rdx
0x00000000004012d3 <+125>: mov %rdx,%rax
0x00000000004012d6 <+128>: shr $0x19,%rax
0x00000000004012da <+132>: imul $0x2faf080,%rax,%rax
0x00000000004012e1 <+139>: sub %rax,%rcx
0x00000000004012e4 <+142>: mov %rcx,%rax
0x00000000004012e7 <+145>: test %rax,%rax
0x00000000004012ea <+148>: jne 0x401302 <TearTest()+172>
0x00000000004012ec <+150>: mov -0x10(%rbp),%rax
0x00000000004012f0 <+154>: mov %rax,%rsi
0x00000000004012f3 <+157>: mov $0x40402a,%edi
0x00000000004012f8 <+162>: mov $0x0,%eax
0x00000000004012fd <+167>: callq 0x401040 <printf#plt>
0x0000000000401302 <+172>: addq $0x1,-0x8(%rbp)
0x0000000000401307 <+177>: jmpq 0x401266 <TearTest()+16>
0x000000000040130c <+182>: mov -0x8(%rbp),%rax
0x0000000000401310 <+186>: mov %rax,%rsi
0x0000000000401313 <+189>: mov $0x4080e8,%edi
0x0000000000401318 <+194>: callq 0x401984 <std::__atomic_base<unsigned long>::operator+=(unsigned long)>
0x000000000040131d <+199>: nop
0x000000000040131e <+200>: leaveq
0x000000000040131f <+201>: retq
Yes, on x86 aligned loads are atomic, BUT this is an architectural detail that you should NOT rely on!
Since you are writing C++ code, you have to abide by the rules of the C++ standard, i.e., you have to use atomics instead of volatile. The fact
that volatile has been part of that language long before the introduction
of threads in C++11 should be a strong enough indication that volatile was
never designed or intended to be used for multi-threading. It is important to
note that in C++ volatile is something fundamentally different from volatile
in languages like Java or C# (in these languages volatile is in
fact related to the memory model and therefore much more like an atomic in C++).
In C++, volatile is used for what is often referred to as "unusual memory".
This is typically memory that can be read or modified outside the current process,
for example when using memory mapped I/O. volatile forces the compiler to
execute all operations in the exact order as specified. This prevents
some optimizations that would be perfectly legal for atomics, while also allowing
some optimizations that are actually illegal for atomics. For example:
volatile int x;
int y;
volatile int z;
x = 1;
y = 2;
z = 3;
z = 4;
...
int a = x;
int b = x;
int c = y;
int d = z;
In this example, there are two assignments to z, and two read operations on x.
If x and z were atomics instead of volatile, the compiler would be free to treat
the first store as irrelevant and simply remove it. Likewise it could just reuse the
value returned by the first load of x, effectively generating code like int b = a.
But since x and z are volatile, these optimizations are not possible. Instead,
the compiler has to ensure that all volatile operations are executed in the
exact order as specified, i.e., the volatile operations cannot be reordered with
respect to each other. However, this does not prevent the compiler from reordering
non-volatile operations. For example, the operations on y could freely be moved
up or down - something that would not be possible if x and z were atomics. So
if you were to try implementing a lock based on a volatile variable, the compiler
could simply (and legally) move some code outside your critical section.
Last but not least it should be noted that marking a variable as volatile does
not prevent it from participating in a data race. In those rare cases where you
have some "unusual memory" (and therefore really require volatile) that is
also accessed by multiple threads, you have to use volatile atomics.
Since aligned loads are actually atomic on x86, the compiler will translate an atomic.load() call to a simple mov instruction, so an atomic load is not slower than reading a volatile variable. An atomic.store() is actually slower than writing a volatile variable, but for good reasons, since in contrast to the volatile write it is by default sequentially consistent. You can relax the memory orders, but you really have to know what you are doing!!
If you want to learn more about the C++ memory model, I can recommend this paper: Memory Models for C/C++ Programmers
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have implemented the strlen() function in different ways, including SSE2 assembly, SSE4.2 assembly and SSE2 intrinsic, I also exerted some experiments on them, with strlen() in <string.h> and strlen() in glibc. However, their performance in terms of milliseconds (time) are unexpected.
My experiment environment:
CentOS 7.0 + gcc 4.8.5 + Intel Xeon
Following are my implementations:
strlen using SSE2 assembly
long strlen_sse2_asm(const char* src){
long result = 0;
asm(
"movl %1, %%edi\n\t"
"movl $-0x10, %%eax\n\t"
"pxor %%xmm0, %%xmm0\n\t"
"lloop:\n\t"
"addl $0x10, %%eax\n\t"
"movdqu (%%edi,%%eax), %%xmm1\n\t"
"pcmpeqb %%xmm0, %%xmm1\n\t"
"pmovmskb %%xmm1, %%ecx\n\t"
"test %%ecx, %%ecx\n\t"
"jz lloop\n\t"
"bsf %%ecx, %%ecx\n\t"
"addl %%ecx, %%eax\n\t"
"movl %%eax, %0"
:"=r"(result)
:"r"(src)
:"%eax"
);
return result;
}
2.strlen using SSE4.2 assembly
long strlen_sse4_2_asm(const char* src){
long result = 0;
asm(
"movl %1, %%edi\n\t"
"movl $-0x10, %%eax\n\t"
"pxor %%xmm0, %%xmm0\n\t"
"lloop2:\n\t"
"addl $0x10, %%eax\n\t"
"pcmpistri $0x08,(%%edi, %%eax), %%xmm0\n\t"
"jnz lloop2\n\t"
"add %%ecx, %%eax\n\t"
"movl %%eax, %0"
:"=r"(result)
:"r"(src)
:"%eax"
);
return result;
}
3. strlen using SSE2 intrinsic
long strlen_sse2_intrin_align(const char* src){
if (src == NULL || *src == '\0'){
return 0;
}
const __m128i zero = _mm_setzero_si128();
const __m128i* ptr = (const __m128i*)src;
if(((size_t)ptr&0xF)!=0){
__m128i xmm = _mm_loadu_si128(ptr);
unsigned int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(xmm,zero));
if(mask!=0){
return (const char*)ptr-src+(size_t)ffs(mask);
}
ptr = (__m128i*)(0x10+(size_t)ptr & ~0xF);
}
for (;;ptr++){
__m128i xmm = _mm_load_si128(ptr);
unsigned int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(xmm,zero));
if (mask!=0)
return (const char*)ptr-src+(size_t)ffs(mask);
}
}
I also looked up the one implemented in linux kernel, following is its implementation
size_t strlen_inline_asm(const char* str){
int d0;
size_t res;
asm volatile("repne\n\t"
"scasb"
:"=c" (res), "=&D" (d0)
: "1" (str), "a" (0), "" (0xffffffffu)
: "memory");
return ~res-1;
}
In my experience, I also added the one of standard library and compared their performance.
Followings are my main function code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <xmmintrin.h>
#include <x86intrin.h>
#include <emmintrin.h>
#include <time.h>
#include <unistd.h>
#include <sys/time.h>
int main()
{
struct timeval tpstart,tpend;
int i=0;
for(;i<1023;i++){
test_str[i] = 'a';
}
test_str[i]='\0';
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen(test_str);
gettimeofday(&tpend,NULL);
printf("strlen from stirng.h--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen_inline_asm(test_str);
gettimeofday(&tpend,NULL);
printf("strlen_inline_asm--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen_sse2_asm(test_str);
gettimeofday(&tpend,NULL);
printf("strlen_sse2_asm--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen_sse4_2_asm(test_str);
gettimeofday(&tpend,NULL);
printf("strlen_sse4_2_asm--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
gettimeofday(&tpstart,NULL);
for(i=0;i<10000000;i++)
strlen_sse2_intrin_align(test_str);
gettimeofday(&tpend,NULL);
printf("strlen_sse2_intrin_align--->%lf\n",(tpend.tv_sec-tpstart.tv_sec)*1000+(tpend.tv_usec-tpstart.tv_usec)/1000.0);
return 0;
}
The result is : (ms)
strlen from stirng.h--->23.518000
strlen_inline_asm--->222.311000
strlen_sse2_asm--->782.907000
strlen_sse4_2_asm--->955.960000
strlen_sse2_intrin_align--->3499.586000
I have some questions about it:
Why strlen of string.h is so fast? I think its code should be identify to strlen_inline_asm because I copied the code from /linux-4.2.2/arch/x86/lib/string_32.c[http://lxr.oss.org.cn/source/arch/x86/lib/string_32.c#L164]
Why sse2 intrinsic and sse2 assembly are so different in performance?
Could someone help me how to disassembly the code so that I can see what has the function strlen of static library been transformed by the compiler? I used gcc -s but didn't find the disassembly of strlen from the <string.h>
I think my code may be not very well, I would be appreciate if you could help me improve my code, especially assembly ones.
Thanks.
Like I said in comments, your biggest error is benchmarking with -O0. I discussed exactly why testing with -O0 is a terrible idea in the first part of another post.
Benchmarks should be done with at least -O2, preferably with the same optimizations as your full project will build with, if you're trying to test test what source makes the fastest asm.
-O0 explains inline asm being way faster than C with intrinsics (or regular compiled C, for C strlen implementation borrowed from glibc).
IDK -O0 would still optimize away loop that discards the result of library strlen repeatedly, or if it somehow just avoided some other huge performance pitfall. It's not interesting to guess about exactly what happened in such a flawed test.
I tightened up your SSE2 inline-asm version. Mostly just because I've been playing with gcc inline asm input/output constraints recently, and wanted to see what it would look like if I wrote it to let the compiler choose which registers to use for temporaries, and avoided unneeded instructions.
The same inline asm works for 32 and 64-bit x86 targets; see this compiled for both on the Godbolt compiler explorer. When compiling to a stand-along function, it doesn't have to save/restore any registers even in 32bit mode:
WARNING: it can read past the end of the string by up to 15 bytes. This could segfault. See Is it safe to read past the end of a buffer within the same page on x86 and x64? for details on avoiding that: get to an alignment boundary, then use aligned loads because that's always safe if the vector contains at least 1 byte of string data. I left the code unchanged because it's interesting to discuss the effect of aligning pointers for SSE vs. AVX. Aligning pointers also avoids cache-line splits, and 4k page-splits (which are a performance pothole before Skylake).
#include <immintrin.h>
size_t strlen_sse2_asm(const char* src){
// const char *orig_src = src; // for a pointer-increment with a "+r" (src) output operand
size_t result = 0;
unsigned int tmp1;
__m128i zero = _mm_setzero_si128(), vectmp;
// A pointer-increment may perform better than an indexed addressing mode
asm(
"\n.Lloop:\n\t"
"movdqu (%[src], %[res]), %[vectmp]\n\t" // result reg is used as the loop counter
"pcmpeqb %[zerovec], %[vectmp]\n\t"
"pmovmskb %[vectmp], %[itmp]\n\t"
"add $0x10, %[res]\n\t"
"test %[itmp], %[itmp]\n\t"
"jz .Lloop\n\t"
"bsf %[itmp], %[itmp]\n\t"
"add %q[itmp], %q[res]\n\t" // q modifier to get quadword register.
// (add %edx, %rax doesn't work). But in 32bit mode, q gives a 32bit reg, so the same code works
: [res] "+r"(result), [vectmp] "=&x" (vectmp), [itmp] "=&r" (tmp1)
: [zerovec] "x" (zero) // There might already be a zeroed vector reg when inlining
, [src] "r"(src)
, [dummy] "m" (*(const char (*)[])src) // this reads the whole object, however long gcc thinks it is
: //"memory" // not needed because of the dummy input
);
return result;
// return result + tmp1; // doing the add outside the asm makes gcc sign or zero-extend tmp1.
// No benefit anyway, since gcc doesn't know that tmp1 is the offset within a 16B chunk or anything.
}
Note the dummy input, as an alternative to a "memory" clobber, to tell the compiler that the inline asm reads the memory pointed to by src, as well as the value of src itself. (The compiler doesn't know what the asm does; for all it knows the asm just aligns a pointer with and or something, so assuming that all input pointers are dereferenced would lead to missed optimizations from reordering / combining loads and stores across the asm. Also, this lets the compiler know we only read the memory, not modify it.) The GCC manual uses an example with this unspecified-length array syntax "m" (*(const char (*)[])src)
It should keep register pressure to a minimum when inlining, and doesn't tie up any special-purpose registers (like ecx which is needed for variable-count shifts).
If you could shave another uop out of the inner loop, it would be down to 4 uops that could issue at one per cycle. As it is, 5 uops means each iteration may take 2 cycles to issue from the frontend, on Intel SnB CPUs. (Or 1.25 cycles on later CPUs like Haswell, and maybe on SnB if I was wrong about the whole-number behaviour.)
Using an aligned pointer would allow the load to fold into a memory operand for pcmpeqb. (As well as being necessary for correctness if the string start is unaligned and the end is near the end of a page). Interestingly, using the zero-vector as the destination for pcmpeqb is ok in theory: you don't need to re-zero the vector between iterations, because you exit the loop if it's ever non-zero. It has 1-cycle latency, so turning the zero vector into a loop-carried dependency is only a problem when cache-misses delay an old iteration. Removing this loop-carried dependency chain might help in practice, though, by letting the back end go faster when catching up after a cache miss that delayed an old iteration.
AVX solves the problem completely (except for correctness if the string ends near the end of a page). AVX allows the load to be folded even without doing an alignment check first. 3-operand non-destructive vpcmpeqb avoids turning the zero vector into a loop-carried dependency. AVX2 would allow checking 32B at once.
Unrolling will help either way, but helps more without AVX. Align to a 64B boundary or something, and then load the whole cache line into four 16B vectors. Doing a combined check on the result of PORing them all together may be good, since pmovmsk + compare-and-branch is 2 uops.
Using SSE4.1 PTEST doesn't help (compared to pmovmsk / test / jnz) because it's 2 uops and can't macro-fuse the way test can.
PTEST can directly test for the whole 16B vector being all-zero or all-ones (using ANDNOT -> CF part), but not if one of the byte-elements is zero. (So we can't avoid pcmpeqb).
Have a look at Agner Fog's guides for optimizing asm, and the other links on the x86 wiki. Most optimization (Agner Fog's, and Intel's and AMD's) will mention optimizing memcpy and strlen specifically, IIRC.
If you read the source of the strlen function in the glibc, you can see that the function is not testing the string char by char, but longword by longword with complex bitwise operations : http://www.stdlib.net/~colmmacc/strlen.c.html. I guess it explains its speed, but the fact that it's even faster than rep instructions in assembly is indeed quite surprising.
I have this simple binary correlation method, It beats table lookup and Hakmem bit twiddling methods by x3-4 and %25 better than GCC's __builtin_popcount (which I think maps to a popcnt instruction when SSE4 is enabled.)
Here is the much simplified code:
int correlation(uint64_t *v1, uint64_t *v2, int size64) {
__m128i* a = reinterpret_cast<__m128i*>(v1);
__m128i* b = reinterpret_cast<__m128i*>(v2);
int count = 0;
for (int j = 0; j < size64 / 2; ++j, ++a, ++b) {
union { __m128i s; uint64_t b[2]} x;
x.s = _mm_xor_si128(*a, *b);
count += _mm_popcnt_u64(x.b[0]) +_mm_popcnt_u64(x.b[1]);
}
return count;
}
I tried unrolling the loop, but I think GCC already automatically does this, so I ended up with same performance. Do you think performance further improved without making the code too complicated? Assume v1 and v2 are of the same size and size is even.
I am happy with its current performance but I was just curious to see if it could be further improved.
Thanks.
Edit: Fixed an error in union and it turned out this error was making this version faster than builtin __builtin_popcount , anyway I modified the code again, it is again slightly faster than builtin now (15%) but I don't think it is worth investing worth time on this. Thanks for all comments and suggestions.
for (int j = 0; j < size64 / 4; ++j, a+=2, b+=2) {
__m128i x0 = _mm_xor_si128(_mm_load_si128(a), _mm_load_si128(b));
count += _mm_popcnt_u64(_mm_extract_epi64(x0, 0))
+_mm_popcnt_u64(_mm_extract_epi64(x0, 1));
__m128i x1 = _mm_xor_si128(_mm_load_si128(a + 1), _mm_load_si128(b + 1));
count += _mm_popcnt_u64(_mm_extract_epi64(x1, 0))
+_mm_popcnt_u64(_mm_extract_epi64(x1, 1));
}
Second Edit: turned out that builtin is the fastest, sigh. especially with -funroll-loops and
-fprefetch-loop-arrays args. Something like this:
for (int j = 0; j < size64; ++j) {
count += __builtin_popcountll(a[j] ^ b[j]);
}
Third Edit:
This is an interesting SSE3 parallel 4 bit lookup algorithm. Idea is from Wojciech Muła, implementation is from Marat Dukhan's answer. Thanks to #Apriori for reminding me this algorithm. Below is the heart of the algorithm, it is very clever, basically counts bits for bytes using a SSE register as a 16 way lookup table and lower nibbles as index of which table cells are selected. Then sums the counts.
static inline __m128i hamming128(__m128i a, __m128i b) {
static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
const __m128i x = _mm_xor_si128(a, b);
const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(x, popcount_mask));
const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(x, 4), popcount_mask));
return _mm_add_epi8(pcnt0, pcnt1);
}
On my tests this version is on par; slightly faster on smaller input, slightly slower on larger ones than using hw popcount. I think this should really shine if it is implemented in AVX. But I don't have time for this, if anyone is up to it would love to hear their results.
The problem is that popcnt (which is what __builtin_popcnt compiles to on intel CPU's) operates on the integer registers. This causes the compiler to issue instructions to move data between the SSE and integer registers. I'm not surprised that the non-sse version is faster since the ability to move data between the vector and integer registers is quite limited/slow.
uint64_t count_set_bits(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum = 0;
for(size_t i = 0; i < count; i++) {
sum += popcnt(a[i] ^ b[i]);
}
return sum;
}
This runs at approx. 2.36 clocks per loop on small data sets (fits in cache). I think it run's slow because of the 'long' dependency chain on sum which restricts the CPU's ability to handle more things out of order. We can improve it by manually pipelining the loop:
uint64_t count_set_bits_2(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum = 0, sum2 = 0;
for(size_t i = 0; i < count; i+=2) {
sum += popcnt(a[i ] ^ b[i ]);
sum2 += popcnt(a[i+1] ^ b[i+1]);
}
return sum + sum2;
}
This runs at 1.75 clocks per item. My CPU is an Sandy Bridge model (i7-2820QM fixed # 2.4Ghz).
How about four-way pipelining? That's 1.65 clocks per item. What about 8-way? 1.57 clocks per item. We can derive that the runtime per item is (1.5n + 0.5) / n where n is the amount of pipelines in our loop. I should note that for some reason 8-way pipelining performs worse than the others when the dataset grows, i have no idea why. The generated code looks okay.
Now if you look carefully there is one xor, one add, one popcnt, and one mov instruction per item. There is also one lea instruction per loop (and one branch and decrement, which i'm ignoring because they're pretty much free).
$LL3#count_set_:
; Line 50
mov rcx, QWORD PTR [r10+rax-8]
lea rax, QWORD PTR [rax+32]
xor rcx, QWORD PTR [rax-40]
popcnt rcx, rcx
add r9, rcx
; Line 51
mov rcx, QWORD PTR [r10+rax-32]
xor rcx, QWORD PTR [rax-32]
popcnt rcx, rcx
add r11, rcx
; Line 52
mov rcx, QWORD PTR [r10+rax-24]
xor rcx, QWORD PTR [rax-24]
popcnt rcx, rcx
add rbx, rcx
; Line 53
mov rcx, QWORD PTR [r10+rax-16]
xor rcx, QWORD PTR [rax-16]
popcnt rcx, rcx
add rdi, rcx
dec rdx
jne SHORT $LL3#count_set_
You can check with Agner Fog's optimization manual that an lea is half a clock cycle in throughout and the mov/xor/popcnt/add combo is apparently 1.5 clock cycles, although i don't fully understand why exactly.
Unfortunately, I think we're stuck here. The PEXTRQ instruction is what's usually used to move data from the vector registers to the integer registers and we can fit this instruction and one popcnt instruction neatly in one clock cycle. Add one integer add instruction and our pipeline is at minimum 1.33 cycles long and we still need to add an vector load and xor in there somewhere... If intel offered instructions to move multiple registers between the vector and integer registers at once it would be a different story.
I don't have an AVX2 cpu at hand (xor on 256-bit vector registers is an AVX2 feature), but my vectorized-load implementation performs quite poorly with low data sizes and reached an minimum of 1.97 clock cycles per item.
For reference these are my benchmarks:
"pipe 2", "pipe 4" and "pipe 8" are 2, 4 and 8-way pipelined versions of the code shown above. The poor showing of "sse load" appears to be a manifestation of the lzcnt/tzcnt/popcnt false dependency bug which gcc avoided by using the same register for input and output. "sse load 2" follows below:
uint64_t count_set_bits_4sse_load(const uint64_t *a, const uint64_t *b, size_t count)
{
uint64_t sum1 = 0, sum2 = 0;
for(size_t i = 0; i < count; i+=4) {
__m128i tmp = _mm_xor_si128(
_mm_load_si128(reinterpret_cast<const __m128i*>(a + i)),
_mm_load_si128(reinterpret_cast<const __m128i*>(b + i)));
sum1 += popcnt(_mm_extract_epi64(tmp, 0));
sum2 += popcnt(_mm_extract_epi64(tmp, 1));
tmp = _mm_xor_si128(
_mm_load_si128(reinterpret_cast<const __m128i*>(a + i+2)),
_mm_load_si128(reinterpret_cast<const __m128i*>(b + i+2)));
sum1 += popcnt(_mm_extract_epi64(tmp, 0));
sum2 += popcnt(_mm_extract_epi64(tmp, 1));
}
return sum1 + sum2;
}
Have a look here. There is an SSSE3 version that beats the popcnt instruction by a lot. I'm not sure but you may be able to extend it to AVX as well.
As for today i used my own min() function (for float and int)
that was based on if but today as i get know that x86 has some operand
for min - this is
MINSS - Minimum of operands
i think that if based min() routine is counter effective and
im very careful for optimization, so i would like rewrite my own
routine into minss version with some inline assembly,
I would like to find how the most effective version of this in
gcc inline assembly would look like
I need something like
int min(int a, int b)
{
// minss a, b
//return
}
for both int and float, to use minss opcode and has minimal prologue and
epilogue
or just using library version would be faster? though i would like
to not use library min/max and had it as much fast if possible
Here is the most efficient possible implementation of min for ints and floats:
int
min_int(int a, int b)
{
return a < b ? a : b;
}
float
min_float(float a, float b)
{
return a < b ? a : b;
}
"But," you exclaim, "those will have conditional jumps in them!" Nope. Here's the output of gcc -S -O2:
min_int:
cmpl %edi, %esi
movl %edi, %eax
cmovle %esi, %eax
ret
min_float:
minss %xmm1, %xmm0
ret
For ints you get a conditional move, and for floats you get minss, because the compiler is very smart. No inline ASM needed!
EDIT: If you're still curious about how to do it with inline assembly, here's an example (for gcc):
float
min_float_asm(float a, float b)
{
float result = a;
asm ("minss %1, %0" : "+x" (result) : "x" (b));
return result;
}
The x constraint means "any SSE register", and "+x" means the value will be read and written, whereas "x" means read-only.
Well, I would suggest against such micro-optimization. If you want to do it anyway, GCC has some __builtin_* functions. One is v4sf __builtin_ia32_minss (v4sf, v4sf). There are other min* built-ins as well, check the docs.
Update
To gain more portability, you might want to take a look at the Intel Intrinsics Guide. Those functions are usually supported by GCC and Clang as well.